Buying down risk: Complexity management
Software complexity has grown exponentially during the past few years. In addition to massive increases in the scale of systems and service infrastructure, new hardware and computing environments like virtualized network functions, custom silicon accelerators, and in-hardware virtualization have added greater complexity to tasks like real-time synchronization and parallel computing. Slick user interfaces rely on copied-and-pasted code, nested loops, and dependencies built on fragmented nests of proprietary code and open source. The result is a tangle of applications interacting within vast systems to provide critical, pervasive services. The biggest of these systems—the massive, distributed cloud computing networks run by Amazon, Microsoft, Google, and others—must schedule and allocate computing resources to meet constantly shifting, unimaginably large workloads. The sheer number of possible system states—many unknown and unknowable—creates risks of catastrophic failures that engineers cannot explicitly anticipate. Complexity is distinct from ‘complicatedness.’ Complexity connotes a fundamental unknowability in a system, as opposed to a complicated system that is challenging to understand but ultimately finite and deterministic.
Software complexity affects cybersecurity profoundly. Poorly managed, it increases cognitive load and collaborative costs for developers. Moreover, its mismanagement fuels the chances of vulnerabilities and errors that drain system resilience. For example, when developers strain to identify the logical relationships between subsystems because of webs of interdependence, collaborative costs skyrocket from a team’s struggle to anticipate the effects of code edits due to unknown interactions. Patches in complex environments can stumble over unknown logical relationships, and intricate compliance regimes might raise complexity and costs rather than tame them. Tight deadlines and financial incentives for quick delivery conspire to produce insecure workflows for critical products. Some studies find that failure to manage complexity can increase already lofty maintenance costs by 25 percent. A multi-day cloud outage for just one provider could cost billions in just the United States.
Fully eliminating the challenges of software complexity is impossible—to think otherwise is a fool’s errand. Indeed, some degree of complexity is a necessary component of many products, allowing them to adjust to unforeseen needs and circumstances. Nevertheless, managing software complexity is possible and necessary. The vast scale of complex private-sector products has brought much industry attention onto complexity management. Such efforts include comprehensive guidance on systems resilience, such as Adkins et al.’s Building Secure and Reliable Systems;the creation or adoption of agile development processes and robust testing practices; and profound investments in resilience, management, and redundancy for the largest cloud systems and products. Policy measures should learn from these efforts, focusing on human-centric approaches, identifying and implementing best practices, and researching ways to clarify logical connections and protect us from ourselves.
Recommendations:
- Institutionalize long-term complexity management: The CTSC language added to the House-passed COMPETES ACT (HR 4521) by amendment offers a useful model for government outreach to complex systems management entities. The provisions would create at least four CTSCs for the security of network technologies, connected industrial control systems, open source software, and federal critical software. These CTSCs would work from the input of the DHS Under Secretary of Science and Technology and the Director of CISA to study, test the security of, coordinate community funding for, and generally support CISA’s work regarding their respective technologies. The legislation allows for the establishment of additional centers as needed; DHS should work within the next two years to establish a fifth CTSC dedicated to the security and management of complexity in IT architectures and services. The CTSC for complexity would serve as a central hub for complexity policy and security, coordinating with CISA’s National Risk Management Center and other offices as appropriate. This entity should work with the JCDC to convene private industry partners regularly to conduct long-term architectural reviews of the largest, most critical complex systems in the form of biannual architecture review meetings. These should include case studies as well as best-and-worst industry practices in complexity management and state-of-the-ecosystem assessments of systemic risk stemming from the interconnected complexity of multiple systems.
- Identify critical complex systems: The CTSC for complexity, as envisioned, should in conjunction with the Joint Cyber Defense Collaborative (JCDC) and the NSA, coordinate private-sector input for the identification of the most critical complex software systems in use by the federal government and throughout industry, classifying them as critical infrastructure in line with language throughout Executive Order 14028. CISA would then share the resulting list with private sector partners and coordinate publishing a methodology for identifying and classifying similar sources of systemic risks through NIST.
- Develop an ecosystem-wide risk-management approach for critical complex systems: Led by the ONCD, an appropriate interagency group with CISA at the helm and including NIST should consult with private-sector stakeholders through the JCDC and the CTSC for complexity to develop approaches for managing systemic increases in complexity.
- Funding for resilient technical designs: CISA should coordinate with industry to identify potential technical tools and designs for managing complexity, research the feasibility of their widespread adoption, and subsidize their development. Industry players offering critical complex cloud products should commit to the gradual integration of these technologies into their systems as possible. Core segmentation and systems-on-chips (SoCs) are natural starting points for these efforts. Core segmentation, achieved either by a hypervisor or scheduler, refers to reserving computing cores for certain critical tasks segregated temporally to reduce the possibilities of undesired interactions. SoCs take that segmentation to the hardware level as integrated circuits with special configurations for efficiency, segmenting away riskier software subsystems. Industry players like Amazon Web Services (AWS) with its Nitro system are already using SoCs to handle routing and storage virtualization, reducing preventing software and hardware interactions.
- Industry and government standards for complex systems development: Industry and NIST should collaborate to develop best practices for designing, developing, and maintaining complex critical software. These should include guidance on segmentation practices, based on the research recommended above, on testing and embedding resilience in systems at the design phase, and on incorporating complex systems into a broader ecosystem. These best practices can gauge existing products retroactively measured and produce improvements for the CTSC for complexity to study further. Additionally, industry practices help inform government maintenance practices for federal complex systems.