The Elephant in the Room one year on

Published On: December 21, 2023

Gill Ringland, Emeritus Fellow at SAMI Consulting, and Professor Ed Steinmueller, University of Sussex, both Co-Chairs of the Service Resilience Working Group of the IT Leaders Forum of the British Computer Society (BCS) write for the NPC to mark a year since the publication of the BCS report, “The Elephant in the Room.” This 2022 report explored the often-neglected risk of software failure and its implications for national resilience. A year on and following the analysis of this report, the Working Group has published its latest report in 2023 to align its findings with the now published UK Government Resilience Framework. This article explores the latest report’s five key recommendations in the context of the UK Government Resilience Framework, focusing on metrics, data sharing, infrastructure, holistic thinking in leadership, and education and training. The Working Group is now focussing on carrying these recommendations forward as a call to action to improve national resilience.

In November 2022 we published “The Elephant in the Room”. It analysed why risk from software failure is often ignored in thinking about resilience. It concluded that:

Software is different; it is intangible and obeys different rules from physical systems, creating problems in, for example, quality control and regulation.
The software element of digital systems failure is a cost to economy and society which will only increase as software has become a utility, is in wider usage, and is becoming more vulnerable to failure.
More people and organisations need to be aware of the actual and potential impact of software failures.

Since then, the UK Government Resilience Framework has been published. It is built around three fundamental principles:

That we need a shared understanding of the risks we face.
That we must focus on prevention and preparation.
That resilience requires a whole of society approach.

As the Service Resilience Working Group of the IT Leaders Forum of the British Computer Society (BCS), we endorse the principles. And we have used them to help explore further why the risk from software failure is often ignored. Through a series of interviews, reports, blogs, roundtables and webinars, we have asked: what could increase shared understanding? How can senior managers be as comfortable quizzing the CTO as they are the FD? How can they appreciate the potential impact that can result from software failure for which they will be held accountable?

We realised that the software systems which underpin the delivery of many services are complex and tightly coupled. Such systems exhibit unpredictable behaviour including failures. Most organisations depend on systems built from components originating elsewhere, and few organisations have the purchasing and contract management expertise to anticipate the behaviour of any one of the many components in their supply chain.

So, we decided to turn our attention to the consequences of software failures, focussing on how to improve the resilience of services which are delivered through software. Our latest report outlines recommendations under five headings. The implementation of these recommendations would together contribute to national resilience.

The first area is metrics – how to describe and measure resilience. We suggest adopting a framework which is already in use for Registered Data Service Providers – the NIS framework. This classifies the extent of service outages under four dimensions: availability (lost user hours); loss of integrity, authenticity or confidentiality of data stored or transmitted; risk to public safety, public security, or of loss of life; and material (financial) damage to users. While adoption of this framework does not indicate strategies for prevention or preparation, it does improve shared understanding and provides a quality benchmark for data collection.

Once a taxonomy is agreed, then data can be shared and this is our second recommendation. There can be reluctance to collect data about operational experiences (especially failures) while using software products or services. This may be partly attributable to a lack of understanding of what has gone wrong (so what exactly to report). It may also be exacerbated by widespread tolerance in general life that systems sometimes ‘go down.’ There is even more reluctance to share this information. Here we see a leadership role for the public sector, in publishing its own outage data and supporting information sharing more widely.

The third recommendation focuses on infrastructure. There is a qualitatively new risk to physical infrastructure from the increasing dependency of Industry 4.0[1] on digital infrastructure systems. But infrastructure organisations do not yet appreciate the potential scale and impact of this risk, so are not prepared for the consequences resulting from software failures.

Further, infrastructure organisations are particularly vulnerable to service outages due to software failures. Across infrastructure sectors, new technology is being combined with legacy systems. The drivers for developing new products are functionality and speed to market, aiming to make the new technology suitable for many sectors, whilst specialising in none. These products have different risk profiles to the existing legacy systems: data exchanges are fragile. Other challenges are real time operation, the increasing array of interconnections and interdependencies, and the growing diversity of users of services[2]. And infrastructure outages have knock-on effects on the economy and society.

We recommend that the remit of regulators of OES’s (Operators of Essential Services) should be expanded to require reporting on digital service outages, ideally using the NIS framework. This would enable regulators to address and set standards for service resilience.

The fourth recommendation applies to infrastructure organisations (among others) who supply services that people rely on. We advocate that Boards and C-Suites should think more holistically about the impact of service outages on doing business, delivering on their purpose, or meeting their commitments. We map out how this might be done, and also recommend that organisations design and put in place alternatives or workarounds consistent with the organisation’s tolerances of failure. These should include restoration of data generated during the outage, investing in the additional human capacity or skills needed, and implementing robust communication and operational protocols.

Finally, we make a number of recommendations on education and training. More IT and risk professionals need to be fluent in methods for increasing service resilience under 24/7 operational conditions. This probably requires certification against ISO and BSI standards. However, the biggest hurdle to improving national resilience against software failure may be the lack of a shared understanding of the risks inherent in software-enabled services. Recent surveys find that members of the C-Suite are overwhelmingly unaware of the risks to their business and reputation from service outages due to software failure. We recommend that methods of estimating or modelling the economic and societal impact of software failures and their impact on services, should be part of management education.

We also recommend that Government, Boards and C-suite should take steps to improve their understanding of their organisation’s service resilience against software failures. This could include: activities to engage the imagination of senior managers about failure possibilities and consequences through simulation games or working through software failure scenarios; dialogue structured around the service delivery approach and leading to action planning to improve resilience to software failures across their supply and demand chains; and management education of the next generation of C-Suite to ensure better understanding of the role of resilience in delivering services.

The Working Group has now completed the work it set itself, however members will be carrying these messages forward as a call to action to improve competence, shared understanding and national resilience.

[1] Industry 4.0—also called the Fourth Industrial Revolution or 4IR—is the next phase in the digitization of sectors previously less affected, driven by disruptive trends including the rise of data and connectivity, analytics, human-machine interaction, and improvements in robotics: https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-are-industry-4-0-the-fourth-industrial-revolution-and-4ir .

[2] https://www.bcs.org/media/11134/itlf-service-resilience.pdf

Share this story

Related posts