Where and why systems fail - the value of knowing what others value in a crisis

Published On: July 12, 2023

This paper by John Beckford, Beckford Consulting and Katie Barnes, Executive Director of the National Preparedness Commission explores the concept of systemic resilience and failure in the face of crises. This is the third of a series of papers commissioned by NPC on applying systemic thinking to the notion of national preparedness. The first paper discussed the role of data in preparedness and the second paper explored crises, resilience, and complex systems.

This paper discusses the importance of stakeholder priorities in understanding where to focus resources during a crisis. It emphasises that the value of a system lies in the eyes of the beholder, and different stakeholders may have different interests and needs. Understanding these varying interests is crucial in determining system success and resilience. During crises, system owners should prioritise understanding the needs of stakeholders and communicate with them effectively. A systemic mapping of the elements and interactions within a system is also vital for identifying alternative solutions and managing dynamic risks.

Crises come in many forms and affect individuals, communities and organisations in a variety of ways. They have one thing in common – highlighted by the very language we use to describe them, e.g., shock or emergency – that portrays them as something out of the ordinary; something with the potential to create severe disruption, degradation or loss, and the capacity to distract attention for far longer than necessary, if proper preparations have not been made.

A systemic appreciation of resilience in the face of crisis can enable a considered response that minimises negative impact and positions an organisation (using this term in its loosest possible sense) to identify and capitalise on opportunities resulting from perturbation. Crucially, the priorities of stakeholders – and not those of the organisation itself – hold the key to understanding where to focus resources in dealing with a crisis and its aftermath.

Systemic resilience and failure

In a recent paper (see Beckford, et. al., 2023) defined infrastructure resilience in accordance with UK government context as follows (see DfID, 2016): “Resilience in infrastructure may be interpreted simply as the robustness of a structure, to withstand hazards.”

When we talk about resilient systems, however, we should look beyond the built environment and consider not just physical artefacts and organisations, but the whole political, economic, social and technical context in which we are living. Systemic resilience does not reside in artefacts but in all of the interactions and interdependencies which enable systems to fulfil their purposes, or not. The existence of ‘system’ can be said to be ‘observer defined’, and system failure does not simply belong to the artefact, but to the effect on the observers of the functioning (or not) of that set of artefacts. These systems concepts were recently discussed in an article written for NPC (see Pantaleon 2023).

Failure, then, is also a matter of opinion arising from the systemic effects on those describing the system. For example, the failure of a piece of rail infrastructure, such as the Nuneham Viaduct near Oxford is, quite clearly a failure of the artefact itself; a collapsing abutment impedes its performance as a railway bridge. Provided, however, that there are alternative ways of completing journeys that would have been made across that viaduct, there is not a whole system failure, just a rail system failure. Similarly, if the purpose of the customer – to whom the journey was incidental or enabling, rather than purposeful in itself – can be achieved by other means, such as a phone or video call, then the system itself is resilient to the artefact failure. We have therefore multiple levels of potential failure requiring multiple appreciations of resilience.

Perhaps what should matter to the owners, governors, operators of a system is not what value it provides to them, but what value it enables for those who use or exploit it. Value is in the eye of the beholder.

Systems and how to approach them

Systemic thinking, as opposed to ‘thinking about things as systems,’ is holistic and embraces physical, informational and metaphysical dimensions of existence. Systemic thinking demands that we embrace the whole; that we carefully delineate boundaries around the subject of our inquiry and ask not ‘what it is’ but ‘what it does’, i.e., how do we experience the system? The idea of ‘system’ can be elusive to grasp, but grasp it we must if we are to understand systems better. They can be defined:

Technically: an arrangement of parts or elements that together exhibit behaviour or meaning that the individual constituents do not.
Contextually: a set of processes and services that work together towards common outcomes, such as justice, health or education.
Casually: an ephemeral ‘other’ that is not easily definable (i.e., the ‘system’ doesn’t work!).

Laying out any system in terms of its purpose and stakeholders, such as beneficiaries, customers, owners and users, allows us to ask: “When things go wrong, what matters, to whom, and why?”

A good answer to that question might include preserving its (bureaucratic) processes – but that is only what matters to the system designers/owners. What matters to customers or users is perhaps the more important question; it is, after all, for their benefit that the system should exist in the first place.

Looking at outcomes for stakeholders – the effects or products of the system in terms of the difference they make – is the only reliable way to understand how best to respond in a crisis. It is also important to realise that different stakeholders may have very different interests; owners perhaps seek profit-maximisation or loss-minimisation, whilst customers seek service provision, and neighbours seek ‘good citizenship’, for example.

An example here might be the choices confronted by a transport operator when there is major disruption. They can either insist on full and appropriate ticketing within the normal bounds of commercial operations or focus on getting customers to their various destinations by honouring tickets rendered invalid by the disruption. The first decision serves the business model; the second serves the customers.

The essence of comprehension rests in appreciating the varying ‘purposes’ imputed to the system by its observers. Using a well-known example, if we consider the purpose of a pig through the eyes of its ‘stakeholders’:

A farmer will see the pig as a source of income and will be looking for a good return on their investment.
A wolf will see the pig as a source of food.
A poet may see the pig as an object of inspiration.
A vet will see the pig as a patient.

At first glance, all four of these fictitious stakeholders may want what is good for the pig, but the definition of ‘good’ will vary between them, depending on the purpose or utility of the pig to them. A farmer will want high quality meat that will fetch a high price, and is probably more concerned with leanness than the wolf, who needs sustenance and a slower, fatter pig that is easier to catch. A poet is potentially more concerned with the drama of the pig’s condition – be it good or bad – and the vet has a vested interest in keeping the pig just healthy enough to require occasional visits to the vet over a long life.

Here, then, is our first understanding of system success: what matters to the observer (a stakeholder), is what they consider success and this becomes critical when thinking about resilience. In times of crisis, during emergencies or at other times when the system and its context may be in flux, its use might deliver something other than what has been its owner-intended function. Well-curated systems should be capable of absorbing ‘normal’ shocks and adapting accordingly, but can only do so with a coherent and shared understanding of purpose.

Pleasing all of the people…

We cannot expect to please all of the people all of the time (though sometimes it may be possible). Instead, leaders, system owners and those in positions of system governance must surface the tensions that may exist, as a first step towards optimising or satisficing outcomes.

One obvious starting point is to explore what success or failure means to various stakeholders. In the simple example of a train cancellation, the operator suffers a degradation of their performance indicators, some reputational impact and an economic impact through compensation of ticket face values, and can measure impact in terms of ticket price costs. For the customer there is consequential loss; the value may be significantly higher because they have more at stake, e.g., a missed job interview, wasted time and expense in getting to the station, a social cost and waiting for the next train, or loss in productivity for their employer. The outcome is different for each stakeholder so there is a need to develop an understanding of:

tolerances: the extent to which stakeholders are willing to accept sub-optimal outcomes.
dependencies and interdependencies: the system effects that enable something to happen outside of the system itself.
power differences: regulatory, financial, political power that each party brings to the discussion and how that affects the possible outcomes.

All are influenced largely by an understanding of the testable assumptions about what makes the system important to each stakeholder, and this begins to elucidate a sense of the value of the system (and hence the cost of its failure). The analysis can become complex, particularly in organisations that are owned by one party, funded by another (or several others), and serve yet another – just who is the customer? It is critically important not just to understand the direct economic losses and gains from interdependency resilience (or the lack of it) but the social consequences. Much of this is not calculable on an actuarial basis but requires consideration and empathy. In earlier or more rudimentary economic thinking this would have been the calculation of opportunity cost, the value of “what might have been”.

A very simple example is the use of rail replacement buses for rail service failures. While the service is resilient (it is still possible to complete a journey) there are financial, time and social costs to customers – but we tend only to value the cost of the delay whilst taking no account of the wider effects.

Efficiency versus resilience

If system success can be understood, controls can be aligned to delivery of outcomes and assumptions can be tested, what remains to be done? Surely the system is now resilient? The analysis described so far will certainly help to inform good design and operational decisions, but one further tension remains to be tackled: the tension that exists between efficiency and resilience. It is too simplistic to claim that efficient systems are never resilient (and vice versa), but they certainly seem to be pull in opposite directions more often than is desirable.

It is perhaps the definition of resilience that underpins the apparent polarisation of outcomes here. Resilience can be achieved in many ways – one of the most visible means of ensuring resilience being to design in resilience in the form of redundant (or spare) capacity. Other approaches use contingency plans to ensure that outcomes are met, by switching to alternative (pre-defined) processes. Some simple examples of this include:

A system delivering 20 units of product each day might be specified to increase production to 30 units in order to cope with unexpected spikes in demand.
An organisation’s primary dataset might be copied live to a datacentre in a different building, so it can be retrievable in case of flood, fire, power outage or other disaster.
A secondary set of systems might be maintained in order to enable a ‘hot failover’ should the first set of systems fail.
A pharmacy delivers patient prescriptions to the doorstep during a pandemic-related ‘lockdown’ period.

Clearly, the extent to which resilience is designed into a system should be proportionate to both its risk exposure and the purpose (from which we can derive value) of the system’s primary functions. The ability to innovate will also influence the need to invest in resilience measures (and can be counted as one in and of itself).

It is possible to maintain efficiency alongside resilience, however every time effort, process or storage is duplicated it has the potential to reduce overall short-term cost-efficiency. The additional costs need to be weighed against the value of achieving resilience – and herein lies the challenge. Justifying additional expenditure, especially in periods of economic downturn, requires sufficient foresight and insight to estimate the probability and cost of failure

The value of knowing what others need to know

We have seen that system success (and hence system failure) is defined by the observers or stakeholders of the system – individually and collectively. Some aspects of success will be agreed upon, but some will not, and negotiations or trade-offs will be necessary.

During perturbations, crises or following disasters, the temptation for system owners is to focus on what they deem (or assume) to be of most value. Choosing which stakeholder’s needs to prioritise over which other’s is often a complex mix of the effects of power, politics and proximity. It is also inherently flawed as a choice if the underlying needs of each stakeholder are not properly understood.

System interruption or change may not inherently be a problem for users of the system. They may, in fact, be prepared to do without it for a period of time provided they know what is happening, what the problem is and what a return to normal might look like (process, limitations, timescales, etc). Other users may not need the whole system to work but might rely on one aspect of it (data, for example) to operate their own system. Finding an alternative solution to facilitate such a data exchange could be far more cost-effective than finding an alternative end-to-end solution. This can be achieved through a systemic mapping of the elements of a system and its interactions and inter-dependencies.

From this it should be clear that knowing what others need to know is a valuable tool in systems resilience, both in the initial design stages of a system, and in its continued improvement and tuning. It is invaluable during crisis operations, when trade-offs are almost certainly needed, but may well differ from those that would be made in times of normal operation. Such an approach addresses a final aspect of managing resilient systems which is to understand how they can be flexed to manage dynamic or evolutionary risks.

Averting systemic failure: towards a contingency theory of resilience

Reflecting on this paper, we can appreciate that neither failure nor resilience can be defined through a singular lens. Each is multifaceted, a function of the particular observer of the system, a product of the dynamic interactions between the system elements as they act and are acted upon by each different observer. Systemic success or failure is dependent on the multiplicity of actors, artefacts, and interactions through which the system is defined and we synthesise our systemic understanding by integrating their differing interpretations of system and system effect

Conventionally we have arranged our organisations through a bureaucracy designed back in the 1920s (see Weber, 1924) to be machine-like, stable, reliable, repeatable, and consistent. That may have been appropriate back when Weber was published; but in 2023 it seems at best inept, at worse potentially catastrophic. The turbulence and, in particular, the speed of cascading effects rooted in highly interdependent digital systems (systems which process and transmit data far faster than human beings can deal with) means that we need to develop organisations that are more adaptive, responsive, and capable of managing dynamic risk and of evolving.

As Stafford Beer put it, “if it works, it’s out of date”.

References:

1. Beckford, J. (2023): Making the Business Case for Resilience https://beckfordconsulting.com/wp-content/uploads/2023/05/Making-the-Business-Case-for-Resilience-V1.2.pdf
2. Department for International Development (DfID) (2016) https://www.gov.uk/research-for-development-outputs/introducing-infrastructure-resilience
3. Pantaleon, M (2023): Crises, resilience, and complex systems https://nationalpreparednesscommission.uk/2023/05/crises-resilience-and-complex-systems/
4. Weber, M., 1924, ‘Legitimate Authority and Bureaucracy’ in Organisation Theory, Selected Readings, D.S. Pugh (ed.), 3^rd Edition, 1990, Penguin, London

Where and why systems fail – the value of knowing what others value in a crisis

Related posts

Where and why systems fail – the value of knowing what others value in a crisis

Share this story

Related posts