Observations on Resilience, Part I: The Heroic St(age)

Nick Rockwell
5 min readJan 17, 2022

--

I have been thinking about operational resilience — how we keep systems available — lately, again. Jerry Li and Patrick Gallagher at ELC were kind enough to invite me to talk about resilience on their podcast, which gave me an opportunity to collect my thoughts. This is the first of three posts adapted from that conversation.

Resilience has been very much front of mind for the last year at my company. But resilience is always front of mind. If it isn’t, just wait a little while and it will be, because something is going to go wrong. We all learn this sooner or later.

Early in my working life, I had my first real experience with resilience, in a very different context. I spent my teenage years working on my father’s farm, and we learned many tough lessons in resilience. The rain is coming, the baler breaks. What do you do?

At my first startup, I was on call all the time, as the only person who understood the system. There were some really challenging, dark moments. At one point one of my co-workers, who was never on call, challenged me by suggesting I should be as devoted to the product as I was to my newborn son. What disturbed me most about that comment was that, shamefully, I basically was.

Over the last year, I’ve tried to stay alert and observe the actual experience of the organization and the individuals involved in the work. I’ve been challenging myself to start fresh and think from the bottom up, as a generalist rather than an expert — which, fortunately, I am not — and to try to see the patterns in how practice around resilience evolves in younger companies. Where do all the tendrils of resilience lead? It turns out they lead all over the place.

They lead to process, and they lead to culture. They lead to the business model. They lead to the emotional contract you have with each other and the company. They lead to the kinds of relationships you build with your customers. They lead all over.

The Emotional Cycle vs. the Feedback Loop

The simplest construct for responding to operational failure and generating resilience is the continuous improvement feedback loop. Typically this looks like: something goes wrong, remediate the acute issue, investigate root causes, identify defects, design corrections and feed back into the development cycle until they are remediated, and repeat.

Continuous improvement is very powerful. However, it’s hard, expensive, and beyond the realistic capabilities of most young companies. Early on, there’s just no way you’re going to be able to maintain anything close to the kind of continuous improvement process that you need to create real resilience. There’s so much effort, resource, discipline, persistence and learning that goes into it. Most young companies operate in scarcity, and context switching is unavoidable, in fact necessary. The persistence on task alone, the length of the loop, required for continuous improvement may not be possible, or even advisable.

Instead, we rely on heroism. We rely on key people on the team, often including ourselves, who just care a lot. We feel pain when there’s a problem, and are intrinsically motivated to dive in and work like hell until the problem is resolved, and the pain stops.

Early on, this is not an anti-pattern — it’s a critical survival strategy. I’m not sure companies make it, and even have the opportunity to evolve towards a better process, without that kind of response. The companies that can’t generate it, who don’t evoke that level of urgency, who perhaps don’t have the right people, may not progress.

The heroic stage is not a bad thing; it’s a necessary thing. However, it has many side effects. One of the side effects is that it runs on an emotional cycle of crisis and relaxation, and reinforces the pattern for all involved. Everybody rallies on the crisis, and energy pours into the initial mitigation, which has a huge emotional payoff. The immediate mitigation is thrilling, and eventually binary. You were in a state of crisis, and now you are not. There is a strong reward function associated with that change of state, that closure.

In contrast, the long loop is very often frustrating. Root causes are elusive, or may not exist, strictly speaking. The long loop rarely has the binary feeling, the state change, of the immediate resolution. Usually we find ourselves back in the domain of design: of trade-offs, of satisficing, of balancing risks, pursuing cause and effect into complexity, where the only “success” available is probabilistic. It can’t compete. We feel the difference between measuring success by the cure of an actual, present harm vs the theoretical reduction in likelihood of a future harm.

So the initial mitigation and its strong reward function actually drains energy from the analysis, learning and long-term mitigation. Combined with the intense pressure in a startup environment to do the next thing, usually switching back to shipping customer value,, it makes the closing of the long loop nearly impossible.

This pattern, heavily reinforced and appropriately rewarded, persists for a long time. I think this is the core resistance that organizations have to break through in order to make it to a state of sustainable continuous improvement, which is how you actually build resilience.

Concentration

Concentration of knowledge bears a close relation to the heroism of acute response — it is the same impulse in a state of rest. It can be very pronounced in early companies, and heroic acute response actually reinforces that concentration. The learning from an incident, and any actual long term mitigation that does take place, the whole long loop, may all be contained in one person’s head. This widens the gap between the critical, “high bus factor” individuals and everyone else.

We correctly regard this as a problem, or a constellation of problems. Reliance on a few key individuals is brittle. It can lead to all kinds of negative dynamics, including resentment and actual hostility towards those people, or various unproductive behaviors from those people. Most of us have seen these kinds of issues.

That familiarity may stem from survivorship bias. It might be that the reason we see this problem so often is that the companies who don’t suffer from knowledge concentration early on actually don’t make it. Generating those people who get into a quick cycle and learn fast, and not losing them — they don’t burn out and leave — may be another survival gate for young companies. It might even be rare — it should be rare — the stress, time commitment and disruption that these key people are subject to is rarely sufficiently compensated. A sense of mission is usually a requirement — a subject for another time.

So generating concentration creates problems, but it may be a necessary condition for survival. It might be the only way operationally intense companies make it out of the early stages. Although we don’t hear many stories of companies directly failing due to operational issues, how many fail gradually, after a thousand cuts, fighting rearguard actions, losing their best people, being unable to focus on producing real value, and so on?

The standard response is to try to disseminate the critical knowledge, to get it out of the few heads and into more heads. And this can help. But it’s limited, and has some distinct drawbacks. Effectively, it amounts to trying to scale the heroic model.

What’s required, however, is a completely different model — one that replaces the hero with the machine of continuous improvement. This transition is difficult because it is discontinuous, by necessity and by design. At its core is the replacement of individuals with the collective, of emotional response with process, of the hero with the machine. This transition is the topic of the next post.

--

--

Nick Rockwell
Nick Rockwell

Written by Nick Rockwell

SVP Engineering at Fastly, ex-CTO at NY Times. My side project is Heresay.

No responses yet