Failure Is Not An Option

I ride my bike a lot. During most of 2020 much of that time has been solo riding. I got an Audible subscription and have been listening to books during my hours of solitude. Recently I listened to Failure is not an option by Gene Kranz. Overall an interesting insider’s perspective on flight control and the career of an flight director during the Gemini and Apollo missions. More than that it has lessons for any team dealing with complex systems in high pressure environments. Throughout the book I heard interesting anecdotes that rang true for me having been on numerous teams of folks running software and infrastructure.

Multiple systems connected systems are a single system

Kranz talks about a mistake they made during Gemini 8. When considering the Gemini and Agena they initially thought of them as two different spacecrafts and failed to see them as a single system when docked. This has the same vibe to me as the classic Leslie Lamport quote “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”. It really drives home that we need to take a holistic view our systems and be considerate of emergent behavior that two or more interacting subsystems can create. Even relatively simple systems create behavior we couldn’t have imagined. I think one tool Kranz used to combat this complexity is ensuring his teams had shared context.

Strong teams require shared context

A number of times in the book Kranz talks about getting groups of subject matter experts working together. In every case without saying it he seems to be trying to create a shared context within the team. I think he realizes early on that it’s impossible for any single team member to understand the entire system. Beyond getting people talking and working together, one way he does this is to create his mission “binder”. The binder provides teams with a knowledge base about all parts of the system and mission they might need right at their finger tips.

A shared context not only ensures everyone is on the same page, it makes the team more resilient to unexpected situations. It allows team members that are experts on disparate parts of the system to understand how their subsystem interacts with the other subsystems. Teams of diverse expertise can solve problems of greater complexity, we see this today with interdisciplinary teams, devops, SRE embedding and the rise of generalists. A shared context breaks down silos and is the glue that holds a team of specialists together.

End to end testing includes human factors

It’s impossible to miss the amount of testing Kranz and his teams did prior to launch. There are readiness reviews and seemingly endless simulations. When each mission includes huge risks the only way to prepare is to simulate real world conditions, you can’t canary deploy in space. The sim sup would devise numerous scenarios that mission control would need to work their way out of. In one serious but extreme simulation they even had a flight controller fake a heart attack. They saw the importance of viewing the human systems and mechanical systems as a single hybrid system that needed to be tested as one. What results is mechanical sympathy, not just knowing how the system works but having an intuition about how it behaves, including the humans operating those systems.

Roll backs don’t exist

Another important lesson is that there’s no such thing as an undo button. This becomes crystal clear launching spacecraft. Kranz commented that they had to continuously deal with each problem in a forward looking manner as it wasn’t possible to un-launch. Each flight was a series of go/no-go (or stay/no-stay) decisions. Once the decision was made they had to figure out how to fix any problems that might arise, always moving forward towards the ultimate goal.

I think this is true in software as well. We are trained to think the last change that was made was the one that introduced the problem. As a result we tend to think we can undo those changes by reverting our changes. Unfortunately, it’s a trap. Rolling back happens in linear time like all macroscopic events. Our mental model of code changes, in distributed systems speak, should be linearized. We might be able to revert the code we just deployed but we don’t always know the effect it’ll have on the system. In the Gemini 8 example above the roll back, in this case undocking from the Agena, made matters worse for Gemini, increasing the rate the craft was spinning. When dealing with complex systems rolling back is rolling forward, even if it’s the previous version.