Joe's Blog

New Job Reading List

2023-12-17T00:00:00+00:00

I left my job in September and I am about to start a new one in January. The last few months I’ve spent most of my time wrangling a toddler, riding bicycles and taking care of household chores and projects. Important work but not geared towards thinking about how a packet gets from place to place in a correct, fault-tolerant and performant way. With a few weeks to go I have given myself an assignment of doing some reading to hopefully learn something new, learn more about the tech my soon-to-be employer has built and generally get the juices flowing. A self-directed “warm up” before the “race” if you will. What follows is that list; some of this is a refresher course and some is purely curiosity. Other parts are to learn something about a self-preceived weak spot in my knowledge (ahem … notice all those MPLS links at the bottom). Wish me luck! We’ll see how far I get during nap time and between diapers.

Fastly

IPv6

DNS

BGP, routing, anycast

Security

Performance, congestion, etc

Automation, configuration, verification and correctness

Monitoring and measurement

Load balacing

MPLS

Failure Is Not An Option

2020-12-28T00:00:00+00:00

Failure Is Not An Option

I ride my bike a lot. During most of 2020 much of that time has been solo riding. I got an Audible subscription and have been listening to books during my hours of solitude. Recently I listened to Failure is not an option by Gene Kranz. Overall an interesting insider’s perspective on flight control and the career of an flight director during the Gemini and Apollo missions. More than that it has lessons for any team dealing with complex systems in high pressure environments. Throughout the book I heard interesting anecdotes that rang true for me having been on numerous teams of folks running software and infrastructure.

Multiple systems connected systems are a single system

Kranz talks about a mistake they made during Gemini 8. When considering the Gemini and Agena they initially thought of them as two different spacecrafts and failed to see them as a single system when docked. This has the same vibe to me as the classic Leslie Lamport quote “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”. It really drives home that we need to take a holistic view our systems and be considerate of emergent behavior that two or more interacting subsystems can create. Even relatively simple systems create behavior we couldn’t have imagined. I think one tool Kranz used to combat this complexity is ensuring his teams had shared context.

Strong teams require shared context

A number of times in the book Kranz talks about getting groups of subject matter experts working together. In every case without saying it he seems to be trying to create a shared context within the team. I think he realizes early on that it’s impossible for any single team member to understand the entire system. Beyond getting people talking and working together, one way he does this is to create his mission “binder”. The binder provides teams with a knowledge base about all parts of the system and mission they might need right at their finger tips.

A shared context not only ensures everyone is on the same page, it makes the team more resilient to unexpected situations. It allows team members that are experts on disparate parts of the system to understand how their subsystem interacts with the other subsystems. Teams of diverse expertise can solve problems of greater complexity, we see this today with interdisciplinary teams, devops, SRE embedding and the rise of generalists. A shared context breaks down silos and is the glue that holds a team of specialists together.

End to end testing includes human factors

It’s impossible to miss the amount of testing Kranz and his teams did prior to launch. There are readiness reviews and seemingly endless simulations. When each mission includes huge risks the only way to prepare is to simulate real world conditions, you can’t canary deploy in space. The sim sup would devise numerous scenarios that mission control would need to work their way out of. In one serious but extreme simulation they even had a flight controller fake a heart attack. They saw the importance of viewing the human systems and mechanical systems as a single hybrid system that needed to be tested as one. What results is mechanical sympathy, not just knowing how the system works but having an intuition about how it behaves, including the humans operating those systems.

Roll backs don’t exist

Another important lesson is that there’s no such thing as an undo button. This becomes crystal clear launching spacecraft. Kranz commented that they had to continuously deal with each problem in a forward looking manner as it wasn’t possible to un-launch. Each flight was a series of go/no-go (or stay/no-stay) decisions. Once the decision was made they had to figure out how to fix any problems that might arise, always moving forward towards the ultimate goal.

I think this is true in software as well. We are trained to think the last change that was made was the one that introduced the problem. As a result we tend to think we can undo those changes by reverting our changes. Unfortunately, it’s a trap. Rolling back happens in linear time like all macroscopic events. Our mental model of code changes, in distributed systems speak, should be linearized. We might be able to revert the code we just deployed but we don’t always know the effect it’ll have on the system. In the Gemini 8 example above the roll back, in this case undocking from the Agena, made matters worse for Gemini, increasing the rate the craft was spinning. When dealing with complex systems rolling back is rolling forward, even if it’s the previous version.

Objects Of Power

2020-09-30T00:00:00+00:00

Objects of Power

Langdon Winner’s Do Artifacts have Politics? (synopsis of the essay) has been a bit of a touch stone for me in the last few months. The essay makes clear that the things humans make have both intended and unintended consequences. Some of these objects affect society by being built with expressly political design but others are innately political simply by existing and being used. The essay discusses a number of examples such as low overpasses Robert Moses built deliberately to limit public transit from reaching Long Island and hegemonic energy infrastructure such as nuclear vs more decentralized and democratic solar. However, I am interested in viewing this work through the lens of modern information technology and I think a lot of what Winner describes carries forward, the internet itself is a prime example.

Notions of centralization and decentralization are common when talking about the structure of the internet. While the underlying infrastructure of the internet started as a distributed network; today it is far more centralized due to the economics to running data centers and networks. This pressure towards centralization shouldn’t be too surprising given the military origins of the internet. Going back to Winner’s essay, Jerry Mander is quoted regarding nuclear power:

“… if you accept nuclear power plants, you also accept a techno-scientific-industrial military elite. Without these people in charge, you could not have nuclear power.”

During the creation of the ARPANET and the later internet it would not be a stretch to say if you accept the internet, you also accept a techno-scientific-industrial military elite. Without these people in charge, you would not have the internet. The same is true for many foundational technologies we use today, such as GPS. Additionally, it’s arguable that the “center” of the internet exists near the Dulles airport, it’s not a mistake that it happens to be in the US. While the internet looks a lot different today the vestiges of it’s military origins and centralized power remain.

On top of this physical infrastructure we have built the web. Just having infrastructure isn’t very useful, we need applications and code running in those data centers to do things we want. At the application layer we are faced with the economic challenges and complexity of running distributed systems. As a result internet services tend to become centralized. A visceral example is that basically no one runs their own email servers these days opting for a relatively small number of companies to do it for us. It simply doesn’t make sense for everyone to run a email server at their house. The result is a small number of companies dictating how we communicate, how we store all of our family photos, writing and anything else we want anyone, including our future selves, to experience or see.

What we end up with is the internet built with political design and innately political services running on top of it. Society set the conditions for the internet to exist as it does today and the internet now affects society. I venture that this feedback loop carries a “memory” of past decisions and biases with it, something along the lines of long-range dependence. I think sensitive dependence on initial conditions likely also plays a role, without the cold war the internet would not exist as we know it today. Conway’s Law suggests that organizations design things that mirror their organization structure, as a result the military built the internet in it’s own image. It’s clear that the physical manifestation of the early internet has tangible and lasting effects on what would later get built on top of the it and how society would use those services. The implications and politics of which now impact literally every facet of our lives.

Changing gears a little, I don’t think Winner could have anticipated how pervasive technology has become in the subsequent forty years since the essay was written but that doesn’t make the essay any less true. We all read articles reciting how our phones are always-on, networked supercomputers. Many of us feel “naked” without them, unable to connect with anyone or anything that matters. In Computational Thinking the authors describe technology, and more specifically information technology and computing, as a human multiplier allowing us to do more with less. In many cases this may be the same behaviors we have always done but now amplified in every way that matters, such as speed, effort and reach. I like this multiplier analogy because it drives home that technology is a tool that humans use for human purposes rather than it be technology only for technology’s sake. As a result technology is a powerful multiplier for our own biases, be it a bridge or a cloud service.

Additionally, the internet enables zero distribution and transaction costs and the power can’t be understated. The fact that any digital product can be created and then downloaded, installed and used by just about anyone is remarkable and carries huge implications for society. I would go so far to say that while the economics of running the cloud tend to create pressure towards centralization at the infrastructure level, zero distribution and transaction costs creates decentralization from the perspective of who can be involved and what gets created. We see this tension in the platform battles today. We have centralized platforms sandwiched between decentralized consumers and creators. Without the former the later would be hard to find, consume and pay, without the latter the former would be a vacant strip mall. While there are caveats to the current system, it goes without saying that it’s never been easier to create something and get it into the hands of someone who might use it.

So, what does all of this mean? The bottom line is biases we see in society, however egregious, can end up being reflected in the objects we build. Left unchecked these biases impact how and who use them, prolonging whatever biases and assumptions they were built with well into the future. For instance, overpasses are rarely replaced. We must also be aware that technology can affect society not only by what it does but how it’s used, banal technologies used in abhorrent ways are just as bad as abhorrent technologies used in banal ways. The combination of ubiquitous computing and zero distribution and transaction costs means we have more power than ever to multiply and distribute whatever we create. It’s important for everyone, as creators, designers and engineers, to be considerate of the impact our work on society and take an active role in checking our biases and be aware of how the things we build get used. As the old saying goes with great power comes great responsibility. If we are not considerate in our creation we will build the internet equivalent of low overpasses for future generations.

On A Plus Minus For Engineering Teams

2020-09-28T00:00:00+00:00

On a Plus/Minus for Engineering Teams

Management is rife with sports metaphors, we are all team players trying to get our projects over the line. Netflix’s CEO famously said they are a sports team, not a family. We even go so far to identify individuals and create roles on engineering teams that map on to sports team archetypes, someone might be a good facilitator, quarterback or in clutch situations. What is less obvious nor standardized is how we evaluate individual and team performance. Many times quantifying performance is far more qualitative and mysterious than anyone in the process would prefer. How can we understand the impact someone has on their team, projects and organization as a whole the same way that a basketball team might? How can we ensure that their evaluation encompasses the myriad of ways they might contribute, rather than just counting things like commits or shipping code?

First off, let me just say I don’t have the answer but I think we as an industry can do better. A step in the right direction might be going beyond the sports analogy and evaluating individuals and teams using sports-like statistics. In basketball there is an all encompassing statistic called plus/minus.

“In its simplest form, plus-minus is exactly what it sounds like – when a given player is on the floor, be it for a single game, group of games or a season, does his team get outscored or does it outscore the opponent? This very simple metric is housed in most common single-game box scores, and is the rawest way of determining what sort of effect a player has on his team (and the opponent) while on the court.”

“… the general goal remains to contextualize the effect a player has on his team and opponents while accounting for as many situations and player combinations as possible. Rather than tracking what a player accomplishes individually, the idea is to determine what each individual player’s cumulative contribution has meant to what their team does while they’re on the floor.”

Most engineering teams don’t have direct opponents or games but we do have goals and projects, during which we are competing against time, costs and complexity. The rest of the analogy and what plus/minus evaluates applies, rather than focus on any individual statistic that an engineer produces it shows how that individual impacts the team, regardless of the way they might contribute. It captures whether the individual is good at scoring (shipping), facilitating (helping others succeed), defense (avoiding pitfalls), in clutch moments (during an outage) and any other possibility. This agnosticism is powerful because it abstracts away the details and focuses on impact.

A vivid example of this is the story of Shane Battier, the No Stats All Star.

“Battier’s game is a weird combination of obvious weaknesses and nearly invisible strengths. When he is on the court, his teammates get better, often a lot better, and his opponents get worse — often a lot worse. He may not grab huge numbers of rebounds, but he has an uncanny ability to improve his teammates’ rebounding. He doesn’t shoot much, but when he does, he takes only the most efficient shots. He also has a knack for getting the ball to teammates who are in a position to do the same, and he commits few turnovers. On defense, although he routinely guards the N.B.A.’s most prolific scorers, he significantly reduces their shooting percentages. At the same time he somehow improves the defensive efficiency of his teammates — probably, Morey surmises, by helping them out in all sorts of subtle ways. “I call him Lego,” Morey says. “When he’s on the court, all the pieces start to fit together. And everything that leads to winning that you can get to through intellect instead of innate ability, Shane excels in. I’ll bet he’s in the hundredth percentile of every category.””

“In his best season, the superstar point guard Steve Nash was a plus 14.5. At the time of the Lakers game, Battier was a plus 10, which put him in the company of Dwight Howard and Kevin Garnett, both perennial All-Stars. For his career he’s a plus 6. “Plus 6 is enormous,” Morey says. “It’s the difference between 41 wins and 60 wins.” He names a few other players who were a plus 6 last season: Vince Carter, Carmelo Anthony, Tracy McGrady.”

Obviously having a team of Shane Battiers won’t help an organization to win a championship or ship a new product but hard to quantify abilities of Shane Battiers are critical to successful teams. Diverse teams can solve problems that specialized teams simply cannot. The bottom line is we should all do better to identify and reward the Shane Battiers like we do the more obvious contributions of specialists and high scorers and that starts with improving how we evaluate each persons impact.

Operational Vulnerability

2020-04-20T00:00:00+00:00

In Benoit Mandelbrot’s seminal The Misbehavior of Markets suggest market behavior has five rules.

Rule 1 - Markets are risky
Rule 2 - Trouble runs in streaks
Rule 3 - Markets have a personality
Rule 4 - Markets mislead
Rule 5 - Market time is relative

The book focuses on the behavior of financial markets, and using the rules can reduce society and an individual’s “financial vulnerability”. Reading through them I can’t help but to identify how they can be applied to web operations. Below I paraphrase the rules from the book and rework them with a focus on teams and the services they run, lastly introducing the idea of operational vulnerability.

Rule 1 - Markets are risky Systems are unstable

“Extreme price swings are the norm in financial markets - not aberrations that can be ignored. Price movements do not follow the well-mannered bell curve assumed by modern finance; they follow a more violent curve that makes an investor’s ride much bumpier.”

Teams building software that perform important, valuable services are by their nature unstable, that is, constantly learning, adapting and acting (i.e. they are solving problems of organized complexity). This change and adaptation can make for a bumpier ride for the individuals on the team and the services they run.

Teams and services that do nothing are stable, never requiring the stress of adaptation. This makes for a smooth ride but unfortunately there is no pay off for building teams and running services that do nothing.

Rule 2 - Trouble runs in streaks Failures come in waves

“Market turbulence tends to cluster. This is no surprise to an experienced trader. … They know that when a market opens choppily, it may well continue that way. They know that a wild Tuesday may well be followed by a wilder Wednesday.”

Errors, failures and outages tend to cluster. This is no surprise to an experienced manager or service operator. They know that when a service or team begins to have problems it may well continue that way. An outage Tuesday can lead to a cascade of failures Wednesday. A mismanaged project one quarter can lead to missed objectives in subsequent quarters.

Rule 3 - Markets have a personality Systems have a personality

“Prices are not driven solely by real-world events, news, and people. When investors, speculators, industrialists, and bankers come together in a real marketplace, a special, new kind of dynamic emerges – greater than, and different from the sum of the parts. … In substantial part, prices are determined by *endogenous effects peculiar to the inner workings of the markets themselves, rather than solely by the exogenous action of outside events. Moreover, this internal market mechanism is remarkably durable.”*

Behavior of an internet service is not driven solely by real-world events and people. When management, sales, developers, security and operators come together to build a product, a special, new kind of dynamic emerges – greater than, and different from the sum of the parts. In substantial part, system behavior, in its broadest sense, from organization down to an individual team or service, is determined by endogenous effects peculiar to the inner workings of the organization, team or service itself. This internal behavior is remarkably durable regardless of the purpose, type or scale, persisting through organizational tumult and refactoring.

Rule 4 - Markets mislead Systems mislead

“Patterns are the fool’s gold of financial markets. The power of chance suffices to create spurious patterns and pseudo-cycles that, for all the world, appear predictable and bankable. But a financial market is especially prone to such statistical mirages.”

Patterns are the fools gold of observability. The power of chance suffices to create spurious patterns and pseudo-cycles that, for all the world, appear predictable and repeatable. Organizations, teams and individuals are especially prone to such statistical mirages. The size, shape and frequency of requests to one service isn’t identical to the next. The mitigation for a problem on one service does not work on the next. Building a product as a part of one team is nothing like building a similar product on another. It’s easy to trick ourselves into seeing a pattern when there is none. A given pattern may be helpful but it isn’t always repeatable nor applicable in every situation.

Rule 5 - Market time is relative System time is relative

“There is what one may call the relativity of time in financial markets. … markets are operating on their own “trading time” – quite distinct from their linear “clock time” … This trading time speeds up the clock periods of high volatility, and slows down in periods of stability.”

There is what one may call the relativity of time in teams and services. Teams and services operate in their own time – quite distinct from their linear “clock time”. “Team time” speeds up in times of organizational volatility, and slows down in periods of stability. “Service time” speeds up during outages and incidents and slows down in periods of stability.

Operational vulnerability

Teams and services are eternally linked. Services don’t get built nor run without a team of individuals to organize and do the work. Teams don’t exist without a purpose, that purpose is to build and run a service or product. Understanding the rules that teams and services play by can help us to become more situationally aware. When we act on that awareness we can adapt and improve our response to incidents, understand the personality (i.e. emergent behavior) of the teams we belong to and the services we run, be less susceptible to being misled by blindly following patterns, and use our intuition to estimate the severity of the current situation, we reduce the risk of normal, everyday operations.

As I define it, operational vulnerability it is the risk within the team and/or service, that when left unchecked tends to create only more risk, leading to failure, outages and missed opportunities. Like in a financial market, operational vulnerability provides a spectrum of risk and reward. For instance creating a new product inherently introduces risk into the system but provides more value to the organization. As humans in these systems our first, perhaps only, job is to balance this tension.

High operational vulnerability tends to manifest itself, similar to the “this is fine” comic, as stressed out teams trying to keep the lights on in a burning house. When a team or service are operationally vulnerable an otherwise small mishap can snowball to a cluster of failures. The burning house could be an existing service that is crumbling under its own weight or a poor performing organization that isn’t giving the team the support it needs. Either way the risk of failure and missed opportunities for the team or service are increasing and mitigations are needed to bring back a healthy balance.

Mild operational vulnerability tends to mean that a team or service can provide value while adapting and being resilient to failure. For instance, a service maintains its availability during a DDOS while preventing impact to downstream services. A team delivering high quality code in the face of personnel or organizational changes. Like a circuit breaker in a house preventing electrical fire, problems tend to be remediated before they cascade throughout the system and become out of control. There are risks but not so much as to dwarf the value generated by building and maintaining the system in the first place.

No operational vulnerability means the team or service quite literally are doing nothing. All actions introduce risk and operational vulnerability. Without introducing risk we cannot build anything of value.

Operational vulnerability scales, from the single line of code to the entire organization. For the individual this could mean introducing technical debt to ship a product, deliberately increasing operational vulnerability, while adding more testing and validation of that code, decreasing operational vulnerability. For a manager this could mean finding ways to increase development velocity, while shielding a team from organizational politics so they can focus on getting work done. For leadership it could mean taking on a large, demanding customer while creating a culture of diversity, inclusion and support.

As humans, at each layer in these systems, we can use the five rules we can identify desirable system behaviors, balancing risk with reward. Increasing operational vulnerability when the time is right and creating more opportunities for value or decreasing it when the risk is too great to stomach, creating stability and resilience in the system.

Thoughts On System Resilience And Organized Complexity

2020-03-03T00:00:00+00:00

Outages make me contemplative, every incident is an opportunity to learn and understand our environment better so we can become more resilient in the future. During a cross country flight yesterday I re-read some of my guideposts on how to think about complexity and systems. These papers remind me of both how hard the problems we, socitey and more specifically people in technology, are trying to solve really are and that every individual on a team contributes to the expertise and diversity to combat complexity with complexity.

I started with How Complex Systems Fail by Richard I Cook. It’s a hit list for how the interactions in complex systems give rise to complexity and emergent and unexpected behaviors. Many of the items on the list will be familiar to us such as Change introduces new forms of failure. The paper ends on a positive note, to me at least, with People continuously create safety and Failure free operations require experience with failure. Reminding me that we, as engineers, practitioners and leadership, are the only way that the system as a whole can improve and become more resilient to failure through our ingenuity and experience.

Next up, I was reminded of Warren Weaver’s paper Science and Complexity. This paper, from 1948, digs into what the role of science, it’s history and impact on society and how complexity will lead to us needing a new way to solve hard problems. The paper categorizes the problems that science tries to solve, what Weaver calls problems of simplicity, problems of disorganized complexity and problems of organized complexity. The first are straightforward problems of collection and classification that science addressed in 1900s. The second are problems that have enumerable variables and interactions but can be addressed using statistical methods, such as averages, the example Weaver uses is that of a billiard table with millions of balls. The last category, organized complexity, are the most difficult to solve and sit somewhere between the aforementioned simple and organized problems. What makes these problems hard to solve is that they can’t be solved by any specific technique, they also happen to be the problems that solving will be foundational to our progress as a species. On the bright side organizations that prioritize collaborative openness and D&I efforts are on the right track:

“… in spite of the modern tendencies toward intense scientific specialization, that members of such diverse groups could work together and could form a unit which was much greater than the mere sum of its parts. It was shown that these groups could tackle certain problems of organized complexity, and get useful answers.”

Finally I revisited Leverage Points: Places to intervene in a system by Donella Meadows. This paper focuses on societal and economic examples but the lessons are applicable to any system. Meadows develops a list of common places to make changes to a system and how to get the response you want from your change. I personally like #6 The structure of information flows, simply sharing information and visibility of a problem can make a big impact. Meadows uses an example of two houses with there electric meters mounted in two different places:

“There was this subdivision of identical houses, the story goes, except that for some reason the electric meter in some of the houses was installed in the basement and in others it was installed in the front hall, where the residents could see it constantly, going round faster or slower as they used more or less electricity. With no other change, with identical prices, electricity consumption was 30 percent lower in the houses where the meter was in the front hall.”

Another powerful lever is #4 The power to add, change,evolve, or self-organize system structure.

“The ability to self-organize is the strongest form of system resilience. A system that can evolve can survive almost any change, by changing itself.”

An important lesson from this paper is that complexity and emergent behavior is hard to predict and can behave counter-intuitively so we as the humans pulling the levers need to think critically about the consequences of our actions.

Together these papers elucidate the difficult but tractable problems we have ahead of us. If we intend to change the future and the world with technology, and I’m hopeful that we can, then I think Weaver summed up our charge pretty well.

“In one sense the answer is very simple: our morals must catch up with our machinery. To state the necessity, however, is not to achieve it. The great gap,which lies so forbiddingly between our power and our capacity to use power wisely, can only be bridged by a vast combination of efforts. Knowledge of individual and group behavior must be improved. Communication must be improved between peoples of different languages and cultures, as well as between all the varied interests which use the same language, but often with such dangerously differing connotations. A revolutionary advance must be made in our understanding of economic and political factors. Willingness to sacrifice selfish short-term interests, either personal or national, in order to bring about long-term improvement for all must be developed.”