When 'everything is on fire,' chaos engineers rush to save tech infrastructure

Somewhere between changing lanes and driving into pushing winds, Kolton Andrus, CEO and co-founder of Gremlin and former chaos software engineer for Netflix and Amazon, gets a call. But this isn't the kind of call that can wait for the next rest stop. Instead, Andrus is forced to pull his motorcycle over to the side of the freeway to guide his team through a network outage.

During network and service outages, chaos engineers are mainly regarded as the ones who "save the day" when "everything is on fire," Andrus told CIO Dive in an interview.

In many ways chaos engineering is the dirty job of operations. The unpredictability of an outage is what gave Andrus an "adrenaline shot" as a call leader, or those responsible for heading up the investigation, during a reported service outage.

In the first five or 10 minutes of an outage, response teams are given a rundown of the events leading up to the outage, usually outlined in metrics or customer impacts that could help signal where things could be "broken," he said.

The role of the chaos engineer has gained more respect and attention over the last decade or so. The "engineers were kind of king" while the operational roles were like "second class citizens," said Andrus.

Operations did the work engineers did not want to do until the shift to DevOps. Once the "you build it, you own it, you operate it mentality" was adopted, a lot of engineers took on quality assurance and site reliability engineering.

What it's like putting out the fire

The fastest resolution times for a service outage are between 10 and 20 minutes; anything longer is more alarming. However, a longer outage does not always indicate malicious activity, it could be an engineer struggling to locate a hard-to-find network bug.

But a classic example of outage resolution is when an engineer unintentionally deletes a database in production, causing everything to go down, according to Andrus. Addressing an incident like that involves asking questions like why the engineer had permission to access the database, what processes were missing to prevent the mistake, why there wasn't guidance, and why there weren't checks and balances in place to prevent the incident.

It is important to answer these questions because about 82% of outages are caused by human error, according to a 2014 Gartner report.

After the initial call among responders, individual teams coordinate to weave through microservices and "bits of code" to understand whether or not the issue arose within their governance.

During his time at Amazon, Andrus said someone would be assigned "owner" of whatever the first critical service fell or where the problem occurred. Over the next few days, the tech company strategically discussed the outage and tried to "drill into the patterns or maybe the social problems that contributed to it," he said.

"The number one goal is how fast we can make things work" while the secondary goal is working to prevent a similar incident. One way to do so is to perform self-inflicted outages. Performing experiments in a controlled environment act as a "flu shot" so organizations can cause harm upfront in a safe manner to prevent damage from something a system isn't already familiar with.

Time offline equates to about $5,600 per minute so experts recommend adopting antifragile practices to avoid those "post-mortem" meetings following an outage. Antifragile systems allow for scalability because they are designed for growth. Network availability, over digital bandwidth, is what an organization's network engineers need to focus on, according to Gartner.

The unpredictability of an outage is an "adrenaline shot" for a call leader during a reported service outage.

Kolton Andrus

CEO and co-founder of Gremlin

Chaos engineers measure time offline with "what is colloquially referred to as three nines," or 99.9% uptime, according to Andrus. An organization that achieves 99.9% uptime equates to less than nine hours of downtime a year. However, companies like Amazon and Netflix are trying to beat the clock to achieve even less downtime with four nines of availability, or about 53 minutes per year.

The hunt for the three nines

Having a chaos engineering team is a "given" for larger companies like Amazon and Netflix, but the same isn't always true for the startups working to reach new levels of success.

Labeled as the most "innovative" platform in the communication market, Slack recently announced an internal renovation with the addition of a "safety security team" to help stave off its rate of outages. The collaboration platform revealed it has about 40 days of outages and 51 days of incidents that caused "reduced functionality." However, since June of 2017, Slack reported an average of 99.96% uptime, hitting the three nines mark.

It's important to remember "a lot of subtlety" goes into determining what counts as an outage. But Slack most likely fell victim to a common startup problem: growing too quickly for its infrastructure, according to Andrus.

Slack successfully became the pioneer of the communication platform market with an "ethos engineers identify with," he said. However, the company may have gotten lost in how quickly it grew and therefore lost some focus on having built an infrastructure with scaling in mind.

The "engineers were kind of king" while the operational roles were like "second class citizens."

Kolton Andrus

CEO and co-founder of Gremlin

Ultimately Slack's growth has added to the cumbersomeness of outages or delays in performance. And while it's "a good problem to have," no service-based company can risk the number of outages Slack disclosed.

Slack had 99.98% uptime in April and 100% uptime in May so far, according to the company's status history. However, there were 12 days of incident reports and one outage in April and eight incidents and one outage reported in May. Still, those percentages put Slack in the three nines sweet spot.

Editor's Note: This article has been updated to reflect Slack’s reported uptime.