Slack adopts 'disasterpiece theater' strategy for running uptime exercises

Dive Brief:

In January 2018, Slack began intentionally replicating production failures to test its systems and engineers of tolerance, similar to a practice "evangelized by Netflix," according to a company blog post Thursday.
The company is calling the process "disasterpiece theater," an "approachable" take on chaos engineering. The "theater" consists of running exploratory exercises where a response plan is circulated among the operations team, according to the blog. Hosts of the exercises document how they will "incite the failure," what commands they'll run, and what Amazon Elastic Compute Cloud instances will be involved.
Slack then asks its hosts how "confident they are that fault tolerance in the dev environment predicts fault tolerance in the prod environment" for the exercise. Hosts also hypothesize how customers mights feel the experimental failure. Each exercise is carried out during a specified time with experts centrally located because Slack is yet to test its monitoring capabilities during a run.

Dive Insight:

Last March Slack created a "safety engineering team" to reduce its downtime compared to other collaboration software vendors. The team was designed to detect operational interferences and work with Slack's software developers.

Now, more than a year later, the company is public and its platform's uptime usually lingers in the "three nines" range, meaning at least 99.9% uptime, according to its status history from the last year.

But Slack has experienced outages in June and July. The first outage, reported on June 28, went on from about 8 a.m. to about 11 p.m. Eastern Standard Time, according to the issue summary.

The platform experienced further degraded functionality in late July, caused by a "change that inadvertently caused some performance issues," according to the summary.

Netflix, the company Slack is emulating with its exercises, is trying to beat the clock to achieve even less down downtime, striving for four nines of availability. The last two months' outages impacted Slack's "three nines" uptime. June's uptime was reported at 99.83% while July clocked in at 99.87%.

Still, Slack has confidence in its foundational infrastructure but as the platform expands the company recognizes "luck is not an availability strategy," according to the blog post.

Not all of Slack's "disasterpiece" tests make it beyond the development environment. If automated remediation didn't work as expected, engineers won't deploy the test in production.

The company has run "dozens" of "disasterpiece" exercises that have more or less gone according to the plan, said the blog post. However, because of Slack's environment is in constant evolution, the engineering team is rarely able to repeat fault tolerance tests in new code, according to the blog post.