Tuesday morning Cloudflare had a drop in traffic.
It was "a typical indication that something very bad has happened," said John Graham-Cumming, CTO of Cloudflare, in an interview with CIO Dive.
Though Cloudflare was the one suffering the outage, it brought down its customers' ability to service their traffic too.
"Our network operates as an extension of our customers' networks," Graham-Cumming said. If "we're down, they're down."
The company's outage was a result of a CPU spike after a "bad software" deployment was "rolled back," according to the company's incident update, but it highlights a greater threat to the internet.
The internet is able to function because of its interconnectedness, which means it's also capable of pulling it all down at once. At the heart of the internet is the Border Gateway Protocol (BGP) and all the veins, autonomous systems (AS), that lead to it.
BGP sessions "create the interconnection relationships between different AS," Angelique Medina, director of product marketing at ThousandEyes, told CIO Dive in an email.
The protocol "was designed in the early days of the internet to be a chain of trust between well-meaning ISPs and universities that had no reason to question the integrity of the information they received from their peers," said Medina.
What's the big BGP deal?
The backbone of the internet are the tools that tells traffic where to go and routes it accordingly.
BGP is the "default routing protocol to route traffic among internet domains," according to the National Institute of Standards and Technology, but it fundamentally lacks built-in security, which makes is susceptible to route hijacking.
The protocol's maturity has stalled with the introduction of more complex routing needs and bad actors.
"Even corporations like Google, with massive resources at their disposal, are not immune from the breakdowns in BGP routing that occur due to inadvertent errors like BGP route leaks" or BGP hijacking, said Medina.
In 2018, hackers rerouted IP addresses managed by AWS' DNS service which ultimately allowed them to steal more than $150,000 worth of cryptocurrency via MyEtherWallet.com. Hackers didn't take advantage of the site's weaknesses but instead used those found in public DNS servers.
Autonomous systems rely on BGP's ability to share routing. Ideas of how to make BGP more secure have been tossed around, but the complexity of autonomous systems derails proposals because all systems would have to update their behaviors too, according to Cloudflare.
"No system is completely resilient and there will always be the on-call component of Ops, but at the end of the day it's better to inject failure proactively and find the problems on your own terms," Kolton Andrus, co-founder and CEO of Gremlin, told CIO Dive.
How to protect against BGP route leaks
Companies are always warned to never wholly depend a vendor's resilience even though technologies like the cloud and software as a service, make it difficult to temper reliance.
"Everyone relies on someone else's software," said Andrus, but "the best companies assume their third parties will fail, and test for that exact scenario ahead of time."
Understanding the ripple effect of relying on a single provider is where most companies can start their resilience plan in light of an outage.
However, the fragility of the internet is tested regularly by routing leaks. "This incident is yet another example of how incredibly easy it is to dramatically alter the service delivery landscape in the internet," said Medina.
June's outage showcased how smaller networks "are often propagated by large providers" despite filtering techniques. Other cases, like Google's November downtime, was the result of a human configuration error, according to ThousandEyes.
So what is the solution? "The conundrum of the internet is that it is a voluntary network" and trust on BGP can fall short, especially with continued outages, said Medina.
Initiatives like the Mutually Agreed Norms for Routing Security, or enterprises writing internet routing best practices into their ISP contracts are starting places to mitigate the risk of outages.
Cloudflare, for example, is in a position of leadership and therefore can set a precedent for internet standards.
"The nature of the internet is cooperation," said Graham-Cumming, "people take a look and they think about whether they should be doing the same thing."
It results in a "social pressure" of best practices, he said.