A widespread AWS outage affected thousands of customers, cascading into issues across multiple digital services on Monday morning. Initial efforts to restore hundreds of AWS services to the US-East-1 Region partially mitigated difficulties, but not fully, leaving Amazon to identify and correct the root cause later in the day.
The company attributed the issues to an internal subsystem responsible for monitoring the health of its network load balancers, according to an update on the company's status page.
“We have taken additional mitigation steps to aid the recovery of the underlying internal subsystem responsible for monitoring the health of our network load balancers and are now seeing connectivity and API recovery for AWS services,” the company said around noon ET, although it still listed the status of service “degraded.”
A cloud outage can ripple through digital services, disrupting multiple applications simultaneously while stifling business continuity plans. The effects can be compounded when the afflicted hyperscaler is AWS, which leads its hyperscaler peers in market share.
Amazon's cloud services attracted 37.7% of all infrastructure as a service spending last year, compared with Microsoft market share of 23.9%, according to Gartner estimates. Google controlled just 9% of spending last year.
Cloud outages represent wake up calls for CIOs, helping them gauge the resiliency of their IT estates, said John Annand, digital infrastructure practice lead at Info-Tech Research Group.
“To try to get any risk down to zero, it goes up on an exponential curve,” Annand said. “The lower you want the risk, the more money it's going to cost you.”
An IT pressure test
Vendor selection is one part of the resiliency puzzle for CIOs. But cloud systems that rely on overlapping providers can be too complex from an architecture standpoint, Annand said.
"It looks nice on paper, and people talk about it in conferences, but they don't practically do it," Annand said. "You have to pick the efficacy and the ease of use of a cloud platform, and then try to plan around the times when you know you're going to have an outage one way or the other."
Roy Illsley, chief analyst of IT operations at Omdia, said the key takeaway of an outage like this for CIOs is to create a dual-source strategy.
“This incident shows that even somebody like AWS can be impacted, and unless you have contingency, you’re stuck,” he said in an email to CIO Dive.
Multicloud provides an added layer of resilience, but porting workloads between clouds is challenging, Illsley said. Ideally, CIOs should consider multicloud combined with on-premises environments, although he cautioned the strategy is a more expensive and complex undertaking.
“There is no silver bullet,” Illsley said. “But CIOs must do the due diligence and consider having a robust recovery plan that is separate from the primary cloud provider.”
IT outages can lead to significant costs for businesses navigating disruption. Every hour of operational downtime caused by tech issues costs companies a median of $2 million, according to New Relic data published last month. Cloud services failures are a leading cause of IT downtime, the company found.
Global IT systems were roiled last year when a faulty CrowdStrike update pushed to Windows devices caused massive outages. The July 2024 incident led to more than $5 billion in estimated direct financial losses for Fortune 500 companies, with the healthcare industry experiencing the biggest financial disruption.
Unplanned IT failures can provide an opportunity to reassess business continuity plans, analysts and experts previously told CIO Dive.
“It's not a question of if a service is going to go down,” Annand said. “It's a question of when. Your job as a CIO is to manage that risk with the rest of your C-suite and to come up with a plan.”
Editorial Director Nicole Laskowski contributed reporting to this story.
Disclosure: Informa owns a controlling stake in Informa TechTarget, the publisher behind CIO Dive and parent company of Omdia. Informa has no influence over CIO Dive's coverage.