The Amazon Simple Storage Service (S3) outage that occurred in the Northern Virginia Region on Tuesday morning was due to human error, according to an Amazon Web Services statement.
AWS reports that the S3 team was debugging an issue on the S3 billing system when one of the team members executed a command to remove a small number of servers for one of the S3 subsystems. "Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," AWS said.
A cascading series of events ultimately caused the outage of AWS services in the US-EAST-I Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud new instance launches and Amazon Elastic Block Store volumes. AWS Lambda and S3 APIs were also unavailable.
Amazon apologized for the incident and reassured customers it is making changes to ensure this type of event does not occur again. "We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level," according to AWS.
Such outages are certainly damaging from a PR perspective, as cloud providers tout their reliability as a key selling point, but they can be even more damaging for the businesses that rely on those services. During the outage, which lasted about four hours, 54 of the top 100 e-commerce retailers’ web sites suffered slow loading times and others, including Express, Lululemon, and One Kings Lane, were entirely knocked offline.
With recent estimates indicating AWS now controls 40% of the cloud market, it’s alarming to contemplate how a mistaken keystroke could essentially bring down a huge part of the Internet. It’s yet another reminder about the importance of relying on multiple service providers and building in redundancy. Fortunately, its appears many businesses are catching on. A recent survey from RightScale found 85% of enterprises now have a multi-cloud strategy.