Advertisment

Amazon explains the embarrassing reason of AWS outage

author-image
CIOL Writers
New Update
Amazon unveils AI consultancy program Amazon ML Solutions Lab

What caused numerous customers and businesses to witness outage for several hours, was nothing else but a human error. To be specefic, it was a typo.

Advertisment

Amazon has released an official statement explaining the outage caused by its Amazon Web Services (AWS) a public cloud infrastructure provider.

In a blog post, Amazon wrote, "The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37 AM PST, an authorised S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that are used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended."

This small mistake took down two subsystems in the US-EAST-1 region, a massive data centre location. The removal of these two larger systems is what knocked so many services offline which includes Quora, Twitch, Kickstarter, Slack, Business Insider, Expedia, Atlassian’s Bitbucket HipChat. The list also includes AWS Service Health Dashboard (SHD), which is necessary for Amazon to update its own status page.

Advertisment

Though Amazon decided to restart all the systems, S3 was unable to service requests. Both systems required a full restart, and this process took longer than expected because the servers have not been completely restarted "for many years."

The index subsystem was fully recovered by 1:18 pm PT, while the placement subsystem was recovered by 1:54 pm PT. By that point, S3 was operating normally.

The company further noted that it's making "several changes" because of the latest incident. To avoid such problems in the future, Amazon said, "While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level."

Amazon has also started dividing parts of the index subsystem into smaller cells and changing the administration console for the AWS Service Health Dashboard.

"We will do everything we can to learn from this event and use it to improve our availability even further,” the company concluded.

amazon aws