March 3 2017
Save
Print

Oops! Amazon's cloud outage was due to human error

Angel Gonzalez

Seattle: Amazon.com's cloud computing unit said that the outage that shook up a sizeable part of the internet Tuesday was caused by human error.

The Amazon Web Services division said in a post-mortem published on its website Thursday that its team was working to fix a problem that slowed down the billing system for S3, a widely-used AWS service.

Through S3, companies and individuals can store their data on Amazon's server farms. S3 also houses the data that underpins a wide array of other AWS services, including some computing processing functions. It works as a basic building block of Amazon's cloud, which in turn is a major pillar of the modern internet.

To fix the slow-down issue, engineers in AWS' Northern Virginia operation - one of the largest cluster of data centres run by the company - needed to take down a small number of servers.

"Unfortunately," as AWS put it in its lengthy mea culpa, a technician made a mistake when entering a command, taking out more servers than needed - some of which were critical to the functioning of S3 in the entire region. Thousands of users relying on AWS data and computing processes were affected.

AWS says that its system is designed to allow the removal of big chunks of its components "with little or no customer impact." But the rebooting took a long time - longer than expected - AWS says, partly because the S3 service has become gigantic since it launched more than a decade ago.

From failure to complete recovery, the outage lasted slightly more than four hours, although other AWS services that had accumulated a backlog of work during the disruption took longer to recover.

AWS said the outage was prompting it to make some changes: for example, reducing the amount of server capacity that can be removed at one time.

"This will prevent an incorrect input from triggering a similar event in the future," AWS said.

Seattle Times

Technology