When a Typo Knocked Down Amazon’s S3 – Backbone of Much of the Internet
Amazon's Web Services suffered a major outage earlier this week, affecting a number of websites and online services. The tech giant is now blaming the massive internet outage caused by its S3 web service on a typo...
How an Amazon typo broke the internet
Everyone was surprised at the failure of S3 (Simple Storage Solution) that is popular for its track record of availability. The failure knocked down a huge number of services and sites down, including Quora, Apple's iCloud services, Trello, and others. Not much was known at the time of this outage, but Amazon has now published a blog post detailing exactly what caused the "internet" outage.
Amazon said that at the time of the outage its S3 team was trying to diagnose why its billing service for S3 was running slowly. During this process, an engineer executed an incorrect command that ended up removing a larger set of servers than what was originally intended.
Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests.
The error required a restart that "took longer than expected."
Similar to other cloud providers, Amazon's S3 subsystems are also designed to support removal or failure of servers with no customer impact, keeping redundancy in mind. This ensures that even when engineers have to remove any servers, it wouldn't affect the system. However, the company couldn't anticipate the time it took to restart some services, due to AWS' exponential growth.
We have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.
While the issue only affected Amazon’s Northern Virginia region, it was enough to cause significant problems for a large number of websites and services.
Amazon apologized for the issue saying the company is proud of its track record of availability with Amazon S3. "We know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further."