Amazon’s multi day outtage on its Elastic Computer Cloud (EC2) started with a faulty upgrade to one of its East Coast facility just after midnight on April 21, the company admitted in a blog post on Thursday.
The network change, described as part of “normal AWS (Amazon Web Services) scaling activity, incorrectly directed traffic from a single Elastic Block Store (EBS), or cluster of hosted volumes, onto a secondary network that lacked the capacity to process the traffic from the redirected EBS. The result, Amazon said, was that both the primary and secondary networks were disabled, incapacitating the hosted volumes and – even worse – isolating them from hosted mirror volumes within the same EBS cluster.
Despite complex and robust redundancies built into the EC2 architecture, the faulty upgrade set of a cascade of events that rippled throughout the EC2 network. With so many hosted volumes knocked offline and isolated, a “re-mirroring storm” occurred, in which the volumes came back online at once and got stuck searching for free storage space to copy their data on an EBS cluster that was over capacity. Amazon said that, at the height of the outtage, about 13% of the volumes on the effected EBS were affected, Amazon said.
As the event unfolded, the degraded EBS cluster began to impact the larger EBS control plane on EC2, with the hosted cluster unable to respond to API requests, which began to back up, delaying API requests in other EC2 availability zones in the East Coast region, also.
Amazon said it has identified a number of factors that contributed to the sustained outtage. The EBS nodes looking for free space on the exhausted primary and secondary networks didn’t back off when no free space was found, but continued to search repeatedly. Amazon said a race condition in the code on the EBS nodes also caused them to fail when concurrently closing a large number of requests for replication. The chances of that happening were extremely low, but the faulty upgrade on the 21st created and the re-mirroring storm it created a volume of connection attempts that was high enough to trigger the race condition frequently, causing nodes to fail as a result of the bug, Amazon said. Those failed nodes added to the list of volumes that needed to re-mirror, leading to more “stuck” volumes and more requests to the re-mirroring storm, Amazon said.
The company is auditing and changing its network configuration policies – using automation to help prevent the error that occured. The company will also be making changes to steer around the kind of re-mirroring storm that occured – making more capacity available to address the demands of large recovery events, the company said. Read the full post here.