Engineering a Single Point of Failure with AWS. Users Beware.

Business

amazon web services, AWS, EBS, Elastic Block Storage, Netflix, Ubuntu, US-East, Virginia

Much has been written about the fragility of Amazon Web Services‘ US-East-1 region. Every time AWS has an outage it seems to be the Eastern zone that brings the service down. Eastern has the oldest infrastructure and this, in part, explains why it’s often the center of attention. Given the fact that US-East is the least reliable of all of Amazon’s zone, one would have thought that AWS would move to limit the critical services running there. If a blog post recently by my friend Rene Buest is correct, AWS in fact builds many of its higher level services on top of US East, thus exacerbating the impact of problems. Let’s dig in to what Buest uncovered in his post.

Buest’s concerns centered around the apparent existence of a single point of failure for a number of its more sophisticated services. We need to circle back a little here – one of the key findings from historically AWS outages has been the fact that designing for failure, in the way Netflix approaches its infrastructure, can give an organization protection over point failures. Admittedly a service can go down if the entire breadth of the providers infrastructure fails, but this is far less likely than point failures. Spreading the risk across multiple geographies and zones is a logical approach for users.

One would think then that AWS would themselves adopt this approach and avoid having all their proverbial eggs in one basket. many AWS services are built on top of Elastic Block Storage (EBS), from Elastic Load Balancer to Relation Database Service and on to Elastic Beanstalk, they all rely on EBS being available in order to function. Buest points to a post mortem investigation of awe.sm’s performance during one AWS outage. Awe.sm moved away from EBS due to issues of poor I/O throughput, failures at a regional level and server failure modes on Ubuntu. As the company said:

For these reasons, and our strong focus on uptime, we abandoned EBS entirely, starting about six months ago, at some considerable cost in operational complexity (mostly around how we do backups and restores). So far, it has been absolutely worth it in terms of observed external uptime.

Unfortunately the services that are built on top of EBS (ELB, RDS, EB) rely on EBS to be up in order to function – if EBS fails ad customers want to balance their traffic to another region to compensate, the balancing isn’t available (since it sits atop EBS). It’s a circular fault that removes the ability of customers to use the very services that are designed to help them reduce the impacts of outages. Witness the status messages below from 25 August where both RDS and ELB suffered issues due to an EBS performance degradation:

EC2 (N. Virginia)
[RESOLVED] Degraded performance for some EBS Volumes

ELB (N. Virginia)
[RESOLVED] Connectivity Issues

RDS (N. Virginia)
[RESOLVED] RDS connectivity issues in a single availability zone

The bottom line here is that upstream services, including ones which ironically are used to direct traffic elsewhere in the event of an outage, are reliant on EBS in order to function. AWS has actually caused a single point of failure with EBS – it’s more important than ever that AWS ensures EBS is reliable. I reached out to a former AWS engineer to fact check the assertions – despite being, to this day, a fan of AWS, he was critical of the architecture that sees many AWS services rely on what is a fragile service:

AWS EBS depends on a database that runs in ONE AZ. That AZ is the most problematic. When that AZ is unavail, EBS crashes. This was true June 2012. Seems like it’s still true based on recent events. As of June 2012, that AZ is connected to another AZ (doesn’t stand alone) and is more fragile than usual. And when EBS is down…yes, most everything is down. RDS, ELB, etc.

It’s a pretty sorry state of affairs and one which can’t possibly continue of AWS is to live up to its promise to become a credible enterprise vendor. I expect significant investment to reinforce the robustness of EBS along with some clarity on short term measures to remove the weaknesses.

7 Comments

Steve Gorton | September 4, 2013 at 12:33 pm

Good article Ben – fully agree.
Of course all the other cloud vendors would have this issue if they had the AWS customer numbers.
How would say our chums at RAX do this if they had the same customer numbers, for example?

- Rich Fiekowsky | September 7, 2013 at 12:29 pm
  
  disagree. This is about SPOF vulnerabilities, not capacity for # users. Does RAX have a SPOF?
  Also, I have read that Salesforce.com does the most transactions/second.
  
  - Steve Gorton | September 26, 2013 at 2:01 pm
    
    … and spof vulnerabilities increase with scaling (aka # users) in any engineering environment.
    AMZN “appears” to have more outages because it has had to scale beyond what anyone else has had to in history. AWS customer-base is *5 that of the combined competition.
    Lastly, Salesforce is SaaS – so lets compare apples with apples and not what you might have read.
    
Carl | September 4, 2013 at 7:17 pm

Back in the days of yore, AWS would simply say, “EBS? Too bad, so sad.”

They have always explicitly recommended against trusting EBS as a formal part of their best practices, and AWS never really wanted to implement EBS in the first place, it’s yesterday’s architecture- a band-aid to bring users in to feel comfy.

Unfortunately it has proven irrevocably popular and AWS is stuck in the position of providing a service that allows users to harm themselves, and then has to suffer the blame for it as well.

Could they do a better job of building and providing block storage volumes? Maybe- it seems incredibly difficult and expensive to do given what we know about the AWS platform, but I think the overarching point is that users have really brought this point of failure, via popular demand, into “the cloud” themselves. Hard drives bonk in the physical world and sh*t flies- no different in the virtual world. Build the way AWS tells you too, and you sail over each and every single one of these outages.

I think this is one of AWS’ true bete noires and I’d bet real money that future iterations of IaaS will have far better protected block storage options, since we apparently can’t give up our old-fashioned modalities when we move to “the cloud” 🙂

darkfader | September 4, 2013 at 10:39 pm

“the overarching point is that users have really brought this point of failure, via popular demand, into “the cloud” themselves.”

Carl, you nailed it.

On the other hand – it’s problematic if a service is well-designed and SPOF-free, until it introduces (the) (a) feature, most people want. 🙂

Alasdair Thompson | September 26, 2013 at 4:48 am

Well written and focused on the topic. Cheers

Kim Dartford | July 21, 2014 at 1:42 pm

Excellent article thanks, lots of detail which is always appreciated. There’s another good article talking about AWS single points of failure on http://www.whyaws.com if anybody’s interested.

Engineering a Single Point of Failure with AWS. Users Beware.

7 Comments

Leave a ReplyCancel reply