On the Microsoft and Amazon Outages in Ireland–The Need for More Redundancy

Business

amazon, Amazon.com, cloud computing, data center, European Union, Information technology, Microsoft, Netflix

So now the world knows that the current outages that Amazon and Microsoft are suffering in Europe have been caused by lightning strikes on their Dublin data centers. The outages have caused downtime for users of both Amazon EC2 and Microsoft BPOS services.

I’ll not delve into the issues around failover – clearly the lightning strike was a catastrophic event that overcame the protection that both providers have against upstream events and caused the usual power supply backups to fail.

For those not in the know, Ireland has become a key hub for technology providers – it’s got a good climate (read cold and wet, good for temperature control), it has good internet connectivity and it has a ready supply of IT staff. Another big factor is the fact that the Irish Government offers very attractive incentives to technology companies to relocate there… it’s all about costs after all.

Colleague Phil Wainewright covered the Amazon outage pointing out that;

EU-WEST-1 is Amazon’s only data center in Europe, which means that customers who have to keep their data within the European region for data protection compliance have no available failover to another Amazon location

It’s a very good point, and does beg some questions about redeundancy in terms of European located cloud providers. Wainewright did point out that the Amazon data center does in fact have three distinct Availability Zones, within one location and that the outage affected only one of those zones, however it appears that recovery efforts are affecting the other availability zones as well. Adrian Cockroft, resident Cloud guru at Netflix was pretty upbeat about the fact that the incident only hit one availability zone saying;

they lost one AZ, that’s why there are several AZ’s. We are testing a global Cassandra cluster and it didn’t go down #doinitright. only use EU for global testing but Cassandra data is repl to all 3 AZ, lose zone it still works and recovers when zone comes back. we use excess reserved instances in prod (in US) to get priority for capacity in zone level outages…

So it seems the cascading issues on the other two zones did not affect Netflix’s infrastructure at all.

Either way, this does highlight issues. Imagine an uber-catastrophic event that knocked out the entire Dublin Amazon data center. With only one physical location in Europe, companies relying on Amazon to host their Euro-centric data have a couple of unpalatable options;

Move to another provider, which in an open standards world might be easy enough but for a company built entirely upon Amazon’s proprietry standards isn’t much fun
Move data to Amazon’s US DCs, and in doing do potentially breach data location regulations

It’s early days in the cloud, but events like this help us think about what the future needs to look like in order to ensure that cloud is safe for everyone…

1 Comment

@somic | August 9, 2011 at 2:30 pm

In the bottom part of your post, you missed the third option – “do nothing.”

Believe it or not, yes it’s an option. One can guesstimate probability of all 3 availability zones going offline, one can estimate costs associated with going down and staying down for N hours, one can also assess how much money and resources can be spent on building out reliability.

Say I lose $10K per hour and my cost of building out sufficient capacity to be able to withstand a complete loss of eu-west-1 is 50K per year.

Then it’s a matter of relatively straightforward arithmetic and dealing with probabilities.

It’s OK to take a hit if this was your plan. It’s not OK to take a hit if an outage caught you by surprise, you have no plan and don’t know what to do.

On the Microsoft and Amazon Outages in Ireland–The Need for More Redundancy

1 Comment

Leave a ReplyCancel reply