As I write this, IT admins across the world (but especially in Europe) are looking bleary-eyed as they try and work around a serious outage that Microsoft’s cloud suffered in Northern Europe. According to the Microsoft advisory:

A subset of customers using Virtual Machines, Storage, SQL Database, Key Vault, App Service, Site Recovery, Automation, Service Bus, Event Hubs, Data Factory, Backup, API management, Log Analytics, Application Insight, Azure Batch Azure Search, Redis Cache, Media Services, IoT Hub, Stream Analytics, Power BI, Azure Monitor, Azure Cosmo DB or Logic Apps in North Europe may experience connection failures when trying to access resources hosted in the region.

Or pretty much anyone doing anything in North Europe.

It doesn’t have to be this way

An astute commentator posted on a story about this outage saying that:

AWS, GCP, and Azure are all more reliable and more secure than a traditional data center but that doesn’t mean you don’t need a DR plan. Well Architected cloud applications can survive failures. The difference with public cloud is if it has an outage its major news due to its blast radius, not usually true when one of your own DC’s has an issue.

I’ve said it thousands of times in the past decade: even the most reliable of cloud providers suffer outages and you’ll note that no one – not AWS, Google or any of the other relevant public cloud vendors are gloating about this. They’ve all suffered similar fates.

Plan for failure

Moving to the cloud doesn’t mean that, all of a sudden, you can forget about planning for failure. True, some of the responsibility for reliability moves over to the vendor, but you still need to plan for the sadly inevitable issues that arise. As Netflix so effectively told the world – if you plan redundancy for every part of your infrastructure, it takes next-level catastrophes to cause service outages.

Still, it’s bad

That said, the scale, both in terms of breadth and time to remedy, of this outage, is really bad and you can bet Satya Nadella, Microsoft’s CEO, was raised from his slumber or meditation session to be informed about this issue. It’s a fairly sure bet that Microsoft’s people will be spending lots of time crafting messages to explain exactly what went wrong and how they’ll stop it happening again.

A cautionary tale

This doesn’t invalidate Microsoft Azure, neither does it invalidate the public cloud generally. What it is, however, is a timely reminder that you need to engineer for the inevitable, and spend lots of time doing “what if” planning.

A timely piece of advice for cloud vendors

People, over communicating is always, always, always better than under-communicating. Microsoft (as well as all other vendors) should do a deep review of the communication that happened during and after this outage to see whether it really was handled in a best-practice manner.

Ben Kepes

Ben Kepes is a technology evangelist, an investor, a commentator and a business adviser. Ben covers the convergence of technology, mobile, ubiquity and agility, all enabled by the Cloud. His areas of interest extend to enterprise software, software integration, financial/accounting software, platforms and infrastructure as well as articulating technology simply for everyday users.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.