Yesterday at 0:51 AM (UTC) Azure, Microsoft’s public cloud service, suffered a global and massive outage for around 11 hours. The outage affected 12 out of Azure’s 17 regions, taking down the entire US, Europe and Asia, together with their customer’s applications and services, and causing a havoc among the users.
After a day of emergency fix and investigation, Microsoft published a formal initial report of the issue in its blog. The root cause is reported to be:
A bug in the Blob Front-Ends which was exposed by the configuration change made as a part of the performance improvement update, which resulted in the Blob Front-Ends to going into an infinite loop.
Though the issue was not yet fully investigated, the initial report indicates that the testing scheme Azure team employed (nicknamed “flighting”) failed to detect the bug, thus allowing the configuration change to be rolled out to production. In addition, the roll-out process itself was done concurrently to the regions instead of the common practice of staged roll-outs across regions:
update was made across most regions in a short period of time due to operational error, instead of following the standard protocol of applying production changes in incremental batches.
Microsoft is not the first cloud vendor to encounter major outages. Amazon’s long-standing AWS cloud has suffered at least one major outage a year (let’s see how this year end, so far looking good for them). In 2011 AWS suffered an outage which lasted 3 days in the US East Region. Interesting to note that outage was also triggered by a configuration change (in their case to upgrade the network capacity). Following that outage I provided recommendations and best practices for customers on how to keep their cloud-hosted systems resilient.
No cloud vendor is immuned to such outages. Even the standard built-in geo-redundancy mechanisms of the vendors such as multi-availability-zone and multi-region strategies cannot save the customers from such major outages, as we witnessed in these outages. We, as customers placing our mission-critical systems in the cloud, need to guarantee the resilience of our system regardless of the vulnerabilities of the underlying cloud provider. To achieve adequate level of resilience we need to employ multi-cloud strategy, deploying our application across several vendors to reduce the risk. I covered the multi-cloud strategy in greater detail in my blog in 2012 following yet another AWS outage.
There will always be bugs. The cloud vendors need to improve their processes and procedures to flush out such critical bugs early on in the testing phase and to avoid cascading problems between systems and geographies in production. And the customers need to remember that the cloud is not a silver bullet, prepare for the disaster, and design their system accordingly.