A couple of days ago Amazon Web Services (AWS) suffered a significant outage in their US-EAST-1 region. This has been the 5th major outage in that region in the past 18 months. The outage affected leading services such as Reddit, Netflix, Foursquare and Heroku.
How should you architect your cloud-hosted system to sustain such outages? Much has been written on this question during this outage, as well as past outages. Many recommend basing your architecture on multiple AWS Availability Zones (AZ) to spread the risk. But during this outage we saw even multi-Availability Zone applications severely affected. Even Amazon published during the outage that
Customers can launch replacement instances in the unaffected availability zones but may experience elevated launch latencies or receive ResourceLimitExceeded errors on their API calls, which are being issued to manage load on the system during recovery.
The reason is that there is an underlying infrastructure that escalates the traffic from the affected AZ to other AZ in a way that overwhelms the system. In the case of this outage it was the AWS API Platform that was rendered unavailable, as nicely explained in this great post:
The waterfall effect seems to happen, where the AWS API stack gets overwhelmed to the point of being useless for any management task in the region.
But it doesn’t really matter for us as users which exact infrastructure it was that failed on this specific outage. 18 months ago, during the first major outage, the reason was another infastructure component, the Elastic Block Store (“EBS”) volumes, that cascaded the problem. Back then I wrote a post on how to architect your system to sustain such outages, and one of my recommendations was:
Spread across several public cloud vendors and/or private cloud
The rule of thumb in IT is that there will always be extreme and rare situations (and don’t forget, Amazon only commits to 99.995% SLA) causing such major outages. And there will always be some common infrastructure that under that extreme and rare situation will carry the ripple effect of the outage to other Availability Zones in the region.
Of course, you can mitigate risk by spreading your system across several AWS Regions (e.g. between US-EAST and US-WEST), as they have much looser coupling, but as I stated on my previous post, that loose coupling comes with a price: it is up to your application to replicate data, using a separate set of APIs for each region. As Amazon themselves state: “it requires effort on the part of application builders to take advantage of this isolation”.
The most resilient architecture would therefore be to mitigate risk by spreading your system across different cloud vendors, to provide the best isolation level. The advantages in terms resilience are clear. But how can that be implemented, given that the vendors are so different in their characteristics and APIs?
There are 2 approaches to deploying across multiple cloud vendors and keeping cloud-vendor-agnostic:
- Open Standards and APIs for cloud API that will be supported by multiple cloud vendors. That way you write your application using a common standard and have immediate support by all conforming cloud vendors. Examples for such emerging standards are OpenStack and JClouds. However, the Cloud is still a young domain with many competing standards and APIs and it is yet to be determined which one shall become the de-facto standard of the industry and where to “place our bet”.
- Open PaaS Platforms that abstract the underlying cloud infrastructure and provide transparent support for all major vendors. You build your application on top of the platform, and leave it up to the platform to communicate to the underlying cloud vendors (whether public or private clouds, or even a hybrid). Examples of such platforms, are CloudFoundry and Cloudify. I dedicated one of my posts for exploring how to build your application using such platforms.
System architects need to face the reality of the Service Level Agreement provided by Amazon and other cloud vendors and their limitations, and start designing for resilience by spreading across isolated environments, deploying DR sites, and by similar redundancy measures to keep their service up-and-running and their data safe. Only that way can we guarantee that we will not be the next one to fall off the 99.995% SLA.
This post was originally posted here.