AWS Outage: Moving from Multi-Availability-Zone to Multi-Cloud

A couple of days ago Amazon Web Services (AWS) suffered a significant outage in their US-EAST-1 region. This has been the 5th major outage in that region in the past 18 months. The outage affected leading services such as Reddit, Netflix, Foursquare and Heroku.

How should you architect your cloud-hosted system to sustain such outages? Much has been written on this question during this outage, as well as past outages. Many recommend basing your architecture on multiple AWS Availability Zones (AZ) to spread the risk. But during this outage we saw even multi-Availability Zone applications severely affected. Even Amazon published during the outage that

Customers can launch replacement instances in the unaffected availability zones but may experience elevated launch latencies or receive ResourceLimitExceeded errors on their API calls, which are being issued to manage load on the system during recovery.

The reason is that there is an underlying infrastructure that escalates the traffic from the affected AZ to other AZ in a way that overwhelms the system. In the case of this outage it was the AWS API Platform that was rendered unavailable, as nicely explained in this great post:

The waterfall effect seems to happen, where the AWS API stack gets overwhelmed to the point of being useless for any management task in the region.

But it doesn’t really matter for us as users which exact infrastructure it was that failed on this specific outage. 18 months ago, during the first major outage, the reason was another infastructure component, the Elastic Block Store (“EBS”) volumes, that cascaded the problem. Back then I wrote a post on how to architect your system to sustain such outages, and one of my recommendations was:

Spread across several public cloud vendors and/or private cloud

The rule of thumb in IT is that there will always be extreme and rare situations (and don’t forget, Amazon only commits to 99.995% SLA) causing such major outages. And there will always be some common infrastructure that under that extreme and rare situation will carry the ripple effect of the outage to other Availability Zones in the region.

Of course, you can mitigate risk by spreading your system across several AWS Regions (e.g. between US-EAST and US-WEST), as they have much looser coupling, but as I stated on my previous post, that loose coupling comes with a price: it is up to your application to replicate data, using a separate set of APIs for each region. As Amazon themselves state: “it requires effort on the part of application builders to take advantage of this isolation”.

The most resilient architecture would therefore be to mitigate risk by spreading your system across different cloud vendors, to provide the best isolation level. The advantages in terms resilience are clear. But how can that be implemented, given that the vendors are so different in their characteristics and APIs?

There are 2 approaches to deploying across multiple cloud vendors and keeping cloud-vendor-agnostic:

  1. Open Standards and APIs for cloud API that will be supported by multiple cloud vendors. That way you write your application using a common standard and have immediate support by all conforming cloud vendors. Examples for such emerging standards are OpenStack and JClouds. However, the Cloud is still a young domain with many competing standards and APIs and it is yet to be determined which one shall become the de-facto standard of the industry and where to “place our bet”.
  2. Open PaaS Platforms that abstract the underlying cloud infrastructure and provide transparent support for all major vendors. You build your application on top of the platform, and leave it up to the platform to communicate to the underlying cloud vendors (whether public or private clouds, or even a hybrid). Examples of such platforms, are CloudFoundry and Cloudify. I dedicated one of my posts for exploring how to build your application using such platforms.


System architects need to face the reality of the Service Level Agreement provided by Amazon and other cloud vendors and their limitations, and start designing for resilience by spreading across isolated environments, deploying DR sites, and by similar redundancy measures to keep their service up-and-running and their data safe. Only that way can we guarantee that we will not be the next one to fall off the 99.995% SLA.

This post was originally posted here.



Filed under cloud deployment, Disaster-Recovery, IaaS, PaaS, Solution Architecture, Uncategorized

10 responses to “AWS Outage: Moving from Multi-Availability-Zone to Multi-Cloud

  1. Sean Hull

    Another great post Dotan. Btw I think amazon’s SLA might be even less than that at 99.95%!

  2. Sean Hull

    Great Followup post, Dotan. Btw I think amazon’s SLA might be even lower than that or 99.95%!

  3. Dotan interesting post. Several issues:

    > Amazon only commits to 99.95% –
    “Annual Uptime Percentage (defined below) of at least 99.95% during the Service Year. ” >

    > Cross cloud doesn’t necessarily include deep application integration but only an a wrapper that create the scale. ISVs that wants to present a scalable solution with fast AWS adoption must take it step by step. Cross AZ is a must and to gain 5 nines you must do cross regions.

    > Cross Clouds is a different story that I have a great debate (about it with Uri C :)) – not sure that with regards to HA it is relevant. It might be with regards to risks of lock-in.

    > Also I see PaaS as a whole different story .. I dont find HA in the five leading incentives to adopt PaaS

    I covered past outage and I invite you to read my take –

    read the two parts 🙂 …


  4. Thanks Ofir for the feedback.

    “Cross cloud doesn’t necessarily include deep application integration”
    > I absolutely agree with your assessment. as I explained in my post, using application platforms can actually save you from having to perform any deep integration with the cloud infrastructure, as the platform will take care of the integration for you.

    “cross clouds… not sure that with regards to HA it is relevant”
    > how do you reach that conclusion? the isolation level between different vendors is always much higher than within the same vendor (even if on different AZ or even regions). if one IaaS provider suffers some infrastructure issues, it is more likely that your application survive it if it can fail over to a different IaaS provider. Obviously, it also prevents vendor lock-in as you stated.

  5. A multi Cloud strategy is one in which users can attain the services of more than one Cloud providers to host their website. This allows them to attain a solid guarantee of their online infrastructure to be secure and running seamlessly.

  6. Pingback: Architecting at Scale with Hybrid and Multi Cloud – Wix Case Study | horovits

  7. Pingback: Can a Configuration Update Take Down The Entire Microsoft Azure Cloud? | horovits

  8. Pingback: Retrospect on recent AWS Outage and Resilient Cloud-Based Architecture | horovits

  9. Pingback: Amazon Cloud Outage Hits Dozen Of Sites, But Not Amazon | horovits

  10. Pingback: AWS Outage - Moving from Multi-Availability-Zone to Multi-Cloud | Cloudify

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s