Retrospect on recent AWS Outage and Resilient Cloud-Based Architecture

According to the television series “Terminator: the Sarah Connor Chronicles”, Skynetcomputer system began its attack against humanity on April 21, 2011. Luckily that hasn’t happened (or has it?) but on that very day another predominant computing system provided us with a painful reminder on how much humanity relies on computers to run the world.

A couple of weeks ago, on April 21, the IT world experienced a tsunami: Amazon Web Services (AWS) cloud went down in the US East Region for over 3 days (!!), and took down with it numerous systems and services that rely on AWS such as HootSuite, Reddit, Foursquare, Quora and many more, with a damage estimated at 2M$. The affected services were EC2 and RDS and Amazon provided a detailed technical summaryof the event.

This tsunami was a wake-up call. This wasn’t the first outage in the cloud arena, and not even the first of Amazon (in fact this is their second outage this year). But its impact was so vast that it finally brought the realization that cloud is not a silver bullet. Those who counted on the generic cloud’s provision of scalability and resilience did not survive the AWS outage. The resilience of your application (as well as scalability) does not exist unless you take care of it.
So how do we achieve this resilience in our application?

Why not learn from the experience of those who survived the outage?
As a cloud evangelist, I was intrigued by the history of the outage as it occurred. There were great posts during and after the outage from those who went down. But more interestingly for me as architect were the detailed posts of those who managed to survive the outage relatively unharmed, such as SimpleGeo, Netflix, SmugMug, SmugMug’s CTO, Twilio, Bizo and others.

In this post I’d like to summarize the patterns, principles and best practices that emerge from these posts, as I believe we can learn a lot from them on how to design our business applications to truly leverage on the benefits that the cloud offers in resilience and scalability.

Patterns, Guidelines and Best Practices

Design for failure

The first and fundamental principle in building robust architecture is to design for failure. As SmugMug states:

… we designed for failure from day one. Any of our instances, or any group of instances in an AZ [Availability Zone], can be “shot in the head” and our system will recover …

This principle should be prevalent during design, development, deployment and maintenance stages of the system. SimpleGeo presents an excellent work practice:

… At SimpleGeo failure is a first class citizen. We talk a lot about it in design discussions, it influences our operational procedures, we think about it when we’re coding, and we joke about it at lunch …

Some companies even embedded random failure simulation in their work procedures, such as SmugMug:

… once your system seems stable enough, start surprising your Ops and Engineering teams by killing stuff in the middle of the day without warning them. They’ll love you …

Netflix even automated random failure simulation using a designated Chaos Monkey service to get their engineering team used to a “constant level of failure in the cloud”.

Stateless and autonomous services

If possible, divide your business logic into stateless services, to allow easy fail-over and scalability. Netflix explained the fail-over benefits:

… if a server fails it’s not a big deal. In the failure case requests can be routed to another service instance and we can automatically spin up a new node to replace it …

Twilio aggregates their stateless services into homogeneous pools, which provides them both fail-over and elasticity:

… The pool of stateless recording services allows upstream services to retry failed requests on other instances of the recording service. In addition, the size of the recording server pool can easily be scaled up and down in real-time based on load …

To contain the ripple effect of the failure, make the services well-encapsulated, as SmugMug states:

… Make your system divided into well-encapsulated components that can fail individually without failing the entire system …

Redundant hot copies spread across zones

By replicating your data to other zones, you insulate your service from zone-wide failure, as SmugMug, Twilio, Netflix and others describe. As Netflix explains:

… we ensure that there are multiple redundant hot copies of the data spread across zones. In the case of a failure we retry in another zone, or switch over to the hot standby …

Twilio also emphasizes the configuration of timeout and retry to avoid delays in failing over to another copy:

… By running multiple redundant copies of each service, one can use quick timeouts and retries to route around failed or unreachable services …

Spread across several public cloud vendors and/or private cloud

Most IT organizations avoid depending on a single ISP by having another ISP as backup. Even Amazon is using this strategy internally to ensure high-availability of their cloud by using a primary and a backup network. Similarly, you would like to avoid dependency on a single cloud vendor by having another vendor as backup. This holds true even if the vendor provides a certain level of resilience, as we saw with Amazon’s multi-AZ failure on the recent outage. Many of the companies that survived the recent AWS outage owe it to using their own datacenters, to using other vendors, or to using the US West Region of AWS. SmugMug for instance kept their critical data on their own datacenter:

… the exact types of data that would have potentially been disabled by the EBS meltdown don’t actually live at AWS at all – it all still lives in our own datacenters, where we can provide predictable performance …

and also recommends to “spread across many providers”, although admitting that

… This is getting more and more difficult as AWS continues to put distance between themselves and their competitors …

When considering using different regions of AWS for resilience, it’s interesting to note Amazon’s statement about the effort required on your application’s side to work with multiple regions, which makes you wonder if it’s that much easier than working with a different vendor altogether:

… if you want to move data between Regions, you need to do it via your applications as we don’t replicate any data between Regions on our users’ behalf. You also need to use a separate set of APIs to manage each Region. Regions provide users with a powerful availability building block, but it requires effort on the part of application builders to take advantage of this isolation …

Automation and Monitoring

Automation is the key. Your application needs to automatically pick up alerts on system events, and should be able to automatically react to the alerts. As SimpleGeo architect states:

… Everything needs to be automated. Spinning up new instances, expanding your clusters, backups, restoring from backups, metrics, monitoring, configurations, deployments, etc. should all be automated …

Interesting to see that even Netflix that took pride in surviving the failure, admitted that the manual responses their engineers used this time will not work in the future, as they grow to a

… worldwide service with dozens of availability zones, even with top engineers we simply won’t be able to scale our responses manually …

Detailed alerting mechanisms are also essential to the manual control of the system, as Bizo states:

… we have our own internal alarms and dashboards that give us up to the minute metrics such as request rate, cpu utilization etc. …

Avoiding ACID services and leveraging on NoSQL solutions

The CTO of SimpleGeo recommends avoiding to rely on ACID services, as it inhibits the distributed nature of the cloud. In order to achieve that, Twilio recommends to “relax consistency requirements”. Netflix implemented that by

… leveraging NoSQL solutions wherever possible to take advantage of the added availability and durability that they provide, even though it meant sacrificing some consistency guarantees …

Load Balancing

Use dynamic balancing, regardless of the zone. When balancing equally by zone, like Amazon’s Elastic Load Balancer (ELB) does, if a zone fails this can bring down the system.

… Netflix uses its own software load balancing service that does balance across instances evenly, independent of which zone they are in. Services using middle tier load balancing are able to handle uneven zone capacity with no intervention …

Conclusion

Recent AWS outage serves as an important lesson to the IT world, and an important milestone in our maturity in using the cloud. The most important thing to do now is to learn from the mistakes made by those who went down with AWS, as well as from the success of the ones who survived it, and come up with proper methodology, patterns, guidelines and best practices on doing it right, so that Skynet will not take down humanity.

Update: added links to my follow-up posts with some more thoughts I had following subsequent AWS outages, around Disaster Recovery Policies and Moving from Multi-AZ to Multi-Cloud.