Tag Archives: outage

AWS Outage: Moving from Multi-Availability-Zone to Multi-Cloud

A couple of days ago Amazon Web Services (AWS) suffered a significant outage in their US-EAST-1 region. This has been the 5th major outage in that region in the past 18 months. The outage affected leading services such as Reddit, Netflix, Foursquare and Heroku.

How should you architect your cloud-hosted system to sustain such outages? Much has been written on this question during this outage, as well as past outages. Many recommend basing your architecture on multiple AWS Availability Zones (AZ) to spread the risk. But during this outage we saw even multi-Availability Zone applications severely affected. Even Amazon published during the outage that

Customers can launch replacement instances in the unaffected availability zones but may experience elevated launch latencies or receive ResourceLimitExceeded errors on their API calls, which are being issued to manage load on the system during recovery.

The reason is that there is an underlying infrastructure that escalates the traffic from the affected AZ to other AZ in a way that overwhelms the system. In the case of this outage it was the AWS API Platform that was rendered unavailable, as nicely explained in this great post:

The waterfall effect seems to happen, where the AWS API stack gets overwhelmed to the point of being useless for any management task in the region.

But it doesn’t really matter for us as users which exact infrastructure it was that failed on this specific outage. 18 months ago, during the first major outage, the reason was another infastructure component, the Elastic Block Store (“EBS”) volumes, that cascaded the problem. Back then I wrote a post on how to architect your system to sustain such outages, and one of my recommendations was:

Spread across several public cloud vendors and/or private cloud

The rule of thumb in IT is that there will always be extreme and rare situations (and don’t forget, Amazon only commits to 99.995% SLA) causing such major outages. And there will always be some common infrastructure that under that extreme and rare situation will carry the ripple effect of the outage to other Availability Zones in the region.

Of course, you can mitigate risk by spreading your system across several AWS Regions (e.g. between US-EAST and US-WEST), as they have much looser coupling, but as I stated on my previous post, that loose coupling comes with a price: it is up to your application to replicate data, using a separate set of APIs for each region. As Amazon themselves state: “it requires effort on the part of application builders to take advantage of this isolation”.

The most resilient architecture would therefore be to mitigate risk by spreading your system across different cloud vendors, to provide the best isolation level. The advantages in terms resilience are clear. But how can that be implemented, given that the vendors are so different in their characteristics and APIs?

There are 2 approaches to deploying across multiple cloud vendors and keeping cloud-vendor-agnostic:

  1. Open Standards and APIs for cloud API that will be supported by multiple cloud vendors. That way you write your application using a common standard and have immediate support by all conforming cloud vendors. Examples for such emerging standards are OpenStack and JClouds. However, the Cloud is still a young domain with many competing standards and APIs and it is yet to be determined which one shall become the de-facto standard of the industry and where to “place our bet”.
  2. Open PaaS Platforms that abstract the underlying cloud infrastructure and provide transparent support for all major vendors. You build your application on top of the platform, and leave it up to the platform to communicate to the underlying cloud vendors (whether public or private clouds, or even a hybrid). Examples of such platforms, are CloudFoundry and Cloudify. I dedicated one of my posts for exploring how to build your application using such platforms.

Conclusion

System architects need to face the reality of the Service Level Agreement provided by Amazon and other cloud vendors and their limitations, and start designing for resilience by spreading across isolated environments, deploying DR sites, and by similar redundancy measures to keep their service up-and-running and their data safe. Only that way can we guarantee that we will not be the next one to fall off the 99.995% SLA.

This post was originally posted here.

5 Comments

Filed under cloud deployment, Disaster-Recovery, IaaS, PaaS, Solution Architecture, Uncategorized

AWS Outage – Thoughts on Disaster Recovery Policies

A couple of days ago it happened again. On June 14 around 9 pm PDT Amazon AWS hit a power outage in its Northern Virginia data center, affecting EC2, RDS, Elastic Beanstalk and other services in the US-EAST region. The AWS status page reported:

Some Cache Clusters in a single AZ in the US-EAST-1 region are currently unavailable. We are also experiencing increased error rates and latencies for the ElastiCache APIs in the US-EAST-1 Region. We are investigating the issue.

This outage affected major sites such as Quora, Foursquare, Pinterest, Heroku and Dropbox. I followed the outage reports, the tweets, the blog posts, and it all sounded all too familiar. A year ago AWS faced a mega-outage that lasted over 3 days, when another datacenter (in Virginia, no less!) went down, and took down with it major sites (Quora, Foursquare… ring a bell?).

Back during last year’s outage I analyzed the reports of the sites that managed to survive the outage, and compiled a list of field-proven guidelines and best practices to apply in your architecture to make it resilient when deployed on AWS and other IaaS providers. I find these guidelines and best practices highly useful in my architectures. I then followed up with another blog post suggesting using designated software platforms to apply some of the guidelines and best practices.

On this blog post I’d like to address one specific guideline in greater depth – architecting for Disaster Recovery.

Disaster Recovery – Characteristics and Challenges

PC Magazine defines Disaster Recovery (DR):

A plan for duplicating computer operations after a catastrophe occurs, such as a fire or earthquake. It includes routine off-site backup as well as a procedure for activating vital information systems in a new location.

DR Planning is a common practice since the days of the mainframes. An interesting question is why this practice is not as widespread in cloud-based architectures. In his recent post “Lessons from the Heroku/Amazon Outage” Nati Shalom, GigaSpaces CTO, analyzes this apparent behavior, and suggests two possible causes:

  • We give up responsability when we move to the cloud - When we move our operation to the cloud we often assume that were outsourcing our data center operation completly, that include our Disaster-Recovery procedures. The truth is that when we move to the cloud were only outsourcing the infrastructure not our operation and the responsability of using this infrastructure remain ours.
  • Complexity - The current DR processes and tools were designed for a pre-cloud world and doesn’t work well in a dynamic environment as the cloud. Many of the tools that are provided by the cloud vendor (Amazon in this sepcific case) are still fairly complex to use.

I addressed the first cause, the perception that cloud is a silver bullet that lets people give up responsibility on resilience aspects, in my previous post. The second cause, the lack of tools, is usually addressed by DevOps tools such as ChefPuppetCFEngine and Cloudify, which capture the setup and are able to bootstrap the application stack on different environments. In my example I used Cloudify to provide consistent installation between EC2 and RackSpace clouds.

Making sure your architecture incorporates a Disaster Recovery Plan is essential to ensure the business continuity, and avoid cases such as the ones seen over Amazon’s outages. Online services require the Hot Backup Site architecture, so the service can stay up even during the outage:

A hot site is a duplicate of the original site of the organization, with full computer systems as well as near-complete backups of user data. Real time synchronization between the two sites may be used to completely mirror the data environment of the original site using wide area network links and specialized software.

DR sites can be in Active/Standby architecture (as was in traditional DRPs), where the DR site starts serving only upon outage event, or they can be in Active/Active architecture (the more modern architectures). In his discussion on assuming responsibility, Nati states that DR architecture should assume responsibility for the following aspects:

  • Workload migration - specifically the ability to clone our application environment in a consistent way across sites in an on demand fashion.
  • Data Synchronization - The ability to maintain real time copy of the data between the two sites.
  • Network connectivity - The ability to enable flow of netwrok traffic between between two sites.

I’d like to experiment with an example DR architecture to address these aspects, as well as addressing Nati’s second challange - Complexity. In this part I will use an example of a simple web app and show how we can easily create two sites on-demand. I would even go as far as setting this environment on two seperate clouds to show how we can ensure even higher degree of redundancy by running our application across two different cloud providers.

A step-by step example: Disaster Recovery from AWS to RackSpace

Let’s put up our sleeves and start experimenting hands-on with DR architecture. As reference application let’s take Spring’s PetClinic Sample Application and run it on an Apache Tomcat web container. The application will persist its data locally to a MySQL relational database. On my experiment I used Amazon EC2 and RackSpace IaaS providers to simulate the two distinct environments of the primary and secondary sites, but any on-demand environments will do. We tried the same example with a combination of HP Cloud Services and a flavor of a Private cloud.

Data synchronization over WAN

How do we replicate data between the MySQL database instances over WAN? On this experiment we’ll use the following pattern:

  1. Monitor data mutating SQL statements on source site. Turn on the MySQL query log, and write a listener (“Feeder”) to intercept data mutating SQL statements, then write them to GigaSpaces In-Memory Data Grid.
  2. Replicate data mutating SQL statements over WAN. I used GigaSpaces WAN Replication to replicate the SQL statements  between the data grids of the primary and secondary sites in a real-time and transactional manner.
  3. Execute data mutating SQL statements on target site. Write a listener (“Processor”) to intercept incoming SQL statements on the data grid and execute them on the local MySQL DB.


To support bi-directional data replication we simply deploy both the Feeder and the Processor on each site.

Workload migration

I would like to address the complexity challenge and show how to automate setting up the site on demand. This is also useful for Active/Standby architectures, where the DR site is activated only upon outage.

In order to set up a site for service, we need to perform the following flow:

  1. spin up compute nodes (VMs)
  2. download and install Tomcat web server
  3. download and install the PetClinic application
  4. configure the load balancer with the new node
  5. when peak load is over – perform the reverse flow to tear down the secondary site

We would like to automate this bootstrap process to support on-demand capabilities in the cloud as we know from traditional DR solutions. I used GigaSpaces Cloudify open-source product as the automation tool for setting up and for taking down the secondary site, utilizing the out-of-the-box connectors for EC2 and RackSpace. Cloudify also provides self-healing  in case of VM or process failure, and can later help in scaling the application (in case of clustered applications).

Network Connectivity

The network connectivity between the primary and secondary sites can be addressed in several ways, ranging from load-balancing between the sites, through setting up VPN between the sites, and up to using designated products such as Cisco’s Connected Cloud Solution.

In this example I went for a simple LB solution using RackSpace’s Load Balancer Service to balance between the web instances, and automated the LB configuration using Cloudify to make the changes as seamless as possible.

Implementation Details

The application is actually a re-use of an  application I wrote recently to experiment with Cloud Bursting architectures, seeing that Cloud Bursting follows the same architecture guidelines as for DR (Active/Standby DR to be exact). The result of the experimentation is available on GitHub. It contains:

  • DB scripts for setting up the logging, schema and demo data for the PetClinic application
  • PetClinic application (.war) file
  • WAN replication gateway module
  • Cloudify recipe for automating the PetClinic deployment

See the documentation on GitHub for detailed instructions on how to configure the above with your specific deployment details.

Conclusion

Cloud-hosted applications should take care of non-functional requirements of the system, including resilience and scalability, just as on-premise applications. Systems that neglect to incorporate these considerations in their architecture, relying solely on the underlying cloud infrastructure, end up severely affected by cloud outage such as the one experienced a few days ago in AWS. On my previous post I listed some guidelines, an important of which is Disaster Recovery which I explored here and suggested possible architectural approaches and example implementation. I hope this discussion raises the awareness in the cloud community and helps maturing up cloud-based architectures, so that on the next outage we will not see as many systems go down.

1311765722_picons03
Follow Dotan on Twitter!

6 Comments

Filed under Cloud, DevOps, Disaster-Recovery, IaaS, Solution Architecture, Uncategorized

Building Cloud Applications the Easy Way Using Elastic Application Platforms

Patterns, Guidelines and Best Practices Revisited

In my previous post I analyzed Amazon’s recent AWS outage and the patterns and best practices that enabled some of the businesses hosted on Amazon’s affected availability zones to survive the outage.

The patterns and best practices I presented are essential to guarantee robust and scalable architectures in general and on the cloud in particular. Those who dismissed my latest post as exaggeration of an isolated incident got affirmation of my statement last week when Amazon found itself apologizing once again after its Cloud Drive service was overwhelmed by unpredictable peak demand for Lady Gaga’s newly-released album (99 cents, who wouldn’t buy it?!) and was rendered non-responsive. This failure to scale up/out to accommodate fluctuating demands raises the scalability concern in the public cloud, in addition to the resilience concern raised in the AWS outage.

Surprisingly, as obvious as the patterns I listed may seem, it seems they are definitely not common practice, seeing the amount of applications that went down when AWS did, and seeing how many other applications have similar issues on public cloud providers.

Why are such fundamental principles not prevalent in today’s architectures on the cloud?

One of the reasons these patterns are not prevalent in today’s cloud applications is that it requires an experienced and confident architect in the areas of distributed and scalable systems to design such architectures. The typical public cloud APIs also require developers to perform complex coding and utilize various non-standard APIs that are usually not common knowledge. Similar difficulties are found in testing, operating, monitoring and maintaining such systems. This makes it quite difficult to implement the above patterns to ensure the application’s resilience and scalability, and diverts valuable development time and resources from the application’s business logic that is the core value of the application.

How can we make the introduction of these patterns and best practices smoother and simpler? Can we get these patterns as a service to our application? We are all used to traditional application servers that provide our enterprise applications with underlying services such as database connection pooling, transaction management and security, and free us from worrying about these concerns so that we can focus on designing our business logic. Similarly, Elastic Application Platforms (EAP)allow your application to easily employ the patterns and best practices I enumerated on my previous post for high availability and elasticity without having to become experts in the field and allowing you to focus on your business logic.

So what is Elastic Application Platform? Forrester defines an elastic application platform as:

An application platform that automates elasticity of application transactions, services, and data, delivering high availability and performance using elastic resources.

Last month Forrester published a review under the title “Cloud Computing Brings Demand For Elastic Application Platforms”. The review is the result of a comprehensive research, and spans 17 pages (a blog post introducing it can be found on the Forrester blog). It analyzes the difficulties companies encounter in implementing their applications on top of cloud infrastructure, and recognizes the elastic application platforms as the emerging solution for a smooth path into the cloud. It then maps the potential providers of such solutions. For its research Forrester interviewed 17 vendor and user companies. Out of all the reviewed vendors, Forrester identified only 3 vendors that are “offering comprehensive EAPs today”: Microsoft, SalesForce.com and GigaSpaces.

As Forrester did an amazing job in their research reviewing and comparing solutions for EAP today, I’ll avoid repeating that. Instead, I’d like to review the GigaSpaces EAP solution in light of the patterns discussed on my previous post, and see how building your solution on top of GigaSpaces enables you to introduce these patterns easily and without having to divert your focus from your business logic.

Patterns, Guidelines and Best Practices Revisited

Design for failure

Well, that’s GigaSpaces’ bread and butter. Whereas thinking about failure diverts you from your core business, in our case it is our core business. GigaSpaces platform provides underlying services to enable high availability and elasticity, so that you don’t have to take care of that. So now that we’ve established that, let’s see how it’s done.

Stateless and autonomous services

The GigaSpaces architecture segregates your application into Processing Units. A Processing Unit (PU) is an autonomous unit of your application. It can be a pure business-logic (stateless) unit, or hold data in-memory, or provide a web application, and mix together these and other functions. You can define the required Service Level Agreement (SLA) for your Processing Unit, and the GigaSpaces platform will make sure to enforce it. When your Processing Unit SLA requires high-availability – the platform will deploy a (hot) backup instance (or multiple backups) of the Processing Unit to which the PU will fail over in case the primary instance fails. When your application needs to scale out – the platform will add another instance of the Processing Unit (maybe over a newly-provisioned virtual machine booted automatically by the platform). When your application needs to distribute data and/or data processing – the platform will shard the data evenly on several instances of the Processing Unit, so that each instance will handle a subset of the data independently of the other instances.

Redundant hot copies spread across zones

You can divide your deployment environment into virtual zones. These zones can represent different data centers, different cloud infrastructure vendors, or any physical or logical division you see fit. Then you can tell the platform (as part of the SLA) not to place both primary and its backup instances of the Processing Unit on the same zone – thus making sure the data stored within the Processing Unit is backed up on two different zones. This will provide your application resilience over two data centers, two cloud vendors, two regions, depending on your required resilience, all with uniform development API. You want higher level of resilience? Just define more zones and more backups for each PU.

Spread across several public cloud vendors and/or private cloud

GigaSpaces abstracts the details of the underlying infrastructure from your application. GigaSpaces’ Multi-Cloud Adaptor technology provides built-in integration with several major cloud providers, including the JClouds open source abstraction layer, thus supporting any cloud vendor that conforms to the JClouds standard. So all you need to do is plug in your desired cloud providers into the platform, and your application logic remains agnostic to the cloud infrastructure details. Plugging in two vendors to ensure resilience now becomes just a matter of configuration. The integration with JClouds is an open-source project under OpenSpaces.org, so feel free to review and even pitch in to extend and enhance integration with cloud vendors.

Automation and Monitoring

GigaSpaces offers a powerful set of tools that allow you to automate your system. First, it offers the Elastic Processing Unit, which can automatically monitor CPU and memory utilization and employ corrective actions based on your defined SLA. GigaSpaces also offers a rich Administration and Monitoring API that enables administration and monitoring of all the GigaSpaces services and components and layers running beneath the platform such as transport layer and, machine and operating system. GigaSpaces also offers a web-based dashboard and a management center client. Another powerful tool for monitoring and automation is the administrative alerts that can be configured and then viewed through GigaSpaces or external tools (e.g. via SNMP traps).

Avoiding ACID services and leveraging on NoSQL solutions

GigaSpaces does not rule out SQL for querying your data. We believe that true NoSQL stands for “Not Only SQL”, and that SQL as a language is good for certain uses, whereas other uses require other query semantics. GigaSpaces supports some of the SQL language through its SQLQuery API or through standard JDBC . However, GigaSpaces also provides a rich set of alternative standards and protocols for accessing your data, such as Map API for key/value access, Document API for dynamic schemas, Object-oriented (through proprietary Space API or standard JPA), and Memcached protocol.

Another challenge of the traditional relational databases is scaling data storage in read-write environment. The distributed relational databases were enough to deal with read-mostly environments. But Web2.0 brought social concepts into the web, with customers feeding data into the websites. Several NoSQL solutions try to address distributed data storage and querying. GigaSpaces provides this via its support for clustered topology of the in-memory data grid (the “space”) and for distributing queries and execution using patterns such as Map/Reduce and event-driven design.

Load Balancing

The elastic natureof the GigaSpaces platform allows it to automatically detect the CPU and memory capacity of the  deployment environment and optimize the load dynamically based on your defined SLA, instead of employing arbitrary division of the data into fixed zones. Such dynamic nature also allows your system to adjust in case of a failure of an entire zone (such as what happened with Amazon’s availability zones) so that your system doesn’t go down even in such extreme cases, and maintains optimal balance under the new conditions.

Furthermore, GigaSpaces platform supports content-based routing, which allow for smart load balancing based on your business model and logic. Content-based routing allows your application to route related data to the same host and then execute entire business flows within the same JVM, thus avoiding network hops and complex distributed transaction locking that hinder your application’s scalability.

Conclusion

Most significant advancements do not happen in slow gradual steps but rather in leaps. These leaps happen when the predominant conception crashes in face of the new reality, leaving behind chaos and uncertainty, and out of the chaos then emerges the next stage in the maturity of the system.

This is the case with the maturity process of the young cloud realm as well: the AWS outage was a major reality check that opened the eyes of the IT world to see that their systems crashed with AWS because they counted on their cloud infrastructure provider to handle your application’s high-availability and elasticity using its generic logic. This concept proved to be wrong. Now the IT world is in confusion, and many discussions are done on whether the faith in cloud was mistaken, with titles like “EC2 Failure Feeds Worries About Cloud Services”.

The next step in the cloud’s maturity was the realization that cloud infrastructure is just infrastructure, and that you need to implement your application correctly, using patterns and best practices such as the ones I raised in my previous post, to leverage on the cloud infrastructure to gain high-availability and elasticity.

The next step in the evolution is to start leveraging on designated application platforms that will handle these concerns for you and virtualize the cloud from your application, so that you can simply define the SLA for your application for high-availability and elasticity, and leave it up to the platform to manipulate the cloud infrastructure to enforce your SLA, while you concentrate on writing your application’s business logic. As Forrester said:

… A new generation of application platforms for elastic applications is arriving to help remove this barrier to realizing cloud’s benefits. Elastic application platforms will reduce the skill required to design, deliver, and manage elastic applications, making automatic scaling of cloud available to all shops …

 

1311765722_picons03
Follow Dotan on Twitter!

7 Comments

Filed under Cloud, PaaS

Retrospect on recent AWS Outage and Resilient Cloud-Based Architecture

According to the television series “Terminator: the Sarah Connor Chronicles”, Skynetcomputer system began its attack against humanity on April 21, 2011. Luckily that hasn’t happened (or has it?) but on that very day another predominant computing system provided us with a painful reminder on how much humanity relies on computers to run the world.

A couple of weeks ago, on April 21, the IT world experienced a tsunami: Amazon Web Services (AWS) cloud went down in the US East Region for over 3 days (!!), and took down with it numerous systems and services that rely on AWS such as HootSuite, Reddit, Foursquare, Quora and many more, with a damage estimated at 2M$. The affected services were EC2 and RDS and Amazon provided a detailed technical summaryof the event.

This tsunami was a wake-up call. This wasn’t the first outage in the cloud arena, and not even the first of Amazon (in fact this is their second outage this year). But its impact was so vast that it finally brought the realization that cloud is not a silver bullet. Those who counted on the generic cloud’s provision of scalability and resilience did not survive the AWS outage. The resilience of your application (as well as scalability) does not exist unless you take care of it.
So how do we achieve this resilience in our application?

Why not learn from the experience of those who survived the outage?
As a cloud evangelist, I was intrigued by the history of the outage as it occurred. There were great posts during and after the outage from those who went down. But more interestingly for me as architect were the detailed posts of those who managed to survive the outage relatively unharmed, such as SimpleGeo, Netflix, SmugMug, SmugMug’s CTO, Twilio, Bizo and others.

In this post I’d like to summarize the patterns, principles and best practices that emerge from these posts, as I believe we can learn a lot from them on how to design our business applications to truly leverage on the benefits that the cloud offers in resilience and scalability.

Patterns, Guidelines and Best Practices

Design for failure

The first and fundamental principle in building robust architecture is to design for failure. As SmugMug states:

… we designed for failure from day one. Any of our instances, or any group of instances in an AZ [Availability Zone], can be “shot in the head” and our system will recover …

This principle should be prevalent during design, development, deployment and maintenance stages of the system. SimpleGeo presents an excellent work practice:

… At SimpleGeo failure is a first class citizen. We talk a lot about it in design discussions, it influences our operational procedures, we think about it when we’re coding, and we joke about it at lunch …

Some companies even embedded random failure simulation in their work procedures, such as SmugMug:

… once your system seems stable enough, start surprising your Ops and Engineering teams by killing stuff in the middle of the day without warning them. They’ll love you …

Netflix even automated random failure simulation using a designated Chaos Monkey service to get their engineering team used to a “constant level of failure in the cloud”.

Stateless and autonomous services

If possible, divide your business logic into stateless services, to allow easy fail-over and scalability. Netflix explained the fail-over benefits:

… if a server fails it’s not a big deal. In the failure case requests can be routed to another service instance and we can automatically spin up a new node to replace it …

Twilio aggregates their stateless services into homogeneous pools, which provides them both fail-over and elasticity:

… The pool of stateless recording services allows upstream services to retry failed requests on other instances of the recording service. In addition, the size of the recording server pool can easily be scaled up and down in real-time based on load …

To contain the ripple effect of the failure, make the services well-encapsulated, as SmugMug states:

… Make your system divided into well-encapsulated components that can fail individually without failing the entire system …

Redundant hot copies spread across zones

By replicating your data to other zones, you insulate your service from zone-wide failure, as SmugMug, Twilio, Netflix and others describe. As Netflix explains:

… we ensure that there are multiple redundant hot copies of the data spread across zones. In the case of a failure we retry in another zone, or switch over to the hot standby …

Twilio also emphasizes the configuration of timeout and retry to avoid delays in failing over to another copy:

… By running multiple redundant copies of each service, one can use quick timeouts and retries to route around failed or unreachable services …

Spread across several public cloud vendors and/or private cloud

Most IT organizations avoid depending on a single ISP by having another ISP as backup. Even Amazon is using this strategy internally to ensure high-availability of their cloud by using a primary and a backup network. Similarly, you would like to avoid dependency on a single cloud vendor by having another vendor as backup. This holds true even if the vendor provides a certain level of resilience, as we saw with Amazon’s multi-AZ failure on the recent outage. Many of the companies that survived the recent AWS outage owe it to using their own datacenters, to using other vendors, or to using the US West Region of AWS. SmugMug for instance kept their critical data on their own datacenter:

… the exact types of data that would have potentially been disabled by the EBS meltdown don’t actually live at AWS at all – it all still lives in our own datacenters, where we can provide predictable performance …

and also recommends to “spread across many providers”, although admitting that

… This is getting more and more difficult as AWS continues to put distance between themselves and their competitors …

When considering using different regions of AWS for resilience, it’s interesting to note Amazon’s statement about the effort required on your application’s side to work with multiple regions, which makes you wonder if it’s that much easier than working with a different vendor altogether:

… if you want to move data between Regions, you need to do it via your applications as we don’t replicate any data between Regions on our users’ behalf. You also need to use a separate set of APIs to manage each Region. Regions provide users with a powerful availability building block, but it requires effort on the part of application builders to take advantage of this isolation …

Automation and Monitoring

Automation is the key. Your application needs to automatically pick up alerts on system events, and should be able to automatically react to the alerts. As SimpleGeo architect states:

… Everything needs to be automated. Spinning up new instances, expanding your clusters, backups, restoring from backups, metrics, monitoring, configurations, deployments, etc. should all be automated …

Interesting to see that even Netflix that took pride in surviving the failure, admitted that the manual responses their engineers used this time will not work in the future, as they grow to a

… worldwide service with dozens of availability zones, even with top engineers we simply won’t be able to scale our responses manually …

Detailed alerting mechanisms are also essential to the manual control of the system, as Bizo states:

… we have our own internal alarms and dashboards that give us up to the minute metrics such as request rate, cpu utilization etc. …

Avoiding ACID services and leveraging on NoSQL solutions

The CTO of SimpleGeo recommends avoiding to rely on ACID services, as it inhibits the distributed nature of the cloud. In order to achieve that, Twilio recommends to “relax consistency requirements”. Netflix implemented that by

… leveraging NoSQL solutions wherever possible to take advantage of the added availability and durability that they provide, even though it meant sacrificing some consistency guarantees …

Load Balancing

Use dynamic balancing, regardless of the zone. When balancing equally by zone, like Amazon’s Elastic Load Balancer (ELB) does, if a zone fails this can bring down the system.

… Netflix uses its own software load balancing service that does balance across instances evenly, independent of which zone they are in. Services using middle tier load balancing are able to handle uneven zone capacity with no intervention …

Conclusion

Recent AWS outage serves as an important lesson to the IT world, and an important milestone in our maturity in using the cloud. The most important thing to do now is to learn from the mistakes made by those who went down with AWS, as well as from the success of the ones who survived it, and come up with proper methodology, patterns, guidelines and best practices on doing it right, so that Skynet will not take down humanity.

Update: added links to my follow-up posts with some more thoughts I had following subsequent AWS outages, around Disaster Recovery Policies and Elastic Application Platforms.

Resources

· Netflix, http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html

· Joe Stump, SimpleGeo, http://stu.mp/2011/04/the-cloud-is-not-a-silver-bullet.html

· SimpleGeo, http://developers.simplegeo.com/blog/2011/04/26/how-simplegeo-stayed-up/

· Ted Theodoropoulos, http://blog.acrowire.com/cloud-computing/failing-to-plan-is-planning-to-fail

· http://stu.mp/2011/04/the-cloud-is-not-a-silver-bullet.html

· SmugMug, http://don.blogs.smugmug.com/2011/04/24/how-smugmug-survived-the-amazonpocalypse

· Twilio, http://www.twilio.com/engineering/2011/04/22/why-twilio-wasnt-affected-by-todays-aws-issues/

· Bizo, http://dev.bizo.com/2011/04/how-bizo-survived-great-aws-outage-of.html

· Heroku, http://status.heroku.com/incident/151

· http://groups.google.com/group/cloud-computing/browse_thread/thread/e8079a54e6a8c4b9/72756bf9e587869d?show_docid=72756bf9e587869d

· Amazon, http://aws.amazon.com/message/65648/

· Amazon, http://aws.amazon.com/architecture/

1311765722_picons03
Follow Dotan on Twitter!

13 Comments

Filed under Cloud, IaaS