Category Archives: Solution Architecture

AWS Outage: Moving from Multi-Availability-Zone to Multi-Cloud

A couple of days ago Amazon Web Services (AWS) suffered a significant outage in their US-EAST-1 region. This has been the 5th major outage in that region in the past 18 months. The outage affected leading services such as Reddit, Netflix, Foursquare and Heroku.

How should you architect your cloud-hosted system to sustain such outages? Much has been written on this question during this outage, as well as past outages. Many recommend basing your architecture on multiple AWS Availability Zones (AZ) to spread the risk. But during this outage we saw even multi-Availability Zone applications severely affected. Even Amazon published during the outage that

Customers can launch replacement instances in the unaffected availability zones but may experience elevated launch latencies or receive ResourceLimitExceeded errors on their API calls, which are being issued to manage load on the system during recovery.

The reason is that there is an underlying infrastructure that escalates the traffic from the affected AZ to other AZ in a way that overwhelms the system. In the case of this outage it was the AWS API Platform that was rendered unavailable, as nicely explained in this great post:

The waterfall effect seems to happen, where the AWS API stack gets overwhelmed to the point of being useless for any management task in the region.

But it doesn’t really matter for us as users which exact infrastructure it was that failed on this specific outage. 18 months ago, during the first major outage, the reason was another infastructure component, the Elastic Block Store (“EBS”) volumes, that cascaded the problem. Back then I wrote a post on how to architect your system to sustain such outages, and one of my recommendations was:

Spread across several public cloud vendors and/or private cloud

The rule of thumb in IT is that there will always be extreme and rare situations (and don’t forget, Amazon only commits to 99.995% SLA) causing such major outages. And there will always be some common infrastructure that under that extreme and rare situation will carry the ripple effect of the outage to other Availability Zones in the region.

Of course, you can mitigate risk by spreading your system across several AWS Regions (e.g. between US-EAST and US-WEST), as they have much looser coupling, but as I stated on my previous post, that loose coupling comes with a price: it is up to your application to replicate data, using a separate set of APIs for each region. As Amazon themselves state: “it requires effort on the part of application builders to take advantage of this isolation”.

The most resilient architecture would therefore be to mitigate risk by spreading your system across different cloud vendors, to provide the best isolation level. The advantages in terms resilience are clear. But how can that be implemented, given that the vendors are so different in their characteristics and APIs?

There are 2 approaches to deploying across multiple cloud vendors and keeping cloud-vendor-agnostic:

  1. Open Standards and APIs for cloud API that will be supported by multiple cloud vendors. That way you write your application using a common standard and have immediate support by all conforming cloud vendors. Examples for such emerging standards are OpenStack and JClouds. However, the Cloud is still a young domain with many competing standards and APIs and it is yet to be determined which one shall become the de-facto standard of the industry and where to “place our bet”.
  2. Open PaaS Platforms that abstract the underlying cloud infrastructure and provide transparent support for all major vendors. You build your application on top of the platform, and leave it up to the platform to communicate to the underlying cloud vendors (whether public or private clouds, or even a hybrid). Examples of such platforms, are CloudFoundry and Cloudify. I dedicated one of my posts for exploring how to build your application using such platforms.

Conclusion

System architects need to face the reality of the Service Level Agreement provided by Amazon and other cloud vendors and their limitations, and start designing for resilience by spreading across isolated environments, deploying DR sites, and by similar redundancy measures to keep their service up-and-running and their data safe. Only that way can we guarantee that we will not be the next one to fall off the 99.995% SLA.

This post was originally posted here.

5 Comments

Filed under cloud deployment, Disaster-Recovery, IaaS, PaaS, Solution Architecture, Uncategorized

AWS Outage – Thoughts on Disaster Recovery Policies

A couple of days ago it happened again. On June 14 around 9 pm PDT Amazon AWS hit a power outage in its Northern Virginia data center, affecting EC2, RDS, Elastic Beanstalk and other services in the US-EAST region. The AWS status page reported:

Some Cache Clusters in a single AZ in the US-EAST-1 region are currently unavailable. We are also experiencing increased error rates and latencies for the ElastiCache APIs in the US-EAST-1 Region. We are investigating the issue.

This outage affected major sites such as Quora, Foursquare, Pinterest, Heroku and Dropbox. I followed the outage reports, the tweets, the blog posts, and it all sounded all too familiar. A year ago AWS faced a mega-outage that lasted over 3 days, when another datacenter (in Virginia, no less!) went down, and took down with it major sites (Quora, Foursquare… ring a bell?).

Back during last year’s outage I analyzed the reports of the sites that managed to survive the outage, and compiled a list of field-proven guidelines and best practices to apply in your architecture to make it resilient when deployed on AWS and other IaaS providers. I find these guidelines and best practices highly useful in my architectures. I then followed up with another blog post suggesting using designated software platforms to apply some of the guidelines and best practices.

On this blog post I’d like to address one specific guideline in greater depth – architecting for Disaster Recovery.

Disaster Recovery – Characteristics and Challenges

PC Magazine defines Disaster Recovery (DR):

A plan for duplicating computer operations after a catastrophe occurs, such as a fire or earthquake. It includes routine off-site backup as well as a procedure for activating vital information systems in a new location.

DR Planning is a common practice since the days of the mainframes. An interesting question is why this practice is not as widespread in cloud-based architectures. In his recent post “Lessons from the Heroku/Amazon Outage” Nati Shalom, GigaSpaces CTO, analyzes this apparent behavior, and suggests two possible causes:

  • We give up responsability when we move to the cloud - When we move our operation to the cloud we often assume that were outsourcing our data center operation completly, that include our Disaster-Recovery procedures. The truth is that when we move to the cloud were only outsourcing the infrastructure not our operation and the responsability of using this infrastructure remain ours.
  • Complexity - The current DR processes and tools were designed for a pre-cloud world and doesn’t work well in a dynamic environment as the cloud. Many of the tools that are provided by the cloud vendor (Amazon in this sepcific case) are still fairly complex to use.

I addressed the first cause, the perception that cloud is a silver bullet that lets people give up responsibility on resilience aspects, in my previous post. The second cause, the lack of tools, is usually addressed by DevOps tools such as ChefPuppetCFEngine and Cloudify, which capture the setup and are able to bootstrap the application stack on different environments. In my example I used Cloudify to provide consistent installation between EC2 and RackSpace clouds.

Making sure your architecture incorporates a Disaster Recovery Plan is essential to ensure the business continuity, and avoid cases such as the ones seen over Amazon’s outages. Online services require the Hot Backup Site architecture, so the service can stay up even during the outage:

A hot site is a duplicate of the original site of the organization, with full computer systems as well as near-complete backups of user data. Real time synchronization between the two sites may be used to completely mirror the data environment of the original site using wide area network links and specialized software.

DR sites can be in Active/Standby architecture (as was in traditional DRPs), where the DR site starts serving only upon outage event, or they can be in Active/Active architecture (the more modern architectures). In his discussion on assuming responsibility, Nati states that DR architecture should assume responsibility for the following aspects:

  • Workload migration - specifically the ability to clone our application environment in a consistent way across sites in an on demand fashion.
  • Data Synchronization - The ability to maintain real time copy of the data between the two sites.
  • Network connectivity - The ability to enable flow of netwrok traffic between between two sites.

I’d like to experiment with an example DR architecture to address these aspects, as well as addressing Nati’s second challange - Complexity. In this part I will use an example of a simple web app and show how we can easily create two sites on-demand. I would even go as far as setting this environment on two seperate clouds to show how we can ensure even higher degree of redundancy by running our application across two different cloud providers.

A step-by step example: Disaster Recovery from AWS to RackSpace

Let’s put up our sleeves and start experimenting hands-on with DR architecture. As reference application let’s take Spring’s PetClinic Sample Application and run it on an Apache Tomcat web container. The application will persist its data locally to a MySQL relational database. On my experiment I used Amazon EC2 and RackSpace IaaS providers to simulate the two distinct environments of the primary and secondary sites, but any on-demand environments will do. We tried the same example with a combination of HP Cloud Services and a flavor of a Private cloud.

Data synchronization over WAN

How do we replicate data between the MySQL database instances over WAN? On this experiment we’ll use the following pattern:

  1. Monitor data mutating SQL statements on source site. Turn on the MySQL query log, and write a listener (“Feeder”) to intercept data mutating SQL statements, then write them to GigaSpaces In-Memory Data Grid.
  2. Replicate data mutating SQL statements over WAN. I used GigaSpaces WAN Replication to replicate the SQL statements  between the data grids of the primary and secondary sites in a real-time and transactional manner.
  3. Execute data mutating SQL statements on target site. Write a listener (“Processor”) to intercept incoming SQL statements on the data grid and execute them on the local MySQL DB.


To support bi-directional data replication we simply deploy both the Feeder and the Processor on each site.

Workload migration

I would like to address the complexity challenge and show how to automate setting up the site on demand. This is also useful for Active/Standby architectures, where the DR site is activated only upon outage.

In order to set up a site for service, we need to perform the following flow:

  1. spin up compute nodes (VMs)
  2. download and install Tomcat web server
  3. download and install the PetClinic application
  4. configure the load balancer with the new node
  5. when peak load is over – perform the reverse flow to tear down the secondary site

We would like to automate this bootstrap process to support on-demand capabilities in the cloud as we know from traditional DR solutions. I used GigaSpaces Cloudify open-source product as the automation tool for setting up and for taking down the secondary site, utilizing the out-of-the-box connectors for EC2 and RackSpace. Cloudify also provides self-healing  in case of VM or process failure, and can later help in scaling the application (in case of clustered applications).

Network Connectivity

The network connectivity between the primary and secondary sites can be addressed in several ways, ranging from load-balancing between the sites, through setting up VPN between the sites, and up to using designated products such as Cisco’s Connected Cloud Solution.

In this example I went for a simple LB solution using RackSpace’s Load Balancer Service to balance between the web instances, and automated the LB configuration using Cloudify to make the changes as seamless as possible.

Implementation Details

The application is actually a re-use of an  application I wrote recently to experiment with Cloud Bursting architectures, seeing that Cloud Bursting follows the same architecture guidelines as for DR (Active/Standby DR to be exact). The result of the experimentation is available on GitHub. It contains:

  • DB scripts for setting up the logging, schema and demo data for the PetClinic application
  • PetClinic application (.war) file
  • WAN replication gateway module
  • Cloudify recipe for automating the PetClinic deployment

See the documentation on GitHub for detailed instructions on how to configure the above with your specific deployment details.

Conclusion

Cloud-hosted applications should take care of non-functional requirements of the system, including resilience and scalability, just as on-premise applications. Systems that neglect to incorporate these considerations in their architecture, relying solely on the underlying cloud infrastructure, end up severely affected by cloud outage such as the one experienced a few days ago in AWS. On my previous post I listed some guidelines, an important of which is Disaster Recovery which I explored here and suggested possible architectural approaches and example implementation. I hope this discussion raises the awareness in the cloud community and helps maturing up cloud-based architectures, so that on the next outage we will not see as many systems go down.

1311765722_picons03
Follow Dotan on Twitter!

6 Comments

Filed under Cloud, DevOps, Disaster-Recovery, IaaS, Solution Architecture, Uncategorized

Bursting into the Clouds – Experimenting with Cloud Bursting

Who needs Cloud Bursting?

We see many organizations examining Cloud as replacement for their existing in-house IT. But we see interest in cloud even among organizations that have no plan of replacing their traditional data center. One prominent use case is Cloud Bursting:

Cloud bursting is an application deployment model in which an application runs in a private cloud or data center and bursts into a public cloud when the demand for computing capacity spikes. The advantage of such a hybrid cloud deployment is that an organization only pays for extra compute resources when they are needed.
[Definition from SearchCloudComputing]

Cloud Bursting appears to be a prominent use case in cloud on-boarding projects. In a recent post, Nati Shalom summarizes nicely the economical rationale for cloud bursting and discusses theoretical approaches for architecture. In this post I’d like to examine the architectural challenges more closely and explore possible designs for Cloud Bursting.

Examining Cloud Bursting Architecture

Overflowing compute to the cloud is addressed by workload migration: when we need more compute power we just spin up more VMs in the cloud (the secondary site) and install instances of the application. The challenge in workload migration is around how to build a consistent environment in the secondary site as in the primary site, so the system can overflow transparently. This is usually addressed by DevOps tools such as Chef, Puppet, CFEngine and Cloudify, which capture the setup and are able to bootstrap the application stack on different environments. In my example I used Cloudify to provide consistent installation between EC2 and RackSpace clouds.

The Cloud Bursting problem becomes more interesting when data is concerned. In his post Nati mentions two approaches for handling data during cloud bursting:

1. The primary site approach - Use the private cloud as the primary data site, and then point all the burst activity to that site.
2. Federated site approach - This approach is similar to the way Content Distribution Networks (CDN) work today. With this approach we maintain a replica of the data available at each site and keep their replicas in sync.

The primary site approach incurs heavy penalty in latency, as each computation needs to make the round trip to the primary site to get the data for the computation. Such architecture is not applicable to online flows.

The federated site approach uses data synchronization to bring the data to the compute, which saves the above latency and enables online flows. But if we want to support “hot” bursting to the cloud, we have to replicate the data between the sites in an ongoing streaming fashion, so that the data is available on the cloud as soon as the peak occurs and we can spin up compute instances and immediately start to redirect load. Let’s see how it’s done.

Cloud Bursting – Examining the Federated Site Approach

Let’s put up our sleeves and start experimenting hands-on with the federated site approach for Cloud Bursting architecture. As reference application let’s take Spring’s PetClinic Sample Application and run it on an Apache Tomcat web container. The application will persist its data locally to a MySQL relational database.

The primary site, representing our private data center, will run the above stack and serve the PetClinic online service. The secondary site, representing the public cloud, will only have a MySQL database, and we will replicate data between the primary and secondary sites to keep data synchronized. As soon as the load on the primary site increases beyond a certain threshold, we will spin up a machine with an instance of Tomcat and the PetClinic application, and update the load balancer to offload some of the traffic to the secondary site.

On my experiment I used Amazon EC2 and RackSpace IaaS providers to simulate the two distinct environments of the primary and secondary sites, but any on-demand environments will do.

Replicating RDBMS data over WAN

How do we replicate data between the MySQL database instances over WAN? On this experiment we’ll use the following pattern:

  1. Monitor data mutating SQL statements on source site. Turn on the MySQL query log, and write a listener (“Feeder”) to intercept data mutating SQL statements, then write them to GigaSpaces In-Memory Data Grid.
  2. Replicate data mutating SQL statements over WAN. I used GigaSpaces WAN Replication to replicate the SQL statements  between the data grids of the primary and secondary sites in a real-time and transactional manner.
  3. Execute data mutating SQL statements on target site. Write a listener (“Processor”) to intercept incoming SQL statements on the data grid and execute them on the local MySQL DB.
To support bi-directional data replication we simply deploy both the Feeder and the Processor on each site.

Auto-bootstrap secondary site

When peak load occurs, we need to react immediately, and perform a series of operations to activate the secondary site:

  1. spin up compute nodes (VMs)
  2. download and install Tomcat web server
  3. download and install the PetClinic application
  4. configure the load balancer with the new node
  5. when peak load is over – perform the reverse flow to tear down the secondary site

We need to automate this bootstrap process to support real-time response to peak-load events. How do we do this automation? I used GigaSpaces Cloudify open-source product as the automation tool for setting up and for taking down the secondary site, utilizing the out-of-the-box connectors for EC2 and RackSpace. Cloudify also provides self-healing  in case of VM or process failure, and can later help in scaling the application (in case of clustered applications).

Implementation Details

The result of the above experimentation is available on GitHub. It contains:

  • DB scripts for setting up the logging, schema and demo data for the PetClinic application
  • PetClinic application (.war) file
  • WAN replication gateway module
  • Cloudify recipe for automating the PetClinic deployment

See the documentation on GitHub for detailed instructions on how to configure the above with your specific deployment details.

Conclusion

Cloud Bursting is a common use case for cloud on-boarding, which requires good architecture patterns. In this post I tried to suggest some patterns and experiment with a simple demo, sharing it with the community to get feedback and raise discussion on these cloud architectures.

1311765722_picons03
Follow Dotan on Twitter!

11 Comments

Filed under Cloud, DevOps, Solution Architecture

Analytics for Big Data – Venturing with the Twitter Use Case

Performing analytics on Big Data is a hot topic these days. Many organizations realized that they can gain valuable insight from the data that flows through their systems, both in real time and in researching historical data. Imagine what Google, Facebook or Twitter can learn from the data that flows through their systems. And indeed the boom of real time analytics is here: Google Analytics, Facebook Social Analytics, Twitter paid tweet analytics and many others, including start-ups that specialize in real time analytics.

But the challenge of analyzing such huge volumes of data is enormous. Mining Terabytes and Petabytes of data is no simple task, one that traditional databases cannot meet, which drove the invention of new NoSQL database technologies such as Hadoop, Cassandra and MongoDB. Analyzing data in real time is yet another challenging task, when dealing with very high throughputs. For example, according the Twitter’s statistics, the number of tweets sent  on March 11 2011 was 177 million! Now, analyzing this stream, that’s a challenge!

Standing up to the challenge

When discussing the challenge of real time analytics on Big Data, I oftentimes use the Twitter use case as an example to illustrate the challenges. People find it easy to relate to this use case and appreciate the volumes and challenges of such a system. A few weeks ago, when planning a workshop on real time analytics for Big Data, I was challenged to take up this use case and design a proof of concept that meets up to the challenge of analyzing real tweets from real time feeds. Well, challenge is my name, and solution architecture is my game…

Note that it is by no means a complete solution, but more of a thought exercise, and I’d like to share these thoughts with you, as well as the code itself, which is shared on GitHub (see at the bottom of the post). I hope this will only be the starting point of a joint community effort to make it into a complete reference example. So let’s have a look at what I sketched up.

What kind of analytics?

First of all, we need to see what kinds of analytics are of interest in such use case. Looking at Twitter analytics, we see various types of analytics that I group in 3 categories:

Counting: real-time counting analytics such as how many requests per day, how many sign-ups, how many times a certain word appears, etc.

Correlation: near-real-time analytics such as desktop vs. mobile users, which devices fail at the same time, etc.

Research: more in-depth analytics that run in batch mode on the historical data such as what features get re-tweeted, detecting sentiments, etc.

When approaching the architecture of a solution that covers all of the above types of analytics, we need to realize the different nature of the real time vs. historical analysis, and leverage on appropriate technologies to meet each challenge on its own ground, and then combine the technologies into a single harmonious solution. I’ll get back to that point…

In my sample application I wanted to listen on the public timeline of Twitter, perform some sample real-time analytics of word-counting, as well as preparing the data for research analytics, and combining it into a single solution to handle all aspects.

Feeding in a hefty stream of tweets

I chose to listen on the Twitter public timeline (so I can share the application with you without having to give you my user/password).

For integration with Twitter I chose to use Spring Social, which offers a built-in Twitter connector integrating with Twitter’s REST API. I wanted to integrate with Twitter’s  Streaming API, but unfortunately it appears that currently Spring Social does not support this API, so I settled for repetitive calls to the regular API.

A feeder took in the tweets and converted them in an ETL style to a canonical Document-oriented representation, which semi-structured nature makes it ideal for the evolving nature of tweet structure, and wrote them to an in-memory data grid (IMDG) on the server side.

The design needs to accommodate a very high throughput of around 10k tweets/sec, with latency of milliseconds. For that end I chose to implement the feeder as an independent processing unit in GigaSpaces XAP, so that I can cope with the write scalability requirement by simply adding more parallel feeders to handle the stream. Since the feeder is a stateless service, scaling out by adding instances is relatively easy. Trying to do the same with stateful services will prove to be much more challenging, as we’re about to find out …

Let’s pick the brains of my accumulated tweets

On the server side, I wanted to store the tweets and prepare them for batch-mode historical research. For the same reason of semi-structured data, I also chose a Document-oriented database to store the Tweets. In this case, I chose the open source Apache Cassandra, which has become a prominent NoSQL database that is in use by Twitter itself, as well as by many other companies. The API to interact with Cassandra is Hector.

To avoid tight coupling of my application code with Cassandra, I followed the Inversion of Control principle (IoC) and created an abstraction layer for the persistent store, where Cassandra is just one implementation, and provided another implementation for testing purposes of persistence to the local file system. Leveraging on Spring Framework wiring capabilities (see below), switching between implementations becomes a configuration choice, with no code changes.

Easy configuration

For easy configuration and wiring I used Spring Framework, leveraging on its wiring capabilities as well as properties injection and parameter configuration. Using these features I made the application highly configurable, exposing the Twitter connection details, buffer sizes, thread pool sizes, etc. This means that the application can be tuned to any load, depending on the hardware, network and JVM specifications.

What can I learn from the tweet stream on the fly?

In addition to persisting the data, I also wanted to perform some sample on-the-fly real-time analytics on the data. For this experiment I chose to perform word counting (or more exactly token counting, as token can also be an expression or a combination of symbols).

At first glance you may think it’s a simple task to implement, but when facing the volumes and throughput of Twitter you’ll quickly realize that we need a highly scalable and distributed architecture that uses an appropriate technology and that takes into account data sharding and consistency, data model de-normalization, processing reliability, message locality, throughput balancing to avoid backlog build-up etc.

Processing workflow

I chose to meet these challenges by employing Event-Driven Architecture (EDA) and designed a workflow to run through the different stages of the processing (parsing, filtering, persisting, local counting, global aggregation, etc.) where each stage of the workflow is a processor. To meet the above challenges of throughput, backlog build-up, distribution etc., I designed the processor with the following characteristics:

  1. Each processor has a thread pool (of a configurable size) to enable concurrent processing.
  2. Each processor thread can process events in batches (of a configurable size) to balance between input and output streams and avoid backlog build-up.
  3. Processors are co-located with the (sharded) data, so that most of the data processing is performed locally, within the same JVM, avoiding distributed transactions, network, and serialization overhead.

The overall workflow looks as follows:

For the implementation of the workflow and the processors I chose XAP Polling Container, which runs inside the in-memory data grid co-located with the data and enables easy implementation of the above characteristics.

The events that drive the workflow are simple POJOs on which I listen and which state changes trigger the events. This is a very useful characteristic of the XAP platform, which saved me the need to generate message objects and place them in a message broker.

Atomic counters

For the implementation of atomic counters I used XAP’s MAP API, which allows using the in-memory data grid as a transactional key-value store, where the key is the token and the value is the count, and each such entry can be locked individually to achieve atomic updates, very similarly to ConcurrentHashMap.

Making it all play together in harmony

So we have a deployment that incorporates a feeder, a processor and a Cassandra persistent store, each such service having multiple instances and needing to scale in/out dynamically based on demand. Designing my solution for the real deal, I’m about to face 10s-100s of instances of each service. Manual deployment or scripting will not be manageable, not to mention the automatic scaling of each service, monitoring and troubleshooting. How do I manage that automatically as a single cohesive solution?

For that I used GigaSpaces Cloudify, which allows me to integrate any stack of services by writing Groovy-based recipes describing declaratively the lifecycle of the application and its services.

I can then deploy and manage the end-to-end application using the CLI and the Web Console.

Conclusion

This was a thought exercise on real-time analytics for big data. I used Twitter use case because I wanted to aim high up the big data challenge and, well, you can’t get much bigger than that.

The end-to-end solution included a clustered Cassandra NoSQL database for the elaborated batch analytics of the historical data, GigaSpaces XAP platform for distributed In-Memory Data Grid with co-located with real-time processing, Spring Social for feeding in Tweets from twitter, Spring Framework for configuration and wiring capabilities, and GigaSpaces Cloudify for deployment, management and monitoring. I used event-driven architecture with semi-structured Documents, POJOs and atomic counters, and with write-behind eviction.

This is just the beginning. My design hardly utilized the capabilities of the chosen technology stack, and it barely scratched the surface of the analytics you can perform on Twitter. Imagine for example what it would take to calculate not just real-time word counts but also the reach of tweets, as done on the tweetreach service.

This project is just the starting point, and I would like to share this project with you and invite you to stand up to the challenge with me and together make it into a complete reference solution for real-time analytics architecture for big data.

The project is found on GitHub under https://github.com/dotanh/rt-analytics.

You’re welcome to contribute!

1311765722_picons03
Follow Dotan on Twitter!

4 Comments

Filed under Big Data, Real Time Analytics, Solution Architecture