Tag Archives: GigaSpaces

AWS Outage: Moving from Multi-Availability-Zone to Multi-Cloud

A couple of days ago Amazon Web Services (AWS) suffered a significant outage in their US-EAST-1 region. This has been the 5th major outage in that region in the past 18 months. The outage affected leading services such as Reddit, Netflix, Foursquare and Heroku.

How should you architect your cloud-hosted system to sustain such outages? Much has been written on this question during this outage, as well as past outages. Many recommend basing your architecture on multiple AWS Availability Zones (AZ) to spread the risk. But during this outage we saw even multi-Availability Zone applications severely affected. Even Amazon published during the outage that

Customers can launch replacement instances in the unaffected availability zones but may experience elevated launch latencies or receive ResourceLimitExceeded errors on their API calls, which are being issued to manage load on the system during recovery.

The reason is that there is an underlying infrastructure that escalates the traffic from the affected AZ to other AZ in a way that overwhelms the system. In the case of this outage it was the AWS API Platform that was rendered unavailable, as nicely explained in this great post:

The waterfall effect seems to happen, where the AWS API stack gets overwhelmed to the point of being useless for any management task in the region.

But it doesn’t really matter for us as users which exact infrastructure it was that failed on this specific outage. 18 months ago, during the first major outage, the reason was another infastructure component, the Elastic Block Store (“EBS”) volumes, that cascaded the problem. Back then I wrote a post on how to architect your system to sustain such outages, and one of my recommendations was:

Spread across several public cloud vendors and/or private cloud

The rule of thumb in IT is that there will always be extreme and rare situations (and don’t forget, Amazon only commits to 99.995% SLA) causing such major outages. And there will always be some common infrastructure that under that extreme and rare situation will carry the ripple effect of the outage to other Availability Zones in the region.

Of course, you can mitigate risk by spreading your system across several AWS Regions (e.g. between US-EAST and US-WEST), as they have much looser coupling, but as I stated on my previous post, that loose coupling comes with a price: it is up to your application to replicate data, using a separate set of APIs for each region. As Amazon themselves state: “it requires effort on the part of application builders to take advantage of this isolation”.

The most resilient architecture would therefore be to mitigate risk by spreading your system across different cloud vendors, to provide the best isolation level. The advantages in terms resilience are clear. But how can that be implemented, given that the vendors are so different in their characteristics and APIs?

There are 2 approaches to deploying across multiple cloud vendors and keeping cloud-vendor-agnostic:

  1. Open Standards and APIs for cloud API that will be supported by multiple cloud vendors. That way you write your application using a common standard and have immediate support by all conforming cloud vendors. Examples for such emerging standards are OpenStack and JClouds. However, the Cloud is still a young domain with many competing standards and APIs and it is yet to be determined which one shall become the de-facto standard of the industry and where to “place our bet”.
  2. Open PaaS Platforms that abstract the underlying cloud infrastructure and provide transparent support for all major vendors. You build your application on top of the platform, and leave it up to the platform to communicate to the underlying cloud vendors (whether public or private clouds, or even a hybrid). Examples of such platforms, are CloudFoundry and Cloudify. I dedicated one of my posts for exploring how to build your application using such platforms.

Conclusion

System architects need to face the reality of the Service Level Agreement provided by Amazon and other cloud vendors and their limitations, and start designing for resilience by spreading across isolated environments, deploying DR sites, and by similar redundancy measures to keep their service up-and-running and their data safe. Only that way can we guarantee that we will not be the next one to fall off the 99.995% SLA.

This post was originally posted here.

5 Comments

Filed under cloud deployment, Disaster-Recovery, IaaS, PaaS, Solution Architecture, Uncategorized

Cloud Deployment: It’s All About Cloud Automation

Not only for modern applications

Many organizations are facing the challenge of migrating their IT to the cloud. But not many know how to actually approach this undertaking. In my recent post – Cloud Deployment: The True Story – I started sketching best practices for performing the cloud on-boarding task in a manageable fashion. But many think this methodology is only good for modern applications that were built with some dynamic/cloud orientation in mind, such as Cassandra NoSQL DB from my previous blog, and that existing legacy application stacks cannot use the same pattern. For example, how different would the cloud on-boarding process be if I modify the PetClinic example application from my previous post to use a MySQL relational database instead of the modern Cassandra NoSQL clustered database? In this blog post I intend to demonstrate that cloud on-boarding of brownfield applications doesn’t have to be a huge monolithic migration project with high risk. Cloud on-boarding can take the pragmatic approach and can be performed in a gradual process that both mitigates the risk and enables you to enjoy the immediate benefits of automation and easier management of your application’s operational lifecycle even before moving to the cloud.

MySQL case study

Let’s look at the above challenge of taking a standard and long-standing MySQL database and adapt it to the cloud. In fact, this challenge was already met by Amazon for their cloud. Amazon Web Services (AWS) include the very popular Relational Database Service (RDS). This service is an adaptation of a MySQL database to the Amazon cloud. MySQL DB was not built or designed for cloud environment, and yet it proved highly popular, and even the new SimpleDB service that Amazon built from scratch with cloud orientation in mind was unable to overthrow the RDS reign. The adaptation of MySQL to AWS was achieved using some pre-tuning of MySQL to the Amazon environment and extensive automation of the installation and management of the DB instances. The case study of Amazon RDS can teach us that on-boarding existing application is not only doable but may even prove better than developing a new implementation from scratch to suit the cloud.

I will follow the MySQL example throughout this post and examine how this traditional pre-cloud database can be made ready for the cloud.

Automation is the key

We have our existing application stack running within our data center, knowing nothing of the cloud, and we would like to deploy it to the cloud. How shall we begin?

Automation is the key. Experts say automated application deployment tools are a requirement when hosting an application in the cloud. Once automation is in place, and given a PaaS layer that abstracts the underlying IaaS, your application can easily be migrated to any common cloud provider with minimal effort.

Furthermore, automation has a value in its own right. The emerging agile movements such as Agile ALM (Application Lifecycle Management) and DevOps endorse automation as a means to support the Continuous Deployment methodology and ever-increasing frequency of releases to multiple environments. Some even go beyond DevOps and as far as NoOps. Forrester analyst Mike Gualtieri states that “NoOps is the peak of DevOps”, where “DevOps Is About Collaboration; NoOps Is About Automation“:

DevOps is a noble and necessary movement for immature organizations. Mature organizations have DevOps down pat. They aspire to automate to speed release increments.

This value of automation in providing a more robust and agile management of your application is a no-brainer and will prove useful even before migrating to the cloud. It is also much easier to test and verify the automation when staying in the well-familiar environment in which the system has been working until now. Once deciding to migrate to the cloud, automation will make the process much simpler and smoother.

Automating application deployment

Let’s take the pragmatic approach. The first step is to automate the installation and deployment of the application in the current environment, namely within the same data center. We capture the operational flows of deploying the application and start automating these processes, either using scripts or using higher-level DevOps automation tools such as Chef and Puppet for Change and Configuration Management (CCM).

Let’s revisit our MySQL example: MySQL doesn’t come with built-in deployment automation. Let’s examine the manual processes involved with installing MySQL DB from scratch and capture that in a simple shell script so we can launch the process automatically:

This script is only the basics. A more complete automation should take care of additional concerns such as super-user permissions, updating ‘yum’, killing old processes and cleaning up previous installations, and maybe even handling differences between flavors of Linux (e.g. Ubuntu’s quirks…). You can check out the more complete version of the installation script for Linux here (mainstream Linux, e.g. RedHat, CentOS, Fedora), as well as a variant for Ubuntu (adapting to its quirks) here. This is open source and a work in progress so feel free to fork the GitHub repo and contribute!

Automating post-deployment operations

Once automation of the application deployment is complete we can then move to automating other operational flows of the application’s lifecycle, such as fail-over or shut down of the application. This aligns with cloud on-boarding, since “Deployment in the cloud is attached to the whole idea of running the application in the cloud”, as Paul Burns, president and analyst at Neovise, says:

People don’t say, ‘Should I automate my deployment in the cloud?’ It’s, ‘Should I run it in the cloud?’ Then, ‘How do I get it to the cloud?’

In our MySQL example we will of course want to automate the start-up of the MySQL service, stopping it and even uninstalling it. More interestingly, we may also want to automate operational steps unique to MySQL such as granting DB permissions, creating a new database, generating a dump (snapshot) of our database content or importing a DB dump to our database. Let’s look at a snippet to capture and automate dump generation. This time we’ll use the Groovy scripting language which provides higher-level utilities for automation and better yet it is portable between OSs, so we don’t have the headache as we described above with Ubuntu (not to mention Windows …):

Adding automation of these post-deployment steps will provide us with end-to-end automation of the entire lifecycle of the application from start-up to tear-down within our data center. Such automation can be performed using elaborate scripting, or can leverage modern open PaaS platforms such as CloudFoundry, Cloudify, and OpenShift to manage the full application lifecycle. For this MySQL automation example I used the Cloudify open source platform, where I modeled the MySQL lifecycle using a Groovy-based DSL as follows:

As you can see, the lifecycle is pretty clear from the DSL, and maps to individual scripts similar to the ones we scripted above. We even have the custom commands for generating dumps and more. With the above in place, we can now install and start MySQL automatically with a single command line:

install-service mysql

Similarly, we can later perform other steps such as tearing it down or generating dumps with a single command line.

You can view the full automation of the MySQL lifecycle in the scripts and recipes in this GitHub repo.

Monitoring application metrics

We may also want to have better visibility into the availability and performance of our application for better operational decision-making, whether for manual processes (e.g. via logs or monitoring tools) or automated processes (e.g. auto-scaling based on load). This is becoming common practice in methodologies such as Application Performance Management (APM). This will also prove useful once in the cloud, as visibility is essential for successful cloud utilization. Rick Blaisdell, CTO at ConnectEDU, explains:

… the key to successful cloud utilization lays in the management and automation tools’ capability to provide visibility into ongoing capacity

In our MySQL example we can sample several interesting metrics that MySQL exposes (e.g. using the SHOW STATUS syntax or ‘mysqladmin’), such as the number of client connections, query counts or query throughput.

Summary

On-boarding existing applications to the cloud does not have to be a painful and high-risk migration process. On boarding can be done in a gradual “baby-step” manner to mitigate risk.

The first step is automation. Automating your application’s management within your existing environment is a no-brainer, and has its own value in making your application deployment, management and monitoring easier and more robust.

Once automation of the full application lifecycle is in place, migrating your application to the cloud becomes smooth sailing, especially if you use PaaS platforms that abstract the underlying cloud provider specifics.

This post was originally posted here.

For the full MySQL cloud automation open source code see this public GitHub repo. Feel free to download, play around, and also fork and contribute.

3 Comments

Filed under Cloud, cloud automation, cloud deployment, DevOps, IaaS, PaaS

Cloud Deployment: The True Story

Everyone wants to be in the cloud. Organizations have internalized the notion and have plans in place to migrate their applications to the cloud in the immediate future. According to Cisco’s recent global cloud survey:

Presently, only 5 percent of IT decision makers have been able to migrate at least half of their total applications to the cloud. By the end of 2012, that number is expected to significantly rise, as one in five (20 percent) will have deployed over half of their total applications to the cloud.

But that survey also reveals the fact that on-boarding your application to the cloud “is harder, and it takes longer than many thought”, as David Linthicum said in his excellent blog post summarizing the above Cisco survey. Taking standard enterprise applications that were designed to run in the data center and on-boarding them to the cloud is in essence a reincarnation of the well-known challenge of platform-migration, which is never easy. But why is there a sense of extra difficulty in on-boarding to the cloud? The first reason David identifies for the extra difficulty is the misconception that cloud is a “silver bullet”. Such “silver bullet” misconception can lead to lack of proper design of the system, which may result in application outage, as I outlined in my previous blogs. Another reason David states for the extra difficulty is the lack of well-defined process and best practices for on-boarding applications to the cloud:

What makes the migration to the cloud even more difficult is the lack of information about the process. Many new cloud users are lost in a sea of hype-driven desire to move to cloud computing, without many proven best practices and metrics.

It is about time for a field-proven process for on-boarding applications to the cloud. In this post I’d like to start examining the accumulated experience in on-boarding various types of applications to the cloud, and see if we can extract a simple process for the migration. This is of course based on the experience of me and my colleagues, and not the result of any academic research, so I would very much like for it to serve as a cornerstone to trigger an open discussion in the community, sharing experience of different types of migration projects and applications, and iteratively refine the suggested process based on the joint experience.

Examining the n-tier enterprise application use case

As a first use case, it makes sense to examine a classic n-tier enterprise application. For the sake of discussion, I’d like to use common open-source modules, assuming they are well-known and to allow us to play with them freely. For the test-case application let’s take Spring’s PetClinic Sample Application and adapt it. We’ll use Apache Tomcat web container and Grails platform for the web and business logic tiers, and MongoDB NoSQL DB for the data tier, to simulate a Big Data use case. We can later add the Apache HTTP Server as a front-end load balancer. To those who start wondering, I’m not invested in the Apache Foundation, just an open-source enthusiast.

First step of on-boarding the application to the cloud is to identify the individual services which comprise the application, and the dependency between these services. In this use case, since the application is well-divided into tiers, it is quite easy to map the services from the tiers. Also, the dependency between the tiers is quite clear. For example, the Tomcat instances are dependent on the back-end database. Mapping the application’s services and their dependency will help us determine which VMs we should spin up, of which images, how many of each, and in which order. In later posts I’ll address additional benefits of the services paradigm.

Next let’s dive into the specific services, and see what it takes to prepare them for on-boarding to the cloud. First step is to identify the operational phases which comprise the service’s lifecycle. Typically, a service will undergo a lifecycle of install-init-start-stop-shutdown. We should capture the operational process for each such phase and formalize it into an automated DevOps process, for example in the form of a script. This process of capturing and formalizing the steps also helps exposing many important issues that need to be addressed to enable the application to run in the cloud, and may even require further lifecycle phases or intermediate steps. For example in the case of Tomcat we may want to support deploying a new WAR file to Tomcat without restarting the container. Another example for MongoDB is that we noticed that it may fail starting up without failure indication in the OS process status, so simple generic monitoring of the process status wasn’t enough and we needed a more accurate and customized way to know when the service successfully completed start-up and is ready to serve. Similar considerations arise with almost every application. I will touch these considerations further in a follow-up post.

With the break-down of the application into services, and the services break-down into their individual lifecycle stages, we have a good skeleton to automate the work on the cloud. You are welcome to review the result of the experimentation available as open-source under CloudifySourice GitHub. On my next post I will further examine the n-tier use case and discuss additional concerns that needed to be addressed to bring it to a full solution.

 

1311765722_picons03
Follow Dotan on Twitter!

2 Comments

Filed under Cloud, DevOps, IaaS, PaaS

Analytics for Big Data – Venturing with the Twitter Use Case

Performing analytics on Big Data is a hot topic these days. Many organizations realized that they can gain valuable insight from the data that flows through their systems, both in real time and in researching historical data. Imagine what Google, Facebook or Twitter can learn from the data that flows through their systems. And indeed the boom of real time analytics is here: Google Analytics, Facebook Social Analytics, Twitter paid tweet analytics and many others, including start-ups that specialize in real time analytics.

But the challenge of analyzing such huge volumes of data is enormous. Mining Terabytes and Petabytes of data is no simple task, one that traditional databases cannot meet, which drove the invention of new NoSQL database technologies such as Hadoop, Cassandra and MongoDB. Analyzing data in real time is yet another challenging task, when dealing with very high throughputs. For example, according the Twitter’s statistics, the number of tweets sent  on March 11 2011 was 177 million! Now, analyzing this stream, that’s a challenge!

Standing up to the challenge

When discussing the challenge of real time analytics on Big Data, I oftentimes use the Twitter use case as an example to illustrate the challenges. People find it easy to relate to this use case and appreciate the volumes and challenges of such a system. A few weeks ago, when planning a workshop on real time analytics for Big Data, I was challenged to take up this use case and design a proof of concept that meets up to the challenge of analyzing real tweets from real time feeds. Well, challenge is my name, and solution architecture is my game…

Note that it is by no means a complete solution, but more of a thought exercise, and I’d like to share these thoughts with you, as well as the code itself, which is shared on GitHub (see at the bottom of the post). I hope this will only be the starting point of a joint community effort to make it into a complete reference example. So let’s have a look at what I sketched up.

What kind of analytics?

First of all, we need to see what kinds of analytics are of interest in such use case. Looking at Twitter analytics, we see various types of analytics that I group in 3 categories:

Counting: real-time counting analytics such as how many requests per day, how many sign-ups, how many times a certain word appears, etc.

Correlation: near-real-time analytics such as desktop vs. mobile users, which devices fail at the same time, etc.

Research: more in-depth analytics that run in batch mode on the historical data such as what features get re-tweeted, detecting sentiments, etc.

When approaching the architecture of a solution that covers all of the above types of analytics, we need to realize the different nature of the real time vs. historical analysis, and leverage on appropriate technologies to meet each challenge on its own ground, and then combine the technologies into a single harmonious solution. I’ll get back to that point…

In my sample application I wanted to listen on the public timeline of Twitter, perform some sample real-time analytics of word-counting, as well as preparing the data for research analytics, and combining it into a single solution to handle all aspects.

Feeding in a hefty stream of tweets

I chose to listen on the Twitter public timeline (so I can share the application with you without having to give you my user/password).

For integration with Twitter I chose to use Spring Social, which offers a built-in Twitter connector integrating with Twitter’s REST API. I wanted to integrate with Twitter’s  Streaming API, but unfortunately it appears that currently Spring Social does not support this API, so I settled for repetitive calls to the regular API.

A feeder took in the tweets and converted them in an ETL style to a canonical Document-oriented representation, which semi-structured nature makes it ideal for the evolving nature of tweet structure, and wrote them to an in-memory data grid (IMDG) on the server side.

The design needs to accommodate a very high throughput of around 10k tweets/sec, with latency of milliseconds. For that end I chose to implement the feeder as an independent processing unit in GigaSpaces XAP, so that I can cope with the write scalability requirement by simply adding more parallel feeders to handle the stream. Since the feeder is a stateless service, scaling out by adding instances is relatively easy. Trying to do the same with stateful services will prove to be much more challenging, as we’re about to find out …

Let’s pick the brains of my accumulated tweets

On the server side, I wanted to store the tweets and prepare them for batch-mode historical research. For the same reason of semi-structured data, I also chose a Document-oriented database to store the Tweets. In this case, I chose the open source Apache Cassandra, which has become a prominent NoSQL database that is in use by Twitter itself, as well as by many other companies. The API to interact with Cassandra is Hector.

To avoid tight coupling of my application code with Cassandra, I followed the Inversion of Control principle (IoC) and created an abstraction layer for the persistent store, where Cassandra is just one implementation, and provided another implementation for testing purposes of persistence to the local file system. Leveraging on Spring Framework wiring capabilities (see below), switching between implementations becomes a configuration choice, with no code changes.

Easy configuration

For easy configuration and wiring I used Spring Framework, leveraging on its wiring capabilities as well as properties injection and parameter configuration. Using these features I made the application highly configurable, exposing the Twitter connection details, buffer sizes, thread pool sizes, etc. This means that the application can be tuned to any load, depending on the hardware, network and JVM specifications.

What can I learn from the tweet stream on the fly?

In addition to persisting the data, I also wanted to perform some sample on-the-fly real-time analytics on the data. For this experiment I chose to perform word counting (or more exactly token counting, as token can also be an expression or a combination of symbols).

At first glance you may think it’s a simple task to implement, but when facing the volumes and throughput of Twitter you’ll quickly realize that we need a highly scalable and distributed architecture that uses an appropriate technology and that takes into account data sharding and consistency, data model de-normalization, processing reliability, message locality, throughput balancing to avoid backlog build-up etc.

Processing workflow

I chose to meet these challenges by employing Event-Driven Architecture (EDA) and designed a workflow to run through the different stages of the processing (parsing, filtering, persisting, local counting, global aggregation, etc.) where each stage of the workflow is a processor. To meet the above challenges of throughput, backlog build-up, distribution etc., I designed the processor with the following characteristics:

  1. Each processor has a thread pool (of a configurable size) to enable concurrent processing.
  2. Each processor thread can process events in batches (of a configurable size) to balance between input and output streams and avoid backlog build-up.
  3. Processors are co-located with the (sharded) data, so that most of the data processing is performed locally, within the same JVM, avoiding distributed transactions, network, and serialization overhead.

The overall workflow looks as follows:

For the implementation of the workflow and the processors I chose XAP Polling Container, which runs inside the in-memory data grid co-located with the data and enables easy implementation of the above characteristics.

The events that drive the workflow are simple POJOs on which I listen and which state changes trigger the events. This is a very useful characteristic of the XAP platform, which saved me the need to generate message objects and place them in a message broker.

Atomic counters

For the implementation of atomic counters I used XAP’s MAP API, which allows using the in-memory data grid as a transactional key-value store, where the key is the token and the value is the count, and each such entry can be locked individually to achieve atomic updates, very similarly to ConcurrentHashMap.

Making it all play together in harmony

So we have a deployment that incorporates a feeder, a processor and a Cassandra persistent store, each such service having multiple instances and needing to scale in/out dynamically based on demand. Designing my solution for the real deal, I’m about to face 10s-100s of instances of each service. Manual deployment or scripting will not be manageable, not to mention the automatic scaling of each service, monitoring and troubleshooting. How do I manage that automatically as a single cohesive solution?

For that I used GigaSpaces Cloudify, which allows me to integrate any stack of services by writing Groovy-based recipes describing declaratively the lifecycle of the application and its services.

I can then deploy and manage the end-to-end application using the CLI and the Web Console.

Conclusion

This was a thought exercise on real-time analytics for big data. I used Twitter use case because I wanted to aim high up the big data challenge and, well, you can’t get much bigger than that.

The end-to-end solution included a clustered Cassandra NoSQL database for the elaborated batch analytics of the historical data, GigaSpaces XAP platform for distributed In-Memory Data Grid with co-located with real-time processing, Spring Social for feeding in Tweets from twitter, Spring Framework for configuration and wiring capabilities, and GigaSpaces Cloudify for deployment, management and monitoring. I used event-driven architecture with semi-structured Documents, POJOs and atomic counters, and with write-behind eviction.

This is just the beginning. My design hardly utilized the capabilities of the chosen technology stack, and it barely scratched the surface of the analytics you can perform on Twitter. Imagine for example what it would take to calculate not just real-time word counts but also the reach of tweets, as done on the tweetreach service.

This project is just the starting point, and I would like to share this project with you and invite you to stand up to the challenge with me and together make it into a complete reference solution for real-time analytics architecture for big data.

The project is found on GitHub under https://github.com/dotanh/rt-analytics.

You’re welcome to contribute!

1311765722_picons03
Follow Dotan on Twitter!

4 Comments

Filed under Big Data, Real Time Analytics, Solution Architecture

Architecting Massively-Scalable Near-Real-Time Risk Analysis Solutions

Recently I held a webinar around architecting solutions for scalable and near-real-time risk analysis solutions based on the experience gathered with our Financial Services customers. In the webinar I also had the honor of hosting Mr. Larry Mitchel, a leading expert in the Financial Services industry, who provided background on the Risk Management domain. Following the general interest in the webinar, I decided to dedicate a post to the subject.

What goes on in the Risk Management domain?

The Finance world continually undergoes changes driven for the most part by the lessons learned from the 2008 financial crash, in an attempt to prevent such catastrophes from reoccurring. Regulations such as Dodd-Frank, EMIR, and Basel III have further formalized it, imposing tighter control and supervision. We see financial institutions addressing these conformance goals by assigning dedicated projects with dedicated budgets (which means more work for solutions architects, lucky me). One of the aspects of this conformance is reducing the risk by shortening the settlement cycles to near-real-time, as seen by initiatives such as Straight-Through Processing.

Traditional architectures, new challenges

Conforming to the new regulations mandates an entirely different approach to risk analysis. This means that the old systems, which relied on overnight batch risk calculations and predefined queries, can no longer suffice, and a more real time approach to risk calculation, with on-the-fly queries, is needed.

From a solution architecture point of view, Risk Analysis is a compute-intensive and a data-intensive process. Looking at our customers’ systems, we see ever-increasing volumes (number of calculated positions and assets, number of re-calculations, data affinity, etc.) and on the other hand we see an ever-increasing demand to reduce the response time, to conform with the regulations or for competitive edge. That makes it a classic Big Data analytics problem.

From a technology point of view, risk analysis solutions traditionally relied on designated compute grid products for the calculations and on relational databases as the data store. That was fine for overnight batch processing, but with the introduction of the new real-time demands databases tend to become bottlenecks under the load, due to the disk and network resources.

Risk Analysis solution architecture revisited

Our experience with such solutions shows that the effective architecture to meet these challenges is a Big Data multi-tiered architecture, in which intraday data is cached in-memory for low-latency response, while historical data is kept in a database for more extensive data mining and reporting. Simple caching solutions cannot provide the scalability of the intraday data under such write-intensive flows (streaming market data, calculation results, and such), and it is therefore an In-Memory Data Grid that has become the standard technology in modern solutions for storing intraday data. Intelligent data grids such as GigaSpaces XAP also provide on-the-fly SQL querying capabilities, which overcome the limitation of predefined queries in traditional architectures.  As for historical data, we see a clear shift from relational databases to NoSQL databases, which perform much better for mining these volumes of semi-structured data.

A piece of the architecture that is often overlooked on initial architecture discussions is the system orchestration. Surprisingly, many of the customers I visit tend to think of risk analysis solutions as the mere sum of a Compute Grid product (for computation scalability) and a Data Grid product (for data scalability). But they neglect to consider the orchestration logic to handle the intersection between the data grid and the compute grid, taking care to avoid duplicate calculations, handling cancellation of calculations, monitoring the state of ongoing calculations, feeding ticks and updates to the client UI, end more. All this amounts to a significant orchestration layer that is traditionally developed in-house.

A much more effective architecture is to embed the orchestration logic together with the data grid within one platform, thereby abstracting the complexities from the clients and removing the need of the clients to interact with anything but the unified platform. GigaSpaces XAP offers the co-location of processing and messaging together with the data, which makes implementing such architectures quite easy. This also enables pre-/post-processing on the data, such as data formatting prior to processing, and result aggregation after calculations, which are requirements often seen in such solutions.

Event-Driven Architecture is highly useful for streaming calculation results to the awaiting clients as they arrive and streaming ticks and other updates to the UI. Using GigaSpaces XAP the implementation of such architecture is made simple by leveraging on the Asynchronous API and on the messaging layer which can treat each data mutation as an event.

To address the real time analytics challenge on the end-to-end Big Data architecture, across both the intraday data (which resides in-memory within the data grid) and the historical data (which resides within a relational/NoSQL database), requires a holistic view of the multi-tier architecture. Intraday data is changed at an extremely high rate with frequent event feeds, whereas historical data can be written in a more relaxed manner, using a write-behind (write-back) caching architectural approach, and consolidating queries across the data stores, making it seem as one unified source for query purposes. Such consolidation is traditionally achieved by combining the various products, but GigaSpaces offers a Real-Time Analytics solution, enabling you to focus on your business logic and leave the rest to the platform.

Future directions

There’s more to discuss in such architectures, such as multi-site deployments over WAN, support for cloud bursting, and more, which should be considered when approaching such solutions. I will not get into these concerns on this post, but you can see coverage of future directions on my webinar.

To get more information on the domain and its challenges, and to hear more on the suggested architecture for Big Data risk analysis solutions I’d recommend watching the full webinar.

 

1311765722_picons03
Follow Dotan on Twitter!

7 Comments

Filed under Big Data, Financial Services, Market Analytics, Real Time Analytics, Risk Managment

Scaling All the Way: Running Scala Applications on GigaSpaces XAP Platform

Scala, the “scalable language”, is the hottest thing in the programming languages world these days. You may have thought that innovation these days only happens in dynamic scripting languages. Or you may have thought that Java and .Net cover all angles for static languages. Or that object-oriented programming is the answer to every large-scale system. So here comes Scala and shuffles the cards. It offers a fascinating fusion of object-oriented and functional paradigms, an elegant and compact syntax, full compatibility with Java and .Net, and most notably – it is easily extensible for defining Domain-Specific Languages (DSL). Several major players already recognized the potential of the language and switched to Scala, including Twitter, Foursquare, and Sony Pictures. A fascinating example of the power of the language is the adoption of the Actor Model for high-level concurrency inspired by the Erlang language.

On one of my recent consulting sessions with a customer in the Financial Services industry in the UK, the customer expressed a desire to develop in Scala on top of the GigaSpaces XAP platform. XAP officially does not support the Scala language, so the customer consulted me on how to construct such a solution on top of the product. Well, I’m a solutions architect, so that’s just the kind of challenges I live for…

My first thought was that as Scala declares full compatibility with Java and complies with Java classes (with very few exceptions), and as any Java program can run on XAP, there should be a way to make it work. However, the customer reported that initial experiments had failed. While visiting the customer onsite, I investigated the problems together with the customer, and came up with some interesting conclusions about inter-operability with Scala that I’d like to share in this post.

The customer had domain model objects written in Scala that he wished to write into the Space. The Space is GigaSpaces implementation of an In-Memory Data Grid. Interestingly, the space functions as a general-purpose data store and exposes an Open Interfacing Layer that enables access to the data using various languages (Java, .Net, C++, scripting) and standard APIs (JPA, JDBC, REST, Map, Document, Memcached, Spring, and more). So having Scala as yet another API for interacting with data in the space would be a natural extension, wouldn’t it?

In order to store instances of a given Java domain class (POJO) in the Space, the class needs to be decorated with metadata to determine things such as which of its properties would serve as the primary key, which properties need to be indexed for efficient querying, etc. GigaSpaces offers two standard methods to specify this metadata: using Java annotations in the class source definition, or using XML alongside the class definition. I wanted to check if any of these methods can serve to similarly decorate Scala classes in preparation for storage in the Space. Experimenting with the customer’s use case, I was able to make the domain model of the customer work with both methods, namely the Scala objects were properly written into the data grid and replicated across the data grid after being annotated using the standard GigaSpaces Java annotations, or using the standard GigaSpaces XML.

That indeed solved the needs of the customer. But I myself only got more of an appetite: What else can be done with Scala on top of XAP? GigaSpaces XAP offers more than just an in-memory data grid, it also offers services on top of the data grid, such as remote invocation using the space as the transport layer, event-driven containers, executing distributed tasks in a Map/Reduce pattern, and more. Would these also work so neatly with Scala in virtue of Scala’s compatibility with Java? So I worked with my colleague Dan Kilman to port the GigaSpaces “Hello World” application to Scala. This application is the beginner’s example application (packaged with the XAP distribution), and it utilizes the basic features of the platform. Porting turned out to be straight-forward, and worked smoothly. The example is published as an open-source project at OpenSpaces.org.

This little experiment showed that programming in Scala on top of XAP is a viable option that deserves further investigation. I should add a proper disclaimer that the XAP platform offers a vast array of features, and that the Scala language offers a vast array of constructs, very few of which have been covered in this experiment. Similarly, I should also state that Scala is not officially supported by the XAP product, which means that there is no official support or test coverage of Scala for the product. However, as a solutions architect exploring possible solutions on top of the product I can now say that this is an exciting option worth exploring, and who knows, if Scala becomes predominant it may one day become an official feature of the product.

1311765722_picons03
Follow Dotan on Twitter!

2 Comments

Filed under Programming Languages

Building Cloud Applications the Easy Way Using Elastic Application Platforms

Patterns, Guidelines and Best Practices Revisited

In my previous post I analyzed Amazon’s recent AWS outage and the patterns and best practices that enabled some of the businesses hosted on Amazon’s affected availability zones to survive the outage.

The patterns and best practices I presented are essential to guarantee robust and scalable architectures in general and on the cloud in particular. Those who dismissed my latest post as exaggeration of an isolated incident got affirmation of my statement last week when Amazon found itself apologizing once again after its Cloud Drive service was overwhelmed by unpredictable peak demand for Lady Gaga’s newly-released album (99 cents, who wouldn’t buy it?!) and was rendered non-responsive. This failure to scale up/out to accommodate fluctuating demands raises the scalability concern in the public cloud, in addition to the resilience concern raised in the AWS outage.

Surprisingly, as obvious as the patterns I listed may seem, it seems they are definitely not common practice, seeing the amount of applications that went down when AWS did, and seeing how many other applications have similar issues on public cloud providers.

Why are such fundamental principles not prevalent in today’s architectures on the cloud?

One of the reasons these patterns are not prevalent in today’s cloud applications is that it requires an experienced and confident architect in the areas of distributed and scalable systems to design such architectures. The typical public cloud APIs also require developers to perform complex coding and utilize various non-standard APIs that are usually not common knowledge. Similar difficulties are found in testing, operating, monitoring and maintaining such systems. This makes it quite difficult to implement the above patterns to ensure the application’s resilience and scalability, and diverts valuable development time and resources from the application’s business logic that is the core value of the application.

How can we make the introduction of these patterns and best practices smoother and simpler? Can we get these patterns as a service to our application? We are all used to traditional application servers that provide our enterprise applications with underlying services such as database connection pooling, transaction management and security, and free us from worrying about these concerns so that we can focus on designing our business logic. Similarly, Elastic Application Platforms (EAP)allow your application to easily employ the patterns and best practices I enumerated on my previous post for high availability and elasticity without having to become experts in the field and allowing you to focus on your business logic.

So what is Elastic Application Platform? Forrester defines an elastic application platform as:

An application platform that automates elasticity of application transactions, services, and data, delivering high availability and performance using elastic resources.

Last month Forrester published a review under the title “Cloud Computing Brings Demand For Elastic Application Platforms”. The review is the result of a comprehensive research, and spans 17 pages (a blog post introducing it can be found on the Forrester blog). It analyzes the difficulties companies encounter in implementing their applications on top of cloud infrastructure, and recognizes the elastic application platforms as the emerging solution for a smooth path into the cloud. It then maps the potential providers of such solutions. For its research Forrester interviewed 17 vendor and user companies. Out of all the reviewed vendors, Forrester identified only 3 vendors that are “offering comprehensive EAPs today”: Microsoft, SalesForce.com and GigaSpaces.

As Forrester did an amazing job in their research reviewing and comparing solutions for EAP today, I’ll avoid repeating that. Instead, I’d like to review the GigaSpaces EAP solution in light of the patterns discussed on my previous post, and see how building your solution on top of GigaSpaces enables you to introduce these patterns easily and without having to divert your focus from your business logic.

Patterns, Guidelines and Best Practices Revisited

Design for failure

Well, that’s GigaSpaces’ bread and butter. Whereas thinking about failure diverts you from your core business, in our case it is our core business. GigaSpaces platform provides underlying services to enable high availability and elasticity, so that you don’t have to take care of that. So now that we’ve established that, let’s see how it’s done.

Stateless and autonomous services

The GigaSpaces architecture segregates your application into Processing Units. A Processing Unit (PU) is an autonomous unit of your application. It can be a pure business-logic (stateless) unit, or hold data in-memory, or provide a web application, and mix together these and other functions. You can define the required Service Level Agreement (SLA) for your Processing Unit, and the GigaSpaces platform will make sure to enforce it. When your Processing Unit SLA requires high-availability – the platform will deploy a (hot) backup instance (or multiple backups) of the Processing Unit to which the PU will fail over in case the primary instance fails. When your application needs to scale out – the platform will add another instance of the Processing Unit (maybe over a newly-provisioned virtual machine booted automatically by the platform). When your application needs to distribute data and/or data processing – the platform will shard the data evenly on several instances of the Processing Unit, so that each instance will handle a subset of the data independently of the other instances.

Redundant hot copies spread across zones

You can divide your deployment environment into virtual zones. These zones can represent different data centers, different cloud infrastructure vendors, or any physical or logical division you see fit. Then you can tell the platform (as part of the SLA) not to place both primary and its backup instances of the Processing Unit on the same zone – thus making sure the data stored within the Processing Unit is backed up on two different zones. This will provide your application resilience over two data centers, two cloud vendors, two regions, depending on your required resilience, all with uniform development API. You want higher level of resilience? Just define more zones and more backups for each PU.

Spread across several public cloud vendors and/or private cloud

GigaSpaces abstracts the details of the underlying infrastructure from your application. GigaSpaces’ Multi-Cloud Adaptor technology provides built-in integration with several major cloud providers, including the JClouds open source abstraction layer, thus supporting any cloud vendor that conforms to the JClouds standard. So all you need to do is plug in your desired cloud providers into the platform, and your application logic remains agnostic to the cloud infrastructure details. Plugging in two vendors to ensure resilience now becomes just a matter of configuration. The integration with JClouds is an open-source project under OpenSpaces.org, so feel free to review and even pitch in to extend and enhance integration with cloud vendors.

Automation and Monitoring

GigaSpaces offers a powerful set of tools that allow you to automate your system. First, it offers the Elastic Processing Unit, which can automatically monitor CPU and memory utilization and employ corrective actions based on your defined SLA. GigaSpaces also offers a rich Administration and Monitoring API that enables administration and monitoring of all the GigaSpaces services and components and layers running beneath the platform such as transport layer and, machine and operating system. GigaSpaces also offers a web-based dashboard and a management center client. Another powerful tool for monitoring and automation is the administrative alerts that can be configured and then viewed through GigaSpaces or external tools (e.g. via SNMP traps).

Avoiding ACID services and leveraging on NoSQL solutions

GigaSpaces does not rule out SQL for querying your data. We believe that true NoSQL stands for “Not Only SQL”, and that SQL as a language is good for certain uses, whereas other uses require other query semantics. GigaSpaces supports some of the SQL language through its SQLQuery API or through standard JDBC . However, GigaSpaces also provides a rich set of alternative standards and protocols for accessing your data, such as Map API for key/value access, Document API for dynamic schemas, Object-oriented (through proprietary Space API or standard JPA), and Memcached protocol.

Another challenge of the traditional relational databases is scaling data storage in read-write environment. The distributed relational databases were enough to deal with read-mostly environments. But Web2.0 brought social concepts into the web, with customers feeding data into the websites. Several NoSQL solutions try to address distributed data storage and querying. GigaSpaces provides this via its support for clustered topology of the in-memory data grid (the “space”) and for distributing queries and execution using patterns such as Map/Reduce and event-driven design.

Load Balancing

The elastic natureof the GigaSpaces platform allows it to automatically detect the CPU and memory capacity of the  deployment environment and optimize the load dynamically based on your defined SLA, instead of employing arbitrary division of the data into fixed zones. Such dynamic nature also allows your system to adjust in case of a failure of an entire zone (such as what happened with Amazon’s availability zones) so that your system doesn’t go down even in such extreme cases, and maintains optimal balance under the new conditions.

Furthermore, GigaSpaces platform supports content-based routing, which allow for smart load balancing based on your business model and logic. Content-based routing allows your application to route related data to the same host and then execute entire business flows within the same JVM, thus avoiding network hops and complex distributed transaction locking that hinder your application’s scalability.

Conclusion

Most significant advancements do not happen in slow gradual steps but rather in leaps. These leaps happen when the predominant conception crashes in face of the new reality, leaving behind chaos and uncertainty, and out of the chaos then emerges the next stage in the maturity of the system.

This is the case with the maturity process of the young cloud realm as well: the AWS outage was a major reality check that opened the eyes of the IT world to see that their systems crashed with AWS because they counted on their cloud infrastructure provider to handle your application’s high-availability and elasticity using its generic logic. This concept proved to be wrong. Now the IT world is in confusion, and many discussions are done on whether the faith in cloud was mistaken, with titles like “EC2 Failure Feeds Worries About Cloud Services”.

The next step in the cloud’s maturity was the realization that cloud infrastructure is just infrastructure, and that you need to implement your application correctly, using patterns and best practices such as the ones I raised in my previous post, to leverage on the cloud infrastructure to gain high-availability and elasticity.

The next step in the evolution is to start leveraging on designated application platforms that will handle these concerns for you and virtualize the cloud from your application, so that you can simply define the SLA for your application for high-availability and elasticity, and leave it up to the platform to manipulate the cloud infrastructure to enforce your SLA, while you concentrate on writing your application’s business logic. As Forrester said:

… A new generation of application platforms for elastic applications is arriving to help remove this barrier to realizing cloud’s benefits. Elastic application platforms will reduce the skill required to design, deliver, and manage elastic applications, making automatic scaling of cloud available to all shops …

 

1311765722_picons03
Follow Dotan on Twitter!

7 Comments

Filed under Cloud, PaaS

Retrospect on recent AWS Outage and Resilient Cloud-Based Architecture

According to the television series “Terminator: the Sarah Connor Chronicles”, Skynetcomputer system began its attack against humanity on April 21, 2011. Luckily that hasn’t happened (or has it?) but on that very day another predominant computing system provided us with a painful reminder on how much humanity relies on computers to run the world.

A couple of weeks ago, on April 21, the IT world experienced a tsunami: Amazon Web Services (AWS) cloud went down in the US East Region for over 3 days (!!), and took down with it numerous systems and services that rely on AWS such as HootSuite, Reddit, Foursquare, Quora and many more, with a damage estimated at 2M$. The affected services were EC2 and RDS and Amazon provided a detailed technical summaryof the event.

This tsunami was a wake-up call. This wasn’t the first outage in the cloud arena, and not even the first of Amazon (in fact this is their second outage this year). But its impact was so vast that it finally brought the realization that cloud is not a silver bullet. Those who counted on the generic cloud’s provision of scalability and resilience did not survive the AWS outage. The resilience of your application (as well as scalability) does not exist unless you take care of it.
So how do we achieve this resilience in our application?

Why not learn from the experience of those who survived the outage?
As a cloud evangelist, I was intrigued by the history of the outage as it occurred. There were great posts during and after the outage from those who went down. But more interestingly for me as architect were the detailed posts of those who managed to survive the outage relatively unharmed, such as SimpleGeo, Netflix, SmugMug, SmugMug’s CTO, Twilio, Bizo and others.

In this post I’d like to summarize the patterns, principles and best practices that emerge from these posts, as I believe we can learn a lot from them on how to design our business applications to truly leverage on the benefits that the cloud offers in resilience and scalability.

Patterns, Guidelines and Best Practices

Design for failure

The first and fundamental principle in building robust architecture is to design for failure. As SmugMug states:

… we designed for failure from day one. Any of our instances, or any group of instances in an AZ [Availability Zone], can be “shot in the head” and our system will recover …

This principle should be prevalent during design, development, deployment and maintenance stages of the system. SimpleGeo presents an excellent work practice:

… At SimpleGeo failure is a first class citizen. We talk a lot about it in design discussions, it influences our operational procedures, we think about it when we’re coding, and we joke about it at lunch …

Some companies even embedded random failure simulation in their work procedures, such as SmugMug:

… once your system seems stable enough, start surprising your Ops and Engineering teams by killing stuff in the middle of the day without warning them. They’ll love you …

Netflix even automated random failure simulation using a designated Chaos Monkey service to get their engineering team used to a “constant level of failure in the cloud”.

Stateless and autonomous services

If possible, divide your business logic into stateless services, to allow easy fail-over and scalability. Netflix explained the fail-over benefits:

… if a server fails it’s not a big deal. In the failure case requests can be routed to another service instance and we can automatically spin up a new node to replace it …

Twilio aggregates their stateless services into homogeneous pools, which provides them both fail-over and elasticity:

… The pool of stateless recording services allows upstream services to retry failed requests on other instances of the recording service. In addition, the size of the recording server pool can easily be scaled up and down in real-time based on load …

To contain the ripple effect of the failure, make the services well-encapsulated, as SmugMug states:

… Make your system divided into well-encapsulated components that can fail individually without failing the entire system …

Redundant hot copies spread across zones

By replicating your data to other zones, you insulate your service from zone-wide failure, as SmugMug, Twilio, Netflix and others describe. As Netflix explains:

… we ensure that there are multiple redundant hot copies of the data spread across zones. In the case of a failure we retry in another zone, or switch over to the hot standby …

Twilio also emphasizes the configuration of timeout and retry to avoid delays in failing over to another copy:

… By running multiple redundant copies of each service, one can use quick timeouts and retries to route around failed or unreachable services …

Spread across several public cloud vendors and/or private cloud

Most IT organizations avoid depending on a single ISP by having another ISP as backup. Even Amazon is using this strategy internally to ensure high-availability of their cloud by using a primary and a backup network. Similarly, you would like to avoid dependency on a single cloud vendor by having another vendor as backup. This holds true even if the vendor provides a certain level of resilience, as we saw with Amazon’s multi-AZ failure on the recent outage. Many of the companies that survived the recent AWS outage owe it to using their own datacenters, to using other vendors, or to using the US West Region of AWS. SmugMug for instance kept their critical data on their own datacenter:

… the exact types of data that would have potentially been disabled by the EBS meltdown don’t actually live at AWS at all – it all still lives in our own datacenters, where we can provide predictable performance …

and also recommends to “spread across many providers”, although admitting that

… This is getting more and more difficult as AWS continues to put distance between themselves and their competitors …

When considering using different regions of AWS for resilience, it’s interesting to note Amazon’s statement about the effort required on your application’s side to work with multiple regions, which makes you wonder if it’s that much easier than working with a different vendor altogether:

… if you want to move data between Regions, you need to do it via your applications as we don’t replicate any data between Regions on our users’ behalf. You also need to use a separate set of APIs to manage each Region. Regions provide users with a powerful availability building block, but it requires effort on the part of application builders to take advantage of this isolation …

Automation and Monitoring

Automation is the key. Your application needs to automatically pick up alerts on system events, and should be able to automatically react to the alerts. As SimpleGeo architect states:

… Everything needs to be automated. Spinning up new instances, expanding your clusters, backups, restoring from backups, metrics, monitoring, configurations, deployments, etc. should all be automated …

Interesting to see that even Netflix that took pride in surviving the failure, admitted that the manual responses their engineers used this time will not work in the future, as they grow to a

… worldwide service with dozens of availability zones, even with top engineers we simply won’t be able to scale our responses manually …

Detailed alerting mechanisms are also essential to the manual control of the system, as Bizo states:

… we have our own internal alarms and dashboards that give us up to the minute metrics such as request rate, cpu utilization etc. …

Avoiding ACID services and leveraging on NoSQL solutions

The CTO of SimpleGeo recommends avoiding to rely on ACID services, as it inhibits the distributed nature of the cloud. In order to achieve that, Twilio recommends to “relax consistency requirements”. Netflix implemented that by

… leveraging NoSQL solutions wherever possible to take advantage of the added availability and durability that they provide, even though it meant sacrificing some consistency guarantees …

Load Balancing

Use dynamic balancing, regardless of the zone. When balancing equally by zone, like Amazon’s Elastic Load Balancer (ELB) does, if a zone fails this can bring down the system.

… Netflix uses its own software load balancing service that does balance across instances evenly, independent of which zone they are in. Services using middle tier load balancing are able to handle uneven zone capacity with no intervention …

Conclusion

Recent AWS outage serves as an important lesson to the IT world, and an important milestone in our maturity in using the cloud. The most important thing to do now is to learn from the mistakes made by those who went down with AWS, as well as from the success of the ones who survived it, and come up with proper methodology, patterns, guidelines and best practices on doing it right, so that Skynet will not take down humanity.

Update: added links to my follow-up posts with some more thoughts I had following subsequent AWS outages, around Disaster Recovery Policies and Elastic Application Platforms.

Resources

· Netflix, http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html

· Joe Stump, SimpleGeo, http://stu.mp/2011/04/the-cloud-is-not-a-silver-bullet.html

· SimpleGeo, http://developers.simplegeo.com/blog/2011/04/26/how-simplegeo-stayed-up/

· Ted Theodoropoulos, http://blog.acrowire.com/cloud-computing/failing-to-plan-is-planning-to-fail

· http://stu.mp/2011/04/the-cloud-is-not-a-silver-bullet.html

· SmugMug, http://don.blogs.smugmug.com/2011/04/24/how-smugmug-survived-the-amazonpocalypse

· Twilio, http://www.twilio.com/engineering/2011/04/22/why-twilio-wasnt-affected-by-todays-aws-issues/

· Bizo, http://dev.bizo.com/2011/04/how-bizo-survived-great-aws-outage-of.html

· Heroku, http://status.heroku.com/incident/151

· http://groups.google.com/group/cloud-computing/browse_thread/thread/e8079a54e6a8c4b9/72756bf9e587869d?show_docid=72756bf9e587869d

· Amazon, http://aws.amazon.com/message/65648/

· Amazon, http://aws.amazon.com/architecture/

1311765722_picons03
Follow Dotan on Twitter!

13 Comments

Filed under Cloud, IaaS