Scaling All The Way: Welcoming Scala as a First-Level Citizen in GigaSpaces XAP

Scala is a hot topic in the programming world. I’ve been following Scala for quite a while, and about 2 years ago I endeavored programming in Scala on top of GigaSpaces XAP (disclaimer: I’m solution architect for GigaSpaces). XAP had no built-in support for Scala, but I leveraged XAP’s native support for Java together with Scala’s compatibility with Java.

And it worked like a charm!

You can read the full details in my previous blog post. I concluded my blog post by saying that

… programming in Scala on top of XAP is a viable notion that deserves further investigation.

However, I also added a disclaimer that

… XAP platform offers a vast array of features, and that the Scala language offers a vast array of constructs, very few of which have been covered on this experiment. Similarly, I should also state that Scala is not officially supported by the XAP product, which means that there is no official support or test coverage of Scala in the product.

Further explorations of more advanced Scala usage, also together with concrete customer use cases, showed that although possible, the resulting code is less intuitive for Scala users, and does not fully utilize the elegant constucts of the Scala language.

Two years went by, and now we decided to take our relationship with Scala to the next level, and make Scala programming on XAP much more intuitive with things such as better support for immutable objects, functional querying using predicates, Scala script execution and enhanced REPL shell. XAP now also exposes some of the platform’s powerful mechanisms for distributed and scalable processing, such as remote script execution and Map/Reduce pattern, over native Scala. These goodies have just been unveiled as part of the latest release of XAP (XAP 9.6). Let’s give you a taste of some of the goodies.

Predicate-based queries

You can now run queries based on Scala predicates just as you’re used to for functional querying:

val pGigaSpace = gigaSpace.predicate
val person = { person: Person =>
person.age > 25 || == personName }

This will be compiled into the XAP native SQL query mechanism, so no runtime overhead and you get all the optimizations available on the platform’s SQL query engine and indexing.

Scatter/Gather and Map/Reduce patterns

XAP contains a mechanism called Task Execution which provides easy implementation of the Scatter/Gather and Map/Reduce patterns.





You can now define both the Scatter and the Gather as Scala functions and dispatch onto a cluster of nodes with a simple invocation:

val asyncFuture2 = gigaSpace.execute(
{ gigaSpace: GigaSpace => },
{ results: Seq[AsyncResult[String]] => { _.getResult() } mkString } )

Remote and parallel execution of Scala scripts over a cluster

What if you want to execute your Scala script across a cluster of compute nodes (a-la compute grid)? Maybe even colocated with the data found on these nodes (a-la in-memory data grid)? This is now easily achievable using XAP’s Dynamic Language Tasks (which also supports other dynamic language scripts such as JavaScript and Groovy):

val script = new ScalaTypedStaticScript("myScript", "scala", code)
  .parameter("someNumber", 1)
  .parameter("someString", "str")
  .parameter("someSet", Set(1,2,3), classOf[Set[_]]) 
val result = executor.execute(script)

Final words

XAP is a multi-language and multi-interface platform. You can write in Java, .NET, C++;you can use standards such as SQL, JPA and Spring; you can use it as a key-value store; you may even choose to store your data in document format instead of objects, to support semi-structured model. So enhancing XAP to support Scala was but a natural move.


In my blog post two years ago I concluded saying that

… [Scala] is an exciting option worth exploring, and who knows, if Scala becomes predominant it may one day become an official feature of the product.

Finally the day has come and Scala made its first steps in becoming a first-level citizen in GigaSpaces XAP. What I described above is just part of it. You can read the full listing here.  There are yet many more things to do in order to fully expose XAP platform’s rich functionality through Scala functional language, such as full support for immutable types. Now it’s time for the user community to check it out. So go ahead and play with it and let us know what you think you need to make your application scale all the way with XAP and Scala.

Leave a comment

Filed under Programming Languages

Enterprises Taking Off to the Cloud(s)

Cloud Deployment: the enterprise angle

The Cloud is no longer the exclusive realm of the young and small start up companies. Enterprises are now joining the game and examining how to migrate their application ecosystem to the cloud. A recent survey conducted by research firm MeriTalk showed that one-third of respondents say they plan to move some mission-critical applications to the cloud in the next year. Within two years, the IT managers said they will move 26 percent of their mission-critical apps to the cloud, and in five years, they expect 44 percent of their mission-critical apps to run in the cloud. Similar results arise from surveys conducted by HP, Cisco and others.

SaaS on the rise in enterprises

Enterprises are replacing their legacy applications with SaaS-based applications. A comprehensive survey published by Gartner last week, which surveyed nearly 600 respondents in over 10 countries, shows that

Companies are not only buying into SaaS (software as a service) more than ever, they are also ripping out legacy on-premises applications and replacing them with SaaS

IaaS providers see the potential of the migration of enterprises to the cloud and adapt their offering. Amazon, having spearheaded Cloud Infrastructure, leads with on-boarding enterprise applications to their AWS cloud. Only a couple of weeks ago Amazon announced that AWS is now certified to run SAP Business Suite (SAP’s CRM, ERP, SCM, PLM) for production applications. That joins Microsoft SharePoint and other widely-adopted enterprise business applications now supported by AWS, which helps enterprises migrate their IT to AWS easier than ever before.

Mission-critical apps call for PaaS

Running your CRM or ERP as SaaS in the cloud is very useful. But what about your enterprise’s mission-critical applications? Whether in the Telco, Financial Services, Healthcare or  other domains, the core business of the organization’s IT usually lies in the form of a complex ecosystem of 100s of interacting applications. How can we on-board the entire ecosystem in a simple and consistent manner to the cloud? One approach that gains steam for such enterprise ecosystems is using PaaS. Gartner predicting PaaS will increase from “three percent to 43 percent of all enterprises by 2015”.

Running your ecosystem of applications on a cloud-based platform provides a good way to build applications for the cloud in a consistent and unified manner. But what about legacy applications? Many of the mission-critical applications in enterprises are ones that have been around for quite some time and were not designed for the cloud and are not supported by any cloud provider. Migrating such applications to the cloud often seems to call for major overhaoul, as stated in MeriTalk’s report on the Federal market:

Federal IT managers see the benefits of moving mission-critical applications to the cloud, but they say many of those application require major re-engineering to modernize them for the cloud

The more veteran PaaS vendors such as Google App Engine and Heroku provide great productivity for developing new applications, but do not provide answer for such legacy applications, which gets us back to square one, having to do the cloud migration ourselves. This migration work seems too daunting for most enterprises to even dare, and that is one of the main inhibitors for cloud adoption despite the incentives.

It is only recently that organizations have started to use PaaS for critical functions, examining PaaS for mission-critical applications. According to a recent survey conducted by Engine Yard among some 162 management and technical professionals of various companies:

PaaS is now seen as a way to boost agility, improve operational efficiency, and increase the performance, scalability, and reliability of mission-critical applications.

What IT organizations are looking for is a way to on-board their existing application ecosystem to the cloud in a consistent manner as provided with the PaaS, but while having IaaS-like low-level control over the environment and the application life cycle. IT organizations seek the means to keep the way they are used to doing things in the data center even when moving to the cloud. A new class of PaaS products emerged over the past couple of years to answer this need, with products such as OpenShift, CloudFoundry and Cloudify. In my MySQL example discussion I demonstrated how the classic MySQL relational database can be on-boarded to the cloud using Cloudify without need for re-engineering MySQL, and without locking into any specific IaaS vendor API.


Enterprises are migrating their applications to the cloud in an increasing rate. Some applications are easily migrated using existing SaaS offering. But the mission-critical applications are complex and call for PaaS for on-boarding them to the cloud. If the mission-critical application contains legacy systems or requires low-level control of OS and other environment configuration then not every PaaS would fit the job. There are many cloud technologies, infrastructure, platforms, tools and vendors out there, and the right choice is not trivial. It is important to make proper assessment of the enterprise system at hand and choose the right tool for the job, to ensure smooth migration, avoid re-engineering as much as possible, and keep flexible to accomodate for future evolution of the application.

If you are interested in consulting around assessment of your application’s on-boarding to the cloud, feel free to contact me directly or email

1311765722_picons03 Follow Dotan on Twitter!

Leave a comment

Filed under cloud deployment, IaaS, PaaS

AWS Outage: Moving from Multi-Availability-Zone to Multi-Cloud

A couple of days ago Amazon Web Services (AWS) suffered a significant outage in their US-EAST-1 region. This has been the 5th major outage in that region in the past 18 months. The outage affected leading services such as Reddit, Netflix, Foursquare and Heroku.

How should you architect your cloud-hosted system to sustain such outages? Much has been written on this question during this outage, as well as past outages. Many recommend basing your architecture on multiple AWS Availability Zones (AZ) to spread the risk. But during this outage we saw even multi-Availability Zone applications severely affected. Even Amazon published during the outage that

Customers can launch replacement instances in the unaffected availability zones but may experience elevated launch latencies or receive ResourceLimitExceeded errors on their API calls, which are being issued to manage load on the system during recovery.

The reason is that there is an underlying infrastructure that escalates the traffic from the affected AZ to other AZ in a way that overwhelms the system. In the case of this outage it was the AWS API Platform that was rendered unavailable, as nicely explained in this great post:

The waterfall effect seems to happen, where the AWS API stack gets overwhelmed to the point of being useless for any management task in the region.

But it doesn’t really matter for us as users which exact infrastructure it was that failed on this specific outage. 18 months ago, during the first major outage, the reason was another infastructure component, the Elastic Block Store (“EBS”) volumes, that cascaded the problem. Back then I wrote a post on how to architect your system to sustain such outages, and one of my recommendations was:

Spread across several public cloud vendors and/or private cloud

The rule of thumb in IT is that there will always be extreme and rare situations (and don’t forget, Amazon only commits to 99.995% SLA) causing such major outages. And there will always be some common infrastructure that under that extreme and rare situation will carry the ripple effect of the outage to other Availability Zones in the region.

Of course, you can mitigate risk by spreading your system across several AWS Regions (e.g. between US-EAST and US-WEST), as they have much looser coupling, but as I stated on my previous post, that loose coupling comes with a price: it is up to your application to replicate data, using a separate set of APIs for each region. As Amazon themselves state: “it requires effort on the part of application builders to take advantage of this isolation”.

The most resilient architecture would therefore be to mitigate risk by spreading your system across different cloud vendors, to provide the best isolation level. The advantages in terms resilience are clear. But how can that be implemented, given that the vendors are so different in their characteristics and APIs?

There are 2 approaches to deploying across multiple cloud vendors and keeping cloud-vendor-agnostic:

  1. Open Standards and APIs for cloud API that will be supported by multiple cloud vendors. That way you write your application using a common standard and have immediate support by all conforming cloud vendors. Examples for such emerging standards are OpenStack and JClouds. However, the Cloud is still a young domain with many competing standards and APIs and it is yet to be determined which one shall become the de-facto standard of the industry and where to “place our bet”.
  2. Open PaaS Platforms that abstract the underlying cloud infrastructure and provide transparent support for all major vendors. You build your application on top of the platform, and leave it up to the platform to communicate to the underlying cloud vendors (whether public or private clouds, or even a hybrid). Examples of such platforms, are CloudFoundry and Cloudify. I dedicated one of my posts for exploring how to build your application using such platforms.


System architects need to face the reality of the Service Level Agreement provided by Amazon and other cloud vendors and their limitations, and start designing for resilience by spreading across isolated environments, deploying DR sites, and by similar redundancy measures to keep their service up-and-running and their data safe. Only that way can we guarantee that we will not be the next one to fall off the 99.995% SLA.

This post was originally posted here.


Filed under cloud deployment, Disaster-Recovery, IaaS, PaaS, Solution Architecture, Uncategorized

Cloud Deployment: It’s All About Cloud Automation

Not only for modern applications

Many organizations are facing the challenge of migrating their IT to the cloud. But not many know how to actually approach this undertaking. In my recent post – Cloud Deployment: The True Story – I started sketching best practices for performing the cloud on-boarding task in a manageable fashion. But many think this methodology is only good for modern applications that were built with some dynamic/cloud orientation in mind, such as Cassandra NoSQL DB from my previous blog, and that existing legacy application stacks cannot use the same pattern. For example, how different would the cloud on-boarding process be if I modify the PetClinic example application from my previous post to use a MySQL relational database instead of the modern Cassandra NoSQL clustered database? In this blog post I intend to demonstrate that cloud on-boarding of brownfield applications doesn’t have to be a huge monolithic migration project with high risk. Cloud on-boarding can take the pragmatic approach and can be performed in a gradual process that both mitigates the risk and enables you to enjoy the immediate benefits of automation and easier management of your application’s operational lifecycle even before moving to the cloud.

MySQL case study

Let’s look at the above challenge of taking a standard and long-standing MySQL database and adapt it to the cloud. In fact, this challenge was already met by Amazon for their cloud. Amazon Web Services (AWS) include the very popular Relational Database Service (RDS). This service is an adaptation of a MySQL database to the Amazon cloud. MySQL DB was not built or designed for cloud environment, and yet it proved highly popular, and even the new SimpleDB service that Amazon built from scratch with cloud orientation in mind was unable to overthrow the RDS reign. The adaptation of MySQL to AWS was achieved using some pre-tuning of MySQL to the Amazon environment and extensive automation of the installation and management of the DB instances. The case study of Amazon RDS can teach us that on-boarding existing application is not only doable but may even prove better than developing a new implementation from scratch to suit the cloud.

I will follow the MySQL example throughout this post and examine how this traditional pre-cloud database can be made ready for the cloud.

Automation is the key

We have our existing application stack running within our data center, knowing nothing of the cloud, and we would like to deploy it to the cloud. How shall we begin?

Automation is the key. Experts say automated application deployment tools are a requirement when hosting an application in the cloud. Once automation is in place, and given a PaaS layer that abstracts the underlying IaaS, your application can easily be migrated to any common cloud provider with minimal effort.

Furthermore, automation has a value in its own right. The emerging agile movements such as Agile ALM (Application Lifecycle Management) and DevOps endorse automation as a means to support the Continuous Deployment methodology and ever-increasing frequency of releases to multiple environments. Some even go beyond DevOps and as far as NoOps. Forrester analyst Mike Gualtieri states that “NoOps is the peak of DevOps”, where “DevOps Is About Collaboration; NoOps Is About Automation“:

DevOps is a noble and necessary movement for immature organizations. Mature organizations have DevOps down pat. They aspire to automate to speed release increments.

This value of automation in providing a more robust and agile management of your application is a no-brainer and will prove useful even before migrating to the cloud. It is also much easier to test and verify the automation when staying in the well-familiar environment in which the system has been working until now. Once deciding to migrate to the cloud, automation will make the process much simpler and smoother.

Automating application deployment

Let’s take the pragmatic approach. The first step is to automate the installation and deployment of the application in the current environment, namely within the same data center. We capture the operational flows of deploying the application and start automating these processes, either using scripts or using higher-level DevOps automation tools such as Chef and Puppet for Change and Configuration Management (CCM).

Let’s revisit our MySQL example: MySQL doesn’t come with built-in deployment automation. Let’s examine the manual processes involved with installing MySQL DB from scratch and capture that in a simple shell script so we can launch the process automatically:

This script is only the basics. A more complete automation should take care of additional concerns such as super-user permissions, updating ‘yum’, killing old processes and cleaning up previous installations, and maybe even handling differences between flavors of Linux (e.g. Ubuntu’s quirks…). You can check out the more complete version of the installation script for Linux here (mainstream Linux, e.g. RedHat, CentOS, Fedora), as well as a variant for Ubuntu (adapting to its quirks) here. This is open source and a work in progress so feel free to fork the GitHub repo and contribute!

Automating post-deployment operations

Once automation of the application deployment is complete we can then move to automating other operational flows of the application’s lifecycle, such as fail-over or shut down of the application. This aligns with cloud on-boarding, since “Deployment in the cloud is attached to the whole idea of running the application in the cloud”, as Paul Burns, president and analyst at Neovise, says:

People don’t say, ‘Should I automate my deployment in the cloud?’ It’s, ‘Should I run it in the cloud?’ Then, ‘How do I get it to the cloud?’

In our MySQL example we will of course want to automate the start-up of the MySQL service, stopping it and even uninstalling it. More interestingly, we may also want to automate operational steps unique to MySQL such as granting DB permissions, creating a new database, generating a dump (snapshot) of our database content or importing a DB dump to our database. Let’s look at a snippet to capture and automate dump generation. This time we’ll use the Groovy scripting language which provides higher-level utilities for automation and better yet it is portable between OSs, so we don’t have the headache as we described above with Ubuntu (not to mention Windows …):

Adding automation of these post-deployment steps will provide us with end-to-end automation of the entire lifecycle of the application from start-up to tear-down within our data center. Such automation can be performed using elaborate scripting, or can leverage modern open PaaS platforms such as CloudFoundry, Cloudify, and OpenShift to manage the full application lifecycle. For this MySQL automation example I used the Cloudify open source platform, where I modeled the MySQL lifecycle using a Groovy-based DSL as follows:

As you can see, the lifecycle is pretty clear from the DSL, and maps to individual scripts similar to the ones we scripted above. We even have the custom commands for generating dumps and more. With the above in place, we can now install and start MySQL automatically with a single command line:

install-service mysql

Similarly, we can later perform other steps such as tearing it down or generating dumps with a single command line.

You can view the full automation of the MySQL lifecycle in the scripts and recipes in this GitHub repo.

Monitoring application metrics

We may also want to have better visibility into the availability and performance of our application for better operational decision-making, whether for manual processes (e.g. via logs or monitoring tools) or automated processes (e.g. auto-scaling based on load). This is becoming common practice in methodologies such as Application Performance Management (APM). This will also prove useful once in the cloud, as visibility is essential for successful cloud utilization. Rick Blaisdell, CTO at ConnectEDU, explains:

… the key to successful cloud utilization lays in the management and automation tools’ capability to provide visibility into ongoing capacity

In our MySQL example we can sample several interesting metrics that MySQL exposes (e.g. using the SHOW STATUS syntax or ‘mysqladmin’), such as the number of client connections, query counts or query throughput.


On-boarding existing applications to the cloud does not have to be a painful and high-risk migration process. On boarding can be done in a gradual “baby-step” manner to mitigate risk.

The first step is automation. Automating your application’s management within your existing environment is a no-brainer, and has its own value in making your application deployment, management and monitoring easier and more robust.

Once automation of the full application lifecycle is in place, migrating your application to the cloud becomes smooth sailing, especially if you use PaaS platforms that abstract the underlying cloud provider specifics.

This post was originally posted here.

For the full MySQL cloud automation open source code see this public GitHub repo. Feel free to download, play around, and also fork and contribute.


Filed under Cloud, cloud automation, cloud deployment, DevOps, IaaS, PaaS

AWS Outage – Thoughts on Disaster Recovery Policies

A couple of days ago it happened again. On June 14 around 9 pm PDT Amazon AWS hit a power outage in its Northern Virginia data center, affecting EC2, RDS, Elastic Beanstalk and other services in the US-EAST region. The AWS status page reported:

Some Cache Clusters in a single AZ in the US-EAST-1 region are currently unavailable. We are also experiencing increased error rates and latencies for the ElastiCache APIs in the US-EAST-1 Region. We are investigating the issue.

This outage affected major sites such as Quora, Foursquare, Pinterest, Heroku and Dropbox. I followed the outage reports, the tweets, the blog posts, and it all sounded all too familiar. A year ago AWS faced a mega-outage that lasted over 3 days, when another datacenter (in Virginia, no less!) went down, and took down with it major sites (Quora, Foursquare… ring a bell?).

Back during last year’s outage I analyzed the reports of the sites that managed to survive the outage, and compiled a list of field-proven guidelines and best practices to apply in your architecture to make it resilient when deployed on AWS and other IaaS providers. I find these guidelines and best practices highly useful in my architectures. I then followed up with another blog post suggesting using designated software platforms to apply some of the guidelines and best practices.

On this blog post I’d like to address one specific guideline in greater depth – architecting for Disaster Recovery.

Disaster Recovery – Characteristics and Challenges

PC Magazine defines Disaster Recovery (DR):

A plan for duplicating computer operations after a catastrophe occurs, such as a fire or earthquake. It includes routine off-site backup as well as a procedure for activating vital information systems in a new location.

DR Planning is a common practice since the days of the mainframes. An interesting question is why this practice is not as widespread in cloud-based architectures. In his recent post “Lessons from the Heroku/Amazon Outage” Nati Shalom, GigaSpaces CTO, analyzes this apparent behavior, and suggests two possible causes:

  • We give up responsability when we move to the cloud - When we move our operation to the cloud we often assume that were outsourcing our data center operation completly, that include our Disaster-Recovery procedures. The truth is that when we move to the cloud were only outsourcing the infrastructure not our operation and the responsability of using this infrastructure remain ours.
  • Complexity - The current DR processes and tools were designed for a pre-cloud world and doesn’t work well in a dynamic environment as the cloud. Many of the tools that are provided by the cloud vendor (Amazon in this sepcific case) are still fairly complex to use.

I addressed the first cause, the perception that cloud is a silver bullet that lets people give up responsibility on resilience aspects, in my previous post. The second cause, the lack of tools, is usually addressed by DevOps tools such as ChefPuppetCFEngine and Cloudify, which capture the setup and are able to bootstrap the application stack on different environments. In my example I used Cloudify to provide consistent installation between EC2 and RackSpace clouds.

Making sure your architecture incorporates a Disaster Recovery Plan is essential to ensure the business continuity, and avoid cases such as the ones seen over Amazon’s outages. Online services require the Hot Backup Site architecture, so the service can stay up even during the outage:

A hot site is a duplicate of the original site of the organization, with full computer systems as well as near-complete backups of user data. Real time synchronization between the two sites may be used to completely mirror the data environment of the original site using wide area network links and specialized software.

DR sites can be in Active/Standby architecture (as was in traditional DRPs), where the DR site starts serving only upon outage event, or they can be in Active/Active architecture (the more modern architectures). In his discussion on assuming responsibility, Nati states that DR architecture should assume responsibility for the following aspects:

  • Workload migration - specifically the ability to clone our application environment in a consistent way across sites in an on demand fashion.
  • Data Synchronization - The ability to maintain real time copy of the data between the two sites.
  • Network connectivity - The ability to enable flow of netwrok traffic between between two sites.

I’d like to experiment with an example DR architecture to address these aspects, as well as addressing Nati’s second challange - Complexity. In this part I will use an example of a simple web app and show how we can easily create two sites on-demand. I would even go as far as setting this environment on two seperate clouds to show how we can ensure even higher degree of redundancy by running our application across two different cloud providers.

A step-by step example: Disaster Recovery from AWS to RackSpace

Let’s put up our sleeves and start experimenting hands-on with DR architecture. As reference application let’s take Spring’s PetClinic Sample Application and run it on an Apache Tomcat web container. The application will persist its data locally to a MySQL relational database. On my experiment I used Amazon EC2 and RackSpace IaaS providers to simulate the two distinct environments of the primary and secondary sites, but any on-demand environments will do. We tried the same example with a combination of HP Cloud Services and a flavor of a Private cloud.

Data synchronization over WAN

How do we replicate data between the MySQL database instances over WAN? On this experiment we’ll use the following pattern:

  1. Monitor data mutating SQL statements on source site. Turn on the MySQL query log, and write a listener (“Feeder”) to intercept data mutating SQL statements, then write them to GigaSpaces In-Memory Data Grid.
  2. Replicate data mutating SQL statements over WAN. I used GigaSpaces WAN Replication to replicate the SQL statements  between the data grids of the primary and secondary sites in a real-time and transactional manner.
  3. Execute data mutating SQL statements on target site. Write a listener (“Processor”) to intercept incoming SQL statements on the data grid and execute them on the local MySQL DB.

To support bi-directional data replication we simply deploy both the Feeder and the Processor on each site.

Workload migration

I would like to address the complexity challenge and show how to automate setting up the site on demand. This is also useful for Active/Standby architectures, where the DR site is activated only upon outage.

In order to set up a site for service, we need to perform the following flow:

  1. spin up compute nodes (VMs)
  2. download and install Tomcat web server
  3. download and install the PetClinic application
  4. configure the load balancer with the new node
  5. when peak load is over – perform the reverse flow to tear down the secondary site

We would like to automate this bootstrap process to support on-demand capabilities in the cloud as we know from traditional DR solutions. I used GigaSpaces Cloudify open-source product as the automation tool for setting up and for taking down the secondary site, utilizing the out-of-the-box connectors for EC2 and RackSpace. Cloudify also provides self-healing  in case of VM or process failure, and can later help in scaling the application (in case of clustered applications).

Network Connectivity

The network connectivity between the primary and secondary sites can be addressed in several ways, ranging from load-balancing between the sites, through setting up VPN between the sites, and up to using designated products such as Cisco’s Connected Cloud Solution.

In this example I went for a simple LB solution using RackSpace’s Load Balancer Service to balance between the web instances, and automated the LB configuration using Cloudify to make the changes as seamless as possible.

Implementation Details

The application is actually a re-use of an  application I wrote recently to experiment with Cloud Bursting architectures, seeing that Cloud Bursting follows the same architecture guidelines as for DR (Active/Standby DR to be exact). The result of the experimentation is available on GitHub. It contains:

  • DB scripts for setting up the logging, schema and demo data for the PetClinic application
  • PetClinic application (.war) file
  • WAN replication gateway module
  • Cloudify recipe for automating the PetClinic deployment

See the documentation on GitHub for detailed instructions on how to configure the above with your specific deployment details.


Cloud-hosted applications should take care of non-functional requirements of the system, including resilience and scalability, just as on-premise applications. Systems that neglect to incorporate these considerations in their architecture, relying solely on the underlying cloud infrastructure, end up severely affected by cloud outage such as the one experienced a few days ago in AWS. On my previous post I listed some guidelines, an important of which is Disaster Recovery which I explored here and suggested possible architectural approaches and example implementation. I hope this discussion raises the awareness in the cloud community and helps maturing up cloud-based architectures, so that on the next outage we will not see as many systems go down.

Follow Dotan on Twitter!


Filed under Cloud, DevOps, Disaster-Recovery, IaaS, Solution Architecture, Uncategorized

Bursting into the Clouds – Experimenting with Cloud Bursting

Who needs Cloud Bursting?

We see many organizations examining Cloud as replacement for their existing in-house IT. But we see interest in cloud even among organizations that have no plan of replacing their traditional data center. One prominent use case is Cloud Bursting:

Cloud bursting is an application deployment model in which an application runs in a private cloud or data center and bursts into a public cloud when the demand for computing capacity spikes. The advantage of such a hybrid cloud deployment is that an organization only pays for extra compute resources when they are needed.
[Definition from SearchCloudComputing]

Cloud Bursting appears to be a prominent use case in cloud on-boarding projects. In a recent post, Nati Shalom summarizes nicely the economical rationale for cloud bursting and discusses theoretical approaches for architecture. In this post I’d like to examine the architectural challenges more closely and explore possible designs for Cloud Bursting.

Examining Cloud Bursting Architecture

Overflowing compute to the cloud is addressed by workload migration: when we need more compute power we just spin up more VMs in the cloud (the secondary site) and install instances of the application. The challenge in workload migration is around how to build a consistent environment in the secondary site as in the primary site, so the system can overflow transparently. This is usually addressed by DevOps tools such as Chef, Puppet, CFEngine and Cloudify, which capture the setup and are able to bootstrap the application stack on different environments. In my example I used Cloudify to provide consistent installation between EC2 and RackSpace clouds.

The Cloud Bursting problem becomes more interesting when data is concerned. In his post Nati mentions two approaches for handling data during cloud bursting:

1. The primary site approach - Use the private cloud as the primary data site, and then point all the burst activity to that site.
2. Federated site approach - This approach is similar to the way Content Distribution Networks (CDN) work today. With this approach we maintain a replica of the data available at each site and keep their replicas in sync.

The primary site approach incurs heavy penalty in latency, as each computation needs to make the round trip to the primary site to get the data for the computation. Such architecture is not applicable to online flows.

The federated site approach uses data synchronization to bring the data to the compute, which saves the above latency and enables online flows. But if we want to support “hot” bursting to the cloud, we have to replicate the data between the sites in an ongoing streaming fashion, so that the data is available on the cloud as soon as the peak occurs and we can spin up compute instances and immediately start to redirect load. Let’s see how it’s done.

Cloud Bursting – Examining the Federated Site Approach

Let’s put up our sleeves and start experimenting hands-on with the federated site approach for Cloud Bursting architecture. As reference application let’s take Spring’s PetClinic Sample Application and run it on an Apache Tomcat web container. The application will persist its data locally to a MySQL relational database.

The primary site, representing our private data center, will run the above stack and serve the PetClinic online service. The secondary site, representing the public cloud, will only have a MySQL database, and we will replicate data between the primary and secondary sites to keep data synchronized. As soon as the load on the primary site increases beyond a certain threshold, we will spin up a machine with an instance of Tomcat and the PetClinic application, and update the load balancer to offload some of the traffic to the secondary site.

On my experiment I used Amazon EC2 and RackSpace IaaS providers to simulate the two distinct environments of the primary and secondary sites, but any on-demand environments will do.

Replicating RDBMS data over WAN

How do we replicate data between the MySQL database instances over WAN? On this experiment we’ll use the following pattern:

  1. Monitor data mutating SQL statements on source site. Turn on the MySQL query log, and write a listener (“Feeder”) to intercept data mutating SQL statements, then write them to GigaSpaces In-Memory Data Grid.
  2. Replicate data mutating SQL statements over WAN. I used GigaSpaces WAN Replication to replicate the SQL statements  between the data grids of the primary and secondary sites in a real-time and transactional manner.
  3. Execute data mutating SQL statements on target site. Write a listener (“Processor”) to intercept incoming SQL statements on the data grid and execute them on the local MySQL DB.
To support bi-directional data replication we simply deploy both the Feeder and the Processor on each site.

Auto-bootstrap secondary site

When peak load occurs, we need to react immediately, and perform a series of operations to activate the secondary site:

  1. spin up compute nodes (VMs)
  2. download and install Tomcat web server
  3. download and install the PetClinic application
  4. configure the load balancer with the new node
  5. when peak load is over – perform the reverse flow to tear down the secondary site

We need to automate this bootstrap process to support real-time response to peak-load events. How do we do this automation? I used GigaSpaces Cloudify open-source product as the automation tool for setting up and for taking down the secondary site, utilizing the out-of-the-box connectors for EC2 and RackSpace. Cloudify also provides self-healing  in case of VM or process failure, and can later help in scaling the application (in case of clustered applications).

Implementation Details

The result of the above experimentation is available on GitHub. It contains:

  • DB scripts for setting up the logging, schema and demo data for the PetClinic application
  • PetClinic application (.war) file
  • WAN replication gateway module
  • Cloudify recipe for automating the PetClinic deployment

See the documentation on GitHub for detailed instructions on how to configure the above with your specific deployment details.


Cloud Bursting is a common use case for cloud on-boarding, which requires good architecture patterns. In this post I tried to suggest some patterns and experiment with a simple demo, sharing it with the community to get feedback and raise discussion on these cloud architectures.

Follow Dotan on Twitter!


Filed under Cloud, DevOps, Solution Architecture

Cloud Deployment: The True Story

Everyone wants to be in the cloud. Organizations have internalized the notion and have plans in place to migrate their applications to the cloud in the immediate future. According to Cisco’s recent global cloud survey:

Presently, only 5 percent of IT decision makers have been able to migrate at least half of their total applications to the cloud. By the end of 2012, that number is expected to significantly rise, as one in five (20 percent) will have deployed over half of their total applications to the cloud.

But that survey also reveals the fact that on-boarding your application to the cloud “is harder, and it takes longer than many thought”, as David Linthicum said in his excellent blog post summarizing the above Cisco survey. Taking standard enterprise applications that were designed to run in the data center and on-boarding them to the cloud is in essence a reincarnation of the well-known challenge of platform-migration, which is never easy. But why is there a sense of extra difficulty in on-boarding to the cloud? The first reason David identifies for the extra difficulty is the misconception that cloud is a “silver bullet”. Such “silver bullet” misconception can lead to lack of proper design of the system, which may result in application outage, as I outlined in my previous blogs. Another reason David states for the extra difficulty is the lack of well-defined process and best practices for on-boarding applications to the cloud:

What makes the migration to the cloud even more difficult is the lack of information about the process. Many new cloud users are lost in a sea of hype-driven desire to move to cloud computing, without many proven best practices and metrics.

It is about time for a field-proven process for on-boarding applications to the cloud. In this post I’d like to start examining the accumulated experience in on-boarding various types of applications to the cloud, and see if we can extract a simple process for the migration. This is of course based on the experience of me and my colleagues, and not the result of any academic research, so I would very much like for it to serve as a cornerstone to trigger an open discussion in the community, sharing experience of different types of migration projects and applications, and iteratively refine the suggested process based on the joint experience.

Examining the n-tier enterprise application use case

As a first use case, it makes sense to examine a classic n-tier enterprise application. For the sake of discussion, I’d like to use common open-source modules, assuming they are well-known and to allow us to play with them freely. For the test-case application let’s take Spring’s PetClinic Sample Application and adapt it. We’ll use Apache Tomcat web container and Grails platform for the web and business logic tiers, and MongoDB NoSQL DB for the data tier, to simulate a Big Data use case. We can later add the Apache HTTP Server as a front-end load balancer. To those who start wondering, I’m not invested in the Apache Foundation, just an open-source enthusiast.

First step of on-boarding the application to the cloud is to identify the individual services which comprise the application, and the dependency between these services. In this use case, since the application is well-divided into tiers, it is quite easy to map the services from the tiers. Also, the dependency between the tiers is quite clear. For example, the Tomcat instances are dependent on the back-end database. Mapping the application’s services and their dependency will help us determine which VMs we should spin up, of which images, how many of each, and in which order. In later posts I’ll address additional benefits of the services paradigm.

Next let’s dive into the specific services, and see what it takes to prepare them for on-boarding to the cloud. First step is to identify the operational phases which comprise the service’s lifecycle. Typically, a service will undergo a lifecycle of install-init-start-stop-shutdown. We should capture the operational process for each such phase and formalize it into an automated DevOps process, for example in the form of a script. This process of capturing and formalizing the steps also helps exposing many important issues that need to be addressed to enable the application to run in the cloud, and may even require further lifecycle phases or intermediate steps. For example in the case of Tomcat we may want to support deploying a new WAR file to Tomcat without restarting the container. Another example for MongoDB is that we noticed that it may fail starting up without failure indication in the OS process status, so simple generic monitoring of the process status wasn’t enough and we needed a more accurate and customized way to know when the service successfully completed start-up and is ready to serve. Similar considerations arise with almost every application. I will touch these considerations further in a follow-up post.

With the break-down of the application into services, and the services break-down into their individual lifecycle stages, we have a good skeleton to automate the work on the cloud. You are welcome to review the result of the experimentation available as open-source under CloudifySourice GitHub. On my next post I will further examine the n-tier use case and discuss additional concerns that needed to be addressed to bring it to a full solution.


Follow Dotan on Twitter!


Filed under Cloud, DevOps, IaaS, PaaS