Amazon Adds Auto-Recovery To Its EC2 Cloud Instances

Amazon had its share of outages. These forced the prudent users to factor that into their design. One of the important guidelines which I pointed out after one such major AWS outage is Automation and Monitoring:

Your application needs to automatically pick up alerts on system events, and should be able to automatically react to the alerts.

At the beginning users had to take care of that themselves. Then Amazon added CloudWatch to enable built-in automatic monitoring. Now comes the next stage – automatically reacting to these notifications.

Amazon announced its latest feature to do just that: Auto Recovery for EC2. With this feature enabled, a failed instance will be automatically detected and recovered, and the recovered instance will be identical to the original, including the same Instance ID, IP address and configuration (e.g. elastic IPs and attached EBS volumes). This means developers no longer need to write the mechanism for the recovery of trivial cases.


The Auto-Recovery feature is based on AWS CloudWatch and the status checks which were added in last couple of years and can detect a variety of failure symptoms such as loss of network connectivity, loss of system power, and software/hardware issues on the physical host. Upon failure it will first attempt to recover on the same machine (involving reboot, so in-memory data will be lost), and then on a different machine (while retaining same ID, IP and configuration). In order to keep a fully automated recovery cycle, the developers better make sure that their software is automatically started upon boot-up (e.g. using cloud-init) so that it restarts automatically also on the rebooted instance.

The service is currently launched in Amazon’s US East (N. Virginia) region, and is currently available for C3, M3, R3, and T2 instance types. The service is offered for free, so you’d only have to pay for the usage of CloudWatch alarms to monitor the system.

Unlike traditional data center, in the cloud the base assumption is that infrastructure instances can and will fail and even frequently. As more applications move to the cloud, including more mission-critical applications, the greater the demand is from cloud providers to provide greater levels of built-in resilience. We shall be seeing more of these features for automatic recovery and self-healing, as well as other types of automated monitoring-triggered responses (both predefined and configurable), in infrastructure level as well as application level, among cloud providers, to meet those needs.

1311765722_picons03 Follow Dotan on Twitter!


Leave a comment

Filed under Cloud

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s