Amazon had its share of outages. These forced the prudent users to factor that into their design. One of the important guidelines which I pointed out after one such major AWS outage is Automation and Monitoring:
Your application needs to automatically pick up alerts on system events, and should be able to automatically react to the alerts.
At the beginning users had to take care of that themselves. Then Amazon added CloudWatch to enable built-in automatic monitoring. Now comes the next stage – automatically reacting to these notifications.
Amazon announced its latest feature to do just that: Auto Recovery for EC2. With this feature enabled, a failed instance will be automatically detected and recovered, and the recovered instance will be identical to the original, including the same Instance ID, IP address and configuration (e.g. elastic IPs and attached EBS volumes). This means developers no longer need to write the mechanism for the recovery of trivial cases.
The Auto-Recovery feature is based on AWS CloudWatch and the status checks which were added in last couple of years and can detect a variety of failure symptoms such as loss of network connectivity, loss of system power, and software/hardware issues on the physical host. Upon failure it will first attempt to recover on the same machine (involving reboot, so in-memory data will be lost), and then on a different machine (while retaining same ID, IP and configuration). In order to keep a fully automated recovery cycle, the developers better make sure that their software is automatically started upon boot-up (e.g. using
cloud-init) so that it restarts automatically also on the rebooted instance.
The service is currently launched in Amazon’s US East (N. Virginia) region, and is currently available for C3, M3, R3, and T2 instance types. The service is offered for free, so you’d only have to pay for the usage of CloudWatch alarms to monitor the system.
Unlike traditional data center, in the cloud the base assumption is that infrastructure instances can and will fail and even frequently. As more applications move to the cloud, including more mission-critical applications, the greater the demand is from cloud providers to provide greater levels of built-in resilience. We shall be seeing more of these features for automatic recovery and self-healing, as well as other types of automated monitoring-triggered responses (both predefined and configurable), in infrastructure level as well as application level, among cloud providers, to meet those needs.
Follow Dotan on Twitter!