Caring for Pets in the Cloud

Pets vs. Cattle, the widely used phrase forming the analogy between servers and animals is used consistently in the IT industry when automating infrastructure deployments or migrating applications to a cloud vendor.

As System Engineers, Architects and Developers struggle to turn pets into cattle, the underlying application, more than likely, will not be initially architected to take advantage of scalability features. Features like stateless application architecture takes time to become reality and often overrun the time it takes for infrastructure architecture changes. Immutability of servers is not easy either. It starts with how the application is architected. Then, can leverage multiple layers of infrastructure automation, operating system configuration management, acceptance testing, software deployments, continuous integration, etc…

This may mean expecting extra cost, time and effort when taking a beautifully hand-crafted snowflake application, crossing your fingers and just throwing it into the cloud. Especially if you’re hoping the underlying services continue to function at the guaranteed uptime SLA while concurrent efforts to develop cloud native versions or a potential replacement application begin to take shape.

AWS has answered this situation with AutoRecovery which is preferable to an AutoScaling group set to one desired/max instance. Although, an instance stop/start will move your instance to another host, AWS does not support live migration. Caveats include the risk of volume corruption and loss of public IP’s so ensure your instance has an Elastic IP or is behind and ELB and the important data is backed up or regular EBS snapshots are taken.

Google Cloud mitigates instance retirement with Transparent Maintenance which, as the name suggests, is transparent and requires no additional configuration. By default, instances will “live migrate” to healthy hardware but can be configured to “terminate and reboot” if the application requires.

Azure doesn’t seem to have equivalent functionality but has technology in place that enables the underlying virtualization to be updated without a VM reboot, however, will experience a 30 second pause. This does not solve the problem of underlying hardware failures which Azure recommends using multiple instances and ensure an SLA of 99.9% for single instance deployments.

Be sure to make the right decisions regarding your pets. Research the available options and plan for failure. Turning a kitten into a cow isn’t easy.