Rotating EBS Snapshots

If you use Elastic Block Storage (EBS) for storing your files on your ec2 instances you more than likely backup those files using the ec2 snapshots. If you don’t already do this you should probably start, as EBS volumes are not 100% fault tolerant, and can (and do) degrade just like normal drives. A good script for taking snapshots of data can be found on the alestic.com website, called ec2-consistent-snapshot. You can find all the information for this script here:

http://alestic.com/2009/09/ec2-consistent-snapshot

How to Rotate EBS Snapshots

After using the ec2-consistent-snapshot script for a while I realized I would eventually need to find something to rotate these backups as they were growing out of control. Some of our volumes were having snapshots done every hour, and that was adding up quickly. Google provided me with no easy solution for rotating the snapshots, so I decided to write my own script.

Essentially what I wanted was to have a script that would rotate the snapshots in Grandfather-Father-Son type setup. I wanted to have hourly backups kept for 24 hours, daily kept for a week, weekly kept for a month, and monthly kept for a year. Anything older than that I don’t want, however the script can be tweaked to allow for older backups.

Basically what the script does is the following:

  • Gets a list of all snapshots and puts them into an array indexed by the volume and the date the snapshot was taken
  • For a given volume organize the snapshots so that there are only hourly snapshots for 1 day, daily snapshots for 1 week, weekly snapshots for 1 month, and monthly snapshots for 1 year and collect which snapshots require deleting.
  • Delete the snapshots that are set for delete.

I wrote the script in PHP, mainly because it is what I feel most comfortable using. I am also once again using the Amazon PHP library. Here is the script in it’s entirety.


<?php
/*
 * rotate-ebs-snapshots.php
 *
 * Author: Stefan Klopp
 * Website: http://www.kloppmagic.ca
 * Requires: Amazon ec2 PHP Library
 *
 */

ini_set("include_path", ".:../:./include:../include:/PATH/TO/THIS/SCRIPT");

// include the amazon php library
require_once("Amazon/EC2/Client.php");
require_once("Amazon/EC2/Model/DeleteSnapshotRequest.php");

// include our configuration file with out ACCESS KEY and SECRET
include_once ('.config.inc.php');

$service = new Amazon_EC2_Client(AWS_ACCESS_KEY_ID,
                                       AWS_SECRET_ACCESS_KEY);


// setup our array of snapshots
$snap_array = setup_snapshot_array();

// call to rotate (you can call this for every volume you want to rotate)
rotate_standard_volume('VOLUME_ID_YOU_WISH_TO_ROTATE');


/* 
 * Used to setup an array of all snapshots for a given aws account
 */
function setup_snapshot_array() {
    global $service;
    // Get a list of all EBS snapshots
    $response = $service->describeSnapshots($request);

    $snap_array = array();

    if ($response->isSetDescribeSnapshotsResult()) {
        $describeSnapshotsResult = $response->getDescribeSnapshotsResult();
        $snapshotList = $describeSnapshotsResult->getSnapshot();
        foreach ($snapshotList as $snapshot) {
            if ($snapshot->getStatus() == 'completed') {

                    // date is in the format of 2009-04-30T15:32:00.000Z
                    list($date, $time) = split("T", $snapshot->getStartTime());

                    list($year, $month, $day) = split("-", $date);
                    list($hour, $min, $second) = split(":", $time);

                    // convert the date to unix time
                    $time = mktime($hour, $min, 0, $month, $day, $year);

                    $new_row = array(
                            'snapshot_id'=>$snapshot->getSnapShotId(),
                            'volume_id'=>$snapshot->getVolumeId(),
                            'start_time'=>$time
                    );
                    // add to our array of snapshots indexed by the volume_id
                    $snap_array[$new_row['volume_id']][$new_row['start_time']] = $new_row;
            }
        }
    }

    // sort each volumes snapshots by the date it was created
    foreach ($snap_array as $vol=>$vol_snap_array) {
            krsort($vol_snap_array);
            $snap_array[$vol] = $vol_snap_array;
    }

    return($snap_array);
}

/*
 * Used to rotate the snapshots
 */
function rotate_standard_volume($vol_id) {
        global $snap_array, $service;

        // calculate the date ranges for snapshots
        $one_day = time() - 86400;
        $one_week = time() - 604800;
        $one_month = time() - 2629743;
        $one_year = time() - 31556926;

        $hourly_snaps = array();
        $daily_snaps = array();
        $weekly_snaps = array();
        $monthly_snaps = array();
        $delete_snaps = array();

        echo "Beginning rotation of volume: {$vol_id}\n";

        foreach($snap_array[$vol_id] as $time=>$snapshot) {

                echo "Testing snapshot {$snapshot['snapshot_id']} with a date of ".date("F d, Y @ G:i:s", $time)."... ";

                if ($time >=  $one_day) {
                        echo "Snapshot is within a day lets keep it.\n";
                        $hourly_snaps[$time] = $snapshot;
                }
                elseif ($time < $one_day &#038;&#038; $time >= $one_week) {
                        $ymd = date("Ymd", $time);
                        echo "Snapshot is daily {$ymd}.\n";

                        if (is_array($daily_snaps[$ymd])) {
                                echo "Already have a snapshot for {$ymd}, lets delete this snap.\n";
                                $delete_snaps[] = $snapshot;
                        }
                        else {
                                $daily_snaps[$ymd] = $snapshot;
                        }
                }
                elseif ($time < $one_week &#038;&#038; $time >= $one_month) {
                        $week = date("W", $time);
                        echo "Snapshot is weekly {$week}.\n";

                        if (is_array($weekly_snaps[$week])) {
                                echo "Already have a snapshot for week {$week}, lets delete this snap.\n";
                                $delete_snaps[] = $snapshot;
                        }
                        else {
                                $weekly_snaps[$week] = $snapshot;
                        }
                }
                elseif ($time < $one_month &#038;&#038; $time >= $one_year) {
                        $month = date("m", $time);
                        echo "Snapshot is monthly {$month}.\n";

                        if (is_array($monthly_snaps[$month])) {
                                echo "Already have a snapshot for month {$month}, lets delete this snap.\n";
                                $delete_snaps[] = $snapshot;
                        }
                        else {
                                $monthly_snaps[$month] = $snapshot;
                        }
                }
                else{
                        echo "Snapshot older than year old, lets delete it.\n";
                        $delete_snaps[] = $snapshot;
                }
        }

        foreach ($delete_snaps as $snapshot) {
                echo "Delete snapshot {$snapshot['snapshot_id']} with date ".date("F d, Y @ H:i", $snapshot['start_time'])." forever.\n";
                $request = new Amazon_EC2_Model_DeleteSnapshotRequest();
                $request->setSnapshotId($snapshot['snapshot_id']);
                $response = $service->deleteSnapshot($request);
        }
        echo "\n";
}

You can either run the script by editing the call to rotate_standard_volume. You can call this method for each volume you wish to rotate snapshots for. Also feel free to change the values of the date ranges to keep snapshots for a given date range for longer or shorter periods.

Finally to make this script effective you should have it run at least once a day via cron.

Conclusion

If you are like me and utilize EBS snapshots for backups of your data you will likely need to rotate those snapshots at some point. With the script above you should be able to quickly and easily rotate your snapshots. With a few tweaks you should be able to easily customize the rotation schedule to suit your needs.

Auto Scaling with Elastic Load Balancing

Along with the ability to Setup Elastic Load Balancing, which I showed you how to setup in my previous post. Amazon provides the ability to auto scale your instances. Auto Scaling groups allow you to set up groups of instances that will scale up and down depending on triggers you can create. For example you can set up a scaling group to always have 2 instances in it, and to scale up another server if the CPU utilization of the servers grows over a certain threshold. This is extremely helpful when you receive unexpected traffic and you are unable to react in time to add new instances. The beauty of Auto Scaling in conjunction with Elastic Load Balancing is that it will automatically assign the new instances to the load balancer you provide.

Creating an Auto Scaling Launch Config

The first step in setting up Auto Scaling is to create a launch config. The launch config is used to determine what ec2 image, and size (small, medium, etc) will be used to setup a new instance for your Auto Scaling group. To setup a launch config you will call the as-create-launch-config. For example to create a new launch config called auto-scaling-test that would launch the image ami-12345678 of size c1.medium you would run the following command:


as-create-launch-config auto-scaling-test --image-id ami-12345678 --instance-type c1.medium

Create an Auto Scaling Group

The next step to enabling Auto Scaling is to setup an Auto Scaling Group. An Auto Scaling group tells Amazon what zones you want your instances created in, the minimum and maximum number of instances to ever launch, and which launch config to utilize. To create an Auto Scaling group you will call the as-create-auto-scaling-group command. For example if you wanted to create a new group with a name of auto-scaling-test using the availability zones of us-east-1a with a minimum number of instances being 2 and a maximum of 4 using our newly created launch config you would run:


as-create-auto-scaling-group auto-scaling-test --availability-zones us-east-1a --launch-configuration auto-scaling-test --max-size 4 --min-size 2

When this command is executed 2 new instances will be created as per the directions of the launch config. the as-create-auto-scaling-group can also take be linked to a load balancer. Thus if we wanted to have this group setup with the load balancer we created in the previous article, you would run:


as-create-auto-scaling-group auto-scaling-test --availability-zones us-east-1a --launch-configuration auto-scaling-test --max-size 4 --min-size 2 --load-balancers test-balancer

After execution this would setup 2 new instances as per the instructions of the launch config, and register these instances with the load balancer test-balancer.

Creating Auto Scaling Triggers

Triggers are used by Auto Scaling to determine whether to launch or terminate instances within an Auto Scaling Group. To setup a trigger you will use the as-create-or-update-trigger command. Here is an example using the auto scaling group we created earlier:


as-create-or-update-trigger auto-scaling-test --auto-scaling-group auto-scaling-test --measure CPUUtilization --statistic Average --period 60 --breach-duration 120 --lower-threshold 30 --lower-breach-increment"=-1" --upper-threshold 60 --upper-breach-increment 2

Lets walk through what this command is doing. Basically what this command is saying is, create a new trigger called auto-scaling-test. This trigger should use the auto-scaling group called auto-scaling-test. It should measure the average CPU utilization of the current instances in the auto scaling group every 60 seconds. If the CPU utilization goes over 60% over the period of 120 seconds launch 2 new instances. Alternatively if the CPU utilization drops below 30% over the period of 120 seconds terminate 1 of the instances. Remember that the trigger will never terminate more instances than the minimum number of instances and it will not launch more instances than the maximum number of instances as defined in the Auto Scaling Group.

Shutting Down an Auto Scaling Group

Initially shutting down an Auto Scaling group can be a bit tricky as you cannot delete an Auto Scaling Group until all the instances are terminated or deregistered from the group. The best way to terminate an Auto Scaling group, it’s triggers and launch config is to do the following steps:

  • Delete all triggers
  • Update the Auto Scaling group to have a minimum and maximum number of instances of 0
  • Wait for the instances registered with the Auto Scaling group to be terminated
  • Delete the Auto Scaling group
  • Delete the launch config

To do this with the examples we used above we would issue the following commands:


as-delete-trigger auto-scaling-test --auto-scaling-group auto-scaling-test

as-update-auto-scaling-group auto-scaling-test --min-size 0 -max-size 0

as-delete-auto-scaling-group auto-scaling-test

as-delete-launch-config auto-scaling-test

With those 4 commands you can easily delete your Auto Scaling group as well as any launch configs or triggers that are associated with it.

Conclusion

Auto Scaling provides an easy and efficient way to grow and shrink your hosting solution based on your current traffic. In events where you are hit with unexpected traffic Auto Scaling can provide a failsafe by automatically launching new instances and scaling up your solution to meet this new traffic. When the traffic subsides Auto Scaling will scale down your solution so that you are not wasting money by running more instances than you require.

One thing to note is that if you know before hand that you will be receiving a traffic spike at a specific time, it may be more benefitial to launch new instances manually before the spike. This will save your system from getting hammered before having the Auto Scaling launches new instances to cope with the additional load. If you rely on Auto Scaling alone in this scenario you could see many requests at the start of the heavy traffic timeout or fail as the minimum number of instances likely won’t be able to handle the traffic load.