Networking

Backup to AWS S3 with s3cmd

Particularly since the introduction of Glacier, S3 from Amazon is quite attractive as an offsite backup offering (archive the backups to Glacier automatically after, say, a week with lifecycle management and your storage costs drop dramatically).

Of course, we still have to keep an eye on our data transfer costs. There are two possible candidates for backing up our Linux Server/VPS to S3 that I’ve seen and used in the past, either: s3cmd or s3fs

S3FS certainly feels nice, and we can rsync to it in the normal way, but (and it is potentially a huge but – no pun intended) AWS S3 data charges are not just for storage, but also bandwidth transferred, and perhaps critically the number of requests made to the S3 API. I freely confess to having doing zero measurement on the subject, but it just feels instinctive that a FUSE filesystem implementation is going to make way more API calls than the python scripts that call the API directly that are s3cmd.

So using the rsync like logic you might consider doing something like:

cd /var/www/
s3cmd sync -r vhosts --delete-removed s3://$BUCKET/current/vhosts/

There is a small snag however to this approach. s3cmd keeps the directory structure in memory to help it with the rsync logic. This is fine if you are on real tin, with memory to spare. But on a VPS, especially an OpenVZ based one where there is no such thing as swap, this can be a real show stopper for large directory structures as the hundreds of MB of RAM required just are not available. Time for our old friend the OOM killer to rear it’s head ?

Recursion of some form would be the elegant answer here. However elegance is for those with time for it, and the following seems to work very effectively with minimal RAM consumption:

cd /var/www
for i in `find . -type d -links 2 | sort | sed -e 's/\.\///g'`
do
s3cmd sync -r $i/ --delete-removed s3://$BUCKET/current/vhosts/$i/
done

The find command looks for directories which only contain two directories (. and ..), that is to say they are the end nodes of a directory tree. And then we back them up, one by one.

Simples.

Leave a Reply