When stuck inside due to all the snow, there is no better time to consider the topic of backup. Seriously, backup is important and the issue is fascinating. The area of backup brings together so many topics in computing. Think about it! To do backup successfully, you must deal with data transfer, data integrity validation, networks, distributed systems, compression and, if you’re doing it right, cryptography. It is not just computer science however, but also workflow management and that somewhat-nebulous-yet-often-referred-to thing of systems thinking. I got to thinking about the topic of backups and was curious as to the state of research as well as backup tools. A great intro to so-called “backup theory” is available on the ‘Backup’ Wikipedia article and others have written on the subject (Google will verify that this is true). As it turns out, advances in storage have recently offered many new opportunities for improving the way that a given backup process might work. Several distributed fault-tolerant filesystems are coming along quite nicely. Though the Google File System and the equivalent HDFS project which is part of Apache’s Hadoop have gotten much attention, there are other options. GlusterFS (follow GlusterFS on Twitter, perhaps?) and Ceph are two excellent examples of Free/Open Source Software projects which offer the compelling combo of fault tolerance and distributed storage. They each employ replication and have similar architectures insofar as they abstract individual machines into “chunks” or “blocks” and manage replication automatically. One interesting difference is that while GlusterFS exports filesystems “as-is” (see the docs for an explanation), Ceph exports entire block devices. So then what about the theory and goals of backup? In my opinion, backing up is not enough. Having a backup plan and executing it perfectly doesn’t mean a thing if the data can’t be recovered. (In fact this issue relates strongly to a larger discussion of reliability. Rather than focusing exclusively on total time spent in a failure state, individuals concerned with reliability would also do well to consider how fast a system can recover from those failures. If a system can be up and running again after only 5 minutes, then that system can go down 12 times before reaching an hour of downtime. If another system takes 20 minutes to rebound from a failure, then that system can only go down 3 times!) I haven’t yet gotten to thinking about the problem of restoration following failure for anything other than plain old files. For example, I backup my personal data to servers in Philadelphia and California in addition to an external hard drive in my apartment. I make careful use of old stand-byes like tar), gzip and rsync along with a checksum utility like md5sum or, more recently, something in the SHA family. Plus I use the git revision control system for code and etckeeper for config. By backing up all of my configuration data in addition to my personal files, I make it so that I can easily return a given system to a usable state, if not restoring it perfectly. I have successfully restored systems by doing little more than a reverse rsync. To be fair, I just spoke about about the few systems under my personal control which constitute a small and limited case. I back up to different locations which is good practice but my local copy is certainly not sufficient for anything industrial. If the disk breaks, I’m out of luck. Considering a larger-scale system is when GlusterFS, Ceph and others would come in handy. Obviously there are a great many books written about this topic but, for the purposes of discussion, if I were to build a platform for the reliable storage of huge amounts of data, my project would look something like this… First, I would round up spare computers with room for extra disks. They machines would not need to be particularly fast or possess large amounts of memory. I’m not sure of the exact hardware requirements for either GlusterFS (vague wiki entry) or Ceph but it’s hard to imagine that they’d require huge amounts of anything but disk space. Anyway, if my organization had old desktops or something which were being replaced then they might be perfect candidates. Second comes storage. It seems that at the time of this writing, one can purchase a 1TB hard disk for around US$85. Let us assume that 20 desktop machines could be procured and each had two spare disk slots. For around US$3500 (figuring 2 US$85 disks per machine and a little more for tax+shipping, etc.) one could buy 20TB of storage. Now, it’s not quite that simple as the fault-tolerance scheme in both GlusterFS and Ceph relies on replication. Assuming the accepted replication factor of 3 (a norm adhered to by the Google File System), that would reduce the 20TB storage block by a third leaving around 6 and two thirds terabytes of fault-tolerant storage. Filesystems of this type usually require cluster control processes which (ideally, I think) reside on dedicated machines so an extra machine or two would also be required. I got a good explanation of how metadata servers work in GFS/HDFS by reading the HDFS design document, actually. Metadata servers and other control processes serve similar functions. Ceph documents are very explicit about not having a single point of failure whereas GlusterFS is not quite so adamant. I need some help figuring out if GlusterFS is as fault-tolerant in that respect. For under US$4000, one could theoretically build over 6TB of fault-tolerant distributed storage (provided that spare machines are plentiful, something which shouldn’t be a problem for organizations with a semi-regular hardware replacement cycle). Now, as for the usage of that storage system for backup, it’s a different piece of the discussion. I’ve seen quite a few setups where a single-but-very-large server provides networked storage to a large number of users via something like samba or NFS. In this case, the big server is usually blessed with some complicated RAID arrangement which people put entirely too much faith in. No one ever listens that RAID is not a backup strategy. Personally, I don’t believe in hardware reliability because it seems silly to spend money trying to prevent hardware from failing when you know it’s going to break (eventually) anyway. I’m not saying that RAID isn’t useful because it *is* useful and has a place. What I am saying is that it’s not backup. So if an organization has a big network storage server then where does that get backed up to? It’s hard to backup 6 TB off site but it can (certainly) be done. However, if an organization is lucky enough to have multiple buildings then a distributed storage cluster like I described earlier would be an excellent addition to the overall backup infrastructure. Putting a few machines in different buildings and using it as a place to shadow the main file server and whatever else needs to be backed up would grant an added measure of security. Snow days are a good time for thinking and my backup jobs went smoothly. However, I feel as if I have begun a track of study which might yield some good results. Granted, it’s not just about the technology (humans screw things up) but the ability to build reliable and fault-tolerant storage systems for massive amounts of data using only commodity hardware is a huge boon to users everywhere. This concludes my backup rant.