Can I get a little MapReduce from my Debian people?
Debian is a world-class Linux distribution. It is used on it’s own for so many applications (desktop, laptop, workstation, handeld, server, etc.) as well as the foundation for so many wonderful projects ((U|K|X)buntu, Maemo, etc.). Personally, I run Debian on my laptop as well as my servers. In fact, when I went to see about setting up a little ad-hoc cluster, I was rather disappointed. Though there are a few clustering tools available, as well as several distributed filesystems (GFS, GlusterFS, OCFS2, and Lustre), shockingly, I could not find any implementation of MapReduce available in the Debian repositories.
For those who might not know, MapReduce is a novel data-processing system developed by Google for internal usage and described in their publication entitled MapReduce: Simplified Data Processing on Large Clusters. For the enlightened out there, it should be clear that the name and mechanism are derived from Lisp’s map and reduce functions. In any case, though Google’s implementation is proprietary, there have been several implementations based on their paper both written in and geared toward a variety of programming languages. Unfortunately, none of these are available in the Debian repositories. In all fairness, Debian does include CouchDB which uses map and reduce functions for generating views. However, it’s not a solution aimed at sorting and processing huge amounts of data, though it is an interesting and capable piece of software.
So, to try and get things moving, I have filed three Debian RFPs (Request For Package) for a few seperate MapReduce implementations.
- Hadoop – Probably the most well-known of the Free/Open Source implementations. Includes a distributed filesystem (HDFS), scaleable distributed database (HBase) and tools to get you going from start to finish. Hadoop is written in Java though it can interoperate with other languages (Scala, too). It’s a top-level project of the Apache Software Foundation and licensed under the Apache License 2.0 – http://hadoop.apache.org
- Skynet – A MapReduce implementation written in Ruby. It’s designed to be fault-tolerant and distrubuted, just like the big boys. Originally written for use at Geni.com and licensed under the MIT License – http://skynet.rubyforge.org/
- Disco – Though the implementation is itself written in Erlang, thus providing excellent distributed fault-tolerance, Disco jobs can be written in Python. It was developed as an in-house tool for rapid data analysis at Nokia and they seem to be quite keen on it. Disco is licensed under a modified BSD License. Page at http://discoproject.org/ and code at http://github.com/tuulos/disco/tree/master
Ok, there might be a few objections to my choices. Why did I leave out neat projects like GridGain, FileMap and BashReduce? Well, for starters, GridGain is another Java implementation that doesn’t seem (at least to me) to have the same momentum Hadoop does. FileMap and BashReduce, while novel, useful and fascinating, are not designed for use in networked environments and are therefore unsuitable for cluster situations. So then whey not MapSharp? Well, primarily because of all the Debian Mono debates going on right now (Gnome’s fail!) . I’ve done work in C# and it’s got some neat features but cool stuff doesn’t and will not ensure that users are not liable from patent litigation.
Also, it seems like those RFPs have some mistakes, so if anyone figures out how to edit them, let me know so I can clean them up.















No comments yet.