Thursday, December 1, 2011

Hadoop/MapReduce on Cassandra using Ruby and REST!

In an effort to make Hadoop/MapReduce on Cassandra more accessible, we added a REST layer to Virgil that allows you to run map reduce jobs written in Ruby against column families in Cassandra by simply posting the ruby script to a URL. This greatly reduces the skill set required to write and deploy the jobs, and allows users to rapidly develop analytics for data store in Cassandra.

To get started, just write a map/reduce job in Ruby like the example included in Virgil:
http://code.google.com/a/apache-extras.org/p/virgil/source/browse/trunk/server/src/test/resources/wordcount.rb

Then throw that script at Virgil with a curl:

curl -X POST http://localhost:8080/virgil/job?jobName=wordcount\&inputKeyspace=dummy\&inputColumnFamily=book\&outputKeyspace=stats\&outputColumnFamily=word_counts --data-binary @src/test/resources/wordcount.rb

In the POST, you specify the input keyspace and column family and the output keyspace and column family. Each row is fed to the ruby map function as a Map, each entry in the map is column in the row. The map function must return tuples (key/value pairs), which are fed back into Hadoop for sorting.

Then, the reduce method is called with the keys and values from Hadoop. The reduce function must return a map of maps, which represent the rows and columns that need to be written back to Cassandra. (keys are the rowkeys, sub maps are the columns)

Presently, the actual job runs inside the Virgil JVM and the HTTP connection is left open until the job completes. Over the next week or two, we'll fix that. We intend to implement the ability to distribute that job across an existing Hadoop cluster. Stay tuned.

For more information see the Virgil wiki:
http://code.google.com/a/apache-extras.org/p/virgil/wiki/mapreduce

No comments: