To get started, just write a map/reduce job in Ruby like the example included in Virgil:
http://code.google.com/a/apache-extras.org/p/virgil/source/browse/trunk/server/src/test/resources/wordcount.rb
Then throw that script at Virgil with a curl:
curl -X POST http://localhost:8080/virgil/job?jobName=wordcount\&inputKeyspace=dummy\&inputColumnFamily=book\&outputKeyspace=stats\&outputColumnFamily=word_counts --data-binary @src/test/resources/wordcount.rb
In the POST, you specify the input keyspace and column family and the output keyspace and column family. Each row is fed to the ruby map function as a Map, each entry in the map is column in the row. The map function must return tuples (key/value pairs), which are fed back into Hadoop for sorting.
Then, the reduce method is called with the keys and values from Hadoop. The reduce function must return a map of maps, which represent the rows and columns that need to be written back to Cassandra. (keys are the rowkeys, sub maps are the columns)
Presently, the actual job runs inside the Virgil JVM and the HTTP connection is left open until the job completes. Over the next week or two, we'll fix that. We intend to implement the ability to distribute that job across an existing Hadoop cluster. Stay tuned.
For more information see the Virgil wiki:
http://code.google.com/a/apache-extras.org/p/virgil/wiki/mapreduce
No comments:
Post a Comment