Brian ONeill's Random Thoughts: December 2011

Monday, December 19, 2011

Programmatically submitting jobs to a remote Hadoop Cluster

I'm adding the ability to deploy a Map/Reduce job to a remote Hadoop cluster in Virgil. With this, Virgil allows users to make a REST POST to schedule a Hadoop job. (pretty handy)

To get this to work properly, Virgil needed to be able to remotely deploy a job. Ordinarily, to run a job against a remote cluster you issue a command from the shell:


hadoop jar $JAR_FILE $CLASS_NAME

We wanted to do the same thing, but from within the Virgil runtime. It was easy enough to find the class we needed to use: RunJar. RunJar's main() method stages the jar and submits the job. Thus, to achieve the same functionality as the command line, we used the following:

List args = new ArrayList();
args.add(locationOfJarFile);
args.add(className);
RunJar.main(args.toArray(new String[0]));

That worked just fine, but would result in a local job deployment. To get it to deploy to a remote cluster, we needed Hadoop to load the cluster configuration. For Hadoop, cluster configuration is spread across three files: core-site.xml, hdfs-site.xml, and mapred-site.xml. To get the Hadoop runtime to load the configuration, you need to include these files on your classpath. The key line is found in the configuration Hadoop Javadoc.

"Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:"

Once we dropped the cluster configuration onto the classpath, everything worked like a charm.

Monday, December 12, 2011

Binaries available for download from Virgil

In response to a few requests for a binary distribution, we just posted artifacts for Virgil.

http://code.google.com/a/apache-extras.org/p/virgil/downloads/list

For simplicity, we're keeping the version number aligned with the version of Cassandra.

(which is important when you are running with an embedded Cassandra ;)

Also, we changed it so you can simply specify the Cassandra instance you want to run against as a command line parameter:

http://code.google.com/a/apache-extras.org/p/virgil/wiki/runmode

This makes it easy to point the GUI at different Cassandra instances.

Now, all you need to do is download the binary distribution, untar/unzip and type:

bin/virgil -h CASSANDRA_HOST

Let me know if anyone has any trouble.

Friday, December 9, 2011

Dependencies and Repositories in SBT 0.11 (vs 0.7) for Kestrel

I was trying to build Kestrel last night, which is written in Scala.

After installing scala and sbt on my mac using homebrew, Kestrel wouldn't build because it requires an older version of sbt. Instead of installing the old version of sbt, I took it as a good opportunity to learn what sbt was about. In migrating the Kestrel build file to sbt 0.11, I found the documentation somewhat lacking. (especially when compared to 0.7) Thus, here are two tidbits that I had a hard time finding... (I actually had to dig through other github projects to see how they did it)

I learned that sbt uses maven repositories as one source for dependencies. Just like maven, you need to declare the dependencies and the available repositories. Here is how you do it.

I forked the Kestrel project and posted my new build.sbt here:
https://github.com/boneill42/kestrel/blob/master/build.sbt

To add a dependency to your project add the following line to the build.sbt.


libraryDependencies += "com.twitter" % "util-core" % "1.12.4"

In maven speak, the first string is the group identifier. The second string is the artifact identifier. And the third string is the version identifier.

In this case, Kestrel required additional repositories.

To add a repository to your project add the following line to the build.sbt.


  resolvers += "twitter.com" at "http://maven.twttr.com/"

The first string is the name of the repository. The second string is the url for the repo.

Hope that helps some people.

Thursday, December 1, 2011

Hadoop/MapReduce on Cassandra using Ruby and REST!

In an effort to make Hadoop/MapReduce on Cassandra more accessible, we added a REST layer to Virgil that allows you to run map reduce jobs written in Ruby against column families in Cassandra by simply posting the ruby script to a URL. This greatly reduces the skill set required to write and deploy the jobs, and allows users to rapidly develop analytics for data store in Cassandra.

To get started, just write a map/reduce job in Ruby like the example included in Virgil:
http://code.google.com/a/apache-extras.org/p/virgil/source/browse/trunk/server/src/test/resources/wordcount.rb

Then throw that script at Virgil with a curl:


curl -X POST http://localhost:8080/virgil/job?jobName=wordcount\&inputKeyspace=dummy\&inputColumnFamily=book\&outputKeyspace=stats\&outputColumnFamily=word_counts --data-binary @src/test/resources/wordcount.rb

In the POST, you specify the input keyspace and column family and the output keyspace and column family. Each row is fed to the ruby map function as a Map, each entry in the map is column in the row. The map function must return tuples (key/value pairs), which are fed back into Hadoop for sorting.

Then, the reduce method is called with the keys and values from Hadoop. The reduce function must return a map of maps, which represent the rows and columns that need to be written back to Cassandra. (keys are the rowkeys, sub maps are the columns)

Presently, the actual job runs inside the Virgil JVM and the HTTP connection is left open until the job completes. Over the next week or two, we'll fix that. We intend to implement the ability to distribute that job across an existing Hadoop cluster. Stay tuned.

For more information see the Virgil wiki:
http://code.google.com/a/apache-extras.org/p/virgil/wiki/mapreduce