Monday, December 19, 2011

Programmatically submitting jobs to a remote Hadoop Cluster

I'm adding the ability to deploy a Map/Reduce job to a remote Hadoop cluster in Virgil. With this, Virgil allows users to make a REST POST to schedule a Hadoop job. (pretty handy)

To get this to work properly, Virgil needed to be able to remotely deploy a job. Ordinarily, to run a job against a remote cluster you issue a command from the shell:

hadoop jar $JAR_FILE $CLASS_NAME 

We wanted to do the same thing, but from within the Virgil runtime. It was easy enough to find the class we needed to use: RunJar. RunJar's main() method stages the jar and submits the job. Thus, to achieve the same functionality as the command line, we used the following:

 List args = new ArrayList(); 
 RunJar.main(args.toArray(new String[0])); 

That worked just fine, but would result in a local job deployment. To get it to deploy to a remote cluster, we needed Hadoop to load the cluster configuration. For Hadoop, cluster configuration is spread across three files: core-site.xml, hdfs-site.xml, and mapred-site.xml. To get the Hadoop runtime to load the configuration, you need to include these files on your classpath. The key line is found in the configuration Hadoop Javadoc.

"Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:"

Once we dropped the cluster configuration onto the classpath, everything worked like a charm.

No comments: