Friday, January 23, 2015

Hadoop for Cassandra: CqlInputFormat != CqlPagingInputFormat != ColumnFamilyInputFormat

We haven't had cause to write a Hadoop job against Cassandra since the old days of thrift.  (since we introduced Elastic Search in our system)   But this week, we found ourselves needing to get some metrics on data stored in the actual C* tables.

I went to the documentation and found this page:

That page references:
"CQL partition input format: ColumnFamilyInputFormat class"

I was familiar with the ColumnFamilyInputFormat class from the old thrift days, and I was pretty sure that a new InputFormat was available that used CQL.  I headed over to the code, dropped down to the 2.0 branch and found this:

Notice that imports:
import org.apache.cassandra.hadoop.cql3.CqlPagingInputFormat

I went happily along my way and implemented the MapReduce code using this InputFormat, but the compiler kept complaining that CqlPagingInputFormat could not be found. After some investigation, it looks like that class was removed from cassandra-all, sometime between 2.0.3 and 2.0.11. See below:

➜  tusk  unzip -l /Users/bone/.m2/repository/org/apache/cassandra/cassandra-all/2.0.11/cassandra-all-2.0.11.jar | grep Cql | grep Input
     2882  10-21-14 16:31   org/apache/cassandra/hadoop/cql3/CqlInputFormat.class
➜  tusk  unzip -l /Users/bone/.m2/repository/org/apache/cassandra/cassandra-all/2.0.3/cassandra-all-2.0.3.jar | grep Cql | grep Input
     1359  11-22-13 08:56   org/apache/cassandra/hadoop/cql3/CqlPagingInputFormat$1.class
     2875  11-22-13 08:56   org/apache/cassandra/hadoop/cql3/CqlPagingInputFormat.class

It looks like the crew is already addressing it:

Hopefully no one else runs into this. ;)