Thursday, November 8, 2012

Big Data Quadfecta: (Cassandra + Storm + Kafka) + Elastic Search


In my previous post, I discussed our BigData Trifecta, which includes Storm, Kafka and Cassandra. Kafka played the role of our work/data queue.  Storm filled the role of our data intake / flow mechanism, and Cassandra our system of record for all things storage.

Cassandra is a fantastic for flexible/scalable storage, but unless you purchase Datastax Enterprise, you're on your own for unstructured search.  Thus, we wanted a mechanism that could index the data we put in Cassandra.

Initially, went with our cassandra trigger mechanism, connected to a SOLR backend. (https://github.com/hmsonline/cassandra-indexing)  That was sufficient, but as we scale our use of Cassandra, we anticipate a much greater load on SOLR, which means additional burden to manage slave/master relationships.  Trying to get ahead of that, we wanted to look at other alternatives.

We evaluated Elastic Search (ES) before choosing SOLR.  ES was better in almost every aspect: performance, administration, scalability, etc.  BUT, it still felt young.  We did this evaluation back in mid-2011, and finding commercial support for ES was difficult compared to SOLR.

That changed substantially this year however when Elastic Search incorporated, and pulled in some substantial players.  With our reservations addressed, we decided to re-examine Elastic Search, but from a new angle.

We now have Storm solidly in our technology stack.  With Storm acting as our intake mechanism, we decided to move away from a trigger-based mechanism, and instead we decided to orchestrate the data flow between Cassandra and ES using Storm.

That motivated us to write and open-source an Elastic Search bolt for Storm:
https://github.com/hmsonline/storm-elastic-search

We simply tacked that bolt onto the end of our Storm topology and with little effort, we have an index of all the data we write to Cassandra.

For the bolt, we implemented the same "mapper" pattern that we put in place when we refactored the Storm Cassandra bolt.  To use the bolt, you just need to implement, TupleMapper, which has the following methods:

    public String mapToDocument(Tuple tuple);
    public String mapToIndex(Tuple tuple);
    public String mapToType(Tuple tuple);
    public String mapToId(Tuple tuple); 

Similar to the Cassandra Bolt, where you map a tuple into a Cassandra Row, here you simply map the tuple to a document that can be posted to Elastic Search (ES).  ES needs four pieces of information to index a document: the documents itself (JSON), the index to which the document should be added, the id of the document and the type.

We included a default mapper, which does nothing more than extract the four different pieces of data directly from the tuple based on field name:
https://github.com/hmsonline/storm-elastic-search/blob/master/src/main/java/com/hmsonline/storm/contrib/bolt/elasticsearch/mapper/DefaultTupleMapper.java

The bolt is still*very* raw.  I have a sample topology that uses it with the Cassandra Bolt. I'll try to get that out there ASAP.

Sidenote: We are still waiting on a resolution to the classpath issue in Storm before we can do another release.  (See: https://github.com/nathanmarz/storm/issues/115)

As always, let me know if you have any trouble. 

1 comment:

Tim TerlegÄrd said...

Have you implemented or used a Cassandra Spout? I googled for one, but couldn't find any.

Shortly I will investigate the fastest possible way to take all rows in a Cassandra Column Family and index them in SOLR. So I found your articles about Cassandra, Storm and SOLR/ES interesting.