Tuesday, November 27, 2012

Compiling Storm (and jzmq) on Mac OSX


I recently setup a new machine, went to compile Storm, and forgot what a PITA it is.  Here is a blog post so future-me knows how to do it.

The biggest hurdle is jzmq.   To get that installed, your going to need some make mojo, and some libraries.  I use Brew.  If you are still using macports, consider a switch.

Anyway, here is the recipe to get jzmq installed assuming you have Brew.
$ git clone https://github.com/nathanmarz/jzmq.git
$ brew install automake
$ brew install libtool
$ brew install zmq
$ cd jzmq
$ ./configure

That resulted in a cryptic message:
cannot find jni.h in /Library/Java/Home/include 

To fix that, I found a suggested fix and created a symbolic link.
$ sudo ln -s /System/Library/Frameworks/JavaVM.framework/Versions/A/Headers/ /Library/Java/Home/include

Then, you hit this:
No rule to make target `classdist_noinst.stamp'

I had to dig through Google cache archives to find a solution for that one:
$ touch src/classdist_noinst.stamp

Then, you hit:
error: cannot access org.zeromq.ZMQ
That's because you haven't compiled zeromq yet!  So:
$ cd src
$ javac -d . org/zeromq/*.java
Wasn't that easy? =)  All that's left is to build and install:
$ cd ..
$ make
$ sudo make install

Now you have jzmq installed.  So we can get on with Storm.  Storm needs lein to build.  Don't go grabbing the latest version of lein either.   You'll need < 2.  There is an explicit check that was added to Storm that will refuse to build with lein >= 2.  You can grab older versions here. (we use 1.7.1)

Unzip that and copy $LEIN_HOME/bin/lein to your bin directory.  Make it executable and put it in your path.  Once you've done that, building Storm isn't so bad.  From the root of the storm source tree:
$ lein deps
$ lein jar 
$ lein install

Happy Storming.

Monday, November 26, 2012

Running Cassandra 1.2-beta on JDK 7 w/ Mac OSX: no snappyjava in java.library.path


The latest greatest Cassandra (1.2-beta) now uses snappy-java for compression.  Unfortunately for now, Cassandra uses 1.0.4.1 version of snappy-java.  That version of snappy-java doesn't play well with JDK 7 on Mac OSX.

There is a known bug:
https://github.com/xerial/snappy-java/issues/6

The fix is in the latest milestone release:
1.0.5-M3

Until that is formally released and Cassandra upgrades its dependency, if you want to run Cassandra under JDK 7 on Mac OSX, follow the instructions at the bottom of this issue.  Basically, unzip the snappy-java jar file and copy the jni library into $CASSANDRA_HOME.  You can see below that I used the jar file from my m2 repo.

bone@zen:~/.m2/repository/org/xerial/snappy/snappy-java/1.0.4.1-> unzip snappy-java-1.0.4.1.jar | grep Mac | grep jni
    44036  10-05-11 10:34   org/xerial/snappy/native/Mac/i386/libsnappyjava.jnilib
    49792  10-05-11 10:34   org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib

Copy the libsnappyjava.jnilib file into Cassandra, and you should be good to go.  If you used the version in your m2, thats:
cp ~/.m2/repository/org/xerial/snappy/snappy-java/1.0.4.1/org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib .

Alternatively, if you are building Cassandra from source, you can upgrade the snappy-java version yourself (in the build.xml).

BTW, Cassandra is tracking this under issue:
https://issues.apache.org/jira/browse/CASSANDRA-4400

Wednesday, November 21, 2012

Installing JDK 7 on Mac OS X


To get JDK 7 up,
Surgery required.  So, I headed over to:
/System/Library/Frameworks/JavaVM.framework/Versions
This is where the system jvm's are stored.  You'll notice a symbolic link for CurrentJDK.  It probably points to:
/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents
You're going to want to point that to the new JDK, which java_home tells us is located in:
bone@zen:/usr/libexec$ /usr/libexec/java_home
/Library/Java/JavaVirtualMachines/jdk1.7.0_09.jdk/Contents/Home
So, the magic commands you need are:
bone@zen:/System/Library/Frameworks/JavaVM.framework/Versions$ sudo rm CurrentJDK
bone@zen:/System/Library/Frameworks/JavaVM.framework/Versions$ sudo ln -s /Library/Java/JavaVirtualMachines/jdk1.7.0_09.jdk/Contents/ CurrentJDK
Then, you should be good:
bone@zen:/System/Library/Frameworks/JavaVM.framework/Versions$ java -version 
java version "1.7.0_09"
Java(TM) SE Runtime Environment (build 1.7.0_09-b05)
Java HotSpot(TM) 64-Bit Server VM (build 23.5-b02, mixed mode)

Thursday, November 8, 2012

Big Data Quadfecta: (Cassandra + Storm + Kafka) + Elastic Search


In my previous post, I discussed our BigData Trifecta, which includes Storm, Kafka and Cassandra. Kafka played the role of our work/data queue.  Storm filled the role of our data intake / flow mechanism, and Cassandra our system of record for all things storage.

Cassandra is a fantastic for flexible/scalable storage, but unless you purchase Datastax Enterprise, you're on your own for unstructured search.  Thus, we wanted a mechanism that could index the data we put in Cassandra.

Initially, went with our cassandra trigger mechanism, connected to a SOLR backend. (https://github.com/hmsonline/cassandra-indexing)  That was sufficient, but as we scale our use of Cassandra, we anticipate a much greater load on SOLR, which means additional burden to manage slave/master relationships.  Trying to get ahead of that, we wanted to look at other alternatives.

We evaluated Elastic Search (ES) before choosing SOLR.  ES was better in almost every aspect: performance, administration, scalability, etc.  BUT, it still felt young.  We did this evaluation back in mid-2011, and finding commercial support for ES was difficult compared to SOLR.

That changed substantially this year however when Elastic Search incorporated, and pulled in some substantial players.  With our reservations addressed, we decided to re-examine Elastic Search, but from a new angle.

We now have Storm solidly in our technology stack.  With Storm acting as our intake mechanism, we decided to move away from a trigger-based mechanism, and instead we decided to orchestrate the data flow between Cassandra and ES using Storm.

That motivated us to write and open-source an Elastic Search bolt for Storm:
https://github.com/hmsonline/storm-elastic-search

We simply tacked that bolt onto the end of our Storm topology and with little effort, we have an index of all the data we write to Cassandra.

For the bolt, we implemented the same "mapper" pattern that we put in place when we refactored the Storm Cassandra bolt.  To use the bolt, you just need to implement, TupleMapper, which has the following methods:

    public String mapToDocument(Tuple tuple);
    public String mapToIndex(Tuple tuple);
    public String mapToType(Tuple tuple);
    public String mapToId(Tuple tuple); 

Similar to the Cassandra Bolt, where you map a tuple into a Cassandra Row, here you simply map the tuple to a document that can be posted to Elastic Search (ES).  ES needs four pieces of information to index a document: the documents itself (JSON), the index to which the document should be added, the id of the document and the type.

We included a default mapper, which does nothing more than extract the four different pieces of data directly from the tuple based on field name:
https://github.com/hmsonline/storm-elastic-search/blob/master/src/main/java/com/hmsonline/storm/contrib/bolt/elasticsearch/mapper/DefaultTupleMapper.java

The bolt is still*very* raw.  I have a sample topology that uses it with the Cassandra Bolt. I'll try to get that out there ASAP.

Sidenote: We are still waiting on a resolution to the classpath issue in Storm before we can do another release.  (See: https://github.com/nathanmarz/storm/issues/115)

As always, let me know if you have any trouble.