Friday, April 3, 2015

High-Performance Computing Clusters (HPCC) and Cassandra on OS X


Our new parent company, LexisNexis, has one of the world's largest public records database:

"...our comprehensive collection of more than 46 billion records from more than 10,000 diverse sources—including public, private, regulated, and derived data. You get comprehensive information on approximately 269 million individuals and 277 million unique businesses."
http://www.lexisnexis.com/en-us/products/public-records.page

And they've been managing, analyzing and searching this database for decades.  Over that time period, they've built up quite an assortment of "Big Data" technologies.  Collectively, LexisNexis refers to those technologies as their High-Performance Computing Cluster (HPCC) platform.
http://hpccsystems.com/Why-HPCC/How-it-works

HPCC is entirely open source:
https://github.com/hpcc-systems/HPCC-Platform

Naturally, we are working through the marriage of HPCC with our real-time data management and analytics stack.  The potential is really exciting.  Specifically, HPCC has sophisticated machine learning and statistics libraries, and a query engine (Roxie) capable of serving up those statistics.
http://hpccsystems.com/ml

Low and behold, HPCC can use Cassandra as a backend storage mechanism! (FTW!)

The HPCC platform isn't technically supported on a Mac, but here is what I did to get it running:

HPCC Install

      brew install icu4c
      brew install boost
      brew install libarchive
      brew install bison27
      brew install openldap
      brew install nodejs
      
  • Make a build directory, and run cmake from there:
      export CC=/usr/bin/clang
      export CXX=/usr/bin/clang++
      cmake ../ -DICU_LIBRARIES=/usr/local/opt/icu4c/lib/libicuuc.dylib -DICU_INCLUDE_DIR=/usr/local/opt/icu4c/include -DLIBARCHIVE_INCLUDE_DIR=/usr/local/opt/libarchive/include -DLIBARCHIVE_LIBRARIES=/usr/local/opt/libarchive/lib/libarchive.dylib -DBOOST_REGEX_LIBRARIES=/usr/local/opt/boost/lib -DBOOST_REGEX_INCLUDE_DIR=/usr/local/opt/boost/include  -DUSE_OPENLDAP=true -DOPENLDAP_INCLUDE_DIR=/usr/local/opt/openldap/include -DOPENLDAP_LIBRARIES=/usr/local/opt/openldap/lib/libldap_r.dylib -DCLIENTTOOLS_ONLY=false -DPLATFORM=true
  • Then, compile and install with (sudo make install)
  • After that, you'll need to muck with the permissions a bit:
      chmod -R a+rwx /opt/HPCCSystems/
      chmod -R a+rwx /var/lock/HPCCSystems
      chmod -R a+rwx /var/log/HPCCSystems
      
  • Now, ordinarily you would run hpcc-init to get the system configured, but that script fails on OS X, so I used linux to generate config files that work and posted those to a repository here:
  • Clone this repository and replace /var/lib/HPCCSystems with the content of var_lib_hpccsystems.zip
      sudo rm -fr /var/lib/HPCCSystems
      sudo unzip var_lib_hpccsystems.zip -d /var/lib
      chmod -R a+rwx /var/lib/HPCCSystems
      
  • Then, from the directory containing the xml files in this repository, you can run:
      daserver (Runs the Dali server, which is the persistence mechanism for HPCC)
      esp (Runs the ESP server, which is the web services and UI layer for HPCC)
      eclccserver (Runs the ECL compile server, which takes the ECL and compiles it down to C++ and then a dynmic library)
      roxie (Runs the Roxie server, which is capable of responding to queries)
  • Kickoff each one of those, then you should be ready to run some ECL. Then, go to http://localhost:8010 in a browser.  You are ready to run some ECL!

Running ECL

Like Pig with Hadoop, HPCC runs a DSL called ECL.  More information on ECL can be found here:
http://hpccsystems.com/download/docs/learning-ecl
  • As a simple smoke test, go into your HPCC-Platform repository, and go under: ./testing/regress/ecl.  
  • Then, run the following:
      ecl run hello.ecl --target roxie --server=localhost:8010
  • You should see the following:
        <dataset name="Result 1"> 
        <row><result_1>Hello world</result_1></row> 
        </dataset>
      
      

Cassandra Plugin

With HPCC up and running, we are ready to have some fun with Cassandra.  HPCC has plugins.  Those plugins reside in /opt/HPCCSystems/plugins.  For me, I had to copy those libraries into /opt/HPCCSystems/lib to get HPCC to recognize them.

Go back to the /opt/HPCCSystems/testing/regress/ecl directory and have a look at cassandra-simple.ecl. A snippet is shown below:

-------------------------

childrec := RECORD
   string name,
   integer4 value { default(99999) },
   boolean boolval { default(true) },
   real8 r8 {default(99.99)},
   real4 r4 {default(999.99)},
   DATA d {default (D'999999')},
   DECIMAL10_2 ddd {default(9.99)},
   UTF8 u1 {default(U'9999 ß')},
   UNICODE u2 {default(U'9999 ßßßß')},
   STRING a,
   SET OF STRING set1,
   SET OF INTEGER4 list1,
   LINKCOUNTED DICTIONARY(maprec) map1{linkcounted};
END;

init := DATASET([{'name1', 1, true, 1.2, 3.4, D'aa55aa55', 1234567.89, U'Straße', U'Straße','Ascii',['one','two','two','three'],[5,4,4,3],[{'a'=>'apple'},{'b'=>'banana'}]},
                 {'name2', 2, false, 5.6, 7.8, D'00', -1234567.89, U'là', U'là','Ascii', [],[],[]}], childrec);

load(dataset(childrec) values) := EMBED(cassandra : user('boneill'),keyspace('test'),batch('unlogged'))
  INSERT INTO tbl1 (name, value, boolval, r8, r4,d,ddd,u1,u2,a,set1,list1,map1) values (?,?,?,?,?,?,?,?,?,?,?,?,?);
ENDEMBED;

--------------------

In this example, we define childrec as a RECORD with a set of fields. We then create a DATASET of type childrec. Then we define a method that takes a dataset of type childrec and runs the Cassandra insert command for each of the records in the dataset.

Startup a Cassandra locally.  (download Cassandra, unzip it, then run bin/cassandra -f (to keep it in foreground))

Once Cassandra is up, simply run the ECL like you did the hello program.

ecl run cassandra-simple.ecl --target roxie --server=localhost:8010

You can then go over to cqlsh and validate that all the data made it back into Cassandra:

➜  cassandra  bin/cqlsh
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.7 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Use HELP for help.
cqlsh> select * from test.tbl1 limit 5;

 name      | a | boolval | d              | ddd  | list1 | map1 | r4     | r8     | set1 | u1     | u2        | value
-----------+---+---------+----------------+------+-------+------+--------+--------+------+--------+-----------+-------
  name1575 |   |    True | 0x393939393939 | 9.99 |  null | null | 1576.6 |   1575 | null | 9999 ß | 9999 ßßßß |  1575
  name3859 |   |    True | 0x393939393939 | 9.99 |  null | null | 3862.9 |   3859 | null | 9999 ß | 9999 ßßßß |  3859
 name11043 |   |    True | 0x393939393939 | 9.99 |  null | null |  11054 |  11043 | null | 9999 ß | 9999 ßßßß | 11043
  name3215 |   |    True | 0x393939393939 | 9.99 |  null | null | 3218.2 |   3215 | null | 9999 ß | 9999 ßßßß |  3215
  name7608 |   |   False | 0x393939393939 | 9.99 |  null | null | 7615.6 | 7608.1 | null | 9999 ß | 9999 ßßßß |  7608

OK -- that should give a little taste of ECL and HPCC.    It is a powerful platform.
As always, let me know if you run into any trouble.

1 comment:

Richard said...

Rather than generating configuration files on linux and copying them across, it's probably preferable to use configgen to generate them on OSX directly. I haven't tried this, but it should work (and is basically what the linux process will be doing).

Richard