Brian ONeill's Random Thoughts: 2014

Tuesday, October 21, 2014

Brew trouble w/ Yosemite (gcc47 and subversion)

Mileage may vary on this one, but after upgrading to Yosemite, I encountered two issues with brew.

The first was with gcc47. After running, "brew upgrade". It got stuck on the gcc47 formula, which error'd with:

==> Upgrading gcc47
gcc47: OS X Mavericks or older is required for stable.
Use `brew install devel or --HEAD` for newer.
Error: An unsatisfied requirement failed this build.

I decided it was better to move to gcc48 than to bother trying to get gcc47 working. To do that, just install gcc48 and remove gcc47 with:

➜  ~  brew install gcc48
➜  ~  brew uninstall gcc47




That fixed that.  Then, I ran into an issue with subversion, which relied on serf.  The failed to compile with:


#include <apr_pools .h="">
         ^
1 error generated.
scons: *** [context.o] Error 1
scons: building terminated because of errors.

There is an open issue with brew for that one. (https://github.com/Homebrew/homebrew/issues/33422)

But you can get around the issue by installing the command line tools for xcode with:

➜ xcode-select --install

If you run into the same issues, hopefully that fixes you.

Example: Parsing tab delimited file using OpenCSV

I prefer opencsv for CSV parsing in Java. That library also supports parsing of tab delimited files, here's how:

Just a quick gist:

Tuesday, October 14, 2014

Sqoop Oracle Example : Getting Started with Oracle -> HDFS import/extract

In this post, we'll get Sqoop (1.99.3) connected to an Oracle database, extracting records to HDFS.

Add Oracle Driver to Sqoop Classpath

The first thing we'll need to do is copy the oracle JDBC jar file into the Sqoop lib directory. Note, this directly may not exist. You may need to create it.

For me, this amounted to:

➜  sqoop  mkdir lib
➜  sqoop  cp ~/git/boneill/data-lab/lib/ojdbc6.jar ./lib

Add YARN and HDFS to Sqoop Classpath

Next, you will need to add the HDFS and YARN jar files to the classpath of Sqoop. If you recall from the initial setup, the classpath is controlled by the common.loader property in the server/conf/catalina.properties file. To get things submitting to the YARN cluster properly, I added the following additional paths to the common.loader property:

common.loader=${catalina.base}/lib,${catalina.base}/lib/*.jar,${catalina.home}/lib,${catalina.home}/lib/*.jar,${catalina.home}/../lib/*.jar,/Users/bone/tools/hadoop/share/hadoop/common/*.jar,/Users/bone/tools/hadoop/share/hadoop/yarn/lib/*.jar,/Users/bone/tools/hadoop/share/hadoop/mapreduce/*.jar,/Users/bone/tools/hadoop/share/hadoop/tools/lib/*.jar,/Users/bone/tools/hadoop/share/hadoop/common/lib/*.jar,/Users/bone/tools/hadoop/share/hadoop/hdfs/*.jar,/Users/bone/tools/hadoop/share/hadoop/yarn/*.jar

Note, the added paths.

IMPORTANT : Restart your Sqoop server so it picks up the new jar files.

(including the driver jar!)

Create JDBC Connection

After that, we can fire up the client, and create a connection with the following:

bin/sqoop.sh client
...
sqoop> create connection --cid 1
Creating connection for connector with id 1
Please fill following values to create new connection object
Name: my_datasource
Connection configuration
JDBC Driver Class: oracle.jdbc.driver.OracleDriver
JDBC Connection String: jdbc:oracle:thin:@change.me:1521:service.name
Username: your.user
Password: ***********
JDBC Connection Properties:
There are currently 0 values in the map:
entry# HIT RETURN HERE!
Security related configuration options
Max connections: 10
New connection was successfully created with validation status FINE and persistent id 1

Create Sqoop Job

Next step is to make a job. This is done with the following:

sqoop> create job --xid 1 --type import
Creating job for connection with id 1
Please fill following values to create new job object
Name: data_import

Database configuration

Schema name: MY_SCHEMA

Table name: MY_TABLE
Table SQL statement:
Table column names:
Partition column name: UID
Nulls in partition column:
Boundary query:

Output configuration

Storage type:
  0 : HDFS
Choose: 0
Output format:
  0 : TEXT_FILE
  1 : SEQUENCE_FILE
Choose: 0
Compression format:
  0 : NONE
...
Choose: 0
Output directory: /user/boneill/dump/

Throttling resources

Extractors:
Loaders:
New job was successfully created with validation status FINE  and persistent id 3

Everything is fairly straight-forward. The output directory is the HDFS directory to which the output will be written.

Run the job!

This was actually the hardest step because the documentation is out of date. (AFAIK) Instead of using "submission", as the documentation states. Use the following:

sqoop> start job --jid 1
Submission details
Job ID: 3
Server URL: http://localhost:12000/sqoop/
Created by: bone
Creation date: 2014-10-14 13:27:57 EDT
Lastly updated by: bone
External ID: job_1413298225396_0001
	http://your_host:8088/proxy/application_1413298225396_0001/
2014-10-14 13:27:57 EDT: BOOTING  - Progress is not available

From there, you should be able to see the job in YARN!

After a bit of churning, you should be able to go over to HDFS and find your files in the output directory.

Best of luck all. Let me know if you have any trouble.

Monday, October 13, 2014

Sqoop 1.99.3 w/ Hadoop 2 Installation / Getting Started Craziness (addtowar.sh not found, common.loader, etc.)

We have a ton of data in relational databases that we are looking to migrate onto our Big Data platform. S We took an initial look around and decided Sqoop might be worth a try. I ran into some trouble getting Sqoop up and running. Here in lies that story...

The main problem is the documentation (and google). It appears as though Sqoop changed install processes between minor dot releases. Google will likely land you on this documentation:
http://sqoop.apache.org/docs/1.99.1/Installation.html

That documentation mentions a shell script, ./bin/addtowar.sh. That shell script no longer exists in sqoop version 1.99.3. Instead you should reference this documentation:
http://sqoop.apache.org/docs/1.99.3/Installation.html

In that documentation, they mention the common.loader property in server/conf/catalina.properties. If you haven't been following the Tomcat scene, that is the new property that allows you to load jar files onto your classpath without dropping them into $TOMCAT/lib, or your war file. (yuck)

To get Sqoop running, you'll need all of the Hadoop jar files (and the transitive dependencies) on the CLASSPATH when Sqoop/Tomcat starts up. And unless, you add all of the Hadoop jar files to this property, you will end up with any or all of the following CNFE/NCDFE exceptions in your log file (found in server/logs/localhost*.log):

java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory
java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/JobClient

java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration

Through trial and error, I found all of the paths needed for the common.loader property. I ended up with the following in my catalina.properties:

common.loader=${catalina.base}/lib,${catalina.base}/lib/*.jar,${catalina.home}/lib,${catalina.home}/lib/*.jar,${catalina.home}/../lib/*.jar,/Users/bone/tools/hadoop/share/hadoop/common/*.jar,/Users/bone/tools/hadoop/share/hadoop/yarn/lib/*.jar,/Users/bone/tools/hadoop/share/hadoop/mapreduce/*.jar,/Users/bone/tools/hadoop/share/hadoop/tools/lib/*.jar,/Users/bone/tools/hadoop/share/hadoop/common/lib/*.jar

That got me past all of the classpath issues. Note, in my case /Users/bone/tools/hadoop was a complete install of Hadoop 2.4.0.

I also ran into this exception:

Caused by: org.apache.sqoop.common.SqoopException: MAPREDUCE_0002:Failure on submission engine initialization - Invalid Hadoop configuration directory (not a directory or permission issues): /etc/hadoop/conf/

That path has to point to your Hadoop conf directory. You can find this setting in server/conf/sqoop.properties. I updated mine to:

org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/Users/bone/tools/hadoop/etc/hadoop

(Again, /Users/bone/tools/hadoop is the directory of my hadoop installation)

OK --- Now, you should be good to go!

Start the server with:

bin/sqoop.sh server start

Then, the client should work! (as shown below)

bin/sqoop.sh client
...
sqoop:000> set server --host localhost --port 12000 --webapp sqoop
Server is set successfully
sqoop:000> show version --all
client version:
  Sqoop 1.99.3 revision 2404393160301df16a94716a3034e31b03e27b0b
  Compiled by mengweid on Fri Oct 18 14:15:53 EDT 2013
server version:
  Sqoop 1.99.3 revision 2404393160301df16a94716a3034e31b03e27b0b

...

From there, follow this:
http://sqoop.apache.org/docs/1.99.3/Sqoop5MinutesDemo.html

Happy sqooping all.

Tuesday, October 7, 2014

Diction in Software Development (i.e. Don't be a d1ck!)

Over the years, I've come to realize how important diction is in software development (and life in general). It may mean the difference between a 15 minute meeting where everyone nods their heads, and a day long battle of egos. (especially when you have a room full of passionate people)

Here are a couple key words and phrases, I've incorporated into my vernacular. Hopefully, these will help out next time you are in an architecture/design session. (or a conversation with your significant other =)

"I appreciate X":
Always use the phrase, "I appreciate that..." in response to a point. But more importantly, *mean* it. It is an age-old adage, but when talking, it is best to listen. Once you've heard the other party, try to understand and appreciate what they are saying. Then, let them know that you appreciate their point, before adding additional information to the conversation. (tnx to +Jeff Klein for this one)

"I am not passionate about X"
To drive consensus, I try to hold focused design discussions. During those discussions, I'd try to squash tangential topics. I used to say, "I don't care about that, do whatever, we're focused on X". Obviously, that would aggravate the people that did care about X. These days, I use the phrase, "I am not passionate about that...". People have different value-systems. Those value systems drive different priorities. It is important to acknowledge that, while also keeping discussions focused. (tnx to +Bob Binion for this one)

"What is the motivation / thought process behind X?"

People enter meetings with different intentions. Those intentions often drive their positions. It is tremendously important to understand those motivations, especially in larger companies or B2B interactions where organizational dynamics come into play. In those contexts, success is often gauged by the size of one's department. Thus, suggesting an approach that eliminates departments of people is a difficult conversation. Sometimes you don't have the right people in the room for that conversation. I've found that it is often useful to tease out motivations and the thought processes behind positions. Then you can address them directly, or at least "appreciate"=) that motivations are in conflict.

Eliminate the "But's"

This is a simple change, but a monumental difference in both diction and philosophy. If you listen to *yourself*, you'll likely make statements like, "X is great, but consider Y". You'll be surprised how many of these can actually be changed to, "X is great, *and* consider Y". Believe me, this changes the whole dynamic of the conversation.

AND, this actually begins to change the way you think...

When you view the world as a set of boolean tradeoffs, "either you can have X or you can have Y", you never consider the potential of having X *and* Y. Considering that possibility is at the root of many innovations. (e.g. I want multi-dimensional analytics AND I want them in real-time!)

F. Scott Fitzgerald said, "The test of a first-rate intelligence is the ability to hold two opposed ideas in the mind at the same time, and still retain the ability to function." These days, this is how I try to live my life, embracing the "what if?" and then delivering on it. =)

(tnx to +Jim Stogdill and +William Loftus for this perspective)

Tuesday, September 30, 2014

Rockstars drive Teslas, not F-150’s (Mac vs. Windows for develpment)

I do quite a bit of speaking at technology events. And these days, when you are up on stage, there is a sea of glowing Apples staring back at you. Almost every software developer I know cranks code on a Macbook Pro. It’s become a status symbol, and a recruiting necessity. Honestly though, it’s just a more productive platform. With it, I can go further on less gas.

It wasn’t always that way however...

I spent my elementary school years coding on an Apple IIc, hacking assembly code to get faster graphics routines. For many of my generation, IIe's and IIc's were what we grew up with. Every classroom had one.

When it came to buying my first machine however, I saved my nickels and dimes and bought an Intel 8088, with a monochrome graphics card. Why? Because, that was all I could afford in middle school. It wasn’t powerful or sexy. I had to work to squeeze every last bit of performance out of it, which is likely why I then spent the next six years, cutting my teeth on every operating system out there: Linux, FreeBSD, Solaris, Windows and OS Warp. (ooh, ahh, multi-tasking!)

I learned a ton, but I was definitely *not* productive. I spent so much time fighting with the machinery that I didn’t have time to solve people’s problems. And evidently, solving people’s problems is how you make money.

Admittedly, I partially blame my OCD for the hours I wasted away tweaking windows managers (much love for fvwm) and xterms so I never had to touch the mouse. But the fact of the matter is, I was wasting time fighting and customizing my tools instead of using those tools to craft products and solutions. Minimally, I’d say the operating systems and tools were distractions, but sometimes they were actual impediments.

Then came OS X. An operating system with the power of unix, with none of the impediments and distractions. I was freed to focus on problem solving. And I never looked back.

Now -- these days, when I go to conferences, everyone has a mac. Or maybe I should say everyone who’s anyone has a mac. That’s not entirely true, but that is certainly the way it feels. (* One exception to this rule is the hard-core hackers running the latest flavor of linux, hacking at such low-levels that they are one with the code. For them, I pour a little out.)

There is no doubt that Mac’s are now ubiquitous in the development community. But when I was recently approached by an executive of a local technology company wondering if they should introduce Mac’s into their environment, I had trouble expressing concrete functional differences (especially differences substantial enough to offset the cost of supporting two platforms). Quite literally, Mac’s have that: ‘je ne sais quoi’.

After thinking it over a bit, I thought it might best to draw an analogy...

It is true that Windows machines and Mac’s have very similar functional components and capabilities (cpu, storage, memory, etc.). So do Tesla’s and Ford F-150’s (wheels, air conditioning, engines). Both will comfortably get you from point A to point B. And depending on the driver, it might even take the same amount of time. There are even situations and terrain for which an F150 is more suited (e.g. ass-backwards country roads).

But between the two, the experience is completely different. If you are looking to attract rockstar race car drivers, you’ll attract more with Teslas than you will with F150s. And if you give a Tesla to the right race car driver, they will have you rocketing down the information superhighway, while you both enjoy the ride.

I love my MBP (macbook pro), and while I can’t afford a Tesla yet – it’s the closest thing I have to one. =)

Maybe someday the tables will turn, but for today – fuel your crew with mac’s, get them an assortment of decals (http://www.macdecals.com/), and keep them hydrated. Sit back and enjoy the ride.

Thursday, September 25, 2014

[YARN] : NullPointerException in MRClientService.getHttpPort()

Have you moved your Hadoop jobs over to Yarn? Are you seeing the following NullPointerException coming out of your job?

Caused by: java.lang.NullPointerException
 at org.apache.hadoop.mapreduce.v2.app.client.MRClientService.getHttpPort(MRClientService.java:167)
 at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:150)
 ... 14 more

This appears to be a poorly handled case in Hadoop, where the webApp fails to load. You need to look at the full log file. You will likely see the following in the log preceeding the NPE:

ERROR (org.apache.hadoop.mapreduce.v2.app.client.MRClientService:139) - Webapps failed to start. Ignoring for now:

That line is produced by the following code in YARN:

    LOG.info("Instantiated MRClientService at " + this.bindAddress);
    try {
      webApp = WebApps.$for("mapreduce", AppContext.class, appContext, "ws").with(conf).
          start(new AMWebApp());
    } catch (Exception e) {
      LOG.error("Webapps failed to start. Ignoring for now:", e);
    }
    super.start();

This means your webApp failed to load. In my case, this was caused by dependency conflicts on my classpath. (servlet & jasper compiler) I've also seen it cause by a dependency mssing, such as:

java.lang.ClassNotFoundException: Class org.apache.hadoop.yarn.server.webproxy.amfilter.AmFilterInitializer

In my situation, the solution was to cleanup my classpath, but the webApp might fail for lots of reasons, which will result in an NPE later in the log file when MRClientService attempts to getHttpPort without a webApp, which is shown in the following code:

public int getHttpPort() {
    return webApp.port();
  }

Anyway, that is probably more detail than you wanted. Happy yarning.

Thursday, July 10, 2014

Applied Big Data : The Freakonomics of Healthcare

I went with a less provocative title this time because my last blog post (http://brianoneill.blogspot.com/2014/04/big-data-fixes-obamacare.html) evidently incited political flame wars. In this post, I hope to avoid that by detailing exactly how Big Data can help our healthcare system in a nonpartisan way. =)

First, lets decompose the problem a bit.

Economics:
Our healthcare system is still (mostly) based on capitalism: more patients + more visits = more money. Within such a system, it is not in the best interest of healthcare providers to have healthy patients. Admittedly, this is a pessimistic view, and doctors and providers are not always prioritizing financial gain. Minimally however, at a macro-scale there exists a conflict of interest for some segment of the market, because not all healthcare providers profit entirely from preventative care.

Behavior:
Right now, with a few exceptions, everyone pays the same for healthcare. Things are changing, but broadly speaking, there are no financial incentives to make healthy choices. We are responsible only for a fraction of the medical expenses we incur. That means everyone covered by my payer (the entity behind the curtain that actually foots the bills) is helping pay for the medical expenses I may rack up as a result of my Friday night pizza and beer binges.

Government:
Finally, the government is trying. They are trying really hard. Through transparency, reporting, and compliance, they have the correct intentions and ideas to bend the cost curve of healthcare. But the government is the government, and large enterprises are large enterprises. And honestly, getting visibility into the disparate systems of any large single large enterprise is difficult (ask any CIO). Imagine trying to gain visibility into thousands enterprises, all at once. It's daunting: schematic disparities, messy data, ETL galore.

Again, this is a pessimistic view and there are remedies in the works. Things like high deductible plans are making people more aware of their expenses. Payers are trying to transition away from fee-for-service models. (http://en.m.wikipedia.org/wiki/Fee-for-service). But what do these remedies need to be effective? You guessed it. Data. Mounds of it.

If you are a payer and want to reward the doctors that are keeping their patients healthy (and out of the doctors offices!), how would you find them? If you are a patient, and want to know who provides the most effective treatments at the cheapest prices, where would you look? If you are the government and want to know how much pharmaceutical companies are spending on doctors, or which pharmacies are allowing fraudulent prescriptions, what systems would you need to integrate?

Hopefully now, you are motivated. This is a big data problem. What's worse is that it is a messy data problem. At HMS, its taken us more than three years and significant blood, sweat and tears to put together a platform that deals with the big and messy mound o' data. The technologies had to mature, along with people and processes. And finally, on sunny days, I can see a light at the end of the tunnel for US healthcare.

If you are on the same mission, please don't hesitate to reach out.

Ironically, I'm posting this from a hospital bed as I recover from the bite of a brown recluse spider.

I guess there are certain things that big data can't prevent. ;)

The Life(Cycles) of UX/UI Development

It recently occurred to me that not one of the dozens and dozens of user interfaces I've worked on over the years, had the same methodology/lifecycle. Many of those were results of the environments under which they were constructed: startup, BIG company, government contract, side-project, open-source, freelance, etc. But the technology also played a part in choosing the methodology we used.

Consider the evolution of UI technology for a moment. Back in the early days of unix/x-windows, UI development was a grind. It wasn't easy to relocate controls and re-organize interactions. Because of that, we were forced to first spend some time with a napkin and a sharpie, and run that napkin around to get feedback, etc. The "UX" cycle was born.

Then came along things like Visual Basic/C++ and WYSIWIG development. The UI design was literally the application. Drag a button here. Double click. Add some logic. Presto, instant prototype... and application. You could immediately get feedback on the look and feel, etc. It was easy to relocate, reorganize things and react to user feedback. What happened to the "UX" cycle? It collapsed into the development cycle. The discipline wasn't lost, it was just done differently, using the same tools/framework used for development.

Then, thanks to Mr.Gore, the world-wide web was invented, bringing HTML along with it. I'm talking early here, the days of HTML sans javascript frameworks. (horror to think!) UI development was thrown back into the stone ages. You needed to make sure you got your site right, because adjusting look and feel often meant "rewrite" the entire web site/application. Many of the "roundtrip" interactions and MVC framework, even though physically separated from the browser, was entwined in the flow/logic of the UI. (Spring Web Flow anyone? http://projects.spring.io/spring-webflow/)

In such a world, again -- you wanted to make sure you got it right, because adjustments were costly. Fortunately, in the meantime the UX discipline and their tools had advanced. It wasn't just about information display, it was about optimizing the interactions. The tools were able to not only play with look, but they could focus on and mock out feel. We could do a full UX design cycle before any code was written. Way cool. Once blessed, the design is/was handed off to the development team, and implementation began.

Fast forward to present day. JavaScript frameworks have come a long way. I now know people that can mockup user experience *in code*, faster and more flexibly than the traditional wire-framing tools. This presents an opportunity to once again collapse the toolset and smooth the design/development process into one cohesive, ongoing process instead of the somewhat disconnected, one-time traditional handoff.

I liken this to the shift that took place years ago for application design. We used to sit down and draw out the application architecture and the design before coding: class hierarchies, sequence diagrams, etc. (Rational Rose and UML anyone?). But once the IDE's advanced enough, it became faster to code the design than to draw it out. The disciplines of architecture and design didn't go away, they are just done differently now.

Likewise with UX. User experience and design are still of paramount importance. And that needs to include user research, coordination with marketing, etc. But can we change the toolset at this point, so the design and development can be unified as one? If so, imagine the smoothed, accelerated design->development->delivery (repeat) process we could construct!

For an innovative company like ours, that depends heavily on time-to-market, that accelerated process is worth a ton. We'll pay extra to get resources that can bridge that gap between UX and development, and play in both worlds. (Provided we don't need to sacrifice on either!)

On that note, if you think you match that persona. Let me know. We have a spot for you. :)

And if you are a UXer, it might be worth playing around with angular and bootstrap to see how easy things have become. We don't mind on the job training. ;)

Friday, April 4, 2014

Big Data fixes Obamacare!

I'll admit it. I figured a sensationalistic title would catch your attention, but honestly it might not be that far from the truth.

At HMS, we sit at an interesting intersection. We have the industry's most accurate and current masterfile, which contains profile information for all of the healthcare providers in the U.S. It also contains information on every organization (e.g. hospital), and all the affiliations between all of those entities.

We color that picture with medical claims data, which allows us to identify networks of influence and key opinion leaders. (i.e. which doctors are the most influential?) We further overlay that with sales data to identify under served markets, etc. Then we add in social data, expense information, real estate information, quality metrics, etc... Anything and everything related to healthcare.

You get the picture.

We quite literally sit at the crossroads of over two thousand sources of healthcare data. To tame those sources, I'm confident we have put together one of the best data integration and analytic platforms available. Even with that, somedays I come to work and I feel like a butterfly caught in a chaotic shit-storm of messy data. Other days I come to work and feel like Cypher sitting in front of a dripping screen of ones and zeros. But instead of seeing blondes and brunettes, I see huge potential: opportunities to control costs, optimize sales operations, enable transparency, eliminate fraud, waste and abuse, and improve the health of vast populations of people.

I've spent the day at the HealthImpact conference in NYC, and it's exciting to be sitting at the center of two tends/markets: Big Data and HealthCare. Articles like this one that hint at the potential:

http://www.informationweek.com/government/big-data-analytics/agencies-see-big-data-as-cure-for-healthcare-ills/d/d-id/1127924

If we get things right, we will change the world. And if we get things wrong, we can still save tax payers, our partners, and patients millions and millions of dollars. ;)

Tuesday, March 25, 2014

Shark on Cassandra (w/ Cash) : Interrogating cached data from C* using HiveQL

As promised, here is part deux of the Spark/Shark on Cassandra series.

In the previous post, we got up and running with Spark on Cassandra. Spark gave us a way to report off of data stored in Cassandra. It was an improvement over MR/Hadoop, but we were still left articulating our operations in Java code. Shark provides an integration layer between Hive and Spark, which allows us to articulate operations in HiveQL at the shark prompt. This enables a non-developer community to explore and analyze data in Cassandra.

Setup a Spark Cluster

Before we jump to Shark, let's get a Spark cluster going. To start a spark cluster, first start the master server with:

$SPARK_HOME> bin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /Users/bone/tools/spark-0.8.1-incubating-bin-hadoop2/bin/../logs/spark-bone-org.apache.spark.deploy.master.Master-1-zen.local.out

To ensure the master started properly, tail the logs:

14/03/25 19:54:42 INFO ActorSystemImpl: RemoteServerStarted@akka://sparkMaster@zen.local:7077
14/03/25 19:54:42 INFO Master: Starting Spark master at spark://zen.local:7077
14/03/25 19:54:42 INFO MasterWebUI: Started Master web UI at http://10.0.0.5:8080

In the log output, you will see the master Spark URL (e.g. spark://zen.local:7077). You will also see the URL for the web UI. Cut and paste that URL into a browser and have a look at the UI. You'll notice that no workers are available. So, let's start one:

$SPARK_HOME> bin/start-slaves.sh
Password:
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/bone/tools/spark-0.8.1-incubating-bin-hadoop2/bin/../logs/spark-bone-org.apache.spark.deploy.worker.Worker-1-zen.local.out

Again, tail the logs. You should see the worker successfully register with the master. You should also see the worker show up in the web UI. And now, we are ready to get moving with Shark.

Setup Shark

First, download Shark and Hive. I used shark-0.8.1-bin-hadoop1.tgz and hive-0.9.0-bin.tgz. Untar each of those. In the $SHARK_HOME/conf directory, copy the shark-env.sh.template file to shark-env.sh and edit the file. Ensure the settings are configured properly. For example:

export SPARK_MEM=4g
export SHARK_MASTER_MEM=4g
export SCALA_HOME="/Users/bone/tools/scala-2.9.3"
export HIVE_HOME="/Users/bone/tools/hive-0.9.0-bin"
export SPARK_HOME="/Users/bone/tools/spark-0.8.1-incubating-bin-hadoop2"
export MASTER="spark://zen.local:7077"

Note that the MASTER variable is set to the master URL from the spark cluster. Make sure that the HADOOP_HOME variable is *NOT* set. Shark can operate directly on Spark. (you need not have Hadoop deployed)

As with Spark, we are going to use an integration layer developed by TupleJump. The integration layer is called Cash:
https://github.com/tuplejump/cash/

To get started, clone the cash repo and follow the instructions here. In summary, build the project and copy target/*.jar and target/dependency/cassandra-*.jar into $HIVE_HOME/lib.

Play with Shark

Fun time. Start shark with the following:

bone@zen:~/tools/shark-> bin/shark

Note that there are two other versions of this command (bin/shark-withinfo and bin/shark-withdebug). Both are *incredibly* useful if you run into trouble.

Once you see the shark prompt, you should be able to refresh the Spark Web UI and see Shark under Running Applications. To get started, first create a database. Using the schema from our previous example/post, let's call our database "northpole":

shark> create database northpole;
OK
Time taken: 0.264 seconds

Next, you'll want to create an external table that maps to your cassandra table with:

shark> CREATE EXTERNAL TABLE northpole.children(child_id string, country string, first_name string, last_name string, state string, zip string)
     >    STORED BY 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler'
     >    WITH SERDEPROPERTIES ("cql.primarykey"="child_id", "comment"="check", "read_repair_chance"="0.1", "cassandra.host"="localhost", "cassandra.port"="9160", "dclocal_read_repair_chance"="0.0", "gc_grace_seconds"="864000", "bloom_filter_fp_chance"="0.1", "cassandra.ks.repfactor"="1", "compaction"="{'class' : 'SizeTieredCompactionStrategy'}", "replicate_on_write"="false", "caching"="all");
OK
Time taken: 0.419 seconds

At this point, you are free to execute some HiveQL queries! Let's do a simple select:

shark> select * from northpole.children;
977.668: [Full GC 672003K->30415K(4054528K), 0.2639420 secs]
OK
michael.myers USA Michael Myers PA 18964
bart.simpson USA Bart Simpson CA 94111
johny.b.good USA Johny Good CA 94333
owen.oneill IRL Owen O'Neill D EI33
richie.rich USA Richie Rich CA 94333
collin.oneill IRL Collin O'Neill D EI33
dennis.menace USA Dennis Menace CA 94222
Time taken: 13.251 seconds

Caching

How cool is that? Now, let's create a cached table!

shark> CREATE TABLE child_cache TBLPROPERTIES ("shark.cache" = "true") AS SELECT * FROM northpole.children;
Moving data to: file:/user/hive/warehouse/child_cache
OK
Time taken: 10.294 seconds

And finally, let's try that select again, this time against the cached table:

shark> select * from child_cache;
OK
owen.oneill IRL Owen O'Neill D EI33
bart.simpson USA Bart Simpson CA 94111
michael.myers USA Michael Myers PA 18964
dennis.menace USA Dennis Menace CA 94222
richie.rich USA Richie Rich CA 94333
johny.b.good USA Johny Good CA 94333
collin.oneill IRL Collin O'Neill D EI33
Time taken: 6.511 seconds

Woot!

Alrighty, that should get you started.
Again -- kudos to TupleJump for all their work on the Spark/Shark -> C* bridge.

Wednesday, March 19, 2014

Pleasantly Uncomfortable : Innovation run rampant.

Over the last three years, we've built out a kick-ass platform for data management and analytics.

Early on, it was a lot of new technology. We were integrating and deploying technologies at a rapid rate, almost one per month: dropwizard, spring, extjs then angular, cassandra, solr then elastic search, kafka, zookeeper, storm, titan, github, puppet, etc. It was a whirlwind.

The new technologies substantially increased productivity and agility. The platform was working. Product development became capabilities composition.

But recently, it occurred to me that the people we hired along the way and the processes we implemented to support that rapid technical evolution are more powerful than the platform itself. To support that platform approach, we adopted open-source dynamics internally. Anyone can contribute to any module, just submit a pull request. Teams are accountable for their products, but they are free to enhance and contribute to any module in the platform. Those dynamics have allowed us to keep the innovative/collaborative spirit as we grow the company.

And oh my, are we growing...

We now have half-a-dozen product development teams. Each is a cross-discipline (dev, qa, ba, pm) mix of onshore and offshore resources. The product development teams are enabled by another half-dozen teams that provide infrastructure and support services (ux/ui design, data/information architecture, infrastructure, maintenance, and devops). The support teams pave the way for the product development teams, so they can rock out features and functionality at warp speed. For the most part, the product teams use Scrum while the support teams use Kanban so they can react quickly to changing priorities and urgent needs.

Each month we have sprint reviews that mimic a science fair. Teams have started dressing up and wearing costumes. It is fun. Plain and simple. But at this last sprint review, something happened to me that has never happened in my career.

I've spent my years at two different types of companies: startups and large enterprises. At the startups, every innovation (and all of the code) was generated by hands at keyboards within a single room (or two). You knew everything that was going into the product, every second of the day. At large enterprises, innovation was stifled by process. You knew everything that was happening because things happened at a turtle's pace. At HMS, we've got something special, a balance between those worlds.

At the last sprint review, I was surprised... for the first time in my career. The teams presented innovations that I didn't know about and never would have come up with on my own. There were beautiful re-factorings, and enabling technical integrations. But honestly, I was uncomfortable. I thought maybe I was losing my edge. I questioned whether I reached a point where I could no longer keep up with everything. It was disconcerting.

I spent a couple hours moping. Then in a moment of clarity, I realized that I was a member of an Innovative Organization: an organization at an optimal balance point between process and productivity, where the reigns of innovation were in everyone's hands -- with a platform and processes that supported them.

Yeah, this sound sounds corny. But I kid you not, it is amazing. We've gone from a situation where a few champions were moving boulders up mountains, to a state where entire teams are leaning forward and pulling the organization along. I'm now happy to enjoy the ride. (but you can bet your ass, I'm going to try to race ahead and meet them at the pass =)

(Kudos to @acollautt, @irieksts,@jmosco, @pabrahamsson, @bflad for giving me this uncomfortable wedgie-like feeling)

Friday, March 7, 2014

Spark on Cassandra (w/ Calliope)

We all know that reporting off of data stored in NoSQL repositories can be cumbersome. Either you built the queries into your data model, or you didn't. If you are lucky, you've paired Cassandra with an indexer like SOLR or Elastic Search, but sometimes an index isn't enough to perform complex analytics on your data. Alternatively, maybe you just want to do a simple transformation on the data. That is often easier said than done.

What we all need is a generic way to run functions over data stored in Cassandra. Sure, you could go grab Hadoop, and be locked into articulating analytics/transformations as MapReduce constructs. But that just makes people sad. Instead, I'd recommend Spark. It makes people happy.

When I set out to run Spark against Cassandra however, I found relatively little information. This is my attempt to rectify that. If you are impatient, you can just go clone the repo I made:
https://github.com/boneill42/spark-on-cassandra-quickstart

Get Calliope

First stop, Calliope.
http://tuplejump.github.io/calliope/
Then go here so you know how to pronounce it. =)

Again, for reasons I've mentioned before, I wanted to access Cassandra via CQL. Unfortunately, at the time of this writing, the CQL version of Calliope wasn't generally available. You need to submit for early access. Fortunately, Rohit and crew are very responsive. And once you have access, you can create a new project that uses it. Drop the dependency in your pom.

<dependency>
    <groupId>com.tuplejump</groupId>
    <artifactId>calliope_2.9.3</artifactId>
    <version>0.8.1-EA</version>
</dependency>

Get Scala'd Up

Now, if you want to fully immerse yourself in the Spark experience, you'll want to develop in Scala. For me, that meant switching over to IntelliJ because I had some challenges using Eclipse with specific (older) versions of Scala. Calliope 0.8.1 early access was compiled with Scala 2.9.3. So you'll want an IDE that supports that version. To get maven support for scala, drop the following into your pom:

<pluginRepositories>
   <pluginRepository>
      <id>scala-tools.org</id>
      <name>Scala-tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
   </pluginRepository>
</pluginRepositories>

<dependency>
   <groupId>org.scalatest</groupId>
   <artifactId>scalatest_2.9.3</artifactId>
   <version>2.0.M5b</version>
</dependency>

<plugin>
   <groupId>org.scala-tools</groupId>
   <artifactId>maven-scala-plugin</artifactId>
   <version>2.15.2</version>
   <executions>
      <execution>
         <goals>
            <goal>compile</goal>
            <goal>testCompile</goal>
         </goals>
      </execution>
   </executions>
   <configuration>
      <scalaVersion>2.9.3</scalaVersion>
      <launchers>
         <launcher>
            <id>run-scalatest</id>
            <mainClass>org.scalatest.tools.Runner</mainClass>
            <args>
               <arg>-p</arg>
               <arg>${project.build.testOutputDirectory}</arg>
            </args>
            <jvmArgs>
               <jvmArg>-Xmx512m</jvmArg>
            </jvmArgs>
         </launcher>
      </launchers>
      <jvmArgs>
         <jvmArg>-Xms64m</jvmArg>
         <jvmArg>-Xmx1024m</jvmArg>
      </jvmArgs>
   </configuration>
</plugin>

Get Spark

Now, the simple part. Add Spark. =)

<dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-core_2.9.3</artifactId>
   <version>0.8.1-incubating</version>
</dependency>

Sling Code

Now, that we have our project stood up. Let's race over a column family and do some processing!

All of the code to perform a Spark job is contained in FindChildrenTest. There are two components: a transformer and the job. The transformer is very similar to the Mapper concept that we have in storm-cassandra-cql. The transformer translates CQL rows into objects that can be used in the job. Here is the code for the transformer:

private object Cql3CRDDTransformers {
  import RichByteBuffer._
  implicit def row2String(key: ThriftRowKey, row: ThriftRowMap): List[String] = {
    row.keys.toList
  }
  implicit def cql3Row2Child(keys: CQLRowKeyMap, values: CQLRowMap): Child = {
    Child(keys.get("child_id").get, values.get("first_name").get, values.get("last_name").get, values.get("country").get, values.get("state").get, values.get("zip").get)
  }
}

The only real important part is the function that translates a row (keys and values) into the Child object.

With a transformer in place, it is quite simple to create a job:

class FindChildrenTest extends FunSpec {
  import Cql3CRDDTransformers._
  val sc = new SparkContext("local[1]", "castest")
  describe("Find All Children in IRL") {
    it("should be able find children in IRL") {
      val cas = CasBuilder.cql3.withColumnFamily("northpole", "children")
      val cqlRdd = sc.cql3Cassandra[Child](cas)
      val children = cqlRdd.collect().toList
      children.filter((child) => child.country.equals("IRL")).map((child) => println(child))
      sc.stop()
    }
  }
}

The first line connects to a keyspace, table. For this example, I reused a schema from my webinar a few years ago. You can find the cql here. The second line creates a Resilient Distributed Dataset (RDD) containing Child objects. An RDD is the primary dataset abstraction in Spark. Once you have an RDD, you can operate on that RDD as if it were any other map. (pretty powerful stuff)

In the code above, we filter the RDD for children in Ireland. We then race over the result, and print the children out. If all goes well, you should end up with the following output:

Child(collin.oneill,Collin,O'Neill,IRL,D,EI33)
Child(owen.oneill,Owen,O'Neill,IRL,D,EI33)

OK -- That should be enough to make you dangerous. I have to give a *HUGE* pile of kudos to Rohit Rai and his team at TupleJump for developing the Calliope project. They are doing great things at TupleJump. I'm keeping an eye on Stargate and Cash as well. In fact, next time, I'll take this a step further and show Shark against Cassandra using Cash.

Friday, February 28, 2014

Big Data : On the Precipice of a Collapse

Before anyone freaks out, I'm talking about a technology collapse, not a market collapse or the steep downhill slope of a hype curve. But I guess a "technology collapse" doesn't sound much better. Maybe "technology consolidation" is a better word, but that's not really what I mean. Oh well, let me explain...

For anyone that has been through introductory computer science courses, you know that different data structures are suited for different applications. With a bit of analysis you can classify the cost of an operation against a specific structure, and how that will scale as the size of the data or the input scales. (See Big O Notation if you would like to geek out.) In Big Data, things are no different, but we are usually talking about performance and scaling relative to additional constraints such as the CAP Theorem.

So what are our "data structures and algorithms" in the Big Data space? Well, here is my laundry/reading list:

So what does that have to do with an imminent collapse? Well, market demands are pushing our systems to ingest an increasing amount of data in a decreasing amount of time, while also making that data immediately available to an increasing variety of queries.

We know from our classes that to accommodate the increased variety of queries, we need different data structures and algorithms to service those queries quickly. That leaves us with no choice but to duplicate data across different data structures and to use different tools to query across those systems. Often this approached is referred to as "Polyglot Persistence". That worked, but it left the application to orchestrate the writes and queries across the different persistence mechanisms. (painful)

To alleviate that pain, people are collapsing the persistence mechanisms and interfaces. Already, we see first-level integrations. People are combining inverted-indexes and search/query mechanisms with distributed databases. e.g.

Titan integrated Elastic Search.
Datastax integrated SOLR.
Stargate combines Cassandra and Lucene

This is a natural pairing because the queries you can perform efficiently against distributed databases are constrained by the partitioning and physical layout of the data. The inverted indexes fill the gap, allowing users to perform queries along dimensions that may not align with the data model used in the distributed database. The distributed database can handle the write-intensive traffic, while the inverted indexes handle the read-intensive traffic, likely with some lag in synchronization between the two. (but not necessarily)

The tight-integration between persistence mechanisms makes it transparent to the end-user that data was duplicated across stores, but IMHO, we still have a ways to go. What happens if you want to perform an ad-hoc ETL against the data? Well, then you need to fire up Hive or Pig (Spark and/or Shark), and use a whole different set of infrastructure, and a whole other language to accomplish your task.

One can imagine a second or third-level integration here, which unifies the interface into "big data": a single language that would provide search/query capabilities (e.g. lucene-like queries), with structured aggregations (e.g. SQL-like queries), with transformation, load and extract functions (e.g. Pig/Hive-like queries) rolled into one-cohesive platform that was capable of orchestrating the operations/functions/queries across the various persistence and execution frameworks.

I'm not sure quite what that looks like. Would it use Storm or Spark as the compute framework? perhaps running on YARN, backed by Cassandra, Elastic Search and Titan, with pre-aggregation capabilities like those found in Druid?

Who knows? Food for thought on a Friday afternoon.
Time to grab a beer.

(P.S. shout out Philip Klein and John Savage, two professors back at Brown that inspired these musings)

Thursday, February 13, 2014

Storm and Cassandra : A Three Year Retrospective

We are doing our final edits on our Storm book due out in April. In reviewing the chapters, I got to thinking through the evolution of our architecture with Storm. I thought I would capture some of our journey. Maybe people out there can skip a few epochs of evolution. =)

Kudos again to +P. Taylor Goetz for introducing Storm at Health Market Science. When Taylor joined the team, we were well on our way to selecting Cassandra as our persistence mechanism, but we had yet to solve the distributed processing problem. We had varying levels of JMS/JVM sprawl and we were dealing with all the challenges of transactional processing against those queues. (exactly the situation Nathan Marz typically depicts when motivating Storm)

To accompany the JMS/JVM sprawl, we also had Hadoop deployments against Cassandra that we were fairly frustrated with. The Map/Reduce paradigm for analysis seemed very restrictive, and we were spending a lot of time tweaking jobs to balance work across phases (map, shuffle, reduce). It felt like we were shoe-horning our problems into M/R. If you then add on the overhead of spinning up a job and waiting for the results, we wanted better. Enter Storm.

Amoebas
We had made the decision to go forward with Cassandra, but we didn't see any bridge between Storm and Cassandra -- so we built one. By December 2011, we had made enough progress on storm-cassandra that it made it into Nathan's talk at the Cassandra Summit, and we started building out our first topologies.

Back in those days, there was no such thing as Trident in Storm. And given the pain that we first encountered, I'd guess that most of the production uses of Storm did not demand transactional integrity. I'm guessing that many of those uses only needed "good enough" real-time answers, and likely leveraged Hadoop, lagging somewhat, to correct issues offline.

We didn't like that approach. Our systems are used to make health-care related decisions in many large operational systems. We wanted the data to be immediately vendable, and guaranteed. Enter Transactional Topologies.

We started using transactional topologies, getting our feet wet connecting Storm to our JMS queues, and birthing storm-jms in the process. In this phase, we felt some of the pain of being on the bleeding edge. APIs felt like they were shifting quite a bit with the coup de grâce coming when Trident was introduced.

Trident added to Storm what we needed: better support for transactional processing and state management. But what was that? transactional topologies are now deprecated? Ouch. Quick -- everyone over to Trident! (we could have used a heads up here!)

Vertebrates
We rapidly ported all of our transactional topologies over to Trident and got acquainted with the new concepts of State. At the same time, we were advancing our understanding of Cassandra data modeling.

We learned the following, which should be tattooed on everyone working in distributed computing:

Eliminate read-before-write scenarios (never fetch+update, just insert)
Ensure all operations are idempotent (when you insert, overwrite)
When possible, avoid shuffling data around (partitionPersist is your friend)

Make sure your processing flow and data model support the above tenants. With our tattoos, we continued to build out our use of Storm throughout 2013.

Walking Upright
Many people tend to compare Storm and Hadoop, with Storm portrayed as a "real-time Hadoop". I believe this short changes Storm a bit. Hadoop (v1) runs a specific algorithm across a pre-defined set of data. Many times to accomplish something useful, you need to string many M/R jobs together. Eventually, you find yourself in want of a higher-level language like Pig, Hive or Cascading. Storm operates at that this higher level, and although it is often cast as a framework for "real-time analytics", it is a general data processing layer capable of accommodating fairly sophisticated data flows.

In fact, Storm excels as data flow and orchestration infrastructure. We use it as our data ingestion infrastructure, orchestrating writes across multiple data storage mechanisms. (see trident-elasticsearch) It provides the backbone for a solid platform that avails of polyglot persistence.

Future Evolution
The best is yet to come. Cassandra is churning out new features that make it even more useful for distributed processing. See my previous post on CQL enhancements. We created a new project to take advantage of those features. (see storm-cassandra-cql) It's already getting some traction. (shout out to Robert Lee for contributing the BackingMap implementation, which should be merged shortly)

Also, with Storm's incorporation into Apache and HortonWorks commitment, we should see a more stable API, more frequent releases, and better synergy with other apache projects. (yarn anyone?!)

Conclusion
So, if you are a gambling type, I'd push your chips to the center of the table. Bet on Storm and Cassandra being a powerful pair as demands for bigger, better and faster continue to push those of us at the edge of the envelope. Its anyone's guess what the powerful pair will evolve into, but one can imagine great things.

Thursday, February 6, 2014

Determining if a conditional update was applied w/ CQL Java-Driver

Sorry, I should have included this in my previous post on conditional updates. One of the critical aspects to using conditional updates is determining whether the update was applied. Here is how you do it.

Given our previous update (all fluentized):

        
Update updateStatement = update(KEYSPACE_NAME, TABLE_NAME);
updateStatement.with(set(VALUE_NAME, 15)).where(eq(KEY_NAME, "DE")).onlyIf(eq(VALUE_NAME, 10));
this.executeAndAssert(updateStatement, "DE", 15);

When we execute the updateStatement with our session and examine the ResultSet returned:

LOG.debug("EXECUTING [{}]", statement.toString());
ResultSet results = clientFactory.getSession().execute(statement);
for (ColumnDefinitions.Definition definition : results.getColumnDefinitions().asList()) {
   for (Row row : results.all()) {
      LOG.debug("[{}]=[{}]", definition.getName(), row.getBool(definition.getName()));
   }
}

You'll notice that you get a ResultSet back that contains one row, with a column named "[applied]". The above code results in:

DEBUG EXECUTING [UPDATE mykeyspace.incrementaltable SET v=15 WHERE k='DE' IF v=10;]
DEBUG [[applied]]=[true]

Thus, to check to see if a conditional update is applied or not, you can use the concise code:

Row row = results.one();
if (row != null)
   LOG.debug("APPLIED?[{}]", row.getBool("[applied]"));

Wednesday, February 5, 2014

Conditional updates with CQL3 java-driver

Within storm-cassandra-cql, I'm building out a state implementation that is capable of making incremental state changes. (Read-Modify-Write) It leverages Cassandra 2.0's lightweight transactions, and the ability to perform a conditional update. (more on the rationale for incremental state changes later)

As I was implementing it, I couldn't find any good examples that showed how to implement conditional updates using the QueryBuilder in the CQL3 java driver.

So, here you go:

        // Now let's conditionally update where it is true
        Update updateStatement = QueryBuilder.update(SalesAnalyticsMapper.KEYSPACE_NAME,

                SalesAnalyticsMapper.TABLE_NAME);
        updateStatement.with(QueryBuilder.set(VALUE_NAME, 15));
        Clause clause = QueryBuilder.eq(KEY_NAME, "DE");
        updateStatement.where(clause);
        Clause conditionalClause = QueryBuilder.eq(VALUE_NAME, 10);
        updateStatement.onlyIf(conditionalClause);
        this.executeAndAssert(updateStatement, "DE", 15);

The key clause is the conditionalClause which is appended to the statement by calling the onlyIf method on the Update object. The above code will set the value of v=15, where k='DE', but only if the current value of v is still 10. In the storm-cassandra-cql use case, this will cause the state to increment only if the value that was read hasn't changed underneath of it.

I've encapsulated a couple conditional updates in a unit test within storm-cassandra-cql. Have a look to see a more complete set of calls (with comments and assertions).

Tuesday, February 4, 2014

Work started on Storm-Cassandra-CQL!

As I laid out in my previous post, there are a number of motivations to start using CQL. CQL has better support for batching, conditional updates, and collections. (IMHO) For those reasons, I've started porting our Trident State implementation to CQL.

The implementation has the same Mapper concept. Simply implement the Mapper interface: map Storm tuples to CQL3 statements.

For example:

   public Statement map(TridentTuple tuple) {
        Update statement = QueryBuilder.update("mykeyspace", "mytable");
        String field = "col1";
        String value = tuple.getString(0);
        Assignment assignment = QueryBuilder.set(field, value);
        statement.with(assignment);
        long t = System.currentTimeMillis() % 10;
        Clause clause = QueryBuilder.eq("t", t);
        statement.where(clause);
        return statement;
    }

(From the ExampleMapper)

The CQL3 statements are then collected and submitted as a batch inside the State implementation.

Below is an example topology:

    public static StormTopology buildTopology() {
        LOG.info("Building topology.");
        TridentTopology topology = new TridentTopology();
        ExampleSpout spout = new ExampleSpout();
        Stream inputStream = topology.newStream("test", spout);
        ExampleMapper mapper = new ExampleMapper();
        inputStream.partitionPersist(new CassandraCqlStateFactory(), 
                                     new Fields("test"),

                                     new CassandraCqlStateUpdater(mapper));
        return topology.build();

Presently the implementation is *very* simple. We know we'll need to enhance the batching mechanism. (e.g. What happens when the size of a batch in Storm exceeds the batch size limit in CQL3? Bad things. =)

At first glance though, this approach for Storm / Cassandra integration is much simpler than our previous implementation and will allow users to leverage the power and features available in CQL3. (e.g. We have grand plans to expose / utilize conditional updates to realize incremental state updates from Storm -- more on that later)

I'd encourage people to give it a spin... and submit contributions!
https://github.com/hmsonline/storm-cassandra-cql

Thursday, January 30, 2014

Book Review : Web Crawling and Data Mining with Apache Nutch

In our space, we found that some of the most current healthcare related information is found on the internet. We harvest that information as input to our healthcare masterfile. Our crawlers run against hundreds of websites. We have a fairly large web harvester, which is what drove me to explore Nutch with Cassandra: Crawling the web with Cassandra.

When Web Crawling and Data Mining with Apache Nutch came out, I was eager to have a read. The first quarter of the book is largely introductory. It walks you through the basics of operating Nutch and the layers in the design: Injecting, Generating, Fetching, Parsing, Scoring and Indexing (with SOLR).

For me, the book got a bit more interesting when it covered the Nutch Plugin architecture. HINT: Take a look at the overall architecture diagram on Page 34 before you start reading!

The book then covers deployment and scaling. A fair amount of time is spent on SOLR deployment and scaling (via sharding), which in and of itself may be valuable if you are a SOLR shop. (not so much if you are Elastic Search (ES) fans -- in fact, it was one of the reasons why we moved to ES ;)

About midway through the book, the real fun starts when the author covers how to run Nutch with/on Hadoop. This includes detailed instructions on Hadoop installation and configuration. This is followed by a chapter on persistence mechanisms, which uses Gora to abstract away the actual storage.

Overall, this is a solid book, especially if you are new to the space and need detailed, line by line instructions to get up and running. To kick it up a notch, it would have been nice to have a smattering of few use cases and real-world examples, but given the book is only about a hundred pages, it does a good job of balancing utility with color commentary.

The book is available from PACKT here:

http://www.packtpub.com/web-crawling-and-data-mining-with-apache-nutch/book

Wednesday, January 29, 2014

Looking for your aaS? (IaaS vs. PaaS vs. SaaS vs. BaaS)

Our API is getting a lot of traction these days. We enable our customers to perform lookups against our masterfile via a REST API. Recently, we've also started exposing our Master Data Management (MDM) capabilities via our REST API. This includes matching/linking, analysis, and consolidation functionality. A customer can send us their data, we will run a sophisticated set of fuzzy matching logic attempting to locate the healthcare entity in our universe (i.e. "match"). We can then compare the attributes supplied by our customers with those on the entity in our universe, and decide which are the most accurate attributes. (i.e. "consolidate") Once we have the consolidated record, we run analysis against that record to look for attributes that might trigger an audit.

I've always described this as a Software as a Service (SaaS) offering, but as we release more and more of our MDM capabilities via the REST API, it is beginning to feel more like Platform as a Serivce (PaaS). I say that because we allow our tenants/customers/clients to deploy logic (code) for consolidation and analytics. That code runs on our "platform".

That got me thinking about the differences between Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Back-end-as-a-Service (BaaS), and Software as a Service (SaaS). Let's first start with some descriptions. (all IMHO)

IaaS: This service is the alternative to racks, blades and metal. IaaS allows you to spin-up new virtual machines, provisioned with an operating system and potentially a framework. From there you are on your own. You need to deploy your own apps, etc. (BYOAC == Bring your own Application Container)

PaaS: This service provides an application container. You provide the application, potentially built with a provider's tools/libraries, then the service provisions everything below that. PaaS adds the application layer on top of IaaS. (BYOA == Bring your own Application)

SaaS: These services exposes specific business functionality via an interface. Consumers are typically consuming the services off-premise over the web. In most cases, SaaS refers to some form of web services and/or user interface. (Either no BYO, or BYOC == Bring your own Configuration)

BaaS: For me, there is a blurred line between BaaS and SaaS. From the examples I've seen, BaaS often refers to services consumed by mobile devices. Often, the backend composes a set of other services and allows the mobile application to offload much of the hard work. (user management, statistics, tracking, notifications, etc) But honestly, I'm not sure if it is the composition of services, the fact that they are consumed from mobile devices, or the type of services that distinguishes BaaS from SaaS. (ideas anyone?)

Of course, each one of these has pros/cons, and which one you select as the foundation for your development will depend highly on what you are building. I see it as a continuum:

The more flexibility you need, the more overhead you have to take on to build out the necessary infrastructure on top of the lower level services. In the end, you may likely have to blend of all of these.

We consume SaaS, build on PaaS (salesforce), leverage IaaS (AWS), and expose interfaces for both PaaS and SaaS!

Any which way you look at it, that's a lot of aaS!

Tuesday, January 28, 2014

Mesos on Mac OS X Mavericks (SOLVED: "Could not link test program to Python")

Continuing on my expedition with Scala and Spark, I wanted to get Mesos working (underneath of Spark). I ran into a couple hiccups along the way...

First, download Mesos:
http://mesos.apache.org/downloads/

Unpack the tar ball, and run "./configure".
If you are running Mavericks, and you've installed Python using brew, you may end up with:

configure: error:
  Could not link test program to Python. Maybe the main Python library has been
  installed in some non-standard library path. If so, pass it to configure,
  via the LDFLAGS environment variable.
  Example: ./configure LDFLAGS="-L/usr/non-standard-path/python/lib"
  ============================================================================
   ERROR!
   You probably have to install the development version of the Python package
   for your distribution.  The exact name of this package varies among them.
  ============================================================================

It turns out there is a bug in python that prevents Mesos from properly linking. Here is the JIRA issue on the mesos project: https://issues.apache.org/jira/browse/MESOS-617

To get around this bug, I needed to downgrade python.

With brew, you can use the following commands:

bone@zen:~/tools/mesos-0.15.0-> brew install homebrew/versions/python24
bone@zen:~/tools/mesos-0.15.0-> brew unlink python
bone@zen:~/tools/mesos-0.15.0-> brew link python24

After that, the configure will complete, BUT -- the compilation will fail with:

In file included from src/stl_logging_unittest.cc:34:
./src/glog/stl_logging.h:56:11: fatal error: 'ext/slist' file not found
# include

For this one, you are going to want to get things compiling with gcc (instead of clang). Use the following:

brew install gcc47
rm -rf build
mkdir build
cd build
CC=gcc-4.7 CXX=g++-4.7 ../configure

After that, you should be able to "sudo make install" and be all set. Happy Meso'ing.

Monday, January 27, 2014

Scala IDE in Eclipse (with 2.9.x and Juno... or not)

I'm taking the plunge into Scala to determine if it has any benefits over Java. To motivate that, I decided to play around with Spark/Shark against Cassandra. To get my feet wet, I set out to run Spark's example Cassandra test (and perhaps enhance it to use CQL).

First, I needed to get my IDE setup to handle Scala. I'm an Eclipse fan, so I just added in the Scala IDE for Eclipse. (but make sure you get the right scala version! see below!)

Go to Help->Install New Software->Add, and use this url:

http://download.scala-ide.org/sdk/helium/e38/scala210/stable/site

Race through the dialog boxes to install the plugin, which will require you to restart.

For me, I was working with a Java project to which I wanted to add the CassandraTest scala class from Spark. If you are in the same situation, and you have an existing Java project, you will need to add the Scala nature in Eclipse. Do this by right-clicking on the project, then Configure->Add Scala Nature.

At this point, you can start the Scala interpreter by right-clicking on the project, then Scala->Create Scala Interpreter.

I was happy -- for a moment. I was all setup, but Eclipse started complaining that certain jar files were "cross-compiled" using a different version of Scala: an older version, 2.9.x. Unfortunatley, I had Storm in my project, which appeared to be pulling in files compiled with 2.9.x.

So, I uninstalled the Scala IDE plugin because it appeared to work only with 2.10.x. I needed to downgrade to an older version of the Scala IDE to get 2.9.x support. That forced me on to an experimental version of Scala IDE because I needed 2.9.x support in Juno. Unfortunately, after re-installing the old version, I lost the ability to add the Scala nature. =(

PUNT

I decided to go hack it at the command-line. I followed this getting started guide to add Scala to my maven pom file. That worked like a champ. And I could run the CassandraTest.

So, at this point, I'm editing the files in Eclipse, but running via command-line. I'm not sure Scala IDE will bother supporting 2.9.x in Juno or Kepler, because they seemed to have moved on. But if anyone has any idea how to get Scala IDE with 2.9.x support in Juno, I'm all ears. (@jamie_allen, any ideas?)

Thursday, January 16, 2014

Jumping on the CQL Bandwagon (a tipping point to migrate off Astyanax/Hector?)

Its been over a year since we started looking at CQL. (see my blog post from last October)

At first we didn't know what to make of CQL. We were heavily invested in the thrift-based APIs (Astyanax + Hector). We had even written a REST API called Virgil directly on top of Thrift (which enabled the server to run an embedded Cassandra).

But there was a fair amount of controversy around CQL, and whether it was putting "SQL" back into "NoSQL". We took a wait and see approach to see how much CQL and the thrift-based API diverged. The Cassandra community pledged to maintain the thrift layer, but it was clear that Datastax was throwing its weight behind the new CQL java driver. It was also clear that new-comers to Cassandra might start with CQL (and the CQL java-driver), especially if they were coming from a SQL background.

Here we are a year later, and with the latest releases of Cassandra, (IMHO) we've hit a tipping point that has driven this C* old-timer to begin the migration to CQL. Specifically, there are three things that CQL has better support for:

Lightweight Transactions: These are conditional inserts and updates. In CQL, you can add an additional where clause on the end of a statement, which is first verified before the upsert occurs. This is hugely powerful in a distributed system, because it helps accommodate distributed reads-before-writes. A client can add a condition which will prevent the update if it was working with stale information. (e.g. by checking a timestamp or checksum and only updating if that timestamp or checksum hasn't changed)

Batching: This allows the client to group statements. The batch construct can guarantee that either all the statements will succeed, or all will fail. Even though it doesn't provide isolation, meaning other clients will see partially committed batches, this is still a very important construct when creating consistent systems that scale because you end up batching in the client to reduce the database traffic.

Collections: When you do enough data modeling on top of Cassandra, you end up building on top of the row key / sorted column key structure using composite columns. And although it is amazing what you can accomplish with that simple structure, a lot of effort is spent marshaling in and out of those primitive structures. Collections offers a convenient translation layer on top of those primitives, which simplifies things. You can always drop down into the primitives, when need be, but sometimes its nice to have a simple list, map, or set at hand.

Now -- don't get me wrong. I'm still a *huge* Astyanax fan, and it still provides some convenience capabilities that AFAIK are not yet available in CQL. (e.g. the Chunked Object Store) But as we guessed a while back, it looks like CQL will offer better support for newer C* features.

SOOO ----
I've started on a rewrite of Virgil that offers up CQL capabilities via REST. I'm calling the project memnon. You can follow along on github as I build it out.

Additionally, I started rewriting the Storm-Cassandra bolt/state mechanisms to ride on top of CQL. You can see that action on github as well.

More to come on both of those.

Tuesday, January 14, 2014

ElasticSearch from AngularJS (fun w/ elasticsearch-js!)

We've recently switched over to AngularJS (from ExtJS). And if you've been following along at home, you know that we are *HUGE* ElasticSearch fans. So today, I set out to answer the question, "How easy would it be to hit Elastic Search directly from javascript?" The answer lies below. =)

First off, I should say that we recently hired a rock-star UI architect (@ddubya) that has brought with him Javascript Voodoo the likes of which few have seen. We have wormsign... and we now have grunt paired with bower for package management. Wicked cool.

So, when I set out to connect our Angular App to Elastic Search, I was pleased to see that Elastic Search Inc recently announced a javascript client that they will officially support. I quickly raced over to bower.io and found a rep listed in their registry...

I then spent the next two hours banging my head against a wall.

Do NOT pull the git repo via bower even though it is listed in the bower registry! There is an open issue on the project to add support for bower. Until that closes, the use of Node's require() within the angular wrapper prevents it from running inside a browser. Using browserify, the ES guys kindly generate a browser compatible version for us. So *use* the browser download (zip or tar ball) instead!

Once you have the download, just unzip it in your app. Load the library with a script tag:

<script src="third_party_components/elasticsearch-js/elasticsearch.angular.js"></script>

That code registers the elastic search module and creates a factory that you can use:

angular.module('elasticsearch', [])
  .factory('esFactory', ['$http', '$q', function ($http, $q) {

    var factory = function (config) {
      config = config || {};
      config.connectionClass = AngularConnector;
      config.$http = $http;
      config.defer = function () {
        return $q.defer();
      };
      return new Client(config);
    };

You can then create a service that uses that factory, creating an instance of a client that you can use from your controller, etc:

angular.module('hms.complete.control.components.search')
  .service('es', function (esFactory) {
  return esFactory({
    host: 'search01:9200'
  });
});

Then, assuming you have a controller defined similar to:

angular.module('my.components.search')
  .controller('searchController', ['$window', '$scope', 'es', function ($window, $scope, es) {

The following code, translated from their example, works like a champ!

es.ping({
  requestTimeout: 1000,
  hello: "elasticsearch!"
}, function (error) {
  if (error) {
    console.error('elasticsearch cluster is down!');
  } else {
    console.log('All is well');
  }
});

From there, you can extend things to do much more... like search! =)

Kudos to the Elastic Search Inc crew for making this available.

Thursday, January 9, 2014

WTF is an architect anyway?

In full disclosure, I'm writing this as a "Chief" Architect (I can't help but picture a big headdress), and I've spent the majority of my career as an "architect" (note the air quotes). And honestly, I've always sought out opportunities that came with this title. I think my fixation came largely from the deification of term in the Matrix movies.

But in reality, titles can cause a lot of headaches, and when you need to scale an organization to accommodate double digit growth year over year, "architects" and "architecture" can help... or hurt that growth process. Especially when architecture is removed/isolated from the implementation/development process, we know that ivory-tower architecture kills.

This day and age however, a company is dead if it doesn't have a platform. And once you have a critical number of teams, especially agile teams that are hyper-focused only on their committed deliverables, how do you cultivate a platform without introducing some form of architecture (and "architects")?

I've seen this done a number of ways. I've been part of an "Innovative Architecture Roadmap Team"' an "Enterprise Architecture Forum", and even a "Shared Core Services Team". All of these sought to establish and promote a platform of common reusable services. Looking back, the success of each of these was directly proportional to the extent to which the actual functional development teams were involved.

In some instances, architects sat outside the teams, hawking the development and injecting themselves when things did not conform to their vision. (Read as: minimal team involvement). In other cases, certain individuals on each team were anointed members of the architecture team. This increased involvement, but was still restricted architectural influence (and consequently buy-in) to the chosen few. Not only is this less than ideal, but it also breeds resentment. Why are some people anointed and not others?

Consider the rock-star hotshot developer that is right out of college. He or she may have disruptive, brilliant architectural insights because dogma hasn't found them yet. Unfortunately, this likely also means that they don't have the clout to navigate political waters into the architectural inner circle. Should the architecture suffer for this? Hell no.

So, what do we do? I suggest we change the flow of architecture. In the scenarios I've described thus far, architecture was defined by and emanated from the architectural inner circle. We need to invert this. IMHO, an architectural approach that breeds innovation is one that seeks to collect and disseminate ideas from the weeds of development.

Pave the road for people that want to influence and contribute to the architecture and make it easy for them to do so. In this approach, everyone is an architect. Or rather, an architect is a kind of person: a person that wants to lift their head up, look around, and contribute to the greater good.

That sounds a bit too utopian. And it is. In reality, architectural beauty is in the eye of the beholder, and people often disagree on approach and design. In most cases, it is possible to come to consensus, or at least settle on a path forward that provides for course correction if the need should arise.

But there are cases, when that doesn't happen. In these cases, I've found it beneficial to bring a smaller crew together, to set aside the noise, leave personal passions aside, and make a final call. Following that gathering, no matter what happened in the room, it is/was the job of those people to champion the approach.

In this capacity, the role of "architects" is to collect, cultivate and champion a common architectural approach. (pretty picture below)

To distinguish this construct from pre-conceived notions of "architecture teams" and "architects" (again, emphasis on the air quotes), I suggest we emphasize that this is a custodial function, and we start calling ourselves "custodians".

Then, we can set the expectation that everyone is an architect (no air quotes), and contributes to architecture. Then, a few custodians -- resolve stalemates, care for, nurture, and promote the architecture to create a unified approach/platform.

I'm considering changing my title to Chief Custodian. I think the janitorial imagery that it conjures up is a closer likeness anyway. Maybe we can get Hollywood to come out with a Matrix prequel that deifies a Custodian. =)