Monday, April 1, 2013

BI/Analytics on Big Data/Cassandra: Vertica, Acunu and Intravert(!?)


As part of our presentation up at NYC* Big Data Tech Day, we noted that Hadoop didn't really work for us.  It was great for ingesting flat files from HDFS into Cassandra, but the map/reduce jobs that used Cassandra as input didn't cut it.  We found ourselves contorting our solutions to fit within the map/reduce framework, which required developer-level capabilities.  We had to add complexity into the system to do batch management/composition, and in the end the map/reduce jobs took too long to complete.

Eventually, we swapped out Hadoop for Storm.  That allowed us to do real-time cumulative analytics.  And most recently, we converted our topologies to Trident.  Handling all CRUD operations through Storm allowed us to perform roll-up metrics by different dimensions using Trident State.  (Additionally, we can write to wide-rows for indexing, etc.)

This is working really well, but we are seeing increasing demand from our data scientists and customers to support "ad hoc" dimensional analysis, dashboards, and reporting.  Elastic Search keeps us covered on many of the ad hoc queries, but aside from facets, it has little support for real-time dimensional aggregations, and no support for dashboards and reports.

We turned to the industry to find the best of breed.  With some help from others that have traveled this road, (shout out to @elubow), we settled on Vertica, Infobright and Acunu as contenders.  I quickly grabbed VM's from each of them and went to work.

WARNING: What I'm about to say is based on a few days experimentation, and largely consists of initial first impressions.  It has no basis on real production experience. (yet =)

First up was Acunu.  Although each of the VMs functioned as an appliance, when logging into the VM and playing around with things, we were most at home with Acunu.  Acunu is backed by Cassandra.  Having C* installed and running as the persistence layer was like having an old friend playing wingman on an initial first date.  (they can bail you out if things start going south =)

Acunu had a nice REST API and a simple enough web-based UI to manage schemas and dimensions.  Within minutes, I was inserting data from a ruby script and playing around with dashboards.... until something went wrong and the server starting throwing OoM's.  After a restart, things cleared up, but it left me questioning the stability a bit.  (once again, this was a *single* vm running on my laptop, so it wasn't the most robust environment)

Next, I moved on to Vertica.  From a features and functions point of view, Vertica looked to be leaps and bounds ahead.  It had sophisticated support for R, which would make our data scientists happy.  It also has compression capabilities, which will make our IT/Ops guys happy.  And it looked to have some sophisticated integration with Hadoop, just in case we ever wanted/needed to support deep analytics that could leverage M/R.

That said, it was far more cumbersome to get up and running, and felt a bit like I went backwards in time.  I couldn't find a REST API. (please let me know if someone has one for Vertica)  So, I was left to go through the hoop-drill of getting a JDBC client driver, which was not available in public repos, etc.  When using the admin tool provided on the appliance, I felt like I was back in middle school (early 90's) installing linux via an ANSI interface on an Intel 8080.  In the end however, I grew accustomedto their client (vsql) and was happily hacking away over the JDBC driver and it felt fairly solid.

Although we are still interested in pursuing both Acunu and Vertica, both experiences left me wanting.   What we really want is a fully open-source solution (preferably apache license) that we are free to enhance, supplement, etc.... with optional commercial support.

That got me thinking about Edward Capriolo's presentation on Intravert.   If I boil down our needs into "must-haves" and "nice-to-haves", what we really *need* is just an implementation of Rainbird.  (http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011)

AS AN ASIDE:
Does anyone know what happened to Rainbird?  I've been trying to get the answer, to no avail.
http://www.youtube.com/watch?v=84k7o4GdkQg

Now, time for crazy talk...
Intravert provides a slick REST API for CRUD operations on Cassandra.  As I said before, I'm a *huge* REST fan.  It provides the loose-coupling for everything in our polyglot persistence architecture.    Intravert also provides a loosely coupled eventing framework to which I can attached handlers.   What if I implemented a handler, that took the CRUD events, and updated additional column families with the dimensional counts/aggregations???    If I then combine that with a javascript framework for charting, how far would that get me?  (60-70% solution?)

To be clear, I'm not bashing Vertica or Acunu.  Both have solid value propositions and they are both contenders in our options analysis.  I'm just mourning the fact that there seems to be no good open-source solution in this space like there are in others.  (Neo4j/TitanDB for graphs, Elastic Search/SOLR for search, Kafka/Kestrel for queueing, Cassandra for Storage, etc.)

We are also considering Druid and Infobright, but I haven't gotten to them yet:
https://github.com/metamx/druid

Please don't bash me for early judgments.
I'm definitely interested in hearing people's thoughts.



6 comments:

Edward Capriolo said...

Do not get me wrong. I love me some map reduce. However the original web crawl use case for map reduce is not even done in map reduce anymore, right? It is caffine or percolate?

We have shot around how intravert could be used as a map reduce light. I have been considering something. Currently a CQL query or slice returns data to the client. What we likely need is server side "temporary" storage, like all the results go to a row key. This would make it easy to implement things like a union or intersection that are bigger then memory but not map-reduce big.

Maybe something like what grid gain does? partitionable problems you can annotate with @GRIDIFY.

I am not 100% sure if vertx/intravert is a perfect fit here.Since there would need to be some stateful component to manage longer running things.

Brian O'Neill said...

Agreed. I've always liked the Riak interface for map/reduce light.
http://docs.basho.com/riak/latest/tutorials/querying/MapReduce/
(I wish C* offered a similar simple M/R interface via REST)

I think some users would be willing to forego the state management, and have the additional writes to support the dimensions happen synchronously. (e.g. in situations where there are tight SLA's for reads, but loose for writes) That would have the added benefit of consistency (if everything was done as a single batch mutation) To compensate, you could simply add additional C* nodes to handle the extra load for the added dimensional writes.

For those that needed longer running queries, where joins are needed before aggregation, you're right. Minimally, I think you would need to introduce a persistent queue of some sort to support async, and then you may need to introduce scratch tables for interim results.

Eric Lubow said...

Given what you are looking for with things being open source, I don't think that Infobright will solve your problems either if Vertica isn't. However I am still a bit unsure as to what you are looking to do.

One of the nice things about Vertica here is that you currently have a SPoF with Storm (Nimbus is the failure point) and with Vertica you can add a level of fault tolerance that you didn't have before (assuming those 2 structures could be made interchangable).

If I were your ops guys, I would be first looking on how to mititgate the SPoFs. Adding more systems into the mix that are failure points is scary at best and hard to think about anything beyond that.

As for the data scientists, it really comes down to their skill levels and what access you are willing to give them. Questions like this and dealing with SPoFs is why we went the route of a service architecture. And while Intravert (with it's REST interface in particular) is incredibly appealing, it only solves one level of the problem.

Another architecture choice we made and use heavily is persistent queues. We used our own home grown queueing system when Resque kept breaking down on us and are now moving over to NSQ. It serves many functions but most importantly as a safety net. And while you have some of that in Storm, it's hard to consider something with an SPoF a safety net.

Brian O'Neill said...

Great points Eric. Fortunately, on the ingest side of things (where Storm plays its primary role), we don't have tight SLA's. As long as we aren't down "too long", Kafka can just hold the data until we get back online.

Regardless, I hear you loud and clear on the SPoF.

FWIW, I just attended a talk by @zedruid. We might look at that next, since it looks like it will fit well into our existing infrastructure. (reading data off Kafka, and providing the JSON/HTTP interface to drill down by dimension)

Ramith Jayasinghe said...

Did you try out WSO2 BAM ?
http://wso2.com/products/business-activity-monitor/

رضا رمضان said...



خدماتنا متميزة عن غيرنا في مجال التسريبات سربات المياه والعوزال وحل بطرق سليمة دون التدمير فعندنا في شركة ركن البيت افضل يوجد افضل الفنين الممتزين في مجال التسربات والكشف عنها بدون اي مشاكل من خلال الطاقم التي تم تدريبه في شركة كشف تسربات المياه بالدمام فتعاملك معنا ستحصل علي خدمات متميزة

شركة كشف تسربات المياه بجدة
شركة كشف تسربات بجدة
شركة عزل خزانات بالرياض
شركة عزل اسطح بالرياض

شركة كشف تسربات بالدمام
شركة كشف تسربات بالرياض
شركة كشف تسربات المياه بالرياض
كشف تسربات المياه