Monday, April 2, 2012

Cassandra vs. (CouchDB | MongoDB | RIak | HBase)

Here is why in "Cassandra vs.", it's Cassandra FTW!

Our organization processes thousands of data sources continuously to produce a single consolidated view of the healthcare space.  There are two aspects of this problem that are challenging.  The first is schema management, and the second is processing time.

Creating a flexible RDBMS model to accomodate thousands disparate data sources is difficult, especially as those schemas change over time.  Even given a flexible relational model, to properly access and manipulate data in that model is complicated.  That complexity bleeds into application code and hampers analytics.

Given the volume of data and the frequency of updates, standardizing, indexing, analyzing and processing that data takes days of time across dozens and dozens of machines.   And even with round the clock processing, the business and customer appetites for additional and more current analytics are insatiable.

Trying to scale the RDBMS system vertically through hardware eventually has its limits.  Scaling horizontally through sharding becomes a challenge.  Operations and Maintenance (O&M) is difficult and requires a lot of custom coding to accommodate the partitioning.

We needed a distributed data system that provided:
  • Flexible Schema Management
  • Distributed Processing
  • Easy Administration (to lower O&M costs)
Driven by the need for flexible schemas, we turned to NoSQL.  We considered: MongoDB, CouchDB, HBase, and Riak.   Immediately we set out to see what support each of these had support for "real" map/reduce.  Given the processing we do, we knew we would eventually need support for all of Hadoop's goodness.  This includes extensions like Pig, Hive, and Cascading.

CouchDB dropped out here.  It supports map/reduce, but little or no notable support for Hadoop proper.  MongoDB scored "acceptable", but the Hadoop support was not nearly as evolved as the support in Cassandra.  Datastax actually distributes an enterprise version of Cassandra that fully integrates the Hadoop runtime.   Thus, we left MongoDB for another day and scored HBase's Hadoop support off the charts.

Riak is interesting in that they provide very slick native support for  map/reduce ( via REST, while they also provide a nice bridge from Hadoop.  I must admit.  We were *very* attracted to the REST interface. (which is why we eventually went on to create Virgil for Cassandra)

Left with Riak, HBase and Cassandra, we layered in some non-functional requirements.  First, we needed to be able to get solid third-party support.   Unfortunately, this is where Riak fell out.   Basho provides support for Riak, but Datastax and Cloudera were names we were familiar with.  

NOW -- Down to HBase and Cassandra.  For this comparison, I won't bother re-iterating all the great points from Dominic William's great post.   Given that post and a few others, we decided on Cassandra.

Now, since choosing Cassandra, I can say there are a few other *really* important less tangible considerations.  The first, is the code base.  Cassandra has an extremely clean and well maintained code base.  Jonathan and team do a fantastic job managing the community and the code.  As we adopted NoSQL, the ability to extend the code-base and incorporate our own features has proven invaluable. (e.g. triggers, a REST interface, and server-side wide-row indexing)  

Secondly, the community is phenomenal. That results in timely support, and solid releases on a regular schedule.   They do a great job prioritizing features, accepting contributions, and cranking out features. (They are now releasing ~quarterly)   We've all probably been part of other open source projects where the leadership is lacking, and features and releases are unpredictable, which makes your own release planning difficult.  Kudos to the Cassandra team.


DuranDuran said...

This was extremely helpful to me. Did you chose to go with a physical or virtual environment? Also, I'd like to know what kind of performance benefits you saw?

DuranDuran said...

This was extremely helpful to me. Did you chose to go with a physical or virtual environment? Also, I'd like to know what kind of performance benefits you saw?

Unknown said...
This comment has been removed by the author.
Unknown said...

Disclaimer: I work for Cloudera

You mentioned that your choice of Cassandra was influenced by the article that Dominic Williams wrote. I went through the post, and it looked great, except for one problem. That post was written in early 2010.

It's worth mentioning that HBase has vastly improved since 2012, so much that Facebook (inventors of Cassandra) moved their entire messaging back-end from Cassandra to HBase.

benslin kard said...

Cassandra boots quickly, and its performance scales smoothly as new nodes are added.

Lincoln Clarete said...

Thanks for the great post. I know it's been a while since you wrote it but I'd love to know if you're still happy with your choice.

Divit said...

I actually enjoyed reading through this posting.Many thanks.

Cassandra Training Courses

رضا رمضان said...

شركة كشف تسربات المياه بالدمام
شركة كشف تسربات بالدمام
شركة كشف تسربات المياه بالخبر
شركة كشف تسربات المياه بالجبيل
شركة كشف تسربات المياه بالاحساء
شركة كشف تسربات المياه بالقطيف
شركة كشف تسربات بالرياض
شركة كشف تسربات المياه بالرياض
كشف تسربات المياه

MASA AL said...

thanx,Nice info شركة رش مبيدات بجدة الشركة الاولى فى المملكة والتى تعتبر افضل شركة رش مبيدات بجدة كمان اننا نضمن لك الجودة والتوفير لاننا ارخص شركة رش مبيدات بجدة ونقوم بمساعدتك على إبادة الحشرات فى شركة رش حشرات بجدة ونقوم بمكافحة الحشرات أيضاَ فى شركة مكافحة حشرات بجدة والتى تعتبر افضل شركة مكافحة حشرات بجدة بل وتعبر كذلك ارخص شركة مكافحة حشرات بجدة

MASA AL said...

شركات الماسة تقدم لكم شركة جلى رخام جدة
فنحن نساعدك على عمل جلى رخام جدة
حيث أن شركتنا شركة جلى بلاط جدة
لديها أفضل ماكينات جلى بلاط جدة
كما اننا فى شركة شركة جلى سيراميك جدة
لدينا احدث أدوات جلى سيراميك جدة
كما ان شركتنا تعتبر افضل شركة جلى رخام جدة وبشهادة العملاء نعتبر افضل شركة جلى بلاط جدة ونعتبر أيضا افضل شركة جلى سيراميك جدة كما أن لدينا افضل اسعار جلي بلاط جدة

MASA AL said...

شركة مكافحة حشرات بجدة انها شركة الماسة والتى تتميز أيضاً أنها شركة رش حشرات بجدة وعلى مر الأيام أصبحت شركة الماسة
افضل شركة رش مبيدات بجدة كم أصبحت أيضاً افضل شركة مكافحة حشرات بجدة نعمل سوياً على إبادة جميع الحشرات والقوارض فى شركة مكافحة حشرات بجدة كما أننا لدينا أفضل المبيدات الحشرية فى شركة رش مبيدات بجدة

Nitesh Kumar said...

I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Couch DB, kindly contact us
MaxMunus Offer World Class Virtual Instructor led training on Couch DB. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
For Demo Contact us.
Nitesh Kumar
Skype id: nitesh_maxmunus
Ph:(+91) 8553912023

Rajapriya R said...

I learnt more useful information from this blog and all information having clear explanation so easy to understand..

best big data training | hadoop training institute in chennai

Ananthi S said...

After reading this blog i learnt new information about hadoop and i got new idea about hadoop which really helpful to develop my knowledge and cracking the interview easily.. thanks a lot for sharing this blog to us

big data training institute in chennai | hadoop training in velachery