Thursday, January 16, 2014

Jumping on the CQL Bandwagon (a tipping point to migrate off Astyanax/Hector?)

Its been over a year since we started looking at CQL. (see my blog post from last October)

At first we didn't know what to make of CQL.   We were heavily invested in the thrift-based APIs (Astyanax + Hector).  We had even written a REST API called Virgil directly on top of Thrift (which enabled the server to run an embedded Cassandra).  

But there was a fair amount of controversy around CQL, and whether it was putting "SQL" back into "NoSQL".  We took a wait and see approach to see how much CQL and the thrift-based API diverged.  The Cassandra community pledged to maintain the thrift layer, but it was clear that Datastax was throwing its weight behind the new CQL java driver.  It was also clear that new-comers to Cassandra might start with CQL (and the CQL java-driver), especially if they were coming from a SQL background.

Here we are a year later, and with the latest releases of Cassandra, (IMHO) we've hit a tipping point that has driven this C* old-timer to begin the migration to CQL.   Specifically, there are three things that CQL has better support for:

Lightweight Transactions: These are conditional inserts and updates.  In CQL, you can add an additional where clause on the end of a statement, which is first verified before the upsert occurs. This is hugely powerful in a distributed system, because it helps accommodate distributed reads-before-writes.  A client can add a condition which will prevent the update if it was working with stale information. (e.g. by checking a timestamp or checksum and only updating if that timestamp or checksum hasn't changed)

Batching:  This allows the client to group statements.  The batch construct can guarantee that either all the statements will succeed, or all will fail.  Even though it doesn't provide isolation, meaning other clients will see partially committed batches, this is still a very important construct when creating consistent systems that scale because you end up batching in the client to reduce the database traffic.

Collections: When you do enough data modeling on top of Cassandra, you end up building on top of the row key / sorted column key structure using composite columns.  And although it is amazing what you can accomplish with that simple structure, a lot of effort is spent marshaling in and out of those primitive structures.  Collections offers a convenient translation layer on top of those primitives, which simplifies things.  You can always drop down into the primitives, when need be, but sometimes its nice to have a simple list, map, or set at hand.

Now -- don't get me wrong.  I'm still a *huge* Astyanax fan, and it still provides some convenience capabilities that AFAIK are not yet available in CQL.  (e.g. the Chunked Object Store)  But as we guessed a while back, it looks like CQL will offer better support for newer C* features.

SOOO ----
I've started on a rewrite of Virgil that offers up CQL capabilities via REST.  I'm calling the project memnon.  You can follow along on github as I build it out.

Additionally, I started rewriting the Storm-Cassandra bolt/state mechanisms to ride on top of CQL.  You can see that action on github as well.

More to come on both of those.


vivek said...

Nice post, thanks for the insights. This is what Kundera offers, provides a way to choose between CQL2/CQL3, and JPQL to CQL3 translation is implicitly managed by Kundera API.

Henning Rauch said...

Nice post. What about performance? Do you see significant differences between astyanax/thrift and cql? I would assume that going through the parser/scanner process of a language like CQL does not come without some cost.