Wednesday, March 21, 2012

Cassandra Indexing: The good, the bad and the ugly

Introduction
Within NoSQL, the operations of indexing, fetching and searching for information are intimately tied to the physical storage mechanisms.  It is important to remember that rows are stored across hosts, but a single row is stored on a single host. (with replicas)  Columns families are stored in sorted order, which makes querying a set of columns efficient (provided you are spanning rows).

The Bad : Partitioning

One of the tough things to get used to at first is that without any indexes queries that span rows can be (very) bad.  Thinking back to our storage model however, that isn't surprising.  The strategy that Cassandra uses to distribute the rows across hosts is called Partitioning. 


Partitioning is the act of carving up the range of rowkeys assigning them into the "token ring", which also assigns responsibility for a segment (i.e. partition) of the rowkey range to each host.  You've probably seen this when you initialized your cluster with a "token".  The token gives the host a location along the token ring, which assigns responsibility for a section of the token range. Partitioning is the act of mapping the rowkey into the token range.  


There are two primary partitioners: Random and Order Preserving.  They are appropriately named.  The RandomPartitioner hashes the rowkeys into tokens.  With the RandomPartitioner, the token is a hash of the rowkey.  This does a good job of evenly distributing your data across a set of nodes, but makes querying a range of the rowkey space incredibly difficult. From only a "start rowkey" value and an "end rowkey" value, Cassandra can't determine what range of the token space you need.  It essentially needs to perform a "table scan" to answer the query, and a "table scan" in Cassandra is bad because it needs to go to each machine (most likely ALL machines if you have a good hash function) to answer the query.


Now, at the great cost of even data distribution, you can employ the OrderPreservingPartitioner (OPP).  I am *not* down with OPP.  The OPP preserves order as it translates rowkeys into tokens.  Now, given a start rowkey value and a end rowkey value, Cassandra *can* determine exactly which hosts have the data you are looking for.  It computes the start value to a token the end value to a token, and simply selects and returns everything in between. BUT, by preserving order, unless your rowkeys are evenly distributed across the space, your tokens won't be either and you'll get a lopsided cluster, which greatly increases the cost of configuration and administration of the cluster. (not worth it)

The Good : Secondary Indexes

Cassandra does provide a native indexing mechanism in Secondary Indexes.  Secondary Indexes work off of the columns values.   You declare a secondary index on a Column Family.  Datastax has good documentation on the usage.  Under the hood,  Cassandra maintains a "hidden column family" as the index. (See Ed Anuff's presentation for specifics)  Since Cassandra doesn't maintain column value information in any one node, and secondary indexes are on columns value (rather than rowkeys), a query still needs to be sent to all nodes.  Additionally, secondary indexes are not recommended for high-cardinality sets.  I haven't looked yet, but I'm assuming this is because of the data model used within the "hidden column family".  If the hidden column family stores a row per unique value (with rowkeys as columns), then it would mean scanning the rows to determine if they are within the range in the query.


From Ed's presentation:
  • Not recommended for high cardinality values(i.e.timestamps,birthdates,keywords,etc.)
  • Requires at least one equality comparison in a query--not great for less-than/greater-than/range queries
  • Unsorted - results are in token order, not query value order
  • Limited to search on datatypes, Cassandra natively understands
With all that said, secondary indexes work out of the box and we've had good success using them on simple values.


The Ugly : Do-It-Yourself (DIY) / Wide-Rows

Now, beauty is in the eye of the beholder.  One of the beautiful things about NoSQL is the simplicity.  The constructs are simple: Keyspaces, Column Families, Rows and Columns.  Keeping it simple however means sometimes you need to take things into your own hands.

This is the case with wide-row indexes.  Utilizing Cassandra's storage model, its easy to build your own indexes where each row-key becomes a column in the index.  This is sometimes hard to get your head around, but lets imagine we have a case whereby we want to select all users in a zip code.  The main users column family is keyed on userid, zip code is a column on each user row.  We could use secondary indexes, but there are quite a few zip codes.  Instead we could maintain a column family with a single row called "idx_zipcode".  We could then write columns into this row of the form "zipcode_userid".  Since the columns are stored in sorted order, it is fast to query for all columns that start with "18964" (e.g. we could use 18964_ and 18964_ZZZZZZ as start and end values).

One obvious downside of this approach is that rows are self-contained on a host. (again except for replicas)  This means that all queries are going to hit a single node.  I haven't yet found a good answer for this.

Additionally, and IMHO, the ugliest part of DIY wide-row indexing is from a client perspective.  In our implementation, we've done our best to be language agnostic on the client-side, allowing people to pick the best tool for the job to interact with the data in Cassandra.  With that mentality, the DIY indexes present some trouble.  Wide-rows often use composite keys (imagine if you had an idx_state_zip, which would allow you to query by state then zip).  Although there is "native" support for composite keys, all of the client libraries implement their own version of them (Hector, Astyanax, and Thrift).   This means that client needing to query data needs to have the added logic to first query the index, and additionally all clients need to construct the composite key in the same manner.

Making It Better...
For this very reason, we've decided to release two open source projects that help push this logic to the server-side.  The first project is Cassandra-Triggers.  This allows you to attached asynchronous activities to writes in Cassandra.  (one such activity could be indexing)  We've also released Cassandra-Indexing.  This is hot off the presses and is still in its infancy (e.g. it only supports UT8Types in the index), but the intent is to provide a generic server-side mechanism that indexes data as its written to Cassandra.  Employing the same server-side technique we used in Cassandra-Indexing, you simply configure the columns you want indexed, and the AOP code does the rest as you write to the target CF.  As always, questions, comments and thoughts are welcome. (especially if I'm off-base somewhere)

3 comments:

Dominique De Vito said...

Hi,

In the section "The Ugly : Do-It-Yourself (DIY) / Wide-Rows", you talk to "maintain a column family with a single row called "idx_zipcode".
And then, of course, this row is going to be hosted on a single node that will have to catch all the "zip" requests.

Why not simply using the zipcode as the rowkey so that all "zip code/user id" relationships would be spread over all the cluster?

Brian O'Neill said...

You're right. If you are always looking for only a single zip code, you could create an inverted CF using zipcode as the row key. The wide-rows are more for range queries on keys. Maybe the better example query for this would be to get a list of users for a range of zipcodes. (e.g. 18000-19000).

Dean Hiller said...

You should checkout playorm on github. It can do indexing and pushes it down with the entities. In fact, it pushes everything in a flush so if something failed it is kind of like all or nothing(not a tx though). Also, you can JQL/HQL into a trillion row table in under 100ms via partitioning.