Tuesday, September 11, 2012

Compound/Composite Keys: Connecting the dots between CQL3, Astyanax and Hector


I still don't know where I stand on the terminology debate, but I know things can get confusing if you are accessing your database from CQL as well as Java.  They are very different views of the world.  Hopefully this article can bridge that gap and explain the different views with respect to composite keys.

Before we get started, we should get our terminology straight.  Cassandra has a CompositeType data type.  This is *not* what we'll be discussing.  Instead, we're discussing compound column names.  CQL3 calls these compound keys, while Astyanax and Hector call them Composites.

Underneath the hood, Cassandra's storage model hasn't changed to accomodate composite keys.  Composite keys are really just a fancy bit-wise concatenation of the key components prior to storage and/or retrieval.  The storage model remains, rowkey -> columnname -> value.

Thus, you have to bend your brain a bit when seeing Cassandra through CQL.  Since multiple "columns" as viewed by CQL, will become a fewer number of columns in the physical storage.  And since the Java APIs provide direct access to the storage model, it will appear as one column through the Java APIs (at least initially).

Let's consider the following CQL as our example:

CREATE TABLE fishblogs (
  userid varchar,
  when timestamp,
  fishtype varchar,
  blog varchar,
  image blob,
  PRIMARY KEY (userid, when, fishtype)
);

Although its not apparent from the CQL, the PRIMARY KEY statement uses the ordering of the parameters to map the schema to Cassandra's storage model.  Specifically, the five columns declared above will result in two columns when an insert statement is run.  Here's how Cassandra maps the schema to storage.
row_key => userid  (since it is the first key declared)
column_name_1 => when + fishtype + blog  (since when and fishtype are keys, and blog is a value it needs to store)
column_name_1 => when + fishtype + image (since when and fishtype are keys and image is a value it needs to store)
If I can paraphrase the mapping... the first primary key becomes the row key.  Subsequent primary keys comprise the prefix components of the composite column name.  One column is generated for each non-primary key column.   The composite key combines the common prefix (comprising the primary keys) plus the column name for the value.

If we perform the following insert:

cqlsh:hms_data> insert into fishblogs (userid, when, fishtype, blog, image) values ('boneill42', 1343925155443, 'CATFISH', 'Caught the big one.', '632a726f636b73');
cqlsh:hms_data> select * from fishblogs;
 userid    | when                     | fishtype | blog                | image
-----------+--------------------------+----------+---------------------+----------------
 boneill42 | 2012-08-02 12:32:35-0400 |  CATFISH | Caught the big one. | 632a726f636b73

CQL presents this as five columns, but under the hood there are really only two (column name -> value):
[8?.vsCATFISHblog]->[Caught the big one.]
[8?.vsCATFISHimage]->[c*rocks]

In the above, you can see the concatenated keys in the column name.  This is the raw output you receive from a query using Astyanax, when you don't have composite keys in place.  This the abbreviated code:
ColumnFamily columnFamily = new ColumnFamily(columnFamilyName, StringSerializer.get(), StringSerializer.get());
OperationResult> result = this.keyspace.prepareQuery(columnFamily).getKey(rowKey).execute();
result.getResult(); 

In order to have the same view as CQL, we need to map the composite key to a class and map the fields of that class.  For example,
public class FishBlog {
    @Component(ordinal = 0)
    public long when;
    @Component(ordinal = 1)
    public String type;
    @Component(ordinal = 2)
    public String field;
    
    public FishBlog() {
    }
}
What seems a little unnatural about this mapping is the "field" member variable.  Ideally, we'd prefer two member variables: "image" and "blog".  But since the data is spread across multiple columns, each containing the value for a separate field, the class has to be flexible enough to carry either (but not both).  

We can then use the following Astyanax code to retrieve data:
AnnotatedCompositeSerializer entitySerializer = new AnnotatedCompositeSerializer(FishBlog.class);
ColumnFamily columnFamily = new ColumnFamily(columnFamilyName, StringSerializer.get(), entitySerializer); OperationResult> result = this.keyspace.prepareQuery(columnFamily).getKey(rowKey).execute();
return result.getResult();


This results in the following ouput, when accessing the result set, which contains FishBlog's.
fishBlog.when=>[Thu Aug 02 12:32:35 EDT 2012]
fishBlog.type=>[CATFISH]
fishBlog.field=>[image]
fishBlog.value=>[c*rocks]
Clear as mud?  Now, to complete the picture I'll show the same code for Hector.  Here is the fetch:
ColumnFamilyTemplate template = new ThriftColumnFamilyTemplate(this.keyspace, columnFamilyName, new StringSerializer(), new CompositeSerializer());
return template.queryColumns(rowKey);
Then, what may feel a little weird is that the Composite object, which contains the component keys for your object, comes from the column name:
for (Composite columnName : columns.getColumnNames()){
   FishBlog fishBlog = new FishBlog(columnName);
   LOG.debug("fishBlog.when=>[" + new Date(fishBlog.getWhen()) + "]");
   LOG.debug("fishBlog.type=>[" + fishBlog.getType() + "]");
   LOG.debug("fishBlog.field=>[" + fishBlog.getField() + "]");
   LOG.debug("fishBlog.value=>[" + columns.getString(columnName) + "]");

}
As you can see from above, I recommend wrapping the composite in an object that provides easy accessors. (which makes it feel like Astyanax)  Here is my FishBlog object for Hector:
public class FishBlog {
    private Composite composite;
    public FishBlog(Composite composite) {
        this.composite = composite;
    }
    public long getWhen() {
        return composite.get(0, new LongSerializer());
    }
    public String getType() {
        return composite.get(1, new StringSerializer());
    }
    public String getField() {
        return composite.get(2, new StringSerializer());
    }
}
OK -- hopefully that demonstrates how the five columns from the CQL perspective translates into two physical columns that can be accessed via Hector or Astyanax, resulting in two entries in the result set mapped to java objects that you can use to get at the values.  I'll try to get the code for these examples up on github.





9 comments:

Jonathan Ellis said...

Incidentally, I'm preferring the term "compound keys" to avoid implying that we use CompositeType with the partition key, which we do not.

Brian O'Neill said...

Good point Jonathan. I updated the title and added a paragraph to make that distinction clear.

Juuso Kosonen said...

Any chance that you could also add how to save/update data via astyanax? Not able to make it work. "InvalidRequestException(why:Not enough bytes to read value of component 0)"

Brian O'Neill said...

Sure thing. In fact, I just ran into this yesterday. For a preview, here is the conversation on the dev list:
http://bit.ly/T9hjNo

I'll post an update shortly.

Brian O'Neill said...

Juuso, here is the update that shows how to save/update data:
http://brianoneill.blogspot.com/2012/10/cql-astyanax-and-compoundcomposite-keys.html

Juuso Kosonen said...

Thanks, Brian!

benslin kard said...

you have shared very good information about the Cassandra being more suitable for real time transaction processing and the serving of interactive data.

Ashot Golovenko said...

Thanks, Brian! This is a very useful article.

I have some questions though:
Let's say I just need to get blog value from your fishblogs table:

select blog from fishblogs where userid = ? and when > ?.

How effective this query would be in comparison to

select * from fishblogs where userid = ? and when > ?.

I mean cassandra boasts the efficiency of retrieval of data stored sequentially which will not be the case for the first query. On the other hand I will spare some node heap memory while retrieving only the columns I need, especially if there are more than just two non-primary key columns in the table.

Generalizing the question, which of the following table definition is more cassandra-effective keeping in mind that most of my queries do not require fetching of all the non-primary key columns:

CREATE TABLE t1 (
partKey ,
rowKey ,
restOfDataInJson ,
PRIMARY KEY (partKey, rowKey)
);

or
CREATE TABLE t1 (
partKey ,
rowKey ,
field1 ,
field2 ,
field3 ,
field4
...
PRIMARY KEY (partKey, rowKey)
);

Brian O'Neill said...

I'd probably go with the second schema. As you point out, with the first schema you will always end up pulling all of the data fields back.

For flat key/value pairs (in the JSON blob), you may even be better served using CQL collections:
http://brianoneill.blogspot.com/2013/01/native-support-for-collections-in.html

Using map(s) instead of JSON, you can perform operations on single keys. (we are in the process of evaluating this approach right now -- converting our JSON blobs to collections)