Why are super columns in Cassandra no longer favoured?

Super columns suffer from a number of problems, not least of which is that it is necessary for Cassandra to deserialze all of the sub-columns of a super column when querying (even if the result will only return a small subset). As a result, there is a practical limit to the number of sub-columns per super column that can be stored before performance suffers.

In theory, this could be fixed within Cassandra by properly indexing sub-columns, but consensus is that composite columns are a better solution, and they work without the added complexity.

The easiest way to make use of composite columns is to take advantage of the abstraction that CQL 3 provides. Consider the following schema:

CREATE TABLE messages(
    username text,
    sent_at timestamp,
    message text,
    sender text,
    PRIMARY KEY(username, sent_at)
);

Username here is the row key, but we’ve used a PRIMARY KEY definition which creates a grouping of row key and the sent_at column. This is important as it has the effect of indexing that attribute.

INSERT INTO messages (username, sent_at, message, sender) VALUES ('bob', '2012-08-01 11:42:15', 'Hi', 'alice');
INSERT INTO messages (username, sent_at, message, sender) VALUES ('alice', '2012-08-01 11:42:37', 'Hi yourself', 'bob');
INSERT INTO messages (username, sent_at, message, sender) VALUES ('bob', '2012-08-01 11:43:00', 'What are you doing later?', 'alice');
INSERT INTO messages (username, sent_at, message, sender) VALUES ('bob', '2012-08-01 11:47:14', 'Bob?', 'alice');

Behind the scenes Cassandra will store the above inserted data something like this:

alice: (2012-08-01 11:42:37,message): Hi yourself, (2012-08-01 11:42:37,sender): bob
bob:   (2012-08-01 11:42:15,message): Hi,          (2012-08-01 11:42:15,sender): alice, (2012-08-01 11:43:00,message): What are you doing later?, (2012-08-01 11:43:00,sender): alice (2012-08-01 11:47:14,message): Bob?, (2012-08-01 11:47:14,sender): alice

But using CQL 3, we can query the “row” using a sent_at predicate, and get back a tabular result set.

SELECT * FROM messages WHERE username="bob" AND sent_at > '2012-08-01';
 username | sent_at                  | message                   | sender
----------+--------------------------+---------------------------+--------
      bob | 2012-08-01 11:43:00+0000 | What are you doing later? |  alice
      bob | 2012-08-01 11:47:14+0000 |                      Bob? |  alice

Leave a Comment

tech