cassandra nyc 2011 data modeling

Data Modeling ExamplesMatthew F. Dennis // @mdennis

Overview

● general guiding goals for Cassandra data models

● Interesting and/or common examples/questions to get us started

● Should be plenty of time at the end for questions, so bring them up if you have them !

Data Modeling Goals

● Keep data queried together on disk together● In a more general sense think about the

efficiency of querying your data and work backward from there to a model in Cassandra

● Usually, you shouldn't try to normalize your data (contrary to many use cases in relational databases)

● Usually better to keep a record that something happened as opposed to changing a value (not always the best approach though)

● Easily the most common use of Cassandra● Financial tick data● Click streams● Sensor data● Performance metrics● GPS data● Event logs● etc, etc, etc ...

● All of the above are essentially the same as far as C* is concerned

Time Series Data

● Things happen in some timestamp ordered stream and consist of values associated with the given timestamp (i.e. “data points”)

– Every 30 seconds record location, speed, heading and engine temp

– Every 5 minutes record CPU, IO and Memory usage

● We are interested in recreating, aggregating and/or analyzing arbitrary time slices of the stream

– Where was agent:007 and what was he doing between 11:21am and 2:38pm yesterday?

– What are the last N actions foo did on my site?

Time Series Thought Model

Data Points Defined

● Each data point has 1-N values

● Each data point corresponds to a specific point in time or an interval/bucket (e.g. 5 th minute of 17th hour on some date)

Data Points Mapped to Cassandra

● Row Key is id of the data point stream bucketed by time– e.g. plane01:jan_2011 or plane01:jan_01_2011 for month or day buckets

respectively

● Column Name is TimeUUID(timestamp of date point)

● Column Value is serialized data point– JSON, XML, pickle, msgpack, thrift, protobuf, avro, BSON, WTFe

● Bucketing– Avoids always requiring multiple seeks when only small slices of the stream are

requested (e.g. stream is 5 years old but I'm on only interested in Jan 5th 3 years ago and/or yesterday between 2pm and 3pm).

– Make it easy to lazily aggregate old stream activity

– Reduces compaction overhead since old rows will never have to be merged again (until you “back fill” and/or delete something)

A Slightly More Concrete Example

● Sensor data from airplanes

● Every 30 seconds each plane sends latitude+longitude, altitude and wine remaining in mdennis' glass.

The Visual

● Row Key is the id of stream being recorded (e.g. plane5:jan_2011)

● Column Name is timestamp (or TimeUUID) associated with the data point

● Column Value is the value of the event (e.g. protobuf serialized lat/long+alt+wine_level)

p5:j11TimeUUID0 TimeUUID1 TimeUUID2

28.90, 124.3045K feet

70%

plane5:jan_2011

Middle of the ocean and half a glass of wine at 44K feet

28.85, 124.2544K feet

50%

28.81, 124.2244K feet

95%

Querying

● When querying, construct TimeUUIDs for the min/max of the time range in question and use them as the start/end in your get_slice call

● Or use a empty start and/or end along with a count

Bucket Sizes?

● Depends greatly on● Average size of time slice queried● Average data point size● Write rate of data points to a stream● IO capacity of the nodes

So... Bucket Sizes?

● No Bigger than a few GB per row● bucket_size * write_rate * sizeof(avg_data_point)

● Bucket size >= average size of time slice queried● No more than maybe 10M entries per row● No more than a month if you have lots of different

streams● NB: there are exceptions to all of the above, which

are really nothing more than guidelines

Ordering

● In cases where the most recent data is the most interesting (e.g. last N events for entity foo or last hour of events for entity bar), you can reverse the comparator (i.e. sort descending instead of ascending)

● http://thelastpickle.com/2011/10/03/Reverse-Comparators/● https://issues.apache.org/jira/browse/CASSANDRA-2355

Spanning Buckets

● If your time slice spans buckets, you'll need to construct all the row keys in question (i.e. number of unique row keys = spans+1)

● If you want all the results between the dates, pass all the row keys to multiget_slice with the start and end of the desired time slice

● If you only want the first N results within your time slice, lowest latency comes from multiget_slice as above but best efficiency comes from serially paging one row key at a time until your desired count is reached

Expiring Streams(e.g. “I only care about the past year”)

● Just set the TTL to the age you want to keep● yeah, that's pretty much it ...

Counters

● Sometimes you're only interested in counting things that happened within some time slice

● Minor adaptation to the previous content to use counters (be aware they are not idempotent)● Column names become buckets● Values become counters

Example: Counting User Logins

U3:S5:L:D

user3:system5:logins:by_day

20110107 ... 20110523

2 7...

2 logins on Jan 7th 2011 for user 3 on system 5

7 logins on May 23rd 2011for user 3 on system 5

U3:S5:L:H

user3:system5:logins:by_hour

2011010710 ... 2011052316

1 7...

one login for user 3 on system 5 on Jan 7th 2011 for the 10th hour

2 logins for user 3 on system 5on May 23rd 2011 for the 16th hour

Eventually Atomic

● In a legacy RDBMS atomicity is “easy”

● Attempting full ACID compliance in distributed systems is a bad idea (and actually impossible in the strictest sense)

● However, consistency is important and can certainly be achieved in C*

● Many approaches / alternatives

● I like a transaction log approach, especially in the context of C*

Transaction Logs(in this context)

● Records what is going to be performed before it is actually performed

● Performs the actions that need to be atomic (in the indivisible sense, not the all at once sense which is usually what people mean when they say isolation)

● Marks that the actions were performed

In Cassandra

● Serialize all actions that need to be performed in a single column – JSON, XML, YAML (yuck!), pickle, JSO, msgpack, protobuf, et cetera● Row Key = randomly chosen C* node token● Column Name = TimeUUID(nowish)

● Perform actions● Delete Column

Configuration Details

● Short gc_grace_seconds on the XACT_LOG Column Family (e.g. 5 minutes)

● Write to XACT_LOG at CL.QUORUM or CL.LOCAL_QUORUM for durability● if it fails with an unavailable exception, pick a

different node token and/or node and try again (gives same semantics as a relational DB in terms of knowing the state of your transaction)

Failures

● Before insert into the XACT_LOG● After insert, before actions● After insert, in middle of actions● After insert, after actions, before delete● After insert, after actions, after delete

Recovery

● Each C* has a crond job offset from every other by some time period

● Each job runs the same code: multiget_slice for all node tokens for all columns older than some time period (the “recovery period”)

● Any columns need to be replayed in their entirety and are deleted after replay (normally there are no columns because normally things are working)

XACT_LOG Comments

● Idempotent writes are awesome (that's why this works so well)

● Doesn't work so well for counters (they're not idempotent)

● Clients must be able to deal with temporarily inconsistent data (they have to do this anyway)

Cassandra Data Modeling ExamplesMatthew F. Dennis // @mdennis

Q?

cassandra nyc 2011 data modeling

Technology

time series data

data point stream

recent data

average data point size

rate of data points

data modeling goals

data modeling examplesmatthew

serialized data point