cassandra nyc 2011 data modeling
DESCRIPTION
TRANSCRIPT
Data Modeling ExamplesMatthew F. Dennis // @mdennis
Overview
● general guiding goals for Cassandra data models
● Interesting and/or common examples/questions to get us started
● Should be plenty of time at the end for questions, so bring them up if you have them !
Data Modeling Goals
● Keep data queried together on disk together● In a more general sense think about the
efficiency of querying your data and work backward from there to a model in Cassandra
● Usually, you shouldn't try to normalize your data (contrary to many use cases in relational databases)
● Usually better to keep a record that something happened as opposed to changing a value (not always the best approach though)
● Easily the most common use of Cassandra● Financial tick data● Click streams● Sensor data● Performance metrics● GPS data● Event logs● etc, etc, etc ...
● All of the above are essentially the same as far as C* is concerned
Time Series Data
● Things happen in some timestamp ordered stream and consist of values associated with the given timestamp (i.e. “data points”)
– Every 30 seconds record location, speed, heading and engine temp
– Every 5 minutes record CPU, IO and Memory usage
● We are interested in recreating, aggregating and/or analyzing arbitrary time slices of the stream
– Where was agent:007 and what was he doing between 11:21am and 2:38pm yesterday?
– What are the last N actions foo did on my site?
Time Series Thought Model
Data Points Defined
● Each data point has 1-N values
● Each data point corresponds to a specific point in time or an interval/bucket (e.g. 5 th minute of 17th hour on some date)
Data Points Mapped to Cassandra
● Row Key is id of the data point stream bucketed by time– e.g. plane01:jan_2011 or plane01:jan_01_2011 for month or day buckets
respectively
● Column Name is TimeUUID(timestamp of date point)
● Column Value is serialized data point– JSON, XML, pickle, msgpack, thrift, protobuf, avro, BSON, WTFe
● Bucketing– Avoids always requiring multiple seeks when only small slices of the stream are
requested (e.g. stream is 5 years old but I'm on only interested in Jan 5th 3 years ago and/or yesterday between 2pm and 3pm).
– Make it easy to lazily aggregate old stream activity
– Reduces compaction overhead since old rows will never have to be merged again (until you “back fill” and/or delete something)
A Slightly More Concrete Example
● Sensor data from airplanes
● Every 30 seconds each plane sends latitude+longitude, altitude and wine remaining in mdennis' glass.
The Visual
● Row Key is the id of stream being recorded (e.g. plane5:jan_2011)
● Column Name is timestamp (or TimeUUID) associated with the data point
● Column Value is the value of the event (e.g. protobuf serialized lat/long+alt+wine_level)
p5:j11TimeUUID0 TimeUUID1 TimeUUID2
28.90, 124.3045K feet
70%
plane5:jan_2011
Middle of the ocean and half a glass of wine at 44K feet
28.85, 124.2544K feet
50%
28.81, 124.2244K feet
95%
Querying
● When querying, construct TimeUUIDs for the min/max of the time range in question and use them as the start/end in your get_slice call
● Or use a empty start and/or end along with a count
Bucket Sizes?
● Depends greatly on● Average size of time slice queried● Average data point size● Write rate of data points to a stream● IO capacity of the nodes
So... Bucket Sizes?
● No Bigger than a few GB per row● bucket_size * write_rate * sizeof(avg_data_point)
● Bucket size >= average size of time slice queried● No more than maybe 10M entries per row● No more than a month if you have lots of different
streams● NB: there are exceptions to all of the above, which
are really nothing more than guidelines
Ordering
● In cases where the most recent data is the most interesting (e.g. last N events for entity foo or last hour of events for entity bar), you can reverse the comparator (i.e. sort descending instead of ascending)
● http://thelastpickle.com/2011/10/03/Reverse-Comparators/● https://issues.apache.org/jira/browse/CASSANDRA-2355
Spanning Buckets
● If your time slice spans buckets, you'll need to construct all the row keys in question (i.e. number of unique row keys = spans+1)
● If you want all the results between the dates, pass all the row keys to multiget_slice with the start and end of the desired time slice
● If you only want the first N results within your time slice, lowest latency comes from multiget_slice as above but best efficiency comes from serially paging one row key at a time until your desired count is reached
Expiring Streams(e.g. “I only care about the past year”)
● Just set the TTL to the age you want to keep● yeah, that's pretty much it ...
Counters
● Sometimes you're only interested in counting things that happened within some time slice
● Minor adaptation to the previous content to use counters (be aware they are not idempotent)● Column names become buckets● Values become counters
Example: Counting User Logins
U3:S5:L:D
user3:system5:logins:by_day
20110107 ... 20110523
2 7...
2 logins on Jan 7th 2011 for user 3 on system 5
7 logins on May 23rd 2011for user 3 on system 5
U3:S5:L:H
user3:system5:logins:by_hour
2011010710 ... 2011052316
1 7...
one login for user 3 on system 5 on Jan 7th 2011 for the 10th hour
2 logins for user 3 on system 5on May 23rd 2011 for the 16th hour
Eventually Atomic
● In a legacy RDBMS atomicity is “easy”
● Attempting full ACID compliance in distributed systems is a bad idea (and actually impossible in the strictest sense)
● However, consistency is important and can certainly be achieved in C*
● Many approaches / alternatives
● I like a transaction log approach, especially in the context of C*
Transaction Logs(in this context)
● Records what is going to be performed before it is actually performed
● Performs the actions that need to be atomic (in the indivisible sense, not the all at once sense which is usually what people mean when they say isolation)
● Marks that the actions were performed
In Cassandra
● Serialize all actions that need to be performed in a single column – JSON, XML, YAML (yuck!), pickle, JSO, msgpack, protobuf, et cetera● Row Key = randomly chosen C* node token● Column Name = TimeUUID(nowish)
● Perform actions● Delete Column
Configuration Details
● Short gc_grace_seconds on the XACT_LOG Column Family (e.g. 5 minutes)
● Write to XACT_LOG at CL.QUORUM or CL.LOCAL_QUORUM for durability● if it fails with an unavailable exception, pick a
different node token and/or node and try again (gives same semantics as a relational DB in terms of knowing the state of your transaction)
Failures
● Before insert into the XACT_LOG● After insert, before actions● After insert, in middle of actions● After insert, after actions, before delete● After insert, after actions, after delete
Recovery
● Each C* has a crond job offset from every other by some time period
● Each job runs the same code: multiget_slice for all node tokens for all columns older than some time period (the “recovery period”)
● Any columns need to be replayed in their entirety and are deleted after replay (normally there are no columns because normally things are working)
XACT_LOG Comments
● Idempotent writes are awesome (that's why this works so well)
● Doesn't work so well for counters (they're not idempotent)
● Clients must be able to deal with temporarily inconsistent data (they have to do this anyway)
Cassandra Data Modeling ExamplesMatthew F. Dennis // @mdennis
Q?