data modeling with cassandra
DESCRIPTION
Data modeling doesn't have to be difficult. This talk walks through different CQL data model examples.TRANSCRIPT
Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
Data Modeling with Cassandra
Patricia Gorla @patriciagorla
Cassandra Consultant
About The Last Pickle. !
Work with clients to deliver and improve Apache Cassandra based solutions. Apache Cassandra Committer, DataStax MVP, Hector Maintainer, Apache Usergrid Committer. Based in New Zealand & USA.
A Few Notes about Cassandra
A Few Notes about CassandraOpen sourced in 2008 by Facebook
A Few Notes about CassandraOpen sourced in 2008 by FacebookA lot has changed since then…
See issues.apache.org/jira/browse/CASSANDRA
Cassandra is…• Distributed
'foo'
'bar''foo'
'foo'
'bar'
'bar'
Data distributed by hash
Cassandra is…• Distributed
Availability through Redundancy
'foo'
'bar''foo'
'foo'
'bar'
'bar'
SouthAfrica
Central Africa
Egypt
North Africa
Mad
agas
car
East Africa
India
Afghanistan
Middle East
Ural
Siberia
Yakutsk Kamchatka
Irkutsk
Japa
n
Russia
Scandinavia
SoutheastAsia
NorthernEurope
SouthernEurope
WesternEurope
Iceland
Great Britain
New Guinea
Indonesia
Western Australia
Eastern Australia
Northwest TerritoryAlaska
Alberta
Ontario Eastern Canada
WesternUnited States
EasternUnited States
Greenland
Central America
Venezuela
Brazil
Peru
Argentina
Cassandra is…• Distributed
Geolocated datacenters
Cassandra is…• Distributed • Eventually Consistent
?
?
?
Read Repair Maintenance Repair
Cassandra is…• Distributed • Eventually Consistent
?
?
?
Consistency Level
QUORUM, ONE, ALL, ANY
Cassandra is…• Distributed • Eventually Consistent • Fast
See http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster
2.1 - 190,000 wps
2.0 - 105,000 wps
Cassandra is…• Distributed • Eventually Consistent • Fast
See http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster
2.1 - 190,000 wps
2.0 - 105,000 wps
Note: Reads can be tuned through data model and JVM
Cassandra is…• Distributed • Eventually Consistent • Fast • Familiar
CREATE TABLE IF NOT EXISTS foo ( bar text, baz text, PRIMARY KEY (bar));
CQL - Cassandra Query Language
Cassandra is…• Distributed • Eventually Consistent • Fast • Familiar
CREATE TABLE IF NOT EXISTS foo ( bar text, baz text, PRIMARY KEY (bar));!
INSERT INTO foo (bar, baz) VALUES ('one', 'two');!
SELECT * FROM foo;
cqlsh - CLI tool
Cassandra is…• Distributed • Eventually Consistent • Fast • Familiar • Popular
DriversDatastax C#, Java, C++, Python,
Node.js*, Ruby*.NET/C# Cassandra Sharp, Aquiles, … Cassandra, Apache Spark Datastax Spark Connector
C++ libQTCassandraClojure CLJ-Hector, Cassaforte, AliaErlang CQerl
Go Gossie, GoCQL, CQLcHaskell Cassy
Java Astyanax,Hector, Achilles,Node.js Helenus, Node-Cassandra-
CQL,ODBC Simba ODBCPerl Cassandra::Simple, PerlcassaPHP CQL PHP, CQLSI, php-
cassandraPython Datastax Python, Pycassa,R R Cassandra
Ruby Fauna, CQL Ruby, CQLEngineRust Rust-CQL
Scala CascalStorm Storm-Cassandra
For full list, see http://planetcassandra.org/client-drivers-tools/
The Hard Part
The Hard Part(Data Modeling)
The Hard Part(Data Modeling)
No JOINs, Denormalize
The Hard Part(Data Modeling)
No JOINs, Denormalize
Duplicate the Data
The Hard Part(Data Modeling)
No JOINs, Denormalize
Duplicate the Data
Identify Usage
Bikes Customers Stations Trips
c Noah Berger, Flickr
Case Study: City BikeShare
!
CREATE KEYSPACE bikeshare WITH replication = { 'class': 'NetworkTopologyStrategy' , 'datacenter1': 3 };!
USE bikeshare; RF can be altered ex post facto
Bikes Customers Stations Trips
c Noah Berger, Flickr
- List the properties of the bike.
!
CREATE TABLE IF NOT EXISTS bike ( bike_id text, properties map<text, text>, is_damaged boolean, is_checked_out boolean, latitude double, longitude double, PRIMARY KEY (bike_id));
!
CREATE TABLE IF NOT EXISTS bike ( bike_id text, properties map<text, text>, is_damaged boolean, is_checked_out boolean, latitude double, longitude double, PRIMARY KEY (bike_id));
!
CREATE TABLE IF NOT EXISTS bike ( bike_id text, properties map<text, text>, is_damaged boolean, is_checked_out boolean, latitude double, longitude double, PRIMARY KEY (bike_id));
See www.datastax.com/documentation/cql/3.0/cql/cql_reference/cql_data_types_c.html for all data types
INSERT INTO bike ( bike_id, properties, is_damaged, is_checked_out, latitude, longitude ) VALUES ( 'bike1', {'serial_number' : 'GS-00143', 'type' : 'road bike'}, False, False, 37.7648, 122.4200);
!
SELECT * FROM bike;
!
SELECT * FROM bike;! bike_id | is_checked_out | is_damaged | latitude | longitude | properties---------+----------------+------------+----------+-----------+----------------------------------------------------- bike3 | False | True | 37.793 | 122.4 | {'serial_number': 'GS-70159', 'type': 'fixed gear'} bike2 | True | False | 37.786 | 122.4 | {'serial_number': 'GS-79366', 'type': 'road bike'} bike1 | False | False | 37.765 | 122.42 | {'serial_number': 'GS-00143', 'type': 'road bike'}!(3 rows)
!
CREATE TABLE IF NOT EXISTS bike ( bike_id text, properties map<text, text>, is_damaged boolean, is_checked_out boolean, latitude double, longitude double, PRIMARY KEY (bike_id));
!
CREATE TABLE IF NOT EXISTS bike ( bike_id text, properties map<text, text>, is_damaged boolean, is_checked_out boolean, latitude double, longitude double, PRIMARY KEY (bike_id));
!UPDATE bike SET properties['color'] = 'royal blue' WHERE bike_id = 'bike1';
!
CREATE TABLE IF NOT EXISTS bike ( bike_id text, properties map<text, text>, is_damaged boolean, is_checked_out boolean, latitude double, longitude double, PRIMARY KEY (bike_id));
!UPDATE bike SET properties['color'] = 'royal blue' WHERE bike_id = 'bike1';
!SELECT properties FROM bike WHERE bike_id = bike1';!properties--------------------------------------------------------------------------- {'color': 'royal blue','serial_number': 'GS-00143', 'type': 'road bike'}!(1 rows)
!
CREATE TABLE IF NOT EXISTS bike ( bike_id text, properties map<text, text>, is_damaged boolean, is_checked_out boolean, latitude double, longitude double, PRIMARY KEY (bike_id));
!DELETE properties['color'] FROM bike WHERE bike_id = 'bike1';
!
CREATE TABLE IF NOT EXISTS bike ( bike_id text, properties map<text, text>, is_damaged boolean, is_checked_out boolean, latitude double, longitude double, PRIMARY KEY (bike_id));
!DELETE properties['color'] FROM bike WHERE bike_id = 'bike1';
!SELECT properties FROM bike WHERE bike_id = bike1';!properties--------------------------------------------------- {'serial_number': 'GS-00143', 'type': 'road bike'}!(1 rows)
Bikes Customers Stations Trips
c Noah Berger, Flickr
- List the properties of the bike. - Verify whether the bike can be
checked out.
!
CREATE TABLE IF NOT EXISTS bike ( bike_id text, properties map<text, text>, is_damaged boolean, is_checked_out boolean, latitude double, longitude double, PRIMARY KEY (bike_id));
!
UPDATE bike SET is_checked_out = True WHERE bike_id = 'bike1' IF is_checked_out = False; !
Set conditional statement
!
UPDATE bike SET is_checked_out = True WHERE bike_id = 'bike1' IF is_checked_out = False; !! [applied] ----------- True
!
UPDATE bike SET is_checked_out = True WHERE bike_id = 'bike1' IF is_checked_out = False; !! [applied] | is_checked_out -----------+---------------- False | True
See www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
Bikes Customers Stations Trips
c Noah Berger, Flickr
- Get the customer details.
CREATE TYPE IF NOT EXISTS address ( street_name text, zip text);!
CREATE TABLE IF NOT EXISTS customer ( customer_id text, email text, name text, password text, mailing_address address, PRIMARY KEY (customer_id));
Note: This example uses text fields for simplicity. Passwords should not be stored in plain text.
CREATE TYPE IF NOT EXISTS address ( street_name text, zip text);!
CREATE TABLE IF NOT EXISTS customer ( customer_id text, email text, name text, password text, mailing_address frozen<address>, PRIMARY KEY (customer_id));
Limitations
Data is serialisedCASSANDRA-7857
CASSANDRA-7423 - Freezing UDT
- Query individual subfields
INSERT INTO customer ( customer_id, email, name, password, mailing_address) VALUES ( 'customer1', '[email protected]', 'Paul Van Haver', 'p@ssw0rd1', {street_name: 'Capp Street', zip: '94110'});
INSERT INTO customer ( customer_id, email, name, password, mailing_address) VALUES ( 'customer1', '[email protected]', 'Paul Van Haver', 'p@ssw0rd1', {street_name: 'Capp Street', zip: '94110'});
!
SELECT mailing_address.street_name FROM customer WHERE customer_id = ‘customer2';!
!
mailing_address.street_name----------------------------- Bryant Street!
(1 rows)
Bikes Customers Stations Trips
c Noah Berger, Flickr
- List the available bikes at a station.
!
CREATE TABLE IF NOT EXISTS station ( station_name text, latitude double, longitude double, PRIMARY KEY (station_name));
!
CREATE TABLE IF NOT EXISTS bike_at_stations_count ( station_name text, bikes_available counter, PRIMARY KEY (station_name));
!
CREATE TABLE IF NOT EXISTS bike_at_stations_count ( station_name text, bikes_available counter, PRIMARY KEY (station_name)); All counters start at 0
Only increment, decrement
!
UPDATE bikes_at_stations_count SET bikes_available = bikes_available + 1 WHERE station_name = '16th & Mission';
2.1 - Creates a local lock
See www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters
!
UPDATE bikes_at_stations_count SET bikes_available = bikes_available + 1 WHERE station_name = '16th & Mission';!
SELECT * FROM bikes_at_stations_count WHERE station_name = '16th & Mission’;!
station_name | bikes_available----------------+----------------- 16th & Mission | 2!
(1 rows)
Bikes Customers Stations Trips
c Noah Berger, Flickr
- List all trips a bike has been on.
CREATE TABLE IF NOT EXISTS BikeTrips ( bike_id text, trip_id text, PRIMARY KEY (bike_id, trip_id));
CREATE TABLE IF NOT EXISTS BikeTrips ( bike_id text, trip_id text, PRIMARY KEY (bike_id, trip_id));
Flaw: All trips for a bike will be stored in the same row
(row will grow unbounded)
Two components of a primary key
PRIMARY KEY ((a, b, …)…, c)
Partition KeyWhere the row will be physically located
Two components of a primary key
PRIMARY KEY ((a, b, …)…, c)
PRIMARY KEY ((a, b, …)…, c)
Partition KeyWhere the row will be physically located
Clustering KeyHow the columns will be ordered on disk
Two components of a primary key
CREATE TABLE IF NOT EXISTS user ( first_name text, last_login timestamp, PRIMARY KEY (first_name));
Single PKEach row is on a separate partition Can be uniquely identified
Single PK
CREATE TABLE IF NOT EXISTS user ( first_name text, last_login timestamp, PRIMARY KEY (first_name, last_login)) WITH CLUSTERING ORDER BY (last_login DESC);
Compound PKColumns are ordered by logins Most recent users will be at the top
Each row is on a separate partition Can be uniquely identified
CREATE TABLE IF NOT EXISTS user ( first_name text, last_login timestamp, PRIMARY KEY (first_name));
Single PK
CREATE TABLE IF NOT EXISTS user ( first_name text, last_login timestamp, PRIMARY KEY (first_name, last_login)) WITH CLUSTERING ORDER BY (last_login DESC);
Compound PKColumns are ordered by logins Most recent users will be at the top
Each row is on a separate partition Can be uniquely identified
CREATE TABLE IF NOT EXISTS user ( first_name text, last_login timestamp, PRIMARY KEY (first_name));
CREATE TABLE IF NOT EXISTS user ( first_name text, last_name text, last_login timestamp, PRIMARY KEY ((first_name, last_name), last_login)) WITH CLUSTERING ORDER BY (last_login DESC);
Composite PKData is bucketed by compositeRow width will be limited
CREATE TABLE IF NOT EXISTS BikeTrips ( bike_id text, trip_id text, PRIMARY KEY (bike_id, trip_id));
Flaw: All trips for a bike will be stored in the same partition
(row will grow unbounded)
CREATE TABLE IF NOT EXISTS BikeTrips ( bike_id text, trip_id text, PRIMARY KEY (bike_id, trip_id));
Solution: Create artificial bucketCREATE TABLE IF NOT EXISTS BikeTrips ( bike_id text, bucket int, trip_id text, PRIMARY KEY ((bike_id, bucket), trip_id));
Flaw: All trips for a bike will be stored in the same partition
(row will grow unbounded)
CREATE TABLE IF NOT EXISTS BikeTrips ( bike_id text, bucket int, trip_id text, PRIMARY KEY ((bike_id, bucket), trip_id));
Must specify all parts on SELECT
SELECT * FROM BikeTrips WHERE bike_id = 1 AND bucket = 0;
Bikes Customers Stations Trips
c Noah Berger, Flickr
- List all trips a bike has been on. - List all trips a customer has
taken.
CREATE TABLE IF NOT EXISTS CustomerTrips ( customer_id text, trip_id text, PRIMARY KEY (customer_id, trip_id));
Rows will not be as wide as BikeTrips
Bikes Customers Stations Trips
c Noah Berger, Flickr
- List all trips a bike has been on. - List all trips a customer has
taken. - Show details of a particular trip
(duration, distance traveled).
CREATE TABLE IF NOT EXISTS trip ( trip_id text, customer_id text static, bike_id text static, started_at timestamp static, stopped_at timestamp static, sequence timestamp, latitude decimal, longitude decimal, delta_distance double, PRIMARY KEY (trip_id, sequence)) WITH CLUSTERING ORDER BY (sequence DESC);
!
SELECT * FROM trip WHERE trip_id = 'trip1';! trip_id | sequence | bike_id | customer_id | started_at | stopped_at | delta_distance | latitude | longitude---------+--------------------------+---------+-------------+--------------------------+--------------------------+----------------+-------------+----------- trip1 | 2014-08-10 06:10:05+0100 | bike15 | customer3 | 2014-08-10 06:07:55+0100 | 2014-08-10 06:07:55+0100 | 8.7951 | -122.405319 | 37.796936 trip1 | 2014-08-10 06:10:00+0100 | bike15 | customer3 | 2014-08-10 06:07:55+0100 | 2014-08-10 06:07:55+0100 | 15.381 | -122.403347 | 37.795535 trip1 | 2014-08-10 06:09:55+0100 | bike15 | customer3 | 2014-08-10 06:07:55+0100 | 2014-08-10 06:07:55+0100 | 0 | -122.403347 | 37.795535 trip1 | 2014-08-10 06:09:50+0100 | bike15 | customer3 | 2014-08-10 06:07:55+0100 | 2014-08-10 06:07:55+0100 | 10.557 | -122.401702 | 37.795731 trip1 | 2014-08-10 06:09:45+0100 | bike15 | customer3 | 2014-08-10 06:07:55+0100 | 2014-08-10 06:07:55+0100 | 0 | -122.401702 | 37.795731 trip1 | 2014-08-10 06:09:40+0100 | bike15 | customer3 | 2014-08-10 06:07:55+0100 | 2014-08-10 06:07:55+0100 | 35.282 | -122.400589 | 37.790268 ... trip1 | 2014-08-10 06:08:45+0100 | bike15 | customer3 | 2014-08-10 06:07:55+0100 | 2014-08-10 06:07:55+0100 | 6.1672 | -122.414782 | 37.771255 trip1 | 2014-08-10 06:08:40+0100 | bike15 | customer3 | 2014-08-10 06:07:55+0100 | 2014-08-10 06:07:55+0100 | 2.6682 | -122.415047 | 37.770929 trip1 | 2014-08-10 06:08:35+0100 | bike15 | customer3 | 2014-08-10 06:07:55+0100 | 2014-08-10 06:07:55+0100 | 2.9604 | -122.415287 | 37.770529 trip1 | 2014-08-10 06:08:30+0100 | bike15 | customer3 | 2014-08-10 06:07:55+0100 | 2014-08-10 06:07:55+0100 | 2.775 | -122.41544 | 37.770119 trip1 | 2014-08-10 06:08:25+0100 | bike15 | customer3 | 2014-08-10 06:07:55+0100 | 2014-08-10 06:07:55+0100 | 5.7684 | -122.41566 | 37.769236 trip1 | 2014-08-10 06:08:20+0100 | bike15 | customer3 | 2014-08-10 06:07:55+0100 | 2014-08-10 06:07:55+0100 | 3.1183 | -122.415669 | 37.768744 trip1 | 2014-08-10 06:08:15+0100 | bike15 | customer3 | 2014-08-10 06:07:55+0100 | 2014-08-10 06:07:55+0100 | 93.217 | -122.414251 | 37.754102 trip1 | 2014-08-10 06:08:10+0100 | bike15 | customer3 | 2014-08-10 06:07:55+0100 | 2014-08-10 06:07:55+0100 | 0 | -122.414251 | 37.754102 trip1 | 2014-08-10 06:08:05+0100 | bike15 | customer3 | 2014-08-10 06:07:55+0100 | 2014-08-10 06:07:55+0100 | 31.664 | -122.409291 | 37.754393 trip1 | 2014-08-10 06:08:00+0100 | bike15 | customer3 | 2014-08-10 06:07:55+0100 | 2014-08-10 06:07:55+0100 | 0 | -122.409291 | 37.754393 trip1 | 2014-08-10 06:07:55+0100 | bike15 | customer3 | 2014-08-10 06:07:55+0100 | 2014-08-10 06:07:55+0100 | 0.54761 | -122.409282 | 37.754307!(27 rows)
CREATE TABLE IF NOT EXISTS trip ( trip_id text, customer_id text static, bike_id text static, started_at timestamp static, stopped_at timestamp static, sequence timestamp, latitude decimal, longitude decimal, delta_distance double, PRIMARY KEY (trip_id, sequence)) WITH CLUSTERING ORDER BY (sequence DESC);
!
SELECT sequence, latitude, longitude FROM trip WHERE trip_id = 'trip1' AND sequence > '2014-08-10 06:09:00+0100';! sequence | latitude | longitude--------------------------+-------------+----------- 2014-08-10 06:10:05+0100 | -122.405319 | 37.796936 2014-08-10 06:10:00+0100 | -122.403347 | 37.795535 2014-08-10 06:09:55+0100 | -122.403347 | 37.795535 2014-08-10 06:09:50+0100 | -122.401702 | 37.795731 2014-08-10 06:09:45+0100 | -122.401702 | 37.795731 2014-08-10 06:09:40+0100 | -122.400589 | 37.790268 2014-08-10 06:09:35+0100 | -122.400589 | 37.790268 2014-08-10 06:09:30+0100 | -122.400404 | 37.790241 2014-08-10 06:09:25+0100 | -122.400359 | 37.790128 2014-08-10 06:09:20+0100 | -122.400359 | 37.790128 2014-08-10 06:09:15+0100 | -122.408092 | 37.784008 2014-08-10 06:09:10+0100 | -122.408092 | 37.784008 2014-08-10 06:09:05+0100 | -122.403416 | 37.780284
Use comparator for data type
CREATE TABLE IF NOT EXISTS trip ( trip_id text, customer_id text static, bike_id text static, started_at timestamp static, stopped_at timestamp static, sequence timestamp, latitude decimal, longitude decimal, delta_distance double, PRIMARY KEY (trip_id, sequence)) WITH CLUSTERING ORDER BY (sequence DESC);
CREATE TABLE IF NOT EXISTS trip ( trip_id text, customer_id text static, bike_id text static, started_at timestamp static, stopped_at timestamp static, sequence timestamp, latitude decimal, longitude decimal, delta_distance double, PRIMARY KEY (trip_id, sequence)) WITH CLUSTERING ORDER BY (sequence DESC);
Recap
Recap• There is hope
Recap• There is hope • Identify usage
Recap• There is hope • Identify usage • Be mindful of storage engine
Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
Patricia Gorla @patriciagorla !
www.thelastpickle.com
Q&A