cassandra scalable data modelling by example
TRANSCRIPT
Scalable data modelling by example
Carlos Alonso (@calonso)
Carlos Alonso
• Ex-Spanish Londoner
• MSc Salamanca University, Spain
• Software Engineer @MyDrive Solutions
• Cassandra certified developer
• Cassandra MVP 2015
• @calonso / http://mrcalonso.com
MyDrive Solutions
❖ World leading driver profiling company
❖ Using technology and data to understand how
to improve driving behaviour
❖ Recently acquired by the Generali Group
❖ @_MyDrive / http://mydrivesolutions.com
❖ We are hiring!!
Cassandra• Open Source distributed database
management system
• Initially developed at Facebook
• Inspired by Amazon’s Dynamo and Google BigTable papers
• Became Apache top-level project in Feb, 2010
• Nowadays developed by DataStax
The data model is the only thing you can’t change once in production.
Cassandra Concepts
Cassandra• Fast Distributed NoSQL Database
• High Availability
• Linear Scalability => Predictability
• No SPOF
• Multi-DC
• Horizontally scalable => $$$
• Not a drop in replacement for RDBMS
The CAP Theorem
Cluster Components
Cluster Components
Consistent Hashing
Hash function
“Carlos” 185664
1773456738847666528349
-894763734895827651234
Replication factor
❖ How many copies (replicas) for your data
Consistency LevelHow many replicas of your data
must respond ok?
Availability /Partition tolerance Consistency
Request Coordination
❖ Coordinator: The node designed to coordinate a particular query
❖ ANY node can coordinate ANY request.
❖ No SPOF: One of Cassandra’s ground principles
❖ The driver chooses which node will coordinate.
A complete example
DriverClient
Partitioner
f81d4fae-…834
• RF = 3• CL = QUORUM• SELECT * … WHERE id = f81d4fae-…
Compaction
❖ SizeTieredCompactionStrategy
❖ LeveledCompactionStrategy
❖ DateTieredCompactionStrategy
❖ TimeWindowCompactionStrategy (kudos Jeff Jirsa)
CREATE TABLE user ( id UUID PRIMARY KEY, email TEXT, name TEXT, preferences SET<TEXT>, ) WITH COMPACTION = { “class”: “<strategy>”, <params> };
Physical Data Layout
Data Modelling
The Primary Key
PARTITION KEY + CLUSTERING
COLUMN(S)
CQL WHERE• Full Primary Key has to be specified.
• Equality matches (only in primary keys or clustering columns):
• SELECT … WHERE album_title = ‘Revolver’;
• IN (only on the last clause):
• SELECT … WHERE album_title = ‘Revolver’ AND number IN (2, 3, 4);
• Range search (only on clustering columns):
• SELECT … WHERE album_title = ‘Revolver’ AND number >= 6 AND number < 2;
Data Modelling
❖ Understand your data
❖ Decide (know) how you’ll query the data
❖ Define column families to satisfy those queries
❖ Implement and optimize
Data Modelling
❖ Understand your data
❖ Decide (know) how you’ll query the data
❖ Define column families to satisfy those
queries
❖ Implement and optimize
Conceptual Model
Logical Model
Physical Model
Query-DrivenMethodology
Analysis &Validation
Query Driven Methodology❖ Spread data evenly around the cluster
❖ Minimise the number of partitions read
❖ Follow the mapping rules:
❖ Entities and relationships: map to tables
❖ Equality search attributes: must be at the beginning of the primary key
❖ Inequality search attributes: become clustering columns
❖ Ordering attributes: become clustering columns
❖ Key attributes: map to primary key columns
Analysis & Validation
❖ 1 Partition per read?
❖ Are write conflicts (overwrites) possible?
❖ How large are partitions?
❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
❖ How much data duplication? (batches)
❖ Client side joins or new table?
An E-Library project.
Requirement: 1Books can be uniquely identified and accessed by ISBN, we also need a title, genre, author and publisher.
QDMBook
ISBN K
Title
Author
Genre
Publisher
Q1
Q1: Find books by ISBN
Analysis & Validation❖ 1 Partition per read?
❖ Are write conflicts (overwrites) possible?
❖ How large are partitions?
❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
❖ 1 X (5 - 1 - 0) + 0 < 1M
❖ How much data duplication? (batches)
❖ Client side joins or new table?
Book
ISBN K
Title
Author
Genre
Publisher
Q1
Q1: Find books by ISBN
N/A
N/A
Physical data model
Book
ISBN K
Title
Author
Genre
Publisher
Q1
Q1: Find books by ISBNCREATE TABLE books ( ISBN VARCHAR PRIMARY KEY, title VARCHAR, author VARCHAR, genre VARCHAR, publisher VARCHAR );
SELECT * FROM books WHERE ISBN = ‘…’;
Requirement 2Users register into the system uniquely identified by an email and a password. We also want their full name.
They will be accessed by email and password or internal unique ID.
QDM K
Users_by_ID
ID
full_name
Q1
Q1: Find users by ID
Users_by_login_info
email K
password
Q2, Q3
Q2: Find users by login info
C
full_name
ID
Q3: Find users by email (to guarantee uniqueness)
Analysis & Validation❖ 1 Partition per read?
❖ Are write conflicts (overwrites) possible?
❖ How large are partitions?
❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
❖ 1 X (2 - 1 - 0) + 0 < 1M
❖ How much data duplication? (batches)
❖ Client side joins or new table?
N/A
N/A
K
Users_by_ID
ID
full_name
Q1
Q1: Find users by ID
Physical Data Model
CREATE TABLE users_by_id ( ID TIMEUUID PRIMARY KEY, full_name VARCHAR );
SELECT * FROM users_by_id WHERE ID = …;
K
Users_by_ID
ID
full_name
Q1
Q1: Find users by ID
Analysis & Validation❖ 1 Partition per read?
❖ Are write conflicts (overwrites) possible?
❖ How large are partitions?
❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
❖ 1 X (4 - 1 - 0) + 0 < 1M
❖ How much data duplication? (batches)
❖ Client side joins or new table? N/A
Q2: Find users by login info
Users_by_login_info
email K
password C
full_name
ID
Q3: Find users by email (to guarantee uniqueness)
Physical Data ModelCREATE TABLE users_by_login_info ( email VARCHAR, password VARCHAR, full_name VARCHAR, ID TIMEUUID, PRIMARY KEY (email, password) );
SELECT * FROM users_by_login_info WHERE email = ‘…’ [AND password = ‘…’];
Q2: Find users by login info
Users_by_login_info
email K
password C
full_name
ID
Q3: Find users by email (to guarantee uniqueness)
Physical Data Model
BEGIN BATCH INSERT INTO users_by_id (ID, full_name) VALUES (…) IF NOT EXISTS; INSERT INTO users_by_login_info (email, password, full_name, ID) VALUES (…); APPLY BATCH;
Requirement 3Users read books.
We want to know which books has a user read.
QDMK
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
Q1: Find all books a loggeduser has read
S
Analysis & Validation❖ 1 Partition per read?
❖ Are write conflicts (overwrites) possible?
❖ How large are partitions?
❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
❖ Books X (7 - 1 - 1) + 1 < 1M => 200,000 books per user
❖ How much data duplication? (batches)
❖ Client side joins or new table?
Q1: Find all books a loggeduser has read
New table
N/A
K
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
S
Physical Data ModelCREATE TABLE books_read_by_user ( user_id TIMEUUID, title VARCHAR, author VARCHAR, full_name VARCHAR STATIC, ISBN VARCHAR, genre VARCHAR, publisher VARCHAR, PRIMARY KEY (user_id, title, author) );
SELECT * FROM books_read_by_user WHERE user_ID = …;
Q1: Find all books a loggeduser has read
K
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
S
Requirement 4In order to improve our site’s usability we need to understand how our users use it by tracking every interaction they have
with our site.
QDMuser_ID
2_weeks
time
element
type
K
Actions_by_user
Q1
Q1: Find all actions a user does in a time range
K
C
Analysis & Validation❖ 1 Partition per read?
❖ Are write conflicts (overwrites) possible?
❖ How large are partitions?
❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
❖ Actions X (5 - 2 - 0) + 0 < 1M => 333.333 per user every 2 weeks
❖ How much data duplication? (batches)
❖ Client side joins or new table?
N/A
K
Actions_by_user
user_ID
2_weeks
Q1
Q1: Find all actions a user does in a time range
K
time
element
type
C
N/A
Physical Data ModelCREATE TABLE actions_by_user ( user_ID TIMEUUID, 2_weeks INT, time TIMESTAMP, element VARCHAR, type VARCHAR, PRIMARY KEY ((user_ID, 2_weeks), time) );
SELECT * FROM actions_by_user WHERE user_ID = … AND 2_weeks = … AND time < … AND time > …;
K
Actions_by_user
user_ID
2_weeks
Q1
Q1: Find all actions a user does in a time range
K
time
element
type
C
Further validation
∑sizeOf(pk) + ∑sizeOf(sc) + Nr x ∑(sizeOf(rc) + ∑sizeOf(clc)) + 8 x Nv
CREATE TABLE actions_by_user ( user_ID TIMEUUID, 2_weeks INT, time TIMESTAMP, element VARCHAR, type VARCHAR, PRIMARY KEY ((user_ID, 2_weeks), time) );
Next Steps
❖ Test your models against your hardware setup
❖ cassandra-stress
❖ http://www.sestevez.com/sestevez/CassandraDataModeler/ (kudos Sebastian Estevez)
❖ Monitor everything
❖ DataStax OpsCenter
Thanks!
Carlos AlonsoSoftware Engineer
@calonso