cassandra scalable data modelling by example

Scalable data modelling by example

Carlos Alonso (@calonso)

Carlos Alonso

• Ex-Spanish Londoner

• MSc Salamanca University, Spain

• Software Engineer @MyDrive Solutions

• Cassandra certified developer

• Cassandra MVP 2015

• @calonso / http://mrcalonso.com

http://mrcalonso.com

MyDrive Solutions

❖ World leading driver profiling company

❖ Using technology and data to understand how

to improve driving behaviour

❖ Recently acquired by the Generali Group

❖ @_MyDrive / http://mydrivesolutions.com

❖ We are hiring!!

http://mydrivesolutions.com

Cassandra• Open Source distributed database

management system

• Initially developed at Facebook

• Inspired by Amazon’s Dynamo and Google BigTable papers

• Became Apache top-level project in Feb, 2010

• Nowadays developed by DataStax

The data model is the only thing you can’t change once in production.

Cassandra Concepts

Cassandra• Fast Distributed NoSQL Database

• High Availability

• Linear Scalability => Predictability

• No SPOF

• Multi-DC

• Horizontally scalable => $$$

• Not a drop in replacement for RDBMS

The CAP Theorem

Cluster Components

Consistent Hashing

Hash function

“Carlos” 185664

1773456738847666528349

-894763734895827651234

Replication factor

❖ How many copies (replicas) for your data

Consistency LevelHow many replicas of your data

must respond ok?

Availability /Partition tolerance Consistency

Request Coordination

❖ Coordinator: The node designed to coordinate a particular query

❖ ANY node can coordinate ANY request.

❖ No SPOF: One of Cassandra’s ground principles

❖ The driver chooses which node will coordinate.

A complete example

DriverClient

Partitioner

f81d4fae-…834

• RF = 3• CL = QUORUM• SELECT * … WHERE id = f81d4fae-…

Compaction

❖ SizeTieredCompactionStrategy

❖ LeveledCompactionStrategy

❖ DateTieredCompactionStrategy

❖ TimeWindowCompactionStrategy (kudos Jeff Jirsa)

CREATE TABLE user ( id UUID PRIMARY KEY, email TEXT, name TEXT, preferences SET<TEXT>, ) WITH COMPACTION = { “class”: “<strategy>”, <params> };

Physical Data Layout

Data Modelling

The Primary Key

PARTITION KEY + CLUSTERING

COLUMN(S)

CQL WHERE• Full Primary Key has to be specified.

• Equality matches (only in primary keys or clustering columns):

• SELECT … WHERE album_title = ‘Revolver’;

• IN (only on the last clause):

• SELECT … WHERE album_title = ‘Revolver’ AND number IN (2, 3, 4);

• Range search (only on clustering columns):

• SELECT … WHERE album_title = ‘Revolver’ AND number >= 6 AND number < 2;

Data Modelling

❖ Understand your data

❖ Decide (know) how you’ll query the data

❖ Define column families to satisfy those queries

❖ Implement and optimize

Data Modelling

❖ Understand your data

❖ Decide (know) how you’ll query the data

❖ Define column families to satisfy those

queries

❖ Implement and optimize

Conceptual Model

Logical Model

Physical Model

Query-DrivenMethodology

Analysis &Validation

Query Driven Methodology❖ Spread data evenly around the cluster

❖ Minimise the number of partitions read

❖ Follow the mapping rules:

❖ Entities and relationships: map to tables

❖ Equality search attributes: must be at the beginning of the primary key

❖ Inequality search attributes: become clustering columns

❖ Ordering attributes: become clustering columns

❖ Key attributes: map to primary key columns

Analysis & Validation

❖ 1 Partition per read?

❖ Are write conflicts (overwrites) possible?

❖ How large are partitions?

❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

❖ How much data duplication? (batches)

❖ Client side joins or new table?

An E-Library project.

Requirement: 1Books can be uniquely identified and accessed by ISBN, we also need a title, genre, author and publisher.

QDMBook

ISBN K

Title

Author

Genre

Publisher

Q1

Q1: Find books by ISBN

Analysis & Validation❖ 1 Partition per read?




❖ 1 X (5 - 1 - 0) + 0 < 1M



Book

ISBN K

Title

Author

Genre

Publisher

Q1

Q1: Find books by ISBN

N/A

N/A

Physical data model

Book

ISBN K

Title

Author

Genre

Publisher

Q1

Q1: Find books by ISBNCREATE TABLE books ( ISBN VARCHAR PRIMARY KEY, title VARCHAR, author VARCHAR, genre VARCHAR, publisher VARCHAR );

SELECT * FROM books WHERE ISBN = ‘…’;

Requirement 2Users register into the system uniquely identified by an email and a password. We also want their full name.

They will be accessed by email and password or internal unique ID.

QDM K

Users_by_ID

ID

full_name

Q1

Q1: Find users by ID

Users_by_login_info

email K

password

Q2, Q3

Q2: Find users by login info

C

full_name

ID

Q3: Find users by email (to guarantee uniqueness)





❖ 1 X (2 - 1 - 0) + 0 < 1M



N/A

N/A

K

Users_by_ID

ID

full_name

Q1


Physical Data Model

CREATE TABLE users_by_id ( ID TIMEUUID PRIMARY KEY, full_name VARCHAR );

SELECT * FROM users_by_id WHERE ID = …;

K

Users_by_ID

ID

full_name

Q1






❖ 1 X (4 - 1 - 0) + 0 < 1M


❖ Client side joins or new table? N/A


Users_by_login_info

email K

password C

full_name

ID


Physical Data ModelCREATE TABLE users_by_login_info ( email VARCHAR, password VARCHAR, full_name VARCHAR, ID TIMEUUID, PRIMARY KEY (email, password) );

SELECT * FROM users_by_login_info WHERE email = ‘…’ [AND password = ‘…’];


Users_by_login_info

email K

password C

full_name

ID


Physical Data Model

BEGIN BATCH INSERT INTO users_by_id (ID, full_name) VALUES (…) IF NOT EXISTS; INSERT INTO users_by_login_info (email, password, full_name, ID) VALUES (…); APPLY BATCH;

Requirement 3Users read books.

We want to know which books has a user read.

QDMK

Books_read_by_user

user_ID

title

Q1

full_name

ISBN

genre

publisher

author

C

C

Q1: Find all books a loggeduser has read

S





❖ Books X (7 - 1 - 1) + 1 < 1M => 200,000 books per user




New table

N/A

K

Books_read_by_user

user_ID

title

Q1

full_name

ISBN

genre

publisher

author

C

C

S

Physical Data ModelCREATE TABLE books_read_by_user ( user_id TIMEUUID, title VARCHAR, author VARCHAR, full_name VARCHAR STATIC, ISBN VARCHAR, genre VARCHAR, publisher VARCHAR, PRIMARY KEY (user_id, title, author) );

SELECT * FROM books_read_by_user WHERE user_ID = …;


K

Books_read_by_user

user_ID

title

Q1

full_name

ISBN

genre

publisher

author

C

C

S

Requirement 4In order to improve our site’s usability we need to understand how our users use it by tracking every interaction they have

with our site.

QDMuser_ID

2_weeks

time

element

type

K

Actions_by_user

Q1

Q1: Find all actions a user does in a time range

K

C





❖ Actions X (5 - 2 - 0) + 0 < 1M => 333.333 per user every 2 weeks



N/A

K

Actions_by_user

user_ID

2_weeks

Q1


K

time

element

type

C

N/A

Physical Data ModelCREATE TABLE actions_by_user ( user_ID TIMEUUID, 2_weeks INT, time TIMESTAMP, element VARCHAR, type VARCHAR, PRIMARY KEY ((user_ID, 2_weeks), time) );

SELECT * FROM actions_by_user WHERE user_ID = … AND 2_weeks = … AND time < … AND time > …;

K

Actions_by_user

user_ID

2_weeks

Q1


K

time

element

type

C

Further validation

∑sizeOf(pk) + ∑sizeOf(sc) + Nr x ∑(sizeOf(rc) + ∑sizeOf(clc)) + 8 x Nv

CREATE TABLE actions_by_user ( user_ID TIMEUUID, 2_weeks INT, time TIMESTAMP, element VARCHAR, type VARCHAR, PRIMARY KEY ((user_ID, 2_weeks), time) );

Next Steps

❖ Test your models against your hardware setup

❖ cassandra-stress

❖ http://www.sestevez.com/sestevez/CassandraDataModeler/ (kudos Sebastian Estevez)

❖ Monitor everything

❖ DataStax OpsCenter

http://www.sestevez.com/sestevez/CassandraDataModeler/

Thanks!

Carlos AlonsoSoftware Engineer

@calonso

cassandra scalable data modelling by example

Engineering