cassandra scalable data modelling by example

46
Scalable data modelling by example Carlos Alonso (@calonso)

Upload: carlos-alonso-perez

Post on 13-Jan-2017

494 views

Category:

Engineering


7 download

TRANSCRIPT

Page 1: Cassandra Scalable data modelling by example

Scalable data modelling by example

Carlos Alonso (@calonso)

Page 2: Cassandra Scalable data modelling by example

Carlos Alonso

• Ex-Spanish Londoner

• MSc Salamanca University, Spain

• Software Engineer @MyDrive Solutions

• Cassandra certified developer

• Cassandra MVP 2015

• @calonso / http://mrcalonso.com

Page 3: Cassandra Scalable data modelling by example

MyDrive Solutions

❖ World leading driver profiling company

❖ Using technology and data to understand how

to improve driving behaviour

❖ Recently acquired by the Generali Group

❖ @_MyDrive / http://mydrivesolutions.com

❖ We are hiring!!

Page 4: Cassandra Scalable data modelling by example

Cassandra• Open Source distributed database

management system

• Initially developed at Facebook

• Inspired by Amazon’s Dynamo and Google BigTable papers

• Became Apache top-level project in Feb, 2010

• Nowadays developed by DataStax

Page 5: Cassandra Scalable data modelling by example

The data model is the only thing you can’t change once in production.

Page 6: Cassandra Scalable data modelling by example

Cassandra Concepts

Page 7: Cassandra Scalable data modelling by example

Cassandra• Fast Distributed NoSQL Database

• High Availability

• Linear Scalability => Predictability

• No SPOF

• Multi-DC

• Horizontally scalable => $$$

• Not a drop in replacement for RDBMS

Page 8: Cassandra Scalable data modelling by example

The CAP Theorem

Page 9: Cassandra Scalable data modelling by example

Cluster Components

Page 10: Cassandra Scalable data modelling by example

Cluster Components

Page 11: Cassandra Scalable data modelling by example

Consistent Hashing

Hash function

“Carlos” 185664

1773456738847666528349

-894763734895827651234

Page 12: Cassandra Scalable data modelling by example

Replication factor

❖ How many copies (replicas) for your data

Page 13: Cassandra Scalable data modelling by example

Consistency LevelHow many replicas of your data

must respond ok?

Page 14: Cassandra Scalable data modelling by example

Availability /Partition tolerance Consistency

Page 15: Cassandra Scalable data modelling by example

Request Coordination

❖ Coordinator: The node designed to coordinate a particular query

❖ ANY node can coordinate ANY request.

❖ No SPOF: One of Cassandra’s ground principles

❖ The driver chooses which node will coordinate.

Page 16: Cassandra Scalable data modelling by example

A complete example

DriverClient

Partitioner

f81d4fae-…834

• RF = 3• CL = QUORUM• SELECT * … WHERE id = f81d4fae-…

Page 17: Cassandra Scalable data modelling by example
Page 18: Cassandra Scalable data modelling by example

Compaction

❖ SizeTieredCompactionStrategy

❖ LeveledCompactionStrategy

❖ DateTieredCompactionStrategy

❖ TimeWindowCompactionStrategy (kudos Jeff Jirsa)

CREATE TABLE user ( id UUID PRIMARY KEY, email TEXT, name TEXT, preferences SET<TEXT>, ) WITH COMPACTION = { “class”: “<strategy>”, <params> };

Page 19: Cassandra Scalable data modelling by example
Page 20: Cassandra Scalable data modelling by example

Physical Data Layout

Page 21: Cassandra Scalable data modelling by example

Data Modelling

Page 22: Cassandra Scalable data modelling by example

The Primary Key

PARTITION KEY + CLUSTERING

COLUMN(S)

Page 23: Cassandra Scalable data modelling by example

CQL WHERE• Full Primary Key has to be specified.

• Equality matches (only in primary keys or clustering columns):

• SELECT … WHERE album_title = ‘Revolver’;

• IN (only on the last clause):

• SELECT … WHERE album_title = ‘Revolver’ AND number IN (2, 3, 4);

• Range search (only on clustering columns):

• SELECT … WHERE album_title = ‘Revolver’ AND number >= 6 AND number < 2;

Page 24: Cassandra Scalable data modelling by example

Data Modelling

❖ Understand your data

❖ Decide (know) how you’ll query the data

❖ Define column families to satisfy those queries

❖ Implement and optimize

Page 25: Cassandra Scalable data modelling by example

Data Modelling

❖ Understand your data

❖ Decide (know) how you’ll query the data

❖ Define column families to satisfy those

queries

❖ Implement and optimize

Conceptual Model

Logical Model

Physical Model

Query-DrivenMethodology

Analysis &Validation

Page 26: Cassandra Scalable data modelling by example

Query Driven Methodology❖ Spread data evenly around the cluster

❖ Minimise the number of partitions read

❖ Follow the mapping rules:

❖ Entities and relationships: map to tables

❖ Equality search attributes: must be at the beginning of the primary key

❖ Inequality search attributes: become clustering columns

❖ Ordering attributes: become clustering columns

❖ Key attributes: map to primary key columns

Page 27: Cassandra Scalable data modelling by example

Analysis & Validation

❖ 1 Partition per read?

❖ Are write conflicts (overwrites) possible?

❖ How large are partitions?

❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

❖ How much data duplication? (batches)

❖ Client side joins or new table?

Page 28: Cassandra Scalable data modelling by example

An E-Library project.

Page 29: Cassandra Scalable data modelling by example

Requirement: 1Books can be uniquely identified and accessed by ISBN, we also need a title, genre, author and publisher.

QDMBook

ISBN K

Title

Author

Genre

Publisher

Q1

Q1: Find books by ISBN

Page 30: Cassandra Scalable data modelling by example

Analysis & Validation❖ 1 Partition per read?

❖ Are write conflicts (overwrites) possible?

❖ How large are partitions?

❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

❖ 1 X (5 - 1 - 0) + 0 < 1M

❖ How much data duplication? (batches)

❖ Client side joins or new table?

Book

ISBN K

Title

Author

Genre

Publisher

Q1

Q1: Find books by ISBN

N/A

N/A

Page 31: Cassandra Scalable data modelling by example

Physical data model

Book

ISBN K

Title

Author

Genre

Publisher

Q1

Q1: Find books by ISBNCREATE TABLE books ( ISBN VARCHAR PRIMARY KEY, title VARCHAR, author VARCHAR, genre VARCHAR, publisher VARCHAR );

SELECT * FROM books WHERE ISBN = ‘…’;

Page 32: Cassandra Scalable data modelling by example

Requirement 2Users register into the system uniquely identified by an email and a password. We also want their full name.

They will be accessed by email and password or internal unique ID.

QDM K

Users_by_ID

ID

full_name

Q1

Q1: Find users by ID

Users_by_login_info

email K

password

Q2, Q3

Q2: Find users by login info

C

full_name

ID

Q3: Find users by email (to guarantee uniqueness)

Page 33: Cassandra Scalable data modelling by example

Analysis & Validation❖ 1 Partition per read?

❖ Are write conflicts (overwrites) possible?

❖ How large are partitions?

❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

❖ 1 X (2 - 1 - 0) + 0 < 1M

❖ How much data duplication? (batches)

❖ Client side joins or new table?

N/A

N/A

K

Users_by_ID

ID

full_name

Q1

Q1: Find users by ID

Page 34: Cassandra Scalable data modelling by example

Physical Data Model

CREATE TABLE users_by_id ( ID TIMEUUID PRIMARY KEY, full_name VARCHAR );

SELECT * FROM users_by_id WHERE ID = …;

K

Users_by_ID

ID

full_name

Q1

Q1: Find users by ID

Page 35: Cassandra Scalable data modelling by example

Analysis & Validation❖ 1 Partition per read?

❖ Are write conflicts (overwrites) possible?

❖ How large are partitions?

❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

❖ 1 X (4 - 1 - 0) + 0 < 1M

❖ How much data duplication? (batches)

❖ Client side joins or new table? N/A

Q2: Find users by login info

Users_by_login_info

email K

password C

full_name

ID

Q3: Find users by email (to guarantee uniqueness)

Page 36: Cassandra Scalable data modelling by example

Physical Data ModelCREATE TABLE users_by_login_info ( email VARCHAR, password VARCHAR, full_name VARCHAR, ID TIMEUUID, PRIMARY KEY (email, password) );

SELECT * FROM users_by_login_info WHERE email = ‘…’ [AND password = ‘…’];

Q2: Find users by login info

Users_by_login_info

email K

password C

full_name

ID

Q3: Find users by email (to guarantee uniqueness)

Page 37: Cassandra Scalable data modelling by example

Physical Data Model

BEGIN BATCH INSERT INTO users_by_id (ID, full_name) VALUES (…) IF NOT EXISTS; INSERT INTO users_by_login_info (email, password, full_name, ID) VALUES (…); APPLY BATCH;

Page 38: Cassandra Scalable data modelling by example

Requirement 3Users read books.

We want to know which books has a user read.

QDMK

Books_read_by_user

user_ID

title

Q1

full_name

ISBN

genre

publisher

author

C

C

Q1: Find all books a loggeduser has read

S

Page 39: Cassandra Scalable data modelling by example

Analysis & Validation❖ 1 Partition per read?

❖ Are write conflicts (overwrites) possible?

❖ How large are partitions?

❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

❖ Books X (7 - 1 - 1) + 1 < 1M => 200,000 books per user

❖ How much data duplication? (batches)

❖ Client side joins or new table?

Q1: Find all books a loggeduser has read

New table

N/A

K

Books_read_by_user

user_ID

title

Q1

full_name

ISBN

genre

publisher

author

C

C

S

Page 40: Cassandra Scalable data modelling by example

Physical Data ModelCREATE TABLE books_read_by_user ( user_id TIMEUUID, title VARCHAR, author VARCHAR, full_name VARCHAR STATIC, ISBN VARCHAR, genre VARCHAR, publisher VARCHAR, PRIMARY KEY (user_id, title, author) );

SELECT * FROM books_read_by_user WHERE user_ID = …;

Q1: Find all books a loggeduser has read

K

Books_read_by_user

user_ID

title

Q1

full_name

ISBN

genre

publisher

author

C

C

S

Page 41: Cassandra Scalable data modelling by example

Requirement 4In order to improve our site’s usability we need to understand how our users use it by tracking every interaction they have

with our site.

QDMuser_ID

2_weeks

time

element

type

K

Actions_by_user

Q1

Q1: Find all actions a user does in a time range

K

C

Page 42: Cassandra Scalable data modelling by example

Analysis & Validation❖ 1 Partition per read?

❖ Are write conflicts (overwrites) possible?

❖ How large are partitions?

❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

❖ Actions X (5 - 2 - 0) + 0 < 1M => 333.333 per user every 2 weeks

❖ How much data duplication? (batches)

❖ Client side joins or new table?

N/A

K

Actions_by_user

user_ID

2_weeks

Q1

Q1: Find all actions a user does in a time range

K

time

element

type

C

N/A

Page 43: Cassandra Scalable data modelling by example

Physical Data ModelCREATE TABLE actions_by_user ( user_ID TIMEUUID, 2_weeks INT, time TIMESTAMP, element VARCHAR, type VARCHAR, PRIMARY KEY ((user_ID, 2_weeks), time) );

SELECT * FROM actions_by_user WHERE user_ID = … AND 2_weeks = … AND time < … AND time > …;

K

Actions_by_user

user_ID

2_weeks

Q1

Q1: Find all actions a user does in a time range

K

time

element

type

C

Page 44: Cassandra Scalable data modelling by example

Further validation

∑sizeOf(pk) + ∑sizeOf(sc) + Nr x ∑(sizeOf(rc) + ∑sizeOf(clc)) + 8 x Nv

CREATE TABLE actions_by_user ( user_ID TIMEUUID, 2_weeks INT, time TIMESTAMP, element VARCHAR, type VARCHAR, PRIMARY KEY ((user_ID, 2_weeks), time) );

Page 45: Cassandra Scalable data modelling by example

Next Steps

❖ Test your models against your hardware setup

❖ cassandra-stress

❖ http://www.sestevez.com/sestevez/CassandraDataModeler/ (kudos Sebastian Estevez)

❖ Monitor everything

❖ DataStax OpsCenter

Page 46: Cassandra Scalable data modelling by example

Thanks!

Carlos AlonsoSoftware Engineer

@calonso