rigorous cassandra data modeling for the relational data architect
TRANSCRIPT
Rigorous Cassandra Data Modeling
for the Relational Data Architect
Artem Chebotko
1 Cassandra Data and Query Models
2 Rigorous Data Modeling
3 Data Modeling Example
4 From Relational to Cassandra
5 Conclusions
2 © 2015. All Rights Reserved.
Tables with Single-Row Partitions
© 2015. All Rights Reserved. 3
username age address
Alice 28 Santa Clara, CA
Alex 37 Austin, TX
users
id type settings owner
1 phone {gps ⇒ on,
pedometer ⇒ on}
Alice
2 wristband {heart rate ⇒ on, …} Alice
3 thermostat {temp ⇒ 75, …} Alice
4 security {…} Alex
5 phone {…} Alex
sensors
Tables with Single-Row Partitions
CREATE TABLE users (
username TEXT,
age INT,
address TEXT,
PRIMARY KEY(username)
);
SELECT * FROM users
WHERE username = ?;
CREATE TABLE sensors (
id INT,
type TEXT,
settings MAP<TEXT,TEXT>,
owner TEXT,
PRIMARY KEY(id)
);
SELECT * FROM sensors
WHERE id = ?;
© 2015. All Rights Reserved. 4
Tables with Multi-Row Partitions
© 2015. All Rights Reserved. 5
username id type settings age address
Alice 1 phone {gps ⇒ on, …} 28 Santa Clara, CA
Alice 2 wristband {heart rate ⇒ on, …} 28 Santa Clara, CA
Alice 3 thermostat {temp ⇒ 75, …} 28 Santa Clara, CA
Alex 4 security … 37 Austin, TX
Alex 5 phone … 37 Austin, TX
sensors_by_user
AS
C
AS
C
Tables with Multi-Row Partitions
CREATE TABLE sensors_by_user (
username TEXT, age INT STATIC, address TEXT STATIC,
id INT, type TEXT, settings MAP<TEXT,TEXT>,
PRIMARY KEY(username, id)
) WITH CLUSTERING ORDER BY (id ASC);
SELECT * FROM sensors_by_user WHERE username = ?;
SELECT * FROM sensors_by_user WHERE username = ? AND id = ?;
SELECT * FROM sensors_by_user WHERE username = ? AND id > ?
ORDER BY id DESC;
© 2015. All Rights Reserved. 6
Key Observations
• C* Data Model
– Single-row partitions
– Multi-row partitions
• C* Query Model
– Partition key
– Partition and clustering keys
– Range search and ordering on
a clustering key
• Relational Data Model
– Normalized tables
• Relational Query Model
– SQL and relational algebra
– Expressive
– Expensive
© 2015. All Rights Reserved. 7
1 Cassandra Data and Query Models
2 Rigorous Data Modeling
3 Data Modeling Example
4 From Relational to Cassandra
5 Conclusions
8 © 2015. All Rights Reserved.
Rigorous: Definition and Implications
© 2015. All Rights Reserved. 9
Formal, Well-Defined, Sound
Repeatable, Automatable
Tools, Ease of Use
Wider Adoption
We Need the Methodology!
© 2015. All Rights Reserved. 10
Conceptual
Data Model
Application
Workflow
Logical
Data Model
Physical
Data Model Mapping Optimization
n
1
id type
1
datetime
parameter
usernameage address
n
User owns Sensor
records
Measurement
has
n m
settings
value
use
r
follo
wer
Methodology Models
© 2015. All Rights Reserved. 11
Model Representation
Conceptual Data Model ERD
Application Workflow Model Graph
Logical Data Model Chebotko Diagram
Physical Data Model Chebotko Diagram, CQL
CREATE TABLE users (
username TEXT,
age INT,
address TEXT,
PRIMARY KEY(username)
);
Q2
Q1
Display user
information
Find
followers
Display
sensors
Show measurementsin a date range
Show today's hourly
aggregates
Q3
Q3
Q4 Q5
Q4
SELECT * FROM users
WHERE username = ?
SELECT * FROM followers_by_user
WHERE username = ?
SELECT * FROM sensors_by_user
WHERE username = ?
SELECT *
FROM measurements_by_sensor
WHERE id = ? AND parameter = ?
AND datetime > ?
SELECT *
FROM summary_by_sensor
WHERE id = ? AND date = ?
users
username K
age
address
followers_by_user
username K
follower_username C↑
follower_age
follower_address
Q1
sensors_by_user
username K
id C↑
type
<settings>
measurements_by_sensor
id K
week K
parameter K
datetime C↓
value
summary_by_sensor
id K
date K
parameter C↑
hour C↓
avg
...
Q2
Q3
Q4
Q5
Q4
MAP<TEXT,TEXT>
FLOAT
TEXT
TEXT
TEXT
TEXT
INT
TEXT
TIMESTAMP
UUID
INT
TEXT
TEXT
TEXT
TEXT
UUID
UUID
TIMESTAMP FLOAT
INT
TIMESTAMP
users
username K
age
address
followers_by_user
username K
follower_username C↑
follower_age
follower_address
Q1
sensors_by_user
username K
id C↑
type
<settings>
measurements_by_sensor
id K
parameter K
datetime C↓
value
summary_by_sensor
id K
date K
parameter C↑
hour C↓
avg
...
Q2
Q3
Q4
Q5
Q4
Methodology Protocols
© 2015. All Rights Reserved. 12
• Conceptual-to-logical mapping
– Mapping rules
– Mapping patterns
• Physical optimizations
– Partition size analysis
– Duplication factor analysis
– Keys, aggregation, transactions, …
Sample Mapping Pattern
© 2015. All Rights Reserved. 13
ET1
key1.2
attr1.1
attr1.2
ET2_by_ET1_key
key1.1 Kkey1.2 Kkey2.1 C↑key2.2 C↑attr1.1 Sattr1.2 Sattr1.3 (collection) S attr2.1 attr2.2 attr2.3 (collection) attr
RT
attr
1 nkey1.1
ET2
key2.1
attr2.1
attr2.2
key2.2
attr2.3
attr1.3
ACCESS PATTERN search attributes: key1.1 key1.2
ET2_by_ET1_key
key1.1 Kkey1.2 C↑key2.1 C↑key2.2 C↑attr2.1 attr2.2 attr2.3 (collection) attr
= >
PRIMARY KEY:All search attributes, followed by all key
attributes of RT
STATIC COLUMNS:Non-key attributes of
ET1, iff all key attributes of ET1 are
part of the partition keyWhat if we add green attributes
to the above table?
The Easy Way
© 2015. All Rights Reserved. 14
kdm.dataview.org
• Implements the methodology
– CDM and Query design
– Automated LDM generation
– Automated PDM and CQL generation
Yesterday’s talk:
World’s Best Data Modeling Tool
for Apache Cassandra
1 Cassandra Data and Query Models
2 Rigorous Data Modeling
3 Data Modeling Example
4 From Relational to Cassandra
5 Conclusions
15 © 2015. All Rights Reserved.
Conceptual Data Model: Fact-Based Model
• Alice is a user
• Alice is 28 y.o.
• Alice wears a wristband
• A wristband is a sensor
• A wristband records a heart rate
• A heart rate is a measurement
• …
© 2015. All Rights Reserved. 16
Conceptual Data Model: Entity-Relationship Model
© 2015. All Rights Reserved. 17
n
1
id type
1
datetime
parameter
usernameage address
n
User owns Sensor
records
Measurement
has
n m
settings
value
use
r
follo
wer
ACCESS PATTERNSQ1: Find a user with a known usernameQ2: Find followers of a userQ3: Find sensors owned by a userQ4: Find measurements for a sensor in a date rangeQ5: Find daily summary of hourly aggregates
Q2
Q1
Display user
information
Find
followers
Display
sensors
Show measurementsin a date range
Show today's hourly
aggregates
Q3
Q3
Q4 Q5
Q4
Application Workflow
© 2015. All Rights Reserved. 18
Q2
Q1
Display user
information
Find
followers
Display
sensors
Show measurementsin a date range
Show today's hourly
aggregates
Q3
Q3
Q4 Q5
Q4
SELECT * FROM users
WHERE username = ?
SELECT * FROM followers_by_user
WHERE username = ?
SELECT * FROM sensors_by_user
WHERE username = ?
SELECT *
FROM measurements_by_sensor
WHERE id = ? AND parameter = ?
AND datetime > ?
SELECT *
FROM summary_by_sensor
WHERE id = ? AND date = ?
Application Workflow and Queries
© 2015. All Rights Reserved. 19
users
username K
age
address
followers_by_user
username K
follower_username C↑
follower_age
follower_address
Q1
sensors_by_user
username K
id C↑
type
<settings>
measurements_by_sensor
id K
parameter K
datetime C↓
value
summary_by_sensor
id K
date K
parameter C↑
hour C↓
avg
...
Q2
Q3
Q4
Q5
Q4
Logical Data Model
© 2015. All Rights Reserved. 20
users
username K
age
address
followers_by_user
username K
follower_username C↑
follower_age
follower_address
Q1
sensors_by_user
username K
id C↑
type
<settings>
measurements_by_sensor
id K
week K
parameter K
datetime C↓
value
summary_by_sensor
id K
date K
parameter C↑
hour C↓
avg
...
Q2
Q3
Q4
Q5
Q4
MAP<TEXT,TEXT>
FLOAT
TEXT
TEXT
TEXT
TEXT
INT
TEXT
TIMESTAMP
UUID
INT
TEXT
TEXT
TEXT
TEXT
UUID
UUID
TIMESTAMP FLOAT
INT
TIMESTAMP
Physical Data Model
© 2015. All Rights Reserved. 21
1 Cassandra Data and Query Models
2 Rigorous Data Modeling
3 Data Modeling Example
4 From Relational to Cassandra
5 Conclusions
22 © 2015. All Rights Reserved.
Relational Methodology
© 2015. All Rights Reserved. 23
CDM
Normalized
Relational
Relational
LDM
Relational
PDM
Mapping
Optimization
Normalization
Queries
Relational Design Example
© 2015. All Rights Reserved. 24
users
username PK
age
address
followers
username PK, FK
follower_username PK, FK
ownership
username PK, FK
sensor_id PK, FK
measurements
sensor_id PK, FK
parameter PK
datetime PK
value
sensors
sensor_id PK
type
settings
sensor_id PK, FK
setting_name PK
settings_value
Relational-to-Cassandra: Indirect Translation
© 2015. All Rights Reserved. 25
Relational
Data Model
Conceptual
Data Model
Reverse
Engineer
Relational
Application
Application
Workflow
Reverse
Engineer
Apply the C*
Methodology
Reverse Engineering is Almost Straightforward
© 2015. All Rights Reserved. 26
users
username PK
age
address
followers
username PK, FK
follower_username PK, FK
ownership
username PK, FK
sensor_id PK, FK
measurements
sensor_id PK, FK
parameter PK
datetime PK
value
sensors
sensor_id PK
type
User owns Sensor
records Measurement
hassettings
sensor_id PK, FK
setting_name PK
settings_value
has Setting
Relational-to-Cassandra: Direct Translation
© 2015. All Rights Reserved. 27
Relational
Schema
SQL
Queries
Cassandra
Schema
Relational-to-Cassandra
Mapping
Extracting Functional Dependencies
© 2015. All Rights Reserved. 28
username age, address
username, sensor_id username, sensor_id
sensor_id type
username, follower_username username, follower_username
sensor_id, parameter, datetime value
sensor_id, setting_name setting_value
users
username PK
age
address
ownership
username PK, FK
sensor_id PK, FK
measurements
sensor_id PK, FK
parameter PK
datetime PK
value
sensors
sensor_id PK
type
followers
username PK, FK
follower_username PK, FK
settings
sensor_id PK, FK
setting_name PK
settings_value
Entailing New Functional Dependencies
• Armstrong’s Axioms
– Reflexivity: If Y X then X Y (trivial functional dependency)
username, sensor_id username, sensor_id
– Augmentation: If X Y then XZ YZ
username age, address
username, sensor_id age, address, sensor_id
– Transitivity: If X Y and Y Z then X Z
© 2015. All Rights Reserved. 29
The Idea
Cassandra table schema must satisfy
the original or entailed relational FDs
The best way to verify this is by computing
an attribute closure
© 2015. All Rights Reserved. 30
No kidding!
You better believe
this guy …
(1) A BC, (2) B F, (3) AD E
AD
{AD}
{ADBC}
{ADBCF}
{ADBCFE}
(trivial)
(1)
(2)
(3)
Computing an Attribute Closure
© 2015. All Rights Reserved. 31
Simple Example
© 2015. All Rights Reserved. 32
Partition key Clustering key Other columns Primary key attribute closure
username age address username, age, address
username age, address
sensor_id type
sensor_id, parameter, datetime value
sensor_id, setting_name setting_value
SELECT age, address FROM users WHERE username = ‘Alice’
username, age, address
Advanced Example
© 2015. All Rights Reserved. 33
SELECT age, type, datetime, value FROM users NATURAL JOIN ownership NATURAL JOIN sensors NATURAL JOIN measurements
WHERE username = ‘Alice’ AND parameter = ‘heart rate’
ORDER BY datetime DESC
Partition key Clustering key Other Primary key attribute closure
username
parameter
datetime ↓
age (S)
type value
username, age, address, parameter,
datetime
username
parameter
datetime ↓
sensor_id ↑
age (S)
type value
username, age, address, sensor_id,
type, parameter, datetime, value
username age, address
sensor_id type
sensor_id, parameter, datetime value
sensor_id, setting_name setting_value
username, age, address, sensor_id,
type, parameter, datetime, value
© 2015. All Rights Reserved. 34
1 Cassandra Data and Query Models
2 Rigorous Data Modeling
3 Data Modeling Example
4 From Relational to Cassandra
5 Conclusions
35 © 2015. All Rights Reserved.
Conclusions
• Cassandra data models from scratch
– The methodology: academy.datastax.com
– Automation: kdm.dataview.org
• Cassandra data models from a relational database
– Two approaches to consider
– Ripe for automation
© 2015. All Rights Reserved. 36
Thank you