tdc2017 | são paulo - trilha nosql how we figured out we had a sre team at - cassandra: por que o...
TRANSCRIPT
Globalcode – Open4education
CassandraWhy will the relational thinking destroy your system
performance?
Paulo Ricardo R. AlmeidaOCJP, 2 years working with Cassandra
Globalcode – Open4education
Agenda• What is Cassandra?• Why Cassandra?• Quick Review• The Problem to tackle
• Relational solution and its drawbacks• Addressing the problem with C* thinking
• Goals and Non-Goals• Query First• The Cassandra solution
• Benchmarking• Additional resources
Globalcode – Open4education
What is Cassandra ?
Distributed Fault Tolerant Linear Scalability
Globalcode – Open4education
Pick two of: Availability, Consistency, Partition Tolerance
Globalcode – Open4education
Why Cassandra ?
● Distributed Cache (Netflix EVCache)● Real time Processing● Data doesn't fit in one place● High write workload
○ Time series data○ Log storage/analysis
● Geographical distribution● Performance
Globalcode – Open4education
Quick Review
Coordinator
RF = 3
CLIENT
token(partitionKey)using Partitioner
Keyspace
Globalcode – Open4education
Globalcode – Open4education
https://pandaforme.gitbooks.io/introduction-to-cassandra
Globalcode – Open4education
The problem
Store TDC information (speakers and talks)
Globalcode – Open4education
Relational Way
Globalcode – Open4education
Relational Way
SELECT * FROM speaker
WHERE state = 'PR'
Globalcode – Open4education
Relational Way
SELECT * FROM talk
INNER JOIN speaker
ON speaker.id == talk.speaker_id_a
OR speaker.id == talk.speaker_id_b
Globalcode – Open4education
Putting into Cassandra
Globalcode – Open4education
Globalcode – Open4education
Why?
SELECT * FROM speaker WHERE state = 'PR'
ALLOW FILTERING
Retrieve all rows and filters one by one
Globalcode – Open4education
Secondary index to Improve read performance
Globalcode – Open4education
Secondary Index
CREATE INDEX speaker_name ON speaker (name);
Globalcode – Open4education
Secondary Index
0312 Paulo Almeida2315 Gessica Dutra...
0003 Jefferson….
5 lookups 1 response = poor performance
SELECT * FROM tdc.speaker
WHERE name = 'Paulo Almeida'
Globalcode – Open4education
Limitations● No JOIN, LIKE… support● No constraints● No transaction (ACID)● No consistency (Strong)● Secondary Index doesn't scale well
Globalcode – Open4education
Goals and Non-Goals
● Non-Goals○ Minimize number of writes○ Minimize data duplication
● Goals○ Spread data evenly around the cluster○ Minimize the number of partitions read
Globalcode – Open4education
Query first!
● Know your queries first and model around them○ Don't model around relations○ Don't model around objects○ Try to create a CF where you can satisfy the query by
reading one partition
Globalcode – Open4education
● Speaker by state● Speaker by name● Talks by speaker name● Talks by keywords● Talks by track
Queries
Globalcode – Open4education
Cassandra Way
Globalcode – Open4education
Cassandra Way
Globalcode – Open4education
Data Modeling
CREATE KEYSPACE tdc WITH REPLICATION =
{
'class': 'SimpleStrategy',
'replication_factor': 3
}
Globalcode – Open4education
Data Modeling
CREATE TABLE tdc.speaker (
id uuid,
name text,
email text,
bio text,
city text,
state text,
PRIMARY KEY (id)
);
keyspace
PartitionKey
Globalcode – Open4education
Data Modeling
CREATE TABLE tdc.speaker_by_name (
speaker_id uuid,
name text
PRIMARY KEY (name, speaker_id)
);
SELECT speaker_id FROM tdc.speaker_by_name;
SELECT * FROM tdc.speaker = $speaker_id
Better approach, requires 2 lookups in any case
Partition Key
Globalcode – Open4education
Data Modeling
SELECT * FROM tdc.speaker_by_state
WHERE state = 'PR'
CREATE TABLE tdc.speaker_by_state (
speaker_id uuid,
name text,
state text,
bio text,
PRIMARY KEY (state, name, speaker_id)
) WITH CLUSTERING ORDER BY (city ASC, name ASC);
Partition Key
Clustering Key
Globalcode – Open4education
Data Modeling
CREATE TABLE tdc.speaker_by_state (
speaker_id uuid,
name text,
state text,
bio text,
PRIMARY KEY (state, city, name, speaker_id)
) WITH CLUSTERING ORDER BY (city ASC, name ASC);
Partition Key Clustering Key
Globalcode – Open4education
Data Modeling
BEGIN BATCH
INSERT INTO speaker (id, …) VALUES (...);
INSERT INTO speaker_by_name (name, ...) VALUES (...);
INSERT INTO speaker_by_state (state, ...) VALUES (...);
APPLY BATCH;
Globalcode – Open4education
Data Modeling
CREATE TABLE tdc.talk_by_speaker_name(
talk_id uuid,
talk_name text,
speaker_name text,
date timestamp,
PRIMARY KEY (speaker_name, date DESC, talk_id)
);
Globalcode – Open4education
Data Modeling
CREATE INDEX talk_by_track_name ON tdc.talk (track_name)
SELECT * FROM tdc.talk WHERE track_name = 'Test';
Globalcode – Open4education
Netflix benchmarkhttps://academy.datastax.com/planet-cassandra/nosql-performance-benchmarks
Globalcode – Open4education
Netflix benchmarkhttps://academy.datastax.com/planet-cassandra/nosql-performance-benchmarks
Nodes Cassandra Couchbase HBase MongoDB
1 18,925.59 1,554.14 973.85 1,278.81
2 35,539.69 2,985.28 3,430.59 1,441.32
4 64,911.39 3,755.28 6,451.95 1,801.06
8 117,237.91 10,138.80 6,262.95 2,195.92
16 210,237.90 11,761.31 15,268.93 1,230.96
32 348,682.44 21,375.02 58,463.15 2,335.14
Operations/sec
Globalcode – Open4education
Globalcode – Open4education
Resources● Cassandra - The definitive guide● Datastax self-paced Training
○ https://academy.datastax.com/resources/ds220-data-modeling● Datastax CQL Reference
○ http://docs.datastax.com/en/cql/3.1/cql/cql_reference/cqlReferenceTOC.html
● Cassandra-demo-middle:○ https://github.com/rochapaulo/cassandra-demo-middle
● Presentation source code:○ https://github.com/rochapaulo/TDC-SP-2017-Cassandra
● Youtube videos