tdc2017 | são paulo - trilha nosql how we figured out we had a sre team at - cassandra: por que o...

Globalcode – Open4education

CassandraWhy will the relational thinking destroy your system

performance?

Paulo Ricardo R. AlmeidaOCJP, 2 years working with Cassandra


Agenda• What is Cassandra?• Why Cassandra?• Quick Review• The Problem to tackle

• Relational solution and its drawbacks• Addressing the problem with C* thinking

• Goals and Non-Goals• Query First• The Cassandra solution

• Benchmarking• Additional resources


What is Cassandra ?

Distributed Fault Tolerant Linear Scalability


Pick two of: Availability, Consistency, Partition Tolerance


Why Cassandra ?

● Distributed Cache (Netflix EVCache)● Real time Processing● Data doesn't fit in one place● High write workload

○ Time series data○ Log storage/analysis

● Geographical distribution● Performance


Quick Review

Coordinator

RF = 3

CLIENT

token(partitionKey)using Partitioner

Keyspace


https://pandaforme.gitbooks.io/introduction-to-cassandra


The problem

Store TDC information (speakers and talks)


Relational Way


Relational Way

SELECT * FROM speaker

WHERE state = 'PR'


Relational Way

SELECT * FROM talk

INNER JOIN speaker

ON speaker.id == talk.speaker_id_a

OR speaker.id == talk.speaker_id_b


Putting into Cassandra


Why?

SELECT * FROM speaker WHERE state = 'PR'

ALLOW FILTERING

Retrieve all rows and filters one by one


Secondary index to Improve read performance


Secondary Index

CREATE INDEX speaker_name ON speaker (name);


Secondary Index

0312 Paulo Almeida2315 Gessica Dutra...

0003 Jefferson….

5 lookups 1 response = poor performance

SELECT * FROM tdc.speaker

WHERE name = 'Paulo Almeida'


Limitations● No JOIN, LIKE… support● No constraints● No transaction (ACID)● No consistency (Strong)● Secondary Index doesn't scale well


Goals and Non-Goals

● Non-Goals○ Minimize number of writes○ Minimize data duplication

● Goals○ Spread data evenly around the cluster○ Minimize the number of partitions read


Query first!

● Know your queries first and model around them○ Don't model around relations○ Don't model around objects○ Try to create a CF where you can satisfy the query by

reading one partition


● Speaker by state● Speaker by name● Talks by speaker name● Talks by keywords● Talks by track

Queries


Cassandra Way


Data Modeling

CREATE KEYSPACE tdc WITH REPLICATION =

{

'class': 'SimpleStrategy',

'replication_factor': 3

}


Data Modeling

CREATE TABLE tdc.speaker (

id uuid,

name text,

email text,

bio text,

city text,

state text,

PRIMARY KEY (id)

);

keyspace

PartitionKey


Data Modeling

CREATE TABLE tdc.speaker_by_name (

speaker_id uuid,

name text

PRIMARY KEY (name, speaker_id)

);

SELECT speaker_id FROM tdc.speaker_by_name;

SELECT * FROM tdc.speaker = $speaker_id

Better approach, requires 2 lookups in any case

Partition Key


Data Modeling

SELECT * FROM tdc.speaker_by_state

WHERE state = 'PR'

CREATE TABLE tdc.speaker_by_state (

speaker_id uuid,

name text,

state text,

bio text,

PRIMARY KEY (state, name, speaker_id)

) WITH CLUSTERING ORDER BY (city ASC, name ASC);

Partition Key

Clustering Key


Data Modeling

CREATE TABLE tdc.speaker_by_state (

speaker_id uuid,

name text,

state text,

bio text,

PRIMARY KEY (state, city, name, speaker_id)

) WITH CLUSTERING ORDER BY (city ASC, name ASC);

Partition Key Clustering Key


Data Modeling

BEGIN BATCH

INSERT INTO speaker (id, …) VALUES (...);

INSERT INTO speaker_by_name (name, ...) VALUES (...);

INSERT INTO speaker_by_state (state, ...) VALUES (...);

APPLY BATCH;


Data Modeling

CREATE TABLE tdc.talk_by_speaker_name(

talk_id uuid,

talk_name text,

speaker_name text,

date timestamp,

PRIMARY KEY (speaker_name, date DESC, talk_id)

);


Data Modeling

CREATE INDEX talk_by_track_name ON tdc.talk (track_name)

SELECT * FROM tdc.talk WHERE track_name = 'Test';


Netflix benchmarkhttps://academy.datastax.com/planet-cassandra/nosql-performance-benchmarks


Netflix benchmarkhttps://academy.datastax.com/planet-cassandra/nosql-performance-benchmarks

Nodes Cassandra Couchbase HBase MongoDB

1 18,925.59 1,554.14 973.85 1,278.81

2 35,539.69 2,985.28 3,430.59 1,441.32

4 64,911.39 3,755.28 6,451.95 1,801.06

8 117,237.91 10,138.80 6,262.95 2,195.92

16 210,237.90 11,761.31 15,268.93 1,230.96

32 348,682.44 21,375.02 58,463.15 2,335.14

Operations/sec


Resources● Cassandra - The definitive guide● Datastax self-paced Training

○ https://academy.datastax.com/resources/ds220-data-modeling● Datastax CQL Reference

○ http://docs.datastax.com/en/cql/3.1/cql/cql_reference/cqlReferenceTOC.html

● Cassandra-demo-middle:○ https://github.com/rochapaulo/cassandra-demo-middle

● Presentation source code:○ https://github.com/rochapaulo/TDC-SP-2017-Cassandra

● Youtube videos

https://academy.datastax.com/resources/ds220-data-modeling

https://academy.datastax.com/resources/ds220-data-modeling

http://docs.datastax.com/en/cql/3.1/cql/cql_reference/cqlReferenceTOC.html



https://github.com/rochapaulo/cassandra-demo-middle

https://github.com/rochapaulo/cassandra-demo-middle

https://github.com/rochapaulo/TDC-SP-2017-Cassandra

https://github.com/rochapaulo/TDC-SP-2017-Cassandra


Thank you!

/rochapaulo

/pauloricardoalmeida

[email protected]

tdc2017 | são paulo - trilha nosql how we figured out we had a sre team at - cassandra: por que o...

Education