traveler's guide to cassandra
TRANSCRIPT
Who wants to be a Cassandra Millionaire?40-minutes of best practice - getting you ready for certification
@VictorFAnjos
Toronto Cassandra Day
2© 2015. All Rights Reserved. @VictorFAnjos
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
Welcome to Who Wants to be a Cassandra Millionaire
50:50
@VictorFAnjos
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
@VictorFAnjos
5© 2015. All Rights Reserved.
A: NAS / SAN
C: DAS SATA
B: SSD
D: DAS SCSI
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
This storage medium allows for best performance.
@VictorFAnjos
6© 2015. All Rights Reserved. @VictorFAnjos
A: NAS / SAN
C: DAS SATA
B: SSD
D: DAS SCSI
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
@VictorFAnjos
This storage medium allows for best performance.
Installation and considerations
how to store the datastore
Storage Area Network Solid State Drive
7© 2015. All Rights Reserved. @VictorFAnjos
Installation and considerations
how to store the datastore
Local (DAS), iSCSI, Fiber Channel
8© 2015. All Rights Reserved. @VictorFAnjos
● AVOID network storage like the plague
● Direct Attached Storage FTW
● Disk latency is a HUGE deal for performance
Installation and considerations
how to store the datastore
9© 2015. All Rights Reserved. 9@VictorFAnjos
SATA/SAS DAS
PCIe/NVMe DAS
Installation and considerations
how to store the datastore
10© 2015. All Rights Reserved. @VictorFAnjos
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
@VictorFAnjos
12© 2015. All Rights Reserved. @VictorFAnjos
A: ZFS
C: Ext4
B: Btrfs
D: F2FS
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
When using SSDs, this filesystem type
is best.
@VictorFAnjos
13© 2015. All Rights Reserved. @VictorFAnjos
A: ZFS
C: Ext4
B: Btrfs
D: F2FS
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
When using SSDs, this filesystem type
is best.
@VictorFAnjos
Congratulations!
You’ve Reachedthe 1,000 ops/s
Milestone!
Congratulations!Congratulations!@VictorFAnjos
Installation and considerations
i can’t believe it’s not btrfs
15© 2015. All Rights Reserved. @VictorFAnjos
● easiest to use ext4 (it’s on most linux distros), but F2FS get 5-10% gains in write performance
● if NOT using F2FS, make sure to TRIM
● multiple disks → use RAID0
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
@VictorFAnjos
17© 2015. All Rights Reserved. @VictorFAnjos
A: 0
C: Equal to HEAP
B: ½ of HEAP
D: EQUAL TO RAM
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
This is the sweetspot for SWAP
when using C*
@VictorFAnjos
18© 2015. All Rights Reserved. @VictorFAnjos
A: 0
C: Equal to HEAP
B: ½ of HEAP
D: EQUAL TO RAM
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
This is the sweetspot for SWAP
when using C*
@VictorFAnjos
Installation and considerations
to swap or not to swap
19© 2015. All Rights Reserved. @VictorFAnjos
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
@VictorFAnjos
21© 2015. All Rights Reserved. @VictorFAnjos
A: 64G
C: 16G
B: 32G
D: 8G
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
Having 64G of RAM means you should optimize to have ___G of HEAP.
@VictorFAnjos
22© 2015. All Rights Reserved. @VictorFAnjos
A: 64G
C: 16G
B: 32G
D: 8G
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
@VictorFAnjos
Having 64G of RAM means you should optimize to have ___G of HEAP.
Installation and considerations
how much heap?
23© 2015. All Rights Reserved. @VictorFAnjos
http://docs.datastax.com/en/cassandra/1.2/cassandra/operations/ops_tune_jvm_c.html
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
@VictorFAnjos
25© 2015. All Rights Reserved. @VictorFAnjos
A: EC2Snitch
C: Simple Snitch
B: Dynamic Snitch
D: Property File Snitch
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
Definitely DO NOT use this snitch in
Multi-DC environments.
@VictorFAnjos
26© 2015. All Rights Reserved. @VictorFAnjos
A: EC2Snitch
C: Simple Snitch
B: Dynamic Snitch
D: Property File Snitch
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
@VictorFAnjos
Definitely DO NOT use this snitch in
Multi-DC environments.
Installation and considerations
son of a snitch
27© 2015. All Rights Reserved. @VictorFAnjos
Installation and considerations
son of a snitch
28© 2015. All Rights Reserved. @VictorFAnjos
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
@VictorFAnjos
30© 2015. All Rights Reserved. @VictorFAnjos
A: Synchronous AND Full Queries
C: Synchronous AND Prepared Statements
B: Asynchronous AND Prepared Statements
D: Asynchronous AND Full Queries
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
To reduce latency and wire time to my
app, I should opt for.
@VictorFAnjos
31© 2015. All Rights Reserved. @VictorFAnjos
A: Synchronous AND Full Queries
C: Synchronous AND Prepared Statements
B: Asynchronous AND Prepared Statements
D: Asynchronous AND Full Queries
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
To reduce latency and wire time to my
app, I should opt for.
@VictorFAnjos
Achieving performance through code/drivers
should I stay or should I go
32© 2015. All Rights Reserved. @VictorFAnjos
● Client writes to any Cassandra node
● Coordinator node replicates to other nodes (in local and remote Data Center)
● Local write acks returned to coordinator
● Client gets ack when enough total nodes are committed
● Data written to internal commit log disks
● When data arrives, remote node replicates data
MULTI DC
● Ack direct to source region coordinator
● Remote copies written to commit log disks
lf a node or region goes offline, hinted handoff completes the write when the node comes back up (as long as there are enough nodes to satisfy consistency level).
Achieving performance through code/drivers
should I stay or should I go
33© 2015. All Rights Reserved. @VictorFAnjos
Prepare ONCE...
Bind and Execute multiple times.
Achieving performance through code/drivers
should I stay or should I go
34© 2015. All Rights Reserved. @VictorFAnjos
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
@VictorFAnjos
36© 2015. All Rights Reserved. @VictorFAnjos
A: 1 / 1 = 1
C: 2 * 1 = 2
B: 2 / 1 = 2
D: 2 / 2 + 1 = 2
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
With RF=2 and CL=Quorum, operations failed when 1 node went down because of this.
@VictorFAnjos
37© 2015. All Rights Reserved. @VictorFAnjos
A: 1 / 1 = 1
C: 2 * 1 = 2
B: 2 / 1 = 2
D: 2 / 2 + 1 = 2
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
With RF=2 and CL=Quorum, operations failed when 1 node went down because of this.
@VictorFAnjos
Congratulations!
You’ve Reachedthe 32,000 ops/s
Milestone!
Congratulations!Congratulations!@VictorFAnjos
Achieving performance through code/drivers
when friends aren’t enough
39© 2015. All Rights Reserved. @VictorFAnjos
Replication Factor = 3
Insert into a cluster of size 6 with consistency Quorum
Two nodes in token range must be present for write to succeed
Achieving performance through code/drivers
when friends aren’t enough
40© 2015. All Rights Reserved. @VictorFAnjos
What happens now?
Cannot achieve consistency level QUORUM
Cannot achieve consistency level QUORUM
Cannot achieve consistency level QUORUM
Cannot achieve consistency level QUORUM
Nodes in partition key DOWN
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
@VictorFAnjos
42© 2015. All Rights Reserved. @VictorFAnjos
A: Truth table
C: CAP Theorem
B: Brewer’s Theorem
D: Entropy
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
This mathematical and CS concept helps when data modeling for query
optimization.
@VictorFAnjos
43© 2015. All Rights Reserved. @VictorFAnjos
A: Truth table
C: CAP Theorem
B: Brewer’s Theorem
D: Entropy
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
@VictorFAnjos
This mathematical and CS concept helps when data modeling for query
optimization.
Data modelling, CQLSH and more
the truth shall set you free
44© 2015. All Rights Reserved. @VictorFAnjos
Motivated by CS, Math, Engineering
Allows for creating building blocks that yield a single output
More complex truth tables can arise
Data modelling, CQLSH and more
the truth shall set you free
45© 2015. All Rights Reserved. @VictorFAnjos
How about searching for username?
And what about full_name?
user_stream
← ← ← Partition Key → → → user_id username full_name
1 0 00 1 00 0 1
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
@VictorFAnjos
47© 2015. All Rights Reserved. @VictorFAnjos
A: Reads / Batches
C: Writes / Deletes
B: Writes / Batches
D: Reads / Deletes
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
A shift in paradigms, what should you
maximize and reduce for best performance.
@VictorFAnjos
48© 2015. All Rights Reserved. @VictorFAnjos
A: Reads / Batches
C: Writes / Deletes
B: Writes / Batches
D: Reads / Deletes
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
@VictorFAnjos
A shift in paradigms, what should you
maximize and reduce for good performance.
Data modelling, CQLSH and more
do the write thing
49© 2015. All Rights Reserved. @VictorFAnjos
Data modelling, CQLSH and more
do the write thing
50© 2015. All Rights Reserved. @VictorFAnjos
memtable --- < 100ns
commit log --- ~ 1 ms
DELETES / TTL cause compactions
Data modelling, CQLSH and more
do the write thing
51© 2015. All Rights Reserved. @VictorFAnjos
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
@VictorFAnjos
53© 2015. All Rights Reserved. @VictorFAnjos
A: ACID
C: Rollback
B: Vector
D: Sharding
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
To not hit a 2B record limit (per row), this
RDBMS borrowed term can still makes sense.
@VictorFAnjos
54© 2015. All Rights Reserved. @VictorFAnjos
A: ACID
C: Rollback
B: Vector
D: Sharding
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
To not hit a 2B record limit (per row), this
RDBMS borrowed term can still makes sense.
@VictorFAnjos
Data modelling, CQLSH and more
sit on this and rotate
55© 2015. All Rights Reserved. @VictorFAnjos
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
@VictorFAnjos
57© 2015. All Rights Reserved. @VictorFAnjos
A: Batches
C: Secondary Indexes
B: Synchronous
D: MySQL
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
Many say to use sparingly, I would say, avoid like the plague.
@VictorFAnjos
58© 2015. All Rights Reserved. @VictorFAnjos
A: Batches
C: Secondary Indexes
B: Synchronous
D: MySQL
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
@VictorFAnjos
Many say to use sparingly, I would say, avoid like the plague.
Performance must-haves
never be second best
59© 2015. All Rights Reserved. @VictorFAnjos
writes are distributed among the cluster
each partition key refers to one exact position in which to get a row
but what do we do when we don’t have exactly the right type of index to specify a queryCREATE TABLE users ( user varchar, email varchar, state varchar, PRIMARY KEY (user));-- OPTION 1 : create an indexCREATE INDEX idxUBS on users (state);
-- OPTION 2 : create another table (store data twice)CREATE TABLE usersByState ( state varchar, user varchar, PRIMARY KEY (state, user));
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
@VictorFAnjos
61© 2015. All Rights Reserved. @VictorFAnjos
A: UDT
C: JSON
B: Lightweight Transactions
D: Hinted handoff
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
This addition to C* can help with ACID like
transactions, at a bit of a performance hit.
@VictorFAnjos
62© 2015. All Rights Reserved. @VictorFAnjos
A: UDT
C: JSON
B: Lightweight Transactions
D: Hinted handoff
50:50
151413121110987654321
1 Million500,000250,000125,00064,00032,00016,0008,0004,0002,0001,000500300200100
This recent addition to C* now helps with ACID like transactions, at a bit
of a performance hit.
@VictorFAnjos
Performance must-haves
slimfast agreement
63© 2015. All Rights Reserved. @VictorFAnjos
Prepares a proposal that is sent to a number of Acceptors.Waits on a an acknowledgement (in form of promise) from Acceptors.Sends accept message to Quorum of Acceptors with new value to commit.Returns success? completion to client.
Determines if proposal is newer than what it has seen.Acknowledges/agree with its own highest proposal value seen AND the current value (of what is to be set).Receive message to commit new value.Accept and return on successful commit of value.
64© 2015. All Rights Reserved. @VictorFAnjos
Performance must-haves
slimfast agreement
Thank you