c* summit 2013: virtual nodes: rethinking topology in cassandra by eric evans
DESCRIPTION
A discussion of the recent work to transition Cassandra from its naive 1-partition-per-node distribution, to a proper virtual nodes implementation.TRANSCRIPT
![Page 1: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/1.jpg)
#Cassandra13
Rethinking Topology in Cassandra
Cassandra SummitJune 11, 2013
Eric [email protected]
@jericevans
![Page 2: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/2.jpg)
#Cassandra13
DHT 101
![Page 3: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/3.jpg)
#Cassandra13
DHT 101partitioning
AZ
![Page 4: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/4.jpg)
#Cassandra13
DHT 101partitioning
AZ
BY
C
![Page 5: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/5.jpg)
#Cassandra13
DHT 101partitioning
AZ
BY
C
Key = Aaa
![Page 6: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/6.jpg)
#Cassandra13
DHT 101replica placement
AZ
BY
C
Key = Aaa
![Page 7: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/7.jpg)
#Cassandra13
DHT 101consistency
Consistency
Availability
Partition tolerance
![Page 8: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/8.jpg)
#Cassandra13
DHT 101scenario: consistency level = one
A
?
?
W
![Page 9: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/9.jpg)
#Cassandra13
DHT 101scenario: consistency level = all
A
?
?
R
![Page 10: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/10.jpg)
#Cassandra13
DHT 101scenario: quorum write
A
B
?
W
![Page 11: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/11.jpg)
#Cassandra13
DHT 101scenario: quorum read
A
B
?R
![Page 12: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/12.jpg)
#Cassandra13
Awesome, yes?
![Page 13: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/13.jpg)
#Cassandra13
Well...
![Page 14: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/14.jpg)
#Cassandra13
Problem:Poor request/stream distribution
![Page 15: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/15.jpg)
#Cassandra13
Distribution
AZ
BY
C
M
![Page 16: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/16.jpg)
#Cassandra13
Distribution
AZ
BY
C
M
![Page 17: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/17.jpg)
#Cassandra13
Distribution
AZ
BY
C
M
![Page 18: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/18.jpg)
#Cassandra13
Distribution
AZ
BY
C
M
![Page 19: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/19.jpg)
#Cassandra13
Distribution
Z A
BY
C
M
![Page 20: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/20.jpg)
#Cassandra13
Distribution
A
BY
C
M
A1Z
![Page 21: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/21.jpg)
#Cassandra13
Distribution
A
BY
C
M
A1Z
![Page 22: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/22.jpg)
#Cassandra13
Distribution
A
BY
C
M
A1Z
![Page 23: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/23.jpg)
#Cassandra13
Problem:Poor data distribution
![Page 24: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/24.jpg)
#Cassandra13
Distribution
A
BD
C
![Page 25: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/25.jpg)
#Cassandra13
Distribution
A
BD
C
E
![Page 26: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/26.jpg)
#Cassandra13
Distribution
E
A
D B
C
![Page 27: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/27.jpg)
#Cassandra13
Distribution
E
A
D B
C
![Page 28: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/28.jpg)
#Cassandra13
Distribution
A
BD
C
H E
FG
![Page 29: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/29.jpg)
#Cassandra13
Distribution
A
BD
C
H E
FG
![Page 30: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/30.jpg)
#Cassandra13
Virtual Nodes
![Page 31: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/31.jpg)
#Cassandra13
In a nutshell...
![Page 32: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/32.jpg)
#Cassandra13
Benefits
● Operationally simpler (no token management)
● Better distribution of load
● Concurrent streaming (all hosts)
● Smaller partitions mean greater reliability
● Better supports heterogeneous hardware
![Page 33: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/33.jpg)
#Cassandra13
Strategies
● Automatic sharding
● Fixed partition assignment
● Random token assignment
![Page 34: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/34.jpg)
#Cassandra13
Strategyautomatic sharding
● Partitions are split when data exceeds a threshold
● Newly created partitions are relocated to a host with less data
● Similar to Bigtable, or Mongo auto-sharding
![Page 35: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/35.jpg)
#Cassandra13
Strategyfixed partition assignment
● Namespace divided into Q evenly-sized partitions
● Q/N partitions assigned per host (where N is number of hosts)
● Joining hosts “steal” partitions evenly from existing hosts
● Used by Dynamo/Voldemort (“strategy 3” in Dynamo paper)
![Page 36: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/36.jpg)
#Cassandra13
Strategyrandom token assignment
● Each host assigned T random tokens
● T random tokens generated for joining hosts; New tokens divide
existing ranges
● Similar to libketama; Identical to Classic Cassandra when T=1
![Page 37: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/37.jpg)
#Cassandra13
Considerations
1.Number of partitions
2.Partition size
3.How 1 changes with more nodes and data
4.How 2 changes with more nodes and data
![Page 38: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/38.jpg)
#Cassandra13
Evaluating
Strategy No. Partitions Partition size
Random O(N) O(B/N)
Fixed O(1) O(B)
Auto-sharding O(B) O(1)
![Page 39: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/39.jpg)
#Cassandra13
Evaluating
Automatic sharding● Partition size is constant (great)
● Number of partitions scales linearly with data size (bad)
![Page 40: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/40.jpg)
#Cassandra13
Evaluating
Fixed partition assignment● Number of partitions is constant (good)
● Partition size scales linearly with data size (bad)
● Greater operational complexity (bad)
![Page 41: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/41.jpg)
#Cassandra13
Evaluating
Random token assignment● Number of partitions scales linearly with number of hosts (OK)
● Partition size increases with more data; Decreases with more
hosts (good)
![Page 42: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/42.jpg)
#Cassandra13
Evaluating
● Automatic sharding
● Fixed partition assignment
● Random token assignment
![Page 43: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/43.jpg)
#Cassandra13
Cassandra
![Page 44: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/44.jpg)
#Cassandra13
Configurationconf/cassandra.yaml
# Comma separated list of tokens, (new# installs only).initial_token:<token>,<token>,<token>
or
# Number of tokens to generate.num_tokens: 256
![Page 45: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/45.jpg)
#Cassandra13
Configurationnodetool info
Token : (invoke with -T/--tokens to see all 256 tokens)ID : 6a8dc22c-1f37-473f-8f7e-47742f4b83a5Gossip active : trueThrift active : trueLoad : 42.92 MBGeneration No : 1370016307Uptime (seconds) : 221Heap Memory (MB) : 998.72 / 1886.00Data Center : datacenter1Rack : rack1Exceptions : 0Key Cache : size 1128 (bytes), capacity 98566144 (bytes), 42 hits, 54 re...Row Cache : size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN ...
![Page 46: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/46.jpg)
#Cassandra13
Configurationnodetool ring
Datacenter: datacenter1==========Replicas: 0
Address Rack Status State Load Owns Token 3074457345618258602127.0.0.1 rack1 Up Normal 42.92 MB 33.33% -9223372036854775808127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3098476543630901247127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3122495741643543892127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3146514939656186537127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3170534137668829183127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3194553335681471828127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 321857253369411447127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3242591731706757118...
![Page 47: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/47.jpg)
#Cassandra13
Configurationnodetool status
Datacenter: datacenter1=======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 127.0.0.1 42.92 MB 256 33.3% 6a8dc22c-1f37-473f-8f7e-47742f4b83a5 rack1UN 127.0.0.2 60.17 MB 256 33.3% 26263a2b-768e-4a79-8d41-3624a14b13a8 rack1UN 127.0.0.3 56.85 MB 256 33.3% 5b3e208f-6d36-4c7b-b2bb-b7c476a1af66 rack1
![Page 48: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/48.jpg)
#Cassandra13
Configurationnodetool status
Datacenter: datacenter1=======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 127.0.0.1 42.92 MB 256 33.3% 6a8dc22c-1f37-473f-8f7e-47742f4b83a5 rack1UN 127.0.0.2 60.17 MB 256 33.3% 26263a2b-768e-4a79-8d41-3624a14b13a8 rack1UN 127.0.0.3 56.85 MB 256 33.3% 5b3e208f-6d36-4c7b-b2bb-b7c476a1af66 rack1
![Page 49: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/49.jpg)
#Cassandra13
Configurationnodetool status
Datacenter: datacenter1=======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 127.0.0.1 42.92 MB 256 33.3% 6a8dc22c-1f37-473f-8f7e-47742f4b83a5 rack1UN 127.0.0.2 60.17 MB 256 33.3% 26263a2b-768e-4a79-8d41-3624a14b13a8 rack1UN 127.0.0.3 56.85 MB 256 33.3% 5b3e208f-6d36-4c7b-b2bb-b7c476a1af66 rack1
![Page 50: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/50.jpg)
#Cassandra13
Migration
A
BD
![Page 51: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/51.jpg)
#Cassandra13
Migrationedit conf/cassandra.yaml and restart
# Number of tokens to generate.num_tokens: 256
![Page 52: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/52.jpg)
#Cassandra13
Migrationconvert to T contiguous tokens in existing ranges
B
AAAAAA
AA
A
A A A
AAAA
C
A
B
![Page 53: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/53.jpg)
#Cassandra13
Migrationshuffle
B
AAAAAA
AA
A
A A A
AAAA
C
A
B
![Page 54: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/54.jpg)
#Cassandra13
Shuffle
● Range transfers are queued on each host
● Hosts initiate transfer to self
● Pay attention to the logs!
![Page 55: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/55.jpg)
#Cassandra13
ShuffleUsage: shuffle [options] <sub-command>
Sub-commands: create Initialize a new shuffle operation ls List pending relocations clear Clear pending relocations en[able] Enable shuffling dis[able] Disable shuffling
Options: -dc, --only-dc Apply only to named DC (create only) -u, --username JMX username -tp, --thrift-port Thrift port number (Default: 9160) -p, --port JMX port number (Default: 7199) -tf, --thrift-framed Enable framed transport for Thrift (Default: false) -en, --and-enable Immediately enable shuffling (create only) -pw, --password JMX password -H, --help Print help information -h, --host JMX hostname or IP address (Default: localhost) -th, --thrift-host Thrift hostname or IP address (Default: JMX host)
![Page 56: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/56.jpg)
#Cassandra13
Performance
![Page 57: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/57.jpg)
#Cassandra13
removenode
Cassandra 1.2 Cassandra 1.10
50
100
150
200
250
300
350
400
450
![Page 58: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/58.jpg)
#Cassandra13
bootstrap
Cassandra 1.2 Cassandra 1.10
100
200
300
400
500
600
![Page 59: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans](https://reader034.vdocument.in/reader034/viewer/2022042613/54827738b47959fb0c8b47fa/html5/thumbnails/59.jpg)
#Cassandra13
The End
● Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels “Dynamo: Amazon’s Highly Available Key-value Store” Web.
● Low, Richard. “Improving Cassandra's uptime with virtual nodes” Web.
● Overton, Sam. “Virtual Nodes Strategies.” Web.
● Overton, Sam. “Virtual Nodes: Performance Results.” Web.
● Jones, Richard. "libketama - a consistent hashing algo for memcache clients” Web.