cassandra and solid state drives
DESCRIPTION
TRANSCRIPT
![Page 1: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/1.jpg)
CASSANDRA & SOLID STATE DRIVESRick Branson, DataStax
![Page 2: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/2.jpg)
FACT
CASSANDRA’S STORAGE ENGINE WAS OPTIMIZED
FOR SPINNING DISKS
![Page 3: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/3.jpg)
LSM-TREES
![Page 4: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/4.jpg)
WRITE PATH
![Page 5: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/5.jpg)
Client Cassandra
On-Disk Node Commit Log
{ cf1: { row1: { col1: abc } } }
{ cf1: { row1: { col2: def } } }
{ cf1: { row1: { col1: <del> } } }
{ cf1: { row2: { col1: xyz } } }
{ cf1: { row1: { col3: foo } } }
insert({ cf1: { row1: { col3: foo } } })
In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”
row1
row2
col1: [del] col2: “def” col3: “foo”
col1: “xyz”
COMMIT
![Page 6: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/6.jpg)
SSTableSSTableSSTableSSTableSSTableSSTableSSTable
1
SSTableSSTableSSTableSSTableSSTableSSTableSSTable
2
SSTableSSTableSSTableSSTableSSTableSSTableSSTable
3
SSTableSSTableSSTableSSTableSSTableSSTableSSTable
4
In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”
row1
row2
col1: [del] col2: “def” col3: “foo”
col1: “xyz”
FLUSH
![Page 7: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/7.jpg)
SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable
31 2 4
SSTables are merged to maintain read performance
COMPACT
SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable
![Page 8: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/8.jpg)
SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable
SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableNew SSTable is streamedto disk and old SSTables
are erased
X X X X
![Page 9: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/9.jpg)
TAKEAWAYS
• All disk writes are sequential, append-only operations
• On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N)
• SSTables are completely immutable
![Page 10: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/10.jpg)
TAKEAWAYS
• All disk writes are sequential, append-only operations
• On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N)
• SSTables are completely immutable
IMPORTANT
![Page 11: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/11.jpg)
COMPARED
• Most popular data storage engines rewrite modified data in-place: MySQL (InnoDB), PostgreSQL, Oracle, MongoDB, Membase, BerkeleyDB, etc
• Most perform similar buffering of writes before flushing to disk
• ... but flushes are RANDOM writes
![Page 12: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/12.jpg)
SPINNING DISKS• Dirt cheap: $0.08/GB
• Seek time limited by time it takes for drive to rotate: IOPS = RPM/60
• 7,200 RPM = ~120 IOPS
• 15,000 RPM has been the max for decades
• Sequential operations are best: 125MB/sec for modern SATA drives
![Page 13: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/13.jpg)
THAT WAS THE WORLD IN WHICH CASSANDRA
WAS BUILT
![Page 14: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/14.jpg)
2012: MLC NAND FLASH*
• Affordable: ~$1.75/GB street
• Massive IOPS: 39,500/sec read, 23,000/sec write
• Latency of less than 100µs
• Good sequential throughput: 270MB/sec read, 205MB/sec write
• Way cheaper per IOPS: $0.02 vs $1.25
* based on specifications provided by Intel for 300GB Intel 320 drive
![Page 15: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/15.jpg)
WITH RANDOM ACCESS STORAGE, ARE CASSANDRA’S
LSM-TREES OBSOLETE?
![Page 16: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/16.jpg)
![Page 17: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/17.jpg)
SOLID STATE HAS SOME MAJOR BUTS...
![Page 18: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/18.jpg)
... BUT• Cannot overwrite directly: must erase
first, then write
• Can write in small increments (4KB), but only erase in ~512KB blocks
• Latency: write is ~100µs, erase is ~2ms
• Limited durability: ~5,000 cycles (MLC) for each erase block
![Page 19: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/19.jpg)
WEAR LEVELING is used to reduce the number of total erase operations
![Page 20: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/20.jpg)
WEAR LEVELING
![Page 21: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/21.jpg)
WEAR LEVELINGErase Block
![Page 22: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/22.jpg)
WEAR LEVELING
![Page 23: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/23.jpg)
WEAR LEVELING
![Page 24: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/24.jpg)
WEAR LEVELING
Disk Page
![Page 25: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/25.jpg)
WEAR LEVELING
Write 1
![Page 26: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/26.jpg)
WEAR LEVELING
Write 1
Write 2
![Page 27: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/27.jpg)
WEAR LEVELING
Write 1
Write 2
Write 3
![Page 28: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/28.jpg)
Write 1
Write 2
Write 3
How is data from only Write 2 modified?
Remember: the whole block must be erased
![Page 29: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/29.jpg)
Mark Garbage
![Page 30: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/30.jpg)
Mark Garbage AppendModified
Data
Empty Block
![Page 31: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/31.jpg)
Wait... GARBAGE?
![Page 32: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/32.jpg)
THAT MEANS...
![Page 33: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/33.jpg)
... fragmentation,WHICH MEANS...
![Page 34: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/34.jpg)
Garbage Collection!
![Page 35: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/35.jpg)
GARBAGE COLLECTION
• Compacts fragmented disk blocks
• Erase operations drag on performance
• Modern SSDs do this in the background... as much as possible
• If no empty blocks are available, GC must be done before ANY writes can complete
![Page 36: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/36.jpg)
WRITE AMPLIFICATION
• When only a few kilobytes are written, but fragmentation causes a whole block to be rewritten
• The smaller & more random the writes, the worse this gets
• Modern “mark and sweep” GC reduces it, but cannot eliminate it
![Page 37: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/37.jpg)
Torture test shows massive write performance drop-off for heavily fragmented drive
Source: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6
![Page 38: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/38.jpg)
Some poorly designed drives COMPLETELY fall apart
Source: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6
![Page 39: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/39.jpg)
Even a well-behaved drive suffers significantly from the
torture test
Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
![Page 40: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/40.jpg)
Post-torture, all disk blocks were marked empty, and the
“fast” comes back...
Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
![Page 41: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/41.jpg)
![Page 42: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/42.jpg)
“TRIM”• Filesystems don’t typically immediately
erase data when files are deleted, they just mark them as deleted and erase later
• TRIM allows the OS to actively tell the drive when a region of disk is no longer used
• If an entire erase block is marked as unused, GC is avoided, otherwise TRIM just hastens the collection process
![Page 43: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/43.jpg)
TRIM only reduces the write amplification effect,
it can’t eliminate it.
![Page 44: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/44.jpg)
THEN THERE’S LIFETIME...
![Page 45: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/45.jpg)
![Page 46: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/46.jpg)
![Page 47: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/47.jpg)
AnandTech estimates that modern MLC SSDsonly last about 1.5 years under heavy MySQL load,
which causes around 10x write amplification
![Page 48: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/48.jpg)
REMEMBER THIS?
![Page 49: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/49.jpg)
TAKEAWAYS
• All disk writes are sequential, append-only operations
• On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N)
• SSTables are completely immutable
![Page 50: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/50.jpg)
CASSANDRA ONLY WRITES
SEQUENTIALLY
![Page 51: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/51.jpg)
“For a sequential write workload, write amplification is equal to 1,
i.e., there is no write amplification.”
Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write Performance: Understanding, Analysis, and Performance Modeling”
![Page 52: Cassandra and Solid State Drives](https://reader034.vdocument.in/reader034/viewer/2022042623/54b740f44a7959be4c8b4918/html5/thumbnails/52.jpg)
THANK YOU.~ @rbranson