cassandra and solid state drives
Post on 15-Jan-2015
24.807 Views
Preview:
DESCRIPTION
TRANSCRIPT
CASSANDRA & SOLID STATE DRIVESRick Branson, DataStax
FACT
CASSANDRA’S STORAGE ENGINE WAS OPTIMIZED
FOR SPINNING DISKS
LSM-TREES
WRITE PATH
Client Cassandra
On-Disk Node Commit Log
{ cf1: { row1: { col1: abc } } }
{ cf1: { row1: { col2: def } } }
{ cf1: { row1: { col1: <del> } } }
{ cf1: { row2: { col1: xyz } } }
{ cf1: { row1: { col3: foo } } }
insert({ cf1: { row1: { col3: foo } } })
In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”
row1
row2
col1: [del] col2: “def” col3: “foo”
col1: “xyz”
COMMIT
SSTableSSTableSSTableSSTableSSTableSSTableSSTable
1
SSTableSSTableSSTableSSTableSSTableSSTableSSTable
2
SSTableSSTableSSTableSSTableSSTableSSTableSSTable
3
SSTableSSTableSSTableSSTableSSTableSSTableSSTable
4
In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”
row1
row2
col1: [del] col2: “def” col3: “foo”
col1: “xyz”
FLUSH
SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable
31 2 4
SSTables are merged to maintain read performance
COMPACT
SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable
SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable
SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableNew SSTable is streamedto disk and old SSTables
are erased
X X X X
TAKEAWAYS
• All disk writes are sequential, append-only operations
• On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N)
• SSTables are completely immutable
TAKEAWAYS
• All disk writes are sequential, append-only operations
• On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N)
• SSTables are completely immutable
IMPORTANT
COMPARED
• Most popular data storage engines rewrite modified data in-place: MySQL (InnoDB), PostgreSQL, Oracle, MongoDB, Membase, BerkeleyDB, etc
• Most perform similar buffering of writes before flushing to disk
• ... but flushes are RANDOM writes
SPINNING DISKS• Dirt cheap: $0.08/GB
• Seek time limited by time it takes for drive to rotate: IOPS = RPM/60
• 7,200 RPM = ~120 IOPS
• 15,000 RPM has been the max for decades
• Sequential operations are best: 125MB/sec for modern SATA drives
THAT WAS THE WORLD IN WHICH CASSANDRA
WAS BUILT
2012: MLC NAND FLASH*
• Affordable: ~$1.75/GB street
• Massive IOPS: 39,500/sec read, 23,000/sec write
• Latency of less than 100µs
• Good sequential throughput: 270MB/sec read, 205MB/sec write
• Way cheaper per IOPS: $0.02 vs $1.25
* based on specifications provided by Intel for 300GB Intel 320 drive
WITH RANDOM ACCESS STORAGE, ARE CASSANDRA’S
LSM-TREES OBSOLETE?
SOLID STATE HAS SOME MAJOR BUTS...
... BUT• Cannot overwrite directly: must erase
first, then write
• Can write in small increments (4KB), but only erase in ~512KB blocks
• Latency: write is ~100µs, erase is ~2ms
• Limited durability: ~5,000 cycles (MLC) for each erase block
WEAR LEVELING is used to reduce the number of total erase operations
WEAR LEVELING
WEAR LEVELINGErase Block
WEAR LEVELING
WEAR LEVELING
WEAR LEVELING
Disk Page
WEAR LEVELING
Write 1
WEAR LEVELING
Write 1
Write 2
WEAR LEVELING
Write 1
Write 2
Write 3
Write 1
Write 2
Write 3
How is data from only Write 2 modified?
Remember: the whole block must be erased
Mark Garbage
Mark Garbage AppendModified
Data
Empty Block
Wait... GARBAGE?
THAT MEANS...
... fragmentation,WHICH MEANS...
Garbage Collection!
GARBAGE COLLECTION
• Compacts fragmented disk blocks
• Erase operations drag on performance
• Modern SSDs do this in the background... as much as possible
• If no empty blocks are available, GC must be done before ANY writes can complete
WRITE AMPLIFICATION
• When only a few kilobytes are written, but fragmentation causes a whole block to be rewritten
• The smaller & more random the writes, the worse this gets
• Modern “mark and sweep” GC reduces it, but cannot eliminate it
Torture test shows massive write performance drop-off for heavily fragmented drive
Source: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6
Some poorly designed drives COMPLETELY fall apart
Source: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6
Even a well-behaved drive suffers significantly from the
torture test
Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
Post-torture, all disk blocks were marked empty, and the
“fast” comes back...
Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
“TRIM”• Filesystems don’t typically immediately
erase data when files are deleted, they just mark them as deleted and erase later
• TRIM allows the OS to actively tell the drive when a region of disk is no longer used
• If an entire erase block is marked as unused, GC is avoided, otherwise TRIM just hastens the collection process
TRIM only reduces the write amplification effect,
it can’t eliminate it.
THEN THERE’S LIFETIME...
AnandTech estimates that modern MLC SSDsonly last about 1.5 years under heavy MySQL load,
which causes around 10x write amplification
REMEMBER THIS?
TAKEAWAYS
• All disk writes are sequential, append-only operations
• On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N)
• SSTables are completely immutable
CASSANDRA ONLY WRITES
SEQUENTIALLY
“For a sequential write workload, write amplification is equal to 1,
i.e., there is no write amplification.”
Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write Performance: Understanding, Analysis, and Performance Modeling”
THANK YOU.~ @rbranson
top related