cassandra and solid state drives

CASSANDRA & SOLID STATE DRIVESRick Branson, DataStax

CASSANDRA’S STORAGE ENGINE WAS OPTIMIZED

FOR SPINNING DISKS

LSM-TREES

WRITE PATH

Client Cassandra

On-Disk Node Commit Log

{ cf1: { row1: { col1: abc } } }

{ cf1: { row1: { col2: def } } }

{ cf1: { row1: { col1: <del> } } }

{ cf1: { row2: { col1: xyz } } }

{ cf1: { row1: { col3: foo } } }

insert({ cf1: { row1: { col3: foo } } })

In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”

col1: [del] col2: “def” col3: “foo”

col1: “xyz”

COMMIT

SSTableSSTableSSTableSSTableSSTableSSTableSSTable

In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”

col1: [del] col2: “def” col3: “foo”

col1: “xyz”

SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable

31 2 4

SSTables are merged to maintain read performance

COMPACT

SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable

SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable

SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableNew SSTable is streamedto disk and old SSTables

are erased

X X X X

TAKEAWAYS

• All disk writes are sequential, append-only operations

• On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N)

• SSTables are completely immutable

TAKEAWAYS

IMPORTANT

COMPARED

• Most popular data storage engines rewrite modified data in-place: MySQL (InnoDB), PostgreSQL, Oracle, MongoDB, Membase, BerkeleyDB, etc

• Most perform similar buffering of writes before flushing to disk

• ... but flushes are RANDOM writes

SPINNING DISKS• Dirt cheap: $0.08/GB

• Seek time limited by time it takes for drive to rotate: IOPS = RPM/60

• 7,200 RPM = ~120 IOPS

• 15,000 RPM has been the max for decades

• Sequential operations are best: 125MB/sec for modern SATA drives

THAT WAS THE WORLD IN WHICH CASSANDRA

WAS BUILT

2012: MLC NAND FLASH*

• Affordable: ~$1.75/GB street

• Massive IOPS: 39,500/sec read, 23,000/sec write

• Latency of less than 100µs

• Good sequential throughput: 270MB/sec read, 205MB/sec write

• Way cheaper per IOPS: $0.02 vs $1.25

* based on specifications provided by Intel for 300GB Intel 320 drive

WITH RANDOM ACCESS STORAGE, ARE CASSANDRA’S

LSM-TREES OBSOLETE?

SOLID STATE HAS SOME MAJOR BUTS...

... BUT• Cannot overwrite directly: must erase

first, then write

• Can write in small increments (4KB), but only erase in ~512KB blocks

• Latency: write is ~100µs, erase is ~2ms

• Limited durability: ~5,000 cycles (MLC) for each erase block

WEAR LEVELING is used to reduce the number of total erase operations

WEAR LEVELING

WEAR LEVELINGErase Block

WEAR LEVELING

Disk Page

WEAR LEVELING

Write 1

WEAR LEVELING

Write 1

Write 2

WEAR LEVELING

Write 1

Write 2

Write 3

Write 1

Write 2

Write 3

How is data from only Write 2 modified?

Remember: the whole block must be erased

Mark Garbage

Mark Garbage AppendModified

Empty Block

Wait... GARBAGE?

THAT MEANS...

... fragmentation,WHICH MEANS...

Garbage Collection!

GARBAGE COLLECTION

• Compacts fragmented disk blocks

• Erase operations drag on performance

• Modern SSDs do this in the background... as much as possible

• If no empty blocks are available, GC must be done before ANY writes can complete

WRITE AMPLIFICATION

• When only a few kilobytes are written, but fragmentation causes a whole block to be rewritten

• The smaller & more random the writes, the worse this gets

• Modern “mark and sweep” GC reduces it, but cannot eliminate it

Torture test shows massive write performance drop-off for heavily fragmented drive

Source: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6

Some poorly designed drives COMPLETELY fall apart

Source: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6

Even a well-behaved drive suffers significantly from the

torture test

Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11

Post-torture, all disk blocks were marked empty, and the

“fast” comes back...

Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11

“TRIM”• Filesystems don’t typically immediately

erase data when files are deleted, they just mark them as deleted and erase later

• TRIM allows the OS to actively tell the drive when a region of disk is no longer used

• If an entire erase block is marked as unused, GC is avoided, otherwise TRIM just hastens the collection process

TRIM only reduces the write amplification effect,

it can’t eliminate it.

THEN THERE’S LIFETIME...

AnandTech estimates that modern MLC SSDsonly last about 1.5 years under heavy MySQL load,

which causes around 10x write amplification

REMEMBER THIS?

TAKEAWAYS

CASSANDRA ONLY WRITES

SEQUENTIALLY

“For a sequential write workload, write amplification is equal to 1,

i.e., there is no write amplification.”

Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write Performance: Understanding, Analysis, and Performance Modeling”

THANK YOU.~ @rbranson

cassandra and solid state drives

disk writes

disk tables sstables

streamedto disk

disk blockswere

region of disk

disk page

wear levelingwrite

random writes

Technology

cassandra summit 2014: cassandra @ netflix: building a house...

solid state drives (ee432) - weebly

introduction to open-channel solid state drives to...

windows 7 enhancements for solid-state drives

solid state drives - faq -...

reprint solid state drives: the beginning of the end for...

tree indexing on solid state drives

reprint solid state drives: the beginning of the …...usb...

file systems of solid-state drives - university of...

solid state drives future of data storage

samsung solid-state drives vs. sas drives in a dell...

ee1351 solid state drives

the unwritten contract of solid state drives

disk schedulers for solid state drives -...

solid state solid state drives

introduction to solid state drives

solid state drives and sql server 2008

smart solid state drives for enterprise applications

hpe solid state disk drives (ssd & accelerators)

ee1403 solid state drives qb