tech talk: rocksdb slides by dhruba borthakur & haobo xu of facebook

Post on 27-Jan-2015

159 Views

Category:

Technology

14 Downloads

Preview:

Click to see full reader

DESCRIPTION

This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.

TRANSCRIPT

The Story of RocksDB

Dhruba Borthakur & Haobo XuDatabase Engineering@Facebook

Embedded Key-Value Store for Flash and RAM

Monday, December 9, 13

Monday, December 9, 13

Monday, December 9, 13

A Client-Server Architecture with disks

Network roundtrip = 50 micro sec

Disk access = 10 milli seconds

Application ServerDatabase

Server

Locally attached Disks

Monday, December 9, 13

Client-Server Architecture with fast storage

Network roundtrip = 50 micro secApplication Server

Database Server

Latency dominated by network

SSD RAM

100 microsecs

100 nanosecs

Monday, December 9, 13

Architecture of an Embedded Database

Network roundtrip = 50 micro sec

Application Server

Database Server

Monday, December 9, 13

Architecture of an Embedded Database

Network roundtrip = 50 micro sec

Application Server

Database Server

Monday, December 9, 13

Architecture of an Embedded Database

Network roundtrip = 50 micro sec

Application Server

Database Server

SSD RAM

100 microsecs

100 nanosecs

Monday, December 9, 13

Architecture of an Embedded Database

Network roundtrip = 50 micro sec

Application Server

Database Server

Storage attached directly to application servers

SSD RAM

100 microsecs

100 nanosecs

Monday, December 9, 13

Any pre-existing embedded databases?

Open Source

FB Proprietary

Monday, December 9, 13

Any pre-existing embedded databases?

1. Berkeley DB2. SQLite3. Kyoto TreeDB4. LevelDB

Open Source

FB Proprietary

Key-value stores

Monday, December 9, 13

Any pre-existing embedded databases?

1. High Performant2. No transaction log3. Fixed size keys

Open Source

FB Proprietary

Key-value stores

Monday, December 9, 13

Comparison of open source databases

Monday, December 9, 13

Comparison of open source databases

Random Reads

Monday, December 9, 13

Comparison of open source databases

Random Reads

Random Writes

Monday, December 9, 13

Comparison of open source databases

Random Reads

LevelDBKyoto TreeDBSQLite3

129,000 ops/sec151,000 ops/sec134,000 ops/sec

Random Writes

Monday, December 9, 13

Comparison of open source databases

Random Reads

LevelDBKyoto TreeDBSQLite3

129,000 ops/sec151,000 ops/sec134,000 ops/sec

LevelDBKyoto TreeDBSQLite3

164,000 ops/sec88,500 ops/sec

9,860 ops/sec

Random Writes

Monday, December 9, 13

HBase and HDFS (in April 2012)

Details of this experiment: http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html

Monday, December 9, 13

HBase and HDFS (in April 2012)

Random Reads

Details of this experiment: http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html

Monday, December 9, 13

HBase and HDFS (in April 2012)

Random Reads

HDFS (1 node)HBase (1 node)

93,000 ops/sec35,000 ops/sec

Details of this experiment: http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html

Monday, December 9, 13

Log Structured Merge Architecture

Read Write data in RAM

Monday, December 9, 13

Log Structured Merge Architecture

Read Write data in RAM

Write Request from Application

Monday, December 9, 13

Log Structured Merge Architecture

Read Write data in RAM

Write Request from Application

Monday, December 9, 13

Log Structured Merge Architecture

Read Write data in RAM

Write Request from Application

Monday, December 9, 13

Log Structured Merge Architecture

Read Write data in RAM

Write Request from Application

Transaction log

Monday, December 9, 13

Log Structured Merge Architecture

Read Write data in RAM

Write Request from Application

Transaction log

Monday, December 9, 13

Log Structured Merge Architecture

Read Write data in RAM

Write Request from Application

Transaction log

Monday, December 9, 13

Log Structured Merge Architecture

Read Write data in RAM

Write Request from Application

Transaction log Read Only data in RAM on

disk

Monday, December 9, 13

Log Structured Merge Architecture

Read Write data in RAM

Write Request from Application

Transaction log Read Only data in RAM on

disk

Periodic Compaction

Monday, December 9, 13

Log Structured Merge Architecture

Read Write data in RAM

Write Request from ApplicationScan Request from Application

Transaction log Read Only data in RAM on

disk

Periodic Compaction

Monday, December 9, 13

Leveldb has low write ratesFacebook Application 1:

• Write rate 2 MB/sec only per machine

• Only one cpu was used

Monday, December 9, 13

Leveldb has low write ratesFacebook Application 1:

• Write rate 2 MB/sec only per machine

• Only one cpu was used

We developed multithreaded compaction

10ximprovement on

write rate

100%of cpus are

in use+Monday, December 9, 13

Leveldb has stallsFacebook Feed:

• P99 latencies were tens of seconds

• Single-threaded compaction

Monday, December 9, 13

Leveldb has stallsFacebook Feed:

• P99 latencies were tens of seconds

• Single-threaded compaction

We implemented thread aware compaction

Dedicated thread(s) to flush memtable Pipelined memtables

P99 reduced to less than a second

Monday, December 9, 13

Leveldb has high write amplification•Facebook Application 2:

• Level Style Compaction

• Write amplification of 70 very high

Monday, December 9, 13

Leveldb has high write amplification•Facebook Application 2:

• Level Style Compaction

• Write amplification of 70 very high

Two compactions by LevelDB Style Compaction

5 bytes

6 bytes

10 bytes

Level-0

Level-1

Level-2

Stage 1

11 bytes

10 bytes

Stage 2

10 bytes

Stage 3

Monday, December 9, 13

Our solution: lower write amplification•Facebook Application 2:

• We implemented Universal Style Compaction

• Start from newest file, include next file in candidate set if

• Candidate set size >= size of next file

Monday, December 9, 13

Our solution: lower write amplification•Facebook Application 2:

• We implemented Universal Style Compaction

• Start from newest file, include next file in candidate set if

• Candidate set size >= size of next file Single compaction by Universal Style Compaction

5bytes

6 bytes

10 bytes

Level-0

Level-1

Level-2

Stage 1

10 bytes

Stage 2

Write amplification reduced to <10Monday, December 9, 13

Leveldb has high read amplification

Monday, December 9, 13

Leveldb has high read amplification

•Secondary Index Service:

•Leveldb does not use blooms for scans

Monday, December 9, 13

Leveldb has high read amplification

•Secondary Index Service:

•Leveldb does not use blooms for scans

•We implemented prefix scans

• Range scans within same key prefix

• Blooms created for prefix

• Reduces read amplification

Monday, December 9, 13

Leveldb: read modify write = 2X IOs

Monday, December 9, 13

Leveldb: read modify write = 2X IOs

•Counter increments

• Get value, value++, Put value

• Leveldb uses 2X IOPS

Monday, December 9, 13

Leveldb: read modify write = 2X IOs

•Counter increments

• Get value, value++, Put value

• Leveldb uses 2X IOPS

•We implemented MergeRecord

• Put “++” operation in MergeRecord

• Background compaction merges all MergeRecords

• Uses only 1X IOPS

Monday, December 9, 13

Leveldb has a Rigid Design

Monday, December 9, 13

Leveldb has a Rigid Design

•LevelDB Design

• Cannot tune system, fixed file sizes

Monday, December 9, 13

Leveldb has a Rigid Design

•LevelDB Design

• Cannot tune system, fixed file sizes

•We wanted a pluggable architecture

• Pluggable compaction filter, e.g. TimeToLive

• Pluggable memtable/sstable for RAM/Flash

• Pluggable Compaction Algorithm

Monday, December 9, 13

The Changes we did to LevelDB

Monday, December 9, 13

The Changes we did to LevelDB

Inherited from LevelDB

• Log Structured Merge DB

• Gets/Puts/Scans of keys

• Forward and Reverse Iteration

Monday, December 9, 13

The Changes we did to LevelDB

Inherited from LevelDB

• Log Structured Merge DB

• Gets/Puts/Scans of keys

• Forward and Reverse Iteration

RocksDB

• 10X higher write rate

• Fewer stalls

• 7x lower write amplification

• Blooms for range scans

• Ability to avoid read-modify-write

• Optimizations for flash or RAM

• And many more…

Monday, December 9, 13

RocksDB is born!

•Key-Value persistent store

•Embedded

•Optimized for fast storage

•Server workloads

Monday, December 9, 13

RocksDB is born!

•Key-Value persistent store

•Embedded

•Optimized for fast storage

•Server workloads

Monday, December 9, 13

What is it not?

•Not distributed

•No failover

•Not highly-available, if machine dies you lose your data

Monday, December 9, 13

What is it not?

•Not distributed

•No failover

•Not highly-available, if machine dies you lose your data

Monday, December 9, 13

RocksDB API▪ Keys and values are arbitrary byte arrays.

▪ Data is stored sorted by key.

▪ The basic operations are Put(key,value), Get(key), Delete(key) and Merge(key, delta)

▪ Forward and backward iteration is supported over the data.

Monday, December 9, 13

RocksDB Architecture

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch Switch

Monday, December 9, 13

RocksDB Architecture

Write Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch Switch

Monday, December 9, 13

RocksDB Architecture

Write Request

Read Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch Switch

Monday, December 9, 13

RocksDB Architecture

Write Request

Read Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch

Memory

Switch

Monday, December 9, 13

RocksDB Architecture

Write Request

Read Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch

Memory Persistent Storage

Switch

Monday, December 9, 13

Log Structured Merge Tree -- Writes

▪ Log Structured Merge Tree

▪ New Puts are written to memory and optionally to transaction log

▪ Also can specify log write sync option for each individual write

▪ We say RocksDB is optimized for writes, what does this mean?

Monday, December 9, 13

RocksDB Write Path

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch Switch

Monday, December 9, 13

RocksDB Write Path

Write Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch Switch

Monday, December 9, 13

Log Structured Merge Tree -- Reads

▪ Data could be in memory or on disk

▪ Consult multiple files to find the latest instance of the key

▪ Use bloom filters to reduce IO

Monday, December 9, 13

RocksDB Read Path

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Blooms

Monday, December 9, 13

RocksDB Read Path

Read Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Blooms

Monday, December 9, 13

RocksDB Read Path

Read Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Memory

Blooms

Monday, December 9, 13

RocksDB Read Path

Read Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Memory Persistent Storage

Blooms

Monday, December 9, 13

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Blooms

CustomizableWAL

Monday, December 9, 13

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Write Request from Application

Blooms

CustomizableWAL

Monday, December 9, 13

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Write Request from Application

Blooms

CustomizableWAL

Monday, December 9, 13

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Write Request from Application

Blooms

CustomizableWAL

Monday, December 9, 13

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Write Request from Application

Transaction log

Blooms

CustomizableWAL

Monday, December 9, 13

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Write Request from Application

Transaction log

Blooms

CustomizableWAL

Monday, December 9, 13

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Write Request from Application

Transaction log

Blooms

CustomizableWAL

Monday, December 9, 13

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Write Request from Application

Transaction log Pluggable sst data format on storage

Blooms

CustomizableWAL

Monday, December 9, 13

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Write Request from Application

Transaction log Pluggable sst data format on storage

Pluggable Compaction

Blooms

CustomizableWAL

Monday, December 9, 13

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Write Request from ApplicationGet or Scan Request from Application

Transaction log Pluggable sst data format on storage

Pluggable Compaction

Blooms

CustomizableWAL

Monday, December 9, 13

Example: Customizable WALogging

•In-house Replication solution wants to be able to embed arbitrary blob in the rocksdb WAL stream for log annotation

•Use Case: Indicate where a log record came from in multi-master replication

•Solution: A Put that only speaks to the log

Monday, December 9, 13

Example: Customizable WALogging

Active MemTable

log

Replication Layer In one write batch:

PutLogData(“I came from Mars”)Put(k1,v1)

k1 v1 “I came from Mars”

k1/v1

Monday, December 9, 13

Example: Customizable WALogging

Write Request

Active MemTable

log

Replication Layer In one write batch:

PutLogData(“I came from Mars”)Put(k1,v1)

k1 v1 “I came from Mars”

k1/v1

Monday, December 9, 13

Example: Pluggable SST format

•One Facebook use case needs extreme fast response but could tolerate some loss of durability

•Quick hack: mount sst in tmpfs

•Still not performant:

• existing sst format is block based

•Solution: A much simpler format that just stores sorted key/value pairs sequentially

• no blocks, no caching, mmap the whole file

• build efficient lookup index on load

Monday, December 9, 13

Example: Blooms for MemTable

•Same use case, after we optimized sst access, memtable lookup becomes a major cost in query

•Problem: Get needs to go through the memtable lookups that eventually return no data

•Solution: Just add a bloom filter to memtable!

Monday, December 9, 13

RocksDB Read Path

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Blooms

Blooms

Blooms

Monday, December 9, 13

RocksDB Read Path

Read Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Blooms

Blooms

Blooms

Monday, December 9, 13

RocksDB Read Path

Read Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Memory

Blooms

Blooms

Blooms

Monday, December 9, 13

RocksDB Read Path

Read Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Memory Persistent Storage

Blooms

Blooms

Blooms

Monday, December 9, 13

Example: Pluggable memtable format

•Another Facebook use case has a distinct load phase where no query is issued.

•Problem: write throughput is limited by single writer thread

•Solution: A new memtable representation that does not keep keys sorted

Monday, December 9, 13

Example: Pluggable memtable format

Sort, Flush

Compaction

Unsorted MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch Switch

Monday, December 9, 13

Example: Pluggable memtable format

Write Request

Sort, Flush

Compaction

Unsorted MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch Switch

Monday, December 9, 13

Example: Pluggable memtable format

Write Request

Read Request

Sort, Flush

Compaction

Unsorted MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch Switch

Monday, December 9, 13

Example: Pluggable memtable format

Write Request

Read Request

Sort, Flush

Compaction

Unsorted MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch

Memory

Switch

Monday, December 9, 13

Example: Pluggable memtable format

Write Request

Read Request

Sort, Flush

Compaction

Unsorted MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch

Memory Persistent Storage

Switch

Monday, December 9, 13

Possible workloads for RocksDB?

▪ Serving data to users via a website

▪ A spam detection backend that needs fast access to data

▪ A graph search query that needs to scan a dataset in realtime

▪ Distributed Configuration Management Systems

▪ Fast serve of Hive Data

▪ A Queue that needs a high rate of inserts and deletes

Monday, December 9, 13

Futures

Monday, December 9, 13

Futures

•Scale linearly with number of cpus

• 32, 64 or higher core machines

• ARM processors

Monday, December 9, 13

Futures

•Scale linearly with number of cpus

• 32, 64 or higher core machines

• ARM processors

•Scale linearly with storage iops

• Striped flash cards

• RAM & NVRAM storage

Monday, December 9, 13

Come Hack with us

Monday, December 9, 13

Come Hack with us

•RocksDB is Open Sourced

• http://rocksdb.org

• Developers group https://www.facebook.com/groups/rocksdb.dev/

Monday, December 9, 13

Come Hack with us

•RocksDB is Open Sourced

• http://rocksdb.org

• Developers group https://www.facebook.com/groups/rocksdb.dev/

•Help us HACK RocksDB

Monday, December 9, 13

Monday, December 9, 13

top related