a next generation storage engine for nosql database systems preview: couchbase connect 2014
DESCRIPTION
B+-tree has been used as one of the main index structures in a database field fore more than four decades. However, typical B+-tree implementations show scalability and performance issues as modern global-scale Web or mobile applications generate huge volumes of data that has not been seen before. Various key-value storage engines with variants of B+-tree, such as log-structured merge tree (LSM-tree) have been proposed to address these limitations. At Couchbase, we also have been working on a new key-value storage engine that can provide high scalability and performance, and recently released the beta version of ForestDB, whose main index structure is based on Hierarchical B+-Tree based Trie or HB+-Trie. In this presentation, we introduce ForestDB and discuss why ForestDB can be fitted well for modern big data applications. We also explain various optimizations on ForestDB, which are planned especially for solid-state drives (SSDs).TRANSCRIPT
Preview: A Next Generation Storage Engine for NoSQL Database Systems
Chiyoung Seo
Software Architect, Couchbase Inc.
©2014 Couchbase, Inc. 2
Why do we need a new KV storage engine?
ForestDB HB+-Trie Block aligning Write Ahead Logging (WAL) Block buffer cache Evaluation
Optimizations for Solid-State Drives (SSDs) Async I/O to exploit parallel I/O capabilities from SSDs Volume manager inside ForestDB Lightweight and I/O efficient compaction Append / Prepend support
Summary
Contents
Why do we need a new KV storage engine?
©2014 Couchbase, Inc. 4
Operate on huge volumes of unstructured data
Significant amount of new data is constantly generated from hundreds of millions of users or devices
Still require high performance and scalability in managing their ever-growing database
Underlying storage engine is one of the most critical parts in database systems to provide high performance / scalability
Modern Web / Mobile Applications
©2014 Couchbase, Inc. 5
Main storage index structure in a database field
Generalization of binary search tree
Each node consists of two or more {key, value (or pointer)} pairs Fanout (or branch) degree: # of KV pairs in a node
Node size is generally fitted into multiple page size
B+Tree
03/26
B+Tree
Ki: ith smallest key in the nodePi: pointer corresponding to Ki
Vi: value corresponding to Ki
f: fanout degree
K1 P1 … … Kd Pd
K1 V1 K2 V2 … … Kf Vf …
Index (non-leaf) node
Leaf node
… Kj Pj … … Kl Pl
K1 P1 … … Kj PjRoot node
…
…
K1 P1 … … Kf Pf Kj Pj … … Kn Pn
…
… …
… …
Kj Vj Kk Vk … … Kn Vn
Index (non-leaf) node
Not suitable to index variable or fixed-length long keys Significant space overhead as entire key strings are indexed in non-leaf nodes
Tree depth grows quickly as more data is loaded
I/O performance is degraded significantly as the data size gets bigger
Several variants of B+Tree were proposed LevelDB (Google) RocksDB (Facebook) TokuDB (Tokutek)
B+Tree Limitations
04/26
Fast and scalable index structure for variable or fixed-length long keys Targeting block I/O storage devices not only SSD but also legacy HDD
Less storage space overhead Reduce write amplification
Regardless of the pattern of keys Efficient to keys both sharing common prefix and not sharing common prefix
Goals
06/26
ForestDB
©2014 Couchbase, Inc. 11
Key-Value storage engine developed by Couchbase Caching / Storage team
Its main index structure is built from Hierarchical B+-Tree based Trie or HB+-Trie
- HB+-Trie was originally presented at ACM SIGMOD 2011 Programming Contest, by Jung-Sang Ahn who works at Couchbase
(http://db.csail.mit.edu/sigmod11contest/sigmod_2011_contest_poster_jungsang_ahn.pdf)
Significantly better read and write performance with less storage overhead
Support various server OSs (Centos, Ubuntu, Debian, Mac OS x, Windows) and mobile OSs (iOS, Android)
1.0 beta was released last week
ForestDB
©2014 Couchbase, Inc. 12
Multi-Version Concurrency Control (MVCC) with append-only storage model Write-Ahead Logging (WAL) A value can be retrieved by its sequence number or disk offset in addition to a
key Custom compare function to support a customized key order Snapshot support to provide different views of database Rollback to revert the database to a specific point Ranged iteration by keys or sequence numbers Transactional support with read_committed or read_uncommitted isolation level Manual or auto compaction configured per KV instance
Main Features
ForestDB: Main Index Structure
Trie (prefix tree) whose node is B+Tree A key is split into the list of fixed-size chunks (sub-string of the key)
HB+Trie (Hierarchical B+Tree based Trie)
Variable length key: Fixed size (e.g. 4-byte)a83jgls83jgo29a…
07/26Lexicographical ordered traversal
Search using Chunk1
Document
B+Tree (Node of HB+Trie)
Node of B+Tree
Chunk1 Chunk2 Chunk3 …
a83j gls8 3jgo …
Search using Chunk2
Search using Chunk3 07/26
Prefix Compression
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)Example: Chunk size = 1 byte
1stInsert ‘aaaa’
B+Tree using 1st
chunk as key
Prefix Compression
1stInsert ‘aaaa’
aaaaa
Distinguishable by first chunk ‘a’
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)Example: Chunk size = 1 byte B+Tree using 1st
chunk as key
Prefix Compression
Distinguishable by first chunk ‘b’
B+Tree using 1st
chunk as key
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)Example: Chunk size = 1 byte
Insert ‘bbbb’
aaaa
1st
abbbb
b
Prefix Compression
B+Tree using 1st
chunk as key
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)Example: Chunk size = 1 byte
Insert ‘aaab’
aaaa
1st
abbbb
b
Cannot distinguish using first chunk ‘a’
Prefix Compression
Insert ‘aaab’
aaaaCannot distinguish using first chunk ‘a’
First distinguishable
chunk: 4th
B+Tree using 1st
chunk as key
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)Example: Chunk size = 1 byte
1st
abbbb
b
Prefix Compression
Store skipped common prefix ‘aa’
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)Example: Chunk size = 1 byte
1st
abbbb
b
4th aa
aaaaa
aaabb
B+Tree using 4th chunk as key,
skipping common prefix ‘aa’
Prefix Compression
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)Example: Chunk size = 1 byte
1st
abbbb
b
4th aa
aaaaa
aaabb
Insert ‘bbcd’Cannot distinguish using first chunk ‘b’
B+Tree using 4th chunk as key,
skipping common prefix ‘aa’
Prefix Compression
1st
abbbb
b
4th aa
aaaaa
aaabb
Insert ‘bbcd’Cannot distinguish using first chunk ‘b’
First distinguishable
chunk: 3rd
B+Tree using 4th chunk as key,
skipping common prefix ‘aa’
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)Example: Chunk size = 1 byte
As original trie, each node (B+Tree) is created on-demand (except for root node)Example: Chunk size = 1 byte
Prefix Compression
1st
a b
4th
aa
aaaaa
aaabb
3rd b
bbbb bbcdb c
Store skipped common prefix ‘b’
B+Tree using 3rd chunk as key,
skipping common prefix ‘b’
08/26
Benefits
When keys are sufficiently long & uniform random (e.g., UUID or hash value) When keys have common prefixes (e.g., secondary index keys)
Example: Chunk size = 2 bytes
1st
Insert a83jfl2iejzm302k,dpwk3gjrieorigje,z9382h3igor8eh4k,283hgoeir8goerha,023o8f9o8zufisue
a83jfl2iejzm30
2k
a8 dpwk3gjrieorig
je
dp z9382h3igor8eh
4k
z9283hgoeir8goer
ha
28023o8f9o8zufis
ue
02
Benefits
09/26
1st
Insert a83jfl2iejzm302k,dpwk3gjrieorigje,z9382h3igor8eh4k,283hgoeir8goerha,023o8f9o8zufisue
a83jfl2iejzm30
2k
a8 dpwk3gjrieorig
je
dp z9382h3igor8eh
4k
z9283hgoeir8goer
ha
28023o8f9o8zufis
ue
02
Majority of keys can be indexed by first chunk There will be only one B+Tree on HB+Trie
We don’t need to store & compare entire key string
When keys are sufficiently long & uniform random (e.g., UUID or hash value) When keys have common prefixes (e.g., secondary index keys)
Example: Chunk size = 2 bytes
Benefits
1st
Insert a83jfl2iejzm302k,dpwk3gjrieorigje,z9382h3igor8eh4k,283hgoeir8goerha,023o8f9o8zufisue
When chunk size is n-bit Up to 2n keys can be index by only first chunk
n=32 (4 bytes): 232 ~= 4 billion n=64 (8 bytes): 264 ~= 1019
a83jfl2iejzm30
2k
a8 dpwk3gjrieorig
je
dp z9382h3igor8eh
4k
z9283hgoeir8goer
ha
28023o8f9o8zufis
ue
02
09/26Majority of keys can be indexed by first chunk There will be only one B+Tree on HB+Trie
We don’t need to store & compare entire key string
When keys are sufficiently long & uniform random (e.g., UUID or hash value) When keys have common prefixes (e.g., secondary index keys)
Example: Chunk size = 2 bytes
ForestDB maintains two index structures HB+Trie: key index Sequence B+Tree: sequence number (8-byte integer) index Retrieve the file offset to a value using key or sequence number
ForestDB Index Structures
DB file Doc Doc Doc Doc Doc Doc …
HB+Trie
B+Tree
key
Sequence number
11/26
ForestDB: Write Ahead Logging
Append updates first, and update the main indexes later
Main purposes To maximize write throughput by sequential writes (append-only updates) To reduce # of index nodes to be written by batched updates
Write Ahead Logging
Append DB headerfor every commitHDocsDB file Docs Index nodes
h(key)h(key)
…
OffsetOffset
…
h(seq no)h(seq no)
…
OffsetOffset
…
ID index Seq no. index
WAL indexes:in-memory structures(hash table)
H
Append DB headerfor every commitHDocsDB file Docs Index nodes
h(key)h(key)
…
OffsetOffset
…
h(seq no)h(seq no)
…
OffsetOffset
…
ID index Seq no. index
WAL indexes:in-memory structures(hash table)
H15/26
Append updates first, and update the main indexes later
Main purposes To maximize write throughput by sequential writes (append-only updates) To reduce # of index nodes to be written by batched updates
Write Ahead Logging
< Key query>1. Retrieve WAL index first2. If hit return immediately3. If miss retrieve HB+Trie (or B+Tree)
ForestDB: Block Cache
ForestDB has its own block cache layer
Managed on a block basis
Give higher priority to index node blocks than data blocks
Provide an option to bypass the OS page cache
Block Cache
HB+Trie (or Seq Index) WAL Index
Block Cache Layer
Block read/write
DB File (on File System)
File read/write (if cache miss/eviction occurs)
18/26
Global LRU list for database files that are currently opened
Separate AVL tree for each file to keep track of dirty blocks
Separate hash table for each file with a key (block_id) and a value (pointer to a cache entry in either the clean LRU list or AVL tree)
Block Cache
File LRU list
File 4
File 2
File 1
File 5
hash(BID)
hash(BID)
…
ptr
ptr
…
AVL-tree
Block Block
Hash table
Block Block Block Block
Dirty blocks
Clean LRU list
…
…
ForestDB: Compaction
Manual compaction Performed by calling the compact public API manually
Daemon compaction A single daemon thread inside ForestDB manages the compaction automatically
A Compactor thread can interleave with a writer thread While a compaction task is running, a writer thread can still write dirty items into the
WAL section of a new file, which allows the compaction thread to be interleaved with the writer thread
Compaction
ForestDB: Evaluation
Evaluation Environments 64-bit machine running Centos 6.5 Intel Xeon 2.00 GHz CPU (6 cores, 12 threads) 32GB RAM and Crucial M4 SSD
Data Key size 32 bytes and value size 1KB Load 100M items Logical data size 100GB total
ForestDB DGM Performance
LevelDB Compression is disabled Write buffer size: 256 MB (initial load), 4 MB (otherwise) Buffer cache size: 8 GB
RocksDB Compression is disabled Write buffer size: 256 MB (initial load), 4 MB (otherwise) Maximum number of background compaction threads: 8 Maximum number of background memtable flushes: 8 Maximum number of write buffers: 8 Buffer cache size: 8 GB (uncompressed)
ForestDB Compression is disabled WAL size: 4,096 documents Buffer cache size: 8 GB
KV Storage Engine Configurations
Initial Load Performance
3x ~ 6x less time
Initial Load Performance
4x less write overhead
Read-Only Performance
1 2 4 80
5000
10000
15000
20000
25000
30000
Read-Only Performance
ForestDB LevelDB RocksDB
# reader threads
Ope
ratio
ns p
er s
econ
d
2x ~ 5x
Write-Only Performance
1 4 16 64 2560
2000
4000
6000
8000
10000
12000
Write-Only Performance
ForestDB LevelDB RocksDB
Write batch size (# documents)
Ope
ratio
ns p
er s
econ
d
- Small batch size (e.g., < 10) is not usually common
3x ~ 5x
Write-Only Performance
1 4 16 64 2560
50
100
150
200
250
300
350
400
450
Write Amplification
ForestDB LevelDB RocksDB
Write batch size (# documents)
Writ
e am
plifi
catio
n(N
orm
aliz
ed t
o a
sing
le d
oc s
ize)
ForestDB shows 4x ~ 20x less write amplification
Mixed Workload Performance
1 2 4 80
2000
4000
6000
8000
10000
12000
Mixed (Unrestricted) Performance
ForestDB LevelDB RocksDB
# reader threads
Ope
ratio
ns p
er s
econ
d
2x ~ 5x
Optimizations for Solid-State Drives
26/26
OS File System Stack Overhead
SSD SSD SSD
Block I/O Interface (SATA, PCI)
OS File System
Page Cache
Meta Data Mgmt
Database Storage Engine
SSD SSD SSD
Block I/O Interface (SATA, PCI)
Database Storage Engine
… Buffer Cache
Typical Database Storage Stack
Advanced Database Storage Stack
Volume Manager
©2014 Couchbase, Inc. 71
Bypass the entire OS file system stack
Volume manager Operate on unformatted disks Maintain the list of valid blocks used by database Garbage collect all invalid blocks
Buffer cache Allow us to use different cache policies based on application workload
Advanced Database Storage Stack
©2014 Couchbase, Inc. 72
Required for append-only storage model Garbage collect stale data blocks
Use significant disk I/O bandwidth Read the entire database file and write all valid blocks into a new file
Affect other performance metrics Regular read / write performance drops significantly
Database Compaction
©2014 Couchbase, Inc. 73
Logical page can change its physical address in flash memory whenever it is overwritten
For this reason, the mapping table between LBA and PBA is maintained by Flash Translation Layer (FTL)
SWAT-Based Compaction Optimization
A B C D E F…
Logical Address in File System (LBA)
FTL Address Mapping: LBA PBA
Physical Address inFlash Memory (PBA)
A A’…
©2014 Couchbase, Inc. 74
SWAT-Based Compaction Optimization
Document
B+Tree (Node of HB+Trie)
B+Tree Node
Old Ver. of B+Tree Node
I
G H
E
A B
F
C D C’
F’
H’
I’
G
E
A B DC’
F’
H’
I’
Current DB file
New CompactedDB file
A new compacted file can be simply created by creating the new LBA to PBA mappings that contain the valid pages only in the current DB file
Need to extend the FTL by adding a new interface SWAT (Swap and Trim)
©2014 Couchbase, Inc. 75
Implement SWAT interface on the OpenSSD development platform by adapting its FTL code
Total time taken for compactions was reduced by 17x
Number of compactions triggered was reduced by 4x
SWAT-Based Compaction Optimization
©2014 Couchbase, Inc. 76
Exploit async I/O library (e.g., libaio) to better utilize the parallel I/O capabilities by SSDs
Quite useful in querying secondary indexes when items satisfying a query predicate are located in multiple blocks on different channels
Utilizing Parallel Channels on SSDs
©2014 Couchbase, Inc. 77
More and more applications use Append / Prepend APIs provided by Couchbase Server Mobile messaging services (e.g., Viber)
Value size gets bigger over time Compaction happens frequently
Append / Prepend Limitations
I
G H
E
A B
F
C D C’
F’
H’
I’
B+Tree (Node of HB+Trie)
B+Tree Node
Old Ver. of B+Tree Node
Document
key: “foo” key: “foo”
curr_val = Get(“foo”);new_val = curr_val + delta;Set(“foo”, new_val);
©2014 Couchbase, Inc. 78
Internally check if a key exists or not
If exists, then write a delta value only into the file and have the offset to the old value
Appended / prepended values will be consolidated together periodically or as part of compaction
Less compaction overhead
Storage-Level Append / Prepend APIs
I
G H
E
A B
F
C D C’
F’
H’
I’
B+Tree (Node of HB+Trie)
B+Tree Node
Old Ver. of B+Tree Node
Document
key: “foo” key: “foo”
Append(“foo”, delta);
Summary
©2014 Couchbase, Inc. 80
ForestDB Compacted main index structure built from HB+-Trie High-performance, space efficiency, and scalability
Various optimizations for Solid-State Drives Compaction Volume manager Exploiting parallel I/O channels on SSDs Append / Prepend
ForestDB integrations Couchbase Server secondary index Couchbase Lite (The Future of Couchbase Mobile session @5:10pm today) Couchbase Server KV engine
Summary