developing a flash-optimized storage engine for nosql databases

Developing a Flash-Optimized Storage

Engine for NoSQL Databases

Brian O’Krafka

Engineering Fellow

SanDisk Corporation

Santa Clara, CA

August 2015

1

2

During our meeting today we will make forward-looking statements.

Any statement that refers to expectations, projections or other characterizations of future events or

circumstances is a forward-looking statement, including those relating to market growth, industry trends, future

products, product performance and product capabilities. This presentation also contains forward-looking

statements attributed to third parties, which reflect their projections as of the date of issuance.

Actual results may differ materially from those expressed in these forward-looking statements due

to a number of risks and uncertainties, including the factors detailed under the caption “Risk Factors” and

elsewhere in the documents we file from time to time with the SEC, including our annual and quarterly reports.

We undertake no obligation to update these forward-looking statements, which speak only as

of the date hereof or as of the date of issuance by a third party, as the case may be.

Forward-Looking Statements

3

Overview

Flash-optimizing NoSQL

Using a general purpose key-value library for flash-optimization

Examples: • MongoDB

• Cassandra

Conclusion

4

Flash-Optimizing NoSQL

Flash-optimizing applications: • Exploit flash latency and IOPS

• Requires extensive parallelism

• Cache hot data in DRAM

Key-value abstraction is a good semantic fit for extending many apps • A good key-value storage engine can simplify flash-optimization

• Many applications manage data internally as objects

• Need more than basic CRUD functionality: crash-safeness, transactions, snapshots, range queries

• Typical applications: caching, databases, message queues, data grids

Flash-extending applications: use key-value library to stage data between DRAM and flash • Get “good-enough” performance at in-flash capacity and cost, enabling server consolidation

Flash-optimizing applications: replace application storage engine with a fast key-value library

A good key-value library can dramatically reduce the work required to flash extend or flash

optimize applications

5

Necessary Key-Value Features (1/2) Basic CRUD operations

• Create, Replace, Update, Delete

• support for wide range of key and value sizes

• flexible buffering options (caller vs library allocated)

Crash-safeness • Key-value store must recover to a usable state after hardware or software crash

• Durability semantics: – "pretty good" durability: small number of ack'd writes lost in crash

– software crash-safe: no ack'd writes lost in software crash

– hardware crash-safe: no ack'd writes lost in software or hardware crash

• Nice to have configurable durability levels

Range queries • Get all objects with keys in the range "apple" to "barn"

• Sort function should be configurable

• Ascending and descending

• Retrieve objects in bulk, with optional caller-provided buffers

Transactions • "txn-start"; write(), write(), delete(), ...; "txn-commit":

• Isolation not necessary: usually easy to enforce in application; avoids issues with deadlock

• Read-uncommitted ok

6

Necessary Key-Value Features (2/2)

Snapshots • Read-only image of namespace at a point-in-time

• Often used for backups

Multiple namespaces • Separate namespace per "table" or "collection" or "column-family"

DRAM caching • Flash is good, but DRAM still orders of magnitude faster

• Most applications have locality

• Space-efficient for large and small objects

• Immune to size churn

Memory efficient • Low DRAM bytes per object

High performance • Granular locking for high concurrency (100,000's of IOPs)

• Short code path

Minimize write amplification

7

Example: ZetaScale™ Software Key-Value Store

ZetaScale MongoDB Shim

Key-Value API: • Create, Replace, Update, Delete of objects,

• Range queries, transactions, snapshots

• Multiple namespaces via containers

Optimized for flash storage: • Maps object keys to flash locations

• Granular caching of objects in DRAM

Persistent data: • Persists data with configurable durability guarantees

• Automatically recovers persisted data after a crash

Optimized for flash: • Granular locking for maximum concurrency for 100,000's of IOPs

• Short code paths to minimize CPU overhead

• Low memory overhead

Application

ZetaScale Key-Value Library

Ctnr 0

obj00

obj01

obj02

Ctnr 1

obj10

obj11

obj12

Create Replace Update Delete

• MongoDB (from “humongous”) is an open source NoSQL

document store • JSON-style documents

• Built-in sharding across multiple nodes

• Automatic resharding when adding or deleting nodes

• Rich, document-based queries

• Supports multiple indices

• Test Hardware: • 2x 8-core 2.6 GHz Intel Xeon; 64G DRAM; 8 200G Lightning® SSDs

• client co-resident on server

• Test Software: • CentOS 6.6; MongoDB 3.0.1

• YCSB: point read, update and insert; 1K objects; 15 minute runs

• For Read/Update: 128G dataset contained 128 million 1K objects

9

MongoDB with ZetaScale™: Read/Update 128G

0

20000

40000

60000

80000

100000

100/0 95/5 90/10 80/20 70/30 50/50 30/70 20/80 10/90 0/100

Tran

sact

ion

s p

er

Seco

nd

Read/Update Mix

MMAPv1 Wired Tiger ZetaScale

Source: Based on SanDisk testing; Apr/May 2015

10

MongoDB Storage Engine Architecture ZetaScale MongoDB Shim

11

MongoDB Storage Engine Architecture ZetaScale MongoDB Shim

SortedData

RecordStore Recovery

Unit

(transactions)

12

MongoDB Key-Value Integration

MongoDB

Key-Value Shim

Key-Value

Library

RecordStore Interface

(Data record Store)

KV Read/Write/Delete

API

SortedData Interface

(Index CRUD)

KV Read/Write/Delete

API

Iterator KV Range API

Cursor

(Index and Range query) KV Range API

RecoveryUnit Interface

(Durability and Isolation) KV Transaction API

ZetaScale MongoDB Shim

MongoDB collection and indexes map to one or

more Key-Value namespaces.

Record location is identified by unique auto

generated ID

Secondary indexes use record location as value

SSDs

MongoDB Storage Engine API

KV = Key-Value Library

Cassandra is an open source distributed key-value store • large scale synchronous/asynchronous replication

• automatic fault-tolerance and scaling

• tunable consistency

• efficient support for large rows (1000’s of columns)

• CQL (SQL-like) query language

• supports multiple indices

• Test Hardware: • Dell R720: 2x 8-core Intel 2.60GHz CPU; DRAM: 128G; Flash: 8 x Lightning SSDs; Controller:

LSI 9207 HBA

• remote client with 2x 8-core Intel Xeon CPU, 10G ethernet

• Test Software: • Cassandra v2.0.3: stock and with ZetaScale

• Datastax modified Cassandra-stress tool

• 60M rows, 5 columns per row, 100 byte object size; 128 threads

14

Cassandra with ZetaScale™ Key-Value Library

0

10000

20000

30000

40000

50000

60000

32R/1W 16R/1W 4R/1W 1R/1W 1R/4W 1R/16W 1R/32W

Tran

sact

ion

s p

er

Seco

nd

Read/Write Mix

Stock Cassandra Cassandra with ZetaScale

Source: Based on SanDisk testing; Mar 2014

15

ZetaScale MongoDB Shim Cassandra Key-Value Integration

Key1 Field 1 Field 2 Field 3

Value 1 Value 2 Value 3

Key2 Field 1 Field 2 Field 3

Value 1 Value 2 Value 3

(Key, Value):

Key => Key 'i' + Field 'i',

Value => Value 'i'

Column Family

Row

Column

Row

Column Column

16

Cassandra Key-Value Integration

Replace Cassandra LSMTree (memtable & sstables) with Key-Value storage engine

Route object get/put calls to Key-Value get/put

Disable stock Cassandra journal: Key-value store maintains its own journal

Key-Value indexing is used for row and column range queries

Key-Value transactions are used to enforce atomicity of row updates and secondary index modifications

Key-value snapshots are used for full and incremental backups

Compaction is eliminated

Thrift

Service

Client

MemTable

CommitLog

SS

Tab

le

SS

Tab

le

Memory

Storage SS

Tab

le

Write Flush

Read Thrift

Service

Client

Key-Value

Store

Memory

Storage

Serialize Deserialize

17

Key-value abstraction is a good semantic fit for flash-optimizing NoSQL

databases (and other apps too!)

Need sufficient functionality: crash-safeness, transactions, snapshots,

queries

Proof points: MongoDB, Cassandra

Proof points show significant performance gains

Higher performance -> server consolidation -> reduced TCO

Conclusion

Questions? SanDisk Booth #207

[email protected]

Santa Clara, CA

August 2015

18

(C)2015 SanDisk Corporation. All rights reserved. SanDisk and Lightning are trademarks of SanDisk Corporation, registered in the

United States and other countries. ZetaScale is a trademark of SanDisk Corporation. Other brand names mentioned herein are for

identification purposes only and may be the trademarks of their holder(s).

developing a flash-optimized storage engine for nosql databases

Technology

nosql flash

good keyvalue library

flash capacity

overview flash

fast keyvalue library

necessary keyvalue features

good keyvalue storage

crashsafeness keyvalue