developing a flash-optimized storage engine for nosql databases
TRANSCRIPT
Developing a Flash-Optimized Storage
Engine for NoSQL Databases
Brian O’Krafka
Engineering Fellow
SanDisk Corporation
Santa Clara, CA
August 2015
1
2
During our meeting today we will make forward-looking statements.
Any statement that refers to expectations, projections or other characterizations of future events or
circumstances is a forward-looking statement, including those relating to market growth, industry trends, future
products, product performance and product capabilities. This presentation also contains forward-looking
statements attributed to third parties, which reflect their projections as of the date of issuance.
Actual results may differ materially from those expressed in these forward-looking statements due
to a number of risks and uncertainties, including the factors detailed under the caption “Risk Factors” and
elsewhere in the documents we file from time to time with the SEC, including our annual and quarterly reports.
We undertake no obligation to update these forward-looking statements, which speak only as
of the date hereof or as of the date of issuance by a third party, as the case may be.
Forward-Looking Statements
3
Overview
Flash-optimizing NoSQL
Using a general purpose key-value library for flash-optimization
Examples: • MongoDB
• Cassandra
Conclusion
4
Flash-Optimizing NoSQL
Flash-optimizing applications: • Exploit flash latency and IOPS
• Requires extensive parallelism
• Cache hot data in DRAM
Key-value abstraction is a good semantic fit for extending many apps • A good key-value storage engine can simplify flash-optimization
• Many applications manage data internally as objects
• Need more than basic CRUD functionality: crash-safeness, transactions, snapshots, range queries
• Typical applications: caching, databases, message queues, data grids
Flash-extending applications: use key-value library to stage data between DRAM and flash • Get “good-enough” performance at in-flash capacity and cost, enabling server consolidation
Flash-optimizing applications: replace application storage engine with a fast key-value library
A good key-value library can dramatically reduce the work required to flash extend or flash
optimize applications
5
Necessary Key-Value Features (1/2) Basic CRUD operations
• Create, Replace, Update, Delete
• support for wide range of key and value sizes
• flexible buffering options (caller vs library allocated)
Crash-safeness • Key-value store must recover to a usable state after hardware or software crash
• Durability semantics: – "pretty good" durability: small number of ack'd writes lost in crash
– software crash-safe: no ack'd writes lost in software crash
– hardware crash-safe: no ack'd writes lost in software or hardware crash
• Nice to have configurable durability levels
Range queries • Get all objects with keys in the range "apple" to "barn"
• Sort function should be configurable
• Ascending and descending
• Retrieve objects in bulk, with optional caller-provided buffers
Transactions • "txn-start"; write(), write(), delete(), ...; "txn-commit":
• Isolation not necessary: usually easy to enforce in application; avoids issues with deadlock
• Read-uncommitted ok
6
Necessary Key-Value Features (2/2)
Snapshots • Read-only image of namespace at a point-in-time
• Often used for backups
Multiple namespaces • Separate namespace per "table" or "collection" or "column-family"
DRAM caching • Flash is good, but DRAM still orders of magnitude faster
• Most applications have locality
• Space-efficient for large and small objects
• Immune to size churn
Memory efficient • Low DRAM bytes per object
High performance • Granular locking for high concurrency (100,000's of IOPs)
• Short code path
Minimize write amplification
7
Example: ZetaScale™ Software Key-Value Store
ZetaScale MongoDB Shim
Key-Value API: • Create, Replace, Update, Delete of objects,
• Range queries, transactions, snapshots
• Multiple namespaces via containers
Optimized for flash storage: • Maps object keys to flash locations
• Granular caching of objects in DRAM
Persistent data: • Persists data with configurable durability guarantees
• Automatically recovers persisted data after a crash
Optimized for flash: • Granular locking for maximum concurrency for 100,000's of IOPs
• Short code paths to minimize CPU overhead
• Low memory overhead
Application
ZetaScale Key-Value Library
Ctnr 0
obj00
obj01
obj02
Ctnr 1
obj10
obj11
obj12
Create Replace Update Delete
• MongoDB (from “humongous”) is an open source NoSQL
document store • JSON-style documents
• Built-in sharding across multiple nodes
• Automatic resharding when adding or deleting nodes
• Rich, document-based queries
• Supports multiple indices
• Test Hardware: • 2x 8-core 2.6 GHz Intel Xeon; 64G DRAM; 8 200G Lightning® SSDs
• client co-resident on server
• Test Software: • CentOS 6.6; MongoDB 3.0.1
• YCSB: point read, update and insert; 1K objects; 15 minute runs
• For Read/Update: 128G dataset contained 128 million 1K objects
9
MongoDB with ZetaScale™: Read/Update 128G
0
20000
40000
60000
80000
100000
100/0 95/5 90/10 80/20 70/30 50/50 30/70 20/80 10/90 0/100
Tran
sact
ion
s p
er
Seco
nd
Read/Update Mix
MMAPv1 Wired Tiger ZetaScale
Source: Based on SanDisk testing; Apr/May 2015
11
MongoDB Storage Engine Architecture ZetaScale MongoDB Shim
SortedData
RecordStore Recovery
Unit
(transactions)
12
MongoDB Key-Value Integration
MongoDB
Key-Value Shim
Key-Value
Library
RecordStore Interface
(Data record Store)
KV Read/Write/Delete
API
SortedData Interface
(Index CRUD)
KV Read/Write/Delete
API
Iterator KV Range API
Cursor
(Index and Range query) KV Range API
RecoveryUnit Interface
(Durability and Isolation) KV Transaction API
ZetaScale MongoDB Shim
MongoDB collection and indexes map to one or
more Key-Value namespaces.
Record location is identified by unique auto
generated ID
Secondary indexes use record location as value
SSDs
MongoDB Storage Engine API
KV = Key-Value Library
Cassandra is an open source distributed key-value store • large scale synchronous/asynchronous replication
• automatic fault-tolerance and scaling
• tunable consistency
• efficient support for large rows (1000’s of columns)
• CQL (SQL-like) query language
• supports multiple indices
• Test Hardware: • Dell R720: 2x 8-core Intel 2.60GHz CPU; DRAM: 128G; Flash: 8 x Lightning SSDs; Controller:
LSI 9207 HBA
• remote client with 2x 8-core Intel Xeon CPU, 10G ethernet
• Test Software: • Cassandra v2.0.3: stock and with ZetaScale
• Datastax modified Cassandra-stress tool
• 60M rows, 5 columns per row, 100 byte object size; 128 threads
14
Cassandra with ZetaScale™ Key-Value Library
0
10000
20000
30000
40000
50000
60000
32R/1W 16R/1W 4R/1W 1R/1W 1R/4W 1R/16W 1R/32W
Tran
sact
ion
s p
er
Seco
nd
Read/Write Mix
Stock Cassandra Cassandra with ZetaScale
Source: Based on SanDisk testing; Mar 2014
15
ZetaScale MongoDB Shim Cassandra Key-Value Integration
Key1 Field 1 Field 2 Field 3
Value 1 Value 2 Value 3
Key2 Field 1 Field 2 Field 3
Value 1 Value 2 Value 3
(Key, Value):
Key => Key 'i' + Field 'i',
Value => Value 'i'
Column Family
Row
Column
Row
Column Column
16
Cassandra Key-Value Integration
Replace Cassandra LSMTree (memtable & sstables) with Key-Value storage engine
Route object get/put calls to Key-Value get/put
Disable stock Cassandra journal: Key-value store maintains its own journal
Key-Value indexing is used for row and column range queries
Key-Value transactions are used to enforce atomicity of row updates and secondary index modifications
Key-value snapshots are used for full and incremental backups
Compaction is eliminated
Thrift
Service
Client
MemTable
CommitLog
SS
Tab
le
SS
Tab
le
Memory
Storage SS
Tab
le
Write Flush
Read Thrift
Service
Client
Key-Value
Store
Memory
Storage
Serialize Deserialize
17
Key-value abstraction is a good semantic fit for flash-optimizing NoSQL
databases (and other apps too!)
Need sufficient functionality: crash-safeness, transactions, snapshots,
queries
Proof points: MongoDB, Cassandra
Proof points show significant performance gains
Higher performance -> server consolidation -> reduced TCO
Conclusion
Questions? SanDisk Booth #207
Santa Clara, CA
August 2015
18
(C)2015 SanDisk Corporation. All rights reserved. SanDisk and Lightning are trademarks of SanDisk Corporation, registered in the
United States and other countries. ZetaScale is a trademark of SanDisk Corporation. Other brand names mentioned herein are for
identification purposes only and may be the trademarks of their holder(s).