ceph day beijing: newstore: design, implementation, and next step work

NewStore: design, implementation and next step workXiaoxi.chen@intel.com

Software Engineer@ Intel

Agenda

• The Problem

• NewStore Design

• Read/Write Process

• Preliminary Performance Result

• Summary

Agenda

• The Problem

• NewStore Design

• Summary

NewStore（a.k.a KeyFile Store)

Object Store

MemStoreKeyValue

KeyValueDB

RockDBStore

LevelDBStore

KineticStore

Journaling Object Store

FileStore

GenericFileStoreBackend

XFSFileStoreBackend

BTRFSFileStoreBackend

ZFSFileStoreBackend

NewsStore

Comparison of nowadays store

Real data data placementtransation implementation(prevent paritial-write)

JournalingObjectStore(a.k.a FileStore)

filesystem

An object is a file in filesystem, the location is encoded by (pgid,oid,snapset,etc), space is managed by filesystem

Journaling levelDB (etc)

MemStore Ceph::Bufferlist

An object holds a bufferlist,bufferlist claim or release memory via standard glibc allocation api.Memory space is co-managed by library(tcmalloc,jemalloc) and OS

N/A STL::Map

KeyValue store levelDB (etc)

An object is striped into a set of fix-length key-value pairs and managed by K/V database

Provided by KV Database levelDB (etc)

Why we want NewStore?

• Can we skip journaling in non-overwrite case?

• Journaling introduce extra writes, which will not only introduce more latency, but also require extra hardware as journal device.

• Especially in object storage scenarios (CDN, S3/Swift), the continuous high writing throughput will be a challenge for SSD lifetime.

• Can we prevent overload the filesystem？

• We tried to maintain a direct mapping of object names to file names.This is problematic because of 255 character limits, rados namespaces, pgprefixes, and the pg directory hashing we do to allow efficient split.

• In small object storage scenarios, we cannot hold the hole inode tree in memory thus every read introduce several FS metadata read.

• Can we utilize the power of backend SSD?

• Filestore cannot utilize the IOPS of SSD

• Key-value store has significant write amplification --- not useable in reality.

Agenda

• The Problem

• NewStore Design

• Summary

NewStore Design

Object

DataKeyValueD

BData is striped to 1K(configurable) Chunks and

stored into K/V DB

XattrKeyValueD

OmapKeyValueD

Object

DataCeph::Buffe

Xattr STL::Map

Omap STL::Map

MemStore KeyValue Store

Object

Data FileSystem/var/lib/ceph/osd.0/current/4.0_head/DIR_0/DIR_0/rb.0.2179.6b8b4567.00000000271b__head_37B

6ED00__4

Xattr Filesystem

Omap KeyValueDB

FileStore

Object

Data FileSystemFragment0 /var/lib/ceph/osd.0/current/5/32768

Fragment1Omap KeyValueDB

Onode KeyValueDB[0-4095] : fid=(8,43341), offset 32768

[4096-17971]:fid=(5,37784), offset 0

NewStore

NewStore data layout

XATTR Fragments

offset length FID

fragment

fset fno

/…/osd.0/fragements/

…/fragements/100/

5 File

6 File

K/V DatabaseFileSystem

Key File

KeyFile Store

fragment

Agenda

• The Problem

• NewStore Design

• Summary

NewStore Write Process(Creation)

Creation

Create a new fragment in the FS

Write data directly to the fragment(file) in the FS

Update the onode

Object_size, fragment_list

This is the workload pattern in object

store!

NO journal write!

WAL Onode Omap/XATTR

K-V Database(RocksDB)

Fragment

FileSystem

Obj:Foo, Size:4MB…

NewStore Write Process(Append)

Append(allow gap)

Write data directly to the last fragment of the object

Update the onode

Object_size, fragment_list

NO journal write!

FileSystem

Obj:Foo, Size:5MB…

FragmentFragment

NewStore Write Process(Overwrite)

Overwrite(fragement aligned)

Create a new fragment in the FS

Write data directly to the new fragement

Update the onode

Update the fragment mapping.

[WIP] Background thread to cleanup the out-of-date fragement

NO journal write!

Fragment

FileSystem

New fragment

NewStore Write Process(WAL)

Modification

Encoded the new data to a WAL entry(op, offset, length, fid, data), the WAL entry is associated with the Onode

Update the attribute of Onode_t(object_size)

Write the WAL and Update the Onode in KV Database in ONE TRANSATION.

Ack to client-----We already persistent the data

Apply the change to Fragements

Aio/ThreadPool/Sync apply

Cleanup WAL

Fragment

FileSystem

Obj:Foo, Size:4.2MB…

NewStore Read Process

Read Onode from K-V Database

There is a Onode cache in front of K-V Database, only access DB when cache missed.

Check if there is any pending WAL entry associated with this Object, wait until the pending WAL finished.

Pending WAL means writes are persisted in the KV DB but not yet written to FS. Something like in Filestore, data was written to journal but not to FS.

Iterate the fragments_list

Read the data from fragments(if any) and padding zero to the holes.

Reply to Client

Agenda

• The Problem

• NewStore Design

• Summary

How to minimize the WAL overhead?

• The throughput and write amplification of WAL is critical to performance

• Leveling Key-value database usually introduce 10X-50X write amplification.

• Special pattern in WAL items

• Short live!!!

• Never read except in recovery!!

• Tricks we are playing

• Enlarge RocksDB write buffer------ try to keep as much WAL in memory

• Enforce merging several write buffers before flushing to disk

• Since WAL is short lived, so put/delete will likely to be merged-- less data written to Level0

• Capping the Max unapplied WAL size in Ceph

• Then if total_buffer_size> max_dirty_wal_size, we can ensure no data written to Level 0• Which imply no compaction, Write amplification ～= 1 !!!!

Tunings to RocksDB

rocksdb_max_background_compactions = 4

rocksdb_compaction_threads = 4

rocksdb_write_buffer_size = 536870912 //512MB

rocksdb_write_buffer_num = 4

rocksdb_min_write_buffer_number_to_merge = 2

rocksdb_level0_file_num_compaction_trigger = 4

rocksdb_max_bytes_for_level_base = 104857600 //100MB

rocksdb_target_file_size_base = 10485760 //10MB

rocksdb_num_levels = 3 // So the MAX_DB_SIZE would be ~10GB(100MB* 10^3), fair enough.

rocksdb_compression = none

Device r/s w/s rMB/s wMB/s

sdb 0.00 2038.00 0.00 3.98

sdc 0.00 0.00 0.00 0.00

sdd 0.00 2099.67 0.00 11.28

/dev/sdb1 Newstore data/dev/sdc1 Newstore RocksDB data/dev/sdd1 Newstore RocksDB WAL

Preliminary RBD performance comparison

Newstore FileStore(hammer)

SeqWrite 0.61 1.00

SeqRead 1.00 1.00

RandWrite 1.65 1.00*

RandRead 1.64 1.00

• Test was done in an 4-nodes cluster, with• 40 HDDs , 10 per node• 4 SSD, 1 per node, partitioned into 10 , hosting newstore_db

• Methodology• RBD was pre-allocated before test------filling zero by DD• Page Cache dropped before each run• Sequential test with REQ_SIZE=64K and QD = 64• Random test with REQ_SIZE=4K and QD = 8

• Newstore was looking promising even on RBD workload.

What’s next

Bug fixing and Performance analysis

Multi fragments per object

Currently only one

Different Caching/Compacting policy for different Data

Leverage Column Family in RocksDB.

Multi Objects in one fragment?

WIP design summit BP

Agenda

• The Problem

• NewStore Design

• Summary

Summary

• From the design, Newstore

• Definitely help Object store performance and TCO by saving the Journaling

• Effort is needed in RBD/Overwrite case

• Code is functional ready, any test/benchmark is welcome.

• Check out ceph/wip-newstore branch.

• Not very stable under pressure now.

• Stay tuned

Legal Notices and DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.

No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Intel, Xeon and the Intel logo are trademarks of Intel Corporation in the United States and other countries.

*Other names and brands may be claimed as the property of others.

Legal Information: Benchmark and Performance Claims Disclaimers

Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase.

Test and System Configurations: See Back up for details.

For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Risk FactorsThe above statements and any others in this document that refer to plans and expectations for the first quarter, the year and the future are forward-looking statements that involve a number of risks and uncertainties. Words such as "anticipates," "expects," "intends," "plans," "believes," "seeks," "estimates," "may," "will," "should" and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many factors could affect Intel's actual results, and variances from Intel's current expectations regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following to be important factors that could cause actual results to differ materially from the company's expectations. Demand for Intel’s products is highly variable and could differ from expectations due to factors including changes in the business and economic conditions; consumer confidence or income levels; customer acceptance of Intel’s and competitors’ products; competitive and pricing pressures, including actions taken by competitors; supply constraints and other disruptions affecting customers; changes in customer order patterns including order cancellations; and changes in the level of inventory at customers. Intel’s gross margin percentage could vary significantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and associated costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or resources; and product manufacturing quality/yields. Variations in gross margin may also be caused by the timing of Intel product introductions and related expenses, including marketing expenses, and Intel’s ability to respond quickly to technological developments and to introduce new features into existing products, which may result in restructuring and asset impairment charges. Intel's results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Results may also be affected by the formal or informal imposition by countries of new or revised export and/or import and doing-business regulations, which could be changed without prior notice. Intel operates in highly competitive industries and its operations have high costs that are either fixed or difficult to reduce in the short term. The amount, timing and execution of Intel’s stock repurchase program and dividend program could be affected by changes in Intel’s priorities for the use of cash, such as operational spending, capital spending, acquisitions, and as a result of changes to Intel’s cash flows and changes in tax laws. Product defects or errata (deviations from published specifications) may adversely impact our expenses, revenues and reputation. Intel’s results could be affected by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues. An unfavorable ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel’s ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. Intel’s results may be affected by the timing of closing of acquisitions, divestitures and other significant transactions. A detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the company’s most recent reports on Form 10-Q, Form 10-K and earnings release.

Rev. 1/15/15

Backup

ceph day beijing: newstore: design, implementation, and next step work

filestore filesystem

filesystem journaling

map keyvalue store leveldb

object storage scenarios

small object storage

map omap stl

direct mapping of object

iops of ssd keyvalue

Technology

ceph day santa clara: ceph at dreamhost

ceph ecosystem update - ceph day frankfurt (feb 2014)

ceph day beijing: containers and ceph

ceph day nyc: ceph fundamentals

ceph day nyc: ceph performance & benchmarking

ceph day london 2014 - ceph ecosystem overview

ceph day beijing: ceph-dokan: a native windows ceph client

ceph day santa clara: ceph performance & benchmarking

london ceph day: ceph in the echosystem

microservices at newstore

ceph deployment with dell crowbar - ceph day frankfurt

ceph day amsterdam 2015 - ceph over ipv6

ceph day nyc: building tomorrow's ceph

ceph day beijing - ceph on all-flash storage - breaking...

ceph day new york 2014: ceph ecosystem update

ceph day beijing: experience sharing and openstack and ceph...

using ceph in ostack.de - ceph day frankfurt

ceph day new york 2014: ceph, a physical perspective

ceph day beijing - storage modernization with intel and ceph

ceph day beijing: keynote - ceph ecosystem update