optimizing zfs for block storage

41
Optimizing ZFS for Block Storage Will Andrews, Justin Gibbs Spectra Logic Corporation

Upload: selina

Post on 25-Feb-2016

42 views

Category:

Documents


8 download

DESCRIPTION

Optimizing ZFS for Block Storage. Will Andrews, Justin Gibbs Spectra Logic Corporation. Talk Outline. Quick Overview of ZFS Motivation for our Work Three ZFS Optimizations COW Fault Deferral and Avoidance Asynchronous COW Fault Resolution Asynchronous Read Completions - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Optimizing ZFS for Block Storage

Optimizing ZFS for Block Storage

Will Andrews, Justin GibbsSpectra Logic Corporation

Page 2: Optimizing ZFS for Block Storage

• Quick Overview of ZFS• Motivation for our Work• Three ZFS Optimizations

– COW Fault Deferral and Avoidance– Asynchronous COW Fault Resolution– Asynchronous Read Completions

• Validation of the Changes• Performance Results• Commentary• Further Work• Acknowledgements

Talk Outline

Page 3: Optimizing ZFS for Block Storage

• File System/Object store + Volume Manager + RAID

• Data Integrity via RAID, checksums stored independently of data, and metadata duplication

• Changes are committed via transactions allowing fast recovery after an unclean shutdown

• Snapshots• Deduplication• Encryption• Synchronous Write Journaling• Adaptive, tiered caching of hot data

ZFS Feature Overview

Page 4: Optimizing ZFS for Block Storage

Simplified ZFS Block DiagramZFS

POSIXLayer

PresentationLayer

Configuration &Control

Data Management Unit

Storage Pool Allocator

Objects and Caching

Layout Policy

ZFS Volumes Lustre

CAMTargetLayer

zfs(8), zpool(8)

File, Block, or Object Access

TX Management & Object Coherency

Volumes, RAID, Snapshots, I/O Pipeline

Spectra OptimizationsHere

Page 5: Optimizing ZFS for Block Storage

• ZFS’s unit of allocation and modification is the ZFS record.

• Records range from 512B to 128KB.• Checksum for each record are verified

when the record is read to ensure data integrity.

• Checksums for a record are stored in the parent record (indirect block, or DMU node) that reference it, which are themselves checksummed.

ZFS Records or Blocks

Page 6: Optimizing ZFS for Block Storage

• ZFS never overwrites a currently allocated block– A new version of the storage pool is built in free

space– The pool is atomically transitioned to the new

version– Free space from the old version is eventually

reused• Atomicity of the version update is

guaranteed by transactions, just like in databases.

Copy-on-Write, Transactional, Semantics

Page 7: Optimizing ZFS for Block Storage

• Each write is assigned a transaction.• Transactions are written in batches called

“transaction groups” that aggregate the I/O into sequential streams for optimum write bandwidth.

• TXGs are pipelined to keep the I/O subsystem saturated– Open TXG: Current version of Objects. Most changes

happen here.– Quiescing TXG: Waiting for writers to finish changes to

in-memory buffers.– Synching TXG: buffers being committed to disk.

ZFS Transactions

Page 8: Optimizing ZFS for Block Storage

Copy on Write In Action

Indirect Block

Data Block Data Block Data Block Data Block

Indirect Block

DMU Node

überblock

Data Block

Indirect Block

DMU Node

überblockRoot of

Storage Pool

Root of an Object (file)

Indirect Linkage for

Object Expansion

Write

Page 9: Optimizing ZFS for Block Storage

• DMU Buffer (DBUF): Metadata for ZFS blocks being modified

• Dirty Record: Syncher information for committing the data.

Record Data

Record Data

Record Data

DMU BufferDirty Record Dirty Record Dirty Record

Current Object

Version

Tracking Transaction Groups

Open TXG Quiescing TXG Syncing TXG

Time

Page 10: Optimizing ZFS for Block Storage

Performance Demo

Page 11: Optimizing ZFS for Block Storage

When we write an existing block, we must mark it dirty…

voiddbuf_will_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx){ int rf = DB_RF_MUST_SUCCEED | DB_RF_NOPREFETCH;

ASSERT(tx->tx_txg != 0); ASSERT(!refcount_is_zero(&db->db_holds));

DB_DNODE_ENTER(db); if (RW_WRITE_HELD(&DB_DNODE(db)->dn_struct_rwlock)) rf |= DB_RF_HAVESTRUCT; DB_DNODE_EXIT(db); (void) dbuf_read(db, NULL, rf); (void) dbuf_dirty(db, tx);}

Performance Analysis

Page 12: Optimizing ZFS for Block Storage

• Why does ZFS Read on Writes?– ZFS records are never overwritten directly– Any missing old data must be read before the

new version of the record can be written– This behavior is a COW Fault

• Observations– Block consumers (Databases, Disk Images, FC

LUN, etc.) are always overwriting existing data.– Why read data in a sequential workload when

you are destined to discard it?– Why force the writer to wait to read data?

Doctor, it hurts when I do this…

Page 13: Optimizing ZFS for Block Storage

Optimization #1Deferred Copy On Write Faults

How Hard Can It Be?Famous Last Words

Page 14: Optimizing ZFS for Block Storage

DMU Buffer State Machine (Before)

UNCACHED

READ

FILL

CACHED EVICT

Read Issued

Full Block Write

Truncate

Read Complete

Copy Complete

Teardown

Page 15: Optimizing ZFS for Block Storage

DMU Buffer State Machine (After)

Page 16: Optimizing ZFS for Block Storage

DMU BufferUNCACHED Dirty Record

Tracking Transaction Groups

Open TXG

Time

Page 17: Optimizing ZFS for Block Storage

Record Data

DMU BufferPARTIAL|FILL Dirty Record

Tracking Transaction Groups

Open TXG

Time

Page 18: Optimizing ZFS for Block Storage

Record Data

DMU BufferPARTIAL Dirty Record

Tracking Transaction Groups

Open TXG

Time

Page 19: Optimizing ZFS for Block Storage

Record Data

Record Data

DMU BufferPARTIAL Dirty Record Dirty Record

Tracking Transaction Groups

Open TXG Quiescing TXG

Time

Page 20: Optimizing ZFS for Block Storage

Record Data

Record Data

Record Data

DMU BufferPARTIAL Dirty Record Dirty Record Dirty Record

Tracking Transaction Groups

Open TXG Quiescing TXG Syncing TXG

Time

SyncerProcessesRecord

Page 21: Optimizing ZFS for Block Storage

Record Data

Record Data

Record Data

DMU BufferREAD Dirty Record Dirty Record Dirty Record

Tracking Transaction Groups

Open TXG Quiescing TXG Syncing TXG

Time

Read Buffer

Dispatch Synchronous Read

SyncerProcessesRecord

Page 22: Optimizing ZFS for Block Storage

Record Data

Record Data

Record Data

DMU BufferREAD Dirty Record Dirty Record Dirty Record

SyncerProcessesRecord

Tracking Transaction Groups

Open TXG Quiescing TXG Syncing TXG

Time

Read Buffer

Synchronous Read Returns

Merge

Merge

Page 23: Optimizing ZFS for Block Storage

Record Data

Record Data

Record Data

DMU BufferCACHED Dirty Record Dirty Record Dirty Record

Tracking Transaction Groups

Open TXG Quiescing TXG Syncing TXG

Time

Page 24: Optimizing ZFS for Block Storage

Optimization #2Asynchronous Fault Resolution

Page 25: Optimizing ZFS for Block Storage

• Syncer stalls due to synchronous resolve behavior.

• Resolving reads that are known to be needed are delayed.– Example: a modified version of the record is

created in a new TXG• Writers should be able to cheaply start the

resolve process without blocking.• The syncer should operate on multiple

COW faults in parallel.

Issues with Implementation #1

Page 26: Optimizing ZFS for Block Storage

• Split Brain– ZFS record can have multiple personality

disorder• Example: Write, truncate, write again, all in flight at

the same time with a resolving read. – Term reflects how dealing with this issue made

us feel.• Chaining syncer’s write to the resolving

read– This read may have been started in advance of

syncer processing due to a writer noticing that resolution is necessary.

Complications

Page 27: Optimizing ZFS for Block Storage

Optimization #3Asynchronous Reads

Page 28: Optimizing ZFS for Block Storage

ZFS – Block DiagramZFS

PosixLayer

PresentationLayer

Configuration &Control

Data Management Unit

Storage Pool Allocator

Objects and Caching

Layout Policy

ZFS Volumes Lustre

CAMTargetLayer

zfs(8), zpool(8)

Thread Blocking SemanticsCallback Semantics

Page 29: Optimizing ZFS for Block Storage

• Goal: Get as much I/O in flight as possible• Uses Thread Local Storage (TLS)– Avoid lock order reversals– Avoid modifications in APIs just to pass down a

queue.– No lock overhead due to it being per-thread

• Refcounting while issuing I/Os to make sure callback is not called until entire I/O completes

Asynchronous DMU I/O

Page 30: Optimizing ZFS for Block Storage

Results

Page 31: Optimizing ZFS for Block Storage

Bugs, bugs, bugs…Deadlocks

Data corruption Missed events

wrong arguments to bcopyPage faults

Bad comments

Invalid state machine transitionsSplit brain conditions

Incorrect refcountingUnprotected critical sections

Sleeping holding non-sleepable locks

Memory leaks

Insufficient interlocking

Disclaimer: This is not a complete list.

Page 32: Optimizing ZFS for Block Storage

• ZFS has many complex moving parts• Simply thrashing a ZFS is not a sufficient

test– Many hidden parts make use of the DMU layer

and are not directly involved in data I/O or at all

• Extensive modifications of the DMU layer require thorough verification– Every object in ZFS uses the DMU layer to

support its transactional nature

Validation

Page 33: Optimizing ZFS for Block Storage

• Many more asserts added• Solaris Test Framework ZFS test suite– Extensively modified to (mostly) pass on

FreeBSD– Has ~300 tests, needs more

• ztest: Unit (ish) test suite– Element of randomization requires multiple

test runs– Some test frequencies increased to verify fixes

• xdd: Performance tests– Finds bugs involving high workloads

Testing, testing, testing…

Page 34: Optimizing ZFS for Block Storage

• DMU I/O APIs rewritten to allow issuing async IOs, minimize hold/release cycles, & unify API for all callers

• DBUF dirty restructured– Now looks more like a checklist than an

organically grown process– Broken apart to reduce complexity and ease

understanding of its many nuances

Cleanup & refactoring

Page 35: Optimizing ZFS for Block Storage

• It goes 3-10X faster! Without breaking anything!

• Results that follow are for the following config:– RAIDZ2 of 4 2TB SATA drives on 6Gb LSI SAS

HBA– Xen HVM DomU w/ 4GB RAM, 4 cores of 2GHz

Xeon– 10GB ZVOL, 128KB record size– Care taken to avoid cache effects

Performance resultsalmost

^

Page 36: Optimizing ZFS for Block Storage

Aligned 128K Sequential Write (MB/s)

Aligned 16K Sequential Write (MB/s)

Unaligned 128K Sequential Write (MB/s)

16K Random Write (IOPS) 16K Random Read (IOPS)0

50

100

150

200

250

300

350

400

450

1 Thread Performance Results

Before After

Page 37: Optimizing ZFS for Block Storage

Aligned 128K Sequential Write (MB/s)

Aligned 16K Sequential Write (MB/s)

Unaligned 128K Sequential Write (MB/s)

16K Random Write (IOPS) 16K Random Read (IOPS)0

50

100

150

200

250

300

350

400

10 Thread Performance Results

Before After

Page 38: Optimizing ZFS for Block Storage

• Commercial consumption of open source works best when it is well written and documented– Drastically improved comments, code

readability• Community differences & development

choices– Sun had a small ZFS team that stayed together– FreeBSD has a large group of people who will

frequently work on one area and move on to another

– Clear coding style, naming conventions, & test cases are required for long-term maintainability

Commentary

Page 39: Optimizing ZFS for Block Storage

• Apply deferred COW fault optimization to indirect blocks– Uncached metadata still blocks writers and this can cut write

performance in half• Required indirect blocks should be fetched

asynchronously• Eliminate copies and allow larger I/O cluster sizes in the

SPA clustered I/O implementation• Improve read prefetch performance for sequential read

workloads• Hybrid RAIDZ and/or more standard RAID 5/6 transform• All the other things that have kept Kirk working on file

systems for 30 years.

Further Work

Page 40: Optimizing ZFS for Block Storage

• Sun’s original ZFS team for developing ZFS• Pawel Dawidek for the FreeBSD port• HighCloud Security for the FreeBSD port of

the STF ZFS test suite• Illumos for continuing open source ZFS

development• Spectra Logic for funding our work

Acknowledgments

Page 41: Optimizing ZFS for Block Storage

Questions?Preliminary Patch Set:

http://people.freebsd.org/~will/zfs/