btrfs goals - oracle | oss.oracle.com - homemason/presentation/btrfs...data ordering • ext3 data...
TRANSCRIPT
Btrfs Goals
• General purpose filesystem that scales to very large storage
• Fast development cycle to attract developers and users early
• Feature focused, provide things other Linux filesystems cannot
• Administration focused, easy to run and very fault tolerant
• Perform well in a variety of workloads
Btrfs Features
• Extent based file storage• Copy on write metadata and data• Space efficient packing of small files• Space efficient indexed directories• Integrity checksumming• Writable snapshots• Efficient incremental backups• Multiple device support• Full back references for all types of metadata• Offline conversion from Ext3
Btrfs Status
• V0.16 recently released• Most releases happen 1-2 months apart• Generally usable in many workloads• Generally stable• Works against mainline kernels and 2.6.18 enterprise
kernels• Multi-device support functional
Key Storage
• Only three types of extents exist in Btrfs– Data– Btree– Super block
• Metadata is stored in key / value pairs in a btree• Items from multiple files and directories share the
same btree blocks• Special purpose btrees used for different functions
Btrfs Snapshots
• As efficient as directories on disk• Every transaction is a new unnamed snapshot• Every commit schedules the old snapshot for deletion• Snapshots use a reference counted COW algorithm to track blocks• Snapshots can be written and snapshotted again• COW leaves old data in place and writes the new data to a new location
Multi-device Support
• Devices are added into a pool of available storage• New logical address space is added with a specific
RAID configuration and data storage flags– System (used by the volume management code)– Metadata– Data– Raid0, raid1, raid10, single-spindle-dup
• Space is allocated from the storage pool in large chunks
• Devices can be mixed in size and speed
Multi-device Support
• All extent pointers and btree pointers use logical addresses
• Super block contains a bootstrap mapping with enough information to translate the addresses used by the chunk tree
• Single translation and IO path for all extents• Storage can be restriped or dynamically reconfigured• Storage pools use the same transaction subsystem
as the rest of the FS– Much lower overhead when reconfiguring storage
Multi-device Support
Btrfs Directories
• Indexed twice to improve directory read speeds• First index based on name for fast lookups• Second index based on directory sequence number
– Returns filenames in something close to inode number order– Tree defrag responsible for keeping inode number order
close to disk order
Data Ordering
• Ext3 data ordering writes all dirty data before each commit– Any fsync waits for all dirty data to reach disk
• Ext4 data ordering writes all dirty non-delalloc data before each commit– Much better than Ext3
• XFS updates inodes only after the data is written• Btrfs updates inodes, checksums, and extent pointers
after the data is written– Allows atomic writes– Makes transactions independent of data writeback
Synchronous Operations
• COW transaction subsystem is slow for frequent commits– Forces recow of many blocks– Forces significant amounts of IO writing out extent allocation
metadata
• Simple write ahead log added for synchronous operations on files or directories
• File or directory Items are copied into a dedicated tree – One log tree per subvolume– File back refs allow us to log file names without the directory
Synchronous Operations
• The log tree uses the same COW btree code as the rest of the FS
• The log tree uses the same writeback code as the rest of the FS, and uses the metadata raid policy.
• Commits of the log tree are separate from commits in the main transaction code.– fsync(file) only writes metadata for that one file– fsync(file) does not trigger writeback of any other data blocks
Mixed Fsync Workload
• 8 Processes creating, writing and fsyncing 20k files• 1 Process creating 30 copies of the linux kernel
sources in sequence
Kernel tree I/O Rate
83MB/s 15,680Ext4 122MB/s 1280Ext3 66.85MB/s 3200
Fsync Creates
Btrfs
Btrfs Problems
• ENOSPC• Online, offline fsck• http://btrfs.wiki.kernel.org/index.php/Project_ideas