g22.3250-001
DESCRIPTION
G22.3250-001. SGI’s XFS or Cool Pet Tricks with B+ Trees. Robert Grimm New York University. Altogether Now: The Three Questions. What is the problem? What is new or different? What are the contributions and limitations?. Inode - PowerPoint PPT PresentationTRANSCRIPT
G22.3250-001
Robert GrimmNew York University
SGI’s XFSor Cool Pet Tricks with B+ Trees
Altogether Now:The Three Questions
What is the problem? What is new or different? What are the contributions and limitations?
Some Background
Inode On-disk data structure containing a file’s metadata
and pointers to its data Defines FS-internal namespace
Inode numbers
Originated in Unix FFS
Vnode Kernel data structure for
representing files Provides standard API
for accessing different FS’s Originated in Sun OS
Motivation
On one side: I/O bottleneck It’s becoming harder to utilize increasing I/O capacity and
bandwidth 112 9 GB disk drives provide 1 TB of storage
High end drives provide 500 MB/sec sustained disk bandwidth
On the other side: I/O intensive applications Editing of uncompressed video
30 MB/sec per stream, 108 GB for one hour
Streaming compressed video on demand 2.7 TB for 1,000 movies, 200 movies require 100 MB/sec
Scalability Problems of Existing File Systems
Slow crash recovery fsck needs to scan entire disk
No support for large file systems 32 bit block pointers address only 4 million blocks
At 8 KB per block, 32 TB
No support for large, sparse files 64 bit block pointers require more levels of indirection
Are also quite inefficient
Fixed-size extents are still too limiting
Scalability Problems of Existing File Systems (cont.)
No support for large, contiguous files Bitmap structures for tracking free and allocated blocks do
not scale Hard to find large regions of contiguous space
But, we need contiguous allocation for good utilization of bandwidth
No support for large directories Linear layout does not scale In-memory hashing imposes high memory overheads
No support for large numbers of files Inodes preallocated during file system creation
XFS Approach in a Nutshell
Use 64 bit block addresses Support for larger files systems
Use B+ trees and extents Support for larger number of files, larger files,
larger directories Better utilization of I/O bandwidth
Log metadata updates Faster crash recovery
XFS Architecture
I/O manager I/O requests
Directory manager File system name space
Space manager Free space, inode & file allocation
Transaction manager Atomic metadata updates
Unified buffer cache Volume manager
Striping, concatenation, mirroring
Storage Scalability
Allocation groups Are regions with their own free space maps and inodes Support AG-relative block and inode pointers
Reduce size of data structures
Improve parallelism of metadata management Allow concurrent accesses to different allocation groups
Unlike FFS, are mostly motivated by scalability not locality
Free space Two B+ trees describing extents (what’s a B+ tree?)
One indexed by starting block (used when?) One indexed by length of extent (used when?)
Storage Scalability (cont.)
Large files File storage tracked by extent map
Each entry: block offset in file, length in blocks, starting block on disk
Small extent map organized as list in inode Large extent map organized as B+ tree rooted in inode
Indexed by block offset in file
Large number of files Inodes allocated dynamically
In chunks of 64
Inode locations tracked by B+ tree Only points to inode chunks
Storage Scalability (cont.)
Large directories Directories implemented as (surprisingly) B+ trees
Map 4 byte hashes to directory entries (name, inode number)
Fast crash recovery Enabled by write ahead log
For all structural updates to metadata E.g., creating a file directory block, new inode, inode allocation
tree block, allocation group header block, superblock
Independent of actual data structures
However, still need disk scavengers for catastrophic failures
Performance Scalability
Allocating files contiguously On-disk allocation is delayed until flush
Uses (cheap) memory to improve I/O performance Typically enables allocation in one extent
Even for random writes (think memory-mapped files)
Avoids allocation for short-lived files
Extents have large range: 21 bit length field Block size can vary by file system
Small blocks for file systems with many small files Large blocks for file systems with mostly large files
What prevents long-term fragmentation?
Performance Scalability (cont.)
Performing file I/O Read requests issued for large I/O buffers
Followed by multiple read ahead requests for sequential reads
Writes are clustered to form larger I/O requests Delayed allocation helps with buffering writes
Direct I/O lets applications bypass buffer cache and use DMA Applications have control over I/O, while still accessing file system
But also need to align data on block boundaries and issue requests on multiples of block size
Reader/writer locking supports more concurrency Direct I/O leaves serialization entirely to applications
Performance Scalability (cont.)
Accessing and updating metadata Updates performed in asynchronous write-ahead log
Modified data still only flushed after log update has completed
But metadata not locked, multiple updates can be batched
Log may be placed on different device from file system Including non-volatile memory
Log operation is simple (though, log is centralized) Provide buffer space, copy, write out, notify
Copying can be done by processors performing transaction
Experiences
I/O Throughput
What can we conclude? Read speed, difference between creates and writes, paralellism
Benchmark Results
Datamation sort 3.52 seconds (7 seconds previous record)
Indy MinuteSort 1.6 GB sorted in 56 seconds (1.08 GB previously)
SPEC SFS 8806 SPECnfs instead of 7023 SPECnfs with EFS
12% increase with mostly small & synchronous writeson similar hardware
Directory Lookups
Why this noticeable break?
Why this noticeable break?
A Final Note
Are these systems/approaches contradictory? Recoverable virtual memory
Simpler is better, better performing
XFS Way more complex is better, better performing
What Do You Think?