Conquest: Better Performance Through A Disk/Persistent-RAM
Hybrid File System
USENIX 2002
An-I Andy Wang • Peter Reiher • Gerald Popek
University of California, Los Angeles
Geoffrey Kuenning
Harvey Mudd College
2
Conquest Overview File systems are optimized for disks
Performance problem Complexity
Now we have tons of inexpensive RAM What can we do with that RAM?
3
Conquest Approach Combine disk and persistent RAM (e.g.,
battery-backed RAM) in a novel way Simplification
> 20% fewer semicolons than ext2, reiserfs, and SGI XFS
Performance (under popular benchmarks) 24% to 1900% faster than LRU disk caching
4
Motivation Most file systems are built for disks
Problems with the disk assumption: Performance Complexity
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
5
Hardware Evolution
1990 2000
1 KHz
1 MHz
1 GHzCPU (50% /yr)Memory (50% /yr)
Disk (15% /yr)
AccessesPerSecond(Log Scale)
105106
1995(1 sec : 6 days) (1 sec : 3 months)
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
6
Inside the Pandora’s Box
Disk arm Disk platters
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
Access time = seek time (disk arm)
+ rotational delay (disk platter)
+ transfer time
7
Disk Optimization Methods Disk arm scheduling Group information on
disk Disk readahead Buffered writes Disk caching
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
Data mirroring Hardware parallelism
8
Complexity Bytes
synchronization
predictive readahead
cache replacement
elevator algorithm
data clusteringdata consistencyasynchronous write
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[Caceres et al., 1993; Hillyer et al., 1996; Qualstar 1998; Tanisys 1999; Micron Semiconductor Products 2000; Quantum 2000]
9
Storage Media Alternatives
accesses/sec (log)
$/MB (log)
100 103
persistent RAM
Magnetic RAM?
(write once) flash memorydisktape
battery-backed DRAM10-3
10-3 106
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[Grochowski 2000] 10
Price Trend of Persistent RAM
1995 2005
100
Year
$/MB(log)
2000
10-2
10-1
101
102
paper/film
3.5” HDD2.5” HDD1” HDDPersistent RAM
Booming of digitalphotography
4 to 10 GB of persistent RAM
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
11
Old Order; New World Disk staying around
Cost, capacity, power, heat RAM as a viable storage alternative
PDAs, digital cameras, MP3 players More architectural changes due to RAM
A big assumption change from disk Rethink data structures, interface, applications
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
12
What does it take to design and build a system that assumes ample persistent RAM as the primary storage medium?
Getting a Fresh Start
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
13
Conquest Design and build a disk/persistent-RAM
hybrid file system Deliver all file system services from memory,
with the exception of high-capacity storage
Benefits: Simplicity Performance
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
14
Simplicity Remove disk-related complexities for most
files Make things simpler for disk as well Less complexity
Fewer bugs Easier maintenance Shorter data path
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
15
Overall All management performed in memory
Memory data path No disk-related overhead
Disk data path Faster speed due to simpler access models
Performance
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
16
Conquest Components Media management Metadata management Allocation service Persistence support Resiliency support
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[Iram 1993; Douceur et al., 1999; Roselli et al., 2000] 17
User Access Patterns Small files
Take little space (10%) Represent most accesses (90%)
Large files Take most space Mostly sequential accesses
Except database applications
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
18
Files Stored in Persistent RAM Small files (< 1MB)
No seek time or rotational delays Fast byte-level accesses Contiguous allocation
Metadata Fast synchronous update No dual representations
Executables and shared libraries In-place execution
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
19
Memory Data Path of Conquest
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
Conventional file systems
IO buffer
Disk management
Storage requests
IO buffermanagement
Disk
Persistencesupport
Conquest Memory Data Path
Storage requests
Persistencesupport
Battery-backedRAM
Small file and metadata storage
[Devlinux.com 2000] 20
Large-File-Only Disk Storage Allocate in big chunks
Lower access overhead Reduced management overhead
No fragmentation management No tricks for small files
Storing data in metadata No elaborate data structures
Wrapping a balanced tree onto disk cylinders
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
21
Sequential-Access Large Files Sequential disk accesses
Near-raw bandwidth Well-defined readahead semantics Read-mostly
Little synchronization overhead (between memory and disk)
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
22
Disk Data Path of Conquest
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
Conventional file systems
IO buffer
Disk management
Storage requests
IO buffermanagement
Disk
Persistencesupport
Conquest Disk Data Path
IO buffermanagement
IO buffer
Storage requests
Disk management
Disk
Battery-backedRAM
Small file and metadata storage
Large-file-only file system
23
Random-Access Large Files Random access?
Common definition: nonsequential access A typical movie has 150 scene changes MP3 stores the title at the end of the files
Near Sequential access? Simplify large-file metadata representation
significantly
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
24
Logical File Representation
File
Name(s) i-node File attributes
Data
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
25
Physical File Representation
File
Name(s) i-node File attributes Data locations
Data blocks
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
26
Ext2 Data Representation
data block location
index block location
index block location
index block location
data block location
index block location
index block location
data block location
data block location
i-node
10
data block location
data block locationdata block location
data block location
index block location
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
27
Problems with Ext2 Design
- Designed for disk storage
- Optimization for small files makes things complex
- Random-access data structure for large files that are accessed mostly sequentially
- Data access time dependent on the byte position in a file
- Maximum file size is limited
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
28
Conquest Representation Persistent RAM
Hash(file name) = location of data Offset(location of data)
Disk storage Per-file, doubly linked list of disk block
segments (stored in persistent RAM)
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
29
Conquest Design
+ Direct data access for in-core files
+ Worse case: sequential memory search for infrequent random accesses to on-disk files
+ Maximum file size limited by physical storage
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
30
Implementation Status Kernel module under Linux 2.4.2 Fully functional and POSIX compliant Modified memory manager to support
Conquest persistence Preparing for office-wide deployment
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
31
Conquest Evaluation Architectural simplification
Feature count Performance improvement
Memory-only workload Memory and disk workload
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
32
Conventional Data Path Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management
Conventional file systems
IO buffer
Disk management
Storage requests
IO buffermanagement
Disk
Persistencesupport
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
33
Memory Path of Conquest Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management
Conquest Memory Data Path
Storage requests
Persistencesupport
Battery-backedRAM
Small file and metadata storage
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
Memory manager encapsulation
34
Disk Path of Conquest Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management
Conquest Disk Data Path
IO buffermanagement
IO buffer
Storage requests
Disk management
Disk
Battery-backedRAM
Small file and metadata storage
Large-file-only file system
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[Katcher 1997; Sweeney et al., 1996; Card et al., 1999; Namesys 2002] 35
Conquest is comparable to ramfs At least 24% faster than the LRU disk cache
ISP workload (emails, web-based transactions)
PostMark Benchmark
0100020003000400050006000700080009000
5000 10000 15000 20000 25000 30000
files
trans / sec
SGI XFS reiserfs ext2fs ramfs Conquest
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
250 MB working set with 2 GB physical RAM
36
0
1000
2000
3000
4000
5000
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
percentage of large files
trans / sec
SGI XFS reiserfs ext2fs Conquest
When both memory and disk components are exercised, Conquest can be several times faster than ext2fs, reiserfs, and SGI XFS
PostMark Benchmark
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
10,000 files,3.5 GB working setwith 2 GB physical RAM
> RAM<= RAM
37
When working set > RAM, Conquest is 1.4 to 2 times faster than ext2fs, reiserfs, and SGI XFS
PostMark Benchmark
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
0
20
40
60
80
100
120
6.0 7.0 8.0 9.0 10.0
percentage of large files
trans / sec
SGI XFS reiserfs ext2fs Conquest
10,000 files,3.5 GB working setwith 2 GB physical RAM
38
Lessons Learned Faster than LRU caching, unexpected
Heavyweight disk handling Severe penalty for accesses to content
Matching user access patterns to storage media offers considerable simplification and better performance Not an automatic result Need careful design
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
39
Conclusion Conquest demonstrates how rethinking
changes in underlying assumptions can lead to significant architectural and performance improvements
Radical changes in hardware, applications, and user expectations in the past decade should lead us to rethink other aspects of OS as well.
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion