improving the performance of storage servers yuanyuan zhou princeton university
TRANSCRIPT
Improving the Performance of Storage Servers
Yuanyuan Zhou
Princeton University
Traditional Storage
• Delivers limited performance– Locally-attached– Little processing power– Small or no internal cache– Limited scale– Limited bandwidth– Simple storage interface
Database Server
(File server)
…
Disk array
Modern Storage Servers
– Network attachable– Increasing processing power – Gigabytes memory cache – Gigabytes bandwidth – Clustering of storage– Offloading application
operationsProcessor
Memory
…
Storage Area Network
…
Database Server
File Server…
• “Disks become super-computers” --Jim Gray
Processor
Memory
…
Impact of Storage Performance
• Storage I/O remains a bottleneck in many high-end or mid-size On-line Transaction Processing (OLTP) databases (Microsoft report & SOSP’95).
• Current technology trends– Processor speed increases 60% per year
– Disk access time improves 7% per year
• Our goal: reduce I/O time0
0.2
0.4
0.6
0.8
1
MS SQL
Computation I/O
Approaches in Improving I/O Performance
• Improving response time and throughput
• Minimizing I/O and communication overhead
My Solutions
• Effective hierarchy-aware storage caching– Improving response time
and throughput
• Using user-level communication as database-storage network– Minimizing I/O &
communication overhead
Database Server
Processor
Memory
…
Storage Area Network
…
File Server…
Processor
Memory
…
Outline
• Effective hierarchy-aware storage caching– Problem
– Access pattern & properties
– MQ algorithm
– Evaluation
– Summary
• User-level communication for database storage– Background
– Architecture & Implementations
– Results
– Summary
Multi-level Server Cache Hierarchy1st Level Buffer Cache
Database ServersFile Servers
Database ClientsFile Clients
Storage Servers
…
(4GB – 32GB) (1GB – 64GB)
NetworkNetwor
k
Storage Server Cache
Database Server Cache
Client Cache
(64MB – 128MB) << ~No need for inclusion property
Multi-level Server Caching
misses
Least Recently Used (LRU)
LRU?Database or
File server
Cache
Storage server
Cache
in cache?accesses
hits
Database Servers Storage Systems
(Lower level)File Servers(Higher level)
Analogy: Storage Box (Basement)
• Assumption for analogy: item = box• Question: do you keep the box?• If you have a basement, you can keep all the boxes
Basement(lower-level)
Living room
(higher-level)
pizzaDELL
Traditional Client-Server Cache Hierarchy
Analogy: Storage Box (Closet)
• If you just have a closet, you may keep only the box for your holiday decorations!
Closet(lower-level)
Living room
(higher-level)
Database-Storage Server Cache Hierarchy
hot accesscold access
hot miss
But If You Use LRU for Your Closet…
• Your closet will be full of garbage!
Basement(lower-level)
Living room
(higher-level)
pizza
“Your cache ain’t nothin’but trash”
• Storage server cache access patterns are not well understood– Most storage server caches still use LRU
• Muntz & Honeyman (USENIX92)– Cache hit ratios at lower level file server caches
are very low
• Willick et al (ICDCS92)– FBR outperforms LRU for disk caches
Questions
• What is the access pattern at storage server caches?
• What are the properties of a good storage server cache replacement algorithm?
• What algorithms are good for storage server caches?
Storage Cache Access Traces
Database or File Server Miss Trace
Oracle-1 Oracle-2 HP Disk Auspex Server
Description TPC-C 100 GB database
TPC-C 100 GB database
Cello, 1992
File server,1993
Database or File cache size (MB)
128 16 30 8
# Reads (millions) 7.3 3.8 0.2 1.8
#Writes (millions) 4.3 2.0 0.3 0.8
#Database or file Server
Single Single Multiple Multiple
Temporal Distances
• Temporal distance: Inter-reference gap from the previous reference to the same block
• Example:
access sequence A B C A D B C
temporal distances 3 - 4 4
blocks: A, B, C, D
Temporal Distance Distribution
0
200000
400000
600000
800000
1 8 64 512 4k 32
k25
6k 2m 16m
temporal distances
#a
cc
es
se
s
Auspex Access Trace (High level)
0
100000
200000
300000
400000
1 8 64 512 4k 32
k25
6k 2m 16m
temporal distances
minDist
Auspex Miss Trace (lower level)
Accesses to storage server have poor temporal locality
Notation: 1k = 1,000 references; 1m = 1,000,000 references
Why Poor Temporal Locality?
Storage Cache
access to B access to B
…
LRU queue
B
16K cache blocks
Assume: 20% miss ratio
Database or File Cache
>16K distinct accesses >3.2K accesses
B
LRU queue
…B
B
0
200000
400000
600000
800000
1000000
1200000
1 8 64 512 4k 32
k25
6k 2m 16m
0
50000
100000
150000
200000250000
300000
350000
400000
450000
500000
1 8 64 512 4k 32
k25
6k 2m 16m
Oracle-1 (128MB Client Cache) Oracle-2 (16MB Client Cache)
0
10000
20000
30000
40000
50000
60000
1 8 64 512 4k 32
k25
6k 2m 16m
HP Disk Trace
0
50000
100000
150000
200000
250000
300000
350000
1 8 64 512 4k 32
k25
6k 2m 16m
Auspex Server Trace
A block should stay in cache for at least minDist time to be hit at the next reference
Minimal Lifetime Property
What Blocks to Keep?
Oracle-1
0
10
20
30
40
50
60
70
80
90
100
1 2 4 8 16 32 64 128
frequency
per
cen
tage
percentage of accesses percentage of blocks
A large percentage of accesses are made to a small percentage of data, but to a less extent
Oracle -1
0
10
20
30
40
50
60
70
80
90
100
1 2 4 8 16 32 64 128
Oracle-2
0
10
20
30
40
50
60
70
80
90
100
1 2 4 8 16 32 64 128 256
HP Disk Trace
0
10
20
30
40
50
60
70
80
90
100
1 2 4 8 16 32 64 128 256 512
Auspex Server Trace
0
10
20
30
40
50
60
70
80
90
100
1 2 4 8 16 32 64 128 256 512
Blocks should be prioritized based on their access frequencies
Frequency-based Priority Property
Replacement Algorithm Properties
• Minimal lifetime– A block should stay in cache for at least
minDist time
• Frequency-based priority– Blocks should be prioritized based on their
access frequencies
• Temporal (aged) frequency– Reference counts accumulated long time ago
should carry less weight.
Performance of Existing Algorithms
Cache Hit Ratios (Oracle-1)
0
20
40
60
80
100
64 128 256 512 1024
Storage Cache Size(MB)
Cac
he
Hit
Rat
io (
%)
OPT
FBR
LRU
Big gap from the off-line Optimal algorithm
Do They Satisfy the Properties?
No on-line algorithms satisfy all three properties
Minimal lifetime Frequency based priority
Temporal frequency
OPT Best Best Best LRU Poor with small
cache sizes Poor Well
FBR Poor Well Well
Our Replacement Algorithm: Multi-Queue(MQ) • Designed based on the three properties
– Minimal lifeTime: multiple LRU queues with different priorities– Frequency-based Priority: promoting based on reference counts– Aged Frequency: demoting when lifetime expires
lifetime = f (minDist)
History
BufferQ0 Q1 Q2 Q3
B : 7
C : 1
evict
C : 1 B : 8
B.expireTime = CurrentTime + lifetimepromote
D : 6
demote
D : 6
access block B
D.expireTime < CurrentTime block ID reference count
Simulation Evaluation
• Trace driven cache simulator– Write-through– Block size: 8 Kbytes
• MQ – m = 8; – Adaptive lifeTime setting based on-line statistic
information
Simulation Results
MQ performs better than others
Cache Hit Ratios (Oracle-1)
0
20
40
60
80
100
64 128 256 512 1024
Storage Cache Size(MB)
Cac
he
Hit
Rat
io (
%)
OPTMQFBRLRU
Temp. Distance < 64K
Temp. Distance >= 64K
Algorithms
#hits #misses #hits #misses
MQ 1553K 293K 1919K 2645K
FBR 1611K 235k 1146K 3418K
LRU 1846K 0 407K 4157K
Why MQ Performs Better?• MQ can selectively keep some blocks for a longer time
0
500000
1000000
1500000
1 256 64k 16m
Oracle-1, Storage cache size: 512 MB (64K entries)
29% 71%
Implementation Results (Oracle)
Storage Cache Size
MQ LRU
128 MB 19.85% 8.85%
256 MB 31.42% 17.66%
512 MB 44.34% 31.69%
Storage cache hit ratios(Database cache size: 128MB Database size: 100GB)
• MQ has a similar effect to doubling the cache size with LRU
OLTP End Performance (Oracle)
0.5 0.42 0.32 0.32 0.240.42
0
0.2
0.40.6
0.8
1
LRU MQ LRU MQ LRU MQ
Storage Cache Size
Nor
mal
ized
Exe
cuti
on T
ime
Computation I/O
• MQ can reduce I/O time by 16~25% and improve overall performance by 9~11% comparing to LRU.
128 MB 256 MB 512 MB
Related Work
• Practice– LRU, MRU, LFU, SEQ, LFRU, 2Q, FBR,LRU-k, …
• Theory– LUP & RLUP analytical model– Competitive analysis
• Temporal locality metrics– LRU Stack distance (1970)– Distance string (1976)– IRG model (1995)
• Multi-queue process scheduling
Summary
• Access pattern?– Long temporal distance & frequency distributed
unevenly
• Properties?– Minimal lifetime, frequency-based priority, aged
frequency
• What algorithms are good for storage caches?– MQ performs betters than seven tested alternatives and
has similar effect to doubling the cache size with LRU.
• Details can be found in– Y.Zhou, J.F.Philbin and K.Li. The Multi-Queue Server Buffer
Cache Replacement Algorithm. USENIX’01
Outline
• Effective hierarchy-aware storage caching– Problem
– Access pattern & properties
– MQ algorithm
– Evaluation
– Summary
• User-level communication for database storage– Background
– Architecture & Implementations
– Results
– Summary
I/O Related Host Overhead
• High-end or mid-size OLTP configurations– Tolerate disk access latency via async. I/Os– Problem: high I/O related processor overhead
• Reasons– OS overhead– Communication protocol overhead
• Our solution: Using user-level communication for database storage
SCSI or Fibre Channel“Strip”
Analogy: Overhead Walkway• Overhead walkway can avoid stairs & traffic, so
you can win money faster!
Overhead walkway
Las Vegas Casinos
Kernel Space
User-space
Database
User-level Communication
Kernel Space
User-space
Storage Server
User-level Communication
• High bandwidth, low latency, low overhead
• Main features– Bypass OS– Zero-copying – Remote DMA (RDMA)
• University research:– VMMC, UNet, FM, AM, Memory Channel, Myrinet …
• Industrial standard– Virtual Interface (VI) Architecture
Using User-level Communication
• Intra-Cluster communication– Scientific parallel applications– Application Servers (Intel CSP, Compaq
TruCluster)– Web servers
• Client-server communication– Databases (Oracle, DB/2, MS SQL)– Direct Accessed File Systems (DAFS)
Our goals
• User-level communication as a Database-storage interconnect:– Is user level communication effective to reduce
the host overhead for database storage?– How to use user-level communication to
connect database with storage?
VI-Attached Storage Server
Databases
VI
Client Stub
VI Network
Local Disks
……
VI
Storage Server
Storage Cache
…• Database and storage communicate using VI• Storage cache is managed using MQ
Local Disks
…
VI
Storage Server
Storage CacheDatabases
VI
Client Stub
…
Client-stub Implementations
• Challenges– Application transparency– Take advantage of VI– Storage API
• Implementations– Kernel Driver– DLL Interceptor– Direct Attached Storage (DSA)
transparency
Client Stub: Kernel Driver
• Fully transparent + standard API
• Plus– Support all applications– Take advantage of VI’s zero-
copying and RDMA features
• Minus– High kernel overhead– Require kernel-space VI
Databases
Kernel-space VI
Device Driver
Kernel-space
Standard DLL
VI
Client Stub: DLL Interceptor
• User-level Transparent + standard API
• Plus– No modification to databases – Take advantage of user-level
communication
• Minus– High overhead to satisfy standard
I/O API semanticsExample: trigger events for I/O completions
Databases
User-space VI
DLL Interceptor
Standard DLL
Kernel-space
VI
Client Stub: Direct Storage Access (DSA)
• Least Transparent + new API
• DSA interface:– Minimize kernel involvement– Designed based on VI features, e.g.
polling for I/O completions
• Plus– Fully take advantage of VI
• Minus– Require small modifications to
database
Modified Databases
User-space VI
DSA Library
Kernel-space
VI
Limitations of User-level Communication
• Substantial enhancements to address the following issues– Lack of flow-control and reconnection
mechanisms
– High memory registration overhead
– High locking overhead
– High interrupt overhead
Evaluation
• Real systems – OS: Windows XP – Databases: MS SQL, TPC-C benchmark– VI network: Giganet – Tested by customers for 6 months
• Large-size configuration– Database: 32-way SMP, 32 GB memory– Storage: 8 PCs each with 3GB memory, 12 Terabytes data
• Mid-size configuration– Database: 4-way SMP, 4 GB memory– Storage: 4 PCs each with 2GB memory, total 1 Terabytes
OLTP Performance (Large-size Config.)
• DSA improvement– I/O time: 40% – Overall: 18%
Normalized Execution Time
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Fibre-Channel
Driver DLL DSA
Nor
mal
ized
Exe
cuti
on T
ime
Computation I/O
TPC-C I/O Overhead Breakdown
0102030405060708090
100
Fibre
-Chan
nel
Driver
DLLDSA
CP
U u
tiliz
atio
n
otherVIClient StublockOSkernel
Summary
• User-level communication can effectively connect database with storage, but may require substantial enhancements.
• A storage API that minimizes kernel involvement in the I/O path is necessary to fully exploit the benefits of user-level communication
• Details can be found inYuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, James F. Philbin and Kai Li. Experience with VI-based Communication for Database Storage. To appear in ISCA’02
Conclusions• Effective hierarchy-aware storage caching
– MQ has a doubling-cache-size effect comparing to LRU, and can reduce the I/O time in OLTP by 15~30%
– Provide insights for other similar multi-level cache hierarchies (e.g. Web Proxy caches)
• Using User-level communication for database storage– DSA can reduce the I/O time in OLTP by 40%– Provide guidelines for the design and implementations
of new I/O interconnects (e.g. Infiniband) and other applications (e.g. DAFS)
My Other Related Research
• Improving availability– Fast cluster fail-over using memory-mapped
communication (ICS’99)
• Memory management for Networked Servers– Consistency protocols for DSMs (OSDI’96)– Coherence granularity vs. protocols (PPoPP’97)– Performance Limitations of software DSMs (HPCA’99)– Thread scheduling for locality (IOPADS’99)
• http://www.cs.princeton.edu/~yzhou/