improving the performance of storage servers yuanyuan zhou princeton university

Improving the Performance of Storage Servers

Yuanyuan Zhou

Princeton University

Traditional Storage

• Delivers limited performance– Locally-attached– Little processing power– Small or no internal cache– Limited scale– Limited bandwidth– Simple storage interface

Database Server

(File server)

…

Disk array

Modern Storage Servers

– Network attachable– Increasing processing power – Gigabytes memory cache – Gigabytes bandwidth – Clustering of storage– Offloading application

operationsProcessor

Memory

…

Storage Area Network

…

Database Server

File Server…

• “Disks become super-computers” --Jim Gray

Processor

Memory

…

Impact of Storage Performance

• Storage I/O remains a bottleneck in many high-end or mid-size On-line Transaction Processing (OLTP) databases (Microsoft report & SOSP’95).

• Current technology trends– Processor speed increases 60% per year

– Disk access time improves 7% per year

• Our goal: reduce I/O time0

0.2

0.4

0.6

0.8

1

MS SQL

Computation I/O

Approaches in Improving I/O Performance

• Improving response time and throughput

• Minimizing I/O and communication overhead

My Solutions

• Effective hierarchy-aware storage caching– Improving response time

and throughput

• Using user-level communication as database-storage network– Minimizing I/O &

communication overhead

Database Server

Processor

Memory

…

Storage Area Network

…

File Server…

Processor

Memory

…

Outline

• Effective hierarchy-aware storage caching– Problem

– Access pattern & properties

– MQ algorithm

– Evaluation

– Summary

• User-level communication for database storage– Background

– Architecture & Implementations

– Results

– Summary

Multi-level Server Cache Hierarchy1st Level Buffer Cache

Database ServersFile Servers

Database ClientsFile Clients

Storage Servers

…

(4GB – 32GB) (1GB – 64GB)

NetworkNetwor

k

Storage Server Cache

Database Server Cache

Client Cache

(64MB – 128MB) << ~No need for inclusion property

Multi-level Server Caching

misses

Least Recently Used (LRU)

LRU?Database or

File server

Cache

Storage server

Cache

in cache?accesses

hits

Database Servers Storage Systems

(Lower level)File Servers(Higher level)

Analogy: Storage Box (Basement)

• Assumption for analogy: item = box• Question: do you keep the box?• If you have a basement, you can keep all the boxes

Basement(lower-level)

Living room

(higher-level)

pizzaDELL

Traditional Client-Server Cache Hierarchy

Analogy: Storage Box (Closet)

• If you just have a closet, you may keep only the box for your holiday decorations!

Closet(lower-level)

Living room

(higher-level)

Database-Storage Server Cache Hierarchy

hot accesscold access

hot miss

But If You Use LRU for Your Closet…

• Your closet will be full of garbage!

Basement(lower-level)

Living room

(higher-level)

pizza

“Your cache ain’t nothin’but trash”

• Storage server cache access patterns are not well understood– Most storage server caches still use LRU

• Muntz & Honeyman (USENIX92)– Cache hit ratios at lower level file server caches

are very low

• Willick et al (ICDCS92)– FBR outperforms LRU for disk caches

Questions

• What is the access pattern at storage server caches?

• What are the properties of a good storage server cache replacement algorithm?

• What algorithms are good for storage server caches?

Storage Cache Access Traces

Database or File Server Miss Trace

Oracle-1 Oracle-2 HP Disk Auspex Server

Description TPC-C 100 GB database

TPC-C 100 GB database

Cello, 1992

File server,1993

Database or File cache size (MB)

128 16 30 8

# Reads (millions) 7.3 3.8 0.2 1.8

#Writes (millions) 4.3 2.0 0.3 0.8

#Database or file Server

Single Single Multiple Multiple

Temporal Distances

• Temporal distance: Inter-reference gap from the previous reference to the same block

• Example:

access sequence A B C A D B C

temporal distances 3 - 4 4

blocks: A, B, C, D

Temporal Distance Distribution

0

200000

400000

600000

800000

1 8 64 512 4k 32

k25

6k 2m 16m

temporal distances

#a

cc

es

se

s

Auspex Access Trace (High level)

0

100000

200000

300000

400000

1 8 64 512 4k 32

k25

6k 2m 16m

temporal distances

minDist

Auspex Miss Trace (lower level)

Accesses to storage server have poor temporal locality

Notation: 1k = 1,000 references; 1m = 1,000,000 references

Why Poor Temporal Locality?

Storage Cache

access to B access to B

…

LRU queue

B

16K cache blocks

Assume: 20% miss ratio

Database or File Cache

>16K distinct accesses >3.2K accesses

B

LRU queue

…B

B

0

200000

400000

600000

800000

1000000

1200000

1 8 64 512 4k 32

k25

6k 2m 16m

0

50000

100000

150000

200000250000

300000

350000

400000

450000

500000

1 8 64 512 4k 32

k25

6k 2m 16m

Oracle-1 (128MB Client Cache) Oracle-2 (16MB Client Cache)

0

10000

20000

30000

40000

50000

60000

1 8 64 512 4k 32

k25

6k 2m 16m

HP Disk Trace

0

50000

100000

150000

200000

250000

300000

350000

1 8 64 512 4k 32

k25

6k 2m 16m

Auspex Server Trace

A block should stay in cache for at least minDist time to be hit at the next reference

Minimal Lifetime Property

What Blocks to Keep?

Oracle-1

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128

frequency

per

cen

tage

percentage of accesses percentage of blocks

A large percentage of accesses are made to a small percentage of data, but to a less extent

Oracle -1

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128

Oracle-2

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128 256

HP Disk Trace

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128 256 512

Auspex Server Trace

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128 256 512

Blocks should be prioritized based on their access frequencies

Frequency-based Priority Property

Replacement Algorithm Properties

• Minimal lifetime– A block should stay in cache for at least

minDist time

• Frequency-based priority– Blocks should be prioritized based on their

access frequencies

• Temporal (aged) frequency– Reference counts accumulated long time ago

should carry less weight.

Performance of Existing Algorithms

Cache Hit Ratios (Oracle-1)

0

20

40

60

80

100

64 128 256 512 1024

Storage Cache Size(MB)

Cac

he

Hit

Rat

io (

%)

OPT

FBR

LRU

Big gap from the off-line Optimal algorithm

Do They Satisfy the Properties?

No on-line algorithms satisfy all three properties

Minimal lifetime Frequency based priority

Temporal frequency

OPT Best Best Best LRU Poor with small

cache sizes Poor Well

FBR Poor Well Well

Our Replacement Algorithm: Multi-Queue(MQ) • Designed based on the three properties

– Minimal lifeTime: multiple LRU queues with different priorities– Frequency-based Priority: promoting based on reference counts– Aged Frequency: demoting when lifetime expires

lifetime = f (minDist)

History

BufferQ0 Q1 Q2 Q3

B : 7

C : 1

evict

C : 1 B : 8

B.expireTime = CurrentTime + lifetimepromote

D : 6

demote

D : 6

access block B

D.expireTime < CurrentTime block ID reference count

Simulation Evaluation

• Trace driven cache simulator– Write-through– Block size: 8 Kbytes

• MQ – m = 8; – Adaptive lifeTime setting based on-line statistic

information

Simulation Results

MQ performs better than others

Cache Hit Ratios (Oracle-1)

0

20

40

60

80

100

64 128 256 512 1024

Storage Cache Size(MB)

Cac

he

Hit

Rat

io (

%)

OPTMQFBRLRU

Temp. Distance < 64K

Temp. Distance >= 64K

Algorithms

#hits #misses #hits #misses

MQ 1553K 293K 1919K 2645K

FBR 1611K 235k 1146K 3418K

LRU 1846K 0 407K 4157K

Why MQ Performs Better?• MQ can selectively keep some blocks for a longer time

0

500000

1000000

1500000

1 256 64k 16m

Oracle-1, Storage cache size: 512 MB (64K entries)

29% 71%

Implementation Results (Oracle)

Storage Cache Size

MQ LRU

128 MB 19.85% 8.85%

256 MB 31.42% 17.66%

512 MB 44.34% 31.69%

Storage cache hit ratios(Database cache size: 128MB Database size: 100GB)

• MQ has a similar effect to doubling the cache size with LRU

OLTP End Performance (Oracle)

0.5 0.42 0.32 0.32 0.240.42

0

0.2

0.40.6

0.8

1

LRU MQ LRU MQ LRU MQ

Storage Cache Size

Nor

mal

ized

Exe

cuti

on T

ime

Computation I/O

• MQ can reduce I/O time by 16~25% and improve overall performance by 9~11% comparing to LRU.

128 MB 256 MB 512 MB

Related Work

• Practice– LRU, MRU, LFU, SEQ, LFRU, 2Q, FBR,LRU-k, …

• Theory– LUP & RLUP analytical model– Competitive analysis

• Temporal locality metrics– LRU Stack distance (1970)– Distance string (1976)– IRG model (1995)

• Multi-queue process scheduling

Summary

• Access pattern?– Long temporal distance & frequency distributed

unevenly

• Properties?– Minimal lifetime, frequency-based priority, aged

frequency

• What algorithms are good for storage caches?– MQ performs betters than seven tested alternatives and

has similar effect to doubling the cache size with LRU.

• Details can be found in– Y.Zhou, J.F.Philbin and K.Li. The Multi-Queue Server Buffer

Cache Replacement Algorithm. USENIX’01

Outline

• Effective hierarchy-aware storage caching– Problem

– Access pattern & properties

– MQ algorithm

– Evaluation

– Summary

• User-level communication for database storage– Background

– Architecture & Implementations

– Results

– Summary

I/O Related Host Overhead

• High-end or mid-size OLTP configurations– Tolerate disk access latency via async. I/Os– Problem: high I/O related processor overhead

• Reasons– OS overhead– Communication protocol overhead

• Our solution: Using user-level communication for database storage

SCSI or Fibre Channel“Strip”

Analogy: Overhead Walkway• Overhead walkway can avoid stairs & traffic, so

you can win money faster!

Overhead walkway

Las Vegas Casinos

Kernel Space

User-space

Database

User-level Communication

Kernel Space

User-space

Storage Server

User-level Communication

• High bandwidth, low latency, low overhead

• Main features– Bypass OS– Zero-copying – Remote DMA (RDMA)

• University research:– VMMC, UNet, FM, AM, Memory Channel, Myrinet …

• Industrial standard– Virtual Interface (VI) Architecture

Using User-level Communication

• Intra-Cluster communication– Scientific parallel applications– Application Servers (Intel CSP, Compaq

TruCluster)– Web servers

• Client-server communication– Databases (Oracle, DB/2, MS SQL)– Direct Accessed File Systems (DAFS)

Our goals

• User-level communication as a Database-storage interconnect:– Is user level communication effective to reduce

the host overhead for database storage?– How to use user-level communication to

connect database with storage?

VI-Attached Storage Server

Databases

VI

Client Stub

VI Network

Local Disks

……

VI

Storage Server

Storage Cache

…• Database and storage communicate using VI• Storage cache is managed using MQ

Local Disks

…

VI

Storage Server

Storage CacheDatabases

VI

Client Stub

…

Client-stub Implementations

• Challenges– Application transparency– Take advantage of VI– Storage API

• Implementations– Kernel Driver– DLL Interceptor– Direct Attached Storage (DSA)

transparency

Client Stub: Kernel Driver

• Fully transparent + standard API

• Plus– Support all applications– Take advantage of VI’s zero-

copying and RDMA features

• Minus– High kernel overhead– Require kernel-space VI

Databases

Kernel-space VI

Device Driver

Kernel-space

Standard DLL

VI

Client Stub: DLL Interceptor

• User-level Transparent + standard API

• Plus– No modification to databases – Take advantage of user-level

communication

• Minus– High overhead to satisfy standard

I/O API semanticsExample: trigger events for I/O completions

Databases

User-space VI

DLL Interceptor

Standard DLL

Kernel-space

VI

Client Stub: Direct Storage Access (DSA)

• Least Transparent + new API

• DSA interface:– Minimize kernel involvement– Designed based on VI features, e.g.

polling for I/O completions

• Plus– Fully take advantage of VI

• Minus– Require small modifications to

database

Modified Databases

User-space VI

DSA Library

Kernel-space

VI

Limitations of User-level Communication

• Substantial enhancements to address the following issues– Lack of flow-control and reconnection

mechanisms

– High memory registration overhead

– High locking overhead

– High interrupt overhead

Evaluation

• Real systems – OS: Windows XP – Databases: MS SQL, TPC-C benchmark– VI network: Giganet – Tested by customers for 6 months

• Large-size configuration– Database: 32-way SMP, 32 GB memory– Storage: 8 PCs each with 3GB memory, 12 Terabytes data

• Mid-size configuration– Database: 4-way SMP, 4 GB memory– Storage: 4 PCs each with 2GB memory, total 1 Terabytes

OLTP Performance (Large-size Config.)

• DSA improvement– I/O time: 40% – Overall: 18%

Normalized Execution Time

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Fibre-Channel

Driver DLL DSA

Nor

mal

ized

Exe

cuti

on T

ime

Computation I/O

TPC-C I/O Overhead Breakdown

0102030405060708090

100

Fibre

-Chan

nel

Driver

DLLDSA

CP

U u

tiliz

atio

n

otherVIClient StublockOSkernel

Summary

• User-level communication can effectively connect database with storage, but may require substantial enhancements.

• A storage API that minimizes kernel involvement in the I/O path is necessary to fully exploit the benefits of user-level communication

• Details can be found inYuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, James F. Philbin and Kai Li. Experience with VI-based Communication for Database Storage. To appear in ISCA’02

Conclusions• Effective hierarchy-aware storage caching

– MQ has a doubling-cache-size effect comparing to LRU, and can reduce the I/O time in OLTP by 15~30%

– Provide insights for other similar multi-level cache hierarchies (e.g. Web Proxy caches)

• Using User-level communication for database storage– DSA can reduce the I/O time in OLTP by 40%– Provide guidelines for the design and implementations

of new I/O interconnects (e.g. Infiniband) and other applications (e.g. DAFS)

My Other Related Research

• Improving availability– Fast cluster fail-over using memory-mapped

communication (ICS’99)

• Memory management for Networked Servers– Consistency protocols for DSMs (OSDI’96)– Coherence granularity vs. protocols (PPoPP’97)– Performance Limitations of software DSMs (HPCA’99)– Thread scheduling for locality (IOPADS’99)

• http://www.cs.princeton.edu/~yzhou/

improving the performance of storage servers yuanyuan zhou princeton university

Documents