eecc722 - shaaban #1 lec # 9 fall2000 10-11-2000 storage system issues designing an i/o system abcs...

EECC722 - ShaabanEECC722 - Shaaban#1 Lec # 9 Fall2000 10-11-2000

Storage System IssuesStorage System Issues• Designing an I/O System

• ABCs of UNIX File Systems

• I/O Benchmarks

• Comparing UNIX File System Performance


Designing an I/O SystemDesigning an I/O System• When designing an I/O system, the components that make

it up should be balanced.

• Six steps for designing an I/O systems are– List types of devices and buses in system– List physical requirements (e.g., volume, power, connectors, etc.)– List cost of each device, including controller if needed– Record the CPU resource demands of device

• CPU clock cycles directly for I/O (e.g. initiate, interrupts, complete)

• CPU clock cycles due to stalls waiting for I/O• CPU clock cycles to recover from I/O activity (e.g., cache flush)

– List memory and I/O bus resource demands– Assess the performance of the different ways to organize these

devices


Example: Determining the I/O BottleneckExample: Determining the I/O Bottleneck

• Assume the following system components:– 500 MIPS CPU– 16-byte wide memory system with 100 ns cycle time– 200 MB/sec I/O bus – 20 20 MB/sec SCSI-2 buses, with 1 ms controller overhead– 5 disks per SCSI bus: 8 ms seek, 7,200 RPMS, 6MB/sec

• Other assumptions– All devices used to 100% capacity, always have average values– Average I/O size is 16 KB– OS uses 10,000 CPU instr. for a disk I/O

• What is the average IOPS? What is the average bandwidth?


• The performance of I/O systems is determined by the portion with the lowest I/O bandwidth

– CPU : (500 MIPS)/(10,000 instr. per I/O) = 50,000 IOPS

– Main Memory : (16 bytes)/(100 ns x 16 KB per I/O) = 10,000 IOPS

– I/O bus: (200 MB/sec)/(16 KB per I/O) = 12,500 IOPS

– SCSI-2: (20 buses)/((1 ms + (16 KB)/(20 MB/sec)) per I/O) = 11,120 IOPS

– Disks: (100 disks)/((8 ms + 0.5/(7200 RPMS) + (16 KB)/(6 MB/sec)) per I/0)

= 6,700 IOPS

• In this case, the disks limit the I/O performance to 6,700 IOPS

• The average I/O bandwidth is– 6,700 IOPS x (16 KB/sec) = 107.2 MB/sec

Example: Determining the I/O BottleneckExample: Determining the I/O Bottleneck


OS Policies and I/O PerformanceOS Policies and I/O Performance

• Performance potential determined by HW: CPU, Disk, bus, memory system.

• Operating system policies can determine how much of that potential is achieved.

• OS Policies:

1) How much main memory allocated for file cache?

2) Can boundary change dynamically?

3) Write policy for disk cache.• Write Through with Write Buffer

• Write Back


Network Attached StorageNetwork Attached StorageDecreasing Disk Diameters

Increasing Network Bandwidth

Network File ServicesHigh PerformanceStorage Serviceon a High Speed

Network

High PerformanceStorage Serviceon a High Speed

Network

14" » 10" » 8" » 5.25" » 3.5" » 2.5" » 1.8" » 1.3" » . . .high bandwidth disk systems based on arrays of disks

3 Mb/s » 10Mb/s » 50 Mb/s » 100 Mb/s » 1 Gb/s » 10 Gb/snetworks capable of sustaining high bandwidth transfers

Network provideswell defined physicaland logical interfaces:separate CPU and storage system!

OS structuressupporting remotefile access


ABCs of UNIX File SystemsABCs of UNIX File Systems• Key Issues

– File vs. Raw I/O

– File Cache Size Policy– Write Policy

– Local Disk vs. Server Disk

• File vs. Raw:– File system access is the norm: standard policies apply– Raw: alternate I/O system to avoid file system, used by data bases

• File Cache Size Policy– Files are cached in main memory, rather than being accessed from disk

– With older UNIX, % of main memory dedicated to file cache is fixed at system generation (e.g., 10%)

– With new UNIX % of main memory for file cache varies depending on amount of file I/O (e.g., up to 80%)


• Write Policy– File Storage should be permanent; either write immediately

or flush file cache after fixed period (e.g., 30 seconds)

– Write Through with Write Buffer

– Write Back

– Write Buffer often confused with Write Back• Write Through with Write Buffer, all writes go to disk

• Write Through with Write Buffer, writes are asynchronous, so processor doesn’t have to wait for disk write

• Write Back will combine multiple writes to same page; hence can be called Write Cancelling

ABCs of UNIX File SystemsABCs of UNIX File Systems


• Local vs. Server– Unix File systems have historically had different policies

(and even file systems) for local client vs. remote server

– NFS local disk allows 30 second delay to flush writes

– NFS server disk writes through to disk on file close

– Cache coherency problem if allow clients to have file caches in addition to server file cache

• NFS just writes through on file closeStateless protocol: periodically get new copies of file blocks

• Other file systems use cache coherency with write back to check state and selectively invalidate or update

ABCs of UNIX File SystemsABCs of UNIX File Systems


Network File SystemsNetwork File SystemsApplication Program

UNIX System Call Layer

UNIX File System

Block Device Driver

Virtual File System Interface

NFS Client

Network Protocol Stack

localaccesses

remoteaccesses



NFS File System

RPC/Transmission Protocols



Server Routines

RPC/Transmission Protocols

Network

Client Server


Typical File Server ArchitectureTypical File Server Architecture

• Limits to performance: data copying– Read data staged from device to primary memory– Copy again into network packet templates– Copy yet again to network interface

• Normally no special hardware for fast processing between network and disk.

Kernel NFS Protocol & File Processing

TCP/IP Protocols Unix File System

Ethernet EthernetDriver

Disk Manager& Driver

PrimaryMemory

DiskController

Backplane Bus

...

NFSRequest

Single Processor File Server


AUSPEX NS5000 File ServerAUSPEX NS5000 File Server• Special hardware/software architecture for high performance NFS I/O• Functional multiprocessing

PrimaryMemoryPrimary

MemoryHost

ProcessorHost

Memory

EthernetProcessor

FileProcessor

StorageProcessor

EnhancedVME Backplane

...1 2 10

ParallelSCSI Channels

FileProcessor

EthernetProcessor

Independent FileSystem

Single BoardComputer

specialized for protocol processing

I/O buffers

dedicated FSsoftware

UNIXfrontend

manages 10 SCSIchannels


AUSPEX Software ArchitectureAUSPEX Software Architecture

Unix System Call Layer

VFS Interface

NFS Client LFS ClientHost Processor

LFS ClientNSF ServerProtocols

Network I/F

Ethernet ProcessorLFS Server

File System Server

File Processor

Primary Memory

Storage Processor

Disk Arrays

Ethernet

Primary control flowPrimary data flowLimitedcontrolinterfaces


Berkeley RAID-II Disk Array File ServerBerkeley RAID-II Disk Array File Server

8 Port InterleavedMemory (128 MByte)

8 x 8 x 32-bitCrossbar

VME VME VME VME

VME

XOR

X-BusBoard

8 Port InterleavedMemory (128 MByte)

8 x 8 x 32-bitCrossbar

VME VME VME VME

VME

XORIOPB In

IOPB Out

X-BusBoard

HiPPIS

HiPPIDHiPPI

TMCIOP Bus

ATC5 SCSI

ChannelsATC

5 SCSIChannels

ATC5 SCSI

ChannelsATC

5 SCSIChannels

FileServer

VME

ControlBus

FDDI Network

TMC

TMCto

UltraNet

to 120 disk drives

Low latency transfers mixed with high bandwidth transfers

Application area:"Diskless Supercomputers"


I/O Performance Metrics: Throughput:• Throughput is a measure of speed—the rate at which the

storage system delivers data.

• Throughput is measured in two ways:

• I/O rate, measured in accesses/second:– I/O rate is generally used for applications where the size of each

request is small, such as transaction processing

• Data rate, measured in bytes/second or megabytes/second (MB/s). – Data rate is generally used for applications where the size of

each request is large, such as scientific applications.


I/O Performance Metrics: Response time

• Response time measures how long a storage system takes to access data. This time can be measured in several ways. For example:

– One could measure time from the user’s perspective,

– the operating system’s perspective,

– or the disk controller’s perspective, depending on what you view as the storage system.


I/O Performance MetricsCapacity:• How much data can be stored on the storage system. • Capacity is not normally applied as a metric to non-storage components of a

computer system, but it is an integral part of evaluating an I/O system.

Reliability:• I/O systems require a reliability level much higher than other parts of a

computer. – If a memory chip develops a parity error, the system will (hopefully) crash

and be restarted.– If a storage device develops a parity error in a database of bank accounts,

however, bankscould unwittingly lose billions of dollars. Thus, reliability is a metric of great importance to storage

COST:• Applies to all components in computer systems. • Disk subsystems are often the most expensive component in a large

computer installation.


I/O BenchmarksI/O Benchmarks• Processor benchmarks classically aimed at response time for

fixed sized problem.

• I/O benchmarks typically measure throughput, possibly with upper limit on response times (or 90% of response times)

• Traditional I/O benchmarks fix the problem size in the benchmark.

• Examples:

Benchmark Size of Data % Time I/O Year

I/OStones 1 MB 26% 1990

Andrew 4.5 MB 4% 1988

– Not much I/O time in benchmarks

– Limited problem size

– Not measuring disk (or even main memory)


The Ideal I/O BenchmarkThe Ideal I/O Benchmark• An I/O benchmark should help system designers and users understand

why the system performs as it does.• The performance of an I/O benchmark should be limited by the I/O

devices. to maintain the focus of measuring and understanding I/O systems.

• The ideal I/O benchmark should scale gracefully over a wide range of current and future machines, otherwise I/O benchmarks quickly become obsolete as machines evolve.

• A good I/O benchmark should allow fair comparisons across machines.• The ideal I/O benchmark would be relevant to a wide range of

applications.• In order for results to be meaningful, benchmarks must be tightly

specified. Results should be reproducible by general users; optimizations which are allowed and disallowed must be explicitly stated.


I/O Benchmarks Comparison


Self-scalingSelf-scaling I/O BenchmarksI/O Benchmarks

• Alternative to traditional I/O benchmarks: self-scaling benchmark; automatically and dynamically increase aspects of workload to match characteristics of system measured – Measures wide range of current & future applications

• Types of self-scaling benchmarks:– Transaction Processing - Interested in IOPS not bandwidth

• TPC-A, TPC-B, TPC-C

– NFS: SPEC SFS/ LADDIS - average response time and throughput.

– Unix I/O - Performance of files systems

• Willy


I/O Benchmarks: Transaction ProcessingI/O Benchmarks: Transaction Processing• Transaction Processing (TP) (or On-line TP=OLTP)

– Changes to a large body of shared information from many terminals, with the TP system guaranteeing proper behavior on a failure

– If a bank’s computer fails when a customer withdraws money, the TP system would guarantee that the account is debited if the customer received the money and that the account is unchanged if the money was not received

– Airline reservation systems & banks use TP

• Atomic transactions makes this work• Each transaction => 2 to 10 disk I/Os & 5,000 and 20,000

CPU instructions per disk I/O – Efficiency of TP SW & avoiding disks accesses by keeping information in

main memory

• Classic metric is Transactions Per Second (TPS) – Under what workload? how machine configured?


I/O Benchmarks: Old TPC BenchmarksI/O Benchmarks: Old TPC Benchmarks• TPC-A: Revised version of TP1/DebitCredit

– Arrivals: Random (TPC) vs. uniform (TP1)– Terminals: Smart vs. dumb (affects instruction path length)– ATM scaling: 10 terminals per TPS vs. 100– Branch scaling: 1 branch record per TPS vs. 10– Response time constraint: 90% Š2 seconds vs. 95% Š1– Full disclosure, approved by TPC– Complete TPS vs. response time plots vs. single point

• TPC-B: Same as TPC-A but without terminals—batch processing of requests

– Response time makes no sense: plots tps vs. residence time (time of transaction resides in system)

• These have been withdrawn as benchmarks


I/O Benchmarks: TPC-C Complex OLTPI/O Benchmarks: TPC-C Complex OLTP

• Models a wholesale supplier managing orders.

• Order-entry conceptual model for benchmark.

• Workload = 5 transaction types.

• Users and database scale linearly with throughput.

• Defines full-screen end-user interface

• Metrics: new-order rate (tpmC) and price/performance ($/tpmC)

• Approved July 1992


TPC-C Price/Performance $/tpm(c)TPC-C Price/Performance $/tpm(c) Rank Config $/tpmC tpmC Database

1 Acer AcerAltos 19000Pro4 $27.25 11,072.07 M/S SQL 6.5

2 Dell PowerEdge 6100 c/s $29.55 10,984.07 M/S SQL 6.5

3 Compaq ProLiant 5500 c/s $33.37 10,526.90 M/S SQL 6.5

4 ALR Revolution 6x6 c/s $35.44 13,089.30 M/S SQL 6.5

5 HP NetServer LX Pro $35.82 10,505.97 M/S SQL 6.5

6 Fujitsu teamserver M796i $37.62 13,391.13 M/S SQL 6.5

7 Fujitsu GRANPOWER 5000 Model 670 $37.62 13,391.13 M/S SQL 6.5

8 Unisys Aquanta HS/6 c/s $37.96 13,089.30 M/S SQL 6.5

9 Compaq ProLiant 7000 c/s $39.25 11,055.70 M/S SQL 6.5

10 Unisys Aquanta HS/6 c/s $39.39 12,026.07 M/S SQL 6.5


I/O Benchmarks: TPC-D Complex I/O Benchmarks: TPC-D Complex Decision Support WorkloadDecision Support Workload

• OLTP: business operation

• Decision support: business analysis (historical)

• Workload = 17 adhoc transaction types

• Synthetic generator of data

• Size determined by Scale Factor: 100 GB, 300 GB, 1 TB, 3 TB, 10 TB

• Metrics:

“Queries per Gigabyte Hour” Power (QppD@Size) = 3600 x SF / Geo. Mean of queriesThroughput (QthD@Size) = 17 x SF / (time/3600)Price/Performance ($/QphD@Size) = $/ geo. mean(QppD@Size, QthD@Size)

• Report time to load database (indices, stats) too.

• Approved April 1995


TPC-D Performance/Price 300 GBTPC-D Performance/Price 300 GB

Rank Config. Qppd QthD $/QphD Database

1 NCR WorldMark 5150 9,260.0 3,117.0 2,172.00 Teradata

2 HP 9000 EPS22 (16 node) 5,801.2 2,829.0 1,982.00 Informix-XPS

3DG AViiON AV20000 3,305.8 1,277.7 1,319.00 Oracle8 v8.0.4

4Sun - Ultra Enterprise 6000 3,270.6 1,477.8 1,553.00 Informix-XPS

5Sequent NUMA-Q 2000 (32 way) 3,232.3 1,097.8 3,283.00 Oracle8 v8.0.4


1 DG AViiON AV20000 3,305.8 1,277.7 1,319.00 Oracle8 v8.0.4

2 Sun Ultra Enterprise 6000 3,270.6 1,477.8 1,553.00 Informix-XPS

3 HP 9000 EPS22 (16 node) 5,801.2 2,829.0 1,982.00 Informix-XPS

4 NCR WorldMark 5150 9,260.0 3,117.0 2,172.00 Teradata

5 Sequent NUMA-Q 2000 (32 way) 3,232.3 1,097.8 3,283.00 Oracle8 v8.0.4


TPC-D Performance 1TBTPC-D Performance 1TB


1 Sun Ultra E6000 (4 x 24-way) 12,931.9 5,850.3 1,353.00 Infomix Dyn

2 NCR WorldMark (32 x 4-way) 12,149.2 3,912.3 2103.00 Teradata

3 IBM RS/6000 SP (32 x 8-way) 7,633.0 5,155.4 2095.00 DB2 UDB, V5

– NOTE: Inappropriate to compare results from different database sizes.


I/O Benchmarks: TPC-W I/O Benchmarks: TPC-W Transactional Web BenchmarkTransactional Web Benchmark

• Represent any business (retail store, software distribution, airline reservation, electronic stock trades, etc.) that markets and sells over the Internet/ Intranet

• Measure systems supporting users browsing, ordering, and conducting transaction oriented business activities.

• Security (including user authentication and data encryption) and dynamic page generation are important

• Before: processing of customer order by terminal operator working on LAN connected to database system

• Today: customer accesses company site over Internet connection, browses both static and dynamically generated Web pages, and searches the database for product or customer information. Customer also initiate, finalize and check on product orders and deliveries.

• Approved Fall, 1998


SPEC SFS/LADDIS Predecessor: SPEC SFS/LADDIS Predecessor: NFSstonesNFSstones

• NFSStones: synthetic benchmark that generates series of NFS requests from single client to test server: reads, writes, & commands & file sizes from other studies.

– Problem: 1 client could not always stress server.

– Files and block sizes not realistic.

– Clients had to run SunOS.


SPEC SFS/LADDISSPEC SFS/LADDIS• 1993 Attempt by NFS companies to agree on standard

benchmark: Legato, Auspex, Data General, DEC, Interphase, Sun.

• Like NFSstones but:– Run on multiple clients & networks (to prevent bottlenecks)

– Same caching policy in all clients

– Reads: 85% full block & 15% partial blocks

– Writes: 50% full block & 50% partial blocks

– Average response time: 50 ms

– Scaling: for every 100 NFS ops/sec, increase capacity 1GB.

– Results: plot of server load (throughput) vs. response time & number of users

• Assumes: 1 user => 10 NFS ops/sec


Unix I/O Benchmarks: Willy • UNIX File System Benchmark that gives insight into I/O

system behavior (Chen and Patterson, 1993)

• Self scaling to automatically explore system size

• Examines five parameters– Unique bytes touched: data size; locality via LRU

• Gives file cache size– Percentage of reads: %writes = 1 – % reads; typically 50%

• 100% reads gives peak throughput– Average I/O Request Size: Bernoulli, C=1– Percentage sequential requests: typically 50%– Number of processes: concurrency of workload (number processes

issuing I/O requests)

• Fix four parameters while vary one parameter

• Searches space to find high throughput


UNIX File System Performance Study UNIX File System Performance Study Using WillyUsing Willy

9 Machines & OSMachine OS Year Price Memory

Alpha AXP 3000/400 OSF/1 1993 $30,000 64 MB

DECstation 5000/200 Sprite LFS 1990 $20,000 32 MB

DECstation 5000/200 Ultrix 4.2 1990 $20,000 32 MB

HP 730 HP/UX 8 & 91991 $35,000 64 MB

IBM RS/6000/550 AIX 3.1.5 1991 $30,000 64 MB

SparcStation 1+ SunOS 4.1 1989 $30,000 28 MB

SparcStation 10/30 Solaris 2.1 1992 $20,000 128 MB

Convex C2/240 Convex OS 1988 $750,000 1024 MB

IBM 3090/600J VF AIX/ESA 1990 $1,000,000 128 MB

Des

ktop

Min

i/Mai

nfra

me


Disk PerformanceDisk Performance• I/O limited by weakest link in chain from processor to disk• Could be disks, disk controller, I/O bus, CPU/Memory bus, CPU,

or OS - not uniform across machinesMachine OS I/O bus Disk

Alpha AXP 3000/400 OSF/1 TurboChannel SCSI RZ26

DECstation 5000/200 Sprite LFS SCSI-I 3 CDC Wren

DECstation 5000/200 Ultrix 4.2 SCSI-I DEC RZ56

HP 730 HP/UX 8 & 9 Fast SCSI-II HP 1350SX

IBM RS/6000/550 AIX 3.1.5 SCSI-I IBM 2355

SparcStation 1+ SunOS 4.1 SCSI-I CDC Wren IV

SparcStation 10/30 Solaris 2.1 SCSI-I Seagate Elite

Convex C2/240 Convex OS IPI-2 4 DKD-502

IBM 3090/600J VF AIX/ESA Channel IBM 3390


Self-Scaling Benchmark ParametersSelf-Scaling Benchmark Parameters


Disk PerformanceDisk Performance

• 32 KB reads

• SS 10 disk spins 5400 RPM; 4 IPI disks on Convex

Mach

ine a

nd

Op

era

tin

g S

yst

em

Megabytes per Second

0.0 1.0 2.0 3.0 4.0 5.0

DS5000,Sprite

DS5000,Ultrix

Sparc1+,SunOS

4.1

3090,AIX/ESA

HP 730, HP/UX 9

RS/6000,AIX

AXP/4000, OSF1

SS 10, Solaris 2

Convex C240,

ConvexOS10

0.5

0.6

0.7

1.1

1.4

1.6

2.0

2.4

4.2

IPI-2, RAID

5400 RPM SCSI-II disk

IBM Channel, IBM 3390 Disk


File Cache PerformanceFile Cache Performance• UNIX File System Performance: not how fast disk, but whether

disk is used (File cache has 3 to 7 x disk perf.)

• 4X speedup between generations; DEC & Sparc

Mach

ines

& O

pera

tin

g S

yst

em

s

Megabytes per Second

0.0 10.0 20.0 30.0 40.0

Sparc1+,SunOS

4.1

DS5000,Ultrix

DS5000,Sprite

Convex C240,

ConvexOS10

SS 10, Solaris 2

3090,AIX/ESA

HP 730, HP/UX 9

RS/6000,AIX

AXP/4000, OSF1

2.8

5.0

8.7

9.9

11.4

27.2

27.9

28.2

31.8

Sun Generations

DECGenerations

Fast Mem Sys


% M

ain

Mem

ory

fo

r F

Ile C

ach

e

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

HP730, H

P/U

X 8

DS5000, U

ltri

x

3090, A

IX/E

SA

DS5000, Spri

te

Sparc

1+

, SunO

S

4.1

SS 1

0, S

ola

ris

2

Alp

ha, O

SF1

RS/6

000, A

IX

HP730, H

P/U

X 9

C

onvex C

240,

ConvexO

S10

81%80%77%74%71%

63%

20%

10%8%

1

10

100

1000

Fil

e C

ach

e S

ize (

MB

)

87%

File Cache SizeFile Cache Size• HP v8 (8%) vs. v9 (81%);

DS 5000 Ultrix (10%) vs. Sprite (63%)


File System Write PoliciesFile System Write Policies• Write Through with Write Buffer (Asynchronous):

AIX, Convex, OSF/1 w.t., Solaris, Ultrix

% Reads

MB

/sec

0

5

10

15

20

25

30

35

0% 20% 40% 60% 80% 100%

Convex

Solaris

AIX

OSF/1

Fast Disks

Fast FileCaches forReads

MB/sec


% Reads

MB

/sec

0

5

10

15

20

25

30

35

0% 20% 40% 60% 80% 100%

HP/UX no w.d.

SunOS

Sprite

File System Write PoliciesFile System Write Policies

Write Cancelling (Write Back): HP/UX no write daemon (v. 30s);

must wait for write to complete when flushed

MB/sec


File cache performance

vs. read percentage


Performance vs. Megabytes Touched


Write policy Performance For Client/Server ComputingWrite policy Performance For Client/Server Computing• NFS: write through on close (no buffers)• HPUX: client caches writes; 25X faster @ 80% reads

% Reads

Meg

ab

yte

s p

er

Secon

d

0

2

4

6

8

10

12

14

16

18

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

HP 720-730, HP/UX 8, DUX

SS1+, SunOS 4.1, NFS

FDDI Network

Ethernet

MB/sec


UNIX I/O Performance Study ConclusionsUNIX I/O Performance Study Conclusions• Study uses Willy, a new I/O benchmark which supports self-scaling

evaluation and predicted performance.

• The hardware determines the potential I/O performance, but the operating system determines how much of that potential is delivered: differences of factors of 100.

• File cache performance in workstations is improving rapidly, with over four-fold improvements in three years for DEC (AXP/3000 vs. DECStation 5000) and Sun (SPARCStation 10 vs. SPARCStation 1+).

• File cache performance of Unix on mainframes and minisupercomputers is no better than on workstations.

• Current workstations can take advantage of high performance disks.

• RAID systems can deliver much higher disk performance.

• File caching policy determines performance of most I/O events, and hence is the place to start when trying to improve I/O performance.

eecc722 - shaaban #1 lec # 9 fall2000 10-11-2000 storage system issues designing an i/o system abcs...

Documents