linux for zseries performance update
TRANSCRIPT
Klaus Bergmann02/24/2003, Session # 2559
Linux for zSeries Performance Update
The following are trademarks of the International Business Machines Corporation in the United States and/or other countries.Enterprise Storage ServerESCON*FICONFICON ExpressHiperSocketsIBM*IBM logo*IBM eServerNetfinity*S/390*VM/ESA*WebSphere*z/VMzSeries* Registered trademarks of IBM CorporationThe following are trademarks or registered trademarks of other companies.Intel is a trademark of the Intel Corporation in the United States and other countries.Java and all Java-related trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc., in the United States and other countries.Lotus, Notes, and Domino are trademarks or registered trademarks of Lotus Development Corporation.Linux is a registered trademark of Linus Torvalds.Microsoft, Windows and Windows NT are registered trademarks of Microsoft Corporation.Penguin (Tux) compliments of Larry Ewing.SET and Secure Electronic Transaction are trademarks owned by SET Secure Electronic Transaction LLC.UNIX is a registered trademark of The Open Group in the United States and other countries.* All other products may be trademarks or registered trademarks of their respective companies.I
Trademarks
Agenda
HardwareScalabilityNetworkingDisk I/O
FCP / SCSIESSLVM
All measurement results arebased on IBM internal benchmarks in a controlled environment and results may vary.
MBASD54 MB
SD14 MB
SD74 MB
SD34 MB
SD64 MB
SD24 MB
SD44 MB
SD04 MB
MBA
MBA MBA
SC1
SC0
CLK
PU2 PU3
PU6
PU4
PU7
PU5
PUB PU1
PUA PU0
PU12 PU13 PU9 PU8
PUC PUD
PU10
PUE
PU11
PUF
MBA
CLK
SC1
SC0
SD54 MB
SD44 MB
SD14 MB
SD04 MB
PU2
PUB PU1
PU3
PU4
PUC
PUD
PUE PUA PU0
PUF PU5MBA MBA
MBA
20 PU Chips @ 1.3 / 1.09 ns3 SAP's, 1 spareup to 16 CP'sup to 8 ICF's/IFL's
12 PU Chips @ 1.3 ns2 SAP's, 1 spareup to 9 CP'sup to 8(9) ICF's/IFL's
20-PU Module 110-116, 1C1-1C9 12-PU Module210-216, 2C1-2C9 101..109, 100 (CF)
Memory(up to 16 GB)
Level 2 Cache (16 MB)
Memory(up to 16 GB)
PU PU PU PU PU PU PU PU PU PU
Level 2 Cache (16 MB)
PU PU PU PU PU PU PU PU PU PU
Memory(up to 16 GB)
Memory(up to 16 GB)
100..150x
:15..20x
:1x
The problemMemory access does not scale with CPU-cycle
SolutionsHigh bandwidths (=> throughput):4 x 16 Byte @ 2.18 ns==> 28 GB/secCache Hierarchies (latencies)
Level 1 Caches on CPU-Chip Level 2 Cache 'shared by 10'
PU
256KData
256KInstr
z900 Systemstructure: Optimized for maximum internal bandwidth
M B A
STI STI STI STI STI STI
M B A
STI STI STI STI STI STI
M B A
STI STI STI STI STI STI
M B A
STI STI STI STI STI STI
6 x 1 GB/s
6 x 1 GB/s
Memory(up to 16 GB)
Level 2 Cache (16 MB)
Memory(up to 16 GB)
PU PU PU PU PU PU PU PU PU PU
Level 2 Cache (16 MB)
PU PU PU PU PU PU PU PU PU PU
Memory(up to 16 GB)
Memory(up to 16 GB)
6 x 1 GB/s
6 x 1 GB/s
z900 Systemstructure: Optimized for maximum external bandwidth
z900-'Turbo' @ 1.09 nsec16 additional models 2C1..2C9,210..216..20..% faster systems
Additional z900-functionalitiesSupport for FCP and SCSI on FICON Feature in Linux environments
LA since 6/2002, GA 1Q 2003 (with new distributions)CIU: Customer Initiated Upgrade
Web-Interface for processor or memory upgradeLIC/microcode download and upgrade via RSF
OSA-Express(QDIO):IPv6, VLAN, SNMP,...
New functions since 04/2002 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
1000
2000
3000
CPs
G6 Turbo z900 2-bus (2064-10x)z900 4-bus (2064-1Cx,11x) z900 4-bus (2064-2Cx,21x)
Relative Performance z900 vs G6 (2064-1C1 = 250)
ESS800
FICONCU
FICONCU
2Gbps links autonegotiated
1 Gbps linkautonegotiated
FICONCU
DWDM DWDM
ESS800
ESS800
2Gbps links autonegotiated
IBM
IBM
IBM
Direct-attachmentIBM Enterprise Storage Server 800
DWDM and optical amplifiersCisco ONS 15540 ESP (LX, SX) and optical amplifiers (LX, SX)Nortel Optera Metro 5200 and 5300E* and optical amplifiers*IBM 2029 Fiber Saver*
FICON Switched and Cascaded ConnectivityMcDATA ED-6064 (2032-064)INRANGE FC/9000-001/-128 (2042)
Advantages of 2 Gbps Links:Higher throughput, improved performanceExtended opportunities for channel-consolidation
Prerequisite: FICON Express cards (FC 2319, 2320)Native FICON, FICON CTC, FICON Cascaded Directors, Fibre ChannelLink speeds negotiated between server and device
transparent for application and user * SX expected 2H 2003
FICON Express Connectivity Options - 2 Gbps Links
ESS F20
2064-216 (z900)1.09ns (917MHz)2 * 16 MB L2 Cache (shared)64 GB LPARESCONFICONHiperSocketsOSA Express GbE
384 MB NVS16 GB Cache 128 * 36 GB disks 10.000 RPMFCP (1 Gbps)FICON (1 Gbps)
2105-F20 (Shark)
8687-3RX (8-way X440)8-way Intel Pentium 3 Xeon 1.6 GHz8 * 512K L2 Cache (private)hyperthreadingsummit chipset
Our Hardware for Measurements
From Kernel version 2.4.7 / 2.4.17 to version 2.4.19From glibc version 2.2.4-31 to version 2.2.5-84From gcc version 2.95.3 to version 3.2-31Huge number of United Linux patches1.3 MLOC (including x,p,i changes)New Linux schedulerAsync I/O...
SuSE SLES7 versus SuSE SLES8
1 4 8 12 16 20 26 32 40
no. of processes
0
200
400
600
800
1000
1200
thro
ughp
ut [M
B/s
] 1 CPU
4 CPUs
8 CPUs
16 CPUs
SLES8, LPAR , 2064-216, ext2, 31-Bit
1 4 8 12 16 20 26 32 40
no. of processes
0
200
400
600
800
1000
1200
thro
ughp
ut [M
B/s
] 1 CPU
4 CPUs
8 CPUs
16 CPUs
SLES7, LPAR, 2064-216, ext2, 31-Bit
8-way Intel Pentium III, 700 MHz 8-way Intel Pentium III Xeon, 1.6 GHzhyperthreading, summit chipsetvalues in () are Linux maxcpus parameter
1 4 8 12 16 20 26 30
no. of processes
0
200
400
600
800
1000
1200
thro
ughp
ut [M
B/s
ec] 1
2
4
8
netfinity 8-way, ext2, kernel 2.4.14
1 4 8 12 16 20 26 32 40
no. of processes
0
200
400
600
800
1000
1200
thro
ughp
ut [M
B/s
]
1 (1)
1 (2)
2 (4)
4 (8)
8 (16)
xSeries 8-way, ext2, kernel 2.4.20
Dbench File I/O - Scalability
Context Switching
2 4 8 16 24 32 64 96 128 192 256
# processess
0200400600800
1000
mic
rose
cond
s
0K4K8K16K32K64K96K128K256K
Context SwitchZ900 Kernel 2.4.18
2 4 8 16 24 32 64 96128192256
# processess
0200
400600
800
1000
mic
rose
cond
s
0K4K8K16K32K64K96K128K256K
Context SwitchX440 Kernel 2.4.20
Networking
Netmarks Results
Transaction workload - RR 200/1000
31 Bit HS 32k
31 Bit HS 32k
31 Bit Gbe 9000
31 Bit Gbe 9000
31 Bit Gbe 1500
31 Bit Gbe 1500
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
1 connection SLES7SLES8
better
Transactions per second
Network cost VM - transaction workload
31Bit HS 32K
31Bit HS 32K
31Bit GbE 9000
31Bit GbE 9000
31Bit GbE 1500
31Bit GbE 1500
0 250 500 750
RR 200/1000 - 1 connection
2064-216 CPU µs/tran
betterSLES7SLES8
Connect Request Response workload (CRR 64/8k-10 connections)
GL 32k 64Bit VM
GL 32k 64Bit VM
HS 32k 64Bit VM
HS 32k 64Bit VM
HS 32k 64Bit LPAR
HS 32k 64Bit LPAR
GL 32k 31Bit VM
GL 32k 31Bit VM
HS 32k 31Bit VM
HS 32k 31Bit VM
HS 32k 31Bit LPAR
HS 32k 31Bit LPAR
0 2000 4000 6000 8000 10000 12000
Con
nect
ion
Typ
e
Transactions per second
better SLES7SLES8
Disk I/O
Don't treat ESS as a black box, understand its structureThe default is close to worst case:You ask for 16 disks and your SysAdmin gives you addresses 5100-510FWhat's wrong with that?
F C P C H P I D
F C P C H P I D
F C P C H P I D
z S e r i e s2 0 6 4
F C P S w i t c h 2 1 0 9
F C P C H P I D
1
C l u s t e r P r o c e s s o r C o m p l e x- 4 w a y S M P R I S C s y s t e m
Clus te r P r o c e s s o r C o m p l e x- 4 w a y S M P R I S C s y s t e m
D A D A D A D AD A D A D A D A
Loop BLoop AR a n k
E S S2 1 0 5 - F 2 0
I/OA d a p t e r
I /OA d a p t e r
I /OA d a p t e r
I /OA d a p t e r
2 3 4
H A B a y 1 H A B a y 2 H A B a y 3 H A B a y 4
CHPIDs
Host Adapters 16 HA in 4 bays
Device Adapter Pairssupporting 2 loops
Disk ranks, each rank is one RAID-5 array
ESS Architecture
Scenario CHPIDs Host Adapt. Ranks Disks bottlenecksingle disk 1 1 1 1 1 HAsingle rank 1 1 1 8 1 HAsingle HA 1 1 4 8 1 HAsingle CHPID 1 4 4 8 1 CHPIDtwo CHPIDs 2 4 4 8 4 HAMax. Avail. 4 4 4 8 4 HA
Benchmark used for measuring: Iozone (http://www.iozone.org)multi process sequential file system I/O, scaling 1-8/16 processeseach process writes and reads a 350 MB file on a separate diskSystem: LPAR, 4 CPUs, 128 MB main memory, linux 2.4.17
Scaling Tests
2064-216, 917 Mhz, 256MB LPAR4 FICON Express channels used for FCP (IOCDS: type FCP)6 FICON Express Channels used for FICON (IOCDS: type FC)2109-F16 FCP switchESS 2105-F20:
16GB cache, 4 FCP host adapters, 6 FICON host adapters4 device adapter pairsonly A-loops contain disks (36.4 GB, 10,000 RPM):- 4 ranks for FB (fixed block) disks used for FCP- 8 ranks ECKD disks used for FICON measurements
Hardware Setup
s e q u e n t i a l w r i t e s e q u e n t i a l r e a d0
2 5
5 0
7 5
1 0 0
1 2 5
1 5 0
1 7 5
2 0 0
2 2 5
2 5 0
2 7 5
3 0 0
I o z o n e , M a x i m u m T h r o u g h p u t , 3 1 b i tS e q u e n t i a l f i l e s y s t e m I / O
s i n g l e d i s k s i n g l e r a n k s i n g l e H A s i n g l e C H P I D t w o C H P I D s
4 C 4 W 4 R 2 1 0 5 - F 2 0
MB
/s
- 1 HA limits to 40MB/s write and 65 MB/s read, regardless of the number of ranks- 4 HA are limiting to 125 MB/s write and 240 MB/s read,but 4 CHPIDs are required to make use of - 31 bit and 64 bit difference is small- it is expected that the values further increase using more ranks, HA, CHPIDs
Results for sequential file system I/O
Locating disks
use the tool 'ESS Specialist' to generate a html report of all disks(Storage Allocation -> Tabular View -> print table)
for ECKD disk it looks like:
logical control unit address unit address ECKD disk rank information(IOCDS: CUADD ) (IOCDS: UNITADD )
for a FCP disk it looks like
rank information FCP disk 16 bit LUN *0x1000000000000=64 bit FCP LUN
FCP disks are not defined in IOCDS!
CHPIDs: C1, C2, C3, C4Host Adapters: HA1, HA2, HA3, HA4device numbers / LUNs:
rank1: 5000, 5001, 5002, 5003, ...rank2: 5400, 5401, 5402, 5403, ...rank3: 5800, 5801, 5802, 5803, ...rank4: 5C00, 5C01, 5C02, 5C03, ...
For FCP use paths:C1 -> HA1 -> rank1C2 -> HA2 -> rank2C3 -> HA3 -> rank3C4 -> HA4 -> rank4
FICON: Define a path from each CHPID to each LCU
Sample Setup
LVM:add the disks to one VG in the following order:5000, 5400, 5800, 5C00, 5001, 5401, 5801, 5C01, ...Use striping (32 KB or 64 KB)!
For Linux VM guests:guest1: 5000, 5400, 5800, 5C00, ...guest2: 5001, 5401, 5801, 5C01, ...guest3: 5003, 5403, 5803, 5C03, ... guest4: 5004, 5404, 5804, 5C04, ...
Sample Setup
LVM: Logical Volume Manager
What is LVM?LVM builds an abstraction layer which hides hardware detailsData is stored on a virtual disk managed by LVMLVM is contained in SuSE SLES7 and SLES8http://www.sistina.com/products_lvm.htm
Benefits of LVMfile and file system sizes independent of physical hardware sizeallows to manage multiple devices as one entity (scalability!)offers much higher performance when configured appropriately
Physical Disk
Physical Disk
Physical Array
LVM: operating system view
Raid AdapterDevice Driver
(Journaled) Filesystem Raw Logical Volume
Volume Group
PhysicalVolume
PhysicalVolume
PhysicalVolume
Logical Volume Logical Volume
LogicalVolumeManager
LVM: improving performance
PhysicalVolume
PhysicalVolume
PhysicalVolume
With LVM and striping parallelism is achieved.http://publib-b.boulder.ibm.com/Redbooks.nsf/RedbookAbstracts/tips0128.html?open
Data Stream (striped)
A simple experiment (1)
System setup: SuSE SLES8, VM guest, 1 CPU, 256MB RAMSingle disk setup: 50GB SCSI diskLVM setup: 4 CHPIDs, 4 WWPNs, 4 ranks, 16 disks, 16 stripes, 32KB stripe size, 62GB total disk sizeFS block size: 4096 bytesWrite block size: 1024 bytes (4096 bytes)Write command: dd if=/dev/zero of=/mnt/dummy.file bs=1024 count=43352000
0 50 100 150
MB/s
ext2, single disk, SLES7ext3, single disk, SLES7reiser, single disk, SLES7ext2, single disk, SLES8ext2, LVM, SLES8ext2, LVM, bs=4096, SLES8ext3, single disk, SLES8ext3, LVM, SLES8reiser, single disk, SLES8reiser, LVM, SLES8
A simple experiment (2)
Datarates
dd if=/dev/zero of=/mnt/dummy.file bs=1024 count=43352000
0 1 2 3 4 5 6
seconds [Thousands]
ext2, single disk, SLES7ext3, single disk, SLES7reiser, single disk, SLES7ext2, single disk, SLES8ext2, LVM, SLES8ext2, LVM, bs=4096, SLES8ext3, single disk, SLES8ext3, LVM, SLES8reiser, single disk, SLES8reiser, LVM, SLES8
A simple experiment (3)
Elapsed Times
dd if=/dev/zero of=/mnt/dummy.file bs=1024 count=43352000
0 1 2 3 4 5
seconds [Thousands]
ext2, single disk, SLES7ext3, single disk, SLES7reiser, single disk, SLES7ext2, single disk, SLES8ext2, LVM, SLES8ext2, LVM, bs=4096, SLES8ext3, single disk, SLES8ext3, LVM, SLES8reiser, single disk, SLES8reiser, LVM, SLES8
A simple experiment (4)
CPU Times
dd if=/dev/zero of=/mnt/dummy.file bs=1024 count=43352000
Visit us
http://www10.software.ibm.com/developerworks/opensource/linux390/whatsnew.shtml
Linux for zSeries Performance Website:
Linux-VM Performance Website:
http://www.vm.ibm.com/perf/tips/linuxper.html
Questions
Thank you!