power5, federation and linux: hps
TRANSCRIPT
1
© 2004 IBM
Charles GrasslIBM
August, 2004
POWER5, Federation and Linux:HPS
2 © 2004 IBM Corporation
Agenda
High Performance SwitchSwitch technology
2
3 © 2004 IBM Corporation
High Performance Switch (HPS)
Also Known As “Federation”Follow on to SP Switch2–Also known as “Colony”
1 adaptor perlogical node
500 Mbyte/s
15 microsec.
Colony Federation
2 Gbyte/sBandwidth
1 adaptor, 2 links, per MCM; 16 Gbyte/s per 32 processor node
Configuration:
5 microsec.Latency:
4 © 2004 IBM Corporation
Switch Technology
Internal network–In lieu of, e.g. Gig Ethernet
Multiple links per node–Match number of links to number of processors
POWER4 → POWER5HPS (Federation)POWER3 → POWER4SP Switch 2 (Colony)POWER2 → POWER3SP Switch
POWER2HPS SwitchProcessorsGeneration
3
5 © 2004 IBM Corporation
HPS Software
MPI-LAPI (PE V4.1)–Uses LAPI as the reliable transport–Library uses threads, not signals for async activities
Existing applications binary compatibleNew performance characteristics–Improved collective communication–Eager–Bulk Transfer
6 © 2004 IBM Corporation
HPS Packaging
4U, 24-inch drawer –Fits in the bottom position of p655 or p690 frame
Switch only frame (7040-W42 frame)–Up to eight HPS switches
16 ports for server-to-switch16 ports for switch-to-switch connections Host attachment directly to server bus via 2-link or 4-link Switch Network Interface (SNI) –Up to two links per pSeries 655–Up to eight links per pSeries 690
4
7 © 2004 IBM Corporation
HPS Scalability
Supports up to 16 p690 or p655 servers and 32 links at GAIncreased support for up to 64 servers –32 can be pSeries 690 servers–128 links planned for July, 2004
Higher scalability limits available by special order
8 © 2004 IBM Corporation
HPS Performance: Single and Multiple Messages
0
500
1000
1500
2000
2500
0 250000 500000 750000 1000000
Bytes
Mby
te/s
SingleMultiple
5
9 © 2004 IBM Corporation
Underlying Message Procedures
Protocols:– Eager– Rendezvous– MP_EAGER_LIMIT
– Range: 0 - 65536
Mechanisms– Packet– Bulk– MP_BULK_MIN_MSG_SIZE
– Range: any non-negative integer
10 © 2004 IBM Corporation
MPI Transfer Protocols
Send Recv.Mess. Header.
Send Recv.Header.
Send Recv.Mess.
Small Messages:Eager
Large Messages:Rendezvous
6
11 © 2004 IBM Corporation
MPI Transfer Mechanisms
SendPackets
switch Recv.
SendBulk
Recv.
switch
Small Messages: Packets
Large Messages: Bulk
12 © 2004 IBM Corporation
HPS Performance Characteristics
Latest software, firmware–Eager–Bulk Transfer–Latency
Original General Acceptance software, firmware–Eager–Bulk Transfer
7
13 © 2004 IBM Corporation
MPI Bandwidth
0200400600800
100012001400160018002000
0 200000 400000 600000 800000
Bytes
Mby
te/s MPI_Send
14 © 2004 IBM Corporation
MPI Bandwidth:Eager Crossover
0200400600800
100012001400160018002000
0 20000 40000 60000 80000 100000
Bytes
Mby
te/s
8
15 © 2004 IBM Corporation
MPI Latency
0102030405060708090
100
0 20000 40000 60000 80000 100000
Bytes
mic
rose
c. Eager Limit
16 © 2004 IBM Corporation
MPI Latency
0
5
10
15
20
25
0 200 400 600 800 1000
Bytes
mic
rose
c.
9
17 © 2004 IBM Corporation
HPS Original (General Acceptance) Release
High bandwidth–1.4 Gbyte/s single message
Latency–10 microsecond.
Bulk transfer:–Messages larger than 128000 bytes
18 © 2004 IBM Corporation
MPI Bandwidth
0
200
400
600
800
1000
1200
1400
1600
0 2000000 4000000 6000000 8000000 10000000
Bytes
Mby
te/s HPS (old), Single
SP Switch2, Single
10
19 © 2004 IBM Corporation
Performance: LatencyBulk Transfer Transition
0
500
1000
1500
2000
2500
3000
3500
0 200000 400000 600000 800000 1000000
Bytes
mic
rose
c.
HPSSP Switch2
20 © 2004 IBM Corporation
MPI BandwidthBulk Transfer Transition
050
100150200250300350400450500
0 100000 200000Bytes
Mby
e/s
HPSSP Switch2
11
21 © 2004 IBM Corporation
MPI_Barrier
0
25
50
75
100
125
150
175
0 2 4 6 8
Tasks per Node
mic
orse
c.4 node Colony4 node Federation
22 © 2004 IBM Corporation
MPI Bandwidth: Prototype
0200400600800
100012001400160018002000
0 500000 1000000 1500000 2000000
Bytes
Mby
te/s
HPS, OldHPS, New
12
23 © 2004 IBM Corporation
HPS Performance
High asymptotic peak bandwidth– ~4x vs. Colony
Extra “kink” in performance curve– Bulk Transfer
Small message performance improvement– New microcode– Memory bandwidth limits
Low latency– 5 microsecond.
0
200
400
600
800
1000
1200
1400
1600
0 2000000 4000000 6000000 8000000 10000000
Federation, MultipleFederation, SingleColony, Single
24 © 2004 IBM Corporation
Bandwidth Structure:Performance Aspects
Shared memoryLarge pagesBulk TransferEager LimitSingle threaded
13
25 © 2004 IBM Corporation
Large Pages:Single Node
0
500
1000
1500
2000
2500
0 2000000 4000000 6000000 8000000 10000000
Bytes
Mby
te/s
Large PagesSmall Pages
26 © 2004 IBM Corporation
Large Pages:Inter-node
0
200
400
600
800
1000
1200
0 200000 400000 600000 800000 1000000 1200000
Bytes
Mby
te/s
Large PagesSmall Pages
14
27 © 2004 IBM Corporation
Large Pages
Controlled by application– -blpdata– LDR_CNTL
LARGE_PAGE_DATA={MNY}
Significant performance increase– ~ 2x
0
200
400
600
800
1000
1200
0 500000 1000000
28 © 2004 IBM Corporation
Bulk Transfer
High bandwidth mechanismLatency is ~150 microseconds– Default start is 128 kbyte– MP_BULK_MIN_MSG_SIZE
Default transfer size: 1 Mbyte– LAPI_DEBUG_BULK_XFER_SIZE
15
29 © 2004 IBM Corporation
Bulk Transfer
0
200
400
600
800
1000
1200
0 200000 400000 600000 800000 1000000 1200000
Bytes
Mby
te/s
No Bulk Trans.Bulk Trans.
30 © 2004 IBM Corporation
Bulk Transfer Size:LAPI_DEBUG_BULK_XFER_SIZE
0
200
400
600
800
1000
1200
1400
1600
0 20000000 40000000 60000000 80000000 100000000
Bytes
Mby
te/s
0.5M1M2M5M10M
16
31 © 2004 IBM Corporation
Bulk Transfer Size
Affects bandwidth for very large messagesEnvironment variable:
– LAPI_DEBUG_BULK_XFER_SIZE– Default: 1 M
0
200
400
600
800
1000
1200
1400
1600
0 2000000 4000000 6000000 8000000 10000000
bytes
32 © 2004 IBM Corporation
MPI Environment Variables
2MP_INSTANCES
yesMP_USE_BULK_XFER
yesMP_SHARED_MEMORY
LARGE_PAGE_DATA=YLDR_CNTRL
1000000LAPI_DEBUG_BULK_XFER_SIZE
128000MP_BULK_MIN_MSG_SIZE
Yes*MP_SINGLE_THREAD
sn_single, sn_allMP_EUIDEVICE
usMP_EUILIB
Recommend ValueEnvironment Variable
* If possible
17
33 © 2004 IBM Corporation
Application Considerations
MPI-LAPI has different architecture than prior versionBulk transfer:–Larger messages (>128K are used), use large pages
Set MP_SINGLE_THREAD=yes if possible32-bit applications with -bmaxdata:0x80000000 will not use LAPI shared memory for large messages. Convert to 64-bit32-bit applications will not use MPI Collective Communication optimizations. Convert to 64-bit.Applications that use signal handlers may require some changesMPL is no longer supported
34 © 2004 IBM Corporation
Summary
HPS– Bandwidth
– Bulk Transfers– 4x higher bandwidth (large messages)
– Latency– 5 microsecond.