power5, federation and linux: hps

17
1 © 2004 IBM Charles Grassl IBM August, 2004 POWER5, Federation and Linux: HPS 2 © 2004 IBM Corporation Agenda High Performance Switch Switch technology

Upload: others

Post on 03-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: POWER5, Federation and Linux: HPS

1

© 2004 IBM

Charles GrasslIBM

August, 2004

POWER5, Federation and Linux:HPS

2 © 2004 IBM Corporation

Agenda

High Performance SwitchSwitch technology

Page 2: POWER5, Federation and Linux: HPS

2

3 © 2004 IBM Corporation

High Performance Switch (HPS)

Also Known As “Federation”Follow on to SP Switch2–Also known as “Colony”

1 adaptor perlogical node

500 Mbyte/s

15 microsec.

Colony Federation

2 Gbyte/sBandwidth

1 adaptor, 2 links, per MCM; 16 Gbyte/s per 32 processor node

Configuration:

5 microsec.Latency:

4 © 2004 IBM Corporation

Switch Technology

Internal network–In lieu of, e.g. Gig Ethernet

Multiple links per node–Match number of links to number of processors

POWER4 → POWER5HPS (Federation)POWER3 → POWER4SP Switch 2 (Colony)POWER2 → POWER3SP Switch

POWER2HPS SwitchProcessorsGeneration

Page 3: POWER5, Federation and Linux: HPS

3

5 © 2004 IBM Corporation

HPS Software

MPI-LAPI (PE V4.1)–Uses LAPI as the reliable transport–Library uses threads, not signals for async activities

Existing applications binary compatibleNew performance characteristics–Improved collective communication–Eager–Bulk Transfer

6 © 2004 IBM Corporation

HPS Packaging

4U, 24-inch drawer –Fits in the bottom position of p655 or p690 frame

Switch only frame (7040-W42 frame)–Up to eight HPS switches

16 ports for server-to-switch16 ports for switch-to-switch connections Host attachment directly to server bus via 2-link or 4-link Switch Network Interface (SNI) –Up to two links per pSeries 655–Up to eight links per pSeries 690

Page 4: POWER5, Federation and Linux: HPS

4

7 © 2004 IBM Corporation

HPS Scalability

Supports up to 16 p690 or p655 servers and 32 links at GAIncreased support for up to 64 servers –32 can be pSeries 690 servers–128 links planned for July, 2004

Higher scalability limits available by special order

8 © 2004 IBM Corporation

HPS Performance: Single and Multiple Messages

0

500

1000

1500

2000

2500

0 250000 500000 750000 1000000

Bytes

Mby

te/s

SingleMultiple

Page 5: POWER5, Federation and Linux: HPS

5

9 © 2004 IBM Corporation

Underlying Message Procedures

Protocols:– Eager– Rendezvous– MP_EAGER_LIMIT

– Range: 0 - 65536

Mechanisms– Packet– Bulk– MP_BULK_MIN_MSG_SIZE

– Range: any non-negative integer

10 © 2004 IBM Corporation

MPI Transfer Protocols

Send Recv.Mess. Header.

Send Recv.Header.

Send Recv.Mess.

Small Messages:Eager

Large Messages:Rendezvous

Page 6: POWER5, Federation and Linux: HPS

6

11 © 2004 IBM Corporation

MPI Transfer Mechanisms

SendPackets

switch Recv.

SendBulk

Recv.

switch

Small Messages: Packets

Large Messages: Bulk

12 © 2004 IBM Corporation

HPS Performance Characteristics

Latest software, firmware–Eager–Bulk Transfer–Latency

Original General Acceptance software, firmware–Eager–Bulk Transfer

Page 7: POWER5, Federation and Linux: HPS

7

13 © 2004 IBM Corporation

MPI Bandwidth

0200400600800

100012001400160018002000

0 200000 400000 600000 800000

Bytes

Mby

te/s MPI_Send

14 © 2004 IBM Corporation

MPI Bandwidth:Eager Crossover

0200400600800

100012001400160018002000

0 20000 40000 60000 80000 100000

Bytes

Mby

te/s

Page 8: POWER5, Federation and Linux: HPS

8

15 © 2004 IBM Corporation

MPI Latency

0102030405060708090

100

0 20000 40000 60000 80000 100000

Bytes

mic

rose

c. Eager Limit

16 © 2004 IBM Corporation

MPI Latency

0

5

10

15

20

25

0 200 400 600 800 1000

Bytes

mic

rose

c.

Page 9: POWER5, Federation and Linux: HPS

9

17 © 2004 IBM Corporation

HPS Original (General Acceptance) Release

High bandwidth–1.4 Gbyte/s single message

Latency–10 microsecond.

Bulk transfer:–Messages larger than 128000 bytes

18 © 2004 IBM Corporation

MPI Bandwidth

0

200

400

600

800

1000

1200

1400

1600

0 2000000 4000000 6000000 8000000 10000000

Bytes

Mby

te/s HPS (old), Single

SP Switch2, Single

Page 10: POWER5, Federation and Linux: HPS

10

19 © 2004 IBM Corporation

Performance: LatencyBulk Transfer Transition

0

500

1000

1500

2000

2500

3000

3500

0 200000 400000 600000 800000 1000000

Bytes

mic

rose

c.

HPSSP Switch2

20 © 2004 IBM Corporation

MPI BandwidthBulk Transfer Transition

050

100150200250300350400450500

0 100000 200000Bytes

Mby

e/s

HPSSP Switch2

Page 11: POWER5, Federation and Linux: HPS

11

21 © 2004 IBM Corporation

MPI_Barrier

0

25

50

75

100

125

150

175

0 2 4 6 8

Tasks per Node

mic

orse

c.4 node Colony4 node Federation

22 © 2004 IBM Corporation

MPI Bandwidth: Prototype

0200400600800

100012001400160018002000

0 500000 1000000 1500000 2000000

Bytes

Mby

te/s

HPS, OldHPS, New

Page 12: POWER5, Federation and Linux: HPS

12

23 © 2004 IBM Corporation

HPS Performance

High asymptotic peak bandwidth– ~4x vs. Colony

Extra “kink” in performance curve– Bulk Transfer

Small message performance improvement– New microcode– Memory bandwidth limits

Low latency– 5 microsecond.

0

200

400

600

800

1000

1200

1400

1600

0 2000000 4000000 6000000 8000000 10000000

Federation, MultipleFederation, SingleColony, Single

24 © 2004 IBM Corporation

Bandwidth Structure:Performance Aspects

Shared memoryLarge pagesBulk TransferEager LimitSingle threaded

Page 13: POWER5, Federation and Linux: HPS

13

25 © 2004 IBM Corporation

Large Pages:Single Node

0

500

1000

1500

2000

2500

0 2000000 4000000 6000000 8000000 10000000

Bytes

Mby

te/s

Large PagesSmall Pages

26 © 2004 IBM Corporation

Large Pages:Inter-node

0

200

400

600

800

1000

1200

0 200000 400000 600000 800000 1000000 1200000

Bytes

Mby

te/s

Large PagesSmall Pages

Page 14: POWER5, Federation and Linux: HPS

14

27 © 2004 IBM Corporation

Large Pages

Controlled by application– -blpdata– LDR_CNTL

LARGE_PAGE_DATA={MNY}

Significant performance increase– ~ 2x

0

200

400

600

800

1000

1200

0 500000 1000000

28 © 2004 IBM Corporation

Bulk Transfer

High bandwidth mechanismLatency is ~150 microseconds– Default start is 128 kbyte– MP_BULK_MIN_MSG_SIZE

Default transfer size: 1 Mbyte– LAPI_DEBUG_BULK_XFER_SIZE

Page 15: POWER5, Federation and Linux: HPS

15

29 © 2004 IBM Corporation

Bulk Transfer

0

200

400

600

800

1000

1200

0 200000 400000 600000 800000 1000000 1200000

Bytes

Mby

te/s

No Bulk Trans.Bulk Trans.

30 © 2004 IBM Corporation

Bulk Transfer Size:LAPI_DEBUG_BULK_XFER_SIZE

0

200

400

600

800

1000

1200

1400

1600

0 20000000 40000000 60000000 80000000 100000000

Bytes

Mby

te/s

0.5M1M2M5M10M

Page 16: POWER5, Federation and Linux: HPS

16

31 © 2004 IBM Corporation

Bulk Transfer Size

Affects bandwidth for very large messagesEnvironment variable:

– LAPI_DEBUG_BULK_XFER_SIZE– Default: 1 M

0

200

400

600

800

1000

1200

1400

1600

0 2000000 4000000 6000000 8000000 10000000

bytes

32 © 2004 IBM Corporation

MPI Environment Variables

2MP_INSTANCES

yesMP_USE_BULK_XFER

yesMP_SHARED_MEMORY

LARGE_PAGE_DATA=YLDR_CNTRL

1000000LAPI_DEBUG_BULK_XFER_SIZE

128000MP_BULK_MIN_MSG_SIZE

Yes*MP_SINGLE_THREAD

sn_single, sn_allMP_EUIDEVICE

usMP_EUILIB

Recommend ValueEnvironment Variable

* If possible

Page 17: POWER5, Federation and Linux: HPS

17

33 © 2004 IBM Corporation

Application Considerations

MPI-LAPI has different architecture than prior versionBulk transfer:–Larger messages (>128K are used), use large pages

Set MP_SINGLE_THREAD=yes if possible32-bit applications with -bmaxdata:0x80000000 will not use LAPI shared memory for large messages. Convert to 64-bit32-bit applications will not use MPI Collective Communication optimizations. Convert to 64-bit.Applications that use signal handlers may require some changesMPL is no longer supported

34 © 2004 IBM Corporation

Summary

HPS– Bandwidth

– Bulk Transfers– 4x higher bandwidth (large messages)

– Latency– 5 microsecond.