ibm hpc development mpi update - spscicomp.org · aix5.3, ibm hps (4 links), ibm rsct/lapi 2.4.3,...
TRANSCRIPT
IBM Systems & Technology Group
© 2004 IBM Corporation
IBM HPC DevelopmentMPI update
Chulho KimScicomp 2007
IBM Systems & Technology Group
© 2004 IBM Corporation
Enhancements in PE 4.3.0 & 4.3.1
� MPI 1-sided improvements (October 2006)
� Selective collective enhancements (October 2006)
� AIX IB US enablement (July 2007)
IBM Systems & Technology Group
© 2004 IBM Corporation
MPI 1-sided improvements
MPI 1-sided testcases provided by William Gropp and Rajeev
Thakur from Argonne National Lab. These two test programs
perform nearest-neighbor ghost-area data exchange:
fence2d.c - four MPI_Puts with MPI_Win_fence call
lock2d.c - four MPI_Puts with MPI_Win_lock and
MPI_Win_unlock calls
A single communication step for this set includes a synchronization
call to start an epoch, a MPI_Put to put the data to each of its four
neighbors ( 4 MPI_Puts) and a synchronization call to end the epoch.
Configurations: 64 tasks – 16 SQ IH nodes
Improvements seen: > 10% and sometimes up to 90% reduction in
time taken to do the operation especially in small messages (< 64K).
IBM Systems & Technology Group
© 2004 IBM Corporation
MPI 1-sided improvements
4
16
64
256
1024
4096
16384
65536
message size
0
1
2
3
4
5
6
7
8
9
second
New Code Current Code
fence2d - 64 tasks on 16 sq nodes
PE 4.2PE 4.3
IBM Systems & Technology Group
© 2004 IBM Corporation
MPI 1-sided improvements
4
16
64
256
1024
4096
16384
65536
message size
0
0.5
1
1.5
2
2.5
3
second
New Code Current Code
lock2d - 64 tasks on 16 sq nodes
PE 4.3 PE 4.2
IBM Systems & Technology Group
© 2004 IBM Corporation
New algorithms in PE 4.3.0
� A set of high radix algorithms for:
� MPI_Barrier,
� small message (<= 2KB) MPI_Reduce,
� small message (<= 2KB) MPI_Allreduce, and
� small message (<= 2KB) MPI_Bcast.
� New parallel pipelining algorithm for large message (>= 256KB) MPI_Bcast.
� These new algorithms are implemented in IBM Parallel Environment4.3.
� Performance comparison are made between IBM PE4.3 and IBM PE4.2.2 (old algorithms)
IBM Systems & Technology Group
© 2004 IBM Corporation
n0
n4n2 n6
n1 n5n3 n7
Shared memory
t0
t1t7 t6 t5 t4 t3 t2
n0
Shared memory
t32
t33t39 t38 t37 t36 t35 t34
n4
• One task is selected as node leader per SMP.• Possible prolog step - node leader gathers inputs from tasks on the same SMP node and forms partial result through shared memory.• Inter-node communication steps among node leaders through interconnect. • Possible epilog step – node leader distributes results among the tasks on the same node through shared memory.• Benefit from shared memory is only for the intra-node part. Available communication resource may not be fully utilized.Example of MPI_Bcast with root=t0, using the common shared memory optimization• 8 SMP nodes each running 8 MPI tasks. • Colors of the nodes represent the inter-node step, after which the node has the broadcast message: Root, Step 0, Step 1, Step2. Arrows show message flow.• 3 inter-node steps are needed to complete the broadcast.
Common shared memory optimization for SMP cluster
IBM Systems & Technology Group
© 2004 IBM Corporation
n7Shared memory
n6Shared memory
n5Shared memory
n4Shared memory
n3Shared memory
Shared memory
t0 t1 t7t6t5t4t3t2
n0
n2Shared memoryShared memory
t8 t9 t15t14t13t12t11t10
n1
High Radix Algorithm
• k (k >= 1) tasks on each SMP participating in the inter-node communication.
• Possible shared memory step after each inter-node step to synchronize and reshuffle data.• Better short message performance as long as the extra shared memory overhead and switch/adapter contention are low. • 64bit only, requires shared memory and same number of tasks on each nodes. • In this MPI_Bcast example, k = 7 • only one inter-node step is needed.
IBM Systems & Technology Group
© 2004 IBM Corporation
Parallel pipeline algorithm: MPI_Bcast
� Split large data into small slices and pipeline the slices along multiple broadcast trees in parallel and in non-conflicting fashion.
� Optimal on any number of tasks (power of two or not) � Communication scheduling is through simple bit operations on rank
of task. Schedule does not need to be cached in the communicator. � Once the pipeline is established, each task is doing concurrent send
and receive of data slices � Paper accepted by EuroPVMMPI07
IBM Systems & Technology Group
© 2004 IBM Corporation
t0 t1 t3
t5
t7
t2
t4
t6
t2 t6
t3
t7
t4
t1
t5
t4 t5
t6
t7
t1
t2
t3
s0
s1
s2
� Broadcast among 8 tasks: t0, t1, … t7. t0 is the root of the broadcast.� Message is split into slices: s0, s1, s2,…� t0 round-robins the slices to the three pipelines
Parallel Pipelining Example:
IBM Systems & Technology Group
© 2004 IBM Corporation
Benchmark config
� Micro benchmarks
� on up to 32 P5 nodes and 8 tasks per node
� AIX5.3, IBM HPS (4 links), IBM RSCT/LAPI 2.4.3, and IBM LoadLeveler 3.4
� For each benchmark,
� run using the old algorithm in PE4.2.2 first,
� then run with the new algorithm in PE4.3.
IBM Systems & Technology Group
© 2004 IBM Corporation
Benchmark
� Loops of:
MPI_Barrier();
Start_time = MPI_Wtime();
MPI_Barrier(); /* or other measured collectives */
End_time = MPI_Wtime();
MPI_Barrier();
IBM Systems & Technology Group
© 2004 IBM Corporation
Timing
� Global switch clock is used. MP_CLOCK_SOURCE=Switch� Time of the measured collectives at any task for a particular loop is:
� for MPI_Barrier and MPI_Allreduce
� End_time of this task – Start_time of this task
� for MPI_Bcast
� End_time of this task – Start_time at the root task
� for MPI_Reduce
� End_time of the root task – Start_time of this task
� 1. For every loop, select the longest time reported by any task as the time of the loop 2. Select the time of the fastest loop� This filters out OS jitter impact and highlights the algorithm capability
IBM Systems & Technology Group
© 2004 IBM Corporation
IBM Systems & Technology Group
© 2004 IBM Corporation
IBM Systems & Technology Group
© 2004 IBM Corporation
IBM Systems & Technology Group
© 2004 IBM Corporation
IBM Systems & Technology Group
© 2004 IBM Corporation
IBM Systems & Technology Group
© 2004 IBM Corporation