Download - Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters
Performance Comparison of Performance Comparison of Pure MPI vs Hybrid MPI-Pure MPI vs Hybrid MPI-
OpenMP Parallelization Models OpenMP Parallelization Models on SMP Clusterson SMP Clusters
Nikolaos Drosinos and Nectarios Koziris
National Technical University
of Athens
Computing Systems
Laboratory
{ndros,nkoziris}@cslab.ece.ntua.grwww.cslab.ece.ntua.gr
April 27, 2004 IPDPS 2004 2
OverviewOverview
Introduction Pure Message-passing Model Hybrid Models
• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model
Experimental Results Conclusions – Future Work
April 27, 2004 IPDPS 2004 3
MotivationMotivation
Active research interest in • SMP clusters• Hybrid programming models
However:• Mostly fine-grain hybrid paradigms (masteronly model)• Mostly DOALL multi-threaded parallelization
April 27, 2004 IPDPS 2004 4
ContributionContribution
Comparison of 3 programming models for the parallelization of tiled loops algorithms
• pure message-passing• fine-grain hybrid• coarse-grain hybrid
Advanced hyperplane scheduling• minimize synchronization need• overlap computation with communication• preserves data dependencies
April 27, 2004 IPDPS 2004 5
Algorithmic ModelAlgorithmic Model
Tiled nested loops with constant flow data dependencies
FORACROSS tile0 DO
…
FORACROSS tilen-2 DO
FOR tilen-1 DO
Receive(tile);
Compute(tile);
Send(tile);
END FOR
END FORACROSS
…
END FORACROSS
April 27, 2004 IPDPS 2004 6
Target ArchitectureTarget Architecture
SMP clusters
April 27, 2004 IPDPS 2004 7
OverviewOverview
Introduction Pure Message-passing Model Hybrid Models
• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model
Experimental Results Conclusions – Future Work
April 27, 2004 IPDPS 2004 8
Pure Message-passing ModelPure Message-passing Model
tile0 = pr0;…tilen-2 = prn-2;FOR tilen-1 = 0 TO DO
Pack(snd_buf, tilen-1 – 1, pr); MPI_Isend(snd_buf, dest(pr)); MPI_Irecv(recv_buf, src(pr)); Compute(tile); MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, pr);END FOR
1
11 minmax
n
nn
x
April 27, 2004 IPDPS 2004 9
Pure Message-passing ModelPure Message-passing Model
April 27, 2004 IPDPS 2004 10
OverviewOverview
Introduction Pure Message-passing Model Hybrid Models
• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model
Experimental Results Conclusions – Future Work
April 27, 2004 IPDPS 2004 11
Hyperplane SchedulingHyperplane Scheduling
Implements coarse-grain parallelism assuming inter-tile data dependencies Tiles are organized into data-independent subsets (groups) Tiles of the same group can be concurrently executed by multiple threads Barrier synchronization between threads
April 27, 2004 IPDPS 2004 12
Hyperplane SchedulingHyperplane Scheduling
X
YZ
mpi_rank = (1,1) omp_tid = (1,1)
tile = 3
6 MPI processes x 6 OpenMP threads
M0 = 3M1 = 2m0 = 3m1 = 2
tile (mpi_rank,omp_tid,tile) group tilethprmN
i
N
iiii
2
0
2
0
April 27, 2004 IPDPS 2004 13
Hyperplane SchedulingHyperplane Scheduling#pragma omp parallel{ group0 = pr0; … groupn-2 = prn-2; tile0 = pr0 * m0 + th0; … tilen-2 = prn-2 * mn-2 + thn-2; FOR(groupn-1){ tilen-1 = groupn-1 - ;
if(0 <= tilen-1 <= ) compute(tile); #pragma omp barrier }}
tnn 11 minmax
2
0
n
iitile
April 27, 2004 IPDPS 2004 14
OverviewOverview
Introduction Pure Message-passing Model Hybrid Models
• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model
Experimental Results Conclusions – Future Work
April 27, 2004 IPDPS 2004 15
Fine-grain ModelFine-grain Model
Incremental parallelization of computationally intensive parts Pure MPI + hyperplane scheduling Inter-node communication outside of multi-threaded part (MPI_THREAD_MASTERONLY) Thread synchronization through implicit barrier of omp parallel directive
April 27, 2004 IPDPS 2004 16
Fine-grain ModelFine-grain Model
FOR(groupn-1){ Pack(snd_buf, tilen-1 – 1, pr); MPI_Isend(snd_buf, dest(pr)); MPI_Irecv(recv_buf, src(pr)); #pragma omp parallel { thread_id=omp_get_thread_num(); if(valid(tile,thread_id,groupn-1)) Compute(tile); } MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, pr);}
April 27, 2004 IPDPS 2004 17
OverviewOverview
Introduction Pure Message-passing Model Hybrid Models
• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model
Experimental Results Conclusions – Future Work
April 27, 2004 IPDPS 2004 18
Coarse-grain ModelCoarse-grain Model
Threads are only initialized once SPMD paradigm (requires more programming effort)Inter-node communication inside multi-threaded part (requires MPI_THREAD_FUNNELED) Thread synchronization through explicit barrier (omp barrier directive)
April 27, 2004 IPDPS 2004 19
Coarse-grain ModelCoarse-grain Model#pragma omp parallel{ thread_id=omp_get_thread_num(); FOR(groupn-1){ #pragma omp master{ Pack(snd_buf, tilen-1 – 1, pr); MPI_Isend(snd_buf, dest(pr)); MPI_Irecv(recv_buf, src(pr)); } if(valid(tile,thread_id,groupn-1)) Compute(tile); #pragma omp master{ MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, pr); } #pragma omp barrier }}
April 27, 2004 IPDPS 2004 20
OverviewOverview
Introduction Pure Message-passing Model Hybrid Models
• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model
Experimental Results Conclusions – Future Work
April 27, 2004 IPDPS 2004 21
Experimental ResultsExperimental Results
8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20) MPICH v.1.2.5 (--with-device=ch_p4, --with-comm=shared) Intel C++ compiler 7.0 (-O3 -mcpu=pentiumpro -static) FastEthernet interconnection ADI micro-kernel benchmark (3D)
April 27, 2004 IPDPS 2004 22
Alternating Direction Implicit Alternating Direction Implicit (ADI)(ADI)
Stencil computation used for solving partial differential equations Unitary data dependencies 3D iteration space (X x Y x Z)
X
Y
Z
Seque
ntial
Exe
cutio
nProcessor Mapping
DataDependencies
April 27, 2004 IPDPS 2004 23
ADI – 2 dual SMP nodesADI – 2 dual SMP nodes
Pure MPI Hybrid
XX
Y Y
Pure MPI Hybrid
XX
Y Y
Z Z
Z Z
: MPI processes: OpenMP threads: MPI communication: OpenMP synchronization
: node 0, CPU 0: node 0, CPU 1: node 1, CPU 0: node 1, CPU 1
X<Y X>Y
April 27, 2004 IPDPS 2004 24
ADI X=128 Y=512 Z=8192 – 2 ADI X=128 Y=512 Z=8192 – 2 nodesnodes
April 27, 2004 IPDPS 2004 25
ADI X=256 Y=512 Z=8192 – 2 ADI X=256 Y=512 Z=8192 – 2 nodesnodes
April 27, 2004 IPDPS 2004 26
ADI X=512 Y=512 Z=8192 – 2 ADI X=512 Y=512 Z=8192 – 2 nodesnodes
April 27, 2004 IPDPS 2004 27
ADI X=512 Y=256 Z=8192 – 2 ADI X=512 Y=256 Z=8192 – 2 nodesnodes
April 27, 2004 IPDPS 2004 28
ADI X=512 Y=128 Z=8192 – 2 ADI X=512 Y=128 Z=8192 – 2 nodesnodes
April 27, 2004 IPDPS 2004 29
ADI X=128 Y=512 Z=8192 – 2 ADI X=128 Y=512 Z=8192 – 2 nodesnodes
Computation Communication
April 27, 2004 IPDPS 2004 30
ADI X=512 Y=128 Z=8192 – 2 ADI X=512 Y=128 Z=8192 – 2 nodesnodes
Computation Communication
April 27, 2004 IPDPS 2004 31
OverviewOverview
Introduction Pure Message-passing Model Hybrid Models
• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model
Experimental Results Conclusions – Future Work
April 27, 2004 IPDPS 2004 32
ConclusionsConclusions
Tiled loop algorithms with arbitrary data dependencies can be adapted to the hybrid parallel programming paradigm Hybrid models can be competitive to the pure message-passing paradigm Coarse-grain hybrid model can be more efficient than fine-grain one, but also more complicated Programming efficiently in OpenMP not easier than programming efficiently in MPI
April 27, 2004 IPDPS 2004 33
Future WorkFuture Work
Application of methodology to real applications and standard benchmarks Work balancing for coarse-grain model Investigation of alternative topologies, irregular communication patterns Performance evaluation on advanced interconnection networks (SCI, Myrinet)
April 27, 2004 IPDPS 2004 34
Thank You!Thank You!
Questions?