on cpu and gpu clusters - centralesupelec
TRANSCRIPT
Equipe ProjetINRIA
AlGorille
Equipe IMS
Computing and energy performance i i i f l i l i hoptimization of a multi‐algorithms PDE solver on CPU and GPU clusterssolver on CPU and GPU clusters
Stéphane Vialle, Sylvain Contassot‐Vivier, Thomas Jost
13/01/201113/01/2011
1 ‐ First experiments on GPU clusters
First experiments on GPU clusters
Experimental testbedExperimental testbed2 x 16 CPU+GPU‐nodes clusters :
• Xeon dual‐core + GT8800• Xeon dual‐core + GT285• Nehalem quad‐core + GT285• Nehalem quad‐core + GT480
a 16 nodes state‐of‐the‐art CPU+GPU clusteran older 16 nodes CPU+GPU clusteran older 16 nodes CPU+GPU cluster2 gigabit Ethernet switchesan heterogeneous 32 nodes cluster
regular upgrade of the system
• Some energy monitoringexternal devices :
Raritan (Dominion PX)
First experiments on GPU clusters
Collection of experimentsCollection of experiments
3 benchmarks with different features:
1 ‐ European option pricer: E b i l ll l M C l i
3 benchmarks with different features:
Embarrassingly parallel Monte‐Carlo computationsParallel random number generator have been ported on GPU
2 ‐ PDE solver: Strong computationsRegular communications between nodesSome computations (must) remain on CPU
3 ‐ 2D Jacobi relaxation: Repetitive light computationsRepetitive light computationsFrequent communications between neighbor nodes
First experiments on GPU clusters
Collection of experimentsCollection of experiments
1E+5Pricer ‐ parallel MC
1E+3
EDP Solver ‐ Synchronous1E+4
Jacobi Relaxation
1E+2
1E+3
1E+4
1E+5
on tim
e (s)
1E+2
1E+3
tion
tim
e (s)
1E+3
1E+4
tion
tim
e (s)
1E+0
1E+1
1E+2
1 10 100
Executio
1E+0
1E+1
1 10 100
Execut
1E+1
1E+2
1 10 100
Execut
Number of nodes Number of nodes1 10 100
Number of nodes
Monocore‐CPU cluster Multicore‐CPU cluster Manycore‐GPU cluster
1E+3
1E+4
Pricer – parallel MC1E+2
EDP Solver ‐ Synchronous1E+3
Jacobi Relaxation
1E+1
1E+2
1E+3
Energy (W
h)
1E+1
Energy (W
h)
1E+1
1E+2
Energy (W
h)
1E+0
1 10 100
Number of nodes
1E+0
1 10 100
Number of nodes
1E+0
1 10 100
E
Number of nodes
First experiments on GPU clusters
Computational & energy model designComputational & energy model design
Temporal gain (speedup) and energy gain of GPU cluster vs CPU cluster:
SpeedupEnergy gain
p g ( p p) gy g
1E+2
ter vs
U cluster
Jacobi Relaxation
Beyond ?Predictions ?1E+2
1E+3
ter vs
PU cluster
Pricer – parallel MC1E+2
er vs
U cluster
EDP Solver ‐ synchronous
1E+0
1E+1
GPU
clust
multicore
CPU
1E+0
1E+1
GPU
clus
multicore
CP OK
1E+0
1E+1GPU
clust
multicore
CPHum…
1 10 100Number of nodes
1E+0
1 10 100Number of nodes
1 10 100Number of nodes
Up to 16 nodes this GPU cluster is more interesting than our CPU cluster, but its interest decreases :
h ?‐ why ?‐ beyond 16 nodes ?
First experiments on GPU clusters
Computational & energy model designComputational & energy model design
Temporal gain (speedup) and energy gain of GPU cluster vs CPU cluster:
SpeedupEnergy gain
p g ( p p) gy g
1E+2
ter vs
U cluster
Jacobi Relaxation
Beyond ?Predictions ?1E+2
1E+3
ter vs
PU cluster
Pricer – parallel MC1E+2
er vs
U cluster
EDP Solver ‐ synchronous
1E+0
1E+1
GPU
clust
multicore
CPU
1E+0
1E+1
GPU
clus
multicore
CP OK
1E+0
1E+1GPU
clust
multicore
CPHum…
1 10 100Number of nodes
1E+0
1 10 100Number of nodes
1 10 100Number of nodes
Up to 16 nodes this GPU cluster is more interesting than our CPU cluster, but its interest decreases :
h ?‐ why ?‐ beyond 16 nodes ?
First experiments on GPU clusters
Computational & energy model designComputational & energy model designCPU cluster GPU cluster
Computations T‐calc‐CPU If algorithm is adapted to GPU architecture:T‐calc‐GPU << T‐calc‐CPU
else: do not use GPUs!
Communications T‐comm‐CPU = T‐comm‐MPI
T‐comm‐GPU ≥ T‐comm‐CPU
T‐comm‐GPU = T‐transfert‐GPUtoCPU + T‐comm‐MPI + T t f t CPUt GPUT‐transfert‐CPUtoGPU
Total time T‐CPU T‐GPU < ? > T‐CPU …..
For a set pb: when the number of nodes increases,T‐commbecomes dominant and GPU cluster interest decreases
2 ‐ A first performance model
A first performance model
First modelling approachFirst modelling approach
1E+3
EDP Solver ‐ Synchronous EDP Solver ‐ Synchronous
Obser ation of the1E+2
1E+3
tion
tim
e (s)
1E+1
1E+2
gy (W
h)
Observation of the first experimentalperformances:
1E+0
1E+1
1 10 100
Execut
1E+0
1 10 100
Energperformances:
• it exists a « scalablearea »,
Number of nodes Number of nodes
area »,• performances of CPU and GPU clusters
Scalablearea
Scalablearea
have different slopes.1E+2
vs uster
EDP Solver ‐ synchronous
Modelling of the « scalable area » 1E+1
GPU
cluster v
multicore
CPU clg
assuming some experimentalmeasurements of the real
1E+0
1 10 100
m
Number of nodes
application are possible(simple modelling).
A first performance model
First modelling approachFirst modelling approachT E σGPUWemodel the‐σT
CPU
‐σTGPU
σE
σECPU
Wemodel the« scalable area » :
N (nodes)
T
N (nodes)
T(N)=T(1)/NσTCPU GPU
We observe:with:
E(N)=E(1).NσE σTCPU σT
GPU>
We consider the l i l
E(N) = P(1).T(N).N+Pswitch.N/Nmax.T(N)
We consider:electrical power dissipated by nodesand switch : E(N) P(1).T(N).N Pswitch.N/Nmax.T(N)
with P: electrical power (Watts)
and switch :
A first performance model
First modelling approachFirst modelling approach
We obtain:σE σT= 1 ‐
SU (N) SU (1) NG/C G/C σTCPUσT
GPU‐
SU (N) = SU (1).NG/C G/C σTσT
EG (N) = EG (1) NG/C G/C σTCPUσT
GPU‐
EG (N) EG (1).N
G/CG/C σCPUσGPU‐1/( ) G/C
We compute the 2 threshold number of nodes:
N = N = SU (1)G/CG/CT
σT
σT
‐‐1/( )SU (N) = 1G/C
G/CG/C σCPUσGPU‐1/( ) G/CN = N = EG (1)G/CG/C
Eσ
Tσ
T‐‐1/( ) EG (N) = 1G/C
A first performance model
First modelling approachFirst modelling approach
3 areas appear when increasing the number of nodes: pp g
GPU cluster more CPU cluster more GPU cluster fasterefficient
(about T and E)efficient
(about T and E)OR less energyconsumming
NG/CTmin( , )NG/C
E NG/CTmax( , )NG/C
E1 N (nodes)
Strategy and Choose GPU Choose CPU gyheuristic requiredto choose GPU or
cluster cluster
CPU cluster
Improving performances withasynchronous algorithmsasynchronous algorithms
Investigation with our PDE solver
Improving performances with asynchronous algorithms
Asynchronous parallel computingAsynchronous parallel computing
Asynchronous algo provide implicit overlapping of communications andAsynchronous algo. provide implicit overlapping of communications and computations, and … … communications are important on GPU clusters.
They should improve executions on GPU clustersThey should improve executions on GPU clusters
But :
• Some iterative algorithms can be turned into asynchronousalgorithms (not all),
• A strong mathematical theory supports this approach.
And :• The convergence detection of the algorithm is more complexand requires more communications (than with synchronous algo)
• Some extra iterations are required to achieve the same accuracy.
Improving performances with asynchronous algorithms
Parallel iterative PDE solverParallel iterative PDE solver
Improving performances with asynchronous algorithms
Inner linear solverInner linear solver
Improving performances with asynchronous algorithms
Asynchronous versionAsynchronous version
… and really more complex parallel implementation!
Improving performances with asynchronous algorithms
Asynchronous versionAsynchronous version
Improving performances with asynchronous algorithms
Performances on a heterogeneous clusterPerformances on a heterogeneous clusterExecution time using both GPU clusters of Supelec (to minimize):
• 17 nodes Xeon dual‐core + GT8800• 16 nodes Nehalem quad‐core + GT285
Rmk: two clusters managed by one 16 nodes Nehalem quad core + GT285
• 2 interconnected Gibagit switchesOAR server
GPUs &synchronous
GPUs &asynchronoussynchronous asynchronous
T‐exec(s)
T‐exec(s)
T T
Nb of fastnodes
Nb of fastnodes
Improving performances with asynchronous algorithms
Performances on a heterogeneous clusterPerformances on a heterogeneous clusterSpeedup vs 1 GPU (to maximize):
• asynchronous version achieves more regular speedup• asynchronous version achieves better speedup on high nb of nodes
GPU cluster & synchronousvs 1 GPU
GPU cluster & asynchronousvs 1 GPU
vs seq
.
vs seq
.
Speedu
pv
Speedu
pv
Sync. S
Sync. S
Nb of fastnodes
Nb of fastnodes
Improving performances with asynchronous algorithms
Performances on a heterogeneous clusterPerformances on a heterogeneous clusterEnergy consumption (to minimize):
• measurement errors become important• sync. and async. energy consumption curves are (just) « different »
GPU cluster & synchronous GPU cluster & asynchronous
on(W
.h)
on(W
.h)
onsumptio
onsumptio
Energy
co
Energy
co
Nb of fastnodes
Nb of fastnodes
E E
Improving performances with asynchronous algorithms
Performances on a heterogeneous clusterPerformances on a heterogeneous clusterEnergy overhead factor vs 1 GPU (to minimize):
• overhead curves are (just) « differents »no more global attractive solution !
GPU cluster & synchronousvs 1 GPU
GPU cluster & asynchronousvs 1 GPU
factor
factor
overhe
adf
overhe
ad
Energy
o
Energy
o
Nb of fastnodes
Nb of fastnodes
4 ‐ Need for a new performance model and an auto‐adaptive solution
Need for a new performance model and an auto‐adaptive solution
Relative async. vs sync. performancesRelative async vs sync speedup and energy gain exhibit somesimilarities: can be used to choose the version to run
need a fine model (region frontiers are complex)need a heuristic when only one gain is greater than 1
Speedup Energy gain
Async. better Async. better
Need for a new performance model and an auto‐adaptive solution
Relative async. vs sync. performancesRelative async. vs sync. performancesEnergy‐Delay Product (EDP) (to minimize):
• to track a global optimum considering both T and E• to track a global optimum, considering both T and E• parallel runs on many nodes seem better• no large differences between sync. and async. versionsno large differences between sync. and async. versions
GPU cluster & synchronous GPU cluster & asynchronous
Need for a new performance model and an auto‐adaptive solution
Relative async. vs sync. performancesRelative async. vs sync. performancesAsync. vs sync. relative Energy‐Delay Product ratio:
• We compute: EDP sync / EDP async• We compute: EDP‐sync / EDP‐async• can be used to make a choice inside ambiguous regions
Compute the EDP and choose sync. or async.
hversion in this region(where relative‐SU > 1 and relative‐EG < 1)and relative‐EG < 1)
Choose async. version
Choose sync. version
Need for a new performance model and an auto‐adaptive solution
Automatic choice criteriaAutomatic choice criteriaAutomatic selection of the « best » version to run:
CPU cluster GPU cluster
Synchronous algoSynchronous algo.
Asynchronous algo.
Criteria:• relative speedup : tracking HPC performances
l i i ki l i• relative energy gain : tracking low energy consumption• relative energy‐delay‐product : tracking a compromise
b d d l i hi h i• but need a model to automatize this choicea fine model : criteria variations are small and region
frontiers are complexfrontiers are complexa model not requiring long and large experiments
Need for a new performance model and an auto‐adaptive solution
Fine model requiredFine model requiredFirst model limitations:
• requires/assumes a « scalable area »T
CPUrequires/assumes a « scalable area »approximative model (not fine)
• requires 4 executions of the entire application:
‐σTCPU
‐σTGPU
requires 4 executions of the entire application: on 1 and N0 nodesrunning the 2 versions to compare
N (nodes)
measuring both T and E
adapted to optimize the execution of a long life application scaling on a parallel machine
To achieve an automatic selection on a short‐life application, we need :• a model requiring only small elementary benchmarks to fixthe model parameter values on the hardware used,
• a fine model not requiring the application exhibit a perfectscalability on the architecture.
Need for a new performance model and an auto‐adaptive solution
Fine model requiredFine model required
First version of this fine model exists
It takes into account:• different power dissipation of the different « identical » nodes of the cluster
• when starting computations on GPU the power dissipation• when starting computations on GPU the power dissipation increases 2 times:
when the GPU starts to computepwhen the fan of the GPU starts, or/and when the GPU increases its frequency, …
h h d• when stopping computations on GPU the power dissipation decreases several times, but not immediately !
•• …
Need for a new performance model and an auto‐adaptive solution
Fine model requiredFine model required(W
atts)
(s)
Need for a new performance model and an auto‐adaptive solution
Fine model requiredFine model required
Fi l iFirst evaluation:• on our PDE solver execution and on our heterogeneous GPU clustercluster
• error (model – observation) : 6%• after some bias corrections : 1%
Stronger evaluation is planned:• on different hardware• on different applications
th h i ti f t d t ti / t l ti f th i ht• then: a heuristic of auto‐adaptation/auto‐selection of the right algorithm will be implemented
To be continued
5 Conclusion and perspectives5 – Conclusion and perspectives
Long term objectives
Kernel 1‐v1 Kernel 1‐v2 Kernel 1‐v3
Kernel 2‐v1 Kernel 2‐v2
Parallel algorithm 1 Parallel algorithm 2
a out –O [speed energy edp ]
Parallel algorithm 1 Parallel algorithm 2
usera.out –O [speed, energy, edp, …] user
h d l
Auto‐adaptation of a.outHeuristic
Performance scheduler
End after fast executionmodel
(energy and
Elementary
computations)End after edp compromise execution
End after low energy consumption executionhardware benchmarks
Equipe ProjetINRIA
AlGorille
Equipe IMS
Computing and energy performance optimization of a multi‐algorithms PDEoptimization of a multi algorithms PDE
solver on CPU and GPU clusters
Q i ?Questions ?