on the future of high performance computing: how to think for peta … · on the future of high...
TRANSCRIPT
ON THE FUTURE OF HIGH
PERFORMANCE COMPUTING:
HOW TO THINK FOR PETA
AND EXASCALE COMPUTING
JACK DONGARRA
UNIVERSITY OF TENNESSEE
OAK RIDGE NATIONAL LAB
What Is LINPACK?
LINPACK is a package of mathematical software for solving problems in linear algebra, mainly dense linear systems of linear equations.
LINPACK: “LINear algebra PACKage” Written in Fortran 66
The project had its origins in 1974
The project had four primary contributors: myself when I was at Argonne National Lab, Jim Bunch from the University of California-San Diego, Cleve Moler who was at New Mexico at that time, and Pete Stewart from the University of Maryland. LINPACK as a software package has been largely superseded by LAPACK, which has been designed to run efficiently on shared-memory, vector supercomputers.
Computing in 1974
High Performance Computers: IBM 370/195, CDC 7600, Univac 1110, DEC PDP-10,
Honeywell 6030
Fortran 66 Trying to achieve software portability Floating point operations where expensive We didn’t think much about energy used Run efficiently BLAS (Level 1) Vector operations
Software released in 1979 About the time of the Cray 1
LINPACK Benchmark?
The Linpack Benchmark is a measure of a computer’s floating-point rate of execution. It is determined by running a computer program that solves
a dense system of linear equations.
Over the years the characteristics of the benchmark has changed a bit. In fact, there are three benchmarks included in the Linpack
Benchmark report.
LINPACK Benchmark Dense linear system solve with LU factorization using partial
pivoting Operation count is: 2/3 n3 + O(n2)
Benchmark Measure: MFlop/s Original benchmark measures the execution rate for a
Fortran program on a matrix of size 100x100.
Accidental Benchmarker Appendix B of the Linpack Users’ Guide Designed to help users extrapolate execution
time for Linpack software package
First benchmark report from 1977; Cray 1 to DEC PDP-10
H. Meuer, H. Simon, E. Strohmaier, & JD
- Listing of the 500 most powerful
Computers in the World
- Yardstick: Rmax from LINPACK MPP
Ax=b, dense problem
- Updated twice a year
SC‘xy in the States in November
Meeting in Germany in June
- All data available from www.top500.org
Size
Rate
TPP performance
Top500 List of Supercomputers
Over Last 20 Years - Performance
Development
0,1
1
10
100
1000
10000
100000
1000000
10000000
100000000
1E+09
1 Gflop/s
1 Tflop/s
100 Mflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s
59.7 GFlop/s
400 MFlop/s
1.17 TFlop/s
16.3 PFlop/s
60.8 TFlop/s
123 PFlop/s
SUM
N=1
N=500
6-8 years
My Laptop (70 Gflop/s)
1993 1995 1997 1999 2001 2003 2005 2007 2009 2011
My iPad2 & iPhone 4s (1.02 Gflop/s)
2012
June 2012: The TOP10
Rank Site Computer Country Cores Rmax
[Pflops] % of Peak
Power [MW]
MFlops/Watt
1 DOE / NNSA
L Livermore Nat Lab Sequoia, BlueGene/Q (16c) + custom USA 1,572,864 16.3 81 8.6 1895
2 RIKEN Advanced Inst for
Comp Sci K computer Fujitsu SPARC64 VIIIfx (8c) +
custom Japan 705,024 10.5 93 12.7 830
3 DOE / OS
Argonne Nat Lab Mira, BlueGene/Q (16c) + custom USA 786,432 8.16 81 3.95 2069
4 Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.52 823
5 Nat. SuperComputer Center
in Tianjin Tianhe-1A, NUDT
Intel (6c) + Nvidia GPU (14c) + custom China 186,368 2.57 55 4.04 636
6 DOE / OS
Oak Ridge Nat Lab Jaguar, Cray
AMD (16c) + custom USA 298,592 1.94 74 5.14 377
7 CINECA Fermi, BlueGene/Q (16c) + custom Italy 163,840 1.73 82 .821 2099
8 Forschungszentrum Juelich
(FZJ) JuQUEEN, BlueGene/Q (16c) + custom Germany 131,072 1.38 82 .657 2099
9 Commissariat a l'Energie
Atomique (CEA) Curie, Bull Intel (8c) + IB France 77,184 1.36 82 2.25 604
10 Nat. Supercomputer Center
in Shenzhen Nebulea, Dawning Intel (6) + Nvidia GPU
(14c) + IB China 120,640 1.27 43 2.58 493
500 Energy Comp IBM Cluster, Intel + IB Italy 4096 .061 93*
Accelerators (58 systems)
0
10
20
30
40
50
60
2006 2007 2008 2009 2010 2011 2012
Syst
em
s
Intel MIC (1)
Clearspeed CSX600 (0)
ATI GPU (2)
IBM PowerXCell 8i (2)
NVIDIA 2070 (10)
NVIDIA 2050(12)
NVIDIA 2090 (31)
27 US
7 China
4 Japan
3 Russia
2 France
2 Germany
2 India
2 Italy
2 Poland
1 Australia
1 Brazil
1 Canada
1 Singapore
1 Spain
1 Taiwan
1 UK
Countries Share
Absolute Counts
US: 252
China: 68
Japan: 35
UK: 25
France: 22
Germany: 20
10
28 Systems at > Pflop/s (Peak)
9/20/2012
11
Top500 in France rank
manufacturer computer
installation_site_name procs r_peak
9 Bull Bullx B510, Xeon E5-2680 8C 2.700GHz, InfB CEA/TGCC-GENCI 77184 1667174
17 Bull Bull bullx super-node S6010/S6030 CEA 138368 1254550
29 IBM BlueGene/Q, Power BQC 16C 1.60GHz, Custom EDF R&D 65536 838861
30 IBM BlueGene/Q, Power BQC 16C 1.60GHz, Custom IDRIS/GENCI 65536 838861
45 Bull Bullx B510, Xeon E5 (Sandy Bridge - EP) 8C 2.70GHz, InfB Bull 20480 442368
69 HP HP POD - Cluster Platform 3000, X5675 3.06 GHz, InfB Airbus 24192 296110
75 SGI SGI Altix ICE 8200EX, Xeon E5472 3.0/X5560 2.8 GHz (GENCI-CINES) 23040 267878
95 HP Cluster Platform 3000 BL2x220, L54xx 2.5 Ghz, InfB Government 24704 247040
97 Bull Bullx B510, Xeon E5-2680 8C 2.700GHz, InfB CEA 9440 203904
104 IBM iDataPlex, Xeon X56xx 6C 2.93 GHz, InfB EDF R&D 16320 191270
115 Bull Bullx B505, Xeon E56xx (Westmere-EP) 2.40 GHz, InfB CEA 7020 274560 149 IBM Blue Gene/P Solution IDRIS 40960 139264
162 Bull Bullx B505, Xeon E5640 2.67 GHz, InfB CEA/TGCC-GENCI 5040 198162 168 Bull BULL Novascale R422-E2 CEA 11520 129998
171 Bull Bullx B510, Xeon E5 (Sandy Bridge - EP) 8C 2.70GHz, InfB Bull 5760 124416
173 SGI SGI Altix ICE 8200EX, Xeon quad core 3.0 GHz Total 10240 122880 201 IBM Blue Gene/P Solution EDF R&D 32768 111411 240 HP Cluster Platform 3000, Xeon X5570 2.93 GHz, InfB Manufacturing 8576 100511
244 Bull Bull bullx super-node S6010/S6030 Bull 11520 104602
245 Bull Bullx S6010 Cluster, Xeon 2.26 Ghz 8-core, InfB CEA/TGCC-GENCI 11520 104417
365 IBM xSeries x3550M3 Cluster, Xeon X5650 6C 2.66 GHz, GigE Information 13068 139044
477 IBM iDataPlex, Xeon E55xx QC 2.26 GHz, GigE Financial 13056 118026
Linpack Efficiency
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 100 200 300 400 500
Linpa
ck E
fficienc
y
Linpack Efficiency
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 100 200 300 400 500
Linpa
ck E
fficienc
y
Linpack Efficiency
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 100 200 300 400 500
Linpa
ck E
fficienc
y
Performance Development in Top500
0,1
1
10
100
1000
10000
100000
1000000
10000000
100000000
1E+09
1E+10
1E+11
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
1 Eflop/s
1 Gflop/s
1 Tflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s
N=1
N=500
•Flop/s or percentage of peak flop/s become
much less relevant
•Algorithms & Software: minimize data
movement; perform more work per unit data
movement.
The High Cost of Data Movement
2011 2018
DP FMADD flop 100 pJ 10 pJ
DP DRAM read 4800 pJ 1920 pJ
Local Interconnect 7500 pJ 2500 pJ
Cross System 9000 pJ 3500 pJ
Approximate power costs (in picoJoules)
Source: John Shalf, LBNL
Energy Cost Challenge
At ~$1M per MW energy costs are substantial 10 Pflop/s in 2011 uses ~10 MWs
1 Eflop/s in 2018 > 100 MWs
DOE Target: 1 Eflop/s in 2018 at 20 MWs
19
Potential System Architecture
with a cap of $200M and 20MW
Systems 2012 BG/Q
Computer
2022 Difference Today & 2022
System peak 20 Pflop/s 1 Eflop/s O(100)
Power 8.6 MW (2 Gflops/W)
~20 MW (50 Gflops/W)
System memory 1.6 PB (16*96*1024)
32 - 64 PB O(10)
Node performance 205 GF/s (16*1.6GHz*8)
1.2 or 15TF/s O(10) – O(100)
Node memory BW 42.6 GB/s 2 - 4TB/s O(1000)
Node concurrency 64 Threads
O(1k) or 10k O(100) – O(1000)
Total Node Interconnect BW
20 GB/s 200-400GB/s O(10)
System size (nodes) 98,304 (96*1024)
O(100,000) or O(1M) O(100) – O(1000)
Total concurrency 5.97 M O(billion) O(1,000)
MTTI 4 days O(<1 day) - O(10)
Potential System Architecture
with a cap of $200M and 20MW
Systems 2012 BG/Q
Computer
2022 Difference Today & 2022
System peak 20 Pflop/s 1 Eflop/s O(100)
Power 8.6 MW (2 Gflops/W)
~20 MW (50 Gflops/W)
System memory 1.6 PB (16*96*1024)
32 - 64 PB O(10)
Node performance 205 GF/s (16*1.6GHz*8)
1.2 or 15TF/s O(10) – O(100)
Node memory BW 42.6 GB/s 2 - 4TB/s O(1000)
Node concurrency 64 Threads
O(1k) or 10k O(100) – O(1000)
Total Node Interconnect BW
20 GB/s 200-400GB/s O(10)
System size (nodes) 98,304 (96*1024)
O(100,000) or O(1M) O(100) – O(1000)
Total concurrency 5.97 M O(billion) O(1,000)
MTTI 4 days O(<1 day) - O(10)
Critical Issues at Peta & Exascale for
Algorithm and Software Design
Synchronization-reducing algorithms Break Fork-Join model
Communication-reducing algorithms Use methods which have lower bound on communication
Mixed precision methods 2x speed of ops and 2x speed for data movement
Autotuning Today’s machines are too complicated, build “smarts” into
software to adapt to the hardware
Fault resilient algorithms Implement algorithms that can recover from failures/bit
flips
Reproducibility of results Today we can’t guarantee this. We understand the issues,
but some of our “colleagues” have a hard time with this.
• Must rethink the design of our algorithms and software
Manycore and Hybrid architectures are disruptive technology Similar to what happened with cluster
computing and message passing
Rethink and rewrite the applications, algorithms, and software
Data movement is expensive
Flops are cheap
Major Changes to Algorithms/Software
23
Dense Linear Algebra Software Evolution
LINPACK (70's)
vector operations
LAPACK (80's)
block operations
ScaLAPACK (90's)
block cyclic
data distribution
PLASMA (00's)
tile operations
Level 1 BLAS
Level 3 BLAS
PBLAS
BLACS
(message passing)
tile layout
dataflow scheduling
PLASMA Principles
Tile Algorithms
minimize capacity misses
Tile Matrix Layout
minimize conflict misses
Dynamic DAG Scheduling
minimizes idle time
More overlap
Asynchronous ops
CPU
MEM
cache
CPU
MEM
cache
CPU
cache
CPU
cache
CPU
cache
LAPACK
PLASMA
Fork-Join Parallelization of LU and QR.
Parallelize the update: • Easy and done in any reasonable software. • This is the 2/3n3 term in the FLOPs count. • Can be done efficiently with LAPACK+multithreaded BLAS
-
dgemm
Time
Core
s
Objectives High utilization of each core
Scaling to large number of cores
Synchronization reducing algorithms
Methodology Dynamic DAG scheduling (QUARK)
Explicit parallelism
Implicit communication
Fine granularity / block data layout
Arbitrary DAG with dynamic scheduling Fork-join
parallelism
PLASMA/MAGMA: Parallel Linear Algebra
s/w for Multicore/Hybrid Architectures
DAG scheduled
parallelism
Time
Communication Avoiding QR
Example
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
R0
R1
R2
R3
R0
R2
R0 R R
D 1
D 2
D 3
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
D 0
D 1
D 2
D 3
D 0
Communication Avoiding QR
Example
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
R0
R1
R2
R3
R0
R2
R0 R R
D 1
D 2
D 3
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
D 0
D 1
D 2
D 3
D 0
Communication Avoiding QR
Example
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
R0
R1
R2
R3
R0
R2
R0 R R
D 1
D 2
D 3
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
D 0
D 1
D 2
D 3
D 0
Communication Avoiding QR
Example
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
R0
R1
R2
R3
R0
R2
R0 R R
D 1
D 2
D 3
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
D 0
D 1
D 2
D 3
D 0
Communication Avoiding QR
Example
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
R0
R1
R2
R3
R0
R2
R0 R R
D 1
D 2
D 3
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
D 0
D 1
D 2
D 3
D 0
PowerPack 2.0
The PowerPack platform consists of software and hardware instrumentation.
Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/
Power for QR Factorization
dual-socket quad-core Intel Xeon E5462 (Harpertown) processor
@ 2.80GHz (8 cores total) w / MLK BLAS
matrix size is very tall and skinny (mxn is 1,152,000 by 288)
PLASMA’s Communication
Reducing QR Factorization
DAG based
PLASMA’s Conventional
QR Factorization
DAG based
MKL’s QR Factorization
Fork-join based
LAPACK’s QR Factorization
Fork-join based
The standard Tridiagonal reduction xSYTRD
step k: Q A Q* then update step k+1
LAPACK xSYTRD:
1. Apply left-right transformations Q A Q* to the panel
2. Update the remaining submatrix A33
A22
A32
æ
è ç
ö
ø ÷
For the symmetric eigenvalue problem:
First stage takes:
• 90% of the time if only eigenvalues
• 50% of the time if eigenvalues and eigenvectors
The standard Tridiagonal reduction xSYTRD
Characteristics 1. Phase 1 requires :
o 4 panel vector multiplications,
o 1 symmetric matrix vector multiplication with A33,
o Cost 2(n-k)2b Flops.
2. Phase 2 requires:
o Symmetric update of A33 using SYRK,
o Cost 2(n-k)2b Flops.
Observations • Too many Level 2 BLAS ops,
• Relies on panel factorization,
• Total cost 4n3/3
• Bulk sync phases,
• Memory bound algorithm.
Symmetric Eigenvalue Problem
• Standard reduction algorithm are very slow on multicore.
• Step1: Reduce the dense matrix to band.
• Matrix-matrix operations, high degree of parallelism
• Step2: Bulge Chasing on the band matrix
• by group and cache aware
Symmetric Eigenvalues
Singular Values singular values only
Block DAG based to banded form, then pipelined group chasing to tridiagaonal form.
The reduction to condensed form accounts for the factor of 50 improvement over LAPACK Execution rates based on 4/3n3 ops
eigenvalues only
Experiments on eight-socket six-core AMD Opteron 2.4 GHz
processors with MKL V10.3.
Summary These are old ideas (today SMPss, StarPU, Charm++, ParalleX, Swarm,…)
Major Challenges are ahead for extreme computing Power Levels of Parallelism Communication Hybrid Fault Tolerance … and many others not discussed here
Not just a programming assignment.
This opens up many new opportunities for applied mathematicians and computer scientists
Collaborators / Software / Support
PLASMA
http://icl.cs.utk.edu/plasma/
MAGMA
http://icl.cs.utk.edu/magma/
Quark (RT for Shared Memory)
• http://icl.cs.utk.edu/quark/
PaRSEC(Parallel Runtime Scheduling
and Execution Control)
• http://icl.cs.utk.edu/parsec/
40
Collaborating partners
University of Tennessee, Knoxville
University of California, Berkeley
University of Colorado, Denver
INRIA, France
KAUST, Saudi Arabia
These tools are being applied to a range of applications beyond dense LA:
Sparse direct, Sparse iterations methods and Fast Multipole Methods