Download - Dongarra 12-06-2016 Seminar.pdf
![Page 1: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/1.jpg)
12/6/16 1
An Overview of High Performance Computing and Benchmark Changes for
the Future
Jack DongarraUniversity of Tennessee
Oak Ridge National Laboratory
![Page 2: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/2.jpg)
State of Supercomputing in 2016• Pflops (> 1015 Flop/s) computing fully established
with 117 systems.• Three technology architecture possibilities or
“swim lanes” are thriving.• Commodity (e.g. Intel)• Commodity + accelerator (e.g. GPUs) (88 systems)• Lightweight cores (e.g. IBM BG, ARM, Knights Landing)
• Interest in supercomputing is now worldwide, and growing in many new markets (~50% of Top500 computers are in industry).
• Exascale (1018 Flop/s) projects exist in many countries and regions.
• Intel processors largest share, 92% followed by AMD, 1%.
2
![Page 3: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/3.jpg)
3
H. Meuer, H. Simon, E. Strohmaier, & JD
- Listing of the 500 most powerfulComputers in the World
- Yardstick: Rmax from LINPACK MPPAx=b, dense problem
- Updated twice a yearSC‘xy in the States in NovemberMeeting in Germany in June
- All data available from www.top500.org
Size
Rat
e
TPP performance
![Page 4: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/4.jpg)
Performance Development of HPC over the Last 24 Years from the Top500
0.1
1
10
100
1000
10000
100000
1000000
10000000
100000000
1E+09
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 201420152016
59.7GFlop/s
400MFlop/s
1.17TFlop/s
93PFlop/s
349TFlop/s
672PFlop/s
SUM
N=1
N=500
1 Gflop/s
1 Tflop/s
100 Mflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s
1 Eflop/s
6-8 years
My iPhone & iPad 4 Gflop/s
My Laptop 70 Gflop/s
4
![Page 5: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/5.jpg)
PERFORMANCE DEVELOPMENT
1
10
100
1000
10000
100000
1000000
10000000
100000000
1E+09
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
SUM
N=1
N=100
1 Gflop/s
1 Tflop/s100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s
1 Eflop/s
N=10
TflopsAchievedASCI RedSandia NL
PflopsAchieved
RoadRunnerLos Alamos NL
EflopsAchieved?
5
![Page 6: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/6.jpg)
November 2016: The TOP 10 SystemsRank Site Computer Country Cores Rmax
[Pflops]% of Peak
Power[MW]
GFlops/Watt
1National Super
Computer Center in Wuxi
Sunway TaihuLight, SW26010(260C) + Custom China 10,649,000 93.0 74 15.4 6.04
2National Super
Computer Center in Guangzhou
Tianhe-2 NUDT, Xeon (12C) + IntelXeon Phi (57C)
+ CustomChina 3,120,000 33.9 62 17.8 1.91
3 DOE / OS Oak Ridge Nat Lab
Titan, Cray XK7, AMD (16C) + Nvidia Kepler GPU (14C) +
CustomUSA 560,640 17.6 65 8.21 2.14
4 DOE / NNSA L Livermore Nat Lab
Sequoia, BlueGene/Q (16C) + custom USA 1,572,864 17.2 85 7.89 2.18
5 DOE / OS L Berkeley Nat Lab
Cori, Cray XC40, Xeon Phi (68C) + Custom USA 622,336 14.0 50 3.94 3.55
6 Joint Center for Advanced HPC
Oakforest-PACS, Fujitsu Primergy CX1640, Xeon Phi (68C)
+ Omni-PathJapan 558,144 13.6 54 2.72 4.98
7 RIKEN AdvancedInst for Comp Sci
K computer Fujitsu SPARC64 VIIIfx (8C) + Custom Japan 705,024 10.5 93 12.7 .827
8 Swiss CSCSPiz Daint, Cray XC50, Xeon (12C) + Nvidia P100(56C) +
CustomSwiss 206,720 9.78 61 1.31 7.45
9 DOE / OS Argonne Nat Lab
Mira, BlueGene/Q (16C) + Custom USA 786,432 8.59 85 3.95 2.07
10 DOE / NNSA / Los Alamos & Sandia
Trinity, Cray XC40,Xeon (16C) + Custom USA 301,056 8.10 80 4.23 1.92
500 Internet company Inspur Intel (8C) + Nnvidia China 5440 .286 71
TaihuLight is 5.2 X Performance of TitanTaihuLight is 1.1 X Sum of All DOE Systems
![Page 7: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/7.jpg)
Recent Developments¨ US DOE planning to deploy O(100) Pflop/s systems for 2017-
2018 - $525M hardware¨ Oak Ridge Lab and Lawrence Livermore Lab to receive IBM
and Nvidia based systems¨ Argonne Lab to receive Intel based system
Ø After this Exaflops
¨ US Dept of Commerce is preventing some China groups from receiving Intel technologyØ Citing concerns about nuclear research being done with the
systems; February 2015.Ø On the blockade list:
ØNational SC Center Guangzhou, site of Tianhe-2ØNational SC Center Tianjin, site of Tianhe-1AØNational University for Defense Technology, developerØNational SC Center Changsha, location of NUDT
7
![Page 8: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/8.jpg)
Since the Dept of Commerce Action …• Expanded focus on Chinese made HW and SW
• “Anything but from the US”• Three separate developments in HPC
• Wuxi• ShenWei O(100) Pflops all Chinese, June 2016
• NUDT • Tianhe-2A O(100) Pflops will be Chinese ARM + accelerator,
2017• Sugon - CAS ICT
• AMD? new processors
• In the latest “5 Year Plan”• Govt push to build out a domestic HPC ecosystem.• Exascale system, will not use any US chips
12/6/168
![Page 9: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/9.jpg)
• China’s first homegrown many-core processor• Vendor: Shanghai High Performance IC Design Center
• Supported by National Science and Technology MajorProject (NMP): Core Electronic Devices, High-end GenericChips, and Basic Software
• 28 nm technology
• 260 Cores
• 3 Tflop/s peak
SW26010 Processor
![Page 10: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/10.jpg)
Sunway TaihuLight http://bit.ly/sunway-2016• SW26010 processor• Chinese design, fab, and ISA• 1.45 GHz• Node = 260 Cores (1 socket)
• 4 – core groups• 64 CPE, No cache, 64 KB scratchpad/CPE• 1 MPE w/32 KB L1 dcache & 256KB L2 cache
• 32 GB memory total, 136.5 GB/s• ~3 Tflop/s, (22 flops/byte)
• Cabinet = 1024 nodes• 4 supernodes=32 boards(4 cards/b(2 node/c))• ~3.14 Pflop/s
• 40 Cabinets in system• 40,960 nodes total• 125 Pflop/s total peak
• 10,649,600 cores total• 1.31 PB of primary memory (DDR3) • 93 Pflop/s for HPL, 74% peak • 0.32 Pflop/s for HPCG, 0.3% peak • 15.3 MW, water cooled
• 6.07 Gflop/s per Watt• 1.8B RMBs ~ $280M, (building, hw, apps, sw, …)
![Page 11: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/11.jpg)
GordonBellAward
11
• Since1987theGordonBellPrizeisawardedattheSCconferencetorecognizeoutstandingachievementinhigh-performancecomputing.
• Thepurposeoftheawardistotracktheprogressofparallelcomputing,withemphasisonrewardinginnovationinapplyingHPCtoapplications.
• Financialsupportofthe$10,000awardisprovidedbyGordonBell,apioneerinhigh-performanceandparallelcomputing.
• Authors‘marktheirSCpaperasapossibleGordonBellPrizecompetitor.
•GordonBellcommitteereviewsthepapersandselects6papersforthecompetition.
• PresentationsaremadeatSCandawinnerischosen.
![Page 12: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/12.jpg)
GordonBellAwardFinalistsatSC16• “ModelingDiluteSolutionsUsingFirst-PrinciplesMolecularDynamics:ComputingMorethanaMillionAtomswithOveraMillionCores,”
• Lawrence-Livermore NationalLaboratory(Calif.)
• “TowardsGreenAviationwithPythonatPetascale,”• ImperialCollegeLondon(England)
• “SimulationsofBelow-GroundDynamicsofFungi:1.184Pflops AttainedbyAutomatedGenerationandAutotuning ofTemporalBlockingCodes,”
• RIKEN(Japan),ChibaUniversity(Japan),KobeUniversity(Japan)andFujitsuLtd.(Japan)
• “Extreme-ScalePhaseFieldSimulationsofCoarseningDynamicsontheSunwayTaihulight Supercomputer,”
• ChineseAcademyofSciences, theUniversityofSouthCarolina,ColumbiaUniversity(NewYork),theNationalResearchCenterofParallelComputerEngineering andTechnology(China)andtheNationalSupercomputingCenterinWuxi(China)
• “AHighlyEffectiveGlobalSurfaceWaveNumericalSimulationwithUltra-HighResolution,”
• FirstInstituteofOceanography(China),NationalResearchCenterofParallelComputerEngineering andTechnology(China)andTsinghuaUniversity(China)
• “10M-CoreScalableFully-ImplicitSolverforNonhydrostatic AtmosphericDynamics,”
• ChineseAcademyofSciences,TsinghuaUniversity(China),theNationalResearchCenterofParallelComputerEngineeringandTechnology(China)andBeijingNormalUniversity(China)
12
![Page 13: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/13.jpg)
HPE,112,22%
SGI,28,6%Lenovo,96,
19%CrayInc.,56,11%
Sugon, 47, 9%
IBM,36,7%
Bull,Atos,20,4%Huawei,16,3%
Inspur,18,4%
Dell,13,3%
Fujitsu,11,2%NUDT, 4, 1%Others,43,9%
HPE
SGI
Lenovo
CrayInc.
Sugon
IBM
Bull,Atos
Huawei
VENDORS / SYSTEM SHARE
#ofsystems,%of500
36%oftheVendorsarefromChina
![Page 14: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/14.jpg)
Countries Share
China has 1/3 of the systems, while the number of systems in the US has fallen to the lowest point since the TOP500 list was created. 14
34
5
5
5
6
7
17
20
27
32
171
171
0 50 100 150 200
Others
India
Russia
Saudi Arabia
Italy
Poland
UK
France
Japan
Germany
China
US
Number of Systims on Top500
![Page 15: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/15.jpg)
0
1
10
100
1,000
10,000
100,000
2000
2002
2004
2006
2008
2010
2012
2014
2016
Tota
l Per
form
ance
[T
flop
/s]
US
Performance of Countries
![Page 16: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/16.jpg)
0
1
10
100
1,000
10,000
100,000
2000
2002
2004
2006
2008
2010
2012
2014
2016
Tota
l Per
form
ance
[T
flop
/s]
US
EU
Performance of Countries
![Page 17: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/17.jpg)
0
1
10
100
1,000
10,000
100,000
2000
2002
2004
2006
2008
2010
2012
2014
2016
Tota
l Per
form
ance
[T
flop
/s]
US
EU
Japan
Performance of Countries
![Page 18: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/18.jpg)
0
1
10
100
1,000
10,000
100,000
2000
2002
2004
2006
2008
2010
2012
2014
2016
Tota
l Per
form
ance
[T
flop
/s]
US
EU
Japan
China
Performance of Countries
![Page 19: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/19.jpg)
19Confessions of an Accidental Benchmarker
• Appendix B of the Linpack Users’ Guide• Designed to help users extrapolate execution time for
Linpack software package• First benchmark report from 1977;
• Cray 1 to DEC PDP-10
Began in late 70’s Time when floating point operations were expensive compared to other operations and data movement
![Page 20: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/20.jpg)
http://tiny.cc/hpcg
Many Other Benchmarks• TOP500• Green 500• Graph 500• Sustained Petascale
Performance • HPC Challenge• Perfect• ParkBench• SPEC-hpc• Big Data Top100• Livermore Loops• EuroBen
• NAS Parallel Benchmarks• Genesis• RAPS• SHOC• LAMMPS• Dhrystone • Whetstone• I/O Benchmarks• WRF• Yellowstone• Roofline• Neptune
20
![Page 21: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/21.jpg)
High Performance Linpack (HPL)• Is a widely recognized and discussed metric for ranking
high performance computing systems • When HPL gained prominence as a performance metric in
the early 1990s there was a strong correlation between its predictions of system rankings and the ranking that full-scale applications would realize.
• Computer system vendors pursued designs that would increase their HPL performance, which would in turn improve overall application performance.
• Today HPL remains valuable as a measure of historical trends, and as a stress test, especially for leadership class systems that are pushing the boundaries of current technology.
21
![Page 22: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/22.jpg)
The Problem• HPL performance of computer systems are no longer so
strongly correlated to real application performance, especially for the broad set of HPC applications governed by partial differential equations.
• Designing a system for good HPL performance can actually lead to design choices that are wrong for the real application mix, or add unnecessary components or complexity to the system.
22
![Page 23: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/23.jpg)
HPCG• High Performance Conjugate Gradients (HPCG).• Solves Ax=b, A large, sparse, b known, x computed.• An optimized implementation of PCG contains essential computational
and communication patterns that are prevalent in a variety of methods for discretization and numerical solution of PDEs
• Synthetic discretized 3D PDE (FEM, FVM, FDM).• Sparse matrix:
• 27 nonzeros/row interior. • 8 – 18 on boundary.• Symmetric positive definite.
• Patterns:• Dense and sparse computations.• Dense and sparse collectives.• Multi-scale execution of kernels via MG (truncated) V cycle.• Data-driven parallelism (unstructured sparse triangular solves).
• Strong verification (via spectral properties of PCG).
23
hpcg-benchmark.org
![Page 24: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/24.jpg)
HPCGResults,Nov2016,1-10# Site Computer Cores Rmax
PflopsHPCGPflops
HPCG/HPL
%ofPeak
1 RIKENAdvancedInstituteforComputationalScience
Kcomputer,SPARC64VIIIfx2.0GHz,Tofuinterconnect
705,024 10.5 0.603 5.7% 5.3%
2 NSCC/Guangzhou Tianhe-2NUDT,Xeon12C2.2GHz+IntelXeonPhi57C+Custom
3,120,000 33.8 0.580 1.7% 1.1%
3 JointCenter forAdvancedHPC,Japan
Oakforest-PACS– PRIMERGYCX600M1,IntelXeonPhi
557,056 24.9 0.385 2.8% 2.8%
4 NationalSupercomputingCenterinWuxi, China
SunwayTaihuLight – SunwayMPP,SW26010
10,649,600
93.0 0.3712 0.4% 0.3%
5 DOE/SC/LBNL/NERSCUSA
Cori – XC40,IntelXeonPhiCray
632,400 13.8 0.355 2.6% 1.3%
6 DOE/NNSA/LLNLUSA
Sequoia– IBMBlueGene/Q,IBM
1,572,864 17.1 0.330 1.9% 1.6%
7 DOE/SC/OakRidgeNatLab Titan- CrayXK7,Opteron627416C2.200GHz,CrayGemini interconnect,NVIDIAK20x
560,640 17.5 0.322 1.8% 1.2%
8 DOE/NNSA/LANL/SNL Trinity - CrayXC40,IntelE5-2698v3,Ariescustom
301,056 8.10 0.182 2.3% 1.6%
9 NASA/MountainView Pleiades - SGIICEX,IntelE5-2680,E5-2680V2,E5-2680V3,InfinibandFDR
243,008 5.9 0.175 2.9% 2.5%
10 DOE/SC/ArgonneNationalLaboratory
Mira- BlueGene/Q,PowerBQC16C1.60GHz,Custom
786,432 8.58 0.167 1.9% 1.7%
![Page 25: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/25.jpg)
Peak Performance - Per Core
Floating point operations per cycle per coreÊ Most of the recent computers have FMA (Fused multiple add): (i.e.
x ←x + y*z in one cycle)Ê Intel Xeon earlier models and AMD Opteron have SSE2
Ê 2 flops/cycle DP & 4 flops/cycle SP
Ê Intel Xeon Nehalem (’09) & Westmere (’10) have SSE4Ê 4 flops/cycle DP & 8 flops/cycle SP
Ê Intel Xeon Sandy Bridge(’11) & Ivy Bridge (’12) have AVX Ê 8 flops/cycle DP & 16 flops/cycle SP
Ê Intel Xeon Haswell (’13) & (Broadwell (’14)) AVX2Ê 16 flops/cycle DP & 32 flops/cycle SP
Ê Xeon Phi (per core) is at 16 flops/cycle DP & 32 flops/cycle SP
Ê Intel Xeon Skylake (server) AVX 512Ê 32 flops/cycle DP & 64 flops/cycle SP
Ê Knight’s Landing
We arehere(almost)
![Page 26: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/26.jpg)
CPU Access Latencies in Clock Cycles
In 167 cycles can do 2672 DP Flops
Cycles
Cycles
![Page 27: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/27.jpg)
Classical Analysis of Algorithms May Not be Valid
• Processors over provisioned for floating point arithmetic
• Data movement extremely expensive• Operation count is not a good indicator of the time to solve a problem.
• Algorithms that do more ops may actually take less time.
12/6/1627
![Page 28: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/28.jpg)
68 cores Intel Xeon Phi KNL, 1.3 GHzThe theoretical peak double precision is 2662 Gflop/s
Compiled with icc and using Intel MKL 2017b1 20160506
Level 1, 2 and 3 BLAS68 cores Intel Xeon Phi KNL, 1.3 GHz, Peak DP = 2662 Gflop/s
Matrix size (N), vector size (NxN)2k 4k 6k 8k 10k 12k 14k 16k 18k 20k
Gflo
p/s
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400dgemm BLAS Level 3dgemv BLAS Level 2daxpy BLAS Level 1
80.3 Gflop/s
35.1 Gflop/s
2100 Gflop/s
35x
C = C + A*B
y = y + A*x
y = �*x + y
![Page 29: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/29.jpg)
0k 4k 8k 12k 16k 20kcolumns (matrix size N ⇥ N )
0
10
20
30
40
50
spee
dup
over
eisp
ack
square, with vectors
lapack QRlapack QR (1 core)linpack QReispack (1 core)
SingularValueDecompositionLAPACKVersion1991
Level1,2,&3BLASFirst Stage 8/3 n3 Ops
Dual socket – 8 coreIntel Sandy Bridge 2.6 GHz (8 Flops per core per cycle)
QR refers to the QR algorithm for computing the eigenvalues
LAPACK QR (BLAS in ||, 16 cores)LAPACK QR (using1 core)(1991)LINPACK QR (1979)EISPACK QR (1975)
3 Generations of software compared
![Page 30: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/30.jpg)
BottleneckintheBidiagonalizationTheStandardBidiagonal Reduction:xGEBRDTwoSteps:FactorPanel&UpdateTailingMatrix
Characteristics• Total cost 8n3/3, (reduction to bi-diagonal)• Too many Level 2 BLAS operations• 4/3 n3 from GEMV and 4/3 n3 from GEMM• Performance limited to 2* performance of GEMV• èMemory bound algorithm.
factor panel k then update è factor panel k+1
Q*A*PH
0 10 20 30 40 50 60
0
10
20
30
40
50
60
nz = 30350 10 20 30 40 50 60
0
10
20
30
40
50
60
nz = 2750 10 20 30 40 50 60
0
10
20
30
40
50
60
nz = 2250 10 20 30 40 50 60
0
10
20
30
40
50
60
nz = 2500
Requires 2 GEMVs
![Page 31: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/31.jpg)
Recent Work on 2-Stage Algorithm
Characteristics• Stage 1:
• Fully Level 3 BLAS• Dataflow Asynchronous execution
• Stage 2:• Level “BLAS-1.5”• Asynchronous execution• Cache friendly kernel (reduced communication)
0 20 40 60
0
10
20
30
40
50
60
nz = 36000 10 20 30 40 50 60
0
10
20
30
40
50
60
nz = 1190 10 20 30 40 50 60
0
10
20
30
40
50
60
nz = 605
First stageTo band
Second stageBulge chasingTo bi-diagonal
![Page 32: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/32.jpg)
0 20 40 60
0
10
20
30
40
50
60
nz = 36000 10 20 30 40 50 60
0
10
20
30
40
50
60
nz = 1190 10 20 30 40 50 60
0
10
20
30
40
50
60
nz = 605
First stageTo band
Second stageBulge chasingTo bi-diagonal
flops ⇡n�nb
nbÂ
s=12n3
b +(nt� s)3n3b +(nt� s) 10
3 n3b+(nt� s)⇥ (nt� s)5n3
b
+
n�nbnbÂ
s=12n3
b +(nt� s�1)3n3b +(nt� s�1) 10
3 n3b+(nt� s)⇥ (nt� s�1)5n3
b
⇡ 103 n3 + 10nb
3 n2 + 2nb3 n3
⇡ 103 n3(gemm)first stage
(7)
second stage:
flops = 6⇥nb ⇥n2(gemv)second stage (8)
maximium performance bound:
Pmax = reference number of operationsminimum time tmin
=83 n3
tmin(103 n3 flops in gemm)+tmin(6nbn2 flops in gemv)
=8n3PgemmPgemv
10n3Pgemv+18nbn2Pgemm
=) 840 Pgemm Pmax 8
12 Pgemm
if Pgemm is about 20x Pgemv and 120 nb 240.
(9)
Speedup:
speedup = time of one-stagetime of two-stage
=4n3/3Pgemv + 4n3/3Pgemm10n3/3Pgemm+6nbn2/Pgemv
=) 8470 Speedup 84
15
=) 1.2 Speedup 5.6
if Pgemm is about 20x Pgemv and 120 nb 240.
(10)
3
flops ⇡n�nb
nbÂ
s=12n3
b +(nt� s)3n3b +(nt� s) 10
3 n3b+(nt� s)⇥ (nt� s)5n3
b
+
n�nbnbÂ
s=12n3
b +(nt� s�1)3n3b +(nt� s�1) 10
3 n3b+(nt� s)⇥ (nt� s�1)5n3
b
⇡ 103 n3 + 10nb
3 n2 + 2nb3 n3
⇡ 103 n3(gemm)first stage
(7)
second stage:
flops = 6⇥nb ⇥n2(gemv)second stage (8)
maximium performance bound:
Pmax = reference number of operationsminimum time tmin
=83 n3
tmin(103 n3 flops in gemm)+tmin(6nbn2 flops in gemv)
=8n3PgemmPgemv
10n3Pgemv+18nbn2Pgemm
=) 840 Pgemm Pmax 8
12 Pgemm
if Pgemm is about 20x Pgemv and 120 nb 240.
(9)
Speedup:
speedup = time of one-stagetime of two-stage
=4n3/3Pgemv + 4n3/3Pgemm10n3/3Pgemm+6nbn2/Pgemv
=) 8470 Speedup 84
15
=) 1.2 Speedup 5.6
if Pgemm is about 20x Pgemv and 120 nb 240.
(10)
3
More Flops, original did 8/3 n3
25% More flops
Recent work on developing new 2-stage algorithm
![Page 33: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/33.jpg)
Recent work on developing new 2-stage algorithm
0 20 40 60
0
10
20
30
40
50
60
nz = 36000 10 20 30 40 50 60
0
10
20
30
40
50
60
nz = 1190 10 20 30 40 50 60
0
10
20
30
40
50
60
nz = 605
First stageTo band
Second stageBulge chasingTo bi-diagonal
Analysis of BRD using 2-stages
June 10, 2015
1 Two stage analysis
f lops ⇡n�nb
nbÂ
s=1
2n3
b +(nt � s)3n3
b +(nt � s) 10
3
n3
b+(nt � s)⇥ (nt � s)5n3
b
+
n�nbnbÂ
s=1
2n3
b +(nt � s�1)3n3
b +(nt � s�1) 10
3
n3
b+(nt � s)⇥ (nt � s�1)5n3
b
⇡ 10
3
n3 + 10nb3
n2 + 2nb3
n3
⇡ 10
3
n3(gemm)first stage
(1)
f lops = 6⇥nb ⇥n2(gemv)second stage
(2)
Pmax = re f erence number o f operationsminimum time tmin
=8
3
n3
tmin(10
3
n3 f lops in gemm)+tmin(6nbn2 f lops in gemv)
=8n3PgemmPgemv
10n3Pgemv+18nbn2Pgemm
=) 8
40
Pgemm Pmax 8
12
Pgemm
if Pgemm is about 20x Pgemv and 120 nb 240.
(3)
speedup = time of one-stage
time of two-stage
=4n
3/3P
gemv
+ 4n
3/3P
gemm
10n
3/3P
gemm
+6n
b
n
2/P
gemv
=) 84
70
Speedup 84
15
=) 1.8 Speedup 7
if Pgemm is about 22x Pgemv and 120 nb 240.
(4)
1
2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k
1
2
3
4
5
6
Matrix size
Spee
dup
2−stages / MKL (DGEBRD)data2
25% More flops and 1.8 – 6 times faster
16 Sandy Bridge cores 2.6 GHz
![Page 34: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/34.jpg)
Critical Issues at Peta & Exascale for Algorithm and Software Design• Synchronization-reducing algorithms
§ Break Fork-Join model
• Communication-reducing algorithms§ Use methods which have lower bound on communication
• Mixed precision methods§ 2x speed of ops and 2x speed for data movement
• Autotuning§ Today’s machines are too complicated, build “smarts” into
software to adapt to the hardware
• Fault resilient algorithms§ Implement algorithms that can recover from failures/bit flips
• Reproducibility of results§ Today we can’t guarantee this. We understand the issues,
but some of our “colleagues” have a hard time with this.
![Page 35: Dongarra 12-06-2016 Seminar.pdf](https://reader033.vdocument.in/reader033/viewer/2022042800/58a1a0dc1a28abd8728be8a0/html5/thumbnails/35.jpg)
Collaborators and SupportMAGMA teamhttp://icl.cs.utk.edu/magma
PLASMA teamhttp://icl.cs.utk.edu/plasma
Collaborating partnersUniversity of Tennessee, KnoxvilleLawrence Livermore National Laboratory, Livermore, CAUniversity of California, BerkeleyUniversity of Colorado, DenverINRIA, France (StarPU team)KAUST, Saudi Arabia