overview of hpc and energy saving on knl for some...
TRANSCRIPT
10/20/2017 1
Overview of HPC and Energy Saving on KNL for Some Computations
Jack Dongarra
University of TennesseeOak Ridge National Laboratory
University of Manchester
Outline
• Overview of High Performance Computing
• With Extreme Computing the “rules” for computing have changed
Since 1993
H. Meuer, H. Simon, E. Strohmaier, & JD
- Listing of the 500 most powerfulComputers in the World
- Yardstick: Rmax from LINPACK MPPAx=b, dense problem
- Updated twice a yearSC‘xy in the States in NovemberMeeting in Germany in June
- All data available from www.top500.org
Size
Rat
e
TPP performance
PERFORMANCE DEVELOPMENT OF HPC OVER THE LAST 24.5 YEARS FROM THE TOP500
0,1
1
10
100
1000
10000
100000
1000000
10000000
100000000
1E+09
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
59.7 GFlop/s59.7 GFlop/s
400 MFlop/s400 MFlop/s
1.17 TFlop/s1.17 TFlop/s
93 PFlop93 PFlop
432 TFlop432 TFlop
749 PFlop749 PFlop
SUM
N=1
N=500
Gflop/s
Tflop/s
100 Mflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
Pflop/s
100 Pflop/s
10 Pflop/s
Eflop/s
6-8 years
My Laptop: 166 Gflop
PERFORMANCE DEVELOPMENT
1
10
100
1000
10000
100000
1000000
10000000
100000000
1E+09
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
SUM
N=1
N=100
1 Gflop/s
1 Tflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s
1 Eflop/s
N=10
Tflops (1012)AchievedASCI RedSandia NL
Pflops (1015)Achieved
RoadRunnerLos Alamos NL
Eflops (1018)Achieved?
China says 2020U.S. says 2021
Today: SW 26010 & Intel KNL~ 3 Tflop/s
�Today: 1 cabinet
of TaihuLight~ 3 Pflop/s
�
State of Supercomputing in 2017
• Pflops (> 1015 Flop/s) computing fully established with 138 computer systems.
• Three technology architecture or “swim lanes” are thriving.
• Commodity (e.g. Intel)
• Commodity + accelerator (e.g. GPUs) (91 systems)
• Lightweight cores (e.g. IBM BG, Intel’s Knights Landing, ShenWei 26010, ARM)
• Interest in supercomputing is now worldwide, and growing in many new markets (~50% of Top500 computers are in industry).
• Exascale (1018 Flop/s) projects exist in many countries and regions.
• Intel processors largest share, 92%, followed by AMD, 1%.
6
Countries Share
China has 1/3 of the systems, while the number of systems in the US has fallen to the lowest point since the TOP500 list was created. 7
Each rectangle represents one of the Top500 computers, area of rectangle reflects its performance.
17 - Top500 Systems in France
07
Name Computer SiteManufacturer Total Cores Acc Rmax
Pangea SGI ICE X, Xeon Xeon E5-2670/ E5-2680v3 12C 2.5GHz, Infiniband FDRTotal Exploration Production HPE 220800 0 5283110
occigen2 bullx DLC 720, Xeon E5-2690v4 14C 2.6GHz, Infiniband FDR (GENCI-CINES) Bull 85824 0 2494650
Prolix2 bullx DLC 720, Xeon E5-2698v4 20C 2.2GHz, Infiniband FDR Meteo France Bull 72000 0 2167990
Beaufix2 bullx DLC 720, Xeon E5-2698v4 20C 2.2GHz, Infiniband FDR Meteo France Bull 73440 0 2157410
Tera-1000-1 bullx DLC 720, Xeon E5-2698v3 16C 2.3GHz, Infiniband FDR
Commissariat a l'Energie Atomique (CEA) Bull 70272 0 1871000
Sid bullx DLC 720, Xeon E5-2695v4 18C 2.1GHz, Infiniband FDR Atos Bull 49896 0 1363480Curie thin nodes Bullx B510, Xeon E5-2680 8C 2.700GHz, Infiniband QDR CEA/TGCC-GENCI Bull 77184 0 1359000
Cobalt bullx DLC 720, Xeon E5-2680v4 14C 2.4GHz, Infiniband EDR
Commissariat a l'Energie Atomique (CEA)/CCRT Bull 38528 0 1299470
Diego bullx DLC 720, Xeon E5-2698v4 20C 2.2GHz, Infiniband FDR Atos Bull 46800 0 1225280Turing BlueGene/Q, Power BQC 16C 1.60GHz, Custom CNRS/IDRIS-GENCI IBM 98304 0 1073327
Tera-100 Bull bullx super-node S6010/S6030
Commissariat a l'Energie Atomique (CEA) Bull 138368 0 1050000
Eole Intel Cluster, Xeon E5-2680v4 14C 2.4GHz, Intel Omni-Path EDF Bull 29568 0 942829Zumbrota BlueGene/Q, Power BQC 16C 1.60GHz, Custom EDF R&D IBM 65536 0 715551
Occigen2 bullx DLC 720, Xeon E5-2690v4 14C 2.6GHz, Infiniband FDRCentre Informatique National (CINES) Bull 19488 0 685985
Sator NEC HPC1812-Rg-2, Xeon E5-2680v4 14C 2.4GHz, Intel Omni-Path ONERA NEC 17360 0 579232
HPC4HP POD - Cluster Platform BL460c, Intel Xeon E5-2697v2 12C 2.7GHz, Infiniband FDR Airbus HPE 34560 0 516897
PORTHOS IBM NeXtScale nx360M5, Xeon E5-2697v3 14C 2.6GHz, Infiniband FDR EDF R&D IBM 16100 0 506357
June 2017: The TOP 10 SystemsRank Site Computer Country Cores
Rmax[Pflops]
% of Peak
Power[MW]
GFlops/Watt
1National Super
Computer Center in Wuxi
Sunway TaihuLight, SW26010(260C) + Custom
China 10,649,000 93.0 74 15.4 6.04
2National Super
Computer Center in Guangzhou
Tianhe-2 NUDT, Xeon (12C) + IntelXeon Phi (57C)
+ CustomChina 3,120,000 33.9 62 17.8 1.91
3 Swiss CSCSPiz Daint, Cray XC50, Xeon (12C) + Nvidia P100(56C) +
CustomSwiss 361,760 19.6 77 2.27 8.6
4DOE / OS
Oak Ridge Nat Lab
Titan, Cray XK7, AMD (16C) + Nvidia Kepler GPU (14C) +
CustomUSA 560,640 17.6 65 8.21 2.14
5DOE / NNSA
L Livermore Nat LabSequoia, BlueGene/Q (16C)
+ customUSA 1,572,864 17.2 85 7.89 2.18
6DOE / OS
L Berkeley Nat LabCori, Cray XC40, Xeon Phi (68C)
+ CustomUSA 622,336 14.0 50 3.94 3.55
7Joint Center for Advanced HPC
Oakforest-PACS, Fujitsu Primergy CX1640, Xeon Phi (68C)
+ Omni-PathJapan 558,144 13.6 54 2.72 4.98
8RIKEN AdvancedInst for Comp Sci
K computer Fujitsu SPARC64 VIIIfx (8C) + Custom
Japan 705,024 10.5 93 12.7 .827
9DOE / OS
Argonne Nat LabMira, BlueGene/Q (16C)
+ CustomUSA 786,432 8.59 85 3.95 2.07
10DOE / NNSA /
Los Alamos & Sandia Trinity, Cray XC40,Xeon (16C) +
Custom USA 301,056 8.10 80 4.23 1.92
500 Internet company Sugon Intel (8C) + Nnvidia China 110,000 .432 71 1.2 0.36
TaihuLight is 60% Sum of All EU SystemsTaihuLight is 35% X Sum of All Systems in US
Next Big Systems in the US DOE� Oak Ridge Lab and Lawrence Livermore Lab to
receive IBM and Nvidia based systems� IBM Power 9 + Nvidia Volta V100
� Installation started but not completed until late 2018-2019
� 5-10 times Titan on apps
� 4,600 nodes, each containing six 7.5-teraflop NVIDIA V100 GPUs, and two IBM Power9 CPUs, its aggregate peak performance should be > 200 petaflops.
� ORNL maybe June 2018 fully ready (Top500 number)
� In 2021 Argonne Lab to receive Intel based system� Exascale systems, based on Future Intel parts, not Knights Hill
� Aurora 21� Balanced architecture to support three pillars
� Large-scale Simulation (PDEs, traditional HPC)
� Data Intensive Applications (science pipelines)
� Deep Learning and Emerging Science AI
� Integrated computing, acceleration, storage
� Towards a common software stack
Toward ExascaleChina plans for Exascale 2020
Three separate developments in HPC; “Anything but from the US”
• Wuxi• Upgrade the ShenWei O(100) Pflops all Chinese
• National University for Defense Technology• Tianhe-2A O(100) Pflops will be Chinese ARM processor + accelerator
• Sugon - CAS ICT • X86 based; collaboration with AMD
US DOE - Exascale Computing Program – 7 Year Program� Initial exascale system based on advanced architecture and
delivered in 2021
� Enable capable exascale systems, based on ECP R&D, delivered in 2022 and deployed in 2023
07
Exascale• 50X the performance of
today’s 20 PF systems• 20-30 MW power• ≦ 1 perceived fault /week• SW to support broad range
of apps
Exascale Computing Project, www.exascaleproject.org
US Department of Energy Exascale Computing Program has formulated a holistic approach that uses co-design and integration to achieve capable exascaleApplication Development Software
TechnologyHardware
TechnologyExascaleSystems
Scalable and productive software
stack
Science and mission applications
Hardware technology elements
Integrated exascalesupercomputers
Correctness Visualization Data Analysis
Applications Co-Design
Programming models, development environment,
and runtimesToolsMath libraries
and Frameworks
System Software, resource management threading, scheduling, monitoring, and control
Memory and Burst
buffer
Data management I/O and file
systemNode OS, runtimes
Res
ilien
ce
Wor
kflo
ws
Hardware interface
ECP’s work encompasses applications, system software, hardware technologies and architectures, and workforce development
07
The Problem with the LINPACK Benchmark• HPL performance of computer systems are no longer so strongly correlated
to real application performance, especially for the broad set of HPC applications governed by partial differential equations.
• Designing a system for good HPL performance can actually lead to design choices that are wrong for the real application mix, or add unnecessary components or complexity to the system.
14
Many Problems in Computational Science Involve Solving PDEs; Large Sparse Linear Systems
−∆u+ c→
•∇u+γu≡ Pu = f
over some domain
( where P denotes the
differential operator )
+boundary conditions
Discretization
(e.g., Galerkin
equations)
i=1
n
∑
∀
aji bj
Basis functions Φj
are often with
local support, e.g.,
leading to local
interactions & hence
sparse matrices, e.g.,
10
100 115
20135
332
i=1
n
∑
i=1
n
∑
Find uh = Φi xi
(P uh, Φj) = (f, Φi) for Φj
� (PΦi , Φj) xi = (f, Φi)
� Sparse Linear System
A x = b
row 10 in this case
will have only 6
non-zeroes:
a10,10, a10,332, a10,100,
a10,115, a10,201, a10,35
Given a PDE,
e.g.:
Modeling Diffusion Fluid Flow
HPCG• High Performance Conjugate Gradients (HPCG).• Solves Ax=b, A large, sparse, b known, x computed.• An optimized implementation of PCG contains essential computational
and communication patterns that are prevalent in a variety of methods for discretization and numerical solution of PDEs
• Synthetic discretized 3D PDE (FEM, FVM, FDM).• Sparse matrix:
• 27 nonzeros/row interior. • 8 – 18 on boundary.• Symmetric positive definite.
• Patterns:• Dense and sparse computations.• Dense and sparse collectives.• Multi-scale execution of kernels via MG (truncated) V cycle.• Data-driven parallelism (unstructured sparse triangular solves).
• Strong verification (via spectral properties of PCG).
16
hpcg-benchmark.org
HPL vs. HPCG: Bookends• Some see HPL and HPCG as “bookends” of a spectrum.
• Applications teams know where their codes lie on the spectrum.
• Can gauge performance on a system using both HPL and HPCG numbers.
17hpcg-benchmark.org
Rank Site Computer CoresRmaxPflops
HPCGPflops
HPCG/HPL
% of Peak
1RIKEN Advanced Institute for
Computational Science, Japan
K computer, SPARC64 VIIIfx 2.0GHz, Tofu
interconnect705,024 10.5 0.60 5.7% 5.3%
2NSCC / Guangzhou
China
Tianhe-2 NUDT, Xeon 12C 2.2GHz + Intel
Xeon Phi 57C + Custom3,120,000 33.8 0.58 1.7% 1.1%
3National Supercomputing Center in WuxiChina
Sunway TaihuLight – Sunway MPP, SW26010 260C 1.45GHz, Sunway,
NRCPC
10,649,600 93.0 0.48 0.5% 0.4%
3
Swiss National
Supercomputing Centre
(CSCS)
Switzerland
Piz Daint – Cray XC50, Xeon E5-2690v3
12C 2.6GHz, Aries interconnect, NVIDIA
P100, Cray361,760 19.6 0.48 2.4% 1.9%
5Joint Center for Advanced HPCJapan
Oakforest-PACS – PRIMERGY CX600 M1, Intel Xeon Phi 7250 68C 1.4GHz, Intel OmniPath, Fujitsu
557,056 24.9 0.39 2.8% 1.5%
6DOE/SC/LBNL/NERSCUSA
Cori – XC40, Intel Xeon Phi 7250 68C 1.4GHz, Cray Aries, Cray
632,400 13.8 0.36 2.6% 1.3%
7DOE/NNSA/LLNLUSA
Sequoia – IBM BlueGene/Q, PowerPC A2 16C 1.6GHz, 5D Torus, IBM
1,572,864 17.2 0.33 1.9% 1.6%
8DOE/SC/Oak Ridge National
Lab
Titan - Cray XK7 , Opteron 6274 16C
2.200GHz, Cray Gemini interconnect,
NVIDIA K20x
560,640 17.6 0.32 1.8% 1.2%
9 DOE/NNSA/LANL/SNLTrinity - Cray XC40, Intel E5-2698v3, Aries
custom, Cray301,056 8.10 0.18 2.3% 1.6%
10 NASA / Mountain View
Pleiades - SGI ICE X, Intel E5-2680, E5-
2680v2, E5-2680v3, E5-2680v4, Infiniband
FDR, HPE
243,008 6.0 0.17 2.9% 2.5%
HPCG Results, Jun 2017, top-10
Bookends: P
eak and HP
L Only
0.0
01
0.0
1
0.1 1
10
10
0
10
00
1
2
3
4
5
6
7
8
9
10
14
15
17
18
19
20
21
22
23
25
26
31
34
38
39
40
44
50
55
57
58
59
60
63
64
71
72
73
80
85
87
90
94
95
112
113
123
131
164
203
209
214
220
226
247
280
282
292
370
380
430
442
472
Pflop/s
Rp
ea
kR
ma
x
Bookends: P
eak, HP
L, and HP
CG
0.0
01
0.0
1
0.1 1
10
10
0
10
00
1
2
3
4
5
6
7
8
9
10
14
15
17
18
19
20
21
22
23
25
26
31
34
38
39
40
44
50
55
57
58
59
60
63
64
71
72
73
80
85
87
90
94
95
112
113
123
131
164
203
209
214
220
226
247
280
282
292
370
380
430
442
472
Pflop/s
Rp
ea
k
Rm
ax
HP
CG
Machine Learning in Computational Science
• Climate
• Biology
• Drug Design
• Epidemology
• Materials
• Cosmology
• High-Energy Physics
Many fields are beginning to adopt machine learning to augment modeling and simulation methods
IEEE 754 Half Precision (16-bit) Floating Pt Standard
22 / 57
A lot of interest driven by “machine learning”
Mixed Precision• Extended to Sparse Direct Methods
• Iterative Methods where inner and outer iteration
• Today many precisions to deal with
• Note the number range with half precision (16 bit fl.pt.)10/20/2017
Deep Learning need Batched Operations
Matrix Multiply is the time consuming part.
Convolution Layers and Fully Connected Layers require matrix multiply
There are many GEMM’s of small matrices, perfectly parallel, can get by with 16-bit floating point
24 / 47
Convolution StepIn this case 3x3 GEMM
x1
x2
x3
x1 y1
y2
w11
w12
w13
w21
w22
w23
Fully ConnectedClassification
Define standard API for batched BLAS and LAPACK in collaboration with Intel/Nvidia/ECP/other usersFixed size most of BLAS and LAPACK releasedVariable size most of BLAS releasedVariable size LAPACK in the branchNative GPU algorithms (Cholesky, LU, QR) in the branchTiled algorithm using batched routines on tile or LAPACK data layout in the branch
Framework for Deep Neural Network kernelsCPU, KNL and GPU routinesFP16 routines in progress
Standard for Batched Computations
Batched Computations
Non-batched computationloop over the matrices one by one and compute using multithread (note that, sincematrices are of small sizes there is not enough work for all the cores). So we expect lowperformance as well as threads contention might also affect the performance
for ( i=0; i<batchcount; i++)dgemm(…)
There is not enough
to fulfill all the cores.
Low percentage of the
resources is used
Batched Computations
Batched computationDistribute all the matrices over the available resources by assigning a matrix to eachgroup of core/TB to operate on it independently
• For very small matrices, assign a matrix/core (CPU) or per TB for GPU• For medium size a matrix go to a team of cores (CPU) or many TB’s (GPU)• For large size switch to multithreads classical 1 matrix per round.
Batched_dgemm(…)
Based on the kerneldesign that decide thenumber of TB or threads(GPU/CPU)and through theNvidia/OpenMPscheduler
Based on the kerneldesign that decide thenumber of TB or threads(GPU/CPU)and through theNvidia/OpenMPscheduler
Tasks manager dispatcher
High percentage of the
resources is used
Batched Computations
Batched computationDistribute all the matrices over the available resources by assigning a matrix to eachgroup of core/TB to operate on it independently
• For very small matrices, assign a matrix/core (CPU) or per TB for GPU• For medium size a matrix go to a team of cores (CPU) or many TB’s (GPU)• For large size switch to multithreads classical 1 matrix per round.
Batched_dgemm(…)
Based on the kerneldesign that decide thenumber of TB or threads(GPU/CPU)and through theNvidia/OpenMPscheduler
Based on the kerneldesign that decide thenumber of TB or threads(GPU/CPU)and through theNvidia/OpenMPscheduler
Tasks manager dispatcher
High percentage of the
resources is used
Batched Computations
Batched computationDistribute all the matrices over the available resources by assigning a matrix to eachgroup of core/TB to operate on it independently
• For very small matrices, assign a matrix/core (CPU) or per TB for GPU• For medium size a matrix go to a team of cores (CPU) or many TB’s (GPU)• For large size switch to multithreads classical 1 matrix per round.
Batched_dgemm(…)
Based on the kerneldesign that decide thenumber of TB or threads(GPU/CPU)and through theNvidia/OpenMPscheduler
Based on the kerneldesign that decide thenumber of TB or threads(GPU/CPU)and through theNvidia/OpenMPscheduler
Tasks manager dispatcher
High percentage of the
resources is used
Batched Computations: How do we design and optimize
2~3x
100X
Switch to non-batched
Smallsizes
mediumsizes
largesizes
Batched Computations: How do we design and optimize
1.2x
30X
Switch to non-batched
Smallsizes
mediumsizes
largesizes
0
1000
2000
3000
4000
5000
6000
7000
8000
0 500 1000 1500 2000 2500 3000 3500 4000
50~1000 matrices of size
Nvidia V100 GPU
Batch dgemm BLAS 3
Standard dgemm BLAS 3
small sizes medium sizes Large sizes
Switch to non-batch
Batch dgemm BLAS 3
Standard dgemm BLAS 3
19X
1.4X
Gflo
p/s
97% of that peak
Intel Knights Landing: FLAT mode
• KNL is in FLAT mode (where entire MCDRAM is used as addressable memory)
� memory allocations are treated similarly to DDR4 memory allocations
• KNL Thermal Design Power (TDP) is 215W (for main SKUs):
TDP = 215W
68 cores, 1.3 GHz, Peak DP = 2662 Gflop/s
68 cores Intel Xeon Phi KNL, 1.3 GHzThe theoretical peak double precision is 2662 Gflop/sCompiled with icc and using Intel MKL 2017b1 20160506
Level 1, 2 and 3 BLAS68 cores Intel Xeon Phi KNL, 1.3 GHz, Peak DP = 2662 Gflop/s
85.3 Gflop/s
35.1 Gflop/s
2000 Gflop/s
35x
PAPI for power-aware computing
• We use PAPI’s latest powercap component for measurement, control, and
performance analysis
• PAPI power components in the past supported only reading power
information
• New component exposes RAPL functionality to allow users to read and
write power limit
• Study numerical building blocks of varying computational intensity
• Use PAPI powercap component to detect power optimization opportunities
� Objective: Cap the power on the architecture to reduce power usage
while keeping the execution time constant � Energy Savings !!!
PAPI for power-aware computing
• We use PAPI’s latest powercap component for measurement, control, and
performance analysis
• PAPI power components in the past supported only reading power
information
• New component exposes RAPL functionality to allow users to read and
write power limit
• Study numerical building blocks of varying computational intensity
• Use PAPI powercap component to detect power optimization opportunities
� Objective: Cap the power on the architecture to reduce power usage
while keeping the execution time constant � Energy Savings !!!
Level 3 BLAS DGEMM on KNL using MCDRAM
Time (sec)0 1 2 3 4 5 6 7 8 9 10 11 12
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300320
1997
8.81
1303
Performancein Gflop/s
Gflops/Watts
Joules
Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set
37DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in
decreasing fashion starting from default, 220 Watts, 200 Watts down till 120 Watts by steps of 10/20.
68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~90 GB/s
idle
DGEMM 18K at default power
set pwr limit default
DGEMM size 18K
Thermal Design Power (TDP)
Level 3 BLAS DGEMM on KNL using MCDRAM
Time (sec)0 10 20 30 40 50 60 70 80 90 100 110
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300320
1997
8.81
1303
Performancein Gflop/s
Gflops/Watts
Joules
Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set
38
68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~90 GB/s
idle
DGEMM 18K at default power
DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in
decreasing fashion starting from default, 220 Watts, 200 Watts down till 120 Watts by steps of 10/20.
set pwr limit default
DGEMM size 18K
PAPI_read(EventSet, long long values[2])// set short term power limit to 268.75 Watts
Values[0] = 26875000 // set long term power limit to 258 Watts
Values[1] = 25800000 PAPI_write(EventSet, values)
Level 3 BLAS DGEMM on KNL using MCDRAM
68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~90 GB/s
Time (sec)0 10 20 30 40 50 60 70 80 90 100 110
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300320
1997
8.81
1303
1904
8.92
1279
Performancein Gflop/s
Gflops/Watts
Joules
Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set
39
set pwr limit defaultDGEMM size 18Kset pwr limit 220WDGEMM size 18K
DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in
decreasing fashion starting from default, 220 Watts, 200 Watts down till 120 Watts by steps of 10/20.
Level 3 BLAS DGEMM on KNL using MCDRAM
68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~90 GB/s
Time (sec)0 10 20 30 40 50 60 70 80 90 100 110
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300320
1997
8.82
1303
1904
8.92
1279
1741
8.95
1267
1589
9.04
1253
1328
8.47
1345
1137
7.73
1480
956
6.95
1661
773
6.03
1904
560
4.72
2443
Performancein Gflop/s
Gflops/Watts
Joules
Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set
40
set pwr limit defaultDGEMM size 18Kset pwr limit 220WDGEMM size 18Kset pwr limit 200WDGEMM size 18Kset pwr limit 180WDGEMM size 18Kset pwr limit 160WDGEMM size 18Kset pwr limit 150W
....DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in
decreasing fashion starting from default, 220 Watts, 200 Watts down till 120 Watts by steps of 10/20.
Level 3 BLAS DGEMM on KNL using MCDRAM
68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~90 GB/s
Time (sec)0 10 20 30 40 50 60 70 80 90 100 110
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300320
1997
8.82
1303
1904
8.92
1279
1741
8.95
1267
1589
9.04
1253
1328
8.47
1345
1137
7.73
1480
956
6.95
1661
773
6.03
1904
560
4.72
2443
Performancein Gflop/s
Gflops/Watts
Joules
Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set
MH
z
0 200 400 600 800 100012001400160018002000
Frequency
41
• When power cap kick-
on the frequency is
decreased which
confirm that capping
affect through the
DVFS
DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in
decreasing fashion starting from default, 220 Watts, 200 Watts down till 120 Watts by steps of 10/20.
Optimum
Flops
Optimum
Flops/Watt
Level 3 BLAS DGEMM on KNL MCDRAM/DDR4
MCDRAM DDR4
Time (sec)0 10 20 30 40 50 60 70 80 90 100 110
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300320
1991
8.22
1383
1901
8.36
1341
1785
8.58
1314
1615
8.57
1325
1340
8.02
1419
1154
7.36
1562
971
6.66
1715
785
5.81
1982
575
4.64
2489
Performancein Gflop/s
Gflops/Watts
Joules
Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set
MH
z
0 200 400 600 800 100012001400160018002000
Frequency
Time (sec)0 10 20 30 40 50 60 70 80 90 100 110
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300320
1997
8.82
1303
1904
8.92
1279
1741
8.95
1267
1589
9.04
1253
1328
8.47
1345
1137
7.73
1480
956
6.95
1661
773
6.03
1904
560
4.72
2443
Performancein Gflop/s
Gflops/Watts
Joules
Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set
MH
z
0 200 400 600 800 100012001400160018002000
Frequency
42
Lesson for DGEMM type of operations (compute intensive):• Capping can reduce performance, but sometimes can hold the energy efficiency (Glops/Watts).
DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in
decreasing fashion starting from default, 220 Watts, 200 Watts down till 120 Watts by steps of 10/20.
Optimum
Flops
Optimum
Flops/Watt
Optimum
Flops/Watt
Optimum
Flops
Level 2 BLAS DGEMV on KNL MCDRAM/DDR4
MCDRAM DDR4
Time (sec)0 11 22 33 44 55 66 77 88 99 110 121 132 143 154
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300320
2185
0.12
2588
2185
0.12
2623
2185
0.12
2639
2185
0.12
2664
2185
0.12
2661
2184
0.13
2558
2182
0.13
2432
2082
0.14
2305
1978
0.14
2240
Performancein Gflop/sAchievedBandwidth GB/sGflops/Watts
Joules
Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set
MH
z
0 200 400 600 800 100012001400160018002000
Frequency
Time (sec)0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300320
8433
50.
3981
5
8333
10.
3980
5
8232
80.
4271
7
7931
70.
4568
6
6526
10.
4274
5
5622
50.
3882
2
4718
80.
3490
9
3815
00.
2910
77
2811
10.
2413
61
Performancein Gflop/sAchievedBandwidth GB/sGflops/Watts
Joules
Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set
MH
z
0 200 400 600 800 100012001400160018002000
Frequency
43DGEMV is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in
decreasing fashion starting from default, 220 Watts, 200 Watts down till 120 Watts by steps of 10/20.
Lesson for DGEMV type of operations (memory bound):• For MCDRAM, capping at 190 Watts results in power reduction without any loss of performance (energy saving ~20%).
• For DDR4, capping at 120 Watts improves energy efficiency by ~17%.
• Overall capping 40 Watts below the power observed at default setting provide about 17-20% energy saving without any
significant reduction in time to solution.
Lesson for DGEMM type of operations (compute intensive):• Capping can reduce performance, but sometimes can hold the energy efficiency (Glops/Watts).
Optimum
Flops
Optimum
Flops/Watt
Optimum
Flops/Watt
Optimum
Flops
Level 1 BLAS DAXPY on KNL MCDRAM/DDR4
MCDRAM DDR4
Time (sec)0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300320
787
0.04
2080
787
0.04
2076
787
0.04
2098
787
0.04
2094
785
0.04
1993
784
0.04
1871
784
0.04
1802
782
0.05
1700
666
0.04
1951
Performancein Gflop/sAchievedBandwidth GB/sGflops/Watts
Joules
Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set
MH
z
0 200 400 600 800 100012001400160018002000
Frequency
Time (sec)0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300320
3542
00.
1551
235
416
0.16
465
3441
00.
1742
529
344
0.16
443
2226
30.
1454
5
1922
60.
1361
4
1518
50.
1168
6
1214
40.
0984
4
910
50.
0710
65
Performancein Gflop/sAchievedBandwidth GB/sGflops/Watts
Joules
Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set
MH
z
0 200 400 600 800 100012001400160018002000
Frequency
Text
44
Lesson for DAXPY type of operations (memory bound):• For MCDRAM, capping at 180 Watts results in power reduction without any loss if performance (energy saving ~16%).
• For DDR4, capping at 130 Watts improves energy efficiency by ~25%.
• Overall capping 40 Watts below the power observed at default setting provide about 16-25% energy saving without any
significant reduction in time to solution.
DAXPY is run repeatedly for a fixed matrix size of 1E8 per step and at each step a new power limit is set in
decreasing fashion starting from default, 220 Watts, 200 Watts down till 120 Watts by steps of 10/20.
Intel Knights Landing
MCDRAM
BLASRate
Gflop/s
Power
Efficiency
Gflop/s per W
Level 3 1997 9.04
Level 2 84 .45
Level 1 35 .17
DRAM
BLASRate
Gflop/s
Power
Efficiency
Gflop/s per W
Level 3 1991
Level 2 21
Level 1 7
Level 1, 2 and 3 BLAS68 cores Intel Xeon Phi KNL, 1.3 GHz, Peak DP = 2662 Gflop/s
Power-awareness in applications
• At TDP basic power limit 215
Time (sec)0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300
1522 joules
MCDRAM_215watts
HPCG benchmark on grid of size 1923: MCDRAM Sparse Matrix Vector operationsSolving Ax=b with Conjugate Gradient Algorithm
Power-awareness in applications
Time (sec)0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300
1522 joules1504 joules
MCDRAM_215wattsMCDRAM_200watts
HPCG benchmark on grid of size 1923: MCDRAM
• Decreasing the power limit
to 200 W do not results in
any loss in performance
while we observe a reducing
of power rate consumption
Sparse Matrix Vector operationsSolving Ax=b with Conjugate Gradient Algorithm
Power-awareness in applications
Time (sec)0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300
1522 joules1504 joules
1468 joules
1486 joules
1728 joules
2390 joules
MCDRAM_215wattsMCDRAM_200wattsMCDRAM_180wattsMCDRAM_160wattsMCDRAM_140wattsMCDRAM_120watts
HPCG benchmark on grid of size 1923: MCDRAM
• Decreasing the power limit
to 200 W do not results in
any loss in performance
while we observe a reducing
of power rate consumption
• Decreasing the power limit
down by 40 Watts (setting
pwr limit at 160 W) will
provide energy saving,
power rate reduction and
without any meaningful loss
in performance, less than
10% reduction in time to
solution.
Sparse Matrix Vector operationsSolving Ax=b with Conjugate Gradient Algorithm
Power-awareness in applications
Time (sec)0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300
3915 joules3957 joules3940 joules3710 joules3300 joules
3158 joules
DDR_215wattsDDR_200wattsDDR_180wattsDDR_160wattsDDR_140wattsDDR_120watts
HPCG benchmark on grid of size 1923: DDR4
• Decreasing the power limit
to 160 W do not results in
any loss in performance
while we observe a reducing
of power rate and energy
saving.
• Decreasing the power limit
down by 40 Watts (Pwr limit
at 140 W) will providing
large energy saving (15%),
power rate reduction,
without any loss in
performance.
• At 120 Watts, less than 10%
reduction in time to solution
while a large energy saving
(20%) and power rate
reduction are observed
Sparse Matrix Vector operationsSolving Ax=b with Conjugate Gradient Algorithm
Power-awareness in applications
Time (sec)0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300
3915 joules3957 joules3940 joules3710 joules3300 joules
3158 joules
DDR_215wattsDDR_200wattsDDR_180wattsDDR_160wattsDDR_140wattsDDR_120watts
Time (sec)0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300
1522 joules1504 joules
1468 joules
1486 joules
1728 joules
2390 joules
MCDRAM_215wattsMCDRAM_200wattsMCDRAM_180wattsMCDRAM_160wattsMCDRAM_140wattsMCDRAM_120watts
HPCG benchmark on grid of size 1923: MCDRAM & DDR4
Lesson for HPCG:• For MCDRAM, decreasing the power limit by 40 Watts from the power observed at default (setting Pwr limit at 160 W)
will provide energy saving, power rate reduction without any meaningful loss in performance, less than 10% reduction
in time to solution.
• For DDR4, similar behavior observed by decreasing the power limit by 40 Watts from the power observed at default
(setting Pwr limit at 120 W) provide a large energy saving (15%-20%), power rate reduction without any significant
reduction in time to solution.
Sparse Matrix Vector operationsSolving Ax=b with CG
Power-awareness in applications
51
Time (sec)0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300
826 joules842 joules
724 joules
803 joules
1066 joules
1865 joules
MCDRAM_215WattsMCDRAM_200WattsMCDRAM_180WattsMCDRAM_160WattsMCDRAM_140WattsMCDRAM_120Watts
Time (sec)0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300
2770 joules2878 joules2689 joules2438 joules2137 joules
2731 joules
DDR_215WattsDDR_200WattsDDR_180WattsDDR_160WattsDDR_140WattsDDR_120Watts
Solving Helmholtz equation with finite difference, Jacobi iteration 12800x12800 grid
Lesson for Jacobi iteration:• For MCDRAM, capping at 180 Watts improves power efficiency by ~13% without any loss in time to solution, while
also decreasing power rate requirements by about 20%
• For DDR4, capping at 140 Watts improves power efficiency by ~20% without any loss in time to solution.
• Overall, capping 30-40 Watts below the power observed at default limit provides large energy gain while keeping up
with the same time to solution, which is similar to the behavior observed for the DGEMV and DAXPY kernels.
Power-awareness in applications
Lattice-Boltzmann simulation of CFD (from SPEC 2006 benchmark)
Time (sec)0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300
3050 joules
2981 joules
2872 joules
3043 joules3383 joules
4033 joules5734 joules
MCDRAM_270wattsMCDRAM_215wattsMCDRAM_200wattsMCDRAM_180wattsMCDRAM_160wattsMCDRAM_140wattsMCDRAM_120watts
Time (sec)0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300
9064 joules9196 joules9194 joules9178 joules9005 joules8084 joules
8467 joules
DDR_270wattsDDR_215wattsDDR_200wattsDDR_180wattsDDR_160wattsDDR_140wattsDDR_120watts
Lesson for Lattice Boltzmann:• For MCDRAM, capping at 200 results in 5% energy saving and power rate reduction without increase in time to
solution.
• For DDR4, capping at 140 Watts improves energy efficiency by ~11% and reduces power rate without any increase
in time to solution.
Power-awareness in applications
Time (sec)0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300
9175 joules8955 joules
8702 joules
8724 joules
10078 joules
MCDRAM_215WattsMCDRAM_200WattsMCDRAM_180WattsMCDRAM_160WattsMCDRAM_140Watts
Time (sec)0 10 20 30 40 50 60 70 80 90 100 110 120 130 140
Ave
rag
e p
ow
er (
Wat
ts)
0 20 40 60 80
100120140160180200220240260280300
19901 joules19796 joules19676 joules17763 joules
16003 joules
13432 joules
DDR_215WattsDDR_200WattsDDR_180WattsDDR_160WattsDDR_140WattsDDR_120Watts
Lesson for Monte Carlo:• For MCDRAM, capping at 180 W results in energy saving (5%) and power rate reduction without any meaningful
increase in time to solution.
• For DDR4, capping at 120 Watts improves power efficiency by ~30% without any loss in performance.
Monte Carlo neutron transport application (from XSbench)
Power Aware Tools
• Designing a toolset that will “automatically” monitor memory traffic.
– Track the performance and data movement.
• Adjust the power to keep the performance roughly the same while decreasing energy consumption.
• Better power performance without significant loss of performance.
• All in a automatic fashion without user involvement.
10/20/2017
Critical Issues and Challenges at Peta & Exascale for Algorithm and Software Design
♦ Synchronization-reducing algorithms� Break Fork-Join model
♦ Communication-reducing algorithms� Use methods which have lower bound on communication
♦ Mixed precision methods� 2x speed of ops and 2x speed for data movement
♦ Autotuning� Today’s machines are too complicated, build “smarts” into software to
adapt to the hardware
♦ Fault resilient algorithms� Implement algorithms that can recover from failures/bit flips
♦ Reproducibility of results� Today we can’t guarantee this, without a penalty. We understand the
issues, but some of our “colleagues” have a hard time with this.
Collaborators and SupportMAGMA team
http://icl.cs.utk.edu/magma
PLASMA team
http://icl.cs.utk.edu/plasma
Collaborating partners
University of Tennessee, KnoxvilleUniversity of California, BerkeleyUniversity of Colorado, DenverUniversity of Manchester