overview of hpc and energy saving on knl for some...

10/20/2017 1

Overview of HPC and Energy Saving on KNL for Some Computations

Jack Dongarra

University of TennesseeOak Ridge National Laboratory

University of Manchester

Outline

• Overview of High Performance Computing

• With Extreme Computing the “rules” for computing have changed

Since 1993

H. Meuer, H. Simon, E. Strohmaier, & JD

- Listing of the 500 most powerfulComputers in the World

- Yardstick: Rmax from LINPACK MPPAx=b, dense problem

- Updated twice a yearSC‘xy in the States in NovemberMeeting in Germany in June

- All data available from www.top500.org

Size

Rat

e

TPP performance

PERFORMANCE DEVELOPMENT OF HPC OVER THE LAST 24.5 YEARS FROM THE TOP500

0,1

1

10

100

1000

10000

100000

1000000

10000000

100000000

1E+09

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016

59.7 GFlop/s59.7 GFlop/s

400 MFlop/s400 MFlop/s

1.17 TFlop/s1.17 TFlop/s

93 PFlop93 PFlop

432 TFlop432 TFlop

749 PFlop749 PFlop

SUM

N=1

N=500

Gflop/s

Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

Pflop/s

100 Pflop/s

10 Pflop/s

Eflop/s

6-8 years

My Laptop: 166 Gflop

PERFORMANCE DEVELOPMENT

1

10

100

1000

10000

100000

1000000

10000000

100000000

1E+09

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

SUM

N=1

N=100

1 Gflop/s

1 Tflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s

1 Eflop/s

N=10

Tflops (1012)AchievedASCI RedSandia NL

Pflops (1015)Achieved

RoadRunnerLos Alamos NL

Eflops (1018)Achieved?

China says 2020U.S. says 2021

Today: SW 26010 & Intel KNL~ 3 Tflop/s

�Today: 1 cabinet

of TaihuLight~ 3 Pflop/s

�

State of Supercomputing in 2017

• Pflops (> 1015 Flop/s) computing fully established with 138 computer systems.

• Three technology architecture or “swim lanes” are thriving.

• Commodity (e.g. Intel)

• Commodity + accelerator (e.g. GPUs) (91 systems)

• Lightweight cores (e.g. IBM BG, Intel’s Knights Landing, ShenWei 26010, ARM)

• Interest in supercomputing is now worldwide, and growing in many new markets (~50% of Top500 computers are in industry).

• Exascale (1018 Flop/s) projects exist in many countries and regions.

• Intel processors largest share, 92%, followed by AMD, 1%.

6

Countries Share

China has 1/3 of the systems, while the number of systems in the US has fallen to the lowest point since the TOP500 list was created. 7

Each rectangle represents one of the Top500 computers, area of rectangle reflects its performance.

17 - Top500 Systems in France

07

Name Computer SiteManufacturer Total Cores Acc Rmax

Pangea SGI ICE X, Xeon Xeon E5-2670/ E5-2680v3 12C 2.5GHz, Infiniband FDRTotal Exploration Production HPE 220800 0 5283110

occigen2 bullx DLC 720, Xeon E5-2690v4 14C 2.6GHz, Infiniband FDR (GENCI-CINES) Bull 85824 0 2494650

Prolix2 bullx DLC 720, Xeon E5-2698v4 20C 2.2GHz, Infiniband FDR Meteo France Bull 72000 0 2167990

Beaufix2 bullx DLC 720, Xeon E5-2698v4 20C 2.2GHz, Infiniband FDR Meteo France Bull 73440 0 2157410

Tera-1000-1 bullx DLC 720, Xeon E5-2698v3 16C 2.3GHz, Infiniband FDR

Commissariat a l'Energie Atomique (CEA) Bull 70272 0 1871000

Sid bullx DLC 720, Xeon E5-2695v4 18C 2.1GHz, Infiniband FDR Atos Bull 49896 0 1363480Curie thin nodes Bullx B510, Xeon E5-2680 8C 2.700GHz, Infiniband QDR CEA/TGCC-GENCI Bull 77184 0 1359000

Cobalt bullx DLC 720, Xeon E5-2680v4 14C 2.4GHz, Infiniband EDR

Commissariat a l'Energie Atomique (CEA)/CCRT Bull 38528 0 1299470

Diego bullx DLC 720, Xeon E5-2698v4 20C 2.2GHz, Infiniband FDR Atos Bull 46800 0 1225280Turing BlueGene/Q, Power BQC 16C 1.60GHz, Custom CNRS/IDRIS-GENCI IBM 98304 0 1073327

Tera-100 Bull bullx super-node S6010/S6030

Commissariat a l'Energie Atomique (CEA) Bull 138368 0 1050000

Eole Intel Cluster, Xeon E5-2680v4 14C 2.4GHz, Intel Omni-Path EDF Bull 29568 0 942829Zumbrota BlueGene/Q, Power BQC 16C 1.60GHz, Custom EDF R&D IBM 65536 0 715551

Occigen2 bullx DLC 720, Xeon E5-2690v4 14C 2.6GHz, Infiniband FDRCentre Informatique National (CINES) Bull 19488 0 685985

Sator NEC HPC1812-Rg-2, Xeon E5-2680v4 14C 2.4GHz, Intel Omni-Path ONERA NEC 17360 0 579232

HPC4HP POD - Cluster Platform BL460c, Intel Xeon E5-2697v2 12C 2.7GHz, Infiniband FDR Airbus HPE 34560 0 516897

PORTHOS IBM NeXtScale nx360M5, Xeon E5-2697v3 14C 2.6GHz, Infiniband FDR EDF R&D IBM 16100 0 506357

June 2017: The TOP 10 SystemsRank Site Computer Country Cores

Rmax[Pflops]

% of Peak

Power[MW]

GFlops/Watt

1National Super

Computer Center in Wuxi

Sunway TaihuLight, SW26010(260C) + Custom

China 10,649,000 93.0 74 15.4 6.04

2National Super

Computer Center in Guangzhou

Tianhe-2 NUDT, Xeon (12C) + IntelXeon Phi (57C)

+ CustomChina 3,120,000 33.9 62 17.8 1.91

3 Swiss CSCSPiz Daint, Cray XC50, Xeon (12C) + Nvidia P100(56C) +

CustomSwiss 361,760 19.6 77 2.27 8.6

4DOE / OS

Oak Ridge Nat Lab

Titan, Cray XK7, AMD (16C) + Nvidia Kepler GPU (14C) +

CustomUSA 560,640 17.6 65 8.21 2.14

5DOE / NNSA

L Livermore Nat LabSequoia, BlueGene/Q (16C)

+ customUSA 1,572,864 17.2 85 7.89 2.18

6DOE / OS

L Berkeley Nat LabCori, Cray XC40, Xeon Phi (68C)

+ CustomUSA 622,336 14.0 50 3.94 3.55

7Joint Center for Advanced HPC

Oakforest-PACS, Fujitsu Primergy CX1640, Xeon Phi (68C)

+ Omni-PathJapan 558,144 13.6 54 2.72 4.98

8RIKEN AdvancedInst for Comp Sci

K computer Fujitsu SPARC64 VIIIfx (8C) + Custom

Japan 705,024 10.5 93 12.7 .827

9DOE / OS

Argonne Nat LabMira, BlueGene/Q (16C)

+ CustomUSA 786,432 8.59 85 3.95 2.07

10DOE / NNSA /

Los Alamos & Sandia Trinity, Cray XC40,Xeon (16C) +

Custom USA 301,056 8.10 80 4.23 1.92

500 Internet company Sugon Intel (8C) + Nnvidia China 110,000 .432 71 1.2 0.36

TaihuLight is 60% Sum of All EU SystemsTaihuLight is 35% X Sum of All Systems in US

Next Big Systems in the US DOE� Oak Ridge Lab and Lawrence Livermore Lab to

receive IBM and Nvidia based systems� IBM Power 9 + Nvidia Volta V100

� Installation started but not completed until late 2018-2019

� 5-10 times Titan on apps

� 4,600 nodes, each containing six 7.5-teraflop NVIDIA V100 GPUs, and two IBM Power9 CPUs, its aggregate peak performance should be > 200 petaflops.

� ORNL maybe June 2018 fully ready (Top500 number)

� In 2021 Argonne Lab to receive Intel based system� Exascale systems, based on Future Intel parts, not Knights Hill

� Aurora 21� Balanced architecture to support three pillars

� Large-scale Simulation (PDEs, traditional HPC)

� Data Intensive Applications (science pipelines)

� Deep Learning and Emerging Science AI

� Integrated computing, acceleration, storage

� Towards a common software stack

Toward ExascaleChina plans for Exascale 2020

Three separate developments in HPC; “Anything but from the US”

• Wuxi• Upgrade the ShenWei O(100) Pflops all Chinese

• National University for Defense Technology• Tianhe-2A O(100) Pflops will be Chinese ARM processor + accelerator

• Sugon - CAS ICT • X86 based; collaboration with AMD

US DOE - Exascale Computing Program – 7 Year Program� Initial exascale system based on advanced architecture and

delivered in 2021

� Enable capable exascale systems, based on ECP R&D, delivered in 2022 and deployed in 2023

07

Exascale• 50X the performance of

today’s 20 PF systems• 20-30 MW power• ≦ 1 perceived fault /week• SW to support broad range

of apps

Exascale Computing Project, www.exascaleproject.org

US Department of Energy Exascale Computing Program has formulated a holistic approach that uses co-design and integration to achieve capable exascaleApplication Development Software

TechnologyHardware

TechnologyExascaleSystems

Scalable and productive software

stack

Science and mission applications

Hardware technology elements

Integrated exascalesupercomputers

Correctness Visualization Data Analysis

Applications Co-Design

Programming models, development environment,

and runtimesToolsMath libraries

and Frameworks

System Software, resource management threading, scheduling, monitoring, and control

Memory and Burst

buffer

Data management I/O and file

systemNode OS, runtimes

Res

ilien

ce

Wor

kflo

ws

Hardware interface

ECP’s work encompasses applications, system software, hardware technologies and architectures, and workforce development

The Problem with the LINPACK Benchmark• HPL performance of computer systems are no longer so strongly correlated

to real application performance, especially for the broad set of HPC applications governed by partial differential equations.

• Designing a system for good HPL performance can actually lead to design choices that are wrong for the real application mix, or add unnecessary components or complexity to the system.

14

Many Problems in Computational Science Involve Solving PDEs; Large Sparse Linear Systems

−∆u+ c→

•∇u+γu≡ Pu = f

over some domain

( where P denotes the

differential operator )

+boundary conditions

Discretization

(e.g., Galerkin

equations)

i=1

n

∑

∀

aji bj

Basis functions Φj

are often with

local support, e.g.,

leading to local

interactions & hence

sparse matrices, e.g.,

10

100 115

20135

332

i=1

n

∑

i=1

n

∑

Find uh = Φi xi

(P uh, Φj) = (f, Φi) for Φj

� (PΦi , Φj) xi = (f, Φi)

� Sparse Linear System

A x = b

row 10 in this case

will have only 6

non-zeroes:

a10,10, a10,332, a10,100,

a10,115, a10,201, a10,35

Given a PDE,

e.g.:

Modeling Diffusion Fluid Flow

HPCG• High Performance Conjugate Gradients (HPCG).• Solves Ax=b, A large, sparse, b known, x computed.• An optimized implementation of PCG contains essential computational

and communication patterns that are prevalent in a variety of methods for discretization and numerical solution of PDEs

• Synthetic discretized 3D PDE (FEM, FVM, FDM).• Sparse matrix:

• 27 nonzeros/row interior. • 8 – 18 on boundary.• Symmetric positive definite.

• Patterns:• Dense and sparse computations.• Dense and sparse collectives.• Multi-scale execution of kernels via MG (truncated) V cycle.• Data-driven parallelism (unstructured sparse triangular solves).

• Strong verification (via spectral properties of PCG).

16

hpcg-benchmark.org

HPL vs. HPCG: Bookends• Some see HPL and HPCG as “bookends” of a spectrum.

• Applications teams know where their codes lie on the spectrum.

• Can gauge performance on a system using both HPL and HPCG numbers.

17hpcg-benchmark.org

Rank Site Computer CoresRmaxPflops

HPCGPflops

HPCG/HPL

% of Peak

1RIKEN Advanced Institute for

Computational Science, Japan

K computer, SPARC64 VIIIfx 2.0GHz, Tofu

interconnect705,024 10.5 0.60 5.7% 5.3%

2NSCC / Guangzhou

China

Tianhe-2 NUDT, Xeon 12C 2.2GHz + Intel

Xeon Phi 57C + Custom3,120,000 33.8 0.58 1.7% 1.1%

3National Supercomputing Center in WuxiChina

Sunway TaihuLight – Sunway MPP, SW26010 260C 1.45GHz, Sunway,

NRCPC

10,649,600 93.0 0.48 0.5% 0.4%

3

Swiss National

Supercomputing Centre

(CSCS)

Switzerland

Piz Daint – Cray XC50, Xeon E5-2690v3

12C 2.6GHz, Aries interconnect, NVIDIA

P100, Cray361,760 19.6 0.48 2.4% 1.9%

5Joint Center for Advanced HPCJapan

Oakforest-PACS – PRIMERGY CX600 M1, Intel Xeon Phi 7250 68C 1.4GHz, Intel OmniPath, Fujitsu

557,056 24.9 0.39 2.8% 1.5%

6DOE/SC/LBNL/NERSCUSA

Cori – XC40, Intel Xeon Phi 7250 68C 1.4GHz, Cray Aries, Cray

632,400 13.8 0.36 2.6% 1.3%

7DOE/NNSA/LLNLUSA

Sequoia – IBM BlueGene/Q, PowerPC A2 16C 1.6GHz, 5D Torus, IBM

1,572,864 17.2 0.33 1.9% 1.6%

8DOE/SC/Oak Ridge National

Lab

Titan - Cray XK7 , Opteron 6274 16C

2.200GHz, Cray Gemini interconnect,

NVIDIA K20x

560,640 17.6 0.32 1.8% 1.2%

9 DOE/NNSA/LANL/SNLTrinity - Cray XC40, Intel E5-2698v3, Aries

custom, Cray301,056 8.10 0.18 2.3% 1.6%

10 NASA / Mountain View

Pleiades - SGI ICE X, Intel E5-2680, E5-

2680v2, E5-2680v3, E5-2680v4, Infiniband

FDR, HPE

243,008 6.0 0.17 2.9% 2.5%

HPCG Results, Jun 2017, top-10

Bookends: P

eak and HP

L Only

0.0

01

0.0

1

0.1 1

10

10

0

10

00

1

2

3

4

5

6

7

8

9

10

14

15

17

18

19

20

21

22

23

25

26

31

34

38

39

40

44

50

55

57

58

59

60

63

64

71

72

73

80

85

87

90

94

95

112

113

123

131

164

203

209

214

220

226

247

280

282

292

370

380

430

442

472

Pflop/s

Rp

ea

kR

ma

x

Bookends: P

eak, HP

L, and HP

CG

0.0

01

0.0

1

0.1 1

10

10

0

10

00

1

2

3

4

5

6

7

8

9

10

14

15

17

18

19

20

21

22

23

25

26

31

34

38

39

40

44

50

55

57

58

59

60

63

64

71

72

73

80

85

87

90

94

95

112

113

123

131

164

203

209

214

220

226

247

280

282

292

370

380

430

442

472

Pflop/s

Rp

ea

k

Rm

ax

HP

CG

Machine Learning in Computational Science

• Climate

• Biology

• Drug Design

• Epidemology

• Materials

• Cosmology

• High-Energy Physics

Many fields are beginning to adopt machine learning to augment modeling and simulation methods

IEEE 754 Half Precision (16-bit) Floating Pt Standard

22 / 57

A lot of interest driven by “machine learning”

Mixed Precision• Extended to Sparse Direct Methods

• Iterative Methods where inner and outer iteration

• Today many precisions to deal with

• Note the number range with half precision (16 bit fl.pt.)10/20/2017

Deep Learning need Batched Operations

Matrix Multiply is the time consuming part.

Convolution Layers and Fully Connected Layers require matrix multiply

There are many GEMM’s of small matrices, perfectly parallel, can get by with 16-bit floating point

24 / 47

Convolution StepIn this case 3x3 GEMM

x1

x2

x3

x1 y1

y2

w11

w12

w13

w21

w22

w23

Fully ConnectedClassification

Define standard API for batched BLAS and LAPACK in collaboration with Intel/Nvidia/ECP/other usersFixed size most of BLAS and LAPACK releasedVariable size most of BLAS releasedVariable size LAPACK in the branchNative GPU algorithms (Cholesky, LU, QR) in the branchTiled algorithm using batched routines on tile or LAPACK data layout in the branch

Framework for Deep Neural Network kernelsCPU, KNL and GPU routinesFP16 routines in progress

Standard for Batched Computations

Batched Computations

Non-batched computationloop over the matrices one by one and compute using multithread (note that, sincematrices are of small sizes there is not enough work for all the cores). So we expect lowperformance as well as threads contention might also affect the performance

for ( i=0; i<batchcount; i++)dgemm(…)

There is not enough

to fulfill all the cores.

Low percentage of the

resources is used

Batched Computations

Batched computationDistribute all the matrices over the available resources by assigning a matrix to eachgroup of core/TB to operate on it independently

• For very small matrices, assign a matrix/core (CPU) or per TB for GPU• For medium size a matrix go to a team of cores (CPU) or many TB’s (GPU)• For large size switch to multithreads classical 1 matrix per round.

Batched_dgemm(…)

Based on the kerneldesign that decide thenumber of TB or threads(GPU/CPU)and through theNvidia/OpenMPscheduler

Based on the kerneldesign that decide thenumber of TB or threads(GPU/CPU)and through theNvidia/OpenMPscheduler

Tasks manager dispatcher

High percentage of the

resources is used

Batched Computations: How do we design and optimize

2~3x

100X

Switch to non-batched

Smallsizes

mediumsizes

largesizes

Batched Computations: How do we design and optimize

1.2x

30X

Switch to non-batched

Smallsizes

mediumsizes

largesizes

0

1000

2000

3000

4000

5000

6000

7000

8000

0 500 1000 1500 2000 2500 3000 3500 4000

50~1000 matrices of size

Nvidia V100 GPU

Batch dgemm BLAS 3

Standard dgemm BLAS 3

small sizes medium sizes Large sizes

Switch to non-batch

Batch dgemm BLAS 3

Standard dgemm BLAS 3

19X

1.4X

Gflo

p/s

97% of that peak

Intel Knights Landing: FLAT mode

• KNL is in FLAT mode (where entire MCDRAM is used as addressable memory)

� memory allocations are treated similarly to DDR4 memory allocations

• KNL Thermal Design Power (TDP) is 215W (for main SKUs):

TDP = 215W

68 cores, 1.3 GHz, Peak DP = 2662 Gflop/s

68 cores Intel Xeon Phi KNL, 1.3 GHzThe theoretical peak double precision is 2662 Gflop/sCompiled with icc and using Intel MKL 2017b1 20160506

Level 1, 2 and 3 BLAS68 cores Intel Xeon Phi KNL, 1.3 GHz, Peak DP = 2662 Gflop/s

85.3 Gflop/s

35.1 Gflop/s

2000 Gflop/s

35x

PAPI for power-aware computing

• We use PAPI’s latest powercap component for measurement, control, and

performance analysis

• PAPI power components in the past supported only reading power

information

• New component exposes RAPL functionality to allow users to read and

write power limit

• Study numerical building blocks of varying computational intensity

• Use PAPI powercap component to detect power optimization opportunities

� Objective: Cap the power on the architecture to reduce power usage

while keeping the execution time constant � Energy Savings !!!

Level 3 BLAS DGEMM on KNL using MCDRAM

Time (sec)0 1 2 3 4 5 6 7 8 9 10 11 12

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300320

1997

8.81

1303

Performancein Gflop/s

Gflops/Watts

Joules

Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set

37DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in

decreasing fashion starting from default, 220 Watts, 200 Watts down till 120 Watts by steps of 10/20.

68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~90 GB/s

idle

DGEMM 18K at default power

set pwr limit default

DGEMM size 18K

Thermal Design Power (TDP)


Time (sec)0 10 20 30 40 50 60 70 80 90 100 110

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300320

1997

8.81

1303


Gflops/Watts

Joules


38


idle

DGEMM 18K at default power

DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in


set pwr limit default

DGEMM size 18K

PAPI_read(EventSet, long long values[2])// set short term power limit to 268.75 Watts

Values[0] = 26875000 // set long term power limit to 258 Watts

Values[1] = 25800000 PAPI_write(EventSet, values)



Time (sec)0 10 20 30 40 50 60 70 80 90 100 110

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300320

1997

8.81

1303

1904

8.92

1279


Gflops/Watts

Joules


39

set pwr limit defaultDGEMM size 18Kset pwr limit 220WDGEMM size 18K





Time (sec)0 10 20 30 40 50 60 70 80 90 100 110

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300320

1997

8.82

1303

1904

8.92

1279

1741

8.95

1267

1589

9.04

1253

1328

8.47

1345

1137

7.73

1480

956

6.95

1661

773

6.03

1904

560

4.72

2443


Gflops/Watts

Joules


40

set pwr limit defaultDGEMM size 18Kset pwr limit 220WDGEMM size 18Kset pwr limit 200WDGEMM size 18Kset pwr limit 180WDGEMM size 18Kset pwr limit 160WDGEMM size 18Kset pwr limit 150W

....DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in




Time (sec)0 10 20 30 40 50 60 70 80 90 100 110

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300320

1997

8.82

1303

1904

8.92

1279

1741

8.95

1267

1589

9.04

1253

1328

8.47

1345

1137

7.73

1480

956

6.95

1661

773

6.03

1904

560

4.72

2443


Gflops/Watts

Joules


MH

z

0 200 400 600 800 100012001400160018002000

Frequency

41

• When power cap kick-

on the frequency is

decreased which

confirm that capping

affect through the

DVFS



Optimum

Flops

Optimum

Flops/Watt

Level 3 BLAS DGEMM on KNL MCDRAM/DDR4

MCDRAM DDR4

Time (sec)0 10 20 30 40 50 60 70 80 90 100 110

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300320

1991

8.22

1383

1901

8.36

1341

1785

8.58

1314

1615

8.57

1325

1340

8.02

1419

1154

7.36

1562

971

6.66

1715

785

5.81

1982

575

4.64

2489


Gflops/Watts

Joules


MH

z

0 200 400 600 800 100012001400160018002000

Frequency

Time (sec)0 10 20 30 40 50 60 70 80 90 100 110

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300320

1997

8.82

1303

1904

8.92

1279

1741

8.95

1267

1589

9.04

1253

1328

8.47

1345

1137

7.73

1480

956

6.95

1661

773

6.03

1904

560

4.72

2443


Gflops/Watts

Joules


MH

z

0 200 400 600 800 100012001400160018002000

Frequency

42

Lesson for DGEMM type of operations (compute intensive):• Capping can reduce performance, but sometimes can hold the energy efficiency (Glops/Watts).



Optimum

Flops

Optimum

Flops/Watt

Optimum

Flops/Watt

Optimum

Flops

Level 2 BLAS DGEMV on KNL MCDRAM/DDR4

MCDRAM DDR4

Time (sec)0 11 22 33 44 55 66 77 88 99 110 121 132 143 154

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300320

2185

0.12

2588

2185

0.12

2623

2185

0.12

2639

2185

0.12

2664

2185

0.12

2661

2184

0.13

2558

2182

0.13

2432

2082

0.14

2305

1978

0.14

2240

Performancein Gflop/sAchievedBandwidth GB/sGflops/Watts

Joules


MH

z

0 200 400 600 800 100012001400160018002000

Frequency

Time (sec)0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300320

8433

50.

3981

5

8333

10.

3980

5

8232

80.

4271

7

7931

70.

4568

6

6526

10.

4274

5

5622

50.

3882

2

4718

80.

3490

9

3815

00.

2910

77

2811

10.

2413

61


Joules


MH

z

0 200 400 600 800 100012001400160018002000

Frequency

43DGEMV is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in


Lesson for DGEMV type of operations (memory bound):• For MCDRAM, capping at 190 Watts results in power reduction without any loss of performance (energy saving ~20%).

• For DDR4, capping at 120 Watts improves energy efficiency by ~17%.

• Overall capping 40 Watts below the power observed at default setting provide about 17-20% energy saving without any

significant reduction in time to solution.

Lesson for DGEMM type of operations (compute intensive):• Capping can reduce performance, but sometimes can hold the energy efficiency (Glops/Watts).

Optimum

Flops

Optimum

Flops/Watt

Optimum

Flops/Watt

Optimum

Flops

Level 1 BLAS DAXPY on KNL MCDRAM/DDR4

MCDRAM DDR4

Time (sec)0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300320

787

0.04

2080

787

0.04

2076

787

0.04

2098

787

0.04

2094

785

0.04

1993

784

0.04

1871

784

0.04

1802

782

0.05

1700

666

0.04

1951


Joules


MH

z

0 200 400 600 800 100012001400160018002000

Frequency

Time (sec)0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300320

3542

00.

1551

235

416

0.16

465

3441

00.

1742

529

344

0.16

443

2226

30.

1454

5

1922

60.

1361

4

1518

50.

1168

6

1214

40.

0984

4

910

50.

0710

65


Joules


MH

z

0 200 400 600 800 100012001400160018002000

Frequency

Text

44

Lesson for DAXPY type of operations (memory bound):• For MCDRAM, capping at 180 Watts results in power reduction without any loss if performance (energy saving ~16%).

• For DDR4, capping at 130 Watts improves energy efficiency by ~25%.

• Overall capping 40 Watts below the power observed at default setting provide about 16-25% energy saving without any

significant reduction in time to solution.

DAXPY is run repeatedly for a fixed matrix size of 1E8 per step and at each step a new power limit is set in


Intel Knights Landing

MCDRAM

BLASRate

Gflop/s

Power

Efficiency

Gflop/s per W

Level 3 1997 9.04

Level 2 84 .45

Level 1 35 .17

DRAM

BLASRate

Gflop/s

Power

Efficiency

Gflop/s per W

Level 3 1991

Level 2 21

Level 1 7

Level 1, 2 and 3 BLAS68 cores Intel Xeon Phi KNL, 1.3 GHz, Peak DP = 2662 Gflop/s

Power-awareness in applications

• At TDP basic power limit 215

Time (sec)0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300

1522 joules

MCDRAM_215watts

HPCG benchmark on grid of size 1923: MCDRAM Sparse Matrix Vector operationsSolving Ax=b with Conjugate Gradient Algorithm


Time (sec)0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300

1522 joules1504 joules

MCDRAM_215wattsMCDRAM_200watts

HPCG benchmark on grid of size 1923: MCDRAM

• Decreasing the power limit

to 200 W do not results in

any loss in performance

while we observe a reducing

of power rate consumption

Sparse Matrix Vector operationsSolving Ax=b with Conjugate Gradient Algorithm


Time (sec)0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300


1468 joules

1486 joules

1728 joules

2390 joules

MCDRAM_215wattsMCDRAM_200wattsMCDRAM_180wattsMCDRAM_160wattsMCDRAM_140wattsMCDRAM_120watts

HPCG benchmark on grid of size 1923: MCDRAM





of power rate consumption


down by 40 Watts (setting

pwr limit at 160 W) will

provide energy saving,

power rate reduction and

without any meaningful loss

in performance, less than

10% reduction in time to

solution.



Time (sec)0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300

3915 joules3957 joules3940 joules3710 joules3300 joules

3158 joules

DDR_215wattsDDR_200wattsDDR_180wattsDDR_160wattsDDR_140wattsDDR_120watts

HPCG benchmark on grid of size 1923: DDR4





of power rate and energy

saving.


down by 40 Watts (Pwr limit

at 140 W) will providing

large energy saving (15%),

power rate reduction,

without any loss in

performance.

• At 120 Watts, less than 10%

reduction in time to solution

while a large energy saving

(20%) and power rate

reduction are observed



Time (sec)0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300


3158 joules

DDR_215wattsDDR_200wattsDDR_180wattsDDR_160wattsDDR_140wattsDDR_120watts

Time (sec)0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300


1468 joules

1486 joules

1728 joules

2390 joules

MCDRAM_215wattsMCDRAM_200wattsMCDRAM_180wattsMCDRAM_160wattsMCDRAM_140wattsMCDRAM_120watts

HPCG benchmark on grid of size 1923: MCDRAM & DDR4

Lesson for HPCG:• For MCDRAM, decreasing the power limit by 40 Watts from the power observed at default (setting Pwr limit at 160 W)

will provide energy saving, power rate reduction without any meaningful loss in performance, less than 10% reduction

in time to solution.

• For DDR4, similar behavior observed by decreasing the power limit by 40 Watts from the power observed at default

(setting Pwr limit at 120 W) provide a large energy saving (15%-20%), power rate reduction without any significant

reduction in time to solution.

Sparse Matrix Vector operationsSolving Ax=b with CG


51

Time (sec)0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300


724 joules

803 joules

1066 joules

1865 joules

MCDRAM_215WattsMCDRAM_200WattsMCDRAM_180WattsMCDRAM_160WattsMCDRAM_140WattsMCDRAM_120Watts

Time (sec)0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300


2731 joules

DDR_215WattsDDR_200WattsDDR_180WattsDDR_160WattsDDR_140WattsDDR_120Watts

Solving Helmholtz equation with finite difference, Jacobi iteration 12800x12800 grid

Lesson for Jacobi iteration:• For MCDRAM, capping at 180 Watts improves power efficiency by ~13% without any loss in time to solution, while

also decreasing power rate requirements by about 20%

• For DDR4, capping at 140 Watts improves power efficiency by ~20% without any loss in time to solution.

• Overall, capping 30-40 Watts below the power observed at default limit provides large energy gain while keeping up

with the same time to solution, which is similar to the behavior observed for the DGEMV and DAXPY kernels.


Lattice-Boltzmann simulation of CFD (from SPEC 2006 benchmark)

Time (sec)0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300

3050 joules

2981 joules

2872 joules



MCDRAM_270wattsMCDRAM_215wattsMCDRAM_200wattsMCDRAM_180wattsMCDRAM_160wattsMCDRAM_140wattsMCDRAM_120watts

Time (sec)0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300

9064 joules9196 joules9194 joules9178 joules9005 joules8084 joules

8467 joules

DDR_270wattsDDR_215wattsDDR_200wattsDDR_180wattsDDR_160wattsDDR_140wattsDDR_120watts

Lesson for Lattice Boltzmann:• For MCDRAM, capping at 200 results in 5% energy saving and power rate reduction without increase in time to

solution.

• For DDR4, capping at 140 Watts improves energy efficiency by ~11% and reduces power rate without any increase

in time to solution.


Time (sec)0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300


8702 joules

8724 joules

10078 joules

MCDRAM_215WattsMCDRAM_200WattsMCDRAM_180WattsMCDRAM_160WattsMCDRAM_140Watts

Time (sec)0 10 20 30 40 50 60 70 80 90 100 110 120 130 140

Ave

rag

e p

ow

er (

Wat

ts)

0 20 40 60 80

100120140160180200220240260280300

19901 joules19796 joules19676 joules17763 joules

16003 joules

13432 joules

DDR_215WattsDDR_200WattsDDR_180WattsDDR_160WattsDDR_140WattsDDR_120Watts

Lesson for Monte Carlo:• For MCDRAM, capping at 180 W results in energy saving (5%) and power rate reduction without any meaningful

increase in time to solution.

• For DDR4, capping at 120 Watts improves power efficiency by ~30% without any loss in performance.

Monte Carlo neutron transport application (from XSbench)

Power Aware Tools

• Designing a toolset that will “automatically” monitor memory traffic.

– Track the performance and data movement.

• Adjust the power to keep the performance roughly the same while decreasing energy consumption.

• Better power performance without significant loss of performance.

• All in a automatic fashion without user involvement.

10/20/2017

Critical Issues and Challenges at Peta & Exascale for Algorithm and Software Design

♦ Synchronization-reducing algorithms� Break Fork-Join model

♦ Communication-reducing algorithms� Use methods which have lower bound on communication

♦ Mixed precision methods� 2x speed of ops and 2x speed for data movement

♦ Autotuning� Today’s machines are too complicated, build “smarts” into software to

adapt to the hardware

♦ Fault resilient algorithms� Implement algorithms that can recover from failures/bit flips

♦ Reproducibility of results� Today we can’t guarantee this, without a penalty. We understand the

issues, but some of our “colleagues” have a hard time with this.

Collaborators and SupportMAGMA team

http://icl.cs.utk.edu/magma

PLASMA team

http://icl.cs.utk.edu/plasma

Collaborating partners

University of Tennessee, KnoxvilleUniversity of California, BerkeleyUniversity of Colorado, DenverUniversity of Manchester

overview of hpc and energy saving on knl for some...

Documents