xeon phi™–архитектура модели …agenda • what and why • intel xeon phi...

38
Intel® XeonPhi™– архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев Intel

Upload: others

Post on 29-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

Intel® Xeon™ Phi™– архитектура,

модели программирования, оптимизация.

Дмитрий Прохоров, Дмитрий Рябцев Intel

Page 2: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

Agenda

• What and Why

• Intel Xeon Phi –Top 500 insights, roadmap, architecture

• How

• Programming models - positioning and spectrum

• How Fast

• Optimization and Tools

2

Page 3: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

What and WhyHPC

High-Performance

Computing

the use of super computers and

parallel processing techniques for

solving complex computational

problems.

3

Page 4: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

What and WhyTOP 500 –”Today’s Future” of tomorrow’s mainstream HPC

4

Page 5: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

What and WhyTOP 500 Highlights – Performance Projection

5

Page 6: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

What and WhyTOP 500 Highlights – Top 10 list

6

Page 7: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

What and Why TOP 500 Highlights – Accelerators in Power Efficiency

7

Page 8: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

What and WhyTOP 500 Highlights – Accelerators/Coprocessors

8

N

V

I

d

I

a

Page 9: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

What and WhyIntel May Integrated Core (MIC) architecture

Larrabee

+TerraFlops

Research

Chip

+Competition

with NVidia on

Accelerators

9

Page 10: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

What and WhyParallelization and vectorization

10

Scalar Vector

Parallel Parallel + Vector

Page 11: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

What and WhyXeon VS Xeon Phi

11

Page 12: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

12

Page 13: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

13

Page 14: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

14

Page 15: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

KNL Mesh Interconnect: All-to-AllAddress uniformly hashed

across all distributed

directories

Typical Read L2 miss

1. L2 miss encountered

2. Send request to the distributed

directory

3. Miss in the directory. Forward to

memory

4. Memory sends the data to the

requestor

15

Misc

IIOEDC EDC

Tile Tile

Tile Tile Tile

EDC EDC

Tile Tile

Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

EDC EDC EDC EDC

iMC Tile Tile Tile Tile iMC

OPIO OPIO OPIO OPIO

OPIO OPIO OPIO OPIO

PCIe

DDR DDR

1

2

3

4

Page 16: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

KNL Mesh Interconnect: QuadrantChip divided into four Quadrants

Directory for an address resides in the same Quadrant as the memory location

SW Transparent

Typical Read L2 miss

1. L2 miss encountered

2. Send request to the distributed directory

3. Miss in the directory. Forward to memory

4. Memory sends the data to the requestor

16

Misc

IIOEDC EDC

Tile Tile

Tile Tile Tile

EDC EDC

Tile Tile

Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

EDC EDC EDC EDC

iMC Tile Tile Tile Tile iMC

OPIO OPIO OPIO OPIO

OPIO OPIO OPIO OPIO

PCIe

DDR DDR

1

2

3

4

Page 17: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

KNL Mesh Interconnect: Sub-NUMA

Clustering Each Quadrant (Cluster) exposed as a

separate NUMA domain to OS

Analogous to 4S Xeon

SW Visible

Typical Read L2 miss

1. L2 miss encountered

2. Send request to the distributed directory

3. Miss in the directory. Forward to memory

4. Memory sends the data to the requestor

17

Misc

IIOEDC EDC

Tile Tile

Tile Tile Tile

EDC EDC

Tile Tile

Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

EDC EDC EDC EDC

iMC Tile Tile Tile Tile iMC

OPIO OPIO OPIO OPIO

OPIO OPIO OPIO OPIO

PCIe

DDR DDR

1

2

3

4

Page 18: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

18

Page 19: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

19

Page 20: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

20

Page 21: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

21

• Cori Supercomputer at NERSC (National Energy Research Scientific

Computing Center at LBNL/DOE) became the first publically announced Knights

Landing based system, with over 9,300 nodes slated to be deployed in mid-2016

• “Trinity” Supercomputer at NNSA (National Nuclear Security Administration) is

a $174 million deal awarded to Cray that will feature Haswell and Knights

Landing, with acceptance phases in both late-2015 and 2016.

• Expecting over 50 system providers for the KNL host processor, in addition to

many more PCIe*-card based solutions.

• >100 Petaflops of committed customer deals to date

• The DOE* and Argonne* awarded Intel contracts for two systems (Theta and

Aurora) as a part of the CORAL* program, with a combined value of over $200

million. Intel is teaming with Cray* on both systems. Scheduled for 2016, Theta

is the first system with greater than 8.5 petaFLOP/s and more than 2,500 nodes,

featuring the Intel® Xeon Phi™ processor (Knights Landing), Cray* Aries*

interconnect and Cray’s* XC* supercomputing platform. Scheduled for 2018,

Aurora is the second and largest system with 180-450 petaFLOP/s and

approximately 50,000 nodes, featuring the next-generation Intel® Xeon Phi™

processor (Knights Hill), 2nd generation Intel® Omni-Path fabric, Cray’s* Shasta*

platform, and a new memory hierarchy composed of Intel Lustre, Burst Buffer

Storage, and persistent memory through high bandwidth on-package memory

Page 22: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

HowProgramming models

22

Page 23: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

HowPositioning works: Adoption for Coprocessors in TOP 500

23

Page 24: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

HowPositioning works: Adoption speed for Coprocessors in TOP

500

24

Page 25: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

25

HowKNL positioning

Out-of-box performance on throughput workloads “about the same” as Xeon, with potential for > 2X in performance when optimized for vectors, threads and memory BW.

Same programming model, tools, compilers and libraries as Xeon. Boots standard OS, runs all legacy code

Xeon KNL

Programming mode

Compilers, Tools & Libraries

Code Base

Massive thread and data parallelism and massive memory bandwidth with

good ST performance in a ISA compatible standard CPU form factor

Page 26: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

HowProgramming models on Xeon Phi

• Native (Xeon Phi)

• Offload (Xeon -> Xeon Phi)

26

Page 27: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

HowProgramming models on Xeon Phi: native

• Recompilation, with –xMIC-AVX512

• Vectorization: increased efficiency, use of new instructions

• MCDRAM and memory tuning: tile, 1GB pages

27

Page 28: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

HowOffload programming model

28

Page 29: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

HowOffload with pragma target in OpenMP 4.0

29

Page 30: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

HowProgramming models on Xeon Phi : offload

• Applicable for coprocessor cards mostly

Cost for data transfers

• Three ways to use:

• OpenMP 4.0 “target” directives

• MKL Automatic offload

• Direct calls to the offload APIs (COI), and those built on it (e.g.,

HStreams)

• Offload over fabric implementation for self-boot

30

Page 31: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

How FastOptimization BKMs

Optimization techniques are the same as for Xeon and helping both

• Loop unrolling to feed vectorization

• Loop reorganization to avoid strides

Be careful with no dependency pragmas

• Data layout changes for more efficient cache usage

• Moving to hybrid MPI+OpenMP from pure MPI

• Avoid data replication, inner node communication, increased MPI buffer size

• NUMA-awareness for sub-NUMA clustering mode

• MPI/thread pinning with parallel data initialization

• Eliminating syncs on barriers where possible

• The more threads the more barrier cost

31

Page 32: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

32

How FastTools: Vector Advisor – explore vectorization

5. Memory Access Patterns Analysis

2. Guidance: detect problem and recommend how to fix it

1. Compiler diagnostics + Performance Data + SIMD efficiency information

4. Loop-Carried Dependency Analysis

3. “Accurate” Trip Counts: understand parallelism granularity and overheads

“Intel® Advisor’s Vectorization

Advisor fills a gap in code

performance analysis. It can guide

the informed user to better exploit the

vector capabilities of modern

processors and coprocessors”

Dr. Luigi IapichinoScientific Computing Expert

Leibniz Supercomputing

Centre

Page 33: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

How FastTools: VTune Amplifier – explore threading/CPU utilization

33

Is serial time of my application significant to prevent scaling?

How efficient is my parallelization towards ideal parallel execution?

How much theoretical gain I can get if invest in tuning?

What regions are more

perspective to invest?

Links to grid view for more

details on inefficiency

Page 34: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

What region is inefficient?

Is the potential gain worth it?

Why is it inefficient? Imbalance?

Scheduling? Lock spinning?

Intel® Xeon Phi™ systems supported

34

Deep Dive in OpenMP* for Efficiency and Scalability at

Region/Barrier levelSee the wall clock impact of inefficiencies, identify their cause

Actual Elapsed Time

Ideal Time

Fork Join

Potential

Gain

Lock SpinningImbalance

Scheduling

Node

Page 35: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

• Memory related PMU-events + tracing of memory allocations

• Metrics by function: CPU Time, Memory Bound, KNL Bandwidth Estimate (NDA)

– KNL Bandwidth Estimate - per core, should be multiplied by number of KNL cores

• Metrics by memory object: Loads, Stores, LLC Misses, Remote DRAM and Remote Cache

accesses

• Memory objects are identified by allocation source line and call stack

• Allows to define structures on high bandwidth path to put to MCDRAM

• Group by ‘Function / Memory Object / Allocation Stack’ -> Sort by ‘KNL Bandwidth Estimate’

metric -> Expand to see Memory Objects -> Sort by Loads

35

Memory Profiling with VTune Amplifier XE Memory Access

Analysis

Node

Page 36: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

Memory Profiling with VTune Amplifier XE

Memory Access Analysis - Bandwidth

Bandwidth data for DDR and MCDRAM can be analyzed in

VTune:

36

Page 37: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

Summary

• Many-core-based architectures play main role to achieve

Exascale and further

• Intel Many Integrated Core (MIC) offers competitive

performance on well-known HPC programming models

• KNL is a step forward in this direction with

• More cores, faster ST

• High Bandwidth Memory

• Self-boot with better performance/Watt and no data transfer

cost

37

Page 38: Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi –Top 500 insights, roadmap, architecture • How • Programming models - positioning

Intel Confidential