xeon phi™–архитектура модели …agenda • what and why • intel xeon phi...
TRANSCRIPT
Intel® Xeon™ Phi™– архитектура,
модели программирования, оптимизация.
Дмитрий Прохоров, Дмитрий Рябцев Intel
Agenda
• What and Why
• Intel Xeon Phi –Top 500 insights, roadmap, architecture
• How
• Programming models - positioning and spectrum
• How Fast
• Optimization and Tools
2
What and WhyHPC
High-Performance
Computing
the use of super computers and
parallel processing techniques for
solving complex computational
problems.
3
What and WhyTOP 500 –”Today’s Future” of tomorrow’s mainstream HPC
4
What and WhyTOP 500 Highlights – Performance Projection
5
What and WhyTOP 500 Highlights – Top 10 list
6
What and Why TOP 500 Highlights – Accelerators in Power Efficiency
7
What and WhyTOP 500 Highlights – Accelerators/Coprocessors
8
N
V
I
d
I
a
What and WhyIntel May Integrated Core (MIC) architecture
Larrabee
+TerraFlops
Research
Chip
+Competition
with NVidia on
Accelerators
9
What and WhyParallelization and vectorization
10
Scalar Vector
Parallel Parallel + Vector
What and WhyXeon VS Xeon Phi
11
12
13
14
KNL Mesh Interconnect: All-to-AllAddress uniformly hashed
across all distributed
directories
Typical Read L2 miss
1. L2 miss encountered
2. Send request to the distributed
directory
3. Miss in the directory. Forward to
memory
4. Memory sends the data to the
requestor
15
Misc
IIOEDC EDC
Tile Tile
Tile Tile Tile
EDC EDC
Tile Tile
Tile Tile Tile
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile
EDC EDC EDC EDC
iMC Tile Tile Tile Tile iMC
OPIO OPIO OPIO OPIO
OPIO OPIO OPIO OPIO
PCIe
DDR DDR
1
2
3
4
KNL Mesh Interconnect: QuadrantChip divided into four Quadrants
Directory for an address resides in the same Quadrant as the memory location
SW Transparent
Typical Read L2 miss
1. L2 miss encountered
2. Send request to the distributed directory
3. Miss in the directory. Forward to memory
4. Memory sends the data to the requestor
16
Misc
IIOEDC EDC
Tile Tile
Tile Tile Tile
EDC EDC
Tile Tile
Tile Tile Tile
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile
EDC EDC EDC EDC
iMC Tile Tile Tile Tile iMC
OPIO OPIO OPIO OPIO
OPIO OPIO OPIO OPIO
PCIe
DDR DDR
1
2
3
4
KNL Mesh Interconnect: Sub-NUMA
Clustering Each Quadrant (Cluster) exposed as a
separate NUMA domain to OS
Analogous to 4S Xeon
SW Visible
Typical Read L2 miss
1. L2 miss encountered
2. Send request to the distributed directory
3. Miss in the directory. Forward to memory
4. Memory sends the data to the requestor
17
Misc
IIOEDC EDC
Tile Tile
Tile Tile Tile
EDC EDC
Tile Tile
Tile Tile Tile
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile
EDC EDC EDC EDC
iMC Tile Tile Tile Tile iMC
OPIO OPIO OPIO OPIO
OPIO OPIO OPIO OPIO
PCIe
DDR DDR
1
2
3
4
18
19
20
21
• Cori Supercomputer at NERSC (National Energy Research Scientific
Computing Center at LBNL/DOE) became the first publically announced Knights
Landing based system, with over 9,300 nodes slated to be deployed in mid-2016
• “Trinity” Supercomputer at NNSA (National Nuclear Security Administration) is
a $174 million deal awarded to Cray that will feature Haswell and Knights
Landing, with acceptance phases in both late-2015 and 2016.
• Expecting over 50 system providers for the KNL host processor, in addition to
many more PCIe*-card based solutions.
• >100 Petaflops of committed customer deals to date
• The DOE* and Argonne* awarded Intel contracts for two systems (Theta and
Aurora) as a part of the CORAL* program, with a combined value of over $200
million. Intel is teaming with Cray* on both systems. Scheduled for 2016, Theta
is the first system with greater than 8.5 petaFLOP/s and more than 2,500 nodes,
featuring the Intel® Xeon Phi™ processor (Knights Landing), Cray* Aries*
interconnect and Cray’s* XC* supercomputing platform. Scheduled for 2018,
Aurora is the second and largest system with 180-450 petaFLOP/s and
approximately 50,000 nodes, featuring the next-generation Intel® Xeon Phi™
processor (Knights Hill), 2nd generation Intel® Omni-Path fabric, Cray’s* Shasta*
platform, and a new memory hierarchy composed of Intel Lustre, Burst Buffer
Storage, and persistent memory through high bandwidth on-package memory
HowProgramming models
22
HowPositioning works: Adoption for Coprocessors in TOP 500
23
HowPositioning works: Adoption speed for Coprocessors in TOP
500
24
25
HowKNL positioning
Out-of-box performance on throughput workloads “about the same” as Xeon, with potential for > 2X in performance when optimized for vectors, threads and memory BW.
Same programming model, tools, compilers and libraries as Xeon. Boots standard OS, runs all legacy code
Xeon KNL
Programming mode
Compilers, Tools & Libraries
Code Base
Massive thread and data parallelism and massive memory bandwidth with
good ST performance in a ISA compatible standard CPU form factor
HowProgramming models on Xeon Phi
• Native (Xeon Phi)
• Offload (Xeon -> Xeon Phi)
26
HowProgramming models on Xeon Phi: native
• Recompilation, with –xMIC-AVX512
• Vectorization: increased efficiency, use of new instructions
• MCDRAM and memory tuning: tile, 1GB pages
27
HowOffload programming model
28
HowOffload with pragma target in OpenMP 4.0
29
HowProgramming models on Xeon Phi : offload
• Applicable for coprocessor cards mostly
Cost for data transfers
• Three ways to use:
• OpenMP 4.0 “target” directives
• MKL Automatic offload
• Direct calls to the offload APIs (COI), and those built on it (e.g.,
HStreams)
• Offload over fabric implementation for self-boot
30
How FastOptimization BKMs
Optimization techniques are the same as for Xeon and helping both
• Loop unrolling to feed vectorization
• Loop reorganization to avoid strides
Be careful with no dependency pragmas
• Data layout changes for more efficient cache usage
• Moving to hybrid MPI+OpenMP from pure MPI
• Avoid data replication, inner node communication, increased MPI buffer size
• NUMA-awareness for sub-NUMA clustering mode
• MPI/thread pinning with parallel data initialization
• Eliminating syncs on barriers where possible
• The more threads the more barrier cost
31
32
How FastTools: Vector Advisor – explore vectorization
5. Memory Access Patterns Analysis
2. Guidance: detect problem and recommend how to fix it
1. Compiler diagnostics + Performance Data + SIMD efficiency information
4. Loop-Carried Dependency Analysis
3. “Accurate” Trip Counts: understand parallelism granularity and overheads
“Intel® Advisor’s Vectorization
Advisor fills a gap in code
performance analysis. It can guide
the informed user to better exploit the
vector capabilities of modern
processors and coprocessors”
Dr. Luigi IapichinoScientific Computing Expert
Leibniz Supercomputing
Centre
How FastTools: VTune Amplifier – explore threading/CPU utilization
33
Is serial time of my application significant to prevent scaling?
How efficient is my parallelization towards ideal parallel execution?
How much theoretical gain I can get if invest in tuning?
What regions are more
perspective to invest?
Links to grid view for more
details on inefficiency
What region is inefficient?
Is the potential gain worth it?
Why is it inefficient? Imbalance?
Scheduling? Lock spinning?
Intel® Xeon Phi™ systems supported
34
Deep Dive in OpenMP* for Efficiency and Scalability at
Region/Barrier levelSee the wall clock impact of inefficiencies, identify their cause
Actual Elapsed Time
Ideal Time
Fork Join
Potential
Gain
Lock SpinningImbalance
Scheduling
Node
• Memory related PMU-events + tracing of memory allocations
• Metrics by function: CPU Time, Memory Bound, KNL Bandwidth Estimate (NDA)
– KNL Bandwidth Estimate - per core, should be multiplied by number of KNL cores
• Metrics by memory object: Loads, Stores, LLC Misses, Remote DRAM and Remote Cache
accesses
• Memory objects are identified by allocation source line and call stack
• Allows to define structures on high bandwidth path to put to MCDRAM
• Group by ‘Function / Memory Object / Allocation Stack’ -> Sort by ‘KNL Bandwidth Estimate’
metric -> Expand to see Memory Objects -> Sort by Loads
35
Memory Profiling with VTune Amplifier XE Memory Access
Analysis
Node
Memory Profiling with VTune Amplifier XE
Memory Access Analysis - Bandwidth
Bandwidth data for DDR and MCDRAM can be analyzed in
VTune:
36
Summary
• Many-core-based architectures play main role to achieve
Exascale and further
• Intel Many Integrated Core (MIC) offers competitive
performance on well-known HPC programming models
• KNL is a step forward in this direction with
• More cores, faster ST
• High Bandwidth Memory
• Self-boot with better performance/Watt and no data transfer
cost
37
Intel Confidential