preparing your application for advanced manycore architectures
TRANSCRIPT
Katie Antypas Services Dept Head, NERSC-8 Project Lead
Preparing your Application for Advanced Manycore Architectures
- 1 -
CSGF HPC Workshop July 17, 2014
What is “Manycore?”
• No precise definition • Multicore (heavyweight): slow evolution from 2–12
cores per chip; core performance matters more than core count
• Manycore (lightweight): jump to 30–60+ cores per chip; core count essentially matters more than individual core performance
• Manycore: Relatively simplified, lower-frequency computational cores; less instruction-level parallelism (ILP); engineered for lower power consumption and heat dissipation
- 2 -
Why Now? “Moore’s Law” will continue. • Memory
wall • Power wall • ILP wall
3
To sustain historic performance growth, the DOE community must prepare for new architectures
Hennessy, Patterson. Computer Architecture, a Quantitative Approach. 5th Ed. 2011
Exponential performance increase; do not rewrite software, just buy a new machine!
hmmm…
oh no!
Perf.
vs.
VAX
/780
What Does it Mean for Users?
• While Moore’s Law continues, users will need to make changes to applications to run well on manycore architectures
• Look at sources of parallelism: new and old
- 4 -
Sources of Parallelism • Domain Parallelism
– independent program units; explicit
• Thread Parallelism – independent execution
units within the program; generally explicit
• Data Parallelism – Same operation on
multiple elements
• Instruction Level Parallelism – Between independent instructions in sequential program
- 5 -
MPI MPI MPI
x
y
z
Threads
x
y
z
|--> DO I = 1, N | R(I) = B(I) + A(I) |--> ENDDO
Threads Threads
Multicore vs. Manycore Parallelism
• TLP: Thread Level Parallelism • DLP: Data Level Parallelism
- 6 -
TLP 24 12 cores and 2 hardware threads each
DLP 4 256 bit wide vector unit
TLP 240 60 cores and 4 hardware threads each
DLP 8 512 bit wide vector unit
Intel Ivy Bridge Intel Xeon-Phi (Knights Corner)
Regardless of vendor major systems throughout the DOE complex will require the same changes
- 7 -
• Regardless of processor architecture, users will need to modify applications to achieve performance – Expose more on-node
parallelism in applications – Increase application
vectorization capabilities – Manage hierarchical memory – For co-processor
architectures, locality directives must be added
Knights Landing
NERSC’s next supercomputer: Cori
Cori Configuration • 64 Cabinets of Cray XC System
– Over 9,300 ‘Knights Landing’ compute nodes – Over 1,900 ‘Haswell’ compute nodes
• Data partition – Cray Aries Interconnect
• Lustre File system – 28 PB capacity, 432 GB/sec peak performance
• NVRAM “Burst Buffer” for I/O acceleration • Significant Intel and Cray application transition
support • Delivery in mid-2016; installation in new LBNL CRT
- 9 -
Intel “Knights Landing” Processor • Next generation Xeon-Phi, >3TF peak • Single socket processor - Self-hosted, not a co-processor, not an
accelerator • Greater than 60 cores per processor with support for four hardware
threads each; more cores than current generation Intel Xeon Phi™ • Intel® "Silvermont" architecture enhanced for high performance
computing • 512b vector units (32 flops/clock – AVX 512) • 3X single-thread performance over current generation Xeon-Phi co-
processor • High bandwidth on-package memory, up to 16GB capacity with
bandwidth projected to be 5X that of DDR4 DRAM memory • Higher performance per watt
- 10 -
Cache Model
Let the hardware automatically manage the integrated on-package memory as an “L3” cache between KNL CPU and external DDR
Flat Model
Manually manage how your application uses the integrated on-package memory and external DDR for peak performance
Hybrid Model
Harness the benefits of both cache and flat models by segmenting the integrated on-package memory
Maximum performance through higher memory bandwidth and flexibility
Knights Landing Integrated On-Package Memory
Near Memory
HBW In-Package Memory
KNL CPU
HBW In-Package Memory
HBW In-Package Memory
HBW In-Package Memory
HBW In-Package Memory
HBW In-Package Memory
. . .
. . .
CPU Package
DDR
DDR
DDR
. . .
Cache
PCB
Near Memory
Far Memory
Side View
Top View
Programming Model Considerations
• Knight’s Landing is a self-hosted part – Users can focus on adding parallelism to their applications
without concerning themselves with PCI-bus transfers
• MPI + OpenMP preferred programming model – Should enable NERSC users to make robust code changes
• MPI-only will work – performance may not be optimal
• On package MCDRAM – How to optimally use ?
• Explicitly or implicitly ??
- 12 -
Case Study
Architecture: NERSC’s Babbage Testbed
- -
• 45 Sandy-bridge nodes with Xeon-Phi Co-processor • Each Xeon-Phi Co-processor has
– 60 cores – 4 HW threads per core – 8 GB of memory
• Multiple ways to program with co-processor – As an accelerator – As a self-hosted processor (ignore Sandy-bridge) – Reverse accelerator – We chose to test as if the Xeon-Phi was a stand alone
processor to mimic Knight’s Landing architecture
FLASH application readiness • FLASH is astrophysics code with explicit solvers for
hydrodynamics and magneto-hydrodynamics • Parallelized using
– MPI domain decomposition AND – OpenMP multithreading over local domains or over cells in each local domain
• Target application is a 3D Sedov explosion problem – A spherical blast wave is evolved over multiple time steps – Use configuration with a uniform resolution grid and use 1003 global cells
• The hydrodynamics solvers perform large stencil computations.
- - Case study by Chris Daley
Initial best KNC performance vs host
- -
Lower is B
etter
Case study by Chris Daley
Best configuration on 1 MIC card
- -
Lower is B
etter
Case study by Chris Daley
MIC performance study 1: thread speedup
- -
Higher is B
etter
• 1 MPI rank per MIC card and various numbers of OpenMP threads
• Each OpenMP thread is placed on a separate core
• 10x thread count ideally gives a 10x speedup
• Speedup is not ideal – But it is not the main cause of the poor MIC performance – ~70% efficiency @ 12 threads (as would be used with 10 MPI ranks per card)
Case study by Chris Daley
Vectorization is another form of on-node parallelism
do i = 1, n a(i) = b(i) + c(i) enddo
Intel Xeon Sandy-Bridge/Ivy-Bridge: 4 Double Precision Ops Concurrently Intel Xeon Phi: 8 Double Precision Ops Concurrently NVIDIA Kepler GPUs: 32 SIMT threads
Data Parallelism - Vectorization • Vectorization is another form of on-node parallelism
- 20 -
DO i= 1, 100 a(i) = b(i) + c(i) ENDDO
• A single instruction will launch many operations on different data;
• Most commonly, multiple iterations of a loop can be done “concurrently.” (Simpler control logic, too.)
• Compilers with optimizations setting turned on will attempt to vectorize your code.
DO i= 1, 100, 4 a(i) = b(i) + c(i) a(i+1) = b(i+1) + c(i+1) a(i+2) = b(i+2) + c(i+2) a(i+3) = b(i+3) + c(i+3) ENDDO
– The data for 1 grid point is laid out as a structure of fluid fields, e.g. density, pressure, …, temperature next to each other: A(HY_DENS:HY_TEMP)
– Vectorization can only happen when the same operation is performed on multiple fluid fields of 1 grid point!
- -
No vectorization gain!
Lower is B
etter
MIC performance study 2: vectorization
• We find that most time is spent in subroutines which update fluid state 1 grid point at a time
Case study by Chris Daley
Enabling vectorization • Must restructure the code
- The fluid fields should no longer be next to each other in memory - A(HY_DENS:HY_TEMP) should become A_dens(1:N), …, A_temp(1:N)
- The 1:N indicates the kernels now operate on N grid points at a time • We tested these changes on part of a data reconstruction kernel
- -
• The new code compiled with vectorization options gives the best performance on 3 different platforms
Higher is B
etter
Case study by Chris Daley
Case Study Summary • FLASH on MIC
– MPI+OpenMP parallel efficiency – OK – Vectorization – zero / negative gain …must restructure!
• Compiler auto-vectorization / vectorization directives do not help the current code
• Changes needed to enable vectorization – Make the kernel subroutines operate on multiple grid points at a time – Change the data layout by using a separate array for each fluid field
• Effectively a change from array of structures (AofS) to structure of arrays (SofA)
• Tested these proof-of-concept changes on a reduced hydro kernel – Demonstrated improved performance on Ivy-Bridge, BG/Q and Xeon-Phi platforms
- - Case study by Chris Daley
Some Initial Lessons Learned
• Improving a code for advanced architectures can improve performance on traditional architectures.
• Inclusion of OpenMP may be needed just to get the code to run (or to fit within memory)
• Some codes may need significant rewrite or refactoring; others gain significantly just adding OpenMP and vectorization
• Profiling/debugging tools and optimized libraries will be essential
• Vectorization important for performance
- 24 -
Preparing your code manycore
(More) Optimized Code
What steps should I take to optimize my application without becoming a performance/architecture expert?
- 26 -
Profile application
Analyze code vectorization
Create MPI vs OpenMP scaling plot
Measure memory bandwidth contention.
Step 1: Profile Code
• Determine in which routines your code spends the most time
• Find out how much memory your code uses
• What MPI routines are taking the most time
- 27 -
init 10% decompose
5%
solve 80%
distribute 5%
There are many tools to help you profile your code, HPC Toolkit, Craypat, Open Speedshop, IPM and many more. Go to your HPC center’s web page for information.
Step 2: Find optimal thread and task balance
• Run this experiment: – Hold constant the number of MPI tasks + OpenMP threads – Vary the number of MPI tasks, which results in a change to
the OpenMP threads
- 28 -
Step 2.5: Assess quality of your OpenMP implementation (cont)
• Take the most optimal # of threads from previous graph and assess efficiency of OpenMP implementation
- 29 -
Step 3: Run in packed vs unpacked mode
Packed
- 30 -
Core 0 Core 1
Core 2 Core 3
NUMA node 0
Core 4 Core 5
DDR3
DDR3
Core 0 Core 1
Core 2 Core 3
Core 4 Core 5
DDR3
DDR3
Core 0 Core 1
Core 2 Core 3
NUMA node 2
Core 4 Core 5
DDR3
DDR3
Core 0 Core 1
Core 2 Core 3
Core 4 Core 5
DDR3
DDR3
NUMA node 3 NUMA node 1
Socket 0 Socket 1
Hopper Node, 24 cores
Step 3: Run in packed vs unpacked mode
Unpacked
- 31 -
Core 0 Core 1
Core 2 Core 3
NUMA node 0
Core 4 Core 5
DDR3
DDR3
Core 0 Core 1
Core 2 Core 3
Core 4 Core 5
DDR3
DDR3
Core 0 Core 1
Core 2 Core 3
NUMA node 2
Core 4 Core 5
DDR3
DDR3
Core 0 Core 1
Core 2 Core 3
Core 4 Core 5
DDR3
DDR3
NUMA node 3 NUMA node 1
Socket 0 Socket 1
Hopper Node, 24 cores
Step 3: Run in packed and unpacked mode
• In your experiment use the same number of cores, but the packed version will use fewer nodes.
• Quiz: Unpacked case runs > 5 times faster than packed version – What could be happening?
- 32 -
Core 0 Core 1
Core 2 Core 3
NUMA node 0
Core 4 Core 5
DDR3
DDR3
Core 0 Core 1
Core 2 Core 3
Core 4 Core 5
DDR3
DDR3
Core 0 Core 1
Core 2 Core 3
NUMA node 2
Core 4 Core 5
DDR3
DDR3
Core 0 Core 1
Core 2 Core 3
Core 4 Core 5
DDR3
DDR3
NUMA node 3 NUMA node 1
Socket 0 Socket 1
Core 0 Core 1
Core 2 Core 3
NUMA node 0
Core 4 Core 5
DDR3
DDR3
Core 0 Core 1
Core 2 Core 3
Core 4 Core 5
DDR3
DDR3
Core 0 Core 1
Core 2 Core 3
NUMA node 2
Core 4 Core 5
DDR3
DDR3
Core 0 Core 1
Core 2 Core 3
Core 4 Core 5
DDR3
DDR3
NUMA node 3 NUMA node 1
Socket 0 Socket 1
Step 4: Use on-package memory
• New! • If your application is memory bandwidth limited, how
should you use the smaller high bandwidth memory? • On Knights Landing, on-package memory bandwidth
up to 5 times faster than DRAM • This is a new challenge for the HPC community and
we need your help!
- 33 -
Step 5: Examine code vectorization
• Vectorization analysis is performed by the compiler. • Run with –vec-report (or similar) and –novec to
measure code speed up from vectorization
- 34 -
DO I = 1, N A(I) = A(I-1) + B(I) ENDDO
Loop dependency (will not vectorize): DO I = 1, N IF (A(I) < X) CYCLE ... ENDDO
Loop exit (may not vectorize):
% ftn -o v.out -vec-report=3 yax.f (col. 10) remark: LOOP WAS VECTORIZED (col. 22) remark: loop was not vectorized: not inner loop (col. 7) remark: loop was not vectorized: existence of vector dependence (col. 7) remark: vector dependence: assumed FLOW dependence between line 16 and line 16
What comes next?
Whats wrong with Current Programming Environments?
• Peak clock frequency as primary limiter for performance improvement
• Cost: FLOPs are biggest cost for system: optimize for compute
• Concurrency: Modest growth of parallelism by adding nodes
• Memory scaling: maintain byte per flop capacity and bandwidth
• Locality: MPI+X model (uniform costs within node & between nodes)
• Uniformity: Assume uniform system performance
• Reliability: It’s the hardware’s problem
36
Old Constraints New Constraints • Power is primary design constraint for future
HPC system design • Cost: Data movement dominates: optimize
to minimize data movement • Concurrency: Exponential growth of
parallelism within chips • Memory Scaling: Compute growing 2x faster
than capacity or bandwidth • Locality: must reason about data locality
and possibly topology • Heterogeneity: Architectural and
performance non-uniformity increase • Reliability: Cannot count on hardware
protection alone
Fundamentally breaks our current programming paradigm and computing ecosystem
Slide from John Shalf
Potential future node architecture
37
Parameterized Machine Model (what do we need to reason about when designing a new code?)
Cores •How Many •Heterogeneous •SIMD Width
Network on Chip (NoC) •Are they equidistant or •Constrained Topology (2D)
On-Chip Memory Hierarchy •Automatic or Scratchpad? •Memory coherency method?
Node Topology •NUMA or Flat? •Topology may be important •Or perhaps just distance
Memory •Nonvolatile / multi-tiered? •Intelligence in memory (or not)
Fault Model for Node • FIT rates, Kinds of faults • Granularity of faults/recovery
Interconnect •Bandwidth/Latency/Overhead •Topology
Primitives for data movement/sync •Global Address Space or messaging? •Synchronization primitives/Fences
Slide from John Shalf
In Summary – Join our postdoc program!
- 39 -
• We will soon be hiring 8 postdocs for the NERSC Exascale Science Applications Program
Acknowledgements
• Thanks to Harvey Wasserman, John Shalf, Chris Daley and Jack Deslippe for slides
• Thanks to the whole NERSC App Readiness team for discussion and content
- 40 -
Thank you
- 41 -
NERSC-8 system named after Gerty Cori (1896 – 1957): Biochemist
• First American woman to be awarded a Nobel Prize in science (1947)
• Born in Prague; US naturalized 1928 • Shared the Nobel Prize in Medicine or
Physiology with her husband + 1 other • Recognized for work involving enzyme
chemistry in carbohydrates: how cells produce and store energy.
- 42 -
• Breakdown of carbohydrates and mechanism of enzyme action are of fundamental importance in renewable bioenergy (cf. DOE Complex Carbohydrate Research Center)
The Cori System
43
Cray Cascade- 64 Cabinets
KNL Compute 9304 Nodes
2 BOOT 2 SDB Network Nodes (4)
Boot Cabinet With esLogin
GigE Switch
GigE (Admin)
FDR IB
40 GigE
GigE (Internal)
FC
Boot RAID
FC Switch
SMW
Network RSIP (10)
esLogin 14 Nodes
esMS 2 Node
40 GigE Switch
136 FDR IB
DVS Server Nodes (32)
MOM 28 Nodes
DSL 32 Nodes
To NERSC network
To NERSC network
Haswell Compute 1920 Nodes
Redundant Core IB Switches
15 FDR IB 102 FDR IB
PFS
12 Scalable Units
MGS – 2 Nodes
MDS – 4 Nodes OSS – 96 Nodes
DDN WolfCreek 5 NetApp E2700
Burst Buffer 384 Nodes
68 LNET Routers
32 FDR IB to NGF
Community Involvement
• Tri-facility app readiness meeting with LCFs, 3/25/2014 – Compilation of Office of Science applications
• NESAP created after learning best LCF practices • LCFs will help evaluate NESAP proposals • Collaboration with Berkeley Lab Computing Research
Division: postdocs, IPCC, more • Other labs and universities • NERSC visibility will help
- 44 -
Early Science Program (ESP)
You must be this tall to enter the dungeon session
You must be this tall to enter the dungeon session
- 45 -
Profile application
Define a few kernels that can fit into memory of KNC and can complete in a few minutes
Analyze code vectorization with vector report
Create MPI vs OpenMP scaling plot
Compute operational intensity of application
Run application in ‘packed’ and ‘unpacked’ mode to assess memory bandwidth contention
• Lightweight cores not fast enough to process complex protocol stacks at line rate • Simplify MPI or add thread match/dispatch extensions • Or use the memory address for endpoint matching
• Can no longer ignore locality (especially inside of node) • Its not just memory system NUMA issues anymore • On chip fabric is not infinitely fast (Topology as first class citizen) • Relaxed relaxed consistency (or no guaranteed HW coherence)
• New Memory Classes & memory management • NVRAM, Fast/low-capacity, Slow/high-capacity • How to annotate & manage data for different classes of memory
• Asynchrony/Heterogeneity • New potential sources of performance heterogeneity • Is BSP up to the task?
Programming Model Challenges (why is MPI+X not sufficient?)
46
Things that Kill Vectorization
Compilers want to “vectorize” your loops whenever possible. But sometimes they get stumped. Here are a few things that prevent your code from vectorizing:
Loop dependency:
Task forking:
do i = 1, n a(i) = a(i-1) + b(i) enddo
do i = 1, n if (a(i) < x) cycle ... enddo
Assumptions of Uniformity is Breaking (many new sources of heterogeneity)
48
• Heterogeneous compute engines (hybrid/GPU computing)
• Fine grained power mgmt. makes homogeneous cores look heterogeneous • thermal throttling – no longer guarantee
deterministic clock rate • Nonuniformities in process
technology creates non-uniform operating characteristics for cores on a CMP • Near Threshold Voltage (NTV)
• Fault resilience introduces inhomogeneity in execution rates
• error correction is not instantaneous • And this will get WAY worse if we move towards
software-based resilience
Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy
Bulk Synchronous Execution
NERSC collaborates with computer companies to deploy advanced HPC and data resources
- 49 -
• Hopper (N6) and Cielo (ACES) were the first Cray petascale systems with a Gemini interconnect
• Architected and deployed data platforms including the largest DOE system focused on genomics
• Edison (N7) is the first Cray petascale system with Intel processors, Aries interconnect and Dragonfly topology (serial #1)
• Cori (N8) will be one of the first large Intel KNL systems and will have unique data capabilities