the challenge of portable parallel programming at...
TRANSCRIPT
The Challenge of Portable Parallel Programming at
Scale Prepared by
Dr. Barbara Chapman University of Houston
SimRace, Rueil-Malmaison, France December 8, 2015
http://www.cs.uh.edu/~hpctools
On-Going Architectural Changes 2
HPC system nodes continue to grow Thermics, power are now key in design decisions Massive increase in intra-node concurrency Trend toward heterogeneity Deeper, more complex memory hierarchies
0
10000
20000
30000
40000
50000
60000
2011 2012 2013 2014 2015
15559,848
26855,89
38692,792 43298,742
50463,78
0 1334,488 8262,962 9592,324
13780,4008 No.
of C
ores
Year
Top500: Av. Core Count
NLNI
Intel: “Sea of Blocks” Compute Model sL1
CE Host Processor: Full x86,
TLBs, SSE, . . .
iL1 dL1
Async Off. Eng.
Tweaked Decoder
Bus Gasket
Bridge
uL2
Intra-Accelerator Network
sL2
Standard x86 on-die fabric & memory map MC
IPM Bus
External DRAM & NVM
Special I/O Fabric
3
AU iL1
sL1
dL1
AU iL1
sL1
dL1
AU iL1
sL1
dL1
AU iL1
sL1
dL1
AU iL1
sL1
dL1
AU iL1
sL1
dL1
AU iL1
sL1
dL1
AU iL1
sL1
dL1 (c) 2
014,
Inte
l
ALU
RF
L1$ L1S
L2$ L2S
LL$ LLS
IPM
DDR
NVM
DDR
NVM
Disk Pool
O(1
0)
O(1
00)
O(1
)
O(1
0) O(1
,000
) O(1
)
Cores per block
Blocks w/ shared L2 per
die
Dies w
/ shared LL$/SPAD per socket
Boards w/ lim
ited DD
R+N
VM per C
hassis
Chassis w
/ large DD
R+N
VM per Exa-
machine
Machines + D
isk arrays
O(1
) Sockets w
/ IPM per Board
10+ Levels Memory, O(100M) Cores
4
(c) 2
014,
Inte
l
Integration of Accelerators: CAPI and APU • IBM’s Coherent Accelerator Processor Interface (CAPI)
integrates accelerators into system architecture with standardized protocol
• AMD’s Heterogeneous System Architecture (HSA)-based APU also integrates accelerators
• Nvidia’s high-speed GPU interconnect
Coherence Bus CAPP
PSL
Power8 CPU GPU
Global Memory
CPU L2
GPU L2
HW Coherence NVLink
X86, ARM64, POWER
CPU
PCIe
HPC Applications: Requirements Growing complexity in applications
Multidisciplinary, increasing amounts of data Very dense connectivity in social networks, etc.
How do we minimize communications in apps?
Performance • Must exploit features of emerging machines at all levels • APIs must facilitate expression of concurrency, help save power, use memory
efficiently, exploit heterogeneity, minimize synchronization Performance portability
• Implies not just that APIs are widely supported • But also that same code runs well everywhere • Very hard to accomplish
Performance less predictable in dynamic
execution environment
Developing Future HPC Applications
Productivity • Outside of HPC this is paramount • How important is this in HPC? Application development costs are high
Any new HPC programming languages out there? • DSLs will appear • New approaches may yet emerge, esp. task-based such as Legion • Need reasonable migration path for existing code
• Along with interoperability to avoid unneeded rewrite
• Role of application developer in detecting/overcoming errors and efficient energy consumption debated
Both MPI and OpenMP are being extended • Exploring how they might work with proposed exascale runtime • Directive features may ultimately be integrated into base languages
US National Strategic Computing Initiative (NSCI), Summer 2015 • Exaflops of computing power for exabytes of data
• Maintain US leadership in HPC capabilities • Improve HPC application developer productivity. • Make HPC readily available • Establish hardware technology for future HPC systems
• Increase capacity and capability of HPC ecosystem • DOD, DOE, NSF lead effort
• DOE: capable exascale computing program • NSF: scientific discovery, ecosystem, workforce • DOD: data analytic computing
A Layered Programming Approach
DSLs, other means for application scientists to provide information
Adapted versions of today’s portable parallel programming APIs (MPI, OpenMP, PGAS, Charm++)
Maybe some non-portable low-level APIs (threads, CUDA, Verilog)
Machine code, device-level interoperability stds, powerful runtime
Computational Chemistry
Climate Research Astrophysics
New kinds of info
Familiar
Custom
Very low-level
Applications
Heterogeneous Hardware
…
A Layered Programming Approach
DSLs, other means for application scientists to provide information
Adapted versions of today’s portable parallel programming APIs (MPI, OpenMP, PGAS, Charm++)
Maybe some non-portable low-level APIs (threads, CUDA, Verilog)
Machine code, device-level interoperability stds, powerful runtime
Computational Chemistry
Climate Research Astrophysics
New kinds of info
Familiar
Custom
Very low-level
Applications
Heterogeneous Hardware
…
PGAS Programming Model Partitioned Global Address Space Characteristics: • global view of data • data affinity part of memory
model • single-sided remote access
memory memory memory
cpu cpu cpu
process process process
Advantages: representative of modern NUMA
architectures works well for distributed
interconnects with RMA support productivity and performance
Language Extensions: UPC, Coarray Fortran (CAF), Titanium Libraries: OpenSHMEM, Global Arrays, GASPI Research: X10, Chapel, Fortress, CAF 2.0
OpenSHMEM: PGAS Library • Partitioned Global Address Space
• Many processes running same code • Operating on different data
• Private, or remotely accessible • rDMA network support: IB, Cray Aries, BG/Q…
• Implements PGAS as library • Standardizes SPMD 1-sided comms
• Point-to-point & strided transfers • Broadcasts, concatenations, reductions • Synchronizations • Remote atomic memory operations, waits, distributed locks
• Our work: reference implementation, exploring new features 13
Barbara Chapman, Tony Curtis, Swaroop Pophale, Stephen Poole, Jeff Kuehn, Chuck Koelbel, Lauren Smith. Introducing OpenSHMEM: SHMEM for the PGAS Community. In PGAS'10 Proc. Fourth
Conference on Partitioned Global Address Space Programming Models, 2010, ACM
OpenMP 4.0 • Released July 2013
• http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf • http://www.openmp.org/mp-documents/OpenMP_Examples_4.0.1.pdf
• Main changes from 3.1: • Accelerator extensions • SIMD extensions • Places and thread affinity • Taskgroup and dependent tasks • Error handling (cancellation) • User-defined reductions
• OpenMP 4.5 enhances these features • Released November 2015
14
#pragma omp parallel #pragma omp for schedule(dynamic) for (I=0;I<N;I++){ NEAT_STUFF(I); } /* implicit barrier here */
Use of OpenMP • Moderate-size scientific, technical applications
• Initially, Fortran binding only • General-purpose multicore programming
• Tasks, C and C++ bindings • Embedded systems
• Tasks, kernel offloads • Large-scale parallel computations
• Usually, in conjunction with MPI • Entry-level parallel programmers
Many application developers think that the use of directives means they don’t have to restructure code. That is not true.
The OpenMP ARB 2015
• OpenMP is maintained by the OpenMP Architecture Review Board (the ARB), which
• Interprets OpenMP • Writes new specifications - keeps OpenMP relevant • Works to increase the impact of OpenMP
• Members are organizations - not individuals • Current members
• Permanent: AMD, ARM, Cray, Fujitsu, HP, IBM, Intel, Micron, NEC, Nvidia, Oracle, Red Hat, Texas Instruments
• Auxiliary: ANL, ASC/LLNL, BSC, cOMPunity, EPCC, LANL, LBNL, NASA, ORNL, RWTH Aachen, SNL, TACC, University of Houston
• IWOMP and OpenMPCon: http://www.iwomp.org
www.openmp.org
“High-level directive-based multi-language parallelism that is performant, productive and portable”
• OpenMP Places and thread affinity policies
• OMP_PLACES to describe hardware regions
• affinity(spread|compact|true|false) • SPREAD: spread threads evenly among the places
spread 8
• COMPACT: collocate OpenMP thread with master thread compact 4
OpenMP 4.0 Affinity
p0
p1
p2
p3
p4
p5
p6
p7
p0
p1
p2
p3
p4
p5
p6
p7
chip 0
core 0
t0 t1
core 1
t2 t3
core 2
t4
core 3
t6 t7
chip 1
core 4
t8 t9
core 5
t10 t11
core 6
t12 t13
core 7
t14 t15
The Tasking Concept In OpenMP
Thread
Generate tasks
Thread
Thread
Thread
Thread
Exec
ute
task
s
Dependences between tasks now also supported
OpenMP for Accelerators
while ((k<=mits)&&(error>tol)) { // a loop copying u[][] to uold[][] is omitted here … #pragma omp parallel for private(resid,j,i) reduction(+:error) for (i=1;i<(n-1);i++) for (j=1;j<(m-1);j++) { resid = (ax*(uold[i-1][j] + uold[i+1][j])\ + ay*(uold[i][j-1] + uold[i][j+1])+ b * uold[i][j] - f[i][j])/b; u[i][j] = uold[i][j] - omega * resid; error = error + resid*resid ; } // rest of the code omitted ... }
#pragma omp target data device (gpu0) map(to:n, m, omega, ax, ay, b, \ f[0:n][0:m]) map(tofrom:u[0:n][0:m]) map(alloc:uold[0:n][0:m])
#pragma omp target device(gpu0)
Early Experiences With The OpenMP Accelerator Model; Chunhua Liao, Yonghong Yan, Bronis R. de Supinski, Daniel J. Quinlan and Barbara Chapman; International Workshop on OpenMP (IWOMP) 2013, September 2013
New and Coming Soon… OpenMP 4.5 • Device construct enhancements
• more control, flexibility in data movement between host and devices, multiple device types
• asynchronous support with nowait and depends • “deep copy” for pointer-based structures/objects
• Loop parallelism enhancements • extended ordered clause to support do-across (e.g. wavefront) parallelism for loop
nests • new taskloop construct for asynchronous loop parallelism with control over task
grain size • Array reductions for C and C++, and more…
OpenMP 5.0 discussions already under way • Interoperability and composability • Locality and affinity • General error model • Transactional memory • Additional looping constructs • Enhanced tasking support
4.5 specification at www.openmp.org
OpenACC Programming Model • Announced Supercomputing 2011
• Initial work by NVIDIA, Cray, PGI, CAPS
• Directive-based programming for accelerators • For Fortran, C, C++ • Loop-based computations
• Compilers: PGI, Cray, CAPS, OpenARC, OpenUH, GCC (4.9)
OpenACC Features • High-level directive-based programming API for accelerators
such as GPUs, APUs, Intel’s Xeon Phi, FPGAs and even DSP chips. • Data directives: copy, copyin, copyout, etc • Data synchronization directive: update • Compute directives
• parallel: more control to the user • kernels: more freedom to the compiler
• Three levels of parallelism: gang, worker and vector • Commercial OpenACC compilers
• PGI, CRAY, PathScale
• Open source OpenACC compilers • GCC 5.0, OpenARC, OpenUH, RoseACC, etc.
OpenACC Compiler Translation • Need to achieve coalesced memory access on GPUs
Compiling a High-level Directive-Based Programming Model for GPGPUs; Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, Sunita Chandrasekaran, and Barbara Chapman; 26th International Workshop on Languages and Compilers for Parallel Computing (LCPC2013)
Performance Portability Across Compilers? • Same OpenACC thread setting does not guarantee best
performance for both OpenUH and PGI compilers (PGI 15.7)
Fig: OpenUH performance for micro-RTM Fig: PGI performance for micro-RTM
Research into Future Runtimes
• Roles and responsibilities of runtime system? • Relationship between RT and OS, programming models? • How is information exchanged? • Does RT really have to be dynamic?
What to Model?
Cost models
Processor model Cache model
Parallel model
Loop overhead
Parallel overhead
Machine cost
Cache cost
Reduction cost
Computational resource cost
Dependency latency cost Register spill
cost
Cache cost Operation cost
Issue cost Mem_ref cost
TLB cost
4853,08105 2691,89195 3551,39345
6033,2904
2402,6061 2255,9813
7083,30225 4546,6893
3064,6816 3567,4856 2697,7405 5231,9194
2167,1573
8119,38975
4286,8672 6046,2975
2574,97045 2108,68385
8906,8519
3676,1309 4898,5849
2451,7159
6758,78625 4134,11485
5505,723 2758,22575 3676,2987 4590,789
0
5000
10000
1+2 2+2 3+1 1+4 3+2 4+2 5+1 3+4 2+5 6+1 4+4 5+3 2+6 5+4 6+3 4+6 5+6BW
(MB
/s)
Thread Configuration (# of remote + # of local threads)
HT3 BW vs Threads on 2 Istanbuls
1+22+12+21+33+1
• Data Abstraction: Data-blocks (DB) • Explicitly created and destroyed • Only available “global” memory • Data-blocks can move
• Compute Abstraction: Event-Driven Tasks (EDT) • EDTs are fine-grained computation units • Execute only after all input data-blocks are ready • EDTs can create new EDTs and DBs
• Synchronization Abstraction: Events • Events help to set up EDT dependences • “Producer” EDT delivers a data-block to an event • Data-block is “consumed” by dependent EDTs
• Globally visible namespace: Guids
Open Common Runtime (OCR)
EDT
EVT
Data Data
Data
Data
Data
Data
Data
Data
EDT EDT
…
Guids
28
12/16/2015
HPX • Prototype exascale software
stack • HPX is low-level programming
interface and runtime system • MPI, OpenMP over HPX
• Execution model • Dynamic adaptive resource
management • Message-driven computation • Efficient synchronization • Global name space; • Lightweight threads • Task scheduling; thread
migration for load balancing, throughput.
MPI/OpenMP Application
XPRESS Migration Stack
HPX
OpenX
OpenMP Thin Runtime Glue
OpenMP compiler
MPI
Legacy stack
OpenMP over HPX (on-going work)
OpenMP translation:
Maps OpenMP task and data parallelism onto HPX Exploits data flow execution capabilities at scale Big increase in throughput for fine-grained tasks No direct interface to OS threads No tied tasks Threadprivate tricky, slow Doesn’t support places OpenMP task dependencies
via futures HPX locks faster than OS locks
0
10
20
30
40
50
60
70
80
90
512 341 256 228 171 128 114 85 64
time
(sec
onds
)
Block Size
LU Run Time on 40 Threads, size = 8192
Intel
HPX
Auto-Generation of Tasks
T.-H. Weng, B. Chapman: Implementing OpenMP Using Dataflow Execution Model for Data Locality and Efficient Parallel Execution. Proc. HIPS-7, 2002
What is “right” way for application developer to express tasks? Old example:
Compiler translates OpenMP into collection of tasks and task graph
Analyzes data usage per task
What is “right” size of task? Might need to adjust at run time
Runtime trade-off between load balance, co-mapping of tasks that use same data In-situ mappings execute task
where data is
Productive Application Development and Deployment Environment
Model to adjust number of threads, schedule and other policies to optimize both energy and execution performance.
• Restructuring / refactoring • High accuracy in information on
execution behavior • Models to guide parameter choices • Dynamic adjustment adapts to
current state of system and application / data
• Open interactions across software stack
Where are Directives Headed? Directives are convenient means to express parallelism
- Enable user to retain Fortran / C / C++ … code OpenMP continues to evolve to meet new needs
- Strong HPC representation; prescriptive model - Data locality, affinity; reduced reliance on barriers
OpenACC more focussed effort, faster progress - Will it be subsumed by OpenMP?
Both require (node) code restructuring Is there a subset of OpenMP (and a programming style) that
will work particularly well for exascale nodes? - Or do we need a new set of directives to generate
tasks?
Life is Short
Questions?