an autonomic performance environment for exascalekhuck/apex-scalable-tools-workshop-2015.pdfapex...
TRANSCRIPT
An Autonomic Performance Environment for Exascale
Kevin A. Huck1, Nick Chaimov1, Allen D. Malony1, Sameer Shende1, Allan Porterfield2, Rob Fowler2
1University of Oregon, Eugene, Oregon, USA 2RENCI, Chapel Hill, North Carolina, USA
2 An Autonomic Performance Environment for Exascale
What I am going to tell you… • XPRESS project overview – OpenX, ParalleX, HPX etc.
• Asynchronous tasking runtimes, tool challenges • APEX overview • Examples
3 An Autonomic Performance Environment for Exascale
XPRESS • DOE X-Stack project to develop the OpenX software
stack, implementing the ParalleX execution model – Sandia National Laboratories (R. Brightwell) – Indiana University (T. Sterling, A. Lumsdane) – Lawrence Berkeley National Laboratory (A. Koniges) – Louisiana State University (H. Kaiser) – Oak Ridge National Laboratory (C. Baker) – University of Houston (B. Chapman) – University of North Carolina (A. Porterfield, R. Fowler) – University of Oregon (A. Malony, K. Huck, S. Shende)
• http://xstack.sandia.gov/xpress/
4 An Autonomic Performance Environment for Exascale
LegacyApplications
New Model Applications
MPIMetaprogramming
FrameworkDomain SpecificActive Library
Compiler
AGASname spaceprocessor
LCOdataflow, futuressynchronization
LightweightThreads
context manager
Parcelsmessage driven
computation
...
OpenMP
XPI
Task recognition Address space control
Memory bankcontrol
OSthread In
stru
men
tatio
n
Networkdrivers
Distributed FrameworkOperating System
HardwareArchitecture
OperatingSystem
Instances
RuntimeSystem
Instances
PRIME MEDIUMInterace / Control
{{
Domain SpecificLanguage
+106 nodes � 103 cores / node + integration network
...
Figure 1: Major components of the OpenX architecture.
The two major components of the XPRESS OpenX software architecture are the LXK operating systemand the HPX-4 runtime system software. Between these two is a major interface protocol, the “PrimeMedium” that supports bi-directional complex interoperability between the operating system and the runtimesystem. The interrelationship is mutually supporting, bi-directional, and dynamically adaptive.
Figure 1 shows key elements of the runtime system including lightweight user multithreading, parcelmessage-driven computation, LCO local control objects for sophisticated synchronization, and AGAS activeglobal address space and processes. Each application has its own ephemeral instance of the HPX runtimesystem. The figure also shows the LXK operating system on the other side of the Prime Medium consistingof a large ensemble of persistent lightweight kernel supervisors each dedicated to a particular node of pro-cessor, memory, and network resources. It is the innovation and power of the OpenX software architecturethat the runtime system and the operating system employ each other for services. Both are simple in designbut achieve complexity of operation through a combination of high replication and dynamic interaction.
XPI, the low level imperative API, will serve both as a readable interface for system software develop-ment and early experimental application programming with which to conduct experiments. It will also serveas a target for source-to-source compiler translation from high level APIs and languages. XPI will representthe ParalleX semantics as closely as possible in a familiar form: the binding of calls to the C programminglanguage. Part of the compiler challenge is to bridge XPI to the HPX runtime system.
Domain specific programming will be provided through a metaprogramming framework that will permitrapid development of DSLs for diverse disciplines. An example of one such DSL derived from the meta-toolkit will be developed by the XPRESS project. Finally, targeting either XPI or native calls of the HPXruntime will be compilation strategies and systems to translate MPI and OpenMP legacy codes to a form thatcan be run by OpenX with performance at least as good as a native code implementation. (While it is desiredthat advance work in declarative programming styles, especially in the area of parallelism, be derived and
6
Run5me Layer
OS Layer
Applica5on Layer
OpenX architecture
5 An Autonomic Performance Environment for Exascale
ParalleX, HPX, and OpenX • ParalleX is an experimental execution model – Theoretical foundation of task-based parallelism – Run in the context of distributed computing context – Supports integration of the entire system stack
• HPX is the runtime system that implements ParalleX – Exposes an uniform, standards-oriented API – Enables writing a fully asynchronous code using hundreds of
millions of threads – Provides unified syntax and semantics for local and remote
operations – HPX-3 (LSU, C++ language/Boost) – HPX-5 (IU, C language)
• XPRESS will develop an OpenX exascale software stack based on ParalleX
6 An Autonomic Performance Environment for Exascale
ParalleX / HPX • Governing principles (components) – Active global address space (AGAS) instead of PGAS
• Enables seamless distributed load balancing and adaptive data placement and locality control
– Message-driven via Parcels instead of message passing • Enables transparent latency hiding and event driven computation • Parcels are a form of active messages
– Lightweight control objects (LCO) instead of global barriers • Reduces global contention and improves parallel efficiency • Synchronization semantics
– Moving work to data instead of moving data to work • Basis for asynchronous parallelism and dataflow style computation
– Fine-grained parallelism of lightweight Threads (vs. CSP) • Improves system and resource utilization
7 An Autonomic Performance Environment for Exascale
Process Manager Local Memory Management
AGASTranslation
Interconnect
Parcel Port Parcel
Handler
Action Manager Thread
ManagerThread
Pool
LocalControlObjects
Performance Counters
Performance Monitor
...
HPX-3 Architecture
8 An Autonomic Performance Environment for Exascale
HPX-5 Architecture
9 An Autonomic Performance Environment for Exascale
OpenX Software Stack and APEX Legacy
Applications
New Model Applications
MPIMetaprogramming
FrameworkDomain SpecificActive Library
Compiler
AGASname spaceprocessor
LCOdataflow, futuressynchronization
LightweightThreads
context manager
Parcelsmessage driven
computation
...
OpenMP
XPI
Task recognition Address space control
Memory bankcontrol
OSthread In
stru
men
tatio
n
Networkdrivers
Distributed FrameworkOperating System
HardwareArchitecture
OperatingSystem
Instances
RuntimeSystem
Instances
PRIME MEDIUMInterace / Control
{{
Domain SpecificLanguage
+106 nodes � 103 cores / node + integration network
...
Figure 1: Major components of the OpenX architecture.
The two major components of the XPRESS OpenX software architecture are the LXK operating systemand the HPX-4 runtime system software. Between these two is a major interface protocol, the “PrimeMedium” that supports bi-directional complex interoperability between the operating system and the runtimesystem. The interrelationship is mutually supporting, bi-directional, and dynamically adaptive.
Figure 1 shows key elements of the runtime system including lightweight user multithreading, parcelmessage-driven computation, LCO local control objects for sophisticated synchronization, and AGAS activeglobal address space and processes. Each application has its own ephemeral instance of the HPX runtimesystem. The figure also shows the LXK operating system on the other side of the Prime Medium consistingof a large ensemble of persistent lightweight kernel supervisors each dedicated to a particular node of pro-cessor, memory, and network resources. It is the innovation and power of the OpenX software architecturethat the runtime system and the operating system employ each other for services. Both are simple in designbut achieve complexity of operation through a combination of high replication and dynamic interaction.
XPI, the low level imperative API, will serve both as a readable interface for system software develop-ment and early experimental application programming with which to conduct experiments. It will also serveas a target for source-to-source compiler translation from high level APIs and languages. XPI will representthe ParalleX semantics as closely as possible in a familiar form: the binding of calls to the C programminglanguage. Part of the compiler challenge is to bridge XPI to the HPX runtime system.
Domain specific programming will be provided through a metaprogramming framework that will permitrapid development of DSLs for diverse disciplines. An example of one such DSL derived from the meta-toolkit will be developed by the XPRESS project. Finally, targeting either XPI or native calls of the HPXruntime will be compilation strategies and systems to translate MPI and OpenMP legacy codes to a form thatcan be run by OpenX with performance at least as good as a native code implementation. (While it is desiredthat advance work in declarative programming styles, especially in the area of parallelism, be derived and
6
* From XPRESS proposal
10 An Autonomic Performance Environment for Exascale
Runtime Adaptation Motivation • Controlling concurrency – Energy efficiency – Performance
• Parametric variability – Granularity for this machine / dataset?
• Load Balancing – When to perform AGAS migration?
• Parallel Algorithms (for_each…) – Separate what from how
• Address the “SLOWER” performance model
11 An Autonomic Performance Environment for Exascale
Tool Challenges • Runtime wants to manage pre-emption – Task starts on one OS thread, at some “waiting” point
is de-scheduled by the task manager (changes the stack), resumes on another (thread/core/node)
– Where/how to instrument?
• Overhead concerns – ParalleX goal: million-way concurrency – NOTHING* shared between threads
• Need access to profile for decision control • 2 implementations of HPX
12 An Autonomic Performance Environment for Exascale
APEX and Autonomics • Performance awareness and performance adaptation • Top down and bottom up performance mapping / feedback – Make node-wide resource utilization data and analysis, energy
consumption, and health information available in real time – Associate performance state with policy for feedback control
• APEX introspection – OS (LXK) track system resource assignment, utilization, job
contention, overhead – Runtime (HPX) track threads, queues, concurrency, remote
operations, parcels, memory management – ParalleX, DSLs and legacy codes allow language-level
performance semantics to be measured
13 An Autonomic Performance Environment for Exascale
APEX Information Flow APEX Introspection
APEX Policy Engine
APEX State RCR Toolkit
Application HPX
Synchronous Asynchronous
Triggered Periodic
events
14 An Autonomic Performance Environment for Exascale
Node 1
Node N
APEX Global Flow (as needed)
Leverage runtime (HPX) to provide global introspection, state, and policies
Policy Engine
State
HPX Policy E
ngine
State
HPX
…
15 An Autonomic Performance Environment for Exascale
APEX Introspection • APEX collects data through “inspectors” – Synchronous uses an event API and event “listeners”
• Initialize, terminate, new thread – added to HPX runtime • Timer start, stop, yield*, resume* - added to HPX task scheduler • Sampled value (counters from HPX-5, HPX-3) • Custom events (meta-events)
– Asynchonous do not rely on events, but occur periodically • APEX exploits access to performance data from
lower stack components – Reading from the RCR blackboard (i.e., power, energy) – “Health” data through other interfaces (/proc/stat, cpuinfo,
meminfo, net/dev, self/status, lm_sensors, power*, etc.)
16 An Autonomic Performance Environment for Exascale
RCR: Resource Centric Reflection • Performance introspection across layers to enable
dynamic, adaptive operation and decision control • Extends previous work on building decision support
instrumentation (RCRToolkit) for introspective adaptive scheduling
• Daemon monitors shared, non-core resources • Real-time analysis, raw/processed data published to
shared memory region, clients subscribe • Utilized at lower levels of the OpenX stack • APEX introspection and policy components access
and evaluate RCR values
17 An Autonomic Performance Environment for Exascale
APEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp, return profiler
handle – Stop event: get timestamp, put profiler object in a queue for
back-end processing, return – Sample event: put the name & value in the queue – Asynchronous consumer thread: process profiler objects and
samples to build statistical profile (in HPX-3, processed/scheduled as a thread/task)
• Concurrency listener (postmortem analysis) – Start event: push timer ID on stack – Stop event: pop timer ID off stack – Asynchronous consumer thread: periodically log current timer
for each thread, output report at termination
18 An Autonomic Performance Environment for Exascale
APEX Policy Listener • Policies are rules that decide on outcomes based on
observed state – Triggered policies are invoked by introspection API events – Periodic policies are run periodically on asynchronous thread
• Polices are registered with the Policy Engine – Applications, runtimes, and/or OS register callback functions
• Callback functions define the policy rules – “If x < y then…”
• Enables runtime adaptation using introspection data – Engages actuators across stack layers – Could also be used to involve online auto-tuning support*
19 An Autonomic Performance Environment for Exascale
APEX Global View • All APEX introspection is collected locally – However, it is not limited to a single-node view
• Global view of introspection data and interactions – Take advantage of the distributed runtime support
• HPX3, HPX5, MPI, …
• API provided for back-end implementations – apex_global_get_value() – each node gets data to be
reduced, optional AGAS/RDMA put (push model) – apex_global_reduce() – optional AGAS/RDMA get (pull
model), node data is aggregated at root node, optional broadcast back out
• Extends global view for policies
20 An Autonomic Performance Environment for Exascale
APEX Examples • HPX-3 1-D stencil code • HPX-5 LULESH kernel • Experiments conducted on Edison – Cray XC30 @ NERSC.gov – 5576 nodes with two 12-core Intel "Ivy Bridge"
processors at 2.4 GHz – 48 threads per node (24 physical cores w/
hyperthreading) – Cray Aries interconnect with Dragonfly topology with
23.7 TB/s global bandwidth
21 An Autonomic Performance Environment for Exascale
Concurrency Throttling for Performance
• Heat diffusion • 1D stencil code • Data array
partitioned into chunks
• 1 node with no hyperthreading
• Performance increases to a point with increasing worker threads, then decreases
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24Ru
ntim
e((s)
Number(of(Worker(Threads
1d_stencil HPX-3
22 An Autonomic Performance Environment for Exascale
Concurrency Throttling for Performance
• Region of maximum performance correlates with thread queue length runtime performance counter – Represents # tasks currently waiting to execute
• Could do introspection on this to control concurrency throttling policy (*work in progress)
53
63
73
83
93
103
113
123
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Thread
'Que
ue'Le
ngth
Runtim
e'(s)
Number'of'Worker'Threads
1d_stencil
Run
time
Thre
ad q
ueue
leng
th
23 An Autonomic Performance Environment for Exascale
1d_stencil Baseline
• 48 worker threads (with hyperthreading)
• Actual concurrency much lower – Implementation is
memory bound • Large variation in
concurrency over time – Tasks waiting on
prior tasks to complete
0
5
10
15
20
25
30
35
40
45
50
0
300
Con
curre
ncy
Pow
er
Time
partition_datado_work_action
hpx::lcos::local::dataflow::executeprimary_namespace_bulk_service_action
primary_namespace_service_actionother
thread cappower
100,000,000 elements, 1000 par55ons
138 secs
Event-generated metrics
Where calculation takes place
24 An Autonomic Performance Environment for Exascale
1d_stencil w/optimal # of Threads
• 12 worker threads • Greater proportion
of threads kept busy – Less interference
between active threads and threads waiting for memory
• Much faster – 61 sec. vs 138
sec.
0
2
4
6
8
10
12
0
300
Con
curre
ncy
Pow
er
Time
partition_datado_work_action
hpx::lcos::local::dataflow::executeprimary_namespace_bulk_service_action
primary_namespace_service_actionother
thread cappower
100,000,000 elements, 1000 par55ons
61 secs
25 An Autonomic Performance Environment for Exascale
1d_stencil Adaptation with APEX
• Initially 48 worker threads
• Discrete hill climbing search to minimize average #of pending tasks
• Converges on 13 (vs. optimal of 12)
• Nearly as fast as optimal – 64 seconds vs.
61 seconds
0
5
10
15
20
25
30
35
40
45
50
0
300
Con
curre
ncy
Pow
er
Time
partition_datado_work_action
hpx::lcos::local::dataflow::executeprimary_namespace_bulk_service_action
primary_namespace_service_actionother
thread cappower
100,000,000 elements, 1000 par55ons
64 secs
26 An Autonomic Performance Environment for Exascale
1D stencil – adapting grain size
0
4
8
12
0
10
20
30
40
50
Conc
urre
ncy
Grai
n Si
ze P
aram
eter
Val
ue
Time
hpx::lcos::local::dataflow::executehpx_main
othergrain_size_parameter
27 An Autonomic Performance Environment for Exascale
OpenMP version, adapt concurrency and block size with ActiveHarmony policy
0
5
10
15
20
25
0
50000
100000
150000
200000
250000
300000C
oncu
rrenc
y
Bloc
k Si
ze
Time
solve iteration [clone . omp fn.2]solve iteration [clone . omp fn.2]solve iteration [clone . omp fn.3]
idlethread cap
power
28 An Autonomic Performance Environment for Exascale
Throttling for Power • Livermore Unstructured Lagrangian Explicit Shock
Hydrodynamics (LULESH) – Proxy application in DOE co-design efforts for exascale – CPU bounded in most implementations (use HPX-5)
• Develop an APEX policy for power – Threads are idled in HPX to keep node under power cap – Use hysteresis of last 3 observations – If Power < low cap increase thread cap – If Power > high cap decrease thread cap
• HPX thread scheduler modified to idle/activate threads per cap • Test example:
– 343 domains, nx = 48, 100 iterations – 16 nodes of Edison, Cray XC30 – Baseline vs. Throttled (200W per node high power cap)
LULESH image, source: Hydrodynamics Challenge Problem, Lawrence Livermore Na>onal Laboratory. Technical Report, LLNL-‐TR-‐490254
29 An Autonomic Performance Environment for Exascale
0
100
200
300
400
500
600
700
800
0
1000
2000
3000
4000
5000
Con
curre
ncy
Pow
er
Time
hpx143fixDEFAULTlcogetPINNED
advanceDomainactionSBN1resultactionSBN3resultaction
PosVelresultactionPosVelsendsactionMonoQresultaction
MonoQsendsactionother
probeDEFAULTSBN3sendsaction
hpxlcosetactionPINNEDthread cap
power
LULESH Baseline
No power cap No thread cap 768 threads Avg. 247 WaJs/node ~360 kiloJoules total ~91 seconds total ~74 seconds HPX tasks
91secs
aprun
30 An Autonomic Performance Environment for Exascale
0
100
200
300
400
500
600
700
800
0
1000
2000
3000
4000
5000
Con
curre
ncy
Pow
er
Time
hpx143fixDEFAULTprobeDEFAULT
lcogetPINNEDadvanceDomainaction
mainactionSBN1resultactionSBN3resultaction
PosVelresultaction
MonoQresultactionMonoQsendsaction
otherSBN3sendsaction
hpxlcosetactionPINNEDPosVelsendsaction
thread cappower
LULESH Throttled by Power Cap
200W power cap 768 threads throJled to 220 Avg. 186 WaJs/node ~280 kiloJoules total ~94 seconds total ~77 seconds HPX tasks
94 secs
aprun
31 An Autonomic Performance Environment for Exascale
LULESH Performance Explanation • 768 vs. 220 threads (after throttling) • 360 kJ vs. 280 kJ total energy consumed (~22% decrease) • 75.24 vs. 77.96 seconds in HPX (~3.6% increase) • Big reduction in yielded action stalls (in thread scheduler)
– Less contention for network access • Hypothesis: LULESH implementation is showing signs of
being network bound - spatial locality of subdomains is not maintained during decomposition Metric Baseline Thro?led % Difference Cycles 1.11341E+13 3.88187E+12 34.865% InstrucHons 7.33378E+12 5.37177E+12 73.247% L2 Cache Misses 8422448397 3894908172 46.244% IPC 0.658677 1.38381 210.089% INS/L2CM 870.742 1379.18 158.391%
32 An Autonomic Performance Environment for Exascale
LULESH latest results
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
Con
curre
ncy
Pow
er
Time
probehandlerpwclcogetrequesthandler
advanceDomainactionPosVelresultactionMonoQresultaction
otherSBN3sendsaction
MonoQsendsactionSBN3resultaction
PosVelsendsactionSBN1resultaction
thread cappower
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
Con
curre
ncy
Pow
er
Time
probehandleradvanceDomainaction
PosVelresultactionPosVelsendsactionMonoQresultaction
otherSBN3resultaction
MonoQsendsactionSBN3sendsaction
initDomainactionthread cap
power
Figures: unmodified and power-‐capped (220W) execu5ons of LULESH (HPX-‐5), with equal execu5on 5mes. (8000 subdomains, 643 elements per subdomain on 8016 cores of Edison, 343 nodes, 24 cores per node). 12.3% energy savings with no performance change.
33 An Autonomic Performance Environment for Exascale
Future Work, Discussion • Conduct more robust experiments and at larger scales on
different platforms • More, better policy rules
– Runtime and operating system – Application and device-specific (*in progress) – Global policies (*in progress)
• Multi-objective optimization (ActiveHarmony) • Integration into HPX-5, HPX-3 code releases • API refinements for general purpose usage • Global data exchange – who does it, and when? • “MPMD” processing – how to (whether to?) sandbox APEX • Other runtimes: OpenMP/OMPT, OmpSS, Legion, OCR… • Source code: https://github.com/khuck/xpress-apex
34 An Autonomic Performance Environment for Exascale
Acknowledgements • Support for this work was provided through Scientific Discovery
through Advanced Computing (SciDAC) program funded by U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research (and Basic Energy Sciences/Biological and Environmental Research/High Energy Physics/Fusion Energy Sciences/Nuclear Physics) under award numbers DE-SC0008638, DE-SC0008704, DE- FG02-11ER26050 and DE-SC0006925.
• Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. DOE’s National Nuclear Security Administration under contract DE- AC04-94AL85000.
• This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.