![Page 1: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/1.jpg)
LLNL-PRES-804125This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Managing Power Efficiency of HPC Applications with Variorum and GEOPM
Tapasya Patki, Stephanie Brink, Aniruddha Marathe, Barry RountreeDavid Lowenthal (U. Arizona), Jonathan Eastep (Intel)
ECP Tutorial
Feb 4, 2020 2:30PM-6:00 PM
![Page 2: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/2.jpg)
2LLNL-PRES-804125
Agenda
• Part I: Overview of GEOPM (15 minutes)• High-level design• User-facing, application-context markup API• Demonstrations (10 minutes)
• Part II: Plug-ins to extend GEOPM algorithm and platform support (30 minutes)• Agent: Run-time tuning extension• PlatformIO: Platform-specific support extension• Demonstrations (10 minutes)
• Part III: ECP Argo Contributions (30 minutes)• ConductorAgent: Transparent, performance-optimizing configuration selection• IBM PlatformIO plugin: Port of GEOPM to IBM Power9 + Nvidia platform
• Questions/Discussion (10 minutes)
![Page 3: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/3.jpg)
3LLNL-PRES-804125
Part I: Hands-on Tutorial on GEOPM
![Page 4: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/4.jpg)
4LLNL-PRES-804125
Background: System Software Stack for Power Management
• Demand Response, RenewablesSite
• Overprovisioning, Job schedulingCluster
• Adaptive runtimes, Power balancingJob/Application
• Measurement and control (capping)Node
Inhe
rited
Pow
er B
ound
s
RMAP,P-SLURM,PowSchedGEOPM,
Conductor,Pshifter,...
Libmsr,msr-safe
Dashboards
Software
![Page 5: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/5.jpg)
5LLNL-PRES-804125
Background: System Software Stack for Power Management
• Demand Response, RenewablesSite
• Overprovisioning, Job schedulingCluster
• Adaptive runtimes, Power balancingJob/Application
• Measurement and control (capping)Node
Inhe
rited
Pow
er B
ound
s
RMAP,P-SLURM,PowSchedGEOPM,
Conductor,PShifter,...LibMSR,msr-safe
Dashboards
§ Critical contribution to the development of HPC power-aware system software stack.
Software
![Page 6: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/6.jpg)
6LLNL-PRES-804125
Power-Constrained Performance-Optimization Problem
Problem definition
Given a job-level power constraint and number of nodes, how do we optimize application performance?
![Page 7: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/7.jpg)
7LLNL-PRES-804125
GEOPM: Global Extensible Open Power Manager
• Power-aware runtime system for large-scale HPC systems
• Intel developed a production-grade, scalable, open-source job-level extensible runtime and framework
• Extensibility through plug-ins + advanced default functionality
• Limitations of existing runtimes• Research-based codes addressed specific needs and situations• Ad-hoc, targeted specific architecture, memory model • Suffered scalability issues• Reliance on empirical data
• Funded through a contract with Argonne National Laboratory
![Page 8: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/8.jpg)
8LLNL-PRES-804125
GEOPM Project Goals
§ Managing power• Maximizing power efficiency or performance
under a power cap
§ Managing manufacturing variation• Power / frequency relationship is non-uniform
across different processors of same type
§ Managing workload imbalance• Divert power to CPUs with more work
§ Managing system jitter• Divert power to CPUs interrupted or stalled by
system noise
§ Application profiling• Report application performance and
power metrics
§ Runtime application tuning• Extensible runtime control agent with
plug-in architecture
§ Integration with MPI• Automatic integration with MPI runtime
through PMPI interface
§ Integration with OpenMP• Automatic integration with OpenMP
through OMPT interface
![Page 9: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/9.jpg)
9LLNL-PRES-804125
GEOPM System Model
Extensible / Plug-in componentsHuman actors
External components – h/w, s/w, files
Job Monitor
Job Optimizer
Power-aware Job Scheduler
UserApplications
Site Admin
User
ApplicationDeveloper
Per-nodeTrace file
System Hardware(sensors,controls,
actuators)GEOPM
![Page 10: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/10.jpg)
10LLNL-PRES-804125
GEOPM: Capabilities
§ Enables analysis and transparent tuning of distributed-memory applications
§ Feedback-guided optimization: Leverages lightweight application profiling
§ Learns application phase patterns: load imbalance across nodes, distinct
computational phases within a node
§ Uses tuning parameters: processor power limit, core frequency, etc.
§ Built-in optimization algorithms: Static Power capping, energy reduction,
load balancing, limiting synchronization costs
![Page 11: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/11.jpg)
11LLNL-PRES-804125
GEOPM Components of Interest
GEOPM CoreHierarchical communication
+ plugin infrastructure
Agent PlatformIO
1
3
2
Markup API
Application
Endpoint4
![Page 12: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/12.jpg)
12LLNL-PRES-804125
GEOPM Components of Interest
GEOPM CoreHierarchical communication
+ plugin infrastructure
Agent PlatformIO
1
3
2
Markup API
Application
Endpoint4
![Page 13: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/13.jpg)
13LLNL-PRES-804125
GEOPM Infrastructure
GEOPM CoreHierarchical communication
+ power-management plugin
Agent Plugin
PlatformIO Plugin
![Page 14: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/14.jpg)
14LLNL-PRES-804125
GEOPM Infrastructure
• GEOPM Source repository navigation• Branches, directories, releases• GEOPM Wiki
• Build process• Dependencies• Build configuration
• GEOPM core infrastructure source• Overview of important classes• Plug-in source• Tutorials and examples• Test coverage
![Page 15: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/15.jpg)
15LLNL-PRES-804125
GEOPM: Component Communication
Controller
PlatformIO
ointResourceManager
Agent
HW Interface(OS)
User Submits Job
Endpoint
Spank Plugin
SLURM GEOPM Runtime
Signal and control flow
Component creation
GEOPM component
![Page 16: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/16.jpg)
16LLNL-PRES-804125
GEOPM: Input/Output Files
Controller
Application Profile PlatformIO
Agent
Report/Trace
HW Interface(OS)
GEOPM RuntimePolicy
Signal and control flow
Component creation
GEOPM componentI/O files
![Page 17: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/17.jpg)
17LLNL-PRES-804125
GEOPM Configuration, Build and Launch
![Page 18: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/18.jpg)
18LLNL-PRES-804125
Building an Application with GEOPM
Step 1 : Set the environment$> module load geopm$> module load <intel compiler>$> module load <MPI compiled with intel-c>
Step 2: Link the Application to GEOPM library $> mpicc APP_SRC.c -L$GEOPM_LIB -lgeopm \
-o APP_EXEC \COMPILER_FLAGS
Example$> mpicc helloworld.c -L$GEOPM_LIB -lgeopm -o a.out
![Page 19: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/19.jpg)
19LLNL-PRES-804125
Running an Application with GEOPM
Step 3: Generate a policy file$> geopmagent --agent=AGENT_NAME --policy=INPUT_PARAMS > POLICY_FILE.json
Example:$> geopmagent --agent=monitor --policy=None > monitor_policy.json
Step 4: Launch application with GEOPM launcher wrapper$> geopmlaunch srun -n < > -N < >\
--geopm-ctl=process \--geopm-agent=AGENT_NAME \--geopm-policy=POLICY_FILE.json \--geopm-report=REPORT_FILE.txt \--geopm-trace=TRACE_FILE.csv \-- APP_EXEC APP_OPTIONS
Example:$> geopmlaunch srun -n 4 -N 1 \
--geopm-ctl=process \--geopm-agent=monitor \--geopm-policy=monitor_policy.json \--geopm-report=report.txt \--geopm-trace=trace.csv \-- a.out
![Page 20: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/20.jpg)
20LLNL-PRES-804125
Demo: Running Application with GEOPM
![Page 21: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/21.jpg)
21LLNL-PRES-804125
GEOPM Components of Interest
GEOPM CoreHierarchical communication
+ plugin infrastructure
Agent PlatformIO
1
3
2
Markup API
Application
Endpoint4
![Page 22: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/22.jpg)
22LLNL-PRES-804125
GEOPM: Components and Interfaces
§ Application region markup API— Computation/communication
regions of interest
§ Epoch— End of iteration
§ OpenMP event callbacks
Collecting Application Context
§ Governed policy—Node-level
assignment
§ Balanced policy—Cluster-level
assignment
Power Assignment
Policies§ New Agent plugin:
ConductorAgent
§ New PlatformIO plugin:IBM port of GEOPM
Extension Interfaces
![Page 23: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/23.jpg)
23LLNL-PRES-804125
GEOPM Markup API: Purpose
• C interfaces provided in GEOPM that the application links against• Resemble typical profiler interfaces
• Annotation functions for programmers to provide information about application critical path and phases to GEOPM• Points where bulk synchronizations occur
• Phase changes occur in an MPI rank (i.e. phase entry and exit)
• Hints on whether phases will be compute-,memory-, or communication-intensive
• How much progress each MPI rank has made in the phase (critical path)
![Page 24: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/24.jpg)
24LLNL-PRES-804125
Application Markup API
• Marking up regions of interest• geopm_prof_region(name, hint, ID)• geopm_prof_enter(ID)• geopm_prof_exit(ID)
• Marking region progress• geopm_prof_progress(ID, %progress)
• Marking a timestep• geopm_prof_epoch()
MPI/Sequential Region
• Marking up regions of interest• geopm_tprof_init( num_work_unit)• geopm_tprof_init_loop(num_thread,
thread ID,num_iter,chunk_size)
• Marking region progress• geopm_tprof_post()
OpenMP Region
![Page 25: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/25.jpg)
25LLNL-PRES-804125
Demo: Using the GEOPM Markup API
![Page 26: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/26.jpg)
26LLNL-PRES-804125
Part II: Plug-ins to extend GEOPM algorithm and platform support
![Page 27: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/27.jpg)
27LLNL-PRES-804125
GEOPM: Policy plugins
§ Application region markup API— Computation/communication
regions of interest
§ Epoch— End of iteration
§ OpenMP event callbacks
Collecting Application Context
§ Governed policy—Node-level
assignment
§ Balanced policy—Cluster-level
assignment
Power Assignment
Policies§ New Agent plugin:
ConductorAgent
§ New PlatformIO plugin:IBM port of GEOPM
Extension Interfaces
![Page 28: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/28.jpg)
28LLNL-PRES-804125
Demo: Using the Default GEOPM Policies
![Page 29: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/29.jpg)
29LLNL-PRES-804125
GEOPM Components of Interest
GEOPM CoreHierarchical communication
+ plugin infrastructure
Agent PlatformIO
1
3
2
Markup API
Application
Endpoint4
![Page 30: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/30.jpg)
30LLNL-PRES-804125
GEOPM Components of Interest
GEOPM CoreHierarchical communication
+ plugin infrastructure
Agent PlatformIO
1
3
2
Markup API
Application
Endpoint4
MSR accesscontrol
telemetryapplication context
Power mgmt algorithmprofiling
accounting
Agent
PlatformIO
![Page 31: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/31.jpg)
31LLNL-PRES-804125
GEOPM Plugin Interface • Two types of plugins: PlatformIO and Agent plugins
• Example Agent plugins• MonitorAgent• BalancerAgent• GoverningAgent• EnergyEfficientAgent
• Example PlatformIO plugins• MSRIOGroup• KNLIOGroup
• Tutorial plugins: ExampleAgent and ExampleIOGroup• Key methods and code blocks• Policy description interface
![Page 33: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/33.jpg)
33LLNL-PRES-804125
Part III: ECP Argo Contributions
![Page 34: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/34.jpg)
34LLNL-PRES-804125
ECP Argo: Selecting Power-Optimizing Configuration
§ Approach: Hardware Overprovisioning with job-level power guarantees— More compute resources than you can power up at once
§ Objective: Optimize job performance under a power constraint
§ Solution: GEOPM – power-constrained performance optimization
§ ECP Argo Contributions:— Augment GEOPM’s algorithm with performance-optimizing
application configurations: # threads, Frequency, etc.— Port GEOPM to IBM POWER9 (support for LLNL Sierra)
![Page 35: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/35.jpg)
35LLNL-PRES-804125
ECP Argo Contributions: Components and Interfaces
§ Application region markup API— Computation/communication
regions of interest
§ Epoch— End of iteration
§ OpenMP event callbacks
Collecting Application Context
§ Governed policy—Node-level
assignment
§ Balanced policy—Cluster-level
assignment
Power Assignment
Policies§ New policy agent plugin:
ConductorAgent
§ New PlatformIO plugin:IBM port of GEOPM
Extension Interfaces
![Page 36: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/36.jpg)
36LLNL-PRES-804125
ECP Argo: How Much Do We Gain With Configuration Tuning?
ECP Argo ECP Argo
![Page 37: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/37.jpg)
37LLNL-PRES-804125
Naïve Scheme: Static Power Allocation
§ Equally distribute and enforce power constraint over all nodes of a job—Uses Intel’s Running Average Power Limit (RAPL) interface
§ Statically select a configuration under the power constraint—Configuration: {Number of cores, Frequency/power limit}—Commonly used: Packed configuration
• Maximum cores possible on the processor• Frequency or power limit as the control knob
![Page 38: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/38.jpg)
38LLNL-PRES-804125
Limitations of Static Power Allocation
1. Trivial node-level configurations may be inefficient
Input: {# cores, frequency/power limit}Output: {Execution time, power usage}
• Up to 30% slower than the optimal configuration
• Needs prohibitively large number of runs of the application
CoMD64 Nodes
50 60 70 80 90Processor power usage (watts)
![Page 39: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/39.jpg)
39LLNL-PRES-804125
Limitations of Static Power Allocation
1. Trivial node-level configurations may be inefficient
Input: {# cores, frequency/power limit}Output: {Execution time, power usage}
• Up to 30% slower than the optimal configuration
• Needs prohibitively large number of runs of the application
CoMD64 Nodes
2. Portion of power left unused with load-imbalanced applications (up to 40%)50 60 70 80 90
Processor power usage (watts)
![Page 40: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/40.jpg)
40LLNL-PRES-804125
Conductor: Dynamic Configuration and Power Management
§ Goals of ConductorAgent— Speed up computation on the critical path— Use power-efficient configuration
§ Need to dynamically identify— Computation region potentially on the critical path—{execution time, power usage} profile for every computation on every
processor
![Page 41: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/41.jpg)
41LLNL-PRES-804125
ConductorAgent AlgorithmStart
Explore configurations Step 1: Configuration Exploration
1 2 3 n. . .
MPI processesConfigurations
k1, k2, ..., knk1 k2 k3 kn
Allgather{Power, Execution Time}
![Page 42: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/42.jpg)
42LLNL-PRES-804125
50 60 70 80 90Power usage (watts)
Start
Explore configurations
Construct Pareto frontier
Select configuration kOPT
Step 1: Configuration Exploration
ConductorAgent Algorithm
![Page 43: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/43.jpg)
43LLNL-PRES-804125
Start
Explore configurations
Construct Pareto frontier
Select Configuration kOPT
Is computation non-critical?
Speed up (with unused power)
No
Calculate new power
allocation
Step 2: Power Re-allocation
Slow down (reduce power)
YesPower Limit:
70W
ParaDiS: Before power re-allocation
ParaDiS: After power re-allocation
power usage (watts)
50 55 60 65 70
50 55 60 65 70 75
0 5
10
15
power usage (watts)
# Ta
sks
# Ta
sks
0 5
10
15
ConductorAgent Algorithm
![Page 44: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/44.jpg)
44LLNL-PRES-804125
Conductor: Integration into GEOPM
§ OMPT class— Explore {OMP, Pcap} configurations during the exploration phase— Select power-efficient configuration during regular execution.
§ Profile class— Report end of timestep (i.e., ‘epoch’), application and system telemetry to enable
sweep of configuration at runtime.
§ ConfigApp class— Perform profiling, generate pareto-optimal configurations.
§ ConfigAgent class — Share telemetry with PowerBalancer agent, send configuration to OMPT.
![Page 45: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/45.jpg)
45LLNL-PRES-804125
ConductorAgent OMPT Profiler
Init Handshake
Shared memoryspace
GEOPM::SharedMemory
GEOPM::SharedMemoryUser
GEOPM Controller Application Process
Time
Initialization: GEOPM, Application Handshake
Initialize control and telemetry
![Page 46: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/46.jpg)
46LLNL-PRES-804125
ConductorAgent OMPT Profiler
GEOPM Controller Application Process
Time
Configuration Exploration: Set Configuration, Collect Telemetry
Configuration Exploration
ThreadCntPowerCapRegionID
PowerTime
Set ThreadsSet ConfigurationSet Power Cap
TelemetryRun Region
Signal Timestep
Sweep all configurations
![Page 47: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/47.jpg)
47LLNL-PRES-804125
ConductorAgent OMPT Profiler
GEOPM Controller Application Process
Time
Configuration Selection: Pick Power-Efficient Configurations
Configuration Selection
Set ConfigurationSet Power Cap ThreadCnt
PowerCapSet ThreadsRun Region
Through application completion
![Page 48: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/48.jpg)
48LLNL-PRES-804125
ECP Argo: End Result
ECP Argo ECP Argo
![Page 49: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/49.jpg)
49LLNL-PRES-804125
ECP Argo Contributions: Components and Interfaces
§ Application region markup API— Computation/communication
regions of interest
§ Epoch— End of iteration
§ OpenMP event callbacks
Collecting Application Context
§ Governed policy—Node-level
assignment
§ Balanced policy—Cluster-level
assignment
Power Assignment
Policies§ New policy agent plugin:
ConductorAgent
§ New PlatformIO plugin:IBM port of GEOPM
Extension Interfaces
![Page 50: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/50.jpg)
50LLNL-PRES-804125
GEOPM Port: Migration to New GEOPM IOGroup Interface
Purpose Old: PlatformImp interface New: IOGroup interface
Get platform information on POWER9
PowerPlatformImp: extendsPlatformImp
PowerIOGroup: extends IOGroup
RAPL-like monitoring and control on POWER9
OCCPlatform: extends Platform PowerIO: Direct CPU monitoring/control interface
Get platform information on GPUs NVMLPlatformImp: extends PlatformImp
NVMLIOGroup: extends IOGroup
RAPL-like monitoring and control on GPUs
NVMLPlatform: extends Platform NVMLIO: Direct GPU monitoring/control
*Additional modifications in GEOPM Agent implementations to fully support GEOPM power management on POWER9 dual socket + Nvidia Volta with NVLink
![Page 51: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/51.jpg)
51LLNL-PRES-804125
GEOPM IBM Port: IBM “Witherspoon” Node
Telemetry
CPU frequency § /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freqCPU Sensors § /sys/firmware/opal/exports/occ_inband_sensors
§ Performance Monitoring library (perfmon2): libpfm4GPU information § NVML :: *
Control
CPU frequency § /sys/devices/system/cpu/cpufreq/policy*/scaling_setspeedGPU power limit § NVML :: nvmlDeviceSetPowerManagementLimit()Node-level power capping
§ /sys/firmware/opal/powercap/system-powercap/powercap-current
§ CPU ID: PowerNV 8335-GTH, 2.2 § Number of cores: 160 4-way SMT, 3.7 GHz§ System memory: 66 GB§ GPU: Nvidia Tesla V100-SXM2§ Software: RHEL, GNU C/C++, GNU Fortran, MPICH2
System configuration &
interfaces
We use linear regression-based model to predict power usage at a given CPU frequency
PCPU = α.F + C
where, PCPU : P9 CPU power usage (watts),F : CPU frequency (GHz) α : Coefficient of frequency scalingC : Constant offset base frequency <-> power correlation
![Page 52: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/52.jpg)
52LLNL-PRES-804125
ECP Argo: Github Contributions
§ Conductor integration and IBM platform plugin: — https://github.com/geopm/geopm/pull/757
§ GEOPM integration with Caliper: — https://github.com/LLNL/Caliper/pull/213
![Page 53: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/53.jpg)
53LLNL-PRES-804125
GEOPM Team and Collaborations
GEOPM Core Team (Intel)Jonathan Eastep (Project Lead)Chris Cantalupo (Lead Developer)Fede ArdanazBrad GeltzBrandon BakerMohammad AliSiddhartha JanaDiana Guttman
LLNL TeamAniruddha MaratheTapasya PatkiStephanie BrinkBarry Rountree
![Page 54: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/54.jpg)
54LLNL-PRES-804125
Questions?
Github links
Configuration Exploration: https://github.com/amarathe84/geopm/tree/master
IBM Port: https://github.com/amarathe84/geopm/tree/ibm-port
![Page 55: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/55.jpg)
55LLNL-PRES-804125
ECP Future WorkW
IPDo
neFu
ture
§ Extend configuration exploration to include CPU-GPU configuration space
§ Power/Performance models for co-scheduling and workflows
§ Port variorum to ARM, HPE, and other architectures
§ PowerStack Consortium and industry integration
§ ECP Phase I: GEOPM extensions, Power-aware SLURM, Legion extensions
§ ECP Phase II: Power control and monitoring through variorum, co-scheduling and workflows
§ Deliver initial version of PowerStack
§ Integrate GEOPM and Variorum
§ Include node-level power capping after OPAL firmware update
![Page 56: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap](https://reader034.vdocument.in/reader034/viewer/2022050221/5f6725e01a0c42257d72c703/html5/thumbnails/56.jpg)