green high performance computing initiative at...
TRANSCRIPT
The Center for Autonomic Computing is supported by the National Science Foundation under Grant No. 0758596.
China – US Software Workshop, September 2011
Green High Performance Computing Initiative at Rutgers University (GreenHPC@RU)
Manish Parashar, Ivan Rodero
Contact: {parashar, irodero}@cac.rutgers.edu http://nsfcac.rutgers.edu/GreenHPC
Motivations for HPC Power Management q Growing scales of High-end High Performance Computing systems
q Power consumption has become critical in terms of operational costs (dominant part of IT budgets)
q Existing power/energy management focused on single layers q May result in suboptimal solutions
Challenges in HPC Power Management q GreenHPC
q $/W/MFLOP, defining energy efficiency q Infrastructure: thermal sensors, instrumentation, monitoring, cooling,
etc. q Architectural challenges
q Processor, memory, interconnect technologies q Increased use of accelerators q Power-aware micro-architectures
q Compilers, OS, runtime support q Power-aware code generation q Power-aware scheduling q DVFS, programming models, abstractions
q Considerations for system level energy efficiency q Optimizing CPU alone is not sufficient, need to look at entire system/
cluster q Application/workload aware-optimizations
q Power-aware algorithm design
Motivations and Challenges Cross-layer Energy-Power Management q Architecture
Component-based Power Management [HiPC’10/11] q Application-centric aggressive power management
at component level q Workload profiling and characterization
q HPC workloads (e.g., HPC Linpack) q Use of low power modes to configure subsystems
(i.e., CPU, memory, disk and NIC) q Energy-efficient memory hierarchies
q Memory power management for multi- many-core systems q Multiple channels to main memory q Application-driven power control (i.e., ensure bandwidth to main memory leveraging channel affinity)
Runtime Power Management q Partitioned Global Address Space (PGAS)
q Implicit message-passing q Unified Parallel C (UPC) so far
q Target platforms q Many-core (i.e., Intel SCC) q HPC clusters
Energy-aware Autonomic Provisioning [IGCC’10] q Virtualized Cloud infrastructures with multiple
geographical distributed entry points. q Workloads composed of HPC applications q Distributed Online Clustering (DOC)
q Cluster job requests in the input stream based on their resource requirements (multiple dimensions, e.g., memory, CPU, network requirements)
q Optimizing energy efficiency in the following ways: q Powering down subsystems when they are not needed q Efficient, just-right VM provisioning (reduce over-provisioning ) q Efficient proactive provisioning and grouping (reduce re-provisioning)
Energy and Thermal Autonomic Management q Reactive thermal and energy management of HPC workloads
q Autonomic decision making to react to thermal hotspots considering multiple dimensions (i.e., energy and thermal efficiency) using different techniques: VM migration, CPU DVFS, CPU pinning
q Proactive energy-aware application-centric VM allocation for HPC workloads q Strategy for proactive VM allocation based on VM consolidation while
satisfying QoS guarantees q Based on application profiling (profiles are known in advance) and
benchmarking on real hardware
q Abnormal operational state detection (e.g., poor performance, hotspots)
q Reactive and proactive approaches q Reacting to anomalies to return to steady state q Predict anomalies in order to avoid them
q Workload/application-awareness q Application profiling/characterization
Approach
QoS
Energy E!ciency
Thermal e!ciency
Abnormal
state
Steady State
…
?
QoS
Energy E!ciency
Thermal e!ciency
QoS
Energy E!ciency
Thermal e!ciency
q Autonomic (self-monitored and self-managed) computing systems
q Optimizing (minimizing): q Energy efficiency q Cost-effectiveness q Utilization
while ensuring (maximizing): q Performance/quality of service delivered
q Addressing both “traditional” and virtualized system (i.e., data centers and GreenHPC in the Cloud)
Goals
Physical Environment Sensor
!Observer!
!Actuator
!Controller!
!
Resources
Sensor
!Observer!
!Actuator
!Controller!
!
Virtualization
Sensor
!Observer!
!Actuator
!Controller!
!
Application/ Workload Sensor
!Observer!
!Actuator
!Controller!
!
Cross-layer Power Management
Cross-infrastructure Power Management
Cloud
(private, public,
hybrid, etc.)
Cloud
(private, public,
hybrid, etc.)
Virtualized Instrumented infrastructure
Application-aware
1 2 3 4 5 6 7 8 910111213141516
0
5
10
15
20
25
30
0
2
4
6
8
10
12
14
Program Region
Channels
Perc
enta
ge o
f cha
nnel
acce
ss
Opportunities for powering down
! " # $ % & '
()"!%
!
!*&
"
"*&
#
#*&
$()"!
%
+,-)./012.
3/0
456)./012.
! "!!! #!!!! #"!!!!
!$"
#
#$"
%
%$"
&'(#!
)
*+,(-./01-
2./
345(-./01-
VM classes based on clusters
Over-provisioning cost
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09
Dura
tion
(s)
Time (ns)
CG.A.O3.16 execution. Call’s to UPC Runtime function upcr_wait2 in node 1
Example: Potential saving during collective operations (e.g., barriers) in the SCC platform