green high performance computing initiative at...

1
The Center for Autonomic Computing is supported by the National Science Foundation under Grant No. 0758596. China – US Software Workshop, September 2011 Green High Performance Computing Initiative at Rutgers University (GreenHPC@RU) Manish Parashar, Ivan Rodero Contact: {parashar, irodero}@cac.rutgers.edu http://nsfcac.rutgers.edu/GreenHPC Motivations for HPC Power Management Growing scales of High-end High Performance Computing systems Power consumption has become critical in terms of operational costs (dominant part of IT budgets) Existing power/energy management focused on single layers May result in suboptimal solutions Challenges in HPC Power Management GreenHPC $/W/MFLOP, defining energy efficiency Infrastructure: thermal sensors, instrumentation, monitoring, cooling, etc. Architectural challenges Processor, memory, interconnect technologies Increased use of accelerators Power-aware micro-architectures Compilers, OS, runtime support Power-aware code generation Power-aware scheduling DVFS, programming models, abstractions Considerations for system level energy eciency Optimizing CPU alone is not sufficient, need to look at entire system/ cluster Application/workload aware-optimizations Power-aware algorithm design Motivations and Challenges Cross-layer Energy-Power Management Architecture Component-based Power Management [HiPC’10/11] Application-centric aggressive power management at component level Workload profiling and characterization HPC workloads (e.g., HPC Linpack) Use of low power modes to congure subsystems (i.e., CPU, memory, disk and NIC) Energy-ecient memory hierarchies Memory power management for multi- many-core systems Multiple channels to main memory Application-driven power control (i.e., ensure bandwidth to main memory leveraging channel affinity) Runtime Power Management Partitioned Global Address Space (PGAS) Implicit message-passing Unified Parallel C (UPC) so far Target platforms Many-core (i.e., Intel SCC) HPC clusters Energy-aware Autonomic Provisioning [IGCC’10] Virtualized Cloud infrastructures with multiple geographical distributed entry points. Workloads composed of HPC applications Distributed Online Clustering (DOC) Cluster job requests in the input stream based on their resource requirements (multiple dimensions, e.g., memory, CPU, network requirements) Optimizing energy eciency in the following ways: Powering down subsystems when they are not needed Efficient, just-right VM provisioning (reduce over-provisioning ) Efficient proactive provisioning and grouping (reduce re-provisioning) Energy and Thermal Autonomic Management Reactive thermal and energy management of HPC workloads Autonomic decision making to react to thermal hotspots considering multiple dimensions (i.e., energy and thermal efficiency) using different techniques: VM migration, CPU DVFS, CPU pinning Proactive energy-aware application-centric VM allocation for HPC workloads Strategy for proactive VM allocation based on VM consolidation while satisfying QoS guarantees Based on application profiling (profiles are known in advance) and benchmarking on real hardware Abnormal operational state detection (e.g., poor performance, hotspots) Reactive and proactive approaches Reacting to anomalies to return to steady state Predict anomalies in order to avoid them Workload/application-awareness Application profiling/characterization Approach QoS Energy Eciency Thermal eciency Abnormal state Steady State ? QoS Energy Eciency Thermal eciency QoS Energy Eciency Thermal eciency Autonomic (self-monitored and self-managed) computing systems Optimizing (minimizing): Energy efficiency Cost-effectiveness Utilization while ensuring (maximizing): Performance/quality of service delivered Addressing both “traditional” and virtualized system (i.e., data centers and GreenHPC in the Cloud) Goals Physical Environment Sensor Observer Actuator Controller Resources Sensor Observer Actuator Controller Virtualization Sensor Observer Actuator Controller Application/ Workload Sensor Observer Actuator Controller Cross-layer Power Management Cross-infrastructure Power Management Cloud (private, public, hybrid, etc.) Cloud (private, public, hybrid, etc.) Virtualized Instrumented infrastructure Application-aware 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 5 10 15 20 25 30 0 2 4 6 8 10 12 14 Program Region Channels Percentage of channel access Opportunities for powering down ! " # $ % & ' ( "! % ! !*& " "*& # #*& $ ( "! % +,- ./012. 3/0456 ./012. ! "!!! #!!!! #"!!! ! !$" # #$" % %$" & ' #! ) *+, -./01- 2./345 -./01- VM classes based on clusters Over-provisioning cost 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 Duration (s) Time (ns) CG.A.O3.16 execution. Call’s to UPC Runtime function upcr_wait2 in node 1 Example : Potential saving during collective operations (e.g., barriers) in the SCC platform

Upload: others

Post on 26-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Green High Performance Computing Initiative at …dimacs.rutgers.edu/.../Slides/Parashar-greenhpc.pdfThe Center for Autonomic Computing is supported by the National Science Foundation

The Center for Autonomic Computing is supported by the National Science Foundation under Grant No. 0758596.

China – US Software Workshop, September 2011

Green High Performance Computing Initiative at Rutgers University (GreenHPC@RU)

Manish Parashar, Ivan Rodero

Contact: {parashar, irodero}@cac.rutgers.edu http://nsfcac.rutgers.edu/GreenHPC

Motivations for HPC Power Management q  Growing scales of High-end High Performance Computing systems

q  Power consumption has become critical in terms of operational costs (dominant part of IT budgets)

q  Existing power/energy management focused on single layers q  May result in suboptimal solutions

Challenges in HPC Power Management q  GreenHPC

q  $/W/MFLOP, defining energy efficiency q  Infrastructure: thermal sensors, instrumentation, monitoring, cooling,

etc. q  Architectural challenges

q  Processor, memory, interconnect technologies q  Increased use of accelerators q  Power-aware micro-architectures

q  Compilers, OS, runtime support q  Power-aware code generation q  Power-aware scheduling q  DVFS, programming models, abstractions

q  Considerations for system level energy efficiency q  Optimizing CPU alone is not sufficient, need to look at entire system/

cluster q  Application/workload aware-optimizations

q  Power-aware algorithm design

Motivations and Challenges Cross-layer Energy-Power Management q  Architecture

Component-based Power Management [HiPC’10/11] q  Application-centric aggressive power management

at component level q  Workload profiling and characterization

q  HPC workloads (e.g., HPC Linpack) q  Use of low power modes to configure subsystems

(i.e., CPU, memory, disk and NIC) q  Energy-efficient memory hierarchies

q  Memory power management for multi- many-core systems q  Multiple channels to main memory q  Application-driven power control (i.e., ensure bandwidth to main memory leveraging channel affinity)

Runtime Power Management q  Partitioned Global Address Space (PGAS)

q  Implicit message-passing q  Unified Parallel C (UPC) so far

q  Target platforms q  Many-core (i.e., Intel SCC) q  HPC clusters

Energy-aware Autonomic Provisioning [IGCC’10] q  Virtualized Cloud infrastructures with multiple

geographical distributed entry points. q  Workloads composed of HPC applications q  Distributed Online Clustering (DOC)

q  Cluster job requests in the input stream based on their resource requirements (multiple dimensions, e.g., memory, CPU, network requirements)

q  Optimizing energy efficiency in the following ways: q  Powering down subsystems when they are not needed q  Efficient, just-right VM provisioning (reduce over-provisioning ) q  Efficient proactive provisioning and grouping (reduce re-provisioning)

Energy and Thermal Autonomic Management q  Reactive thermal and energy management of HPC workloads

q  Autonomic decision making to react to thermal hotspots considering multiple dimensions (i.e., energy and thermal efficiency) using different techniques: VM migration, CPU DVFS, CPU pinning

q  Proactive energy-aware application-centric VM allocation for HPC workloads q  Strategy for proactive VM allocation based on VM consolidation while

satisfying QoS guarantees q  Based on application profiling (profiles are known in advance) and

benchmarking on real hardware

q  Abnormal operational state detection (e.g., poor performance, hotspots)

q  Reactive and proactive approaches q  Reacting to anomalies to return to steady state q  Predict anomalies in order to avoid them

q  Workload/application-awareness q  Application profiling/characterization

Approach

QoS

Energy E!ciency

Thermal e!ciency

Abnormal

state

Steady State

?

QoS

Energy E!ciency

Thermal e!ciency

QoS

Energy E!ciency

Thermal e!ciency

q  Autonomic (self-monitored and self-managed) computing systems

q  Optimizing (minimizing): q  Energy efficiency q  Cost-effectiveness q  Utilization

while ensuring (maximizing): q  Performance/quality of service delivered

q  Addressing both “traditional” and virtualized system (i.e., data centers and GreenHPC in the Cloud)

Goals

Physical Environment Sensor

!Observer!

!Actuator

!Controller!

!

Resources

Sensor

!Observer!

!Actuator

!Controller!

!

Virtualization

Sensor

!Observer!

!Actuator

!Controller!

!

Application/ Workload Sensor

!Observer!

!Actuator

!Controller!

!

Cross-layer Power Management

Cross-infrastructure Power Management

Cloud

(private, public,

hybrid, etc.)

Cloud

(private, public,

hybrid, etc.)

Virtualized Instrumented infrastructure

Application-aware

1 2 3 4 5 6 7 8 910111213141516

0

5

10

15

20

25

30

0

2

4

6

8

10

12

14

Program Region

Channels

Perc

enta

ge o

f cha

nnel

acce

ss

Opportunities for powering down

! " # $ % & '

()"!%

!

!*&

"

"*&

#

#*&

$()"!

%

+,-)./012.

3/0

456)./012.

! "!!! #!!!! #"!!!!

!$"

#

#$"

%

%$"

&'(#!

)

*+,(-./01-

2./

345(-./01-

VM classes based on clusters

Over-provisioning cost

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09

Dura

tion

(s)

Time (ns)

CG.A.O3.16 execution. Call’s to UPC Runtime function upcr_wait2 in node 1

Example: Potential saving during collective operations (e.g., barriers) in the SCC platform