green high performance computing initiative at...

Post on 26-May-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Center for Autonomic Computing is supported by the National Science Foundation under Grant No. 0758596.

China – US Software Workshop, September 2011

Green High Performance Computing Initiative at Rutgers University (GreenHPC@RU)

Manish Parashar, Ivan Rodero

Contact: {parashar, irodero}@cac.rutgers.edu http://nsfcac.rutgers.edu/GreenHPC

Motivations for HPC Power Management q  Growing scales of High-end High Performance Computing systems

q  Power consumption has become critical in terms of operational costs (dominant part of IT budgets)

q  Existing power/energy management focused on single layers q  May result in suboptimal solutions

Challenges in HPC Power Management q  GreenHPC

q  $/W/MFLOP, defining energy efficiency q  Infrastructure: thermal sensors, instrumentation, monitoring, cooling,

etc. q  Architectural challenges

q  Processor, memory, interconnect technologies q  Increased use of accelerators q  Power-aware micro-architectures

q  Compilers, OS, runtime support q  Power-aware code generation q  Power-aware scheduling q  DVFS, programming models, abstractions

q  Considerations for system level energy efficiency q  Optimizing CPU alone is not sufficient, need to look at entire system/

cluster q  Application/workload aware-optimizations

q  Power-aware algorithm design

Motivations and Challenges Cross-layer Energy-Power Management q  Architecture

Component-based Power Management [HiPC’10/11] q  Application-centric aggressive power management

at component level q  Workload profiling and characterization

q  HPC workloads (e.g., HPC Linpack) q  Use of low power modes to configure subsystems

(i.e., CPU, memory, disk and NIC) q  Energy-efficient memory hierarchies

q  Memory power management for multi- many-core systems q  Multiple channels to main memory q  Application-driven power control (i.e., ensure bandwidth to main memory leveraging channel affinity)

Runtime Power Management q  Partitioned Global Address Space (PGAS)

q  Implicit message-passing q  Unified Parallel C (UPC) so far

q  Target platforms q  Many-core (i.e., Intel SCC) q  HPC clusters

Energy-aware Autonomic Provisioning [IGCC’10] q  Virtualized Cloud infrastructures with multiple

geographical distributed entry points. q  Workloads composed of HPC applications q  Distributed Online Clustering (DOC)

q  Cluster job requests in the input stream based on their resource requirements (multiple dimensions, e.g., memory, CPU, network requirements)

q  Optimizing energy efficiency in the following ways: q  Powering down subsystems when they are not needed q  Efficient, just-right VM provisioning (reduce over-provisioning ) q  Efficient proactive provisioning and grouping (reduce re-provisioning)

Energy and Thermal Autonomic Management q  Reactive thermal and energy management of HPC workloads

q  Autonomic decision making to react to thermal hotspots considering multiple dimensions (i.e., energy and thermal efficiency) using different techniques: VM migration, CPU DVFS, CPU pinning

q  Proactive energy-aware application-centric VM allocation for HPC workloads q  Strategy for proactive VM allocation based on VM consolidation while

satisfying QoS guarantees q  Based on application profiling (profiles are known in advance) and

benchmarking on real hardware

q  Abnormal operational state detection (e.g., poor performance, hotspots)

q  Reactive and proactive approaches q  Reacting to anomalies to return to steady state q  Predict anomalies in order to avoid them

q  Workload/application-awareness q  Application profiling/characterization

Approach

QoS

Energy E!ciency

Thermal e!ciency

Abnormal

state

Steady State

?

QoS

Energy E!ciency

Thermal e!ciency

QoS

Energy E!ciency

Thermal e!ciency

q  Autonomic (self-monitored and self-managed) computing systems

q  Optimizing (minimizing): q  Energy efficiency q  Cost-effectiveness q  Utilization

while ensuring (maximizing): q  Performance/quality of service delivered

q  Addressing both “traditional” and virtualized system (i.e., data centers and GreenHPC in the Cloud)

Goals

Physical Environment Sensor

!Observer!

!Actuator

!Controller!

!

Resources

Sensor

!Observer!

!Actuator

!Controller!

!

Virtualization

Sensor

!Observer!

!Actuator

!Controller!

!

Application/ Workload Sensor

!Observer!

!Actuator

!Controller!

!

Cross-layer Power Management

Cross-infrastructure Power Management

Cloud

(private, public,

hybrid, etc.)

Cloud

(private, public,

hybrid, etc.)

Virtualized Instrumented infrastructure

Application-aware

1 2 3 4 5 6 7 8 910111213141516

0

5

10

15

20

25

30

0

2

4

6

8

10

12

14

Program Region

Channels

Perc

enta

ge o

f cha

nnel

acce

ss

Opportunities for powering down

! " # $ % & '

()"!%

!

!*&

"

"*&

#

#*&

$()"!

%

+,-)./012.

3/0

456)./012.

! "!!! #!!!! #"!!!!

!$"

#

#$"

%

%$"

&'(#!

)

*+,(-./01-

2./

345(-./01-

VM classes based on clusters

Over-provisioning cost

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09 1.30383e+09

Dura

tion

(s)

Time (ns)

CG.A.O3.16 execution. Call’s to UPC Runtime function upcr_wait2 in node 1

Example: Potential saving during collective operations (e.g., barriers) in the SCC platform

top related