understanding performance, power and energy behavior in asymmetric processors nagesh b...

Understanding Performance, Power and Energy Behavior in Asymmetric Processors

Nagesh B Lakshminarayana

Hyesoon Kim

School of Computer Science

Georgia Institute of Technology

Outline

• Background and Motivation

• Thread Interactions

• Dynamic Scheduling

• Asymmetry Aware Scheduling

• Conclusion and Future Work

Heterogeneous Architectures

• A particularly interesting class of parallel machines is Heterogeneous Architectures– Multiple types of Processing Elements (PEs)

available on the same machine

PEBPEBPEBPEBIn

terconnect

Heterogeneous Architectures

• Heterogeneous architectures are becoming very common

IBM Cell processor

Special Accelerator

Fast core

Slow core

Focus of this talk

Asymmetric Processors

Fast core

Machine configurations

All-slow (SMP) All processors running at their lowest frequency

Half-half (AMP) Half of the processors running at their highest frequency, rest running at their lower frequency

All-fast (SMP) All processors running at their highest frequency

• M-I experiments have 8 threads, M-II experiments have 16 threads

• AMPs emulated using SpeedStep/PowerNow

Machine-I 2 Socket 1.87 GHz Quad-core Intel Xeon

4MB L2 cache, 8GB RAM, 40GB HDD, RHEL 5

Machine-II 4 Socket 2 GHz Quad-core AMD Opteron 8350

2MB L3 cache, 32GB RAM, 1TB HDD, RHEL 4

Power Measurement

• Using Extech 380801 Power Analyzer• Total system power consumption

Experiment Machine

Windows MachinePower CableSerial Cable

Power Socket

PARSEC Benchmark Suite

• Desktop-oriented multithreaded benchmark suite– Multithreaded– Animation, Data Mining, Financial Analysis– Pthreads, OpenMP

100150200250300350

ec) All-fastHalf-halfAll-slow

Performance of PARSEC benchmarks

• On average, performance of half-half is between that of all-slow and all-fast

Execution Time

slow-limited middle-perf unstable

barrier barrierbarrier

(a) slow-limited (b) middle-perf (c) unstable

Classification of Benchmarks

J)All-fastHalf-halfAll-slow

• In half-half/all-slow, total energy consumption is higher even though average power consumed might be lower

Energy Consumption of PARSEC

Energy consumption

slow-limited middle-perf

• Observations

–Different applications behave differently on AMPs

–Usually SMP with fast processors saves energy

Behavior of Parsec Benchmarks

Why do different applications behave differently on AMPs?

Outline

Thread Interactions

Sources of thread interactions• Critical Sections• Barriers

Case (a)

Critical section

Useful work

Case (b)

Waiting

Critical Sections (CS)

• Waiting to enter CSs

• Waiting for other threads to finish

barrier

Barriers

barrier

10% CS 15% CS 20% CS 50% CS 75% CSNo

16 @ 1 GHz16 @ 1.2 GHz16 @ 1.4 GHz16 @ 1.7 GHz16 @ 2 GHz

Effect of Critical Section length

• CS limited application

• As critical section length increases, the average power consumed decreases

Normalized Power Consumption

Normalized Execution Time• CS limited application

10% 15% 20% 50% 75%

me 16 @ 1 GHz (SMP)

16 @ 1.2 GHz (SMP)

16 @ 1.4 GHz (SMP)

16 @ 1.7GHz (SMP)

16 @ 2 GHz (SMP)

• Performance of AMPs sensitive to CS length

Normalized Execution Time• CS limited application

10% 15% 20% 50% 75%

me 16 @ 1 GHz (SMP)

16 @ 1.2 GHz (SMP)

16 @ 1.4 GHz (SMP)

16 @ 1.7GHz (SMP)

16 @ 2 GHz (SMP)

8 @ 1 GHz, 8 @ 2 GHz (AMP)

8 @ 1.2 GHz, 8 @ 2 GHz (AMP)

8 @ 1.4 GHz, 8 @ 2 GHz (AMP)

8 @ 1.7GHz, 8 @ 2 GHz (AMP)

• Energy consumption shows the same trend

Normalized Energy Consumption• CS limited application

10% 15% 20% 50% 75%

16 @ 1 GHz (SMP)

16 @ 1.2 GHz (SMP)

16 @ 1.4 GHz (SMP)

16 @ 1.7GHz (SMP)

16 @ 2 GHz (SMP)

8 @ 1 GHz, 8 @ 2 GHz (AMP)

8 @ 1.2 GHz, 8 @ 2 GHz (AMP)

8 @ 1.4 GHz, 8 @ 2 GHz (AMP)

8 @ 1.7GHz, 8 @ 2 GHz (AMP)

Effect of Critical Section frequency

• Both length and frequency of CS affect performance and energy consumption

• As frequency increases, performance difference between half-half and all-fast reduces

• If majority of the execution time is spent waiting for locks, it is OK to have a few slow processors

• Results available in the paper

Effect of Barriers

• For few barriers, half-half performs similar to all-slow

• For large number of barriers, half-half performs similar to all-fast

• Results available in the paper

Outline

• Motivation: better run-time adaptivity • Each thread requests for more work after

completing the assigned work• OpenMP, Intel Thread Building Blocks

Dynamic Scheduling

• Can help improve performance and reduce energy consumption in AMPs• Should be preferred to static and guided policies

Machine configuration

Normalized Execution Time

Normalized Energy Consumption

Static/Dynamic Static/Dynamic

16 @ 1 GHz (SMP) 1.0 1.0

16 @ 1.2 GHz (SMP) 0.83 0.87

16 @ 1.4 GHz (SMP) 0.71 0.78

16 @ 1.7 GHz (SMP) 0.59 0.68

16 @ 2 GHz (SMP) 0.50 0.61

8 @ 1 GHz, 8 @ 2 GHz (AMP) 1.00/0.67 1.05/0.73

8 @ 1.2 GHz, 8 @ 2 GHz (AMP) 0.83/0.63 0.90/0.70

8 @ 1.4 GHz, 8 @ 2 GHz (AMP) 0.71/0.59 0.80/0.67

8 @ 1.7 GHz, 8 @ 2 GHz (AMP) 0.59/0.54 0.69/0.63

• Parallel-for application

Outline

Scheduling in AMPs

• Longest Job to a Fast Processor First (LJFPF) [Lakshminarayana’08]

barrier

Fast core

Fast core Slow core

Slow core

How Does the Scheduler Know

• Length of work? • Current mechanism: application sends task

length information• On-going work: Prediction mechanism

• ITK: Medical image processing applications (OpenSource)• MultiRegistration (Registration method)

– kernel with 50 iterations– 50 iterations divided among 8 threads

Normalized Execution Time Normalized Energy Consumption

Outline

Conclusion & Future Work

Conclusion• Evaluated the performance/energy consumption behavior

of multithreaded applications in AMPs

• For symmetric workloads– With little thread interaction: SMP with fast processors– With a lot of thread interaction: AMP could be better

• For asymmetric threads – AMP could provide lowest energy consumption

Future Work• Predict application characteristics and use predicted

information for thread scheduling on AMPs

Thank you!

understanding performance, power and energy behavior in asymmetric processors nagesh b...

perf slide

openmp slide

perfunstable slide

css slide

future work slide

barrier barriers barrier

pe b interconnect slide

average power

Documents

nagesh internationalization of indian enterprises

feedback directed prefetching santhosh srinath onur mutlu...

prashna upanishad a quest - nagesh...

journal of hydrology - iisc...

nagesh project

nagesh pm certificates

manuscript version: author’s accepted manuscript in wrap...

pis: sudhakar yalamanchili hyesoon kim richard...

age based scheduling for asymmetric multiprocessors nagesh b...

block-precise processors nagesh b lakshminarayana , ...

ajit & nagesh

prashna upanishad a quest - nagesh sonde

hydrological sciences journal stochastic models of...

pis: sudhakar yalamanchili hyesoon kim richard vuduc...pis:...

spring 2009 prof. hyesoon kim - cc.gatech.edu › ~hyesoon...

spring 2010 prof. hyesoon kim - georgia institute of...

edusat session 6 nagesh s j c e

materials management by nagesh l talekar

address-value delta (avd) prediction onur mutlu hyesoon kim...

presented by nagesh adluru