understanding performance, power and energy behavior in asymmetric processors nagesh b...
Post on 14-Dec-2015
212 Views
Preview:
TRANSCRIPT
Understanding Performance, Power and Energy Behavior in Asymmetric Processors
Nagesh B Lakshminarayana
Hyesoon Kim
School of Computer Science
Georgia Institute of Technology
2
Outline
• Background and Motivation
• Thread Interactions
• Dynamic Scheduling
• Asymmetry Aware Scheduling
• Conclusion and Future Work
3
Heterogeneous Architectures
• A particularly interesting class of parallel machines is Heterogeneous Architectures– Multiple types of Processing Elements (PEs)
available on the same machine
PEA
PEBPEBPEBPEBIn
terconnect
4
Heterogeneous Architectures
• Heterogeneous architectures are becoming very common
IBM Cell processor
Special Accelerator
Fast core
Slow core
Slow core
Slow core
Slow core
Focus of this talk
Asymmetric Processors
Fast core
5
Machine configurations
All-slow (SMP) All processors running at their lowest frequency
Half-half (AMP) Half of the processors running at their highest frequency, rest running at their lower frequency
All-fast (SMP) All processors running at their highest frequency
• M-I experiments have 8 threads, M-II experiments have 16 threads
• AMPs emulated using SpeedStep/PowerNow
Machine-I 2 Socket 1.87 GHz Quad-core Intel Xeon
4MB L2 cache, 8GB RAM, 40GB HDD, RHEL 5
Machine-II 4 Socket 2 GHz Quad-core AMD Opteron 8350
2MB L3 cache, 32GB RAM, 1TB HDD, RHEL 4
6
Power Measurement
• Using Extech 380801 Power Analyzer• Total system power consumption
Experiment Machine
Windows MachinePower CableSerial Cable
Power Socket
7
PARSEC Benchmark Suite
• Desktop-oriented multithreaded benchmark suite– Multithreaded– Animation, Data Mining, Financial Analysis– Pthreads, OpenMP
8
050
100150200250300350
Exec
ution
tim
e (s
ec) All-fastHalf-halfAll-slow
Performance of PARSEC benchmarks
• On average, performance of half-half is between that of all-slow and all-fast
Execution Time
slow-limited middle-perf unstable
9
barrier barrierbarrier
(a) slow-limited (b) middle-perf (c) unstable
Classification of Benchmarks
10
0
50
100
150
200
Ener
gy (K
J)All-fastHalf-halfAll-slow
• In half-half/all-slow, total energy consumption is higher even though average power consumed might be lower
Energy Consumption of PARSEC
Energy consumption
slow-limited middle-perf
11
• Observations
–Different applications behave differently on AMPs
–Usually SMP with fast processors saves energy
Behavior of Parsec Benchmarks
12
Why do different applications behave differently on AMPs?
13
Outline
• Background and Motivation
• Thread Interactions
• Dynamic Scheduling
• Asymmetry Aware Scheduling
• Conclusion and Future Work
14
Thread Interactions
Sources of thread interactions• Critical Sections• Barriers
15
Case (a)
Critical section
Useful work
Case (b)
Waiting
Critical Sections (CS)
• Waiting to enter CSs
16
• Waiting for other threads to finish
barrier
Barriers
barrier
0.8
0.85
0.9
0.95
1
10% CS 15% CS 20% CS 50% CS 75% CSNo
rmal
ized
po
wer
co
nsu
mp
tion
16 @ 1 GHz16 @ 1.2 GHz16 @ 1.4 GHz16 @ 1.7 GHz16 @ 2 GHz
17
Effect of Critical Section length
• CS limited application
• As critical section length increases, the average power consumed decreases
Normalized Power Consumption
18
Effect of Critical Section length
Normalized Execution Time• CS limited application
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
10% 15% 20% 50% 75%
Nor
mal
ized
exe
cuti
on ti
me 16 @ 1 GHz (SMP)
16 @ 1.2 GHz (SMP)
16 @ 1.4 GHz (SMP)
16 @ 1.7GHz (SMP)
16 @ 2 GHz (SMP)
19
Effect of Critical Section length
• Performance of AMPs sensitive to CS length
Normalized Execution Time• CS limited application
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
10% 15% 20% 50% 75%
Nor
mal
ized
exe
cuti
on ti
me 16 @ 1 GHz (SMP)
16 @ 1.2 GHz (SMP)
16 @ 1.4 GHz (SMP)
16 @ 1.7GHz (SMP)
16 @ 2 GHz (SMP)
8 @ 1 GHz, 8 @ 2 GHz (AMP)
8 @ 1.2 GHz, 8 @ 2 GHz (AMP)
8 @ 1.4 GHz, 8 @ 2 GHz (AMP)
8 @ 1.7GHz, 8 @ 2 GHz (AMP)
20
Effect of Critical Section length
• Energy consumption shows the same trend
Normalized Energy Consumption• CS limited application
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
10% 15% 20% 50% 75%
Nor
mal
ized
ene
rgy
cons
umpti
on
16 @ 1 GHz (SMP)
16 @ 1.2 GHz (SMP)
16 @ 1.4 GHz (SMP)
16 @ 1.7GHz (SMP)
16 @ 2 GHz (SMP)
8 @ 1 GHz, 8 @ 2 GHz (AMP)
8 @ 1.2 GHz, 8 @ 2 GHz (AMP)
8 @ 1.4 GHz, 8 @ 2 GHz (AMP)
8 @ 1.7GHz, 8 @ 2 GHz (AMP)
21
Effect of Critical Section frequency
• Both length and frequency of CS affect performance and energy consumption
• As frequency increases, performance difference between half-half and all-fast reduces
• If majority of the execution time is spent waiting for locks, it is OK to have a few slow processors
• Results available in the paper
22
Effect of Barriers
• For few barriers, half-half performs similar to all-slow
• For large number of barriers, half-half performs similar to all-fast
• Results available in the paper
23
Outline
• Background and Motivation
• Thread Interactions
• Dynamic Scheduling
• Asymmetry Aware Scheduling
• Conclusion and Future Work
24
• Motivation: better run-time adaptivity • Each thread requests for more work after
completing the assigned work• OpenMP, Intel Thread Building Blocks
Dynamic Scheduling
25
Dynamic Scheduling
• Can help improve performance and reduce energy consumption in AMPs• Should be preferred to static and guided policies
Machine configuration
Normalized Execution Time
Normalized Energy Consumption
Static/Dynamic Static/Dynamic
16 @ 1 GHz (SMP) 1.0 1.0
16 @ 1.2 GHz (SMP) 0.83 0.87
16 @ 1.4 GHz (SMP) 0.71 0.78
16 @ 1.7 GHz (SMP) 0.59 0.68
16 @ 2 GHz (SMP) 0.50 0.61
8 @ 1 GHz, 8 @ 2 GHz (AMP) 1.00/0.67 1.05/0.73
8 @ 1.2 GHz, 8 @ 2 GHz (AMP) 0.83/0.63 0.90/0.70
8 @ 1.4 GHz, 8 @ 2 GHz (AMP) 0.71/0.59 0.80/0.67
8 @ 1.7 GHz, 8 @ 2 GHz (AMP) 0.59/0.54 0.69/0.63
• Parallel-for application
26
Outline
• Background and Motivation
• Thread Interactions
• Dynamic Scheduling
• Asymmetry Aware Scheduling
• Conclusion and Future Work
27
Scheduling in AMPs
• Longest Job to a Fast Processor First (LJFPF) [Lakshminarayana’08]
barrier
Fast core
Fast core Slow core
Slow core
28
How Does the Scheduler Know
• Length of work? • Current mechanism: application sends task
length information• On-going work: Prediction mechanism
29
LJFPF
• ITK: Medical image processing applications (OpenSource)• MultiRegistration (Registration method)
– kernel with 50 iterations– 50 iterations divided among 8 threads
Normalized Execution Time Normalized Energy Consumption
30
Outline
• Background and Motivation
• Thread Interactions
• Dynamic Scheduling
• Asymmetry Aware Scheduling
• Conclusion and Future Work
31
Conclusion & Future Work
Conclusion• Evaluated the performance/energy consumption behavior
of multithreaded applications in AMPs
• For symmetric workloads– With little thread interaction: SMP with fast processors– With a lot of thread interaction: AMP could be better
• For asymmetric threads – AMP could provide lowest energy consumption
Future Work• Predict application characteristics and use predicted
information for thread scheduling on AMPs
32
Thank you!
top related