asymmetry aware scheduling algorithms for asymmetric processors
DESCRIPTION
Asymmetry Aware Scheduling Algorithms for Asymmetric Processors. Nagesh Lakshminarayana Sushma Rao Hyesoon Kim Computer Science Georgia Institute of Technology. Outline. Background and Problem Application characteristics on AMP/SMP LJFPF Policy CJFPF Policy Conclusion. PE B. PE B. - PowerPoint PPT PresentationTRANSCRIPT
Asymmetry Aware Scheduling Algorithms for Asymmetric Processors
Nagesh Lakshminarayana Sushma Rao Hyesoon Kim
Computer Science Georgia Institute of Technology
Outline
• Background and Problem
• Application characteristics on AMP/SMP
• LJFPF Policy
• CJFPF Policy
• Conclusion
Heterogeneous Architectures
• A particularly interesting class of parallel machines is Heterogeneous Architecture:– Multiple types of Processing Elements (PEs)
available on the same machine
PEA
PEB
PEB
PEB
PEB
Inte
rcon
nect
Heterogeneous Architectures
• Heterogeneous architectures are becoming very common:
Multicore CPU + GPU
IBM Cell processor
Special accelerator
Fast core
Slow core
Slow core
Slow core
Slow core
Focus of this talk
Asymmetric Processors
Fast core
Scheduling Problem: Multiple applications
Fast core
Slow core
Slow core
Slow core
Slow core
Scalable applications
Non-scalable applications
Fast core
Fast Core
Slow Core
Scheduling Problem: Multi-threaded application
Fast core
Slow core
Slow core
Slow core
Slow core
Fast core
Problem
How to schedule multi-threaded applications on Asymmetric Multiprocessors (AMP)?
Outline
• Background and Problem
• Application characteristics on AMP/SMP
• LJFPF Policy
• CJFPF Policy
• Conclusion
Experimental Methodology
• Use a 1.87GHz two-socket Quad-core machine to measure the performance
• Use SpeedStep technology to emulate an AMP
All-slow (SMP) All 8 processors are running at 1.6 GHz
One-fast (AMP) 1 processors are running at 1.87 GHz
7 processors are running at 1.6GHz
Half-half (AMP) 4 processors are running at 1.87GHz
4 processors are running at 1.6GHz
All-fast (SMP) All processors are running at 1.87GHz
Performance Results on AMP/SMP
0.8
0.85
0.9
0.95
1
1.05
No
rma
lize
d e
xe
cu
tio
n t
ime
All-slow
One-fast
Half-half
All-fast
Fast core
Slow core
Slow core
Slow core
Slow core
Fast core
Slow-Limited Applications
barrier
Middle-perf Benchmarks
barrier
Similar to a slow-limited benchmark but sequential section is much longer
Unstable Benchmarks
barrier
barrier
Lots of barriers Asymmetric workloads
PARSEC Benchmarks
Application Locks Barriers Cond. Variables
AMP performance category
BlackSholes 39 8 0.000 slow-limited
Bodytrack 6824702 111160 0.003 unstable
Canneal 34 0 0.003 middle-perf
dedup 10002625 0 0.009 unstable
ferret 1422579 0 0.014 slow-limited
facesim 7384488 0 0.03 middle-perf
Fluidanimate 1153407308 31998 0.02 unstable
Freqmine 39 0 0.12 middle-perf
streamcluster 1379 633174 0.013 middle-perf
swaptions 9 0 0.00 slow-limited
vips 11 0 0.0049 unstable
x264 207692 0 13793 middle-perf
Outline
• Background and Problem
• Applications on AMP/SMP
• LJFPF Policy
• CJFPF Policy
• Conclusion
LJFPF Policy
• Longest Job to a Fast Processor First
barrier
Fast core
Fast core Slow core
Slow core
How Does the Scheduler Know
• Length of work?
• Current mechanism: application sends the information
• On-going work: Prediction mechanism
Evaluation
• Matrix Multiplication
Sequential version
Parallel versionSymmetric workload
Parallel versionAsymmetric workload
Asymmetric Workload (Matrix Multiplication)
0.9
0.95
1
1.05
1.1
1.15
1.2
300-300
310-290
320-280
330-270
340-260
350-250
360-240
No
rma
lize
d e
xecu
tion
tim
e
All-fast
Half-half(LJFPF)
Half-half (RR)
All-slow
Real Application
• ITK (Medical image processing tool kit)– Open source but a real application
Evaluation: MultiRegistration
• Kernel loop has 50 iterations
50 % 8 ≠0
• Divide 50 iterations into 7, 7, 7, 7, 6, 6, 5, 5
0.92
0.94
0.96
0.98
1
1.02
1.04
All-f
ast
Ha
lf-h
alf
(LJF
PF
)
Ha
lf-h
alf
(RR
)
All-s
low
No
rma
lize
d e
xe
cu
tio
n t
imeResults: ITK Benchmark
2.3%
Outline
• Background and Problem
• Application characteristics on AMP/SMP
• LJFPF Policy
• CJFPF Policy
• Conclusion
Critical Section
Lock
Lock
Critical Section Limited Workloads
Critical section
Useful workwaiting
Case (a)
Case (b)
Critical Section Effects
0
1
2
3
4
5
6
7
8
9
10%CS 15%CS 20%CS
sp
eed
up
All-fast
Half-half
All-slow
Half-half performs similar to all-fast
CJFPF Policy
• Critical Job to a Fast Processor First Policy
Fast core
Slow core
Slow core
Slow core
0
1
2
3
4
5
6
7
8-12 16-24 40-60
sp
eed
up
CJFPF
RR
CJFPF Results
Longer critical sectionThe benefit of the CJFPF policy decreases
Conclusion
• We evaluated the characteristics of multi-threaded applications on AMPs.
• Barriers and critical sections are important factors.• Propose two new scheduling policies: Longest job
to fast core first (LJFPF), critical job to fast core first (CJFPF)– Scheduling polices improve performance for asymmetric
workloads.• Future work
– Develop a prediction mechanism– Evaluate symmetric workloads on AMPs– Other kinds of heterogeneous architectures
Thank you!