asymmetry aware scheduling algorithms for asymmetric processors

Asymmetry Aware Scheduling Algorithms for Asymmetric Processors

Nagesh Lakshminarayana Sushma Rao Hyesoon Kim

Computer Science Georgia Institute of Technology

Outline

• Background and Problem

• Application characteristics on AMP/SMP

• LJFPF Policy

• CJFPF Policy

• Conclusion

Heterogeneous Architectures

• A particularly interesting class of parallel machines is Heterogeneous Architecture:– Multiple types of Processing Elements (PEs)

available on the same machine

PEA

PEB

PEB

PEB

PEB

Inte

rcon

nect

Heterogeneous Architectures

• Heterogeneous architectures are becoming very common:

Multicore CPU + GPU

IBM Cell processor

Special accelerator

Fast core

Slow core

Slow core

Slow core

Slow core

Focus of this talk

Asymmetric Processors

Fast core

Scheduling Problem: Multiple applications

Fast core

Slow core

Slow core

Slow core

Slow core

Scalable applications

Non-scalable applications

Fast core

Fast Core

Slow Core

Scheduling Problem: Multi-threaded application

Fast core

Slow core

Slow core

Slow core

Slow core

Fast core

Problem

How to schedule multi-threaded applications on Asymmetric Multiprocessors (AMP)?

Outline



• LJFPF Policy

• CJFPF Policy

• Conclusion

Experimental Methodology

• Use a 1.87GHz two-socket Quad-core machine to measure the performance

• Use SpeedStep technology to emulate an AMP

All-slow (SMP) All 8 processors are running at 1.6 GHz

One-fast (AMP) 1 processors are running at 1.87 GHz

7 processors are running at 1.6GHz

Half-half (AMP) 4 processors are running at 1.87GHz

4 processors are running at 1.6GHz

All-fast (SMP) All processors are running at 1.87GHz

Performance Results on AMP/SMP

0.8

0.85

0.9

0.95

1

1.05

No

rma

lize

d e

xe

cu

tio

n t

ime

All-slow

One-fast

Half-half

All-fast

Fast core

Slow core

Slow core

Slow core

Slow core

Fast core

Slow-Limited Applications

barrier

Middle-perf Benchmarks

barrier

Similar to a slow-limited benchmark but sequential section is much longer

Unstable Benchmarks

barrier

barrier

Lots of barriers Asymmetric workloads

PARSEC Benchmarks

Application Locks Barriers Cond. Variables

AMP performance category

BlackSholes 39 8 0.000 slow-limited

Bodytrack 6824702 111160 0.003 unstable

Canneal 34 0 0.003 middle-perf

dedup 10002625 0 0.009 unstable

ferret 1422579 0 0.014 slow-limited

facesim 7384488 0 0.03 middle-perf

Fluidanimate 1153407308 31998 0.02 unstable

Freqmine 39 0 0.12 middle-perf

streamcluster 1379 633174 0.013 middle-perf

swaptions 9 0 0.00 slow-limited

vips 11 0 0.0049 unstable

x264 207692 0 13793 middle-perf

Outline


• Applications on AMP/SMP

• LJFPF Policy

• CJFPF Policy

• Conclusion

LJFPF Policy

• Longest Job to a Fast Processor First

barrier

Fast core

Fast core Slow core

Slow core

How Does the Scheduler Know

• Length of work?

• Current mechanism: application sends the information

• On-going work: Prediction mechanism

Evaluation

• Matrix Multiplication

Sequential version

Parallel versionSymmetric workload

Parallel versionAsymmetric workload

Asymmetric Workload (Matrix Multiplication)

0.9

0.95

1

1.05

1.1

1.15

1.2

300-300

310-290

320-280

330-270

340-260

350-250

360-240

No

rma

lize

d e

xecu

tion

tim

e

All-fast

Half-half(LJFPF)

Half-half (RR)

All-slow

Real Application

• ITK (Medical image processing tool kit)– Open source but a real application

Evaluation: MultiRegistration

• Kernel loop has 50 iterations

50 % 8 ≠0

• Divide 50 iterations into 7, 7, 7, 7, 6, 6, 5, 5

0.92

0.94

0.96

0.98

1

1.02

1.04

All-f

ast

Ha

lf-h

alf

(LJF

PF

)

Ha

lf-h

alf

(RR

)

All-s

low

No

rma

lize

d e

xe

cu

tio

n t

imeResults: ITK Benchmark

2.3%

Outline



• LJFPF Policy

• CJFPF Policy

• Conclusion

Critical Section

Lock

Lock

Critical Section Limited Workloads

Critical section

Useful workwaiting

Case (a)

Case (b)

Critical Section Effects

0

1

2

3

4

5

6

7

8

9

10%CS 15%CS 20%CS

sp

eed

up

All-fast

Half-half

All-slow

Half-half performs similar to all-fast

CJFPF Policy

• Critical Job to a Fast Processor First Policy

Fast core

Slow core

Slow core

Slow core

0

1

2

3

4

5

6

7

8-12 16-24 40-60

sp

eed

up

CJFPF

RR

CJFPF Results

Longer critical sectionThe benefit of the CJFPF policy decreases

Conclusion

• We evaluated the characteristics of multi-threaded applications on AMPs.

• Barriers and critical sections are important factors.• Propose two new scheduling policies: Longest job

to fast core first (LJFPF), critical job to fast core first (CJFPF)– Scheduling polices improve performance for asymmetric

workloads.• Future work

– Develop a prediction mechanism– Evaluate symmetric workloads on AMPs– Other kinds of heterogeneous architectures

Thank you!

asymmetry aware scheduling algorithms for asymmetric processors

Documents