scheduler performance in manycore architecture

57
May 2, 2012 1 Scheduler Performance in Many-Core Architecture Itai Avron MSc Thesis Technion - Electrical Engineering Dept.

Upload: chiportal

Post on 13-Dec-2014

526 views

Category:

Technology


0 download

DESCRIPTION

ItaiAvron, Technion

TRANSCRIPT

Page 1: Scheduler performance in manycore architecture

May 2, 2012 1

Scheduler Performance in Many-Core Architecture

Itai AvronMSc Thesis

Technion - Electrical Engineering Dept.

Page 2: Scheduler performance in manycore architecture

May 2, 2012 2

Agenda

• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work

Page 3: Scheduler performance in manycore architecture

May 2, 2012 3

Background

• CPU performance improvements– In the past : Increase of clock frequency

• We reached the power wall

– Today : Multi-cores– The future : Many-cores

• Homogeneous \ Heterogeneous?• What architecture?• Memory model?• Scheduler?• …

Page 4: Scheduler performance in manycore architecture

May 2, 2012 4

Scheduling In Many-Core Architecture

• Software scheduling is slow– A lot of cores to schedule– Fine granularity tasks many tasks to schedule at

the same time• To enhance parallelism

• Dedicated Hardware required!

Page 5: Scheduler performance in manycore architecture

May 2, 2012 5

Scheduler Challenges

• Latency– Message delay• From core to scheduler (completed prev. task)• From scheduler to core (start new task)

– Schedule time• to allocate tasks to cores

• Capacity– Number of instances\tasks scheduled per cycle

Page 6: Scheduler performance in manycore architecture

May 2, 2012 6

Other Architectures

• Graphic Processing Unit (GPU’s)• Tilera• Larrabee• XMT• Rigel• Data-Driven Multithreading Model• Task Superscalar

Page 7: Scheduler performance in manycore architecture

May 2, 2012 7

GPU – NVIDIA Fermi

• Composed of many processing elements (PEs)

• Scheduling is done in hardware– Schedule warps– Only one control flow

• SIMD

Page 8: Scheduler performance in manycore architecture

May 2, 2012 8

Tilera

• Composed of tiles– Each tile is independent

• Static Scheduling– Determined during

compile time

• MIMD

[Agarwal (MIT) 1997- ]

Page 9: Scheduler performance in manycore architecture

May 2, 2012 9

Larrabee (Intel)

• Array of processor cores• Software controlled

Scheduling– Lightweight distributed

task-stealing scheduler

• MIMD

Page 10: Scheduler performance in manycore architecture

May 2, 2012 10

XMT

• Composed of TCU’s– Thread control unit

• Hardware Scheduling– Using Prefix-Sum

• PRAM Programming Model

• SPMD

[Vishkin (UMD) 2005-]

Page 11: Scheduler performance in manycore architecture

May 2, 2012 11

Rigel

• Composed of tiles of clusters– Each cluster holds 8

cores

• Software Scheduling– Allocation via task

queues– Synchronization via

Barriers

• SPMD[Patel (UIUC) 2008- ]

Page 12: Scheduler performance in manycore architecture

May 2, 2012 12

Data-Driven Multithreading Model

• A Threads Synchronization Unit (TSU)– Connects to existing

cores

• Hardware Scheduling– Using Task Map

• Producer-Consumer Programming Model

[Evripidou (U Cyprus) 1997- ]

Page 13: Scheduler performance in manycore architecture

May 2, 2012 13

Task Superscalar

• An Out-of-Order Task Pipeline– Connects to existing cores– No Speculations

• Hardware Scheduling– Creation of new tasks is

done in software– Management and Allocation

is done in Hardware

• StarSs Programming Model[Etsion (BSC) 2009- ]

Page 14: Scheduler performance in manycore architecture

May 2, 2012 14

Agenda

• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work

Page 15: Scheduler performance in manycore architecture

May 2, 2012 15

The ‘Plural’ System Architecture

Scheduler

Cores

Memory Network

Memory banks

[Bayer (Technion) 1988 ]

Page 16: Scheduler performance in manycore architecture

May 2, 2012 16

The System

• Many RISC cores– In-Order, Blocking Load\Store– No data cache

• Shared On-Chip memory banks– Interleaved address– Access takes 2 cycles

• Core retries on collision

• Hardware synchronization and scheduling unit– Distributes tasks to cores according to a task map– Collects task completion messages from cores

Page 17: Scheduler performance in manycore architecture

May 2, 2012 17

Plural Task Map

• Precedence Graph• Created by the

programmer• Duplicable Tasks– All instances are

concurrent

A1

B1200

C5000

D130

E1

cntr=4

Task NameNumber of instances

Task

Dependency

Condition

Page 18: Scheduler performance in manycore architecture

May 2, 2012 18

Plural Scheduling

• Central Synchronization Unit (CSU)– Manages allocation, scheduling, and synchronization of tasks– Collects task-termination– Programmed by the task map– Allocates packs (sets) of parallel task-instances

• Distribution Network (DN)– Organized as a tree with the CSU as its root– Mediates between the CSU and the processing cores– Downstream flow -> decomposes allocated packs of task instances– Upstream flow -> unifies task-termination events from the cores

Page 19: Scheduler performance in manycore architecture

May 2, 2012 19

Scheduling ProcessCSU allocates ready to run

tasks

DN distributes packs to cores

Cores sends termination message on completion

DN unifies termination messages

CSU process new eligible to

run tasks

Page 20: Scheduler performance in manycore architecture

May 2, 2012 20

Agenda

• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work

Page 21: Scheduler performance in manycore architecture

May 2, 2012 21

Scheduler Improvements

• Enhancing scheduler capacity• Reducing scheduling latency• Adding task queues to each core– Sharing queues

• Adding task length indicator

Page 22: Scheduler performance in manycore architecture

May 2, 2012 22

Simulation Environment• Matlab Simulator– Based on Eyal and Dima’s simulator

• Benchmarks– 3 Demo programs– 3 Benchmarks

• JPEG, Mandelbrot, Linear Solver

• 24 System configurations– 256 cores, 256 banks– Scheduler capacity: 5, 10, infinite [instances]– Latency (scheduler—cores): 0, 20 [cycles]– Task queue depth: 0, 1, 2, 10 [instances]

[Friedman, Khoretz, Ginosar, PDP 2012]

Page 23: Scheduler performance in manycore architecture

May 2, 2012 23

Benchmark Task MapsNormal and

Shared VariableParallel

A123

B10015

C50035

D60020

E13018

F127

cntr=4

A1

23

B2000

25

C2500

35

D2600

26

E2300

18

F1

19

cntr=4

A1

540

B1

225

C409680

D4096

7

A1

10

B1

10

G3002418

E1

12810

L1

460

N1

207

D300181

K1001659

C1

5715

I1001952

J2001490

F300705

H3002927

M1

2548

Mandelbrot

A1

236

B140

C1

214

E100126

K187

F158

J147

D1

172

G7720197

H10078

cntr=5

JPEG Linear Solver

Task NameNumber of instancesLength in time units

Page 24: Scheduler performance in manycore architecture

May 2, 2012 24

Agenda

• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work

Page 25: Scheduler performance in manycore architecture

May 2, 2012 25

Analysis of Simulation Results

• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark

• Benchmarks Analysis

Page 26: Scheduler performance in manycore architecture

May 2, 2012 26

“Normal” BenchmarkActivity Per core, Latency = 0 cycles

A1

23

B10015

C50035

D60020

E13018

F1

27

cntr=4

Page 27: Scheduler performance in manycore architecture

May 2, 2012 27

“Normal” BenchmarkUnbalanced scheduling, Latency = 0 cycles

A1

23

B10015

C50035

D60020

E13018

F1

27

cntr=4

Page 28: Scheduler performance in manycore architecture

May 2, 2012 28

“Normal” BenchmarkActivity Per core, Latency = 20 cycles

A1

23

B10015

C50035

D60020

E13018

F1

27

cntr=4

Page 29: Scheduler performance in manycore architecture

May 2, 2012 29

Analysis of Simulation Results

• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark

• Benchmarks Analysis

Page 30: Scheduler performance in manycore architecture

May 2, 2012 30

“Parallel” BenchmarkActivity Per core, Latency = 0 cycles

A1

23

B2000

25

C2500

35

D2600

26

E2300

18

F1

19

cntr=4

Page 31: Scheduler performance in manycore architecture

May 2, 2012 31

“Parallel” BenchmarkActivity Per core, Latency = 20 cycles

Queues help hide latency only if schedule capacity is sufficiently high

A1

23

B2000

25

C2500

35

D2600

26

E2300

18

F1

19

cntr=4

Page 32: Scheduler performance in manycore architecture

May 2, 2012 32

Analysis of Simulation Results

• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark

• Benchmarks Analysis

Page 33: Scheduler performance in manycore architecture

May 2, 2012 33

“Shared Variable” BenchmarkActivity Per cycle, Latency = 0 cycles

A1

23

B10015

C50035

D60020

E13018

F1

27

cntr=4

Is this a problem of the scheduler?

Page 34: Scheduler performance in manycore architecture

May 2, 2012 34

Analysis of Simulation Results

• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark

• Benchmarks Analysis

Page 35: Scheduler performance in manycore architecture

May 2, 2012 35

JPEG BenchmarkActivity Per cycle, Latency = 0 cycles

A110

B110

G3002418

E1

12810

L1

460

N1

207

D300181

K100

1659

C1

5715

I100

1952

J2001490

F300705

H3002927

M1

2548

Page 36: Scheduler performance in manycore architecture

May 2, 2012 36

JPEG BenchmarkUnbalanced scheduling, Latency = 0 cycles

A1

10

B1

10

G300

2418

E1

12810

L1

460

N1

207

D300181

K100

1659

C1

5715

I100

1952

J200

1490

F300705

H300

2927

M1

2548

Queues may degrade system performance

Page 37: Scheduler performance in manycore architecture

May 2, 2012 37

Solutions to imbalance

1. Queue sharing among multiple cores2. Scheduling awareness of long tasks3. Using fine granularity tasks4. Task migration among queues5. Task map optimization6. Pipeline multiple instances of an algorithm

Simulated

Page 38: Scheduler performance in manycore architecture

May 2, 2012 38

Solutions to imbalance

• Queue sharing among multiple cores• Scheduling awareness of long tasks• Using fine granularity tasks

Page 39: Scheduler performance in manycore architecture

May 2, 2012 39

JPEG BenchmarkShared Queues

Activity Per cycle, Latency = 0 cycles

A110

B110

G300

2418

E1

12810

L1

460

N1

207

D300181

K100

1659

C1

5715

I100

1952

J2001490

F300705

H300

2927

M1

2548

Page 40: Scheduler performance in manycore architecture

May 2, 2012 40

Solutions to imbalance

• Queue sharing among multiple cores• Scheduling awareness of long tasks• Using fine granularity tasks

Page 41: Scheduler performance in manycore architecture

May 2, 2012 41

JPEG BenchmarkExecution-Time Aware Scheduler

Activity Per cycle, Latency = 0 cycles, Task E flagged as long

Flag task C as well

[Green 2010]

Page 42: Scheduler performance in manycore architecture

May 2, 2012 42

JPEG BenchmarkExecution-Time Aware Scheduler

Activity Per cycle, Latency = 0 cycles, Task E and C flagged as long

Need Profiling Tool

Page 43: Scheduler performance in manycore architecture

May 2, 2012 43

Solutions to imbalance

• Queue sharing among multiple cores• Scheduling awareness of long tasks• Using fine granularity tasks

Page 44: Scheduler performance in manycore architecture

May 2, 2012 44

JPEG BenchmarkFine Granularity

Activity Per cycle, Latency = 0 cycles

A1

10

B1

10

G3002418

E11

4270

L1

460

N1

207

D300181

K100

1659

C1

5715

I100

1952

J200

1490

F300705

H3002927

M1

2548

E21

4270

E31

4270

Might be further improved by decomposing task E further and by also decomposing task C

Page 45: Scheduler performance in manycore architecture

May 2, 2012 45

Analysis of Simulation Results

• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark

• Benchmarks Analysis

Page 46: Scheduler performance in manycore architecture

May 2, 2012 46

Linear Solver BenchmarkActivity Per core, Latency = 20 cycles

A1

236

B1

40

C1

214

E100126

K1

87

F158

J1

47

D1

172

G7720197

H10078

cntr=5

Page 47: Scheduler performance in manycore architecture

May 2, 2012 47

Analysis of Simulation Results

• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark

• Benchmarks Analysis

Page 48: Scheduler performance in manycore architecture

May 2, 2012 48

Mandelbrot BenchmarkActivity Per cycle, Latency = 20 cycles

A1

540

B1

225

C409680

D4096

7

Page 49: Scheduler performance in manycore architecture

May 2, 2012 49

Mandelbrot BenchmarkActivity Per cycle, Latency = 20 cycles, Zoom on task D execution for infinite capacity

A1

540

B1

225

C4096

80

D4096

7

Fine grained tasks requires deep queues and a powerful scheduler to assign instances fast enough to hide

latencies

Page 50: Scheduler performance in manycore architecture

May 2, 2012 50

Analysis of Simulation Results

• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark

• Benchmarks Analysis

Page 51: Scheduler performance in manycore architecture

May 2, 2012 51

Total Run-Time

A 2 slot queue and a scheduler capacity of 10 is enough to utilize 256 cores

Page 52: Scheduler performance in manycore architecture

May 2, 2012 52

Load Balancing(STD of cores busy time, latency = 20)

• Queues may cause imbalance• Larger scheduler capacity decreases imbalance

Page 53: Scheduler performance in manycore architecture

May 2, 2012 53

Effective Allocation Latency

A 1 slot queue is sufficient to hide much of the latency

Page 54: Scheduler performance in manycore architecture

May 2, 2012 54

Agenda

• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work

Page 55: Scheduler performance in manycore architecture

May 2, 2012 55

Conclusions

• Analysis of scheduler effect on many-core architecture

• A simulation and investigation tool• Queues to hide latencies– Might cause imbalance• Task map optimization and tuning• Sharing queues

Page 56: Scheduler performance in manycore architecture

May 2, 2012 56

Future Research

• Scheduler distribution networks• Implications of scheduler on power• Other imbalance solutions– As described before

• Profiling for task map optimization and scheduling analysis

Page 57: Scheduler performance in manycore architecture

May 2, 2012 57

QUESTIONS?