scheduler performance in manycore architecture

May 2, 2012 1

Scheduler Performance in Many-Core Architecture

Itai AvronMSc Thesis

Technion - Electrical Engineering Dept.

May 2, 2012 2

Agenda

• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work

May 2, 2012 3

Background

• CPU performance improvements– In the past : Increase of clock frequency

• We reached the power wall

– Today : Multi-cores– The future : Many-cores

• Homogeneous \ Heterogeneous?• What architecture?• Memory model?• Scheduler?• …

May 2, 2012 4

Scheduling In Many-Core Architecture

• Software scheduling is slow– A lot of cores to schedule– Fine granularity tasks many tasks to schedule at

the same time• To enhance parallelism

• Dedicated Hardware required!

May 2, 2012 5

Scheduler Challenges

• Latency– Message delay• From core to scheduler (completed prev. task)• From scheduler to core (start new task)

– Schedule time• to allocate tasks to cores

• Capacity– Number of instances\tasks scheduled per cycle

May 2, 2012 6

Other Architectures

• Graphic Processing Unit (GPU’s)• Tilera• Larrabee• XMT• Rigel• Data-Driven Multithreading Model• Task Superscalar

May 2, 2012 7

GPU – NVIDIA Fermi

• Composed of many processing elements (PEs)

• Scheduling is done in hardware– Schedule warps– Only one control flow

• SIMD

May 2, 2012 8

Tilera

• Composed of tiles– Each tile is independent

• Static Scheduling– Determined during

compile time

• MIMD

[Agarwal (MIT) 1997- ]

May 2, 2012 9

Larrabee (Intel)

• Array of processor cores• Software controlled

Scheduling– Lightweight distributed

task-stealing scheduler

• MIMD

May 2, 2012 10

XMT

• Composed of TCU’s– Thread control unit

• Hardware Scheduling– Using Prefix-Sum

• PRAM Programming Model

• SPMD

[Vishkin (UMD) 2005-]

May 2, 2012 11

Rigel

• Composed of tiles of clusters– Each cluster holds 8

cores

• Software Scheduling– Allocation via task

queues– Synchronization via

Barriers

• SPMD[Patel (UIUC) 2008- ]

May 2, 2012 12

Data-Driven Multithreading Model

• A Threads Synchronization Unit (TSU)– Connects to existing

cores

• Hardware Scheduling– Using Task Map

• Producer-Consumer Programming Model

[Evripidou (U Cyprus) 1997- ]

May 2, 2012 13

Task Superscalar

• An Out-of-Order Task Pipeline– Connects to existing cores– No Speculations

• Hardware Scheduling– Creation of new tasks is

done in software– Management and Allocation

is done in Hardware

• StarSs Programming Model[Etsion (BSC) 2009- ]

May 2, 2012 14

Agenda


May 2, 2012 15

The ‘Plural’ System Architecture

Scheduler

Cores

Memory Network

Memory banks

[Bayer (Technion) 1988 ]

May 2, 2012 16

The System

• Many RISC cores– In-Order, Blocking Load\Store– No data cache

• Shared On-Chip memory banks– Interleaved address– Access takes 2 cycles

• Core retries on collision

• Hardware synchronization and scheduling unit– Distributes tasks to cores according to a task map– Collects task completion messages from cores

May 2, 2012 17

Plural Task Map

• Precedence Graph• Created by the

programmer• Duplicable Tasks– All instances are

concurrent

A1

B1200

C5000

D130

E1

cntr=4

Task NameNumber of instances

Task

Dependency

Condition

May 2, 2012 18

Plural Scheduling

• Central Synchronization Unit (CSU)– Manages allocation, scheduling, and synchronization of tasks– Collects task-termination– Programmed by the task map– Allocates packs (sets) of parallel task-instances

• Distribution Network (DN)– Organized as a tree with the CSU as its root– Mediates between the CSU and the processing cores– Downstream flow -> decomposes allocated packs of task instances– Upstream flow -> unifies task-termination events from the cores

May 2, 2012 19

Scheduling ProcessCSU allocates ready to run

tasks

DN distributes packs to cores

Cores sends termination message on completion

DN unifies termination messages

CSU process new eligible to

run tasks

May 2, 2012 20

Agenda


May 2, 2012 21

Scheduler Improvements

• Enhancing scheduler capacity• Reducing scheduling latency• Adding task queues to each core– Sharing queues

• Adding task length indicator

May 2, 2012 22

Simulation Environment• Matlab Simulator– Based on Eyal and Dima’s simulator

• Benchmarks– 3 Demo programs– 3 Benchmarks

• JPEG, Mandelbrot, Linear Solver

• 24 System configurations– 256 cores, 256 banks– Scheduler capacity: 5, 10, infinite [instances]– Latency (scheduler—cores): 0, 20 [cycles]– Task queue depth: 0, 1, 2, 10 [instances]

[Friedman, Khoretz, Ginosar, PDP 2012]

May 2, 2012 23

Benchmark Task MapsNormal and

Shared VariableParallel

A123

B10015

C50035

D60020

E13018

F127

cntr=4

A1

23

B2000

25

C2500

35

D2600

26

E2300

18

F1

19

cntr=4

A1

540

B1

225

C409680

D4096

7

A1

10

B1

10

G3002418

E1

12810

L1

460

N1

207

D300181

K1001659

C1

5715

I1001952

J2001490

F300705

H3002927

M1

2548

Mandelbrot

A1

236

B140

C1

214

E100126

K187

F158

J147

D1

172

G7720197

H10078

cntr=5

JPEG Linear Solver

Task NameNumber of instancesLength in time units

May 2, 2012 24

Agenda


May 2, 2012 25

Analysis of Simulation Results

• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark

• Benchmarks Analysis

May 2, 2012 26

“Normal” BenchmarkActivity Per core, Latency = 0 cycles

A1

23

B10015

C50035

D60020

E13018

F1

27

cntr=4

May 2, 2012 27

“Normal” BenchmarkUnbalanced scheduling, Latency = 0 cycles

A1

23

B10015

C50035

D60020

E13018

F1

27

cntr=4

May 2, 2012 28

“Normal” BenchmarkActivity Per core, Latency = 20 cycles

A1

23

B10015

C50035

D60020

E13018

F1

27

cntr=4

May 2, 2012 29




May 2, 2012 30

“Parallel” BenchmarkActivity Per core, Latency = 0 cycles

A1

23

B2000

25

C2500

35

D2600

26

E2300

18

F1

19

cntr=4

May 2, 2012 31

“Parallel” BenchmarkActivity Per core, Latency = 20 cycles

Queues help hide latency only if schedule capacity is sufficiently high

A1

23

B2000

25

C2500

35

D2600

26

E2300

18

F1

19

cntr=4

May 2, 2012 32




May 2, 2012 33

“Shared Variable” BenchmarkActivity Per cycle, Latency = 0 cycles

A1

23

B10015

C50035

D60020

E13018

F1

27

cntr=4

Is this a problem of the scheduler?

May 2, 2012 34




May 2, 2012 35

JPEG BenchmarkActivity Per cycle, Latency = 0 cycles

A110

B110

G3002418

E1

12810

L1

460

N1

207

D300181

K100

1659

C1

5715

I100

1952

J2001490

F300705

H3002927

M1

2548

May 2, 2012 36

JPEG BenchmarkUnbalanced scheduling, Latency = 0 cycles

A1

10

B1

10

G300

2418

E1

12810

L1

460

N1

207

D300181

K100

1659

C1

5715

I100

1952

J200

1490

F300705

H300

2927

M1

2548

Queues may degrade system performance

May 2, 2012 37

Solutions to imbalance

1. Queue sharing among multiple cores2. Scheduling awareness of long tasks3. Using fine granularity tasks4. Task migration among queues5. Task map optimization6. Pipeline multiple instances of an algorithm

Simulated

May 2, 2012 38


• Queue sharing among multiple cores• Scheduling awareness of long tasks• Using fine granularity tasks

May 2, 2012 39

JPEG BenchmarkShared Queues

Activity Per cycle, Latency = 0 cycles

A110

B110

G300

2418

E1

12810

L1

460

N1

207

D300181

K100

1659

C1

5715

I100

1952

J2001490

F300705

H300

2927

M1

2548

May 2, 2012 40



May 2, 2012 41

JPEG BenchmarkExecution-Time Aware Scheduler

Activity Per cycle, Latency = 0 cycles, Task E flagged as long

Flag task C as well

[Green 2010]

May 2, 2012 42

JPEG BenchmarkExecution-Time Aware Scheduler

Activity Per cycle, Latency = 0 cycles, Task E and C flagged as long

Need Profiling Tool

May 2, 2012 43



May 2, 2012 44

JPEG BenchmarkFine Granularity

Activity Per cycle, Latency = 0 cycles

A1

10

B1

10

G3002418

E11

4270

L1

460

N1

207

D300181

K100

1659

C1

5715

I100

1952

J200

1490

F300705

H3002927

M1

2548

E21

4270

E31

4270

Might be further improved by decomposing task E further and by also decomposing task C

May 2, 2012 45




May 2, 2012 46

Linear Solver BenchmarkActivity Per core, Latency = 20 cycles

A1

236

B1

40

C1

214

E100126

K1

87

F158

J1

47

D1

172

G7720197

H10078

cntr=5

May 2, 2012 47




May 2, 2012 48

Mandelbrot BenchmarkActivity Per cycle, Latency = 20 cycles

A1

540

B1

225

C409680

D4096

7

May 2, 2012 49

Mandelbrot BenchmarkActivity Per cycle, Latency = 20 cycles, Zoom on task D execution for infinite capacity

A1

540

B1

225

C4096

80

D4096

7

Fine grained tasks requires deep queues and a powerful scheduler to assign instances fast enough to hide

latencies

May 2, 2012 50




May 2, 2012 51

Total Run-Time

A 2 slot queue and a scheduler capacity of 10 is enough to utilize 256 cores

May 2, 2012 52

Load Balancing(STD of cores busy time, latency = 20)

• Queues may cause imbalance• Larger scheduler capacity decreases imbalance

May 2, 2012 53

Effective Allocation Latency

A 1 slot queue is sufficient to hide much of the latency

May 2, 2012 54

Agenda


May 2, 2012 55

Conclusions

• Analysis of scheduler effect on many-core architecture

• A simulation and investigation tool• Queues to hide latencies– Might cause imbalance• Task map optimization and tuning• Sharing queues

May 2, 2012 56

Future Research

• Scheduler distribution networks• Implications of scheduler on power• Other imbalance solutions– As described before

• Profiling for task map optimization and scheduling analysis

May 2, 2012 57

QUESTIONS?

scheduler performance in manycore architecture

Technology

task queues synchronization

scheduling unit

scheduling latency

task length indicatormay

tasktermination events

new task schedule time

scheduler mimd

scheduler improvements