scheduler performance in manycore architecture
DESCRIPTION
ItaiAvron, TechnionTRANSCRIPT
May 2, 2012 1
Scheduler Performance in Many-Core Architecture
Itai AvronMSc Thesis
Technion - Electrical Engineering Dept.
May 2, 2012 2
Agenda
• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work
May 2, 2012 3
Background
• CPU performance improvements– In the past : Increase of clock frequency
• We reached the power wall
– Today : Multi-cores– The future : Many-cores
• Homogeneous \ Heterogeneous?• What architecture?• Memory model?• Scheduler?• …
May 2, 2012 4
Scheduling In Many-Core Architecture
• Software scheduling is slow– A lot of cores to schedule– Fine granularity tasks many tasks to schedule at
the same time• To enhance parallelism
• Dedicated Hardware required!
May 2, 2012 5
Scheduler Challenges
• Latency– Message delay• From core to scheduler (completed prev. task)• From scheduler to core (start new task)
– Schedule time• to allocate tasks to cores
• Capacity– Number of instances\tasks scheduled per cycle
May 2, 2012 6
Other Architectures
• Graphic Processing Unit (GPU’s)• Tilera• Larrabee• XMT• Rigel• Data-Driven Multithreading Model• Task Superscalar
May 2, 2012 7
GPU – NVIDIA Fermi
• Composed of many processing elements (PEs)
• Scheduling is done in hardware– Schedule warps– Only one control flow
• SIMD
May 2, 2012 8
Tilera
• Composed of tiles– Each tile is independent
• Static Scheduling– Determined during
compile time
• MIMD
[Agarwal (MIT) 1997- ]
May 2, 2012 9
Larrabee (Intel)
• Array of processor cores• Software controlled
Scheduling– Lightweight distributed
task-stealing scheduler
• MIMD
May 2, 2012 10
XMT
• Composed of TCU’s– Thread control unit
• Hardware Scheduling– Using Prefix-Sum
• PRAM Programming Model
• SPMD
[Vishkin (UMD) 2005-]
May 2, 2012 11
Rigel
• Composed of tiles of clusters– Each cluster holds 8
cores
• Software Scheduling– Allocation via task
queues– Synchronization via
Barriers
• SPMD[Patel (UIUC) 2008- ]
May 2, 2012 12
Data-Driven Multithreading Model
• A Threads Synchronization Unit (TSU)– Connects to existing
cores
• Hardware Scheduling– Using Task Map
• Producer-Consumer Programming Model
[Evripidou (U Cyprus) 1997- ]
May 2, 2012 13
Task Superscalar
• An Out-of-Order Task Pipeline– Connects to existing cores– No Speculations
• Hardware Scheduling– Creation of new tasks is
done in software– Management and Allocation
is done in Hardware
• StarSs Programming Model[Etsion (BSC) 2009- ]
May 2, 2012 14
Agenda
• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work
May 2, 2012 15
The ‘Plural’ System Architecture
Scheduler
Cores
Memory Network
Memory banks
[Bayer (Technion) 1988 ]
May 2, 2012 16
The System
• Many RISC cores– In-Order, Blocking Load\Store– No data cache
• Shared On-Chip memory banks– Interleaved address– Access takes 2 cycles
• Core retries on collision
• Hardware synchronization and scheduling unit– Distributes tasks to cores according to a task map– Collects task completion messages from cores
May 2, 2012 17
Plural Task Map
• Precedence Graph• Created by the
programmer• Duplicable Tasks– All instances are
concurrent
A1
B1200
C5000
D130
E1
cntr=4
Task NameNumber of instances
Task
Dependency
Condition
May 2, 2012 18
Plural Scheduling
• Central Synchronization Unit (CSU)– Manages allocation, scheduling, and synchronization of tasks– Collects task-termination– Programmed by the task map– Allocates packs (sets) of parallel task-instances
• Distribution Network (DN)– Organized as a tree with the CSU as its root– Mediates between the CSU and the processing cores– Downstream flow -> decomposes allocated packs of task instances– Upstream flow -> unifies task-termination events from the cores
May 2, 2012 19
Scheduling ProcessCSU allocates ready to run
tasks
DN distributes packs to cores
Cores sends termination message on completion
DN unifies termination messages
CSU process new eligible to
run tasks
May 2, 2012 20
Agenda
• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work
May 2, 2012 21
Scheduler Improvements
• Enhancing scheduler capacity• Reducing scheduling latency• Adding task queues to each core– Sharing queues
• Adding task length indicator
May 2, 2012 22
Simulation Environment• Matlab Simulator– Based on Eyal and Dima’s simulator
• Benchmarks– 3 Demo programs– 3 Benchmarks
• JPEG, Mandelbrot, Linear Solver
• 24 System configurations– 256 cores, 256 banks– Scheduler capacity: 5, 10, infinite [instances]– Latency (scheduler—cores): 0, 20 [cycles]– Task queue depth: 0, 1, 2, 10 [instances]
[Friedman, Khoretz, Ginosar, PDP 2012]
May 2, 2012 23
Benchmark Task MapsNormal and
Shared VariableParallel
A123
B10015
C50035
D60020
E13018
F127
cntr=4
A1
23
B2000
25
C2500
35
D2600
26
E2300
18
F1
19
cntr=4
A1
540
B1
225
C409680
D4096
7
A1
10
B1
10
G3002418
E1
12810
L1
460
N1
207
D300181
K1001659
C1
5715
I1001952
J2001490
F300705
H3002927
M1
2548
Mandelbrot
A1
236
B140
C1
214
E100126
K187
F158
J147
D1
172
G7720197
H10078
cntr=5
JPEG Linear Solver
Task NameNumber of instancesLength in time units
May 2, 2012 24
Agenda
• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work
May 2, 2012 25
Analysis of Simulation Results
• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark
• Benchmarks Analysis
May 2, 2012 26
“Normal” BenchmarkActivity Per core, Latency = 0 cycles
A1
23
B10015
C50035
D60020
E13018
F1
27
cntr=4
May 2, 2012 27
“Normal” BenchmarkUnbalanced scheduling, Latency = 0 cycles
A1
23
B10015
C50035
D60020
E13018
F1
27
cntr=4
May 2, 2012 28
“Normal” BenchmarkActivity Per core, Latency = 20 cycles
A1
23
B10015
C50035
D60020
E13018
F1
27
cntr=4
May 2, 2012 29
Analysis of Simulation Results
• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark
• Benchmarks Analysis
May 2, 2012 30
“Parallel” BenchmarkActivity Per core, Latency = 0 cycles
A1
23
B2000
25
C2500
35
D2600
26
E2300
18
F1
19
cntr=4
May 2, 2012 31
“Parallel” BenchmarkActivity Per core, Latency = 20 cycles
Queues help hide latency only if schedule capacity is sufficiently high
A1
23
B2000
25
C2500
35
D2600
26
E2300
18
F1
19
cntr=4
May 2, 2012 32
Analysis of Simulation Results
• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark
• Benchmarks Analysis
May 2, 2012 33
“Shared Variable” BenchmarkActivity Per cycle, Latency = 0 cycles
A1
23
B10015
C50035
D60020
E13018
F1
27
cntr=4
Is this a problem of the scheduler?
May 2, 2012 34
Analysis of Simulation Results
• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark
• Benchmarks Analysis
May 2, 2012 35
JPEG BenchmarkActivity Per cycle, Latency = 0 cycles
A110
B110
G3002418
E1
12810
L1
460
N1
207
D300181
K100
1659
C1
5715
I100
1952
J2001490
F300705
H3002927
M1
2548
May 2, 2012 36
JPEG BenchmarkUnbalanced scheduling, Latency = 0 cycles
A1
10
B1
10
G300
2418
E1
12810
L1
460
N1
207
D300181
K100
1659
C1
5715
I100
1952
J200
1490
F300705
H300
2927
M1
2548
Queues may degrade system performance
May 2, 2012 37
Solutions to imbalance
1. Queue sharing among multiple cores2. Scheduling awareness of long tasks3. Using fine granularity tasks4. Task migration among queues5. Task map optimization6. Pipeline multiple instances of an algorithm
Simulated
May 2, 2012 38
Solutions to imbalance
• Queue sharing among multiple cores• Scheduling awareness of long tasks• Using fine granularity tasks
May 2, 2012 39
JPEG BenchmarkShared Queues
Activity Per cycle, Latency = 0 cycles
A110
B110
G300
2418
E1
12810
L1
460
N1
207
D300181
K100
1659
C1
5715
I100
1952
J2001490
F300705
H300
2927
M1
2548
May 2, 2012 40
Solutions to imbalance
• Queue sharing among multiple cores• Scheduling awareness of long tasks• Using fine granularity tasks
May 2, 2012 41
JPEG BenchmarkExecution-Time Aware Scheduler
Activity Per cycle, Latency = 0 cycles, Task E flagged as long
Flag task C as well
[Green 2010]
May 2, 2012 42
JPEG BenchmarkExecution-Time Aware Scheduler
Activity Per cycle, Latency = 0 cycles, Task E and C flagged as long
Need Profiling Tool
May 2, 2012 43
Solutions to imbalance
• Queue sharing among multiple cores• Scheduling awareness of long tasks• Using fine granularity tasks
May 2, 2012 44
JPEG BenchmarkFine Granularity
Activity Per cycle, Latency = 0 cycles
A1
10
B1
10
G3002418
E11
4270
L1
460
N1
207
D300181
K100
1659
C1
5715
I100
1952
J200
1490
F300705
H3002927
M1
2548
E21
4270
E31
4270
Might be further improved by decomposing task E further and by also decomposing task C
May 2, 2012 45
Analysis of Simulation Results
• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark
• Benchmarks Analysis
May 2, 2012 46
Linear Solver BenchmarkActivity Per core, Latency = 20 cycles
A1
236
B1
40
C1
214
E100126
K1
87
F158
J1
47
D1
172
G7720197
H10078
cntr=5
May 2, 2012 47
Analysis of Simulation Results
• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark
• Benchmarks Analysis
May 2, 2012 48
Mandelbrot BenchmarkActivity Per cycle, Latency = 20 cycles
A1
540
B1
225
C409680
D4096
7
May 2, 2012 49
Mandelbrot BenchmarkActivity Per cycle, Latency = 20 cycles, Zoom on task D execution for infinite capacity
A1
540
B1
225
C4096
80
D4096
7
Fine grained tasks requires deep queues and a powerful scheduler to assign instances fast enough to hide
latencies
May 2, 2012 50
Analysis of Simulation Results
• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark
• Benchmarks Analysis
May 2, 2012 51
Total Run-Time
A 2 slot queue and a scheduler capacity of 10 is enough to utilize 256 cores
May 2, 2012 52
Load Balancing(STD of cores busy time, latency = 20)
• Queues may cause imbalance• Larger scheduler capacity decreases imbalance
May 2, 2012 53
Effective Allocation Latency
A 1 slot queue is sufficient to hide much of the latency
May 2, 2012 54
Agenda
• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work
May 2, 2012 55
Conclusions
• Analysis of scheduler effect on many-core architecture
• A simulation and investigation tool• Queues to hide latencies– Might cause imbalance• Task map optimization and tuning• Sharing queues
May 2, 2012 56
Future Research
• Scheduler distribution networks• Implications of scheduler on power• Other imbalance solutions– As described before
• Profiling for task map optimization and scheduling analysis
May 2, 2012 57
QUESTIONS?