exploiting coarse-grained task, data, and pipeline parallelism in stream programs

40
1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose, CA http://cag.csail.mit.edu/streamit

Upload: morag

Post on 13-Jan-2016

27 views

Category:

Documents


1 download

DESCRIPTION

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs. Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose, CA. http://cag.csail.mit.edu/streamit. Multicores Are Here!. Picochip PC102. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

1

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in

Stream Programs

Michael Gordon, William Thies, andSaman Amarasinghe

Massachusetts Institute of Technology

ASPLOS October 2006San Jose, CA

http://cag.csail.mit.edu/streamit

Page 2: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

2

Multicores Are Here!

1985 199019801970 1975 1995 2000

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005 20??

# ofcores

1

2

4

8

16

32

64

128

256

512

Athlon

Raw

Power4Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Broadcom 1480 Opteron 4P

Xeon MP

AmbricAM2045

Page 3: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

3

Multicores Are Here!

For uniprocessors,C was:•Portable•High Performance•Composable•Malleable•Maintainable

Uniprocessors:C is the commonmachine language

1985 199019801970 1975 1995 2000

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005

Raw

Power4Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Broadcom 1480

20??

# ofcores

1

2

4

8

16

32

64

128

256

512

Opteron 4P

Xeon MP

Athlon

AmbricAM2045

Page 4: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

4

Multicores Are Here!

What is the commonmachine languagefor multicores?

1985 199019801970 1975 1995 2000

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005

Raw

Power4Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Broadcom 1480

20??

# ofcores

1

2

4

8

16

32

64

128

256

512

Opteron 4P

Xeon MP

Athlon

AmbricAM2045

Page 5: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

5

Common Machine Languages

Common PropertiesSingle flow of control

Single memory image

Uniprocessors:

Differences:Register File

ISA

Functional Units

Register AllocationInstruction Selection

Instruction Scheduling

Common PropertiesMultiple flows of control

Multiple local memories

Multicores:

Differences:Number and capabilities of cores

Communication Model

Synchronization Model

von-Neumann languages represent the common properties and abstract away the differences

Need common machine language(s) for multicores

Page 6: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

6

Streaming as a Common Machine Language

• Regular and repeating computation

• Independent filters with explicit communication– Segregated address spaces and

multiple program counters

• Natural expression of Parallelism:– Producer / Consumer dependencies– Enables powerful, whole-program

transformations

Adder

Speaker

AtoD

FMDemod

LPF1

Scatter

Gather

LPF2 LPF3

HPF1 HPF2 HPF3

Page 7: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

7

Types of Parallelism

Task Parallelism– Parallelism explicit in algorithm– Between filters without

producer/consumer relationship

Data Parallelism– Peel iterations of filter, place within

scatter/gather pair (fission)– parallelize filters with state

Pipeline Parallelism– Between producers and consumers– Stateful filters can be parallelized

Scatter

Gather

Task

Page 8: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

8

Types of Parallelism

Task Parallelism– Parallelism explicit in algorithm– Between filters without

producer/consumer relationship

Data Parallelism– Between iterations of a stateless filter – Place within scatter/gather pair (fission)– Can’t parallelize filters with state

Pipeline Parallelism– Between producers and consumers– Stateful filters can be parallelized

Scatter

Gather

Scatter

Gather

Task

Pip

elin

e

Data

Data Parallel

Page 9: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

9

Types of Parallelism

Traditionally:

Task Parallelism– Thread (fork/join) parallelism

Data Parallelism– Data parallel loop (forall)

Pipeline Parallelism– Usually exploited in hardware

Scatter

Gather

Scatter

Gather

Task

Pip

elin

e

Data

Page 10: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

10

Problem Statement

Given: – Stream graph with compute and communication

estimate for each filter– Computation and communication resources of

the target machine

Find:– Schedule of execution for the filters that best

utilizes the available parallelism to fit the machine resources

Page 11: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

11

Our 3-Phase Solution

1. Coarsen: Fuse stateless sections of the graph2. Data Parallelize: parallelize stateless filters3. Software Pipeline: parallelize stateful filters

Compile to a 16 core architecture– 11.2x mean throughput speedup over single core

Coarsen Granularity

Data Parallelize

Software Pipeline

Page 12: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

12

Outline

• StreamIt language overview

• Mapping to multicores– Baseline techniques– Our 3-phase solution

Page 13: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

13

• Applications– DES and Serpent [PLDI 05]– MPEG-2 [IPDPS 06]– SAR, DSP benchmarks, JPEG, …

• Programmability– StreamIt Language (CC 02) – Teleport Messaging (PPOPP 05)– Programming Environment in Eclipse (P-PHEC 05)

• Domain Specific Optimizations– Linear Analysis and Optimization (PLDI 03)– Optimizations for bit streaming (PLDI 05)– Linear State Space Analysis (CASES 05)

• Architecture Specific Optimizations– Compiling for Communication-Exposed

Architectures (ASPLOS 02)– Phased Scheduling (LCTES 03)– Cache Aware Optimization (LCTES 05)– Load-Balanced Rendering (Graphics Hardware 05)

StreamIt Program

Front-end

Stream-AwareOptimizations

Uniprocessorbackend

Clusterbackend

Rawbackend

IBM X10backend

C/C++ C per tile +msg code

Streaming X10 runtime

Annotated Java

MPI-likeC/C++

Simulator(Java Library)

The StreamIt Project

Page 14: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

14

Model of Computation

• Synchronous Dataflow [Lee ‘92]– Graph of autonomous filters– Communicate via FIFO channels

• Static I/O rates– Compiler decides on an order

of execution (schedule)– Static estimation of

computation

A/D

Duplicate

LED

Detect

Band Pass

LED

Detect

LED

Detect

LED

Detect

Page 15: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

15

Example StreamIt Filter0 1 2 3 4 5 6 7 8 9 10 11 input

output

FIR

0 1

floatfloat filter FIR (int N, float[N] weights) {

work push 1 pop 1 peek N {float result = 0;

for (int i = 0; i < N; i++) {result += weights[i] peek(i);

}pop();push(result);

}}

Stateless

Page 16: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

16

Example StreamIt Filter

floatfloat filter FIR (int N, ) {

work push 1 pop 1 peek N {float result = 0;

for (int i = 0; i < N; i++) {result += weights[i] peek(i);

}pop();push(result);

}}

0 1 2 3 4 5 6 7 8 9 10 11 input

output

FIR

0 1

weights = adaptChannel(weights);

float[N] weights

;

(int N) {

Stateful

Page 17: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

17

parallel computation

StreamIt Language Overview

• StreamIt is a novel language for streaming– Exposes parallelism and

communication– Architecture independent– Modular and composable

– Simple structures composed to creates complex graphs

– Malleable– Change program behavior

with small modifications

may be any StreamIt language construct

joinersplitter

pipeline

feedback loop

joiner splitter

splitjoin

filter

Page 18: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

18

Outline

• StreamIt language overview

• Mapping to multicores– Baseline techniques– Our 3-phase solution

Page 19: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

19

Baseline 1: Task Parallelism

Adder

Splitter

Joiner

Compress

BandPass

Expand

Process

BandStop

Compress

BandPass

Expand

Process

BandStop

• Inherent task parallelism between two processing pipelines

• Task Parallel Model:

– Only parallelize explicit task parallelism

– Fork/join parallelism

• Execute this on a 2 core machine ~2x speedup over single core

• What about 4, 16, 1024, … cores?

Page 20: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

20

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Th

rou

gh

pu

t N

orm

ali

ze

d t

o S

ing

le C

ore

Str

ea

mIt

Evaluation: Task ParallelismRaw Microprocessor

16 inorder, single-issue cores with D$ and I$16 memory banks, each bank with DMA

Cycle accurate simulator

Parallelism: Not matched to target!Synchronization: Not matched to target!

n

n

i

1iratio timeExecution

Page 21: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

21

Baseline 2: Fine-Grained Data Parallelism

Adder

Splitter

Joiner

• Each of the filters in the example are stateless

• Fine-grained Data Parallel Model:– Fiss each stateless filter N

ways (N is number of cores)– Remove scatter/gather if

possible

• We can introduce data parallelism– Example: 4 cores

• Each fission group occupies entire machineBandStopBandStopBandStopAdder

Splitter

Joiner

ExpandExpandExpand

ProcessProcessProcess

Joiner

BandPassBandPassBandPass

CompressCompressCompress

BandStopBandStopBandStop

Expand

BandStop

Splitter

Joiner

Splitter

Process

BandPass

Compress

Splitter

Joiner

Splitter

Joiner

Splitter

Joiner

ExpandExpandExpand

ProcessProcessProcess

Joiner

BandPassBandPassBandPass

CompressCompressCompress

BandStopBandStopBandStop

Expand

BandStop

Splitter

Joiner

Splitter

Process

BandPass

Compress

Splitter

Joiner

Splitter

Joiner

Splitter

Joiner

Page 22: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

22

Evaluation:Fine-Grained Data Parallelism

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Th

rou

gh

pu

t N

orm

aliz

ed t

o S

ing

le C

ore

Str

eam

It

Task

Fine-Grained Data Good Parallelism!

Too Much Synchronization!

Page 23: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

23

Outline

• StreamIt language overview

• Mapping to multicores– Baseline techniques– Our 3-phase solution

Page 24: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

24

Phase 1: Coarsen the Stream GraphSplitter

Joiner

Expand

BandStop

Process

BandPass

Compress

Expand

BandStop

Process

BandPass

Compress

• Before data-parallelism is exploited

• Fuse stateless pipelines as much as possible without introducing state– Don’t fuse stateless with

stateful– Don’t fuse a peeking filter with

anything upstreamPeek Peek

PeekPeek

Adder

Page 25: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

25

Splitter

Joiner

BandPassCompressProcessExpand

BandPassCompressProcessExpand

BandStop BandStop

Adder

• Before data-parallelism is exploited

• Fuse stateless pipelines as much as possible without introducing state– Don’t fuse stateless with

stateful– Don’t fuse a peeking filter with

anything upstream

• Benefits:– Reduces global communication

and synchronization– Exposes inter-node

optimization opportunities

Phase 1: Coarsen the Stream Graph

Page 26: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

26

Phase 2: Data Parallelize

AdderAdderAdder

Splitter

Joiner

Adder

BandPassCompressProcessExpand

BandPassCompressProcessExpand

BandStop BandStop

Splitter

Joiner

Fiss 4 ways, to occupy entire chip

Data Parallelize for 4 cores

Page 27: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

27

Phase 2: Data Parallelize

AdderAdderAdder

Splitter

Joiner

Adder

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

BandStop BandStop

Splitter

Joiner

Task parallelism!Each fused filter does equal workFiss each filter 2 times to occupy entire chip

Data Parallelize for 4 cores

Page 28: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

28

BandStop BandStop

Phase 2: Data Parallelize

AdderAdderAdder

Splitter

Joiner

Adder

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

Splitter

Joiner

BandStop

Splitter

Joiner

BandStop

Splitter

Joiner

Task parallelism, each filter does equal workFiss each filter 2 times to occupy entire chip

• Task-conscious data parallelization– Preserve task parallelism

• Benefits:– Reduces global communication

and synchronization

Data Parallelize for 4 cores

Page 29: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

29

Evaluation: Coarse-Grained Data Parallelism

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Th

rou

gh

pu

t N

orm

aliz

ed

to

Sin

gle

Co

re S

tre

am

It

Task

Fine-Grained Data

Coarse-Grained Task + Data

Good Parallelism! Low Synchronization!

Page 30: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

30

Simplified Vocoder

RectPolar

Splitter

Joiner

AdaptDFT AdaptDFT

Splitter

Splitter

Amplify

Diff

UnWrap

Accum

Amplify

Diff

Unwrap

Accum

Joiner

Joiner

PolarRect

66

20

2

1

1

1

2

1

1

1

20 Data Parallel

Data Parallel

Target a 4 core machine

Data Parallel, but too little work!

Page 31: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

31

Data Parallelize

RectPolarRectPolarRectPolar

Splitter

Joiner

AdaptDFT AdaptDFT

Splitter

Splitter

Amplify

Diff

UnWrap

Accum

Amplify

Diff

Unwrap

Accum

Joiner

RectPolar

Splitter

Joiner

RectPolarRectPolarRectPolarPolarRect

Splitter

Joiner

Joiner

66

20

2

1

1

1

2

1

1

1

20

5

5

Target a 4 core machine

Page 32: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

32

Data + Task Parallel Execution

Time

Cores

21

Target 4 core machine

Splitter

Joiner

Splitter

Splitter

Joiner

Splitter

Joiner

RectPolarSplitter

Joiner

Joiner

66

2

1

1

1

2

1

1

1

5

5

Page 33: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

33

We Can Do Better!

Time

Cores

Target 4 core machine

Splitter

Joiner

Splitter

Splitter

Joiner

Splitter

Joiner

RectPolarSplitter

Joiner

Joiner

66

2

1

1

1

2

1

1

1

5

5

16

Page 34: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

34

Phase 3: Coarse-Grained Software Pipelining

RectPolar

RectPolar

RectPolar

RectPolar

Prologue

New Steady

State

• New steady-state is free of dependencies

• Schedule new steady-state using a greedy partitioning

Page 35: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

35

Greedy Partitioning

Target 4 core machine

Time 16

CoresTo Schedule:

Page 36: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

36

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Th

rou

gh

pu

t N

orm

ali

zed

to

Sin

gle

Co

re S

tre

am

It

Task Fine-Grained Data

Coarse-Grained Task + Data Coarse-Grained Task + Data + Software Pipeline

Evaluation: Coarse-Grained Task + Data + Software Pipelining

Best Parallelism! Lowest Synchronization!

Page 37: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

37

Generalizing to Other Multicores

• Architectural requirements:– Compiler controlled local memories with DMA– Efficient implementation of scatter/gather

• To port to other architectures, consider:– Local memory capacities – Communication to computation tradeoff

• Did not use processor-to-processor communication on Raw

Page 38: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

38

Related Work

• Streaming languages:– Brook [Buck et al. ’04]– StreamC/KernelC [Kapasi ’03, Das et al. ’06]– Cg [Mark et al. ‘03]– SPUR [Zhang et al. ‘05]

• Streaming for Multicores:– Brook [Liao et al., ’06]

• Ptolemy [Lee ’95]• Explicit parallelism:

– OpenMP, MPI, & HPF

Page 39: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

39

Conclusions

• Good speedups across varied benchmark suite• Algorithms should be applicable across multicores

TaskFine-Grained

DataCoarse-Grained

Task + Data

Coarse-Grained Task + Data +

Software Pipeline

ParallelismNot

matchedGood Good Best

SynchronizationNot

matched High Low Lowest

• Streaming model naturally exposes task, data, and pipeline parallelism

• This parallelism must be exploited at the correct granularity and combined correctly

Page 40: Exploiting Coarse-Grained  Task, Data, and Pipeline Parallelism in  Stream Programs

Questions?

Thanks!

40