Download - Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

1

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in

Stream Programs

Michael Gordon, William Thies, andSaman Amarasinghe

Massachusetts Institute of Technology

ASPLOS October 2006San Jose, CA

http://cag.csail.mit.edu/streamit

2

Multicores Are Here!

1985 199019801970 1975 1995 2000

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005 20??

# ofcores

1

2

4

8

16

32

64

128256

512

Athlon

Raw

Power4 Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Broadcom 1480 Opteron 4PXeon MP

AmbricAM2045

3


For uniprocessors,C was:•Portable•High Performance•Composable•Malleable•Maintainable

Uniprocessors:C is the commonmachine language

1985 199019801970 1975 1995 2000

4004

8008


Itanium 2

2005

Raw

Power4 Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Broadcom 1480

20??

# ofcores

1

2

4

8

16

32

64

128256

512

Opteron 4PXeon MP

Athlon

AmbricAM2045

4


What is the commonmachine languagefor multicores?

1985 199019801970 1975 1995 2000

4004

8008


Itanium 2

2005

Raw

Power4 Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Broadcom 1480

20??

# ofcores

1

2

4

8

16

32

64

128256

512

Opteron 4PXeon MP

Athlon

AmbricAM2045

5

Common Machine Languages

Single memory imageSingle flow of controlCommon PropertiesUniprocessors:

ISA

Functional Units

Register FileDifferences:

Register AllocationInstruction Selection

Instruction Scheduling

Multiple local memoriesMultiple flows of controlCommon PropertiesMulticores:

Communication Model

Synchronization Model

Number and capabilities of coresDifferences:

von-Neumann languages represent the common properties and abstract away the differences

Need common machine language(s) for multicores

6

Streaming as a Common Machine Language

• Regular and repeating computation

• Independent filters with explicit communication– Segregated address spaces and

multiple program counters

• Natural expression of Parallelism:– Producer / Consumer dependencies– Enables powerful, whole-program

transformations Adder

Speaker

AtoD

FMDemod

LPF1

Scatter

Gather

LPF2 LPF3

HPF1 HPF2 HPF3

7

Types of Parallelism

Task Parallelism– Parallelism explicit in algorithm– Between filters without

producer/consumer relationship

Data Parallelism– Peel iterations of filter, place within

scatter/gather pair (fission)– parallelize filters with state

Pipeline Parallelism– Between producers and consumers– Stateful filters can be parallelized

Scatter

Gather

Task

8


Task Parallelism– Parallelism explicit in algorithm– Between filters without

producer/consumer relationship

Data Parallelism– Between iterations of a stateless filter – Place within scatter/gather pair (fission)– Can’t parallelize filters with state

Pipeline Parallelism– Between producers and consumers– Stateful filters can be parallelized

Scatter

Gather

Scatter

Gather

Task

Pip

elin

e

Data

Data Parallel

9


Traditionally:

Task Parallelism– Thread (fork/join) parallelism

Data Parallelism– Data parallel loop (forall)

Pipeline Parallelism– Usually exploited in hardware

Scatter

Gather

Scatter

Gather

Task

Pip

elin

e

Data

10

Problem Statement

Given: – Stream graph with compute and communication

estimate for each filter– Computation and communication resources of

the target machine

Find:– Schedule of execution for the filters that best

utilizes the available parallelism to fit the machine resources

11

Our 3-Phase Solution

1. Coarsen: Fuse stateless sections of the graph2. Data Parallelize: parallelize stateless filters3. Software Pipeline: parallelize stateful filters

Compile to a 16 core architecture– 11.2x mean throughput speedup over single core

Coarsen Granularity

Data Parallelize

Software Pipeline

12

Outline

• StreamIt language overview• Mapping to multicores

– Baseline techniques– Our 3-phase solution

13

• Applications– DES and Serpent [PLDI 05]– MPEG-2 [IPDPS 06]– SAR, DSP benchmarks, JPEG, …

• Programmability– StreamIt Language (CC 02) – Teleport Messaging (PPOPP 05)– Programming Environment in Eclipse (P-PHEC 05)

• Domain Specific Optimizations– Linear Analysis and Optimization (PLDI 03)– Optimizations for bit streaming (PLDI 05)– Linear State Space Analysis (CASES 05)

• Architecture Specific Optimizations– Compiling for Communication-Exposed

Architectures (ASPLOS 02)– Phased Scheduling (LCTES 03)– Cache Aware Optimization (LCTES 05)– Load-Balanced Rendering

(Graphics Hardware 05)

StreamIt Program

Front-end

Stream-AwareOptimizations

Uniprocessorbackend

Clusterbackend

Rawbackend

IBM X10backend

C/C++ C per tile +msg code

StreamingX10 runtime

Annotated Java

MPI-likeC/C++

Simulator(Java Library)

The StreamIt Project

14

Model of Computation

• Synchronous Dataflow [Lee ‘92]– Graph of autonomous filters– Communicate via FIFO channels

• Static I/O rates– Compiler decides on an order

of execution (schedule)– Static estimation of

computation

A/D

Duplicate

LED

Detect

Band Pass

LED

Detect

LED

Detect

LED

Detect

15

Example StreamIt Filter0 1 2 3 4 5 6 7 8 9 10 11 input

output

FIR0 1

float→float filter FIR (int N, float[N] weights) {

work push 1 pop 1 peek N {float result = 0;

for (int i = 0; i < N; i++) {result += weights[i] ∗ peek(i);

}pop();push(result);

}}

Stateless

16

Example StreamIt Filter

float→float filter FIR (int N, ) {

work push 1 pop 1 peek N {float result = 0;

for (int i = 0; i < N; i++) {result += weights[i] ∗ peek(i);

}pop();push(result);

}}

0 1 2 3 4 5 6 7 8 9 10 11 input

output

FIR0 1

weights = adaptChannel(weights);

float[N] weights;

(int N) {

Stateful

17

parallel computation

StreamIt Language Overview• StreamIt is a novel

language for streaming– Exposes parallelism and

communication– Architecture independent– Modular and composable

– Simple structures composed to creates complex graphs

– Malleable– Change program behavior

with small modifications

may be any StreamIt language construct

joinersplitter

pipeline

feedback loop

joiner splitter

splitjoin

filter

18

Outline



19

Baseline 1: Task Parallelism

Adder

Splitter

Joiner

Compress

BandPass

Expand

Process

BandStop

Compress

BandPass

Expand

Process

BandStop

• Inherent task parallelism between two processing pipelines

• Task Parallel Model:– Only parallelize explicit task

parallelism – Fork/join parallelism

• Execute this on a 2 core machine ~2x speedup over single core

• What about 4, 16, 1024, … cores?

20

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Bitonic

Sort

Chann

elVoc

oder

DCT

DES

FFT

Filterba

nk

FMRadio

Serpe

nt

TDEMPEG2D

ecod

er

Vocod

er

Radar

Geometr

ic Mea

nTh

roug

hput

Nor

mal

ized

to S

ingl

e C

ore

Stre

amIt

Evaluation: Task ParallelismRaw Microprocessor

16 inorder, single-issue cores with D$ and I$16 memory banks, each bank with DMA

Cycle accurate simulator

Parallelism: Not matched to target!Synchronization: Not matched to target!

21

Baseline 2: Fine-Grained Data Parallelism

Adder

Splitter

Joiner

• Each of the filters in the example are stateless

• Fine-grained Data Parallel Model:– Fiss each stateless filter N

ways (N is number of cores)– Remove scatter/gather if

possible

• We can introduce data parallelism– Example: 4 cores

• Each fission group occupies entire machineBandStopBandStopBandStopAdder

Splitter

Joiner

ExpandExpandExpand

ProcessProcessProcess

Joiner

BandPassBandPassBandPass

CompressCompressCompress

BandStopBandStopBandStop

Expand

BandStop

Splitter

Joiner

Splitter

Process

BandPass

Compress

Splitter

Joiner

Splitter

Joiner

Splitter

Joiner

ExpandExpandExpand

ProcessProcessProcess

Joiner

BandPassBandPassBandPass

CompressCompressCompress

BandStopBandStopBandStop

Expand

BandStop

Splitter

Joiner

Splitter

Process

BandPass

Compress

Splitter

Joiner

Splitter

Joiner

Splitter

Joiner

22

Evaluation:Fine-Grained Data Parallelism

0

1

2

34

5

6

7

8

910

11

12

13

14

1516

17

18

19

Bitonic

SortCha

nnelVoc

oder

DCT

DES

FFT

Filterban

k

FMRadio

Serpen

t

TDEMPEG2Deco

der

Vocod

er

Radar

Geometr

ic Mea

nTh

roug

hput

Nor

mal

ized

to S

ingl

e C

ore

Stre

amIt

TaskFine-Grained Data

Good Parallelism!Too Much Synchronization!

23

Outline



24

Phase 1: Coarsen the Stream GraphSplitter

Joiner

Expand

BandStop

Process

BandPass

Compress

Expand

BandStop

Process

BandPass

Compress

• Before data-parallelism is exploited

• Fuse stateless pipelines as much as possible without introducing state– Don’t fuse stateless with

stateful– Don’t fuse a peeking filter with

anything upstreamPeek Peek

PeekPeek

Adder

25

Splitter

Joiner

BandPassCompressProcessExpand


BandStop BandStop

Adder

• Before data-parallelism is exploited

• Fuse stateless pipelines as much as possible without introducing state– Don’t fuse stateless with

stateful– Don’t fuse a peeking filter with

anything upstream

• Benefits:– Reduces global communication

and synchronization– Exposes inter-node

optimization opportunities

Phase 1: Coarsen the Stream Graph

26

Phase 2: Data Parallelize

AdderAdderAdder

Splitter

Joiner

Adder



BandStop BandStop

Splitter

Joiner

Fiss 4 ways, to occupy entire chip

Data Parallelize for 4 cores

27


AdderAdderAdder

Splitter

Joiner

Adder


Splitter

Joiner



Splitter

Joiner


BandStop BandStop

Splitter

Joiner

Task parallelism!Each fused filter does equal workFiss each filter 2 times to occupy entire chip


28

BandStop BandStop


AdderAdderAdder

Splitter

Joiner

Adder


Splitter

Joiner



Splitter

Joiner


Splitter

Joiner

BandStop

Splitter

Joiner

BandStop

Splitter

Joiner

Task parallelism, each filter does equal workFiss each filter 2 times to occupy entire chip

• Task-conscious data parallelization– Preserve task parallelism

• Benefits:– Reduces global communication

and synchronization


29

Evaluation: Coarse-Grained Data Parallelism

012

34567

89

1011

1213141516

171819

Bitonic

SortCha

nnelVocoder

DCT

DES

FFT

Filterba

nk

FMRadio

Serpent

TDEMPEG2Dec

oder

Vocoder

Radar

Geometric

Mean

Thro

ughp

ut N

orm

aliz

ed to

Sin

gle

Cor

e St

ream

It

TaskFine-Grained DataCoarse-Grained Task + Data

Good Parallelism!Low Synchronization!

30

Simplified Vocoder

RectPolar

Splitter

Joiner

AdaptDFT AdaptDFT

Splitter

Splitter

Amplify

Diff

UnWrap

Accum

Amplify

Diff

Unwrap

Accum

Joiner

Joiner

PolarRect

66

20

2

1

1

1

2

1

1

1

20 Data Parallel

Data Parallel

Target a 4 core machine

Data Parallel, but too little work!

31

Data Parallelize

RectPolarRectPolarRectPolar

Splitter

Joiner

AdaptDFT AdaptDFT

Splitter

Splitter

Amplify

Diff

UnWrap

Accum

Amplify

Diff

Unwrap

Accum

Joiner

RectPolar

Splitter

Joiner

RectPolarRectPolarRectPolarPolarRect

Splitter

Joiner

Joiner

66

20

2

1

1

1

2

1

1

1

20

5

5

Target a 4 core machine

32

Data + Task Parallel Execution

Time

Cores

21

Target 4 core machine

Splitter

Joiner

Splitter

Splitter

Joiner

Splitter

Joiner

RectPolarSplitter

Joiner

Joiner

66

2

1

1

1

2

1

1

1

5

5

33

We Can Do Better!

Time

Cores


Splitter

Joiner

Splitter

Splitter

Joiner

Splitter

Joiner

RectPolarSplitter

Joiner

Joiner

66

2

1

1

1

2

1

1

1

5

5

16

34

Phase 3: Coarse-Grained Software Pipelining

RectPolar

RectPolar

RectPolar

RectPolar

Prologue

New Steady

State

• New steady-state is free of dependencies

• Schedule new steady-state using a greedy partitioning

35

Greedy Partitioning


Time 16

CoresTo Schedule:

36

0123456789

10111213141516171819

Bitonic

SortCha

nnelVocoder

DCT

DES

FFT

Filterba

nk

FMRadio

Serpent

TDEMPEG2Dec

oder

Vocoder

Radar

Geometric

MeanTh

roug

hput

Nor

mal

ized

to S

ingl

e C

ore

Stre

amIt

Task Fine-Grained DataCoarse-Grained Task + Data Coarse-Grained Task + Data + Software Pipeline

Evaluation: Coarse-Grained Task + Data + Software Pipelining

Best Parallelism!Lowest Synchronization!

37

Generalizing to Other Multicores

• Architectural requirements:– Compiler controlled local memories with DMA– Efficient implementation of scatter/gather

• To port to other architectures, consider:– Local memory capacities – Communication to computation tradeoff

• Did not use processor-to-processor communication on Raw

38

Related Work

• Streaming languages:– Brook [Buck et al. ’04]– StreamC/KernelC [Kapasi ’03, Das et al. ’06]– Cg [Mark et al. ‘03]– SPUR [Zhang et al. ‘05]

• Streaming for Multicores:– Brook [Liao et al., ’06]

• Ptolemy [Lee ’95]• Explicit parallelism:

– OpenMP, MPI, & HPF

39

Conclusions

• Good speedups across varied benchmark suite• Algorithms should be applicable across multicores

Low

Good

Coarse-Grained Task + Data

High

Good

Fine-Grained Data

LowestNot matched Synchronization

Best Not matchedParallelism

Coarse-Grained Task + Data +

Software PipelineTask

• Streaming model naturally exposes task, data, and pipeline parallelism

• This parallelism must be exploited at the correct granularity and combined correctly

Download - Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

Top Related