lecture 12 - streamit parallelizing...

43
Prof. Saman Amarasinghe, MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 StreamIt Parallelizing Compiler

Upload: vuongnga

Post on 18-May-2018

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

Prof. Saman Amarasinghe, MIT. 1 6.189 IAP 2007 MIT

6.189 IAP 2007

Lecture 12

StreamIt Parallelizing Compiler

Page 2: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

2 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Common Machine Language

● Represent common properties of architecturesNecessary for performance

● Abstract away differences in architecturesNecessary for portability

● Cannot be too complexMust keep in mind the typical programmer

● C and Fortran were the common machine languages for uniprocessors

Imperative languages are not the correct abstraction for parallel architectures.

● What is the correct abstraction for parallel multicore machines?

Page 3: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

3 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Common Machine Language for Multicores

● Current offerings:OpenMPMPIHigh Performance Fortran

● Explicit parallel constructs grafted onto imperative language● Language features obscured:

ComposabilityMalleability Debugging

● Huge additional burden on programmer:Introducing parallelismCorrectness of parallelismOptimizing parallelism

Page 4: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

4 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Explicit Parallelism

● Programmer controls details of parallelism!● Granularity decisions:

if too small, lots of synchronization and thread creation if too large, bad locality

● Load balancing decisionsCreate balanced parallel sections (not data-parallel)

● Locality decisionsSharing and communication structure

● Synchronization decisionsbarriers, atomicity, critical sections, order, flushing

● For mass adoption, we need a better paradigm:Where the parallelism is naturalExposes the necessary information to the compiler

Page 5: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

5 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Unburden the Programmer

● Move these decisions to compiler!GranularityLoad BalancingLocalitySynchronization

● Hard to do in traditional languagesCan a novel language help?

Page 6: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

6 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

● Regular and repeating computation● Synchronous Data Flow● Independent actors

with explicit communication● Data items have short lifetimes

Benefits:● Naturally parallel● Expose dependencies to compiler● Enable powerful transformations Adder

Speaker

AtoD

FMDemod

LPF1

Duplicate

RoundRobin

LPF2 LPF3

HPF1 HPF2 HPF3

Properties of Stream Programs

Page 7: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

7 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Outline

● Why we need New Languages?● Static Schedule ● Three Types of Parallelism● Exploiting Parallelism

Page 8: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

8 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

…push=2

Steady-State Schedule

● All data pop/push rates are constant● Can find a Steady-State Schedule

# of items in the buffers are the same before and the after executing the scheduleThere exist a unique minimum steady state schedule

● Schedule = { }

pop=3push=1

pop=2…

A B C

Page 9: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

9 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

…push=2

…push=2

Steady-State Schedule

● All data pop/push rates are constant● Can find a Steady-State Schedule

# of items in the buffers are the same before and the after executing the scheduleThere exist a unique minimum steady state schedule

● Schedule = { A }

pop=3push=1

pop=2…

A B C

Page 10: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

10 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

…push=2

…push=2

Steady-State Schedule

● All data pop/push rates are constant● Can find a Steady-State Schedule

# of items in the buffers are the same before and the after executing the scheduleThere exist a unique minimum steady state schedule

● Schedule = { A, A }

pop=3push=1

pop=2…

A B C

Page 11: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

11 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

…push=2

Steady-State Schedule

● All data pop/push rates are constant● Can find a Steady-State Schedule

# of items in the buffers are the same before and the after executing the scheduleThere exist a unique minimum steady state schedule

● Schedule = { A, A, B }

pop=3push=1

pop=2…

pop=3push=1

A B C

Page 12: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

12 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

…push=2

…push=2

Steady-State Schedule

● All data pop/push rates are constant● Can find a Steady-State Schedule

# of items in the buffers are the same before and the after executing the scheduleThere exist a unique minimum steady state schedule

● Schedule = { A, A, B, A }

pop=3push=1

pop=2…

A B C

Page 13: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

13 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

…push=2

Steady-State Schedule

pop=3push=1

pop=2…

pop=3push=1

A B C

● All data pop/push rates are constant● Can find a Steady-State Schedule

# of items in the buffers are the same before and the after executing the scheduleThere exist a unique minimum steady state schedule

● Schedule = { A, A, B, A, B }

Page 14: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

14 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

…push=2

Steady-State Schedule

● All data pop/push rates are constant● Can find a Steady-State Schedule

# of items in the buffers are the same before and the after executing the scheduleThere exist a unique minimum steady state schedule

● Schedule = { A, A, B, A, B, C }

pop=3push=1

pop=2…

pop=2…

A B C

Page 15: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

15 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Initialization Schedule

● When peek > pop, buffer cannot be empty after firing a filter

● Buffers are not empty at the beginning/end of the steady state schedule

● Need to fill the buffers before starting the steady state execution

peek=4pop=3push=1

Page 16: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

16 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Initialization Schedule

● When peek > pop, buffer cannot be empty after firing a filter

● Buffers are not empty at the beginning/end of the steady state schedule

● Need to fill the buffers before starting the steady state execution

peek=4pop=3push=1

peek=4pop=3push=1

Page 17: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

17 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Outline

● Why we need New Languages?● Static Schedule ● Three Types of Parallelism● Exploiting Parallelism

Page 18: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

18 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Types of Parallelism

Task ParallelismParallelism explicit in algorithmBetween filters without producer/consumer relationship

Data ParallelismPeel iterations of filter, place within scatter/gather pair (fission)parallelize filters with state

Pipeline ParallelismBetween producers and consumersStateful filters can be parallelized

Scatter

Gather

Task

Page 19: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

19 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Types of Parallelism

Task ParallelismParallelism explicit in algorithmBetween filters without producer/consumer relationship

Data ParallelismBetween iterations of a stateless filter Place within scatter/gather pair (fission)Can’t parallelize filters with state

Pipeline ParallelismBetween producers and consumersStateful filters can be parallelized

Scatter

Gather

Scatter

Gather

Task

Pip

elin

e

Data

Data Parallel

Page 20: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

20 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Types of Parallelism

Traditionally:

Task ParallelismThread (fork/join) parallelism

Data ParallelismData parallel loop (forall)

Pipeline ParallelismUsually exploited in hardware

Scatter

Gather

Scatter

Gather

Task

Pip

elin

e

Data

Page 21: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

21 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Outline

● Why we need New Languages?● Static Schedule ● Three Types of Parallelism● Exploiting Parallelism

Page 22: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

22 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Baseline 1: Task Parallelism

Adder

Splitter

Joiner

Compress

BandPass

Expand

Process

BandStop

Compress

BandPass

Expand

Process

BandStop

● Inherent task parallelism between two processing pipelines

● Task Parallel Model:Only parallelize explicit task parallelism Fork/join parallelism

● Execute this on a 2 core machine ~2x speedup over single core

● What about 4, 16, 1024, … cores?

Page 23: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

23 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Bitonic

Sort

Chann

elVoc

oder

DCT

DES

FFT

Filterba

nk

FMRadio

Serpe

nt

TDEMPEG2D

ecod

er

Vocod

er

Radar

Geometr

ic Mea

nTh

roug

hput

Nor

mal

ized

to S

ingl

e C

ore

Stre

amIt

Evaluation: Task Parallelism

Raw Microprocessor16 inorder, single-issue cores with D$ and I$

16 memory banks, each bank with DMACycle accurate simulator

Parallelism: Not matched to target!Synchronization: Not matched to target!

Page 24: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

24 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Baseline 2: Fine-Grained Data Parallelism

Adder

Splitter

Joiner

● Each of the filters in the example are stateless

● Fine-grained Data Parallel Model:Fiss each stateless filter N ways (N is number of cores)Remove scatter/gather if possible

● We can introduce data parallelismExample: 4 cores

● Each fission group occupies entire machine

BandStopBandStopBandStopAdderSplitter

Joiner

ExpandExpandExpand

ProcessProcessProcess

Joiner

BandPassBandPassBandPass

CompressCompressCompress

BandStopBandStopBandStop

Expand

BandStop

Splitter

Joiner

Splitter

Process

BandPass

Compress

Splitter

Joiner

Splitter

Joiner

Splitter

Joiner

ExpandExpandExpand

ProcessProcessProcess

Joiner

BandPassBandPassBandPass

CompressCompressCompress

BandStopBandStopBandStop

Expand

BandStop

Splitter

Joiner

Splitter

Process

BandPass

Compress

Splitter

Joiner

Splitter

Joiner

Splitter

Joiner

Page 25: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

25 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Evaluation: Fine-Grained Data Parallelism

0

1

2

34

5

6

7

8

910

11

12

13

14

1516

17

18

19

Bitonic

SortCha

nnelVoc

oder

DCT

DES

FFT

Filterban

k

FMRadio

Serpen

t

TDEMPEG2Deco

der

Vocod

er

Radar

Geometr

ic Mea

nTh

roug

hput

Nor

mal

ized

to S

ingl

e C

ore

Stre

amIt

TaskFine-Grained Data

Good Parallelism!Too Much Synchronization!

Page 26: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

26 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Baseline 3: Hardware Pipeline Parallelism

Adder

Splitter

Joiner

Compress

BandPass

Expand

Process

BandStop

Compress

BandPass

Expand

Process

BandStop

● The BandPass and BandStopfilters contain all the work

● Hardware PipeliningUse a greedy algorithm to fuse adjacent filtersWant # filters <= # cores

● Example: 8 Cores● Resultant stream graph is

mapped to hardwareOne filter per core

● What about 4, 16, 1024, cores?

Page 27: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

27 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Baseline 3: Hardware Pipeline Parallelism

Adder

Splitter

Joiner

BandPass

CompressProcessExpand

BandStop

BandPass

BandStop

● The BandPass and BandStopfilters contain all the work

● Hardware PipeliningUse a greedy algorithm to fuse adjacent filtersWant # filters <= # cores

● Example: 8 Cores● Resultant stream graph is

mapped to hardwareOne filter per core

● What about 4, 16, 1024, cores?Performance dependent on fusing to a load-balanced stream graph

CompressProcessExpand

Page 28: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

28 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Evaluation: Hardware Pipeline Parallelism

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Bitonic

SortCha

nnelVoc

oder

DCT

DES

FFT

Filterban

k

FMRadio

Serpen

t

TDEMPEG2Deco

der

Vocod

er

Radar

Geometr

ic Mea

n

Thro

ughp

ut N

orm

aliz

ed to

Sin

gle

Cor

e St

ream

It

TaskFine-Grained DataHardware Pipelining

Parallelism: Not matched to target!Synchronization: Not matched to target!

Page 29: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

29 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

The StreamIt Compiler

1. Coarsen: Fuse stateless sections of the graph2. Data Parallelize: parallelize stateless filters3. Software Pipeline: parallelize stateful filters

Compile to a 16 core architecture11.2x mean throughput speedup over single core

Coarsen Granularity

Data Parallelize

Software Pipeline

Page 30: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

30 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Phase 1: Coarsen the Stream Graph

Splitter

Joiner

Expand

BandStop

Process

BandPass

Compress

Expand

BandStop

Process

BandPass

Compress

● Before data-parallelism is exploited● Fuse stateless pipelines as much

as possible without introducing state

Don’t fuse stateless with statefulDon’t fuse a peeking filter with anything upstream

Peek Peek

PeekPeek

Adder

Page 31: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

31 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Phase 1: Coarsen the Stream Graph

Splitter

Joiner

BandPassCompressProcessExpand

BandPassCompressProcessExpand

BandStop BandStop

Adder

● Before data-parallelism is exploited● Fuse stateless pipelines as much

as possible without introducing state

Don’t fuse stateless with statefulDon’t fuse a peeking filter with anything upstream

● Benefits:Reduces global communication and synchronizationExposes inter-node optimization opportunities

Page 32: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

32 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Phase 2: Data Parallelize

AdderAdderAdder

Splitter

Joiner

Adder

BandPassCompressProcessExpand

BandPassCompressProcessExpand

BandStop BandStop

Splitter

Joiner

Fiss 4 ways, to occupy entire chip

Data Parallelize for 4 cores

Page 33: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

33 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Phase 2: Data Parallelize

AdderAdderAdder

Splitter

Joiner

Adder

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

BandStop BandStop

Splitter

Joiner

Task parallelism!Each fused filter does equal workFiss each filter 2 times to occupy entire chip

Data Parallelize for 4 cores

Page 34: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

34 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

BandStop BandStop

Phase 2: Data Parallelize

AdderAdderAdder

Splitter

Joiner

Adder

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

Splitter

Joiner

BandStop

Splitter

Joiner

BandStop

Splitter

Joiner

Task parallelism, each filter does equal workFiss each filter 2 times to occupy entire chip

● Task-conscious data parallelizationPreserve task parallelism

● Benefits:Reduces global communication and synchronization

Data Parallelize for 4 cores

Page 35: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

35 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Evaluation: Coarse-Grained Data Parallelism

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Bitonic

SortCha

nnelVocoder

DCT

DES

FFTFilte

rbank

FMRadio

Serpent

TDEMPEG2Dec

oder

Vocoder

Radar

Geometric

MeanTh

roug

hput

Nor

mal

ized

to S

ingl

e C

ore

Stre

amIt

TaskFine-Grained DataHardware PipeliningCoarse-Grained Task + Data

Good Parallelism!Low Synchronization!

Page 36: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

36 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Simplified Vocoder

RectPolar

Splitter

Joiner

AdaptDFT AdaptDFT

Splitter

Splitter

Amplify

Diff

UnWrap

Accum

Amplify

Diff

Unwrap

Accum

Joiner

Joiner

PolarRect

66

20

2

1

1

1

2

1

1

1

20 Data Parallel

Data Parallel

Target a 4 core machine

Data Parallel, but too little work!

Page 37: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

37 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Data Parallelize

RectPolarRectPolarRectPolar

Splitter

Joiner

AdaptDFT AdaptDFT

Splitter

Splitter

Amplify

Diff

UnWrap

Accum

Amplify

Diff

Unwrap

Accum

Joiner

RectPolar

Splitter

Joiner

RectPolarRectPolarRectPolarPolarRect

Splitter

Joiner

Joiner

66

20

2

1

1

1

2

1

1

1

20

5

5

Target a 4 core machine

Page 38: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

38 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Data + Task Parallel Execution

Time

Cores

21

Target 4 core machine

Splitter

Joiner

Splitter

Splitter

Joiner

Splitter

Joiner

RectPolarSplitter

Joiner

Joiner

66

2

1

1

1

2

1

1

1

5

5

Page 39: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

39 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

We Can Do Better!

Time

Cores

Target 4 core machine

Splitter

Joiner

Splitter

Splitter

Joiner

Splitter

Joiner

RectPolarSplitter

Joiner

Joiner

66

2

1

1

1

2

1

1

1

5

5

16

Page 40: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

40 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Phase 3: Coarse-Grained Software Pipelining

RectPolar

RectPolar

RectPolar

RectPolar

Prologue

New Steady

State

● New steady-state is free of dependencies

● Schedule new steady-state using a greedy partitioning

Page 41: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

41 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Greedy Partitioning

Target 4 core machine

Time 16

CoresTo Schedule:

Page 42: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

42 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

0123456789

10111213141516171819

Bitonic

SortCha

nnelVocoder

DCT

DES

FFT

Filterba

nk

FMRadio

Serpent

TDEMPEG2Dec

oder

Vocoder

Radar

Geometric

Mean

Thro

ughp

ut N

orm

aliz

ed to

Sin

gle

Cor

e St

ream

It

Task Fine-Grained DataHardware Pipelining Coarse-Grained Task + DataCoarse-Grained Task + Data + Software Pipeline

Evaluation: Coarse-Grained Task + Data + Software Pipelining

Best Parallelism!Lowest Synchronization!

Page 43: Lecture 12 - StreamIt Parallelizing Compilerilab.usc.edu/packages/cell-processor/course/6.189-lecture12... · ... MIT. 1 6.189 IAP 2007 MIT 6.189 IAP 2007 Lecture 12 ... MIT. 7 6.189

43 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.

Summary

● Robust speedups across varied benchmark suite

ApplicationDependent

ApplicationDependent

HardwarePipelining

Low

Good

Coarse-Grained Task + Data

High

Good

Fine-Grained

Data

LowestApplicationDependent

Synchronization

Best ApplicationDependent

Parallelism

Coarse-Grained Task + Data + Software

PipelineTask

● Streaming model naturally exposes task, data, and pipeline parallelism

● This parallelism must be exploited at the correct granularity and combined correctly