stream programming: luring programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf ·...

77
Stream Programming: Luring Programmers into the Multicore Era Bill Thies Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Spring 2008

Upload: ledang

Post on 25-Jun-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Stream Programming: LuringProgrammers into the Multicore Era

Bill Thies

Computer Science and Artificial Intelligence Laboratory

Massachusetts Institute of Technology

Spring 2008

Page 2: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Multicores are Here

1985 199019801970 1975 1995 2000

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005

Raw

Power4 Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Broadcom 1480

20??

# ofcores

1

2

4

8

16

32

64

128256

512

Opteron 4PXeon MP

Athlon

AmbricAM2045

Tilera

Page 3: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Multicores are Here

1985 199019801970 1975 1995 2000

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005

Raw

Power4 Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Broadcom 1480

20??

# ofcores

1

2

4

8

16

32

64

128256

512

Opteron 4PXeon MP

Athlon

AmbricAM2045

Tilera

Hardware wasresponsible forimproving performance

Page 4: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Multicores are Here

1985 199019801970 1975 1995 2000

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005

Raw

Power4 Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Broadcom 1480

20??

# ofcores

1

2

4

8

16

32

64

128256

512

Opteron 4PXeon MP

Athlon

AmbricAM2045

Tilera

Now, performanceburden falls onprogrammers

Page 5: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Is Parallel Programming a New Problem?• No! Decades of research targeting multiprocessors

– Languages, compilers, architectures, tools…

• What is different today?1. Multicores vs. multiprocessors. Multicores have:

- New interconnects with non-uniform communication costs- Faster on-chip communication than off-chip I/O, memory ops- Limited per-core memory availability

2. Non-expert programmers- Supercomputers with >2048 processors today: 100 [top500.org]

- Machines with >2048 cores in 2020: >100 million [ITU, Moore]

3. Application trends- Embedded: 2.7 billion cell phones vs 850 million PCs [ITU 2006]

- Data-centric: YouTube streams 200 TB of video daily

Page 6: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Streaming Application Domain• For programs based on streams of data

– Audio, video, DSP, networking, and cryptographic processing kernels

– Examples: HDTV editing, radar tracking, microphone arrays, cell phone base stations, graphics

Adder

Speaker

AtoD

FMDemod

LPF1

Duplicate

RoundRobin

LPF2 LPF3

HPF1 HPF2 HPF3

Page 7: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Streaming Application Domain• For programs based on streams of data

– Audio, video, DSP, networking, and cryptographic processing kernels

– Examples: HDTV editing, radar tracking, microphone arrays, cell phone base stations, graphics

• Properties of stream programs– Regular and repeating computation– Independent filters

with explicit communication– Data items have short lifetimes

Adder

Speaker

AtoD

FMDemod

LPF1

Duplicate

RoundRobin

LPF2 LPF3

HPF1 HPF2 HPF3

Page 8: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Brief History of Streaming1960 1970 1980 1990 2000

Models of Computation

Languages / Compilers

Modeling Environments

Petri NetsComp. Graphs

Kahn Proc. NetworksCommunicating Sequential Processes

SisalOccam

Lucid IdVALlazy

Synchronous Dataflow

Gabriel

LUSTRE

Ptolemy

EsterelC

Grape-IIMatlab/Simulink

etc.

ErlangpH

Page 9: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Weaknesses• Unsuitable for static analysis• Cannot leverage deep results

from DSP / modeling community

Strengths• Elegance• Generality

Brief History of Streaming1960 1970 1980 1990 2000

Models of Computation

Languages / Compilers

Modeling Environments

Petri NetsComp. Graphs

Kahn Proc. NetworksCommunicating Sequential Processes

SisalOccam

Lucid IdVALlazy

Synchronous Dataflow

Gabriel

LUSTRE

Ptolemy

EsterelC

Grape-IIMatlab/Simulink

etc.

ErlangpH

Page 10: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Weaknesses• Unsuitable for static analysis• Cannot leverage deep results

from DSP / modeling community

Strengths• Elegance• Generality

Brief History of Streaming1960 1970 1980 1990 2000

Models of Computation

Languages / Compilers

Modeling Environments

Petri NetsComp. Graphs

Kahn Proc. NetworksCommunicating Sequential Processes

SisalOccam

Lucid IdVALlazy

Synchronous Dataflow

Gabriel

LUSTRE

Ptolemy

EsterelC

Grape-IIMatlab/Simulink

etc.

ErlangpH

StreamItCg StreamC

Brook

“StreamProgramming”

Page 11: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

StreamIt: A Language and Compilerfor Stream Programs

• Key idea: design language that enables static analysis

• Goals:1. Expose and exploit the parallelism in stream programs2. Improve programmer productivity in the streaming domain

• Project contributions:– Language design for streaming [CC'02, CAN'02, PPoPP'05, IJPP'05]

– Automatic parallelization [ASPLOS'02, G.Hardware'05, ASPLOS'06]

– Domain-specific optimizations [PLDI'03, CASES'05, TechRep'07]

– Cache-aware scheduling [LCTES'03, LCTES'05]

– Extracting streams from legacy code [MICRO'07]

– User + application studies [PLDI'05, P-PHEC'05, IPDPS'06]

– 7 years, 25 people, 300 KLOC– 700 external downloads, 5 external publications

Page 12: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

StreamIt: A Language and Compilerfor Stream Programs

• Key idea: design language that enables static analysis

• Goals:1. Expose and exploit the parallelism in stream programs2. Improve programmer productivity in the streaming domain

• I contributed to:– Language design for streaming [CC'02, CAN'02, PPoPP'05, IJPP'05]

– Automatic parallelization [ASPLOS'02, G.Hardware'05, ASPLOS'06]

– Domain-specific optimizations [PLDI'03, CASES'05, TechRep'07]

– Cache-aware scheduling [LCTES'03, LCTES'05]

– Extracting streams from legacy code [MICRO'07]

– User + application studies [PLDI'05, P-PHEC'05, IPDPS'06]

– 7 years, 25 people, 300 KLOC– 700 external downloads, 5 external publications

Page 13: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

StreamIt: A Language and Compilerfor Stream Programs

• Key idea: design language that enables static analysis

• Goals:1. Expose and exploit the parallelism in stream programs2. Improve programmer productivity in the streaming domain

• This talk:– Language design for streaming [CC'02, CAN'02, PPoPP'05, IJPP'05]

– Automatic parallelization [ASPLOS'02, G.Hardware'05, ASPLOS'06]

– Domain-specific optimizations [PLDI'03, CASES'05, TechRep'07]

– Cache-aware scheduling [LCTES'03, LCTES'05]

– Extracting streams from legacy code [MICRO'07]

– User + application studies [PLDI'05, P-PHEC'05, IPDPS'06]

– 7 years, 25 people, 300 KLOC– 700 external downloads, 5 external publications

Page 14: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Part 1: Language Design

Joint work with Michael GordonWilliam Thies, Michal Karczmarek, Saman Amarasinghe (CC’02)

William Thies, Michal Karczmarek, Janis Sermulins, Rodric Rabbah,Saman Amarasinghe (PPoPP’05)

Page 15: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

StreamIt Language Basics• High-level, architecture-independent language

– Backend support for uniprocessors, multicores (Raw, SMP), cluster of workstations

• Model of computation: synchronous dataflow– Program is a graph of independent filters– Filters have an atomic execution step

with known input / output rates– Compiler is responsible for

scheduling and buffer management

• Extensions to synchronous dataflow – Dynamic I/O rates– Support for sliding window operations– Teleport messaging [PPoPP’05]

Decimate

Input

Output

110

11

x 10

x 1

x 1

[Lee & Messerschmidt,1987]

Page 16: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Representing Streams• Conventional wisdom: stream programs are graphs

– Graphs have no simple textual representation– Graphs are difficult to analyze and optimize

• Insight: stream programs have structure

structuredunstructured

Page 17: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Structured Streams

may be any StreamIt language construct

joinersplitter

pipeline

feedback loop

joiner splitter

splitjoin

filter • Each structure is single-input, single-output

• Hierarchical and composable

Page 18: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Radar-Array Front End

Page 19: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Filterbank

Page 20: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

FFT

Page 21: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Block Matrix Multiply

Page 22: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

MP3 Decoder

Page 23: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Bitonic Sort

Page 24: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

FM Radio with Equalizer

Page 25: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Ground Moving Target Indicator (GMTI)

99 filters3566 filter instances

Page 26: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

26

void->void pipeline FMRadio(int N, float lo, float hi) {add AtoD();

add FMDemod();

add splitjoin {split duplicate;for (int i=0; i<N; i++) {

add pipeline {add LowPassFilter(lo + i*(hi - lo)/N);

add HighPassFilter(lo + i*(hi - lo)/N);}

}join roundrobin();

}add Adder();

add Speaker();}

Adder

Speaker

AtoD

FMDemod

LPF1

Duplicate

RoundRobin

LPF2 LPF3

HPF1 HPF2 HPF3

Example Syntax: FMRadio

Page 27: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

• Software radio

• Frequency hopping radio

• Acoustic beam former

• Vocoder

• FFTs and DCTs

• JPEG Encoder/Decoder

• MPEG-2 Encoder/Decoder

• MPEG-4 (fragments)

• Sorting algorithms

• GMTI (Ground Moving Target Indicator)

• DES and Serpent crypto algorithms

• SSCA#3 (HPCS scalable benchmark for synthetic aperture radar)

• Mosaic imaging using RANSAC algorithm

StreamIt Application Suite

Total size: 60,000 lines of code

Page 28: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Control Messages

• Occasionally, low-bandwidth control messages are sent between actors

• Often demands precise timing– Communications: adjust protocol,

amplification, compression– Network router: cancel invalid packet– Adaptive beamformer: track a target– Respond to user input, runtime errors– Frequency hopping radio

• Traditional techniques:– Direct method call (no timing guarantees)– Embed message in stream (opaque, slow)

AtoD

duplicate

LPF2LPF1 LPF3

HPF2HPF1 HPF3

Transmit

roundrobin

Encode

Decode

Page 29: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

• Looks like method call, but timed relative to data in the stream

– Exposes dependences to compiler– Simple and precise for user

- Adjustable latency- Can send upstream or downstream

void setProtocol(int p) {reconfig(p);

}

TargetFilter x;if newProtocol(p) {

x.setProtocol(p) @ 2;}

Idea 2: Teleport Messaging

Page 30: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Part 2: Automatic Parallelization

Joint work with Michael GordonMichael I. Gordon, William Thies, Saman Amarasinghe (ASPLOS’06)

Michael I. Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali S. Meli, Andrew A. Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman Amarasinghe (ASPLOS’02)

Page 31: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Streaming is an Implicitly Parallel Model• Programmer thinks about functionality, not parallelism

• More explicit models may…– Require knowledge of target [MPI] [cG]

– Require parallelism annotations [OpenMP] [HPF] [Cilk] [Intel TBB]

• Novelty over other implicit models?[Erlang] [MapReduce] [Sequoia] [pH] [Occam] [Sisal] [Id] [VAL] [LUSTRE][HAL] [THAL] [SALSA] [Rosette] [ABCL] [APL] [ZPL] [NESL] […]

Exploiting streaming structure for robust performance

Page 32: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Parallelism in Stream Programs

Task parallelism– Analogous to thread (fork/join)

parallelism

Data Parallelism– Peel iterations of filter, place within

scatter/gather pair (fission)– parallelize filters with state

Pipeline Parallelism– Between producers and consumers– Stateful filters can be parallelized

Splitter

Joiner

Task

Page 33: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Parallelism in Stream Programs

Task parallelism– Analogous to thread (fork/join)

parallelism

Data parallelism– Analogous to DOALL loops

Pipeline parallelism– Analogous to ILP that is

exploited in hardware

Splitter

Joiner

Splitter

Joiner

Task

Pip

elin

e

Data

Stateless

Page 34: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Baseline: Fine-Grained Data Parallelism

Adder

Splitter

Joiner

BandStopBandStopBandStopAdderSplitter

Joiner

ExpandExpandExpand

ProcessProcessProcess

Joiner

BandPassBandPassBandPass

CompressCompressCompress

BandStopBandStopBandStop

Expand

BandStop

Splitter

Joiner

Splitter

Process

BandPass

Compress

Splitter

Joiner

Splitter

Joiner

Splitter

Joiner

ExpandExpandExpand

ProcessProcessProcess

Joiner

BandPassBandPassBandPass

CompressCompressCompress

BandStopBandStopBandStop

Expand

BandStop

Splitter

Joiner

Splitter

Process

BandPass

Compress

Splitter

Joiner

Splitter

Joiner

Splitter

Joiner

Page 35: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

0

2

4

6

8

10

12

14

16

18

Bitonic

SortCha

nnelV

ocod

er

DCT

DES

FFTFilte

rbank

FMRadio

Serpen

t

TDEMPEG2-s

ubse

tVoc

oder

Radar

Geometr

ic Mea

nThro

ughp

ut N

orm

aliz

ed to

Sin

gle

Cor

e St

ream

It

Fine-Grained Data

Coarse-Grained Task + Data

Evaluation:Fine-Grained Data Parallelism

Raw Microprocessor16 inorder, single-issue cores with D$ and I$

16 memory banks, each bank with DMACycle accurate simulator

Page 36: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

0

2

4

6

8

10

12

14

16

18

Bitonic

SortCha

nnelV

ocod

er

DCT

DES

FFTFilte

rbank

FMRadio

Serpen

t

TDEMPEG2-s

ubse

tVoc

oder

Radar

Geometr

ic Mea

nThro

ughp

ut N

orm

aliz

ed to

Sin

gle

Cor

e St

ream

It

Fine-Grained Data

Coarse-Grained Task + Data

Evaluation:Fine-Grained Data Parallelism

Good Parallelism!Too Much Synchronization!

Page 37: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Splitter

Joiner

Expand

BandStop

Process

BandPass

Compress

Expand

BandStop

Process

BandPass

Compress

Adder

Coarsening the Granularity

Page 38: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Splitter

Joiner

BandPassCompressProcessExpand

BandPassCompressProcessExpand

BandStop BandStop

Adder

Coarsening the Granularity

Page 39: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Splitter

Joiner

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

BandStop BandStop

Coarsening the Granularity

Adder

Page 40: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

BandStop BandStop

Splitter

Joiner

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

Splitter

Joiner

BandStop

Splitter

Joiner

BandStop

Coarsening the Granularity

AdderAdderAdderAdderAdder

Splitter

Joiner

Page 41: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

0

2

4

6

8

10

12

14

16

18

Bitonic

SortCha

nnelV

ocod

er

DCT

DES

FFTFilte

rbank

FMRadio

Serpen

t

TDEMPEG2-s

ubse

tVoc

oder

Radar

Geometr

ic Mea

nThro

ughp

ut N

orm

aliz

ed to

Sin

gle

Cor

e St

ream

It

Fine-Grained Data

Coarse-Grained Task + Data

Evaluation: Coarse-Grained Data Parallelism

Good Parallelism!Low Synchronization!

Page 42: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Simplified Vocoder

RectPolar

Splitter

Joiner

AdaptDFT AdaptDFT

Splitter

Splitter

Amplify

Diff

UnWrap

Accum

Amplify

Diff

Unwrap

Accum

Joiner

Joiner

PolarRect

66

20

2

1

1

1

2

1

1

1

20 Data Parallel

Data Parallel

Target a 4-core machine

Page 43: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Data Parallelize

RectPolarRectPolarRectPolar

Splitter

Joiner

AdaptDFT AdaptDFT

Splitter

Splitter

Amplify

Diff

UnWrap

Accum

Amplify

Diff

Unwrap

Accum

Joiner

RectPolar

Splitter

Joiner

RectPolarRectPolarRectPolarPolarRect

Splitter

Joiner

Joiner

66

20

2

1

1

1

2

1

1

1

20

5

5

Target a 4-core machine

Page 44: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Data + Task Parallel Execution

Time

Cores

21

Splitter

Joiner

Splitter

Splitter

Joiner

Splitter

Joiner

RectPolarSplitter

Joiner

Joiner

66

2

1

1

1

2

1

1

1

5

5

Target a 4-core machine

Page 45: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

We Can Do Better

Time

Cores

Splitter

Joiner

Splitter

Splitter

Joiner

Splitter

Joiner

RectPolarSplitter

Joiner

Joiner

66

2

1

1

1

2

1

1

1

5

5

16

Target a 4-core machine

Page 46: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

RectPolar

RectPolar

RectPolar

RectPolar

Prologue

New Steady

State

Coarse-Grained Software Pipelining

Page 47: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

0

2

4

6

8

10

12

14

16

18

Bitonic

SortCha

nnelV

ocod

er

DCT

DES

FFTFilte

rbank

FMRadio

Serpen

t

TDEMPEG2-s

ubse

tVoc

oder

Radar

Geometr

ic Mea

nThro

ughp

ut N

orm

aliz

ed to

Sin

gle

Cor

e St

ream

It Fine-Grained DataCoarse-Grained Task + DataCoarse-Grained Task + Data + Software Pipeline

Evaluation: Coarse-Grained Task + Data + Software Pipelining

Page 48: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

0

2

4

6

8

10

12

14

16

18

Bitonic

SortCha

nnelV

ocod

er

DCT

DES

FFTFilte

rbank

FMRadio

Serpen

t

TDEMPEG2-s

ubse

tVoc

oder

Radar

Geometr

ic Mea

nThro

ughp

ut N

orm

aliz

ed to

Sin

gle

Cor

e St

ream

It Fine-Grained DataCoarse-Grained Task + DataCoarse-Grained Task + Data + Software Pipeline

Evaluation: Coarse-Grained Task + Data + Software Pipelining

Best Parallelism!Lowest Synchronization!

Page 49: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Parallelism: Take Away• Stream programs have abundant parallelism

– However, parallelism is obfuscated in language like C

• Stream languages enable new & effective mapping

– In C, analogous transformations impossibly complex – In StreamC or Brook, similar transformations possible

[Khailany et al., IEEE Micro’01] [Buck et al., SIGGRAPH’04] [Das et al., PACT’06] […]

• Results should extend to other multicores– Parameters: local memory, comm.-to-comp. cost– Preliminary results on Cell are promising [Zhang, dasCMP’07]

Coarsen Granularity

Data Parallelize

Software Pipeline

Page 50: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Part 3: Domain-Specific Optimizations

Joint work with Andrew Lamb, Sitij AgrawalAndrew Lamb, William Thies, Saman Amarasinghe (PLDI’03)

Sitij Agrawal, William Thies, Saman Amarasinghe (CASES’05)

Page 51: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

DSP Optimization Process• Given specification of algorithm,

minimize the computation cost

Adder

Speaker

AtoD

FMDemod

LPF1

Duplicate

RoundRobin

LPF2 LPF3

HPF1 HPF2 HPF3

Linear

Page 52: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

DSP Optimization Process• Given specification of algorithm,

minimize the computation cost– Currently done by hand (MATLAB)

Speaker

Equalizer

AtoD

FMDemod

IFFT

FFT

Page 53: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

DSP Optimization Process• Given specification of algorithm,

minimize the computation cost– Currently done by hand (MATLAB)

• Can compiler replace DSP expert?– Library generators limited [Spiral] [FFTW] [ATLAS]– Enable unified development environment

Speaker

Equalizer

AtoD

FMDemod

IFFT

FFT

Page 54: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Focus: Linear State Space Filters• Properties:

– Outputs are linear function of inputs and states– New states are linear function of inputs and states

• Most common target of DSP optimizations– FIR / IIR filters– Linear difference equations– Upsamplers / downsamplers– DCTs

u

x’ = Ax + Bu

y = Cx + Du

inputs

states

outputs

Page 55: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Focus: Linear State Space Filters

u

x’ = Ax + Bu

y = Cx + Du

inputs

states

outputs

Page 56: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Focus: Linear Filters

float->float filter Scale {work push 2 pop 1 { float u = pop();push(u);push(2*u);

}}

u

y = Du

inputs

outputs

Linear dataflow analysis

Page 57: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Focus: Linear Filters

float->float filter Scale {work push 2 pop 1 { float u = pop();push(u);push(2*u);

}}

uinputs

outputs

Linear dataflow analysis

=y1y2

12

u

Page 58: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Combining Adjacent Filters

y = Du

z = EyG

z = EDu

Filter 1

Filter 2

y

u

z

CombinedFilter

u

z

z = Gu

Page 59: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Combination Example

Filter 1

Filter 2

y

u

z

CombinedFilter

u

z[ ]654=AE

⎥⎥⎥

⎢⎢⎢

⎡=

321

BD

C = [ 32 ]G

1 multsoutput

6 multsoutput

Page 60: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

• If matrix dimensions mis-match?

The General Case

[D]U

E

[D]U

E

[D][D]

[D]

Original Expanded

σ

pop = σ

Matrix expansion:

Page 61: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

• If matrix dimensions mis-match?

The General Case

[D]U

E

[D]U

E

[D][D]

[D]

Original Expanded

σ

pop = σ

Matrix expansion:

Page 62: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Pipelines

Feedback Loops

The General Case

Page 63: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Splitjoins

The General Case

Page 64: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

-40%

-20%

0%

20%

40%

60%

80%

100%

FIRRate

Conve

rtTarg

etDete

ctFMRad

io

Radar

FilterB

ank

Vocod

erOve

rsample

DToA

Benchmark

Flop

s R

emov

ed (%

)

linear

0.3%

Floating-Point Operations Reduction

Page 65: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

-40%

-20%

0%

20%

40%

60%

80%

100%

FIRRate

Conve

rtTarg

etDete

ctFMRad

io

Radar

FilterB

ank

Vocod

erOve

rsample

DToA

Benchmark

Flop

s R

emov

ed (%

)

linear

freq

-140%

0.3%

Floating-Point Operations Reduction

Page 66: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Splitter

Sink

RR

Mag

Detect

Duplicate

Mag

Detect

Mag

Detect

BeamForm BeamForm BeamForm BeamForm

Filter Filter Filter Filter

Mag

Detect

RR

Splitter(null)

Input Input Input Input Input Input Input Input Input Input Input Input

Dec Dec Dec Dec Dec Dec Dec Dec Dec Dec Dec Dec

Dec Dec Dec Dec Dec Dec Dec Dec Dec Dec Dec Dec

FIR1 FIR1 FIR1 FIR1 FIR1 FIR FIR1 FIR1 FIR1 FIR1 FIR1 FIR1

FIR2 FIR2 FIR2 FIR2 FIR2 FIR FIR2 FIR2 FIR2 FIR2 FIR2 FIR2

Radar (Transformation Selection)

Page 67: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

RR

RR

RR

Splitter

Sink

RR

Filter

Mag

Detect

Filter

Mag

Detect

Filter

Mag

Detect

Duplicate

BeamForm BeamForm BeamForm BeamForm

Filter

Mag

Detect

Splitter(null)

Input Input Input Input Input Input Input Input Input Input Input Input

Radar (Transformation Selection)

Page 68: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

RR

RR

Splitter

Sink

RR

Filter

Mag

Detect

Filter

Mag

Detect

Filter

Mag

Detect

RR

Duplicate

BeamForm BeamForm BeamForm BeamForm

Filter

Mag

Detect

Splitter(null)

Input Input Input Input Input Input Input Input Input Input Input Input

Radar (Transformation Selection)

Page 69: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

2.4 times as many FLOPS

half as many FLOPS

Radar (Transformation Selection)

RR

RR

Splitter

Sink

RR

Filter

Mag

Detect

Filter

Mag

Detect

Filter

Mag

Detect

Filter

Mag

Detect

Splitter(null)

Input

Input

Input

Input

Input

Input

Input

Input

Input

Input

Input

Input

Splitter

Sink

RR

Mag

Duplicate

Mag Mag Mag

RR

Splitter(null)

Input Input Input Input Input Input Input Input Input Input Input Input

Maximal Combination andShifting to Frequency Domain

Using TransformationSelection

Page 70: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

-40%

-20%

0%

20%

40%

60%

80%

100%

FIRRateCon

vert

TargetD

etect

FMRadio

Radar

FilterB

ank

Vocode

rOve

rsample

DToA

Benchmark

Flop

s Re

mov

ed (%

)

linearfreqautosel

-140%

0.3%

Floating Point Operations Reduction

Page 71: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

-200%-100%

0%100%200%300%400%500%600%700%800%900%

FIRRateCon

vert

TargetD

etect

FMRadio

Radar

FilterB

ank

Vocode

rOve

rsample

DToA

Benchmark

Spe

edup

(%)

linearfreqautosel

Execution Speedup

On a Pentium IV

5%

Page 72: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

-200%-100%

0%100%200%300%400%500%600%700%800%900%

FIRRateCon

vert

TargetD

etect

FMRadio

Radar

FilterB

ank

Vocode

rOve

rsample

DToA

Benchmark

Spe

edup

(%)

linearfreqautosel

Execution Speedup

On a Pentium IV

5%

Additional transformations:1. Eliminating redundant states2. Eliminating parameters

(non-zero, non-unary coefficients)3. Translation to the compressed domain

Page 73: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

StreamIt: Lessons Learned• In practice, I/O rates of filters are often matched [LCTES’03]

– Over 30 publications study an uncommon case (CD-DAT)

• Multi-phase filters complicate programs, compilers– Should maintain simplicity of only one atomic step per filter

• Programmers accidentally introduce mutable filter state

1 2 3 2 7 8 7 5

x 147 x 98 x 28 x 32

void>int filter SquareWave() {int x = 0;

work push 1 {push(x);x = 1 - x;

} }

void>int filter SquareWave() {work push 2 {

push(0);push(1);

}} statefulstateless

Page 74: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Future of StreamIt• Goal: influence the next big language

Source: B. Stroustrup, The Design and Evolution of C++

1960

1970

1980

1990

Structural influenceFeature influenceFortran

Algol 60CPL

BCPL

C

ANSI C

Simula 67

C with Classes

C++

C++arm

C++std

ML CluAlgol 68

Ada

Origins of C++

Academic origin

Page 75: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Research Trajectory• Vision: Make emerging computational substrates

universally accessible and useful1. Languages, compilers, & tools for multicores

– I believe new language / compiler technologycan enable scalable and robust performance

– Next inroads: expose & exploit flexibility in programs

2. Programmable microfluidics– We have developed programming languages,

tools, and flexible new devices for microfluidics– Potential to revolutionize biology experimentation

3. Technologies for the developing world– TEK: enable Internet experience over email account– Audio Wiki: publish content from a low-cost phone– uBox / uPhone: monitor & improve rural healthcare

Page 76: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Conclusions• A parallel programming model will succeed only by

luring programmers, making them do less, not more

• Stream programminglures programmers with:– Elegant programming primitives– Domain-specific optimizations

• Meanwhile, streamingis implicitly parallel– Robust performance via task,

data, & pipeline parallelism

• We believe stream programming will play a key rolein enabling a transition to multicore processors

Contributions– Structured streams– Teleport messaging– Unified algorithm for task,

data, pipeline parallelism– Software pipelining of

whole procedures– Algebraic simplification of

whole procedures– Translation from time to frequency – Selection of best DSP transforms

Page 77: Stream Programming: Luring Programmers into the …people.csail.mit.edu/thies/jobtalk08.pdf · 2009-02-09 · Stream Programming: Luring Programmers into the Multicore Era ... Application

Acknowledgments• Project supervisors

– Prof. Saman Amarasinghe – Dr. Rodric Rabbah

• Contributors to this talk– Michael I. Gordon (Ph.D. Candidate) – leads StreamIt backend efforts– Andrew A. Lamb (M.Eng) – led linear optimizations– Sitij Agrawal (M.Eng) – led statespace optimizations

• Compiler developers– Kunal Agrawal– Allyn Dimock– Qiuyuan Jimmy Li

• Application developers– Basier Aziz– Matthew Brown– Matthew Drake

• User interface developers– Kimberly Kuo

– Jasper Lin– Michal Karczmarek– David Maze

– Shirley Fung– Hank Hoffmann– Chris Leger

– Janis Sermulins– Phil Sung– David Zhang

– Ali Meli– Satish Ramaswamy– Jeremy Wong

– Juan Reyes