profile guided deployment of stream programs on multicores s. m. farhad the university of sydney...

28
Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Upload: mark-perry

Post on 20-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Profile Guided Deployment of Stream

Programs on MulticoresS. M. Farhad

The University of Sydney

Joint work with

Yousun Ko

Bernd Burgstaller

Bernhard Scholz

Page 2: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

2

Outline

Motivation Multicore trend Stream programming

Research Questions How to profiling communication overhead on

Multicores? How to deploy stream programs?

Related works

2

Page 3: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

3

Motivation

1

1975

2

4

8

16

32

64

128

256

512

1980 1985 1990 1995 2000 2005 2010

400480088080 8086 286 386 486 Pentium P2 P3 P4

Athlon Itanium Itanium2

Power4 PA8800400480088080

PA8800

Opteron CoreDuo

Power6Xbox 360

BCM 1480Opteron 4P

Xeon

Niagara Cell

RAW

RAZA XLR Cavium

Unicore

Homogeneous Multicore

Heterogeneous MulticoreCISCO CSR1

Larrabee

PicoChip AMBRIC

AMD Fusion

NVIDIA G80

Core

Core2Duo

Core2Quad

# co

res/

chip

Courtesy: Scott’08

C/C++/Java

CUDA

X10Peakstream

Fortress

Accelerator

Ct

C T M

Rstream

Rapidmind

Stream Programming

3

Page 4: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

4

Stream Programming Paradigm Programs expressed as stream

graphs

Streams: Infinite sequence of data elements

Actors: Functions applied to streams

4

Actor

Stream

Stream

Page 5: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

5

Properties of Stream Program Regular and repeating

computation Independent actors with explicit

communication Producer / Consumer

dependencies

5

Adder

Speaker

AtoD

FMDemod

LPF1

Splitter

Joiner

LPF2 LPF3

HPF1 HPF2 HPF3

Page 6: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

6

StreamIt Language

An implementation of stream prog.

Hierarchical structure

Each construct has single input/output stream

parallel computation

may be any StreamIt language construct

joinersplitter

pipeline

feedback loop

joiner splitter

splitjoin

filter

6

Page 7: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

7

Outline

Motivation Multicore trend Stream programming

Research Questions How to profiling communication overhead on

Multicores? How to deploy stream programs?

Related works

7

Page 8: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

How to Estimate the Communication Overhead on Multicores?

8

Page 9: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Problems to Measure Communication Overhead on Multicores Reasons:

Multicores are non-communication exposed architecture

Complex cache hierarchy Cache coherence protocols

Consequence: Cannot directly measure the communication cost Estimate the communication cost by measuring

the execution time of actors

9

Page 10: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Measuring the Communication Overhead of an Edge

10

i k

Processor 1

No communication cost

Processor 1

With communication cost

Processor 2

ki

kkiiki ttttC ),(

it ktit kt

Page 11: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

How to Minimize the Required Number of Experiments

11

A

B

C

1

2

Pipeline

GraphColoring

Requires2+1 Steps

A

B

C

D

Processor 1 Processor 2

1

2

3

E

F

5

4

Even edgesacross partition

Processor 1

A

D

B

C

E

Processor 2

1

3

2

4

Odd edgesacross partition

Page 12: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Obs. 1: There is no loop of three actors in a stream graph

12

i k

l

Processor 1 Processor 2

Page 13: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Obs. 2: There is no interference of adjacent nodes between edges

13

A

B

C D

E

F

For blue color edges

P-1

P-2

P-3

P-4

Page 14: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Remove Interference

Convert to a line graph

Add interference edges

Use vertex coloring algorithm

14

A

B

C D

E

F

AB

BC

BDCE

DE

EF

Line graphStream graph

AB

BC

BDCE

DE

EF

Page 15: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Processor Leveling Graph

15

A

B

C D

E

F

For blue colored edge Processor leveling graph

A

B, C, D, E

F

Page 16: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Coloring the Processor Labelling Graph

16

A

B, C, D, E

F

Processor 2Processor 1

A

B, C, D, E

F

A

B, C, D, E

F

Page 17: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Measuring the Communication Cost

17

A

B

C D

E

F

A

B, C, D, E

F

Processor 2Processor 1

)()(

)()(

),(

),(

FFEEFE

BBAABA

ttttC

ttttC

At

Bt

Et

Ft

For blue colored edge

Page 18: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Profiling Performance

Benchmark Total Edge Prof Steps Steps/Edge (%) Err (%)SAR 44 3 7 10MatrixMult 88 21 24 17MergeSort 37 4 11 31FMRadio 21 3 14 24DCT 28 9 32 14RadixSort 12 2 17 5FFT 26 3 12 27MPEG 56 17 30 15Channel 22 6 27 11BeamFormer 39 5 13 13

GM 17% 15%

18

Page 19: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

19

Outline

Motivation Multicore trend Stream programming

Research Questions How to profiling communication overhead? How to deploy stream programs?

Related works

19

Page 20: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Deployment of Stream Programs

20

A (5)

B (40)

C (40)

D(5)

Processor 1 Processor 2

25

25

5

5

25

25

A (5)

B (40)

C (40)

D(5)

525

25

25

255

Load = (5 + 40) + 5 = 50

Load = (40 + 5) + 5 = 50

Makespan = 50, Speedup = 90/50 = 1.8

Page 21: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Deploying Stream Programs without Considering Communication

21

A (5)

B (40)

C (40)

D(5)

Processor 1 Processor 2

A (5)

C (40)

B (40)

D(5)

5

25

25

5

5

25

25

2525

2525

5

Load = (5+40) + (25+5+25) = 100

Load = (40+5) + (25+5+25) = 100

Makespan = 100, Speedup = 90/100 = 0.9

Compare = (100 – 50)x100%/50 = 100%

Page 22: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Deployment Performance

Benchmark m (us) m (us) (m – m)/m%

SAR 45.54 45.54 0

MatrixMult 67.80 111.14 64

MergeSort 1.63 6.99 329

FMRadio 1.57 7.00 346

DCT 4.64 7.68 66

RadixSort 1.49 3.08 107

FFT 18.28 34.15 87

MPEG 37.26 37.26 0

Channel 89.00 91.20 2

BeamFormer 7.29 7.29 0

22

Page 23: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Speedups obtained for 2, 4 and 6 processors

23

Page 24: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Summary

We propose an efficient profiling technique for multicore that minimizes profiling steps

We propose ILP based approach that minimizes the makespan

We conducted experiments The number of profiling steps is on the average only

17% The profiling scheme shows only 15% error on the

average in the random mapping test Obtains speedup of 3.11x for 4 processors and a

speedup of 4.02x for 6 processors

24

Page 25: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

25

Related Works

[1] Static Scheduling of SDF Programs for DSP [Lee ‘87]

[2] StreamIt: A language for streaming applications [Thies ‘02]

[3] Phased Scheduling of Stream Programs [Thies ’03]

[4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in

Stream Programs [Thies ‘06]

[5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08]

[6] Software Pipelined Execution of Stream Programs on GPUs

[Udupa‘09]

[7] Synergistic Execution of Stream Programs on Multicores with

Accelerators [Udupa ‘09]

[8] Orchestration by approximation [Farhad ‘11]

25

Page 26: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Questions?

Page 27: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Minimizing Errors in Profiling Process Errors are likely in any profiling process We chose an architecture which has uniform

cache hierarchy We pin the threads using likwidpin tools

27

Page 28: Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

Cache Topology of Processor

28

Core #0 Core #1 Core #2 Core #3 Core #4 Core #5

L1: 64kB L1: 64kB L1: 64kB L1: 64kB L1: 64kB L1: 64kB

L2: 512kB

L2: 512kB

L2: 512kB

L2: 512kB

L2: 512kB

L2: 512kB

L3: 6MB

800MHz hexa-core AMD Phenom(tm) II X6 1090T