profile guided deployment of stream programs on multicores s. m. farhad the university of sydney...

Post on 20-Jan-2016

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Profile Guided Deployment of Stream

Programs on MulticoresS. M. Farhad

The University of Sydney

Joint work with

Yousun Ko

Bernd Burgstaller

Bernhard Scholz

2

Outline

Motivation Multicore trend Stream programming

Research Questions How to profiling communication overhead on

Multicores? How to deploy stream programs?

Related works

2

3

Motivation

1

1975

2

4

8

16

32

64

128

256

512

1980 1985 1990 1995 2000 2005 2010

400480088080 8086 286 386 486 Pentium P2 P3 P4

Athlon Itanium Itanium2

Power4 PA8800400480088080

PA8800

Opteron CoreDuo

Power6Xbox 360

BCM 1480Opteron 4P

Xeon

Niagara Cell

RAW

RAZA XLR Cavium

Unicore

Homogeneous Multicore

Heterogeneous MulticoreCISCO CSR1

Larrabee

PicoChip AMBRIC

AMD Fusion

NVIDIA G80

Core

Core2Duo

Core2Quad

# co

res/

chip

Courtesy: Scott’08

C/C++/Java

CUDA

X10Peakstream

Fortress

Accelerator

Ct

C T M

Rstream

Rapidmind

Stream Programming

3

4

Stream Programming Paradigm Programs expressed as stream

graphs

Streams: Infinite sequence of data elements

Actors: Functions applied to streams

4

Actor

Stream

Stream

5

Properties of Stream Program Regular and repeating

computation Independent actors with explicit

communication Producer / Consumer

dependencies

5

Adder

Speaker

AtoD

FMDemod

LPF1

Splitter

Joiner

LPF2 LPF3

HPF1 HPF2 HPF3

6

StreamIt Language

An implementation of stream prog.

Hierarchical structure

Each construct has single input/output stream

parallel computation

may be any StreamIt language construct

joinersplitter

pipeline

feedback loop

joiner splitter

splitjoin

filter

6

7

Outline

Motivation Multicore trend Stream programming

Research Questions How to profiling communication overhead on

Multicores? How to deploy stream programs?

Related works

7

How to Estimate the Communication Overhead on Multicores?

8

Problems to Measure Communication Overhead on Multicores Reasons:

Multicores are non-communication exposed architecture

Complex cache hierarchy Cache coherence protocols

Consequence: Cannot directly measure the communication cost Estimate the communication cost by measuring

the execution time of actors

9

Measuring the Communication Overhead of an Edge

10

i k

Processor 1

No communication cost

Processor 1

With communication cost

Processor 2

ki

kkiiki ttttC ),(

it ktit kt

How to Minimize the Required Number of Experiments

11

A

B

C

1

2

Pipeline

GraphColoring

Requires2+1 Steps

A

B

C

D

Processor 1 Processor 2

1

2

3

E

F

5

4

Even edgesacross partition

Processor 1

A

D

B

C

E

Processor 2

1

3

2

4

Odd edgesacross partition

Obs. 1: There is no loop of three actors in a stream graph

12

i k

l

Processor 1 Processor 2

Obs. 2: There is no interference of adjacent nodes between edges

13

A

B

C D

E

F

For blue color edges

P-1

P-2

P-3

P-4

Remove Interference

Convert to a line graph

Add interference edges

Use vertex coloring algorithm

14

A

B

C D

E

F

AB

BC

BDCE

DE

EF

Line graphStream graph

AB

BC

BDCE

DE

EF

Processor Leveling Graph

15

A

B

C D

E

F

For blue colored edge Processor leveling graph

A

B, C, D, E

F

Coloring the Processor Labelling Graph

16

A

B, C, D, E

F

Processor 2Processor 1

A

B, C, D, E

F

A

B, C, D, E

F

Measuring the Communication Cost

17

A

B

C D

E

F

A

B, C, D, E

F

Processor 2Processor 1

)()(

)()(

),(

),(

FFEEFE

BBAABA

ttttC

ttttC

At

Bt

Et

Ft

For blue colored edge

Profiling Performance

Benchmark Total Edge Prof Steps Steps/Edge (%) Err (%)SAR 44 3 7 10MatrixMult 88 21 24 17MergeSort 37 4 11 31FMRadio 21 3 14 24DCT 28 9 32 14RadixSort 12 2 17 5FFT 26 3 12 27MPEG 56 17 30 15Channel 22 6 27 11BeamFormer 39 5 13 13

GM 17% 15%

18

19

Outline

Motivation Multicore trend Stream programming

Research Questions How to profiling communication overhead? How to deploy stream programs?

Related works

19

Deployment of Stream Programs

20

A (5)

B (40)

C (40)

D(5)

Processor 1 Processor 2

25

25

5

5

25

25

A (5)

B (40)

C (40)

D(5)

525

25

25

255

Load = (5 + 40) + 5 = 50

Load = (40 + 5) + 5 = 50

Makespan = 50, Speedup = 90/50 = 1.8

Deploying Stream Programs without Considering Communication

21

A (5)

B (40)

C (40)

D(5)

Processor 1 Processor 2

A (5)

C (40)

B (40)

D(5)

5

25

25

5

5

25

25

2525

2525

5

Load = (5+40) + (25+5+25) = 100

Load = (40+5) + (25+5+25) = 100

Makespan = 100, Speedup = 90/100 = 0.9

Compare = (100 – 50)x100%/50 = 100%

Deployment Performance

Benchmark m (us) m (us) (m – m)/m%

SAR 45.54 45.54 0

MatrixMult 67.80 111.14 64

MergeSort 1.63 6.99 329

FMRadio 1.57 7.00 346

DCT 4.64 7.68 66

RadixSort 1.49 3.08 107

FFT 18.28 34.15 87

MPEG 37.26 37.26 0

Channel 89.00 91.20 2

BeamFormer 7.29 7.29 0

22

Speedups obtained for 2, 4 and 6 processors

23

Summary

We propose an efficient profiling technique for multicore that minimizes profiling steps

We propose ILP based approach that minimizes the makespan

We conducted experiments The number of profiling steps is on the average only

17% The profiling scheme shows only 15% error on the

average in the random mapping test Obtains speedup of 3.11x for 4 processors and a

speedup of 4.02x for 6 processors

24

25

Related Works

[1] Static Scheduling of SDF Programs for DSP [Lee ‘87]

[2] StreamIt: A language for streaming applications [Thies ‘02]

[3] Phased Scheduling of Stream Programs [Thies ’03]

[4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in

Stream Programs [Thies ‘06]

[5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08]

[6] Software Pipelined Execution of Stream Programs on GPUs

[Udupa‘09]

[7] Synergistic Execution of Stream Programs on Multicores with

Accelerators [Udupa ‘09]

[8] Orchestration by approximation [Farhad ‘11]

25

Questions?

Minimizing Errors in Profiling Process Errors are likely in any profiling process We chose an architecture which has uniform

cache hierarchy We pin the threads using likwidpin tools

27

Cache Topology of Processor

28

Core #0 Core #1 Core #2 Core #3 Core #4 Core #5

L1: 64kB L1: 64kB L1: 64kB L1: 64kB L1: 64kB L1: 64kB

L2: 512kB

L2: 512kB

L2: 512kB

L2: 512kB

L2: 512kB

L2: 512kB

L3: 6MB

800MHz hexa-core AMD Phenom(tm) II X6 1090T

top related