profile guided deployment of stream programs on multicores s. m. farhad the university of sydney...

Profile Guided Deployment of Stream

Programs on MulticoresS. M. Farhad

The University of Sydney

Joint work with

Yousun Ko

Bernd Burgstaller

Bernhard Scholz

Outline

Motivation Multicore trend Stream programming

Research Questions How to profiling communication overhead on

Multicores? How to deploy stream programs?

Related works

Motivation

1980 1985 1990 1995 2000 2005 2010

400480088080 8086 286 386 486 Pentium P2 P3 P4

Athlon Itanium Itanium2

Power4 PA8800400480088080

PA8800

Opteron CoreDuo

Power6Xbox 360

BCM 1480Opteron 4P

Niagara Cell

RAZA XLR Cavium

Unicore

Homogeneous Multicore

Heterogeneous MulticoreCISCO CSR1

Larrabee

PicoChip AMBRIC

AMD Fusion

NVIDIA G80

Core2Duo

Core2Quad

Courtesy: Scott’08

C/C++/Java

X10Peakstream

Fortress

Accelerator

Rstream

Rapidmind

Stream Programming

Stream Programming Paradigm Programs expressed as stream

graphs

Streams: Infinite sequence of data elements

Actors: Functions applied to streams

Stream

Properties of Stream Program Regular and repeating

computation Independent actors with explicit

communication Producer / Consumer

dependencies

Speaker

FMDemod

Splitter

Joiner

LPF2 LPF3

HPF1 HPF2 HPF3

StreamIt Language

An implementation of stream prog.

Hierarchical structure

Each construct has single input/output stream

parallel computation

may be any StreamIt language construct

joinersplitter

pipeline

feedback loop

joiner splitter

splitjoin

filter

Outline

Research Questions How to profiling communication overhead on

Multicores? How to deploy stream programs?

Related works

How to Estimate the Communication Overhead on Multicores?

Problems to Measure Communication Overhead on Multicores Reasons:

Multicores are non-communication exposed architecture

Complex cache hierarchy Cache coherence protocols

Consequence: Cannot directly measure the communication cost Estimate the communication cost by measuring

the execution time of actors

Measuring the Communication Overhead of an Edge

Processor 1

No communication cost

Processor 1

With communication cost

Processor 2

kkiiki ttttC ),(

it ktit kt

How to Minimize the Required Number of Experiments

Pipeline

GraphColoring

Requires2+1 Steps

Processor 1 Processor 2

Even edgesacross partition

Processor 1

Processor 2

Odd edgesacross partition

Obs. 1: There is no loop of three actors in a stream graph

Obs. 2: There is no interference of adjacent nodes between edges

For blue color edges

Remove Interference

Convert to a line graph

Add interference edges

Use vertex coloring algorithm

Line graphStream graph

Processor Leveling Graph

For blue colored edge Processor leveling graph

B, C, D, E

Coloring the Processor Labelling Graph

B, C, D, E

Processor 2Processor 1

B, C, D, E

Measuring the Communication Cost

B, C, D, E

Processor 2Processor 1

FFEEFE

BBAABA

For blue colored edge

Profiling Performance

Benchmark Total Edge Prof Steps Steps/Edge (%) Err (%)SAR 44 3 7 10MatrixMult 88 21 24 17MergeSort 37 4 11 31FMRadio 21 3 14 24DCT 28 9 32 14RadixSort 12 2 17 5FFT 26 3 12 27MPEG 56 17 30 15Channel 22 6 27 11BeamFormer 39 5 13 13

GM 17% 15%

Outline

Research Questions How to profiling communication overhead? How to deploy stream programs?

Related works

Deployment of Stream Programs

B (40)

C (40)

B (40)

C (40)

Load = (5 + 40) + 5 = 50

Load = (40 + 5) + 5 = 50

Makespan = 50, Speedup = 90/50 = 1.8

Deploying Stream Programs without Considering Communication

B (40)

C (40)

B (40)

Load = (5+40) + (25+5+25) = 100

Load = (40+5) + (25+5+25) = 100

Makespan = 100, Speedup = 90/100 = 0.9

Compare = (100 – 50)x100%/50 = 100%

Deployment Performance

Benchmark m (us) m (us) (m – m)/m%

SAR 45.54 45.54 0

MatrixMult 67.80 111.14 64

MergeSort 1.63 6.99 329

FMRadio 1.57 7.00 346

DCT 4.64 7.68 66

RadixSort 1.49 3.08 107

FFT 18.28 34.15 87

MPEG 37.26 37.26 0

Channel 89.00 91.20 2

BeamFormer 7.29 7.29 0

Speedups obtained for 2, 4 and 6 processors

Summary

We propose an efficient profiling technique for multicore that minimizes profiling steps

We propose ILP based approach that minimizes the makespan

We conducted experiments The number of profiling steps is on the average only

17% The profiling scheme shows only 15% error on the

average in the random mapping test Obtains speedup of 3.11x for 4 processors and a

speedup of 4.02x for 6 processors

Related Works

[1] Static Scheduling of SDF Programs for DSP [Lee ‘87]

[2] StreamIt: A language for streaming applications [Thies ‘02]

[3] Phased Scheduling of Stream Programs [Thies ’03]

[4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in

Stream Programs [Thies ‘06]

[5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08]

[6] Software Pipelined Execution of Stream Programs on GPUs

[Udupa‘09]

[7] Synergistic Execution of Stream Programs on Multicores with

Accelerators [Udupa ‘09]

[8] Orchestration by approximation [Farhad ‘11]

Questions?

Minimizing Errors in Profiling Process Errors are likely in any profiling process We chose an architecture which has uniform

cache hierarchy We pin the threads using likwidpin tools

Cache Topology of Processor

Core #0 Core #1 Core #2 Core #3 Core #4 Core #5

L1: 64kB L1: 64kB L1: 64kB L1: 64kB L1: 64kB L1: 64kB

L2: 512kB

L3: 6MB

800MHz hexa-core AMD Phenom(tm) II X6 1090T

profile guided deployment of stream programs on multicores s. m. farhad the university of sydney...

communication overhead

stream graphprocessor

communication costhow

communication costestimate

stream graphs streams

related works

profiling stepswe

processor labelling

Documents

chapter 7 multicores, multiprocessors, and...

morgan kaufmann publishers multicores, multiprocessors, and...

factored operating system for clouds and multicores

serial code accelerators for heterogeneous multicores...

parallel programming and timing analysis on embedded...

database engines on multicores, why ... - uni salzburg

multicores from the compiler's perspective a blessing or a...

fast and scalable software execution on multicores

a study of garbage collector scalability on multicores

parallelizing compilers for multicores - purdue...

chapter 7 multicores, multiprocessors, and clusters...

cetus: source-to-source compiler infrastructure for...

embedded multicores example of freescale solutions

application performance on multicores · application...

programming and timing analysis of parallel programs on...

for non-flash users, download joe burgstaller press kit 3/13

deployment of query plans on multicores - inf.ethz.ch ·...

communication overhead estimation on multicores s. m. farhad...

oscar low poweroscar low power multicores, …...oscar low...

stream processing of x-ray microdiffraction data on...