joão m. p. cardoso

43
A Data-Driven Approach for Pipelining Sequences of Data- Dependent LOOPs João M. P. Cardoso ITIV, University of Karlsruhe, July 2, 2007 Portuga l

Upload: vivien-kidd

Post on 30-Dec-2015

61 views

Category:

Documents


9 download

DESCRIPTION

A Data-Driven Approach for Pipelining Sequences of Data-Dependent LOOPs. João M. P. Cardoso. Portugal. ITIV, University of Karlsruhe, July 2, 2007. Motivation. Many applications have sequences tasks E.g., in image and video processing algorithms Contemporary FPGAs - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: João M. P. Cardoso

A Data-Driven Approach for Pipelining Sequences

of Data-Dependent LOOPs

João M. P. Cardoso

ITIV, University of Karlsruhe, July 2, 2007

Portugal

Page 2: João M. P. Cardoso

2

Motivation

Many applications have sequences tasks• E.g., in image and video processing

algorithms

Contemporary FPGAs• Plenty of room to accommodate highly

specialized complex architectures• Time to creatively “use available

resources” than to simply “save resources”

Page 3: João M. P. Cardoso

3

Motivation

Computing Stages• Sequentially

Task A Task B Task C

TIME

Page 4: João M. P. Cardoso

4

Motivation

Computing Stages• Concurrently

TIME

Task A

Task B

Task C

Page 5: João M. P. Cardoso

5

Outline

Objective Loop Pipelining Producer/Consumer Computing Stages Pipelining Sequences of Loops Inter-Stage Communication Experimental Setup and Results Related Work Conclusions Future Work

Page 6: João M. P. Cardoso

6

Objectives

To speed-up applications with multiple and data-dependent stages • each stage seen as a set of nested

loops

How?• Pipelining those sequences of data-

dependent stages using fine-grain synchronization schemes

• Taking advantage of field-custom computing structures (FPGAs)

Page 7: João M. P. Cardoso

7

Loop Pipelining Attempt to overlap

loop iterations Significant

speedups are achieved

But how to pipeline sequences of loops?

I1 I2 I3 I4

I1

I2

I3

I4

time

...

...

Page 8: João M. P. Cardoso

8

Computing Stages

Sequentially

Producer:

...A[2]A[1]A[0]

Consumer:

A[0]A[1]A[2]...

Page 9: João M. P. Cardoso

9

Computing Stages

Concurrently• Ordered producer/consumer pairs

• Send/receive

Producer:...A[2]A[1]A[0]

Consumer:A[0]A[1]A[2]...

A[3

]

...

A[2

]

A[1

]

A[0

]

FIFO with N stages

Page 10: João M. P. Cardoso

10

Computing Stages

Concurrently• Unordered producer/consumer pairs

• Empty/Full table

0

1 A[1]

0

0

0

1 A[5]

0

0

Producer:...A[3]A[5]A[1] Consumer:

A[3]A[1]A[5]...

Em

pty/full

data

Page 11: João M. P. Cardoso

11

Main Idea

FDCT

Execution of Loops 1, 2 Execution of Loop 3

time

Loop 1 Loop 2

Loop 3

Global FSM

Data Input

Intermediatedata

Data output

Intermediate data array

0 1 2 3 4 5 6 7

816243240

4856

Page 12: João M. P. Cardoso

12

Main Idea

FDCT• Out-of-order producer/consumer pairs• How to overlap computing stages?

0 1 2 3 4 5 6 7

8

16243240

4856

0 1 2 3 4 5 6 7

8

16243240

4856

Page 13: João M. P. Cardoso

13

Main Idea Pipelined FDCT

Intermediate data( dual-port RAM )

Loop 1 Loop 2

Loop 3

FSM 1

FSM 2

Dual-port 1-bit table( empty/full )

Data input

Data output

Execution of Loops 1, 2

Execution of Loop 3

time

Intermediate data array

0 1 2 3 4 5 6 7

816243240

4856

Page 14: João M. P. Cardoso

14

Main Idea

TaskA

TaskB

Mem

ory

Mem

ory

Mem

ory

Page 15: João M. P. Cardoso

15

Possible Scenarios

Single write, single read• Accepted without code changes

Single write, multiple reads• Accepted without code changes (by

using an N-bit table)

Multiple writes, single read• Need code transformations

Multiple writes, multiple reads• Need code transformations

Page 16: João M. P. Cardoso

16

Inter-Stage Communication Responsible to:

• Communicate data between pipelined stages

• Flag data availability Solutions

• Perfect associative memory• Cost too high

• Memory for data plus 1-bit table (each cell represents full/empty information)

• Size of the data set to communicate

• Decrease size using hash-based solution

0

1 A[1]

0

0

0

1 A[5]

0

0

Em

pty/full

data

Page 17: João M. P. Cardoso

17

i_1 = 0;for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[i_1]; if(!tab[i_1]) goto L1; L2: f1 = tmp[1+i_1]; if(!tab[1+i_1]) goto L2; // remaining loads // computations … // stores i_1 += 8;}

…boolean tab[SIZE]={0, 0,…, 0};…for(i=0; i<num_fdcts; i++){ //Loop 1

for(j=0; j<N; j++){ //Loop 2

// loads // computations // stores tmp[48+i_1] = F6 >> 13; tab[48+i_1] = true; tmp[56+i_1] = F7 >> 13; tab[56+i_1] = true; i_1++; } i_1 += 56; }

Inter-Stage Communication

Memory plus 1-bit table

img

Loop 1 Loop 2

Dual-port memory:

tmp

Loop 3

dct_o

FSM 1 FSM 2

Dual-port 1-

bit table: tab

data connections address connections

Page 18: João M. P. Cardoso

18

i_1 = 0;for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[H(i_1)]; if(!tab[H(i_1)]) goto L1; L2: f1 = tmp[H(1+i_1)]; if(!tab[H(1+i_1)]) goto L2; // remaining loads // computations … // stores i_1 += 8;}

…boolean tab[SIZE]={0, 0,…, 0};…for(i=0; i<num_fdcts; i++){ //Loop 1

for(j=0; j<N; j++){ //Loop 2

// loads // computations // stores tmp[H(48+i_1)] = F6 >> 13; tab[H(48+i_1)] = true; tmp[H(56+i_1)] = F7 >> 13; tab[H(56+i_1)] = true; i_1++; } i_1 += 56; }

Inter-Stage Communication

Hash-based solution:

img

Loop 1 Loop 2

Dual-port memory:

tmp

Loop 3

dct_o

FSM 1 FSM 2

Empty/full table: tab

data connections address connections

H H

H

H

Page 19: João M. P. Cardoso

19

Inter-Stage Communication Hash-based solution

• We did not want to include additional delays in the load/store operations

• Use H(k) = k MOD m• When m is a multiple of 2*N,• H(k) can be implemented by just using the

least log2(m) significant bits of K to address the cache (translates to simple interconnections)

A[5]1

0

0

0

0

0

A[1]1

0

H H

A[5]1

0

0

0

0

0

A[1]1

0

Page 20: João M. P. Cardoso

20

Inter-Stage Communication

Hash-based solution: H(k) = k MOD m Single read

(L=1) R = 1 = 0

a) writeb) read

c) empty/full update

L N

M

data_in address_in

H

address_out data_out

H

hit/miss

T

(a)

(b)

(c)

(a)

(b)

R (a)

Page 21: João M. P. Cardoso

21

Inter-Stage Communication

Hash-based solution: H(k) = k MOD m Multiple reads

(L>1) R = 11...1 (L) >>= R

a) writeb) read

c) empty/full update

L N

M

data_inaddress_in

H

address_out data_out

H

hit/miss

T

(a)

(b)

(c)

(a)

(b)

R (a)

Page 22: João M. P. Cardoso

22

Buffer size calculation

By monitoring behavior• of communication component

For each read and write • determine the size of the buffer

needed to avoid collisionsDone during RTL simulation

Page 23: João M. P. Cardoso

23

Java Code withdirectives

Front-End (includescompilation to JVM)

Library(FUs)

FU Models(HDL)

Java bytecodes

Nau

Logic Synthesis and Place andRoute (vendor-specific)

FU Models(Java)

SpecificReconfigurable

Hardware (FPGA)

Estimators

ControlUnits(XML)

DatapathUnits (XML)

RTG (XML)

XSL Transformers

Experimental Setup

Compilation flow• Uses our previous work on compiling

algorithms in a Java subset to FPGAs

Page 24: João M. P. Cardoso

24

Experimental Setup

Simulation back-end

fsm.xmldatapath.xmldatapath.xml fsm.xml rtg.xml

to dotty to dottyto hds to java to javato vhdl to vhdl

datapath.hds fsm.java rtg.java

fsm.class rtg.classHADES

Library of Operators

(JAVA)

I/O data( RAMs and Stimulus )

XSLTs

ANT build file

Page 25: João M. P. Cardoso

25

Experimental Results Benchmarks

Algorithm

# Stages #loops

Description

fdct 2 {s1,s2} 3 Fast DCT (Discrete Cosine Transform)

fwt2D 4 {s1,s2,s3,s4}

8 Forward Haar Wavelet

RGB2gray+

histogram

2 {s1,s2} 2 Transforms an RGB image to a gray image with 256 levels and determines the histogram of the gray image

Smooth +

sobel,3

versions:(a)(b)(c)

2 {s1,s2} 6 Smooth image operation based on 33 windows being the resultant image input to the sobel edge detector. (a): original code; (b): two innermost loops of the smooth algorithm fully unrolled (scalar replacement of the array with coefficients); (c): the same as (b) plus elimination of redundant array references in the original code of sobel.

Page 26: João M. P. Cardoso

26

Experimental Results

FDCT (speed-up achieved by Pipelining Sequences of Loop)

1.00

1.20

1.40

1.60

1.80

2.00

1 2 3 4 5 6 7 8 16 32 40 48 56 64 128

256

512

1024

# 8x8 blocks

Sp

ee

du

p

Page 27: João M. P. Cardoso

27

Experimental ResultsAlgorithm

Input data size

Stages#cc w/o

PSL

Speed-up Upper –Bound

#cc w/ PSLSpeed-

up

fdct 800600(s1,s2)(s1)(s2)

3,930,0051,950,0031,920,003

2.02 1,830,215 2.02

Fwt2D 512512(s1,s2,s3,s4)(s1,s2)(s3,s4)

4,724,7452,362,3732,362,373

2.00 3,664,917 1.29

RGB2gray +

histogram

800600

(s1,s2)(s1)(s2)

6,720,0252,880,0153,840,015

1.75 3,840,007 1.75

Smooth + sobel

(a)800600

(s1,s2)(s1)(s2)

49,634,00932,929,47316,606,951

1.51 32,929,489 1.51

Smooth + sobel

(b)800600

(s1,s2)(s1)(s2)

30,068,64513,364,10916,606,951

1.81 16,640,509 1.81

Smooth + sobel

(c)800600

(s1,s2)(s1)(s2)

25,773,80913,364,10911,862,791

1.92 13,364,117 1.92

Page 28: João M. P. Cardoso

28

Experimental Results What does happen with buffer sizes?

128

480000

480000

480000

2621442

2048

131072

56

1

120000

1198

1 10 100 1000 10000 100000 1000000

smooth + sobel (a)

RGB2gray + histogram (a)

fwt2D

fdct

table size (no hash function) buffer size used (simple hash function) buffer minimum size (perfect hash)

Page 29: João M. P. Cardoso

29

Experimental Results

Adjust latency of tasks in order to balance pipeline stages:• Slowdown tasks with higher latency• Optimization of slower tasks in order to

reduce their latency

Slowdown of producer tasks usually reduces the size of the inter-stage buffers

Page 30: João M. P. Cardoso

30

131072

1

480000

480000

480000

480000

480000

4800002048

2048

8192

2

131072

6001

120000

1198

95110

1198

1 10 100 1000 10000 100000 1000000

smooth + sobel (a)

smooth + sobel (b)

smooth + sobel (c)

RGB2gray + histogram (a)

RGB2gray + histogram (b)

RGB2gray + histogram (c)

table size (no hash function) buffer size used (simple hash function) buffer minimum size (perfect hash)

Experimental Results

Buffer sizes

+1 cycle per iteration of the producer

+2 cycles per iteration of the producer

original

Optimizations in the producer

+Optimizations in the consumer

original

Page 31: João M. P. Cardoso

31

Experimental Results

Buffer sizes

41.5%

41.5%

8.4%

27.4%

50.0%

50.0%

26.7%

56.3%

234

4

59

234

4

240000

131072

3750

0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0%

smooth + sobel (a)

smooth + sobel (b)

smooth + sobel (c)

RGB2gray + histogram (a)

RGB2gray + histogram (b)

RGB2gray + histogram (c)

fwt2D

fdct

1 10 100 1000 10000 100000 1000000 1000000010000000

010000000

00

overhead related to optimal size reduction related to original

Page 32: João M. P. Cardoso

32

Experimental Results

1.14

1.000.96

1.131.131.00 1.00 1.00 1.03 0.99 1.00 0.99

1

10

100

1000

10000

fdct

fdct-

hash

fdct-

table

sm

ooth

+sobel

sm

ooth

+sobel-hash

sm

ooth

+sobel-ta

ble

RG

B2gra

y+

his

togra

m

RG

B2gra

y+

his

togra

m-h

ash

RG

B2gra

y+

his

togra

m-t

able

fwt2

D

fwt2

D-h

ash

fwt2

D-t

able

FP

GA

reso

urc

es

0.0

0.2

0.4

0.6

0.8

1.0

1.2

# FFs # 4-LUTS # Slices Normalized Freq.

Resources and Frequency (Spartan-3 400)

Page 33: João M. P. Cardoso

33

Related Work

Previous approach (Ziegler et al.)• Coarse-grained communication and synchronization

scheme• FIFOs are used to communicate data between

pipelining stages• Width of FIFO stages dependent on

producer/consumer ordering• Less applicable

A[0]A[1]A[2]A[3]...

Producer: Consumer:

A[0]A[1]A[2]A[3]...

A[0]A[1]...

A[0]A[1]A[2]A[3]...

A[1]A[0]A[3]A[2]...

A[0]A[1]

A[2]A[3]

...

A[0]A[1]A[2]A[3]A[4]A[5]...

A[0]A[3]A[1]A[4]A[2]A[5]...

A[0]A[1]A[2]A[3]A[4]

A[5]A[6]A[7]A[8]A[9]

...

time

Page 34: João M. P. Cardoso

34

Conclusions We presented a scheme to accelerate

applications, pipelining sequences of loops• I.e., Before the end of a stage (set of nested loops)

a subsequent stage (set of nested loops) can start executing based on data already produced

Data-driven scheme is used based on empty/full tables• A scheme to reduce the size of the memory

buffers for inter-stage pipelining (using a simple hash function)

Depending on the consumer/producer ordering, speedups close to theoretical ones are achieved• as if stages are concurrently and independently

executed

Page 35: João M. P. Cardoso

35

Future Work Research other hash functions Study slowdown effects Apply the technique in the context of

Multi-Core Systems

Processor Core

A

LN

Mdata

_in

addr

ess_

in

H

addr

ess_

out

data

_out

H

hit

/mis

s

T

(a)

(b)

(c)

(a)

(b)

R(a

)

Processor Core

BMem

ory

Mem

ory

Page 36: João M. P. Cardoso

36

Acknowledgments Work partially funded by

• CHIADO - Compilation of High-Level Computationally Intensive Algorithms to Dynamically Reconfigurable COmputing Systems

• Portuguese Foundation for Science and Technology (FCT), POSI and FEDER, POSI/CHS/48018/2002

Based on the work done by Rui Rodrigues

In collaboration with Pedro C. Diniz

Page 37: João M. P. Cardoso

37

technologyfrom seed

A Data-Driven Approach for Pipelining

Sequences of Data-Dependent Loops

Page 38: João M. P. Cardoso

38

Buffer Monitor

FDCT

0

10

20

30

40

50

60

0 50 100 150 200 250 300

clock cycles

elem

ents

0

0.5

1

1.5

2

2.5

3

3.5

buffer size store load(hit) load(miss)

Page 39: João M. P. Cardoso

39

Buffer Monitor

fwt2D

0

0,2

0,4

0,6

0,8

1

1,2

0 20 40 60 80 100

clock cycles

ele

me

nts

0

0,5

1

1,5

2

2,5

3

3,5

buffer size load(miss) load(hit) store

Page 40: João M. P. Cardoso

40

Buffer MonitorRGB2gray + histogram

0

2

4

6

8

10

12

0

18

36

54

72

90

10

8

12

6

14

4

16

2

18

0

19

8

21

6

23

4

25

2

27

0

28

8

30

6

32

4

34

2

36

0

clock cycles

ele

me

nts

0

0,5

1

1,5

2

2,5

3

3,5

buffer size store load(miss) load(hit)

Page 41: João M. P. Cardoso

41

Buffer Monitor

RGB2gray + histogram (modified)

0

1

2

3

4

5

6

0

18

36

54

72

90

10

8

12

6

14

4

16

2

18

0

19

8

21

6

23

4

25

2

27

0

28

8

30

6

32

4

34

2

36

0

37

8

clock cycles

ele

me

nts

0

0,5

1

1,5

2

2,5

3

3,5

buffer size store load(miss) load(hit)

Page 42: João M. P. Cardoso

42

Buffer MonitorSmooth + Sobel a)

0

5

10

15

20

25

30

0

11

3

22

6

33

9

45

2

56

5

67

8

79

1

90

4

10

17

11

30

12

43

13

56

14

69

15

82

16

95

18

08

19

21

20

34

21

47

22

60

23

73

clock cycles

ele

me

nts

0

0,5

1

1,5

2

2,5

3

3,5

buffer size store load(miss) load(hit)

Page 43: João M. P. Cardoso

43

Buffer Monitor

Smooth + Sobel a)

0

2

4

6

8

10

12

14

1

11

4

22

8

34

2

45

6

57

0

68

4

79

8

91

2

10

26

11

40

12

54

13

68

14

82

15

96

17

10

18

24

19

38

20

52

21

66

22

80

23

94

clock cycles

ele

me

nts

0

0,5

1

1,5

2

2,5

3

3,5

buffer size store load(miss) load(hit)