zhiduo liu supervisor: guy lemieux sep. 28 th, 2012 accelerator compiler for the venice vector...

86
Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th , 2012 Accelerator Compiler for the VENICE Vector Processor

Upload: jeffery-ramsey

Post on 17-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Zhiduo Liu

Supervisor: Guy Lemieux

Sep. 28th, 2012

Accelerator Compiler

for the VENICE Vector

Processor

Page 2: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Outline:

Motivation

Background

Implementation

Results

Conclusion

Page 3: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Outline:

Motivation

Background

Implementation

Results

Conclusion

Page 4: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Motivation

Multi-core

GPU

FPGA

Many-core

CUDA

System Verilog

VHD

L

OpenCL

Erlang

Computer clusters

OpenM

PMPI

Pthre

a

dOpenHM

PP

Verilo

gBluespec

Cilk

X10

OpenGL

Sh

aJava

ParC

Fortress

Chapel

Vector Processor

StreamIt

Spong

e

SSE

Page 5: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Motivation

Multi-core

GPU

FPGA

Many-core

CUDA

System Verilog

VHD

L

OpenCL

Erlang

Computer clusters

OpenM

PMPI

Pthre

a

dOpenHM

PP

Verilo

gBluespec

Cilk

X10

OpenGL

Sh

aJava

ParC

Fortress

Chapel

Vector Processor

StreamIt

Spong

e

SSE

Simplification

Page 6: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Motivation

Single Description

Page 7: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Contributions

The compiler serves as a new back-end of a single-description multiple-device language.

The compiler makes VENICE easier to program and debug.

The compiler provides auto-parallelization and optimization.

[1] Z. Liu, A. Severance, S. Singh and G. Lemieux, “Accelerator Compiler for

the VENICE Vector Processor,” in FPGA 2012.

[2] C. Chou, A. Severance, A. Brant, Z. Liu, S. Sant, G. Lemieux, “VEGAS: soft

vector processor with scratchpad memory,” in FPGA 2011.

Page 8: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Outline:

Motivation

Background

Implementation

Results

Conclusion

Page 9: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Complicated

ALIGNALIGN WR RDWR RD ALIGNALIGN EX1EX1 EX2EX2 ACCUMACCUM

Page 10: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

#include "vector.h“

int main(){ int A[] = {1,2,3,4,5,6,7,8}; const int data_len = sizeof ( A );

int *va = ( int *) vector_malloc ( data_len );

vector_dma_to_vector ( va, A, data_len ); vector_wait_for_dma ();

vector_set_vl ( data_len / sizeof (int) );

vector ( SVW, VADD, va, 42, va ); vector_instr_sync ();

vector_dma_to_host ( A, va, data_len ); vector_wait_for_dma ();

vector_free (); }

Program in VENICE assembly

•Allocate vectors in scratchpad

•Move data from main memory to scratchpad

•Wait for DMA transaction to be completed

•Setup for vector instructions

•Perform vector computations

•Wait for vector operations to be completed

•Move data from scratchpad to main memory

•Wait for DMA transaction to be completed

•Deallocate memory from scratchpad

Page 11: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

#include "Accelerator.h"

using namespace ParallelArrays;using namespace MicrosoftTargets;

int main(){ int A[] = {1,2,3,4,5,6,7,8};

Target *tgt = CreateVectorTarget();

IPA b = IPA( A, sizeof (A)/sizeof (int));

IPA c = b + 42;

tgt->ToArray( c, A, sizeof (A)/sizeof (int));

tgt->Delete();}

Target *tgt= CreateDX9Target(); Target *tgt = CreateMulticoreTarget();

Program in Accelerator

•Create a Target•Create Parallel Array objects•Write expressions•Call ToArray to evaluate expressions•Delete Target object

Page 12: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Assembly Programming :

Write AssemblyWrite Assembly

Download to boardDownload to board

Compile with GccCompile with Gcc

Get ResultGet Result

Doesn’t compile?

Result Incorrect?

Accelerator Programming :

Write in AcceleratorWrite in Accelerator

Download to boardDownload to board

Compile with Microsoft Visual Studio

Compile with Microsoft Visual Studio

Get ResultGet Result

Compile with GccCompile with Gcc

Doesn’t compile?Or result incorrect?

Page 13: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Assembly Programming :

1.Hard to program2.Long debug cycle3.Not portable4.Manual – Not always optimal or correct (wysiwyg)

Accelerator Programming :

1.Easy to program2.Easy to debug3.Can also target other devices4.Automated compiler optimizations

Page 14: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Outline:

Motivation

Background

Implementation

Results

Conclusion

Page 15: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor
Page 16: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor
Page 17: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

#include "Accelerator.h"

using namespace ParallelArrays;using namespace MicrosoftTargets;

int main(){ Target *tgtVector = CreateVectorTarget(); const int length = 8192; int a[] = {1,2,3,4, … , 8192}; int d[length];

IPA A = IPA( a, length); IPA B = Evaluate( Rotate(A, [1]) + 1 ); IPA C = Evaluate( Abs( A + 2 )); IPA D = ( A + B ) * C ;

tgtVector->ToArray( D, d, length * sizeof(int));

tgtVector->Delete();}

××

DD

++

AA++

AA 22

AbsAbs

++

AA

11RotRot

Page 18: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA++

AA 22

AbsAbs

++

11RotRot

AA

Page 19: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA++

AA 22

AbsAbs

++

11A(rot)

A(rot)

Page 20: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA++

AA 22

AbsAbs

++

11A(rot)

A(rot)

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BBCC

++

AA 22

AbsAbs

Page 21: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor
Page 22: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor
Page 23: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB CC

++

AA 22

AbsAbs

Combine Operations

Page 24: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

|+||+|

22

CC

AA

Combine Operations

Page 25: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Scratchpad Memory“Virtual Vector Register File”

Page 26: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

“Virtual Vector Register File”

Page 27: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

“Virtual Vector Register File”

Number of vector registers = ?Vector register size = ?

Page 28: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

“Virtual Vector Register File”

Number of vector registers = ?Vector register size = ?

Page 29: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

1 0

1

1 0

1

1 1

21

2

11 22

33

11 22

33

11 22

3344

55

Evaluation Order

Page 30: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Page 31: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Page 32: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 3

Page 33: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 3

Page 34: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 3

B 1

Page 35: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 3

B 1

Page 36: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 3

B 1

C 1

Page 37: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 3

B 1

C 1

Active

A Yes

B No

C No

Page 38: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 3

B 1

C 1

Active

A Yes

B No

C No

numLoads = 1

Page 39: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 3

B 1

C 1

Active

A Yes

B No

C No

numLoads = 1

Page 40: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 2

B 1

C 1

Active

A Yes

B No

C No

numLoads = 1

Page 41: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 2

B 1

C 1

Active

A Yes

B No

C No

numLoads = 1

Page 42: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 2

B 1

C 1

Active

A Yes

B No

C No

numLoads = 1

Page 43: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 2

B 1

C 1

Active

A Yes

B No

C No

numLoads = 1

numTemps = 1

Page 44: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 2

B 1

C 1

Active

A Yes

B No

C No

numLoads = 1

numTemps = 1

numTotal = 2

maxTotal = 2

Page 45: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 2

B 1

C 1

Active

A Yes

B Yes

C No

numLoads = 2

numTemps = 1

Page 46: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 2

B 1

C 1

Active

A Yes

B Yes

C No

numLoads = 2

numTemps = 0

Page 47: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 1

B 1

C 1

Active

A Yes

B Yes

C No

numLoads = 2

numTemps = 0

Page 48: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 1

B 1

C 1

Active

A Yes

B Yes

C No

numLoads = 2

numTemps = 0

Page 49: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 1

B 1

C 1

Active

A Yes

B Yes

C No

numLoads = 2

numTemps = 1

Page 50: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 1

B 1

C 1

Active

A Yes

B Yes

C No

numLoads = 2

numTemps = 1

numTotal = 3

maxTotal = 3

Page 51: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 1

B 1

C 1

Active

A Yes

B Yes

C Yes

numLoads = 3

numTemps = 0

Page 52: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 0

B 1

C 1

Active

A No

B Yes

C Yes

numLoads = 3

numTemps = 0

Page 53: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 0

B 0

C 1

Active

A No

B No

C Yes

numLoads = 3

numTemps = 0

Page 54: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 0

B 0

C 1

Active

A No

B No

C Yes

numLoads = 3

numTemps = 0

Page 55: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 0

B 0

C 0

Active

A No

B No

C No

numLoads = 3

numTemps = 0

Page 56: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 0

B 0

C 0

Active

A No

B No

C No

numLoads = 3

numTemps = 0

Page 57: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 0

B 0

C 0

Active

A No

B No

C No

numLoads = 3

numTemps = 0

numTotal = 3

maxTotal = 3

Page 58: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 0

B 0

C 0

Active

A No

B No

C No

numLoads = 0

numTemps = 0

numTotal = 0

maxTotal = 3

Page 59: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

“Virtual Vector Register File”

Number of vector registers = 3Vector register size = ?

Page 60: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

“Virtual Vector Register File”

Number of vector registers = 3Vector register size = Capacity/3

Page 61: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Convert to LIR

Result:B

A(rot)

1

+

Result:D

A

B

+

C

×

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

|+||+|

22

CC

AA

Result:C

A

2

|+|

11 22

33

11 22

33

11 22

3344

55

Page 62: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Code Generation

Result:B

A(rot)

1

+

Result:D

A

B

+

C

×

Result:C

A

2

|+|

Page 63: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Code Generation

Result:B

A(rot)

1

+

Result:D

A

B

+

C

×

Result:C

A

2

|+|

1 2 3 4 ... 8192

Page 64: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Code Generation

Result:B

A(rot)

1

+

Result:D

A

B

+

C

×

Result:C

A

2

|+|

1 2 3 4 ... 8192 1

Page 65: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Code Generation

Result:B

A(rot)

1

+

Result:D

A

B

+

C

×

Result:C

A

2

|+|

#include "vector.h“

int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 );

Page 66: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Code Generation

Result:D

A

B

+

C

×

Result:C

A

2

|+|

#include "vector.h“

int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va );

Page 67: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Code Generation

Result:D

A

B

+

C

×

#include "vector.h“

int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va ); vector ( VVW, VADD, vb, vb, va );

Page 68: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Code Generation#include "vector.h“

int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va ); vector ( VVW, VADD, vb, vb, va ); vector ( VVW, VADD, vc, vc, vb );} vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); vector_wait_for_dma ();

vector_free (); }

Result:D

A

B

+

C

×

Page 69: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Convert To LIRConvert To LIR

IRIR

Combine Memory transformsCombine Memory transforms

Combine OperationsCombine Operations

Evaluation Ordering Evaluation Ordering

Buffer CountingBuffer Counting

Calculate Register SizeCalculate Register Size

Need Double buffering?

LIRLIR

Expression GraphExpression Graph

Convert to IRConvert to IR

Sub-divide IRSub-divide IR

Constant foldingConstant folding

CSECSE

Move Bounds to LeavesMove Bounds to Leaves

VENICE CodeVENICE Code

Initialize MemoryInitialize Memory

Transfer Data To ScratchpadTransfer Data To Scratchpad

Set VLSet VL

Write Vector InstructionsWrite Vector Instructions

Transfer Result To HostTransfer Result To Host

Allocate MemoryAllocate Memory

Page 70: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Outline:

Motivation

Background

Implementation

Results

Conclusion

Page 71: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

370x

Speedups Compiler vs. Human fir 2Dfir life imgblend median motest

V1 1.04x 0.97x 1.01x 1.00x 0.99x 0.81xV4 1.01x 1.12x 1.10x 1.02x 1.07x 1.01xV16 1.09x 1.12x 1.38x 0.90x 0.96x 1.01xV64 1.30x 1.42x 2.24x 0.92x 0.81x 1.04x

Page 72: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor
Page 73: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor
Page 74: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

CPUBenchmark Runtime (ms)

fir 2Dfir life imgblend median motest

Xeon E5540 (2.53GHz) 0.07 0.44 0.53 0.12 9.97 0.24

VENICE(V64,100MHz) 0.07 0.29 0.23 0.33 3.11 0.22

Speedup 1.0 x 1.5 x 2.3 x 0.4 x 3.2 x 1.1 x

Compare to Intel CPU

Compile Time

  fir 2D fir life imgblend median motest geomean

Compile time(ms)

4.74 5.05 4.49 4.44 92.72 24.27 10.12

Page 75: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Using smaller data types

fir 2D fir life imgblend median motest geomeanbyte halfword byte halfword byte word

V1 3.93x 4.36x 4.07x 4.12xV4 3.54x 3.83x 4.03x 3.79xV16 2.90x 3.22x 4.00x 3.34x

V1 1.96x 1.54x 1.74xV4 2.00x 1.46x 1.71xV16 1.97x 1.83x 1.90x

Speedup using bytes

Speedup using halfwords

Page 76: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Outline:

Motivation

Background

Implementation

Results

Conclusion

Page 77: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Conclusions:

The compiler greatly improves the programming and debugging experience for VENICE.

The compiler produces highly optimized VENICE code and achieves performance close-to or better-than hand-optimized code.

The compiler demonstrates the feasibility of using high-abstraction languages, such as Microsoft Accelerator with pluggable 3rd-party back-end support to provide a sustainable solution for future emerging hardware.

Page 78: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Thank you !

Page 79: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Optimal VL for V16

Input Data Sizes (words)

8192 16384 32768 65536 131072 262144 524288 1048576

Instr-uction Count

1 4096 8192 8192 8192 8192 8192 8192 81922 4096 8192 8192 8192 8192 8192 8192 81923 2048 2048 4096 4096 8192 8192 8192 81924 1024 2048 2048 4096 4096 8192 8192 81925 1024 2048 2048 4096 4096 8192 8192 81926 1024 2048 2048 4096 4096 8192 8192 81927 1024 2048 2048 4096 4096 8192 8192 81928 1024 2048 2048 4096 4096 8192 8192 81929 1024 2048 2048 4096 4096 8192 8192 8192

10 1024 2048 2048 4096 4096 8192 8192 819211 1024 2048 2048 4096 4096 8192 8192 819212 1024 2048 2048 4096 4096 8192 8192 819213 1024 2048 2048 4096 4096 8192 8192 819214 1024 2048 2048 4096 4096 8192 8192 819215 1024 2048 2048 4096 4096 8192 8192 819216 1024 2048 2048 4096 4096 8192 8192 8192

Look-up Table

Page 80: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor
Page 81: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor
Page 82: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

“Virtual Vector Register File”

Number of vector registers = 4Vector register size = 1024

Page 83: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Combine Operators for Motion Estimation

V4 V16 V64Before (ms) 2.03 0.55 0.30After (ms) 1.36 0.37 0.21Speedup 1.49x 1.48x 1.43x

Page 84: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Performance Degradation on median

int *v_min = v_input1; int *v_max = v_input2;

vector ( VVW, VOR, v_tmp, v_min, v_min ); vector ( VVW, VSUB, v_sub, v_max, v_min ); vector ( VVW, VCMV_LTZ, v_min, v_max, v_sub ); vector ( VVW, VCMV_LTZ, v_max, v_tmp, v_sub );

vector ( VVW, VSUB, v_sub, v_input1, v_input2 ); vector ( VVW, VCMV_GTEZ, v_min, v_input2, v_sub ); vector ( VVW, VCMV_LTZ, v_min, v_input1, v_sub ); vector ( VVW, VCMV_GTEZ, v_min, v_input1, v_sub ); vector ( VVW, VCMV_LTZ, v_max, v_input2, v_sub );

Human-written compare-and-swap

Compiler-generated compare-and-swap

Page 85: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor

Double Buffering

Page 86: Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor