zhiduo liu supervisor: guy lemieux sep. 28 th , 2012

86
Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th , 2012 Accelerator Compiler for the VENICE Vector Processor

Upload: fathia

Post on 31-Jan-2016

21 views

Category:

Documents


0 download

DESCRIPTION

Accelerator Compiler for the VENICE Vector Processor. Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th , 2012. Outline:. Motivation Background Implementation Results Conclusion. Outline:. Motivation Background Implementation Results Conclusion. FPGA. VHDL. Motivation. Multi-core. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Zhiduo Liu

Supervisor: Guy Lemieux

Sep. 28th, 2012

Accelerator Compiler

for the VENICE Vector

Processor

Page 2: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Outline:

Motivation

Background

Implementation

Results

Conclusion

Page 3: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Outline:

Motivation

Background

Implementation

Results

Conclusion

Page 4: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Motivation

Multi-core

GPU

FPGA

Many-core

CUDA

System Verilog

VHD

L

OpenCL

Erlang

Computer clusters

OpenM

PMPI

Pthre

a

dOpenHM

PP

Verilo

gBluespec

Cilk

X10

OpenGL

Sh

aJava

ParC

Fortress

Chapel

Vector Processor

StreamIt

Spong

e

SSE

Page 5: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Motivation

Multi-core

GPU

FPGA

Many-core

CUDA

System Verilog

VHD

L

OpenCL

Erlang

Computer clusters

OpenM

PMPI

Pthre

a

dOpenHM

PP

Verilo

gBluespec

Cilk

X10

OpenGL

Sh

aJava

ParC

Fortress

Chapel

Vector Processor

StreamIt

Spong

e

SSE

Simplification

Page 6: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Motivation

Single Description

Page 7: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Contributions

The compiler serves as a new back-end of a single-description multiple-device language.

The compiler makes VENICE easier to program and debug.

The compiler provides auto-parallelization and optimization.

[1] Z. Liu, A. Severance, S. Singh and G. Lemieux, “Accelerator Compiler for

the VENICE Vector Processor,” in FPGA 2012.

[2] C. Chou, A. Severance, A. Brant, Z. Liu, S. Sant, G. Lemieux, “VEGAS: soft

vector processor with scratchpad memory,” in FPGA 2011.

Page 8: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Outline:

Motivation

Background

Implementation

Results

Conclusion

Page 9: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Complicated

ALIGNALIGN WR RDWR RD ALIGNALIGN EX1EX1 EX2EX2 ACCUMACCUM

Page 10: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

#include "vector.h“

int main(){ int A[] = {1,2,3,4,5,6,7,8}; const int data_len = sizeof ( A );

int *va = ( int *) vector_malloc ( data_len );

vector_dma_to_vector ( va, A, data_len ); vector_wait_for_dma ();

vector_set_vl ( data_len / sizeof (int) );

vector ( SVW, VADD, va, 42, va ); vector_instr_sync ();

vector_dma_to_host ( A, va, data_len ); vector_wait_for_dma ();

vector_free (); }

Program in VENICE assembly

•Allocate vectors in scratchpad

•Move data from main memory to scratchpad

•Wait for DMA transaction to be completed

•Setup for vector instructions

•Perform vector computations

•Wait for vector operations to be completed

•Move data from scratchpad to main memory

•Wait for DMA transaction to be completed

•Deallocate memory from scratchpad

Page 11: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

#include "Accelerator.h"

using namespace ParallelArrays;using namespace MicrosoftTargets;

int main(){ int A[] = {1,2,3,4,5,6,7,8};

Target *tgt = CreateVectorTarget();

IPA b = IPA( A, sizeof (A)/sizeof (int));

IPA c = b + 42;

tgt->ToArray( c, A, sizeof (A)/sizeof (int));

tgt->Delete();}

Target *tgt= CreateDX9Target(); Target *tgt = CreateMulticoreTarget();

Program in Accelerator

•Create a Target•Create Parallel Array objects•Write expressions•Call ToArray to evaluate expressions•Delete Target object

Page 12: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Assembly Programming :

Write AssemblyWrite Assembly

Download to boardDownload to board

Compile with GccCompile with Gcc

Get ResultGet Result

Doesn’t compile?

Result Incorrect?

Accelerator Programming :

Write in AcceleratorWrite in Accelerator

Download to boardDownload to board

Compile with Microsoft Visual Studio

Compile with Microsoft Visual Studio

Get ResultGet Result

Compile with GccCompile with Gcc

Doesn’t compile?Or result incorrect?

Page 13: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Assembly Programming :

1.Hard to program2.Long debug cycle3.Not portable4.Manual – Not always optimal or correct (wysiwyg)

Accelerator Programming :

1.Easy to program2.Easy to debug3.Can also target other devices4.Automated compiler optimizations

Page 14: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Outline:

Motivation

Background

Implementation

Results

Conclusion

Page 15: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012
Page 16: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012
Page 17: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

#include "Accelerator.h"

using namespace ParallelArrays;using namespace MicrosoftTargets;

int main(){ Target *tgtVector = CreateVectorTarget(); const int length = 8192; int a[] = {1,2,3,4, … , 8192}; int d[length];

IPA A = IPA( a, length); IPA B = Evaluate( Rotate(A, [1]) + 1 ); IPA C = Evaluate( Abs( A + 2 )); IPA D = ( A + B ) * C ;

tgtVector->ToArray( D, d, length * sizeof(int));

tgtVector->Delete();}

××

DD

++

AA++

AA 22

AbsAbs

++

AA

11RotRot

Page 18: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA++

AA 22

AbsAbs

++

11RotRot

AA

Page 19: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA++

AA 22

AbsAbs

++

11A(rot)

A(rot)

Page 20: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA++

AA 22

AbsAbs

++

11A(rot)

A(rot)

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BBCC

++

AA 22

AbsAbs

Page 21: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012
Page 22: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012
Page 23: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB CC

++

AA 22

AbsAbs

Combine Operations

Page 24: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

|+||+|

22

CC

AA

Combine Operations

Page 25: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Scratchpad Memory“Virtual Vector Register File”

Page 26: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

“Virtual Vector Register File”

Page 27: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

“Virtual Vector Register File”

Number of vector registers = ?Vector register size = ?

Page 28: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

“Virtual Vector Register File”

Number of vector registers = ?Vector register size = ?

Page 29: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

1 0

1

1 0

1

1 1

21

2

11 22

33

11 22

33

11 22

3344

55

Evaluation Order

Page 30: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Page 31: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Page 32: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 3

Page 33: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 3

Page 34: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 3

B 1

Page 35: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 3

B 1

Page 36: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 3

B 1

C 1

Page 37: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 3

B 1

C 1

Active

A Yes

B No

C No

Page 38: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 3

B 1

C 1

Active

A Yes

B No

C No

numLoads = 1

Page 39: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 3

B 1

C 1

Active

A Yes

B No

C No

numLoads = 1

Page 40: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 2

B 1

C 1

Active

A Yes

B No

C No

numLoads = 1

Page 41: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 2

B 1

C 1

Active

A Yes

B No

C No

numLoads = 1

Page 42: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 2

B 1

C 1

Active

A Yes

B No

C No

numLoads = 1

Page 43: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 2

B 1

C 1

Active

A Yes

B No

C No

numLoads = 1

numTemps = 1

Page 44: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 2

B 1

C 1

Active

A Yes

B No

C No

numLoads = 1

numTemps = 1

numTotal = 2

maxTotal = 2

Page 45: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 2

B 1

C 1

Active

A Yes

B Yes

C No

numLoads = 2

numTemps = 1

Page 46: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 2

B 1

C 1

Active

A Yes

B Yes

C No

numLoads = 2

numTemps = 0

Page 47: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 1

B 1

C 1

Active

A Yes

B Yes

C No

numLoads = 2

numTemps = 0

Page 48: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 1

B 1

C 1

Active

A Yes

B Yes

C No

numLoads = 2

numTemps = 0

Page 49: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 1

B 1

C 1

Active

A Yes

B Yes

C No

numLoads = 2

numTemps = 1

Page 50: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 1

B 1

C 1

Active

A Yes

B Yes

C No

numLoads = 2

numTemps = 1

numTotal = 3

maxTotal = 3

Page 51: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 1

B 1

C 1

Active

A Yes

B Yes

C Yes

numLoads = 3

numTemps = 0

Page 52: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 0

B 1

C 1

Active

A No

B Yes

C Yes

numLoads = 3

numTemps = 0

Page 53: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 0

B 0

C 1

Active

A No

B No

C Yes

numLoads = 3

numTemps = 0

Page 54: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 0

B 0

C 1

Active

A No

B No

C Yes

numLoads = 3

numTemps = 0

Page 55: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 0

B 0

C 0

Active

A No

B No

C No

numLoads = 3

numTemps = 0

Page 56: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 0

B 0

C 0

Active

A No

B No

C No

numLoads = 3

numTemps = 0

Page 57: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 0

B 0

C 0

Active

A No

B No

C No

numLoads = 3

numTemps = 0

numTotal = 3

maxTotal = 3

Page 58: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

++

A(rot)

A(rot)

22

CC

Count number of virtual vector registers

Ref Count

A 0

B 0

C 0

Active

A No

B No

C No

numLoads = 0

numTemps = 0

numTotal = 0

maxTotal = 3

Page 59: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

“Virtual Vector Register File”

Number of vector registers = 3Vector register size = ?

Page 60: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

“Virtual Vector Register File”

Number of vector registers = 3Vector register size = Capacity/3

Page 61: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Convert to LIR

Result:B

A(rot)

1

+

Result:D

A

B

+

C

×

××

DD

++

AA

CC

BB

++

A(rot)

A(rot)

11

BB

|+||+|

22

CC

AA

Result:C

A

2

|+|

11 22

33

11 22

33

11 22

3344

55

Page 62: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Code Generation

Result:B

A(rot)

1

+

Result:D

A

B

+

C

×

Result:C

A

2

|+|

Page 63: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Code Generation

Result:B

A(rot)

1

+

Result:D

A

B

+

C

×

Result:C

A

2

|+|

1 2 3 4 ... 8192

Page 64: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Code Generation

Result:B

A(rot)

1

+

Result:D

A

B

+

C

×

Result:C

A

2

|+|

1 2 3 4 ... 8192 1

Page 65: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Code Generation

Result:B

A(rot)

1

+

Result:D

A

B

+

C

×

Result:C

A

2

|+|

#include "vector.h“

int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 );

Page 66: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Code Generation

Result:D

A

B

+

C

×

Result:C

A

2

|+|

#include "vector.h“

int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va );

Page 67: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Code Generation

Result:D

A

B

+

C

×

#include "vector.h“

int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va ); vector ( VVW, VADD, vb, vb, va );

Page 68: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Code Generation#include "vector.h“

int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va ); vector ( VVW, VADD, vb, vb, va ); vector ( VVW, VADD, vc, vc, vb );} vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); vector_wait_for_dma ();

vector_free (); }

Result:D

A

B

+

C

×

Page 69: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Convert To LIRConvert To LIR

IRIR

Combine Memory transformsCombine Memory transforms

Combine OperationsCombine Operations

Evaluation Ordering Evaluation Ordering

Buffer CountingBuffer Counting

Calculate Register SizeCalculate Register Size

Need Double buffering?

LIRLIR

Expression GraphExpression Graph

Convert to IRConvert to IR

Sub-divide IRSub-divide IR

Constant foldingConstant folding

CSECSE

Move Bounds to LeavesMove Bounds to Leaves

VENICE CodeVENICE Code

Initialize MemoryInitialize Memory

Transfer Data To ScratchpadTransfer Data To Scratchpad

Set VLSet VL

Write Vector InstructionsWrite Vector Instructions

Transfer Result To HostTransfer Result To Host

Allocate MemoryAllocate Memory

Page 70: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Outline:

Motivation

Background

Implementation

Results

Conclusion

Page 71: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

370x

Speedups Compiler vs. Human fir 2Dfir life imgblend median motest

V1 1.04x 0.97x 1.01x 1.00x 0.99x 0.81xV4 1.01x 1.12x 1.10x 1.02x 1.07x 1.01xV16 1.09x 1.12x 1.38x 0.90x 0.96x 1.01xV64 1.30x 1.42x 2.24x 0.92x 0.81x 1.04x

Page 72: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012
Page 73: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012
Page 74: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

CPUBenchmark Runtime (ms)

fir 2Dfir life imgblend median motest

Xeon E5540 (2.53GHz) 0.07 0.44 0.53 0.12 9.97 0.24

VENICE(V64,100MHz) 0.07 0.29 0.23 0.33 3.11 0.22

Speedup 1.0 x 1.5 x 2.3 x 0.4 x 3.2 x 1.1 x

Compare to Intel CPU

Compile Time

  fir 2D fir life imgblend median motest geomean

Compile time(ms)

4.74 5.05 4.49 4.44 92.72 24.27 10.12

Page 75: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Using smaller data types

fir 2D fir life imgblend median motest geomeanbyte halfword byte halfword byte word

V1 3.93x 4.36x 4.07x 4.12xV4 3.54x 3.83x 4.03x 3.79xV16 2.90x 3.22x 4.00x 3.34x

V1 1.96x 1.54x 1.74xV4 2.00x 1.46x 1.71xV16 1.97x 1.83x 1.90x

Speedup using bytes

Speedup using halfwords

Page 76: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Outline:

Motivation

Background

Implementation

Results

Conclusion

Page 77: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Conclusions:

The compiler greatly improves the programming and debugging experience for VENICE.

The compiler produces highly optimized VENICE code and achieves performance close-to or better-than hand-optimized code.

The compiler demonstrates the feasibility of using high-abstraction languages, such as Microsoft Accelerator with pluggable 3rd-party back-end support to provide a sustainable solution for future emerging hardware.

Page 78: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Thank you !

Page 79: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Optimal VL for V16

Input Data Sizes (words)

8192 16384 32768 65536 131072 262144 524288 1048576

Instr-uction Count

1 4096 8192 8192 8192 8192 8192 8192 81922 4096 8192 8192 8192 8192 8192 8192 81923 2048 2048 4096 4096 8192 8192 8192 81924 1024 2048 2048 4096 4096 8192 8192 81925 1024 2048 2048 4096 4096 8192 8192 81926 1024 2048 2048 4096 4096 8192 8192 81927 1024 2048 2048 4096 4096 8192 8192 81928 1024 2048 2048 4096 4096 8192 8192 81929 1024 2048 2048 4096 4096 8192 8192 8192

10 1024 2048 2048 4096 4096 8192 8192 819211 1024 2048 2048 4096 4096 8192 8192 819212 1024 2048 2048 4096 4096 8192 8192 819213 1024 2048 2048 4096 4096 8192 8192 819214 1024 2048 2048 4096 4096 8192 8192 819215 1024 2048 2048 4096 4096 8192 8192 819216 1024 2048 2048 4096 4096 8192 8192 8192

Look-up Table

Page 80: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012
Page 81: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012
Page 82: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

“Virtual Vector Register File”

Number of vector registers = 4Vector register size = 1024

Page 83: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Combine Operators for Motion Estimation

V4 V16 V64Before (ms) 2.03 0.55 0.30After (ms) 1.36 0.37 0.21Speedup 1.49x 1.48x 1.43x

Page 84: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Performance Degradation on median

int *v_min = v_input1; int *v_max = v_input2;

vector ( VVW, VOR, v_tmp, v_min, v_min ); vector ( VVW, VSUB, v_sub, v_max, v_min ); vector ( VVW, VCMV_LTZ, v_min, v_max, v_sub ); vector ( VVW, VCMV_LTZ, v_max, v_tmp, v_sub );

vector ( VVW, VSUB, v_sub, v_input1, v_input2 ); vector ( VVW, VCMV_GTEZ, v_min, v_input2, v_sub ); vector ( VVW, VCMV_LTZ, v_min, v_input1, v_sub ); vector ( VVW, VCMV_GTEZ, v_min, v_input1, v_sub ); vector ( VVW, VCMV_LTZ, v_max, v_input2, v_sub );

Human-written compare-and-swap

Compiler-generated compare-and-swap

Page 85: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Double Buffering

Page 86: Zhiduo  Liu Supervisor: Guy Lemieux Sep. 28 th , 2012