zhiduo liu supervisor: guy lemieux sep. 28 th , 2012

Zhiduo Liu

Supervisor: Guy Lemieux

Sep. 28th, 2012

Accelerator Compiler

for the VENICE Vector

Processor

Outline:

Motivation

Background

Implementation

Results

Conclusion

Outline:

Motivation

Background

Implementation

Results

Conclusion

Motivation

Multi-core

Many-core

System Verilog

OpenCL

Erlang

Computer clusters

dOpenHM

Verilo

gBluespec

OpenGL

Fortress

Chapel

Vector Processor

StreamIt

Motivation

Multi-core

Many-core

System Verilog

OpenCL

Erlang

Computer clusters

dOpenHM

Verilo

gBluespec

OpenGL

Fortress

Chapel

Vector Processor

StreamIt

Simplification

Motivation

Single Description

Contributions

The compiler serves as a new back-end of a single-description multiple-device language.

The compiler makes VENICE easier to program and debug.

The compiler provides auto-parallelization and optimization.

[1] Z. Liu, A. Severance, S. Singh and G. Lemieux, “Accelerator Compiler for

the VENICE Vector Processor,” in FPGA 2012.

[2] C. Chou, A. Severance, A. Brant, Z. Liu, S. Sant, G. Lemieux, “VEGAS: soft

vector processor with scratchpad memory,” in FPGA 2011.

Outline:

Motivation

Background

Implementation

Results

Conclusion

Complicated

ALIGNALIGN WR RDWR RD ALIGNALIGN EX1EX1 EX2EX2 ACCUMACCUM

#include "vector.h“

int main(){ int A[] = {1,2,3,4,5,6,7,8}; const int data_len = sizeof ( A );

int *va = ( int *) vector_malloc ( data_len );

vector_dma_to_vector ( va, A, data_len ); vector_wait_for_dma ();

vector_set_vl ( data_len / sizeof (int) );

vector ( SVW, VADD, va, 42, va ); vector_instr_sync ();

vector_dma_to_host ( A, va, data_len ); vector_wait_for_dma ();

vector_free (); }

Program in VENICE assembly

•Allocate vectors in scratchpad

•Move data from main memory to scratchpad

•Wait for DMA transaction to be completed

•Setup for vector instructions

•Perform vector computations

•Wait for vector operations to be completed

•Move data from scratchpad to main memory

•Wait for DMA transaction to be completed

•Deallocate memory from scratchpad

#include "Accelerator.h"

using namespace ParallelArrays;using namespace MicrosoftTargets;

int main(){ int A[] = {1,2,3,4,5,6,7,8};

Target *tgt = CreateVectorTarget();

IPA b = IPA( A, sizeof (A)/sizeof (int));

IPA c = b + 42;

tgt->ToArray( c, A, sizeof (A)/sizeof (int));

tgt->Delete();}

Target *tgt= CreateDX9Target(); Target *tgt = CreateMulticoreTarget();

Program in Accelerator

•Create a Target•Create Parallel Array objects•Write expressions•Call ToArray to evaluate expressions•Delete Target object

Assembly Programming :

Write AssemblyWrite Assembly

Download to boardDownload to board

Compile with GccCompile with Gcc

Get ResultGet Result

Doesn’t compile?

Result Incorrect?

Accelerator Programming :

Write in AcceleratorWrite in Accelerator

Download to boardDownload to board

Compile with Microsoft Visual Studio

Get ResultGet Result

Compile with GccCompile with Gcc

Doesn’t compile?Or result incorrect?

Assembly Programming :

1.Hard to program2.Long debug cycle3.Not portable4.Manual – Not always optimal or correct (wysiwyg)

Accelerator Programming :

1.Easy to program2.Easy to debug3.Can also target other devices4.Automated compiler optimizations

Outline:

Motivation

Background

Implementation

Results

Conclusion

#include "Accelerator.h"

using namespace ParallelArrays;using namespace MicrosoftTargets;

int main(){ Target *tgtVector = CreateVectorTarget(); const int length = 8192; int a[] = {1,2,3,4, … , 8192}; int d[length];

IPA A = IPA( a, length); IPA B = Evaluate( Rotate(A, [1]) + 1 ); IPA C = Evaluate( Abs( A + 2 )); IPA D = ( A + B ) * C ;

tgtVector->ToArray( D, d, length * sizeof(int));

tgtVector->Delete();}

AbsAbs

11RotRot

AbsAbs

11RotRot

AbsAbs

11A(rot)

A(rot)

AbsAbs

11A(rot)

A(rot)

AbsAbs

A(rot)

AbsAbs

Combine Operations

A(rot)

|+||+|

Combine Operations

Scratchpad Memory“Virtual Vector Register File”

“Virtual Vector Register File”

Number of vector registers = ?Vector register size = ?

A(rot)

Evaluation Order

A(rot)

Count number of virtual vector registers

A(rot)

Ref Count

A(rot)

Ref Count

A(rot)

Ref Count

A(rot)

Ref Count

A(rot)

Ref Count

A(rot)

Ref Count

Active

A(rot)

Ref Count

Active

numLoads = 1

A(rot)

Ref Count

Active

numLoads = 1

A(rot)

Ref Count

Active

numLoads = 1

A(rot)

Ref Count

Active

numLoads = 1

A(rot)

Ref Count

Active

numLoads = 1

A(rot)

Ref Count

Active

numLoads = 1

numTemps = 1

A(rot)

Ref Count

Active

numLoads = 1

numTemps = 1

numTotal = 2

maxTotal = 2

A(rot)

Ref Count

Active

numLoads = 2

numTemps = 1

A(rot)

Ref Count

Active

numLoads = 2

numTemps = 0

A(rot)

Ref Count

Active

numLoads = 2

numTemps = 0

A(rot)

Ref Count

Active

numLoads = 2

numTemps = 0

A(rot)

Ref Count

Active

numLoads = 2

numTemps = 1

A(rot)

Ref Count

Active

numLoads = 2

numTemps = 1

numTotal = 3

maxTotal = 3

A(rot)

Ref Count

Active

numLoads = 3

numTemps = 0

A(rot)

Ref Count

Active

numLoads = 3

numTemps = 0

A(rot)

Ref Count

Active

numLoads = 3

numTemps = 0

A(rot)

Ref Count

Active

numLoads = 3

numTemps = 0

A(rot)

Ref Count

Active

numLoads = 3

numTemps = 0

A(rot)

Ref Count

Active

numLoads = 3

numTemps = 0

A(rot)

Ref Count

Active

numLoads = 3

numTemps = 0

numTotal = 3

maxTotal = 3

A(rot)

Ref Count

Active

numLoads = 0

numTemps = 0

numTotal = 0

maxTotal = 3

Number of vector registers = 3Vector register size = ?

Number of vector registers = 3Vector register size = Capacity/3

Convert to LIR

Result:B

A(rot)

Result:D

A(rot)

|+||+|

Result:C

Code Generation

Result:B

A(rot)

Result:D

Result:C

Code Generation

Result:B

A(rot)

Result:D

Result:C

1 2 3 4 ... 8192

Code Generation

Result:B

A(rot)

Result:D

Result:C

1 2 3 4 ... 8192 1

Code Generation

Result:B

A(rot)

Result:D

Result:C

int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 );

Code Generation

Result:D

Result:C

vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va );

Code Generation

Result:D

vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va ); vector ( VVW, VADD, vb, vb, va );

Code Generation#include "vector.h“

vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va ); vector ( VVW, VADD, vb, vb, va ); vector ( VVW, VADD, vc, vc, vb );} vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); vector_wait_for_dma ();

vector_free (); }

Result:D

Convert To LIRConvert To LIR

Combine Memory transformsCombine Memory transforms

Combine OperationsCombine Operations

Evaluation Ordering Evaluation Ordering

Buffer CountingBuffer Counting

Calculate Register SizeCalculate Register Size

Need Double buffering?

LIRLIR

Expression GraphExpression Graph

Convert to IRConvert to IR

Sub-divide IRSub-divide IR

Constant foldingConstant folding

CSECSE

Move Bounds to LeavesMove Bounds to Leaves

VENICE CodeVENICE Code

Initialize MemoryInitialize Memory

Transfer Data To ScratchpadTransfer Data To Scratchpad

Set VLSet VL

Write Vector InstructionsWrite Vector Instructions

Transfer Result To HostTransfer Result To Host

Allocate MemoryAllocate Memory

Outline:

Motivation

Background

Implementation

Results

Conclusion

Speedups Compiler vs. Human fir 2Dfir life imgblend median motest

V1 1.04x 0.97x 1.01x 1.00x 0.99x 0.81xV4 1.01x 1.12x 1.10x 1.02x 1.07x 1.01xV16 1.09x 1.12x 1.38x 0.90x 0.96x 1.01xV64 1.30x 1.42x 2.24x 0.92x 0.81x 1.04x

CPUBenchmark Runtime (ms)

fir 2Dfir life imgblend median motest

Xeon E5540 (2.53GHz) 0.07 0.44 0.53 0.12 9.97 0.24

VENICE(V64,100MHz) 0.07 0.29 0.23 0.33 3.11 0.22

Speedup 1.0 x 1.5 x 2.3 x 0.4 x 3.2 x 1.1 x

Compare to Intel CPU

Compile Time

fir 2D fir life imgblend median motest geomean

Compile time(ms)

4.74 5.05 4.49 4.44 92.72 24.27 10.12

Using smaller data types

fir 2D fir life imgblend median motest geomeanbyte halfword byte halfword byte word

V1 3.93x 4.36x 4.07x 4.12xV4 3.54x 3.83x 4.03x 3.79xV16 2.90x 3.22x 4.00x 3.34x

V1 1.96x 1.54x 1.74xV4 2.00x 1.46x 1.71xV16 1.97x 1.83x 1.90x

Speedup using bytes

Speedup using halfwords

Outline:

Motivation

Background

Implementation

Results

Conclusion

Conclusions:

The compiler greatly improves the programming and debugging experience for VENICE.

The compiler produces highly optimized VENICE code and achieves performance close-to or better-than hand-optimized code.

The compiler demonstrates the feasibility of using high-abstraction languages, such as Microsoft Accelerator with pluggable 3rd-party back-end support to provide a sustainable solution for future emerging hardware.

Thank you !

Optimal VL for V16

Input Data Sizes (words)

8192 16384 32768 65536 131072 262144 524288 1048576

Instr-uction Count

1 4096 8192 8192 8192 8192 8192 8192 81922 4096 8192 8192 8192 8192 8192 8192 81923 2048 2048 4096 4096 8192 8192 8192 81924 1024 2048 2048 4096 4096 8192 8192 81925 1024 2048 2048 4096 4096 8192 8192 81926 1024 2048 2048 4096 4096 8192 8192 81927 1024 2048 2048 4096 4096 8192 8192 81928 1024 2048 2048 4096 4096 8192 8192 81929 1024 2048 2048 4096 4096 8192 8192 8192

10 1024 2048 2048 4096 4096 8192 8192 819211 1024 2048 2048 4096 4096 8192 8192 819212 1024 2048 2048 4096 4096 8192 8192 819213 1024 2048 2048 4096 4096 8192 8192 819214 1024 2048 2048 4096 4096 8192 8192 819215 1024 2048 2048 4096 4096 8192 8192 819216 1024 2048 2048 4096 4096 8192 8192 8192

Look-up Table

Number of vector registers = 4Vector register size = 1024

Combine Operators for Motion Estimation

V4 V16 V64Before (ms) 2.03 0.55 0.30After (ms) 1.36 0.37 0.21Speedup 1.49x 1.48x 1.43x

Performance Degradation on median

int *v_min = v_input1; int *v_max = v_input2;

vector ( VVW, VOR, v_tmp, v_min, v_min ); vector ( VVW, VSUB, v_sub, v_max, v_min ); vector ( VVW, VCMV_LTZ, v_min, v_max, v_sub ); vector ( VVW, VCMV_LTZ, v_max, v_tmp, v_sub );

vector ( VVW, VSUB, v_sub, v_input1, v_input2 ); vector ( VVW, VCMV_GTEZ, v_min, v_input2, v_sub ); vector ( VVW, VCMV_LTZ, v_min, v_input1, v_sub ); vector ( VVW, VCMV_GTEZ, v_min, v_input1, v_sub ); vector ( VVW, VCMV_LTZ, v_max, v_input2, v_sub );

Human-written compare-and-swap

Compiler-generated compare-and-swap

Double Buffering

zhiduo liu supervisor: guy lemieux sep. 28 th , 2012

dma vector

vector va

va vector

len vector

vector operations

sync vector

int data

len sizeof int vector

Documents

eece476 lectures 10: multi-cycle cpu control chapter 5:...

peter lemieux v. us

eece476: computer architecture lecture 20: branch prediction...

configuration bitstream reduction for sram-based fpgas by...

vegas: soft vector processor with scratchpad memory...

portfolio architecture_hubert lemieux

ben lemieux - saint mary's university

safe overclocking safe overclocking of tightly coupled cgras...

lemieux creek water availability study - british...

lemieux wood door collection

undecorate by christiane lemieux - excerpt

eece476: computer architecture lecture 25: chapter 7, memory...

lemieux personalize your door

coarse and fine grain programmable overlay architectures for...

mehdi alimadadi, samad sheikhaei, guy lemieux, shahriar...

eece476 lecture 8: altera tools for your project (no...

mario lemieux real estate service

the future of fpga interconnect guy lemieux the university...

pragmatic sociology lemieux

eece476: computer architecture lecture 18: pipelining...