zhiduo liu supervisor: guy lemieux sep. 28 th, 2012 accelerator compiler for the venice vector...
TRANSCRIPT
Zhiduo Liu
Supervisor: Guy Lemieux
Sep. 28th, 2012
Accelerator Compiler
for the VENICE Vector
Processor
Outline:
Motivation
Background
Implementation
Results
Conclusion
Outline:
Motivation
Background
Implementation
Results
Conclusion
Motivation
Multi-core
GPU
FPGA
Many-core
…
CUDA
System Verilog
VHD
L
OpenCL
Erlang
Computer clusters
OpenM
PMPI
Pthre
a
dOpenHM
PP
Verilo
gBluespec
Cilk
X10
OpenGL
Sh
aJava
ParC
Fortress
Chapel
Vector Processor
StreamIt
Spong
e
SSE
Motivation
Multi-core
GPU
FPGA
Many-core
…
CUDA
System Verilog
VHD
L
OpenCL
Erlang
Computer clusters
OpenM
PMPI
Pthre
a
dOpenHM
PP
Verilo
gBluespec
Cilk
X10
OpenGL
Sh
aJava
ParC
Fortress
Chapel
Vector Processor
StreamIt
Spong
e
SSE
Simplification
Motivation
…
Single Description
Contributions
The compiler serves as a new back-end of a single-description multiple-device language.
The compiler makes VENICE easier to program and debug.
The compiler provides auto-parallelization and optimization.
[1] Z. Liu, A. Severance, S. Singh and G. Lemieux, “Accelerator Compiler for
the VENICE Vector Processor,” in FPGA 2012.
[2] C. Chou, A. Severance, A. Brant, Z. Liu, S. Sant, G. Lemieux, “VEGAS: soft
vector processor with scratchpad memory,” in FPGA 2011.
Outline:
Motivation
Background
Implementation
Results
Conclusion
Complicated
ALIGNALIGN WR RDWR RD ALIGNALIGN EX1EX1 EX2EX2 ACCUMACCUM
#include "vector.h“
int main(){ int A[] = {1,2,3,4,5,6,7,8}; const int data_len = sizeof ( A );
int *va = ( int *) vector_malloc ( data_len );
vector_dma_to_vector ( va, A, data_len ); vector_wait_for_dma ();
vector_set_vl ( data_len / sizeof (int) );
vector ( SVW, VADD, va, 42, va ); vector_instr_sync ();
vector_dma_to_host ( A, va, data_len ); vector_wait_for_dma ();
vector_free (); }
Program in VENICE assembly
•Allocate vectors in scratchpad
•Move data from main memory to scratchpad
•Wait for DMA transaction to be completed
•Setup for vector instructions
•Perform vector computations
•Wait for vector operations to be completed
•Move data from scratchpad to main memory
•Wait for DMA transaction to be completed
•Deallocate memory from scratchpad
#include "Accelerator.h"
using namespace ParallelArrays;using namespace MicrosoftTargets;
int main(){ int A[] = {1,2,3,4,5,6,7,8};
Target *tgt = CreateVectorTarget();
IPA b = IPA( A, sizeof (A)/sizeof (int));
IPA c = b + 42;
tgt->ToArray( c, A, sizeof (A)/sizeof (int));
tgt->Delete();}
Target *tgt= CreateDX9Target(); Target *tgt = CreateMulticoreTarget();
Program in Accelerator
•Create a Target•Create Parallel Array objects•Write expressions•Call ToArray to evaluate expressions•Delete Target object
Assembly Programming :
Write AssemblyWrite Assembly
Download to boardDownload to board
Compile with GccCompile with Gcc
Get ResultGet Result
Doesn’t compile?
Result Incorrect?
Accelerator Programming :
Write in AcceleratorWrite in Accelerator
Download to boardDownload to board
Compile with Microsoft Visual Studio
Compile with Microsoft Visual Studio
Get ResultGet Result
Compile with GccCompile with Gcc
Doesn’t compile?Or result incorrect?
Assembly Programming :
1.Hard to program2.Long debug cycle3.Not portable4.Manual – Not always optimal or correct (wysiwyg)
Accelerator Programming :
1.Easy to program2.Easy to debug3.Can also target other devices4.Automated compiler optimizations
Outline:
Motivation
Background
Implementation
Results
Conclusion
#include "Accelerator.h"
using namespace ParallelArrays;using namespace MicrosoftTargets;
int main(){ Target *tgtVector = CreateVectorTarget(); const int length = 8192; int a[] = {1,2,3,4, … , 8192}; int d[length];
IPA A = IPA( a, length); IPA B = Evaluate( Rotate(A, [1]) + 1 ); IPA C = Evaluate( Abs( A + 2 )); IPA D = ( A + B ) * C ;
tgtVector->ToArray( D, d, length * sizeof(int));
tgtVector->Delete();}
××
DD
++
AA++
AA 22
AbsAbs
++
AA
11RotRot
××
DD
++
AA++
AA 22
AbsAbs
++
11RotRot
AA
××
DD
++
AA++
AA 22
AbsAbs
++
11A(rot)
A(rot)
××
DD
++
AA++
AA 22
AbsAbs
++
11A(rot)
A(rot)
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BBCC
++
AA 22
AbsAbs
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB CC
++
AA 22
AbsAbs
Combine Operations
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
|+||+|
22
CC
AA
Combine Operations
Scratchpad Memory“Virtual Vector Register File”
“Virtual Vector Register File”
“Virtual Vector Register File”
Number of vector registers = ?Vector register size = ?
“Virtual Vector Register File”
Number of vector registers = ?Vector register size = ?
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
1 0
1
1 0
1
1 1
21
2
11 22
33
11 22
33
11 22
3344
55
Evaluation Order
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 3
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 3
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 3
B 1
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 3
B 1
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 3
B 1
C 1
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 3
B 1
C 1
Active
A Yes
B No
C No
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 3
B 1
C 1
Active
A Yes
B No
C No
numLoads = 1
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 3
B 1
C 1
Active
A Yes
B No
C No
numLoads = 1
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 2
B 1
C 1
Active
A Yes
B No
C No
numLoads = 1
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 2
B 1
C 1
Active
A Yes
B No
C No
numLoads = 1
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 2
B 1
C 1
Active
A Yes
B No
C No
numLoads = 1
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 2
B 1
C 1
Active
A Yes
B No
C No
numLoads = 1
numTemps = 1
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 2
B 1
C 1
Active
A Yes
B No
C No
numLoads = 1
numTemps = 1
numTotal = 2
maxTotal = 2
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 2
B 1
C 1
Active
A Yes
B Yes
C No
numLoads = 2
numTemps = 1
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 2
B 1
C 1
Active
A Yes
B Yes
C No
numLoads = 2
numTemps = 0
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 1
B 1
C 1
Active
A Yes
B Yes
C No
numLoads = 2
numTemps = 0
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 1
B 1
C 1
Active
A Yes
B Yes
C No
numLoads = 2
numTemps = 0
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 1
B 1
C 1
Active
A Yes
B Yes
C No
numLoads = 2
numTemps = 1
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 1
B 1
C 1
Active
A Yes
B Yes
C No
numLoads = 2
numTemps = 1
numTotal = 3
maxTotal = 3
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 1
B 1
C 1
Active
A Yes
B Yes
C Yes
numLoads = 3
numTemps = 0
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 0
B 1
C 1
Active
A No
B Yes
C Yes
numLoads = 3
numTemps = 0
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 0
B 0
C 1
Active
A No
B No
C Yes
numLoads = 3
numTemps = 0
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 0
B 0
C 1
Active
A No
B No
C Yes
numLoads = 3
numTemps = 0
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 0
B 0
C 0
Active
A No
B No
C No
numLoads = 3
numTemps = 0
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 0
B 0
C 0
Active
A No
B No
C No
numLoads = 3
numTemps = 0
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 0
B 0
C 0
Active
A No
B No
C No
numLoads = 3
numTemps = 0
numTotal = 3
maxTotal = 3
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
++
A(rot)
A(rot)
22
CC
Count number of virtual vector registers
Ref Count
A 0
B 0
C 0
Active
A No
B No
C No
numLoads = 0
numTemps = 0
numTotal = 0
maxTotal = 3
“Virtual Vector Register File”
Number of vector registers = 3Vector register size = ?
“Virtual Vector Register File”
Number of vector registers = 3Vector register size = Capacity/3
Convert to LIR
Result:B
A(rot)
1
+
Result:D
A
B
+
C
×
××
DD
++
AA
CC
BB
++
A(rot)
A(rot)
11
BB
|+||+|
22
CC
AA
Result:C
A
2
|+|
11 22
33
11 22
33
11 22
3344
55
Code Generation
Result:B
A(rot)
1
+
Result:D
A
B
+
C
×
Result:C
A
2
|+|
Code Generation
Result:B
A(rot)
1
+
Result:D
A
B
+
C
×
Result:C
A
2
|+|
1 2 3 4 ... 8192
Code Generation
Result:B
A(rot)
1
+
Result:D
A
B
+
C
×
Result:C
A
2
|+|
1 2 3 4 ... 8192 1
Code Generation
Result:B
A(rot)
1
+
Result:D
A
B
+
C
×
Result:C
A
2
|+|
#include "vector.h“
int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va;
vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 );
Code Generation
Result:D
A
B
+
C
×
Result:C
A
2
|+|
#include "vector.h“
int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va;
vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va );
Code Generation
Result:D
A
B
+
C
×
#include "vector.h“
int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va;
vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va ); vector ( VVW, VADD, vb, vb, va );
Code Generation#include "vector.h“
int main(){ int A[8192] = {1,2,3,4, … 8192}; int *va = ( int *) vector_malloc ( 32772 ); int *vb = ( int *) vector_malloc ( 32768 ); int *vc = ( int *) vector_malloc ( 32768 ); int *vd = ( int *) vector_malloc ( 32772 ); int *vtemp = va;
vector_dma_to_vector ( va, A, 32772 ); for(int i=0; i<4; i++){ vector_set_vl ( 1024 ); vtemp = va; va = vd; vd = vtemp; vector_wait_for_dma (); if(i<3) vector_dma_to_vector ( va, A+(i+1)*1024, 32772 ); if(i>0){ vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); } vector ( SVW, VADD, vb, 1, va+1 ); vector_abs ( SVW, VADD, vc, 2, va ); vector ( VVW, VADD, vb, vb, va ); vector ( VVW, VADD, vc, vc, vb );} vector_instr_sync (); vector_dma_to_host ( A+(i-1)*1024, vc, 32768 ); vector_wait_for_dma ();
vector_free (); }
Result:D
A
B
+
C
×
Convert To LIRConvert To LIR
IRIR
Combine Memory transformsCombine Memory transforms
Combine OperationsCombine Operations
Evaluation Ordering Evaluation Ordering
Buffer CountingBuffer Counting
Calculate Register SizeCalculate Register Size
Need Double buffering?
LIRLIR
Expression GraphExpression Graph
Convert to IRConvert to IR
Sub-divide IRSub-divide IR
Constant foldingConstant folding
CSECSE
Move Bounds to LeavesMove Bounds to Leaves
VENICE CodeVENICE Code
Initialize MemoryInitialize Memory
Transfer Data To ScratchpadTransfer Data To Scratchpad
Set VLSet VL
Write Vector InstructionsWrite Vector Instructions
Transfer Result To HostTransfer Result To Host
Allocate MemoryAllocate Memory
Outline:
Motivation
Background
Implementation
Results
Conclusion
370x
Speedups Compiler vs. Human fir 2Dfir life imgblend median motest
V1 1.04x 0.97x 1.01x 1.00x 0.99x 0.81xV4 1.01x 1.12x 1.10x 1.02x 1.07x 1.01xV16 1.09x 1.12x 1.38x 0.90x 0.96x 1.01xV64 1.30x 1.42x 2.24x 0.92x 0.81x 1.04x
CPUBenchmark Runtime (ms)
fir 2Dfir life imgblend median motest
Xeon E5540 (2.53GHz) 0.07 0.44 0.53 0.12 9.97 0.24
VENICE(V64,100MHz) 0.07 0.29 0.23 0.33 3.11 0.22
Speedup 1.0 x 1.5 x 2.3 x 0.4 x 3.2 x 1.1 x
Compare to Intel CPU
Compile Time
fir 2D fir life imgblend median motest geomean
Compile time(ms)
4.74 5.05 4.49 4.44 92.72 24.27 10.12
Using smaller data types
fir 2D fir life imgblend median motest geomeanbyte halfword byte halfword byte word
V1 3.93x 4.36x 4.07x 4.12xV4 3.54x 3.83x 4.03x 3.79xV16 2.90x 3.22x 4.00x 3.34x
V1 1.96x 1.54x 1.74xV4 2.00x 1.46x 1.71xV16 1.97x 1.83x 1.90x
Speedup using bytes
Speedup using halfwords
Outline:
Motivation
Background
Implementation
Results
Conclusion
Conclusions:
The compiler greatly improves the programming and debugging experience for VENICE.
The compiler produces highly optimized VENICE code and achieves performance close-to or better-than hand-optimized code.
The compiler demonstrates the feasibility of using high-abstraction languages, such as Microsoft Accelerator with pluggable 3rd-party back-end support to provide a sustainable solution for future emerging hardware.
Thank you !
Optimal VL for V16
Input Data Sizes (words)
8192 16384 32768 65536 131072 262144 524288 1048576
Instr-uction Count
1 4096 8192 8192 8192 8192 8192 8192 81922 4096 8192 8192 8192 8192 8192 8192 81923 2048 2048 4096 4096 8192 8192 8192 81924 1024 2048 2048 4096 4096 8192 8192 81925 1024 2048 2048 4096 4096 8192 8192 81926 1024 2048 2048 4096 4096 8192 8192 81927 1024 2048 2048 4096 4096 8192 8192 81928 1024 2048 2048 4096 4096 8192 8192 81929 1024 2048 2048 4096 4096 8192 8192 8192
10 1024 2048 2048 4096 4096 8192 8192 819211 1024 2048 2048 4096 4096 8192 8192 819212 1024 2048 2048 4096 4096 8192 8192 819213 1024 2048 2048 4096 4096 8192 8192 819214 1024 2048 2048 4096 4096 8192 8192 819215 1024 2048 2048 4096 4096 8192 8192 819216 1024 2048 2048 4096 4096 8192 8192 8192
Look-up Table
“Virtual Vector Register File”
Number of vector registers = 4Vector register size = 1024
Combine Operators for Motion Estimation
V4 V16 V64Before (ms) 2.03 0.55 0.30After (ms) 1.36 0.37 0.21Speedup 1.49x 1.48x 1.43x
Performance Degradation on median
int *v_min = v_input1; int *v_max = v_input2;
vector ( VVW, VOR, v_tmp, v_min, v_min ); vector ( VVW, VSUB, v_sub, v_max, v_min ); vector ( VVW, VCMV_LTZ, v_min, v_max, v_sub ); vector ( VVW, VCMV_LTZ, v_max, v_tmp, v_sub );
vector ( VVW, VSUB, v_sub, v_input1, v_input2 ); vector ( VVW, VCMV_GTEZ, v_min, v_input2, v_sub ); vector ( VVW, VCMV_LTZ, v_min, v_input1, v_sub ); vector ( VVW, VCMV_GTEZ, v_min, v_input1, v_sub ); vector ( VVW, VCMV_LTZ, v_max, v_input2, v_sub );
Human-written compare-and-swap
Compiler-generated compare-and-swap
Double Buffering