guy lemieux risc-v vector performance analysis · 4 x 32b operations / cycle 8 x 16b operations /...
TRANSCRIPT
![Page 1: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/1.jpg)
RISC-V VectorPerformance Analysis
Guy LemieuxCEO at VectorBlox Computing
Prof. at University of British Columbia
Collaborators:
Fredy Augusto AlvesJosé Augusto NacifScience and Technology Institute
Universidade Federal de Viçosa, Brazil
![Page 2: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/2.jpg)
MotivationRISC-V Vectors (RVV) is a highly anticipated extension
Very early performance analysisVector spec is incompleteGive feedback on vector spec
Compare toFixed-width SIMDAlternative vector architecture MXP
Find cause of performance differences
![Page 3: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/3.jpg)
![Page 4: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/4.jpg)
Example: SIMD vs Vector
![Page 5: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/5.jpg)
Example: SIMD vs Vector
MIPS (MSA) Intel (AVX2) RISC-V (RVV)
![Page 6: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/6.jpg)
Example: SIMD vs Vector
![Page 7: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/7.jpg)
Example: SIMD vs Vector
![Page 8: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/8.jpg)
This talk...Let’s look at real code
Running on Xilinx Zynq-7020 FPGA (ZedBoard):
1. ARM Cortex-A9 with NEON (667MHz, 128b datapath)2. ARM Cortex-A9 with RVV (100MHz, 512b datapath)3. ARM Cortex-A9 with MXP (100MHz, 512b datapath)
Note1: NEON has 1.66x “ops per second” advantage (667MHz/100MHz) * (128b / 512b)
Note2: NEON has 8x more memory bandwidth (6400MB/s vs 800MB/s)
Note3: RISC-V and MXP have 256x more vector data storage (256B vs 64kB)
![Page 9: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/9.jpg)
ARM NEON16 named registers, 128b wide
128b datapath
4 x 32b operations / cycle
8 x 16b operations / cycle
16 x 8b operations / cycle
supports float32, we’ll use int32, int16
Note1: wider registers or datapath → ISA must changeNote2: wider registers or datapath → software must be rewritten
Source: ARM NEON Programmers Guide
![Page 10: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/10.jpg)
RISC-V Vectors32 named registers
Implementation-defined:
Register size (in bits) ~ MAXVL 16384b (2kB, 512 words)
Execution width (in bits) ~ EXECW 512b
Multiple cycles to execute ~ MAXVL / EXECW #cycles=32
![Page 11: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/11.jpg)
MXP
No named registers, no MAXVL
All vector data is memory-mapped scratchpad, ie uses scalar pointersScratchpad is multi-banked
Analogy: file lookup on RAID disk array
Emulate RVV: V0 = base_addr, Vi = V0+i*MAXVL*sizeof(max_elem_size)
Note1: extremely flexible, no vector placement (alignment) or length restrictionsNote2: vector length subject to size of scratchpad
VectorBlox MXP
![Page 12: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/12.jpg)
MXP Scratchpad as a RAID disk array
![Page 13: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/13.jpg)
Benchmark Performance
![Page 14: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/14.jpg)
Benchmark Performance
![Page 15: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/15.jpg)
Benchmark Performance
![Page 16: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/16.jpg)
Benchmark Performance
![Page 17: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/17.jpg)
Benchmark Performance
![Page 18: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/18.jpg)
Autocorrelation (32b)for( i=0; i<1800-lag; i++ ) s += ( A[i]*A[i+k] ) >> 2; // k: 0..1024
![Page 19: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/19.jpg)
Autocorrelation (32b)for( i=0; i<1800-lag; i++ ) s += ( A[i]*A[i+k] ) >> 2; // k: 0..1024
NEON
![Page 20: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/20.jpg)
Autocorrelation (32b)for( i=0; i<1800-lag; i++ ) s += ( A[i]*A[i+k] ) >> 2; // k: 0..1024
NEON RISC-V
![Page 21: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/21.jpg)
Autocorrelation (32b)for( i=0; i<1800-lag; i++ ) s += ( A[i]*A[i+k] ) >> 2; // k: 0..1024
NEON RISC-V MXP
![Page 22: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/22.jpg)
Autocorrelation (32b)for( i=0; i<1800-lag; i++ ) s += ( A[i]*A[i+k] ) >> 2; // k: 0..1024
NEON RISC-V MXP
Why is MXP faster?
![Page 23: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/23.jpg)
Autocorrelation (32b)for( i=0; i<1800-lag; i++ ) s += ( A[i]*A[i+k] ) >> 2; // k: 0..1024
NEON RISC-V MXP
Why is MXP faster?RISC-V
vslidedn (moves data)vsrl, vredsum (2 instructions)
MXPscalar increment (start address of vector)(1 instruction) accumulate vshr
![Page 24: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/24.jpg)
2D FIR (32b)out[x,y] = * A[x+dx,y+dy] // A is 512 x 512
![Page 25: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/25.jpg)
2D FIR (32b)out[x,y] = * A[x+dx,y+dy] // A is 512 x 512
RISC-V MXP
![Page 26: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/26.jpg)
2D FIR (32b)out[x,y] = * A[x+dx,y+dy] // A is 512 x 512
RISC-V MXP
Why is MXP faster?
![Page 27: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/27.jpg)
2D FIR (32b)out[x,y] = * A[x+dx,y+dy] // A is 512 x 512
RISC-V MXP
Why is MXP faster?RISC-V
vslidedn needed to move dataMXPscalar computes start address of vector
![Page 28: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/28.jpg)
Sobel (16b)
out[x,y] = min( 255, ( |Gx| + |Gy| ) >>2 ); // input A is 1024 x 1024
![Page 29: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/29.jpg)
Sobel (16b)
out[x,y] = min( 255, ( |Gx| + |Gy| ) >>2 ); // input A is 1024 x 1024
RISC-V MXP
![Page 30: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/30.jpg)
Sobel (16b)
out[x,y] = min( 255, ( |Gx| + |Gy| ) >>2 ); // input A is 1024 x 1024
RISC-V MXP11 total arith
operations
0 slides (scalar add)
2 absdiffinstructions
0 data move(scalar
ptr assign.)
Why is MXP faster?20 arithoperations
3 vslidedn
4 instructionsper absdiff
6 data moveoperations
![Page 31: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/31.jpg)
RISC-V
20 vs 11 arith operations3 vslidedn4 instructions per absdiff
6 vs 0 data move operations
Why is MXP faster?
RISC-V MXP
![Page 32: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/32.jpg)
RGBA to LUMA (32b → 16b → 8b)luma8 = uint16( 25*blu8 + 129*grn8 + 66*red8 + 128 ) >> 8
// input is 1600 x 1600
![Page 33: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/33.jpg)
RGBA to LUMA (32b → 16b → 8b)luma8 = uint16( 25*blu8 + 129*grn8 + 66*red8 + 128 ) >> 8
// input is 1600 x 1600
NEON
![Page 34: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/34.jpg)
RGBA to LUMA (32b → 16b → 8b)luma8 = uint16( 25*blu8 + 129*grn8 + 66*red8 + 128 ) >> 8
// input is 1600 x 1600
NEON RISC-V MXP
![Page 35: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/35.jpg)
RGBA to LUMA (32b → 16b → 8b)luma8 = uint16( 25*blu8 + 129*grn8 + 66*red8 + 128 ) >> 8
// input is 1600 x 1600
NEON RISC-V MXP
Why is MXP faster?RISC-V
Narrows data with 3 vsrln instructionsVector load not prefetched
MXPNarrows data with any instruction (vand)Row data is prefetched
![Page 36: Guy Lemieux RISC-V Vector Performance Analysis · 4 x 32b operations / cycle 8 x 16b operations / cycle 16 x 8b operations / cycle supports float32, we’ll use int32, int16 Note1:](https://reader033.vdocument.in/reader033/viewer/2022052718/5f0537ef7e708231d411de52/html5/thumbnails/36.jpg)
SummaryRVV vs NEON
already strong performance advantage
RVV vs MXP
1. Extra data movementa. vslidedn/up take extra time (eg, sliding windows)b. rotating buffers require vmov
2. Unbundled operationsa. Reductions take an extra instructionb. Data-element narrowing takes an extra instruction
3. Missing operationsa. vabsdiff takes 4 instructions (vslt, vsub, vsub, vmerge)