vector fpu

8/11/2019 Vector Fpu

http://slidepdf.com/reader/full/vector-fpu 1/50

Floating Point Vector Processing

on an

FPGA

Prof. Miriam Leeser

Department

of

Electrical

and

Computer

Engineering

Boston, [email protected]

Based on MS thesis by Jainik Kathiara, Jan 2011

and FCCM 2011 paper



• Introduction to Vector Processing

• Vector‐scalar

ISA

• ‐

• Vectorized

Linear

Algebra

Kernels• Results

• Future Directions



• Rich set of reconfigurable

e ements

• Embedded Processor: PowerPC

instruction extensions to

PowerPC: – Emulated in software

– Hardware coprocessor

•

pipeline

– NU VFLOAT library



• FPGA Floating Point Unit serializes operations:

– PowerPC fetches and executes instructions, data

– ,



• Vector Processor: potential to operate on lots of data

at the same time – Multiple data elements stored in a vector

–

• Eliminates loops

• FPVC does

its

own

instruction

fetch

and

execute

• ec or ns ruc ons are ense – Reduced program code size

– Reduced dynamic instruction bandwidth

– Reduced data

hazards

• Parallel execution, parallel data –



for (i=0; i < n; i++)

Y[i] = A[i] * x + Y[i];

• BLAS library routine SAXPY / DAXPY

•

vector Y

• In Vector

ISA

such

operations

are

written

very

com actl : o erate on entire vector Y i



L.D F0,a ;load scalar a

DADDIU

R4,Rx,#512

;last

address

to

loadLoop: L.D F2,0(Rx) ;load X(i)

. , ,

L.D F4,0(Ry) ;load Y(i)

ADD.D F4,F4,F2

;a

× X(i)

+ Y(i)

S.D 0(Ry),F4 ;store into Y(i)

DADDIU Rx,Rx,#8 ;increment index to X

DSUBU R20,R4,Rx ;compute bound

BNEZ R20,Loop ;check if done



L.S F0 a load scalar a

LV V1,Rx

;load

vector

X

MULVS.S V2,V1,F0 ;vector‐scalar multiply

LV V3,Ry ;load vector Y

ADDV.S V4,V2,V3

;add

SV Ry,V4 ;store the result

• Assumes vector

length

matches

length

of

registers, etc.



• Vector re isters hold man o erands at once

– 64, 128,

256

typical

• Vector instructions operate on many operands at once:

– LV, SV

– VADD, VMULT

– This reduces

code

size

and

dynamic

instruction

count

• What about processing?

– Use one functional unit (e.g. MULT) and pipeline it

– Have multiple functional units operating at once: vector lanes

– Do both: parallelism and pipelining



• Use deep pipeline (=> fast clock) to

V V V

execute element operations

• Simplifies control of deep pipeline

because elements in vector are

1 2 3

independent (=> no hazards!)

Six stage multiply pipeline

- *’ o ov .

Ucal Berkeley, 1998.



Vector Instruction Execution , ,

Execution

using

one

Execution

using

four

p pe ne

unct ona

unit

p pe ne

unct ona

units

A[4] B[4]

A[5] B[5]

A[6] B[6]

A[16] B[16]

A[20] B[20]

A[24] B[24]

A[17] B[17]

A[21] B[21]

A[25] B[25]

A[18] B[18]

A[22] B[22]

A[26] B[26]

A[19] B[19]

A[23] B[23]

A[27] B[27]

C 2

A[3] B[3]

C 8

A[12] B[12]

C 9

A[13] B[13]

C 10

A[14] B[14]

C 11

A[15] B[15]

C[1] C[4] C[5] C[6] C[7]

C[0] C[0] C[1] C[2] C[3]



Functional

Unit

RegistersElements 0,

4, 8,

…

Elements 1,

5, 9,

…

Elements 2,

6, 10,

…

Elements 3,

7, 11,

…

Lane

Memory

Subsystem



m = n; i =0;

while m > MVL

for (j = 0; j < MVL; j= j++)

Y[i*MVL+j] = A[i*MVL+j] * x + Y[i*MVL+j];

m = m – MVL;i++;}

for (j = 0;j < m; j++)

* = * * *

• Maximum vector length (MVL)

• Vector Length Register (VLR)

• Strip mining



Vector Strip Mining

Solution: Break loops into pieces that fit into vector registers, “Strip

mining” ANDI R1, N, 63 # N mod 64

MTC1 VLR, R1 # Do remainder

loop:

for (i=0; i<N; i++)C[i] = A[i]+B[i];

LV V1, RA

DSLL R2, R1, 3 # Multiply by 8

DADDU RA, RA, R2 # Bump pointer+

A B C

Remainder

,

DADDU RB, RB, R2

ADDV.D V3, V1, V2

SV V3 RC

+ 64 elements

DADDU RC, RC, R2

DSUBU N, N, R1 # Subtract elements

LI R1, 64

MTC1 VLR, R1 # Reset full length

BGTZ N, loop # Any more to do?

+



•

VIRAM and

T0

designed

by

Kozyrakis[1]

and

Asanovic[2] respectively

• VIRAM and T0 are implemented with ASICs

• Yianncouras[3] and

Yu[4]

have

designed

FPGA

based

soft vector processors inspired by VIRAM and T0

– This work implements integer arithmetic, not floating

point



• ‐

from earlier

work

on

floating

point

co

‐

– Fetches its own instructions

–

• Loop control is local to the FPVC

– Includes divide and square root in floating point



Vector Chaining and Hybrid

vector/SIMD Architecture

• Vector chaining is pipeline forwarding in a vector

• Requires one read and write port each functional

unit

• Hybrid vector/SIMD computation performs in SIMD

fashion and over time as in the traditional vector (b)

• AMD GPU architecture implements vector/SIMD

architecture



Vector Scalar Instruction Set

rc tecture• 32 bit instruction set

• Supports 32

vector

registers

• All the instructions can be classified into categories:

– Memory access instructions

– Inte er arithmetic

instructions

– Program flow control instructions

– Floatin oint arithmetic instructions

– Special instructions



• Two types of

organization

– Register

Partitioned

– Element art t one

• Vector Register

– Num er o

Vector

lanes

–



Vector Lane, Short Vector, Vector

Register, Sca ar

Register

Scalar

Registers

Vector Lanes (L)

Short Vector (SV)



Memory Instruction format

•

op[5:0] rd[4:0] r1[4:0] r2[4:0] imd[10:0]

access patterns are

– Unit stride

– Non‐unit stride

– Permutation access

– Look up

table

access

– Rake access



Arithmetic Instruction with both register operand

op : r : r : r : exop :

o 5:0 rd 4:0 r1 4:0 imd 15:0

Arithmetic Instruction with 16‐bit immediate value

• Includes both integer

instructions

• Masked instruction

execution is

also

included



• Same instruction

format

is

used

• Only first element of first short vector of each vector

register is used

• Result will

be

replicated

to

all

lanes

and

stored

on

the first short vector



• Expand Compress

Mask Vector1 A[0]

Mask Vector1 A[0]

Mask Vector1 A[0]

0 A[1]

1 A[2]

1 A[3]

0 ‐

1 A[2]

1 A[3]

1 A[2]

1 A[3]

1 A[4]

1 A[4]

0 A[5]

1 A[4]

0 ‐

1 A[7]

0 ‐

1 A[7]

‐

1 A[7]

‐

0 ‐



• Autonomous from the

ma n processor

• Supports vector

scalar

ISA

• in‐order issue, out of

order completion

– Arbiter handles completion

• Unified vector scalar

file

• Uses NU VFLOAT library

for floating point units



• Supports modified Harvard style memory

architecture

• Separate instruction and data memory in local on‐

chip RAM

• Unified

main

memory

(in

other

on‐

chip

RAM)• Local on chip RAM reduces traffic on the system bus

• Program and data size are limited by local on‐chip

RAM size

– Vector code is more compact than scalar code!



• FPVC is connected through PLB interface to system

us ut not m te to any us protoco

• Two ports are provided for connection in embedded

sys em

– Slave port – for communication with main

– Master port – for main memory accesses

•

n er ace

can

e

con gure

or

,

or ‐

data width

memory

accesses



• Design implemented on Xilinx

ML510 board

• 32‐bit PLB based system bus

• Embedded system runs at 100

MHz

• PowerPC program code is

compiled with gcc using –o2

optimization• FPU only used for comparison

• FPVC program code is written in

machine code and unoptimized

• rogram an a a are s ore n

BRAM (main

memory)

• Main metric for performance

cycles



PowerPC_main (){ FPVC_main()

.

2. Write kernel parameter to FPVC’s

local data RAM;

wait for instruction load;

load data;

compute kernel();.

FPVC instruction load;

4. Wait until FPVC completes execution;

5.Sto PowerPC timer();

store result;

HALT FPVC;

}

}



•

• Matrix‐Vector Product

• Matrix‐Matrix Multiplication

• QR Decomposition

• Cholesk Decom osition



• Performs O(N)

DOT_product_kernel(){

load vector u from local data RAM;

operations

•

load vector v from local data RAM;

mul_vector = multiply

u and

v;

formulated as:accumulate = reduction(mul_vector);

}



DOT Product Performance for Short Vector

Sca ing

1.6

1.8

DOT PRODUCT with Lane (L) = 2

1.2

1.4

I m

p r o v e m e n t

0.8

1

P e r f o r m a n c

0.4

0.6

8 16 32 64 128 256 512

Number of Vector Elements

PowerPC SV = 8, L = 2 SV = 16, L = 2 SV = 32, L = 2



2.2

2.4

DOT PRODUCT with Short Vector Size (SV)= 32

1.4

1.6

1.8

I m

p r o v e m e n t

0.8

1

1.2

P e r f o r m a n c

0.4

0.6

8 16 32 64 128 256 512

Power PC L = 1 L = 2 L = 4 L = 8



‐

• BLAS level 2 routine

• Performs O(N2 )

floating point

_ _

loop (i = 0 to i = N‐1)

• Product can be

formulated as:

y i = DOT_product_kernel(A i ,x);

store result y i to local memory;

end loop;

}



Matrix‐Vector Product Performance for

Lane Sca ing

1.4

1.6

1

1.2

p r

o v e m e n t

0.8

e r f o r m a n c e I

0.4

.

4 8 12 16

S uare Matrix Size

PowerPC L = 1 L = 2 L = 4 L = 8



‐

• BLAS level 3 routine

• Performs O(N3 )

floating point

_ _

loop (i = 0 to i = N‐1)

• Product can be

formulated as:

C i = MV_product_kernel(A,Bi );

store result C i

to local memory;

end loop;

}



Matrix‐Matrix Multiplication Performance

or Lane

Sca ing

1.6

1.7

MM Product with Short Vector Size (SV) = 32

1.2

1.3

1.4

.

I m

p r o v e m e n t

0.9

1

1.1

P e r f o r m a n c

0.7

0.8

4 8 12 16

Square Matrix Size

PowerPC Lane = 1 Lane = 2 Lane = 4 Lane = 8



• This kernel uses Givens

rotation to decompose

matrix into

an

orthogonal

(Q) and upper triangular QR_Decomp_kernel(){

loop (i = 0 up to i = M‐1)

matrix (R) such that A = QR.

• An N x N matrix A is zeroed

out one

element

at

a time

loop (j = N‐1 down to j > i)

x = A[j ‐1] [i];

y = A[j][i];

using 2 x 2 rotation matrix: i,j

A[j ‐1:j][0:N‐1] = MM_product_kernel

(Qi,j , A[j ‐1:j][0:N‐1]);

end loop;

end loop;

}

• Performs O(N3 ) floating point

operations.



QR Decomposition Performance for Lane

Sca ing



symmetric positive‐

definite matrix into the

_ _

loop (i = 0 up to i = N‐1)

pivot value

= sqrt

(Ai,i );

divide i th column vector from i to N by pivot value;

such that A = LLT .loop (j = i+1 upto N)

accumulate row vector from 0 to i;

subtract accumulated value from A j,i+1 ;

• Each element of L can

be defined as below:

end loop;

}



Cholesky Decomposition Performance for

Lane Sca ing



• Designed and implemented unified vector

scalar

floating

point

architecture•

operations:

– ,

,

,

,

• Initiated designing linear algebra library for

computation



•

processor

– Easier to implement

• FPVC is autonomous from embedded

processor

– Good choice

for

implementing

scientific

apps

that

use rest of FPGA for at the same time



• Double Precision Floatin Point Su ort

• Architectural Improvements

– Memory Caching

• Im roved Tools

– Vector Compiler Tool Flow

• More applications

– Demonstrate concurrent use of FPVC



[1] C. Kozyrakis and D. Patterson, “Overcoming the Limitations of

Conventional Vector Processors”, In Proceedings of the 30th International

Symposium on

Computer

Architecture,

San

Diego,

California,

June

2003, pp. 399–409.

[2] K. Asanovic, J. Beck, B. Irissou, B. Kingsbury, and N. Morgan, “The T0

Vector Microprocessor,” Hot Chips, vol. 7, pp. 187–196, 1995.

[3] P.

Yiannacouras,

J.

Gregory

Steffan,

and

Jonathan

Rose,

VESPA:

Portable, Scalable, and Flexible FPGA‐Based Vector

Processors, International Conference on Compilers, Architecture and

Synthesis for Embedded Systems (CASES), October 2008, Atlanta, GA.

[4] . Yu, G. Lemieux, and C. Eagleston, "Vector Processing as a Soft‐core CPU

Accelerator," ACM International Symposium on FPGA, 2008.



Miriam Leeser

[email protected] : www.coe.neu.edu Research rcl index. h

More details

can

be

found

in:

Jainik Kathiara’s MS thesis under publications link

An Autonomous

Vector/Scalar

Floating

Point

Coprocessor

for FPGAs b Jainik Kathiara and Miriam Leeser

vector fpu

Documents