fpga based acceleration of linear algebra computations
DESCRIPTION
B.Y. Vinay Kumar Siddharth Joshi Sumedh Attarde Prof. Sachin Patkar Prof. H. Narayanan. FPGA based Acceleration of Linear Algebra Computations. Outline. Double Precision Dense Matrix-Matrix Multiplication. Motivation Related Work Algorithm Design Results Conclusions - PowerPoint PPT PresentationTRANSCRIPT
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
1
FPGA based Acceleration of Linear Algebra Computations.
B.Y. Vinay KumarSiddharth JoshiSumedh Attarde
Prof. Sachin PatkarProf. H. Narayanan
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
2
Outline
Double Precision Dense Matrix-Matrix Multiplication. Motivation Related Work Algorithm Design Results Conclusions
Double Precision Sparse Matrix-Vector Multiplication. Introduction Prasanna DeLorimier David Gregg et. al. What can we do ?
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
3
FPGA based Double Precision Dense Matrix-Matrix Multiplication.
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
4
Motivation
FPGAs have been making inroads for HiPC. Accelerating BLAS-3 achieved by accelerating matrix
multiplications. Modern FPGAs provide an abundance of resources – We
must capitalise upon these.
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
5
Related Work{1/2}
The two main works ~ Dou and Prasanna. Both based on linear arrays, both use memory switching – both sustain their peak.
Dou : Optimised for a large VirtexII pro device (Xillinx).Created his own MAC (Not fully compliant).Sub-block dimensions must be powers of 2.Optimised for Low IO bandwidth.
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
6
Related Work{2/2}
Prasanna:
Scaling results in speed degradation of about 35% (2 PEs to 20 PEs).
2.1 GFLOPs on a CRAY XD1 with VirtexII Pros (XC2VP50).
For design only (XC2VP125) they report 15% clock degradation on 2 to 24 PEs.
» They state they have not made any platform specific optimisations, for the implemented design.
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
7
Algorithm
1. Broadcast ‘A’, keep a unique ‘B’ per PE2. Multiply, and put in pipeline of multiplier.3. Output is fed to directly to Adder+Ram
(accumulator)4. When the updated C is ready, take them out.
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
8
Design-1
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
9
Design-II
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
10
FPGA Synthesis/PAR data{1/2}
PE DSP48Es FIFO B RAM Slice Reg Slice LUT
1 16 1 2 2511 1374
4 64 4 8 10377 5451
8 128 8 16 20865 10886
16 256 16 32 41841 21750
20(SX240) 320 20 40 52329 27176
40 (SX240)
640 40 80 103335 53914
Table: Clock Speed in MHz for the overall design for different number of PE.
Device/PE 1 4 8 16 19 20 40
SX95T-3 377 374 373 373 372 201 -
SX240T-2 374 373 344 - - 372 371.7
Table: Resource Utilisation for SX95T and SX240T (post PAR)
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
11
FPGA Synthesis/PAR data{2/2}
Table: Resource Utilisation for Virtex II ProXC2VP100 (post PAR)
15 PE 20 PE
MULT18x18 240(54%) 304(68%)
RAMB16s 90 (20%) 114(26%)
Slices 30218 (68%) 37023(83%)
Speed 133.94 MHz 133.79 MHz
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
12
Conclusions
We propose a variation of the rank one update algorithm for matrix multiplication.
We introduce a scalable processing element for this algorithm, targeted a Virtex-5 SX240T FPGA
The two designs clearly show the difference of local storage on IO bandwidth.
The designs achieved a design speed of 373 MHz, 40 PEs and a sustained performance of 29.8 GFLOPS for a single FPGA. We also provide 5.3 GFLOPS on a XC2VP100.
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
13
FPGA based Double Precision Sparse Matrix-Vector Multiplication.
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
14
Introduction
There are three main papers we will be looking at Viktor Prasanna: Hybrid method use HLL+S/W+HDL Michael DeLorimier: Maximum performance but unrealisticDavid Gregg et. al.: Most realistic assumptions wrt DRAM
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
15
Prasanna
Use of prexisting IP cores – specifically for iterative solver (CG)
4 input reduction ckt does dot product results in partial sums as op.
Adder loop with Array does summation of dotproduct – created using
HLL
Reduction ckt at the end uses B-Tree to create the final value
IP s are available
DRAM looked at – but not realistically
Order of Matrices is small
DRAM is bottleneck
With their IP's they have a good architecture -however change the IP
and modify datapath – eg. Dou MAC
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
16
DeLorimier
Use BRAMs for everything.
Use for iterative Solver – specifically CG
MAC requires interleaving
They do load balancing in their partitioner which requires – a
communication stage, very matrix/partitioner dependent.
Communication is the bottleneck
Performance:750 MFLOPS / processor
16 Virtex II 6000s
Each has 5 PE + 1 CE
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
17
David Gregg et. al. (SPAR)
They only report the use of the SPAR architecture for FPGAs
They use very pessimistic DRAM access times. Emphasis on
cache-miss removal
Not using their Block RAMs well – maybe something
interesting can be done here
128 MFLOPS for 3 parallel SPAR units but remove cache miss
and we get a peak of 570 MFLOPS
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
18
What can we do ?
Both use CSR – Not required why not modify representation
Two approaches : We can try both simultaneously
Prasanna – split across dot products (same row many PE)
Delorimier – split accross rows (many rows – one PE)
Use data from SPAR – viable approach – both do zero
multiplies – we get away with one zero multiply/coloumn
Minimise communication or overlap it. - we can do interleaving
for this – while one stage computes the previous one
communicates.
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
19
Questions ?
IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved
20
THANK YOU
Thank You