automated generation of high-performance large-scale...

Automated Generation of High-Performance Large-Scale MatrixMultiplication Accelerator on FPGA

Jie Wang, Jason CongDepartment of Computer Science

University of California, Los Angeles{jiewang, cong}@cs.ucla.edu

Abstract— Matrix multiplication (MM) is a key linear algebraroutine which has been widely used in many application areas.In this work we provide a high-performance single-precisiondense MM FPGA accelerator, and also an automatic generatorto generate the accelerator with high throughput and highresource efficiency based on hardware and MM workloadspecifications. The accelerator adopts the linear systolic array asthe basic building block and contains an optimized architecturewhich integrates several blocks together. The size and thenumber of blocks are parameterized, allowing the user to searchfor the optimal design parameters using an automatic designspace exploration. The accelerator is tested on the Xilinx VC709evaluation board, and shows a peak performance of 198.1GFLOPs.

I. INTRODUCTION

Modern FPGAs offer high computation throughput withhigh energy efficiency; this makes them a competitive het-erogeneous platform for high-performance computing (HPC)in current warehouse-scale data centers [1], [2].

Linear-algebra-intensive applications make up 70% of theHPC market worldwide [3]. Numerous acceleration imple-mentations have been proposed on different platforms, in-cluding BLAS, MLK on CPUs, cuBLAS, Magma on GPUs,LaPACKrc on FPGA, etc.

In this work we focus on one of the most basic yetimportant computation routines in linear algebra – densematrix multiplication (MM). The increasing scale of com-putation in HPC applications has made MM with largescale increasingly common. For large-scale MM, tiling isnecessary, since we can hardly fit the computation of thewhole MM on one FPGA device. In such case, trade-offsbetween two parameters: 1) tile size and 2) number of tilescomputed in parallel will dramatically influence the overallperformance of the accelerator across different hardware andMM workload specifications.

We use a simple example to show the importance ofsuch trade-offs. The MM workload is denoted by C =A × B, where A,B,C ∈ R320×320. We assume that theaccelerator contains D basic building blocks which can finishone square MM operation of size S in S2 clock cycles(CCs). The parameter S and D refer to the tile size and thenumber of tiled MMs computed in parallel. Three differentconfigurations of the accelerator are compared, as denotedby Config.1, Config.2, and Config.3 in Table 1.

In Config.2, as shown in Figure 1, the input matriceswill be tiled by factor 128, and two tiled MMs will be

× =

× =

𝑚𝑎𝑡𝑟𝑖𝑥 𝐴 𝑚𝑎𝑡𝑟𝑖𝑥 𝐵 𝑚𝑎𝑡𝑟𝑖𝑥 𝐶

𝑍𝑒𝑟𝑜 𝑃𝑎𝑑𝑑𝑖𝑛𝑔

320

320

512

384128

𝐴11′

𝐴21′

𝐵11′

𝐶11′

𝐶21′

𝑚𝑎𝑡𝑟𝑖𝑥 𝐴′ 𝑚𝑎𝑡𝑟𝑖𝑥 𝐵′ 𝑚𝑎𝑡𝑟𝑖𝑥 𝐶′

Fig. 1: Matrix tiling and zero padding in Config.2. Twotiles from matrix A

′(e.g., A

′

11 and A′

21), and one tile frommatrix B

′(e.g., B

′

11) are fetched to compute two tiles inmatrix C

′(e.g., C

′

11 and C′

21).

TABLE 1: Comparison among different configurations.

Config.1 Config.2 Config.3Tile Size (S) 256 128 64

#Tiles in Parallel (D) 1 2 4Bandwidth Low Midium High

DSP Same Same SameOther Resources High Midium LowExecution Cycles 524288 294912 184320

computed in parallel, consuming two tiles from matrix Aand one tile from matrix B. We will extend the dimensionsof matrix A and B to 512× 384 and 384× 384 respectivelyby padding zeros, for the simplicity of scheduling. Theexecution time of original MM workload on the acceleratorwith this configuration can be calculated as 512

128×2 ×384128 ×

384128×128

2 = 294912 CCs. As compared in Table 1, Config.1takes the longest time to finish due to the wasted cyclescomputing the padded elements, while Config.3 consumesthe least time since the tile size 64 is small enough to fit inwell with the original matrix. However, Config.3 requiresmore tiles to be fetched from the external memory eachtime, consuming more bandwidth compared to Config.1. Forresources, although these three configurations consume thesame amount of DSPs, with further architecture optimization,other resources including slideLUT, sliceReg, BRAM, etc.will vary across different configurations; this will be coveredin detail later in the paper. The two parameters 1) tile sizeand 2) tiled MMs computed in parallel, which determine

1

the different configurations of the accelerator, form a designspace where we can find the optimal design parametersto maximize the throughput of the accelerator, based ondifferent hardware and MM workload specifications. Fordevice with rich resource and low bandwidth, Config.1 is theoptimal choice for the accelerator, while for device with lessresource but more bandwidth, Config.2 and Config.3 will bemore optimal since the execution time is less, offering higherthroughput compared to Config.1.

Many previous MM accelerator designs on FPGA focuson the optimization of the computation engine for a singletiled MM operation, lacking the exploration for multi-tiledesign. This makes them less optimal when handling large-scale MMs. Therefore, in this work, we explore the fulldesign space by combining these two factors together. Weuse an optimal architecture, linear systolic array [4], as thebuilding block, which works on a single tiled MM operation.Our accelerator architecture uses multiple blocks togetherto compute multiple tiles in parallel. Several optimizationstrategies are used for this architecture which help reducethe on-chip resource utilization and bandwidth consumptionbetween the FPGA chip and the external DDR memory.The whole accelerator architecture is parameterized by thenumber and the size of such blocks. We further offer agenerator which helps the user automatically generate the op-timal parameter sets that can achieve the highest throughputand highest resource efficiency based on user specifications,including the resources available for the accelerator, and theMM workloads. This accelerator can be easily integratedinto high-level synthesis (HLS) tools as a BLAS-like librarycomponent without further modification from the user side.

Our contributions can be summarized as follows:

• We propose a novel architecture with linear systolicdesign as building blocks. We use multiple blocks tocompose the overall architecture for the accelerator.The size and the number of blocks are parameterized,enabling full design space exploration. Several opti-mization techniques are used, including the data sharingamong MM blocks. These techniques help improve theperformance while saving the area occupied.

• We present a generator to automate the design spaceexploration and generate a complete FPGA MM ac-celerator for users. This generator takes the hardwarespecifications of the target FPGA device and the MMworkloads as inputs, and generates the optimal designparameters with the highest throughput and highestresource efficiency as outputs.

• Our accelerator has high scalability and high portability,and can be integrated into HLS tools as a library com-ponent. The accelerator is tested on the Xilinx VC709evaluation board and has shown a peak performance of198.1 GFLOPs.

Our paper is organized as follows. We introduce the previ-ous MM accelerator designs on FPGA and the linear systolicarray architecture that we adopt in Section II. We explain theaccelerator architecture in detail in Section III, and discuss

the design space of our accelerator and the generator toexplore the design space in Section IV. Experimental resultsare shown and discussed in Section V. We conclude the paperin Section VI.

II. BACKGROUND & RELATED WORK

A. MM Designs on FPGA

Matrix multiplication, as an important routine for linearalgebra, has long been a hot topic in both industry andacademia. Numerous designs [4]–[9] have been proposed inrecent years. These designs can be divided into two majorcategories based on the data network topology they use:broadcast and systolic designs.

The most recent work on broadcast designs is proposed byKumar et al. in [5]. They use tiling on the original matrix toget small square matrices. The proposed accelerator workson a single tiled MM operation each time. For each tiledMM, the data from the two input matrices are fetched incolumn major and row major respectively. The acceleratorbroadcasts the data fetched from the first matrix to an arrayof processing elements (PE), each holding an element froma row in the second matrix. Altera also provides a floatingpoint MM IP [6] in which a vector calculator unit is usedto calculate the dot product between the elements from thetwo input matrices. The data from the first input matrix isbroadcast to elements from a row in the second matrix.The structure of broadcast designs is simple and easy toimplement; however, the long wires and high fan-outs fordata broadcast networks will limit the scalability of design.The design [5] is tested with only 40 PEs, and the matrixdimension of Altera’s MM IP [6] is bounded by 256. Thesedesigns are not suitable for large-scale MMs.

The research on systolic architecture has gained moreattention than broadcast designs. With proper organizationof the data and control signal flow, the systolic architectureavoids global interconnects with communication betweenadjacent PEs only. This feature offers more flexibility forrouting, and therefore shows the potential for high perfor-mance and scalability. Amira et al. [7] propose a bit-level2D systolic architecture. The number of PEs is equal to thenumber of the elements in the output matrix. Data from theinput matrices are loaded in a systolic manner so that eachPE calculates one element in the output matrix. 2D systolicarchitecture has the drawback of high complexity of on-chipcommunication among the PE arrays and the data bufferswhich feed the PEs. This may limit the scalability of thedesign when implementing a large number of PEs. In orderto solve this problem, different architectures are proposedto transform the 2D systolic architecture to a 1D array.Dou et al. [8] propose an accelerator which uses the samealgorithm as Kumar et al. in [5]. The accelerator consistsof a master processor and a linear array of PEs. The masterprocessor distributes a set of elements in one column of thefirst input matrix and one row of the second input matrix toeach PE. The MAC in this design is highly optimized forXilinx Virtex-2 devices and is not portable to other FPGAdevices. Jovanovic et al. [9] propose a tiled MM architecture

2

𝑃𝐸1 𝑃𝐸2 𝑃𝐸𝑁

𝐴

𝐵

𝐶

Fig. 2: Overview of linear systolic array architecture [4].The architecture contains N PEs in total, each processingone column in the output matrix. In each clock cycle, thearchitecture fetches one element from matrix A and B incolumn and row major respectively, and outputs one elementin matrix C in column major.

which returns the result tiles to the host processor as soonas they complete. This architecture requires the result tilesto be accumulated by the host processor, which limits theportability of the design.

In this work we adopt the design proposed in [4]. Thisarchitecture has high performance and scalability, and hasbeen proved to be an architecture with high energy efficiencycompared to other FPGA implementations [10]. Detailsabout the design will be reviewed in the next section.

Tiling is a necessary approach for handling large-scaleMMs. Apart from the optimization of the acceleration fora single tiled MM, the scheduling among these tiled MMswill affect the performance as well. Most previous work [5],[8], [9] handle the tiled MM in a sequential manner as theaccelerator works on one tiled MM operation at a time.Work [4] has proposed an architecture that handles multipletiled MMs in parallel; however, the design space of sucharchitecture has not been fully explored. The increasingcapacity of modern FPGAs has enabled the possibility ofplacing multiple tiled MM modules on chip. In this workwe show that with the associated design space explorationfor such architecture, we will gain significant performanceimprovement compared to previous designs.

B. Linear Systolic Array

In this section we will give a brief introduction of thelinear systolic array architecture, which is first proposed byJang et al. in [4]. Figure 2 shows the overall architecture.

N PEs are used to solve the problem C = A×B, whereA,B,C ∈ RN×N . Each PE computes one column in theoutput matrix. For example, as shown in Figure 3, PEj

calculates jth column in matrix C (ci,j =∑

ai,k × bk,j ,where 1 ≤ i, k ≤ N ). The data from matrix A and B willbe fetched in column and row major respectively. And thedata in matrix C will be output in column major. In eachN clock cycles, N elements from one column in matrix Awill be passed through all the PEs, with N elements fromone row in matrix B as well. PEj will calculate all thepartial results for the elements in the jth column and addthem with the intermediate results computed in the previouscycles. The first element in matrix C will be output afterN2 cycles. The architecture uses specific buffers to stream

𝐴_𝑏𝑢𝑓

𝐵_𝑏𝑢𝑓

𝐵_𝑤𝑜𝑟𝑘0

𝐵_𝑤𝑜𝑟𝑘1

×

+

𝐶𝑂_𝑏𝑢𝑓

𝐶_𝑏𝑢𝑓

0

𝑃𝐸𝑗−1

𝑃𝐸𝑗−1

𝑃𝐸𝑗−1

𝑃𝐸𝑗+1

𝑃𝐸𝑗+1

𝑃𝐸𝑗+1

𝑃𝐸𝑗

𝑏𝑢𝑓_𝑠𝑒𝑙𝑒𝑐𝑡

𝑟𝑒𝑠𝑢𝑙𝑡_𝑠𝑒𝑙𝑒𝑐𝑡

𝑜𝑝𝑒𝑟𝑎𝑛𝑑_𝑠𝑒𝑙𝑒𝑐𝑡

Fig. 3: Architecture for single PE. PEj works on jthcolumn in the output matrix. Wires denoted by orange dashedlines are padded with extra registers for improving the timingperformance.

the outputs so that it will output only one element in eachcycle. Therefore, another N2 cycles are needed to finishtransferring all the results. However, this process can beoverlapped with the computation of the following new MM,resulting in a computation latency of N2 cycles on average.

Inside each PE, there are four registers (A buf , B buf ,B work0, B work1), one MAC, and two local memories ofN words each. Figure 3 shows the architecture of each PE indetail. Inside the PEj , register A buf and B buf are used totransfer the data from PEj−1 to PEj+1. Register B work0and B work1 buffer the data from matrix B which willbe processed in the local PE. A dedicated memory switchmechanism is used to select the corresponding element fromthe two registers, which will be multiplied in the MACunit by the element from matrix A buffered in A buf .Memory C buf stores the intermediate results from theMAC, and memory CO buf buffers the results transferredfrom PEj+1, which helps serialize the outputs in a streamingmanner.

This architecture has shown high performance and scala-bility [10], mostly owing to the local interconnect betweenadjacent PEs. This avoids the problems of long wires andhigh fan-outs in broadcast designs.

III. ACCELERATOR ARCHITECTURE

In this section we first give a whole picture of the system inSection III-A. Then we explain our optimization on the singleblock architecture in Section III-B, and the scheduling andoptimization techniques for multiple blocks in Section III-C.In Section III-D we briefly introduce the interface modulefor the accelerator.

A. System Overview

The whole system, as shown in Figure 4, is composedof two parts in general: the host processor and the MMaccelerator. The host processor can either be a hard-coreprocessor like an Intel CPU or a soft-core processor likeMicroblaze from Xilinx. The MM accelerator is wrapped

3

𝐵𝑙𝑜𝑐𝑘1

𝐾𝑒𝑟𝑛𝑒𝑙



⋯

𝐴 𝐹𝐼𝐹𝑂

𝐵 𝐹𝐼𝐹𝑂

𝐶 𝐹𝐼𝐹𝑂

𝐴 𝐵𝑢𝑓𝑓𝑒𝑟

𝐵 𝐵𝑢𝑓𝑓𝑒𝑟

𝐶 𝐵𝑢𝑓𝑓𝑒𝑟𝐻𝑜𝑠𝑡

𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟 𝐼𝑛𝑡𝑒𝑟𝑓𝑎𝑐𝑒

𝐴𝑋𝐼4

𝐴𝑋𝐼4 𝐿𝑖𝑡𝑒

𝐴𝑋𝐼4

𝐴𝑐𝑐𝑒𝑙𝑒𝑟𝑎𝑡𝑜𝑟

𝐷𝐷𝑅𝐶𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑟

𝐷𝐷𝑅 𝑀𝑒𝑚𝑜𝑟𝑦

Fig. 4: System overview. The host processor controls theaccelerator which works on MM operations via AXI4 Litebus. Both the host processor and the accelerator have accessto the external DDR memory via AXI4 bus.

around with a standard interface, and therefore has no effecton the choices of host processor. In this work we useMicroblaze as the host processor; the host processor and theMM accelerator are connected by AXI4 Lite bus.

The host processor runs the application programs andcalls the MM routines which will be computed on the MMaccelerator. The control arguments, as the inputs of theaccelerator, include the MM workloads – like the dimensionsof the matrices, the batch size, and the memory address of thematrices in the external DDR memory. The host processorinitializes the accelerator after passing the workload argu-ments and keeps polling the status of the accelerator until itfinishes. Both host processor and MM accelerator have directaccess to the DDR memory.

The MM accelerator consists of two parts: the interfaceand the computation kernel. The interface module fetchesthe input arguments from the host processor, which areused to schedule the whole block of MM operations in thecomputation kernel. It also transfers the data between theDDR memory and the kernel.

The computation kernel consists of several buildingblocks, each handling a single tiled MM operation. Eachblock is implemented using a linear systolic architecture, asintroduced in Section II-B. All the building blocks work in alockstep manner; this is for simplicity of the scheduling andthe implementation.

B. Single Block Architecture Optimization

We adopt the linear systolic architecture as the basicbuilding block. Based on the original design, we also makeseveral optimizations to improve the timing performancewhile saving the area.

1) Interconnect Pipelining: The main advantage of thesystolic array architecture is the local connection betweenadjacent PEs while avoiding the global interconnects. How-ever, such features can still be underutilized in the routingprocess when there are wires inside or across the PEs that arequite long due to the local resource scarcity. Those wires willbecome critical paths and cause timing problems. Normally,there are sufficient registers and LUTs inside each slice, but

for i← 1 to ceil(M/(S × d1)) dofor j ← 1 to ceil(N/(S × d2)) do

for k ← 1 to ceil(K/S) do#pragma UNROLL;for ii← 1 to d1 do

for jj ← 1 to d2 doC tile[i× d1 + ii, j × d2 + jj] +=A tile[i× d1 + ii, k]×B tile[k, j ×d2 + jj];

endend

endend

endAlgorithm 1: Tiled MM scheduling with multipleblocks. For those matrix with dimensions which are notthe multiple of the tile size S, we will pad zeros for thosematrices. #pragma UNROLL denotes that we will placed1× d2 blocks on chip working in parallel.

relatively scarce BRAMs and DSPs. Therefore, we pick uppaths denoted by orange dashed lines in Figure 3, which areconnected with BRAMs or DSPs. These paths are paddedwith extra registers to reduce the pressure of routing. Thecost of these registers is small compared to the benefits theybring for timing improvement, as tested in experiments.

2) Daisy Chain Control: The data to be processed by eachPE is passed one by one through the array. It is natural topass the control signals (e.g., data select, operand select,buf select in Figure 3) one by one through the arrayaccompanied with the data instead of generating them insideeach PE independently. In our design, we only implementthe controller in the first PE, and all the control signals itgenerates will be passed through the controller chain acrossPEs. This helps save the area occupied by the duplicatecontrollers, while causing no performance degradation.

C. Multiple Block Integration

In our design we place several blocks in parallel, while twoproblems become critical: 1) how to determine the numberand the size of the blocks; 2) how to schedule the tiled MMoperations and map to these blocks. In this section we discussour solution to the second problem in detail, and we willsolve the first problem in Section IV.

Algorithm 1 shows a tiled MM scheduling with multipleblocks. The MM workload is C = A × B, where A ∈RM×K , B ∈ RK×N , and C ∈ RM×N . The tiling factoris S, which means that the building block handles a tiledMM of RS×S × RS×S . And based on the features of oursystolic array architecture, we will place S PEs inside thisblock. d1 refers to the number of tiles fetched from matrixA, and d2 refers to the number of tiles fetched from matrixB. These tiles will be shared among all the blocks. Thereare d1×d2 blocks in total, generating d1×d2 tiles in matrixC in parallel.

Figure 5 is a simple example of when d1 = 2 and d2 = 1for a tiled MM. We can see from the figure that each time,one tile from matrix B is shared by two blocks. The benefits

4

𝐴11 𝐴12

𝐴21 𝐴22

× =

𝑚𝑎𝑡𝑟𝑖𝑥 𝐴 𝑚𝑎𝑡𝑟𝑖𝑥 𝐵 𝑚𝑎𝑡𝑟𝑖𝑥 𝐶

𝐴11 × 𝐵11 𝐴12 × 𝐵21 𝐴11 × 𝐵12 𝐴12 × 𝐵22𝐵𝑙𝑜𝑐𝑘 1

𝐴21 × 𝐵11 𝐴22 × 𝐵21 𝐴21 × 𝐵12 𝐴22 × 𝐵22𝐵𝑙𝑜𝑐𝑘 2

𝐵11 𝐵12

𝐵21 𝐵22

𝐶11 𝐶12

𝐶21 𝐶22

𝐾𝑒𝑟𝑛𝑒𝑙

𝐼𝑛𝑡𝑒𝑟𝑓𝑎𝑐𝑒

𝑀𝑀𝑊𝑜𝑟𝑘𝑙𝑜𝑎𝑑

𝑙𝑜𝑎𝑑 𝐴12


𝑙𝑜𝑎𝑑 𝐵21










𝑤𝑟𝑖𝑡𝑒 𝐶11




Fig. 5: Workloads scheduling and mapping when d1 = 2and d2 = 1. There are d1 × d2 = 2 × 1 = 2 blocks in thekernel. Two tiles from matrix A and one tile from matrixB are fetched in parallel. And two tiles in matrix C aregenerated in parallel. Each tile from matrix A is distributed toone block, while the tile from matrix B will be shared amongthe two blocks. In this case, there will be only one copy of(B buf,B work0, B work1) modules inside the kernel fortransferring data from matrix B, instead of two copies in thedesign without data sharing.

TABLE 2: Comparison between kernels with and withoutdata sharing.

w/ Data Sharing w/o Data SharingA buf Size O(d1× S) O(d1× d2× S)B buf Size O(d2× S) O(d1× d2× S)B work0/1 Size O(d2× S) O(d1× d2× S)Bandwidth O(d1 + d2 + d1× d2) O(d1× d2× 3)

this data sharing brings are significant. First, it saves thearea of buffers used for transferring the data from matrix Aand B inside each PE. Second, it cuts down the bandwidthrequirements of the kernel since data from either matrix A orB is shared among the blocks, and the kernel accesses lessdata compared to the one with no data sharing. Table 2 makesthe comparison between the design with and without datasharing in detail. Considering the kernel with d1×d2 blocks,without data sharing, we will need to fetch d1 × d2 tilesfrom matrix A and B respectively, computing d1× d2 tilesin matrix C in parallel. However, we will only need to fetchd1 and d2 tiles from matrix A and B respectively with datasharing, reducing the consumption of off-chip bandwidth andbuffers for data transfer.

D. Interface Design

The interface module takes the arguments from the hostprocessor and transfers data between the DDR memory andthe kernel. The buffers which store the data of matrix A,B and C are connected with the kernel through FIFOs, andwith the DDR memory via AXI4 bus. The interface moduleis also wrapped from the outside to offer a standard interfaceto the host processor in order to provide portability of theaccelerator.

0

1

2

3

4

5

6

7

0 20 40 60 80

Ban

dw

idth

(G

B/s

)

Burst Length (KByte)

32-bit 128-bit 512-bit

(a) Effective bandwidth for sin-gle AXI4 IP.

0

1

2

3

4

5

6

7

8

9

10

1 2 3

Ban

dw

idth

(G

B/s

)

Number of AXI IPs

64Byte 512Byte

16384Byte 65536Byte

(b) Effective bandwidth forAXI4 data width as 512-bit.

Fig. 6: Effective bandwidth test results. In a), we test theeffective bandwidth with different burst lengths and AXI4bus data width, using a single AXI4 IP. In b), we test theeffective bandwidth with different number of IPs and burstlengths, keeping the AXI4 data width as 512-bit for each IP.

We implement the double buffering so as to overlap theprocess of computation inside the kernel with the data com-munication with the external DDR memory. When buffersfor matrix C store the intermediate results, since the datawill be fetched again and added up with the new results inthe next tiled MM operation, the data of these buffers willnot be transferred to the external DDR memory. This helpscut down unnecessary off-chip communication.

Note that the effective bandwidth between the externalDDR memory and the FPGA chip will be strongly affectedby 1) the AXI4 data width of each IP which fetches datafrom the DDR memory via AXI4 bus, 2) number of AXI4IPs attached to the bus and 3) the burst length of eachread/write request issued from the IP. This will affect theoverall performance of the accelerator. Therefore, we runa simple bandwidth test on the Xilinx VC709 evaluationboard, as shown in Figure 6. We set the frequency of DDRcontroller and the data width between the DDR controllerand the AXI4 bus to maximum. Then, we test the effects ofthese three factors above. In Figure 6a, we find that keepingAXI4 data width unchanged, the effective bandwidth willfirst increase with the increase of the burst length, then reachthe limit when burst length becomes large enough. Also,increasing the AXI4 data width will improve the bandwidthas well. In Figure 6b, as we can see, the more IPs, thehigher the bandwidth. In conclusion, increasing all the threefactors help improve the effective bandwidth. However, suchbenefits don’t come without any cost. Note that increasing thenumber of AXI4 IPs will increase the complexity of AXI4bus on chip, which will cause timing closure problem andoccupy a large amount of area with too many IPs. Also,increasing the burst length will require the efforts of datareorganization from the host processor side. In this work, wechoose 512-bit as the AXI4 data width, and implement threeAXI4 IPs for transferring data for matrix A,B,C each. Also,we reorganize the matrix data stored in the DDR memory,so that we will read/write an entire tile in a burst transfer,increasing the burst length from S words to S × S words.

5

TABLE 3: Hardware and workload specification formatand a simple example.

Hardware MM WorkloadsType of Resource Available M K N Batch Size

SliceLUTSliceRegBRAM

DSPBandwidth(GB/s)

1247521552800254612.8

356256125

23256345

464256678

3203

This will add to the overheads in the host processor side, butis rather necessary. As shown in Figure 6b, for S = 128, wecan only achieve effective bandwidth of 3.43 GB/s with 3AXI4 IPs, which is only 38% of the peak performance wecan achieve (9.07 GB/s when using 3 IPs with burst lengthof 65536 Bytes).

IV. DESIGN SPACE EXPLORATION

In this section we introduce our automated generatorto generate the optimal design parameters (d1, d2, S) viadesign space exploration. The generator takes the hardwareresource and workload specifications as the inputs, as shownin Table 3, and generates the optimal design parameters(d1, d2, S) for the accelerator. The target is to minimizethe execution time for MM workloads based on hardwareresource constraints. The generator can work in two modes:1) SINGLE and 2) CROSS. In SINGLE mode, the generatorwill generate the optimal design parameters separately fordifferent MM workloads. Therefore, different bitstreams willbe needed to handle different workloads. In CROSS mode,the generator works to find the optimal design parameterswhich minimize the execution time in total, therefore, theFPGA device will be configured only once. We conclude ouroptimization problem for CROSS mode as shown below:

minimized1,d2,S

∑i

exec time(Mi, Ni,Ki, d1, d2, S)

subject to SliceLUTused(d1, d2, S) ≤ SliceLUTtotal

SliceRegused(d1, d2, S) ≤ SliceRegtotalBRAMused(d1, d2, S) ≤ BRAMtotal

DSPused(d1, d2, S) ≤ DSPtotal

BWused(d1, d2, S) ≤ BWtotal

(1)

The MM workloads are defined by matrix dimensions asMi, Ni,Ki. And the hardware resource specifications aredefined by the amount available on chip, which are denotedas SliceLUTtotal, SliceRegtotal, BRAMtotal, DSPtotal,and BWtotal in Equation 1.

The next problem is to estimate the resource utilizationwith different design parameters (d1, d2, S). This is done bymultiple linear regression (MLR) based on several predefinedexperiments. The architecture of our accelerator is linear,which introduces linearity to the resource utilization withdifferent parameters. Therefore, MLR fits well for estimatingthe resource utilization of our accelerator. We first writethe estimation function for different types of resources withundetermined coefficients. The independent variables in thefunction are chosen based on the analysis of our accelerator

architecture. Then, we use several predefined experiments toget the input data for MLR, and generate the coefficients.

For example, to estimate SliceReg utilization, we firstpick up the independent variables (Var. for abbreviation) aslisted in Table 4. These variables are highly related to theaccelerator architecture. We take the registers in kernel asan example. O(d1 × S) registers are used for A Buf , andO(d2 × S) registers are used for B Buf , B work0, andB work1. Also, O(d1×d2×S) registers are used inside theMAC for d1 × d2 blocks. Finally, O(S × log2 S) registersare used for control signals. Based on the analysis above,we obtain the function of SliceReg utilization for the kernelmodule in our accelerator as below:

kernel SliceRegused(d1, d2, S) = b+ c0 × d1× S

+ c1 × d2× S + c2 × d1× d2× S + c3 × S log2 S (2)

We first use these variables as independent variables of MLR,and then filter out those variables with P-value below theconfidential level (from significance testing). This ensures theregression function to be significant enough, which makesit reliable when used for estimating resource utilization.We use 12 predefined experiments on the Xilinx VC709evaluation board to generate the MLR coefficients (Coeff. forabbreviation) for SliceReg, as listed in Table 4. We also testour fitting function on other designs, showing high predictionaccuracy. Some results are shown in Table 5.

Note that these coefficients are apparently platform-dependable, due to different configurations of these resourceson different devices. For example, the different LUT configu-ration like LUT-4 and LUT-6 in different Xilinx devices willaffect the LUT resources consumed by the kernel. However,the linearity of the design itself ensures the reliability ofMLR methodology on different platforms. And the numberof experiments is limited (we employ MLR based on 12predefined experiments to generate the coefficients in Ta-ble 4), which ensures the feasibility of the methodology aswell. The most important thing is that this work will notbe done by users. We will finish the experiments acrossdifferent platforms and pack the solver with these coefficientsin advance.

The optimization problem defined in Equation 1 for ourdesign space exploration can be classified as a nonlineardiscrete optimization problem because our design parameters(d1, d2, S) are discrete, and the constraint function involvesnonlinear items. Generally, there are two approaches forsolving such problems [11]: 1) exhaustive enumeration and2) constraint relaxation and heuristic rounding. The firstapproach is simple to implement but rarely practical. Thesecond approach requires us to first neglect the integerconstraints, and solve the problem using nonlinear methods.When a non-integer solution is obtained, round it to aninteger using a heuristic. However, the solutions it findsmight be suboptimal depending on the problem formulationand heuristic we adopt. We then adopt the first approach, andimplement a C program. There are two reasons for this. Oneis that this approach is quite simple to implement, the other

6

TABLE 4: MLR coefficients for SliceReg. We estimate the kernel and interface modules separately. The interface moduleis further divided into two modules, including the FIFOs between the interface and the kernel and buffers storing matrixA,B and C.

Kernel InterfaceBuffer FIFO

Var. Coeff. Var. Coeff. Var. Coeff.b 482.22 b 9542.33 d1 92.00

S × logS2 5.68 d1 411.00 d2 92.00d1× S 65.48 d2 530.33 d1× d2 92.00d2× S 128.14 d1× d2 889.00

d1× d2× S 542.25

TABLE 5: Testing results for SliceReg. The followingfigures are all from place and route results, We use the fittingfunction with coefficients shown in Table 4 to estimate theSliceReg utilization with different design parameters.

d1 d2 S MLR Prediction Actual Utilization Relative Error2 1 128 189202 190636 -0.75%2 2 128 346024 348110 -0.60%1 1 256 213633 211928 0.80%2 1 256 371795 368967 0.77%

is that the runtime of enumeration is actually low. Normallythe program will finish within one second on a standardworkstation.

The goal of the algorithm is to find the optimal param-eters which help minimize the computation time for MMworkloads based on the hardware resources available. InSINGLE mode, for each MM workload, we will enumerateall the compatible design parameters which are filtered bythe resource constraints as stated in Equation 1, and thenchoose the one with the least execution time for the MMworkload. The algorithm for CROSS mode is quite similar.The difference is that we will enumerate all the compatibledesign sets for all the MM workloads, and find the setachieving the least execution time across all the workloads.

The automated generator enables the accelerator to beeasily integrated into the user’s RTL/HLS designs withoutany modification from the user perspective.

V. EXPERIMENTAL RESULTS

The interface module of the accelerator is written in C andcompiled to RTL by Xilinx HLS. The kernel is written inRTL manually. All the following experiments are synthesizedand implemented by Xilinx Vivado 2015.1, and tested on theXilinx VC709 evaluation board. The Microblaze processorruns at 100MHz, and all the other peripheral IPs, includingthe accelerator, run at 200MHz. The on-board execution timeis measured by the Xilinx AXI Timer IP. The accelerator isfor single precision computation.

In this section, we will first give a comparison of ouraccelerator with different design parameters. Then, we willcompare our accelerator with other MM FPGA acceleratordesigns.

A. Accelerator with Different Parameters

In this section, we compare our accelerator with differentdesign parameters. Three configurations are compared here,

TABLE 6: Design comparison with different parameters.

Config.1 Config.2 Config.3S 256 128 64

(d1, d2) (1,1) (2,1) (2,2)SliceLUT 98420 97301 99084SliceReg 211928 190636 182042BRAM 599 402.5 378

DSP 1311 1311 1314BW(GB/s) 2.4 4 6.4

Actual GFLOPs 97.9 100.7 101.6Theoretical GFLOPs 102.4 102.4 102.4

Efficiency(%) 96 98 99

as denoted by Config.1, Config.2, and Config.3. As we cansee, all the designs achieve high performance efficiency(actual GFLOPs/theoretical GFLOPs) due to proper configu-ration of the interface module. As for the resource utilization,for Config.2, sliceRegs are saved by sharing data from matrixB between 2 blocks compared to Config.1. While Config.3shows a further reduction in SliceReg utilization by sharingdata from matrix A among all the blocks as well. BRAMutilization varies in a large scale among different config-urations. Both the interface and kernel module consumeBRAMs. Take the BRAM utilization in the kernel moduleas an example, which can be estimated by c×S2×d1×d2.Therefore, Config.2 offers 1282

2562 ×2×11×1 × 100% = 50%

reduction of BRAM compared to that of Config.1. Similarly,Config.3 offers 50% reduction compared to Config.2, and75% reduction compared to Config.1.

All the designs offer the same GFLOPs performance,while vary in terms of the resource utilization. For differenthardware resource and MM workload specifications, our ac-celerator offers the user the design space to explore differentdesign parameters to achieve optimal performance.

B. Comparison with Other MM FPGA accelerators

We compare our accelerator performance to previous de-signs. We select works in [10], [12] as baselines, which showthe best GFLOPs performance so far. Baseline designs usethe VC709 board as well. We run the generator in SINGLEmode to select the optimal design parameters for MMworkloads as listed in Table 5. The optimal design parametersare (2, 2, 128) for both workloads. In this configuration, wewill place four blocks in the kernel, reading two tiles frommatrix A and two tiles from matrix B and distributing themto four blocks at the same time. Four tiles in matrix C willbe generated in parallel. We will place 128 PEs inside each

7

TABLE 7: Design Comparison with Other MM FPGAAccelerators.

(a) Workload1 (M,N,K) = (512, 512, 512)

GFLOPs Slice GFLOPs/Slice∗

Used % (×103)[10], [12] 279.0 99585 92 2.1

Ours 198.1 75436 70 2.6(b) Workload2 (M,N,K) = (256, 256, 256)

GFLOPs Slice GFLOPs/Slice∗

Used % (×103)[10], [12] 153.6 56404 52 1.8

Ours 198.1 75436 70 2.6

block, and there are 128× 2× 2 = 512 PEs in total.We first compare the GFLOPs performance between our

accelerator and baselines. In [10], [12], GFLOPs perfor-mance is calculated based on the place and route results,with the frequency of 274MHz (Workload 1) and 300MHz(Workload 2) respectively. However, normally for designswith high resource utilization, it’s hard to achieve suchhigh frequency. Our results are from the on-board testingresults, which are more persuasive for practice. Besides, theslice utilization in baselines is reported for the kernel partonly, excluding the interface part. Therefore, the resourceutilization of our accelerator excludes the interface part aswell.

Based on the analysis above, the comparison of resourceutilization becomes more useful. We see that the slice utiliza-tion of our accelerator is lower than baselines when usingthe same number of PEs in the first workload, and higherin the second workload since we are using more PEs thantheir design. In order to offer a fair comparison, We furthercompare the performance density of two designs, usingGFLOPs/Slice as the metric. To calculate the performancedensity of baselines, we use the projected GFLOPs basedon working frequency of 200MHz. And it shows that theperformance density of our design is higher than baselines.This indicates that our accelerator will consume less resourcewhen providing the same GFLOPs as the baseline design.The high resource efficiency of our design is due to multipleoptimization strategies we adopt, including the data sharingamong the blocks in our kernel.

Overall, in this section, we first compare our acceleratordesign with different parameters, showing the high flexibilityof the accelerator. The design space formed by the architec-ture enables the automatic generator to generate the designachieving the highest throughput and resource efficiencybased on hardware resource and MM workload specificationsfrom the user. We then compare our design to the previousdesigns, which shows high sustained performance and highresource efficiency as well.

VI. CONCLUSION

In this work we propose a high-performance MM accel-erator for large-scale MM workloads. We adopt the linearsystolic array architecture as the building block, and groupseveral blocks together. We also optimize the architecture byusing data sharing which enables high resource efficiency

of the design. The size and the number of blocks areparameterized, which form a design space explored by anautomated generator to search the optimal design parameters,providing the highest throughput based on hardware resourceand MM workload specifications from the user side. Thisaccelerator can be easily integrated into current HLS tools asa library component without any modifications from the userside. Testing results shows that our accelerator can providesustained high performance across different workloads, withhigh resource efficiency as well. Our accelerator can offer apeak performance of 198.1 GFLOPs on the Xilinx VC709evaluation board.

Linear algebra routines are playing an important rolein HPC. With great effort in both industry and academiafor integrating modern FPGAs into warehouse-scale datacenters, the requirements for high-performance linear algebraaccelerators handling large-scale workloads are becomingmore urgent. This work aims to accelerate the MM routine,which is mostly common in linear-algebra-intensive appli-cations. A necessary and challenging focus of future workwill be migrating this methodology to acceleration of otherlinear algebra routines, such as sparse matrix multiplication,LU decomposition, etc.

REFERENCES

[1] A. Putnam, A. Caulfield, E. Chung, D. Chiou, K. Constantinides,J. Demme, H. Esmaeilzadeh, J. Fowers, G. Gopal, J. Gray, M. Hasel-man, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus,E. Peterson, S. Pope, A. Smith, J. Thong, P. Xiao, and D. Burger, “Areconfigurable fabric for accelerating large-scale datacenter services,”in Computer Architecture (ISCA), 2014 ACM/IEEE 41st InternationalSymposium on, June 2014, pp. 13–24.

[2] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, andE. S. Chung, “Accelerating deep convolutional neural networksusing specialized hardware,” February 2015. [Online]. Available:http://research.microsoft.com/apps/pubs/default.aspx?id=240715

[3] J. Gonzalez and R. C. Nez, “Lapackrc: Fast linear algebra kernels/-solvers for fpga accelerators,” Journal of Physics: Conference Series,vol. 180, no. 1, p. 012042, 2009.

[4] J.-w. Jang, S. Choi, and V. Prasanna, “Energy- and time-efficientmatrix multiplication on fpgas,” Very Large Scale Integration (VLSI)Systems, IEEE Transactions on, vol. 13, no. 11, pp. 1305–1319, Nov2005.

[5] V. Kumar, S. Joshi, S. Patkar, and H. Narayanan, “Fpga based highperformance double-precision matrix multiplication,” in VLSI Design,2009 22nd International Conference on, Jan 2009, pp. 341–346.

[6] Floating-Point IP Cores User Guide, Altera.[7] A. Amira and F. Bensaali, “An fpga based parameterizable system for

matrix product implementation,” in Signal Processing Systems, 2002.(SIPS ’02). IEEE Workshop on, Oct 2002, pp. 75–79.

[8] Y. Dou, S. Vassiliadis, G. K. Kuzmanov, and G. N. Gaydadjiev, “64-bitfloating-point fpga matrix multiplication,” in Proceedings of the 2005ACM/SIGDA 13th International Symposium on Field-programmableGate Arrays, ser. FPGA ’05. New York, NY, USA: ACM, 2005, pp.86–95.

[9] Z. Jovanovic and V. Milutinovic, “Fpga accelerator for floating-pointmatrix multiplication,” Computers Digital Techniques, IET, vol. 6,no. 4, pp. 249–256, July 2012.

[10] K. Matam, H. Le, and V. Prasanna, “Evaluating energy efficiency offloating point matrix multiplication on fpgas,” in High PerformanceExtreme Computing Conference (HPEC), 2013 IEEE, Sept 2013, pp.1–6.

[11] J.-P. Vert, “Nonlinear optimization: Discrete optimization,”Spring 2006. [Online]. Available: http://cbio.ensmp.fr/ jvert/teach-ing/2006insead/slides/9 discrete/discrete.pdf

[12] K. Matam and V. Prasanna, “Energy-efficient large-scale matrixmultiplication on fpgas,” in Reconfigurable Computing and FPGAs(ReConFig), 2013 International Conference on, Dec 2013, pp. 1–8.

8

automated generation of high-performance large-scale...

Documents