implementation of elementary functions for fpga compute

4
Implementation of Elementary Functions for FPGA Compute Accelerators Spenser Gilliland and Jafar Saniie Illinois Institute of Technology Chicago, Illinois 60616 Fernando Martinez Vallina Xilinx, Inc San Jose, California 95124 Abstract—Field programmable gate arrays (FPGA) are grow- ing from the role of glue logic into the area of application acceleration and compute. This is fostered by advances in silicon technologies as well as standards based methodologies for interacting with heterogeneous compute resources. As these standards generally require the implementation of elementary functions, this work outlines the implementation and evaluation of the elementary functions required by the heterogeneous pro- gramming standard OpenCL. It outlines the implementation of the math ”builtin” functions using CORDIC methods and details the processes that will be taken to benchmark the resource usage, maximum frequency, and latency of each function on Xilinx 7 Series FPGAs. Because of the applicability and standardization of the OpenCL math functions, this benchmarking effort provides a basis for understanding and analysing future implementations. I. INTRODUCTION Mobile, data center, and high performance computing (HPC) are rapidly approaching a thermal and power wall due to the inefficiencies of homogeneous compute architectures. This has caused attention to shift from performance per dollar towards performance per watt as the primary metric which defines overall system performance. The change in focus has been quantified as part of the Green 500 [1] benchmarking effort. Due to these changes, modern computing systems are increasingly defined by a heterogeneous architecture, which includes one or more general purpose processors augmented by accelerators. Traditionally, these accelerators have been graphics processing units (GPU) due to their inclusion in most commercially available computing systems. However, FPGAs are an interesting competitor in this market due to their ability to achieve higher performance per watt for many applications. With the advent of OpenCL, there now exists a defacto common language to express hardware acceleration on het- erogeneous compute architectures. As the OpenCL kernel language is based on the C99 standard [2], FPGAs utilize high level synthesis (HLS) tools to create FPGA configurations from the kernel source code. HLS tools have made major strides in recent years and represent a possible alternative to register transfer level (RTL) schematic capture in many applications [3]. However, a major remaining step is the implementation of optimized standard math libraries compliant with the OpenCL standard. The primary contribution of this paper is the implementa- tion and benchmarking of the single precision math functions required by the embedded profile of the OpenCL 2.0 spec- ification. These implementations were then benchmarked for latency, maximum frequency, and resource usage. A. Hardware Accelerators and Heterogeneous Systems The introduction of general purpose compute accelerators has increased the overall efficiency of compute intensive tasks. Accelerators generally improve throughput by trading single thread performance for a higher density of cores. When appli- cations can take advantage of multiple cores, the performance is greatly improved by offloading the parallel tasks to an accelerator. Fig. 1. Median Data Center Cost Breakdown In Data Center environments, the primary objective is to achieve the lowest total cost of ownership for a workload. As most Data Center systems have a service life of between three to five years, the power and operational budget of such a system will generally make up approximately 15 percent of the total costs per year [4]. A total budget breakdown per year is shown in figure 1. As hardware accelerators offer increases in efficiency, they can allow systems to use less power, which allows operators to either increase density or decrease the total power usage. Current HPC methodologies focus on using a large number of general purpose processors. General purpose processors are designed to maximize single core performance. As a result, designers utilize high transistor counts per processor, high voltages, and high frequencies to maximize performance. By using methods such as branch prediction, multi-level cache structures, scoreboarding, etc, designers have increased single- core performance at the expense of idle transistors. These techniques limit the number of transistors that are being used for computation at any one time in exchange for lower 978-1-4673-9985-2/16/$31.00 ©2016 IEEE 0179 Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on October 02,2020 at 15:12:40 UTC from IEEE Xplore. Restrictions apply.

Upload: others

Post on 12-Jan-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Implementation of elementary functions for FPGA compute

Implementation of Elementary Functions forFPGA Compute Accelerators

Spenser Gilliland and Jafar SaniieIllinois Institute of Technology

Chicago, Illinois 60616

Fernando Martinez VallinaXilinx, Inc

San Jose, California 95124

Abstract—Field programmable gate arrays (FPGA) are grow-ing from the role of glue logic into the area of applicationacceleration and compute. This is fostered by advances insilicon technologies as well as standards based methodologiesfor interacting with heterogeneous compute resources. As thesestandards generally require the implementation of elementaryfunctions, this work outlines the implementation and evaluationof the elementary functions required by the heterogeneous pro-gramming standard OpenCL. It outlines the implementation ofthe math ”builtin” functions using CORDIC methods and detailsthe processes that will be taken to benchmark the resource usage,maximum frequency, and latency of each function on Xilinx 7Series FPGAs. Because of the applicability and standardization ofthe OpenCL math functions, this benchmarking effort providesa basis for understanding and analysing future implementations.

I. INTRODUCTION

Mobile, data center, and high performance computing(HPC) are rapidly approaching a thermal and power wall dueto the inefficiencies of homogeneous compute architectures.This has caused attention to shift from performance per dollartowards performance per watt as the primary metric whichdefines overall system performance. The change in focus hasbeen quantified as part of the Green 500 [1] benchmarkingeffort. Due to these changes, modern computing systems areincreasingly defined by a heterogeneous architecture, whichincludes one or more general purpose processors augmentedby accelerators. Traditionally, these accelerators have beengraphics processing units (GPU) due to their inclusion in mostcommercially available computing systems. However, FPGAsare an interesting competitor in this market due to their abilityto achieve higher performance per watt for many applications.

With the advent of OpenCL, there now exists a defactocommon language to express hardware acceleration on het-erogeneous compute architectures. As the OpenCL kernellanguage is based on the C99 standard [2], FPGAs utilize highlevel synthesis (HLS) tools to create FPGA configurations fromthe kernel source code. HLS tools have made major strides inrecent years and represent a possible alternative to registertransfer level (RTL) schematic capture in many applications[3]. However, a major remaining step is the implementation ofoptimized standard math libraries compliant with the OpenCLstandard.

The primary contribution of this paper is the implementa-tion and benchmarking of the single precision math functionsrequired by the embedded profile of the OpenCL 2.0 spec-ification. These implementations were then benchmarked forlatency, maximum frequency, and resource usage.

A. Hardware Accelerators and Heterogeneous Systems

The introduction of general purpose compute acceleratorshas increased the overall efficiency of compute intensive tasks.Accelerators generally improve throughput by trading singlethread performance for a higher density of cores. When appli-cations can take advantage of multiple cores, the performanceis greatly improved by offloading the parallel tasks to anaccelerator.

Fig. 1. Median Data Center Cost Breakdown

In Data Center environments, the primary objective is toachieve the lowest total cost of ownership for a workload.As most Data Center systems have a service life of betweenthree to five years, the power and operational budget of sucha system will generally make up approximately 15 percent ofthe total costs per year [4]. A total budget breakdown per yearis shown in figure 1. As hardware accelerators offer increasesin efficiency, they can allow systems to use less power, whichallows operators to either increase density or decrease the totalpower usage.

Current HPC methodologies focus on using a large numberof general purpose processors. General purpose processors aredesigned to maximize single core performance. As a result,designers utilize high transistor counts per processor, highvoltages, and high frequencies to maximize performance. Byusing methods such as branch prediction, multi-level cachestructures, scoreboarding, etc, designers have increased single-core performance at the expense of idle transistors. Thesetechniques limit the number of transistors that are beingused for computation at any one time in exchange for lower

978-1-4673-9985-2/16/$31.00 ©2016 IEEE

0179

Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on October 02,2020 at 15:12:40 UTC from IEEE Xplore. Restrictions apply.

Page 2: Implementation of elementary functions for FPGA compute

latency in single-core applications. Additionally by increasingfrequencies and voltages in order to extract the maximumperformance out of the few transistors that are active in anygiven applications, static and dynamic power as well as heatdissipation are exponentially increased.

GPU-based heterogeneous computing is starting to estab-lish a foothold in the HPC market. GPUs are simpler pro-cessors traditionally defined in the Flynn Taxonomy as singleinstruction multiple data (SIMD). The processors in theseaccelerators may have lower clock speeds and higher latenciesthan CPUs but make up for this deficiency by including anorder of magnitude more cores per device.

FPGAs are an emerging competitor to GPUs for acceler-ated processing. FPGAs utilize a heterogeneous configurablefabric that interconnects DSP, Slice, BRAMs, and IOBs el-ements to create virtually any synchronous circuit. Thesedevices are able to express extremely fine-grained parallelismand an optimized pipeline based on the complexity and re-quirements of the application.

FPGAs are currently being evaluated for their possible useas a platform for compute in several applications. Notably, theywere recently selected by Microsoft for the implementationof search algorithms [5]. This has led to a larger interest inFPGA-based computing for high performance and data center.

B. OpenCL

OpenCL is a standard that allows interoperability betweenaccelerators and algorithm developers. Its sole purpose is toenable the use of heterogeneous compute elements withina general purpose computing environment. This enables astandard methodology for utilizing computation acceleratorssuch as GPU, multiprocessors (MP), FPGAs, and applicationspecific integrated circuits (ASIC).

OpenCL divides the problem of heterogeneous computeplatforms into two major components; the host applicationprogramming interface (API) and computation kernels. Com-pute kernels are written in a subset of the C99 programmingstandard, which must be fully supported by all OpenCLcompliant accelerators. This enables kernels and host code tobe functionally portable between accelerators.

The host API primarily provides functions to queue com-putation kernels, copy data to devices, and retrieve data fromdevice accessible memory. As shown in figure 2, a host willaccelerate applications using the following pattern: copy datato device accessible memory; initiate compute kernel; finally,copy results from device to host accessible memory. Otherroutines are available on the host including device discoveryand kernel compilation. Host code runs on general purposecomputing system such as an x86 or ARM.

C. OpenCL Math Builtins

Kernels have access to a library of standard functions or”builtins.” These include elementary math functions, imagemanipulation functions, conversion routines, etc. To be compli-ant with the OpenCL specification, each ”builtin” must eitherexactly or within a specified error produce the same results asa reference function. The main benefit of OpenCL requiringfull compliance is that kernels written for one device will befunctionally portable between different accelerators.

Fig. 2. OpenCL context with multiple queues

II. ALGORITHMS FOR ELEMENTARY FUNCTIONAPPROXIMATIONS

A. Introduction

In FPGAs all functions must be implemented using thecombination of DSP, LUT, and BRAM elements. As each ofthese elements are limited within a given FPGA, a generalpurpose implementation of the elementary functions shouldattempt to reduce resource usage to a minimum. This allowsoptimal flexibility for the HLS tools to pipeline and replicatethe implementation. For this reason a CORDIC based imple-mentation was choosen. CORDIC based algorithms are a classof shift and add algorithms which utilize the rotation of avector to achieve convergence.

1) Unit in the Last Place: The primary goal of this proposalis the implementation of all functions required for the OpenCL2.0 Embedded Profile Specification. The main metric used todescribe accuracy of the implementation is unit in the last place(ULP). By using definition 1, it is clear that 0 ULP means afunction has an exact answer and 0.5 means a function mustbe properly rounded.

Definition 1 (Unit in the Last Place) - If x lies betweentwo finite consecutive floating-point numbers a and b withoutbeing equal to one of them, then ulp(x) = |b− a|, otherwiseulp(x) is the distance between the two finite floating pointnumbers nearest x.

B. CORDIC Algorithms

The CORDIC algorithm was originally developed byVolder [6] and adapted to the common case by Walther [7]. Themain reference used throughout this implementation was [8].However, it is also well described in other books such as [9],there is an excellent overview of the hardware implementationin [10], and error bounding is well described in [11]. Inaddition, some computations require double rotations whichare described in [12], and [13]. The rotation of a vector usingCORDIC is shown in figure 3.

The CORDIC iteration is defined as

(xn+1

yn+1

)=

(1 −dn2−n

dn2−n 1

)(xnyn

)(1)

tn+1 = tn − dnarctan(2−n)

1) Rotation Mode: Rotation mode is a method for solvingsin and cos of z0. In this mode, dn is chosen such that dn =sign(zn).

0180

Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on October 02,2020 at 15:12:40 UTC from IEEE Xplore. Restrictions apply.

Page 3: Implementation of elementary functions for FPGA compute

Fig. 3. One iteration of the CORDIC algorithm

xn = K (x0cos(t0)− y0sin(t0)) (2)yn = K (y0cos(t0)− x0sin(t0))tn = 0

This can be used to compute sin and cos simultaneouslyusing the following iteration scheme.

x0 = 1/K xn = cos(θ) (3)y0 = 0 yn = sin(θ)

t0 = θ tn = 0

2) Vectoring Mode: In vectoring mode the direction, dn, ischosen such that dn = −sign(yn).

xn = K√x20 + y20 (4)

yn = 0

tn = t0 + arctan

(y0x0

)

This can be used to compute atan, atan2 and hypotfunctions simultaneously using the following iteration scheme.

x0 =x0K

xn =√x20 + y20 (5)

y0 =y0K

yn = 0

t0 = 0 tn = arctan

(y0x0

)

3) sin−1 and cos−1 Mode: A number of papers have beenwritten on the topic of sin−1 and cos−1 computation usingCORDIC [11], [14], [15]. The method used for implementationis described in [8].

To determine the sin−1(θ), the following algorithm basedon double rotations may be used.

(xn+1

yn+1

)=

(1 −dn2−n

dn2−n 1

)2(xnyn

)(6)

tn+1 = tn + 2dnarctan(2−n)

vn+1 = vn + vn2−2n

In this algorithm, dn is chosen such that dn = sign(x)when yn <= vn else dn = −sign(x).

vn = K2v0 (7)

xn =

√(K2x0)

2 − c2

yn = vn

tn = tn + sin−1

(v0

K2x0

)

Thus by using the following initial arguments, it is possibleto compute sin−1.

v0 = θ vn = K2v0 (8)

x0 =1

K2xn =

√1− v20

y0 = 0 yn = vn

t0 = 0 tn = sin−1(v0)

A similar method can be used to determine cos−1.

4) Hyperbolic Functions: The hyperbolic functions maybe computed using CORDIC with minor changes. To allow aCORDIC core to implement both Hyperbolic and Trigonomet-ric functions we introduce the parameter m.

(xn+1

yn+1

)=

(1 −mdn2−n

dn2−n 1

)(xnyn

)(9)

tn+1 = tn − dntanh−1(2−n)

When utilizing a lookup table for the tanh−1 and tan−1

both the hyperbolic and trigonometric functions may be im-plemented in the same CORDIC.

III. PERFORMANCE COMPARISON

A. Microbenchmark Analysis

Microbenchmarks which evaluate each elementary functionindividually have been developed as part of this proposal. Onetests the total size for a fully sequential implementation of thefunction, another tests a pipelined verion with 256 differentinputs. Another, utilizes a test bench to test accuracy usingemulation on a computing cluster.

0181

Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on October 02,2020 at 15:12:40 UTC from IEEE Xplore. Restrictions apply.

Page 4: Implementation of elementary functions for FPGA compute

TABLE I. RESULTS OF MICROBENCHMARK ANALYSIS

Function BRAM DSP FF LUTS Delay (ns)pown 0 22 4562 5757 8.62acos 4 3 2945 9196 8.34asin 4 6 6285 19124 8.44atan 4 3 3201 8681 8.34atan2 4 3 3525 9847 8.74cospi 4 13 3116 9394 8.63sinpi 4 13 3116 9394 8.63tanpi 4 13 4089 10901 8.63atanh 6 0 2503 8051 8.54cosh 4 26 4738 12719 8.47sinh 4 26 4799 12817 8.47tanh 4 26 5837 14532 8.47cos 4 90 5605 9599 8.34sin 4 90 5637 9767 8.42tan 4 90 6579 11370 8.42log2 2 0 1050 2762 8.34exp2 2 2 1483 5343 8.63log 2 3 1196 3085 8.34log10 2 3 1196 3085 8.34exp 2 10 1545 4871 8.34exp10 2 10 1579 4939 8.34log1p 2 5 1899 4832 8.74expm1 2 12 2248 7143 8.34powr 4 31 8105 14363 8.63pow 4 55 14210 25922 8.73rootn 4 0 2867 5722 8.48cbrt 4 16 1646 4101 8.34sqrt 4 0 1506 3923 8.52rsqrt 4 0 2448 5276 8.52acosh 10 10 6138 15264 8.74asinh 4 13 2522 5966 8.75

1) Latency, Delay, Area, and Power: Latency combinedwith a delay specification is the metric which determines thespeed with which an implementation can produce a result.Area is defined as the number of FPGA elements used in theimplementation of the elementary function. This is measuredin number of BRAM, DSP, and LUT elements. The results forpown, acos, asin, atan, atan2, cospi, sinpi, tanpi, atanh, cosh,sinh, tanh, cos, sin, tan, log2, exp2, log, log10, exp, exp10,log1p, expm1, powr, pow, rootn, cbrt, sqrt, rsqrt, acosh, andasinh are listed in table I.

IV. CONCLUSION

As computing systems have evolved, there has been a longtrend of using more and more transistors to accomplish agiven task. FPGAs turn this idea on its head, they suggestthat building highly efficient small blocks and connectingthem together will yield a more efficient and more performantsolution. Thus, FPGA based parallel processing is a solutionto one of the largest problems on the ITRS 2013 road map:Power Management [16].

”Power management is now the primary issueacross most application segments due to the 2xincrease in transistor count per generation while costeffective heat removal from packaged chips remainsalmost flat.”

REFERENCES

[1] B. Subramaniam, W. Saunders, T. Scogland, and W.-c. Feng, “Trendsin Energy-Efficient Computing: A Perspective from the Green500,” in4th International Green Computing Conference, Arlington, VA, June2013.

[2] “iso/iec 9899:1999 (e) contents.”

[3] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang,“High-level synthesis for fpgas: From prototyping to deployment,”Computer-Aided Design of Integrated Circuits and Systems, IEEETransactions on, vol. 30, no. 4, pp. 473–491, April 2011.

[4] M. Heikkurinen, S. Cohen, F. Karagiannis, K. Iqbal, S. Andreozzi,and M. Michelotto, “Answering the cost assessment scaling challenge:Modelling the annual cost of european computing services forresearch,” Journal of Grid Computing, pp. 1–24, 2014. [Online].Available: http://dx.doi.org/10.1007/s10723-014-9302-y

[5] A. Putnam, A. Caulfield, E. Chung, D. Chiou, K. Constantinides,J. Demme, H. Esmaeilzadeh, J. Fowers, G. Gopal, J. Gray, M. Hasel-man, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus,E. Peterson, S. Pope, A. Smith, J. Thong, P. Xiao, and D. Burger, “Areconfigurable fabric for accelerating large-scale datacenter services,”in Computer Architecture (ISCA), 2014 ACM/IEEE 41st InternationalSymposium on, June 2014, pp. 13–24.

[6] J. E. Volder, “The cordic trigonometric computing technique,” Elec-tronic Computers, IRE Transactions on, vol. EC-8, no. 3, pp. 330–334,Sept 1959.

[7] J. S. Walther, “A unified algorithm for elementary functions,”in Proceedings of the May 18-20, 1971, Spring JointComputer Conference, ser. AFIPS ’71 (Spring). New York,NY, USA: ACM, 1971, pp. 379–385. [Online]. Available:http://doi.acm.org/10.1145/1478786.1478840

[8] J.-M. Muller, Elementary functions. Springer, 2006.[9] B. Parhami, “Computer arithmetic: algorithms and hardware designs,”

2009.[10] R. Andraka, “A survey of cordic algorithms for fpga based computers,”

in Proceedings of the 1998 ACM/SIGDA Sixth International Symposiumon Field Programmable Gate Arrays, ser. FPGA ’98. NewYork, NY, USA: ACM, 1998, pp. 191–200. [Online]. Available:http://doi.acm.org/10.1145/275107.275139

[11] C. Mazenc, X. Merrheim, and J. M. Muller, “Computing functions cos-1 and sin-1 using cordic,” Computers, IEEE Transactions on, vol. 42,no. 1, pp. 118–122, Jan 1993.

[12] L. Deng and J. An, “A low latency high-throughput elementary functiongenerator based on enhanced double rotation cordic,” in ComputerApplications and Communications (SCAC), 2014 IEEE Symposium on,July 2014, pp. 125–130.

[13] N. Takagi, T. Asada, and S. Yajima, “Redundant cordic methods witha constant scale factor for sine and cosine computation,” Computers,IEEE Transactions on, vol. 40, no. 9, pp. 989–995, Sep 1991.

[14] T. Lang and E. Antelo, “Cordic-based computation of arccos andarcsin,” in Application-Specific Systems, Architectures and Processors,1997. Proceedings., IEEE International Conference on, Jul 1997, pp.132–143.

[15] ——, “Cordic vectoring with arbitrary target value,” Computers, IEEETransactions on, vol. 47, no. 7, pp. 736–749, Jul 1998.

[16] “The International Technology Roadmap for Semiconductors (ITRS),Executive Summary, 2013, http://www.itrs.net/.”

0182

Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on October 02,2020 at 15:12:40 UTC from IEEE Xplore. Restrictions apply.