results – peak streaming performance implementing closed-form expressions on fpgas using the nal,...

1
Results – Peak Streaming Performance Implementing Closed-Form Expressions on FPGAs Using the NAL, Implementing Closed-Form Expressions on FPGAs Using the NAL, with Comparison to CUDA GPU and Cell BE Implementations with Comparison to CUDA GPU and Cell BE Implementations Robin Bruce Javier Setoain, Richard Chamberlain, Malachy Devlin & Rosa M. Badia For more information, you may contact [email protected], [email protected], [email protected], [email protected], [email protected] Institute S y ste m L evel In teg ratio n for Abstract This poster outlines the Nallatech Accelerator Layer (NAL) FPGA programming environment and its relationship to Intel’s Accelerator Abstraction Layer. A general look at FPGAs versus stored-program processors is given. Hardware platforms that support the NAL are presented: the Nallatech H101, the Intel FSB-FPGA Module and the BenOne PCIe. To demonstrate the NAL system, two closed-form expressions are implemented. These functions are single- precision floating-point, and make use of arithmetic operations and elementary functions. The functions selected were the probability density function (PDF) and the Black-Scholes-Merton options pricing formula (BSM). These functions were implemented on a dual-core Opteron, a Nallatech H101 card (with a Xilinx Virtex-4 LX160 FPGA) and BenOne PCIe (LX160 FPGA) card using the NAL, an NVIDIA G80 using CUDA and a Cell BE system using the CellSs programing environment. The aim was to use the same ANSI C code for the kernels in all computing environments. The GPU system showed the best silicon performance for the implementation of these kernels. Including data transfer times, the BenOne PCIe FPGA platform had the highest performance. ` Probability Density Function & Black Scholes 2 2 2 2 2 1 where 1 2 exp 2 1 σ) μ, f(x; x e π (x) σ μ x σ σ μ) (x π σ #include <math.h> #define SIZE 8192 #define RECIP_ROOT2PI 0.39894228040143267793994605993438 float ndf(float x) { float x_sqrd; float exp_in; float result; x_sqrd = x * x; exp_in = ldexpf(x_sqrd,-1); result = RECIP_ROOT2PI * expf(-exp_in); return result; } float pdf(float x, float mu, float sigma) { float recip_sigma; float ndf_in; float ndf_out; float result; recip_sigma = 1.0 / sigma; ndf_in = (x - mu) * recip_sigma; ndf_out = ndf(ndf_in); result = recip_sigma * ndf_out; return result; } void pdf_main(float output[SIZE], float x[SIZE], float mu[SIZE], float sigma[SIZE]) { int i=0; for(i=0; i<SIZE; i++){ output[i] = pdf(x[i],mu[i],sigma[i]); } } DIME-C Code for Probability Density Function FPGAs versus Stored-Program Processors Nallatech Reconfigurable Computing Platforms The diagram shows the efficiency of a selection of modern processing technologies for application data types that range from bit-level processing to symbolic processing. Efficiency is an admittedly subjective composite of size, weight, energy consumption, absolute performance and time to solution. The diagram reflects how, with each generation, stored-program processors and FPGAs are evolving from their respective symbolic and bit-level roots to become ever more capable vector/streaming processors. The NAL is a set of C++ classes that functions as a system-level design environment for Nallatech reconfigurable computing platforms. The NAL was designed to complement Intel’s Accelerator Abstraction Layer (AAL) as a programming environment for the FSB-FPGA accelerator. Nallatech’s high-level language compiler DIME-C is a prominent component of the NAL. It allows for the compilation of ANSI C code to VHDL targeted at Xilinx FPGAs. The NAL is a set of C++ classes that functions as a system-level design environment for Nallatech reconfigurable computing platforms. The NAL approach permits the system-level modeling of multiple DIME-C blocks, something that in the previous DIMETalk-based system could not be reliably modeled at the software level. Probability Density Function Black Scholes Merton Withou t Data Transf er (MOPS) With Data Transf er (MOPS) Withou t Data Transf er (MOPS) With Data Transf er (MOPS) Optero n N/A 4.75 N/A 3.4 H101 PCI-X (LX160 ) 600 74.4 200 49 Cell BE 195 189 27.7 26.2 G80 5959 205 1276 110 BenOne PCIe (LX160 ) 600 250 200 125 , ) 2 ( ) ln( ) ( ) ( ) , ( 1 2 2 1 2 1 T d d T T r K S d d e K d S T S C rT The formula gives the price C of a European call option with exercise price K on a stock currently trading at price S, i.e., the right to buy a share of the stock at price K after T years. The constant risk- free interest rate is r, and the constant stock volatility is σ. Φ is the standard normal cumulative distribution function, shown in equation (3). The error function, though not theoretically closed form, can be adequately evaluated in single-precision arithmetic by means of a Taylor expansion, making it closed-form from a computational perspective. Formula for Black Scholes Merton Options Pricing Formula 2 erf 1 2 1 ) ( : where x x The code used in DIME-C for both the PDF and Black Scholes Implementations was unchanged or virtually unchanged in the Opteron, CellSs and CUDA kernel implementations, though naturally the top-level code for each had to take into account the differing environments. The scenario for the results presented here is that there is a fictional application running on the host processor(s) that has a constant stream of input values for which it needs output values from the closed-form function implemented on the attached accelerator. The Diagram shows the unique and shared properties of the FSB-FPGA Platform, the H101 PCI-X Card and the Ben One PCIe FPGA Compute Platforms. The NAL sits atop the AAL accelerator API when used to program the FSB- FPGA. At present the NAL sits atop the FUSE API in the PCI-X and PCIe Platforms Acknowledgements The lead author’s research is sponsored by Nallatech, and partially funded by the UK Engineering and Physical Sciences Research Council. The Institute for System Level Integration and Strathclyde University, both in Scotland, provide academic and logistical support. This work has also been supported by the Spanish government through the research contracts CICYT-TIN 2005/5619 and Ingenio 2010 Consolider CSD00C-07-20811 The Opteron had the weakest performance for the functions. The GPU had the strongest silicon performance, the performance discounting data transfer. When taking into account the data transfer, then the outcome depended on the method of interconnect used. Amongst the accelerators, the PCI-X H101 FPGA accelerator card had the lowest overall transfer-inclusive performance, followed by the Cell Processor then the GPU, with the BenOne implementation coming out on top. Cell had the lowest silicon potential of the accelerators, but was most balanced in terms of silicon potential and data transfer bandwidth

Upload: phebe-johnston

Post on 02-Jan-2016

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Results – Peak Streaming Performance Implementing Closed-Form Expressions on FPGAs Using the NAL, with Comparison to CUDA GPU and Cell BE Implementations

Results – Peak Streaming Performance

Implementing Closed-Form Expressions on FPGAs Using the NAL, with Implementing Closed-Form Expressions on FPGAs Using the NAL, with Comparison to CUDA GPU and Cell BE ImplementationsComparison to CUDA GPU and Cell BE Implementations

Implementing Closed-Form Expressions on FPGAs Using the NAL, with Implementing Closed-Form Expressions on FPGAs Using the NAL, with Comparison to CUDA GPU and Cell BE ImplementationsComparison to CUDA GPU and Cell BE Implementations

Robin BruceJavier Setoain, Richard Chamberlain, Malachy Devlin & Rosa M. Badia

For more information, you may contact [email protected], [email protected], [email protected], [email protected], [email protected]

In s titu te S y s te m L e v e lIn teg ra tio n

fo r

Abstract

This poster outlines the Nallatech Accelerator Layer (NAL) FPGA programming environment and its relationship to Intel’s Accelerator Abstraction Layer. A general look at FPGAs versus stored-program processors is given. Hardware platforms that support the NAL are presented: the Nallatech H101, the Intel FSB-FPGA Module and the BenOne PCIe.

To demonstrate the NAL system, two closed-form expressions are implemented. These functions are single-precision floating-point, and make use of arithmetic operations and elementary functions. The functions selected were the probability density function (PDF) and the Black-Scholes-Merton options pricing formula (BSM). These functions were implemented on a dual-core Opteron, a Nallatech H101 card (with a Xilinx Virtex-4 LX160 FPGA) and BenOne PCIe (LX160 FPGA) card using the NAL, an NVIDIA G80 using CUDA and a Cell BE system using the CellSs programing environment. The aim was to use the same ANSI C code for the kernels in all computing environments. The GPU system showed the best silicon performance for the implementation of these kernels. Including data transfer times, the BenOne PCIe FPGA platform had the highest performance.

`

Probability Density Function & Black Scholes

2

2

2

2

2

1

where

1

2exp

2

1

σ)μ,f(x;

xe

π(x)

σ

μx

σ

σ

μ)(x

πσ

#include <math.h>

#define SIZE 8192

#define RECIP_ROOT2PI 0.39894228040143267793994605993438

float ndf(float x)

{

float x_sqrd;

float exp_in;

float result;

x_sqrd = x * x;

exp_in = ldexpf(x_sqrd,-1);

result = RECIP_ROOT2PI * expf(-exp_in);

return result;

}

float pdf(float x, float mu, float sigma)

{

float recip_sigma;

float ndf_in;

float ndf_out;

float result;

recip_sigma = 1.0 / sigma;

ndf_in = (x - mu) * recip_sigma;

ndf_out = ndf(ndf_in);

result = recip_sigma * ndf_out;

return result;

}

void pdf_main(float output[SIZE], float x[SIZE], float mu[SIZE], float sigma[SIZE])

{

int i=0;

for(i=0; i<SIZE; i++){

output[i] = pdf(x[i],mu[i],sigma[i]);

}

}

DIME-C Code for Probability Density Function

FPGAs versus Stored-Program Processors

Nallatech Reconfigurable Computing Platforms

The diagram shows the efficiency of a selection of modern processing technologies for application data types that range from bit-level processing to symbolic processing. Efficiency is an admittedly subjective composite of size, weight, energy consumption, absolute performance and time to solution. The diagram reflects how, with each generation, stored-program processors and FPGAs are evolving from their respective symbolic and bit-level roots to become ever more capable vector/streaming processors.

The NAL is a set of C++ classes that functions as a system-level design environment for Nallatech reconfigurable computing platforms. The NAL was designed to complement Intel’s Accelerator Abstraction Layer (AAL) as a programming environment for the FSB-FPGA accelerator.

Nallatech’s high-level language compiler DIME-C is a prominent component of the NAL. It allows for the compilation of ANSI C code to VHDL targeted at Xilinx FPGAs.

The NAL is a set of C++ classes that functions as a system-level design environment for Nallatech reconfigurable computing platforms. The NAL approach permits the system-level modeling of multiple DIME-C blocks, something that in the previous DIMETalk-based system could not be reliably modeled at the software level.

Probability Density Function

Black Scholes Merton

Without Data

Transfer

(MOPS)

With Data

Transfer

(MOPS)

Without Data

Transfer

(MOPS)

With Data

Transfer

(MOPS)Opteron N/A 4.75 N/A 3.4

H101 PCI-X

(LX160)600 74.4 200 49

Cell BE 195 189 27.7 26.2G80 5959 205 1276 110

BenOnePCIe

(LX160)600 250 200 125

,

)2()ln(

)()(),(

12

2

1

21

Tdd

T

TrKSd

deKdSTSC rT

The formula gives the price C of a European call option with exercise price K on a stock currently trading at price S, i.e., the right to buy a share of the stock at price K after T years. The constant risk-free interest rate is r, and the constant stock volatility is σ. Φ is the standard normal cumulative distribution function, shown in equation (3). The error function, though not theoretically closed form, can be adequately evaluated in single-precision arithmetic by means of a Taylor expansion, making it closed-form from a computational perspective.

Formula for Black Scholes Merton Options Pricing Formula

2erf1

2

1)( :where

xx

The code used in DIME-C for both the PDF and Black Scholes Implementations was unchanged or virtually unchanged in the Opteron, CellSs and CUDA kernel implementations, though naturally the top-level code for each had to take into account the differing environments. The scenario for the results presented here is that there is a fictional application running on the host processor(s) that has a constant stream of input values for which it needs output values from the closed-form function implemented on the attached accelerator. The Diagram shows

the unique and shared properties of the FSB-FPGA Platform, the H101 PCI-X Card and the Ben One PCIe FPGA Compute Platforms.

The NAL sits atop the AAL accelerator API when used to program the FSB-FPGA. At present the NAL sits atop the FUSE API in the PCI-X and PCIe Platforms

 

Acknowledgements

The lead author’s research is sponsored by Nallatech, and partially funded by the UK Engineering and Physical Sciences Research Council. The Institute for System Level Integration and Strathclyde University, both in Scotland, provide academic and logistical support. This work has also been supported by the Spanish government through the research contracts CICYT-TIN 2005/5619 and Ingenio 2010 Consolider CSD00C-07-20811

The Opteron had the weakest performance for the functions. The GPU had the strongest silicon performance, the performance discounting data transfer. When taking into account the data transfer, then the outcome depended on the method of interconnect used. Amongst the accelerators, the PCI-X H101 FPGA accelerator card had the lowest overall transfer-inclusive performance, followed by the Cell Processor then the GPU, with the BenOne implementation coming out on top. Cell had the lowest silicon potential of the accelerators, but was most balanced in terms of silicon potential and data transfer bandwidth