evaluation of running ffts on the cray xd1 with attached fpgas · evaluation of running ffts on the...

Post on 13-Mar-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Evaluation of running FFTs on the Cray XD1 with attached FPGAs

May 19, 2005

David StrenskiCray, Inc.

stren@cray.com

Michael BabstDSPlogic, Inc.

mike.babst@DSPlogic.com

Roderick SwiftDSPlogic, Inc.

rod.swift@DSPlogic.com

Cray XD1 Overview

The Rebirth of Co-processing

8086 Processor 8087 Coprocessor

AMD Opteron Xilinx Virtex II Pro FPGA

1976

2004

Application AccelerationApplication Acceleration

Reconfigurable ComputingTightly coupled to OpteronFPGA acts like a programmable co-processor Performs vector operationsWell-suited for:

Searching, sorting, signal processing, audio/video/image manipulation, encryption, error correction, coding/decoding, packet processing, random number generation.

Application Accelerator

Programming the FFT on the FPGA

Data Flow Management

FFT Application Architecture

Input FIFO

(4 MB)

Cray AAAPI

FPGA Application

Cra

y R

T C

ore

Output FIFO

(4 MB)

Result Buffer(2 MB)

IO L

ibra

ry

I/O Control

Opteron Application

FFT Core

User Memory

Opteron Application Accelerator

Control State

Machine

FFT Config/ Status

FFT Application Architecture

Four Layer ArchitectureCray Driver LayerI/O Management LayerFFT LibraryEnd-User Application

Cray Driver LayerLow-level interface between Opteron and FPGACray Application Accelerator software APIRapid-Array Transport CoreQDRII Core

FFT Application Architecture

I/O Management LayerSimplified Software and FPGA Interfaces

Framed, streaming data interface

Software APIio_init(fpgafile)io_config(frame length)loadframe(pointer to input data)getresult(pointer to output location)txrx(input pointer, output pointer, dataset size)

I/O Management CoreInput/output FPGA FIFOs decouple Opteron and FPGA processingSimple Data I/O bus for FPGA FFT applicationUser Definable Control/Status Registers

FFT Application Architecture

FFT LibraryFFT Core

Combination of off-the-shelf cores and custom VHDL optimized for XD1Cooley Tukey AlgorithmRadix-2 decimation-in-frequencyStreaming data at 1 sample per clock

Software APIfft_init(fft length, direction)

FFT Application Architecture

End-User ApplicationUses both FFT Library and I/O LibraryInitialization

io_init()io_config()fft_init()

Data TransferSerial Data I/O

loadframe()getresult()

Optimized Data I/Otxrx()

Data Flow

Data Format/Frame Processing

F0 FM/K-1

B0 BK/N-1

Re[15:0]

Im[31:0] Re[31:0]

Complex Input Sample

Dataset = M samples

K samples / frame

D0 DN-1

N samples / fft

Dataset

Frame

FFT Block

Im[15:0]

Complex Output Sample

64 bits / sample

Frame Length:1024 < K < 65536FFT Size:32 < N < 65536, non-overlappingDataset Size:32 < M < Available Memory

Data PipeliningOpteron-FPGA Communications latencyProcessing LatencyData pipelining required to achieve maximum FFT performanceLatency to Initial FFT Result

Send multiple frames prior to receiving first resultFurther optimization of latency possible

sec10*1.1

8)5.1125.3( 9KTNKT clkL ++≈

Processing Pipeline

clkclk

ff KTFKT == clk

clkfr KT

FKT

89

89==

Load Input Fifo F0

Load FPGA Processor

F1

loadframe() loadframe()

F2

loadframe()

F3

loadframe()

F4

loadframe()

F0 F1 F2 F3 F4

Process DataProcessing Delay

To Output Fifo

To Result Buffer

To User Memory

F0

F0

F0

F1

F1

Tff

Tfr

Tfc

getresult()

F1

Dfft

Tfoclkfft NTD 5.1≈

sec10*1.1

8)5.1125.3( 9KTNKT clkL ++≈

910*1.18KTfc ≈

Fabric Limited(1.6 GB/s)

Fabric Limited(1.4 GB/s)

Memory Access Limited(≅1.1 GB/sec)

FFT LatencyFFT Latency

1.00E-05

1.00E-04

1.00E-03

1.00E-02

5 6 7 8 9 10 11 12 13 14 15 16

log2(Nfft)

Late

ncy

(sec

)

Fc=200, Rm=1.6G Fc=175, Rm=1.6G Fc=175, Rm=1.1G

Accuracy

Accuracy MethodologyDifference between fixed/floating point

32-bit Single Precision23-bit mantissa, 8-bit exponent, 1 sign bit

Fixed point has limited dynamic rangeVariable mantissa for programmable precision16-bit input / 32-bit output example

Total FFT error also depends on other factorsRounding/Truncation at each stageTwiddle factor precision

Normalized metric, independent of length

n

n

bba

bacompare−

=),( nlen

i

ninn xxL

11

0⎟⎠

⎞⎜⎝

⎛== ∑

=

FFTW L2norm Accuracy

Measured Accuracy ResultsN = 1024

L2NormError(x_int16, x_r64)1.526792830336569e-005

L2NormError( FFTWFFT(x_int16), FFTWFFT(x_r64))1.526948880682192e-005

L2NormError( FPGAFFT(x_int16), FFTWFFT(x_r64)) 2.635449438537074e-005

N = 65536L2NormError(x_int16, x_r64)

1.526792830336569e-005L2NormError( FFTWFFT(x_int16), FFTWFFT(x_r64))

1.526796487132111e-005L2NormError( FPGAFFT(x_int16), FFTWFFT(x_r64))

3.216349170399355e-005

Accuracy Summary

Comparable to other single-precision FFTsInitial rounding of data causes most errorInput less accurate, output more accurate than single precision floatPrecision may be traded off for speed, FFT size, etc.Dynamic range limits

Performance

FFT Computation Rate

1.6 GBytes/sec FPGA FFT computation rate w/ 200 MHz clockNot realizable

Expected Rates1.4 GBytes/sec theoretical max1.1 GBytes realistic with one or more of the following enhancements

I/O joint R/W software optimizationIncreased result buffer size beyond 2 MB (future Cray release)Bidirectional DMA (future Cray release)

550 Mbytes/sec approximate worst caseTime-shared Opteron transmit and receive

Current optimized rate ~830 Mbytes/sec64-bits / complex sample

R = 1.4 GB/sec = 5.7 ns/pointR = 1.1 GB/sec = 7.3 ns/pointR = 830 MB/sec = 12 ns/pointR = 550 MB/src = 14.6 ns/point

The average FFT computation rate isdetermined entirely by I/O throughput

FFT Computation Rate

0

1

10

100

1000

10000

5 6 7 8 9 10 11 12 13 14 15 16

log2(Nfft)

Tim

eR = 1.42 R = 1.1 R = 1.1 / 2 FFTW(cpignol64-246)

FFT Computation Rate

1

10

100

1000

10000

10 11 12 13 14 15 16

log2(Nfft)

Tim

e

R = 1.42 R = 1.1 R = 1.1 / 2 FFTW(cpignol64-246)

Experimental Results

Un-optimized

FFT sizeFrame Size

Number of Frames

FFTs/ Frame

Total processed Average Duration (usec) Throughput (Mbytes/sec)

N K F K/N Nproc Total TX RX Total TX RX65536 65536 8 1 524288 932.63 403.00 526.00 562.16 1300.96 996.7532768 32768 16 1 524288 462.50 205.88 255.06 566.80 1273.32 1027.7716384 16384 32 1 524288 231.31 106.28 123.88 566.65 1233.26 1058.108192 8192 64 1 524288 116.86 56.05 60.31 560.81 1169.30 1086.624096 4096 128 1 524288 60.45 30.04 29.74 542.11 1090.85 1101.742048 2048 256 1 524288 32.81 17.96 14.50 499.44 912.20 1129.621024 1024 512 1 524288 19.23 11.09 7.90 426.02 738.95 1037.22512 1024 512 2 524288 9.27 5.44 3.73 441.76 753.50 1098.71256 1024 512 4 524288 4.62 2.75 1.81 443.67 745.54 1130.24128 1024 512 8 524288 2.31 1.38 0.91 442.91 744.73 1126.5164 1024 512 16 524288 1.16 0.69 0.46 441.00 740.96 1117.9032 1024 512 32 524288 0.58 0.35 0.23 441.38 742.03 1127.75

Experimental Results

0

100

200

300

400

500

600

700

800

900

1024 2048 4096 8192 16384 32768 65536

Frame Length

Thro

ughp

ut (M

Byt

es/s

ec)

Sustained throughput, improved TX/RX software optimization

Speed Improvement Ratio

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536

Nfft

T(fft

w)/T

(fpga

)R=1.4G R=1.1G R=800M

ConclusionsUp to 4.75x speed gains are achievable todayFPGA Performance enhancement increases with FFT lengthMultiple FFTs utilize pipeline and provide efficiencyLatency limits usefulness for single computations of small FFT sizesFFT L2norm accuracy ~10-5, similar to other single-precision algorithmsModular architecture

Separate I/O and application optimizationRapid application development

top related