evaluation of running ffts on the cray xd1 with attached fpgas · evaluation of running ffts on the...

Evaluation of running FFTs on the Cray XD1 with attached FPGAs

May 19, 2005

David StrenskiCray, Inc.

stren@cray.com

Michael BabstDSPlogic, Inc.

mike.babst@DSPlogic.com

Roderick SwiftDSPlogic, Inc.

rod.swift@DSPlogic.com

Cray XD1 Overview

The Rebirth of Co-processing

8086 Processor 8087 Coprocessor

AMD Opteron Xilinx Virtex II Pro FPGA

Application AccelerationApplication Acceleration

Reconfigurable ComputingTightly coupled to OpteronFPGA acts like a programmable co-processor Performs vector operationsWell-suited for:

Searching, sorting, signal processing, audio/video/image manipulation, encryption, error correction, coding/decoding, packet processing, random number generation.

Application Accelerator

Programming the FFT on the FPGA

Data Flow Management

FFT Application Architecture

Input FIFO

(4 MB)

Cray AAAPI

FPGA Application

Output FIFO

(4 MB)

Result Buffer(2 MB)

I/O Control

Opteron Application

FFT Core

User Memory

Opteron Application Accelerator

Control State

Machine

FFT Config/ Status

Four Layer ArchitectureCray Driver LayerI/O Management LayerFFT LibraryEnd-User Application

Cray Driver LayerLow-level interface between Opteron and FPGACray Application Accelerator software APIRapid-Array Transport CoreQDRII Core

I/O Management LayerSimplified Software and FPGA Interfaces

Framed, streaming data interface

Software APIio_init(fpgafile)io_config(frame length)loadframe(pointer to input data)getresult(pointer to output location)txrx(input pointer, output pointer, dataset size)

I/O Management CoreInput/output FPGA FIFOs decouple Opteron and FPGA processingSimple Data I/O bus for FPGA FFT applicationUser Definable Control/Status Registers

FFT LibraryFFT Core

Combination of off-the-shelf cores and custom VHDL optimized for XD1Cooley Tukey AlgorithmRadix-2 decimation-in-frequencyStreaming data at 1 sample per clock

Software APIfft_init(fft length, direction)

End-User ApplicationUses both FFT Library and I/O LibraryInitialization

io_init()io_config()fft_init()

Data TransferSerial Data I/O

loadframe()getresult()

Optimized Data I/Otxrx()

Data Flow

Data Format/Frame Processing

F0 FM/K-1

B0 BK/N-1

Re[15:0]

Im[31:0] Re[31:0]

Complex Input Sample

Dataset = M samples

K samples / frame

D0 DN-1

N samples / fft

Dataset

FFT Block

Im[15:0]

Complex Output Sample

64 bits / sample

Frame Length:1024 < K < 65536FFT Size:32 < N < 65536, non-overlappingDataset Size:32 < M < Available Memory

Data PipeliningOpteron-FPGA Communications latencyProcessing LatencyData pipelining required to achieve maximum FFT performanceLatency to Initial FFT Result

Send multiple frames prior to receiving first resultFurther optimization of latency possible

sec10*1.1

8)5.1125.3( 9KTNKT clkL ++≈

Processing Pipeline

clkclk

ff KTFKT == clk

clkfr KT

Load Input Fifo F0

Load FPGA Processor

loadframe() loadframe()

loadframe()

F0 F1 F2 F3 F4

Process DataProcessing Delay

To Output Fifo

To Result Buffer

To User Memory

getresult()

Tfoclkfft NTD 5.1≈

sec10*1.1

8)5.1125.3( 9KTNKT clkL ++≈

910*1.18KTfc ≈

Fabric Limited(1.6 GB/s)

Fabric Limited(1.4 GB/s)

Memory Access Limited(≅1.1 GB/sec)

FFT LatencyFFT Latency

1.00E-05

1.00E-04

1.00E-03

1.00E-02

5 6 7 8 9 10 11 12 13 14 15 16

log2(Nfft)

Fc=200, Rm=1.6G Fc=175, Rm=1.6G Fc=175, Rm=1.1G

Accuracy

Accuracy MethodologyDifference between fixed/floating point

32-bit Single Precision23-bit mantissa, 8-bit exponent, 1 sign bit

Fixed point has limited dynamic rangeVariable mantissa for programmable precision16-bit input / 32-bit output example

Total FFT error also depends on other factorsRounding/Truncation at each stageTwiddle factor precision

Normalized metric, independent of length

bacompare−

=),( nlen

ninn xxL

0⎟⎠

⎞⎜⎝

⎛== ∑

FFTW L2norm Accuracy

Measured Accuracy ResultsN = 1024

L2NormError(x_int16, x_r64)1.526792830336569e-005

L2NormError( FFTWFFT(x_int16), FFTWFFT(x_r64))1.526948880682192e-005

L2NormError( FPGAFFT(x_int16), FFTWFFT(x_r64)) 2.635449438537074e-005

N = 65536L2NormError(x_int16, x_r64)

1.526792830336569e-005L2NormError( FFTWFFT(x_int16), FFTWFFT(x_r64))

1.526796487132111e-005L2NormError( FPGAFFT(x_int16), FFTWFFT(x_r64))

3.216349170399355e-005

Accuracy Summary

Comparable to other single-precision FFTsInitial rounding of data causes most errorInput less accurate, output more accurate than single precision floatPrecision may be traded off for speed, FFT size, etc.Dynamic range limits

Performance

FFT Computation Rate

1.6 GBytes/sec FPGA FFT computation rate w/ 200 MHz clockNot realizable

Expected Rates1.4 GBytes/sec theoretical max1.1 GBytes realistic with one or more of the following enhancements

I/O joint R/W software optimizationIncreased result buffer size beyond 2 MB (future Cray release)Bidirectional DMA (future Cray release)

550 Mbytes/sec approximate worst caseTime-shared Opteron transmit and receive

Current optimized rate ~830 Mbytes/sec64-bits / complex sample

R = 1.4 GB/sec = 5.7 ns/pointR = 1.1 GB/sec = 7.3 ns/pointR = 830 MB/sec = 12 ns/pointR = 550 MB/src = 14.6 ns/point

The average FFT computation rate isdetermined entirely by I/O throughput

5 6 7 8 9 10 11 12 13 14 15 16

log2(Nfft)

eR = 1.42 R = 1.1 R = 1.1 / 2 FFTW(cpignol64-246)

10 11 12 13 14 15 16

log2(Nfft)

R = 1.42 R = 1.1 R = 1.1 / 2 FFTW(cpignol64-246)

Experimental Results

Un-optimized

FFT sizeFrame Size

Number of Frames

FFTs/ Frame

Total processed Average Duration (usec) Throughput (Mbytes/sec)

N K F K/N Nproc Total TX RX Total TX RX65536 65536 8 1 524288 932.63 403.00 526.00 562.16 1300.96 996.7532768 32768 16 1 524288 462.50 205.88 255.06 566.80 1273.32 1027.7716384 16384 32 1 524288 231.31 106.28 123.88 566.65 1233.26 1058.108192 8192 64 1 524288 116.86 56.05 60.31 560.81 1169.30 1086.624096 4096 128 1 524288 60.45 30.04 29.74 542.11 1090.85 1101.742048 2048 256 1 524288 32.81 17.96 14.50 499.44 912.20 1129.621024 1024 512 1 524288 19.23 11.09 7.90 426.02 738.95 1037.22512 1024 512 2 524288 9.27 5.44 3.73 441.76 753.50 1098.71256 1024 512 4 524288 4.62 2.75 1.81 443.67 745.54 1130.24128 1024 512 8 524288 2.31 1.38 0.91 442.91 744.73 1126.5164 1024 512 16 524288 1.16 0.69 0.46 441.00 740.96 1117.9032 1024 512 32 524288 0.58 0.35 0.23 441.38 742.03 1127.75

Experimental Results

1024 2048 4096 8192 16384 32768 65536

Frame Length

Sustained throughput, improved TX/RX software optimization

Speed Improvement Ratio

32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536

)R=1.4G R=1.1G R=800M

ConclusionsUp to 4.75x speed gains are achievable todayFPGA Performance enhancement increases with FFT lengthMultiple FFTs utilize pipeline and provide efficiencyLatency limits usefulness for single computations of small FFT sizesFFT L2norm accuracy ~10-5, similar to other single-precision algorithmsModular architecture

Separate I/O and application optimizationRapid application development

evaluation of running ffts on the cray xd1 with attached fpgas · evaluation of running ffts on the...

Documents

task placement of parallel multi-dimensional ffts...

xd1-t transformer differential protection relay tb xd1-t...

cray timeline

demanding parallel ffts: slabs & rods

parallel spectral methods: solving elliptic problems with...

nug monthly webinar – january 2015 - nersc · gcc/4.9.2...

the naval research laboratory cray xd1 - cray user group ·...

cray assembly language (cal) for cray x1™ systems...

reconfigurable computing aspects of the cray xd1 sandia...

xd1-l – cable differential protection relay · pdf...

large multicore ffts: approaches to optimization

xd1-t transformer differential protection relay

cray cray bay

reconfigurable computing aspects of the cray xd1 sandia...

exceeding 100x speedup/fpga cray xd1 timing analysis...

hpc advisory council switzerland conference 2013...3 2003 3...

ffts in graphics and vision - johns hopkins university

development of ffts for radio astronomy

designer seymour cray and the cray-3 supercomputer,...

cray system software features for cray x1 system - cug