evaluation of running ffts on the cray xd1 with attached fpgas · evaluation of running ffts on the...
Post on 13-Mar-2020
0 Views
Preview:
TRANSCRIPT
Evaluation of running FFTs on the Cray XD1 with attached FPGAs
May 19, 2005
David StrenskiCray, Inc.
stren@cray.com
Michael BabstDSPlogic, Inc.
mike.babst@DSPlogic.com
Roderick SwiftDSPlogic, Inc.
rod.swift@DSPlogic.com
Cray XD1 Overview
The Rebirth of Co-processing
8086 Processor 8087 Coprocessor
AMD Opteron Xilinx Virtex II Pro FPGA
1976
2004
Application AccelerationApplication Acceleration
Reconfigurable ComputingTightly coupled to OpteronFPGA acts like a programmable co-processor Performs vector operationsWell-suited for:
Searching, sorting, signal processing, audio/video/image manipulation, encryption, error correction, coding/decoding, packet processing, random number generation.
Application Accelerator
Programming the FFT on the FPGA
Data Flow Management
FFT Application Architecture
Input FIFO
(4 MB)
Cray AAAPI
FPGA Application
Cra
y R
T C
ore
Output FIFO
(4 MB)
Result Buffer(2 MB)
IO L
ibra
ry
I/O Control
Opteron Application
FFT Core
User Memory
Opteron Application Accelerator
Control State
Machine
FFT Config/ Status
FFT Application Architecture
Four Layer ArchitectureCray Driver LayerI/O Management LayerFFT LibraryEnd-User Application
Cray Driver LayerLow-level interface between Opteron and FPGACray Application Accelerator software APIRapid-Array Transport CoreQDRII Core
FFT Application Architecture
I/O Management LayerSimplified Software and FPGA Interfaces
Framed, streaming data interface
Software APIio_init(fpgafile)io_config(frame length)loadframe(pointer to input data)getresult(pointer to output location)txrx(input pointer, output pointer, dataset size)
I/O Management CoreInput/output FPGA FIFOs decouple Opteron and FPGA processingSimple Data I/O bus for FPGA FFT applicationUser Definable Control/Status Registers
FFT Application Architecture
FFT LibraryFFT Core
Combination of off-the-shelf cores and custom VHDL optimized for XD1Cooley Tukey AlgorithmRadix-2 decimation-in-frequencyStreaming data at 1 sample per clock
Software APIfft_init(fft length, direction)
FFT Application Architecture
End-User ApplicationUses both FFT Library and I/O LibraryInitialization
io_init()io_config()fft_init()
Data TransferSerial Data I/O
loadframe()getresult()
Optimized Data I/Otxrx()
Data Flow
Data Format/Frame Processing
F0 FM/K-1
B0 BK/N-1
Re[15:0]
Im[31:0] Re[31:0]
Complex Input Sample
Dataset = M samples
K samples / frame
D0 DN-1
N samples / fft
Dataset
Frame
FFT Block
Im[15:0]
Complex Output Sample
64 bits / sample
Frame Length:1024 < K < 65536FFT Size:32 < N < 65536, non-overlappingDataset Size:32 < M < Available Memory
Data PipeliningOpteron-FPGA Communications latencyProcessing LatencyData pipelining required to achieve maximum FFT performanceLatency to Initial FFT Result
Send multiple frames prior to receiving first resultFurther optimization of latency possible
sec10*1.1
8)5.1125.3( 9KTNKT clkL ++≈
Processing Pipeline
clkclk
ff KTFKT == clk
clkfr KT
FKT
89
89==
Load Input Fifo F0
Load FPGA Processor
F1
loadframe() loadframe()
F2
loadframe()
F3
loadframe()
F4
loadframe()
F0 F1 F2 F3 F4
Process DataProcessing Delay
To Output Fifo
To Result Buffer
To User Memory
F0
F0
F0
F1
F1
Tff
Tfr
Tfc
getresult()
F1
Dfft
Tfoclkfft NTD 5.1≈
sec10*1.1
8)5.1125.3( 9KTNKT clkL ++≈
910*1.18KTfc ≈
Fabric Limited(1.6 GB/s)
Fabric Limited(1.4 GB/s)
Memory Access Limited(≅1.1 GB/sec)
FFT LatencyFFT Latency
1.00E-05
1.00E-04
1.00E-03
1.00E-02
5 6 7 8 9 10 11 12 13 14 15 16
log2(Nfft)
Late
ncy
(sec
)
Fc=200, Rm=1.6G Fc=175, Rm=1.6G Fc=175, Rm=1.1G
Accuracy
Accuracy MethodologyDifference between fixed/floating point
32-bit Single Precision23-bit mantissa, 8-bit exponent, 1 sign bit
Fixed point has limited dynamic rangeVariable mantissa for programmable precision16-bit input / 32-bit output example
Total FFT error also depends on other factorsRounding/Truncation at each stageTwiddle factor precision
Normalized metric, independent of length
n
n
bba
bacompare−
=),( nlen
i
ninn xxL
11
0⎟⎠
⎞⎜⎝
⎛== ∑
−
=
FFTW L2norm Accuracy
Measured Accuracy ResultsN = 1024
L2NormError(x_int16, x_r64)1.526792830336569e-005
L2NormError( FFTWFFT(x_int16), FFTWFFT(x_r64))1.526948880682192e-005
L2NormError( FPGAFFT(x_int16), FFTWFFT(x_r64)) 2.635449438537074e-005
N = 65536L2NormError(x_int16, x_r64)
1.526792830336569e-005L2NormError( FFTWFFT(x_int16), FFTWFFT(x_r64))
1.526796487132111e-005L2NormError( FPGAFFT(x_int16), FFTWFFT(x_r64))
3.216349170399355e-005
Accuracy Summary
Comparable to other single-precision FFTsInitial rounding of data causes most errorInput less accurate, output more accurate than single precision floatPrecision may be traded off for speed, FFT size, etc.Dynamic range limits
Performance
FFT Computation Rate
1.6 GBytes/sec FPGA FFT computation rate w/ 200 MHz clockNot realizable
Expected Rates1.4 GBytes/sec theoretical max1.1 GBytes realistic with one or more of the following enhancements
I/O joint R/W software optimizationIncreased result buffer size beyond 2 MB (future Cray release)Bidirectional DMA (future Cray release)
550 Mbytes/sec approximate worst caseTime-shared Opteron transmit and receive
Current optimized rate ~830 Mbytes/sec64-bits / complex sample
R = 1.4 GB/sec = 5.7 ns/pointR = 1.1 GB/sec = 7.3 ns/pointR = 830 MB/sec = 12 ns/pointR = 550 MB/src = 14.6 ns/point
The average FFT computation rate isdetermined entirely by I/O throughput
FFT Computation Rate
0
1
10
100
1000
10000
5 6 7 8 9 10 11 12 13 14 15 16
log2(Nfft)
Tim
eR = 1.42 R = 1.1 R = 1.1 / 2 FFTW(cpignol64-246)
FFT Computation Rate
1
10
100
1000
10000
10 11 12 13 14 15 16
log2(Nfft)
Tim
e
R = 1.42 R = 1.1 R = 1.1 / 2 FFTW(cpignol64-246)
Experimental Results
Un-optimized
FFT sizeFrame Size
Number of Frames
FFTs/ Frame
Total processed Average Duration (usec) Throughput (Mbytes/sec)
N K F K/N Nproc Total TX RX Total TX RX65536 65536 8 1 524288 932.63 403.00 526.00 562.16 1300.96 996.7532768 32768 16 1 524288 462.50 205.88 255.06 566.80 1273.32 1027.7716384 16384 32 1 524288 231.31 106.28 123.88 566.65 1233.26 1058.108192 8192 64 1 524288 116.86 56.05 60.31 560.81 1169.30 1086.624096 4096 128 1 524288 60.45 30.04 29.74 542.11 1090.85 1101.742048 2048 256 1 524288 32.81 17.96 14.50 499.44 912.20 1129.621024 1024 512 1 524288 19.23 11.09 7.90 426.02 738.95 1037.22512 1024 512 2 524288 9.27 5.44 3.73 441.76 753.50 1098.71256 1024 512 4 524288 4.62 2.75 1.81 443.67 745.54 1130.24128 1024 512 8 524288 2.31 1.38 0.91 442.91 744.73 1126.5164 1024 512 16 524288 1.16 0.69 0.46 441.00 740.96 1117.9032 1024 512 32 524288 0.58 0.35 0.23 441.38 742.03 1127.75
Experimental Results
0
100
200
300
400
500
600
700
800
900
1024 2048 4096 8192 16384 32768 65536
Frame Length
Thro
ughp
ut (M
Byt
es/s
ec)
Sustained throughput, improved TX/RX software optimization
Speed Improvement Ratio
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536
Nfft
T(fft
w)/T
(fpga
)R=1.4G R=1.1G R=800M
ConclusionsUp to 4.75x speed gains are achievable todayFPGA Performance enhancement increases with FFT lengthMultiple FFTs utilize pipeline and provide efficiencyLatency limits usefulness for single computations of small FFT sizesFFT L2norm accuracy ~10-5, similar to other single-precision algorithmsModular architecture
Separate I/O and application optimizationRapid application development
top related