fpga-based scientific computingcas.ee.ic.ac.uk/people/gac1/date2011/stitt.pdf · fpga-based...
Post on 15-Apr-2018
214 Views
Preview:
TRANSCRIPT
DATE 2011 Workshop W2
FPGA-based Scientific Computing: A Bright Future?
Dr. Greg Stitt Electrical and Computer Engineering
University of Florida NSF Center for High-Performance Reconfigurable
Computing (CHREC)
Introduction FPGAs widely shown to have performance advantages
compared to other devices However, trend for scientific computing is microprocessors+GPUs
Problems with GPU trend Microprocessor+GPU trend unsustainable due to power
consumption Top supercomputers nearing 10 megawatts Result: Energy and cooling dominate total cost of ownership
Are FPGAs a potential solution? FPGAs use significantly less power than GPUs, sometimes with
similar or better performance Computational density per watt becoming an important metric for
device efficiency for scientific computing
Novo-G RC Machine @ CHREC 192 Stratix-III E260 FPGAs
Each Altera FPGA with 768 18x18 multipliers, 254K logic elements, 204K registers, & max. power ~18W
48 quad-FPGA boards GiDEL PCIe x8 PROCStar-III
Embedded-style boards, for both HPEC- & HPC-oriented research
4¼ GB DDR2 attached to each FPGA ~ 1 TB total RAM in Novo-G
24+1 Linux servers in cluster 24 compute servers (2 boards/server) 1 head-node server for management 20 Gb/s non-blocking InfiniBand 1 Gb/s Ethernet 26 (24+2) quad-core Xeons Max. system power of ~8 KW
6 “Novo” is Latin, "to make anew, refresh, revive, change, alter," essence of RC; “G” is for Genesis or Green.
NW/SW/ND Performance on Novo-G
Baseline: 192·225, length 850 Sequence Comparisons Software Runtime: 11,026 CPU hours on 2.4GHz Opteron # FPGAs Runtime (sec) Speedup
1 47,616 833
4 12,014 3,304
96 503 78,914
128 391 101,518
192 (est.) 270 147,013
Baseline: Human X Chromosome v 19200, length 650 Seqs Software Runtime: 5,481 CPU hours on 2.4GHz Opteron # FPGAs Runtime (sec) Speedup
1 23,846 827
4 5,966 3,307
96 250 78,926
128 188 104,955
192 (est.) 127 155,366
Baseline: 192·224, length 450 Distance Calculations Software Runtime: 11,673 CPU hours on 2.4GHz Opteron # FPGAs Runtime (sec) Speedup
1 13,522 3,108
4 3,429 12,255
96 144 291,825
128 118 356,125
192 (est.) 77 545,751
Results on Novo-G for NW (left), SW (Center), and ND (Right). Each chart illustrates performance of a single FPGA under varying input conditions. Each table shows scaling performance with varying number of FPGAs under optimal input conditions.
Estimated performance on Novo-G comparable or better than biggest supercomputers on www.Top500.org Jaguar @ ORNL: 224,162 cores – 2.4 GHz Hexacore Opterons; 6.95 MW Roadrunner @ LANL: 122,400 cores – 3.2 GHz Cells + 1.8GHz Opterons; 2.35 MW
7
} Novo-G Power 8 KW Max.
Information-Theoretic Adaptive Filtering
8
ITL
System identification Feature extraction
Blind source separation Clustering
Information-Theoretic Learning (ITL) New way of data quantification based upon MEE
(minimum error entropy) instead of MSE (mean square error)
Superior results for nonlinear system identification However, prohibitive increase in computational
complexity; Solution? RC
Baseline: 160 10th order AFs with window size of 100 Software Runtime: 3min 34sec CPU time on 2.4GHz Opteron
# FPGAs Runtime (ms) Speedup
1 36.85 5,800
4 (1 board) 9.22 23,200
8 (1 server) 4.61 46,400
ITL Adaptive Filters (AFs) 20 AFs/ FPGA @ 150 MHz 80 AFs/board (4 FPGAs), 160 AFs/server
(8 FPGAs) Additional AFs not scientifically meaningful
for 1D data, so app capped @ 160 AFs
Software: Fastest possible sampling frequency of ~1.5 kHz Hardware: Fastest possible sampling frequency of 425 kHz Impact: Able to employ superior MEE cost function for
much broader spectrum of signals and problems
REF: S. Craciun, A. George, H. Lam, J. Principe, "A Parallel Hardware Architecture for Information-Theoretic Adaptive Filtering," Proc. of High-Performance Reconfigurable Computing Technology
and Applications Workshop at SC'10, New Orleans, LA, Nov. 14, 2010.
Why aren’t FPGAs more widely used?
5 main barriers preventing wider usage Increased application design complexity
Significantly more complex that microprocessors or GPUs
Limited applicability Not all applications benefit from FPGAs
Prohibitive compilation times Placement and routing often takes hours, days, even
more than a week Device cost
Newest devices cost more than $10,000 Lack of application/tool portability, standardization
Application and tool designers must start over for new systems
9
top related