accelerating computation with fpgasweb.stanford.edu/class/ee380/abstracts/090513-slides.pdf(map)...

55
M. J. Flynn Maxeler Technologies 1 Accelerating computation with FPGAs Michael J. Flynn Maxeler Technologies and Stanford University

Upload: others

Post on 13-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 1

Accelerating computation with FPGAs

Michael J. FlynnMaxeler Technologies and Stanford University

Page 2: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 2

Based on work done by my colleaguesat Maxeler, especially Oskar Mencer, Oliver Pell,

Rob Dimond and Jacob Bower and for the Seismic speedup study

Tamas Nemeth of ChevronThe seismic background slides are courtesy of

Olav Lindtjorn of Schlumbergerand Bob Clapp of Stanford Geophysics

Page 3: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 3

Outline

• A walk back to the beginning of hardware “Design Automation”

• An alternative parallel model for hardware and software.

• Maxeler tools and hardware.• Seismic computation, and an example• Some results and comments

Page 4: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 4

A walk back to the beginning of hardware “Design Automation”

Page 5: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 5

704 / 709 (c.1955) hand assembly

Page 6: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 6

The Standard SMS Card

Page 7: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 7

ALDs; each block was a logic gate and represented on a punched card. Machines did the routing and wiring

type

socket#

pin#

Page 8: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 8

SMS Cards and Sockets

Page 9: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 9

Large SMS frame (7030 and 7090)

Page 10: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 10

7094 (c.1961); more than 350 made enabled by design automation

Page 11: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 11

An alternative parallel model for hardware and software.

Page 12: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 12

Today’s HPC architecture

Page 13: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 13

Hardware and software alternatives

• Hardware: A more generalized (and reconfigurable) parallel array model.

• Software: A cylindrical rather than a layered model suits many applications.

Page 14: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 14

Design space for large applications

• Some distinctions– Memory limited vs. compute limited– Static vs. dynamic control structure– Homogeneous (scalar) vs. core + accelerator

at the node

Page 15: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 15

HPC: the case for reconfigurable heterogeneous nodes• Traditional “scalar” multicore processors are

designed to optimize latency, but not throughput.• Structured (systolic) arrays

FPGAs, GPUs, hyper “core” or “cell” dies, offer complementary throughput acceleration.

• Properly configured, an array can provide 10x- 100x improvement in arithmetic (SIMD) or memory bandwidth (by streaming or MISD).

Page 16: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 16

Accelerate tasks by streaming• MISD structured computation:

streaming computations across a long array before storing results in memory. Can achieve >100x in improved use of memory.

Structured arrayData from node

memory

Computation #1

Results tomemory

Computation #2

Buffer intermediate results

Page 17: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 17

FPGA acceleration • One tenth the frequency with 105 - 106 cells per die. • Magnitude of parallelism overcomes frequency limitations.• Stream data across large cell array, minimizing memory

bandwidth; create a systolic array without nearest neighbour constraints.

• Customized data structures e.g.17 bit floating point; always just enough precision.

• A software (re) configurable technology• Need an in-depth application study to realize acceleration;

acceleration is not automatic; acceleration requires more programming effort.

Page 18: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 18

A different programming model

• A cylindrical rather than a layered model suits static applications

• A streaming (MISD) computational model suits memory and (often) compute limited applications.

Page 19: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 19

Cylindrical Model: Non-Accelerated System

Critical kernel

Software CPU Hardware

High level choices Low level – no choice

Page 20: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 20

Cylindrical Model: Accelerated SystemHigh level choicesapplication choices,

algorithms, types of parallelism, data representations

Low level choicescommunications,

routing, timing,

gate delays

Critical kernel

• First accelerate the (initially small) kernel• Later extend kernel size

optimize

Page 21: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 21

Speedup with the Cylindrical Model

• Transform application to execute multiple simultaneous strips using DRAM “pipes”

• Stream computations through each pipe• Strip size limited by

FPGA area and DRAM bandwidth– Application specific data precision

• multiplies FPGA area• multiplies DRAM bandwidth

Page 22: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 22

Too much effort?

“The parallel approach to computing does require that some original thinking be done about numerical analysis and data management in order to secure efficient use.In an environment which has represented the absence of the need to think as the highest virtue this is a decided disadvantage.”

-Daniel Slotnick, 1967

Page 23: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 23

Maxeler approach and tools.

Page 24: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 24

MAXware: acceleration support

• Acceleration methodology• Application tool set• Acceleration hardware from Maxeler

Page 25: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 25

Acceleration methodology

• Four stages:– Analysis of the program (static and dynamic) – Transformation: create dataflow graph,

data layout and representation.– Partitioning: optimize dataflow,

data access and data representation options.– Implementation: uses Maxeler compiler to

configure application for FPGAs, and set run time interfaces.

Page 26: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 26

Maxeler tool set

• Parton profiles running application, finds hot spots and suitability for FPGA key metric: computations per memory access

• Performance modeller estimates execution time of alternative designs for FPGA; accurate to 10%

• MaxelerOS support for MAX2 board• MaxQ: FPGA bitstream manufacturing

Page 27: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 27

Maxeler Compiler

1) User application written in (implementable) Java

2) Code is analyzed and transformed to match data bandwidth

3) Maxeler manager compiler generates structural hardware design.

4) Maxeler kernel compiler generates data flow graph.

Page 28: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 28

Maxeler Manager Compiler

PCIe Memorycontroller

Arithmeticcomputation

(kernel)

Streaminginterface

specification

• connects blocks in different clock-domains• bridges streaming blocks w/ different configurations• performs aspect-changes on data flow

Page 29: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 29

Data flow graph fragment

Page 30: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 30

Same fragmentnow buffer synchronized

Page 31: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 31

Data flow graph as generated by compiler

4866 nodes; about 250x100

Page 32: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 32

Maxeler Kernel Compiler

• Takes data flow graph as input• Performs a number of transformations

– Pre-scheduling optimizations e.g. constant-folding

– Adds scheduling buffers– Buffer optimization

• Generates a structural hardware design

Page 33: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 33

Streaming hardware examples

from Maxeler Technologies

MaxCardPCI Express x16 Card

MaxBox1U server with 1 MaxCardor 4 MaxCards in a 1U box

MaxRack10U, 20U or 40U

max 75W, typical 50W

Page 34: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 34

The MAX2 board

• 2 x Virtex 5 LX330T each with 24 GB memory with PCI express x16 link.

• 10 GBPS inter FPGA link; total off-chip bandwidth of 48 GBPS.

• MaxelerOS has Linux device drivers, manages all PCI & DRAM “plumbing”.

Page 35: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 35

Seismic computation

Page 36: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 36

Example: seismic data processing

• For O&G exploration: distribute grid of sensors over large area

• Sonic impulse the area and record reflections: frequency, amplitude, delay at each sensor

• Sea based surveys use 30,000 sensors to record data (120 db range) each sampled at more than 2kbps with new sonic impulse every 10 sec.

Order of terabytes of data each day.

Page 37: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 37

Vessel

Q-Fins

Source

Monowing

Tailbuoy

Miniwing

Streamer spreadStreamer spread

Page 38: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 38

Page 39: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 39

Q-Technology: Higher Resolution & Fidelity

Slide courtesy of Olav Lindtjorn

Page 40: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 40

Relative Cost of Imaging AlgorithmsFWI-Elatic

2012

The Holy Grail

Rel

ativ

e C

PU c

ost

Imaging Complexity

CAZ WEM (VTI)

KPrSDM (TTI)

Shot WEM (VTI)

1

10

100

1000

10000

100000

KPrSTM

: Memory and disk space cost

Reverse Time

Fast Beam

DW-WEM

1000000

Slide courtesy of Olav Lindtjorn

Page 41: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 41

Application Architecture and data rates

Data:• Meta Data• Trace: 8 – 20 kB• Shot: 500 MB – 2 GB• Survey: 2 – 500 TB

Distributor (MAP)

Distributor (MAP) Collector

(Reduce) Collector (Reduce)

ComputeCompute

ComputeCompute

ComputeCompute

ComputeCompute

ComputeCompute

Control

100s – 10,000+ nodes

Run time: days – weeks

100 1 - 10

Slide courtesy of Olav Lindtjorn

Page 42: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 42

Computational problems in geophysics

Wave propagation

Diffusion

Fluid Flow

Finite difference

FFT

Finite Element

Oil

Water

Climate

Sparse matrix

Slide courtesy of Bob Clapp

Page 43: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 43

Some results and comments

Page 44: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 44

Seismic forward modelling example

• Data can be interpreted with frequency and amplitude indicates structure, delay indicates depth (z axis).

• Process data to determine location of structures of interest

• Many different ways to process– One option:

simulate wave propagation from source to receivers and compare to recorded data

Page 45: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 45

Computations per output point on Intel Xeon

Cycles FLOPs Other OpsL1 Cache Miss Rate CPI

2nd order – X pass 11% 40.2 72.3 0.2% 0.6

2nd order – Y pass 15% 40.2 72.3 3.8% 0.8

2nd order – Z pass 21% 40.2 72.4 7.3% 1.1

Vector add 1% 1.0 9.0 1.3% 0.9

4th order – X pass 10% 40.2 64.7 0.2% 0.6

4th order – Y pass 15% 40.2 64.7 4.0% 0.9

4th order – Z pass 21% 39.6 65.3 7.8% 1.2

Update pressure 1% 1.9 9.9 1.0% 0.7

Boundary sponge 5% 0.8 4.0 5.6% 5.8

Page 46: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 46

Computations

• On average about a data cache miss per 10 floating point ops.

• Xeon achieves about 1.0 CPI• So, Xeon has a 20x frequency starting

advantage over an FPGA based computation• BUT FPGA uses lots of parallelism to

significant advantage

Page 47: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 47

Chevron 3D Finite Difference

1

10

100

1000

10000

100000

1000000

240 336 432 528 624 720 816 912 1008

Problem Size (x^3)

time

[s]

Maxeler FPGA Card

Single CPU core w/ -O3

Speedup using the Maxeler FPGA System at Chevron

85x

240x

Data from Tamas Nemeth (Chevron), SEG conference, Nov 08, Las Vegas.

Page 48: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 48

Streaming solution (FPGA)

• Convolve 4 input points (strips) simultaneously (per FPGA); buffer intermediate results; forward 4 outputs to next pipeline stage: SIMD

• Continue (streaming) pipelining until the silicon runs out (468 stages): MISD

• Size the Floating Point so that there is just enough range & precision

• One PCIe board provides 8 x 468 FLOPS every 4 ns; almost 1 teraflop.

O. Pell, T. Nemeth, J. Stefani and R. Ergas. Design Space Analysis for the Acoustic Wave Equation Implementationon FPGA Circuits. European Association of Geoscientists and Engineers (EAGE) Conference, Rome, June 2008.

Page 49: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 49

Page 50: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 50

4 input x 4 output points no cache misses; no memory accesses

Input Input Input Input

Output Output Output Output

Constant

Implements dataflow graph in 468 pipeline stages

Page 51: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 51

Achieving Speedup > 100x

• Stream the computation in 468 stage pipeline• Execute 8 points simultaneously • Eliminate cache misses, eliminate overhead

operations (load, store, branch..)• So 8x (points processed) x 468 stages x 2

(overhead ops)/ 20 = 374 (max Sp possible)• Operate at one-twentieth the frequency;

reduce power and space.

Page 52: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 52

So how typical is this?

• Other application areas (ongoing studies)– Financial computations, Monte Carlo based derivative price and

risk– PDE solvers: Finite Difference, Finite Elements– Image processing– CFD, Lattice Boltzmann

• Kernels, referenced to a Xeon– Gaussian Random Number Generation 300x– 3D finite difference 160x– Seismic offset gathers 170x

Page 53: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 53

So how can emulation technology (FPGA) be better than the x86 processor(s)?

• Lack of robustness in x86 streaming hardware (spanning area, time, power)

• Lack of robust parallel software methodology and tools

• Effort and support tools enable large systolic dataflow with surprisingly good A x T x P

• Effort and support tools provide significant application speedup

Page 54: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 54

So, are we there yet?

• Do we have a design automation technology for parallelizing software?

• Well, no; but Slotnick promised it wouldn’t be easy and we’re making progress.

there’s no free lunch in parallel programming, yet

Page 55: Accelerating computation with FPGAsweb.stanford.edu/class/ee380/Abstracts/090513-slides.pdf(MAP) Distributor (MAP) Collector (Reduce) Collector (Reduce) Compute. Compute Compute. Compute

M. J. Flynn Maxeler Technologies 55

Summary• Acceleration based speedup using systolic

arrays.• More attention to memory and/or arithmetic:

data representation, streaming, and RAM.• Lower power with non aggressive frequency

use.• Programming uses cylindrical model; but

speedup requires lots of low level program optimization. Good tools are golden.