system acceleration overview - aventri€¦ · fine-grain vs. coarse-grain ... run ‘enough’...
TRANSCRIPT
![Page 1: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/1.jpg)
1
![Page 2: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/2.jpg)
System Acceleration Overview
Bill Jenkins
Altera Sr. Product Specialist for
Programming Language Solutions
![Page 3: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/3.jpg)
Industry Trends
3
Increasing product functionality and performance
Smaller time-to-market window
Shorter product lifecycle
Limited design resources
![Page 4: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/4.jpg)
Parallel Computing
4
“A form of computation in which many calculations are
carried out simultaneously, operating on the principle that
large problems can often be divided into smaller ones, which
are then solved concurrently (in parallel)”
~ Highly Parallel Computing, Amasi/Gottlieb (1989)
![Page 5: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/5.jpg)
Information Memory Wall
Memory architectures have limited bandwidth, and
can’t keep up with the processor. DRAM cores are
slow!
Realization (cost) Power Wall
Process scaling (28nm/20nm/14nm) trends towards
exponentially increasing power consumption.
Computation ILP Wall
Compilers don’t find enough parallelism in a single
instruction stream to keep a Von Neuman architecture
busy.
Need for Parallel Computing
5
CPUs aren’t getting faster
Performance now come from parallelism
![Page 6: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/6.jpg)
Challenges in Parallel Programming
6
Finding Parallelism What activities can be executed concurrently?
Is parallelism explicit (programmer specified) or implicit?
Data sharing and synchronization What happens if two activities access the
same data at the same time?Hardware design implications
eg. Uniform address spaces, cache coherency
Applications exhibit different behaviors Control
Searching, parsing, etc…
Data intensiveImage processing, data mining, etc…
Compute intensiveIterative methods, financial modeling, etc…
![Page 7: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/7.jpg)
Finding Parallelism
7
Scatter-gather Separate input data into subsets that are sent to each parallel resource
and then combine the results
Data parallelism
Divide-and-conquer Decompose problem into sub-problems that run well on available compute
resources
Task parallelism
Pipeline Parallelism Task parallelism where tasks have a producer consumer relationship
Different tasks operate in parallel on different data
![Page 8: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/8.jpg)
Granularity of Parallelism
8
Ratio of computation to communication
Fine-grain vs. Coarse-grain
Most efficient granularity heavily depends on application
and hardware environment
![Page 9: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/9.jpg)
Fine-grained parallelism
9
Low computation intensity
Small task size
Data transferred frequently
Benefits Easy to load balance among task
Drawback Synchronization and communication overhead can
overshadow benefits of parallelism
Example GPU threads
Instruction level parallelism
Fine Grain
Time
Computation
Communication
![Page 10: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/10.jpg)
Coarse-grained parallelism
10
High arithmetic intensity
Data transferred infrequently
Benefit Low overhead
Drawback Difficult to load balance
Example Threads running on different CPUs
Time
Computation
Communication
![Page 11: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/11.jpg)
Programmers Dilemma
11
“The way the processor industry is going, is to add more and
more cores, but nobody knows how to program those things.
I mean, two yeah; four not really; eight, forget it.”
~ Steve Jobs, 1955-2011
![Page 12: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/12.jpg)
Heterogeneous Computing Systems
12
Heterogeneous computing refers to systems that use more
than one kind of processor. These are systems that gain
performance not just by adding the same type of processors,
but by adding dissimilar processors, usually incorporating
specialized processing capabilities to handle particular
tasks. (applications)
![Page 13: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/13.jpg)
Example Heterogeneous Computing Devices
13
Multi-core, general purpose, central processing units
(CPUs) Include multiple execution units ("cores") on the same chip
Digital Signal Processing (DSPs) processors Optimized for the operational needs of digital signal processing
Graphics Processing units (GPUs) Heavily optimized for computer graphics processing
Field Programmable Gate Arrays (FPGAs) Custom architectures for each problem being solved
CPUs DSPs GPUs FPGAs
![Page 14: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/14.jpg)
Efficiency via Specialization
ASICsFPGAs
Source: Bob Broderson, Berkeley Wireless group
GPUs
![Page 15: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/15.jpg)
What is OpenCL?
15
A low level software programming model for
software engineers and a software
methodology for system architects
First industry standard for heterogeneous computing
Provides increased performance with
hardware acceleration
Low Level Programming language Based on ANSI C99
Open, royalty-free, standard Managed by Khronos Group Altera active member Conformance requirements
V1.0 is current reference
V2.0 is current release http://www.khronos.org
Host Accelerator
C/C++
API
OpenCL
C
![Page 16: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/16.jpg)
OpenCL
Platform
Model
16
Heterogeneous Platform Model
Host
Example
Platformx86
PCIe
Device Device
Host
Memory
Global
Memory
![Page 17: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/17.jpg)
OpenCL Constructs
17
OpenCL standard provides abstract models Generic: able to be mapped on to significantly different architectures
Flexible: able to extract high performance from every architecture
Portable: vendor and device independent
![Page 18: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/18.jpg)
Parallelism in OpenCL
18
Parallelism is explicit Identify parallel regions of algorithm and implement as kernels executed by
many work-items
Task (SMT) or Data (SPMD)
Hierarchy of work-items (threads) Work-items are grouped into workgroups
Size of workgroups is usually restricted by hardware implementation (256-1024)
Work-items within a workgroup can explicitly synchronize and share data
Otherwise free to executed independently
Work-groups are always independent
Explicit memory hierarchy Global memory visible to all workgroups and work-items
Local memory visible only to work-items in a workgroup
Private memory visible only to a single work-item
![Page 19: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/19.jpg)
High-Level CPU / DSP Architectures
19
Optimized for latency Large caches, HW prefetch
Complicated control Superscalar, out-of-order execution, etc.
Comparatively few execution units Vector units (e.g. Intel SSE/AVX, ARM® NEON)
DRAM
L1/L2 Cache
ControlALU
ALU ALU
ALU
L1/L2 Cache
ControlALU
ALU ALU
ALU
L3 Cache
![Page 20: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/20.jpg)
Compiling OpenCL Standard To CPUs
20
Execute different workgroups across cores
Potentially ‘fuse’ work-items together to execute on vector
units Or ‘vectorize’ kernels to work with explicit vector types
Synchronization between work-items is handled entirely in
software
No dedicated hardware for sharing data between work-
items Rely on caches instead
![Page 21: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/21.jpg)
21
Multiple Compute Units (“Cores”)
Conceptually many parallel threads
Vector-like execution
Each cores with dedicated resources
Registers
Local memory/L1 cache
Hardware synchronization
Wide memory bus
High bandwidth
High latency (800 cycles!)
Small read/write caches
PCIe® board
High-Level GPU Architectures
![Page 22: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/22.jpg)
Compiling OpenCL Standard to GPUs
22
Distribute workgroups across compute-units
Work-items execute in parallel on the vector-like cores
Use dedicated hardware resources for efficient
synchronization of work-items and sharing data between
work-items
Run ‘enough’ work-items per compute-unit to mask
latencies Work-items contend for fixed resources (registers, local memory)
Hardware limits too!
Number of work-items/workgroups per compute unit
Translates into a requirement for thousands of work-items!
NVIDIA provides tool to help calculate this
![Page 23: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/23.jpg)
Greater Challenges Require More Complex Solutions:
FPGAs
23
Tell me what it doesn’t have and you’ll see it in next
generation
Intel fab improves performance from 20% to 2X! 10Tflops throughput
Quad A53 ARM processors
100’s interface protocols
GB’s of on-die DRAM
>50KLogic Elements
>5MLogic Elements
>500KLogic Elements
1990’s
2010’s
2000’s
![Page 24: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/24.jpg)
FPGA Architecture: Fine-grained Massively
Parallel
24
Millions of reconfigurable logic
elements
Thousands of 20Kb memory
blocks
Thousands of Variable
Precision DSP blocks
Dozens of High-speed
transceivers
Multiple High Speed
configurable Memory
Controllers
Multiple ARM© Cores
I/O
I/O
I/O
I/O
Let’s zoom in
![Page 25: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/25.jpg)
FPGA Architecture: Basic Elements
25
1-bit configurable
operation
Configured to perform any
1-bit operation:
AND, OR, NOT, ADD, SUB
Basic Element
1-bit register
(store result)
![Page 26: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/26.jpg)
FPGA Architecture: Flexible Interconnect
26
Basic Elements are
surrounded with a
flexible interconnect
…
![Page 27: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/27.jpg)
FPGA Architecture: Flexible Interconnect
27
Wider custom operations are
implemented by configuring and
interconnecting Basic Elements
… …
![Page 28: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/28.jpg)
FPGA Architecture: Custom Operations Using Basic
Elements
28
Wider custom operations are
implemented by configuring and
interconnecting Basic Elements
…
16-bit add
Your custom 64-bit
bit-shuffle and encode
32-bit sqrt
… …
![Page 29: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/29.jpg)
FPGA Architecture: Memory Blocks
29
Memory
Block
20 Kb
addr
data_in
data_out
Can be configured and
grouped using the
interconnect to create
various cache architectures
![Page 30: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/30.jpg)
FPGA Architecture: Memory Blocks
30
Memory
Block
20 Kb
addr
data_in
data_out
Can be configured and
grouped using the
interconnect to create
various cache architectures
Lots of smaller
caches
Few larger
caches
![Page 31: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/31.jpg)
FPGA Architecture: Floating Point Multiplier/Adder
Blocks
31
data_in
Dedicated floating point
multiply and add blocks
data_out
![Page 32: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/32.jpg)
FPGA Architecture: Configurable Routing
32
Blocks are connected into
a custom data-path that
matches your application.
![Page 33: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/33.jpg)
Mapping a simple program to an FPGA
33
R0 Load Mem[100]R1 Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]
High-level code
Mem[100] += 42 * Mem[101]
CPU instructions
![Page 34: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/34.jpg)
B
A
AALU
First let’s take a look at execution on a simple
CPU
34
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load StoreLdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CData
Fixed and general
architecture:
- General “cover-all-cases” data-paths
- Fixed data-widths
- Fixed operations
![Page 35: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/35.jpg)
B
A
AALU
Load constant value into register
35
Very inefficient use of hardware!
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load StoreLdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CData
![Page 36: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/36.jpg)
CPU activity, step by step
36
AR0 Load Mem[100]
AR1 Load Mem[101]
AR2 Load #42
AR2 Mul R1, R2
AR0 Add R2, R0
Store R0 Mem[100]A
Time
![Page 37: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/37.jpg)
On the FPGA we unroll the CPU hardware…
37
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]A
Space
![Page 38: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/38.jpg)
… and specialize by position
38
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]A
1. Instructions are fixed.
Remove “Fetch”
![Page 39: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/39.jpg)
… and specialize
39
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]A
1. Instructions are fixed.
Remove “Fetch”
2. Remove unused ALU ops
![Page 40: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/40.jpg)
… and specialize
40
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]A
1. Instructions are fixed.
Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
![Page 41: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/41.jpg)
… and specialize
41
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed.
Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
4. Wire up registers properly!
And propagate state.
![Page 42: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/42.jpg)
… and specialize
42
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed.
Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
4. Wire up registers properly!
And propagate state.
5. Remove dead data.
![Page 43: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/43.jpg)
… and specialize
43
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed.
Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
4. Wire up registers properly!
And propagate state.
5. Remove dead data.
6. Reschedule!
![Page 44: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/44.jpg)
Custom data-path on the FPGA matches your
algorithm!
44
Build exactly what you need:
Operations
Data widths
Memory size & configuration
Efficiency:
Throughput / Latency / Power
load load
store
42
High-level code
Mem[100] += 42 * Mem[101]
Custom data-path
![Page 45: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/45.jpg)
OpenCL on FPGA
45
OpenCL kernels are translated into a highly parallel circuit
A unique functional unit is created for every operation in the kernel
Memory loads / stores, computational operations, registers
Functional units are only connected when there is some data dependence dictated by the kernel
Pipeline the resulting circuit with a new thread on each clock cycle to
keep functional units busy
Can launch kernels as single threaded kernels or
NDRanges No limitations since work group is NDRange (configure internal memory)
Amount of parallelism is dictated by the
number of pipelined computing operations in
the generated hardware
![Page 46: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/46.jpg)
FPGAs Channels Advantage
46
Standard OpenCL Altera Vendor Extension
Host
Interface
CvP Update
Inte
rco
nn
ect
DDR3
Interface
10Gb
Interface
10Gb
Interface
DDR3
Interface
QDRII
Interface
QDRII
Interface
QDRII
Interface
QDRII
Interface
OpenCL
Kernels
OpenCL
Kernels
DDR
DDR
QDR
QDR
QDR
QDR
10G
HostHost
Interface
CvP Update
Inte
rco
nn
ect
DDR3
Interface
10Gb
Interface
10Gb
Interface
DDR3
Interface
QDRII
Interface
QDRII
Interface
QDRII
Interface
QDRII
Interface
OpenCL
Kernels
OpenCL
Kernels
DDR
DDR
QDR
QDR
QDR
QDR
Host
Network10G
Network
IO and Kernel Channels
![Page 47: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/47.jpg)
Tradtional Hosted
Heterogeneous Platform
Shared Virtual Memory (SVM) Platform Model
New Hosted Heterogenous
Platform with SVM
47
OpenCL 1.2 OpenCL 2.0
Global
Memory
DEV
1
Host
MemoryCPU
Global
Memory
DEV
…
Global
Memory
DEV
N
Shared Virtual Memory
Global
Memory
DEV
1
Host
MemoryCPU
Global
Memory
DEV
…
Global
Memory
DEV
N
PCIe
QPI
CAPP PSLCAPI
![Page 48: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/48.jpg)
OpenCL and FPGA Acceleration in the News
48
IBM and Altera Collaborate on OpenCL“IBM’s collaboration with Altera on OpenCL and support of the IBM Power
architecture with the Altera SDK for OpenCL can bring more innovation to
address Big Data and cloud computing challenges,” said Tom Rosamilia, senior
vice president, IBM Systems
Intel Reveals FPGA and Xeon in One Socket"That allows end users that have applications that can benefit from acceleration
to load their IP and accelerate that algorithm on that FPGA as an offload,"
explained the vice president of Intel's data center group, Diane Bryant
Baidu and Altera Demonstrate Faster Image Classification“Altera Corp. and Baidu, China’s largest online search engine, are collaborating on
using FPGAs and convolutional neural network (CNN) algorithms for deep learning
applications….”
Search Engine Gets Help From FPGA"Altera was really interesting in helping with the development—the
resources they were willing to throw our way were more significant than
those from Xilinx“ Microsoft Engr Manager
![Page 49: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/49.jpg)
OpenCL On FPGAs Fit Into All Markets
49
Computer & Storage
(HPC, Financial,
Data Compression)
Military/Government
(Crypto, Image Detection )
Broadcast, Consumer
(Video image processing)
Medical
(Diagnostic Image Processing,
BioInformatics)
Automotive/Industrial
(Pedestrian Detection,
Motion Estimation)
Networking
(DPI, SDN, NFV)
Data
Processing
Algorithms
![Page 50: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/50.jpg)
Case Study: Image Classification
50
Deep Learning Algorithm Convolutional Neural Networking
Based on Hinton’s CNN
Early Results on Stratix V 2X Perf./Power vs. gpgpu
despite soft floating point
400 images/s
8+ simultaneous kernels
vs. 2 on gpgpu
Exploiting OpenCL channels
between kernels
A10 results Hard floating point uses all DSPs
Better density and frequency
~ 4X performance/watt v SV
6800 images/s
No code change required
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
Here are the classes in the dataset, as well as 10 random images from each:
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
Hinton’s CNN Algorithm
![Page 51: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/51.jpg)
Smith-Waterman
51
Sequence Alignment
Scoring Matrix
Advantage FPGA
Integer Arithmetic
SMT Streaming
Results
Platform Power
(W)
Performance
(MCUPS)
Efficiency
(MCUPS/W)
W3565 Xeon Processor 140 40 .29
nVidia K20 225 704 3.13
PCIe385 SV FGPA 25 32596 1303.00
![Page 52: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/52.jpg)
Haplotype Caller (Pair-HMM)
52
Smith Waterman like algorithm
Uses hidden markov models to compare gene
sequences
3 stages: Assembler, Pair-HMM (70%), Traversal
+Genotyping
Floating point (SP + DP)
C++ code starting point (from JAVA)
Whole genome takes 7.6 days!
Results
Platform Runtime (ms)
Java (gatk 2.8) 10,800
Intel Xeon E5-1650 138
nVidia Tesla K40 70
Altera SV FPGA 15.5
Altera A10 FPGA 3
*Source : http://www.broadinstitute.org/~carneiro/talks/20140501-bio_it_world.pdf
![Page 53: System Acceleration Overview - Aventri€¦ · Fine-grain vs. Coarse-grain ... Run ‘enough’ work-items per compute-unit to mask latencies ... Dedicated floating point multiply](https://reader034.vdocument.in/reader034/viewer/2022050205/5f58368ef26e3102cd54ef83/html5/thumbnails/53.jpg)
Multi-Asset Barrier Option Pricing
53
Monte-Carlo simulation
No closed form solution possible
High quality random number generator required
Billions of simulations required
Used GPU vendors example code
Advantage FPGA
Complex Control Flow
Optimizations
Channels, loop pipelining
Results
PlatformPower
(W)
Performance
(Bsims/s)
Efficiency
(Msims/s/W)
W3690 Xeon Processor 130 .032 0.0025
nVidia Kepler20 212 10.1 48
SV FPGA 45 12.0 266