draf: a low-power dram-based reconfigurable acceleration...
TRANSCRIPT
-
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng,
Bob Brennan, Christos Kozyrakis
ISCA – June 22, 2016
-
FPGA-Based Accelerators
Improve performance and energy efficiency
Good balance between flexibility (CPUs) and efficiency (ASICs)
Recently used for many datacenter appso Image/video processing, websearch, neural networks, …
2Pictures: Putnam, et al. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. ISCA’14
-
Motivation
Deploy FPGAs in cost & power constrained systems
Datacenter systems
o High-density FPGAs for large accelerators for multiple apps
o Low-power FPGAs to simplify integration in servers and racks
Mobile systems
o High-density FPGAs for accelerators for multiple apps
o Low-power FPGAs for low cost and long battery life
3
-
DRAF in a Nutshell
A high-density & low-power FPGA
o Bit-level reconfigurable, just like conventional FPGAs
Uses dense DRAM technology for lookup tables
o Replacing the SRAM technology in conventional FPGAs
DRAF vs. FPGA
o 10 – 100x logic density
o 1/3 power consumption
o Multi-context support with fast context switch
4
-
Challenges of Building DRAM-based FPGAs
5
-
DRAM Array Structure
6
Subarray
MAT
Sense-amp
Master wordline
Ro
w d
eco
de
r
Local wordlineb
itlin
e
A DRAM subarray is naturally a lookup-table
Input
Output
-
MAT
Sense-amp
Master wordline
Ro
w d
eco
de
r
Local wordlineb
itlin
e
Challenges
7
~8k-bit output
~1k rows~10-bit input
Mismatch LUT sizea 8192-bit LUT?
Slow speeda LUT with 10 ns delay?
10-30 ns delay
Destructive accessdata lost after access?
-
User Clock
Physical Path
R1 R2
L1 L2
L3
L4
Destructive Access
Explicit activation, restoration, and precharge operations
o Longer access delay due to serialization
Issue of LUT chaining: order of LUT access
8
Must activate L2 after L1
Must activate L4 after both L2 & L3
-
DRAF ArchitectureBasic Logic Element
Multi-Context Support
Timing
9
-
DRAF Overview
Same island layout and configurable interconnect as FPGA
10
DSPIn DRAM technologySlower but not critical
Block RAMUses DRAM arrays
CLBContains multiple basic logic elements (BLEs)
-
Basic Logic Element
7-10 bits input
2-4 bits output
11
Subarray
MAT
Sense-amp
Master wordlineR
ow
de
cod
er
Local wordline
bit
lin
e Narrower MAT1k bits to 8-16 bits
Col logic
6
14
4x2 4x2
Specialized column logicBetter flexibility
FFs
4 4Additional FFs & MUXsRegistering & retiming
Single-MAT accessMulti-context
3
4
-
Multi-Context Support
DRAF supports 8-16 contexts per chip
o Context: one MAT per BLE
o Efficient use of MATs with little area and power overhead
Instant switch between active contexts
o Similar to context-switch between processes on CPU
Context uses
o One context per accelerator design or application
o One context per part of a very large accelerator design
12
-
User Clock
Physical Path
R1 R2
L1 L2
L3
L4
Timing – Destructive Access
Issue of LUT chaining: order of LUT access
Solution: phase – similar to critical path finding
13
Phase 0 Phase 1
PhaseTimeline
Phase 0
Phase 2
Phase 0 Phase 1 Phase 2
-
LUT-1
LUT-2
Timing – Latency Optimization
Issue: precharge and restore delays
Solution: 3-way delay overlapping
o Hide PRE/RST delays with wire propagation delay
Performance gap between DRAF and FPGA reduces from >10x to 2-4x
14
ACTPRE RSTWire
ACTPRE RST
LUT-1
LUT-2
ACTPRE RSTWire
ACTPRE RST
Saved delay
-
Summary
Challenges solutions
o Mismatch LUT size multi-context BLE
o Destructive access phase-based timing
o Slow speed 3-way delay overlapping
Other design features (see paper)
o Sense-amp as register
o Time-multiplexed routing
o Handling DRAM Refresh
15
-
EvaluationArea, power, performance against FPGA and CPU
16
-
Methodology
Synthesize, place & route with Yosys + VTR
CACTI-3DD with 45 nm power and area models
Comparisons
o 70 mm2 FPGA based on Xilinx Virtex-6
o 70 mm2 DRAF device, 8-context
o Intel Xeon E5-2630 multi-core processor (2.3 GHz)
18 accelerator designs
o MachSuite, Sirius, Vivado HLS Video Library, VTR benchsuite
o Web service, image processing, analytics, neural networks, …17
-
DRAF Chip Area & Power
18
0.01
0.1
1
10
100
1000
0 0.5 1 1.5
Ch
ip A
rea
(mm
2)
Logic Capacity(in million 6-LUT equivalents)
FPGA
DRAF
0.01
0.1
1
10
100
1000
0 0.5 1 1.5
Pe
ak C
hip
Po
we
r (W
)
Logic Capacity(in million 6-LUT equivalents)
FPGA
DRAF
10x area improvement 50x peak power reduction
-
FPGA vs. DRAF (Area)
8-context DRAF occupies 19% less area than 1-context FPGA
o 10x area efficiency: 8 designs in less silicon area than 1 design before
o But only one context can be active at a time
19
0
0.2
0.4
0.6
0.8
1
1.2
1.4
aes backprop gemm gmm harris stemmer stencil viterbi editdist
No
rmal
ized
Min
Bo
un
din
g A
rea
FPGA Logic FPGA Routing DRAF Logic DRAF Routing
Inefficient use of larger DRAM LUT
exp/log functions
-
FPGA vs. DRAF (Power)
Use one context in DRAF
DRAF consumes 1/3 power of FPGA and 15% less energy
o Note: current CAD tools are less efficient with DRAF
20
0
0.2
0.4
0.6
0.8
1
aes backprop gemm gmm harris stemmer stencil viterbi editdist
No
rmal
ized
Po
wer
FPGA Logic FPGA Routing DRAF Logic DRAF Routing
-
Performance
DRAF is 2.7x slower than FPGA
DRAF is 13.5x faster than CPU, 3.4x faster than ideal 4-core
21
0.1
1
10
100
1000
aes backprop gemm gmm harris stemmer stencil viterbi
No
rmal
ized
Th
rou
ghp
ut
CPU 4 CPU FPGA DRAF
exp/log functions
Efficient line buffer
-
Conclusions
DRAF: high-density and low-power reconfigurable fabric
o Based on dense DRAM technology
o Optimized timing + multi-context support
DRAF targets cost and power constrained applications
o E.g., datacenters and mobile systems
DRAF trades off some performance for area & power efficiency
o 10x smaller area, 3x less power, and 2.7x slower than FPGA
o Still 13x speedup over Xeon cores
22
-
Thanks!Questions?