draf: a low-power dram-based reconfigurable acceleration...

23
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA – June 22, 2016

Upload: others

Post on 30-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

    Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng,

    Bob Brennan, Christos Kozyrakis

    ISCA – June 22, 2016

  • FPGA-Based Accelerators

    Improve performance and energy efficiency

    Good balance between flexibility (CPUs) and efficiency (ASICs)

    Recently used for many datacenter appso Image/video processing, websearch, neural networks, …

    2Pictures: Putnam, et al. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. ISCA’14

  • Motivation

    Deploy FPGAs in cost & power constrained systems

    Datacenter systems

    o High-density FPGAs for large accelerators for multiple apps

    o Low-power FPGAs to simplify integration in servers and racks

    Mobile systems

    o High-density FPGAs for accelerators for multiple apps

    o Low-power FPGAs for low cost and long battery life

    3

  • DRAF in a Nutshell

    A high-density & low-power FPGA

    o Bit-level reconfigurable, just like conventional FPGAs

    Uses dense DRAM technology for lookup tables

    o Replacing the SRAM technology in conventional FPGAs

    DRAF vs. FPGA

    o 10 – 100x logic density

    o 1/3 power consumption

    o Multi-context support with fast context switch

    4

  • Challenges of Building DRAM-based FPGAs

    5

  • DRAM Array Structure

    6

    Subarray

    MAT

    Sense-amp

    Master wordline

    Ro

    w d

    eco

    de

    r

    Local wordlineb

    itlin

    e

    A DRAM subarray is naturally a lookup-table

    Input

    Output

  • MAT

    Sense-amp

    Master wordline

    Ro

    w d

    eco

    de

    r

    Local wordlineb

    itlin

    e

    Challenges

    7

    ~8k-bit output

    ~1k rows~10-bit input

    Mismatch LUT sizea 8192-bit LUT?

    Slow speeda LUT with 10 ns delay?

    10-30 ns delay

    Destructive accessdata lost after access?

  • User Clock

    Physical Path

    R1 R2

    L1 L2

    L3

    L4

    Destructive Access

    Explicit activation, restoration, and precharge operations

    o Longer access delay due to serialization

    Issue of LUT chaining: order of LUT access

    8

    Must activate L2 after L1

    Must activate L4 after both L2 & L3

  • DRAF ArchitectureBasic Logic Element

    Multi-Context Support

    Timing

    9

  • DRAF Overview

    Same island layout and configurable interconnect as FPGA

    10

    DSPIn DRAM technologySlower but not critical

    Block RAMUses DRAM arrays

    CLBContains multiple basic logic elements (BLEs)

  • Basic Logic Element

    7-10 bits input

    2-4 bits output

    11

    Subarray

    MAT

    Sense-amp

    Master wordlineR

    ow

    de

    cod

    er

    Local wordline

    bit

    lin

    e Narrower MAT1k bits to 8-16 bits

    Col logic

    6

    14

    4x2 4x2

    Specialized column logicBetter flexibility

    FFs

    4 4Additional FFs & MUXsRegistering & retiming

    Single-MAT accessMulti-context

    3

    4

  • Multi-Context Support

    DRAF supports 8-16 contexts per chip

    o Context: one MAT per BLE

    o Efficient use of MATs with little area and power overhead

    Instant switch between active contexts

    o Similar to context-switch between processes on CPU

    Context uses

    o One context per accelerator design or application

    o One context per part of a very large accelerator design

    12

  • User Clock

    Physical Path

    R1 R2

    L1 L2

    L3

    L4

    Timing – Destructive Access

    Issue of LUT chaining: order of LUT access

    Solution: phase – similar to critical path finding

    13

    Phase 0 Phase 1

    PhaseTimeline

    Phase 0

    Phase 2

    Phase 0 Phase 1 Phase 2

  • LUT-1

    LUT-2

    Timing – Latency Optimization

    Issue: precharge and restore delays

    Solution: 3-way delay overlapping

    o Hide PRE/RST delays with wire propagation delay

    Performance gap between DRAF and FPGA reduces from >10x to 2-4x

    14

    ACTPRE RSTWire

    ACTPRE RST

    LUT-1

    LUT-2

    ACTPRE RSTWire

    ACTPRE RST

    Saved delay

  • Summary

    Challenges solutions

    o Mismatch LUT size multi-context BLE

    o Destructive access phase-based timing

    o Slow speed 3-way delay overlapping

    Other design features (see paper)

    o Sense-amp as register

    o Time-multiplexed routing

    o Handling DRAM Refresh

    15

  • EvaluationArea, power, performance against FPGA and CPU

    16

  • Methodology

    Synthesize, place & route with Yosys + VTR

    CACTI-3DD with 45 nm power and area models

    Comparisons

    o 70 mm2 FPGA based on Xilinx Virtex-6

    o 70 mm2 DRAF device, 8-context

    o Intel Xeon E5-2630 multi-core processor (2.3 GHz)

    18 accelerator designs

    o MachSuite, Sirius, Vivado HLS Video Library, VTR benchsuite

    o Web service, image processing, analytics, neural networks, …17

  • DRAF Chip Area & Power

    18

    0.01

    0.1

    1

    10

    100

    1000

    0 0.5 1 1.5

    Ch

    ip A

    rea

    (mm

    2)

    Logic Capacity(in million 6-LUT equivalents)

    FPGA

    DRAF

    0.01

    0.1

    1

    10

    100

    1000

    0 0.5 1 1.5

    Pe

    ak C

    hip

    Po

    we

    r (W

    )

    Logic Capacity(in million 6-LUT equivalents)

    FPGA

    DRAF

    10x area improvement 50x peak power reduction

  • FPGA vs. DRAF (Area)

    8-context DRAF occupies 19% less area than 1-context FPGA

    o 10x area efficiency: 8 designs in less silicon area than 1 design before

    o But only one context can be active at a time

    19

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    aes backprop gemm gmm harris stemmer stencil viterbi editdist

    No

    rmal

    ized

    Min

    Bo

    un

    din

    g A

    rea

    FPGA Logic FPGA Routing DRAF Logic DRAF Routing

    Inefficient use of larger DRAM LUT

    exp/log functions

  • FPGA vs. DRAF (Power)

    Use one context in DRAF

    DRAF consumes 1/3 power of FPGA and 15% less energy

    o Note: current CAD tools are less efficient with DRAF

    20

    0

    0.2

    0.4

    0.6

    0.8

    1

    aes backprop gemm gmm harris stemmer stencil viterbi editdist

    No

    rmal

    ized

    Po

    wer

    FPGA Logic FPGA Routing DRAF Logic DRAF Routing

  • Performance

    DRAF is 2.7x slower than FPGA

    DRAF is 13.5x faster than CPU, 3.4x faster than ideal 4-core

    21

    0.1

    1

    10

    100

    1000

    aes backprop gemm gmm harris stemmer stencil viterbi

    No

    rmal

    ized

    Th

    rou

    ghp

    ut

    CPU 4 CPU FPGA DRAF

    exp/log functions

    Efficient line buffer

  • Conclusions

    DRAF: high-density and low-power reconfigurable fabric

    o Based on dense DRAM technology

    o Optimized timing + multi-context support

    DRAF targets cost and power constrained applications

    o E.g., datacenters and mobile systems

    DRAF trades off some performance for area & power efficiency

    o 10x smaller area, 3x less power, and 2.7x slower than FPGA

    o Still 13x speedup over Xeon cores

    22

  • Thanks!Questions?