3d-stacked logic-in-memory hardware for sparse matrix ......carnegie mellon 3d-stacked...

21
Carnegie Mellon Carnegie Mellon 3D-Stacked Logic-in-Memory Hardware For Sparse Matrix Operations Franz Franchetti Carnegie Mellon University In collaboration with Qiuling Zhu, Fazle Sadi, Qi Guo, Guangling Xu, Ekin Sumbul, James C. Hoe and Larry Pileggi The work was sponsored by Defense Advanced Research Projects Agency (DARPA) under agreement No. HR0011-13-2-0007. The content, views and conclusions presented in this document do not necessarily reflect the position or the policy of DARPA or the U.S. Government. No official endorsement should be inferred. This work was also supported in part by the Intelligence Advanced Research Program Agency and Space and Naval Warfare Systems Center Pacific under Contract No. N66001-12- C-2008, Program Manager Dennis Polla.

Upload: others

Post on 10-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Carnegie Mellon Carnegie Mellon

    3D-Stacked Logic-in-Memory Hardware For Sparse Matrix Operations

    Franz Franchetti Carnegie Mellon University In collaboration with Qiuling Zhu, Fazle Sadi, Qi Guo, Guangling Xu, Ekin Sumbul, James C. Hoe and Larry Pileggi

    The work was sponsored by Defense Advanced Research Projects Agency (DARPA) under agreement No. HR0011-13-2-0007. The content, views and conclusions presented in this document do not necessarily reflect the position or the policy of DARPA or the U.S. Government. No official endorsement should be inferred. This work was also supported in part by the Intelligence Advanced Research Program Agency and Space and Naval Warfare Systems Center Pacific under Contract No. N66001-12- C-2008, Program Manager Dennis Polla.

  • Carnegie Mellon Carnegie Mellon

    How to Accelerate Graph Algorithms? Insight: Important Graph algorithms = funny* tensor contraction** Edge Betweenness Centrality Bayesian belief propagation Minimal Spanning Trees Single source and all pairs shortest path

    But: Sparse matrix operations are inefficient on current machines High meta-data ratio: 1 index/data in CSR*** format Memory bound: little or no reuse (E.g., SpMV****) Tiling impossible: tiling in CSR is hard Inefficient: bandwidth bound, runs at 1% compute peak

    *funny = over a semi-ring with add/mul redefined, e.g., (*, +) = (+, min) **tensor contraction captures scalar product, matrix * vector, matrix * matrix and multiple matrix * matrix, including nD transposes ***CSR = compressed sparse row ****SpMV = sparse matrix-vector multiplication

  • Carnegie Mellon Carnegie Mellon

    Enabling Technology: Logic on 3D DRAM Stack

    3D stacked DRAM and logic/DRAM integration Logic and DRAM layers connected by TSVs Better timing, lower area, advanced IO in the logic layer

    Logic directly on DRAM stack Bandwidth and latency concerns still exist to off-chip Behind DRAM interface opens up internal resources

    HMC

    DDR4 DIMMs

    Intel Knights Landing Nvidia Volta Micron HMC

    HMC

  • Carnegie Mellon Carnegie Mellon

    Concept: Memory-Side Accelerators

    Idea: Accelerator on DRAM side No off-DIMM data traffic Huge problem sizes possible 3D stacking is enabling technology

    Configurable array of accelerators Domain specific, highly configurable Cover DoD-relevant kernels Configurable on-accelerator routing

    System CPU Multicore/manycore CPU CPU-side accelerators Explicit memory management SIMD and multicore parallelism

    http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&uact=8&docid=vAsx1pRWshmqkM&tbnid=OGZEFdq-QNznEM:&ved=0CAUQjRw&url=http://nanobitwallpaper.com/intel-haswell-i7/&ei=a_iiU5idNsKMqAaGuYGgDQ&bvm=bv.69411363,d.cWc&psig=AFQjCNH6vnDdvGOCz5Y4P6I3GkzOQ06S5w&ust=1403275695913880

  • Carnegie Mellon Carnegie Mellon

    Simulation, Emulation, and Software Stack Synopsis DesignWare

    McPAT

    Accelerator Simulation Timing: Synopsis DesignWare Power: DRAMSim2, Cacti, Cacti-3DD

    McPAT

    Full System Evaluation Run code on real system (Haswell, Xeon Phi)

    or in simulator (SimpleScalar,…) Normal DRAM access for CPU, but

    trap accelerator command memory space, invoke simulator

    API and Software stack Accelerator: memory mapped device with command and data address space User API: C configuration library, standard API where applicable Virtual memory: fixed non-standard logical-to-physical mapping Memory management: Special malloc/free, Linux kernel support

    http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&uact=8&docid=vAsx1pRWshmqkM&tbnid=OGZEFdq-QNznEM:&ved=0CAUQjRw&url=http://nanobitwallpaper.com/intel-haswell-i7/&ei=a_iiU5idNsKMqAaGuYGgDQ&bvm=bv.69411363,d.cWc&psig=AFQjCNH6vnDdvGOCz5Y4P6I3GkzOQ06S5w&ust=1403275695913880http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&uact=8&docid=dlVkmJ7ffYvwmM&tbnid=HJ4Cd3GGsE_eIM:&ved=0CAUQjRw&url=http://www.pcmag.com/article2/0,2817,2399406,00.asp&ei=h_qiU-ekFoibqAbTjYGIBg&bvm=bv.69411363,d.cWc&psig=AFQjCNGWHyOsHoqt0LsrS5Tul1bN8jeU-w&ust=1403276256253817

  • Carnegie Mellon Carnegie Mellon

    Sparse Matrix Multiplication (SpGEMM)

    (Buluc, Gilbert) A B C

    State-of-the-art algorithm: Algorithm minimizes floating-point operations Matrix B: streaming memory access But: Matrix A: Every access is a cache and row buffer miss Expensive multiway merge required for Matrix C

  • Carnegie Mellon Carnegie Mellon

    SpGEMM Revisited: 3DIC Changes Trade-Offs

    A B C

    On chip: Standard approach

    A B C

    Off-chip: SUMMA with hypersparse blocks On chip

    DRAM layer

    DRAM layer

    DRAM layer

    3DIC: stacked DRAM+ASIC

    Trade-off: full streaming memory access, but higher memory bandwidth required

    DRAM layer

    Accelerator Layer

  • Carnegie Mellon Carnegie Mellon

    Why Does This Work?

    Power-graph incidence matrixes are very sparse on average 3 non-zeroes per row

    Tiles are hypersparse 1,000 entries in 50,000 x 50,000 tile

    Total memory size limits matrix dimension only about 100 tiles per row/column

    3D stacking provides enormous bandwidth but DRAM array (and row buffer miss latency) is unchanged

    Here O(n3/b3) algorithm beats O(n*nnz) algorithm the key is the huge block size and maximum matrix dimension

  • Carnegie Mellon Carnegie Mellon

    3D Sparse Matrix Accelerator

    Off-chip: SRUMMA (tiled shared memory algorithm)

    On-chip: regular SpGEMM kernel

    DRAM Dies

    Buffer

    LiM

    DRAM

    DRAM DRAM

    Scratchpad Layer

    DRAM Dies

    LiM Dies Scratchpad: Hierarchical tiling

    SRAM logic-in-memory

    EDRAM scratchpad

    DRAM

    Accelerator Layer

    DRAM layer

    DRAM layer

    DRAM layer

    DRAM layer

  • Carnegie Mellon Carnegie Mellon

    Logic Layer Functionality Examples

    Data sorting memory

    1

    6 7 8

    10

    4 1 3 4 6 7

    10

    3

    1

    7 8

    10

    3 6

    4

    1

    8 10

    3 4 7

    6 3 7 4

    1

    10

    3 4 6 8

    6 8 7 10

    1 3 4 6 7 8

    8

    Pointer chasing memory

    Non-zero row

    index array

    Column index array

    i i+1

    1 4 6 7 8

    10

    Orthogonal access memory

    3

    Smart SRAM Designs

    Sparse Matrix Content Access Memory (CAM) bit-0 access

    one-big array reordering

  • Carnegie Mellon Carnegie Mellon

    SpGEMM Kernel

  • Carnegie Mellon Carnegie Mellon

    index 0 index 1 index 2 index 3

    index n

    value 0 value 1 value 2 value 3

    value n

    search index

    S E Q U E N C E R

    value 2

    X

    +

    Op2 (Match): Read, modify and write back

    Op1 (Mismatch): Insert a new entry

    E N C O

    D E R

    D E C O D E R

    Pipeline

    r𝒐𝒐𝒘𝒘𝑨𝑨

    v𝒂𝒂𝒍𝒍𝑨𝑨

    v𝒂𝒂𝒍𝒍𝑩𝑩

    r𝒐𝒐𝒘𝒘𝑪𝑪

    v𝒂𝒂𝒍𝒍𝑪𝑪

    r𝒐𝒐𝒘𝒘𝑪𝑪 v𝒂𝒂𝒍𝒍𝑪𝑪

    SpGEMM-CAM Functional Architecture

    CAM structure facilitates efficient one-cycle non-zero merge

    Content Addressable Memory (CAM)

    • Numerical values • Row/column Index • Index matching

    • CAM data array • CAM key array • CAM comparisons

    Sparse matrix components CAM storage structure

    Match algorithm to hardware structure

  • Carnegie Mellon Carnegie Mellon

    Map SpGEMM Algorithm to 3D LiM Architecture

    Map matrix tiles to DRAM rows Sequential tile streaming guarantees maximum sustained bandwidth and good

    data locality (cheap regular access)

    Matrix A

    Matrix B

    (Conceptual 3D stack structure)

    Matrix Tile

    DRAM Row

    Matrix Tile

    DRAM Row TSV

    TSV

    TSV

    TSV

    A B C

  • Carnegie Mellon Carnegie Mellon

    SpGEMM Multicore

    Duplicated SpGEMM LiM cores for parallel matrix processing

    Match TSV bandwidth to the LiM

    processing throughput (# of kernels)

    A completely balanced system optimized from the app knowledge

    Central Controller

    core-1 core-n

    Meta Data Array

    A

    B

    C

    A

    B

    C A

    B

    C

    core-2

    (CAM)

    (CAM)

    (CAM) SpGEMM

    LiM Cores

    LiM Die Layer

    TSV TSV TSV TSV T S V

    T S V

    T S V

    Stacked DRAM Dies

    SpGEMM LiM kernel

    core-1

    core-2

  • Carnegie Mellon Carnegie Mellon

    Recursive SpGEMM on Hierarchical 3D LiM

    Buffer

    LiM

    DRAM

    DRAM DRAM

    Scratchpad Layer

    DRAM Dies

    LiM Dies

    LiM

    DRAM DRAM DRAM DRAM Dies

    LiM Dies

    n1x n2 n1x n2

    L-1

    L-2

    n1

    n1

    n2

    n2

    L-1

    L-2

    L-3

    D1

    D1

    D2

    TSV Idle

  • Carnegie Mellon Carnegie Mellon

    Chip Generator GUI Interface

    stack-2

    stack-4

    stack-8

    stack-16stack-32

    0

    200

    400

    600

    800

    bank-32bank-16

    bank-8

    N_stack

    N_bank

    (b) Bandwidth [GB/s] w/ 256 IO per bank

    Re-architects the DRAM dies

    Bandwidth 3D Memory

    Access Timing

    EE: Bandwidth/Power

    Thermal and Temperature

    3D-System Design Space Exploration

    Silicon Interposer: Future System Scaling

    Silicon interposer

    Stack 1 Stack 2

    Design Automation Framework & Chip Generator

  • Carnegie Mellon Carnegie Mellon

    Transistor count > 1M Core Area: 1.003 x 1.292 mm2 Max Frequency: 475MHz Avg. Power at max f : 71.7 mW

    Test Chip: Single SpGEMM Core

  • Carnegie Mellon Carnegie Mellon

    Intel Dual-CPU Intel Xeon E5-2430 system about 140W, 6 DRAM channels, 64 GB/s max, 6 DIMMs (48 chips), 210 GFLOPS peak Intel MKL 11.0 mkl_dcsrmultdcsr (unsorted):

    100 MFLOPS – 1 GFLOPS, 1 – 10 MFLOPS/W

    Our 3DIC System 4 DRAM layers@ 2GBit, 1 logic layer, 32nm, 30 – 50 cores, 1GHz, 30 – 50 GFLOPS peak Performance: 10 – 50 GFLOPS @ 668GB/s with 1024 TSV

    Power efficiency: 1 GFLOPS/W @ 350GB/s with 512TSV or 8GB/s with 1024 TSV

    SpGEMM: Simulation Results

    Matrices from University of Florida sparse matrix collection

  • Carnegie Mellon Carnegie Mellon

    Beyond SpGEMM: SpMV Accelerator

    3DIC Memory Side Accelerator Target: Larges Sparse Graphs

    Algorithm SpMV Kernel

    Block-column multiply + merge step Partial results are streamed to DRAM Matrix streams from DRAM Vector segment is held in scratchpad/EDRAM

    Standard sparse matrix times dense vector Matrix is very sparse (a few non-zeroes per row) Matrix and vector size is large (>10M x 10M/GB)

  • Carnegie Mellon Carnegie Mellon

    Ongoing: System-Level Design Full system: CPU and DRAM

    Scalability: multiple DRAM stacks on silicon interposer

  • Carnegie Mellon Carnegie Mellon

    Summary

    1.0E-02

    1.0E-01

    1.0E+00

    1.0E+01

    1.0E+02

    1.0E+03

    1.0E+04

    belgi

    umde

    l_n20 NLR

    rgg_2

    0AS

    365

    vent

    ur3

    road

    del_n

    18pr

    efergg

    _17

    small

    wau

    toca

    idaR

    coA_

    Cco

    A_D cs4 cti

    fe_4e

    lt2fe_

    body

    fe_sp

    hergg

    _15

    t60k

    vsp_

    bvs

    p_c

    wing

    wing

    _n

    matrix size 1M-10M matrix size 100K-1M matrix size 10K-100K

    Intel MKL (Th-12) 3D-LiM (block-32K)

    Power efficiency [MFLOPS/w] for running various benchmark matrices

    1000x

    SpGEMM is core algorithm for graphs graphs and sparse matrices are dual

    3DIC requires to rethink algorithms Example: trade off latency for bandwidth

    Enormous efficiency gains are possible both power efficiency and performance

    Slide Number 1How to Accelerate Graph Algorithms?Enabling Technology: Logic on 3D DRAM StackConcept: Memory-Side AcceleratorsSimulation, Emulation, and Software StackSparse Matrix Multiplication (SpGEMM)SpGEMM Revisited: 3DIC Changes Trade-OffsWhy Does This Work?3D Sparse Matrix AcceleratorLogic Layer Functionality ExamplesSlide Number 11Content Addressable Memory (CAM)Map SpGEMM Algorithm to 3D LiM ArchitectureSpGEMM MulticoreSlide Number 15Re-architects the DRAM dies Slide Number 17Slide Number 18Beyond SpGEMM: SpMV AcceleratorOngoing: System-Level Design Summary