integrating nvidia deep learning accelerator (nvdla) with...

18
Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim Farzad Farshchi § , Qijing Huang , Heechul Yun § § University of Kansas, University of California, Berkeley

Upload: others

Post on 05-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Integrating NVIDIA Deep Learning Accelerator (NVDLA) with ...heechul/papers/nvdla-emc2019-slides.pdf•Fast, cycle-exact full system simulator, runs on FPGA in the cloud ... •Convolutional

Integrating NVIDIA Deep Learning Accelerator (NVDLA)

with RISC-V SoC on FireSimFarzad Farshchi§, Qijing Huang¶, Heechul Yun§

§University of Kansas, ¶University of California, Berkeley

Page 2: Integrating NVIDIA Deep Learning Accelerator (NVDLA) with ...heechul/papers/nvdla-emc2019-slides.pdf•Fast, cycle-exact full system simulator, runs on FPGA in the cloud ... •Convolutional

SiFive Internship

• Rocket Chip: open-source RISC-V SoC

• NVDLA: open-source DNN inference engine

• Demoed the integration at Hot Chips’18

Rocket Chip SoC NVDLA+

2

Page 3: Integrating NVIDIA Deep Learning Accelerator (NVDLA) with ...heechul/papers/nvdla-emc2019-slides.pdf•Fast, cycle-exact full system simulator, runs on FPGA in the cloud ... •Convolutional

SiFive Internship

3

Page 4: Integrating NVIDIA Deep Learning Accelerator (NVDLA) with ...heechul/papers/nvdla-emc2019-slides.pdf•Fast, cycle-exact full system simulator, runs on FPGA in the cloud ... •Convolutional

Motivation

• Useful platform for research

• Limitations

• No L2

• Fast DRAM, slow SoC

• Expensive: $7k FPGA board

• Let’s integrate NVDLA into FireSim

4

Page 5: Integrating NVIDIA Deep Learning Accelerator (NVDLA) with ...heechul/papers/nvdla-emc2019-slides.pdf•Fast, cycle-exact full system simulator, runs on FPGA in the cloud ... •Convolutional

FireSim

• Fast, cycle-exact full system simulator, runs on FPGA in the cloud

• Simulated design is derived from Rocket Chip RTL

• Decouples target from FPGA DRAM

• Adds its own DRAM and LLC model

• Easy-to-use. Very good documentation.

5

Page 6: Integrating NVIDIA Deep Learning Accelerator (NVDLA) with ...heechul/papers/nvdla-emc2019-slides.pdf•Fast, cycle-exact full system simulator, runs on FPGA in the cloud ... •Convolutional

How FireSim Works?

• Transforms RTL to target model• Inserts queues at I/O ports of

target

• Creates a token-based simulator

• In each cycle a token is consumed by model

• What if token queue is empty?• The model has to wait

Figure credit: Donggyu Kim et al. “Strober: Fast and Accurate Sample-Based Energy Simulation for Arbitrary RTL”Stall the target pipeline

6

Page 7: Integrating NVIDIA Deep Learning Accelerator (NVDLA) with ...heechul/papers/nvdla-emc2019-slides.pdf•Fast, cycle-exact full system simulator, runs on FPGA in the cloud ... •Convolutional

How to Stall The Target Pipeline?

• For Chisel code:• Rocket Chip is written

in Chisel

• For Verilog (we added):

7

Page 8: Integrating NVIDIA Deep Learning Accelerator (NVDLA) with ...heechul/papers/nvdla-emc2019-slides.pdf•Fast, cycle-exact full system simulator, runs on FPGA in the cloud ... •Convolutional

Overall System Architecture

• NVDLA is integrated in target

• LLC + Memory Model: Not part of

the target. Added by FireSim.

• Supports multiple models e.g. DDR3,

constant latency

• Runtime configurable LLC: different set,

way, block sizes. No need to rebuild

FPGA image

8

Page 9: Integrating NVIDIA Deep Learning Accelerator (NVDLA) with ...heechul/papers/nvdla-emc2019-slides.pdf•Fast, cycle-exact full system simulator, runs on FPGA in the cloud ... •Convolutional

Integrate Your Own Accelerator

• Any accelerator can be integrated

(if it fits inside FPGA)

• Develop and test software for your

accelerator in Linux environment

before having the chip in hand

• Get fast and accurate performance

results

9

Page 10: Integrating NVIDIA Deep Learning Accelerator (NVDLA) with ...heechul/papers/nvdla-emc2019-slides.pdf•Fast, cycle-exact full system simulator, runs on FPGA in the cloud ... •Convolutional

NVDLA

• Scalable: nv_small, nv_medium,

nv_large

• We used nv_large: 2048 MACs

• Convolutional core: matrix-

matrix multiplication

• Post-processing: activation

function, pooling, etc.

Adopted from “The Nvidia Deep Learning Accelerator”, https://goo.gl/Znyba5

10

Page 11: Integrating NVIDIA Deep Learning Accelerator (NVDLA) with ...heechul/papers/nvdla-emc2019-slides.pdf•Fast, cycle-exact full system simulator, runs on FPGA in the cloud ... •Convolutional

Performance Analysis (I)

• Baseline config:

• Quad-core Rocket Core, 3.2 GHz

• NVDLA: 2048 INT8 MACs, 512 KiB conv. buffer, 3.2 GHz

• LLC: Shared 2 MiB, 8-way, 64 B block

• DRAM: 4 ranks, 8 banks, FR-FCFS

• YOLOv3: 416 x 416 frame, 66 billion operations

11

Page 12: Integrating NVIDIA Deep Learning Accelerator (NVDLA) with ...heechul/papers/nvdla-emc2019-slides.pdf•Fast, cycle-exact full system simulator, runs on FPGA in the cloud ... •Convolutional

Performance Analysis (II)

• Frame process time: 133 ms (7.5 fps)

• 67 ms on NVDLA

• 66 ms on processor, multithreaded with OpenMP

• Layers not supported by NVDLA are running on processor

• Custom YOLO, upsampling, FP ⇔ INT8

• Make common DNN algorithm run very fast ✔

• Computations not supported by the accelerator can make you slow ✖

12

Page 13: Integrating NVIDIA Deep Learning Accelerator (NVDLA) with ...heechul/papers/nvdla-emc2019-slides.pdf•Fast, cycle-exact full system simulator, runs on FPGA in the cloud ... •Convolutional

Performance Comparison

• Rocket: baseline config, no NVDLA

• NVDLA+Rocket: baseline config

• Xeon: E5-2658 v3

• Titan Xp: Pascal arch, 3840 CUDA cores

• Titan cosumes more power

• Titan Xp: board TDP 250 W, 471 mm² in 16nm

• NVDLA IP: 766 mW peak, 3.3 mm² in 16nm

407x

5.5x

13

Page 14: Integrating NVIDIA Deep Learning Accelerator (NVDLA) with ...heechul/papers/nvdla-emc2019-slides.pdf•Fast, cycle-exact full system simulator, runs on FPGA in the cloud ... •Convolutional

Sharing LLC with Accelerator

• Sharing the LLC can be a good alternative to scratchpad

• Consumes less chip area• Less programming effort

• Performance does not vary by changing the LLC size

• But varies by changing the block size

• Streaming access pattern. Not much data reuse left

• NVDLA minimum burst length: 32B

• Hardware prefetcher should help

* Speedup is measured w.r.t design with no LLC

1.6x

14

Page 15: Integrating NVIDIA Deep Learning Accelerator (NVDLA) with ...heechul/papers/nvdla-emc2019-slides.pdf•Fast, cycle-exact full system simulator, runs on FPGA in the cloud ... •Convolutional

Contention In Memory System

• We care about worst-case execution time in real-time systems

• Synthetic benchmark is running on the CPU stressing the memory system

• NVDLA execution time is measured

2.5x

* Normalized to solo execution time i.e. running in isolation

15

Page 16: Integrating NVIDIA Deep Learning Accelerator (NVDLA) with ...heechul/papers/nvdla-emc2019-slides.pdf•Fast, cycle-exact full system simulator, runs on FPGA in the cloud ... •Convolutional

Conclusion

• We integrated NVDLA with a RISC-V SoC on FireSim• Fast, easy-to-use

• No FPGA board needed: runs on the Amazon could

• Can be used for architectural/system research

• We will be using it for research in real-time embedded systems

• Open-sourced and publicly available at:

https://github.com/CSL-KU/firesim-nvdla/

Google “firesim nvdla”

16

Page 17: Integrating NVIDIA Deep Learning Accelerator (NVDLA) with ...heechul/papers/nvdla-emc2019-slides.pdf•Fast, cycle-exact full system simulator, runs on FPGA in the cloud ... •Convolutional

Demo

17

Page 18: Integrating NVIDIA Deep Learning Accelerator (NVDLA) with ...heechul/papers/nvdla-emc2019-slides.pdf•Fast, cycle-exact full system simulator, runs on FPGA in the cloud ... •Convolutional

• Questions?

18