machine-learning-assisted agile vlsi design for...

Brucek Khailany, Director of ASIC and VLSI Research, NVIDIA

June 23, 2019

MACHINE-LEARNING-ASSISTED AGILE VLSI DESIGN FOR MACHINE LEARNING

2© NVIDIA 2019

AGENDA

Motivation - Chip Design Complexity

Agile Design Approach

A “C++-to-Layout” Flow

Research Testchips

ML for EDA

3© NVIDIA 2019

CHIP DESIGN COMPLEXITYCustomer demand for more transistors & capability, need lower design costs

NVIDIA Xavier SoC [CES 2018]

© NVIDIA 2019

1959 1990 2005 2020

Chip

Cap

abil

ity

from

Tec

hnol

ogy

Scal

ing

(Dennard Voltage scaling ends)

Moore’s Law[Electronics, 1965]

Capability at fixed power and cost

Customer demand

Capability atfixed cost

(Moore’s lawEnds)

4© NVIDIA 2019

MOTIVATION:SHORTEN DEVELOPMENT TIME & COST

2013 2014 2015 2016 2017 2018

• RTL design and verification dominates: >70% of IC design effort at NVIDIA

• Prohibits which features make it into each SoC

• Need 10x lower design and verification effort:

• Overlap architect & implement phases for faster time-to-market, more features

Typical development timeline: 3-5 years from R&D to product:

R&D Architect Implement

R&D Architect Implement Productize

Productize

Feature 1

Feature 2

5© NVIDIA 2019

METHODOLOGY RESEARCH APPROACHRaise HW design

level of abstractionAgile VLSI Design

Use high-level languages: C++ instead of Verilog

Use tools: e.g. High-Level Synthesis (HLS)

Use libraries / generators: MatchLib

Small teams, jointly working on architecture, implementation, VLSI

Continuous integration, automated tool flow, agile project management.

24-hour spins from C++-to-layout

Use modularity to scale upMulti-chip modules

Network-on-Package/Network-on-Chip

6© NVIDIA 2019

OBJECT-ORIENTED HLS-BASED DESIGNMatchLib – Modular Approach To Circuits and Hardware Library

HLS-compatible

Highly-parameterizable, high QoR implementations

Open-sourced athttps://github.com/NVlabs/matchlib

Goal: “STL/Boost” for HW

Working closely with Siemens Mentor on HLS support, open source libraries

Target: 10x productivity of manual RTL coding

[“A Modular Digital VLSI Flow for High-Productivity SoC Design [Khailany et al., DAC 2018]

https://github.com/NVlabs/matchlib

https://research.nvidia.com/publication/2018-06_A-Modular-Digital

7© NVIDIA 2019

OBJECT-ORIENTED HLS-BASED DESIGN

Legend

C++ simulation

HLS Compilation

MatchLib (C++/SystemC)

Automatic RAM mapping

Specification

Logic Synthesis

Verified SystemC models

HLS-ableArchitectural

Model(C++/SystemC)

VerificationTestbenches

(SystemC)

HLS scripts (TCL)

HLS-generated RTL

RTL cosimulation

FDSB trace

Power Analysis

Netlist

Syn scripts (TCL)

Designer inputs

Libraries

IntermediateGenerated Files

Tools

Performance

Results and Metrics

PowerArea

LI Channels (SystemC)

▪ All functional verification and performance verification run natively in C++ simulation of SystemC models• ~50x speedup over RTL• ~3% error in performance

▪ Leverage libraries for productivity▪ Use std verification approaches but in SystemC

• Constrained random generators and checkers• Line and expression coverage• Assertions

▪ “Pushbutton” flow from SystemC to layout• Achieved 24-hour turnaround time

▪ All SystemC testbenches reusable with generated RTL for final verification at full-chip and unit-level

8© NVIDIA 2019

AGILE VLSI DESIGN TECHNIQUESGLOBALLY ASYNCHRONOUS

LOCALLY SYNCHRONOUS PAUSIBLE ADAPTIVE CLOCKS

SMALL, ABUTTING PARTITIONS FOR FAST PLACE-AND-ROUTE ITERATIONS

[“A Fine-Grained GALS SoC with Pausible Adaptive Clocking in 16 nm FinFET”, Fojtik et al., ASYNC 2019]

• 100s of local adaptive clock generators• Fast, error-free clock domain crossings• “Correct by construction”

top-level timing closure• Reduced margin for power-supply noise

13

18

14

19 21

16

1

7

2

8

3

9

10

5

11

6

12

17

22

15

20

Partition#1

4

I/O

• Daily iterations through P&R tools• Always working on “top of tree” RTL

• Agile, incremental approach to design closure during march-to-tapeout phase• Tweak floorplans, timing constraints• RTL design bugs, performance, VLSI

constraints, and tool settings all converge together

9© NVIDIA 2019

SOC TESTCHIP DEMONSTRATIONS

10© NVIDIA 2019

RC18 OVERVIEWProject Goals:1. Design a state-of-the-art scalable high-performance deep learning inference accelerator (>100 TOPS, ~10 TOPS/W)2. Demonstrate this is possible with a small team in a high-productivity VLSI flow

RC18 Overview:6x6 MCM in a packageEach chip has:

4x4 array of Processing Elements (PEs) per chipGlobal Buffer/PE for 2nd-level inter-layer storageRISC-V controller5x4 Mesh Network-on-ChipNetwork-on-Package and GPIO Interface

Each PE has:Distributed weight storageInput and activation buffersVector MACs for convolutional/fully-connected layersPost-processing (ReLu, pooling, etc..)

RC18 Package

RC18 Chip

[“A 0.11 pJ/Op, 0.32-128 TOPS, Scalable, Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm”, Zimmer et al., VLSI 2019] [Venkatesan et al., To Appear at HotChips 2019]

11© NVIDIA 2019

RC18 ARCHITECTURE: PROCESSING ELEMENTOptimized for Vector MAC dot-products to run direct convolutions

RC18 Package

RC18Chip

Vector MAC

Vector MAC

Vector MAC

Vector MAC

Vector MAC

Vector MAC

Vector MAC

Vector MAC

Manager

Input Buffer

Accumulation Buffer

+ +Manager

AddrGen

AddrGen

DistributedWeight Buffer

R

Router InterfaceSerdes

PPU

Trunc ReLU

Pooling Bias

12© NVIDIA 2019

RC18 IMPLEMENTATIONReuse, modularity, hierarchical design

13© NVIDIA 2019

SCALING UP

Advantages: Larger-than-reticle systems, higher yields, simpler to design, mix process technologies, time-to-market, power delivery

Disadvantages: Chip-to-chip communication costs (area and energy)

MULTI-CHIP MODULES INTERCONNECTION NETWORKS

NETWORK ON PACKAGE (NOP)

• 6x6 mesh, NoP router per chiplet• Multicast with configurable routing table.

NETWORK ON CHIP (NOC)• 4x5 mesh topology connects 16 PEs, one Global PE, and one RISC-V.• Unicast/multicast with configurable routing table.

14© NVIDIA 2019

RC18 RESULTS1-chip 36-chip

Process 16nm FinFET 16nm FinFET

Core area 3.1 mm2 111.6 mm2

Voltage 0.41-1.2 V 0.52-1.1 V

Frequency 0.16-2 GHz 0.48-1.8 GHz

Peak Perf (8b int) 0.32-4 TOPS 32.5-128 TOPS

Energy efficiency 0.1-1 pJ/op 0.16-8.3 pJ/opMeasurement Setup

Package

DieChip Stats

[“A 0.11 pJ/Op, 0.32-128 TOPS, Scalable, Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm”, Zimmer et al., VLSI 2019]

Application Benchmarks

(Measured at 0.85 V) Throughput Core Energy

DriveNet (4 chips, batch=1) 1.8K images/s 0.19 mJ/image

DriveNet (4 chips, batch=4) 6.5K images/s 0.23 mJ/image

ResNet-50 (36 chips, batch=2) 2.6K images/s 16.1 mJ/image

16© NVIDIA 2019

CHALLENGES AND RESEARCH OPPORTUNITIES

Challenges

Difficult to predict VLSI QoR (routability, timing, area, power) from high-level design

Exploration of physical design space (floorplans, timing constraints, tool knobs) still limited by machines/licenses/humans

Opportunities

Train ML-based predictors using supervised learning techniques

Automate physical design space exploration

Use ML/GPU-accelerated EDA to assist exploration

Use automatic design space exploration techniques

Can we use machine learning to solve these problems?

17© NVIDIA 2019

RECENT RESEARCH EXAMPLESML/DL-based predictors for VLSI

Chan et al., “Routability Optimization In Sub14nm Technologies”, ISPD’17

DRC Prediction

Timing Prediction

Han et al., “A deep learning methodology to proliferate golden signoff timing”, DATE’14

DRC Hotspot Prediction

Xie et al., “ROUTENET: Routability Prediction for Mixed-Size Designs Using Convolutional Neural Network”, ICCAD’18

Ma et al., “High Performance Graph Convolutional Networks with Applications in Testability Analysis”, DAC’19

Testability Prediction

Zhou et al., “PRIMAL: Power Inference using Machine Learning”, DAC’19

Power Dissipation Prediction

18© NVIDIA 2019

RECENT RESEARCH EXAMPLESUsing ML tools for Physical Design Space Exploration (DSE)

Using ML models to predict QoR effects of tool knobs for DSE

Liu and Carloni, “On Learning-Based Methods for Design-Space Exploration with High-Level Synthesis”,DAC’13

Using GPUs and DL Frameworks to Accelerate VLSI Placement (30x speedup)

Lin et al., “DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement”, DAC’19 (Best Paper)https://github.com/limbo018/DREAMPlace

https://github.com/limbo018/DREAMPlace

19© NVIDIA 2019

SUMMARY AND FUTURE WORKAddressing the design complexity problem

• Design effort reduction from raising design abstraction levels

• Embrace agile design and open-source where it makes sense

• Research automated physical design space exploration

• Use machine learning

• Use parallel computing resources (e.g. GPUs)

20© NVIDIA 2019

ACKNOWLEDGMENTS

Research sponsored by DARPA under the CRAFT program.

NVIDIA Collaborators: Evgeni Krimer, Jason Clemons, Ben Keller, Matthew Fojtik, Alicia Klinefelter, Angshuman Parashar, Michael Pellauer, Nathaniel Pinckney, Mark Ren, Yakun Sophia Shao, Stephen Tell, Yanqing Zhang, Brian Zimmer, Bill Dally, Joel S. Emer, Stephen Keckler

Harvard Collaborators:David Brooks, Gu-Yeon Wei, and team

Former NVIDIA interns/postdocs: Christopher Fletcher, Ziyun Li, Yuzhe Ma, Antonio Puglielli, Priyanka Raina, Shreesha Srinath, Gopal Srinivasan, Chris Torng, Sam (Likun) Xi, Zhiyao Xie, Yuan Zhou

Thanks to the Siemens Mentor Catapult HLS team for discussions and support: Bryan Bowyer, Stuart Clubb, Moises Garcia, Khalid Islam, and Stuart Swan.

Collaborative effort across architecture and design methodology

This research was, in part, funded by the U.S. Government, under the DARPA CRAFT program. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

THANK YOU!

machine-learning-assisted agile vlsi design for...

Documents