machine-learning-assisted agile vlsi design for...
TRANSCRIPT
Brucek Khailany, Director of ASIC and VLSI Research, NVIDIA
June 23, 2019
MACHINE-LEARNING-ASSISTED AGILE VLSI DESIGN FOR MACHINE LEARNING
2© NVIDIA 2019
AGENDA
Motivation - Chip Design Complexity
Agile Design Approach
A “C++-to-Layout” Flow
Research Testchips
ML for EDA
3© NVIDIA 2019
CHIP DESIGN COMPLEXITYCustomer demand for more transistors & capability, need lower design costs
NVIDIA Xavier SoC [CES 2018]
© NVIDIA 2019
1959 1990 2005 2020
Chip
Cap
abil
ity
from
Tec
hnol
ogy
Scal
ing
(Dennard Voltage scaling ends)
Moore’s Law[Electronics, 1965]
Capability at fixed power and cost
Customer demand
Capability atfixed cost
(Moore’s lawEnds)
4© NVIDIA 2019
MOTIVATION:SHORTEN DEVELOPMENT TIME & COST
2013 2014 2015 2016 2017 2018
• RTL design and verification dominates: >70% of IC design effort at NVIDIA
• Prohibits which features make it into each SoC
• Need 10x lower design and verification effort:
• Overlap architect & implement phases for faster time-to-market, more features
Typical development timeline: 3-5 years from R&D to product:
R&D Architect Implement
R&D Architect Implement Productize
Productize
Feature 1
Feature 2
5© NVIDIA 2019
METHODOLOGY RESEARCH APPROACHRaise HW design
level of abstractionAgile VLSI Design
Use high-level languages: C++ instead of Verilog
Use tools: e.g. High-Level Synthesis (HLS)
Use libraries / generators: MatchLib
Small teams, jointly working on architecture, implementation, VLSI
Continuous integration, automated tool flow, agile project management.
24-hour spins from C++-to-layout
Use modularity to scale upMulti-chip modules
Network-on-Package/Network-on-Chip
6© NVIDIA 2019
OBJECT-ORIENTED HLS-BASED DESIGNMatchLib – Modular Approach To Circuits and Hardware Library
HLS-compatible
Highly-parameterizable, high QoR implementations
Open-sourced athttps://github.com/NVlabs/matchlib
Goal: “STL/Boost” for HW
Working closely with Siemens Mentor on HLS support, open source libraries
Target: 10x productivity of manual RTL coding
[“A Modular Digital VLSI Flow for High-Productivity SoC Design [Khailany et al., DAC 2018]
7© NVIDIA 2019
OBJECT-ORIENTED HLS-BASED DESIGN
Legend
C++ simulation
HLS Compilation
MatchLib (C++/SystemC)
Automatic RAM mapping
Specification
Logic Synthesis
Verified SystemC models
HLS-ableArchitectural
Model(C++/SystemC)
VerificationTestbenches
(SystemC)
HLS scripts (TCL)
HLS-generated RTL
RTL cosimulation
FDSB trace
Power Analysis
Netlist
Syn scripts (TCL)
Designer inputs
Libraries
IntermediateGenerated Files
Tools
Performance
Results and Metrics
PowerArea
LI Channels (SystemC)
▪ All functional verification and performance verification run natively in C++ simulation of SystemC models• ~50x speedup over RTL• ~3% error in performance
▪ Leverage libraries for productivity▪ Use std verification approaches but in SystemC
• Constrained random generators and checkers• Line and expression coverage• Assertions
▪ “Pushbutton” flow from SystemC to layout• Achieved 24-hour turnaround time
▪ All SystemC testbenches reusable with generated RTL for final verification at full-chip and unit-level
8© NVIDIA 2019
AGILE VLSI DESIGN TECHNIQUESGLOBALLY ASYNCHRONOUS
LOCALLY SYNCHRONOUS PAUSIBLE ADAPTIVE CLOCKS
SMALL, ABUTTING PARTITIONS FOR FAST PLACE-AND-ROUTE ITERATIONS
[“A Fine-Grained GALS SoC with Pausible Adaptive Clocking in 16 nm FinFET”, Fojtik et al., ASYNC 2019]
• 100s of local adaptive clock generators• Fast, error-free clock domain crossings• “Correct by construction”
top-level timing closure• Reduced margin for power-supply noise
13
18
14
19 21
16
1
7
2
8
3
9
10
5
11
6
12
17
22
15
20
Partition#1
4
I/O
• Daily iterations through P&R tools• Always working on “top of tree” RTL
• Agile, incremental approach to design closure during march-to-tapeout phase• Tweak floorplans, timing constraints• RTL design bugs, performance, VLSI
constraints, and tool settings all converge together
9© NVIDIA 2019
SOC TESTCHIP DEMONSTRATIONS
10© NVIDIA 2019
RC18 OVERVIEWProject Goals:1. Design a state-of-the-art scalable high-performance deep learning inference accelerator (>100 TOPS, ~10 TOPS/W)2. Demonstrate this is possible with a small team in a high-productivity VLSI flow
RC18 Overview:6x6 MCM in a packageEach chip has:
4x4 array of Processing Elements (PEs) per chipGlobal Buffer/PE for 2nd-level inter-layer storageRISC-V controller5x4 Mesh Network-on-ChipNetwork-on-Package and GPIO Interface
Each PE has:Distributed weight storageInput and activation buffersVector MACs for convolutional/fully-connected layersPost-processing (ReLu, pooling, etc..)
RC18 Package
RC18 Chip
[“A 0.11 pJ/Op, 0.32-128 TOPS, Scalable, Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm”, Zimmer et al., VLSI 2019] [Venkatesan et al., To Appear at HotChips 2019]
11© NVIDIA 2019
RC18 ARCHITECTURE: PROCESSING ELEMENTOptimized for Vector MAC dot-products to run direct convolutions
RC18 Package
RC18Chip
Vector MAC
Vector MAC
Vector MAC
Vector MAC
Vector MAC
Vector MAC
Vector MAC
Vector MAC
Manager
Input Buffer
Accumulation Buffer
+ +Manager
AddrGen
AddrGen
DistributedWeight Buffer
R
Router InterfaceSerdes
PPU
Trunc ReLU
Pooling Bias
12© NVIDIA 2019
RC18 IMPLEMENTATIONReuse, modularity, hierarchical design
13© NVIDIA 2019
SCALING UP
Advantages: Larger-than-reticle systems, higher yields, simpler to design, mix process technologies, time-to-market, power delivery
Disadvantages: Chip-to-chip communication costs (area and energy)
MULTI-CHIP MODULES INTERCONNECTION NETWORKS
NETWORK ON PACKAGE (NOP)
• 6x6 mesh, NoP router per chiplet• Multicast with configurable routing table.
NETWORK ON CHIP (NOC)• 4x5 mesh topology connects 16 PEs, one Global PE, and one RISC-V.• Unicast/multicast with configurable routing table.
14© NVIDIA 2019
RC18 RESULTS1-chip 36-chip
Process 16nm FinFET 16nm FinFET
Core area 3.1 mm2 111.6 mm2
Voltage 0.41-1.2 V 0.52-1.1 V
Frequency 0.16-2 GHz 0.48-1.8 GHz
Peak Perf (8b int) 0.32-4 TOPS 32.5-128 TOPS
Energy efficiency 0.1-1 pJ/op 0.16-8.3 pJ/opMeasurement Setup
Package
DieChip Stats
[“A 0.11 pJ/Op, 0.32-128 TOPS, Scalable, Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm”, Zimmer et al., VLSI 2019]
Application Benchmarks
(Measured at 0.85 V) Throughput Core Energy
DriveNet (4 chips, batch=1) 1.8K images/s 0.19 mJ/image
DriveNet (4 chips, batch=4) 6.5K images/s 0.23 mJ/image
ResNet-50 (36 chips, batch=2) 2.6K images/s 16.1 mJ/image
15© NVIDIA 2019
MACHINE-LEARNING-ASSISTEDDESIGN AUTOMATION
16© NVIDIA 2019
CHALLENGES AND RESEARCH OPPORTUNITIES
Challenges
Difficult to predict VLSI QoR (routability, timing, area, power) from high-level design
Exploration of physical design space (floorplans, timing constraints, tool knobs) still limited by machines/licenses/humans
Opportunities
Train ML-based predictors using supervised learning techniques
Automate physical design space exploration
Use ML/GPU-accelerated EDA to assist exploration
Use automatic design space exploration techniques
Can we use machine learning to solve these problems?
17© NVIDIA 2019
RECENT RESEARCH EXAMPLESML/DL-based predictors for VLSI
Chan et al., “Routability Optimization In Sub14nm Technologies”, ISPD’17
DRC Prediction
Timing Prediction
Han et al., “A deep learning methodology to proliferate golden signoff timing”, DATE’14
DRC Hotspot Prediction
Xie et al., “ROUTENET: Routability Prediction for Mixed-Size Designs Using Convolutional Neural Network”, ICCAD’18
Ma et al., “High Performance Graph Convolutional Networks with Applications in Testability Analysis”, DAC’19
Testability Prediction
Zhou et al., “PRIMAL: Power Inference using Machine Learning”, DAC’19
Power Dissipation Prediction
18© NVIDIA 2019
RECENT RESEARCH EXAMPLESUsing ML tools for Physical Design Space Exploration (DSE)
Using ML models to predict QoR effects of tool knobs for DSE
Liu and Carloni, “On Learning-Based Methods for Design-Space Exploration with High-Level Synthesis”,DAC’13
Using GPUs and DL Frameworks to Accelerate VLSI Placement (30x speedup)
Lin et al., “DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement”, DAC’19 (Best Paper)https://github.com/limbo018/DREAMPlace
19© NVIDIA 2019
SUMMARY AND FUTURE WORKAddressing the design complexity problem
• Design effort reduction from raising design abstraction levels
• Embrace agile design and open-source where it makes sense
• Research automated physical design space exploration
• Use machine learning
• Use parallel computing resources (e.g. GPUs)
20© NVIDIA 2019
ACKNOWLEDGMENTS
Research sponsored by DARPA under the CRAFT program.
NVIDIA Collaborators: Evgeni Krimer, Jason Clemons, Ben Keller, Matthew Fojtik, Alicia Klinefelter, Angshuman Parashar, Michael Pellauer, Nathaniel Pinckney, Mark Ren, Yakun Sophia Shao, Stephen Tell, Yanqing Zhang, Brian Zimmer, Bill Dally, Joel S. Emer, Stephen Keckler
Harvard Collaborators:David Brooks, Gu-Yeon Wei, and team
Former NVIDIA interns/postdocs: Christopher Fletcher, Ziyun Li, Yuzhe Ma, Antonio Puglielli, Priyanka Raina, Shreesha Srinath, Gopal Srinivasan, Chris Torng, Sam (Likun) Xi, Zhiyao Xie, Yuan Zhou
Thanks to the Siemens Mentor Catapult HLS team for discussions and support: Bryan Bowyer, Stuart Clubb, Moises Garcia, Khalid Islam, and Stuart Swan.
Collaborative effort across architecture and design methodology
This research was, in part, funded by the U.S. Government, under the DARPA CRAFT program. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
THANK YOU!