efficient embedded computing - darpa · 2018. 7. 25. · area numbers are from synthesized result...

WILLIAM

CHIEF SCIENTIST AND SVP OF RESEARCH

DALLY

NVIDIA

The Accelerator Age

DARPA ERI SummitJuly 25, 2018

Bill DallyChief Scientist and SVP of Research, NVIDIA Corporation

Professor (Research), Stanford University

3

We need to continue delivering improved performance and perf/W

Evolution of DL is Gated by Hardware

IMAGE RECOGNITION SPEECH RECOGNITION

Important Property of Neural Networks

Results get better with

more data +bigger models +

more computation

(Better algorithms, new insights and improved techniques always help, too!)

2012AlexNet

2015ResNet

152 layers22.6 GFLOP~3.5% error

8 layers1.4 GFLOP~16% Error

16XModel

2014Deep Speech 1

2015Deep Speech 2

80 GFLOP7,000 hrs of Data

~8% Error

10XTraining Ops

465 GFLOP12,000 hrs of Data

~5% Error

5

But Process Technology isn’t Helping us Anymore

Moore’s Law is Dead

John

Hen

ness

y an

d D

avid

Pat

ters

on, C

ompu

ter A

rchi

tect

ure:

A Q

uant

itativ

e Ap

proa

ch, 6

/e. 2

018

7

Accelerators can continue scaling

perf and perf/W

Fast Accelerators since 1985• Mossim Simulation Engine: Dally, W.J. and Bryant, R.E., 1985. A hardware architecture for

switch-level simulation. IEEE Trans. CAD, 4(3), pp.239-250.• MARS Accelerator: Agrawal, P. and Dally, W.J., 1990. A hardware logic simulation system. IEEE

Trans. CAD, 9(1), pp.19-29.• Reconfigurable Arithmetic Processor: Fiske, S. and Dally, W.J., 1988. The reconfigurable

arithmetic processor . ISCA 1988.• Imagine: Kapasi, U.J., Rixner, S., Dally, W.J., Khailany, B., Ahn, J.H., Mattson, P. and Owens,

J.D., 2003. Programmable stream processors. Computer, 36(8), pp.54-62.• ELM: Dally, W.J., Balfour, J., Black-Shaffer, D., Chen, J., Harting, R.C., Parikh, V., Park, J. and

Sheffield, D., 2008. Efficient embedded computing. Computer, 41(7).• EIE: Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A. and Dally, W.J., 2016, June. EIE:

efficient inference engine on compressed deep neural network, ISCA 2016• SCNN:Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J.,

Keckler, S.W. and Dally, W.J., 2017, June. Scnn: An accelerator for compressed-sparse convolutional neural networks, ISCA 2017

• Darwin: Turakhia, Bejerano, and Dally, “Darwin: A Genomics Co-processor provides up to 15,000× acceleration on long read assembly”, ASPLOS 2018.

Accelerators Employ:• Massive Parallelism – 1,000,000x, not 16x – with Locality

• Special Data Types and Operations• Do in 1 cycle what normally takes 10s or 100s

• Optimized Memory• High bandwidth (and low energy) for specific data structures and operations

• Reduced or Amortized Overhead

• Algorithm-Architecture Co-Design9

Need to Understand Cost of OperationsAnd Communication

Operation: Energy (pJ)8b Add 0.0316b Add 0.0532b Add 0.116b FP Add 0.432b FP Add 0.98b Mult 0.232b Mult 3.116b FP Mult 1.132b FP Mult 3.732b SRAM Read (8KB) 532b DRAM Read 640

Area (µm2)3667

13713604184282349516407700N/AN/A

Energy numbers are from Mark Horowitz “Computing’s Energy Problem (and what we can do about it)”, ISSCC 2014Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library.

Communication is Expensive, Be Small, Be Local

LPDDR DRAMGB

On-Chip SRAMMB

Local SRAMKB

640pJ/word

50pJ/word

5pJ/word

Scaling of Communication

0

50

100

150

200

250

300

350

DFMA 40nm DFMA 10nm Wire 40nm Wire 10nm

pJ

Keckler et al. Micro 2011.

13

The Algorithm often Has to Change

Algorithm-Architecture Co-Design for DarwinStart with Graphmap

0.1 1 10 100 1000 10000 100000

Time/read (ms)

Filtration Alignment

Graphmap~10K seeds~440M hits

Filtration

Alignment~3 hits

~1 hits

1. Graphmap (software)

1

Algorithm-Architecture Co-Design for DarwinReplace Graphmap with Hardware-Friendly AlgorithmsSpeed up Filtering by 100x, but 2.1x Slowdown Overall

0.1 1 10 100 1000 10000 100000

Time/read (ms)


Graphmap~10K seeds~440M hits

Darwin~2K seeds~1M hits

Filtration(D-SOFT)

Alignment(GACT)

Filtration

Alignment~3 hits

~1 hits

~1680 hits

~1 hits

2.1X slowdown

1. Graphmap (software)2. Replace by D-SOFT and GACT

(software)

1

2

Algorithm-Hardware Co-Design for DarwinAccelerate Alighment – 380x Speedup

0.1 1 10 100 1000 10000 100000

Time/read (ms)

Filtration Alignment1. Graphmap (software)2. Replace by D-SOFT and GACT

(software)3. GACT hardware-acceleration

2.1X slowdown

380X speedup


(software)3. GACT hardware-acceleration

1

2

3

Algorithm-Hardware Co-Design for DarwinSpecialized Memory for Filtering – 60.8x Speedup

17

0.1 1 10 100 1000 10000 100000

Time/read (ms)


DRAM

DRAM

DRAM

DRAM

SPL

SPL

SPL

SPL

UBL

UBL

Bin-countSRAM

Bin-countSRAM

2.1X slowdown

380X speedup

3.9X speedup

15.6X speedup


(software)3. GACT hardware-acceleration4. Four DRAM channels for D-SOFT5. Move bin updates in D-SOFT to

SRAM (ASIC)

1

2

3

4

5

Algorithm-Hardware Co-Design for DarwinPipeline D-Soft and GACT – now completely D-Soft limited – 1.4x

Overall 15,000x

18

0.1 1 10 100 1000 10000 100000

Time/read (ms)

Filtration Alignment1. Graphmap (software)2. Replace by D-SOFT and GACT

(software)3. GACT hardware-acceleration4. Four DRAM channels for D-SOFT5. Move bin updates in D-SOFT to

SRAM (ASIC)6. Pipeline D-SOFT and GACT

2.1X slowdown

380X speedup

3.9X speedup

15.6X speedup

1.4X speedup

D-SOFT

Softw

are

GACT60

GACT61

GACT62

GACT63

GACT0

GACT1

GACT2

GACT3

1

2

3

4

5

6

19

GPUs are Ideal Accelerator Platforms

GPUs Provide:• High-Bandwidth, Hierarchical Memory System

• Can be configured to match application

• Programmable Control and Operand Delivery

• Simple places to bolt on Domain-Specific Hardware• As instructions or memory clients

20

Volta V10021B xtors | TSMC 12nm FFN | 815mm2

5,120 CUDA cores7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS125 Tensor TFLOPS20MB SM RF | 16MB Cache 32GB HBM2 @ 900 GB/s300 GB/s NVLink

22

Tensor Core

D = AB + C

D =

FP16 FP16 FP16 or FP32

A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A2,0 A2,1 A2,2 A2,3

A3,0 A3,1 A3,2 A3,3

B0,0 B0,1 B0,2 B0,3

B1,0 B1,1 B1,2 B1,3

B2,0 B2,1 B2,2 B2,3

B3,0 B3,1 B3,2 B3,3

C0,0 C0,1 C0,2 C0,3

C1,0 C1,1 C1,2 C1,3

C2,0 C2,1 C2,2 C2,3

C3,0 C3,1 C3,2 C3,3

Specialized Instructions Amortize Overhead

Operation Ops Energy** Overhead*HFMA 2 1.5pJ 2000%HDP4A 8 6.0pJ 500%HMMA 128 110pJ 27%

*Overhead is instruction fetch, decode, and operand fetch – 30pJ**Energy numbers from 45nm process

24

Volta GPUFP32 / FP16 / INT8 Multi Precision

512 CUDA Cores1.3 CUDA TFLOPS

20 Tensor Core TOPS

ISP1.5 GPIX/sNative Full-range HDRTile-based Processing

PVA1.6 TOPS

Stereo DisparityOptical Flow

Image Processing

Video Processor1.2 GPIX/s Encode1.8 GPIX/s Decode

16 CSI109 Gbps

1gE & 10gE

XAVIER

World’s First Autonomous Machine Processor

Most Complex SOC Ever Made | 9 Billion Transistors, 350mm2, 12nFFN | ~8,000 Engineering YearsDiversity of Engines Accelerate Entire AV Pipeline | Designed for ASIL-D AV

256-Bit LPDDR4137 GB/s

DLA5 TFLOPS FP1610 TOPS INT8

Carmel ARM64 CPU8 Cores10-wide Superscalar2700 SpecInt2000Functional Safety FeaturesDual Execution ModeParity & ECC

Command Interface

Tensor Execution Micro-controller

Memory Interface

Input DMA(Activations

and Weights)

Unified512KB InputBuffer

Activations and Weights

Sparse Weight Decompression

Native Winograd

InputTransform

MACArray

2048 Int8or

1024 Int16or

1024 FP16

Output Accumulator

s

Output Postprocessor

(Activation Function,

Pooling etc.)

Output DMA

NVIDIA DLA

Advance Accelerators via a

Government, Industry, Academic Partnership

27@NVIDIA 2017

NVIDIA’s DARPA ProgramsKey technologies

Ubiquitous High PerformanceComputing (UHPC) CRAFT

2010 2012 2014 2016 2018

PERFECT

VirtualEye

Circuits and VLSIGround-reference signaling (GRS)Charge-recycling signalingHigh-productivity VLSI design methodologyPausible bisynchronous FIFO for GALSLow voltage SRAM assist circuitsEnergy-efficient latchesVariation-tolerant clocking

Programming Systems and ApplicationsCUB and other algorithm librariesCUDA extensionsCUDNN initial implementationReal-time video synthesis

ArchitectureVolta SM architectureEnergy-efficient ML inference accelerators

28@NVIDIA 2017

NVIDIA’s Approach to Gov’t Partnerships

Gov’t and NVIDIA must be aligned on program objectives

Research must be aligned with technical/strategic direction of the company

Results must have the potential for transferto NVIDIA product teams

Vehicle for direct collaboration with Universities

The Accelerated Future

(map force (pairs

particles)

Mapping DirectivesProgram

Mapper &Runtime

GPUData & Task Placement

Synthesis

Custom Compute Blocks (Instructions or Clients)

SMs

Configurable MemoryEfficient NoC

Enabling Technologies

1. Efficient GPUs (memory and interconnect substrate)

Library of custom (or configurable) blocksLinear algebraDynamic programming

Tools for building custom blocksPackagingConfigurable blocksLibrariesInterfacesHLS

Tools for mapping computation to parallel substrate

1

1

2

2

3

3

4

4

(map force (pairs

particles)

Mapping DirectivesProgram

Mapper &

Runtime

GPU Data & Task Placement

Synthesis

Custom Compute Blocks

SMs

Configurable Memory

Efficient NoC

32

Efficiency and ProgrammabilitySoftware Defined Hardware (SDH) - Symphony

Dyn

amic

Rec

onfi

gura

tion

of

HW

/SW

DSLs targeting algorithmic patterns and architecture

primitives

Data orchestration tiles (DOT): efficient data transfer

and transformation

Configurable compute tiles (CCT): adaptable

algorithmic primitives

Support for different data types and varying data

sparsity

Mapping, Optimization, and Load balancing

DRAM DOT

DOT

DOT

DOT

CCT

CCT

CCT

CCT

DOT

DOT

DOT

DOT

CCT

CCT

CCT

CCT

DRAM

DRAM

DRAM

1

4

33

Enabling Fast Creation of Accelerators

Need tools & flows to dramatically improve accelerator design productivity

Current Research (CRAFT)

Future Research (IDEA)

Applications of GPU Computing and Machine Learning to Design Automation

Design Automation FlowObject-Oriented High-Level Synthesis

Globally Asynchronous Locally Synchronous ClockingAutomatic Floorplanning

Agile Hardware Design Techniques

• Architecture Models (C++)

• Libraries• 3rd party IP• Internal IP

2

3

34

Conclusion

John Hennessy – “Achieving cost-performance in this era of DSAs will require matching applications, languages, architecture - and reducing design cost”

Summary• We need to continue scaling perf/W

– Economic growth depends on it– Machine learning depends on it

• Moore’s Law is over• Accelerators are the future

– GPUs – efficient memory, communication and control– Custom blocks – instructions or clients

• Enabling Technologies– Efficient Memory/Interconnect substrate (GPUs) [SDH]– Mapping Tools – [SDH]– Efficient Generation of Custom Blocks – [CRAFT]

• Industry, Government, University Partnership– NVIDIA, DARPA, (MIT, UIUC, UC Davis) [SDH]

Thank You

efficient embedded computing - darpa · 2018. 7. 25. · area numbers are from synthesized result...

Documents