efficient embedded computing - darpa · 2018. 7. 25. · area numbers are from synthesized result...
TRANSCRIPT
![Page 1: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/1.jpg)
WILLIAM
CHIEF SCIENTIST AND SVP OF RESEARCH
DALLY
NVIDIA
![Page 2: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/2.jpg)
The Accelerator Age
DARPA ERI SummitJuly 25, 2018
Bill DallyChief Scientist and SVP of Research, NVIDIA Corporation
Professor (Research), Stanford University
![Page 3: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/3.jpg)
3
We need to continue delivering improved performance and perf/W
![Page 4: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/4.jpg)
Evolution of DL is Gated by Hardware
IMAGE RECOGNITION SPEECH RECOGNITION
Important Property of Neural Networks
Results get better with
more data +bigger models +
more computation
(Better algorithms, new insights and improved techniques always help, too!)
2012AlexNet
2015ResNet
152 layers22.6 GFLOP~3.5% error
8 layers1.4 GFLOP~16% Error
16XModel
2014Deep Speech 1
2015Deep Speech 2
80 GFLOP7,000 hrs of Data
~8% Error
10XTraining Ops
465 GFLOP12,000 hrs of Data
~5% Error
![Page 5: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/5.jpg)
5
But Process Technology isn’t Helping us Anymore
Moore’s Law is Dead
![Page 6: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/6.jpg)
John
Hen
ness
y an
d D
avid
Pat
ters
on, C
ompu
ter A
rchi
tect
ure:
A Q
uant
itativ
e Ap
proa
ch, 6
/e. 2
018
![Page 7: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/7.jpg)
7
Accelerators can continue scaling
perf and perf/W
![Page 8: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/8.jpg)
Fast Accelerators since 1985• Mossim Simulation Engine: Dally, W.J. and Bryant, R.E., 1985. A hardware architecture for
switch-level simulation. IEEE Trans. CAD, 4(3), pp.239-250.• MARS Accelerator: Agrawal, P. and Dally, W.J., 1990. A hardware logic simulation system. IEEE
Trans. CAD, 9(1), pp.19-29.• Reconfigurable Arithmetic Processor: Fiske, S. and Dally, W.J., 1988. The reconfigurable
arithmetic processor . ISCA 1988.• Imagine: Kapasi, U.J., Rixner, S., Dally, W.J., Khailany, B., Ahn, J.H., Mattson, P. and Owens,
J.D., 2003. Programmable stream processors. Computer, 36(8), pp.54-62.• ELM: Dally, W.J., Balfour, J., Black-Shaffer, D., Chen, J., Harting, R.C., Parikh, V., Park, J. and
Sheffield, D., 2008. Efficient embedded computing. Computer, 41(7).• EIE: Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A. and Dally, W.J., 2016, June. EIE:
efficient inference engine on compressed deep neural network, ISCA 2016• SCNN:Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J.,
Keckler, S.W. and Dally, W.J., 2017, June. Scnn: An accelerator for compressed-sparse convolutional neural networks, ISCA 2017
• Darwin: Turakhia, Bejerano, and Dally, “Darwin: A Genomics Co-processor provides up to 15,000× acceleration on long read assembly”, ASPLOS 2018.
![Page 9: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/9.jpg)
Accelerators Employ:• Massive Parallelism – 1,000,000x, not 16x – with Locality
• Special Data Types and Operations• Do in 1 cycle what normally takes 10s or 100s
• Optimized Memory• High bandwidth (and low energy) for specific data structures and operations
• Reduced or Amortized Overhead
• Algorithm-Architecture Co-Design9
![Page 10: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/10.jpg)
Need to Understand Cost of OperationsAnd Communication
Operation: Energy (pJ)8b Add 0.0316b Add 0.0532b Add 0.116b FP Add 0.432b FP Add 0.98b Mult 0.232b Mult 3.116b FP Mult 1.132b FP Mult 3.732b SRAM Read (8KB) 532b DRAM Read 640
Area (µm2)3667
13713604184282349516407700N/AN/A
Energy numbers are from Mark Horowitz “Computing’s Energy Problem (and what we can do about it)”, ISSCC 2014Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library.
![Page 11: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/11.jpg)
Communication is Expensive, Be Small, Be Local
LPDDR DRAMGB
On-Chip SRAMMB
Local SRAMKB
640pJ/word
50pJ/word
5pJ/word
![Page 12: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/12.jpg)
Scaling of Communication
0
50
100
150
200
250
300
350
DFMA 40nm DFMA 10nm Wire 40nm Wire 10nm
pJ
Keckler et al. Micro 2011.
![Page 13: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/13.jpg)
13
The Algorithm often Has to Change
![Page 14: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/14.jpg)
Algorithm-Architecture Co-Design for DarwinStart with Graphmap
0.1 1 10 100 1000 10000 100000
Time/read (ms)
Filtration Alignment
Graphmap~10K seeds~440M hits
Filtration
Alignment~3 hits
~1 hits
1. Graphmap (software)
1
![Page 15: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/15.jpg)
Algorithm-Architecture Co-Design for DarwinReplace Graphmap with Hardware-Friendly AlgorithmsSpeed up Filtering by 100x, but 2.1x Slowdown Overall
0.1 1 10 100 1000 10000 100000
Time/read (ms)
Filtration Alignment
Graphmap~10K seeds~440M hits
Darwin~2K seeds~1M hits
Filtration(D-SOFT)
Alignment(GACT)
Filtration
Alignment~3 hits
~1 hits
~1680 hits
~1 hits
2.1X slowdown
1. Graphmap (software)2. Replace by D-SOFT and GACT
(software)
1
2
![Page 16: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/16.jpg)
Algorithm-Hardware Co-Design for DarwinAccelerate Alighment – 380x Speedup
0.1 1 10 100 1000 10000 100000
Time/read (ms)
Filtration Alignment1. Graphmap (software)2. Replace by D-SOFT and GACT
(software)3. GACT hardware-acceleration
2.1X slowdown
380X speedup
1. Graphmap (software)2. Replace by D-SOFT and GACT
(software)3. GACT hardware-acceleration
1
2
3
![Page 17: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/17.jpg)
Algorithm-Hardware Co-Design for DarwinSpecialized Memory for Filtering – 60.8x Speedup
17
0.1 1 10 100 1000 10000 100000
Time/read (ms)
Filtration Alignment
DRAM
DRAM
DRAM
DRAM
SPL
SPL
SPL
SPL
UBL
UBL
Bin-countSRAM
Bin-countSRAM
2.1X slowdown
380X speedup
3.9X speedup
15.6X speedup
1. Graphmap (software)2. Replace by D-SOFT and GACT
(software)3. GACT hardware-acceleration4. Four DRAM channels for D-SOFT5. Move bin updates in D-SOFT to
SRAM (ASIC)
1
2
3
4
5
![Page 18: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/18.jpg)
Algorithm-Hardware Co-Design for DarwinPipeline D-Soft and GACT – now completely D-Soft limited – 1.4x
Overall 15,000x
18
0.1 1 10 100 1000 10000 100000
Time/read (ms)
Filtration Alignment1. Graphmap (software)2. Replace by D-SOFT and GACT
(software)3. GACT hardware-acceleration4. Four DRAM channels for D-SOFT5. Move bin updates in D-SOFT to
SRAM (ASIC)6. Pipeline D-SOFT and GACT
2.1X slowdown
380X speedup
3.9X speedup
15.6X speedup
1.4X speedup
D-SOFT
Softw
are
GACT60
GACT61
GACT62
GACT63
GACT0
GACT1
GACT2
GACT3
1
2
3
4
5
6
![Page 19: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/19.jpg)
19
GPUs are Ideal Accelerator Platforms
![Page 20: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/20.jpg)
GPUs Provide:• High-Bandwidth, Hierarchical Memory System
• Can be configured to match application
• Programmable Control and Operand Delivery
• Simple places to bolt on Domain-Specific Hardware• As instructions or memory clients
20
![Page 21: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/21.jpg)
Volta V10021B xtors | TSMC 12nm FFN | 815mm2
5,120 CUDA cores7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS125 Tensor TFLOPS20MB SM RF | 16MB Cache 32GB HBM2 @ 900 GB/s300 GB/s NVLink
![Page 22: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/22.jpg)
22
Tensor Core
D = AB + C
D =
FP16 FP16 FP16 or FP32
A0,0 A0,1 A0,2 A0,3
A1,0 A1,1 A1,2 A1,3
A2,0 A2,1 A2,2 A2,3
A3,0 A3,1 A3,2 A3,3
B0,0 B0,1 B0,2 B0,3
B1,0 B1,1 B1,2 B1,3
B2,0 B2,1 B2,2 B2,3
B3,0 B3,1 B3,2 B3,3
C0,0 C0,1 C0,2 C0,3
C1,0 C1,1 C1,2 C1,3
C2,0 C2,1 C2,2 C2,3
C3,0 C3,1 C3,2 C3,3
![Page 23: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/23.jpg)
Specialized Instructions Amortize Overhead
Operation Ops Energy** Overhead*HFMA 2 1.5pJ 2000%HDP4A 8 6.0pJ 500%HMMA 128 110pJ 27%
*Overhead is instruction fetch, decode, and operand fetch – 30pJ**Energy numbers from 45nm process
![Page 24: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/24.jpg)
24
Volta GPUFP32 / FP16 / INT8 Multi Precision
512 CUDA Cores1.3 CUDA TFLOPS
20 Tensor Core TOPS
ISP1.5 GPIX/sNative Full-range HDRTile-based Processing
PVA1.6 TOPS
Stereo DisparityOptical Flow
Image Processing
Video Processor1.2 GPIX/s Encode1.8 GPIX/s Decode
16 CSI109 Gbps
1gE & 10gE
XAVIER
World’s First Autonomous Machine Processor
Most Complex SOC Ever Made | 9 Billion Transistors, 350mm2, 12nFFN | ~8,000 Engineering YearsDiversity of Engines Accelerate Entire AV Pipeline | Designed for ASIL-D AV
256-Bit LPDDR4137 GB/s
DLA5 TFLOPS FP1610 TOPS INT8
Carmel ARM64 CPU8 Cores10-wide Superscalar2700 SpecInt2000Functional Safety FeaturesDual Execution ModeParity & ECC
![Page 25: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/25.jpg)
Command Interface
Tensor Execution Micro-controller
Memory Interface
Input DMA(Activations
and Weights)
Unified512KB InputBuffer
Activations and Weights
Sparse Weight Decompression
Native Winograd
InputTransform
MACArray
2048 Int8or
1024 Int16or
1024 FP16
Output Accumulator
s
Output Postprocessor
(Activation Function,
Pooling etc.)
Output DMA
NVIDIA DLA
![Page 26: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/26.jpg)
Advance Accelerators via a
Government, Industry, Academic Partnership
![Page 27: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/27.jpg)
27@NVIDIA 2017
NVIDIA’s DARPA ProgramsKey technologies
Ubiquitous High PerformanceComputing (UHPC) CRAFT
2010 2012 2014 2016 2018
PERFECT
VirtualEye
Circuits and VLSIGround-reference signaling (GRS)Charge-recycling signalingHigh-productivity VLSI design methodologyPausible bisynchronous FIFO for GALSLow voltage SRAM assist circuitsEnergy-efficient latchesVariation-tolerant clocking
Programming Systems and ApplicationsCUB and other algorithm librariesCUDA extensionsCUDNN initial implementationReal-time video synthesis
ArchitectureVolta SM architectureEnergy-efficient ML inference accelerators
![Page 28: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/28.jpg)
28@NVIDIA 2017
NVIDIA’s Approach to Gov’t Partnerships
Gov’t and NVIDIA must be aligned on program objectives
Research must be aligned with technical/strategic direction of the company
Results must have the potential for transferto NVIDIA product teams
Vehicle for direct collaboration with Universities
![Page 29: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/29.jpg)
The Accelerated Future
![Page 30: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/30.jpg)
(map force (pairs
particles)
Mapping DirectivesProgram
Mapper &Runtime
GPUData & Task Placement
Synthesis
Custom Compute Blocks (Instructions or Clients)
SMs
Configurable MemoryEfficient NoC
![Page 31: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/31.jpg)
Enabling Technologies
1. Efficient GPUs (memory and interconnect substrate)
Library of custom (or configurable) blocksLinear algebraDynamic programming
Tools for building custom blocksPackagingConfigurable blocksLibrariesInterfacesHLS
Tools for mapping computation to parallel substrate
1
1
2
2
3
3
4
4
(map force (pairs
particles)
Mapping DirectivesProgram
Mapper &
Runtime
GPU Data & Task Placement
Synthesis
Custom Compute Blocks
SMs
Configurable Memory
Efficient NoC
![Page 32: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/32.jpg)
32
Efficiency and ProgrammabilitySoftware Defined Hardware (SDH) - Symphony
Dyn
amic
Rec
onfi
gura
tion
of
HW
/SW
DSLs targeting algorithmic patterns and architecture
primitives
Data orchestration tiles (DOT): efficient data transfer
and transformation
Configurable compute tiles (CCT): adaptable
algorithmic primitives
Support for different data types and varying data
sparsity
Mapping, Optimization, and Load balancing
DRAM DOT
DOT
DOT
DOT
CCT
CCT
CCT
CCT
DOT
DOT
DOT
DOT
CCT
CCT
CCT
CCT
DRAM
DRAM
DRAM
1
4
![Page 33: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/33.jpg)
33
Enabling Fast Creation of Accelerators
Need tools & flows to dramatically improve accelerator design productivity
Current Research (CRAFT)
Future Research (IDEA)
Applications of GPU Computing and Machine Learning to Design Automation
Design Automation FlowObject-Oriented High-Level Synthesis
Globally Asynchronous Locally Synchronous ClockingAutomatic Floorplanning
Agile Hardware Design Techniques
• Architecture Models (C++)
• Libraries• 3rd party IP• Internal IP
2
3
![Page 34: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/34.jpg)
34
Conclusion
John Hennessy – “Achieving cost-performance in this era of DSAs will require matching applications, languages, architecture - and reducing design cost”
![Page 35: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/35.jpg)
Summary• We need to continue scaling perf/W
– Economic growth depends on it– Machine learning depends on it
• Moore’s Law is over• Accelerators are the future
– GPUs – efficient memory, communication and control– Custom blocks – instructions or clients
• Enabling Technologies– Efficient Memory/Interconnect substrate (GPUs) [SDH]– Mapping Tools – [SDH]– Efficient Generation of Custom Blocks – [CRAFT]
• Industry, Government, University Partnership– NVIDIA, DARPA, (MIT, UIUC, UC Davis) [SDH]
![Page 36: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/36.jpg)
Thank You
![Page 37: Efficient Embedded Computing - DARPA · 2018. 7. 25. · Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library](https://reader036.vdocument.in/reader036/viewer/2022071509/612db9e21ecc515869425e59/html5/thumbnails/37.jpg)