distribution statement a. approved for public...
TRANSCRIPT
![Page 1: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/1.jpg)
Distribution Statement A. Approved for Public Release
![Page 2: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/2.jpg)
EPOCHS: EFFICIENT PROGRAMMABILITY OF COGNITIVE HETEROGENEOUS SYSTEMS
Distribution Statement A. Approved for Public Release
![Page 3: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/3.jpg)
SARITA
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
ADVE
Distribution Statement A. Approved for Public Release
![Page 4: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/4.jpg)
EPOCHS OVERVIEW
Quick design & implementation of domain-specific SoC + compiler + scheduler for cognitive connected vehicles
EPOCHS Application
Compiler + Scheduler
Ontology& Design Space
Exploration
Accelerator + NoC + Memory
Architecture
Agile Implementation
Flow
Domain-Specific
SoC Hardware
S. Adve, V. Adve, P. Bose, D. Brooks, L. Carloni, S. Misailovic, V. Reddi, K. Shepard, G-Y Wei
Distribution Statement A. Approved for Public Release
![Page 5: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/5.jpg)
Key to heterogeneous system design: Interfaces for hardware & software• Specializable• Programmable• Efficient
This talk: Two interfaces• Spandex: Accelerator communication interface• Heterogeneous Parallel Virtual Machine (HPVM): Hardware-software interface
Enable specialization with programmability and efficiency
KEY: HARDWARE & SOFTWARE INTERFACES
Distribution Statement A. Approved for Public Release
![Page 6: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/6.jpg)
HETEROGENEOUS SYSTEMS COMMUNICATION
CPU
Accelerator
Traditional heterogeneity:
CPU
Accelerator
CPU Mem Space
Accelerator Mem Space
data in
data out
Shared Mem Space
coherent data
Coherent shared memory:
Wasteful data movement No fine‐grain synchronization (atomics) No irregular access patterns
Implicit data reuseFine‐grain synchronization (atomics)Irregular access patterns
Current solutions: complex and inflexible
Distribution Statement A. Approved for Public Release
![Page 7: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/7.jpg)
HETEROGENEOUS DEVICES HAVE DIVERSE MEMORY DEMANDS
Spatiallocality
Temporallocality
ThroughputSensitivity
LatencySensitivity
Fine-grainSynch
Distribution Statement A. Approved for Public Release
![Page 8: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/8.jpg)
HETEROGENEOUS DEVICES HAVE DIVERSE MEMORY DEMANDS
Spatiallocality
Temporallocality
Throughput Sensitivity
Latency Sensitivity
Fine-grainSynch
Typical CPU workloads: fine-grain synchronization, latency sensitive
Distribution Statement A. Approved for Public Release
![Page 9: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/9.jpg)
HETEROGENEOUS DEVICES HAVE DIVERSE MEMORY DEMANDS
Spatiallocality
Temporallocality
Throughput Sensitivity
Latency Sensitivity
Fine-grainSynch
Typical GPU workloads: spatial locality, throughput sensitive
Distribution Statement A. Approved for Public Release
![Page 10: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/10.jpg)
KEY PROPERTIES OF COHERENCE PROTOCOLS
Properties
Granularity
Invalidation
Updates
Distribution Statement A. Approved for Public Release
![Page 11: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/11.jpg)
KEY PROPERTIES OF COHERENCE PROTOCOLS
Properties CPU GPU DeNovo(for CPU or GPU)
Granularity Line Reads: LineWrites: Word
Reads: FlexibleWrites: Word
Invalidation Writer-invalidate Self-invalidate Self-invalidate
Updates Ownership Write-through Ownership
MESI GPU coh.
DeNovo
How to integrate different accelerators with different protocols?
Distribution Statement A. Approved for Public Release
![Page 12: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/12.jpg)
Accel 2
CURRENT SOLUTIONS: INFLEXIBLE, INEFFICIENT
Accel 1CPU
MESI L1
GPU
GPU coh. L1
DeNovoL1
GPU coh. L1
Distribution Statement A. Approved for Public Release
![Page 13: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/13.jpg)
Accel 2
CURRENT SOLUTIONS: INFLEXIBLE, INEFFICIENT
Accel 1CPU
MESI Last Level Cache (LLC)
MESI L1
GPU
GPU coh. L1
DeNovoL1
GPU coh. L1
Distribution Statement A. Approved for Public Release
![Page 14: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/14.jpg)
Accel 2
CURRENT SOLUTIONS: INFLEXIBLE, INEFFICIENT
Accel 1CPU
MESI Last Level Cache (LLC)
MESI L1
GPU
GPU coh. L1
DeNovoL1
Distribution Statement A. Approved for Public Release
![Page 15: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/15.jpg)
Accel 2
CURRENT SOLUTIONS: INFLEXIBLE, INEFFICIENT
Accel 1CPU
MESI Last Level Cache (LLC)
MESI L1
GPU
GPU coh. L1
DeNovoL1MESI L1
Distribution Statement A. Approved for Public Release
![Page 16: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/16.jpg)
Accel 2
CURRENT SOLUTIONS: INFLEXIBLE, INEFFICIENT
Accel 1CPU
MESI Last Level Cache (LLC)
MESI L1
GPU
GPU coh. L1
DeNovoL1
MESI/GPU coh.Hybrid L2
MESI L1
Distribution Statement A. Approved for Public Release
![Page 17: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/17.jpg)
SPANDEX: FLEXIBLE HETEROGENEOUS COHERENCE INTERFACE [ISCA’18]
CPU
MESI L1
• Adapts to exploit individual device’s workload attributes• Better performance, lower complexity
⇒ Fits like a glove for heterogeneous systems!
GPU
GPU coh. L1
Accel 1
GPU coh. L1
Accel 2
DeNovoL1
Spandex
Supported by ADA JUMP and C-FAR STARNET centers
Distribution Statement A. Approved for Public Release
![Page 18: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/18.jpg)
DeNovo L1
SPANDEX OVERVIEW
Key Components
• Flexible device request interface
• DeNovo-based LLC
• External request interface
Device may need a translation unit (TU)Spandex LLC
Device Request Interface
External Request Interface
TUTUTU
CPU GPU Accel
MESI L1 GPU coh. L1
Distribution Statement A. Approved for Public Release
![Page 19: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/19.jpg)
DeNovo L1
SPANDEX OVERVIEW
Key Components
• Flexible device request interface
• DeNovo-based LLC
• External request interface
Device may need a translation unit (TU)Spandex LLC
Device Request Interface
External Request Interface
TUTUTU
CPU GPU Accel
MESI L1 GPU coh. L1
Distribution Statement A. Approved for Public Release
![Page 20: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/20.jpg)
SPANDEX DEVICE REQUEST INTERFACE
Action Request Indicates
ReadReqV Self-invalidationReqS Writer-invalidation
WriteReqWT Write-throughReqO Ownership only
Read+Write
ReqWT+data Atomic write-through
ReqO+data Ownership + DataWriteback ReqWB Owned data eviction
• Requests also specify granularity and (optionally) a bitmask
Distribution Statement A. Approved for Public Release
![Page 21: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/21.jpg)
DYNAMIC & NEW COHERENCE SPECIALIZATIONS
• Spandex flexibility & simplicity enables dynamic & new coherence specializations
Producer Consumer
Dynamic Spandex request selectionProducer-consumer forwardingExtended granularity flexibility
Spandex LLC
• Directed by compiler or runtime
• Applied to neural networks: enables new programming patterns
Distribution Statement A. Approved for Public Release
![Page 22: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/22.jpg)
DATA PARALLEL VS. PIPELINED NEURAL NETWORKS
Fully connected DNN Recurrent neural nw (RNN)Pipelined (no simple data parallel)
• Each core has independent image• Each core processes all network layers• No communication/synch overhead• But need weights of all layers for reuse
• Each core processes one layer• Communication/synch overhead• But need weights of only one
layer for reuse
• No simple communication-freedata parallel version sincecomputation depends onprevious input
Data Parallel Pipelined
Spandex flexibility enables fine-grained communication and efficient pipelines
Distribution Statement A. Approved for Public Release
![Page 23: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/23.jpg)
COHERENCE SPECIALIZATION FOR PIPELINED NEURAL NETWORKS
• Dynamically select coherence strategy for different data− Use ownership for weight data (reuse), writethrough for input feature data
Accel2 L1
Spandex LLC
Accel1 L1w
f
Distribution Statement A. Approved for Public Release
![Page 24: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/24.jpg)
COHERENCE SPECIALIZATION FOR PIPELINED NEURAL NETWORKS
• Dynamically select coherence strategy for different data− Use ownership for weight data (reuse), writethrough for input feature data
• Optimization1: Producer → Consumer forward via last level cache (LLC)− LLC forwards feature output to next consumer
Accel2 L1
Spandex LLC
Accel1 L1w f
Distribution Statement A. Approved for Public Release
![Page 25: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/25.jpg)
COHERENCE SPECIALIZATION FOR PIPELINED NEURAL NETWORKS
• Dynamically select coherence strategy for different data− Use ownership for weight data (reuse), writethrough for input feature data
• Optimization1: Producer → Consumer forward via last level cache (LLC)− LLC forwards feature output to next consumer
• Optimization2: Direct Producer → Consumer forward w/ owner prediction− Producer directly forwards feature data to consumer cache, with no LLC lookup
Accel2 L1
Spandex LLC
Accel1 L1w f
Distribution Statement A. Approved for Public Release
![Page 26: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/26.jpg)
EVALUATION METHODOLOGY
• Baseline system: CPU + Multiple GPU compute cores on a network on chip
• Neural network (NN) computation on different compute cores of GPUs
• NNs based on standard networks or information from literature
• Cycle accurate architectural simulation: GEMS+GPGPUSim+Garnett+Spandex
Distribution Statement A. Approved for Public Release
![Page 27: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/27.jpg)
020406080
100120140160
FCNN Lenet LSTM
Execution cycles
Base: Data Parallel (Serial for LSTM) Base: Pipelined Optimized
RESULTS FOR NEURAL NETWORKS
020406080
100120140160
FCNN Lenet LSTM
Network traffic
6X
1.6X
25X 300X
1.5X / 3X
Large reduction in execution time and/or network trafficNext steps for Spandex: Apply to EPOCHS application and integrate with compiler
Distribution Statement A. Approved for Public Release
![Page 28: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/28.jpg)
Key to heterogeneous system design: Interfaces for hardware & software• Specializable• Programmable• Efficient
This talk: Two interfacesSpandex: Accelerator communication interfaceHeterogeneous Parallel Virtual Machine (HPVM): Hardware-software interface
Enable specialization with programmability and efficiency
RECAP SO FAR
Distribution Statement A. Approved for Public Release
![Page 29: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/29.jpg)
CURRENT INTERFACE LEVELS
"Hardware" ISA
Virtual ISA
Language-neutral Compiler IR
Language-level Compiler IR
General-purpose prog. language
Domain-specific prog. language
Hardware innovation
Object-code portability
Compiler investment
Language innovation
App. performance
App. productivity
CPUs + Vector SIMD Units
…
GPU DSP FPGACustomaccelerators
Distribution Statement A. Approved for Public Release
![Page 30: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/30.jpg)
Much moreuniform
WHERE SHOULD WE ABSTRACT HETEROGENEITY?
"Hardware" ISA
Virtual ISA
Language-neutral Compiler IR
Language-level Compiler IR
General-purpose prog. language
Domain-specific prog. language Too diverseto define a
uniforminterface
Also too diverse …
HPVM
CPUs + Vector SIMD Units
…
GPU DSP FPGACustomaccelerators
Distribution Statement A. Approved for Public Release
![Page 31: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/31.jpg)
HETEROGENEOUS PARALLEL VIRTUAL MACHINE
Key to Programmability: Common abstractions for heterogeneous parallel hardware
Use HPVM for:1. Portable object code2. Retargetable parallel
compiler IR and system3. Run-time schedulingTranslators
HPVM Compiler IR &
Virtual ISARuntime
Scheduler
HPVM-C/C++Keras
TensorFlowOtherDSLs
Front ends
Halide
GPU DSP FPGACustomAccelerators
CPU
Kotsifakou et al., PPOPP 2018
Distribution Statement A. Approved for Public Release
![Page 32: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/32.jpg)
ABSTRACTION OF PARALLEL COMPUTATION
Dataflow Graph with side effects
LLVMVA = load <4 x float>* AVB = load <4 x float>* B
…VC = fmul <4 x float> VA, VB
Hierarchical
or
Captures• coarse grain task parallelism• streams, pipelined parallelism• nested parallelism• SPMD-style data parallelism• fine grain vector parallelism
Supports high-level optimizations as graph transformations
N different parallelism models
Single unified parallelism model
Distribution Statement A. Approved for Public Release
![Page 33: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/33.jpg)
HPVM BOTTOM-UP CODE GENERATION
Intel Xeon E5 core i7 +AVX
nVidia GeForceGTX 680 GPU
Frontend
Sourceprogram
.bc (with HPVM intrinsics)
Developer siteUser site
HPVM-to-PTX
HPVM-to-SPIR-to-AVX
HPVM graph optimizer
Code-gen: Bottom-up on graph hierarchyTwo key aspects1. Any node Any device2. Reuse vendor back ends for
high performance code-gen
Intel / AlteraArria 10 FPGA
HPVM-to-FPGA
Distribution Statement A. Approved for Public Release
![Page 34: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/34.jpg)
APPROX-HPVM: ACCURACY AWARE OPTIMIZATION
Relaxing accuracy can enable higher efficiency (2X to 50X in our work)Can we make these techniques easier to use?
Goal 1: Applications should only specify high-level accuracy goals• Maximum acceptable loss in quality, e.g., inference error, PSNR• End-to-end metrics, not per function or pipeline stage or …
Goal 2: (Often) Want object-code portability• Approximation choices are highly system-dependent• Can make orders-of-magnitude difference in performance, energy
ApproxHPVM: Accuracy-aware representation and optimizationsImplemented for Keras: Neural networks in TensorFlow
Sharif et al., OOPSLA 2019, accepted with revisions
Distribution Statement A. Approved for Public Release
![Page 35: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/35.jpg)
DNN SPEEDUP AND ENERGY SAVINGS
1-2% loss of inference accuracy gives 2X-9X speedup, 2X-11X energy savings in most networks
Target system: NVIDIA Tegra TX2 + PROMISE ML acceleratorTX2: FP32 or FP16, PROMISE: 7 voltage levels in SRAM bitlines
Distribution Statement A. Approved for Public Release
![Page 36: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/36.jpg)
Key to heterogeneous system design: Interfaces for hardware & software• Specializable• Programmable• Efficient
This talk: Two interfaces• Spandex: Accelerator communication interface• Heterogeneous Parallel Virtual Machine (HPVM): Hardware-software interface
Enable specialization with programmability and efficiency
SUMMARY: HARDWARE & SOFTWARE INTERFACES
Distribution Statement A. Approved for Public Release
![Page 37: Distribution Statement A. Approved for Public Releasersim.cs.uiuc.edu/Talks/Adve_Sarita_DSSoC_Final2.pdf · SARITA UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN ADVE SADVE@ILLINOIS.EDU](https://reader033.vdocument.in/reader033/viewer/2022060310/5f0a8c177e708231d42c2bc0/html5/thumbnails/37.jpg)
Distribution Statement A. Approved for Public Release