benchmarking and improving nn execution on dsp … zuzana...about it). in 2014 ieee international...
TRANSCRIPT
![Page 1: Benchmarking and improving NN execution on DSP … Zuzana...about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC) (Feb 2014), pp](https://reader036.vdocument.in/reader036/viewer/2022070800/5f0214c17e708231d4027a01/html5/thumbnails/1.jpg)
Neural Network Engine (NNE)
• NN is dominated by memory accesses not MAC operations
• 8-bit wordlength has little impact on model accuracy
Using the three optimization techniques in NNE reduced:
oMemory accesses by 5.6x compared to baseline xDSP
oEnergy costs by 5.5x compared to baseline xDSP
oMemory required for parameters by 3x
• NNE always performs in a deterministic number of cycles for any arbitrary network
Conclusion
Results
Deterministic number of cycles:
3. Two-step scaling• Solves inefficient individual scaling of activations
3.1 Within a vector (when writing results)
3.2 Across a layer (when reading results in a next layer)
L1 L2 L3 L4
*minimum number of required accesses. This number can grow depending on how many times previously computed values must be retrieved and shifted additionally
**this number does not contain energy required for fetching and decoding the instructions, which would increase the energy costs even more
Table 2: 24-bit DSP vs 8-bit NNE. Energy cost is estimated on a 45nm process [4]
NNE
• Inference = processing of a one-second
audio file
• Required MAC operations: 6,600/inference
• Total cycles: 7,295/inference
• Memory not accessed in only 45 cycles
(0.62% of all cycles)
• Accuracy: 80.28% (4,890 audio files)
L1 L2 L3 L4
GroupNumber of
shifts/groupAdditional
shifts
Green 2 ?
Blue 1 ?
Orange 3 ?
GroupNumber of
shifts/groupAdditiona
l shifts
Green 2 3 - 2 = 1
Blue 1 3 - 1 = 2
Orange 3 0
Benchmarking and improving NN execution on DSP vs.
custom accelerator for hearing instrumentsZuzana Jelčicová*, Adrian Mardari*, Oskar Andersson*, Evangelia Kasapaki*, Jens Sparsø✺
Demant A/S, Smørum (Denmark)*; Technical University of Denmark, Lyngby (Denmark)✺
N = number of layers excluding an input layer
O = the number of output neurons in the current layer
A = number of inputs/activations to the layer
NNE includes three optimizations:
1. Reduced wordlength • 24 to 8-bit parameters (inputs, weights, biases)
• On-the-fly symmetric quantization (each layer individually)
• Insignificant loss in accuracy
2. Several MACs in parallel•Input & output stationary processing
•Considerations: 96-bit memory interface and 8-bit data
Table 1: Number of read operations (inputs and weights) needed for inference using 8-bit parameters.
Biases are excluded as they are negligible.
DSPs in general
• Datapaths with 16 and 24-bit elements
Neural network (NN) inference
• 8 bits are usually sufficient
o More frequent overflows → additional memory accesses,
increased energy costs, and unpredictable timing behavior.
xDSP vs NN processing engine (NNE)
xDSP → Oticon’s DSP-based platform used for obtaining baseline
results
NNE achieves further power optimizations by exploiting three mutually
dependent techniques:
1. Reduced wordlength – 24 to 8 bits with insignificant loss in
accuracy
2. Several MACs in parallel – reduced wordlength → processing
of more data at once
3. Two-step scaling
• eliminates the need to reload and scale already computed
outputs to maintain the ratio
• makes our NNE always execute in a deterministic number
of cycles
Introduction
• Feedforward fully connected DNN model (250x144x144x144x12) [1]
trained on 12 words (first two categories represent silence and
unknown words): "yes", "no", "up", "down", "left", "right", "on", "off",
"stop", "go“
• ReLU and softmax (output layer) activation functions
• Dataset: 65,000 one-second utterances of 30 words [2]
• Input to DNN is the flattened feature matrix
• Input signal (length L) is framed into overlapping frames (length
l=40ms) with a stride (s=40ms), giving 25 frames as input/inference
→
• 10 frequency bins → 250 inputs
Keyword Spotting Application (KWS)
Figure 1: KWS system consisting of a feature extractor and a NN based classifier
• Generic audio DSP→ baseline for comparisons
• 24 bits for representing data (Q5.19)
• SIMD4 (96-bit memory interface)
Digital Signal Processor (xDSP)
[1] https://github.com/ARM-software/ML-KWS-for-MCU
[2] https://ai.googleblog.com/2017/08/launching-speech-commands-
dataset.html
[3] Quantization algorithms. https://nervanasystems.github.io/
distiller/algo_quantization.html
[4] Horowitz, M. 1.1 computing’s energy problem (and what we can do
about it). In 2014 IEEE International Solid-State Circuits Conference
Digest of Technical Papers (ISSCC) (Feb 2014), pp. 10–14
References
Figure 2: System
overview diagram
for the NNE
[3]