bit-tactical: exploiting ineffectual computations in ... · abstract backend: precision...
TRANSCRIPT
Abstract
Backend: precision adaptability
Results
Motivation
Over 90% of Activation bits are ZERO
Throughput: Energy efficiency:
Comparison with other accelerators:
Baseline
Straight-forward implementation Synchronized parallel units that
do not skip any work:
Frontend: Skipping 0-Weights
Sparsified networks present ineffectual computation in the form of zero-valued weights. A data-parallel accelerator will not take advantage of this, taking 4 cycles in this example: Bit-Tactical borrows effectual weights temporally (from the future) and spatially (from different coordinates within the same filter). It only needs 2 cycles to process this example:
Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How
Alberto Delmas, Patrick Judd, Dylan Malone Stuart, Zissis Poulos,
Mostafa Mahmoud, Sayeh Sharify, Milos Nikolic, Andreas Moshovos
Previous accelerators exploit data sparsity in one dimension (zero-values, precision variability, zero-bits; on either Weights or Activations). Combining zero-value Weight sparsity along with zero-bit Activation sparsity exploitation provides the best performance potential for sparse networks, while still benefiting dense ones.
Most computation is ineffectual:
over 31x potential
Bit-Tactical can work with multiple backends to take advantage of this: TCLp uses a bit-serial synchronized backend which skips zeros at both ends of the value; TCLe uses a term-serial backend that only processes effectual terms.
Bit-Tactical: Deep neural network inference accelerator
• Targets primarily CNNs, can do any layer type • Exploits sparsity and effectual bit content
Over 10x faster than data-parallel accelerators
2x more energy efficient
32% area cost
5.8
7.4
4.5
5.5
5.3
5.6
6.5
8.4
4.7
6.0
5.3
6.1
7.0
9.1
5.1
6.2
5.6
6.5
7.1
9.5
4.9
6.2
5.4
6.4
9.6
12
.9
7.8
9.4
9.2
9.7
10
.5
14
.5
7.8
10
.1
9.2
10
.2
11.2
15
.5
8.6
10
.5
9.6
10
.8
11.0
16
.2
7.9
10
.3
9.1
10
.6
0
2
4
6
8
10
12
14
16
18
AlexNet-ss AlexNet-es GoogLeNet-ss GoogLeNet-es ResNet50-ss geomean
TCLp<1,6> TCLp<2,5> TCLp<2,5>+S TCLp<4,3>
TCLe<1,6> TCLe<2,5> TCLe<2,5>+S TCLe<4,3>
1.0
1.0
1.0
1.0
1.0
1.0
0.7
0.7
0.8
0.6
0.6
0.7
2.3
3.1
1.8
1.6
1.0
1.8
3.6
4.2
3.1
3.4
4.4
3.7
7.0
9.1
5.1
6.2
5.6
6.5
6.1
7.3
5.5
6.2
7.9
6.6
11.2
15
.5
8.6
10
.5
9.6
10
.8
0
2
4
6
8
10
12
14
16
AlexNet-ss AlexNet-es GoogLeNet-ss GoogLeNet-es ResNet50-ss geomean
base dense-SCNN SCNN DynSTR TCLp<2,5>+S PRA TCLe<2,5>+S
2.9
3.7
2.3
2.7
2.7
2.8
3.2
4.1
2.3
2.9
2.6
3.0
3.4
4.4
2.5
3.1
2.7
3.2
3.3
4.5
2.3
2.9
2.6
3.0
2.7
3.6
2.2
2.6
2.6
2.7
2.7
3.7
2.0
2.6
2.3
2.6
2.7
3.7
2.0
2.5
2.3
2.6
2.4
3.6
1.8
2.3
2.0
2.4
0
1
2
3
4
5
AlexNet-ss AlexNet-es GoogLeNet-ss GoogLeNet-es ResNet50-ss geomean
TCLp<1,6> TCLp<2,5> TCLp<2,5>+S TCLp<4,3>
TCLe<1,6> TCLe<2,5> TCLe<2,5>+S TCLe<4,3>
Distribution of activation values using a 16-bit fixed point representation.
× W4W3W2W1W0 × 1404140404 × 0404040404
040404040404 W4W3W2W1W00404
0404040404040404 W4W3W2W1W004040404
Tile: general view
WSU
(W
eig
ht
Skip
pin
g U
nit
Slic
e)
A
SU (
Act
ivat
ion
Se
lect
ion
Un
it)
TCLe/TCLp tile enabling serial activation processing
Organization
Off-Chip bandwidth characterization
No-Crossbar design
Instead of using crossbars to shuffle data around (be it the outputs – left; or one of the inputs – center), Bit-Tactical (right) utilizes small, local muxes to select effectual term pairs.
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
2.3
3.1
1.8
1.6
1.0
4.2
2.6
2.2
3.6
4.2
3.1
3.4
4.4
2.0
5.5
3.6
6.1
7.3
5.5
6.2
7.9
5.8
7.1
6.5
7.0
9.1
5.1
6.2
5.6
4.6
11.3
6.7
11.2
15.5
8.6
10.5
9.6
10.9
14.1
11.3
0
2
4
6
8
10
12
14
16
DaDianNao++ SCNN Dstripes Pragmatic TCLp<2,5> TCLe<2,5>
2.3
3.0
1.7
2.1
1.8
1.5
3.7
2.2
2.0
2.8
1.6
1.9
1.8
2.0
2.6
2.1
0
1
2
3
4
TCLp<2,5> TCLe<2,5>
5.8
7.4
4.5
5.5
5.3
3.3
9.2
5.6
7.0
9.1
5.1
6.2
5.6
4.6
11
.3
6.7
7.1
9.5
4.9
6.2
5.4
3.1
9.3
6.1
9.6
12
.9
7.8
9.4
9.2
8.4
11
.7
9.7
11
.2
15
.5
8.6
10
.5
9.6
10
.9
14
.1
11
.3
11
.0
16
.2
7.9
10
.3
9.1
7.1
11
.7
10
.1
0
2
4
6
8
10
12
14
16
18
TCLp<1,6> TCLp<2,5> TCLp<4,3> TCLe<1,6> TCLe<2,5> TCLe<4,3>
1.6x
1.5x
1.8x
1.9x
2.5x
1.8x
1.6x
1.8x
6.7x
4.3x
4.4x
2.5x
1.7x
2.2x
3.7x
3.3x
11.6
x
5.1x
7.8x
4.6x
3.9x
3.3x
5.6x
5.5x
3.6x
3.2x
3.9x
3.4x
9.0x
2.7x
2.5x
3.7x
8.0x
7.1x
8.8x
8.2x
17.9
x
6.5x
6.1x
8.4x
26.2
x
10.6
x 16.6
x
8.5x
14.0
x
5.0x
9.2x
11.4
x
58.4
x
23.7
x
37.2
x
20.7
x
28.0
x
12.0
x
22.5
x
26.0
x
0x
10x
20x
30x
40x
50x
60x
AlexNet-ES
AlexNet-SS
GoogleNet-ES
GoogleNet-SS
ResNet50-SS
Mobilenet Bi-LSTM
Geomean
Pot
entia
l Spe
edup
A W W+A Ap Ae W+Ap W+Ae