slide : in defense of smart algorithms over hardware ...02... · [1]beidi chen, tharunmedini,...
TRANSCRIPT
SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems
Beidi Chen
Collaborators: Tharun Medini*, James Farwell† , Sameh Gobriel†,Charlie Tai †, Anshumali Shrivastava*
* Rice University, †Intel
MLSys 2020
Our SLIDE System (C++ from scratch) on a 44 core CPU beats TF on V100 (1 hours vs 3.5 hours). 100+ million parameter networks. TF on same CPU is 16 hours with all HPC optimization (Intel MKL-DNN).
3.5x faster on CPU than TF on V100 (Log Scale in Time)2
The Age of Large Networks
• More Data• Large Models• Tons of Engineering • Backpropagation (Aka Simple Gradient Descent)
3
Giant Matrix Multiplication for every data point in each epoch (Forward + Backward)
Fully Connected NN
1234
123
123
InputHidden 1 Hidden 2
1
5
……
…
9
Output
…
… …
𝑓(𝑊$𝑥)
4
ChallengesDo we really need all the computations?No!! Good News: Only high activations are important
• Sampling few neurons in proportion of activations is enough (Adaptive Dropouts)
(Ba et al. Neurips 13 , Makhzani et al. Neurips 15)
• Relu filtered negative activations (50% sparsity by design)
• Softmax
Bad News: We need to compute all to identify (or sample) the high activation neurons. NO SAVINGS 5
The Fundamental Sampling Puzzle
Given N fixed sampling weights, 𝑤(,𝑤*, … ,𝑤, . • Task: Sample 𝑥- with probability 𝑤-
• Cost of 1 sample 𝑂(𝑁).• Cost of K samples 𝑂(𝑁).
Given N time-varying sampling weights (activations) 𝑤(0, 𝑤*0, … , 𝑤,0 .• Task: At time t, sample 𝑥- with probability 𝑤-0
• Cost of sampling O(N), at every time t. • Last Few years of work in Locality Sensitive Hashing: If 𝑤-0 = 𝑓(𝑠𝑖𝑚(𝜃0, 𝑥-)), for a
specific set of f and sim, then 𝑂(1) every time after and initial preprocessing cost of 𝑂(𝑁).
6
Textbook Hashing (Dictionary)Hashing: Function h that maps a given data point (𝑥 ∈ 𝑅9) to an integer key ℎ ∶ 𝑅9 ↦ 0, 1, 2, … , 𝑁 . ℎ(𝑥) serves as a discrete fingerprint.
Property (Ideal Hash Functions):• If x = 𝑦, then ℎ 𝑥 = ℎ(𝑦)• If x ≠ 𝑦, then ℎ 𝑥 ≠ ℎ(𝑦)
7
Probabilistic Fingerprinting (Hashing) (late 90s)Hashing: Function (Randomized) h that maps a given data point (𝑥 ∈ 𝑅9) to an integer key ℎ ∶ 𝑅9 ↦ 0, 1, 2, … , 𝑁 . ℎ(𝑥) serves as a discrete fingerprint.
Locality Sensitive Property:• If x = 𝑦 S𝑖𝑚(𝑥, 𝑦) is high, then ℎ 𝑥 = ℎ 𝑦 Pr(ℎ 𝑥 = ℎ 𝑦 ) is high• If x ≠ 𝑦 S𝑖𝑚(𝑥, 𝑦) is low, then ℎ 𝑥 ≠ ℎ(𝑦) Pr(ℎ 𝑥 = ℎ 𝑦 ) is low
Likely Unlikely
8
Example 1: Signed Random Projection (SRP)
monotonic in 𝜃
+
- +
Pr(h(x) = h(y)) = 1� 1
⇡
cos
�1(✓)
X
Y
X
Y
H1
H2
A classical result from Goemans-Williamson (95)9
Example 2: (Densified) Winner Take All
Original Vectors:
WTA hash codes:(ICCV 2011)
DWTA hash codes:(UAI 2018)
K=3
Yagnik (ICCV11), Chen and Shrivastava (UAI 18) 10
Probabilistic Hash Tables
Given: 𝑓 is monotonic.Prh
⇥h(x) = h(y)
⇤= f(sim(x, y)),
<latexit sha1_base64="T1X8MzxWNtEUHQEsV/R6e5uIKKQ=">AAACEnicbZDLSsNAFIYnXmu9VV26GSxCAqUkVdCFQsGNywr2AmkIk+mkGTq5MDORhtBncOOruHGhiFtX7nwbp20W2vrDwDf/OYeZ83sJo0Ka5re2srq2vrFZ2ipv7+zu7VcODjsiTjkmbRyzmPc8JAijEWlLKhnpJZyg0GOk641upvXuA+GCxtG9zBLihGgYUZ9iJJXlVowWd4O+R4d2oI8NeA0DPTOmd0exrwsa6uNaZhg16FaqZt2cCS6DVUAVFGq5la/+IMZpSCKJGRLCtsxEOjnikmJGJuV+KkiC8AgNia0wQiERTj5baQJPlTOAfszViSScub8nchQKkYWe6gyRDMRibWr+V7NT6V86OY2SVJIIzx/yUwZlDKf5wAHlBEuWKUCYU/VXiAPEEZYqxbIKwVpceRk6jbp1Vm/cnVebV0UcJXAMToAOLHABmuAWtEAbYPAInsEreNOetBftXfuYt65oxcwR+CPt8wdf55q3</latexit>
• Given query, if ℎ( 𝑞 = 11and ℎ* 𝑞 = 01, then probe bucket with index 1101. It is a good bucket !!
• (Locality Sensitive) ℎ- 𝑞 =ℎ-(𝑥) noisy indicator of high similarity.
• Doing better than random !!
11
LSH for Search (Known)
Theory • Super-linear 𝑂(𝑁(FG) memory• Sub-linear query time, O(𝑁G)• 𝜌 < 1 but generally large (close to 1) and often hard to determine
Practical Issues• Needs lot of hash tables and distance computations for good accuracy on
near-neighbors• Buckets can be quite heavy. Poor randomness, or unfavorable data
distributions
12
New View: Data Structures for Efficient Sampling!
Is LSH really a search algorithm?• Given the query 𝜃0, LSH samples 𝑥- from the dataset, with probability 𝑤-0 = 1 − 1 − p xL, 𝜃0 M N
• 𝑤-0 is proportional to p xL, 𝜃0 M and the some similarity of xL, 𝜃0• LSH is considered a black box for nearest-neighbor search. It is not!!
13
LSH as SamplersWe can pre-process the dataset D, such that • Given any query q, we can sample 𝑥 ∈ 𝐷 with probability 𝐶𝑜𝑛𝑠𝑡× 1 − 1 − 𝑝 𝑞, 𝑥 V W in KL hash computation and L bucket probes. • Even K = 1, L =1 is adaptive. So O(1) time adaptive.• Adaptive: x is sampled with higher probability than y
• if and only if sim(q,x) > sim(q,y)
We can exactly compute the sampling probability. • Const = No of elements sampled/ No of elements in Buckets
(Chen et al. NeurIPS 2019)
Sufficient for Importance Sampling Estimations. Sampling cost O(1).
14
SLIDE: Sub-LInear Deep learning Engine
Step 1 – Build the hash tables by processing the weights of the hidden layers (initialization).
Subtlety: Neurons (vectors) in hash tables are not the data vectors. Reorganizing neurons.
12345
1234
1234
1
5
InputHidden 1 Hidden 2
……
…9
Output
H2
1 |3
2 | 1,4
3 | 2
1 1
H11 |1
2 | 2,4
3 | 3
15
SLIDE: Sub-LInear Deep learning Engine
Step 2 – Hash the input to any given layer using its randomized hash function.
12345
1234
1234
1
5
InputHidden 1 Hidden 2
……
…9
Output
H11 |1
2 | 2,4
3 | 3
H2
1 |3
2 | 1,4
3 | 22
16
SLIDE: Sub-LInear Deep learning Engine
Step 3 – Query the hidden layer's hash table(s) for the active set using integer fingerprint.Sample neurons in proportion to their activations.1
2345
1234
1234
1
5
InputHidden 1 Hidden 2
……
…9
Output
H11 |1
2 | 2,4
3 | 3
H2
1 |3
2 | 1,4
3 | 2
3
17
SLIDE: Sub-LInear Deep learning Engine
Step 4 – Perform forward and back propagation only on the nodes in the active set.Computation is in the same order of active neurons.1
2345
1234
1234
1
5
InputHidden 1 Hidden 2
……
…9
Output
H11 |1
2 | 2,4
3 | 3
H2
1 |3
2 | 1,4
3 | 2
4
18
SLIDE: Sub-LInear Deep learning Engine
Step 5 – Update hash tables by rehashing the updated node weights.Computation is in the same order of active neurons.1
2345
1234
1234
1
5
InputHidden 1 Hidden 2
……
…9
Output
H11 |1
2 | 2,4
3 | 3
H2
1 |3
2 | 1,4
3 | 2
55
19
We can go very sparse if Adaptive
• 2 Hidden Layers • 1000 Nodes Per Layer
• Reduce both training and inference cost by 95%!
• Significantly more for larger networks.
(The wider the better)
20
Sparsity + Randomness à Asynchronous Updates
• 3 Hidden Layers • 1000 Nodes Per Layer21
Less Computations + Asynchronous Parallelism
• Each update is computationally very small (100x+ reduction in computation and energy)
• Updates are near-independent, very low chance of conflict. Hence, parallel SGD!
22
SLIDE: Sub-LInear Deep learning Engine
12345
1234
1234
InputHidden1 Hidden2
1
5
……
…
9
Output
Layer
Active Inputs
1 0 1 ……
Active Inputs
0.1 0.2 0.5 ……
Activation for each Inputs
0.3 0.8 0.7 ……
Accumulated Gradients
-0.3 …Weights
0.8 -0.5
BatchSize
NeuronNetwork
HashTable1 HashTableL
00
00
00…
11
…
…
…
…
…
00
01
10…
11
ℎ"" ℎ#"… Buckets…
…
Empty…
…
…1 92
00
00
00…
11
…
…
…
…
…
00
01
10…
11
ℎ"$ ℎ#$… Buckets…
…
Empty…
…
19
5
1
2
…
Previous Layer Size
23
(Extreme sparsity and randomness in gradient updates)
Thanks to the theory of HOGWILD!(Recht et al. Neurips 11)
Parallelism with OpenMP
Parallel across training samples in a batch 1 0 1 ……
Active Inputs
0.1 0.2 0.5 ……
Activation for each Inputs
0.3 0.8 0.7 ……
Accumulated Gradients
Weights
BatchsizeNode
-0.3 0.8 -0.5 ……
24
Flexible choices of Hash Functions
SLIDE supports four different LSH hash functions• Simhash (cosine similarity)• Winner-take-all Hashing (order)• Densified Winner-take-all Hashing (for sparse data)∗• Minhash (jaccard similarity)
Easily add more!
25
• Vanilla sub-sampling:- choose sub-samples uniformly
• Top K sub-sampling:- rank samples and choose topk
• Hard Thresholding sub-sampling:- choose sub-samples that occur > threshold times
Design Choices for Speed
26
Micro-Architecture Optimization
Cache Optimization
Transparent Hugepages
Vector Processing
Software Pipelining and Prefetching
27
Looks Good on Paper. Does it change anything?
BaselineState-of-the-art optimized Implementations• TF on Intel Xeon E5-2699A v4 @ 2.40GHz CPU (FMA,AVX, AVX2, SSE4.2)• TF on NVIDIA Tesla V100 (32GB)
VS.SLIDE on Intel Xeon E5-2699A v4 @ 2.40GHz CPU (FMA,AVX, AVX2, SSE4.2)• TF on NVIDIA Tesla V100 (32GB)
28
Datasets
29
Performance
30
Performance compared to sampled softmax
31
Performance @ Different Batchsizes
32
Asynchronous Parallelism gets best scalability
33
Inefficiency Diagnosis
34
0
0.1
0.2
0.3
0.4
0.5
0.6
8Threads 16Threads 32Threads
Tensorflow-CPU Inefficiencies Ratio in CPU Usage
Front-EndBound MemoryBound RetiringBound CoreBound
0
0.1
0.2
0.3
0.4
0.5
0.6
8Threads 16Threads 32Threads
SLIDEInefficiencies Ratio in CPU Usage
Front-EndBound MemoryBound RetiringBound CoreBound
Impact of HugePages
35
Conclusion: From Matrix Multiplication to (few) Hash Lookups
• Standard• Operation• Matrix Multiply
• Pros• Hardware Support
• Cons• Expensive O(N^3)• Can only scale with hardware.• Energy
• SLIDE• Operations• Compute Random Hashes of Data• Hash lookups, Sample and Update.
(Decades of work in Databases)• Very Few Multiplication (100x+ reduction)
• Pros• Energy (IoT), Latency• Asynchronous Parallel Gradient updates • Simple Hash Tables • Larger Network à More Savings
• Cons• Random Memory Access (but parallel SGD)
36
Future Work
• Distributed SLIDE
• SLIDE on more complex architectures like CNN/RNN
37
References
[1] Beidi Chen, Tharun Medini, Anshumali Shrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems”. Proceedings of the 3rd MLSysConference (2020).[2] Ryan Spring, Anshumali Shrivastava. "Scalable and sustainable deep learning via randomized hashing". Proceedings of the 23rd ACM SIGKDD (2017).[3] Makhzani, A. and Frey, B. J. “Winner-take-all autoencoders”. In Advances in neural information processing systems (2015).[4] Beid Chen, Anshumali Shrivastava. “Densified Winner Take All (WTA) Hashing for Sparse Datasets”. In Uncertainty in artificial intelligence (2018).[5] Beidi Chen, Yingchen Xu, and Anshumali Shrivastava. "LGD: Fast and Accurate Stochastic Gradient Estimation”. In Neurips, Dec. 2019. Vancouver.[6] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. “Hogwild: A lock-free approach to parallelizing stochastic gradient descent”. In Advances in neural information processing systems(2011).
38
Thanks!!!Welcome to stop by Poster #7
39