slide : in defense of smart algorithms over hardware ...02... · [1]beidi chen, tharunmedini,...

SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

Beidi Chen

Collaborators: Tharun Medini*, James Farwell† , Sameh Gobriel†,Charlie Tai †, Anshumali Shrivastava*

* Rice University, †Intel

MLSys 2020

Our SLIDE System (C++ from scratch) on a 44 core CPU beats TF on V100 (1 hours vs 3.5 hours). 100+ million parameter networks. TF on same CPU is 16 hours with all HPC optimization (Intel MKL-DNN).

3.5x faster on CPU than TF on V100 (Log Scale in Time)2

The Age of Large Networks

• More Data• Large Models• Tons of Engineering • Backpropagation (Aka Simple Gradient Descent)

3

Giant Matrix Multiplication for every data point in each epoch (Forward + Backward)

Fully Connected NN

1234

123

123

InputHidden 1 Hidden 2

1

5

……

…

9

Output

…

… …

𝑓(𝑊$𝑥)

4

ChallengesDo we really need all the computations?No!! Good News: Only high activations are important

• Sampling few neurons in proportion of activations is enough (Adaptive Dropouts)

(Ba et al. Neurips 13 , Makhzani et al. Neurips 15)

• Relu filtered negative activations (50% sparsity by design)

• Softmax

Bad News: We need to compute all to identify (or sample) the high activation neurons. NO SAVINGS 5

The Fundamental Sampling Puzzle

Given N fixed sampling weights, 𝑤(,𝑤*, … ,𝑤, . • Task: Sample 𝑥- with probability 𝑤-

• Cost of 1 sample 𝑂(𝑁).• Cost of K samples 𝑂(𝑁).

Given N time-varying sampling weights (activations) 𝑤(0, 𝑤*0, … , 𝑤,0 .• Task: At time t, sample 𝑥- with probability 𝑤-0

• Cost of sampling O(N), at every time t. • Last Few years of work in Locality Sensitive Hashing: If 𝑤-0 = 𝑓(𝑠𝑖𝑚(𝜃0, 𝑥-)), for a

specific set of f and sim, then 𝑂(1) every time after and initial preprocessing cost of 𝑂(𝑁).

6

Textbook Hashing (Dictionary)Hashing: Function h that maps a given data point (𝑥 ∈ 𝑅9) to an integer key ℎ ∶ 𝑅9 ↦ 0, 1, 2, … , 𝑁 . ℎ(𝑥) serves as a discrete fingerprint.

Property (Ideal Hash Functions):• If x = 𝑦, then ℎ 𝑥 = ℎ(𝑦)• If x ≠ 𝑦, then ℎ 𝑥 ≠ ℎ(𝑦)

7

Probabilistic Fingerprinting (Hashing) (late 90s)Hashing: Function (Randomized) h that maps a given data point (𝑥 ∈ 𝑅9) to an integer key ℎ ∶ 𝑅9 ↦ 0, 1, 2, … , 𝑁 . ℎ(𝑥) serves as a discrete fingerprint.

Locality Sensitive Property:• If x = 𝑦 S𝑖𝑚(𝑥, 𝑦) is high, then ℎ 𝑥 = ℎ 𝑦 Pr(ℎ 𝑥 = ℎ 𝑦 ) is high• If x ≠ 𝑦 S𝑖𝑚(𝑥, 𝑦) is low, then ℎ 𝑥 ≠ ℎ(𝑦) Pr(ℎ 𝑥 = ℎ 𝑦 ) is low

Likely Unlikely

8

Example 1: Signed Random Projection (SRP)

monotonic in 𝜃

+

- +

Pr(h(x) = h(y)) = 1� 1

⇡

cos

�1(✓)

X

Y

X

Y

H1

H2

A classical result from Goemans-Williamson (95)9

Example 2: (Densified) Winner Take All

Original Vectors:

WTA hash codes:(ICCV 2011)

DWTA hash codes:(UAI 2018)

K=3

Yagnik (ICCV11), Chen and Shrivastava (UAI 18) 10

Probabilistic Hash Tables

Given: 𝑓 is monotonic.Prh

⇥h(x) = h(y)

⇤= f(sim(x, y)),

<latexit sha1_base64="T1X8MzxWNtEUHQEsV/R6e5uIKKQ=">AAACEnicbZDLSsNAFIYnXmu9VV26GSxCAqUkVdCFQsGNywr2AmkIk+mkGTq5MDORhtBncOOruHGhiFtX7nwbp20W2vrDwDf/OYeZ83sJo0Ka5re2srq2vrFZ2ipv7+zu7VcODjsiTjkmbRyzmPc8JAijEWlLKhnpJZyg0GOk641upvXuA+GCxtG9zBLihGgYUZ9iJJXlVowWd4O+R4d2oI8NeA0DPTOmd0exrwsa6uNaZhg16FaqZt2cCS6DVUAVFGq5la/+IMZpSCKJGRLCtsxEOjnikmJGJuV+KkiC8AgNia0wQiERTj5baQJPlTOAfszViSScub8nchQKkYWe6gyRDMRibWr+V7NT6V86OY2SVJIIzx/yUwZlDKf5wAHlBEuWKUCYU/VXiAPEEZYqxbIKwVpceRk6jbp1Vm/cnVebV0UcJXAMToAOLHABmuAWtEAbYPAInsEreNOetBftXfuYt65oxcwR+CPt8wdf55q3</latexit>

• Given query, if ℎ( 𝑞 = 11and ℎ* 𝑞 = 01, then probe bucket with index 1101. It is a good bucket !!

• (Locality Sensitive) ℎ- 𝑞 =ℎ-(𝑥) noisy indicator of high similarity.

• Doing better than random !!

11

LSH for Search (Known)

Theory • Super-linear 𝑂(𝑁(FG) memory• Sub-linear query time, O(𝑁G)• 𝜌 < 1 but generally large (close to 1) and often hard to determine

Practical Issues• Needs lot of hash tables and distance computations for good accuracy on

near-neighbors• Buckets can be quite heavy. Poor randomness, or unfavorable data

distributions

12

New View: Data Structures for Efficient Sampling!

Is LSH really a search algorithm?• Given the query 𝜃0, LSH samples 𝑥- from the dataset, with probability 𝑤-0 = 1 − 1 − p xL, 𝜃0 M N

• 𝑤-0 is proportional to p xL, 𝜃0 M and the some similarity of xL, 𝜃0• LSH is considered a black box for nearest-neighbor search. It is not!!

13

LSH as SamplersWe can pre-process the dataset D, such that • Given any query q, we can sample 𝑥 ∈ 𝐷 with probability 𝐶𝑜𝑛𝑠𝑡× 1 − 1 − 𝑝 𝑞, 𝑥 V W in KL hash computation and L bucket probes. • Even K = 1, L =1 is adaptive. So O(1) time adaptive.• Adaptive: x is sampled with higher probability than y

• if and only if sim(q,x) > sim(q,y)

We can exactly compute the sampling probability. • Const = No of elements sampled/ No of elements in Buckets

(Chen et al. NeurIPS 2019)

Sufficient for Importance Sampling Estimations. Sampling cost O(1).

14

SLIDE: Sub-LInear Deep learning Engine

Step 1 – Build the hash tables by processing the weights of the hidden layers (initialization).

Subtlety: Neurons (vectors) in hash tables are not the data vectors. Reorganizing neurons.

12345

1234

1234

1

5


……

…9

Output

H2

1 |3

2 | 1,4

3 | 2

1 1

H11 |1

2 | 2,4

3 | 3

15


Step 2 – Hash the input to any given layer using its randomized hash function.

12345

1234

1234

1

5


……

…9

Output

H11 |1

2 | 2,4

3 | 3

H2

1 |3

2 | 1,4

3 | 22

16


Step 3 – Query the hidden layer's hash table(s) for the active set using integer fingerprint.Sample neurons in proportion to their activations.1

2345

1234

1234

1

5


……

…9

Output

H11 |1

2 | 2,4

3 | 3

H2

1 |3

2 | 1,4

3 | 2

3

17


Step 4 – Perform forward and back propagation only on the nodes in the active set.Computation is in the same order of active neurons.1

2345

1234

1234

1

5


……

…9

Output

H11 |1

2 | 2,4

3 | 3

H2

1 |3

2 | 1,4

3 | 2

4

18


Step 5 – Update hash tables by rehashing the updated node weights.Computation is in the same order of active neurons.1

2345

1234

1234

1

5


……

…9

Output

H11 |1

2 | 2,4

3 | 3

H2

1 |3

2 | 1,4

3 | 2

55

19

We can go very sparse if Adaptive

• 2 Hidden Layers • 1000 Nodes Per Layer

• Reduce both training and inference cost by 95%!

• Significantly more for larger networks.

(The wider the better)

20

Sparsity + Randomness à Asynchronous Updates

• 3 Hidden Layers • 1000 Nodes Per Layer21

Less Computations + Asynchronous Parallelism

• Each update is computationally very small (100x+ reduction in computation and energy)

• Updates are near-independent, very low chance of conflict. Hence, parallel SGD!

22


12345

1234

1234

InputHidden1 Hidden2

1

5

……

…

9

Output

Layer

Active Inputs

1 0 1 ……

Active Inputs

0.1 0.2 0.5 ……

Activation for each Inputs

0.3 0.8 0.7 ……

Accumulated Gradients

-0.3 …Weights

0.8 -0.5

BatchSize

NeuronNetwork

HashTable1 HashTableL

00

00

00…

11

…

…

…

…

…

00

01

10…

11

ℎ"" ℎ#"… Buckets…

…

Empty…

…

…1 92

00

00

00…

11

…

…

…

…

…

00

01

10…

11

ℎ"$ ℎ#$… Buckets…

…

Empty…

…

19

5

1

2

…

Previous Layer Size

23

(Extreme sparsity and randomness in gradient updates)

Thanks to the theory of HOGWILD!(Recht et al. Neurips 11)

Parallelism with OpenMP

Parallel across training samples in a batch 1 0 1 ……

Active Inputs

0.1 0.2 0.5 ……

Activation for each Inputs

0.3 0.8 0.7 ……

Accumulated Gradients

Weights

BatchsizeNode

-0.3 0.8 -0.5 ……

24

Flexible choices of Hash Functions

SLIDE supports four different LSH hash functions• Simhash (cosine similarity)• Winner-take-all Hashing (order)• Densified Winner-take-all Hashing (for sparse data)∗• Minhash (jaccard similarity)

Easily add more!

25

• Vanilla sub-sampling:- choose sub-samples uniformly

• Top K sub-sampling:- rank samples and choose topk

• Hard Thresholding sub-sampling:- choose sub-samples that occur > threshold times

Design Choices for Speed

26

Micro-Architecture Optimization

Cache Optimization

Transparent Hugepages

Vector Processing

Software Pipelining and Prefetching

27

Looks Good on Paper. Does it change anything?

BaselineState-of-the-art optimized Implementations• TF on Intel Xeon E5-2699A v4 @ 2.40GHz CPU (FMA,AVX, AVX2, SSE4.2)• TF on NVIDIA Tesla V100 (32GB)

VS.SLIDE on Intel Xeon E5-2699A v4 @ 2.40GHz CPU (FMA,AVX, AVX2, SSE4.2)• TF on NVIDIA Tesla V100 (32GB)

28

Datasets

29

Performance

30

Performance compared to sampled softmax

31

Performance @ Different Batchsizes

32

Asynchronous Parallelism gets best scalability

33

Inefficiency Diagnosis

34

0

0.1

0.2

0.3

0.4

0.5

0.6

8Threads 16Threads 32Threads

Tensorflow-CPU Inefficiencies Ratio in CPU Usage

Front-EndBound MemoryBound RetiringBound CoreBound

0

0.1

0.2

0.3

0.4

0.5

0.6

8Threads 16Threads 32Threads

SLIDEInefficiencies Ratio in CPU Usage

Front-EndBound MemoryBound RetiringBound CoreBound

Impact of HugePages

35

Conclusion: From Matrix Multiplication to (few) Hash Lookups

• Standard• Operation• Matrix Multiply

• Pros• Hardware Support

• Cons• Expensive O(N^3)• Can only scale with hardware.• Energy

• SLIDE• Operations• Compute Random Hashes of Data• Hash lookups, Sample and Update.

(Decades of work in Databases)• Very Few Multiplication (100x+ reduction)

• Pros• Energy (IoT), Latency• Asynchronous Parallel Gradient updates • Simple Hash Tables • Larger Network à More Savings

• Cons• Random Memory Access (but parallel SGD)

36

Future Work

• Distributed SLIDE

• SLIDE on more complex architectures like CNN/RNN

37

References

[1] Beidi Chen, Tharun Medini, Anshumali Shrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems”. Proceedings of the 3rd MLSysConference (2020).[2] Ryan Spring, Anshumali Shrivastava. "Scalable and sustainable deep learning via randomized hashing". Proceedings of the 23rd ACM SIGKDD (2017).[3] Makhzani, A. and Frey, B. J. “Winner-take-all autoencoders”. In Advances in neural information processing systems (2015).[4] Beid Chen, Anshumali Shrivastava. “Densified Winner Take All (WTA) Hashing for Sparse Datasets”. In Uncertainty in artificial intelligence (2018).[5] Beidi Chen, Yingchen Xu, and Anshumali Shrivastava. "LGD: Fast and Accurate Stochastic Gradient Estimation”. In Neurips, Dec. 2019. Vancouver.[6] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. “Hogwild: A lock-free approach to parallelizing stochastic gradient descent”. In Advances in neural information processing systems(2011).

38

https://arxiv.org/abs/1903.03129

Thanks!!!Welcome to stop by Poster #7

39

slide : in defense of smart algorithms over hardware ...02... · [1]beidi chen, tharunmedini,...

Documents