slide : in defense of smart algorithms over hardware ...02... · [1]beidi chen, tharunmedini,...

39
SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems Beidi Chen Collaborators: Tharun Medini * , James Farwell , Sameh Gobriel , Charlie Tai , Anshumali Shrivastava * * Rice University, Intel MLSys 2020

Upload: others

Post on 26-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

Beidi Chen

Collaborators: Tharun Medini*, James Farwell† , Sameh Gobriel†,Charlie Tai †, Anshumali Shrivastava*

* Rice University, †Intel

MLSys 2020

Page 2: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Our SLIDE System (C++ from scratch) on a 44 core CPU beats TF on V100 (1 hours vs 3.5 hours). 100+ million parameter networks. TF on same CPU is 16 hours with all HPC optimization (Intel MKL-DNN).

3.5x faster on CPU than TF on V100 (Log Scale in Time)2

Page 3: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

The Age of Large Networks

• More Data• Large Models• Tons of Engineering • Backpropagation (Aka Simple Gradient Descent)

3

Page 4: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Giant Matrix Multiplication for every data point in each epoch (Forward + Backward)

Fully Connected NN

1234

123

123

InputHidden 1 Hidden 2

1

5

……

9

Output

… …

𝑓(𝑊$𝑥)

4

Page 5: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

ChallengesDo we really need all the computations?No!! Good News: Only high activations are important

• Sampling few neurons in proportion of activations is enough (Adaptive Dropouts)

(Ba et al. Neurips 13 , Makhzani et al. Neurips 15)

• Relu filtered negative activations (50% sparsity by design)

• Softmax

Bad News: We need to compute all to identify (or sample) the high activation neurons. NO SAVINGS 5

Page 6: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

The Fundamental Sampling Puzzle

Given N fixed sampling weights, 𝑤(,𝑤*, … ,𝑤, . • Task: Sample 𝑥- with probability 𝑤-

• Cost of 1 sample 𝑂(𝑁).• Cost of K samples 𝑂(𝑁).

Given N time-varying sampling weights (activations) 𝑤(0, 𝑤*0, … , 𝑤,0 .• Task: At time t, sample 𝑥- with probability 𝑤-0

• Cost of sampling O(N), at every time t. • Last Few years of work in Locality Sensitive Hashing: If 𝑤-0 = 𝑓(𝑠𝑖𝑚(𝜃0, 𝑥-)), for a

specific set of f and sim, then 𝑂(1) every time after and initial preprocessing cost of 𝑂(𝑁).

6

Page 7: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Textbook Hashing (Dictionary)Hashing: Function h that maps a given data point (𝑥 ∈ 𝑅9) to an integer key ℎ ∶ 𝑅9 ↦ 0, 1, 2, … , 𝑁 . ℎ(𝑥) serves as a discrete fingerprint.

Property (Ideal Hash Functions):• If x = 𝑦, then ℎ 𝑥 = ℎ(𝑦)• If x ≠ 𝑦, then ℎ 𝑥 ≠ ℎ(𝑦)

7

Page 8: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Probabilistic Fingerprinting (Hashing) (late 90s)Hashing: Function (Randomized) h that maps a given data point (𝑥 ∈ 𝑅9) to an integer key ℎ ∶ 𝑅9 ↦ 0, 1, 2, … , 𝑁 . ℎ(𝑥) serves as a discrete fingerprint.

Locality Sensitive Property:• If x = 𝑦 S𝑖𝑚(𝑥, 𝑦) is high, then ℎ 𝑥 = ℎ 𝑦 Pr(ℎ 𝑥 = ℎ 𝑦 ) is high• If x ≠ 𝑦 S𝑖𝑚(𝑥, 𝑦) is low, then ℎ 𝑥 ≠ ℎ(𝑦) Pr(ℎ 𝑥 = ℎ 𝑦 ) is low

Likely Unlikely

8

Page 9: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Example 1: Signed Random Projection (SRP)

monotonic in 𝜃

+

- +

Pr(h(x) = h(y)) = 1� 1

cos

�1(✓)

X

Y

X

Y

H1

H2

A classical result from Goemans-Williamson (95)9

Page 10: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Example 2: (Densified) Winner Take All

Original Vectors:

WTA hash codes:(ICCV 2011)

DWTA hash codes:(UAI 2018)

K=3

Yagnik (ICCV11), Chen and Shrivastava (UAI 18) 10

Page 11: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Probabilistic Hash Tables

Given: 𝑓 is monotonic.Prh

⇥h(x) = h(y)

⇤= f(sim(x, y)),

<latexit sha1_base64="T1X8MzxWNtEUHQEsV/R6e5uIKKQ=">AAACEnicbZDLSsNAFIYnXmu9VV26GSxCAqUkVdCFQsGNywr2AmkIk+mkGTq5MDORhtBncOOruHGhiFtX7nwbp20W2vrDwDf/OYeZ83sJo0Ka5re2srq2vrFZ2ipv7+zu7VcODjsiTjkmbRyzmPc8JAijEWlLKhnpJZyg0GOk641upvXuA+GCxtG9zBLihGgYUZ9iJJXlVowWd4O+R4d2oI8NeA0DPTOmd0exrwsa6uNaZhg16FaqZt2cCS6DVUAVFGq5la/+IMZpSCKJGRLCtsxEOjnikmJGJuV+KkiC8AgNia0wQiERTj5baQJPlTOAfszViSScub8nchQKkYWe6gyRDMRibWr+V7NT6V86OY2SVJIIzx/yUwZlDKf5wAHlBEuWKUCYU/VXiAPEEZYqxbIKwVpceRk6jbp1Vm/cnVebV0UcJXAMToAOLHABmuAWtEAbYPAInsEreNOetBftXfuYt65oxcwR+CPt8wdf55q3</latexit>

• Given query, if ℎ( 𝑞 = 11and ℎ* 𝑞 = 01, then probe bucket with index 1101. It is a good bucket !!

• (Locality Sensitive) ℎ- 𝑞 =ℎ-(𝑥) noisy indicator of high similarity.

• Doing better than random !!

11

Page 12: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

LSH for Search (Known)

Theory • Super-linear 𝑂(𝑁(FG) memory• Sub-linear query time, O(𝑁G)• 𝜌 < 1 but generally large (close to 1) and often hard to determine

Practical Issues• Needs lot of hash tables and distance computations for good accuracy on

near-neighbors• Buckets can be quite heavy. Poor randomness, or unfavorable data

distributions

12

Page 13: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

New View: Data Structures for Efficient Sampling!

Is LSH really a search algorithm?• Given the query 𝜃0, LSH samples 𝑥- from the dataset, with probability 𝑤-0 = 1 − 1 − p xL, 𝜃0 M N

• 𝑤-0 is proportional to p xL, 𝜃0 M and the some similarity of xL, 𝜃0• LSH is considered a black box for nearest-neighbor search. It is not!!

13

Page 14: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

LSH as SamplersWe can pre-process the dataset D, such that • Given any query q, we can sample 𝑥 ∈ 𝐷 with probability 𝐶𝑜𝑛𝑠𝑡× 1 − 1 − 𝑝 𝑞, 𝑥 V W in KL hash computation and L bucket probes. • Even K = 1, L =1 is adaptive. So O(1) time adaptive.• Adaptive: x is sampled with higher probability than y

• if and only if sim(q,x) > sim(q,y)

We can exactly compute the sampling probability. • Const = No of elements sampled/ No of elements in Buckets

(Chen et al. NeurIPS 2019)

Sufficient for Importance Sampling Estimations. Sampling cost O(1).

14

Page 15: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

SLIDE: Sub-LInear Deep learning Engine

Step 1 – Build the hash tables by processing the weights of the hidden layers (initialization).

Subtlety: Neurons (vectors) in hash tables are not the data vectors. Reorganizing neurons.

12345

1234

1234

1

5

InputHidden 1 Hidden 2

……

…9

Output

H2

1 |3

2 | 1,4

3 | 2

1 1

H11 |1

2 | 2,4

3 | 3

15

Page 16: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

SLIDE: Sub-LInear Deep learning Engine

Step 2 – Hash the input to any given layer using its randomized hash function.

12345

1234

1234

1

5

InputHidden 1 Hidden 2

……

…9

Output

H11 |1

2 | 2,4

3 | 3

H2

1 |3

2 | 1,4

3 | 22

16

Page 17: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

SLIDE: Sub-LInear Deep learning Engine

Step 3 – Query the hidden layer's hash table(s) for the active set using integer fingerprint.Sample neurons in proportion to their activations.1

2345

1234

1234

1

5

InputHidden 1 Hidden 2

……

…9

Output

H11 |1

2 | 2,4

3 | 3

H2

1 |3

2 | 1,4

3 | 2

3

17

Page 18: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

SLIDE: Sub-LInear Deep learning Engine

Step 4 – Perform forward and back propagation only on the nodes in the active set.Computation is in the same order of active neurons.1

2345

1234

1234

1

5

InputHidden 1 Hidden 2

……

…9

Output

H11 |1

2 | 2,4

3 | 3

H2

1 |3

2 | 1,4

3 | 2

4

18

Page 19: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

SLIDE: Sub-LInear Deep learning Engine

Step 5 – Update hash tables by rehashing the updated node weights.Computation is in the same order of active neurons.1

2345

1234

1234

1

5

InputHidden 1 Hidden 2

……

…9

Output

H11 |1

2 | 2,4

3 | 3

H2

1 |3

2 | 1,4

3 | 2

55

19

Page 20: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

We can go very sparse if Adaptive

• 2 Hidden Layers • 1000 Nodes Per Layer

• Reduce both training and inference cost by 95%!

• Significantly more for larger networks.

(The wider the better)

20

Page 21: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Sparsity + Randomness à Asynchronous Updates

• 3 Hidden Layers • 1000 Nodes Per Layer21

Page 22: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Less Computations + Asynchronous Parallelism

• Each update is computationally very small (100x+ reduction in computation and energy)

• Updates are near-independent, very low chance of conflict. Hence, parallel SGD!

22

Page 23: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

SLIDE: Sub-LInear Deep learning Engine

12345

1234

1234

InputHidden1 Hidden2

1

5

……

9

Output

Layer

Active Inputs

1 0 1 ……

Active Inputs

0.1 0.2 0.5 ……

Activation for each Inputs

0.3 0.8 0.7 ……

Accumulated Gradients

-0.3 …Weights

0.8 -0.5

BatchSize

NeuronNetwork

HashTable1 HashTableL

00

00

00…

11

00

01

10…

11

ℎ"" ℎ#"… Buckets…

Empty…

…1 92

00

00

00…

11

00

01

10…

11

ℎ"$ ℎ#$… Buckets…

Empty…

19

5

1

2

Previous Layer Size

23

Page 24: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

(Extreme sparsity and randomness in gradient updates)

Thanks to the theory of HOGWILD!(Recht et al. Neurips 11)

Parallelism with OpenMP

Parallel across training samples in a batch 1 0 1 ……

Active Inputs

0.1 0.2 0.5 ……

Activation for each Inputs

0.3 0.8 0.7 ……

Accumulated Gradients

Weights

BatchsizeNode

-0.3 0.8 -0.5 ……

24

Page 25: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Flexible choices of Hash Functions

SLIDE supports four different LSH hash functions• Simhash (cosine similarity)• Winner-take-all Hashing (order)• Densified Winner-take-all Hashing (for sparse data)∗• Minhash (jaccard similarity)

Easily add more!

25

Page 26: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

• Vanilla sub-sampling:- choose sub-samples uniformly

• Top K sub-sampling:- rank samples and choose topk

• Hard Thresholding sub-sampling:- choose sub-samples that occur > threshold times

Design Choices for Speed

26

Page 27: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Micro-Architecture Optimization

Cache Optimization

Transparent Hugepages

Vector Processing

Software Pipelining and Prefetching

27

Page 28: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Looks Good on Paper. Does it change anything?

BaselineState-of-the-art optimized Implementations• TF on Intel Xeon E5-2699A v4 @ 2.40GHz CPU (FMA,AVX, AVX2, SSE4.2)• TF on NVIDIA Tesla V100 (32GB)

VS.SLIDE on Intel Xeon E5-2699A v4 @ 2.40GHz CPU (FMA,AVX, AVX2, SSE4.2)• TF on NVIDIA Tesla V100 (32GB)

28

Page 29: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Datasets

29

Page 30: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Performance

30

Page 31: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Performance compared to sampled softmax

31

Page 32: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Performance @ Different Batchsizes

32

Page 33: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Asynchronous Parallelism gets best scalability

33

Page 34: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Inefficiency Diagnosis

34

0

0.1

0.2

0.3

0.4

0.5

0.6

8Threads 16Threads 32Threads

Tensorflow-CPU Inefficiencies Ratio in CPU Usage

Front-EndBound MemoryBound RetiringBound CoreBound

0

0.1

0.2

0.3

0.4

0.5

0.6

8Threads 16Threads 32Threads

SLIDEInefficiencies Ratio in CPU Usage

Front-EndBound MemoryBound RetiringBound CoreBound

Page 35: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Impact of HugePages

35

Page 36: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Conclusion: From Matrix Multiplication to (few) Hash Lookups

• Standard• Operation• Matrix Multiply

• Pros• Hardware Support

• Cons• Expensive O(N^3)• Can only scale with hardware.• Energy

• SLIDE• Operations• Compute Random Hashes of Data• Hash lookups, Sample and Update.

(Decades of work in Databases)• Very Few Multiplication (100x+ reduction)

• Pros• Energy (IoT), Latency• Asynchronous Parallel Gradient updates • Simple Hash Tables • Larger Network à More Savings

• Cons• Random Memory Access (but parallel SGD)

36

Page 37: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Future Work

• Distributed SLIDE

• SLIDE on more complex architectures like CNN/RNN

37

Page 38: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

References

[1] Beidi Chen, Tharun Medini, Anshumali Shrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems”. Proceedings of the 3rd MLSysConference (2020).[2] Ryan Spring, Anshumali Shrivastava. "Scalable and sustainable deep learning via randomized hashing". Proceedings of the 23rd ACM SIGKDD (2017).[3] Makhzani, A. and Frey, B. J. “Winner-take-all autoencoders”. In Advances in neural information processing systems (2015).[4] Beid Chen, Anshumali Shrivastava. “Densified Winner Take All (WTA) Hashing for Sparse Datasets”. In Uncertainty in artificial intelligence (2018).[5] Beidi Chen, Yingchen Xu, and Anshumali Shrivastava. "LGD: Fast and Accurate Stochastic Gradient Estimation”. In Neurips, Dec. 2019. Vancouver.[6] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. “Hogwild: A lock-free approach to parallelizing stochastic gradient descent”. In Advances in neural information processing systems(2011).

38

Page 39: SLIDE : In Defense of Smart Algorithms over Hardware ...02... · [1]Beidi Chen, TharunMedini, AnshumaliShrivastava “SLIDE : In Defense of Smart Algorithms over Hardware Acceleration

Thanks!!!Welcome to stop by Poster #7

39