distributed machine learningasynchronous byzantine machine learning. icml 2018.] byzantine...

63
Distributed Machine Learning 1 Georgios Damaskinos 2018

Upload: others

Post on 24-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

Distributed Machine Learning

1

Georgios Damaskinos2018

Page 2: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

Machine Learning ?

2

Page 3: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

Machine Learning “in a nutshell”

3

Page 4: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

Machine Learning algorithm

4

Page 5: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

Machine Learning algorithm

5

Page 6: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

Safety?

6

Page 7: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

Safety?

Convergence

Co

st f

un

ctio

n

7

Page 8: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

Machine Learning ?

8

Page 9: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

Think big!

9

Page 10: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

Think big!

Example: Image Classification

Data: ImageNet: 1.3 Million training images (224 x 224 x 3)

Model: ResNet-152: 60.2 Million parameters (model size)

Training time (single node):TensorFlow: 19 days!!

10

Page 11: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

Think big!

Example: Image Classification

Data: ImageNet: 1.3 Million training images (224 x 224 x 3)

Model: ResNet-152: 60.2 Million parameters (model size)

Training time (single node):TensorFlow: 19 days!!

11

Page 12: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

12

Page 13: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

Performance?

Training time (single node):TensorFlow: 19 days

Training time (distributed):1024 Nodes (theoretical): 25 minutesCSCS (3rd top supercomputer, 4500+ GPUs, state-of-the-art interconnect:

13

Page 14: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

Performance?

State-of-the-art (ResNet-50): 1 hour [GP+17]

● Batch size = 8192● 256 GPUs

[GP+17] Goyal, Priya, et al. "Accurate, large minibatch SGD: training imagenet in 1 hour." 201714

Page 15: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

Distributed … how?

15

Page 16: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

Data Parallelism

16

Page 17: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

17

Batch Learning

Page 18: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

18

Batch Learning

Page 19: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

19

Batch Learning

Page 20: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

20

Batch Learning

Page 21: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

21

• Partition Data• Parallel Compute on

Partitions

Parallel Batch Learning

Page 22: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

22

Parallel Batch Learning

Page 23: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

23

Parallel Batch Learning

Page 24: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

24

• More frequent updates

Parallel Synchronous Mini-Batch Learning

Page 25: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

25

Parallel Asynchronous Mini-Batch Learning

Page 26: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

26

Parallel Asynchronous Mini-Batch Learning

Page 27: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

27

Parallel Asynchronous Mini-Batch Learning

Page 28: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

28

Parallel Asynchronous Mini-Batch Learning

Page 29: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

29

Parallel Asynchronous Mini-Batch Learning

Page 30: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

30

Parallel Asynchronous Mini-Batch Learning

Page 31: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

31

Parallel Asynchronous Mini-Batch Learning

Page 32: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

32

• Gradients computed using stale parameters

• Increased utilization• Central lock

Parallel Asynchronous Mini-Batch Learning

Page 33: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

33

Distributed ML

● Parallelism○ Model○ Data

● Learning○ Synchronous○ Asynchronous

Page 34: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

34

Distributed ML: Challenges

1. Scalability

2. Privacy

3. Security

Page 35: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

35

Scalability - Asynchrony

Page 36: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

36

Scalability - Communication

ImageNet classification (ResNet-152):Model/update size = ~ 250MB

Page 37: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

37

Scalability - Communication

Page 38: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

38

Scalability - Communication

Page 39: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

39

Scalability - Communication

Page 40: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

40

Scalability - Communication

Page 41: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

41

ImageNet classification (ResNet-152):Mode/update size = ~ 250MB

Compression● Distillation [PPA+18]● Quantization [DGL+17]

○ SignSGD [BJ+18]

[DGL+17] Alistarh, Dan, et al. "QSGD: Communication-efficient SGD via gradient quantization and encoding." NIPS 2017.[PPA+18] Polino, Antonio, Razvan Pascanu, and Dan Alistarh. "Model compression via distillation and quantization." ICLR 2018.[BJ+18] Bernstein, Jeremy, et al. "signSGD: compressed optimisation for non-convex problems." ICML 2018.

Scalability - Communication

Page 42: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

42

Distributed ML: Challenges

1. Scalabilitya. Asynchronyb. Communication efficiency

2. Privacy

3. Security

Page 43: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

43

• Medical data

• Photos

• Search logs

Page 44: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

44

Privacy

Differential Privacy● Decentralized Learning [BGT+18]● Compression <-> DP [AST+18]

Local Privacy● MPC

[BGT+18] Bellet, A., Guerraoui, R., Taziki, M., & Tommasi, M.. Personalized and Private Peer-to-Peer Machine Learning. AISTATS 2018.[AST+18] Agarwal, N., Suresh, A. T., Yu, F., Kumar, S., & Mcmahan, H. B. (2018). cpSGD: Communication-efficient and differentially-private distributed SGD. NIPS 2018.

Page 45: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

45

Distributed ML: Challenges

1. Scalabilitya. Asynchronyb. Communication efficiency

2. Privacya. Differential Privacyb. Local Privacy

3. Security

Page 46: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

46

Security: Byzantine worker

Page 47: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

47

• More frequent updates

Security: Synchronous BFT

Page 48: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

48

Security: Synchronous BFT

Krum[Blanchard, Peva, Mhamdi, E. M. E., Rachid Guerraoui, and Julien Stainer. "Machine learning with adversaries: Byzantine tolerant gradient descent." NIPS. 2017.]

● Byzantine resilience against f/n workers, 2f + 2 < n

● Provable convergence (i.e., safety)

How?1. Worker i: score(i) =

Select gradient with minimum score2. m-Krum

Majority + Squared distance-based decision -> BFT

Page 49: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

49

• Gradients computed using stale parameters

• Increased utilization• Central lock

Security: Asynchronous BFT

Page 50: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

50

Security: Asynchronous BFT

Kardam [Damaskinos, G., Mhamdi, E. M. E., Guerraoui, R., Patra, R., & Taziki, M. Asynchronous Byzantine Machine Learning. ICML 2018.]

● Byzantine resilience against f/n workers, 3f < n

● Optimal slowdown:

● Provable (almost sure) convergence (i.e., safety)

How?1. Lipschitz Filtering Component => Byzantine resilience2. Staleness Dampening Component => Asynchronous convergence

Asynchrony can be viewed as Byzantine behavior

Page 51: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

51

Distributed ML: Challenges

1. Scalabilitya. Asynchronyb. Communication efficiency

2. Privacya. Differential Privacyb. Local Privacy

3. Securitya. Synchronous BFTb. Asynchronous BFT

Page 52: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

52

Distributed ML: Frameworks

Page 53: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

53

Tensorflow: Why?

https://towardsdatascience.com/deep-learning-framework-power-scores-2018-23607ddf297a

Popularity

Page 54: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

54

Tensorflow: Why?

Support

● Visualization tools

● Documentation

● Tutorials

Page 55: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

55

Tensorflow: Why?

Portability - Flexibility - Scalability

Page 56: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

56

Tensorflow: What is it?

● Dataflow graph computation

● Automatic differentiation (also for while loops [Y+18])

[Y+18] Yu, Yuan, et al. "Dynamic control flow in large-scale machine learning." EuroSys. ACM, 2018.

Page 57: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

57

Tensor?

Multidimensional array of numbers

Examples:● A scalar● A vector● A matrix

Page 58: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

58

DataFlow?

● Computations are graphs○ Nodes: Operations○ Edges: Tensors

● Program phases:○ Construction: create the graph○ Execution: push data through the graph

Page 59: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

59

Tensorflow VS DataFlow Frameworks

● Batch Processing

● Relaxed consistency

● Simplicity○ No join operations○ Input diff => new batch

Page 60: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

60

Architecture

Page 61: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

61

Learning

Page 62: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

62

Page 63: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence

63

TensorFlow BFT ? No!

How can we make it BFT?

[Damaskinos G., El Mhamdi E., Guerraoui R., Guirguis A., Rouault S.]