![Page 1: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/1.jpg)
Scaling up Machine Learning algorithms for classification
Department of Mathematical InformaticsThe University of Tokyo
Shin Matsushima
![Page 2: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/2.jpg)
How can we scale up Machine Learning to Massive datasets?
• Exploit hardware traits– Disk IO is bottleneck– Dual Cached Loops– Run Disk IO and Computation simultaneously
• Distributed asynchronous optimization (ongoing)– Current work using multiple machines
2
![Page 3: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/3.jpg)
LINEAR SUPPORT VECTOR MACHINES VIA DUAL CACHED LOOPS
3
![Page 4: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/4.jpg)
• Intuition of linear SVM
– xi: i-th datapoint
– yi: i-th label. +1 or -1
– yi w ・ xi : larger is better, smaller is worse
4
××
×
×
×
×
×
×× ×: yi = +1
×: yi = -1
![Page 5: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/5.jpg)
• Formulation of Linear SVM
– n: number of data points– d: number of features– Convex non-smooth optimization
5
![Page 6: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/6.jpg)
• Formulation of Linear SVM – Primal
– Dual
6
![Page 7: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/7.jpg)
Coordinate descent7
![Page 8: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/8.jpg)
8
![Page 9: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/9.jpg)
9
![Page 10: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/10.jpg)
10
![Page 11: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/11.jpg)
11
![Page 12: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/12.jpg)
12
![Page 13: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/13.jpg)
13
![Page 14: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/14.jpg)
14
![Page 15: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/15.jpg)
• Coordinate Descent Method– For each update we solve one-variable optimization
problem with respect to the variable to update.
15
![Page 16: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/16.jpg)
• Applying Coordinate Descent for Dual formulation of SVM
16
![Page 17: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/17.jpg)
17
• Applying Coordinate Descent for Dual formulation of SVM
![Page 18: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/18.jpg)
Dual Coordinate Descent [Hsieh et al. 2008]
18
![Page 19: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/19.jpg)
Attractive property
• Suitable for large scale learning– We need only one data for each update.
• Theoretical guarantees– Linear convergence ( cf. SGD )
• Shrinking[Joachims 1999]
– We can eliminate “uninformative” data:
cf.
19
![Page 20: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/20.jpg)
Shrinking [Joachims 1999]
• Intuition: a datapoint far from the current decision boundary is unlikely to become a support vector
20
×
×
×
×
○
○
![Page 21: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/21.jpg)
Shrinking [Joachims 1999]
• Condition
• Available only in the dual problem
21
![Page 22: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/22.jpg)
Problem in scaling up to massive data
• In dealing with small-scale data, we first copy the entire dataset into main memory
• In dealing with large-scale data, we cannot copy the dataset at once
22
Read
Disk
Memory
Data
![Page 23: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/23.jpg)
ReadData
• Schemes when data cannot fit in memory1. Block Minimization [Yu et al. 2010]– Split the entire dataset into blocks so that each
block can fit in memory
![Page 24: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/24.jpg)
Train RAM
• Schemes when data cannot fit in memory1. Block Minimization [Yu et al. 2010]– Split the entire dataset into blocks so that each
block can fit in memory
![Page 25: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/25.jpg)
ReadData
• Schemes when data cannot fit in memory1. Block Minimization [Yu et al. 2010]– Split the entire dataset into blocks so that each
block can fit in memory
![Page 26: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/26.jpg)
Train RAM
• Schemes when data cannot fit in memory1. Block Minimization [Yu et al. 2010]– Split the entire dataset into blocks so that each
block can fit in memory
![Page 27: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/27.jpg)
Block Minimization[Yu et al. 2010]
27
![Page 28: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/28.jpg)
ReadData
• Schemes when data cannot fit in memory2. Selective Block Minimization [Chang and Roth 2011]
– Keep “informative data” in memory
![Page 29: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/29.jpg)
Train RAM
Block
• Schemes when data cannot fit in memory2. Selective Block Minimization [Chang and Roth 2011]
– Keep “informative data” in memory
![Page 30: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/30.jpg)
Train RAM
Block
• Schemes when data cannot fit in memory2. Selective Block Minimization [Chang and Roth 2011]
– Keep “informative data” in memory
![Page 31: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/31.jpg)
ReadData
• Schemes when data cannot fit in memory2. Selective Block Minimization [Chang and Roth 2011]
– Keep “informative data” in memory
![Page 32: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/32.jpg)
Train RAM
Block
• Schemes when data cannot fit in memory2. Selective Block Minimization [Chang and Roth 2011]
– Keep “informative data” in memory
![Page 33: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/33.jpg)
Train RAM
Block
• Schemes when data cannot fit in memory2. Selective Block Minimization [Chang and Roth 2011]
– Keep “informative data” in memory
![Page 34: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/34.jpg)
Selective Block Minimization[Chang and Roth 2011]
34
![Page 35: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/35.jpg)
• Previous schemes switch CPU and DiskIO– Training (CPU) is idle while reading– Reading (DiskIO) is idle while training
35
![Page 36: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/36.jpg)
• We want to exploit modern hardware1. Multicore processors are commonplace2. CPU(Memory IO) is often 10-100 times
faster than Hard disk IO
36
![Page 37: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/37.jpg)
1.Make reader and trainer run simultaneously and almost asynchronously.
2.Trainer updates the parameter many times faster than reader loads new datapoints.
3.Keep informative data in main memory.(=Evict uninformative data primarily from main memory)
37
Dual Cached Loops
![Page 38: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/38.jpg)
ReaderThread
TrainerThread
Parameter
Dual Cached Loops
RAM
Disk
Memory
Data
38
![Page 39: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/39.jpg)
ReaderThread
TrainerThread
Parameter
Dual Cached Loops
RAM
Disk
Memory
Data
39
![Page 40: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/40.jpg)
Read
Disk
Memory
Data
W: working index set
40
![Page 41: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/41.jpg)
Train
ParameterMemory
41
![Page 42: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/42.jpg)
Which data is “uninformative”?
• A datapoint far from the current decision boundary is unlikely to become a support vector
• Ignore the datapoint for a while.
42
××
×
×
×
○
○○
![Page 43: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/43.jpg)
Which data is “uninformative”?
– Condition
43
![Page 44: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/44.jpg)
• Datasets with Various Characteristics:
• 2GB Memory for storing datapoints • Measured Relative Function Value
45
![Page 45: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/45.jpg)
• Comparison with (Selective) Block Minimization (implemented in Liblinear)
– ocr : dense, 45GB
46
![Page 46: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/46.jpg)
47
• Comparison with (Selective) Block Minimization (implemented in Liblinear)
– dna : dense, 63GB
![Page 47: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/47.jpg)
48
Comparison with (Selective) Block Minimization (implemented in Liblinear)
– webspam : sparse, 20GB
![Page 48: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/48.jpg)
49
Comparison with (Selective) Block Minimization (implemented in Liblinear)
– kddb : sparse, 4.7GB
![Page 49: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/49.jpg)
• When C gets larger (dna C=1)
51
![Page 50: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/50.jpg)
• When C gets larger(dna C=10)
52
![Page 51: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/51.jpg)
• When C gets larger(dna C=100)
53
![Page 52: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/52.jpg)
• When C gets larger(dna C=1000)
54
![Page 53: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/53.jpg)
• When memory gets larger(ocr C=1)
55
![Page 54: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/54.jpg)
• Expanding Features on the fly– Expand features explicitly when the reader thread
loads an example into memory.• Read (y,x) from the Disk• Compute f(x) and load (y,f(x)) into RAM
Read
Disk
Data
12495340( )x R
x=GTCCCACCT…
56
![Page 55: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/55.jpg)
2TB data
16GB memory
10hrs
50M examples
12M featurescorresponding to
2TB
57
![Page 56: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/56.jpg)
• Summary– Linear SVM Optimization when data cannot fit in
memory– Use the scheme of Dual Cached Loops– Outperforms state of the art by orders of magnitude– Can be extended to• Logistic regression• Support vector regression• Multiclass classification
58
![Page 57: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/57.jpg)
DISTRIBUTED ASYNCHRONOUS OPTIMIZATION (CURRENT WORK)
59
![Page 58: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/58.jpg)
Future/Current Work
• Utilize the same principle as dual cached loops in multi-machine algorithm– Transportation of data can be efficiently done without
harming optimization performance– The key is to run Communication and Computation
simultaneously and asynchronously– Can we do more sophisticated communication
emerging in multi-machine optimization?
60
![Page 59: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/59.jpg)
• (Selective) Block Minimization scheme for Large-scale SVM
61
Move data Process Optimization
HDD/ File system
One machine
One machine
![Page 60: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/60.jpg)
• Map-Reduce scheme for multi-machine algorithm
62
Move parameters Process Optimization
Master node
Workernode
Workernode
![Page 61: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/61.jpg)
63
≈
![Page 62: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/62.jpg)
64
![Page 63: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/63.jpg)
65
![Page 64: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/64.jpg)
Stratified Stochastic Gradient Descent [Gemulla, 2011]
66
![Page 65: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/65.jpg)
67
![Page 66: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/66.jpg)
68
![Page 67: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/67.jpg)
• Map-Reduce scheme for multi-machine algorithm
69
Move parameters Process Optimization
Master node
Workernode
Workernode
![Page 68: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/68.jpg)
Asynchronous multi-machine scheme70
Parameter Communication
Parameter Updates
![Page 69: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/69.jpg)
NOMAD71
![Page 70: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/70.jpg)
NOMAD72
![Page 71: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/71.jpg)
73
![Page 72: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/72.jpg)
74
![Page 73: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/73.jpg)
75
![Page 74: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/74.jpg)
76
![Page 75: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/75.jpg)
Asynchronous multi-machine scheme
• Each machine holds a subset of data• Keep communicating a potion of parameter from
each other• Simultaneously run updating parameters for
those each machine possesses
77
![Page 76: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/76.jpg)
• Distributed stochastic gradient descent for saddle point problems– Another formulation of SVM (Regularized Risk
Minimization in general)– Suitable for parallelization
78
![Page 77: Scaling up Machine Learning Algorithms for Classification](https://reader038.vdocument.in/reader038/viewer/2022102813/547d6ce0b4af9fa5248b46c8/html5/thumbnails/77.jpg)
How can we scale up Machine Learning to Massive datasets?
• Exploit hardware traits– Disk IO is bottleneck– Run Disk IO and Computation simultaneously
• Distributed asynchronous optimization (ongoing)– Current work using multiple machines
79