lecture 7 ``big'' learning - uni-tuebingen.dep. laskov and b. nelson (tu¨bingen) lecture...
TRANSCRIPT
Lecture 7“Big” Learning
Pavel Laskov1 Blaine Nelson1
1Cognitive Systems Group
Wilhelm Schickard Institute for Computer Science
Universitat Tubingen, Germany
Advanced Topics in Machine Learning, 2012
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26
Lessons from Last Lecture
Theory of convex optimization subject to constraints
minx
f (x) | g(x) ≤ 0
Transformation from primal to dual problems
Formulations of various SVM learning problems:
Two-class classificationOne-class 1classificationRegression with ǫ-insensitive loss
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 2 / 26
Agenda for Today / Next Lecturs
How to efficiently solve the optimization problem in the form
minα
− 1⊤α+ 12α
⊤Hα
s.t. α⊤y = 0, 0 ≤ α ≤ C
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 3 / 26
Agenda for Today / Next Lecturs
How to efficiently solve the optimization problem in the form
minα
− 1⊤α+ 12α
⊤Hα
s.t. α⊤y = 0, 0 ≤ α ≤ C
for large amounts of data
Sequential Minimal OptimizationFeasible Direction Decomposition
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 3 / 26
Agenda for Today / Next Lecturs
How to efficiently solve the optimization problem in the form
minα
− 1⊤α+ 12α
⊤Hα
s.t. α⊤y = 0, 0 ≤ α ≤ C
for large amounts of data
Sequential Minimal OptimizationFeasible Direction Decomposition
for huge amounts of data
Map-ReduceStochastic Gradient Descent (next lecture)
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 3 / 26
Decomposition of the SVM Learning Problem
Main idea: freeze all but d variables, solve a subproblem, select anotherworking set and repeat until convergence.
α =
[αw
αn
]
y =
[ywyn
]
H =
[Hww Hwn
Hnw Hnn
]
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 4 / 26
Decomposition of the SVM Learning Problem
Main idea: freeze all but d variables, solve a subproblem, select anotherworking set and repeat until convergence.
α =
[αw
αn
]
y =
[ywyn
]
H =
[Hww Hwn
Hnw Hnn
]
Subproblem formulation:
minα
(α⊤
n Hnw − 1⊤w )αw + 12α
⊤wHwwαw
s.t. y⊤wαw = −y⊤n αn
0 ≤ αw ≤ C
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 4 / 26
Decomposition of the SVM Learning Problem
Main idea: freeze all but d variables, solve a subproblem, select anotherworking set and repeat until convergence.
α =
[αw
αn
]
y =
[ywyn
]
H =
[Hww Hwn
Hnw Hnn
]
Subproblem formulation:
minα
(α⊤
n Hnw − 1⊤w )αw + 12α
⊤wHwwαw
s.t. y⊤wαw = −y⊤n αn
0 ≤ αw ≤ C
How to select an optimal suboptimal working set w?
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 4 / 26
Sequential Minimal Optimization (SMO): LibSVM
Main idea: select a smallest possible working set at each iteration
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 5 / 26
Sequential Minimal Optimization (SMO): LibSVM
Main idea: select a smallest possible working set at each iteration
⇒ analytical solution for subproblems⇒ easy working set selection by heuristics
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 5 / 26
Sequential Minimal Optimization (SMO): LibSVM
Main idea: select a smallest possible working set at each iteration
⇒ analytical solution for subproblems⇒ easy working set selection by heuristics
Geometry of SVM constraints for two points:
y1 6= y2
y1 = y2
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 5 / 26
SMO Derivation: Preliminaries
Defines = y1y2
vi =
k∑
j=3
yjαoldj Kij
= f old(xi ) + bold − y1αold1 K1i − y2α
old2 K2i
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 6 / 26
SMO Derivation: Preliminaries
Defines = y1y2
vi =
k∑
j=3
yjαoldj Kij
= f old(xi ) + bold − y1αold1 K1i − y2α
old2 K2i
Then we can write the objective function as
W (α1, α2) =− α1 − α2 +12K11α
21 +
12K22α
22 + sK12α1α2
+ y1α1v1 + y2α2v2 + const
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 6 / 26
SMO Derivation: Handling Equality Constraint
The equality constraint implies
α1 + sα2 = αold1 + sαold
2 = γ
⇒ α1 = (γ − sα2)
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 7 / 26
SMO Derivation: Handling Equality Constraint
The equality constraint implies
α1 + sα2 = αold1 + sαold
2 = γ
⇒ α1 = (γ − sα2)
Substituting α1 in the objective function, we obtain
W (α2) =− (γ − sα2)− α2
+ 12K11(γ − sα2)
2 + 12K22α
22 + sK12(γ − sα2)α2
+ y1(γ − sα2)v1 + y2α2v2 + const
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 7 / 26
SMO Derivation: Optimality for α2
Differentiating W w.r.t. α2 we obtain
∂W (α2)
∂α2= α2(K11 + K22 − 2K12)
+ s − 1− sK11γ + sK12γ − y2v1 + y2v2
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 8 / 26
SMO Derivation: Optimality for α2
Differentiating W w.r.t. α2 we obtain
∂W (α2)
∂α2= α2(K11 + K22 − 2K12)
+ s − 1− sK11γ + sK12γ − y2v1 + y2v2
Hence optimal α2 must satisfy the condition
α2(K11 + K22 − 2K12) = s(K11 − K12)γ + y2(v1 − v2) + 1− s
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 8 / 26
SMO Derivation: Optimality for α2
Differentiating W w.r.t. α2 we obtain
∂W (α2)
∂α2= α2(K11 + K22 − 2K12)
+ s − 1− sK11γ + sK12γ − y2v1 + y2v2
Hence optimal α2 must satisfy the condition
α2(K11 + K22 − 2K12) = s(K11 − K12)γ + y2(v1 − v2) + 1− s
Here comes the mess...
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 8 / 26
SMO Derivation: Update of α2
Substituting the expressions for v1, v2 and γ, we obtain
α2(K11 + K22 − 2K12) = s(K11 − K12)(αold1 + sαold
2 )
+ y2(fold(x1) + bold − y1α
old1 K11 − y2α
old2 K12)
− y2(fold(x2) + bold − y1α
old1 K12 − y2α
old2 K22)
+ y2y2 − y1y2
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 9 / 26
SMO Derivation: Update of α2
Substituting the expressions for v1, v2 and γ, we obtain
α2(K11 + K22 − 2K12) = s(K11 − K12)(αold1 + sαold
2 )
+ y2(fold(x1) + bold − y1α
old1 K11 − y2α
old2 K12)
− y2(fold(x2) + bold − y1α
old1 K12 − y2α
old2 K22)
+ y2y2 − y1y2
= αold2 (K11 + K22 − 2K12)
+ y2((fold(x1)− y1)− (f old(x2)− y2))
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 9 / 26
SMO Derivation: Update of α2
Substituting the expressions for v1, v2 and γ, we obtain
α2(K11 + K22 − 2K12) = s(K11 − K12)(αold1 + sαold
2 )
+ y2(fold(x1) + bold − y1α
old1 K11 − y2α
old2 K12)
− y2(fold(x2) + bold − y1α
old1 K12 − y2α
old2 K22)
+ y2y2 − y1y2
= αold2 (K11 + K22 − 2K12)
+ y2((fold(x1)− y1)− (f old(x2)− y2))
Hence α2 can be updated as follows
α2 = αold2 +
y2(E1 − E2)
K11 + K22 − 2K12
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 9 / 26
SMO: Working Set Selection
Choosing the first element in the pair
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 10 / 26
SMO: Working Set Selection
Choosing the first element in the pair
Iterate between all examples with 0 < α < C (strictly!) that violate KKT
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 10 / 26
SMO: Working Set Selection
Choosing the first element in the pair
Iterate between all examples with 0 < α < C (strictly!) that violate KKTIf no such examples can be found, pick some of the remaining examples thatviolate KKT
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 10 / 26
SMO: Working Set Selection
Choosing the first element in the pair
Iterate between all examples with 0 < α < C (strictly!) that violate KKTIf no such examples can be found, pick some of the remaining examples thatviolate KKTIf no such examples can be found, we are done!
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 10 / 26
SMO: Working Set Selection
Choosing the first element in the pair
Iterate between all examples with 0 < α < C (strictly!) that violate KKTIf no such examples can be found, pick some of the remaining examples thatviolate KKTIf no such examples can be found, we are done!
Choosing the second element in the pair
We want an element that provides the largest update...
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 10 / 26
SMO: Working Set Selection
Choosing the first element in the pair
Iterate between all examples with 0 < α < C (strictly!) that violate KKTIf no such examples can be found, pick some of the remaining examples thatviolate KKTIf no such examples can be found, we are done!
Choosing the second element in the pair
We want an element that provides the largest update...Computing the denominator in the update formula is expensive, but thenumerator can be evaluated cheaply...
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 10 / 26
SMO: Working Set Selection
Choosing the first element in the pair
Iterate between all examples with 0 < α < C (strictly!) that violate KKTIf no such examples can be found, pick some of the remaining examples thatviolate KKTIf no such examples can be found, we are done!
Choosing the second element in the pair
We want an element that provides the largest update...Computing the denominator in the update formula is expensive, but thenumerator can be evaluated cheaply...If E1 < 0, choose the element with the largest E2
If E1 > 0, choose the element with the smallest E2
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 10 / 26
SMO: Working Set Selection
Choosing the first element in the pair
Iterate between all examples with 0 < α < C (strictly!) that violate KKTIf no such examples can be found, pick some of the remaining examples thatviolate KKTIf no such examples can be found, we are done!
Choosing the second element in the pair
We want an element that provides the largest update...Computing the denominator in the update formula is expensive, but thenumerator can be evaluated cheaply...If E1 < 0, choose the element with the largest E2
If E1 > 0, choose the element with the smallest E2
Further heuristics are used if these do not help...
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 10 / 26
SMO Summary
Quick and dirty
, Single iteration is extremely fast/ Number of iterations is difficult to control/ Convergence analysis is very cumbersome
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 11 / 26
SMO Summary
Quick and dirty
, Single iteration is extremely fast/ Number of iterations is difficult to control/ Convergence analysis is very cumbersome
Other “tricks of the trade”:
Error caching: keep previously computed error functions if possibleSpecial speed-up rules for linear SVM and for sparse data
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 11 / 26
SMO Summary
Quick and dirty
, Single iteration is extremely fast/ Number of iterations is difficult to control/ Convergence analysis is very cumbersome
Other “tricks of the trade”:
Error caching: keep previously computed error functions if possibleSpecial speed-up rules for linear SVM and for sparse data
How can we guarantee the optimal choice of a working set at eachiteration?
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 11 / 26
KKT Re-visited
Theorem (Karush-Kuhn-Tucker)
The point α is an optimal solution of the optimization problem
minα
− 1⊤α+ 12α
⊤Hα
s.t. α⊤y = 0, 0 ≤ α ≤ C
if and only if there exists a vector u = (µ,Π⊤,Υ⊤) such that
Hα− 1+ µy +Υ−Π = 0 # Lagrangian
Π ≥ 0 # Nonneg. Lagr. mult.
Υ ≥ 0
Υ⊤(α− C1) = 0 # Compl. slackness
Π⊤α = 0
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 12 / 26
Analysis of KKT Conditions
The first two conditions can be combined into:
Π = Hα− 1︸ ︷︷ ︸
−g
+ µy +Υ ≥ 0
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 13 / 26
Analysis of KKT Conditions
The first two conditions can be combined into:
Π = Hα− 1︸ ︷︷ ︸
−g
+ µy +Υ ≥ 0
When is the system of inequalities above inconsistent?
⇒ KKT is not satisfied for the working set⇒ Re-optimization of the working set brings improvement
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 13 / 26
Conditions for Inconsistency of KKT: Derivation
Each line of the KKT system of inequalities is in the form
πi = −gi + yiµ+ υi ≥ 0
Complementary slackness conditions fall into the three cases:
0 < αi < C implies that πi = 0, υi = 0. KKT inequality becomes
−gi + yiµ = 0 ⇒ µ = giyi
αi = 0 implies that υi = 0. KKT inequality becomes
−gi + yiµ ≥ 0 ⇒ µ ≥ giyi
αi = C implies that πi = 0. KKT inequality becomes
−gi + yiµ ≤ 0 ⇒ µ ≤ giyi
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 14 / 26
Conditions for Inconsistency of KKT: Result
Every KKT inequality restricts the variable µ to an interval Mi on a realline:
Mi =
[gi/yi ], 0 < αi < C
[gi ,+∞), αi = 0 & yi = 1
(−∞,−gi ], αi = 0 & yi = −1
(−∞, gi ], αi = C & yi = 1
[−gi ,+∞), αi = C & yi = −1
inconsistency gap
� -intervals
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 15 / 26
Conditions for Inconsistency of KKT: Result
Every KKT inequality restricts the variable µ to an interval Mi on a realline:
Mi =
[gi/yi ], 0 < αi < C
[gi ,+∞), αi = 0 & yi = 1
(−∞,−gi ], αi = 0 & yi = −1
(−∞, gi ], αi = C & yi = 1
[−gi ,+∞), αi = C & yi = −1
inconsistency gap
� -intervals
Theorem (Laskov, 1999)
A working set in the SVM decomposition is suboptimal if and only if the
intersection of all its µ-intervals is empty.
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 15 / 26
Selection of Maximally Inconsistent Working Sets
Let q be the desired size of the working set.
Compute intervals Mi = [µlefti , µright
i ] for all training examples.
Select q/2 points with the largerst values µlefti .
Select q/2 points with the smallest values µrighti .
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 16 / 26
Back to Theory: Method of Feasible Directions
Consider a general optimization problem minα f (α) |Aα ≤ b
Problem
Let α0 be some feasible point, i.e., Aα0 ≤ b. How can we find a directiond = α−α0 which
(a) improves the objective function, and
(b) does not violate the constraint?
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 17 / 26
Back to Theory: Method of Feasible Directions
Consider a general optimization problem minα f (α) |Aα ≤ b
Problem
Let α0 be some feasible point, i.e., Aα0 ≤ b. How can we find a directiond = α−α0 which
(a) improves the objective function, and
(b) does not violate the constraint?
Solution
Optimal feasible direction is given by the following linear program
mind
∇f (α0)d
s.t. Ad ≤ 0
||d|| ≤ 1
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 17 / 26
Maximal Inconsistency vs. Feasible Directions
Maximal inconsistency algorithm corresponds to the feasible directionmethod with the normalization di ∈ {−1, 0, 1} and the maximum numberof non-zero directions bounded by q:
The equality constraint y⊤d = 0 corresponds to the selection of exactlyq/2 elements in each pass.
The “left pass” contributes the largest values giyi to the objectivefunction.
The “right pass” contributes the smallest values giyi to the objectivefunction.
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 18 / 26
Feasible Direction Decomposition: Running Time
Scaling factors
Measure execution time for various dataset sizesLinear regression on a log-log scale, record theslope of regression function
Convergence
Study the behaviour of the function:
||Wk+1 −W ∗||
||Wk −W ∗||p
Analysis
An O(
nKCǫ
+ n2 log(Cn
K
))
upper bound on the
overall run time0 1 2 3 4 5
0
1
2
3
4
5
6Scaling factors for regression SVM
no decompositionFD decomposition
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 19 / 26
Summary of Feasible Direction Decomposition
The only practical general method for large-scale training of SVM
Reasonable speed (“quadratic” in the number of examples)Provable convergence (proofs are quite complex)Linear convergence rateSome improved working set selection strategies exist
Limitations:
Inherently sequentialStorage of the full kernel matrix impossible for “Internet-age” data volumes
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 20 / 26
Approaches for Huge Datasets
How do you train an SVM on 1 billion examples with 10,000 dimensions?
Focus on easy cases: linear SVM
Newton methodsSubgradient methodsBundle methods
Parallelize
Map-reduce of various flavours
Ignore most of irrelevant data
Online learning: stochastic gradient descent
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 21 / 26
Map-Reduce Paradigm
Map: Process a given a key-value pair and produce of list of outputkey-value pairs.
Reduce: Given a list of values for a certain key, produce a smaller list ofvalues.
Map ReduceInput data
Intermediate
representation Output data
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 22 / 26
Map-Reduce Examples
Distributed grep
Map: emit a line if it matches a supplied patternReduce: copy the intermediate data to the output
Count of URL access frequency
Map: process a log file and output entries in the form (URL, 1)Reduce: add all values for each URL and produce entries in the form (URL,total count)
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 23 / 26
Map-Reduce for SMO
Modified SMO variant:
Find two elements with the smallest and the largest classification functions(cf. feasible direction decomposition ,).Update the dual weights using the SMO formulas.Update classification function values for all examples according to the newweights.
Map-Reduce implementation
Map: classification updateReduce: search for the min-max elements
Performance improvement:
×5–32 for training×120–150 for classification
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 24 / 26
Summary
Solution of optimization problems in SVM training problems requiresspecial-purpose algorithms.
Key ideas for conventional speed-up of SVM training are decompositionand gradient-optimal working set selection.
Parallelization can be used to scale the learning algorithms up to billionsof examples.
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 25 / 26
Bibliography I
[1] Bryan Catanzaro, Narayanan Sundaram, and Kurt Keutzer. Fastsupport vector machine training and classification on graphicsprocessors. In Proceedings of the 25th international conference on
Machine learning, pages 104–111, 2008.
[2] Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified dataprocessing on large clusters. Communications of the ACM,51(1):107–113, January 2008.
[3] Pavel Laskov. Feasible direction decomposition algorithms for trainingsupport vector machines. Machine Learning, 46:315–349, March 2002.
[4] John C. Platt. Fast training of support vector machines usingsequential minimal optimization. In Bernhard Scholkopf, ChristopherJ. C. Burges, and Alexander J. Smola, editors, Advances in kernel
methods, pages 185–208. MIT Press, 1999.
P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 26 / 26