lecture 7 ``big'' learning - uni-tuebingen.dep. laskov and b. nelson (tu¨bingen) lecture...

50
Lecture 7 “Big” Learning Pavel Laskov 1 Blaine Nelson 1 1 Cognitive Systems Group Wilhelm Schickard Institute for Computer Science Universit¨ at T¨ ubingen, Germany Advanced Topics in Machine Learning, 2012 P. Laskov and B. Nelson (T¨ ubingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26

Upload: others

Post on 30-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Lecture 7“Big” Learning

Pavel Laskov1 Blaine Nelson1

1Cognitive Systems Group

Wilhelm Schickard Institute for Computer Science

Universitat Tubingen, Germany

Advanced Topics in Machine Learning, 2012

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26

Page 2: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Lessons from Last Lecture

Theory of convex optimization subject to constraints

minx

f (x) | g(x) ≤ 0

Transformation from primal to dual problems

Formulations of various SVM learning problems:

Two-class classificationOne-class 1classificationRegression with ǫ-insensitive loss

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 2 / 26

Page 3: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Agenda for Today / Next Lecturs

How to efficiently solve the optimization problem in the form

minα

− 1⊤α+ 12α

⊤Hα

s.t. α⊤y = 0, 0 ≤ α ≤ C

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 3 / 26

Page 4: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Agenda for Today / Next Lecturs

How to efficiently solve the optimization problem in the form

minα

− 1⊤α+ 12α

⊤Hα

s.t. α⊤y = 0, 0 ≤ α ≤ C

for large amounts of data

Sequential Minimal OptimizationFeasible Direction Decomposition

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 3 / 26

Page 5: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Agenda for Today / Next Lecturs

How to efficiently solve the optimization problem in the form

minα

− 1⊤α+ 12α

⊤Hα

s.t. α⊤y = 0, 0 ≤ α ≤ C

for large amounts of data

Sequential Minimal OptimizationFeasible Direction Decomposition

for huge amounts of data

Map-ReduceStochastic Gradient Descent (next lecture)

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 3 / 26

Page 6: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Decomposition of the SVM Learning Problem

Main idea: freeze all but d variables, solve a subproblem, select anotherworking set and repeat until convergence.

α =

[αw

αn

]

y =

[ywyn

]

H =

[Hww Hwn

Hnw Hnn

]

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 4 / 26

Page 7: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Decomposition of the SVM Learning Problem

Main idea: freeze all but d variables, solve a subproblem, select anotherworking set and repeat until convergence.

α =

[αw

αn

]

y =

[ywyn

]

H =

[Hww Hwn

Hnw Hnn

]

Subproblem formulation:

minα

(α⊤

n Hnw − 1⊤w )αw + 12α

⊤wHwwαw

s.t. y⊤wαw = −y⊤n αn

0 ≤ αw ≤ C

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 4 / 26

Page 8: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Decomposition of the SVM Learning Problem

Main idea: freeze all but d variables, solve a subproblem, select anotherworking set and repeat until convergence.

α =

[αw

αn

]

y =

[ywyn

]

H =

[Hww Hwn

Hnw Hnn

]

Subproblem formulation:

minα

(α⊤

n Hnw − 1⊤w )αw + 12α

⊤wHwwαw

s.t. y⊤wαw = −y⊤n αn

0 ≤ αw ≤ C

How to select an optimal suboptimal working set w?

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 4 / 26

Page 9: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Sequential Minimal Optimization (SMO): LibSVM

Main idea: select a smallest possible working set at each iteration

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 5 / 26

Page 10: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Sequential Minimal Optimization (SMO): LibSVM

Main idea: select a smallest possible working set at each iteration

⇒ analytical solution for subproblems⇒ easy working set selection by heuristics

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 5 / 26

Page 11: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Sequential Minimal Optimization (SMO): LibSVM

Main idea: select a smallest possible working set at each iteration

⇒ analytical solution for subproblems⇒ easy working set selection by heuristics

Geometry of SVM constraints for two points:

y1 6= y2

y1 = y2

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 5 / 26

Page 12: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO Derivation: Preliminaries

Defines = y1y2

vi =

k∑

j=3

yjαoldj Kij

= f old(xi ) + bold − y1αold1 K1i − y2α

old2 K2i

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 6 / 26

Page 13: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO Derivation: Preliminaries

Defines = y1y2

vi =

k∑

j=3

yjαoldj Kij

= f old(xi ) + bold − y1αold1 K1i − y2α

old2 K2i

Then we can write the objective function as

W (α1, α2) =− α1 − α2 +12K11α

21 +

12K22α

22 + sK12α1α2

+ y1α1v1 + y2α2v2 + const

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 6 / 26

Page 14: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO Derivation: Handling Equality Constraint

The equality constraint implies

α1 + sα2 = αold1 + sαold

2 = γ

⇒ α1 = (γ − sα2)

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 7 / 26

Page 15: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO Derivation: Handling Equality Constraint

The equality constraint implies

α1 + sα2 = αold1 + sαold

2 = γ

⇒ α1 = (γ − sα2)

Substituting α1 in the objective function, we obtain

W (α2) =− (γ − sα2)− α2

+ 12K11(γ − sα2)

2 + 12K22α

22 + sK12(γ − sα2)α2

+ y1(γ − sα2)v1 + y2α2v2 + const

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 7 / 26

Page 16: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO Derivation: Optimality for α2

Differentiating W w.r.t. α2 we obtain

∂W (α2)

∂α2= α2(K11 + K22 − 2K12)

+ s − 1− sK11γ + sK12γ − y2v1 + y2v2

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 8 / 26

Page 17: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO Derivation: Optimality for α2

Differentiating W w.r.t. α2 we obtain

∂W (α2)

∂α2= α2(K11 + K22 − 2K12)

+ s − 1− sK11γ + sK12γ − y2v1 + y2v2

Hence optimal α2 must satisfy the condition

α2(K11 + K22 − 2K12) = s(K11 − K12)γ + y2(v1 − v2) + 1− s

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 8 / 26

Page 18: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO Derivation: Optimality for α2

Differentiating W w.r.t. α2 we obtain

∂W (α2)

∂α2= α2(K11 + K22 − 2K12)

+ s − 1− sK11γ + sK12γ − y2v1 + y2v2

Hence optimal α2 must satisfy the condition

α2(K11 + K22 − 2K12) = s(K11 − K12)γ + y2(v1 − v2) + 1− s

Here comes the mess...

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 8 / 26

Page 19: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO Derivation: Update of α2

Substituting the expressions for v1, v2 and γ, we obtain

α2(K11 + K22 − 2K12) = s(K11 − K12)(αold1 + sαold

2 )

+ y2(fold(x1) + bold − y1α

old1 K11 − y2α

old2 K12)

− y2(fold(x2) + bold − y1α

old1 K12 − y2α

old2 K22)

+ y2y2 − y1y2

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 9 / 26

Page 20: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO Derivation: Update of α2

Substituting the expressions for v1, v2 and γ, we obtain

α2(K11 + K22 − 2K12) = s(K11 − K12)(αold1 + sαold

2 )

+ y2(fold(x1) + bold − y1α

old1 K11 − y2α

old2 K12)

− y2(fold(x2) + bold − y1α

old1 K12 − y2α

old2 K22)

+ y2y2 − y1y2

= αold2 (K11 + K22 − 2K12)

+ y2((fold(x1)− y1)− (f old(x2)− y2))

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 9 / 26

Page 21: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO Derivation: Update of α2

Substituting the expressions for v1, v2 and γ, we obtain

α2(K11 + K22 − 2K12) = s(K11 − K12)(αold1 + sαold

2 )

+ y2(fold(x1) + bold − y1α

old1 K11 − y2α

old2 K12)

− y2(fold(x2) + bold − y1α

old1 K12 − y2α

old2 K22)

+ y2y2 − y1y2

= αold2 (K11 + K22 − 2K12)

+ y2((fold(x1)− y1)− (f old(x2)− y2))

Hence α2 can be updated as follows

α2 = αold2 +

y2(E1 − E2)

K11 + K22 − 2K12

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 9 / 26

Page 22: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO: Working Set Selection

Choosing the first element in the pair

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 10 / 26

Page 23: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO: Working Set Selection

Choosing the first element in the pair

Iterate between all examples with 0 < α < C (strictly!) that violate KKT

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 10 / 26

Page 24: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO: Working Set Selection

Choosing the first element in the pair

Iterate between all examples with 0 < α < C (strictly!) that violate KKTIf no such examples can be found, pick some of the remaining examples thatviolate KKT

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 10 / 26

Page 25: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO: Working Set Selection

Choosing the first element in the pair

Iterate between all examples with 0 < α < C (strictly!) that violate KKTIf no such examples can be found, pick some of the remaining examples thatviolate KKTIf no such examples can be found, we are done!

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 10 / 26

Page 26: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO: Working Set Selection

Choosing the first element in the pair

Iterate between all examples with 0 < α < C (strictly!) that violate KKTIf no such examples can be found, pick some of the remaining examples thatviolate KKTIf no such examples can be found, we are done!

Choosing the second element in the pair

We want an element that provides the largest update...

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 10 / 26

Page 27: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO: Working Set Selection

Choosing the first element in the pair

Iterate between all examples with 0 < α < C (strictly!) that violate KKTIf no such examples can be found, pick some of the remaining examples thatviolate KKTIf no such examples can be found, we are done!

Choosing the second element in the pair

We want an element that provides the largest update...Computing the denominator in the update formula is expensive, but thenumerator can be evaluated cheaply...

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 10 / 26

Page 28: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO: Working Set Selection

Choosing the first element in the pair

Iterate between all examples with 0 < α < C (strictly!) that violate KKTIf no such examples can be found, pick some of the remaining examples thatviolate KKTIf no such examples can be found, we are done!

Choosing the second element in the pair

We want an element that provides the largest update...Computing the denominator in the update formula is expensive, but thenumerator can be evaluated cheaply...If E1 < 0, choose the element with the largest E2

If E1 > 0, choose the element with the smallest E2

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 10 / 26

Page 29: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO: Working Set Selection

Choosing the first element in the pair

Iterate between all examples with 0 < α < C (strictly!) that violate KKTIf no such examples can be found, pick some of the remaining examples thatviolate KKTIf no such examples can be found, we are done!

Choosing the second element in the pair

We want an element that provides the largest update...Computing the denominator in the update formula is expensive, but thenumerator can be evaluated cheaply...If E1 < 0, choose the element with the largest E2

If E1 > 0, choose the element with the smallest E2

Further heuristics are used if these do not help...

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 10 / 26

Page 30: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO Summary

Quick and dirty

, Single iteration is extremely fast/ Number of iterations is difficult to control/ Convergence analysis is very cumbersome

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 11 / 26

Page 31: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO Summary

Quick and dirty

, Single iteration is extremely fast/ Number of iterations is difficult to control/ Convergence analysis is very cumbersome

Other “tricks of the trade”:

Error caching: keep previously computed error functions if possibleSpecial speed-up rules for linear SVM and for sparse data

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 11 / 26

Page 32: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

SMO Summary

Quick and dirty

, Single iteration is extremely fast/ Number of iterations is difficult to control/ Convergence analysis is very cumbersome

Other “tricks of the trade”:

Error caching: keep previously computed error functions if possibleSpecial speed-up rules for linear SVM and for sparse data

How can we guarantee the optimal choice of a working set at eachiteration?

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 11 / 26

Page 33: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

KKT Re-visited

Theorem (Karush-Kuhn-Tucker)

The point α is an optimal solution of the optimization problem

minα

− 1⊤α+ 12α

⊤Hα

s.t. α⊤y = 0, 0 ≤ α ≤ C

if and only if there exists a vector u = (µ,Π⊤,Υ⊤) such that

Hα− 1+ µy +Υ−Π = 0 # Lagrangian

Π ≥ 0 # Nonneg. Lagr. mult.

Υ ≥ 0

Υ⊤(α− C1) = 0 # Compl. slackness

Π⊤α = 0

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 12 / 26

Page 34: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Analysis of KKT Conditions

The first two conditions can be combined into:

Π = Hα− 1︸ ︷︷ ︸

−g

+ µy +Υ ≥ 0

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 13 / 26

Page 35: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Analysis of KKT Conditions

The first two conditions can be combined into:

Π = Hα− 1︸ ︷︷ ︸

−g

+ µy +Υ ≥ 0

When is the system of inequalities above inconsistent?

⇒ KKT is not satisfied for the working set⇒ Re-optimization of the working set brings improvement

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 13 / 26

Page 36: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Conditions for Inconsistency of KKT: Derivation

Each line of the KKT system of inequalities is in the form

πi = −gi + yiµ+ υi ≥ 0

Complementary slackness conditions fall into the three cases:

0 < αi < C implies that πi = 0, υi = 0. KKT inequality becomes

−gi + yiµ = 0 ⇒ µ = giyi

αi = 0 implies that υi = 0. KKT inequality becomes

−gi + yiµ ≥ 0 ⇒ µ ≥ giyi

αi = C implies that πi = 0. KKT inequality becomes

−gi + yiµ ≤ 0 ⇒ µ ≤ giyi

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 14 / 26

Page 37: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Conditions for Inconsistency of KKT: Result

Every KKT inequality restricts the variable µ to an interval Mi on a realline:

Mi =

[gi/yi ], 0 < αi < C

[gi ,+∞), αi = 0 & yi = 1

(−∞,−gi ], αi = 0 & yi = −1

(−∞, gi ], αi = C & yi = 1

[−gi ,+∞), αi = C & yi = −1

inconsistency gap

� -intervals

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 15 / 26

Page 38: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Conditions for Inconsistency of KKT: Result

Every KKT inequality restricts the variable µ to an interval Mi on a realline:

Mi =

[gi/yi ], 0 < αi < C

[gi ,+∞), αi = 0 & yi = 1

(−∞,−gi ], αi = 0 & yi = −1

(−∞, gi ], αi = C & yi = 1

[−gi ,+∞), αi = C & yi = −1

inconsistency gap

� -intervals

Theorem (Laskov, 1999)

A working set in the SVM decomposition is suboptimal if and only if the

intersection of all its µ-intervals is empty.

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 15 / 26

Page 39: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Selection of Maximally Inconsistent Working Sets

Let q be the desired size of the working set.

Compute intervals Mi = [µlefti , µright

i ] for all training examples.

Select q/2 points with the largerst values µlefti .

Select q/2 points with the smallest values µrighti .

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 16 / 26

Page 40: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Back to Theory: Method of Feasible Directions

Consider a general optimization problem minα f (α) |Aα ≤ b

Problem

Let α0 be some feasible point, i.e., Aα0 ≤ b. How can we find a directiond = α−α0 which

(a) improves the objective function, and

(b) does not violate the constraint?

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 17 / 26

Page 41: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Back to Theory: Method of Feasible Directions

Consider a general optimization problem minα f (α) |Aα ≤ b

Problem

Let α0 be some feasible point, i.e., Aα0 ≤ b. How can we find a directiond = α−α0 which

(a) improves the objective function, and

(b) does not violate the constraint?

Solution

Optimal feasible direction is given by the following linear program

mind

∇f (α0)d

s.t. Ad ≤ 0

||d|| ≤ 1

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 17 / 26

Page 42: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Maximal Inconsistency vs. Feasible Directions

Maximal inconsistency algorithm corresponds to the feasible directionmethod with the normalization di ∈ {−1, 0, 1} and the maximum numberof non-zero directions bounded by q:

The equality constraint y⊤d = 0 corresponds to the selection of exactlyq/2 elements in each pass.

The “left pass” contributes the largest values giyi to the objectivefunction.

The “right pass” contributes the smallest values giyi to the objectivefunction.

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 18 / 26

Page 43: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Feasible Direction Decomposition: Running Time

Scaling factors

Measure execution time for various dataset sizesLinear regression on a log-log scale, record theslope of regression function

Convergence

Study the behaviour of the function:

||Wk+1 −W ∗||

||Wk −W ∗||p

Analysis

An O(

nKCǫ

+ n2 log(Cn

K

))

upper bound on the

overall run time0 1 2 3 4 5

0

1

2

3

4

5

6Scaling factors for regression SVM

no decompositionFD decomposition

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 19 / 26

Page 44: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Summary of Feasible Direction Decomposition

The only practical general method for large-scale training of SVM

Reasonable speed (“quadratic” in the number of examples)Provable convergence (proofs are quite complex)Linear convergence rateSome improved working set selection strategies exist

Limitations:

Inherently sequentialStorage of the full kernel matrix impossible for “Internet-age” data volumes

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 20 / 26

Page 45: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Approaches for Huge Datasets

How do you train an SVM on 1 billion examples with 10,000 dimensions?

Focus on easy cases: linear SVM

Newton methodsSubgradient methodsBundle methods

Parallelize

Map-reduce of various flavours

Ignore most of irrelevant data

Online learning: stochastic gradient descent

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 21 / 26

Page 46: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Map-Reduce Paradigm

Map: Process a given a key-value pair and produce of list of outputkey-value pairs.

Reduce: Given a list of values for a certain key, produce a smaller list ofvalues.

Map ReduceInput data

Intermediate

representation Output data

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 22 / 26

Page 47: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Map-Reduce Examples

Distributed grep

Map: emit a line if it matches a supplied patternReduce: copy the intermediate data to the output

Count of URL access frequency

Map: process a log file and output entries in the form (URL, 1)Reduce: add all values for each URL and produce entries in the form (URL,total count)

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 23 / 26

Page 48: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Map-Reduce for SMO

Modified SMO variant:

Find two elements with the smallest and the largest classification functions(cf. feasible direction decomposition ,).Update the dual weights using the SMO formulas.Update classification function values for all examples according to the newweights.

Map-Reduce implementation

Map: classification updateReduce: search for the min-max elements

Performance improvement:

×5–32 for training×120–150 for classification

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 24 / 26

Page 49: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Summary

Solution of optimization problems in SVM training problems requiresspecial-purpose algorithms.

Key ideas for conventional speed-up of SVM training are decompositionand gradient-optimal working set selection.

Parallelization can be used to scale the learning algorithms up to billionsof examples.

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 25 / 26

Page 50: Lecture 7 ``Big'' Learning - uni-tuebingen.deP. Laskov and B. Nelson (Tu¨bingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26 LessonsfromLastLecture Theory of convex optimization

Bibliography I

[1] Bryan Catanzaro, Narayanan Sundaram, and Kurt Keutzer. Fastsupport vector machine training and classification on graphicsprocessors. In Proceedings of the 25th international conference on

Machine learning, pages 104–111, 2008.

[2] Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified dataprocessing on large clusters. Communications of the ACM,51(1):107–113, January 2008.

[3] Pavel Laskov. Feasible direction decomposition algorithms for trainingsupport vector machines. Machine Learning, 46:315–349, March 2002.

[4] John C. Platt. Fast training of support vector machines usingsequential minimal optimization. In Bernhard Scholkopf, ChristopherJ. C. Burges, and Alexander J. Smola, editors, Advances in kernel

methods, pages 185–208. MIT Press, 1999.

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 26 / 26