lecture 7 ``big'' learning - uni-tuebingen.dep. laskov and b. nelson (tu¨bingen) lecture...

Lecture 7“Big” Learning

Pavel Laskov1 Blaine Nelson1

1Cognitive Systems Group

Wilhelm Schickard Institute for Computer Science

Universitat Tubingen, Germany

Advanced Topics in Machine Learning, 2012

P. Laskov and B. Nelson (Tubingen) Lecture 7: “Big” Learning June 12, 2012 1 / 26

Lessons from Last Lecture

Theory of convex optimization subject to constraints

minx

f (x) | g(x) ≤ 0

Transformation from primal to dual problems

Formulations of various SVM learning problems:

Two-class classificationOne-class 1classificationRegression with ǫ-insensitive loss


Agenda for Today / Next Lecturs

How to efficiently solve the optimization problem in the form

minα

− 1⊤α+ 12α

⊤Hα

s.t. α⊤y = 0, 0 ≤ α ≤ C




minα

− 1⊤α+ 12α

⊤Hα

s.t. α⊤y = 0, 0 ≤ α ≤ C

for large amounts of data

Sequential Minimal OptimizationFeasible Direction Decomposition




minα

− 1⊤α+ 12α

⊤Hα

s.t. α⊤y = 0, 0 ≤ α ≤ C

for large amounts of data

Sequential Minimal OptimizationFeasible Direction Decomposition

for huge amounts of data

Map-ReduceStochastic Gradient Descent (next lecture)


Decomposition of the SVM Learning Problem

Main idea: freeze all but d variables, solve a subproblem, select anotherworking set and repeat until convergence.

α =

[αw

αn

]

y =

[ywyn

]

H =

[Hww Hwn

Hnw Hnn

]




α =

[αw

αn

]

y =

[ywyn

]

H =

[Hww Hwn

Hnw Hnn

]

Subproblem formulation:

minα

(α⊤

n Hnw − 1⊤w )αw + 12α

⊤wHwwαw

s.t. y⊤wαw = −y⊤n αn

0 ≤ αw ≤ C




α =

[αw

αn

]

y =

[ywyn

]

H =

[Hww Hwn

Hnw Hnn

]

Subproblem formulation:

minα

(α⊤

n Hnw − 1⊤w )αw + 12α

⊤wHwwαw

s.t. y⊤wαw = −y⊤n αn

0 ≤ αw ≤ C

How to select an optimal suboptimal working set w?


Sequential Minimal Optimization (SMO): LibSVM

Main idea: select a smallest possible working set at each iteration




⇒ analytical solution for subproblems⇒ easy working set selection by heuristics




⇒ analytical solution for subproblems⇒ easy working set selection by heuristics

Geometry of SVM constraints for two points:

y1 6= y2

y1 = y2


SMO Derivation: Preliminaries

Defines = y1y2

vi =

k∑

j=3

yjαoldj Kij

= f old(xi ) + bold − y1αold1 K1i − y2α

old2 K2i


SMO Derivation: Preliminaries

Defines = y1y2

vi =

k∑

j=3

yjαoldj Kij

= f old(xi ) + bold − y1αold1 K1i − y2α

old2 K2i

Then we can write the objective function as

W (α1, α2) =− α1 − α2 +12K11α

21 +

12K22α

22 + sK12α1α2

+ y1α1v1 + y2α2v2 + const


SMO Derivation: Handling Equality Constraint

The equality constraint implies

α1 + sα2 = αold1 + sαold

2 = γ

⇒ α1 = (γ − sα2)


SMO Derivation: Handling Equality Constraint

The equality constraint implies

α1 + sα2 = αold1 + sαold

2 = γ

⇒ α1 = (γ − sα2)

Substituting α1 in the objective function, we obtain

W (α2) =− (γ − sα2)− α2

+ 12K11(γ − sα2)

2 + 12K22α

22 + sK12(γ − sα2)α2

+ y1(γ − sα2)v1 + y2α2v2 + const


SMO Derivation: Optimality for α2

Differentiating W w.r.t. α2 we obtain

∂W (α2)

∂α2= α2(K11 + K22 − 2K12)

+ s − 1− sK11γ + sK12γ − y2v1 + y2v2




∂W (α2)

∂α2= α2(K11 + K22 − 2K12)

+ s − 1− sK11γ + sK12γ − y2v1 + y2v2

Hence optimal α2 must satisfy the condition

α2(K11 + K22 − 2K12) = s(K11 − K12)γ + y2(v1 − v2) + 1− s




∂W (α2)

∂α2= α2(K11 + K22 − 2K12)

+ s − 1− sK11γ + sK12γ − y2v1 + y2v2

Hence optimal α2 must satisfy the condition

α2(K11 + K22 − 2K12) = s(K11 − K12)γ + y2(v1 − v2) + 1− s

Here comes the mess...


SMO Derivation: Update of α2

Substituting the expressions for v1, v2 and γ, we obtain

α2(K11 + K22 − 2K12) = s(K11 − K12)(αold1 + sαold

2 )

+ y2(fold(x1) + bold − y1α

old1 K11 − y2α

old2 K12)

− y2(fold(x2) + bold − y1α

old1 K12 − y2α

old2 K22)

+ y2y2 − y1y2





2 )


old1 K11 − y2α

old2 K12)


old1 K12 − y2α

old2 K22)

+ y2y2 − y1y2

= αold2 (K11 + K22 − 2K12)

+ y2((fold(x1)− y1)− (f old(x2)− y2))





2 )


old1 K11 − y2α

old2 K12)


old1 K12 − y2α

old2 K22)

+ y2y2 − y1y2

= αold2 (K11 + K22 − 2K12)

+ y2((fold(x1)− y1)− (f old(x2)− y2))

Hence α2 can be updated as follows

α2 = αold2 +

y2(E1 − E2)

K11 + K22 − 2K12


SMO: Working Set Selection

Choosing the first element in the pair




Iterate between all examples with 0 < α < C (strictly!) that violate KKT




Iterate between all examples with 0 < α < C (strictly!) that violate KKTIf no such examples can be found, pick some of the remaining examples thatviolate KKT




Iterate between all examples with 0 < α < C (strictly!) that violate KKTIf no such examples can be found, pick some of the remaining examples thatviolate KKTIf no such examples can be found, we are done!





Choosing the second element in the pair

We want an element that provides the largest update...






We want an element that provides the largest update...Computing the denominator in the update formula is expensive, but thenumerator can be evaluated cheaply...






We want an element that provides the largest update...Computing the denominator in the update formula is expensive, but thenumerator can be evaluated cheaply...If E1 < 0, choose the element with the largest E2

If E1 > 0, choose the element with the smallest E2






We want an element that provides the largest update...Computing the denominator in the update formula is expensive, but thenumerator can be evaluated cheaply...If E1 < 0, choose the element with the largest E2

If E1 > 0, choose the element with the smallest E2

Further heuristics are used if these do not help...


SMO Summary

Quick and dirty

, Single iteration is extremely fast/ Number of iterations is difficult to control/ Convergence analysis is very cumbersome


SMO Summary

Quick and dirty


Other “tricks of the trade”:

Error caching: keep previously computed error functions if possibleSpecial speed-up rules for linear SVM and for sparse data


SMO Summary

Quick and dirty


Other “tricks of the trade”:

Error caching: keep previously computed error functions if possibleSpecial speed-up rules for linear SVM and for sparse data

How can we guarantee the optimal choice of a working set at eachiteration?


KKT Re-visited

Theorem (Karush-Kuhn-Tucker)

The point α is an optimal solution of the optimization problem

minα

− 1⊤α+ 12α

⊤Hα

s.t. α⊤y = 0, 0 ≤ α ≤ C

if and only if there exists a vector u = (µ,Π⊤,Υ⊤) such that

Hα− 1+ µy +Υ−Π = 0 # Lagrangian

Π ≥ 0 # Nonneg. Lagr. mult.

Υ ≥ 0

Υ⊤(α− C1) = 0 # Compl. slackness

Π⊤α = 0


Analysis of KKT Conditions

The first two conditions can be combined into:

Π = Hα− 1︸︷︷︸

−g

+ µy +Υ ≥ 0


Analysis of KKT Conditions

The first two conditions can be combined into:

Π = Hα− 1︸︷︷︸

−g

+ µy +Υ ≥ 0

When is the system of inequalities above inconsistent?

⇒ KKT is not satisfied for the working set⇒ Re-optimization of the working set brings improvement


Conditions for Inconsistency of KKT: Derivation

Each line of the KKT system of inequalities is in the form

πi = −gi + yiµ+ υi ≥ 0

Complementary slackness conditions fall into the three cases:

0 < αi < C implies that πi = 0, υi = 0. KKT inequality becomes

−gi + yiµ = 0 ⇒ µ = giyi

αi = 0 implies that υi = 0. KKT inequality becomes

−gi + yiµ ≥ 0 ⇒ µ ≥ giyi

αi = C implies that πi = 0. KKT inequality becomes

−gi + yiµ ≤ 0 ⇒ µ ≤ giyi


Conditions for Inconsistency of KKT: Result

Every KKT inequality restricts the variable µ to an interval Mi on a realline:

Mi =

[gi/yi ], 0 < αi < C

[gi ,+∞), αi = 0 & yi = 1

(−∞,−gi ], αi = 0 & yi = −1

(−∞, gi ], αi = C & yi = 1

[−gi ,+∞), αi = C & yi = −1

inconsistency gap

� -intervals


Conditions for Inconsistency of KKT: Result

Every KKT inequality restricts the variable µ to an interval Mi on a realline:

Mi =

[gi/yi ], 0 < αi < C

[gi ,+∞), αi = 0 & yi = 1

(−∞,−gi ], αi = 0 & yi = −1

(−∞, gi ], αi = C & yi = 1

[−gi ,+∞), αi = C & yi = −1

inconsistency gap

� -intervals

Theorem (Laskov, 1999)

A working set in the SVM decomposition is suboptimal if and only if the

intersection of all its µ-intervals is empty.


Selection of Maximally Inconsistent Working Sets

Let q be the desired size of the working set.

Compute intervals Mi = [µlefti , µright

i ] for all training examples.

Select q/2 points with the largerst values µlefti .

Select q/2 points with the smallest values µrighti .


Back to Theory: Method of Feasible Directions

Consider a general optimization problem minα f (α) |Aα ≤ b

Problem

Let α0 be some feasible point, i.e., Aα0 ≤ b. How can we find a directiond = α−α0 which

(a) improves the objective function, and

(b) does not violate the constraint?


Back to Theory: Method of Feasible Directions

Consider a general optimization problem minα f (α) |Aα ≤ b

Problem

Let α0 be some feasible point, i.e., Aα0 ≤ b. How can we find a directiond = α−α0 which

(a) improves the objective function, and

(b) does not violate the constraint?

Solution

Optimal feasible direction is given by the following linear program

mind

∇f (α0)d

s.t. Ad ≤ 0

||d|| ≤ 1


Maximal Inconsistency vs. Feasible Directions

Maximal inconsistency algorithm corresponds to the feasible directionmethod with the normalization di ∈ {−1, 0, 1} and the maximum numberof non-zero directions bounded by q:

The equality constraint y⊤d = 0 corresponds to the selection of exactlyq/2 elements in each pass.

The “left pass” contributes the largest values giyi to the objectivefunction.

The “right pass” contributes the smallest values giyi to the objectivefunction.


Feasible Direction Decomposition: Running Time

Scaling factors

Measure execution time for various dataset sizesLinear regression on a log-log scale, record theslope of regression function

Convergence

Study the behaviour of the function:

||Wk+1 −W ∗||

||Wk −W ∗||p

Analysis

An O(

nKCǫ

+ n2 log(Cn

K

))

upper bound on the

overall run time0 1 2 3 4 5

0

1

2

3

4

5

6Scaling factors for regression SVM

no decompositionFD decomposition


Summary of Feasible Direction Decomposition

The only practical general method for large-scale training of SVM

Reasonable speed (“quadratic” in the number of examples)Provable convergence (proofs are quite complex)Linear convergence rateSome improved working set selection strategies exist

Limitations:

Inherently sequentialStorage of the full kernel matrix impossible for “Internet-age” data volumes


Approaches for Huge Datasets

How do you train an SVM on 1 billion examples with 10,000 dimensions?

Focus on easy cases: linear SVM

Newton methodsSubgradient methodsBundle methods

Parallelize

Map-reduce of various flavours

Ignore most of irrelevant data

Online learning: stochastic gradient descent


Map-Reduce Paradigm

Map: Process a given a key-value pair and produce of list of outputkey-value pairs.

Reduce: Given a list of values for a certain key, produce a smaller list ofvalues.

Map ReduceInput data

Intermediate

representation Output data


Map-Reduce Examples

Distributed grep

Map: emit a line if it matches a supplied patternReduce: copy the intermediate data to the output

Count of URL access frequency

Map: process a log file and output entries in the form (URL, 1)Reduce: add all values for each URL and produce entries in the form (URL,total count)


Map-Reduce for SMO

Modified SMO variant:

Find two elements with the smallest and the largest classification functions(cf. feasible direction decomposition ,).Update the dual weights using the SMO formulas.Update classification function values for all examples according to the newweights.

Map-Reduce implementation

Map: classification updateReduce: search for the min-max elements

Performance improvement:

×5–32 for training×120–150 for classification


Summary

Solution of optimization problems in SVM training problems requiresspecial-purpose algorithms.

Key ideas for conventional speed-up of SVM training are decompositionand gradient-optimal working set selection.

Parallelization can be used to scale the learning algorithms up to billionsof examples.


Bibliography I

[1] Bryan Catanzaro, Narayanan Sundaram, and Kurt Keutzer. Fastsupport vector machine training and classification on graphicsprocessors. In Proceedings of the 25th international conference on

Machine learning, pages 104–111, 2008.

[2] Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified dataprocessing on large clusters. Communications of the ACM,51(1):107–113, January 2008.

[3] Pavel Laskov. Feasible direction decomposition algorithms for trainingsupport vector machines. Machine Learning, 46:315–349, March 2002.

[4] John C. Platt. Fast training of support vector machines usingsequential minimal optimization. In Bernhard Scholkopf, ChristopherJ. C. Burges, and Alexander J. Smola, editors, Advances in kernel

methods, pages 185–208. MIT Press, 1999.


lecture 7 ``big'' learning - uni-tuebingen.dep. laskov and b. nelson (tu¨bingen) lecture...

Documents