support vector machines: training with stochastic gradient ...support vector machines •training by...

68
Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1

Upload: others

Post on 04-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Machine Learning

Support Vector Machines: Training with

Stochastic Gradient Descent

1

Page 2: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Support vector machines

• Training by maximizing margin

• The SVM objective

• Solving the SVM optimization problem

• Support vectors, duals and kernels

2

Page 3: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

SVM objective function

3

Regularization term: • Maximize the margin• Imposes a preference over the

hypothesis space and pushes for better generalization

• Can be replaced with other regularization terms which impose other preferences

Empirical Loss: • Hinge loss • Penalizes weight vectors that make

mistakes

• Can be replaced with other loss functions which impose other preferences

A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss

min𝐰

12𝐰"𝐰+ 𝐶)

#

max 0, 1 − 𝑦#𝐰"𝐱

Page 4: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Outline: Training SVM by optimization

1. Review of convex functions and gradient descent

2. Stochastic gradient descent

3. Gradient descent vs stochastic gradient descent

4. Sub-derivatives of the hinge loss

5. Stochastic sub-gradient descent for SVM

6. Comparison to perceptron

4

Page 5: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Outline: Training SVM by optimization

1. Review of convex functions and gradient descent

2. Stochastic gradient descent

3. Gradient descent vs stochastic gradient descent

4. Sub-derivatives of the hinge loss

5. Stochastic sub-gradient descent for SVM

6. Comparison to perceptron

5

Page 6: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Solving the SVM optimization problem

This function is convex in 𝐰

6

min𝐰

12𝐰"𝐰+ 𝐶)

#

max 0, 1 − 𝑦#𝐰"𝐱#

Page 7: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

A function 𝑓 is convex if for every 𝒖, 𝒗 in the domain, and for every 𝜆 ∈ [0,1] we have

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

7

Recall: Convex functions

From geometric perspective

Every tangent plane lies below the function

u v

f(v)

f(u)

Page 8: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

A function 𝑓 is convex if for every 𝒖, 𝒗 in the domain, and for every 𝜆 ∈ [0,1] we have

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

8

u v

f(v)

f(u)

Recall: Convex functions

From geometric perspective

Every tangent plane lies below the function

Page 9: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Convex functions

9

Linear functions max is convex

Some ways to show that a function is convex:

1. Using the definition of convexity

2. Showing that the second derivative is positive (for one dimensional functions)

3. Showing that the second derivative is positive semi-definite (for vector functions)

Page 10: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Not all functions are convex

10

These are concave

These are neither

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≥ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

Page 11: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Convex functions are convenient

A function 𝑓 is convex if for every 𝒖, 𝒗 in the domain, and for every 𝜆 ∈[0,1] we have

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

In general: Necessary condition for 𝑥 to be a minimum for the function 𝑓 is that the gradient ∇𝑓 𝑥 = 0

For convex functions, this is both necessary and sufficient

11

u v

f(v)

f(u)

Page 12: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

This function is convex in w

• This is a quadratic optimization problem because the objective is quadratic

• Older methods: Used techniques from Quadratic Programming– Very slow

• No constraints, can use gradient descent– Still very slow!

Solving the SVM optimization problem

12

min𝐰

12𝐰"𝐰+ 𝐶)

#

max 0, 1 − 𝑦#𝐰"𝐱#

Page 13: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient descent

General strategy for minimizing a function 𝐽 𝐰

• Start with an initial guess for 𝐰, say 𝐰+

• Iterate till convergence: – Compute the gradient of the

gradient of 𝐽 at 𝐰!

– Update 𝐰! to get 𝐰!"# by taking a step in the opposite direction of the gradient

13

J(w)

ww0

Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

We are trying to minimize

𝐽 𝐰 = min𝐰

12𝐰"𝐰+ 𝐶+

#

max 0, 1 − 𝑦#𝐰"𝐱#

Page 14: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient descent

General strategy for minimizing a function 𝐽 𝐰

• Start with an initial guess for 𝐰, say 𝐰+

• Iterate till convergence: – Compute the gradient of the

gradient of 𝐽 at 𝐰!

– Update 𝐰! to get 𝐰!"# by taking a step in the opposite direction of the gradient

14

J(w)

ww0w1

Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

We are trying to minimize

𝐽 𝐰 = min𝐰

12𝐰"𝐰+ 𝐶+

#

max 0, 1 − 𝑦#𝐰"𝐱#

Page 15: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient descent

General strategy for minimizing a function 𝐽 𝐰

• Start with an initial guess for 𝐰, say 𝐰+

• Iterate till convergence: – Compute the gradient of the

gradient of 𝐽 at 𝐰!

– Update 𝐰! to get 𝐰!"# by taking a step in the opposite direction of the gradient

15

J(w)

ww0w1w2

Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

We are trying to minimize

𝐽 𝐰 = min𝐰

12𝐰"𝐰+ 𝐶+

#

max 0, 1 − 𝑦#𝐰"𝐱#

Page 16: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient descent

General strategy for minimizing a function 𝐽 𝐰

• Start with an initial guess for 𝐰, say 𝐰+

• Iterate till convergence: – Compute the gradient of the

gradient of 𝐽 at 𝐰!

– Update 𝐰! to get 𝐰!"# by taking a step in the opposite direction of the gradient

16

J(w)

ww0w1w2w3

Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

We are trying to minimize

𝐽 𝐰 = min𝐰

12𝐰"𝐰+ 𝐶+

#

max 0, 1 − 𝑦#𝐰"𝐱#

Page 17: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient descent for SVM

1. Initialize 𝐰%

2. For t = 0, 1, 2, ….1. Compute gradient of 𝐽 𝐰 at 𝐰,. Call it ∇J 𝐰,-.

2. Update w as follows:𝐰,-. ← 𝐰, − 𝑟∇𝐽(𝐰,)

17

𝑟: The learning rate .

We are trying to minimize

𝐽 𝐰 = min𝐰

12𝐰

"𝐰+ 𝐶+#

max 0, 1 − 𝑦#𝐰"𝐱#

Page 18: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Outline: Training SVM by optimization

ü Review of convex functions and gradient descent

2. Stochastic gradient descent

3. Gradient descent vs stochastic gradient descent

4. Sub-derivatives of the hinge loss

5. Stochastic sub-gradient descent for SVM

6. Comparison to perceptron

18

Page 19: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient descent for SVM

1. Initialize 𝐰%

2. For t = 0, 1, 2, ….1. Compute gradient of 𝐽 𝐰 at 𝐰,. Call it ∇J 𝐰,-.

2. Update w as follows:𝐰,-. ← 𝐰, − 𝑟∇𝐽(𝐰,)

19

r: Called the learning rate

Gradient of the SVM objective requires summing over the entire training set

Slow, does not really scale

We are trying to minimize

𝐽 𝐰 = min𝐰

12𝐰

"𝐰+ 𝐶+#

max 0, 1 − 𝑦#𝐰"𝐱#

Page 20: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Stochastic gradient descent for SVMGiven a training set 𝑆 = (𝐱0, 𝑦0) , 𝐱 ∈ ℜ1, 𝑦 ∈ {−1,1}1. Initialize 𝐰+ = 0 ∈ ℜ12. For epoch = 1 … T:

1. Pick a random example (𝐱$, 𝑦$) from the training set 𝑆

2. Treat (𝐱$, 𝑦$) as a full dataset and take the derivative of the SVM objective ,𝐽 at the current 𝐰!%#. Call it ∇,𝐽 𝐰&%#

,𝐽 𝐰 = min𝐰

12𝐰(𝐰+ 𝐶max 0, 1 − 𝑦$𝐰(𝐱$

3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#

3. Return final w

20

𝐽 𝐰 =12𝐰"𝐰+ 𝐶+

#

max 0, 1 − 𝑦#𝐰"𝐱#

Page 21: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Stochastic gradient descent for SVMGiven a training set 𝑆 = (𝐱0, 𝑦0) , 𝐱 ∈ ℜ1, 𝑦 ∈ {−1,1}1. Initialize 𝐰+ = 0 ∈ ℜ12. For epoch = 1 … T:

1. Pick a random example (𝐱$, 𝑦$) from the training set 𝑆

2. Treat (𝐱$, 𝑦$) as a full dataset and take the derivative of the SVM objective ,𝐽 at the current 𝐰!%#. Call it ∇,𝐽 𝐰&%#

,𝐽 𝐰 = min𝐰

12𝐰(𝐰+ 𝐶max 0, 1 − 𝑦$𝐰(𝐱$

3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#

3. Return final w

21

𝐽 𝐰 =12𝐰"𝐰+ 𝐶+

#

max 0, 1 − 𝑦#𝐰"𝐱#

Page 22: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Stochastic gradient descent for SVMGiven a training set 𝑆 = (𝐱0, 𝑦0) , 𝐱 ∈ ℜ1, 𝑦 ∈ {−1,1}1. Initialize 𝐰+ = 0 ∈ ℜ12. For epoch = 1 … T:

1. Pick a random example (𝐱$, 𝑦$) from the training set 𝑆

2. Treat (𝐱$, 𝑦$) as a full dataset and take the derivative of the SVM objective ,𝐽 at the current 𝐰!%#. Call it ∇,𝐽 𝐰&%#

,𝐽 𝐰 = min𝐰

12𝐰(𝐰+ 𝐶max 0, 1 − 𝑦$𝐰(𝐱$

3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#

3. Return final w

22

𝐽 𝐰 =12𝐰"𝐰+ 𝐶+

#

max 0, 1 − 𝑦#𝐰"𝐱#

Page 23: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Stochastic gradient descent for SVMGiven a training set 𝑆 = (𝐱0, 𝑦0) , 𝐱 ∈ ℜ1, 𝑦 ∈ {−1,1}1. Initialize 𝐰+ = 0 ∈ ℜ12. For epoch = 1 … T:

1. Pick a random example (𝐱$, 𝑦$) from the training set 𝑆

2. Treat (𝐱$, 𝑦$) as a full dataset and take the derivative of the SVM objective ,𝐽 at the current 𝐰!%#. Call it ∇,𝐽 𝐰&%#

,𝐽 𝐰 = min𝐰

12𝐰(𝐰+ 𝐶max 0, 1 − 𝑦$𝐰(𝐱$

3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#

3. Return final w

23

𝐽 𝐰 =12𝐰"𝐰+ 𝐶+

#

max 0, 1 − 𝑦#𝐰"𝐱#

Page 24: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Stochastic gradient descent for SVMGiven a training set 𝑆 = (𝐱0, 𝑦0) , 𝐱 ∈ ℜ1, 𝑦 ∈ {−1,1}1. Initialize 𝐰+ = 0 ∈ ℜ12. For epoch = 1 … T:

1. Pick a random example (𝐱$, 𝑦$) from the training set 𝑆

2. Treat (𝐱$, 𝑦$) as a full dataset and take the derivative of the SVM objective ,𝐽 at the current 𝐰!%#. Call it ∇,𝐽 𝐰&%#

,𝐽 𝐰 = min𝐰

12𝐰(𝐰+ 𝐶max 0, 1 − 𝑦$𝐰(𝐱$

3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#

3. Return final w

24

𝐽 𝐰 =12𝐰"𝐰+ 𝐶+

#

max 0, 1 − 𝑦#𝐰"𝐱#

Page 25: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Stochastic gradient descent for SVM

Given a training set 𝑆 = (𝐱0, 𝑦0) , 𝐱 ∈ ℜ1, 𝑦 ∈ {−1,1}1. Initialize 𝐰+ = 0 ∈ ℜ1

2. For epoch = 1 … T:1. Pick a random example (𝐱$, 𝑦$) from the training set 𝑆

2. Treat (𝐱$, 𝑦$) as a full dataset and take the derivative of the SVM objective ,𝐽 at the current 𝐰!%#. Call it ∇,𝐽 𝐰&%#

3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#

3. Return final w

25

This algorithm is guaranteed to converge to the minimum of 𝐽 if 𝛾! is small enough.Why? The objective 𝐽 𝐰 is a convex function

𝐽 𝐰 =12𝐰"𝐰+ 𝐶+

#

max 0, 1 − 𝑦#𝐰"𝐱#

Page 26: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Outline: Training SVM by optimization

ü Review of convex functions and gradient descent

ü Stochastic gradient descent

3. Gradient descent vs stochastic gradient descent

4. Sub-derivatives of the hinge loss

5. Stochastic sub-gradient descent for SVM

6. Comparison to perceptron

26

Page 27: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

27

Gradient descent

Page 28: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

28

Stochastic Gradient descent

Page 29: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

29

Stochastic Gradient descent

Page 30: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

30

Stochastic Gradient descent

Page 31: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

31

Stochastic Gradient descent

Page 32: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

32

Stochastic Gradient descent

Page 33: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

33

Stochastic Gradient descent

Page 34: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

34

Stochastic Gradient descent

Page 35: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

35

Stochastic Gradient descent

Page 36: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

36

Stochastic Gradient descent

Page 37: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

37

Stochastic Gradient descent

Page 38: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

38

Stochastic Gradient descent

Page 39: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

39

Stochastic Gradient descent

Page 40: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

40

Stochastic Gradient descent

Page 41: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

41

Stochastic Gradient descent

Page 42: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

42

Stochastic Gradient descent

Page 43: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

43

Stochastic Gradient descent

Page 44: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

44

Stochastic Gradient descent

Page 45: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Gradient Descent vs SGD

45

Stochastic Gradient descent

Many more updates than gradient descent, but each individual update is less computationally expensive

Page 46: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Outline: Training SVM by optimization

ü Review of convex functions and gradient descent

ü Stochastic gradient descent

ü Gradient descent vs stochastic gradient descent

4. Sub-derivatives of the hinge loss

5. Stochastic sub-gradient descent for SVM

6. Comparison to perceptron

46

Page 47: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Stochastic gradient descent for SVM

Given a training set 𝑆 = (𝐱0, 𝑦0) , 𝐱 ∈ ℜ1, 𝑦 ∈ {−1,1}1. Initialize 𝐰+ = 0 ∈ ℜ1

2. For epoch = 1 … T:1. Pick a random example (𝐱$, 𝑦$) from the training set 𝑆

2. Treat (𝐱$, 𝑦$) as a full dataset and take the derivative of the SVM objective at the current 𝐰!%# to be ∇J 𝐰&%#

3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#

3. Return final w

47

𝐽 𝐰 = min𝐰

12𝐰"𝐰+ 𝐶+

#

max 0, 1 − 𝑦#𝐰"𝐱

Page 48: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Hinge loss is not differentiable!

What is the derivative of the hinge loss with respect to w?

48

𝐽 𝐰 = min𝐰

12𝐰5𝐰+ 𝐶max 0, 1 − 𝑦0𝐰5𝐱0

Page 49: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Detour: Sub-gradients

Generalization of gradients to non-differentiable functions(Remember that every tangent is a hyperplane that lies below the function for convex functions)

Informally, a sub-tangent at a point is any hyperplane that lies below the function at the point.A sub-gradient is the slope of that line

49

Page 50: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Sub-gradients

50[Example from Boyd]

g1 is a gradient at x1

g2 and g3 is are both subgradients at x2

f is differentiable at x1Tangent at this point

Formally, a vector g is a subgradient to f at point x if

Page 51: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Sub-gradients

51[Example from Boyd]

g1 is a gradient at x1

g2 and g3 is are both subgradients at x2

f is differentiable at x1Tangent at this point

Formally, a vector g is a subgradient to f at point x if

Page 52: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Sub-gradients

52[Example from Boyd]

g1 is a gradient at x1

g2 and g3 is are both subgradients at x2

f is differentiable at x1Tangent at this point

Formally, a vector g is a subgradient to f at point x if

Page 53: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Sub-gradient of the SVM objective

53

General strategy: First solve the max and compute the gradient for each case

Page 54: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Sub-gradient of the SVM objective

54

General strategy: First solve the max and compute the gradient for each case

Page 55: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Outline: Training SVM by optimization

ü Review of convex functions and gradient descent

ü Stochastic gradient descent

ü Gradient descent vs stochastic gradient descent

ü Sub-derivatives of the hinge loss

5. Stochastic sub-gradient descent for SVM

6. Comparison to perceptron

55

Page 56: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Stochastic sub-gradient descent for SVM

Given a training set 𝑆 = (𝐱# , 𝑦#) , 𝐱 ∈ ℜ> , 𝑦 ∈ {−1,1}1. Initialize 𝐰 = 0 ∈ ℜ>

2. For epoch = 1 … T:For each training example 𝐱0, 𝑦0 ∈ 𝑆:

If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$

else: 𝐰 ← 1 − 𝛾! 𝐰

3. Return 𝐰

56

Page 57: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Stochastic sub-gradient descent for SVM

Given a training set 𝑆 = (𝐱# , 𝑦#) , 𝐱 ∈ ℜ> , 𝑦 ∈ {−1,1}1. Initialize 𝐰 = 0 ∈ ℜ>

2. For epoch = 1 … T:For each training example 𝐱0, 𝑦0 ∈ 𝑆:

If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$

else: 𝐰 ← 1 − 𝛾! 𝐰

3. Return 𝐰

57

Page 58: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Stochastic sub-gradient descent for SVM

Given a training set 𝑆 = (𝐱# , 𝑦#) , 𝐱 ∈ ℜ> , 𝑦 ∈ {−1,1}1. Initialize 𝐰 = 0 ∈ ℜ>

2. For epoch = 1 … T:For each training example 𝐱0, 𝑦0 ∈ 𝑆:

If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$

else: 𝐰 ← 1 − 𝛾! 𝐰

3. Return 𝐰

58

Update 𝐰 ← 𝐰− 𝛾!∇𝐽

Page 59: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Stochastic sub-gradient descent for SVM

Given a training set 𝑆 = (𝐱# , 𝑦#) , 𝐱 ∈ ℜ> , 𝑦 ∈ {−1,1}1. Initialize 𝐰 = 0 ∈ ℜ>

2. For epoch = 1 … T:For each training example 𝐱0, 𝑦0 ∈ 𝑆:

If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$

else: 𝐰 ← 1 − 𝛾! 𝐰

3. Return 𝐰

59

Page 60: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Stochastic sub-gradient descent for SVM

Given a training set 𝑆 = (𝐱# , 𝑦#) , 𝐱 ∈ ℜ> , 𝑦 ∈ {−1,1}1. Initialize 𝐰 = 0 ∈ ℜ>

2. For epoch = 1 … T:For each training example 𝐱0, 𝑦0 ∈ 𝑆:

If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$

else: 𝐰 ← 1 − 𝛾! 𝐰

3. Return 𝐰

60

𝛾$: learning rate, many tweaks possible

Page 61: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Stochastic sub-gradient descent for SVM

Given a training set 𝑆 = (𝐱# , 𝑦#) , 𝐱 ∈ ℜ> , 𝑦 ∈ {−1,1}1. Initialize 𝐰 = 0 ∈ ℜ>

2. For epoch = 1 … T:For each training example 𝐱0, 𝑦0 ∈ 𝑆:

If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$

else: 𝐰 ← 1 − 𝛾! 𝐰

3. Return 𝐰

61

𝛾$: learning rate, many tweaks possible

Important to shuffle examples at the start of each epoch

Page 62: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Convergence and learning rates

With enough iterations, it will converge in expectation

Provided the step sizes are “square summable, but not summable”

• Step sizes 𝛾! are positive• Sum of squares of step sizes over t = 1 to 1 is not infinite• Sum of step sizes over t = 1 to 1 is infinity

• Some examples: 𝛾@ =A!

BC"!#$or 𝛾@ =

A!BC@

62

Page 63: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Convergence and learning rates

• Number of iterations to get to accuracy within 𝜖

• For strongly convex functions, N examples, d dimensional:– Gradient descent: 𝑂 𝑁𝑑 ln !

"

– Stochastic gradient descent: 𝑂 #"

• More subtleties involved, but SGD is generally preferable when the data size is huge

• Recently, many variants that are based on this general strategy– Examples: Adagrad, momentum, Nesterov’s accelerated gradient, Adam,

RMSProp, etc…

63

Page 64: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Convergence and learning rates

• Number of iterations to get to accuracy within 𝜖

• For strongly convex functions, N examples, d dimensional:– Gradient descent: 𝑂 𝑁𝑑 ln !

"

– Stochastic gradient descent: 𝑂 #"

• More subtleties involved, but SGD is generally preferable when the data size is huge

• Recently, many variants that are based on this general strategy– Examples: Adagrad, momentum, Nesterov’s accelerated gradient, Adam,

RMSProp, etc…

64

Page 65: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Outline: Training SVM by optimization

ü Review of convex functions and gradient descent

ü Stochastic gradient descent

ü Gradient descent vs stochastic gradient descent

ü Sub-derivatives of the hinge loss

ü Stochastic sub-gradient descent for SVM

6. Comparison to perceptron

65

Page 66: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Stochastic sub-gradient descent for SVM

Given a training set 𝑆 = (𝐱# , 𝑦#) , 𝐱 ∈ ℜ> , 𝑦 ∈ {−1,1}1. Initialize 𝐰 = 0 ∈ ℜ>

2. For epoch = 1 … T:For each training example 𝐱0, 𝑦0 ∈ 𝑆:

If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$

else: 𝐰 ← 1 − 𝛾! 𝐰

3. Return 𝐰

66

Compare with the Perceptron update:If 𝑦"𝐰#𝐱" ≤ 0,

update 𝐰 ← 𝐰+ 𝛾!𝑦"𝐱"

Page 67: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

Perceptron vs. SVM

• Perceptron: Stochastic sub-gradient descent for a different loss– No regularization though

• SVM optimizes the hinge loss– With regularization

67

Page 68: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem

SVM summary from optimization perspective

• Minimize regularized hinge loss

• Solve using stochastic gradient descent– Very fast, run time does not depend on number of examples

– Compare with Perceptron algorithm: Perceptron does not maximize margin width• Perceptron variants can force a margin

– Convergence criterion is an issue; can be too aggressive in the beginning and get to a reasonably good solution fast; but convergence is slow for very accurate weight vector

• Other successful optimization algorithms exist– Eg: Dual coordinate descent, implemented in liblinear

68Questions?