svms with slack - university of texas at dallasnrr150130/cs4375/...svms with slack. nicholas ruozzi....

31
SVMs with Slack Nicholas Ruozzi University of Texas at Dallas Based roughly on the slides of David Sontag

Upload: others

Post on 17-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

SVMs with Slack

Nicholas RuozziUniversity of Texas at Dallas

Based roughly on the slides of David Sontag

Page 2: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

Dual SVM

maxπœ†πœ†β‰₯0

βˆ’12�𝑖𝑖

�𝑗𝑗

πœ†πœ†π‘–π‘–πœ†πœ†π‘—π‘—π‘¦π‘¦π‘–π‘–π‘¦π‘¦π‘—π‘—π‘₯π‘₯ 𝑖𝑖 𝑇𝑇π‘₯π‘₯ 𝑗𝑗 + �𝑖𝑖

πœ†πœ†π‘–π‘–

such that

�𝑖𝑖

πœ†πœ†π‘–π‘–π‘¦π‘¦π‘–π‘– = 0

β€’ The dual formulation only depends on inner products between the data points

β€’ Same thing is true if we use feature vectors instead

2

Page 3: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

Dual SVM

maxπœ†πœ†β‰₯0

βˆ’12�𝑖𝑖

�𝑗𝑗

πœ†πœ†π‘–π‘–πœ†πœ†π‘—π‘—π‘¦π‘¦π‘–π‘–π‘¦π‘¦π‘—π‘—Ξ¦ π‘₯π‘₯ 𝑖𝑖 𝑇𝑇Φ π‘₯π‘₯ 𝑗𝑗 + οΏ½

𝑖𝑖

πœ†πœ†π‘–π‘–

such that

�𝑖𝑖

πœ†πœ†π‘–π‘–π‘¦π‘¦π‘–π‘– = 0

β€’ The dual formulation only depends on inner products between the data points

β€’ Same thing is true if we use feature vectors instead

3

Page 4: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

The Kernel Trickβ€’ For some feature vectors, we can compute the inner products quickly,

even if the feature vectors are very large

β€’ This is best illustrated by example

β€’ Let πœ™πœ™ π‘₯π‘₯1, π‘₯π‘₯2 =

π‘₯π‘₯1π‘₯π‘₯2π‘₯π‘₯2π‘₯π‘₯1π‘₯π‘₯12

π‘₯π‘₯22

β€’ πœ™πœ™ π‘₯π‘₯1, π‘₯π‘₯2 π‘‡π‘‡πœ™πœ™ 𝑧𝑧1, 𝑧𝑧2 = π‘₯π‘₯12𝑧𝑧12 + 2π‘₯π‘₯1π‘₯π‘₯2𝑧𝑧1𝑧𝑧2 + π‘₯π‘₯22𝑧𝑧22

= π‘₯π‘₯1𝑧𝑧1 + π‘₯π‘₯2𝑧𝑧2 2

= π‘₯π‘₯𝑇𝑇𝑧𝑧 2

4

Page 5: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

The Kernel Trickβ€’ For some feature vectors, we can compute the inner products quickly,

even if the feature vectors are very large

β€’ This is best illustrated by example

β€’ Let πœ™πœ™ π‘₯π‘₯1, π‘₯π‘₯2 =

π‘₯π‘₯1π‘₯π‘₯2π‘₯π‘₯2π‘₯π‘₯1π‘₯π‘₯12

π‘₯π‘₯22

β€’ πœ™πœ™ π‘₯π‘₯1, π‘₯π‘₯2 π‘‡π‘‡πœ™πœ™ 𝑧𝑧1, 𝑧𝑧2 = π‘₯π‘₯12𝑧𝑧12 + 2π‘₯π‘₯1π‘₯π‘₯2𝑧𝑧1𝑧𝑧2 + π‘₯π‘₯22𝑧𝑧22

= π‘₯π‘₯1𝑧𝑧1 + π‘₯π‘₯2𝑧𝑧2 2

= π‘₯π‘₯𝑇𝑇𝑧𝑧 2

Reduces to a dot product in the original space

5

Page 6: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

The Kernel Trick

β€’ The same idea can be applied for the feature vector πœ™πœ™ of all polynomials of degree (exactly) 𝑑𝑑

β€’ πœ™πœ™ π‘₯π‘₯ π‘‡π‘‡πœ™πœ™ 𝑧𝑧 = π‘₯π‘₯𝑇𝑇𝑧𝑧 𝑑𝑑

β€’ More generally, a kernel is a function π‘˜π‘˜ π‘₯π‘₯, 𝑧𝑧 = πœ™πœ™ π‘₯π‘₯ π‘‡π‘‡πœ™πœ™(𝑧𝑧) for some feature map πœ™πœ™

β€’ Rewrite the dual objective

maxπœ†πœ†β‰₯0,βˆ‘π‘–π‘– πœ†πœ†π‘–π‘–π‘¦π‘¦π‘–π‘–=0

βˆ’12�𝑖𝑖

�𝑗𝑗

πœ†πœ†π‘–π‘–πœ†πœ†π‘—π‘—π‘¦π‘¦π‘–π‘–π‘¦π‘¦π‘—π‘—π‘˜π‘˜(π‘₯π‘₯(𝑖𝑖), π‘₯π‘₯ 𝑗𝑗 ) + �𝑖𝑖

πœ†πœ†π‘–π‘–

6

Page 7: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

Examples of Kernelsβ€’ Polynomial kernel of degree exactly 𝑑𝑑

β€’ π‘˜π‘˜ π‘₯π‘₯, 𝑧𝑧 = π‘₯π‘₯𝑇𝑇𝑧𝑧 𝑑𝑑

β€’ General polynomial kernel of degree 𝑑𝑑 for some 𝑐𝑐

β€’ π‘˜π‘˜ π‘₯π‘₯, 𝑧𝑧 = π‘₯π‘₯𝑇𝑇𝑧𝑧 + 𝑐𝑐 𝑑𝑑

β€’ Gaussian kernel for some 𝜎𝜎

β€’ π‘˜π‘˜ π‘₯π‘₯, 𝑧𝑧 = exp βˆ’ π‘₯π‘₯βˆ’π‘§π‘§ 2

2𝜎𝜎2

β€’ The corresponding πœ™πœ™ is infinite dimensional!

β€’ So many more…

7

Page 8: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

Gaussian Kernelsβ€’ Consider the Gaussian kernel

expβˆ’ π‘₯π‘₯ βˆ’ 𝑧𝑧 2

2𝜎𝜎2= exp

βˆ’ π‘₯π‘₯ βˆ’ 𝑧𝑧 𝑇𝑇(π‘₯π‘₯ βˆ’ 𝑧𝑧)2𝜎𝜎2

= expβˆ’ π‘₯π‘₯ 2 + 2π‘₯π‘₯𝑇𝑇𝑧𝑧 βˆ’ 𝑧𝑧 2

2𝜎𝜎2

= exp βˆ’π‘₯π‘₯ 2

2𝜎𝜎2exp βˆ’

𝑧𝑧 2

2𝜎𝜎2exp

π‘₯π‘₯π‘‡π‘‡π‘§π‘§πœŽπœŽ2

β€’ Use the Taylor expansion for exp()

expπ‘₯π‘₯π‘‡π‘‡π‘§π‘§πœŽπœŽ2

= �𝑛𝑛=0

∞π‘₯π‘₯𝑇𝑇𝑧𝑧 𝑛𝑛

𝜎𝜎2𝑛𝑛𝑛𝑛!

8

Page 9: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

Gaussian Kernelsβ€’ Consider the Gaussian kernel

expβˆ’ π‘₯π‘₯ βˆ’ 𝑧𝑧 2

2𝜎𝜎2= exp

βˆ’ π‘₯π‘₯ βˆ’ 𝑧𝑧 𝑇𝑇(π‘₯π‘₯ βˆ’ 𝑧𝑧)2𝜎𝜎2

= expβˆ’ π‘₯π‘₯ 2 + 2π‘₯π‘₯𝑇𝑇𝑧𝑧 βˆ’ 𝑧𝑧 2

2𝜎𝜎2

= exp βˆ’π‘₯π‘₯ 2

2𝜎𝜎2exp βˆ’

𝑧𝑧 2

2𝜎𝜎2exp

π‘₯π‘₯π‘‡π‘‡π‘§π‘§πœŽπœŽ2

β€’ Use the Taylor expansion for exp()

expπ‘₯π‘₯π‘‡π‘‡π‘§π‘§πœŽπœŽ2

= �𝑛𝑛=0

∞π‘₯π‘₯𝑇𝑇𝑧𝑧 𝑛𝑛

𝜎𝜎2𝑛𝑛𝑛𝑛!

9

Polynomial kernels of every degree!

Page 10: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

Kernelsβ€’ Bigger feature space increases the possibility of overfitting

β€’ Large margin solutions may still generalize reasonably well

β€’ Alternative: add β€œpenalties” to the objective to disincentivize complicated solutions

min𝑀𝑀

12𝑀𝑀 2 + 𝑐𝑐 β‹… (# π‘œπ‘œπ‘œπ‘œ π‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘π‘π‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘œπ‘œπ‘šπ‘šπ‘π‘π‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘œπ‘œπ‘›π‘›π‘šπ‘š)

β€’ Not a quadratic program anymore (in fact, it’s NP-hard)

β€’ Similar problem to Hamming loss (i.e., counting the number of misclassifications), no notion of how badly the data is misclassified

10

Page 11: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

SVMs with Slack

β€’ Allow misclassification

β€’ Penalize misclassification linearly (just like in the perceptron algorithm)

β€’ Again, easier to work with than the Hamming loss

β€’ Objective stays convex

β€’ Will let us handle data that isn’t linearly separable!

11

Page 12: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

12

Page 13: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘šPotentially allows some points to be misclassified/inside the margin

13

Page 14: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

Constant c determines degree to which slack is penalized

14

Page 15: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

β€’ How does this objective change with 𝑐𝑐?

15

Page 16: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

β€’ How does this objective change with 𝑐𝑐?

β€’ As 𝑐𝑐 β†’ ∞, requires a perfect classifier

β€’ As 𝑐𝑐 β†’ 0, allows arbitrary classifiers (i.e., ignores the data)

16

Page 17: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

β€’ How should we pick 𝑐𝑐?

17

Page 18: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

β€’ How should we pick 𝑐𝑐?

β€’ Divide the data into three pieces training, testing, and validation

β€’ Use the validation set to tune the value of the hyperparameter 𝑐𝑐

18

Page 19: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

Evaluation Methodology

β€’ General learning strategy

β€’ Build a classifier using the training data

β€’ Select hyperparameters using validation data

β€’ Evaluate the chosen model with the selected hyperparameters on the test data

19

How can we tell if we overfit the training data?

Page 20: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

ML in Practiceβ€’ Gather Data + Labelsβ€’ Select feature vectorsβ€’ Randomly split into three groups

β€’ Training setβ€’ Validation setβ€’ Test set

β€’ Experimentation cycle β€’ Select a β€œgood” hypothesis from the hypothesis spaceβ€’ Tune hyperparameters using validation setβ€’ Compute accuracy on test set (fraction of correctly classified

instances)

20

Page 21: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

β€’ What is the optimal value of πœ‰πœ‰ for fixed 𝑀𝑀 and 𝑏𝑏?

21

Page 22: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

β€’ What is the optimal value of πœ‰πœ‰ for fixed 𝑀𝑀 and 𝑏𝑏?

β€’ If 𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1, then πœ‰πœ‰π‘–π‘– = 0

β€’ If 𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 < 1, then πœ‰πœ‰π‘–π‘– = 1 βˆ’ 𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏

22

Page 23: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

β€’ We can formulate this slightly differently

β€’ πœ‰πœ‰π‘–π‘– = max 0, 1 βˆ’ 𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏

β€’ Does this look familiar?

β€’ Hinge loss provides an upper bound on Hamming loss

23

Page 24: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

Hinge Loss Formulationβ€’ Obtain a new objective by substituting in for πœ‰πœ‰

min𝑀𝑀,𝑏𝑏

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

max 0, 1 βˆ’ 𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏

24

Can minimize with gradient descent!

Page 25: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

Hinge Loss Formulationβ€’ Obtain a new objective by substituting in for πœ‰πœ‰

min𝑀𝑀,𝑏𝑏

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

max 0, 1 βˆ’ 𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏

25

Hinge lossPenalty to prevent overfitting

Page 26: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

Imbalanced Data

β€’ If the data is imbalanced (i.e., more positive examples than negative examples), may want to evenly distribute the error between the two classes

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12

𝑀𝑀 2 +𝑐𝑐𝑁𝑁+

�𝑖𝑖:𝑦𝑦𝑖𝑖=1

πœ‰πœ‰π‘–π‘– +π‘π‘π‘π‘βˆ’

�𝑖𝑖:𝑦𝑦𝑖𝑖=βˆ’1

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

26

Page 27: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

Dual of Slack Formulation

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

27

Page 28: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

Dual of Slack Formulation

𝐿𝐿 𝑀𝑀, 𝑏𝑏, πœ‰πœ‰, πœ†πœ†, πœ‡πœ‡ =12𝑀𝑀𝑇𝑇𝑀𝑀 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘– + �𝑖𝑖

πœ†πœ†π‘–π‘–(1 βˆ’ πœ‰πœ‰π‘–π‘– βˆ’ 𝑦𝑦𝑖𝑖(𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏)) + �𝑖𝑖

βˆ’πœ‡πœ‡π‘–π‘–πœ‰πœ‰π‘–π‘–

Convex in 𝑀𝑀, 𝑏𝑏, πœ‰πœ‰, so take derivatives to form the dual

πœ•πœ•πΏπΏπœ•πœ•π‘€π‘€π‘˜π‘˜

= π‘€π‘€π‘˜π‘˜ + �𝑖𝑖

βˆ’πœ†πœ†π‘–π‘–π‘¦π‘¦π‘–π‘–π‘₯π‘₯π‘˜π‘˜(𝑖𝑖) = 0

πœ•πœ•πΏπΏπœ•πœ•π‘π‘

= �𝑖𝑖

βˆ’πœ†πœ†π‘–π‘–π‘¦π‘¦π‘–π‘– = 0

πœ•πœ•πΏπΏπœ•πœ•πœ‰πœ‰π‘˜π‘˜

= 𝑐𝑐 βˆ’ πœ†πœ†π‘˜π‘˜ βˆ’ πœ‡πœ‡π‘˜π‘˜ = 0

28

Page 29: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

Dual of Slack Formulation

maxπœ†πœ†β‰₯0

βˆ’12�𝑖𝑖

�𝑗𝑗

πœ†πœ†π‘–π‘–πœ†πœ†π‘—π‘—π‘¦π‘¦π‘–π‘–π‘¦π‘¦π‘—π‘—π‘₯π‘₯ 𝑖𝑖 𝑇𝑇π‘₯π‘₯ 𝑗𝑗 + �𝑖𝑖

πœ†πœ†π‘–π‘–

such that

�𝑖𝑖

πœ†πœ†π‘–π‘–π‘¦π‘¦π‘–π‘– = 0

𝑐𝑐 β‰₯ πœ†πœ†π‘–π‘– β‰₯ 0, for all π‘šπ‘š

29

Page 30: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

Generalization

β€’ We argued, intuitively, that SVMs generalize better than the perceptron algorithm

β€’ How can we make this precise?

β€’ Coming soon... but first...

30

Page 31: SVMs with Slack - University of Texas at Dallasnrr150130/cs4375/...SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David Sontag

Roadmapβ€’ Where are we headed?

β€’ Other simple hypothesis spaces for supervised learning

β€’ π‘˜π‘˜ nearest neighbor

β€’ Decision trees

β€’ Learning theory

β€’ Generalization and PAC bounds

β€’ VC dimension

β€’ Bias/variance tradeoff

31