cs37300: data mining & machine learning©jan20-19 christopher w. clifton 3 soft svm • notice...

25
©Jan-19 Christopher W. Clifton 1 20 Materials adapted from Profs. Jennifer Neville and Dan Goldwasser CS37300: Data Mining & Machine Learning Margin-Based Classifiers: Support Vector Machine Prof. Chris Clifton 3 March 2020 Maximal Margin The discussion motivates the notion of maximal margin The maximal margin of a data set S is defined as: 1. For a given w: Find the closest point. 2. Then, find the one the gives the maximal margin value across all w’s (of size 1). Note: the selection of the point in the min and therefore the max does not change if we scale w, so it’s okay to only deal with normalized w’s. Ɣ(S) = max ||w||=1 min (x,y) S y w T x

Upload: others

Post on 06-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 120

Materials adapted from Profs. Jennifer Neville and Dan Goldwasser

CS37300:

Data Mining & Machine Learning

Margin-Based Classifiers:

Support Vector Machine

Prof. Chris Clifton

3 March 2020

Maximal Margin

• The discussion motivates the notion of maximal margin

• The maximal margin of a data set S is defined as:

3

1. For a given w: Find the closest point. 2. Then, find the one the gives the maximal margin value across all w’s (of size 1).

Note: the selection of the point in the min and

therefore the max does not change if we scale w,so it’s okay to only deal with normalized w’s.

Ɣ(S) = max||w||=1 min(x,y) ∈ S y wT x

Page 2: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 220

Hard SVM Optimization

• We have shown that the sought after weight vector w is the solution of the following optimization problem:

SVM Optimization:

• This is an optimization problem in (n+1) variables, with |S|=m inequality constraints.

5

Minimize: ½ ||w||2

Subject to: ∀ (x,y) ∀ S: y wT x ≥ 1

Visualizing the Solution in the non-

Separable Case

8

Page 3: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 320

Soft SVM

• Notice that the relaxation of the constraint: yi witxi ≥ 1 can be done by

introducing a slack variable ξ (per example) and requiring: yi witxi ≥ 1 -

ξ i ; ξ i ≥ 0

• Now, we want to solve:

Min ½ ||w||2 + c ξi subject to ξ i ≥ 0

• Which can be written as:

Min ½ wTw + c max(0, 1 – yi witxi).

9

SVM Objective Function

• The problem we solved is: Min ½ ||w||2 + c Σ ξi

• Where ξi > 0 is a slack variable, defined as: ξ i = max(0, 1 – yi wtxi)

– Equivalently, we can say that: yi wtxi ≥ 1 - ξ; ξ ≥ 0

• And this can be written as:

Min ½ ||w||2 + c Σ ξi

• General Form of a learning algorithm:

– Minimize empirical loss, and Regularize (to avoid over fitting)

14

Empirical lossRegularization term

Page 4: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 420

What if the true model isn’t linearly

separable?By transforming the feature space these functions can be made linearly separable

Represent each point in 2D as (x,x2)

SVM: Kernel Function

• Don’t need to transform dataset

– Just need distance

• Kernel function: compute distance in new space without actually computing point positions!

– Let 𝜙 𝑥 = (𝑥12 ,𝑥2

2 , 2𝑥1𝑥2)

– 𝜙 𝑥 𝑇𝜙(𝑦) = 𝑥12𝑦1

2+ 2𝑥1𝑥2𝑦1𝑦2 + 𝑥22𝑦2

2

• Radial basis function: kernel representing infinite expansion

– “Infinite dimension” feature space

Page 5: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 520

Polynomial Regression

• GD: general optimization algorithm

– Works for classification (different loss function)

– Incorporate polynomial features to fit a complex model

– Danger – overfitting!

17

Regularization

• Both for regression and classification, for a given error we prefer a simpler model

– Keep W small: ε changes in the input cause ε*w in the output

• Sometimes we are even willing to trade a higher error rate for a simpler model (why?)

• Add a regularization term:

– This is a form of inductive bias

Page 6: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 620

Regularization

• A very popular choice of regularization term is to minimize the norm of the weight vector– For example

– For convenience: ½ squared norm• What is the update gradient of the loss function?

• At each iteration we subtract the weights by λ*w

• In general, we can pick other norms (p-norm)– Referenced as L-p norm

– Different p will have a different effect!

19

Model Complexity and

Bias/Variance Tradeoff

20

Page 7: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 720

Bias/Variance

• Easy to define very powerful learners

– Move to a higher dimension

• Is it always a good idea?

• Simple (simplistic) approach:

• Option 1: Learning doesn’t work– our model is not powerful enough!– Move to a richer representation

• Option 2: Learning doesn’t work – our model overfits! – Get more data! Remove features! Regularize!

How can we tell?

21

Bias-Variance Tradeoff

22

Bia

s

Variance

Sample different sets from P(x,y) and train (each x is a learned model)

How would the learned models “behave” when trained over different datasets from the distribution?

Low bias: powerf ul enough

to learnLow Variance:

doesn’t ov erfit

High bias: cannot represent

targetLow Variance:

doesn’t ov er fit

High bias: cannot represent

targetHigh Variance:

ov erfits

Low bias: powerf ul enough

to learnHigh Variance:

ov erfits

Page 8: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 820

Bias-Variance Tradeoff

23

Bia

s

Variance

Sample different sets from P(x,y) and train (each x is a learned model)

How would the learned models “behave” when trained over different datasets from the distribution?

Low bias: powerf ul enough

to learnLow Variance:

doesn’t ov erfit

High bias: cannot represent

targetLow Variance:

doesn’t ov er fit

High bias: cannot represent

targetHigh Variance:

ov erfits

Low bias: powerf ul enough

to learnHigh Variance:

ov erfits

Bias-Variance Tradeoff

24

Bia

s

Variance

High bias: cannot represent

targetLow Variance:

doesn’t ov er fit

High bias: cannot represent

targetHigh Variance:

ov erfits

Low bias: powerf ul enough

to learnHigh Variance:

ov erfits

Clustered together (low variance) in a “good” place (low bias)

Sample different sets from P(x,y) and train (each x is a learned model)

How would the learned models “behave” when trained over different datasets from the distribution?

Page 9: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 920

Bias-Variance Tradeoff

25

Bia

s

Variance

High bias: cannot represent

targetHigh Variance:

ov erfits

Low bias: powerf ul enough

to learnHigh Variance:

ov erfits

Clustered together (low variance) away from the target function (high bias)

Sample different sets from P(x,y) and train (each x is a learned model)

How would the learned models “behave” when trained over different datasets from the distribution?

Bias-Variance Tradeoff

Bia

s

Variance

Low bias: powerf ul enough

to learnHigh Variance:

ov erfits

Far apart (high variance) away from the target function (high bias)

Sample different sets from P(x,y) and train (each x is a learned model)

How would the learned models “behave” when trained over different datasets from the distribution?

Page 10: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 1020

Overfitting

• Model complexity: “how many parameters are learned?”• Decision tree: depth, Linear models: features, degree of

a polynomial kernel

• Empirical Error: For a given dataset, what is the percentage misclassified items by the learned classifier?

Empirical

Error

Model Complexity

Data Generating Distribution

• We assume that a distribution P(x,y) from which the dataset is sampled– Data generating distribution

• Training and testing examples are drawn from the same distribution– The examples are identically distributed

• Each example is drawn independently– We say that the train/test examples are independently and identically

distributed (i.i.d)

• We care about our performance over any sample from that distribution – not just the one we observe during training

28

Page 11: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 1120

Overfitting

Expected Error: what percentage of items drawn from P(x,y) do we expect to be misclassified by the learned classifier?

let’s consider different samples from the data distribution. You can get closer to the expected loss by considering the expectation of the empirical loss on (all/many) different data samples

Let’s continue the discussion with that in mind!

Expected

Error

Model Complexity

The Variance of the Learner

How susceptible is the learner to minor changes in the training data? (i.e., different samples from P(x,y))• You can think about the testing data as another sample

• OverfittingVariance increases with model complexity!

Expected

Error

Model Complexity

Variance

Page 12: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 1220

Bias of the Learner

How likely is the learner to identify the target hypothesis?• What will happen if we test on a different sample?

• Underfitting• What will happen if we train on a different sample?Bias is high when the model is too simple

Expected

Error

Model Complexity

Bias

Model Complexity

Simple models:High bias and low variance

Complex models:High variance and low bias

Expected

Error

Model Complexity

Variance

Bias

Page 13: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 1320

Model Complexity

Simple models:High bias and low variance

Complex models:High variance and low bias

Expected

Error

Model Complexity

Variance

Bias

UnderfittingOverfitting

Impact of bias and variance

Expected Error ≈ Bias + Variance

Expected

Error

Model Complexity

Variance

Bias

Page 14: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 1420

Bias-Variance Tradeoff in Practice

• We saw that the classification error can be expressed in terms of bias and variance

• Reducing the bias/variance can reduce expected error!

• Different scenarios can lead to different actions for reducing the error

– High bias: add more features

– High variance: simplify the model, add more examples

• How can we diagnose each one of these scenarios?

35

Bias-Variance Analysis

• Let’s look at the influence each scenario has on the validation and

training errors

36

Error

Model Complexity

Let’s compare simple models (e.g., linear in x) to complex ones

Page 15: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 1520

Bias-Variance Analysis

37

Error

Model Complexity

As we move to a more

expressive model the training error is likely to decrease

Training error

Bias-Variance Analysis

Error

Model Complexity

When we are using a simple model, it’s very likely we’ll

underfit resulting in a high validation error

Training error

As we move to a more expressive

model the training error is likely to decrease

Validation error

Page 16: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 1620

Bias-Variance Analysis

39

Error

Model Complexity

When we are using a simple model, it’s very likely we’ll

underfit resulting in a high validation error

Training error

If we increase the complexity,

we can do better on the

validation data

Validation error (simplemodel)

Validation error (complex model)

Bias-Variance Analysis

40

Error

Model Complexity

When we are using a simple model, it’s very likely we’ll

underfit resulting in a high validation error

Training error

If we keep increasing the

complexity the model will

eventually overfit and the

validation error will increase

Validation error

(simple model)

Validation error

(complex model)

Validation error (very complex model)

Page 17: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 1720

Bias-Variance Analysis

Error

Model Complexity

Training error

Validation error

If we keep increasing the

complexity the model will

eventually overfit and the

validation error will increase

Interpolating over the points: two curves that we can use for

diagnosis

Bias-Variance Analysis

42

Model Complexity

Validation error

Error Training error

Bias problem : the model is not

expressive enough. Underfitting(1) high training

error (2) validation error ≈

training error

Interpolating over the points: two curves that we can use for diagnosis

Page 18: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 1820

Bias-Variance Analysis

43

Error

Model Complexity

Bias problem : the model is not

expressive enough. Underfitting(1) high training

error (2) validation error ≈

training error

Training error

Validation error

Variance problem: the model is too

expressive for the data set that we have overfitting(1) low training error,

(2) validation error >

training error.

Interpolating over the points: two curves that we can use for diagnosis

Learning Curve

44

Plotting the learning curve of an algorithm on a given data set

is also a useful way to diagnose the algorithm’s performance.

Error

#examples

Learning curve:

The Error rate of the

learner as we increase the size of the training

set

Page 19: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 1920

Learning Curve

45

Plotting the learning curve of an algorithm on a given data set

is also a useful way to diagnose the algorithm’s performance.

Error

Examples

When the algorithm has access to a

few examples – training error will be very low

Training error (few examples)

Easy to

separate a

small number

of examples!

Learning Curve

46

Error

ExamplesAs more examples are sampled, it is

more difficult to separate the data and the training error will increase

Training error

Plotting the learning curve of an algorithm on a given data set

is also a useful way to diagnose the algorithm’s performance.

Page 20: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 2020

Learning Curve

47#examples

On the other hand, with more

data the validation error is likely to decrease

Training error

Validation error

Error

Plotting the learning curve of an algorithm on a given data set

is also a useful way to diagnose the algorithm’s performance.

Learning Curve

High bias and high variance models react differently when

presented with more examples

#examples

Error

Training error

Validation error

Page 21: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 2120

Learning Curve (High Bias)

49#examples

Error

Training error: increasing the number of examples will increase the training

error since the simple model cannot account for the “noise” introduced

Training error

Validation error

High Bias: We train a simple model (e.g., linear classifier over

the original feature space)

Learning Curve (High Bias)

50#examples

Adding more examples will not change the learned model (just increase the training error) Adding more examples will not reduce the validation error significantly

Error

Training error

Validation error

High Bias: We train a simple model (e.g., linear classifier over

the original feature space)

Page 22: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 2220

Learning Curve (High Bias)

51#examples

Simple hypotheses converge quickly (converges after a few examples, adding more will not help)

Feature engineering: carefully selecting a

subset of features will not help; instead add more relevant attributes!

Reduce regularization tradeoff

Error

Training error

Validation error

High Bias: We train a simple model (e.g., linear classifier over

the original feature space)

Learning Curve (High Variance)

52

In the high variance case, let’s assume we blow up the

feature space, and we have a very expressive function

Model Complexity

Error

Page 23: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 2320

Learning Curve (High Variance)

53

In the high variance case, let’s assume we blow up the feature

space, and we have a very expressive function

Model Complexity

When we have few examples, the

model can easily overfit and have a small training error

Training error

Error

Learning Curve (High Variance)

54

In the high variance case, let’s assume we blow up the feature

space, and we have a very expressive function

Model Complexity

As we add more examples, it might be more

difficult to fit the data, so the training error will increase

(the model might not account for all the “noise”)

Training error

Error

Page 24: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 2420

Learning Curve (High Variance)

55

In the high variance case, let’s assume we blow up the feature

space, and we have a very expressive function

Model Complexity

The validation error will be high

(the model is overfitting)

Training error

ErrorValidation error

High variance: large gapbetween error on training and validation sets (useful for identifying high variance)

Learning Curve (High Variance)

56

In the high variance case, let’s assume we blow up the feature

space, and we have a very expressive function

Model Complexity

1) Adding more examples will change the resulting classifier and decrease

the gap (reduce validation error!)2) Simplifying the model (less features)

might also help3) Increase regularization tradeoff

Training error

ErrorValidation error

Page 25: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing

©Jan-19 Christopher W. Clifton 2520

Summary

• Bias/Variance: convenient way to analyze a learning system

performance

– Identify underfitting/overfitting

– Both high bias and high variance can happen

57