cs37300: data mining & machine learning©jan20-19 christopher w. clifton 3 soft svm • notice...

©Jan-19 Christopher W. Clifton 120

Materials adapted from Profs. Jennifer Neville and Dan Goldwasser

CS37300:

Data Mining & Machine Learning

Margin-Based Classifiers:

Support Vector Machine

Prof. Chris Clifton

3 March 2020

Maximal Margin

• The discussion motivates the notion of maximal margin

• The maximal margin of a data set S is defined as:

3

1. For a given w: Find the closest point. 2. Then, find the one the gives the maximal margin value across all w’s (of size 1).

Note: the selection of the point in the min and

therefore the max does not change if we scale w,so it’s okay to only deal with normalized w’s.

Ɣ(S) = max||w||=1 min(x,y) ∈ S y wT x


Hard SVM Optimization

• We have shown that the sought after weight vector w is the solution of the following optimization problem:

SVM Optimization:

• This is an optimization problem in (n+1) variables, with |S|=m inequality constraints.

5

Minimize: ½ ||w||2

Subject to: ∀ (x,y) ∀ S: y wT x ≥ 1

Visualizing the Solution in the non-

Separable Case

8


Soft SVM

• Notice that the relaxation of the constraint: yi witxi ≥ 1 can be done by

introducing a slack variable ξ (per example) and requiring: yi witxi ≥ 1 -

ξ i ; ξ i ≥ 0

• Now, we want to solve:

Min ½ ||w||2 + c ξi subject to ξ i ≥ 0

• Which can be written as:

Min ½ wTw + c max(0, 1 – yi witxi).

9

SVM Objective Function

• The problem we solved is: Min ½ ||w||2 + c Σ ξi

• Where ξi > 0 is a slack variable, defined as: ξ i = max(0, 1 – yi wtxi)

– Equivalently, we can say that: yi wtxi ≥ 1 - ξ; ξ ≥ 0

• And this can be written as:

Min ½ ||w||2 + c Σ ξi

• General Form of a learning algorithm:

– Minimize empirical loss, and Regularize (to avoid over fitting)

14

Empirical lossRegularization term


What if the true model isn’t linearly

separable?By transforming the feature space these functions can be made linearly separable

Represent each point in 2D as (x,x2)

SVM: Kernel Function

• Don’t need to transform dataset

– Just need distance

• Kernel function: compute distance in new space without actually computing point positions!

– Let 𝜙 𝑥 = (𝑥12 ,𝑥2

2 , 2𝑥1𝑥2)

– 𝜙 𝑥 𝑇𝜙(𝑦) = 𝑥12𝑦1

2+ 2𝑥1𝑥2𝑦1𝑦2 + 𝑥22𝑦2

2

• Radial basis function: kernel representing infinite expansion

– “Infinite dimension” feature space


Polynomial Regression

• GD: general optimization algorithm

– Works for classification (different loss function)

– Incorporate polynomial features to fit a complex model

– Danger – overfitting!

17

Regularization

• Both for regression and classification, for a given error we prefer a simpler model

– Keep W small: ε changes in the input cause ε*w in the output

• Sometimes we are even willing to trade a higher error rate for a simpler model (why?)

• Add a regularization term:

– This is a form of inductive bias


Regularization

• A very popular choice of regularization term is to minimize the norm of the weight vector– For example

– For convenience: ½ squared norm• What is the update gradient of the loss function?

• At each iteration we subtract the weights by λ*w

• In general, we can pick other norms (p-norm)– Referenced as L-p norm

– Different p will have a different effect!

19

Model Complexity and

Bias/Variance Tradeoff

20


Bias/Variance

• Easy to define very powerful learners

– Move to a higher dimension

• Is it always a good idea?

• Simple (simplistic) approach:

• Option 1: Learning doesn’t work– our model is not powerful enough!– Move to a richer representation

• Option 2: Learning doesn’t work – our model overfits! – Get more data! Remove features! Regularize!

How can we tell?

21

Bias-Variance Tradeoff

22

Bia

s

Variance

Sample different sets from P(x,y) and train (each x is a learned model)

How would the learned models “behave” when trained over different datasets from the distribution?

Low bias: powerf ul enough

to learnLow Variance:

doesn’t ov erfit

High bias: cannot represent

targetLow Variance:

doesn’t ov er fit


targetHigh Variance:

ov erfits


to learnHigh Variance:

ov erfits



23

Bia

s

Variance




to learnLow Variance:

doesn’t ov erfit


targetLow Variance:

doesn’t ov er fit



ov erfits



ov erfits


24

Bia

s

Variance


targetLow Variance:

doesn’t ov er fit



ov erfits



ov erfits

Clustered together (low variance) in a “good” place (low bias)





25

Bia

s

Variance



ov erfits



ov erfits

Clustered together (low variance) away from the target function (high bias)




Bia

s

Variance



ov erfits

Far apart (high variance) away from the target function (high bias)




Overfitting

• Model complexity: “how many parameters are learned?”• Decision tree: depth, Linear models: features, degree of

a polynomial kernel

• Empirical Error: For a given dataset, what is the percentage misclassified items by the learned classifier?

Empirical

Error

Model Complexity

Data Generating Distribution

• We assume that a distribution P(x,y) from which the dataset is sampled– Data generating distribution

• Training and testing examples are drawn from the same distribution– The examples are identically distributed

• Each example is drawn independently– We say that the train/test examples are independently and identically

distributed (i.i.d)

• We care about our performance over any sample from that distribution – not just the one we observe during training

28


Overfitting

Expected Error: what percentage of items drawn from P(x,y) do we expect to be misclassified by the learned classifier?

let’s consider different samples from the data distribution. You can get closer to the expected loss by considering the expectation of the empirical loss on (all/many) different data samples

Let’s continue the discussion with that in mind!

Expected

Error

Model Complexity

The Variance of the Learner

How susceptible is the learner to minor changes in the training data? (i.e., different samples from P(x,y))• You can think about the testing data as another sample

• OverfittingVariance increases with model complexity!

Expected

Error

Model Complexity

Variance


Bias of the Learner

How likely is the learner to identify the target hypothesis?• What will happen if we test on a different sample?

• Underfitting• What will happen if we train on a different sample?Bias is high when the model is too simple

Expected

Error

Model Complexity

Bias

Model Complexity

Simple models:High bias and low variance

Complex models:High variance and low bias

Expected

Error

Model Complexity

Variance

Bias


Model Complexity

Simple models:High bias and low variance

Complex models:High variance and low bias

Expected

Error

Model Complexity

Variance

Bias

UnderfittingOverfitting

Impact of bias and variance

Expected Error ≈ Bias + Variance

Expected

Error

Model Complexity

Variance

Bias


Bias-Variance Tradeoff in Practice

• We saw that the classification error can be expressed in terms of bias and variance

• Reducing the bias/variance can reduce expected error!

• Different scenarios can lead to different actions for reducing the error

– High bias: add more features

– High variance: simplify the model, add more examples

• How can we diagnose each one of these scenarios?

35

Bias-Variance Analysis

• Let’s look at the influence each scenario has on the validation and

training errors

36

Error

Model Complexity

Let’s compare simple models (e.g., linear in x) to complex ones



37

Error

Model Complexity

As we move to a more

expressive model the training error is likely to decrease

Training error


Error

Model Complexity

When we are using a simple model, it’s very likely we’ll

underfit resulting in a high validation error

Training error

As we move to a more expressive

model the training error is likely to decrease

Validation error



39

Error

Model Complexity



Training error

If we increase the complexity,

we can do better on the

validation data

Validation error (simplemodel)

Validation error (complex model)


40

Error

Model Complexity



Training error

If we keep increasing the

complexity the model will

eventually overfit and the

validation error will increase

Validation error

(simple model)

Validation error

(complex model)

Validation error (very complex model)



Error

Model Complexity

Training error

Validation error

If we keep increasing the

complexity the model will

eventually overfit and the

validation error will increase

Interpolating over the points: two curves that we can use for

diagnosis


42

Model Complexity

Validation error

Error Training error

Bias problem : the model is not

expressive enough. Underfitting(1) high training

error (2) validation error ≈

training error

Interpolating over the points: two curves that we can use for diagnosis



43

Error

Model Complexity

Bias problem : the model is not

expressive enough. Underfitting(1) high training

error (2) validation error ≈

training error

Training error

Validation error

Variance problem: the model is too

expressive for the data set that we have overfitting(1) low training error,

(2) validation error >

training error.

Interpolating over the points: two curves that we can use for diagnosis

Learning Curve

44

Plotting the learning curve of an algorithm on a given data set

is also a useful way to diagnose the algorithm’s performance.

Error

#examples

Learning curve:

The Error rate of the

learner as we increase the size of the training

set


Learning Curve

45



Error

Examples

When the algorithm has access to a

few examples – training error will be very low

Training error (few examples)

Easy to

separate a

small number

of examples!

Learning Curve

46

Error

ExamplesAs more examples are sampled, it is

more difficult to separate the data and the training error will increase

Training error




Learning Curve

47#examples

On the other hand, with more

data the validation error is likely to decrease

Training error

Validation error

Error



Learning Curve

High bias and high variance models react differently when

presented with more examples

#examples

Error

Training error

Validation error


Learning Curve (High Bias)

49#examples

Error

Training error: increasing the number of examples will increase the training

error since the simple model cannot account for the “noise” introduced

Training error

Validation error

High Bias: We train a simple model (e.g., linear classifier over

the original feature space)


50#examples

Adding more examples will not change the learned model (just increase the training error) Adding more examples will not reduce the validation error significantly

Error

Training error

Validation error





51#examples

Simple hypotheses converge quickly (converges after a few examples, adding more will not help)

Feature engineering: carefully selecting a

subset of features will not help; instead add more relevant attributes!

Reduce regularization tradeoff

Error

Training error

Validation error



Learning Curve (High Variance)

52

In the high variance case, let’s assume we blow up the

feature space, and we have a very expressive function

Model Complexity

Error



53

In the high variance case, let’s assume we blow up the feature

space, and we have a very expressive function

Model Complexity

When we have few examples, the

model can easily overfit and have a small training error

Training error

Error


54



Model Complexity

As we add more examples, it might be more

difficult to fit the data, so the training error will increase

(the model might not account for all the “noise”)

Training error

Error



55



Model Complexity

The validation error will be high

(the model is overfitting)

Training error

ErrorValidation error

High variance: large gapbetween error on training and validation sets (useful for identifying high variance)


56



Model Complexity

1) Adding more examples will change the resulting classifier and decrease

the gap (reduce validation error!)2) Simplifying the model (less features)

might also help3) Increase regularization tradeoff

Training error

ErrorValidation error


Summary

• Bias/Variance: convenient way to analyze a learning system

performance

– Identify underfitting/overfitting

– Both high bias and high variance can happen

57

cs37300: data mining & machine learning©jan20-19 christopher w. clifton 3 soft svm • notice...

Documents