cs37300: data mining & machine learning©jan20-19 christopher w. clifton 3 soft svm • notice...
TRANSCRIPT
![Page 1: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/1.jpg)
©Jan-19 Christopher W. Clifton 120
Materials adapted from Profs. Jennifer Neville and Dan Goldwasser
CS37300:
Data Mining & Machine Learning
Margin-Based Classifiers:
Support Vector Machine
Prof. Chris Clifton
3 March 2020
Maximal Margin
• The discussion motivates the notion of maximal margin
• The maximal margin of a data set S is defined as:
3
1. For a given w: Find the closest point. 2. Then, find the one the gives the maximal margin value across all w’s (of size 1).
Note: the selection of the point in the min and
therefore the max does not change if we scale w,so it’s okay to only deal with normalized w’s.
Ɣ(S) = max||w||=1 min(x,y) ∈ S y wT x
![Page 2: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/2.jpg)
©Jan-19 Christopher W. Clifton 220
Hard SVM Optimization
• We have shown that the sought after weight vector w is the solution of the following optimization problem:
SVM Optimization:
• This is an optimization problem in (n+1) variables, with |S|=m inequality constraints.
5
Minimize: ½ ||w||2
Subject to: ∀ (x,y) ∀ S: y wT x ≥ 1
Visualizing the Solution in the non-
Separable Case
8
![Page 3: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/3.jpg)
©Jan-19 Christopher W. Clifton 320
Soft SVM
• Notice that the relaxation of the constraint: yi witxi ≥ 1 can be done by
introducing a slack variable ξ (per example) and requiring: yi witxi ≥ 1 -
ξ i ; ξ i ≥ 0
• Now, we want to solve:
Min ½ ||w||2 + c ξi subject to ξ i ≥ 0
• Which can be written as:
Min ½ wTw + c max(0, 1 – yi witxi).
9
SVM Objective Function
• The problem we solved is: Min ½ ||w||2 + c Σ ξi
• Where ξi > 0 is a slack variable, defined as: ξ i = max(0, 1 – yi wtxi)
– Equivalently, we can say that: yi wtxi ≥ 1 - ξ; ξ ≥ 0
• And this can be written as:
Min ½ ||w||2 + c Σ ξi
• General Form of a learning algorithm:
– Minimize empirical loss, and Regularize (to avoid over fitting)
14
Empirical lossRegularization term
![Page 4: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/4.jpg)
©Jan-19 Christopher W. Clifton 420
What if the true model isn’t linearly
separable?By transforming the feature space these functions can be made linearly separable
Represent each point in 2D as (x,x2)
SVM: Kernel Function
• Don’t need to transform dataset
– Just need distance
• Kernel function: compute distance in new space without actually computing point positions!
– Let 𝜙 𝑥 = (𝑥12 ,𝑥2
2 , 2𝑥1𝑥2)
– 𝜙 𝑥 𝑇𝜙(𝑦) = 𝑥12𝑦1
2+ 2𝑥1𝑥2𝑦1𝑦2 + 𝑥22𝑦2
2
• Radial basis function: kernel representing infinite expansion
– “Infinite dimension” feature space
![Page 5: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/5.jpg)
©Jan-19 Christopher W. Clifton 520
Polynomial Regression
• GD: general optimization algorithm
– Works for classification (different loss function)
– Incorporate polynomial features to fit a complex model
– Danger – overfitting!
17
Regularization
• Both for regression and classification, for a given error we prefer a simpler model
– Keep W small: ε changes in the input cause ε*w in the output
• Sometimes we are even willing to trade a higher error rate for a simpler model (why?)
• Add a regularization term:
– This is a form of inductive bias
![Page 6: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/6.jpg)
©Jan-19 Christopher W. Clifton 620
Regularization
• A very popular choice of regularization term is to minimize the norm of the weight vector– For example
– For convenience: ½ squared norm• What is the update gradient of the loss function?
• At each iteration we subtract the weights by λ*w
• In general, we can pick other norms (p-norm)– Referenced as L-p norm
– Different p will have a different effect!
19
Model Complexity and
Bias/Variance Tradeoff
20
![Page 7: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/7.jpg)
©Jan-19 Christopher W. Clifton 720
Bias/Variance
• Easy to define very powerful learners
– Move to a higher dimension
• Is it always a good idea?
• Simple (simplistic) approach:
• Option 1: Learning doesn’t work– our model is not powerful enough!– Move to a richer representation
• Option 2: Learning doesn’t work – our model overfits! – Get more data! Remove features! Regularize!
How can we tell?
21
Bias-Variance Tradeoff
22
Bia
s
Variance
Sample different sets from P(x,y) and train (each x is a learned model)
How would the learned models “behave” when trained over different datasets from the distribution?
Low bias: powerf ul enough
to learnLow Variance:
doesn’t ov erfit
High bias: cannot represent
targetLow Variance:
doesn’t ov er fit
High bias: cannot represent
targetHigh Variance:
ov erfits
Low bias: powerf ul enough
to learnHigh Variance:
ov erfits
![Page 8: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/8.jpg)
©Jan-19 Christopher W. Clifton 820
Bias-Variance Tradeoff
23
Bia
s
Variance
Sample different sets from P(x,y) and train (each x is a learned model)
How would the learned models “behave” when trained over different datasets from the distribution?
Low bias: powerf ul enough
to learnLow Variance:
doesn’t ov erfit
High bias: cannot represent
targetLow Variance:
doesn’t ov er fit
High bias: cannot represent
targetHigh Variance:
ov erfits
Low bias: powerf ul enough
to learnHigh Variance:
ov erfits
Bias-Variance Tradeoff
24
Bia
s
Variance
High bias: cannot represent
targetLow Variance:
doesn’t ov er fit
High bias: cannot represent
targetHigh Variance:
ov erfits
Low bias: powerf ul enough
to learnHigh Variance:
ov erfits
Clustered together (low variance) in a “good” place (low bias)
Sample different sets from P(x,y) and train (each x is a learned model)
How would the learned models “behave” when trained over different datasets from the distribution?
![Page 9: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/9.jpg)
©Jan-19 Christopher W. Clifton 920
Bias-Variance Tradeoff
25
Bia
s
Variance
High bias: cannot represent
targetHigh Variance:
ov erfits
Low bias: powerf ul enough
to learnHigh Variance:
ov erfits
Clustered together (low variance) away from the target function (high bias)
Sample different sets from P(x,y) and train (each x is a learned model)
How would the learned models “behave” when trained over different datasets from the distribution?
Bias-Variance Tradeoff
Bia
s
Variance
Low bias: powerf ul enough
to learnHigh Variance:
ov erfits
Far apart (high variance) away from the target function (high bias)
Sample different sets from P(x,y) and train (each x is a learned model)
How would the learned models “behave” when trained over different datasets from the distribution?
![Page 10: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/10.jpg)
©Jan-19 Christopher W. Clifton 1020
Overfitting
• Model complexity: “how many parameters are learned?”• Decision tree: depth, Linear models: features, degree of
a polynomial kernel
• Empirical Error: For a given dataset, what is the percentage misclassified items by the learned classifier?
Empirical
Error
Model Complexity
Data Generating Distribution
• We assume that a distribution P(x,y) from which the dataset is sampled– Data generating distribution
• Training and testing examples are drawn from the same distribution– The examples are identically distributed
• Each example is drawn independently– We say that the train/test examples are independently and identically
distributed (i.i.d)
• We care about our performance over any sample from that distribution – not just the one we observe during training
28
![Page 11: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/11.jpg)
©Jan-19 Christopher W. Clifton 1120
Overfitting
Expected Error: what percentage of items drawn from P(x,y) do we expect to be misclassified by the learned classifier?
let’s consider different samples from the data distribution. You can get closer to the expected loss by considering the expectation of the empirical loss on (all/many) different data samples
Let’s continue the discussion with that in mind!
Expected
Error
Model Complexity
The Variance of the Learner
How susceptible is the learner to minor changes in the training data? (i.e., different samples from P(x,y))• You can think about the testing data as another sample
• OverfittingVariance increases with model complexity!
Expected
Error
Model Complexity
Variance
![Page 12: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/12.jpg)
©Jan-19 Christopher W. Clifton 1220
Bias of the Learner
How likely is the learner to identify the target hypothesis?• What will happen if we test on a different sample?
• Underfitting• What will happen if we train on a different sample?Bias is high when the model is too simple
Expected
Error
Model Complexity
Bias
Model Complexity
Simple models:High bias and low variance
Complex models:High variance and low bias
Expected
Error
Model Complexity
Variance
Bias
![Page 13: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/13.jpg)
©Jan-19 Christopher W. Clifton 1320
Model Complexity
Simple models:High bias and low variance
Complex models:High variance and low bias
Expected
Error
Model Complexity
Variance
Bias
UnderfittingOverfitting
Impact of bias and variance
Expected Error ≈ Bias + Variance
Expected
Error
Model Complexity
Variance
Bias
![Page 14: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/14.jpg)
©Jan-19 Christopher W. Clifton 1420
Bias-Variance Tradeoff in Practice
• We saw that the classification error can be expressed in terms of bias and variance
• Reducing the bias/variance can reduce expected error!
• Different scenarios can lead to different actions for reducing the error
– High bias: add more features
– High variance: simplify the model, add more examples
• How can we diagnose each one of these scenarios?
35
Bias-Variance Analysis
• Let’s look at the influence each scenario has on the validation and
training errors
36
Error
Model Complexity
Let’s compare simple models (e.g., linear in x) to complex ones
![Page 15: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/15.jpg)
©Jan-19 Christopher W. Clifton 1520
Bias-Variance Analysis
37
Error
Model Complexity
As we move to a more
expressive model the training error is likely to decrease
Training error
Bias-Variance Analysis
Error
Model Complexity
When we are using a simple model, it’s very likely we’ll
underfit resulting in a high validation error
Training error
As we move to a more expressive
model the training error is likely to decrease
Validation error
![Page 16: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/16.jpg)
©Jan-19 Christopher W. Clifton 1620
Bias-Variance Analysis
39
Error
Model Complexity
When we are using a simple model, it’s very likely we’ll
underfit resulting in a high validation error
Training error
If we increase the complexity,
we can do better on the
validation data
Validation error (simplemodel)
Validation error (complex model)
Bias-Variance Analysis
40
Error
Model Complexity
When we are using a simple model, it’s very likely we’ll
underfit resulting in a high validation error
Training error
If we keep increasing the
complexity the model will
eventually overfit and the
validation error will increase
Validation error
(simple model)
Validation error
(complex model)
Validation error (very complex model)
![Page 17: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/17.jpg)
©Jan-19 Christopher W. Clifton 1720
Bias-Variance Analysis
Error
Model Complexity
Training error
Validation error
If we keep increasing the
complexity the model will
eventually overfit and the
validation error will increase
Interpolating over the points: two curves that we can use for
diagnosis
Bias-Variance Analysis
42
Model Complexity
Validation error
Error Training error
Bias problem : the model is not
expressive enough. Underfitting(1) high training
error (2) validation error ≈
training error
Interpolating over the points: two curves that we can use for diagnosis
![Page 18: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/18.jpg)
©Jan-19 Christopher W. Clifton 1820
Bias-Variance Analysis
43
Error
Model Complexity
Bias problem : the model is not
expressive enough. Underfitting(1) high training
error (2) validation error ≈
training error
Training error
Validation error
Variance problem: the model is too
expressive for the data set that we have overfitting(1) low training error,
(2) validation error >
training error.
Interpolating over the points: two curves that we can use for diagnosis
Learning Curve
44
Plotting the learning curve of an algorithm on a given data set
is also a useful way to diagnose the algorithm’s performance.
Error
#examples
Learning curve:
The Error rate of the
learner as we increase the size of the training
set
![Page 19: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/19.jpg)
©Jan-19 Christopher W. Clifton 1920
Learning Curve
45
Plotting the learning curve of an algorithm on a given data set
is also a useful way to diagnose the algorithm’s performance.
Error
Examples
When the algorithm has access to a
few examples – training error will be very low
Training error (few examples)
Easy to
separate a
small number
of examples!
Learning Curve
46
Error
ExamplesAs more examples are sampled, it is
more difficult to separate the data and the training error will increase
Training error
Plotting the learning curve of an algorithm on a given data set
is also a useful way to diagnose the algorithm’s performance.
![Page 20: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/20.jpg)
©Jan-19 Christopher W. Clifton 2020
Learning Curve
47#examples
On the other hand, with more
data the validation error is likely to decrease
Training error
Validation error
Error
Plotting the learning curve of an algorithm on a given data set
is also a useful way to diagnose the algorithm’s performance.
Learning Curve
High bias and high variance models react differently when
presented with more examples
#examples
Error
Training error
Validation error
![Page 21: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/21.jpg)
©Jan-19 Christopher W. Clifton 2120
Learning Curve (High Bias)
49#examples
Error
Training error: increasing the number of examples will increase the training
error since the simple model cannot account for the “noise” introduced
Training error
Validation error
High Bias: We train a simple model (e.g., linear classifier over
the original feature space)
Learning Curve (High Bias)
50#examples
Adding more examples will not change the learned model (just increase the training error) Adding more examples will not reduce the validation error significantly
Error
Training error
Validation error
High Bias: We train a simple model (e.g., linear classifier over
the original feature space)
![Page 22: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/22.jpg)
©Jan-19 Christopher W. Clifton 2220
Learning Curve (High Bias)
51#examples
Simple hypotheses converge quickly (converges after a few examples, adding more will not help)
Feature engineering: carefully selecting a
subset of features will not help; instead add more relevant attributes!
Reduce regularization tradeoff
Error
Training error
Validation error
High Bias: We train a simple model (e.g., linear classifier over
the original feature space)
Learning Curve (High Variance)
52
In the high variance case, let’s assume we blow up the
feature space, and we have a very expressive function
Model Complexity
Error
![Page 23: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/23.jpg)
©Jan-19 Christopher W. Clifton 2320
Learning Curve (High Variance)
53
In the high variance case, let’s assume we blow up the feature
space, and we have a very expressive function
Model Complexity
When we have few examples, the
model can easily overfit and have a small training error
Training error
Error
Learning Curve (High Variance)
54
In the high variance case, let’s assume we blow up the feature
space, and we have a very expressive function
Model Complexity
As we add more examples, it might be more
difficult to fit the data, so the training error will increase
(the model might not account for all the “noise”)
Training error
Error
![Page 24: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/24.jpg)
©Jan-19 Christopher W. Clifton 2420
Learning Curve (High Variance)
55
In the high variance case, let’s assume we blow up the feature
space, and we have a very expressive function
Model Complexity
The validation error will be high
(the model is overfitting)
Training error
ErrorValidation error
High variance: large gapbetween error on training and validation sets (useful for identifying high variance)
Learning Curve (High Variance)
56
In the high variance case, let’s assume we blow up the feature
space, and we have a very expressive function
Model Complexity
1) Adding more examples will change the resulting classifier and decrease
the gap (reduce validation error!)2) Simplifying the model (less features)
might also help3) Increase regularization tradeoff
Training error
ErrorValidation error
![Page 25: CS37300: Data Mining & Machine Learning©Jan20-19 Christopher W. Clifton 3 Soft SVM • Notice that the relaxation of the constraint: y i w i tx i ≥ 1 can be done by introducing](https://reader033.vdocument.in/reader033/viewer/2022053019/5f252cc53bbd843bbd574b89/html5/thumbnails/25.jpg)
©Jan-19 Christopher W. Clifton 2520
Summary
• Bias/Variance: convenient way to analyze a learning system
performance
– Identify underfitting/overfitting
– Both high bias and high variance can happen
57