learning: linear methodsce.sharif.edu/courses/99-00/1/ce417-1/resources/root... · 2021. 1. 1. ·...
TRANSCRIPT
-
Learning: Linear MethodsCE417: Introduction to Artificial Intelligence
Sharif University of Technology
Fall 2020
Soleymani
Some slides are based on Klein and Abdeel, CS188, UC Berkeley.
-
Paradigms of ML
Reinforcement learning
partial (indirect) feedback, no explicit guidance
Given rewards for a sequence of moves to learn a policy and
utility functions
Supervised learning (regression, classification)
predicting a target variable for which we get to see examples.
Unsupervised learning
revealing structure in the observed unlabeled data
2
-
Components of (Supervised) Learning
3
Unknown target function: 𝑓:𝒳 → 𝒴
Input space:𝒳
Output space:𝒴
Training data: 𝒙1, 𝑦1 , 𝒙2, 𝑦2 , … , (𝒙𝑁, 𝑦𝑁)
Pick a formula 𝑔:𝒳 → 𝒴 that approximates the targetfunction 𝑓
selected from a set of hypotheses ℋ
-
Supervised Learning:
Regression vs. Classification
Supervised Learning
Regression: predict a continuous target variable
E.g.,𝑦 ∈ [0,1]
Classification: predict a discrete target variable
E.g.,𝑦 ∈ {1,2, … , 𝐶}
4
-
Regression: Example
Housing price prediction
0
100
200
300
400
0 500 1000 1500 2000 2500
Price ($)
in 1000’s
Size in feet2
5 Figure adopted from slides of Andrew Ng
-
Training data: Example
𝑥1 𝑥2 𝑦
0.9 2.3 1
3.5 2.6 1
2.6 3.3 1
2.7 4.1 1
1.8 3.9 1
6.5 6.8 -1
7.2 7.5 -1
7.9 8.3 -1
6.9 8.3 -1
8.8 7.9 -1
9.1 6.2 -1x1
x2
6
Training data
-
Classification: Example
Weight (Cat, Dog)
7
1(Dog)
0(Cat)
weight
weight
-
Linear regression
Cost function:
0
100
200
300
400
500
0 500 1000 1500 2000 2500 3000
0
100
200
300
400
500
0 500 1000 1500 2000 2500 3000
𝑦(𝑖) − 𝑔(𝑥 𝑖 ; 𝒘)
𝑥
𝐽 𝒘 =𝑖=1
𝑛
𝑦 𝑖 − 𝑔(𝑥(𝑖); 𝒘)2
=𝑖=1
𝑛
𝑦 𝑖 −𝑤0 − 𝑤1𝑥𝑖 2
8
-
Cost function
0
100
200
300
400
500
0 1000 2000 3000
Price ($)
in 1000’s
Size in feet2 (x)
𝐽(𝒘)(function of the parameters 𝑤0, 𝑤1)
𝑤0𝑤1
9 This example has been adapted from: Prof. Andrew Ng’s slides
-
Review:
Gradient descent
10
First-order optimization algorithm to find 𝒘∗ = argmin𝒘
𝐽(𝒘)
Also known as ”steepest descent”
In each step, takes steps proportional to the negative of the
gradient vector of the function at the current point 𝒘𝑡:
𝒘𝑡+1 = 𝒘𝑡 − 𝛾𝑡 𝛻 𝐽 𝒘𝑡
𝐽(𝒘) decreases fastest if one goes from 𝒘𝑡 in the direction of −𝛻𝐽 𝒘𝑡
Assumption: 𝐽(𝒘) is defined and differentiable in a neighborhood of apoint 𝒘𝑡
Gradient ascent takes steps proportional to (the positive
of) the gradient to find a local maximum of the function
-
Review:
Gradient descent
11
Minimize 𝐽(𝒘)
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘𝐽(𝒘𝑡)
𝛻𝒘𝐽 𝒘 = [𝜕𝐽 𝒘
𝜕𝑤1,𝜕𝐽 𝒘
𝜕𝑤2, … ,
𝜕𝐽 𝒘
𝜕𝑤𝑑]
If 𝜂 is small enough, then 𝐽 𝒘𝑡+1 ≤ 𝐽 𝒘𝑡 .
𝜂 can be allowed to change at every iteration as 𝜂𝑡.
Step size
(Learning rate
parameter)
-
Review:
Gradient descent disadvantages
12
Local minima problem
However, when 𝐽 is convex, all local minima are also globalminima ⇒ gradient descent can converge to the globalsolution.
-
w1
w0
J(w0,w1)
Review: Problem of gradient descent with
non-convex cost functions
13 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
-
w0
w1
J(w0,w1)
Review: Problem of gradient descent with
non-convex cost functions
14 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
-
Gradient descent for SSE cost function
15
Minimize 𝐽(𝒘)𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘𝐽(𝒘
𝑡)
𝐽(𝒘): Sum of squares error
𝐽 𝒘 =𝑖=1
𝑛
𝑦 𝑖 − 𝑔 𝒙 𝑖 ; 𝒘2
Weight update rule for 𝑔 𝒙;𝒘 = 𝒘𝑇𝒙:
𝒘𝑡+1 = 𝒘𝑡 + 𝜂
𝑖=1
𝑛
𝑦 𝑖 −𝒘𝑡𝑇𝒙 𝑖 𝒙(𝑖)
𝒘 =
𝑤0𝑤1⋮𝑤𝑑
𝒙 =
1𝑥1⋮𝑥𝑑
-
Gradient descent for SSE cost function
16
Weight update rule:𝑔 𝒙;𝒘 = 𝒘𝑇𝒙
𝒘𝑡+1 = 𝒘𝑡 + 𝜂
𝑖=1
𝑛
𝑦 𝑖 −𝒘𝑇𝒙 𝑖 𝒙(𝑖)
𝜂: too small → gradient descent can be slow.
𝜂 : too large → gradient descent can overshoot theminimum. It may fail to converge, or even diverge.
Batch mode: each step
considers all training
data
-
(function of the parameters 𝑤0, 𝑤1)𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥
𝐽(𝑤0, 𝑤1)
𝑤0
𝑤1
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 17
-
(function of the parameters 𝑤0, 𝑤1)
𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
𝑤0
𝑤1
18 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
-
(function of the parameters 𝑤0, 𝑤1)
𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
𝑤0
𝑤1
19 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
-
(function of the parameters 𝑤0, 𝑤1)
𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
𝑤0
𝑤1
20 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
-
(function of the parameters 𝑤0, 𝑤1)
𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
𝑤0
𝑤1
21 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
-
(function of the parameters 𝑤0, 𝑤1)
𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
𝑤0
𝑤1
22 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
-
(function of the parameters 𝑤0, 𝑤1)
𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
𝑤0
𝑤1
23 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
-
(function of the parameters 𝑤0, 𝑤1)
𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
𝑤0
𝑤1
24 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
-
(function of the parameters 𝑤0, 𝑤1)
𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
𝑤0
𝑤1
25 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
-
Linear Classifiers
26
-
Error-Driven Classification
27
-
Errors, and What to Do
Examples of errors
Dear GlobalSCAPE Customer,
GlobalSCAPE has partnered with ScanSoft to offer you the
latest version of OmniPage Pro, for just $99.99* - the
regular list price is $499! The most common question we've
received about this offer is - Is this genuine? We would like
to assure you that this offer is authorized by ScanSoft, is
genuine and valid. You can get the . . .
. . . To receive your $30 Amazon.com promotional certificate,
click through to
http://www.amazon.com/apparel
and see the prominent link for the $30 offer. All details are
there. We hope you enjoyed receiving this message. However,
if you'd rather not receive future e-mails announcing new
store launches, please click . . .
28
-
What to Do About Errors Problem: there’s still spam in your inbox
Need more features – words aren’t enough! Have you emailed the sender before?
Have 1M other people just gotten the same email?
Is the sending information consistent?
Is the email in ALL CAPS?
Do inline URLs point where they say they point?
Does the email address you by (your) name?
Naïve Bayes models can incorporate a variety of features, but tend todo best in homogeneous cases (e.g. all features are word occurrences)
29
-
Feature Vectors
Hello,
Do you want free printr
cartriges? Why pay more
when you can get them
ABSOLUTELY FREE! Just
# free : 2
YOUR_NAME : 0
MISSPELLED : 2
FROM_FRIEND : 0
...
SPAMor+
PIXEL-7,12 : 1
PIXEL-7,13 : 0
...
NUM_LOOPS : 1
...
“2”
30
We show input by 𝒙 or 𝑓(𝒙)
-
Weights
Binary case: compare features to a weight vector to identify the class
Learning: figure out the weight vector from examples
# free : 2
YOUR_NAME : 0
MISSPELLED : 2
FROM_FRIEND : 0
...
# free : 4
YOUR_NAME :-1
MISSPELLED : 1
FROM_FRIEND :-3
...
# free : 0
YOUR_NAME : 1
MISSPELLED : 1
FROM_FRIEND : 1
...
Dot product positive means the positive class
31
-
Binary Decision Rule
In the space of feature vectors
Examples are points
Any weight vector is a hyperplane
One side corresponds toY=+1
Other corresponds toY=-1
BIAS : -3
free : 4
money : 2
...0 1
0
1
2
free
mo
ney
+1 = SPAM
-1 = HAM
32
-
Square error loss function for classification!
33
Square error loss is not suitable for classification: Least square loss penalizes ‘too correct’ predictions (that they lie a long way on the correct side
of the decision)
Least square loss also lack robustness to noise
𝐽 𝒘 =
𝑖=1
𝑁
𝑤𝑥 𝑖 +𝑤0 − 𝑦𝑖 2
𝐾 = 2
-
Perceptron criterion
34
Two-class: 𝑦 ∈ {−1,1} 𝑦 = −1 for 𝐶2, 𝑦 = 1 for 𝐶1
Goal:∀𝑖, 𝒙(𝑖) ∈ 𝐶1 ⇒ 𝒘𝑇𝒙(𝑖) > 0
∀𝑖, 𝒙 𝑖 ∈ 𝐶2 ⇒ 𝒘𝑇𝒙 𝑖 < 0
𝐽𝑃 𝒘 = −
𝑖∈ℳ
𝒘𝑇𝒙 𝑖 𝑦 𝑖
ℳ: subset of training data that are misclassified
Many solutions? Which solution among them?
-
Batch Perceptron
35
“Gradient Descent” to solve the optimization problem:
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘𝐽𝑃(𝒘𝑡)
𝛻𝒘𝐽𝑃 𝒘 = −
𝑖∈ℳ
𝒙 𝑖 𝑦 𝑖
Batch Perceptron converges in finite number of steps for linearly
separable data:
Initialize 𝒘Repeat
𝒘 = 𝒘+ 𝜂σ𝑖∈ℳ 𝒙𝑖 𝑦 𝑖
Until convergence
-
Stochastic gradient descent for Perceptron
36
Single-sample perceptron:
If 𝒙(𝑖) is misclassified:
𝒘𝑡+1 = 𝒘𝑡 + 𝜂𝒙(𝑖)𝑦(𝑖)
Perceptron convergence theorem: for linearly separable data
If training data are linearly separable, the single-sample perceptron is
also guaranteed to find a solution in a finite number of steps
Initialize 𝒘, 𝑡 ← 0repeat
𝑡 ← 𝑡 + 1𝑖 ← 𝑡 mod 𝑁
if 𝒙(𝑖) is misclassified then𝒘 = 𝒘+ 𝒙(𝑖)𝑦(𝑖)
Until all patterns properly classified
Fixed-Increment single sample Perceptron
𝜂 can be set to 1 and proof still works
-
Weight Updates
37
-
Learning: Binary Perceptron
Start with weights = 0
For each training instance:
Classify with current weights
If correct (i.e., y=y*), no change!
If wrong: adjust the weight vector
38
𝒘𝑡+1 = 𝒘𝑡 + 𝒙(𝑖)𝑦(𝑖)
-
Example
39
-
Learning: Binary Perceptron
Start with weights = 0
For each training instance:
Classify with current weights
40
ො𝑦
-
Learning: Binary Perceptron
Start with weights = 0
For each training instance:
Classify with current weights
If correct (i.e., ො𝑦 = 𝑦), no change!
If wrong: adjust the weight vector byadding or subtracting the featurevector (subtract if y is -1):
41
𝑦
𝑦
ො𝑦
-
Multiclass Decision Rule
If we have multiple classes:
A weight vector for each class:
Score (activation) of a class y:
Prediction highest score wins
Binary = multiclass where the negative class has weight zero
42
ො𝑦
-
Learning: Multiclass Perceptron
Start with all weights = 0
Pick up training examples one by one
Predict with current weights
43
ො𝑦
𝑤 ො𝑦
𝑤𝑦
-
Learning: Multiclass Perceptron
Start with all weights = 0
Pick up training examples one by one
Predict with current weights
If correct, no change!
If wrong: lower score of wrong answer,raise score of right answer
44
𝑤 ො𝑦
𝑤𝑦
ො𝑦
𝑤ො𝑦 = 𝑤ො𝑦 − 𝑓(𝑥)
-
Learning: Multiclass Perceptron
Start with all weights = 0
Pick up training examples one by one
Predict with current weights
If correct, no change!
If wrong: lower score of wrong answer,raise score of right answer
45
ො𝑦
𝑤 ො𝑦
𝑤𝑦
𝑤ො𝑦 = 𝑤ො𝑦 − 𝑓(𝑥)
𝑤𝑦 = 𝑤𝑦 + 𝑓(𝑥)
-
Properties of Perceptrons
Separability: true if some parameters get the training setperfectly classified
Convergence: if the training is separable, perceptron willeventually converge (binary case)
Mistake Bound: the maximum number of mistakes (binarycase) related to the margin or degree of separability
Separable
Non-Separable
46
-
Examples: Perceptron
Non-Separable Case
47
-
Discriminative approach: logistic regression
48
𝑔 𝒙;𝒘 = 𝜎(𝒘𝑇𝒙)
𝜎 . is an activation function
Sigmoid (logistic) function
Activation function
𝜎 𝑧 =1
1 + 𝑒−𝑧
𝐾 = 2
𝒙 = 1, 𝑥1, … , 𝑥𝑑𝒘 = 𝑤0, 𝑤1, … , 𝑤𝑑
-
Logistic regression: cost function
49
ෝ𝒘 = argmin𝒘
𝐽(𝒘)
𝐽 𝒘 =
=
𝑖=1
𝑛
−𝑦(𝑖)log 𝜎 𝒘𝑇𝒙(𝑖) − (1 − 𝑦(𝑖))log 1 − 𝜎 𝒘𝑇𝒙(𝑖)
𝐽(𝒘) is convex w.r.t. parameters.
-
Logistic regression: loss function
50
Loss 𝑦, 𝑓 𝒙;𝒘 = −𝑦 × log 𝜎 𝒙;𝒘 − (1 − 𝑦) × log(1 − 𝜎 𝒙;𝒘 )
Loss 𝑦, 𝜎 𝒙;𝒘 = ቊ−log(𝜎(𝒙;𝒘)) if 𝑦 = 1
−log(1 − 𝜎 𝒙;𝒘 ) if 𝑦 = 0
How is it related to zero-one loss?
Loss 𝑦, ො𝑦 = ቊ1 𝑦 ≠ ො𝑦0 𝑦 = ො𝑦
𝜎 𝒙;𝒘 =1
1 + 𝑒𝑥𝑝(−𝒘𝑇𝒙)
Since 𝑦 = 1 or 𝑦 = 0⇒
-
Logistic regression: Gradient descent
51
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘𝐽(𝒘𝑡)
𝛻𝒘𝐽 𝒘 =𝑖=1
𝑛
𝜎 𝒘𝑇𝒙 𝑖 − 𝑦 𝑖 𝒙 𝑖
Is it similar to gradient of SSE for linear regression?
𝛻𝒘𝐽 𝒘 =𝑖=1
𝑛
𝒘𝑇𝒙 𝑖 − 𝑦 𝑖 𝒙 𝑖
-
Multi-class classifier
𝑔 𝒙;𝑾 = 𝑔1 𝒙,𝑾 ,… , 𝑔𝐾 𝒙,𝑾
𝑾 = 𝒘1 ⋯ 𝒘𝐾 contains one vector of parametersfor each class
In linear classifiers, 𝑾 is 𝑑 × 𝐾 where 𝑑 shows number offeatures
𝑾𝑇𝒙 provides us a vector
𝑔 𝒙;𝑾 contains K numbers giving class scores for theinput 𝒙
52
-
Multi-class logistic regression
𝑔 𝒙;𝑾 = 𝑔1 𝒙,𝑾 ,… , 𝑔𝐾 𝒙,𝑾𝑇
𝑾 = 𝒘1 ⋯ 𝒘𝐾 contains one vector of parametersfor each class
53
𝑔𝑘 𝒙;𝑾 =exp(𝒘𝑘
𝑇𝒙 )
σ𝑗=1𝐾 exp(𝒘𝑗
𝑇𝒙 )
-
Logistic regression: multi-class
54
𝑾 = argmin𝑾
𝐽(𝑾)
𝐽 𝑾 = −
𝑖=1
𝑛
𝑘=1
𝐾
𝑦𝑘𝑖log 𝑔𝑘 𝒙
(𝑖);𝑾
𝑾 = 𝒘1 ⋯ 𝒘𝐾𝒚 is a vector of length 𝐾 (1-of-K coding)e.g., 𝒚 = 0,0,1,0 𝑇 when the target class is 𝐶3
-
Logistic regression: multi-class
55
𝒘𝑗𝑡+1 = 𝒘𝑗
𝑡 − 𝜂𝛻𝑾𝐽(𝑾𝑡)
𝛻𝒘𝑗𝐽 𝑾 =𝑖=1
𝑛
𝑔𝑗 𝒙𝑖 ;𝑾 − 𝑦𝑗
𝑖𝒙 𝑖
-
Logistic regression: Probabilistic perspective
Maximum likelihood estimation:
with:
= Two-class Logistic Regression
56
𝑃 𝑦 𝑖 = +1 𝑥 𝑖 ; 𝑤 =1
1 + 𝑒−𝒘𝑇𝒙
𝑃 𝑦 𝑖 = 0 𝑥 𝑖 ; 𝑤 = 1 −1
1 + 𝑒−𝒘𝑇𝒙
-
Multiclass Logistic Regression
Multi-class linear classification
A weight vector for each class:
Score (activation) of a class y:
Prediction w/highest score wins:
How to make the scores into probabilities?
original activations softmax activations
57
𝒘1𝑇𝒙
𝒘2𝑇𝒙 𝒘3
𝑇𝒙
𝒘𝑘
𝑧𝑘 = 𝒘𝑘𝑇𝒙
ො𝑦 = argmax𝑘
𝒘𝑘𝑇𝒙
-
Logistic regression: Probabilistic perspective
Maximum likelihood estimation:
with:
= Multi-Class Logistic Regression
58
𝑃 𝑦 𝑖 𝑥 𝑖 ; 𝑤 =𝑒𝒘𝑦 𝑖𝑇 𝒙(𝑖)
σ𝑘=1𝐾 𝑒𝒘𝑘
𝑇𝒙(𝑖)
-
Logistic regression: multi-class
59
𝒘𝑗𝑡+1 = 𝒘𝑗
𝑡 − 𝜂𝛻𝑾𝐽(𝑾𝑡)
𝛻𝒘𝑗𝐽 𝑾 =𝑖=1
𝑛
𝑃 𝑦𝑗|𝒙𝑖 ;𝑾 − 𝑦𝑗
𝑖𝒙 𝑖
-
Example
How can we tell whether this W and b is good or bad?
𝑾𝑇
60 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017
𝒙 =
𝑥1⋮𝑥4
𝑾𝑇 =
𝒘1𝒘2𝒘3 3×4
𝒘𝟎 = 𝒃 =
𝑏1𝑏2𝑏3
-
Bias can also be included in the W matrix
61 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017
-
Softmax classifier loss: example
𝐿(1) = − log 0.13= 0.89
𝐿(𝑖) = − log𝑒𝑠𝑦(𝑖)
σ𝑗=1𝐾 𝑒𝑠𝑗
62 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017
𝑠𝑘 = 𝒘𝑘𝑇𝒙
-
Generalized linear
Linear combination of fixed non-linear function of the
input vector
𝑓(𝒙;𝒘) = 𝑤0 +𝑤1𝜙1(𝒙)+ . . . 𝑤𝑚𝜙𝑚(𝒙)
{𝜙1(𝒙), . . . , 𝜙𝑚(𝒙)}: set of basis functions (or features)
𝜙𝑖 𝒙 :ℝ𝑑 → ℝ
63
-
Basis functions: examples
Linear
Polynomial (univariate)
64
-
Polynomial regression: example
65
𝑚 = 1 𝑚 = 3
𝑚 = 5 𝑚 = 7
-
Generalized linear classifier
66
Assume a transformation 𝜙:ℝ𝑑 → ℝ𝑚 on the featurespace
𝒙 → 𝝓 𝒙
Find a hyper-plane in the transformed feature space:
𝑥2
𝑥1 𝜙1(𝒙)
𝜙2(𝒙)
𝜙: 𝒙 → 𝝓 𝒙
𝒘𝑇𝝓 𝒙 + 𝑤0 = 0
{𝜙1(𝒙), . . . , 𝜙𝑚(𝒙)}: set of basis functions (or features)
𝜙𝑖 𝒙 :ℝ𝑑 → ℝ
𝝓 𝒙 = [𝜙1(𝒙), . . . , 𝜙𝑚(𝒙)]
-
Model complexity and overfitting
With limited training data, models may achieve zero
training error but a large test error.
Over-fitting: when the training loss no longer bears any
relation to the test (generalization) loss.
Fails to generalize to unseen examples.
67
1
𝑛
𝑖=1
𝑛
𝑦 𝑖 − 𝑔 𝒙𝑖; 𝒘
2
≈ 0Training(empirical) loss
Expected
(true) lossE𝐱,y 𝑦 − 𝑔 𝒙;𝒘
2≫ 0
-
Polynomial regression
68
𝑚 = 0 𝑚 = 1
𝑚 = 3 𝑚 = 9
𝑦
𝑦
𝑦
𝑦
[Bishop]
-
Over-fitting causes
69
Model complexity
E.g., Model with a large number of parameters (degrees of
freedom)
Low number of training data
Small data size compared to the complexity of the model
-
Number of training data & overfitting
70
Over-fitting problem becomes less severe as the size of
training data increases.
𝑚 = 9 𝑚 = 9
𝑛 = 15 𝑛 = 100
[Bishop]
-
How to evaluate the learner’s performance?
71
Generalization error: true (or expected) error that we
would like to optimize
Two ways to assess the generalization error is:
Practical: Use a separate data set to test the model
Theoretical: Law of Large numbers
statistical bounds on the difference between training and expected
errors
-
Avoiding over-fitting
72
Determine a suitable value for model complexity (Model
Selection)
Simple hold-out method
Cross-validation
Regularization (Occam’s Razor)
Explicit preference towards simple models
Penalize for the model complexity in the objective function
-
Model Selection
73
learning algorithm defines the data-driven search over
the hypothesis space (i.e. search for good parameters)
hyperparameters are the tunable aspects of the model,
that the learning algorithm does not select
This slide has been adopted from CMU ML course:
http://www.cs.cmu.edu/~mgormley/courses/10601-s18/
-
Model Selection
74
Model selection is the process by which we choose the
“best” model from among a set of candidates
assume access to a function capable of measuring the quality of
a model
typically done “outside” the main training algorithm
Model selection / hyperparameter optimization is just
another form of learning
This slide has been adopted from CMU ML course:
http://www.cs.cmu.edu/~mgormley/courses/10601-s18/
-
Simple hold out:
training, validation, and test sets
75
Simple hold-out chooses the model that minimizes error on
validation set.
Estimate generalization error for the test set
performance of the selected model is finally evaluated on the test set
Training
Validation
Test
-
Simple hold-out: model selection
76
Steps:
Divide training data into training and validation set 𝑣_𝑠𝑒𝑡
Use only the training set to train a set of models
Evaluate each learned model on the validation set
𝐽𝑣 𝒘 =1
𝑣_𝑠𝑒𝑡σ𝑖∈𝑣_𝑠𝑒𝑡 𝑦
(𝑖)− 𝑓 𝒙
(𝑖); 𝒘
2
Choose the best model based on the validation set error
Usually, wasteful of valuable training data
Training data may be limited.
On the other hand, small validation set give a relatively noisyestimate of performance.
Cross-validation technique is used instead.
-
Regularization
77
Adding a penalty term in the cost function to discourage
the coefficients from reaching large values.
Ridge regression (weight decay):
𝐽 𝒘 =𝑖=1
𝑛
𝑦 𝑖 −𝒘𝑇𝝓 𝒙 𝑖2+ 𝜆𝒘𝑇𝒘
-
Regularization parameter
78
Generalization
𝜆 now controls the effective complexity of the model andhence determines the degree of over-fitting
[Bishop]
-
Choosing the regularization parameter
79
A set of models with different values of 𝜆.
Find ෝ𝒘 for each model based on training data
Find 𝐽𝑣(ෝ𝒘) (or 𝐽𝑐𝑣(ෝ𝒘)) for each model
𝐽𝑣 𝒘 =1
𝑛_𝑣σ𝑖∈𝑣_𝑠𝑒𝑡 𝑦
(𝑖)− 𝑓 𝑥
(𝑖);𝒘
2
Select the model with the best 𝐽𝑣(ෝ𝒘) (or 𝐽𝑐𝑣(ෝ𝒘))
-
Resources
80
Yaser S. Abu-Mostafa, Malik Maghdon-Ismail, and Hsuan
Tien Lin,“Learning from Data”, Chapter 2.3, 3.2, 3.4.