introduction to machine learning regressionml-tau-2014.wdfiles.com/.../regression_2014.pdf · ridge...
TRANSCRIPT
![Page 1: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/1.jpg)
Introduction to Machine Learning Regression
Computer Science, Tel-Aviv University, 2014-15
1
![Page 2: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/2.jpg)
Classification
Input: X
¨ Real valued, vectors over real.
¨ Discrete values (0,1,2,...)
¨ Other structures (e.g., strings, graphs, etc.)
Output: Y
¨ Discrete (0,1,2,...)
![Page 3: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/3.jpg)
Regression
Input: X
¨ Real valued, vectors over real.
¨ Discrete values (0,1,2,...)
Output: Y
¨ Real valued, vectors over real.
![Page 4: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/4.jpg)
Examples: Regression
¨ Weight + height cholesterol level
¨ Age + gender time spent in front of the TV
¨ Past choices of a user 'Netflix score' ¨ Profile of a job (user, machine, time) Memory
usage of a submitted process.
![Page 5: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/5.jpg)
![Page 6: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/6.jpg)
![Page 7: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/7.jpg)
Linear Regression
Input: A set of points (xi,yi) ¨ Assume there is a linear relation between y and x.
![Page 8: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/8.jpg)
Linear Regression
Input: A set of points (xi,yi) ¨ Assume there is a linear relation between y and x.
¨ Find a,b by solving
![Page 9: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/9.jpg)
Regression: Minimize the Residuals
y = a · x+ b
![Page 10: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/10.jpg)
Regression: Minimize the Residuals
(xi, yi)ri = yi � a · xi � b
y = a · x+ b
![Page 11: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/11.jpg)
Likelihood Formulation
Model assumptions:
Input:
![Page 12: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/12.jpg)
Likelihood Formulation
Model assumptions:
Input:
![Page 13: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/13.jpg)
Likelihood Formulation
Model assumptions:
Input:
Likelihood maximized when
![Page 14: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/14.jpg)
Note:
We can add another variable xd+1=1, and set ad+1=b. Therefore, without loss of generality
![Page 15: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/15.jpg)
Matrix Notations
![Page 16: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/16.jpg)
The Normal Equations
![Page 17: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/17.jpg)
![Page 18: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/18.jpg)
Functions over n-dimensions
For a function f(x1,…,xn), the gradient is the vector of partial derivatives:
In one dimension: derivative.
18
∇f = ∂ f∂x1
,…, ∂ f∂xn
"
#$
%
&'
![Page 19: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/19.jpg)
Gradient Descent
¨ Goal: Minimize a function ¨ Algorithm:
1. Start from a point
2. Compute
3. Update
4. Return to (2), unless converged.
u =∇f (x1i,..., xn
i )
![Page 20: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/20.jpg)
Gradient Descent
Gradient Descent iteration:
Advantage: simple, efficient.
![Page 21: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/21.jpg)
Online Least Squares
Online update step:
Advantage: Efficient, similar to perceptron.
![Page 22: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/22.jpg)
Singularity issues
¨ Not very efficient since we need to inverse a matrix. ¨ The solution is unique if is invertible. ¨ If is singular, we have an infinite number of
solutions to the equations. What is the solution minimizing ?
XtXXtX
ky �Xak2
![Page 23: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/23.jpg)
The Singular Case
![Page 24: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/24.jpg)
The Singular Case
−3 −2 −1 0 1 2−4
−2
0
2−8
−6
−4
−2
0
2
4
6
8
![Page 25: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/25.jpg)
The Singular Case
![Page 26: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/26.jpg)
The Singular Case
![Page 27: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/27.jpg)
Risk of Overfitting
¨ Say we have a very large number of variables (d is large). ¨ When the number of variables is large we usually have co-
linear variables, and therefore is singular.
¨ Even if is non-singular, there is a risk for over-fitting. For instance, if d=n we can get
¨ Intuitively, we want a small number of variables to explain y.
XtXXtX
![Page 28: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/28.jpg)
Regularization
Letλbe a regularization parameter. Ideally, we need to solve the following: This is a hard problem (NP-hard).
![Page 29: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/29.jpg)
Shrinkage Methods
Lasso regression: Ridge regression:
![Page 30: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/30.jpg)
Ridge Regression
![Page 31: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/31.jpg)
Ridge Regression
![Page 32: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/32.jpg)
Ridge Regression
Positive definite and therefore nonsingular
![Page 33: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/33.jpg)
Ridge Regression – Bayesian View
Prior on a
![Page 34: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/34.jpg)
Ridge Regression – Bayesian View
Prior on a
Maximizing the posterior is equivalent to Ridge with
logPosterior(a | �, ⌧, Data) = � 1
2�
2
nX
i=1
(yi � a · xi)2 � 1
2⌧
2
dX
i=1
a
2i
�n
2
log(2⇡�
2)� d
2
log(2⇡⌧
2)
� =�2
⌧2
![Page 35: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/35.jpg)
Lasso Regression
![Page 36: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/36.jpg)
Lasso Regression
The above is equivalent to the following quadratic program:
![Page 37: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/37.jpg)
Lasso Regression – Bayesian View
Prior on a
![Page 38: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/38.jpg)
Lasso Regression – Bayesian View
Prior on a
logPosterior(a | Data,�,�) = � 1
2�
2
nX
i=1
(yi � a · xi)2 � n
2
log(2⇡�
2)
+ log(
�
4�
2)� �
2�
2
dX
i=1
|ai|
![Page 39: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/39.jpg)
Lasso vs. Ridge
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Laplace vs. Normal priors (mean 0, variance 1)
![Page 40: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/40.jpg)
An Equivalent Formulation
Lasso: Ridge:
Claim: for every λ there is λ’ that produces the same solution
![Page 41: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/41.jpg)
![Page 42: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/42.jpg)
![Page 43: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/43.jpg)
Breaking Linearity
This can be solved using the usual linear regression by plugging X
![Page 44: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/44.jpg)
Regression for Classification
Input: X
¤ Real valued, vectors over real. ¤ Discrete values (0,1,2,...) ¤ Other structures (e.g., strings, graphs, etc.)
Output: Y
¤ Discrete (0 or 1)
We treat the probability Pr(Y|X) as a linear function of X. Problem: Pr(Y|X) should be bounded in [0,1].
![Page 45: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/45.jpg)
Logistic Regression
Model:
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
![Page 46: Introduction to Machine Learning Regressionml-tau-2014.wdfiles.com/.../Regression_2014.pdf · Ridge Regression – Bayesian View Prior on a Maximizing the posterior is equivalent](https://reader034.vdocument.in/reader034/viewer/2022042921/5f6a609b7d71bf394d226465/html5/thumbnails/46.jpg)
Logistic Regression
Given training data, we can write down the likelihood:
linear concave
There is a unique solution – can be found using gradient descent.