linear’regression’ · linear’regression’ aar$%singh%&%barnabas%poczos% % %...
TRANSCRIPT
![Page 1: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/1.jpg)
Linear Regression
Aar$ Singh & Barnabas Poczos
Machine Learning 10-‐701/15-‐781 Jan 23, 2014
![Page 2: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/2.jpg)
• Learning distribu$ons – Maximum Likelihood Es$ma$on (MLE) – Maximum A Posteriori (MAP)
• Learning classifiers – Naïve Bayes
2
So far …
![Page 3: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/3.jpg)
3
Discrete to Con3nuous Labels
Sports Science News
Classification Regression
Anemic cell Healthy cell
Stock Market Predic$on
Y = ?
X = Feb01
X = Document Y = Topic X = Cell Image Y = Diagnosis
![Page 4: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/4.jpg)
Regression Tasks
4
Weather Predic$on
Y = Temp
X = 7 pm
Es$ma$ng Contamina$on
X = new loca3on Y = sensor reading
![Page 5: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/5.jpg)
5
Supervised Learning
Sports Science News
Classification: Regression:
Probability of Error
Goal:
Mean Squared Error
Y = ?
X = Feb01
loss function (performance measure)
![Page 6: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/6.jpg)
Regression algorithms
Learning algorithm
6
Linear Regression Regularized Linear Regression – Ridge regression, Lasso Polynomial Regression Kernel Regression Regression Trees, Splines, Wavelet es$mators, …
![Page 7: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/7.jpg)
Replace Expecta3on with Empirical Mean
7
Empirical Minimizer:
Optimal predictor:
Law of Large Numbers:
Empirical mean
n ∞
![Page 8: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/8.jpg)
Restrict class of predictors
8
Empirical Minimizer:
Optimal predictor:
Class of predictors
Why? Overfi_ng! Empiricial loss minimized by any func$on of the form
Xi
Yi
![Page 9: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/9.jpg)
Restrict class of predictors
9
Empirical Minimizer:
Optimal predictor:
Class of predictors
-‐ Class of Linear func$ons -‐ Class of Polynomial func$ons -‐ Class of nonlinear func$ons
F
![Page 10: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/10.jpg)
Linear Regression
10
-‐ Class of Linear func$ons
β1 -‐ intercept
β2 = slope Uni-‐variate case:
Mul$-‐variate case: 1
where ,
Least Squares Estimator
![Page 11: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/11.jpg)
Least Squares Es3mator
11
f(Xi) = Xi�
![Page 12: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/12.jpg)
Least Squares Es3mator
12
![Page 13: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/13.jpg)
Normal Equa3ons
13
If is inver$ble,
When is inver$ble ? Recall: Full rank matrices are inver$ble. What is rank of ? What if is not inver$ble ? Regulariza$on (later)
p xp p x1 p x1
![Page 14: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/14.jpg)
Gradient Descent
14
Even when is inver$ble, might be computa$onally expensive if A is huge.
Treat as op$miza$on problem Observa$on: J(β) is convex in β.
J(β1) J(β1, β2)
β1 β1 β2
How to find the minimizer?
![Page 15: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/15.jpg)
Gradient Descent
15
Even when is inver$ble, might be computa$onally expensive if A is huge.
Ini$alize: Update:
0 if = Stop: when some criterion met e.g. fixed # itera$ons, or < ε.
Since J(β) is convex, move along nega3ve of gradient
step size
![Page 16: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/16.jpg)
Effect of step-‐size α
16
Large α => Fast convergence but larger residual error Also possible oscilla$ons
Small α => Slow convergence but small residual error
![Page 17: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/17.jpg)
Least Squares and MLE
17
Intui$on: Signal plus (zero-‐mean) Noise model
Least Square Estimate is same as Maximum Likelihood Estimate under a Gaussian model !
log likelihood
= X�⇤
![Page 18: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/18.jpg)
Regularized Least Squares and MAP
18
What if is not inver$ble ?
log likelihood log prior
I) Gaussian Prior
0
Ridge Regression
b�MAP = (AAA>AAA+ �III)�1AAA>YYY
![Page 19: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/19.jpg)
Regularized Least Squares and MAP
19
What if is not inver$ble ?
log likelihood log prior
Prior belief that β is Gaussian with zero-‐mean biases solu$on to “small” β
I) Gaussian Prior
0
Ridge Regression
![Page 20: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/20.jpg)
Regularized Least Squares and MAP
20
What if is not inver$ble ?
log likelihood log prior
Prior belief that β is Laplace with zero-‐mean biases solu$on to “small” β
Lasso
II) Laplace Prior
![Page 21: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/21.jpg)
Ridge Regression vs Lasso
21
Ridge Regression: Lasso:
Lasso (l1 penalty) results in sparse solu3ons – vector with more zero coordinates Good for high-‐dimensional problems – don’t have to store all coordinates!
βs with constant l1 norm
Ideally l0 penalty, but op$miza$on becomes non-‐convex
βs with constant l0 norm
βs with constant J(β) (level sets of J(β))
βs with constant l2 norm
β2
β1
HOT!
![Page 22: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/22.jpg)
Beyond Linear Regression
26
Polynomial regression Regression with nonlinear features Later … Kernel regression -‐ Local/Weighted regression
![Page 23: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/23.jpg)
Polynomial Regression
27
Univariate (1-‐dim) case:
where ,
MulGvariate (p-‐dim) case:
degree m
f(X) = �0 + �1X(1)
+ �2X(2)
+ · · ·+ �pX(p)
+
pX
i=1
pX
j=1
�ijX(i)X(j)
+
pX
i=1
pX
j=1
pX
k=1
X(i)X(j)X(k)
+ . . . terms up to degree m
![Page 24: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/24.jpg)
28
Polynomial Regression
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-45
-40
-35
-30
-25
-20
-15
-10
-5
0
5
k=1 k=2
k=3 k=7
Polynomial of order k, equivalently of degree up to k-‐1
What is the right order? Recall overfiPng! More later …
![Page 25: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/25.jpg)
29
Regression with nonlinear features
In general, use any nonlinear features
e.g. eX, log X, 1/X, sin(X), …
Nonlinear features
Weight of each feature
![Page 26: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%](https://reader035.vdocument.in/reader035/viewer/2022071007/5fc4af936159526a8b20fd04/html5/thumbnails/26.jpg)
What you should know Linear Regression Least Squares Es$mator Normal Equa$ons Gradient Descent Probabilis$c Interpreta$on (connec$on to MLE)
Regularized Linear Regression (connec$on to MAP) Ridge Regression, Lasso
Polynomial Regression, Regression with Non-‐linear features
25