![Page 1: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/1.jpg)
Advanced Introduction to Machine Learning
10715, Fall 2014
Linear Regression and Sparsity
Eric XingLecture 2, September 10, 2014
Reading:© Eric Xing @ CMU, 2014 1
![Page 2: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/2.jpg)
Machine learning for apartment hunting
Now you've moved to Pittsburgh!! And you want to find the most reasonably priced apartment satisfying your needs:
square-ft., # of bedroom, distance to campus …
Living area (ft2) # bedroom Rent ($)
230 1 600506 2 1000433 2 1100109 1 500…150 1 ?270 1.5 ?
© Eric Xing @ CMU, 2014 2
![Page 3: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/3.jpg)
The learning problem
Features: Living area, distance to campus, #
bedroom … Denote as x=[x1, x2, … xk]
Target: Rent Denoted as y
Training set:
rent
Living area
rent
Living area
Location
knnn
k
k
n xxx
xxxxxx
21
222
12
121
11
2
1
x
xx
X
nny
yy
y
yy
Y
2
1
2
1
or
© Eric Xing @ CMU, 2014 3
![Page 4: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/4.jpg)
Linear Regression Assume that Y (target) is a linear function of X (features):
e.g.:
let's assume a vacuous "feature" X0=1 (this is the intercept term, why?), and define the feature vector to be:
then we have the following general representation of the linear function:
Our goal is to pick the optimal . How! We seek that minimize the following cost function:
22
110ˆ xxy
n
iiii yxyJ
1
2
21 ))(ˆ()(
© Eric Xing @ CMU, 2014 4
![Page 5: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/5.jpg)
The Least-Mean-Square (LMS) method The Cost Function:
Consider a gradient descent algorithm:
n
ii
Ti yJ
1
2
21 )()( x
tj
tj
tj J )(
1
© Eric Xing @ CMU, 2014 5
![Page 6: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/6.jpg)
The Least-Mean-Square (LMS) method Now we have the following descent rule:
For a single training point, we have:
This is known as the LMS update rule, or the Widrow-Hoff learning rule This is actually a "stochastic", "coordinate" descent algorithm This can be used as a on-line algorithm
n
i
ji
tTii
tj
tj xy
1
1 )( x
© Eric Xing @ CMU, 2014 6
![Page 7: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/7.jpg)
Geometry and Convergence of LMS
N=1 N=2 N=3
Claim: when the step size satisfies certain condition, and when certain other technical conditions are satisfied, LMS will converge to an “optimal region”.
itT
iitt y xx )(1
© Eric Xing @ CMU, 2014 7
![Page 8: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/8.jpg)
Steepest Descent and LMS Steepest descent
Note that:
This is as a batch gradient descent algorithm
n
in
Tnn
T
k
yJJJ11
xx )(,,
n
in
tTnn
tt y1
1 xx )(
© Eric Xing @ CMU, 2014 8
![Page 9: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/9.jpg)
The normal equations Write the cost function in matrix form:
To minimize J(θ), take derivative and set to zero:
yyXyyXXX
yXyX
yJ
TTTTTT
T
n
ii
Ti
212121
1
2)()( x
nx
xx
X
2
1
ny
yy
y
2
1
0
221
22121
yXXX
yXXXXX
yyXyXX
yyXyyXXXJ
TT
TTT
TTTT
TTTTTT
trtrtr
tryXXX TT
The normal equations
yXXX TT 1*
© Eric Xing @ CMU, 2014 9
![Page 10: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/10.jpg)
Some matrix derivatives For , define:
Trace:
Some fact of matrix derivatives (without proof)
fA
fA
fA
fA
Af
mnm
n
A
1
111
)(
RR nmf :
, tr
n
iiiAA
1
, tr aa BCACABABC trtrtr
, tr TA BAB , tr TTT
A ABCCABCABA TA AAA 1
© Eric Xing @ CMU, 2014 10
![Page 11: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/11.jpg)
Comments on the normal equation In most situations of practical interest, the number of data
points N is larger than the dimensionality k of the input space and the matrix X is of full column rank. If this condition holds, then it is easy to verify that XTX is necessarily invertible.
The assumption that XTX is invertible implies that it is positive definite, thus at the critical point we have found is a minimum.
What if X has less than full column rank? regularization (later).
© Eric Xing @ CMU, 2014 11
![Page 12: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/12.jpg)
Direct and Iterative methods Direct methods: we can achieve the solution in a single step
by solving the normal equation Using Gaussian elimination or QR decomposition, we converge in a finite number
of steps It can be infeasible when data are streaming in in real time, or of very large
amount
Iterative methods: stochastic or steepest gradient Converging in a limiting sense But more attractive in large practical problems Caution is needed for deciding the learning rate
© Eric Xing @ CMU, 2014 12
![Page 13: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/13.jpg)
Convergence rate Theorem: the steepest descent equation algorithm
converge to the minimum of the cost characterized by normal equation:
If
A formal analysis of LMS need more math-mussels; in practice, one can use a small , or gradually decrease .
© Eric Xing @ CMU, 2014 13
![Page 14: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/14.jpg)
A Summary: LMS update rule
Pros: on-line, low per-step cost, fast convergence and perhaps less prone to local optimum
Cons: convergence to optimum not always guaranteed
Steepest descent
Pros: easy to implement, conceptually clean, guaranteed convergence Cons: batch, often slow converging
Normal equations
Pros: a single-shot algorithm! Easiest to implement. Cons: need to compute pseudo-inverse (XTX)-1, expensive, numerical issues
(e.g., matrix is singular ..), although there are ways to get around this …
intT
nnt
jt
j xy ,)( x1
n
in
tTnn
tt y1
1 xx )(
yXXX TT 1*
© Eric Xing @ CMU, 2014 14
![Page 15: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/15.jpg)
Geometric Interpretation of LMS The predictions on the training data are:
Note that
and
is the orthogonal projection ofinto the space spanned by the columns of X
yXXXXXy TT 1 *ˆ
yIXXXXyy TT 1ˆ
0
1
1
yXXXXXX
yIXXXXXyyXTTTT
TTTT
!!y y
nx
xx
X
2
1
ny
yy
y
2
1
© Eric Xing @ CMU, 2014 15
![Page 16: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/16.jpg)
Probabilistic Interpretation of LMS Let us assume that the target variable and the inputs are
related by the equation:
where ε is an error term of unmodeled effects or random noise
Now assume that ε follows a Gaussian N(0,σ), then we have:
By independence assumption:
iiT
iy x
2
2
221
)(exp);|( i
Ti
iiyxyp x
2
12
1 221
n
i iT
inn
iii
yxypL
)(exp);|()(
x
© Eric Xing @ CMU, 2014 16
![Page 17: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/17.jpg)
Probabilistic Interpretation of LMS, cont. Hence the log-likelihood is:
Do you recognize the last term?
Yes it is:
Thus under independence assumption, LMS is equivalent to MLE of θ !
n
i iT
iynl1
22 211
21 )(log)( x
n
ii
Ti yJ
1
2
21 )()( x
© Eric Xing @ CMU, 2014 17
![Page 18: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/18.jpg)
Case study: predicting gene expression
The genetic picture
CGTTTCACTGTACAATTTcausal SNPs
a univariate phenotype:
i.e., the expression intensity of a gene
© Eric Xing @ CMU, 2014 18
![Page 19: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/19.jpg)
Individual 1
Individual 2
Individual N
Phenotype (BMI)
2.5
4.8
4.7
Genotype
. . C . . . . . T . . C . . . . . . . T . . .
. . C . . . . . A . . C . . . . . . . T . . .
. . G . . . . . A . . G . . . . . . . A . . .
. . C . . . . . T . . C . . . . . . . T . . .
. . G . . . . . T . . C . . . . . . . T . . .
. . G . . . . . T . . G . . . . . . . T . . .
Causal SNPBenign SNPs
…Association Mapping as Regression
© Eric Xing @ CMU, 2014 19
![Page 20: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/20.jpg)
Individual 1
Individual 2
Individual N
Phenotype (BMI)
2.5
4.8
4.7
Genotype
. . 0 . . . . . 1 . . 0 . . . . . . . 0 . . .
. . 1 . . . . . 1 . . 1 . . . . . . . 1 . . .
. . 2 . . . . . 2 . . 1 . . . . . . . 0 . . .
…
yi =
J
jjijx
1 SNPs with large
|βj| are relevant
Association Mapping as Regression
© Eric Xing @ CMU, 2014 20
![Page 21: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/21.jpg)
Experimental setup Asthama dataset
543 individuals, genotyped at 34 SNPs Diploid data was transformed into 0/1 (for homozygotes) or 2 (for heterozygotes) X=543x34 matrix Y=Phenotype variable (continuous)
A single phenotype was used for regression
Implementation details Iterative methods: Batch update and online update implemented. For both methods, step size α is chosen to be a small fixed value (10-6). This
choice is based on the data used for experiments. Both methods are only run to a maximum of 2000 epochs or until the change in
training MSE is less than 10-4
© Eric Xing @ CMU, 2014 21
![Page 22: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/22.jpg)
Convergence Curves
For the batch method, the training MSE is initially large due to uninformed initialization
In the online update, N updates for every epoch reduces MSE to a much smaller value.
© Eric Xing @ CMU, 2014
22
![Page 23: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/23.jpg)
The Learned Coefficients
© Eric Xing @ CMU, 2014 23
![Page 24: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/24.jpg)
T G
A A
C C
A T
G A
A G
T A
GenotypeTrait
Multivariate Regression for Trait Association Analysis
Xy
2.1 x=
x= β
Association Strength
?
© Eric Xing @ CMU, 2014 24
![Page 25: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/25.jpg)
T G
A A
C C
A T
G A
A G
T A
GenotypeTrait
Multivariate Regression for Trait Association Analysis
Many non-zero associations: Which SNPs are truly significant?
2.1 x=
Association Strength
© Eric Xing @ CMU, 2014 25
![Page 26: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/26.jpg)
Sparsity One common assumption to make sparsity.
Makes biological sense: each phenotype is likely to be associated with a small number of SNPs, rather than all the SNPs.
Makes statistical sense: Learning is now feasible in high dimensions with small sample size
© Eric Xing @ CMU, 2014 26
![Page 27: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/27.jpg)
Sparsity: In a mathematical sense Consider least squares linear regression problem: Sparsity means most of the beta’s are zero.
But this is not convex!!! Many local optima, computationally intractable.
y
x1
x2
x3
xn-1
xn
1
2
3
n-1
n
© Eric Xing @ CMU, 2014 27
![Page 28: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/28.jpg)
L1 Regularization (LASSO)(Tibshirani, 1996)
A convex relaxation.
Still enforces sparsity!
Constrained Form Lagrangian Form
© Eric Xing @ CMU, 2014 28
![Page 29: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/29.jpg)
Theoretical Guarantees Assumptions
Dependency Condition: Relevant Covariates are not overly dependent Incoherence Condition: Large number of irrelevant covariates cannot be too
correlated with relevant covariates Strong concentration bounds: Sample quantities converge to expected values
quickly
If these are assumptions are met, LASSO will asymptotically recover correct subset of covariates that relevant.
© Eric Xing @ CMU, 2014 29
![Page 30: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/30.jpg)
Consistent Structure Recovery[Zhao and Yu 2006]
© Eric Xing @ CMU, 2014 30
![Page 31: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/31.jpg)
Genotype
x=2.1
Trait
Lasso for Reducing False Positives
Many zero associations (sparse results), but what if there are multiple related traits?
+
J
j 1 |βj |
T G
A A
C C
A T
G A
A G
T A
Lasso Penalty for sparsity
Association Strength
© Eric Xing @ CMU, 2014 31
![Page 32: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/32.jpg)
Ridge Regression vs Lasso
Ridge Regression: Lasso: HOT!
Lasso (l1 penalty) results in sparse solutions – vector with more zero coordinatesGood for high‐dimensional problems – don’t have to store all coordinates!
βs with constant l1 norm
βs with constant J(β)(level sets of J(β))
βs with constant l2 norm
β2
β1
X X
© Eric Xing @ CMU, 2014 32
![Page 33: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/33.jpg)
Bayesian Interpretation Treat the distribution parameters also as a random variable The a posteriori distribution of after seem the data is:
This is Bayes Rule
likelihood marginalpriorlikelihoodposterior
dpDppDp
DppDpDp
)()|()()|(
)()()|()|(
The prior p(.) encodes our prior knowledge about the domain© Eric Xing @ CMU, 2014 33
![Page 34: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/34.jpg)
What if (XTX) is not invertible ?
log likelihood log prior
Prior belief that β is Gaussian with zero‐mean biases solution to “small” β
I) Gaussian Prior
0
Ridge Regression
Closed form: HW
Regularized Least Squares and MAP
© Eric Xing @ CMU, 2014 34
![Page 35: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/35.jpg)
Regularized Least Squares and MAP
log likelihood log prior
Prior belief that β is Laplace with zero‐mean biases solution to “small” β
Lasso
Closed form: HW
II) Laplace Prior
What if (XTX) is not invertible ?
© Eric Xing @ CMU, 2014 35
![Page 36: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/36.jpg)
Take home message Gradient descent
On-line Batch
Normal equations Geometric interpretation of LMS Probabilistic interpretation of LMS, and equivalence of LMS and
MLE under certain assumption (what?) Sparsity:
Approach: ridge vs. lasso regression Interpretation: regularized regression versus Bayesian regression Algorithm: convex optimization (we did not discuss this)
LR does not mean fitting linear relations, but linear combination or basis functions (that can be non-linear)
Weighting points by importance versus by fitness
© Eric Xing @ CMU, 2014 36
![Page 37: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/37.jpg)
After class material …
© Eric Xing @ CMU, 2014 37
![Page 38: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/38.jpg)
Advanced Material: Beyond basic LR LR with non-linear basis functions
Locally weighted linear regression
Regression trees and Multilinear Interpolation
We will discuss this in next class after we set the state right!(if we’ve got time )
© Eric Xing @ CMU, 2014 38
![Page 39: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/39.jpg)
LR with non-linear basis functions LR does not mean we can only deal with linear relationships
We are free to design (non-linear) features under LR
where the j(x) are fixed basis functions (and we define 0(x) = 1).
Example: polynomial regression:
We will be concerned with estimating (distributions over) the weights θ and choosing the model order M.
)()( xxy Tm
j j 10
321 xxxx ,,,:)(
© Eric Xing @ CMU, 2014 39
![Page 40: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/40.jpg)
Basis functions There are many basis functions, e.g.:
Polynomial
Radial basis functions
Sigmoidal
Splines, Fourier, Wavelets, etc
1 jj xx)(
2
2
2sx
x jj
exp)(
sx
x jj
)(
© Eric Xing @ CMU, 2014 40
![Page 41: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/41.jpg)
1D and 2D RBFs 1D RBF
After fit:
© Eric Xing @ CMU, 2014 41
![Page 42: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/42.jpg)
Good and Bad RBFs A good 2D RBF
Two bad 2D RBFs
© Eric Xing @ CMU, 2014 42
![Page 43: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/43.jpg)
Overfitting and underfitting
xy 10 2210 xxy
5
0jj
j xy
© Eric Xing @ CMU, 2014 43
![Page 44: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/44.jpg)
Bias and variance We define the bias of a model to be the expected
generalization error even if we were to fit it to a very (say, infinitely) large training set.
By fitting "spurious" patterns in the training set, we might again obtain a model with large generalization error. In this case, we say the model has large variance.
© Eric Xing @ CMU, 2014 44
![Page 45: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/45.jpg)
Locally weighted linear regression
The algorithm:Instead of minimizing
now we fit θ to minimize
Where do wi's come from?
where x is the query point for which we'd like to know its corresponding y
Essentially we put higher weights on (errors on) training examples that are close to the query point (than those that are further away from the query)
n
ii
Ti yJ
1
2
21 )()( x
n
ii
Tii ywJ
1
2
21 )()( x
2
2
2)(exp xx i
iw
© Eric Xing @ CMU, 2014 45
![Page 46: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/46.jpg)
Parametric vs. non-parametric Locally weighted linear regression is the second example we
are running into of a non-parametric algorithm. (what is the first?)
The (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm because it has a fixed, finite number of parameters (the θ), which are fit to the
data; Once we've fit the θ and stored them away, we no longer need to keep the
training data around to make future predictions. In contrast, to make predictions using locally weighted linear regression, we need
to keep the entire training set around.
The term "non-parametric" (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis grows linearly with the size of the training set.
© Eric Xing @ CMU, 2014 46
![Page 47: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/47.jpg)
Robust Regression
The best fit from a quadratic regression
But this is probably better …
How can we do this?© Eric Xing @ CMU, 2014 47
![Page 48: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/48.jpg)
LOESS-based Robust Regression Remember what we do in "locally weighted linear regression"? we "score" each point for its impotence
Now we score each point according to its "fitness"
(Courtesy to Andrew Moor) © Eric Xing @ CMU, 2014 48
![Page 49: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/49.jpg)
Robust regression For k = 1 to R…
Let (xk ,yk) be the kth datapoint Let yest
k be predicted value of yk
Let wk be a weight for data point k that is large if the data point fits well and small if it fits badly:
Then redo the regression using weighted data points.
Repeat whole thing until converged!
2)( estkkk yyw
© Eric Xing @ CMU, 2014 49
![Page 50: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/50.jpg)
Robust regression—probabilistic interpretation What regular regression does:
Assume yk was originally generated using the following recipe:
Computational task is to find the Maximum Likelihood estimation of θ
),( 20 N kT
ky x
© Eric Xing @ CMU, 2014 50
![Page 51: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/51.jpg)
Robust regression—probabilistic interpretation What LOESS robust regression does:
Assume yk was originally generated using the following recipe:
with probability p:
but otherwise
Computational task is to find the Maximum Likelihood estimates of θ, p, µ and σhuge.
The algorithm you saw with iterative reweighting/refittingdoes this computation for us. Later you will find that it is an instance of the famous E.M. algorithm
),( 20 N kT
ky x
),(~ huge2Nky
© Eric Xing @ CMU, 2014 51
![Page 52: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/52.jpg)
Regression Tree Decision tree for regression
Gender Rich? Num. Children
# travel per yr.
Age
F No 2 5 38
M No 0 2 25
M Yes 1 0 72
: : : : :
Gender?
Predicted age=39 Predicted age=36
Female Male
© Eric Xing @ CMU, 2014 52
![Page 53: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/53.jpg)
A conceptual picture Assuming regular regression trees, can you sketch a graph of
the fitted function y*(x) over this diagram?
© Eric Xing @ CMU, 2014 53
![Page 54: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/54.jpg)
How about this one? Multilinear Interpolation
We wanted to create a continuous and piecewise linear fit to the data
© Eric Xing @ CMU, 2014 54
![Page 55: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/55.jpg)
Take home message Gradient descent
On-line Batch
Normal equations Geometric interpretation of LMS Probabilistic interpretation of LMS, and equivalence of LMS and
MLE under certain assumption (what?) Sparsity:
Approach: ridge vs. lasso regression Interpretation: regularized regression versus Bayesian regression Algorithm: convex optimization (we did not discuss this)
LR does not mean fitting linear relations, but linear combination or basis functions (that can be non-linear)
Weighting points by importance versus by fitness
© Eric Xing @ CMU, 2014 55
![Page 56: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/56.jpg)
Appendix
© Eric Xing @ CMU, 2014 56
![Page 57: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/57.jpg)
Parameter Learning from iid Data Goal: estimate distribution parameters from a dataset of N
independent, identically distributed (iid), fully observed, training cases
D = {x1, . . . , xN}
Maximum likelihood estimation (MLE)1. One of the most common estimators2. With iid and full-observability assumption, write L() as the likelihood of the data:
3. pick the setting of parameters most likely to have generated the data we saw:
);,,()( , NxxxPL 21
N
i i
N
xP
xPxPxP
1
2
);(
);(,),;();(
)(maxarg*
L )(logmaxarg
L© Eric Xing @ CMU, 2014 57
![Page 58: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/58.jpg)
Example: Bernoulli model Data:
We observed N iid coin tossing: D={1, 0, 1, …, 0}
Representation:Binary r.v:
Model:
How to write the likelihood of a single observation xi ?
The likelihood of datasetD={x1, …,xN}:
ii xxixP 11 )()(
N
i
xxN
iiN
iixPxxxP1
1
121 1 )()|()|,...,,(
},{ 10nx
tails#head# )()(
11 111
N
ii
N
ii xx
101
xx
xPfor for
)(
xxxP 11 )()(
© Eric Xing @ CMU, 2014 58
![Page 59: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/59.jpg)
Maximum Likelihood Estimation Objective function:
We need to maximize this w.r.t.
Take derivatives wrt
Sufficient statistics The counts, are sufficient statistics of data D
)log()(log)(log)|(log);( 11 hhnn nNnDPD thl
01
hh nNnl
Nnh
MLE
i
iMLE xN1
or
Frequency as sample mean
, where, i ikh xnn
© Eric Xing @ CMU, 2014 59
![Page 60: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/60.jpg)
Overfitting Recall that for Bernoulli Distribution, we have
What if we tossed too few times so that we saw zero head?We have and we will predict that the probability of seeing a head next is zero!!!
The rescue: "smoothing" Where n' is know as the pseudo- (imaginary) count
But can we make this more formal?
tailhead
headheadML nn
n
,0headML
''
nnnnn
tailhead
headheadML
© Eric Xing @ CMU, 2014 60
![Page 61: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/61.jpg)
Bayesian Parameter Estimation Treat the distribution parameters also as a random variable The a posteriori distribution of after seem the data is:
This is Bayes Rule
likelihood marginalpriorlikelihoodposterior
dpDppDp
DppDpDp
)()|()()|(
)()()|()|(
The prior p(.) encodes our prior knowledge about the domain© Eric Xing @ CMU, 2014 61
![Page 62: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/62.jpg)
Frequentist Parameter Estimation Two people with different priors p() will end up with different estimates p(|D).
Frequentists dislike this “subjectivity”. Frequentists think of the parameter as a fixed, unknown
constant, not a random variable. Hence they have to come up with different "objective"
estimators (ways of computing from data), instead of using Bayes’ rule. These estimators have different properties, such as being “unbiased”, “minimum
variance”, etc. The maximum likelihood estimator, is one such estimator.
© Eric Xing @ CMU, 2014 62
![Page 63: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/63.jpg)
Discussion
or p(), this is the problem!
Bayesians know it
© Eric Xing @ CMU, 2014 63
![Page 64: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/64.jpg)
Bayesian estimation for Bernoulli Beta distribution:
When x is discrete
Posterior distribution of :
Notice the isomorphism of the posterior to the prior, such a prior is called a conjugate prior and are hyperparameters (parameters of the prior) and correspond to the
number of “virtual” heads/tails (pseudo counts)
1111
1
11 111 thth nnnn
N
NN xxp
pxxpxxP )()()(),...,(
)()|,...,(),...,|(
1111 11
)(),()(
)()()(),;( BP
!)()( xxxx 1
© Eric Xing @ CMU, 2014 64
![Page 65: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/65.jpg)
Bayesian estimation for Bernoulli, con'd Posterior distribution of :
Maximum a posteriori (MAP) estimation:
Posterior mean estimation:
Prior strength: A=+ A can be interoperated as the size of an imaginary data set from which we obtain
the pseudo-counts
1111
1
11 111 thth nnnn
N
NN xxp
pxxpxxP )()()(),...,(
)()|,...,(),...,|(
NndCdDp hnn
Bayesth 11 1 )()|(
Bata parameters can be understood as pseudo-counts
),...,|(logmaxarg NMAP xxP 1
© Eric Xing @ CMU, 2014 65
![Page 66: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/66.jpg)
Effect of Prior Strength Suppose we have a uniform prior (==1/2),
and we observe Weak prior A = 2. Posterior prediction:
Strong prior A = 20. Posterior prediction:
However, if we have enough data, it washes away the prior. e.g., . Then the estimates under weak and strong prior are and , respectively, both of which are close to 0.2
),( 82 th nnn
25010221282 .)',,|(
th nnhxp
40010202102082 .)',,|(
th nnhxp
),( 800200 th nnn
100022001
10002020010
© Eric Xing @ CMU, 2014 66
![Page 67: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/67.jpg)
Example 2: Gaussian density Data:
We observed N iid real samples: D={-0.1, 10, 1, -5.2, …, 3}
Model:
Log likelihood:
MLE: take derivative and set to zero:
22212 22 /)(exp)( /
xxP
N
n
nxNDPD1
2
22
212
2 )log()|(log);(l
n n
n n
xN
x
2422
2
21
2
1
l
l )/(
n MLnMLE
n nMLE
xN
xN
22 1
1
© Eric Xing @ CMU, 2014 67
![Page 68: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/68.jpg)
MLE for a multivariate-Gaussian It can be shown that the MLE for µ and Σ is
where the scatter matrix is
The sufficient statistics are nxn and nxnxnT.
Note that XTX=nxnxnT may not be full rank (eg. if N <D), in which case ΣML is not
invertible
SN
xxN
xN
nT
MLnMLnMLE
n nMLE
11
1
TMLMLn n
Tnnn
TMLnMLn NxxxxS
Kn
n
n
n
x
xx
x
2
1
TN
T
T
x
xx
X2
1
© Eric Xing @ CMU, 2014 68
![Page 69: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/69.jpg)
Bayesian estimation
Normal Prior:
Joint probability:
Posterior:
2020
2120 22 /)(exp)( /
P
1
20
22
020
2
20
20
2
2 11
11
N
Nx
NN ~ and ,
///
///~ where
Sample mean
2020
2120
1
22
22
22
212
/)(exp
exp),(
/
/
N
nn
N xxP
22212 22 ~/)~(exp~)|( /
xP
© Eric Xing @ CMU, 2014 69
![Page 70: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less](https://reader033.vdocument.in/reader033/viewer/2022042312/5edad2a509ac2c67fa685ca1/html5/thumbnails/70.jpg)
Bayesian estimation: unknown µ, known σ
The posterior mean is a convex combination of the prior and the MLE, with weights proportional to the relative noise levels.
The precision of the posterior 1/σ2N is the precision of the prior 1/σ2
0 plus one contribution of data precision 1/σ2 for each observed data point.
Sequentially updating the mean µ∗ = 0.8 (unknown), (σ2)∗ = 0.1 (known)
Effect of single data point
Uninformative (vague/ flat) prior, σ20 →∞
1
20
22
020
2
20
20
2
2 11
11
N
Nx
NN
N~ ,
///
///
20
2
20
020
2
20
001
)( )( xxx
0 N
© Eric Xing @ CMU, 2014 70