CpSc 881: Machine Learning
Regression
2
Copy Right Notice
Most slides in this presentation are adopted from slides of text book and various sources. The Copyright belong to the original authors. Thanks!
3
Regression problems
The goal is to make quantitative (real valued) predictions on the basis of a vector of features or attributes
Examples: house prices, stock values, survival time, fuel efficiency of cars, etc.
Questions: What can we assume about the problem? how do we formalize the regression problem? how do we evaluate predictions?
4
A generic regression problem
The input attributes are given as fixed length vectors x = [x1,...,xd]T , where each component such as xi may be discrete or real valued.
The outputs are assumed to be real valued y R (or a restricted subset of the real values)
We have access to a set of n training examples, Dn = {(x1,y1),...,(xn,yn)}, sampled independently at random from some fixed but unknown distribution P(x,y)
The goal is to minimize the prediction error/loss on new examples (x,y) drawn at random from the same distribution P(x,y). The loss may be, for example, the squared loss
where denotes our prediction in response to x.
2)ˆ()ˆ,( yyyyloss
y
5
Linear regression
We need to define a class of functions (types of predictions we will try to make) such as linear predictions:
where w1,w0 are the parameters we need to set.
0110 ),;( wxwwwxf
6
Estimation criterion
• We need an estimation criterion so as to be able to select appropriate values for our parameters (w1 and w0) based on the training set Dn = {(x1,y1),...,(xn,yn)},
• For example, we can use the empirical loss:
n
iiin wwxfy
nJ
1
210 )),;((
1
7
Empirical loss
n
iiiPyx wwxfy
nwwxfyE
1
210
210~),( )),;((
1)),;((
• Ideally, we would like to find the parameters w1,w0 that minimize the expected loss (assuming unlimited training data):
where the expectation is over samples from P(x,y).
• When the number of training examples n is large, however, the empirical error is approximately what we want
210~),(01 )),;((),( wwxfyEwwJ Pyx
8
Estimating the parameters
We minimize the empirical squared loss
By setting the derivatives with respect to w1 and w0 to zero we get necessary conditions for the “optimal” parameter values
n
iii
n
iiin
xwwyn
wwxfyn
wwJ
1
210
1
21010
)(1
)),;((1
),(
0),(
0),(
101
100
wwJw
wwJw
n
n
9
Structural error measures the error introduced by the limited function class (infinite training data):
where (w0*,w1*) are the optimal linear regression parameters.
Approximation error measures how close we can get to the optimal linear predictions with limited training data:
where are the parameter estimates based on a small training set (therefore themselves random variables).
Types of error
2**~),(
210~),(
,)()(min
1010
xwwyExwwyE PyxPyxww
210
**~),( )ˆˆ(
10xwwxwwE Pyx
)ˆ,ˆ( 10 ww
10
Multivariate Regression
Write matrix X and Y thus:
(there are R datapoints. Each input has m components)
The linear regression model assumes a vector w such that
Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]
The result is the same as in the one dimensional case:
RRmRR
m
m
R y
y
y
xxx
xxx
xxx
2
1
21
22221
11211
2
...
...
...
..........
..........
..........
y
x
x
x
x
1
yXX)X T-1T(ˆ w
11
• The linear regression functions
are convenient because they are linear in the parameters, not necessarily in the input x.
• We can easily generalize these classes of functions to be non-linear functions of the inputs x but still linear in the parameters w. For example: mth order polynomial prediction m
Beyond linear regression
ddd xwxwwwff
xwwwxff
110
10
),(:
),(:
x
mmxwxwxwwwxff 2
210),(:
12
Subset Selection and Shrinkage: Motivation
Bias Variance Trade Off Goal: choose model to minimize error
Method: sacrifice a little bit of bias to reduce the variance
Better interpretation: find the strongest factors from the input space
22 ])ˆ([)ˆ()ˆ( EVarE
13
Shrinkage
Intuition: continuous version of subset selection
Goal: imposing penalty on complexity of model to get lower variance
Two example:Ridge regressionLasso
14
Ridge Regression
Penalize by sum-of-squares of parameters
Or
})({minargˆ1
22
10
i
π
jj
p
jjiji
ridge βλXy
p
jj
i
p
jjiji
ridge
ssubject
Xy
1
2
2
10
to
})({minargˆ
15
Understanding of Ridge Regression
Find the orthogonal principal components (basis vectors), and then apply greater amount of shrinkage to basis vectors with small variance.
Assumption: y vary most in the directions of high varianceIntuitive example: stop words in text classification if assuming no covariance between words
Relates to MAP EstimationIf: ~ N(0, I) , y ~ N(X, 2I)Then:
))()|((maxargˆ
PDataPridge
16
Lasso(Least Absolute Shrinkage and Selection Operator)
A popular model selection and shrinkage estimation method.
In a linear regression set-up: :continuous response
:design matrix
: parameter vector
The lasso estimator is then defined as:
Y
pnX :
p
jjX
1
2
2β
|)|(ˆ argmin Y
n
i iu122
2uwhere , and larger set some exactly to
0.
17
Lasso(Least Absolute Shrinkage and Selection Operator)
FeaturesSparse Solutions
Let be the full least square estimates
and
Value will cause
the shrinkage
Let
as the scaled Lasso
parameter
00
0ˆ| |jt
0t t
00
ˆ/ / | |js t t t
18
Why Lasso?
LASSO is proposed because:Ridge regression is not parsimonious.Ridge regression may generate huge prediction errors under sparse matrix of true (unknown) coefficients.
LASSO can outperform RR if:True (unknown) coefficients are composed of a lot of zeros.
19
Why Lasso?
Prediction AccuracyAssume , and ,
then the prediction error of the estimate is
OLS estimates often have low bias but large variance, the Lasso can improve the overall prediction accuracy by sacrifice a little bias to reduce the variance of the predicted value.
19
( )y f x ( ) 0E 2var( ) ˆ ( )f x
2
2 2 2
2 2
ˆ ˆ ˆ[ ( ) ( )] [ ( ) ( )]
ˆ ˆ( ( )) var( ( ))
ˆ( ) [( ( ))]
Ef x f x E f x Ef x
bias f x f x
Err x E y f x
20
Why Lasso?
InterpretationIn many cases, the response is determined by just a small subset of the predictor variables.
20
y
21
How to solve the problem?
The absolute inequality constraints can be translated into inequality constraints. ( p stands for the number of predictor variables )
Where is an matrix, corresponding to linear inequality constraints.
But direct application of this procedure is not practical due to the fact that may be very large.
Lawson, C. and Hansen, R. (1974) Solving Least Squares Problems. Prentice Hall.
2 p
| |jjt G t
G 2 p p 2 p
2 p
22
How to solve the problem?
Outline of the AlgorithmSequentially introduce the inequality constraints
In practice, the average iteration steps required is in the range of (0.5p, 0,75p), so the algorithm is acceptable.
Lawson, C. and Hansen, R. (1974) Solving Least Squares Problems. Prentice Hall.
23
In some cases not only continuous but also categorical predictors (factors) are present, the lasso solution is not satisfactory with only selecting individual dummy variables but the whole factor.
Extended from the lasso penalty, the group lasso estimator is:
)(minargˆ1
2
2
2
G
gI g
XY
gI : the index set belonging to the th group of variables.
The penalty does the variable selection at the group level , belonging to the intermediate between and type penalty.
It encourages that either or for all
g
1l 2l
0ˆ gβ 0ˆ, jg },,1{ gdfj
Group Lasso
24
Elastic Net
Compromise Between ℓ1 and ℓ2 to Improve Reliability
1
2
1
1 1 11
22 2 2
1
2 2
1 22 1 2
Training Set ( , )
Residual Sum of Squares ( ) ( )
Lasso penalty ( ) | |, 0
Ridge penalty ( ) | | , 0
ˆObjective: Find min
n
i i i
n
i ii
p
jj
p
jj
D X y
RSS y X
L
L
y x
25
Elastic Net
ridge penaltyλ2 elastic net
penalty
lasso penaltyλ1
26
Principle component regression
Goal: Using linear combinations of inputs as inputs in the regression
Usually the derived input directions are orthogonal to each other
Principle component regression
Get vm using SVD
Use as inputs in the regressionmm Xvz
M
mmm
pcr zyy1
ˆˆ
27
Partial Least squares
Idea: find directions that have high variance and have high correlation with y.
Unlike general multiple linear regression, the PLS regression can handle strong collinear data and the data in which number of predictors is larger than the number of observations.
The PLS build the relationship between response and predictors through a few latent variables constructed from predictors. The number of latent variables is much smaller than that of the original predictors.
28
Partial Least squares
Let vector y (n×1) denotes the single response; matrix X (n×p) denotes the n observations of p predicators and matrix T (n×h) denotes n values of the h latent variables. The latent variables are linear combinations of the original predictors:
where matrix W (p×h) is the weights. Then, the response and observations of predictors can be expressed using T as follow (Wold S., et al. 2001):
where matrix P (h×p) is the is called loadings and matrix C (h×1) is the regression coefficients of T. The matrix E (n×p) and vector f (n×1) are the random errors of X and y. The PLS regression decomposes the X and y simultaneously to find a set of latent variables that explain the covariance between X and y as much as possible.
ij kj ikk
T W X
ikj
jkijik EPTX
mj
ijmjm fTCy
29
Partial Least squares
The PLS regression has also established the relation between the response y and original predictors X as a multiple regression model:
where vector f’ (n×1) is the regression errors and matrix B (p×1) is the PLS regression coefficients and can be calculated by:
Then, the significant predictors can be selected based on the values of regression coefficients from PLS regression, which is called the PLS-Beta method
mk
ikmkm fXBy '
i
kimim WCB
30
PCR vs. PLS vs. Ridge Regression
PCR discards the smallest eigenvalue components (low-variance direction). The mth component vm solves:
PLS shrink the low-variance direction, while inflate high variance direction. The mth component vm solves:
Ridge Regression: Shrinks coefficients of the principle components. Low-variance direction is shrinked more
)(max1,...,1,0,1||
XVarmlSvTl
)(),(max 2
1,...,1,0,1||
XVarXyCorr
mlSvTl