lecture 11: regression methods i (linear regression)hzhang/math574m/2019lect11_lm.pdf · 9/43....
TRANSCRIPT
Linear Model IntroductionComputation Issues
Two Popular Algorithms
Lecture 11: Regression Methods I (LinearRegression)
Hao Helen Zhang
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)1 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
Outline
1 Regression: Supervised Learning with Continuous Responses2 Linear Models and Multiple Linear Regression
Ordinary Least SquaresStatistical inferencesComputational algorithms
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)2 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Regression Models
If the response Y take real values, we refer this type of supervisedlearning problem as regression problem.
linear regression models
parametric models
nonparametric regression
splines, kernel estimator, local polynomial regression
semiparametric regression
Broad coverage:
penalized regression, regression trees, support vectorregression, quantile regression
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)3 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Linear Regression Models
A standard linear regression model assumes
yi = xTi β + εi , εi ∼ i.i.d, E (εi ) = 0, Var(εi ) = σ2,
yi is the response for the ith observation, xi ∈ Rd is thecovariates
β ∈ Rd is the d-dimensional parameter vector
Common model assumptions:
independence of errors
constant error variance (homoscedasticity)
ε independent of X.
Normality is not needed.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)4 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
About Linear Models
Linear models has been a mainstay of statistics for the past 30years and remains one of the most important tools.
The covariates may come from different sources
quantitative inputs; dummy coding qualitative inputs.transformed inputs: log(X ),X 2,
√X , ...
basis expansion: X1,X21 ,X
31 , ... (polynomial representation)
interaction between variables: X1X2, ...
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)5 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Review on Matrix Theory - Notations
A is an m ×m matrix.
col(A): the subspace of Rm spanned by the columns of A.
Im is the identity matrix of size m.
Vectors,
x is a nonzero m × 1 vector
0 is a zero vector of m × 1.
ei , i = 1, . . . ,m is m× 1 unit vector, with 1 in the ith positionand zeros elsewhere.
The ith column of A can be expresed as Aei , for i = 1, . . . ,m.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)6 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Basic Concepts
The determinant of A is detA) = |A|.The trace of A is tr(A) = the sum of the diagonal elements.
The roots of the mth degree of polynomial equation in λ.
|λIm − A| = 0,
denoted by λ1, · · · , λm are called the eigenvalues of A.
The collection {λ1, · · · , λm} is called the spectrum of A.
Any nonzero m × 1 vector xi 6= 0 such that
Axi = λixi
is an eigenvector of A corresponding to the eigenvalue λi .
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)7 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Let B be another m ×m matrix, then
|AB| = |A||B|, tr(AB) = tr(BA).
A is symmetric ifA′ = A.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)8 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Review on Matrix Theory (II)
The following are equivalent:
|A| 6= 0
rank(A) = m
A−1 exists.
Linear transformation: Ax
generates a vector in col(A)
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)9 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Orthogonal Matrix
An m ×m matrix P is called an orthogonal matrix if
PP ′ = P ′P = Im, or P−1 = P ′.
If P is an orthogonal matrix, then
|PP ′| = |P||P ′| = |P|2 = |I | = 1, so |P| = ±1.
For any A, we have tr(PAP ′) = tr(AP ′P) = tr(A).
PAP ′ and A have the same eigenvalues, since
|λIm − PAP ′| = |λPP ′ − PAP ′| = |P|2|λIm − A| = |λIm − A|.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)10 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Spectral Decomposition of Symmetric Matrix
If A is symmetric, there exists an orthogonal matrix P such that
P ′AP = Λ = diag{λ1, · · · , λm},
λi ’s are the eigenvalues of A.
The eigenvectors of A are the column vectors of P.
Denote the ith column of P by pi , then
PP ′ =m∑i=1
pip′i = Im.
The spectral decomposition of A is
A = PΛP ′ =m∑i=1
λipip′i
tr(A) = tr(Λ) =∑n
i=1 λi and |A| = |Λ| =∏m
i=1 λi .
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)11 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Idempotent Matrices
An m ×m matrix A is idempotent if
A2 = AA = A.
The eigenvalues of an idempotent matrix are either zero or one
λx = Ax = A(Ax) = A(λx) = λ2x,
=⇒ λ = λ2.
If A is idemponent, so is Im − A.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)12 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Projection Matrix
A symmetric, idempotent matrix A is called a projection matrix.
If A is an symmetric idempotent, then
If rank(A) = r , then A has r eigenvalues equal to 1 and m− rzero eigenvalues.
tr(A) = rank(A).
Im − A is also symmetric idempotent, of rank m − r .
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)13 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Projection Matrices
Given x ∈ Rm, define y = Ax, z = (I − A)x = x− y. Then
y ⊥ z.
y is the orthogonal projection of x onto the subspace col(A).
z = (I − A)x is the orthogonal projection of x onto thecomplementary subspace such that
x = y + z = Ax + (I − A)x.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)14 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Matrix Notations for Linear Regression
The response vector y = (y1, · · · , yn)T
The design matrix X .
Assume the first column of X is 1.The dimension of X is n × (1 + d).
The regression coefficients β =
(β0β1
).
The error vector ε = (ε1, · · · , εn))T .
The linear model is written as:
y = Xβ + ε
the estimated coefficients β
the predicted response y = X β.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)15 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Ordinary Least Squares (OLS)
The most popular method for fitting the linear model is theordinary least squares (OLS):
minβ
RSS(β) = (y − Xβ)T (y − Xβ).
Normal equations: XT (y − Xβ) = 0
β = (XTX )−1XTy and y = X (XTX )−1XTy.
Residual vector is r = y − y = (I − PX )y.
Residual sum squares RSS = rT r.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)16 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Projection Matrix
Call the following square matrix the projection or hat matrix:
PX = X (XTX )−1XT .
Properties:
symmetric and non-negative definite
idempotent: P2X = PX . The eigenvalues are 0’s and 1’s.
PXX = X , (I − PX )X = 0.
We haver = (I − PX )y, RSS = yT (I − PX )y.
NoteXT r = XT (I − PX )y = 0.
The residual vector is orthogonal to the column space spanned byX , col(X ).
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)17 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 3PSfrag repla ements
x1x2
y
yFigure 3.2: The N-dimensional geometry of leastsquares regression with two predi tors. The out omeve tor y is orthogonally proje ted onto the hyperplanespanned by the input ve tors x1 and x2. The proje tiony represents the ve tor of the least squares predi tions
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)18 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Sampling Properties of β
Var(β) = σ2(XTX )−1,
The variance σ2 can be estimated as
σ2 = RSS/(n − d − 1).
This is an unbiased estimator, i.e., E(σ2) = σ2
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)19 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Inferences for Gaussian Errors
Under the Normal assumption on the error ε, we have
β ∼ N(β, σ2(XTX )−1)
(n − d − 1)σ2 ∼ σ2χ2n−d−1
β is independent of σ2
To test H0 : βj = 0, we use
if σ2 is known, zj =βj
σ√vj
has a Z distribution under H0;
if σ2 is unknown, tj =βj
σ√vj
has a tn−d−1 distribution under
H0;
where vj is the jth diagonal element of (XTX )−1.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)20 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Confidence Interval for Individual Coefficients
Under Normal assumption, the 100(1− α)% C.I. of βj is
βj ± tn−d−1;α2σ√vj ,
where tk;ν is 1− ν percentile of tk distribution.
In practice, we use the approximate 100(1− α)% C.I. of βj
βj ± zα2σ√vj ,
where zα2
is 1− α2 percentile of the standard Normal
distribution.
Even if the Gaussian assumption does not hold, this interval isapproximately right, with the coverage probability 1− α asn→∞.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)21 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Review on Multivariate Normal Distributions
Distributions of Quadratic Form (Non-central χ2):
If X ∼ Np(µ, Ip), then
W = XTX =
p∑i=1
X 2i ∼ χ2
p(λ), λ =1
2µTµ.
Special case: If X ∼ Np(0, Ip), then W = XTX ∼ χ2p.
If X ∼ Np(µ,V ) where V is nonsingular, then
W = XTV−1X ∼ χ2p(λ), λ =
1
2µTV−1µ.
If X ∼ Np(µ,V ) with V nonsingular, if A is symmetric andAV is idempotent with rank s, then
W = XTAX ∼ χ2s (λ), λ =
1
2µTAµ.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)22 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Cochran’s Theorem
Let y ∼ Nn(µ, σ2In) and let Aj , j = 1, · · · , J be symmetricidempotent matrices with rank sj . Furthermore, assume that∑J
j=1 Aj = In and∑J
j=1 sj = n, then
(i)
Wj =1
σ2yTAjy ∼ χ2
sj(λj),
where λj = 12σ2µ
TAjµ
(ii) Wj ’s are mutually independent with each other.
Essentially: we decompose yTy into the (scaled) sum of itsquadratic forms,
n∑i=1
y2i yi = yT Iny =J∑
j=1
yTAjy.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)23 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Application of Cochran’s Theorem to Linear Models
Example: Assume y ∼ Nn(Xβ, σ2In). Define A = I − PX and
the residual sum of squares: RSS = yTAy = ‖r‖2
the sum of squares regression: SSR = yTPXy = ‖y‖2.By Cochran’s Theorem, we have
(i)RSS/σ2 ∼ χ2
n−d−1, SSR/σ2 ∼ χ2d+1(λ),
where λ = (Xβ)T (Xβ)/(2σ2),
(ii) RSS is independent from SSR. (Note r ⊥ y)
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)24 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
F Distribution
If U1 ∼ χ2p,U2 ∼ χ2
q and U1 ⊥ U2, then
F =U1/p
U2/q∼ Fp,q.
If U1 ∼ χ2p(λ),U2 ∼ χ2
q and U1 ⊥ U2, then
F =U1/p
U2/q∼ Fp,q(λ), (noncentral F )
Example: Assume y ∼ Nn(Xβ, σ2In). Let A = I − PX , and
RSS = yTAyT = ‖r‖2, SSR = yTPXy = ‖y‖2.
Then
F =SSR/(d + 1)
RSS/(n − d − 1)∼ Fd+1,n−d−1(λ), λ = ‖Xβ‖2/(2σ2).
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)25 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Making Inferences about Multiple Parameters
Assume X = [X0,X1], where X0 consists of the first k columns.Correspondingly, β = [β′0,β
′1]′. To test H0 : β0 = 0, using
F =(RSS1 − RSS)/k
RSS/(n − d − 1)
RSS1 = yT (I − PX1)y (reduced model).
RSS = yT (I − PX )y (full model)
RSS1 ∼ σ2χ2n−d−1.
RSS1 − RSS = yT (PX − PX1)y.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)26 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Testing Multiple Parameter
Applying Cochran’s Theorem to RSS1,RSS and RSS1 − RSS ,
they are independent
they respectively follow noncentral χ2 distributions, withnoncentralities (Xβ)T (I − PX1)(Xβ)/(2σ2), 0, and(Xβ)T (PX − PX1)(Xβ)/(2σ2).
. Then we have
F ∼ Fk,n−d−1(λ), with λ = (Xβ)T (PX − PX1)(Xβ)/(2σ2).
Under H0, we have Xβ = X1β1, so F ∼ Fk,n−d−1.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)27 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Nested Model Selection
To test for significance of groups of coefficients simultaneously, weuse F -statistic
F =(RSS0 − RSS1)/(d1 − d0)
RSS1/(n − d1 − 1),
where
RSS1 is the RSS for the bigger model with d1 + 1 parameters
RSS0 is the RSS for the nested smaller model with d0 + 1parameter, have d1 − d0 parameters constrained to zero.
F -statistic measure the change in RSS per additional parameter inthe bigger model, and it is normalized by σ2.
Under the assumption that the smaller model is correct,F ∼ Fd1−d0,n−d1−1.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)28 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Confidence Set
The approximate confidence set of β is
Cβ = {β|(β − β)T (XTX )(β − β) ≤ σ2χ2d+1;1−α},
where χ2k;1−α is 1− α percentile of χ2
k distribution.
The confidence interval for the true function f (x) = xTβ is
{xTβ|β ∈ Cβ}
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)29 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
OLS MethodStatistical Inferences
Gauss-Markov Theorem
Assume sTβ is linearly estimable, i.e., there exists a linearestimator b + cTy such that E (b + cTy) = sTβ.
A function sTβ is linearly estimable iff s = XTa for some a.
Theorem: If sTβ is linearly estimable, then sT β is the best linearunbiased estimator (BLUE) of sTβ:
For any cTy satisfying E (cTy) = sTβ, we have
Var(sT β) ≤ Var(cTy).
sT β is the best among all the unbiased estimators. (It is afunction of the complete and sufficient statistic (yTy,XTy).)
Question: Is it possible to find a slightly biased linear estimator butwith smaller variance? (– Trade a little bias for a large reduction invariance.)
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)30 / 43
Linear Model IntroductionComputation Issues
Two Popular AlgorithmsOrthogonalize Design Matrix
Linear Regression with Orthogonal Design
If X is univariate, the least square estimate is
β =
∑i xiyi∑i x
2i
=< x, y >
< x, x >.
if X = [x1, ..., xd ] has orthogonal columns, i.e.,
< xj , xk >= 0, ∀j 6= k;
or equivalently, XTX = diag(‖x1‖2, ..., ‖xd‖2
). The OLS
estimates are given as
βj =< xj , y >
< xj , xj >for j = 1, ..., d .
Each input has no effect on the estimation of other parameters.Multiple linear regression reduces to univariate regression.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)31 / 43
Linear Model IntroductionComputation Issues
Two Popular AlgorithmsOrthogonalize Design Matrix
How to orthogonalize X?
Consider the simple linear regression y = β01 + β1x + ε.We regress x onto 1 and obtain the residual
z = x− x1.
Orthogonalization Process:
The residual z is orthogonal to the regressor 1.
The column space of X is span{1, x}.Note: y ∈ span{1, x} = span{1, z}, because
β01 + β1x = β0 + β1[x1 + (x− x1)]
= β0 + β1[x1 + z]
= (β0 + β1x)1 + β1z
= η01 + β1z.
{1, z} form an orthogonal basis for the column space of X .
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)32 / 43
Linear Model IntroductionComputation Issues
Two Popular AlgorithmsOrthogonalize Design Matrix
How to orthogonalize X? (continued)
Estimation Process:
First, we regress y onto z for the OLS estimate of the slope β1
β1 =< y, z >
< z, z >=
< y, x− x1 >
< x− x1, x− x1 >.
Second, we regress y onto 1 and get the coefficient η0 = y .
The OLS fit is given as
y = η01 + β1z
= η01 + β1(x− x1) = (η0 − β1x)1 + β1x.
Therefore, the OLS slope is obtained as
β0 = η0 − β1x = y − β1x .
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)33 / 43
Linear Model IntroductionComputation Issues
Two Popular AlgorithmsOrthogonalize Design Matrix
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 3PSfrag repla ements
x1x2
y
y zzzzzFigure 3.4: Least squares regression by orthogonaliza-tion of the inputs. The ve tor x2 is regressed on theve tor x1, leaving the residual ve tor z. The regressionof y on z gives the multiple regression oeÆ ient of x2.Adding together the proje tions of y on ea h of x1 andz gives the least squares �t y.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)34 / 43
Linear Model IntroductionComputation Issues
Two Popular AlgorithmsOrthogonalize Design Matrix
How to orthogonalize X? (d=2)Consider y = β1x1 + β2x2 + β3x3 + ε.Orthogonization process:
1 We regress x2 onto x1, compute the residual
z1 = x2 − γ12x1. (note z1 ⊥ x1)
2 We regress x3 onto (x1, z1), compute the residual
z2 = x3 − γ13x1 − γ23z1. (note z2 ⊥ {x1, z1})
Note: span{x1, x2, x3} = span{x1, z1, z2}, because
β1x1 + β2x2 + β3x3 = β1x1 + β2(γ12x1 + z1) + β3(γ13x1 + γ23z1 + z2)
= (β1 + β2γ12 + β3γ13)x1 + (β2 + β3γ23)z1 + β3z2
= η1x1 + η2z1 + β3z2.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)35 / 43
Linear Model IntroductionComputation Issues
Two Popular AlgorithmsOrthogonalize Design Matrix
Estimation Process
We project y onto the orthogonal basis {x1, z2, z3} one by one,and then recover the coefficients corresponding to the originalcolumns of X .
First, we regress y onto z2 for the OLS estimate of the slopeβ3
β3 =< y, z2 >
< z2, z2 >
Second, we regress y onto z1, leading to the coefficient η2,and
β2 = η2 − β3γ23Third, we regress y onto x1, leading to the coefficient η1, and
β1 = η1 − β3γ13 − β2γ12
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)36 / 43
Linear Model IntroductionComputation Issues
Two Popular AlgorithmsOrthogonalize Design Matrix
Gram-Schmidt Procedure (Successive Orthogonalization)
1 Initialize z0 = x0 = 12 For j = 1, ..., d Regression xj on z0, z1, ..., zj−1 to produce
coefficients γkj =<zk ,xj><zk ,zk>
for k = 0, ..., j − 1, and residual
vector zj = xj −∑j−1
k=0 γkjzk . ({z0, z1, ..., zj−1} areorthogonal)
3 Regress y on the residual zd to get
βd =< y, zd >
< zd , zd >
4 Compute βj , j = d − 1, · · · , 0 in that order successively.
{z0, z1, ..., zd} forms orthogonal basis for Col(X ).
Multiple regression coefficient βj is the additional contributionof xj to y, after xj has been adjusted forx0, x1, ..., xj−1, xj+1, ..., xd .
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)37 / 43
Linear Model IntroductionComputation Issues
Two Popular AlgorithmsOrthogonalize Design Matrix
Collinearity Issue
The dth coefficientβd =
< zd , y >
< zd , zd >
If xd is highly correlated with some of the other x′js, then
The residual vector zd is close to zero
The coefficient βd will be very unstable
The variance estimates
Var(βd) =σ2
‖zd‖2.
The precision for estimating βd depends on the length of zd ,or, how much xd is unexplained by the other xk ’s
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)38 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
QR DecompositionCholesky Algorithm
Two Computational Algorithms For Multiple Regression
Consider the Normal Equation
XTXβ = XTy.
We like to avoid computing (XTX )−1 directly.1 QR decomposition of X
X = QR where Q is orthonormal and R is upper triangularEssentially, a process of orthogonal matrix triangularization
2 Cholesky decomposition of XTX .
XTX = RRT where R is lower triangular
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)39 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
QR DecompositionCholesky Algorithm
Matrix Formulation of Orthogonalization
In Step 2 of Gram-Schmidt procedure, for j = 1, ..., d
zj = xj −j−1∑k=0
γkjzk =⇒ xj =
j−1∑k=0
γkjzk + zj .
In matrix form X = [x1, ..., xd ] and Z = [z1, ..., zd ],
X = ZΓ
The columns of Z are orthogonal to each other
The matrix Γ is upper triangular, with 1 at the diagonals.
Standardizing Z using D = diag{‖z1‖, ..., ‖zd‖},
X = ZΓ = ZD−1DΓ ≡ QR, with Q = ZD−1, R = DΓ.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)40 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
QR DecompositionCholesky Algorithm
QR Decomposition
The columns of Q consists of an orthonormal basis for thecolumn space of X .
Q is orthogonal matrix of n × d , satisfying QTQ = I .
R is upper triangular matrix of d × d , full-ranked.
XTX = (QR)T (QR) = RTQTQR = RTR
The least square solutions are
β = (XTX )−1XTy
= R−1R−TRTQTy = R−1QTy
y = X β
= (QR)(R−1QTy)
= QQTy.
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)41 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
QR DecompositionCholesky Algorithm
QR Algorithm for Normal Equations
Regard β as the solution for linear equations system:
Rβ = QTy.
1 Conduct QR decomposition of X = QR. (Gram-SchmidtOrthogonalization)
2 Compute QTy.
3 Solve the triangular system Rβ = QTy.
The computational complexity: nd2
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)42 / 43
Linear Model IntroductionComputation Issues
Two Popular Algorithms
QR DecompositionCholesky Algorithm
Cholesky Decomposition Algorithm
For any positive definite square matrix A, we have
A = RRT ,
where R is a lower triangular matrix of full rank.
1 Compute XTX and XTy.
2 Factoring XTX = RRT , then β = (RT )−1R−1XTy
3 Solve the triangular system Rw = XTy for w.
4 Solve the triangular system RTβ = w for β.
The computational complexity: d3 + nd2/2 (can be faster thanQR for small d , but can be less stable)
Var(y0) = Var(xT0 β) = σ2(xT0 (RT )−1R−1x0).
Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)43 / 43