cph 636 - dr. charnigo chap. 3 notes the authors discuss linear methods for regression, not only...

CPH 636 - Dr. CharnigoChap. 3 Notes

The authors discuss linear methods for regression, not only because of their historical importance but because (especially with some modern innovations) they may perform well when n is small, p is large, and/or the noise variance is large.

Formula (3.1) describes the model for f(x) := E[Y|X=x], which can accommodate non-linear functions of x including dummy codes; regarding the latter, note part f of exercise 1 on your first team project.


Even if f(x) is non-linear in x, f(x) is linear in the parameters β0, β1, …, βp. Thus, estimating the parameters by ordinary least squares – i.e., minimization of (3.2) – leads to the closed-form solution in (3.6). If you’ve not studied vector calculus, I will show you how the solution is obtained with p=1 and neglecting the intercept.

Figure 3.1 illustrates ordinary least squares when p=2.


Note that, once the parameters are estimated, we may predict Y for any x. In particular, (3.7) shows that predicting Y for x which occurred in the training data is accomplished by a matrix-vector multiplication,

Ypredicted = H Ytraining, where H := X(XTX)-1XT.

To make predictions for a new data set (rather than for the training data set), simply replace the first X in H by its analogue from the new data set.


We refer to H as the hat matrix. The hat matrix is what mathematicians call a projection matrix. Consequently, H2 = H. Figure 3.2 provides an illustration with p=2, but I will provide an illustration with p=1 which may be easier to understand.

Assuming there are no redundancies in the columns of X, the trace of H is p+1. This trace is called the degrees of freedom (df) for the model, a concept which generalizes to nonparametric methods which are linear in the outcomes.


If we regard the predictors as fixed rather than random (or, if we condition upon their observed values), then under the usual assumptions for linear regression (what are they ?), we have result (3.10).

Combined with result (3.11), which says that the residual sum of squares divided by the error variance follows the chi-square distribution on n-p-1 degrees of freedom, result (3.10) forms the basis for inference on β0, β1, …, βp. (Even if we are more interested in prediction, this is still worth your understanding.)


The authors make the point that the T distribution which is proper to such inferences is often replaced by the Z distribution when n (or, rather, n-p-1) is large.

I think the authors have oversimplified a bit here, though, because the adequacy of the Z approximation to the T distribution depends on the desired level of confidence or significance. In any case, with modern computing powerful enough to implement methods in this book, why use such an approximation at all ?


Result (3.13) tells you how to test a null hypothesis that some but not all predictors in your model are unnecessary.

This is important because testing H0: β1 = β2 = 0 is not the same as testing H0: β1 = 0 and H0: β2 = 0. Possibly X1 could be deleted if X2 were kept in the model, or vice versa, while deleting both would be unwise. For example, consider X1 = SBP and X2 = DBP.


One could (and sometimes does, especially in backward elimination) test whether to remove X2 and then, once X2 is gone, test whether to remove X1 as well. But that entails (a greater degree of) sequential hypothesis testing, which is not well understood in terms of its implications for actual versus nominal statistical significance.

Moreover, there are some situations in which X1 and X2 should either both or neither be in the model, such as when they are dummy variables.


Result (3.15) describes how to make a confidence region for β0, β1, …, βp. I will illustrate what this looks like for β0 and β1 when p = 1. Importantly, such a region is not a rectangle.

Though not shown explicitly in (3.15), one may also make a confidence region for any subset of the parameters. I will illustrate what this looks like for β1 and β2 when p > 2 supposing that, for example, X1 = SBP and X2 = DBP.


The authors discuss the prostate cancer data set at some length. Notice that they began (Figure 1.1) by exploring the data. Dr. Stromberg would be proud !

I don’t quite agree with the authors’ conflation of “strongest effect” with largest Z score, even though they did standardize the predictors to have unit variance. Let’s discuss that…


Let’s also make sure we understand what the authors mean by “base error rate” and its reduction by 50%.

Returning to the idea of inference, the Gauss-Markov Theorem (and, likewise, the Cramer-Rao lower bound, for those of you who’ve heard of it) will permit us to conclude that, for a correctly specified model: (i) parameters are estimated unbiasedly; and, (ii) parameters are estimated with minimal variance subject to (i).


Although the Gauss-Markov Theorem sounds reassuring, there are some cases when we can achieve a huge reduction in variance by tolerating a modest amount of bias. This may substantially reduce both mean square error of estimation and mean square error of prediction.

Moreover, there’s a big catch to the Gauss-Markov Theorem: for a correctly specified model. How often do you suppose that (really) happens ?


The authors proceed to describe how a multiple linear regression model can actually be viewed as the result of fitting several simple linear regression models.

They begin by noting that when the inputs are orthogonal (roughly akin to the idea of statistical independence), unadjusted and adjusted parameter estimates are identical.


Usually inputs are not orthogonal. But imagine that, with standardized inputs and response (hence, no need for an intercept), we do the following:

1. Multiple linear regression of Xk on all other features.

2. Simple linear regression of Yk on the residuals from step 1.

These residuals are orthogonal to the other features…


…and so the parameter estimate from step 2 will be the same as we would have obtained for Xk in a multiple linear regression of Y on X1, X2, …, Xp.

This gives us an alternate interpretation of an adjusted regression coefficient: we are quantifying the effect of Xk on that portion of Y which is orthogonal to (or, if you prefer, unexplained by) the other inputs.


Formula (3.29) then shows why it’s difficult to estimate the coefficient for Xk if Xk is highly correlated with the other features. This condition, in its extreme form, is known as collinearity.

In fact, the quantity in the denominator of (3.29) is related to the so-called variance inflation factor, which is sometimes used as a diagnostic for collinearity.


You may have also heard of the distinction between ANOVA and MANOVA.

You may wonder, then, whether there is an analogue to MANOVA for situations when you have multiple continuous outcomes which are being regressed on multiple continuous predictors.


There is, but parameter estimates relating a particular outcome to the predictors do not depend on whether they are acquired for the one outcome by itself or for all outcomes simultaneously. This is true even if the multiple outcomes are correlated with each other.

This is in stark contrast, of course, to the dependence of parameter estimates relating a particular outcome to a particular predictor on whether other predictors are considered simultaneously.


The authors note that ordinary least squares may have little bias and large variability. (Here they are assuming that the model is specified correctly. If the model is a drastic simplification of reality, then ordinary least squares will have large bias and little variability in relation to viable competing paradigms for modeling and estimation.)

The authors therefore discuss subset selection, which may reduce variability and enhance interpretation.


Best subset selection, which is computationally feasible for up to a few dozen candidate predictors, entails finding the best one-predictor model, the best two-predictor model, and so forth. This is illustrated in Figure 3.5.

Then, using either a validation data set or cross-validation (the authors do the latter in Figure 3.7), choose from among the best one-predictor model, best two-predictor model, and so forth.


Note that the authors do not define “best” strictly by mean square error of prediction but also by considerations of parsimony.

Forward selection is an alternative to best subsets selection when p is large. The authors refer to it as a “greedy algorithm” because, at each step, that predictor is chosen which explains the greatest part of the remaining variability in the outcome.


While that seems desirable, the end result may actually be sub-optimal if predictors are strongly correlated.

Backward elimination is another option. A disadvantage is that it may not be viable when p is large relative to n. A compelling advantage is that backward elimination can be easily implemented “manually” if not otherwise programmed into the statistical software. This may be useful in, for example, PROC MIXED of SAS.


In addition to explicitly choosing from among available predictors, we may also employ “shrinkage” methods for estimating parameters in a linear regression model.

These are so called because the resulting estimates are often smaller in magnitude than those acquired via ordinary least squares.

Until further notice, we assume that Y and X1, …, Xp have been standardized with respect to training data.


Ridge regression is defined by formula (3.44) in the textbook and can be viewed as the solution to the penalized least squares problem expressed in (3.41).

Though perhaps not obvious, the constrained least squares problem in (3.42) is equivalent to (3.41), for an appropriate choice of t depending on λ. Moreover, (3.44) may be a good way to address collinearity. In particular, a correlation of ρ between X1 and X2 is, roughly speaking, effectively reduced to ρ / (1 + λ).


Figure 3.8 displays a “ridge trace”, which shows how the estimated parameters in ridge regression depend on λ. One may choose λ by cross validation, as the authors have done.

Ridge regression can also be viewed as finding a Bayesian posterior mode (whilst ordinary least squares is frequentist maximum likelihood), when the prior distribution on each estimated parameter is normal with mean 0 and variance σ2 / λ.


What do you think about ridge regression in terms of the bias / variance tradeoff ?

One weakness of ridge regression is that, almost invariably, you are still “stuck” with all of the predictors. Even if the collinearity issue is satisfactorily resolved, why is this a weakness ?

An alternative is the lasso, for which the corresponding optimization problems are (3.51) and (3.52).


There is no analytic solution for the parameter estimates with the lasso; one must use numerical optimization methods.

However, a favourable feature of the lasso is that some parameter estimates are “shrunk” to zero. In other words, some variables are effectively removed.

Figure 3.10 displays estimated parameters in relation to a shrinkage factor s which is proportional to t.


Figure 3.11 explains why the lasso can effectively remove some predictors from the model. Each red ellipse represents a contour on which the residual sum of squares equals a fixed value.

However, we are only allowed to accept parameter estimates within the blue geometric regions. So, the final parameter estimates will occur where an ellipse is tangent to a region. For a circular region, this will almost never happen along a coordinate axis.


The lasso also has a Bayesian interpretation, corresponding to a prior distribution on parameter estimates which has much heavier tails than a normal distribution. Thus, the lasso is less capable of reducing very large ordinary least squares estimates such as may occur with collinearity.

Table 3.4 nicely characterizes how subset selection, lasso, and ridge regression shrink ordinary least squares parameter estimates for uncorrelated predictors.


While having uncorrelated predictors is a fanciful notion for an observational study (versus a designed experiment), Table 3.4 helps explain why ridge regression does not produce zeroes and why the lasso is more of a “continuous” operation than subset selection.

The authors also mention the elastic net and least angle regression as shrinkage methods.


The former is a sort of compromise between ridge regression and the lasso, as suggested by Figure 18.5 later in the textbook. The idea is to both reduce very large ordinary least squares estimates and eliminate extraneous predictors from the model.

The latter is similar to the lasso, as shown in Figure 3.15, and provides insight into how to compute parameter estimates for the lasso more efficiently.


Besides subset selection and shrinkage methods, one may fit linear regression models via approaches based on derived input directions.

Principal components regression replaces X1, X2, …, Xp by a set of uncorrelated variables W1, W2, …, Wp such that Var(W1) > Var(W2) > … > Var(Wp). Each W is a linear combination of the X’s, such that the squared coefficients of the X’s sum to 1.


One then uses some or all of W1, W2, …, Wp as predictors in lieu of X1, X2, …, Xp. This eliminates any problem that may exist with collinearity.

The downside of principal components regression is that a W may be difficult to interpret contextually, unless it should happen that the W is approximately proportional to an average of some of the X’s or a “contrast” (the difference between the average of one subset of the X’s and the average of another subset).


Partial least squares – which has been investigated by our own Dr. Rayens, among others - is similar to principal components regression, except that W1, W2, …, Wp are chosen in a way that Corr2(Y,W1)Var(W1) > Corr2(Y,W2)Var(W2) > … > Corr2(Y,Wp)Var(Wp).

If one intends to use only some of W1, W2, …, Wp as predictors, partial least squares may explain more variation in Y than principal components regression.


Figure 3.18 presents a nice illustrative comparison of ordinary least squares, best subset selection, ridge regression, lasso, principal components regression, and partial least squares.

To aid in the interpretation, note that X2 could be expressed as + or - (1/2) X1 + (sqrt(3)/2) Z, where Z is standard normal and independent of X1. Also, W1 = (X1 + or - X2) / sqrt(2) and W2 = (X1 – or + X2) / sqrt(2) for principal components regression.

cph 636 - dr. charnigo chap. 3 notes the authors discuss linear methods for regression, not only...

Documents