statistics ii lesson 5. regression analysis (second...
TRANSCRIPT
Statistics IILesson 5. Regression analysis (second part)
Year 2010/11
Lesson 5. Regression analysis (second part)
Contents
I Diagnosis: Residual analysis
I The ANOVA (ANalysis Of VAriance) decomposition
I Nonlinear relationships and linearizing transformations
I A matrix treatment of the linear regression model
I Introduction to multiple linear regression
Lesson 5. Regression analysis (second part)
Bibliography references
I Newbold, P. “Estadıstica para los negocios y la economıa” (1997)I Chapters 12, 13 and 14
I Pena, D. “Regresion y Analisis de Experimentos” (2002)I Chapters 5 and 6
Regression diagnostics
Diagnostics
I The theoretical assumptions for the simple linear regression modelwith one response variable Y and one explicative variable x are:
I Linearity: yi = β0 + β1xi + ui , for i = 1, . . . , nI Homogeneity: E [ui ] = 0, for i = 1, . . . , nI Homoscedasticity: Var[ui ] = σ2, for i = 1, . . . , nI Independence: ui and uj are independent for i 6= jI Normality: ui ∼ N(0, σ2), for i = 1, . . . , n
I We study how to apply diagnostic procedures to test if theseassumptions are appropriate for the available data (xi , yi )
I Based on the analysis of the residuals ei = yi − yi
Diagnostics: scatterplots
Scatterplots
I The simplest diagnostic procedure is based on the visualexamination of the scatterplot for (xi , yi )
I Often this simple but powerful method reveals patterns suggestingwhether the theoretical model might be appropriate or not
I We illustrate its application on a classical example. Consider the fourfollowing datasets
Diagnostics: scatterplots
The Anscombe datasets132 TWO-VARIABLE LINEAR REGRESSION
TABLE 3-10 Four Data Sets
DATASET 1 DATASET 2 X Y X Y
10.0 8.04 10.0 9.14 8.0 6.95 8.0 8.14
1'3.0 7.58 13.0 8.74 9.0 8 .81 9.0 8.77
11.0 8.33 11.0 9.26 14.0 9.96 14.0 8.10
6.0 7.24 6.0 6.13 4.0 4.26 4.0 3.10
12.0 10.84 12.0 9.13 7.0 4.82 7.0 7.26 5.0 5.68 5.0 4.74
DATASET 3 DATASET 4 X Y X Y
10.0 7.46 8.0 6.58 8.0 6.77 8.0 5.76
13.0 12.74 8.0 7.71 9.0 7.11 8.0 8.84
11.0 7.81 8.0 8.47 14.0 8.84 8.0 7.04 6.0 6.08 8.0 5.25 4.0 5.39 19.0 12.50
12.0 8.15 8.0 5.56 7.0 6.42 8.0 7.91 5.0 5.73 8.0 6.89
SOURCE: F . J . Anscombe, op. cit.
mean of X = 9.0,
mean of Y = 7.5, for all four data sets.
And yet the four situations-although numerically equivalent in major respects-are substantively very different. Figure 3-29 shows how very different the four data sets actually are.
Anscombe has emphasized the importance of visual displays in statistical analysis:
Most textbooks on statistical methods, and most statistical computer programs, pay too little attention to graphs. Few of us escape being indoctrinated with these notions:
Diagnostics: scatterplots
The Anscombe example
I The estimated regression model for each of the four previousdatasets is the same
I yi = 3,0 + 0,5xi
I n = 11, x = 9,0, y = 7,5, rxy = 0,817
I The estimated standard error of the estimator β1,√s2R
(n − 1)s2x
takes the value 0,118 in all four cases. The corresponding T statistictakes the value T = 0,5/0,118 = 4,237
I But the corresponding scatterplots show that the four datasets arequite different. Which conclusions could we reach from thesediagrams?
Diagnostics: scatterplots
Anscombe data scatterplots133 TWO-VARIABLE LINEAR REGRESSION
10 10
5 5
o
10
5
5 10 15 5 10 Data Set 1 Data Set 2
10
5
5 10 15 20 5 10 Data Set 3 Data Set 4
FIGURE 3-29 Scatterplots for the four data sets of Table 3-10 SOURCE: F. J . Anscombe, op cit.
(1) numerical calculations are exact, but graphs are rough;
15
15
(2) for any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis;
(3) performing intricate calculations is virtuous, whereas actually looking at the data is cheating.
A computer should make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding.
Graphs can have various purposes, such as: (i) to help us perceive and appreciate some broad features of the data, (ii) to let us look behind those broad features and see what else is there. Most kinds of statistical calculation rest on assumptions about the behavior of the data. Those assumptions may be false, and then the calculations may be misleading. We ought always to try to check whether the assumptions are reasonably correct; and if they are wrong we ought
20
20
Residual analysis
Further analysis of the residuals
I If the observation of the scatterplot is not sufficient to reject themodel, a further step would be to use diagnosis methods based onthe analysis of the residuals ei = yi − yi
I This analysis starts by standarizing the residuals, that is, dividingthem by the residuals (quasi-)standard deviation sR . The resultingquantities are known as standarized residuals:
ei
sR
I Under the assumptions of the linear regression model, thestandarized residuals are approximately independent standard normalrandom variables
I A plot of these standarized residuals should show no clear pattern
Residual analysis
Residual plots
I Several types of residual plots can be constructed. The mostcommon ones are:
I Plot of standardized residuals vs. xI Plot of standardized residuals vs. y (the predicted responses)
I Deviations from the model hypotheses result in patterns on theseplots, which should be visually recognizable
Residual plots examples
Consistency of the theoretical model5.1: Ej: consistencia con el modelo teorico
Residual plots examples
Nonlinearity5.1: Ej: No linealidad
Residual plots examples
Heteroscedasticity
5.1: Ej: Heterocedasticidad
Residual analysis
Outliers
I In a plot of the regression line we may observe outlier data, that is,data that show significant deviations from the regression line (orfrom the remaining data)
I The parameter estimators for the regression model, β0 and β1, arevery sensitive to these outliers
I It is important to identify the outliers and make sure that they reallyare valid data
I Statgraphics is able to show for example the data that generate“Unusual residuals” or “Influential points”
Residual analysis
Normality of the errors
I One of the theoretical assumptions of the linear regression model isthat the errors follow a normal distribution
I This assumption can be checked visually from the analysis of theresiduals ei , using different approaches:
I By inspection of the frequency histogram for the residualsI By inspection of the “Normal Probability Plot” of the residuals
(significant deviations of the data from the straight line in the plotcorrespond to significant departures from the normality assumption)
The ANOVA decomposition
Introduction
I ANOVA: ANalysis Of VAriance
I When fitting the simple linear regression model yi = β0 + β1xi to adata set (xi , yi ) for i = 1, . . . , n, we may identify three sources ofvariability in the responses
I variability associated to the model:
SSM =Pn
i=1(yi − y)2,
where the initials SS denote “sum of squares” and M refers to themodel
I variability of the residuals:
SSR =Pn
i=1(yi − yi )2 =
Pni=1 e2
i
I total variability: SST =Pn
i=1(yi − y)2
I The ANOVA decomposition states that: SST = SSM + SSR
The ANOVA decomposition
The coefficient of determination R2
I The ANOVA decomposition states that SST = SSM + SSR
I Note that yi − y = (yi − yi ) + (yi − y)
I SSM =∑n
i=1(yi − y)2 measures the variation in the responses dueto the regression model (explained by the predicted values yi )
I Thus, the ratio SSR/SST is the proportion of the variation in theresponses that is not explained by the regression model
I The ratio R2 = SSM/SST = 1− SSR/SST is the proportion of thevariation in the responses that is explained by the regression model.It is known as the coefficient of determination
I The value of the coefficient of determination satisfies R2 = r2xy (the
squared correlation coefficient)
I For example, if R2 = 0,85 the variable x explains 85 % of thevariation in the response variable y
The ANOVA decomposition
ANOVA table
Source of variability SS DF Mean F ratioModel SSM 1 SSM/1 SSM/s2
R
Residuals/errors SSR n − 2 SSR/(n − 2) = s2R
Total SST n − 1
The ANOVA decomposition
ANOVA hypothesis testing
I Hypothesis test, H0 : β1 = 0 vs. H1 : β1 6= 0
I Consider the ratio
F =SSM/1
SSR/(n − 2)=
SSM
s2R
I Under H0, F follows an F1,n−2 distribution
I Test at a significance level α: reject H0 if F > F1,n−2;α
I How does this result relate to the test based on the Student-t wesaw in Lesson 4? They are equivalent
The ANOVA decomposition
Statgraphics output5.2: Ej. ANOVA
Estimacion de la varianza
Regression Analysis - Linear model: Y = a + b*X
-----------------------------------------------------------------------------
Dependent variable: Precio en ptas.
Independent variable: Produccion en kg.
-----------------------------------------------------------------------------
Standard T
Parameter Estimate Error Statistic P-Value
-----------------------------------------------------------------------------
Intercept 74,1151 8,73577 8,4841 0,0000
Slope -1,35368 0,3002 -4,50924 0,0020
-----------------------------------------------------------------------------
Analysis of Variance
-----------------------------------------------------------------------------
Source Sum of Squares Df Mean Square F-Ratio P-ValueSource Sum of Squares Df Mean Square F-Ratio P-Value
-----------------------------------------------------------------------------
Model 528,475 1 528,475 20,33 0,0020
Residual 207,925 8 25,9906
-----------------------------------------------------------------------------
Total (Corr.) 736,4 9
Correlation Coefficient = -0,84714
R-squared = 71,7647 percent
Standard Error of Est. = 5,0981!!"
Nonlinear relationships and linearizing transformations
Introduction
I Consider the case when the deterministic part f (xi ; a, b) of theresponse in the model
yi = f (xi ; a, b) + ui , i = 1, . . . , n
is a nonlinear function of x that depends on two parameters a and b(for example, f (x ; a, b) = abx)
I In some cases we may apply transformations to the data to linearizethem. We are then able to apply the linear regression procedure
I From the original data (xi , yi ) we obtain the transformed data(x ′i , y
′i )
I The parameters β0 and β1 corresponding to the linear relationbetween x ′i and y ′i are transformations of the parameters a and b
Nonlinear relationships and linearizing transformations
Linearizing transformations
I Examples of linearizing transformations:I If y = f (x ; a, b) = axb then log y = log a + b log x . We have
I y ′ = log y , x ′ = log x , β0 = log a, β1 = bI If y = f (x ; a, b) = abx then log y = log a + (log b)x . We have
I y ′ = log y , x ′ = x , β0 = log a, β1 = log bI If y = f (x ; a, b) = 1/(a + bx) then 1/y = a + bx . We have
I y ′ = 1/y , x ′ = x , β0 = a, β1 = b
I If y = f (x ; a, b) = log(axb) then y = log a + b(log x). We haveI y ′ = y , x ′ = log x , β0 = log a, β1 = b
A matrix treatment of linear regression
Introduction
I Remember the simple linear regression model,
yi = β0 + β1xi + ui , i = 1, . . . , n
I If we write one equation for each one of the observations, we have
y1 = β0 + β1x1 + u1
y2 = β0 + β1x2 + u2
......
yn = β0 + β1xn + un
A matrix treatment of linear regression
The model in matrix form
I We can write the preceding equations in matrix form asy1
y2
...yn
=
β0 + β1x1
β0 + β1x2
...β0 + β1xn
+
u1
u2
...un
I And splitting the parameters β from the variables xi ,
y1
y2
...yn
=
1 x1
1 x2
......
1 xn
(β0
β1
)+
u1
u2
...un
A matrix treatment of linear regression
The regression model in matrix form
I We can write the preceding matrix relationshipy1
y2
...yn
=
1 x1
1 x2
......
1 xn
(β0
β1
)+
u1
u2
...un
as
y = Xβ + u
I y: response vector; X: explanatory variables matrix (or experimentaldesign matrix); β: vector of parameters; u: error vector
The regression model in matrix form
Covariance matrix for the errors
I We denote as Cov(u) the n × n matrix of covariances for the errors.Its (i , j)-th element is given by cov(ui , uj) = 0 if i 6= j andcov(ui , ui ) = Var(ui ) = σ2
I Thus, Cov(u) is the identity matrix In×n multiplied by σ2:
Cov(u) =
σ2 0 · · · 00 σ2 · · · 0...
.... . .
...0 0 · · · σ2
= σ2I
The regression model in matrix form
Least-squares estimation
I The least-squares vector parameter estimate β is the unique solutionof the 2× 2 matrix equation (check the dimensions)
(XTX)β = XTy,
that is,β = (XTX)−1XTy.
I The vector y = (yi ) of response estimates is given by
y = Xβ
and the residual vector is defined as e = y − y
The multiple linear regression model
Introduction
I Use of the simple linear regression model: predict the value of aresponse y from the value of an explanatory variable x
I In many applications we wish to predict the response y from thevalues of several explanatory variables x1, . . . , xk
I For example:I forecast the value of a house as a function of its size, location, layout
and number of bathroomsI forecast the size of a parliament as a function of the population, its
rate of growth, the number of political parties with parliamentaryrepresentation, etc.
The multiple linear regression model
The model
I Use of the multiple linear regression model: predict a response yfrom several explanatory variables x1, . . . , xk
I If we have n observations, for i = 1, . . . , n,
yi = β0 + β1xi1 + β2xi2 + · · ·+ βkxik + ui
I We assume that the error variables ui are independent randomvariables following a N(0, σ2) distribution
The multiple linear regression model
The least-squares fit
I We have n observations, and for i = 1, . . . , n
yi = β0 + β1xi1 + β2xi2 + · · ·+ βkxik + ui
I We wish to fit to the data (xi1, xi2, . . . , xik , yi ) a hyperplane of theform
yi = β0 + β1xi1 + β2xi2 + · · ·+ βkxik
I The residual for observation i is defined as ei = yi − yi
I We obtain the parameter estimates βj as the values that minimizethe sum of the squares of the residuals
The multiple linear regression model
The model in matrix form
I We can write the model as a matrix relationship,
y1
y2
...yn
=
1 x11 x12 · · · x1k
1 x21 x22 · · · x2k
......
.... . .
...1 xn1 xn2 · · · xnk
β0
β1
β2
...βk
+
u1
u2
...un
and in compact form as
y = Xβ + u
I y: response vector; X: explanatory variables matrix (or experimentaldesign matrix); β: vector of parameters; u: error vector
The multiple linear regression model
Least-squares estimation
I The least-squares vector parameter estimate β is the unique solutionof the (k + 1)× (k + 1) matrix equation (check the dimensions)
(XTX)β = XTy,
and as in the k = 1 case (simple linear regression) we have
β = (XTX)−1XTy.
I The vector y = (yi ) of response estimates is given by
y = Xβ
and the residual vector is defined as e = y − y
The multiple linear regression model
Variance estimation
I For the multiple linear regression model, an estimator for the errorvariance σ2 is the residual (quasi-)variance,
s2R =
∑ni=1 e2
i
n − k − 1,
and this estimator is unbiasedI Note that for the simple linear regression case we had n − 2 in the
denominator
The multiple linear regression model
The sampling distribution of β
I Under the model assumptions, the least-squares estimator β for theparameter vector β follows a multivariate normal distribution
I E (β) = β (it is an unbiased estimator)
I The covariance matrix for β is Cov(β) = σ2(XTX)−1
I We estimate Cov(β) using s2R(XTX)−1
I The estimate of Cov(β) provides estimates s2(βj) for the variance
Var(βj). s2(βj) is the standard error of the estimator βj
I If we standardize βj we have
βj − βj
s(βj)∼ tn−k−1 (the Student-t distribution)
The multiple linear regression model
Inference on the parameters βj
I Confidence interval for βj at a confidence level 1− α
βj ± tn−k−1;α/2 s(βj)
I Hypothesis testing for H0 : βj = 0 vs. H1 : βj 6= 0 at a confidencelevel α
I Reject H0 if |T | > tn−k−1;α/2, where
T =βj
s(βj)
is the test statistic
The ANOVA decomposition
The multivariate case
I ANOVA: ANalysis Of VAriance
I When fitting the multiple linear regression modelyi = β0 + β1xi1 + · · ·+ βkxik to a data set (xi1, . . . , xik , yi ) for i = 1, . . . , n,we may identify three sources of variability in the responses
I variability associated to the model:
SSM =Pn
i=1(yi − y)2,
where the initials SS denote “sum of squares” and M refers to themodel
I variability of the residuals:
SSR =Pn
i=1(yi − yi )2 =
Pni=1 e2
i
I total variability: SST =Pn
i=1(yi − y)2
I The ANOVA decomposition states that: SST = SSM + SSR
The ANOVA decomposition
The coefficient of determination R2
I The ANOVA decomposition states that SST = SSM + SSR
I Note that yi − y = (yi − yi ) + (yi − y)
I SSM =∑n
i=1(yi − y)2 measures the variation in the responses dueto the regression model (explained by the predicted values yi )
I Thus, the ratio SSR/SST is the proportion of the variation in theresponses that is not explained by the regression model
I The ratio R2 = SSM/SST = 1− SSR/SST is the proportion of thevariation in the responses that is explained by the explanatoryvariables. It is known as the coefficient of multiple determination
I The value of this coefficient satisfies R2 = r2yy (the squared
correlation coefficient)
I For example, if R2 = 0,85 the variables x1, . . . , xk explain 85 % ofthe variation in the response variable y
The ANOVA decomposition
ANOVA table
Source of variability SS DF Mean F ratio
Model SSM k SSM/k (SSM/k)/s2R
Residuals/errors SSR n − k − 1 SSR/(n − k − 1) = s2R
Total SST n − 1
The ANOVA decomposition
ANOVA hypothesis testing
I Hypothesis test, H0 : β1 = β2 = · · · = βk = 0 vs. H1 : βj 6= 0 forsome j = 1, . . . , k
I H0: the response does not depend on any xj
I Consider the ratio
F =SSM/k
SSR/(n − k − 1)=
SSM
s2R
I Under H0, F follows an Fk,n−k−1 distribution
I Test at a significance level α: reject H0 if F > Fk,n−k−1;α