statistics ii lesson 5. regression analysis (second...

Statistics IILesson 5. Regression analysis (second part)

Year 2010/11

Lesson 5. Regression analysis (second part)

Contents

I Diagnosis: Residual analysis

I The ANOVA (ANalysis Of VAriance) decomposition

I Nonlinear relationships and linearizing transformations

I A matrix treatment of the linear regression model

I Introduction to multiple linear regression

Lesson 5. Regression analysis (second part)

Bibliography references

I Newbold, P. “Estadıstica para los negocios y la economıa” (1997)I Chapters 12, 13 and 14

I Pena, D. “Regresion y Analisis de Experimentos” (2002)I Chapters 5 and 6

Regression diagnostics

Diagnostics

I The theoretical assumptions for the simple linear regression modelwith one response variable Y and one explicative variable x are:

I Linearity: yi = β0 + β1xi + ui , for i = 1, . . . , nI Homogeneity: E [ui ] = 0, for i = 1, . . . , nI Homoscedasticity: Var[ui ] = σ2, for i = 1, . . . , nI Independence: ui and uj are independent for i 6= jI Normality: ui ∼ N(0, σ2), for i = 1, . . . , n

I We study how to apply diagnostic procedures to test if theseassumptions are appropriate for the available data (xi , yi )

I Based on the analysis of the residuals ei = yi − yi

Diagnostics: scatterplots

Scatterplots

I The simplest diagnostic procedure is based on the visualexamination of the scatterplot for (xi , yi )

I Often this simple but powerful method reveals patterns suggestingwhether the theoretical model might be appropriate or not

I We illustrate its application on a classical example. Consider the fourfollowing datasets


The Anscombe datasets132 TWO-VARIABLE LINEAR REGRESSION

TABLE 3-10 Four Data Sets

DATASET 1 DATASET 2 X Y X Y

10.0 8.04 10.0 9.14 8.0 6.95 8.0 8.14

1'3.0 7.58 13.0 8.74 9.0 8 .81 9.0 8.77

11.0 8.33 11.0 9.26 14.0 9.96 14.0 8.10

6.0 7.24 6.0 6.13 4.0 4.26 4.0 3.10

12.0 10.84 12.0 9.13 7.0 4.82 7.0 7.26 5.0 5.68 5.0 4.74

DATASET 3 DATASET 4 X Y X Y

10.0 7.46 8.0 6.58 8.0 6.77 8.0 5.76

13.0 12.74 8.0 7.71 9.0 7.11 8.0 8.84

11.0 7.81 8.0 8.47 14.0 8.84 8.0 7.04 6.0 6.08 8.0 5.25 4.0 5.39 19.0 12.50

12.0 8.15 8.0 5.56 7.0 6.42 8.0 7.91 5.0 5.73 8.0 6.89

SOURCE: F . J . Anscombe, op. cit.

mean of X = 9.0,

mean of Y = 7.5, for all four data sets.

And yet the four situations-although numerically equivalent in major respects-are substantively very different. Figure 3-29 shows how very different the four data sets actually are.

Anscombe has emphasized the importance of visual displays in statistical analysis:

Most textbooks on statistical methods, and most statistical computer programs, pay too little attention to graphs. Few of us escape being indoctrinated with these notions:


The Anscombe example

I The estimated regression model for each of the four previousdatasets is the same

I yi = 3,0 + 0,5xi

I n = 11, x = 9,0, y = 7,5, rxy = 0,817

I The estimated standard error of the estimator β1,√s2R

(n − 1)s2x

takes the value 0,118 in all four cases. The corresponding T statistictakes the value T = 0,5/0,118 = 4,237

I But the corresponding scatterplots show that the four datasets arequite different. Which conclusions could we reach from thesediagrams?


Anscombe data scatterplots133 TWO-VARIABLE LINEAR REGRESSION

10 10

5 5

o

10

5

5 10 15 5 10 Data Set 1 Data Set 2

10

5

5 10 15 20 5 10 Data Set 3 Data Set 4

FIGURE 3-29 Scatterplots for the four data sets of Table 3-10 SOURCE: F. J . Anscombe, op cit.

(1) numerical calculations are exact, but graphs are rough;

15

15

(2) for any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis;

(3) performing intricate calculations is virtuous, whereas actually looking at the data is cheating.

A computer should make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding.

Graphs can have various purposes, such as: (i) to help us perceive and appreciate some broad features of the data, (ii) to let us look behind those broad features and see what else is there. Most kinds of statistical calculation rest on assumptions about the behavior of the data. Those assumptions may be false, and then the calculations may be misleading. We ought always to try to check whether the assumptions are reasonably correct; and if they are wrong we ought

20

20

Residual analysis

Further analysis of the residuals

I If the observation of the scatterplot is not sufficient to reject themodel, a further step would be to use diagnosis methods based onthe analysis of the residuals ei = yi − yi

I This analysis starts by standarizing the residuals, that is, dividingthem by the residuals (quasi-)standard deviation sR . The resultingquantities are known as standarized residuals:

ei

sR

I Under the assumptions of the linear regression model, thestandarized residuals are approximately independent standard normalrandom variables

I A plot of these standarized residuals should show no clear pattern

Residual analysis

Residual plots

I Several types of residual plots can be constructed. The mostcommon ones are:

I Plot of standardized residuals vs. xI Plot of standardized residuals vs. y (the predicted responses)

I Deviations from the model hypotheses result in patterns on theseplots, which should be visually recognizable

Residual plots examples

Consistency of the theoretical model5.1: Ej: consistencia con el modelo teorico


Nonlinearity5.1: Ej: No linealidad


Heteroscedasticity

5.1: Ej: Heterocedasticidad

Residual analysis

Outliers

I In a plot of the regression line we may observe outlier data, that is,data that show significant deviations from the regression line (orfrom the remaining data)

I The parameter estimators for the regression model, β0 and β1, arevery sensitive to these outliers

I It is important to identify the outliers and make sure that they reallyare valid data

I Statgraphics is able to show for example the data that generate“Unusual residuals” or “Influential points”

Residual analysis

Normality of the errors

I One of the theoretical assumptions of the linear regression model isthat the errors follow a normal distribution

I This assumption can be checked visually from the analysis of theresiduals ei , using different approaches:

I By inspection of the frequency histogram for the residualsI By inspection of the “Normal Probability Plot” of the residuals

(significant deviations of the data from the straight line in the plotcorrespond to significant departures from the normality assumption)

The ANOVA decomposition

Introduction

I ANOVA: ANalysis Of VAriance

I When fitting the simple linear regression model yi = β0 + β1xi to adata set (xi , yi ) for i = 1, . . . , n, we may identify three sources ofvariability in the responses

I variability associated to the model:

SSM =Pn

i=1(yi − y)2,

where the initials SS denote “sum of squares” and M refers to themodel

I variability of the residuals:

SSR =Pn

i=1(yi − yi )2 =

Pni=1 e2

i

I total variability: SST =Pn

i=1(yi − y)2

I The ANOVA decomposition states that: SST = SSM + SSR


The coefficient of determination R2

I The ANOVA decomposition states that SST = SSM + SSR

I Note that yi − y = (yi − yi ) + (yi − y)

I SSM =∑n

i=1(yi − y)2 measures the variation in the responses dueto the regression model (explained by the predicted values yi )

I Thus, the ratio SSR/SST is the proportion of the variation in theresponses that is not explained by the regression model

I The ratio R2 = SSM/SST = 1− SSR/SST is the proportion of thevariation in the responses that is explained by the regression model.It is known as the coefficient of determination

I The value of the coefficient of determination satisfies R2 = r2xy (the

squared correlation coefficient)

I For example, if R2 = 0,85 the variable x explains 85 % of thevariation in the response variable y


ANOVA table

Source of variability SS DF Mean F ratioModel SSM 1 SSM/1 SSM/s2

R

Residuals/errors SSR n − 2 SSR/(n − 2) = s2R

Total SST n − 1


ANOVA hypothesis testing

I Hypothesis test, H0 : β1 = 0 vs. H1 : β1 6= 0

I Consider the ratio

F =SSM/1

SSR/(n − 2)=

SSM

s2R

I Under H0, F follows an F1,n−2 distribution

I Test at a significance level α: reject H0 if F > F1,n−2;α

I How does this result relate to the test based on the Student-t wesaw in Lesson 4? They are equivalent


Statgraphics output5.2: Ej. ANOVA

Estimacion de la varianza

Regression Analysis - Linear model: Y = a + b*X

-----------------------------------------------------------------------------

Dependent variable: Precio en ptas.

Independent variable: Produccion en kg.

-----------------------------------------------------------------------------

Standard T

Parameter Estimate Error Statistic P-Value

-----------------------------------------------------------------------------

Intercept 74,1151 8,73577 8,4841 0,0000

Slope -1,35368 0,3002 -4,50924 0,0020

-----------------------------------------------------------------------------

Analysis of Variance

-----------------------------------------------------------------------------

Source Sum of Squares Df Mean Square F-Ratio P-ValueSource Sum of Squares Df Mean Square F-Ratio P-Value

-----------------------------------------------------------------------------

Model 528,475 1 528,475 20,33 0,0020

Residual 207,925 8 25,9906

-----------------------------------------------------------------------------

Total (Corr.) 736,4 9

Correlation Coefficient = -0,84714

R-squared = 71,7647 percent

Standard Error of Est. = 5,0981!!"

Nonlinear relationships and linearizing transformations

Introduction

I Consider the case when the deterministic part f (xi ; a, b) of theresponse in the model

yi = f (xi ; a, b) + ui , i = 1, . . . , n

is a nonlinear function of x that depends on two parameters a and b(for example, f (x ; a, b) = abx)

I In some cases we may apply transformations to the data to linearizethem. We are then able to apply the linear regression procedure

I From the original data (xi , yi ) we obtain the transformed data(x ′i , y

′i )

I The parameters β0 and β1 corresponding to the linear relationbetween x ′i and y ′i are transformations of the parameters a and b

Nonlinear relationships and linearizing transformations

Linearizing transformations

I Examples of linearizing transformations:I If y = f (x ; a, b) = axb then log y = log a + b log x . We have

I y ′ = log y , x ′ = log x , β0 = log a, β1 = bI If y = f (x ; a, b) = abx then log y = log a + (log b)x . We have

I y ′ = log y , x ′ = x , β0 = log a, β1 = log bI If y = f (x ; a, b) = 1/(a + bx) then 1/y = a + bx . We have

I y ′ = 1/y , x ′ = x , β0 = a, β1 = b

I If y = f (x ; a, b) = log(axb) then y = log a + b(log x). We haveI y ′ = y , x ′ = log x , β0 = log a, β1 = b

A matrix treatment of linear regression

Introduction

I Remember the simple linear regression model,

yi = β0 + β1xi + ui , i = 1, . . . , n

I If we write one equation for each one of the observations, we have

y1 = β0 + β1x1 + u1

y2 = β0 + β1x2 + u2

......

yn = β0 + β1xn + un


The model in matrix form

I We can write the preceding equations in matrix form asy1

y2

...yn

=

β0 + β1x1

β0 + β1x2

...β0 + β1xn

+

u1

u2

...un

I And splitting the parameters β from the variables xi ,

y1

y2

...yn

=

1 x1

1 x2

......

1 xn

(β0

β1

)+

u1

u2

...un


The regression model in matrix form

I We can write the preceding matrix relationshipy1

y2

...yn

=

1 x1

1 x2

......

1 xn

(β0

β1

)+

u1

u2

...un

as

y = Xβ + u

I y: response vector; X: explanatory variables matrix (or experimentaldesign matrix); β: vector of parameters; u: error vector


Covariance matrix for the errors

I We denote as Cov(u) the n × n matrix of covariances for the errors.Its (i , j)-th element is given by cov(ui , uj) = 0 if i 6= j andcov(ui , ui ) = Var(ui ) = σ2

I Thus, Cov(u) is the identity matrix In×n multiplied by σ2:

Cov(u) =

σ2 0 · · · 00 σ2 · · · 0...

.... . .

...0 0 · · · σ2

= σ2I


Least-squares estimation

I The least-squares vector parameter estimate β is the unique solutionof the 2× 2 matrix equation (check the dimensions)

(XTX)β = XTy,

that is,β = (XTX)−1XTy.

I The vector y = (yi ) of response estimates is given by

y = Xβ

and the residual vector is defined as e = y − y

The multiple linear regression model

Introduction

I Use of the simple linear regression model: predict the value of aresponse y from the value of an explanatory variable x

I In many applications we wish to predict the response y from thevalues of several explanatory variables x1, . . . , xk

I For example:I forecast the value of a house as a function of its size, location, layout

and number of bathroomsI forecast the size of a parliament as a function of the population, its

rate of growth, the number of political parties with parliamentaryrepresentation, etc.


The model

I Use of the multiple linear regression model: predict a response yfrom several explanatory variables x1, . . . , xk

I If we have n observations, for i = 1, . . . , n,

yi = β0 + β1xi1 + β2xi2 + · · ·+ βkxik + ui

I We assume that the error variables ui are independent randomvariables following a N(0, σ2) distribution


The least-squares fit

I We have n observations, and for i = 1, . . . , n

yi = β0 + β1xi1 + β2xi2 + · · ·+ βkxik + ui

I We wish to fit to the data (xi1, xi2, . . . , xik , yi ) a hyperplane of theform

yi = β0 + β1xi1 + β2xi2 + · · ·+ βkxik

I The residual for observation i is defined as ei = yi − yi

I We obtain the parameter estimates βj as the values that minimizethe sum of the squares of the residuals


The model in matrix form

I We can write the model as a matrix relationship,

y1

y2

...yn

=

1 x11 x12 · · · x1k

1 x21 x22 · · · x2k

......

.... . .

...1 xn1 xn2 · · · xnk

β0

β1

β2

...βk

+

u1

u2

...un

and in compact form as

y = Xβ + u

I y: response vector; X: explanatory variables matrix (or experimentaldesign matrix); β: vector of parameters; u: error vector


Least-squares estimation

I The least-squares vector parameter estimate β is the unique solutionof the (k + 1)× (k + 1) matrix equation (check the dimensions)

(XTX)β = XTy,

and as in the k = 1 case (simple linear regression) we have

β = (XTX)−1XTy.

I The vector y = (yi ) of response estimates is given by

y = Xβ

and the residual vector is defined as e = y − y


Variance estimation

I For the multiple linear regression model, an estimator for the errorvariance σ2 is the residual (quasi-)variance,

s2R =

∑ni=1 e2

i

n − k − 1,

and this estimator is unbiasedI Note that for the simple linear regression case we had n − 2 in the

denominator


The sampling distribution of β

I Under the model assumptions, the least-squares estimator β for theparameter vector β follows a multivariate normal distribution

I E (β) = β (it is an unbiased estimator)

I The covariance matrix for β is Cov(β) = σ2(XTX)−1

I We estimate Cov(β) using s2R(XTX)−1

I The estimate of Cov(β) provides estimates s2(βj) for the variance

Var(βj). s2(βj) is the standard error of the estimator βj

I If we standardize βj we have

βj − βj

s(βj)∼ tn−k−1 (the Student-t distribution)


Inference on the parameters βj

I Confidence interval for βj at a confidence level 1− α

βj ± tn−k−1;α/2 s(βj)

I Hypothesis testing for H0 : βj = 0 vs. H1 : βj 6= 0 at a confidencelevel α

I Reject H0 if |T | > tn−k−1;α/2, where

T =βj

s(βj)

is the test statistic


The multivariate case

I ANOVA: ANalysis Of VAriance

I When fitting the multiple linear regression modelyi = β0 + β1xi1 + · · ·+ βkxik to a data set (xi1, . . . , xik , yi ) for i = 1, . . . , n,we may identify three sources of variability in the responses

I variability associated to the model:

SSM =Pn

i=1(yi − y)2,

where the initials SS denote “sum of squares” and M refers to themodel

I variability of the residuals:

SSR =Pn

i=1(yi − yi )2 =

Pni=1 e2

i

I total variability: SST =Pn

i=1(yi − y)2

I The ANOVA decomposition states that: SST = SSM + SSR


The coefficient of determination R2

I The ANOVA decomposition states that SST = SSM + SSR

I Note that yi − y = (yi − yi ) + (yi − y)

I SSM =∑n

i=1(yi − y)2 measures the variation in the responses dueto the regression model (explained by the predicted values yi )

I Thus, the ratio SSR/SST is the proportion of the variation in theresponses that is not explained by the regression model

I The ratio R2 = SSM/SST = 1− SSR/SST is the proportion of thevariation in the responses that is explained by the explanatoryvariables. It is known as the coefficient of multiple determination

I The value of this coefficient satisfies R2 = r2yy (the squared

correlation coefficient)

I For example, if R2 = 0,85 the variables x1, . . . , xk explain 85 % ofthe variation in the response variable y


ANOVA table

Source of variability SS DF Mean F ratio

Model SSM k SSM/k (SSM/k)/s2R

Residuals/errors SSR n − k − 1 SSR/(n − k − 1) = s2R

Total SST n − 1


ANOVA hypothesis testing

I Hypothesis test, H0 : β1 = β2 = · · · = βk = 0 vs. H1 : βj 6= 0 forsome j = 1, . . . , k

I H0: the response does not depend on any xj

I Consider the ratio

F =SSM/k

SSR/(n − k − 1)=

SSM

s2R

I Under H0, F follows an Fk,n−k−1 distribution

I Test at a significance level α: reject H0 if F > Fk,n−k−1;α

statistics ii lesson 5. regression analysis (second...

Documents