sta 200 b article

8/22/2019 Sta 200 b Article

1/95

Faculty of Applied SciencesDepartment of Mathematics and Physics

Statistical Methods 2B Lecture Notes

Lecturer: Mr. T. Farrar

Contents

1 Review of Random Variables and Probability Distributions 1

2 Correlation Analysis of Paired Data Sets 19

3 Simple Linear Regression Analysis 27

4 Multiple Linear Regression 48

5 Logistic Regression 81

6 Poisson Regression 90

1 Review of Random Variables and ProbabilityDistributions

What you will be expected to already know

1. Descriptive Statistics

2. Basic Probability concepts

3. Graphical methods of displaying data (line graph, scatter plot, histogram)

4. Random Variables and Probability Distributions (Discrete and continuous)

5. Special probability distributions (binomial, Poisson, normal)

6. Hypothesis Testing (t-tests, F tests, 2 tests, nonparametric tests, p-values)

7. Basic calculus

8. Matrices

1


2/95

Discrete Random Variables

Definition: A random variable is a variable which takes onits values by chance

Definition: The sample space S(a.k.a. support) is theset of possible values that a random variable may take

A random variable is discrete if it can only take only a finiteor countably infinite number of distinct values. Usually a discrete randomvariable only takes on integer values.

E.g. Number of defective television sets in a shipment of 100 sets S={1, 2, 3, . . . , 100}

E.g. Number of visits to a website in one year

S={1, 2, 3, . . .} We use an uppercase letter such as Y to denote a random variable, and a

lowercase letter such asy to denote a particular value that the random variablemay assume

Discrete Probability Distributions

We may denote the probability that Ytakes on the value y by Pr(Y =y) This probability is subject to the following restrictions:

1. 0Pr (Y =y)1 for all y (all probabilities must be between 0 and 1)2.yS

Pr (Y =y) = 1 (sum of probabilities over whole sample space must be 1)

E.g. Flipping a six sided die: let Ybe the number that comes up Pr (Y =y) = 1

6, y= 1, 2, 3, 4, 5, 6

It is easy to see that both restrictions hold.

The probability distribution of the lengths of patent lives for new drugs isgiven below. The patent life refers to the number of years a company hasto make a profit from the drug after it is approved before competitors mayproduce the same drug.

Years, y 3 4 5 6 7 8 9 10 11 12 13

Pr (Y =y) .03 .05 .07 .10 .14 .20 .18 .12 .07 .03 .01

The function that maps all values in the sample space to their probabilities iscalled a probability mass function

It may be expressed in a table (as above) or as a mathematical formula

2


3/95

We can use a graph to represent the probability mass function:

Suppose the law dictates that the sentence (in years) for a particular crimemust be between 5 and 10 years in prison. By looking at past cases a lawyeris able to construct the following probability distribution for the number ofyears to which a person convicted of the crime is sentenced:

f(y) =0.4471

y , y = 5, 6, 7, 8, 9, 10

Hence the probability that a person convicted of this crime receives a 6 yearsentence is

fY(6) =0.4471

6= 0.1825

As an exercise, graph this probability mass function and verify that it satisfiesthe two restrictions on probability mass functions.

Expected Value of a Discrete Random Variable

We can define the expected value of a random variable as follows:

E (Y) =yS

yf(y)

Iff(y) accurately characterises the population described by the random vari-able Y, then E (Y) =, the population mean

3


4/95

In our prison sentencing example:

E (Y) =10

y=5y

0.4471y

=10

y=5

0.4471

y

= 0.4471

5 +

6 +

7 +

8 +

9 +

10

= 7.298

Thus, we would expect the average sentence to be 7.3 years.

It can also be shown that for any real-valued functiong(Y), the expected valueofg(Y) is given by:

E (g(Y)) =yS

g(y)f(y)

Variance of a Discrete Random Variable

We can define the variance of a random variable as follows:2 = E (Y) = E

(Y )2

= E

Y2 2 (why?)

=yS

y2f(y) E (Y)2

In our prison sentencing example:

Var (Y) =10

y=5

y20.4471

y E (Y)2

=10

y=5

0.4471y(3/2) 7.2982

= 0.4471

5(3/2) + 6(3/2) + 7(3/2) + 8(3/2) + 9(3/2) + 10(3/2) 7.2982

= 56.177 53.261= 2.916

4


5/95

Thus, the variance ofY is 2.916 and the standard deviation is2.916 = 1.71

Properties of Expected Value

Let Ybe a discrete random variable with probability mass function f(y) andlet abe a constant. Then E (aY) =aE (Y).

Proof:

E (aY) =yS

ayf(y)

=ayS

yf(y)

=aE (Y)

As an exercise, prove that ifb is a constant, then E (b) =b. As a further exercise, if Y1 and Y2 are two random variables, prove that

E (Y1+Y2) = E (Y1) + E (Y2).

Properties of Variance

Let Ybe a discrete random variable with probability mass function f(y) andlet abe a constant. Then Var (aY) =a2 Var (Y).

Proof:Var (aY) = E

a2Y2

E (aY)2=a2E

Y2 a2E (Y)2

=a2

E

Y2 E (Y)2

=a2 Var (Y)

As an exercise, prove that ifbis a constant, then Var (b) = 0.

Special Discrete Probability Distributions

Binomial Distribution

The binomial distribution relates to a binomial experiment which has thefollowing five properties:

1. The experiment consists of a fixed number of trials, n

5


6/95

2. Each trial results in one of two outcomes, called successand failure (denoted 1 and 0)

3. The probability of success in each trial is equal to p and the probabilityof failure is 1

p(sometimes called q)

4. All the trials are independent of one another

5. The random variable of interest is Y, the total number of successes ob-served in the n trials

The probability mass function for the binomial distribution is as follows:

f(y) = n

ypy (1

p)ny , y= 0, 1, 2, . . . , nand 0

p

1

We can derive this function using multiplicative probability rule for indepen-dent events and the concept of combinations

We have y successes andn y failures, and there are n!y! (n y)! =

n

y

ways

to arrange them in order

Here is a graph of the binomial probability mass function where n = 15 andp= 0.4:

As an exercise, draw the binomial probability mass function wheren= 9 andp= 0.8.

Mean and Variance of Binomial Distribution

The mean of a binomially distributed random variable is E (Y) =np. The variance of a binomially distributed random variable is Var (Y) =np (1 p).

6


7/95

Binomial Example

There is an English saying, Dont count your chickens before they hatch

A farmer is breeding chickens. He has 15 hens that each lay one egg per day.The eggs are then placed in incubators He has observed that there is an 80% hatchability rate, that is, an 80% prob-

ability that an egg will hatch into a live chick

1. How many live chicks should the farmer expect per day?

E (Y) =np = 15 0.8 = 12

2. What is the probability that at least 13 eggs from a given day will hatch?

Pr (Y13) = Pr (Y = 13) + Pr (Y = 14) + Pr (Y= 15)=

15

13

0.813 (1 0.8)1513 +

15

14

0.814 (1 0.8)1514 +

15

15

0.815 (1 0.8)1515

= 0.2309 + 0.1319 + 0.0352 = 0.398

Negative Binomial Probability Distribution

While a binomial random variable measures the number of successes in n trialsof a binomial experiment where n is fixed, a negative binomial randomvariable measures the number of trialsy required for k successes to occur.

We could think of this as the event A B where A is the event that the firsty 1 trials containk 1 successes andB is the event that the yth trial resultsin a success.

f(y) = Pr (A B) = Pr (A) Pr (B) (since A and B are independent)

Pr (A) = y 1k 1pk1qyk, yk (by binomial distribution)Pr (B) =p

Thus f(y) =

y 1k 1

pkqyk, y= k, k+ 1, k+ 2, . . .

7


8/95

Negative Binomial Distribution

Here is a graph of the binomial probability mass function where k = 3 andp= 0.6 (going as far as y = 17):

As an exercise, draw the negative binomial probability mass function wherek= 2 and p= 0.5, up to y = 10.

Mean and Variance of Negative Binomial Distribution

The mean of a negative binomial random variable is E (Y) = kp

The variance of a negative binomial random variable is Var (Y) = k (1 p)p2

Negative Binomial Distribution Example

Each time a fisherman casts his line into the water there is a probability of 18

that he will catch a fish.

Today he has decided that he will continue casting his line until he catches 5fish

1. What is the expected number of casts required to catch 5 fish?

E (Y) =

k

p =

5

0.125= 40

2. What is the standard deviation of the number of casts required to catch 5 fish?

Var (Y) =5 (1 0.125)

0.1252 = 280

=

Var (Y) =

280 = 16.73

8


9/95

4. What is the probability that he will need exactly 50 casts?

Pr (Y= 20) = 50 15 1 0.1255(1 0.125)505

= 0.0159

5. What is the probability that he will need more than 8 casts?

Pr (Y >8) = 1 8

y=5

y 15 1

0.1255(1 0.125)y5

= 1 440.1255(1 0.125)55 +540.1255(1 0.125)65+

6

4

0.1255(1 0.125)75 +

7

4

0.1255(1 0.125)85

= 1 (0.0000 + 0.0001 + 0.0004 + 0.0007)= 1 0.0011 = 0.999

Poisson Distribution

The Poisson Distribution can be thought of as a limiting case of the binomialdistribution

Suppose we are interested in the number of car accidents Y that occur at abusy intersection during one week

We could divide the week into n intervals of time, with each interval being sosmall that at most one accident could occur in that interval

We define p as the probability that an accident occurs in a particular sub-interval and 1

pas the probability that no accident occurs

We could then think of this as a binomial experiment It can then be shown that:

limn

n

y

py (1 p)ny =(np)

y enp

y!

If we let = np then we have the probability mass function of the Poissondistribution:

f(y) =ye

y!

, y= 0, 1, 2, . . .

9


10/95

Here is a graph of the Poisson probability mass function where = 3.3 (goingas far as y= 12):

As an exercise, draw the Poisson probability mass function where = 1, upto y= 6.

Mean and Variance Poisson Distribution

The Poisson Distribution is used to model the counting of rare events thatoccur with a certain average rate per unit of time or space

For the Poisson Distribution, E (Y) = and Var (Y) =

The expected value and variance are equal!

Poisson Distribution Example

The number of complaints that a busy laundry facility receives per day is arandom variable Yhaving a Poisson distribution with = 3.3

1. What is the probability that the facility will receive less than two com-plaints on a particular day?

Pr (Y


11/95

If the number of complaints per day has a Poisson distribution withparameter then the number of complaints in five days has a Poissondistribution with parameter 5 . Thus, if we letWbe the number of complaints per week, then:

E (W) = 5= 16.5

Continuous Random Variables

A random variable is continuous if it can on any value in aninterval (e.g., between 0 and 5). In other words, continuous random variablestake on real-numbered values

There is no such thing as a probability mass function for a continuous randomvariable. Instead, we have a probability density function which allows usto find probabilities over an interval

If Y is a continuous random variable, and f(y) is the probability densityfunction, then:

Pr (aYb) = b

a

f(y)dy

What we are actually doing is finding the area under the curve between a andb.

Properties of a Probability Density Function

1. f(y)0 for all y,< y


12/95

First we note that 3y2 0 for all 0y1 , so the first conditionis satisfied.

Second:

f(y)dy =

10

f(y)dy (since the function is 0 elsewhere)

=

10

3y2dy

= y31

0

= 13 03 = 1

Thus the second condition is also satisfied.

Find the probability that between 60% and 90% of people pay their incometax on time.

Pr(0.6Y0.9) = 0.9

0.6

3y2dy

= y30.90.6

= 0.93 0.63= 0.513

Thus 51.3% of people pay their income tax on time according tothis model.

Note that it does not matter whether we use < or

with continuous random

variables

Expected Value and Variance of a Continuous Random Variable

The expected value of a continuous random variable Y is defined as follows:= E (Y) =

yf(y)dy

Similarly the variance is defined thus:2 = Var (Y) = E

Y2

2 =

y2f(y)dy 2

These have the same properties as in the discrete case.

12


13/95

Find the expected value of the proportion of people who pay their income taxon time.

= E (Y) = 1

0

y 3y2dy

=

10

3y3dy

= 3

4y41

0

=3

4= 0.75

Find the standard deviation of the proportion of people who pay theirincome tax on time.

2 = Var (Y) =

10

y2 3y2dy 2

= 1

0 3y

4

dy 0.752

= 3

5y51

0

0.752

=3

5 0.752 = 0.6 0.5625 = 0.0375

Hence = 0.0375 = 0.194

Special Continuous Probability Distributions

Uniform Distribution

Suppose thatYcan take on any value between1and 2with equal probability.Then Y follows the continuous uniform distribution and its probability massfunction is as follows:

f(y) = 1

2

1, 1y2

0 , elsewhere

13


14/95

We can use integrals to compute probabilities, but in this case we dont needto because we are actually just finding the area of a rectangle! It can be shown

that E(Y) = 1+22

and Var (Y) = (21)2

12

Uniform Distribution Example

An insurance company provides roadside assistance to its clients. To save coststhey want to dispatch the nearest possible tow truck.

Along a particular highway which is 100 km long, breakdowns occur at uni-formly distributed locations.

Towing Company A is the nearest for the first 70 km of the highway andTowing Company B is the nearest for the final 30 km of the highway.

1. What is the expected location of the next breakdown?

E (Y) =1+2

2 =

0 + 100

2 = 50

We expect the next breakdown to occur at the 50 km mark

3. What is the probability that the next breakdown will be attended by companyB?

Here f(y) =1

100, 0y100 and 0 elsewhere

We need to find the area under f(y) between 70 and 100

We could calculate

10070

f(y)dy

Or we can simply calculate the area of this rectangle:

14


15/95

The area of a rectangle is lengthwidth. Thus:

Pr(70Y100) = 30 1100

= 0.30

Normal Distribution

A random variable Y is said to have a normal distribution with parameters< 0 if its probability density function is:

f(y) = 1

2e(y)

2/(22), < y


16/95

Even more good news: any Normally distributed random variable Y withmean and standard deviation can be transformed to a Standard Normalrandom variable Zusing this simple transformation:

Z=Y

This graph shows how the transformation works:

Using the Z Table to Calculate Probabilities

TheZTable provides us with Pr (Z < z) for anyzvalue that we choose up to2 decimal places

16


17/95

Suppose we want to know Pr (Z z) = 1 Pr (Z < z)

If we want to find Pr (Z < z) for a negative zvalue, we can use the fact thatthe Standard Normal Distribution is symmetric:

Pr (Z < z) = 1

Pr (Z 2.2285. tobserved = 10.50> 2.228, thus we reject H0

6. We conclude at 5% significance level that the correlation is significantly dif-ferent from 0

The Fisher Transformation

What if we want to test whether = 0 for any value1< 0 < 1? What if we want a confidence interval for ? The Fisher Transformation allows us to do both (approximately)

zr =12

ln1 +r1 r

This quantity has an approximate Normal distribution with a mean of 0 and

a variance of 1

n 3 From this we get the following test statistic, which has a standard normal

distribution under the null hypothesis:

Z=

12ln

1+r1r

12ln

1+0101n3

23


24/95

Pearsons Correlation Coefficient: General Hypothesis Test Example

Suppose we want to find out whether the correlation is less than 0.99 in ourice cream sales vs. temperature example?

1. H0 : = 0.99 vs. HA:


25/95

Spearmans Rank Correlation Coefficient

What if one or both ofXand Yare not normally distributed?

Suppose we have the Statistics FISA marks and number of hours of TVwatched per week for n= 8 students:

FISA Marks vs. Hours of TV per week

Hours of TV per week (xi) FISA Mark (yi)

3 73

11 50

7 87

38 31

13 62

20 61

22 46

34 59

Spearmans Rank Correlation Coefficient

In this case we can instead use Spearmans Rank Correlation Coefficient s,which is based on the ranksof thexiandyirather than the values themselves

It is a general measure of association rather than a measure of linear depen-dence

R(xi) are the ranks of the xvalues; thus the lowest value has a rank of 1, thesecond lowest a rank of 2, etc.

R(yi) is computed the same way for the y values The sample estimator ofs is:

rs=

n

ni=1

R(xi)R(yi) n

i=1

R(xi)

ni=1

R(yi)n n

i=1

R(xi)2

ni=1

R(xi)

2n ni=1

R(yi)2

ni=1

R(yi)

2 If there are no ties inxor y, this reduces to a simpler formula:

rs= 1

6n

i=1d2i

n (n2 1) where di=R(xi) R(yi)

25


26/95

FISA Marks vs. TV hours per week

Hours of TV per week (xi) FISA Mark (yi) R(xi) R(yi) di d2i

3 73 1 7 6 3611 50 3 3 0 07 87 2 8 6 36

38 31 8 1 7 49

13 62 4 6 2 420 61 5 5 0 0

22 46 6 2 4 16

34 59 7 4 3 9

d2i = 150

Spearmans Rank Correlation Coefficient Example

In our FISA marks vs. TV hours example:

We can now compute the sample Spearman correlation coefficient:

rs= 1

6 1508 (8

2

1)=

0.786

This suggests that there is a negative association between hours spent watchingTV and FISA mark

Spearmans Rank Correlation Coefficient: Hypothesis Testing

We may want to test the null hypothesis H0:s= 0 against some alternativeto see if there is a significant association between xand y

Ifn is large (and there are no ties) then the statistic t =

rs

n

21 r2s has ap-

proximately a tdistribution with n 2 degrees of freedom Ifn is small we use rs as our test statistic and use a table of critical values

(see appendix)

For our student marks vs. TV hours example, suppose we want to check if theassociation between these two variables is significant at the 5% significancelevel

26


27/95

Spearmans Rank Correlation Coefficient: Hypothesis Testing Example

1. H0:s= 0 vs. HA:s= 02. = 0.05

3. Test statistic is rs

4. Critical value isrs/2,8 = 0.738, so we reject H0 if|rsobserved |> 0.7385.|rsobserved|=| 0.786|= 0.786> 0.738, so we reject H06. We conclude there is a (negative) association between hours spent watching

TV per week and FISA mark

Spearmans Rank Correlation Coefficient: General Hypothesis Tests andConfidence Intervals

The Fisher Transformation that was done on the Pearson Correlation Coeffi-cient also applies to the Spearman Rank Correlation Coefficient Thus we can use the very same formulas based on the standard normal dis-

tribution to carry out general hypothesis tests such as H0 : s = 0.6 vs.HA:s= 0.6 as well as to construct confidence intervals for s

Of course we need to use rs instead ofr in these formulas, but everything elsestays the same

Limitations of Correlation Analysis

Two of the limitations of correlation analysis are:

1. It does not allow us to compare more than two variables at a time

2. It does not allow us to make predictions

We now turn to linear regression analysis which enables us to do both of these

3 Simple Linear Regression AnalysisEquation of a Line

The equation of a line is often expressed as y = mx+c

mis the slope of the line, the change in y for a one unit change in x c is the intercept of the line, the value ofy when x= 0 (and the point

where the line crosses the vertical axis)

Often when we compare observations from two variables, we see what appearsto be an approximately linear relationship

We must decide logically which is the independent variable (x) and which isthe dependent variable (y)

For example, the scatter plot of ice cream sales vs. temperatures (whichis dependent on the other?)

27


28/95

Line Fitting

If we have only two points, we can fit a line that goes right through them both

E.g. if we have the points (x1 = 2, y1= 4) and (x2= 6, y2 = 6) m= y2y1

x2x1 = 6462 =

12

m= y y1x x1

1

2=

y 4x 2

2 (y 4) =x 22y 8 =x 2

2y= x+ 6

y=1

2x+ 3

28


29/95

Line Fitting

However, as soon as we have three or more points, we usually cant fit themperfectly with a straight line

Consider the following scatter plot:

There is no line that describes this relationship perfectly So how do we model a relationship that is kind oflinear?

The Simple Linear Regression Model

We could assume that the yi observations depend on the xi observations in alinear way but also contain some unexplained variation

We model this unexplained variation or error as a random variable i This means Y is a random variable since it depends on a random variable Thus we have Y =0+1x+ Or, for individual observations, yi=0+1xi+i for i= 1, 2, . . . , n

We have simply changed the name ofm to

1andc to

0, switched their

order, and added the error term

29


30/95

Model Assumptions

The most important assumptions of a simple linear regression model are asfollows:

Thex values are fixed, not random (thus we write x in lower case and Y,a random variable, in upper case)

All error terms have a zero mean, i.e. E (i) = 0i All error terms have the same fixed variance, i.e. Var (i) =2i All observations are independent of each other The error terms follow the normal distribution

The Problem

Even if our model and its assumptions are correct, we have a problem: wedont know the values of0, 1 or i

In order to know them we would have to have data from the whole populationofxand y, which is usually impossible

We can only estimate 0, 1 and i as best as we can But how?

Line Fitting

If we asked three people to draw the line that best fits the points, we mightget three different results:

How would we know which line is the best? As statisticians we want to use a statistic to quantify this! But how?

30


31/95

The Least Squares Method

Suppose we have observations (xi, yi) fori = 1, 2, . . . , n, and we fit a line withequation yi= 0+ 1xi

We have simply changed the name ofmto 1 and c to 0, and switchedtheir order

Theony, 0 and 1 reminds us that these are estimates of the relation-ship

We can determine how far each individual yi value is from the line using theformulaei=yi

yi= yi

0+ 1xi The ei values are called residuals

31


32/95

The residuals ei are our best estimate of the unknown errors i They also provide us with a clue of how to find the estimated line that best

fits the data

Overall, we want the errors to be as small as possible However, we cant just minimize the sum of errors because thepositiveerrors

(points above the line) andnegativeerrors (points below the line) will canceleach other out!

Instead we minimize the sum of squared errors 2i because these will all bepositive

SSError=n

i=1

2i

This quantifies the overall distance between the points and the line Similar to how thevariance gives an indication of the distance between

data points and their mean

32


33/95

We will choose the values of 0 and 1 that minimize the sum of squarederrors

How do we do this? Calculus! The Sum of Squared Errors is a function of0 and 1

SSError=S(0, 1) =n

i=1

(yi 0 1xi)2

So our method is as follows:1. Take partial derivatives of theS SError function with respect to0 and1

2. Set the derivatives equal to zero

3. Solve this system of equations for 0 and 1 to get the values whichminimize the function

Deriving the Least Squares Estimators

S(0, 1)

0=2

ni=1

(yi 0 1xi) = 0 (1)

S(0, 1)

1 =2n

i=1(yi 0 1xi) xi = 0 (2)

This is the system of equations we must solve in terms of0 and 1 We simplify them as follows:

2n

i=1

(yi 0 1xi) = 0n

i=1

yi n

i=1

0 n

i=1

1xi= 0

ni=1

yi 0n

i=1

1 1n

i=1

xi= 0

ny n0 n1x= 00= y 1x

33


34/95

2n

i=1(yi 0 1xi) xi= 0

ni=1

yixi n

i=1

0xi n

i=1

1x2i = 0

ni=1

yixi 0n

i=1

xi 1n

i=1

x2i = 0

ni=1

xiyi (y 1x)n

i=1

xi 1n

i=1

x2i = 0

n

i=1 xiyi nxy+n1x2 1

n

i=1 x2i = 0

1

ni=1

x2i nx2

=n

i=1

xiyi nxy

1=

ni=1

xiyi nxyn

i=1

x2i nx2

Least Squares Estimation Formula

Thus the least squares estimates of 0 and 1 can be calculated using thefollowing formula:

1 =

ni=1

xiyi nxyn

i=1

x2i nx

2

0 = y 1x

It turns out that 1 and 0 are Minimum variance unbiased estimators(MVUE)of1 and 0

This means that:1. E

0

=0 and E

1

=1 (unbiased)

2. 0and

1can be proven to have the smallest variance (greatest precision)

of any linear estimators of0 and 1

34


35/95

Proof that 1 is Unbiased Estimator of1

We first need to derive E (Yi) and E

Y

We will also use our assumptions that the xvalues are fixed and that E (i) = 0E (Yi) = E (0+1xi+i)

= E (0) + E (1xi) + E (i)

=0+1xi+ 0 (since the first two are constants)

=0+1xi

E

Y

= E 1

n

ni=1

yi

= 1

n

ni=1

E (yi)

= 1

n

ni=1

(0+1xi)

= 1

n(n0+1nx)

=0+1x

35


36/95

E

1

= E

ni=1

xiyi nxyn

i=1x

2i nx

2

=

1n

i=1

x2i nx2E

ni=1

xiyi nxy

(since x is fixed, the denominator is constant)

= 1

ni=1

x2i nx2

ni=1

xiE (yi) nxE (y)

= 1ni=1

x2i nx2 n

i=1

xi(0+1xi) nx (0+1x) (see results proved above)

= 1

ni=1

x2i nx2

0nx+1

ni=1

x2i nx0 nx21

=

1

n

i=1x2i nx2

n

i=1

x2i nx2

=1

Proof that 0 is an Unbiased Estimator of0

As an exercise, try to prove that E 0=0

The proof is much shorter than the proof for 1Prediction with Simple Linear Regression

Once we have calculated the least squares estimates 1 and 0, we can writeout the fitted regression equation:

y= 0+ 1x

We can now use this equation to predict the most likely value of y for aparticular value ofx

36


37/95

This is one of the most useful things about this model! However we must be careful to only make predictions for values ofx in the

domain of our data

We cannot extrapolate since the relationship may not be linear outside ofthe domain of the data

The Riskiness of Extrapolation

Suppose we fit a line to a set of data points with xi values ranging from 0 to 6 Now we use our fitted line to predict the value ofy for x= 10

The Riskiness of Extrapolation

What if modeling the relationship between y and x as a straight line is onlyappropriate between x= 0 andx= 6?

Can you see how far off the prediction would appear to be if we had data forlarger xvalues like this?

Simple Linear Regression Example

Various doses of a toxic substance were given to groups of 25 rats and theresults were observed (see table below)

37


38/95

Rat Deaths vs. Doses

Dose in mg (x) Number of Deaths (y)

4 1

6 3

8 6

10 8

12 14

14 16

16 20

1. Find the fitted simple linear regression equation for this data

2. Use the model to predict the number of deaths in a group of 25 rats whoreceive a 7 mg dose of the toxin

38


39/95

Rat Deaths vs. Doses

xi yi x2i xiyi

4 1 16 4

6 3 36 188 6 64 48

10 8 100 80

12 14 144 168

14 16 196 224

16 20 256 320xi = 70

yi= 68

x2i = 812

xiyi= 862

x= 10 y= 9.714

1=

ni=1

xiyi nxyn

i=1

x2i nx2

=862 7 10 9.714

812

7

102

=182.02112

= 1.625

0= y 1x= 9.714 1.625 10=6.536

Note that it is important not to round numbers off until you have the finalregression equation, otherwise your answer may be inaccurate

Thus the fitted regression equation is y=6.54 + 1.63x Predicting the number of deaths for a dose of 7mg:

y=6.54 + 1.63x=6.54 + 1.63 7 = 4.9

39


40/95

Simple Linear Regression Exercise

Calculate the equation of the line of best fit for the temperature (x) vs. icecream sales (y) example

Use the equation to predict the ice cream sales on a day on which the temper-ature is 20

Inferences from a Simple Linear Regression

The two unknown parameters involved in a simple linear regression model are0 and 1

2, the variance of the error terms, is also unknown

We may be interested in knowing whether it is reasonable to conclude that

one of these unknowns is equal to (or not equal to) a particular value

Most often we are interested in whether 1= 0 since this determines whetherxand y have a positive relationship, a negative relationship or no relationship

Like in correlation analysis! To use hypothesis testing to make inferences about these unknowns we need

an appropriate test statistic

Inferences on 1

Inferences about 1 will be based on how far the estimated value 1 is fromthe null hypothesis value

As always, we also take into account the standard errorof the estimate anditsprobability distribution

We already proved that E

1

=1

Let:

SSx =n

i=1 x2i

nx2

SSy =n

i=1

y2i ny2

SSxy =n

i=1

xiyi nxy

Notice that, expressed in these terms, 1 = SSxySSx

Subject to our model assumptions, it can be proven that Var 1 = 2SSx40


41/95

However, because we do not know the value of2 we must use the best esti-

mate, which turns out to be 2

=

1

n 2n

i=1 e2i = 1n 2SSResidual= M SResidual ThusVar1 = 2

SSx

It can be proven that 1 E (1)Var1 has a t distribution with n 2 degrees offreedom

Thust=1

1 2

SSx

has a tdistribution with n 2 degrees of freedom

Since SSResidual= SSy1SSxy, we can express this as:

t=1 1

SSy1SSxy(n 2) SSx

If we replace1with1 this becomes our test statistic for testing H0 : 1 =

1

Hypothesis Testing Review

For such a t test, our decision rules would be as follows: H0:1=1 vs. HA:1=1

Reject H0 if|tobserved|> t/2,n2 H0:1=1 vs. HA:1< 1

Reject H0 iftobserved 1

Reject H0 iftobserved > t,n2

41


42/95

The p-value Approach

Instead of using critical values to decide whether to reject H0, one can also usep-values

A p-value (sometimes denoted ) is defined as the probability of obtaining aresult at least as extreme as the observed data, given that H0 is true.

For such a t test, our decision rules would be as follows: H0:1=1 vs. HA:1=1

Reject H0 if 2 Pr (t >|tobserved| given that 1=1 )< H0:1=1 vs. HA:1< 1

Reject H0 if Pr (t < tobserved given that 1 = 1 )< H0:1=1 vs. HA:1> 1

Reject H0 if Pr (t > tobserved given that 1 = 1 )< Note that p-values cannot usually be computed by hand. As an example, the

third p-value involves computing

=

tobserved

f(y)dy where f(y) is the probability density function of the t distribution

However,p-values can be easily calculated with a computer, and are the quick-est way to reach a decision about a hypothesis test when using statisticalsoftware packages

Confidence Interval for 1

Using the t statistic above, we can derive a (1 )100% confidence intervalfor 1 as follows:

Pr1 t/2,n2 SSy1SSxy(n 2) SSx < 1< 1+t/2,n2 SSy1SSxy(n 2) SSx = 1 Thus the C.I. for 1 is:1 t/2,n2

SSy1SSxy

(n 2) SSx ,1+t/2,n2

SSy1SSxy

(n 2) SSx

42


43/95

Inference on 1 Example

Suppose we want to test H0 : 1 = 0 vs. HA : 1= 0 for the rat death vs.dosage example, at the = 0.05 significance level

Our test statistic is tt (n 2) as defined above Our critical region is|tobserved|> t/2,n2=t0.025,5 = 2.570 We have already calculated that SSxy = 182 and SSx = 112 We further can calculate that SSy = 301.4286

t=1 1

SSy1SSxy

(n

2) SSx

= 1.625 0

301.4286 1.625 182(7 2) 112

= 1.625

0.01014

= 1.625

0.1007= 16.14

|tobserved|> 2.570, thus we reject H0 and conclude that1= 0; the slope of theregression model is statistically significant

A 95% Confidence Interval for 1 is given by:1 t/2,n2

SSy1SSxy(n 2) SSx ,

1+t/2,n2

SSy1SSxy(n 2) SSx

(1.625 2.570 0.1007 ,1.625 + 2.570 0.1007)

(1.37 ,1.88)

Inference on 0

In a similar way it can be proven that:

E

0

=0

Var

0

=2

1

n+

x2

SSx

If we estimate2 with 2 thent = 0 0

1

n+

x2

SSx

has at distribution with

n 2 degrees of freedom

43


44/95

We can also express tas:

t=0 0

SSy1SSxyn 2 1n+ x2SSxConfidence Interval for 0

A (1 )100% Confidence Interval for 0 is given by:0 t/2,n2

SSy1SSxyn 2

1

n+

x2

SSx

,0+t/2,n2

SSy1SSxy

n 2

1

n+

x2

SSx

Inference on 0 Example

With our dosage vs. rat deaths example, suppose we are interested in whether0


45/95

Inference on 2

It is also possible to perform hypothesis tests and confidence intervals con-cerning 2 using the 2 distribution

However we will not cover these in this module.

Predicting the Mean Response

One of the advantages of the linear regression model is that we can use x topredict Y

Suppose we want to estimate the mean value ofY whenx= x, E (Y|x= x)

We know that E (Y|x= x) =0+1x Our best estimate of E (Y|x= x) is y = 0+ 1x

The variance of this estimator is Var (y) =2

1

n+

(x x)2SSx

Since 2 is unknown, we can use the following estimate:

Var (y) = 2

1

n+

(x x)2SSx

=SSy1SSxy

n 2

1

n+

(x x)2SSx

It can also be shown that t= yVar (y) t (n 2)

Confidence Interval for Mean Response

Thus a (1

)100% Confidence Interval for E (Y

|x= x) is given by:0+ 1x t/2,n2

SSy1SSxyn 2

1

n+

(x x)2SSx

If we want the interval to be as narrow as possible (a more accurate prediction),

then nshould be large, SSx should be large, and xshould be near x.

That is, we should gather data on a wide range ofxvalues

45


46/95

Predicting a New Response

Suppose we want to predict the response valuey for a new observationx = x

Our best estimate would be y

= 0+

1x

E (y) =0+1x

Var (y) =2

1 +1

n+

(x x)2SSx

Thus:

Var (y) = 2

1 +

1

n+

(x x)2SSx

=SSy1SSxy

n 2

1 +1

n+

(x x)2SSx

It can be shown that t= yVar (y) t (n 2)

Prediction Interval for an Individual Response

A (1

)100% Prediction Interval for y is given by:

0+ 1x t/2,n2

SSy1SSxyn 2

1 +

1

n+

(x x)2SSx

It is called a prediction interval rather than a confidence interval because Yiis a random variable, not an unknown parameter

Notice that the prediction interval for Yi is always wider than the confidenceinterval for E (Y|x= x)

It is more difficult to predict the value of an individual observation than themean of many observations

Example

Consider our Temperature vs. Ice Cream Sales example We want a confidence interval for the average ice cream sales when the tem-

perature is 20 and a prediction interval for the ice cream sales on a particularday when the temperature is 20

46


47/95

1. Confidence Interval for E (Y|x= 20)

0+ 1x

t/2,n2

SSy1SSxyn 2

1

n

+(x x)2

SSx 159.474 + 30.088(20) t0.025,10

174754.9 30.088(5325.025)12 2

1

12+

(20 18.675)2176.9825

442.286 2.228

135.549

442.286 25.94= (416.35, 468.23)

2. Prediction Interval for Yi when x= 20

0+ 1xi t/2,n2SSy1SSxy

n 2

1 +

1

n+

(x x)2SSx

159.474 + 30.088(20) t0.025,10

174754.9 30.088(5325.025)12 2

1 +

1

12+

(20 18.675)2176.9825

442.286 2.228

1589.10

442.286 88.82= (353.47, 531.11)

Assessing the Fit of a Regression Line

While testing the hypothesis H0 : 1 = 0 can give us a yes or no answer onwhether the model is appropriate, we would like a statistic that can quantifyhow good the model is

One method is to calculate what proportion of the total variation in y isexplained by our model

The total variation in y is S Sy =n

i=1(yi y)

2

=

ni=1

y2i ny

2

The variation not explained by the model isSSResidual=n

i=1

(yi yi)2

Thus the variationexplained by the modelis the differenceSSy SSResidual Our goodness of fit statistic, called the Coefficient of Determination, is

the ratio of the variation explained by the model to the total variation:

r2

=SSy

SSResidual

SSy = 1 SSResidual

SSy

47


48/95

We call this statistic r 2 because it turns out that it is the square of Pearsonssample correlation coefficient r

Proof:

r2 = 1 S SResidualSSy

= 1 S Sy1SSxy

SSy

= 1

1 1SSxySSy

= 1

SSxySSy

=SS

xySSx

SSxy

SSy

=SS2xy

SSxSSy

= (r)2

Goodness of Fit Example

In our dosage vs. rat deaths example:

r2 = SS

2

xySSxSSy

= 1822

112 301.4286 = 0.981

Thus in this case we can say that 98 .1% of the variation in rat deaths can beexplained by the dosage given

4 Multiple Linear RegressionMultiple Linear Regression Model Specification

Before now we have used models with only one independent variable xi What if we want to investigate the relationship between a single dependent

variable Y and two independent variables x1 and x2?

The multiple linear regression model allows us to do this

Motivational Example

An experiment was conducted to determine the effect of pressure and temper-ature on the yield of a chemical. Two levels of pressure (in kPa) and three

levels of temperature (inC) were used and the results were as follows:

48


49/95

Yield (yi) Pressure (xi1) Temperature (xi2)

21 350 40

23 350 90

26 350 15022 550 40

23 550 90

28 550 150

3D Scatter Plot

If we want to represent the relationship graphically we would need a threedimensional scatter plot

Instead of a line of best fit, we now need a plane of best fit

Multiple Linear Regression Model

Themultiple linear regression model allows us to investigate the relation-ship between a single dependent variable Y and two independent variables x1and x2

The model is specified as follows:

Y =0+1x1+2x2+

Or, in terms of observations, as follows:

yi=0+1x1i+2x2i+i

49


50/95

This is the equation of a plane, not a line 0 is still the intercept (the point where the plane crosses the vertical axis,

x1=x2 = 0)

1 is the slope of the plane in the x1 direction 2 is the slope of the plane in the x2 direction

1 and 2 are sometimes referred to as partial slope coefficients This model relies on the same assumptions as the simple linear regression

model, with one addition:

x1 and x2 must not be collinear(highly correlated with one another)

The fitted regression equation in this case is:

Y = 0+ 1x1+ 2x2

Multiple Linear Regression Model: Deriving Least Squares ParameterEstimates

We can again use theMethod of Least Squaresto estimate the parameters0, 1 and 2

We still have our sum of squared error function, which is now a function ofthree variables:

SSError= S(0, 1, 2) =n

i=1

2i =n

i=1

(yi 0 1x1i 2x2i)2

We can still use the same steps:1. Take partial derivatives of theSSErrorfunction with respect to0,1and

2

2. Set the derivatives equal to zero

3. Solve this system of equations for 0,

1and

2to get the values which

minimize the function

S(0, 1, 2)

0=2

ni=1

(yi 0 1x1i 2x2i) = 0

S(0, 1, 2)

1=2

ni=1

(yi 0 1x1i 2x2i) x1i= 0

S(0, 1, 2)

2=2

n

i=1(yi 0 1x1i 2x2i) x2i= 0

50


51/95

Solving this system of equations for 0, 1 and 2 is possible but it will takelong and the formula will be complicated.

An alternative is to use matrix notation, which is more compact

Multiple Linear Regression Model: Matrix Notation

We can specify the regression model in matrix notation as follows:

y= X + where

y is an n 1 matrix:

y=

y1

y2...

yn

X is an 3 nmatrix:

X =

1 1 1x11 x12 x1nx21 x22 x2n

is a 3

1 matrix:

=

012

is an n 1 matrix:

=

1

2...

n

Quick Review of Matrices

For any matrices A and B where A is the transpose ofA:A

=A

(A + B)=A+ B

(AB)=BA

51


52/95

Additionally, the inverse of a square matrix A (which is like the matrixequivalent of division) is the matrix A1 such that AA1 =Iwhere I is theidentity matrix, e.g.

I= 1 0 00 1 00 0 1

To find the inverse of a matrix we can use the following method (similar to

Gauss-Jordan elimination):

Suppose

A=

1 2 3

0 4 5

1 0 6

Then: 1 2 3 1 0 00 4 5 0 1 01 0 6 0 0 1

= 1 2 3 1 0 00 4 5 0 1 0

0 2 3 1 0 1

= 1 2 3 1 0 00 4 5 0 1 0

0 0 11 2 1 2

=

2 0 1 2 1 00 4 5 0 1 0

0 0 11 2 1 2

=

22 0 0 24 12 20 4 5 0 1 0

0 0 11 2 1 2

=

22 0 0 24 12 20 44 0 10 6 100 0 11 2 1 2

= 1 0 0 1211 611 1110 1 0 5

223

22522

0 0 1 211

111

211

Thus A1 =

1211

611

111

522

322

522

211

111

211

Deriving Least Squares Estimates in Matrix Notation

Our sum of squared error function in matrix notation is:

S() =n

i=1

2i == (y X) (y X)

=y (X)

(y X)

=y X (y X)

=yy

Xy

yX + XX

52


53/95

Now, in Xy we are multiplying a 1 3 matrix by a 3 n matrix by an 1 matrix, so the result will be a 1 1 matrix, i.e. a scalar number

Similarly, in yX we are multiplying a 1 nmatrix by an 3 matrix by a3 1 matrix, so the result will again be a 1 1 matrix, i.e. a scalar

Notice also that xy= yx The transpose of a scalar is itself Thus, since these matrices are both scalars, they are equal, and we can simplify

our equation to:

S() =yy 2Xy + XX

We now differentiate this function using vector calculus and set it equal to 0:S

=2Xy + 2XX= 0

XX= Xy

=XX

1Xy

Thus in matrix form, the least squares estimators of are given by =

XX

1Xy

This matrix exists as long as the inverse ofXXexists, which it does as longas our assumption of no linear dependence between x1 and x2 holds true

The estimators have the same Minimum Variance Unbiased Estimatorproperty as 0 and 1 do in the simple linear regression case

In matrix form, the fitted regression equation is y= X In matrix form, the residuals are e= y y

Multiple Linear Regression Example

We have the following data from ten species of mammal:

53


54/95

Species Name Gestation Period in days (y) Body Weight in kg (x1) Avg. Litter size (x2)

Rat 23 0.05 7.3

Tree Squirrel 38 0.33 3

Dog 63 8.5 4Porcupine 112 11 1.2

Pig 115 190 8

Bush Baby 135 0.7 1

Goat 150 49 2.4

Hippo 240 1400 1

Fur seal 254 250 1

Human 270 65 1

Here, our individual matrices are as follows:

y=

23

38

63

112

115

135

150

240

254

270

X =

1 0.05 7.3

1 0.33 3

1 8.5 4

1 11 1.2

1 190 8

1 0.7 1

1 49 2.4

1 1400 1

1 250 1

1 65 1

We first check if our y values appear to be normally distributed:

54


55/95

Looks okay

Our XXmatrix is as follows: 10 1974.580 29.91974.58 2065419.851 3401.85529.9 3401.855 153.49

To find the inverse of this matrix we would use Gauss-Jordan Elimination as

above

However in the age of technology its much quicker to use computer softwaresuch as MatLab

We find that

XX

1=

0.3021 1.9913 104 5.4428 1021.9913 104 6.3378 107 2.4744 1055.4428 102 2.4744 105 1.6569 102

We multiply this matrix byX and then by y to get our parameter estimates

= 178.70.0756917.93

Thus our fitted regression equation is Y= 178.68 + 0.07569x1 17.93x2 We interpret this as follows:

The intercept means that (according to the model) a mammal with bodyweight of 0 kg which has an average litter size of 0 babies would have agestation period of 179 days

(Note that the intercept does not always make practical sense!)

55


56/95

For every kg of body weight, gestation period increases by 0.07569 days For every baby in the average litter, gestation period decreases by 17.93

days

Remember, we cannot assume the relationships are causal

It can be dangerous to extrapolate outside the region ofx1 and x2 values inthe data even if it is within range of individual values

Intercept may be an example of this! See the graph below

Multiple Linear Regression with k Independent Variables

Using our matrix notation we can generalise the multiple linear regressionmodel from 2 independent variables to k independent variables

The model is specified as follows:

Y =0+1x1+2x2+ +kxk+

Or, in terms of observations, as follows:

yi=0+1x1i+2x2i+ +kxki+i

Note that p= k+ 1 is the total number of parameters in the model (k inde-pendent variables plus one intercept)

56


57/95

Hence y= X + where: y is an n 1 matrix, Xis an n pmatrix, is a p 1 matrix, and is

an n 1 matrix This model relies on the same assumptions as the simple linear regression

model, along with the assumption of no multicollinearity:

None of the independent variables are collinear (highly correlated with oneanother)

Multiple Linear Regression Example

Data was collected from 195 American universities on the following variables:

Graduation Rate (the proportion of students in Bachelors degree pro-grammes who graduate after four years)

Admission Rate (the proportion of applicants to the university who areaccepted)

Student-to-Faculty Ratio (the number of students per lecturer) Average Debt (the average student debt level at graduation, in US dol-

lars)

A few observations from the data are displayed below:

Grad Rate (y) Admission Rate (x1) S/F Ratio (x2) Avg Debt (x3)

0.65 0.35 14 11156

0.81 0.39 16 13536

0.8 0.35 12 19762

0.46 0.65 13 12906

0.5 0.58 21 14449

0.47 0.65 11 166450.18 0.59 14 17221

0.52 0.6 13 14791

0.39 0.79 15 14382...

... ...

...

57


58/95

In this case we have k = 3 independent variables and p = 4 parameters toestimate

The model equation is as follows:yi = 0+1xi1+2xi2+3xi3+i

Using computer software we determine that:

XX

1

=

j = 0 j= 1 j = 2 j = 3

j= 0 0.1059 0.01782 3.0672 103 3.2823 106j= 1 0.01782 0.1906 5.7407 103 5.7146 107j= 2 3.0672 103 5.7407 103 4.6400 104 2.1002 109j= 3

3.2823

106

5.7146

107 2.1001

109 2.3045

1010

We further determine that:

=XX

1Xy=

1.1095

0.37980.02789

5.1687 107

Thus our sample regression function is:

y = 1.1095 0.3798x1 0.02789x2+ 5.1687 107

x3

Interpretation: For every 0.01 unit increase in admission rate, there is an expected

0.003798 unit decrease in graduation rate (we cant really talk aboutthe usual 1 unit increase in x1 since it is a proportion and ranges onlyfrom 0 to 1)

For every one unit increase in student-to-lecturer ratio, there is an ex-pected 0.02789 unit decrease in graduation rate

For every $1 increase in average student debt, there is an expected 5.1687107 unit increase in graduation rateInferences from a Multiple Linear Regression

Just like in simple linear regression, we often want to do hypothesis testing formultiple linear regression

There are three main types of hypothesis tests to consider:1. Inferences on Individual Parameters

2. Inferences on the Full Model (all parameters)

3. Inferences on Subsets of Parameters

58


59/95

Inferences on Individual Parameters

The logic is the same as in simple linear regression but we now use a matrixapproach

It can be proven that E = It can also be proven that the covariance matrix of is:

Cov

=2XX

1 This means that for each individual element of, j:

Ej =jVar

j

=2Cjj

where Cjj is the diagonal element ofXX

1corresponding to j

This is the multivariate equivalent of our result in simple linear regression thatVar

1

=2SS1x

Now, we face the same problem as before in that we dont usually know thevalue of2

Remember, before we estimated 2 with

2 = 1

n 2n

i=1

e2i = 1

n 2SSResidual

In the multivariate case, we have to divide by np instead of n2 (wesubtract the number of parameters to be estimated which was 2 in that case)

Our sum of squared residuals can be expressed as follows:

SSResidual=n

i=1

e2i =ee

= (y y) (y y)

=yX

yX

=yy Xy yX + XX=yy 2Xy + XX=yy Xy since XX= Xy

59


60/95

Therefore, 2 = SSResidualn p =

1

n pyy Xy

The test statistic for testing the null hypothesis H0:j =j is thus:

t=j j

Cjj

=j j

yy Xy

Cjj / (n p)

Under the null hypothesis, t follows a t distribution with np degrees offreedom

Our decision rules will be the same as for inferences on 1 in the simple linearregression model (depending whether we have a two-tailed, lower tail or uppertail test)

Note that this formula can be used for any j including 0 If we set j = 0 then we are testing for the significance of an individual

coefficient, that is, whether there is a linear relationship between Y and xj

Inferences on Individual Parameters: Example

Suppose we want to test whether the admission rate has a significant, negativeimpact on the graduation rate

1. H0 : 1= 0 vs. HA : 1 < 0

2. = 0.05

3. t=1

(yy Xy)C11/ (n p)t (n p)

4. Critical region: tobserved t/2,np=t0.025,1954 = t0.025,1911.9845. tobserved =

5.169 1074.7691 2.3045 1010/ (195 4) = 0.215

|tobserved|< 1.984 thus we do not reject H06. We conclude that average student debt has no significant effect on grad-

uation rate

Inference on the Whole Regression Model

One way to test the usefulness of a particular multiple linear regression modelwith k independent variables is to test the following:

H0 : 1=2 = = k = 0HA : j= 0 for at least one j

If we reject H0, this implies that at least one of the independent variables

x1, x2, . . . , xk contributes significantly to the model

To develop this test, remember the following from our r2 calculations:

SSy =n

i=1

(yi y)2 =n

i=1

y2i ny2

=yy ny2SSResidual= y

y XyHence S SModel = SSy SSResidual= xy ny2

It can be shown that under H0,SSModel2 (p 1) and SSResidual2 (n p) From this we can develop a test statistic which compares the variation ex-

plained by the model to the variation not explained by the model:

F = SSModel/ (p 1)SSResidual/ (n p)

Under H0, F F(p 1, n p) and so we use the F distribution table todetermine whether or not to reject the null hypothesis

In this case we always have a one-sided, upper tail test. Our decision rule is: Reject H0 ifFobserved > F,p1,np

61


62/95

Inference on the Whole Regression Model: Example

For our graduation rate example:1. H0 : 1=2 = 3 = 0 vs. HA:j

= 0 for at least one j = 1, 2, 3

2. = 0.05

3. Test statistic: F = SSModel/(p 1)SSResidual/(n p)F(p 1, n p)

4. Critical Region: Fobserved > F,p1,np = F0.05,2,1923.041

5. Fobserved =

Xy ny2

/ (p 1)

yy Xy

/ (n p)=

6.102/(4 1)4.769/(195 4) = 81.47 >

3.041, so we reject H0

6. We conclude that at least one of the independent variables contributes

significantly to the model.Inference on a Subset of the Parameters

It is also possible to carry out a test of significance on a subset of the param-eters, but we will not cover this

Confidence Intervals for Individual Coefficients

By rearranging our test statistic for an individual coefficient parameter, wecan obtain the following (1 ) 100% Confidence Interval for j for any j =0, 1, 2, . . . , k:

Prj t/2,np2Cjjj j+ t/2,np2Cjj = 1 where2 =S SResidual/ (n p) =

yy Xy

/ (n p)

Confidence Intervals for Individual Coefficients: Example

Let us construct a confidence interval for 3 in the graduation rate example First lets calculate 2

Ifyy= 68.9714 and Xy= 64.20232, then SSResidual= 4.769

Thus 2 =S SResidual/(n

p) = 4.769/(195

4) = 0.02497

We know that 3= 5.1687 107 and C33 = 2.3045 1010

Thus our confidence interval is given by:j t/2,np

2Cjj

5.1687 107 t0.025,1954

0.02497(2.3045 1010)5.1687 107 1.984

0.02497(2.3045 1010)

5.1687 107 1.984

0.02497(2.3045 1010)5.1687

107

4.759

106

= 4.24 106, 5.28 10662


63/95

Thus we can say with 95% confidence that the change in graduation rate fora $1 increase in average student debt is between4.25 106 and 5.28 106

Notice that the confidence interval contains the value 0, which agrees with theconclusion to our hypothesis test earlier

Confidence Region for All Coefficients

One can also construct a joint confidence region for all parameters For a simple linear regression model the confidence ellipse for (0, 1) would

have the shape of a two-dimensional ellipse

This is outside the scope of this course howeverConfidence Interval for the Mean Response

As we did in simple linear regression, we can construct a confidence interval

for the mean response at a particular point, say, x

x =

1

x01

x02...

x0k

The mean response at this point is E (Y|x= x) =x

The estimatedmean response at this point is y

=x

A (1 ) 100% Confidence Interval for E (Y|x= x) is given by:

Pr

y t/2,np

2x (XX)1 x E (Y|x= x)y +t/2,np

2x (XX)1 x

= 1

Confidence Interval for the Mean Response: Example

Lets find a confidence interval for the average graduation rate of universitieswhich have an admission rate of 50% = 0.5, a student-to-faculty ratio of20 : 1 = 20, and an average student debt of $20000

In this case, x = [1, 0.5, 20, 20000], a 1 4 matrix Our point estimate is:

y =x

= [1, 0.5, 20, 20000]

1.1095

0.37980.02789

5.1687 107

= 1.1095

0.3798(0.5)

0.02789(20) + 5.1687

107(20000)

= 0.3721

63


64/95

Thus we would predict that such universities would have an average graduationrate of 37.21%

The only thing left to calculate in our confidence interval formula isx XX

1x

Using matrix multiplication we see this is equal to 0.03492 Thus our 95% confidence interval for E (Y|x= x) is:

y t/2,np

2x (XX)1 x

0.3721 1.984

0.02497(0.03492)

0.3721 0.0586= (0.3135, 0.4307)

Prediction Interval for a New Response

Also, like in simple linear regression, we can predict the value of the responseY for a new observation x and obtain a confidence interval for it

The predicted value is y =x (actually the same as y above) A (1 ) 100% Prediction interval for Y is:

Pry t/2,np2 1 + x (XX)1 x Y y +t/2,np2 1 + x (XX)1 x = 1 As in the simple linear regression case, we can see from the 1+ that this

prediction interval is wider than the confidence interval for the mean response

Prediction Interval for a New Response: Example

Let us obtain a prediction interval at a particular university which has anadmission rate of 50% = 0.5, a student-to-faculty ratio of 20 : 1 = 20, and anaverage student debt of $20000

Our point estimate is y which is actually the same as y; it equals 0.3721 Our 95% prediction interval is as follows:

y t/2,np

2

1 + x (XX)1 x

0.3721 1.984

0.02497(1 + 0.03492)

0.3721 0.3189= (0.0532, 0.691)

We can see that this is a very wide (and not very useful) prediction interval

64


65/95

Assessing Goodness of Fit of a Multiple Linear Regression Model

We can define r2 just as we did for the simple linear regression model:

r2 = 1 SSResidualSSy

= 1 yy Xyyy ny2

In this case it is referred to as theMultiple Coefficient of Determination One of the disadvantages of this statistic is that it will always increase as more

independent variables are added to the model

This will suggest that the fit is getting better even if the new variables are not

significant

This problem led to the development of an alternative goodness of fit statisticfor multiple linear regression called Adjusted r2

Adjusted r2

Adjusted r2, written as r2, imposes a penalty for adding more terms to themodel

It will thus decrease when we add an independent variable that does not

contribute much explanatory power

r2 = 1 SSResidual/ (n p)SSy/ (n 1) = 1

n 1n p

1 r2

r2 and r2 for Multiple Linear Regression Model: Example

In our university graduation rates example, we calculate r2 as follows:

r2 = 1

yy Xyyy ny

2

= 1 68.9714 64.2023268.9714 58.09986

= 1 0.4387= 0.5613

This suggests that 56% of the variation in graduation rates can be explainedby the three factors in the model

65


66/95

Now we calculate r2 as follows:

r2 = 1

n 1n p

1 r2

= 1 195 1195 4 (1 0.5613)

= 1 0.4456= 0.5544

In this case, there is not much difference between the two, because the samplesize n is very large compared to the number of parameters p

Model Selection Algorithms

Various algorithms (procedures) have been proposed for selecting which vari-ables to include in a model

This is particularly important when there are many possible independent vari-ables to choose from

We do not want to miss out on variables that contribute significantly to themodel, but we also dont want to include unnecessary variables which makeour estimates less precise

The three most common algorithms that are used are:1. Backward Elimination

2. Forward Selection

3. Stepwise Selection

Backward Elimination

Backward Elimination starts with a full model consisting of all possible inde-pendent variables, and cuts it down until the bestmodel is achieved

The algorithm then proceeds as follows:

1. Begin with a model including all possible independent variables

2. Estimate the model and take note of thetobserved statistic values for indi-vidual coefficients (not including 0)

3. Choose the coefficient with the smallest|tobserved|; call it j4. Carry out the test of hypothesis H0 : j = 0 vs. HA : j= 0 at the

significance level

5. If the null hypothesis is rejected, we accept this as our final model

6. If the null hypothesis is not rejected, we remove the variable xj

from themodel and repeat from step (2)

66


67/95

Forward Selection

Forward Selection works in the opposite direction: it begins with an emptymodel and adds variables until the bestmodel is achieved

The algorithm proceeds as follows:1. Run simple linear regressions between y and each possible xvariable

2. Identify the independent variable with the highest|tobserved| value in itssimple linear regression with y

3. Carry out the test of hypothesis H0 : j = 0 vs. HA : j= 0 at the significance level in this simple linear regression model

4. If we reject H0, we add xj to the multiple linear regression model andproceed to the independent variable with the next highest

|tobserved

|in its

simple linear regression with y, and repeat from step (3).

5. If the null hypothesis is not rejected, we conclude xj is not significantto the model, so we do not add it. We also realise none of the otherindependent variables with smaller|tobserved| will be significant; thus themodel is final; we are done

Stepwise Selection

Stepwise Selection combines elements of both Backward Elimination and For-ward Selection

The algorithm proceeds as follows:1. Run simple linear regressions between y and each possible xvariable

2. Identify the independent variable with the highest|tobserved| value in itssimple linear regression with y

3. Carry out the test of hypothesis H0 : j = 0 vs. HA : j= 0 at the significance level in this simple linear regression model

4. If we reject H0, we add xj to the multiple linear regression model

So far the algorithm is exactly like Forward Selection; but nowit changes

5. Carry out a t test from the multiple linear regression model for thesignificance of each j in the model so far

6. If the null hypothesis is not rejected for any j we delete that xj fromthe model

7. Proceed to the independent variable with the next highest|tobserved|in itssimple linear regression with y, and repeat from step (3).

8. Once we reach a point where all the variables in the model are significant,and none of the variables outside the model are significant, this is our finalmodel

67


68/95

Model Selection Algorithms: Example

It is easier to see an example in the tutorial using SAS, since these algorithmsare very tedious to carry out by hand

In the case of our Graduation Rate example, all three algorithms lead to thesame result: we keep x1 and x2 in the model and drop x3

Note: there are other model selection algorithms but we will not cover them

Residual Analysis

Revisiting Model Assumptions

Remember that the assumptions of the multiple linear regression include thefollowing:

All error terms have a zero mean, i.e. E (i) = 0i All error terms have the same fixed variance, i.e. Var (i) =2i All observations are independent of each other The error terms follow the normal distribution None of the x variables are highly correlated with one another

Whenever we are applying a multiple linear regression model it is importantto check these assumptions

Model Adequacy

The first four of these assumptions can be assessed using residual analysis:that is, looking at the residuals of the model

There are two basic ways to do this: Graphical Analysis Hypothesis Tests

In this module we will only look at graphical analysis (the hypothesis testingapproach will be taught in Econometrics in third year)

Graphical Residual Analysis

Remember that the residuals are defined as e= y y, that is, ei=yi yi To calculate the residuals we first determine the least squares regression line

and then obtain the predicted value for eachxiin the sample; then we subtractthese predicted values from the observed yi values in the sample

Once we have the residuals we can plot the residuals (vertical axis) againstthe predicted values (horizontal axis)

One can gain a lot of information about the model by looking at this plot

68


69/95

Plot of Residuals vs. Predicted Values

The main things to look for in the plot are patternsor unusual points

Ideally, the points should be evenly distributed above and below zero andshould appear completely random

In this plot we can see that the points appear random

69


70/95

Do you see anything different in this plot?

The variance of the residuals appears to increase as y increases

Normal Quantile-Quantile Plot

A normal quantile-quantile plot is a useful tool for checking if the residualsare normally distributed

If so, the points should fall approximately in a straight line

Does this QQ plot look normally distributed?

70


71/95

How about this one?

71


72/95

Histogram of Residuals

Another way to check normality is to plot a histogram of the residuals and seeif it is bell shaped

How about this one?

72


73/95

Summary of Graphical Analysis of Residuals

Graphical analysis of residuals is a useful diagnostic tool for determining model

adequacy

However it has limitations - often the results can be inconclusive This is especially true for small sample sizes

Outlier Diagnostics

We can also use the residuals to look for outliers: values which the modelpredicts extremely badly

While we could simply look at the residuals themselves, it is better to scalethem in some way Analogy tozscores from STA100A: we dont only want to know how far

an observation is from its mean; we want to know how many standarddeviations away it is

A basic way to scale the residuals would be to divide them by their standarddeviation:

di=ei

This is called the standardized residual Since these residuals should be approximately normally distributed with mean

0 and variance 1, they should almost always lie in the range3di3 Thus we could define an outlier as any observation whose standardized residual

is>3 or


74/95

Outlier Diagnostics: Externally Studentized Residuals

The only weakness with the internally studentized residual is that the varianceestimate 2 used in calculating ri is influenced by the ith observation

It may be thrown off by an outlier; thusri is not ideal for outlier detection Instead, for each observation, we could estimate the variance using a data set

ofn 1 observations with the ith observation removed, and use this estimateS2(i) in the scaling formula

It can be shown that:

S(i) =(n p)2 e2i /(1 hii)

n p 1

If we replace 2 withS2(i)in the internally studentized residual formula we get:

ti= eiS2(i)(1 hii)

This is known as the externally studentized residual and is the best wayof scaling residuals

Hypothesis Test for Outliers

A further advantage is that, under the model assumptions, tit(n p 1) One could carry out a hypothesis test on each observation to check if it is an

outlier:

1. H0: The ith observation is not an outlier vs. HA: The ith observation isan outlier

2. = 0.05

3. Test statistic is|ti|

4. Rejection rule: Reject H0 if|ti|> t/(2n),np15. Compute ti observed and reach a decision

6. State conclusion

The reason why we have /(2n) instead of /2 is that we are running thehypothesis testn times, so we are basically dividing up the overall type I errorprobability among the n individual tests (this is known as the Bonferroniapproach)

74


75/95

yi xi yi ei di ri ti

19 8 18.325 0.675 0.2008 0.2178 0.1997

17 7 16.275 0.725 0.2157 0.2450 0.2248

23 10 22.425 0.575 0.1711 0.1856 0.169922 9 20.375 1.625 0.4835 0.5169 0.4827

33 14 30.625 2.375 0.7067 1.4133 1.5696

18 7 16.275 1.725 0.5133 0.5830 0.5480

16 7 16.275 -0.275 -0.0818 -0.0929 -0.0849

19 10 22.425 -7.425 -2.2092 -2.3962 -10.5468

Outlier Diagnostics: Example

Suppose we have the following set of data (n= 8):

When we estimate the simple linear regression model yi=0+ 1xi+ i usingthe least squares method, we get:

0 = 1.525,1 = 2.15

We can substitute each of ourxi forx in the fitted equation y= 1.525+2.15xto obtain the predicted values yi which are in the third column of the tableabove

We can then calculate the residuals: ei=yi yi (see fourth column of table) To calculate the standardized residuals we first need to calculate 2:

2 = 1

n 2n

i=1

e2i =1

6

0.2752 + 0.4252 + + (4.025)2 = 3.6625

Now we have di= ei2

(see calculated values in fifth column)

Next we can calculate the internally studentized residuals. We first need tocalculate the Hat matrix H=XXX1 X

In this case, X=

1 8

1 7

1 10

1 9

1 14

1 7

1 71 10

75


76/95

Taking the diagonal elements ofH and using them in the formula ri =ei

2 (1 hii), we get the values (see sixth column of table above)

Next we calculate the externally studentized residuals. We first need to calcu-late the S2(i)=

(n p)2 e2i /(1 hii)n p 1

Then we plug these into the following formula to get the values in the seventhcolumn:

ti= eiS2(i)(1 hii)

It is now apparent for the first time that the 8th observation is an outlier

Hypothesis Test for Outliers: Example

We conduct the hypothesis test described above for each of the 8 observations,at = 0.05 level

In every case, our rejection rule is reject H0 if|ti|> t/(2n),np1=t0.003125,5 We dont have a column for 0.003125 in our t table so we can take the

average of the entries in the 0.005 and 0.001 columns to get an approxi-mation: (4.030 + 5.876)/2 = 4.953

We reject H0 for all observations for which

|ti|> 4.953; in this case we reject

only for the 8th observation

Thus we conclude that the 8th observation is an outlier and none of the othersare

Influence Diagnostics

Sometimes, a small subset of observations (even one observation) exert a dis-proportionate influence on the fitted regression model

In other words, the parameter estimates depend more on these few obser-

vations than on the majority of the data

We would like to be able to locate these influential observations and pos-sibly eliminate them

Leverage

The elements of the Hat Matrix hij describe the amount of influence exertedbyyj on yi

Thus a basic measure of the influence of an observation, known as the lever-

age, is given by hii

76


77/95

The properties of the Hat Matrix H include that the sum of all n diagonalelements is equal to p, that is:

n

i=1 hii=p Therefore, the average hii value would be pn As a rule of thumb, any observation i such that hii > 2pn would be called a

high-leverageobservation

Cooks Distance

The leverage only takes into account the location of an x observation A more sophisticated measure of influence would take into account the location

of the xand y values of an observation The Cooks Distance is one such measure Let be the usual least squares parameter estimates from all n observations,

and

Let (i) be the least squares parameter estimates where the ith observationhas been deleted from the data

Then the Cooks Distance is defined as:

Di= (i) XX(i) pMSResidual

The Cooks Distance formula can also be expressed in terms of the internallystudentized residuals:

Di= r2ip

hii1 hii

In general, ifDi > 1 we say that the ith observation is influentialInfluence Diagnostics: Example

With the outlier data set used above, the hii values are:

hii= [0.15, 0.225, 0.15, 0.125, 0.75, 0.225, 0.225, 0.15]

In this case 2pn

= 2(2)8

= 0.5. Sinceh55 = 0.75 > 0.5, we can say that the 5thobservation is a high leverage observation

We can calculate the Cooks Distance using the formula Di= r2i

p

hii1 hii

In this case,Di= [0.0042, 0.0087, 0.0030, 0.0191, 2.9961, 0.0493, 0.0013, 0.5066]

Since D5>1 we can again say that the 5th observation is influential

77


78/95

Multicollinearity

Multicollinearity occurs when two or more of the x variables have a stronglinear relationship with each other

This makes the estimates less precise In fact, if two or more x variables have a perfect linear relationship, we cannot

use the method of least squares

Technically this is because the XXmatrix is not invertible In most cases the multicollinearity will not be perfect; but if it is strong, it

can still ruin the model

How do we know if there is multicollinearity?

Detecting Multicollinearity

The simplest way to detect multicollinearity is to calculate the Pearson corre-lation coefficient between each pair of independent variables xs and xt

A rule of thumb says that if any of these correlation coefficients is higher than0.7 in absolute value, there is serious multicollinearity

SAS can also provide us with variance inflation factor (VIF) estimates,which tell us by what factor the error variance increases due to multicollinearity

in a particular independent variable

A rule of thumb says that if the VIF >5 for any independent variable, thereis serious multicollinearity involving that variable

The simplest way of resolving multicollinearity is to remove one of the offendingxvariables

Multicollinearity: Example

The table below gives the cost of adding a new communications node to a

network, along with three independent variables thought to explain this cost:the number of ports available for access (x1), the bandwidth (x2), and the portspeed (x3)

When we estimate the modelYi=0 +1x1i +2x2i +3x3i +iusing OrdinaryLeast Squares, we get the fitted equation:

y= 17487 14168x1+ 81.39x2+ 1523.7x3

Continue from SAS project

78


79/95

yi x1i x2i x3i

52388 68 58 653

51761 52 179 499

50221 44 123 42236095 32 38 307

27500 16 29 154

57088 56 141 538

54475 56 141 538

33969 28 48 269

31309 24 29 230

23444 24 10 230

24269 12 56 115

53479 52 131 499

33543 20 38 192

33056 24 29 230

Changes in Functional Form

What if there is a non-linear relationship between Y and x? E.g. quadratic, cubic, logarithmic, etc.

We can still use linear regression just as before, but with the independentvariables transformed appropriatelyChanges in Functional Form: Example 1

Example with quadratic termChanges in Functional Form: Example 1

Example with ln term (log base e) Interpretation: 1 is the expected change in y for a one unit increase in ln x

This can also be expressed in terms of a change in x: 1 is the expected change in y whenx is multiplied bye = 2.718, that is,

when xincreases by 171.8%

More generally, the expected change iny for a% increase inx would be1ln

100 +

100

Thus the expected change in y for a 10% increase in xwould be 0.0951 For small , ln

100 +

100

100and so, we can say approximately that

1100

is the expected change in y for a 1% increase in x

79


80/95

Transformations of the Dependent Variable

Used to make the data fit a normal distribution better

Used to resolve the problem of non-constant variance Common transformations include:

y = ln(y) y =y

The Box-Cox Transformation is a method used to choose the best transforma-tion for y

Box-Cox Transformation

Used to make the data fit a normal distribution better Used to make the variance more constant Common transformations include:

y = ln(y) y =y

The Box-Cox Transformation is a method used to choose the best transforma-tion for y

Box-Cox Transformation

The Box-Cox Transformation consists of estimating a new parameter (This has nothing to do with Poisson distribution) The value ofis the best powerto use in transforming y; for instance:

If= 2, we use the transformation y =y 2

If= 1

2

, we use the transformation y =y1/2 =

y

In the special case = 0 we use the transformation y = ln(y) SAS can estimate the parameter for us

Box-Cox Transformation: Example

C

Interaction Terms

a

80


81/95

Dummy Variables

Do two-category only; save rest for econometrics

5 Logistic RegressionDifferent kinds of Dependent Variables

Throughout our study of linear regression models, we have assumed that thedependent variable is a normally distributed random variable

However, in practice we may want to build models for data that are not nor-mally distributed

For the rest of the module we will be looking at some of these modulesCategorical Dependent Variable

We already studied models with dummy (categorical) independent variables But what if the dependent variable is categorical?

If the dependent variable has two possible values (like a Bernoulli ran-dom variable), then it is called binary

A Bernoulli random variable is a binomial random variable where thenumber of trials is n= 1

For example, the dependent variable could be:

Yi=

1 if the ith product is defective

0 if the ith product is ok

Or:Yi=

1 if the ith patient recovers

0 if the ith patient dies

We can construct models for this kind of dependent variable They will be quite different from linear regression models, but still have some

key similarities since both types of models are classified as Generalized Lin-ear Models

81


82/95

Generalized Linear Models

Generalized Linear Models are a class of models, some of the properties ofwhich are:

1. We haven independent response observationsy1, y2, . . . , yn with theoret-ical means 1, 2, . . . , n

2. The observation yi is a random variable with a probability distributionfrom the exponential family (which basically means its probability massfunction or probability density function has an e in it)

3. The mean response vector is related to alinear predictor =x=0+1x1+2x2+. . .+kxk

4. The relationship betweeni and i is expressed by alink functiong sothat i= g(i), i= 1, 2, . . . , n

By taking the inverse of this function we can also write i = E (yi) =g1(i) =g1(xi )

In the case of linear regression: The link function is g(i) =i, so E (Yi) =i= i=xi The dependent variable follows a normal distribution In summary,YiN(xi , 2) (this is a way of writing the model without

i)

Logistic Regression Model

If each Yi follows a Bernoulli distribution (binomial with n = 1), with prob-ability of success Pr (Yi= 1) = pi and probability of failure 1 pi, theni= E (Yi) =pi

If we again used the identity link function g(i) = i then our model wouldbe pi=0+1x1+2x2+. . .+kxk

It is easy to see that this is a bad idea, because the predicted values of themodel would not necessarily be between 0 and 1

A better model uses the link function is g(pi) = ln pi1 pi

The quantity pi

1pi is called an odds: it is the ratio of the probability ofsuccess to the probability of failure

Thus the link function gives the log odds, also known as the logit orlogistic function

This means the model can be expressed as follows:

ln pi1 pi =0+1x1+2x2+. . .+kxk82


83/95

By taking the inverse of the function we can also express the model like this:

E (Yi) =pi= 1

1 +ex

i

where xi = [1, x1i, x2i, . . . , xki]

Notice that there is no error term i in this model Remember that pi are probabilities and thus range between 0 and 1 A graph ofg(pi) is as follows (it is undefined at 0 and 1):

Parameter Estimation in Logistic Regression

Just like in linear regression, our first task is to estimate the parameter vector

=

0

1

2...

k

However we can no longer use the Method of Least Squares (Why?)

Instead we use the Method of Maximum Likelihood

83


84/95

We will not explain the details of this method Unfortunately this method requires an iterative procedure and cannot easily

be calculated by hand

However computer software such as SAS can compute the estimates0,1,2, . . . ,kquite easily

Interpreting Parameters in Logistic Regression

More important for our purpose is to be able to interpret what the parameterestimates tell us

The parameter estimates themselves are interpreted as log-odds ratios, whilee1 for instance would be interpreted as an odds ratio

It is best to illustrate what these terms mean using an example

Logistic Regression Example

Consider a data set of 200 people admitted to the intensive care unit at ahospital

The dependent variable is whether they died:

yi= 1 if the person died

0 if the person survived

The first independent variable is the type of admission to ICU:

xi1 =

1 if they were admitted via emergency services

0 if the they were self-admitted

The second independent variable xi2 is the persons systolic blood pressure inmm Hg

The estimated model is:

ln

pi

1 pi

= 0+ 1xi1+ 2xi2

which can also be written as:

Pr (Yi= 1) = pi=

1

1 +e(0+1xi1+2xi2)

84


85/95

We estimate the parameters in SAS and our fitted equation is:

ln

pi

1

pi

=1.33 + 2.022xi1 0.014xi2

Or: Pr (Yi= 1) = pi= 11 +e(1.33+2.022xi10.014xi2)

Now to interpret the parameters: as in linear regression, 0represents the casewhen all independent variables take a value of 0

In this case, if xi1 = 0 (meaning the person was self-admitted) and t

sta 200 b article

Documents