sta 200 b article
TRANSCRIPT
-
8/22/2019 Sta 200 b Article
1/95
Faculty of Applied SciencesDepartment of Mathematics and Physics
Statistical Methods 2B Lecture Notes
Lecturer: Mr. T. Farrar
Contents
1 Review of Random Variables and Probability Distributions 1
2 Correlation Analysis of Paired Data Sets 19
3 Simple Linear Regression Analysis 27
4 Multiple Linear Regression 48
5 Logistic Regression 81
6 Poisson Regression 90
1 Review of Random Variables and ProbabilityDistributions
What you will be expected to already know
1. Descriptive Statistics
2. Basic Probability concepts
3. Graphical methods of displaying data (line graph, scatter plot, histogram)
4. Random Variables and Probability Distributions (Discrete and continuous)
5. Special probability distributions (binomial, Poisson, normal)
6. Hypothesis Testing (t-tests, F tests, 2 tests, nonparametric tests, p-values)
7. Basic calculus
8. Matrices
1
-
8/22/2019 Sta 200 b Article
2/95
Discrete Random Variables
Definition: A random variable is a variable which takes onits values by chance
Definition: The sample space S(a.k.a. support) is theset of possible values that a random variable may take
A random variable is discrete if it can only take only a finiteor countably infinite number of distinct values. Usually a discrete randomvariable only takes on integer values.
E.g. Number of defective television sets in a shipment of 100 sets S={1, 2, 3, . . . , 100}
E.g. Number of visits to a website in one year
S={1, 2, 3, . . .} We use an uppercase letter such as Y to denote a random variable, and a
lowercase letter such asy to denote a particular value that the random variablemay assume
Discrete Probability Distributions
We may denote the probability that Ytakes on the value y by Pr(Y =y) This probability is subject to the following restrictions:
1. 0Pr (Y =y)1 for all y (all probabilities must be between 0 and 1)2.yS
Pr (Y =y) = 1 (sum of probabilities over whole sample space must be 1)
E.g. Flipping a six sided die: let Ybe the number that comes up Pr (Y =y) = 1
6, y= 1, 2, 3, 4, 5, 6
It is easy to see that both restrictions hold.
The probability distribution of the lengths of patent lives for new drugs isgiven below. The patent life refers to the number of years a company hasto make a profit from the drug after it is approved before competitors mayproduce the same drug.
Years, y 3 4 5 6 7 8 9 10 11 12 13
Pr (Y =y) .03 .05 .07 .10 .14 .20 .18 .12 .07 .03 .01
The function that maps all values in the sample space to their probabilities iscalled a probability mass function
It may be expressed in a table (as above) or as a mathematical formula
2
-
8/22/2019 Sta 200 b Article
3/95
We can use a graph to represent the probability mass function:
Suppose the law dictates that the sentence (in years) for a particular crimemust be between 5 and 10 years in prison. By looking at past cases a lawyeris able to construct the following probability distribution for the number ofyears to which a person convicted of the crime is sentenced:
f(y) =0.4471
y , y = 5, 6, 7, 8, 9, 10
Hence the probability that a person convicted of this crime receives a 6 yearsentence is
fY(6) =0.4471
6= 0.1825
As an exercise, graph this probability mass function and verify that it satisfiesthe two restrictions on probability mass functions.
Expected Value of a Discrete Random Variable
We can define the expected value of a random variable as follows:
E (Y) =yS
yf(y)
Iff(y) accurately characterises the population described by the random vari-able Y, then E (Y) =, the population mean
3
-
8/22/2019 Sta 200 b Article
4/95
In our prison sentencing example:
E (Y) =10
y=5y
0.4471y
=10
y=5
0.4471
y
= 0.4471
5 +
6 +
7 +
8 +
9 +
10
= 7.298
Thus, we would expect the average sentence to be 7.3 years.
It can also be shown that for any real-valued functiong(Y), the expected valueofg(Y) is given by:
E (g(Y)) =yS
g(y)f(y)
Variance of a Discrete Random Variable
We can define the variance of a random variable as follows:2 = E (Y) = E
(Y )2
= E
Y2 2 (why?)
=yS
y2f(y) E (Y)2
In our prison sentencing example:
Var (Y) =10
y=5
y20.4471
y E (Y)2
=10
y=5
0.4471y(3/2) 7.2982
= 0.4471
5(3/2) + 6(3/2) + 7(3/2) + 8(3/2) + 9(3/2) + 10(3/2) 7.2982
= 56.177 53.261= 2.916
4
-
8/22/2019 Sta 200 b Article
5/95
Thus, the variance ofY is 2.916 and the standard deviation is2.916 = 1.71
Properties of Expected Value
Let Ybe a discrete random variable with probability mass function f(y) andlet abe a constant. Then E (aY) =aE (Y).
Proof:
E (aY) =yS
ayf(y)
=ayS
yf(y)
=aE (Y)
As an exercise, prove that ifb is a constant, then E (b) =b. As a further exercise, if Y1 and Y2 are two random variables, prove that
E (Y1+Y2) = E (Y1) + E (Y2).
Properties of Variance
Let Ybe a discrete random variable with probability mass function f(y) andlet abe a constant. Then Var (aY) =a2 Var (Y).
Proof:Var (aY) = E
a2Y2
E (aY)2=a2E
Y2 a2E (Y)2
=a2
E
Y2 E (Y)2
=a2 Var (Y)
As an exercise, prove that ifbis a constant, then Var (b) = 0.
Special Discrete Probability Distributions
Binomial Distribution
The binomial distribution relates to a binomial experiment which has thefollowing five properties:
1. The experiment consists of a fixed number of trials, n
5
-
8/22/2019 Sta 200 b Article
6/95
2. Each trial results in one of two outcomes, called successand failure (denoted 1 and 0)
3. The probability of success in each trial is equal to p and the probabilityof failure is 1
p(sometimes called q)
4. All the trials are independent of one another
5. The random variable of interest is Y, the total number of successes ob-served in the n trials
The probability mass function for the binomial distribution is as follows:
f(y) = n
ypy (1
p)ny , y= 0, 1, 2, . . . , nand 0
p
1
We can derive this function using multiplicative probability rule for indepen-dent events and the concept of combinations
We have y successes andn y failures, and there are n!y! (n y)! =
n
y
ways
to arrange them in order
Here is a graph of the binomial probability mass function where n = 15 andp= 0.4:
As an exercise, draw the binomial probability mass function wheren= 9 andp= 0.8.
Mean and Variance of Binomial Distribution
The mean of a binomially distributed random variable is E (Y) =np. The variance of a binomially distributed random variable is Var (Y) =np (1 p).
6
-
8/22/2019 Sta 200 b Article
7/95
Binomial Example
There is an English saying, Dont count your chickens before they hatch
A farmer is breeding chickens. He has 15 hens that each lay one egg per day.The eggs are then placed in incubators He has observed that there is an 80% hatchability rate, that is, an 80% prob-
ability that an egg will hatch into a live chick
1. How many live chicks should the farmer expect per day?
E (Y) =np = 15 0.8 = 12
2. What is the probability that at least 13 eggs from a given day will hatch?
Pr (Y13) = Pr (Y = 13) + Pr (Y = 14) + Pr (Y= 15)=
15
13
0.813 (1 0.8)1513 +
15
14
0.814 (1 0.8)1514 +
15
15
0.815 (1 0.8)1515
= 0.2309 + 0.1319 + 0.0352 = 0.398
Negative Binomial Probability Distribution
While a binomial random variable measures the number of successes in n trialsof a binomial experiment where n is fixed, a negative binomial randomvariable measures the number of trialsy required for k successes to occur.
We could think of this as the event A B where A is the event that the firsty 1 trials containk 1 successes andB is the event that the yth trial resultsin a success.
f(y) = Pr (A B) = Pr (A) Pr (B) (since A and B are independent)
Pr (A) = y 1k 1pk1qyk, yk (by binomial distribution)Pr (B) =p
Thus f(y) =
y 1k 1
pkqyk, y= k, k+ 1, k+ 2, . . .
7
-
8/22/2019 Sta 200 b Article
8/95
Negative Binomial Distribution
Here is a graph of the binomial probability mass function where k = 3 andp= 0.6 (going as far as y = 17):
As an exercise, draw the negative binomial probability mass function wherek= 2 and p= 0.5, up to y = 10.
Mean and Variance of Negative Binomial Distribution
The mean of a negative binomial random variable is E (Y) = kp
The variance of a negative binomial random variable is Var (Y) = k (1 p)p2
Negative Binomial Distribution Example
Each time a fisherman casts his line into the water there is a probability of 18
that he will catch a fish.
Today he has decided that he will continue casting his line until he catches 5fish
1. What is the expected number of casts required to catch 5 fish?
E (Y) =
k
p =
5
0.125= 40
2. What is the standard deviation of the number of casts required to catch 5 fish?
Var (Y) =5 (1 0.125)
0.1252 = 280
=
Var (Y) =
280 = 16.73
8
-
8/22/2019 Sta 200 b Article
9/95
4. What is the probability that he will need exactly 50 casts?
Pr (Y= 20) = 50 15 1 0.1255(1 0.125)505
= 0.0159
5. What is the probability that he will need more than 8 casts?
Pr (Y >8) = 1 8
y=5
y 15 1
0.1255(1 0.125)y5
= 1 440.1255(1 0.125)55 +540.1255(1 0.125)65+
6
4
0.1255(1 0.125)75 +
7
4
0.1255(1 0.125)85
= 1 (0.0000 + 0.0001 + 0.0004 + 0.0007)= 1 0.0011 = 0.999
Poisson Distribution
The Poisson Distribution can be thought of as a limiting case of the binomialdistribution
Suppose we are interested in the number of car accidents Y that occur at abusy intersection during one week
We could divide the week into n intervals of time, with each interval being sosmall that at most one accident could occur in that interval
We define p as the probability that an accident occurs in a particular sub-interval and 1
pas the probability that no accident occurs
We could then think of this as a binomial experiment It can then be shown that:
limn
n
y
py (1 p)ny =(np)
y enp
y!
If we let = np then we have the probability mass function of the Poissondistribution:
f(y) =ye
y!
, y= 0, 1, 2, . . .
9
-
8/22/2019 Sta 200 b Article
10/95
Here is a graph of the Poisson probability mass function where = 3.3 (goingas far as y= 12):
As an exercise, draw the Poisson probability mass function where = 1, upto y= 6.
Mean and Variance Poisson Distribution
The Poisson Distribution is used to model the counting of rare events thatoccur with a certain average rate per unit of time or space
For the Poisson Distribution, E (Y) = and Var (Y) =
The expected value and variance are equal!
Poisson Distribution Example
The number of complaints that a busy laundry facility receives per day is arandom variable Yhaving a Poisson distribution with = 3.3
1. What is the probability that the facility will receive less than two com-plaints on a particular day?
Pr (Y
-
8/22/2019 Sta 200 b Article
11/95
If the number of complaints per day has a Poisson distribution withparameter then the number of complaints in five days has a Poissondistribution with parameter 5 . Thus, if we letWbe the number of complaints per week, then:
E (W) = 5= 16.5
Continuous Random Variables
A random variable is continuous if it can on any value in aninterval (e.g., between 0 and 5). In other words, continuous random variablestake on real-numbered values
There is no such thing as a probability mass function for a continuous randomvariable. Instead, we have a probability density function which allows usto find probabilities over an interval
If Y is a continuous random variable, and f(y) is the probability densityfunction, then:
Pr (aYb) = b
a
f(y)dy
What we are actually doing is finding the area under the curve between a andb.
Properties of a Probability Density Function
1. f(y)0 for all y,< y
-
8/22/2019 Sta 200 b Article
12/95
First we note that 3y2 0 for all 0y1 , so the first conditionis satisfied.
Second:
f(y)dy =
10
f(y)dy (since the function is 0 elsewhere)
=
10
3y2dy
= y31
0
= 13 03 = 1
Thus the second condition is also satisfied.
Find the probability that between 60% and 90% of people pay their incometax on time.
Pr(0.6Y0.9) = 0.9
0.6
3y2dy
= y30.90.6
= 0.93 0.63= 0.513
Thus 51.3% of people pay their income tax on time according tothis model.
Note that it does not matter whether we use < or
with continuous random
variables
Expected Value and Variance of a Continuous Random Variable
The expected value of a continuous random variable Y is defined as follows:= E (Y) =
yf(y)dy
Similarly the variance is defined thus:2 = Var (Y) = E
Y2
2 =
y2f(y)dy 2
These have the same properties as in the discrete case.
12
-
8/22/2019 Sta 200 b Article
13/95
Find the expected value of the proportion of people who pay their income taxon time.
= E (Y) = 1
0
y 3y2dy
=
10
3y3dy
= 3
4y41
0
=3
4= 0.75
Find the standard deviation of the proportion of people who pay theirincome tax on time.
2 = Var (Y) =
10
y2 3y2dy 2
= 1
0 3y
4
dy 0.752
= 3
5y51
0
0.752
=3
5 0.752 = 0.6 0.5625 = 0.0375
Hence = 0.0375 = 0.194
Special Continuous Probability Distributions
Uniform Distribution
Suppose thatYcan take on any value between1and 2with equal probability.Then Y follows the continuous uniform distribution and its probability massfunction is as follows:
f(y) = 1
2
1, 1y2
0 , elsewhere
13
-
8/22/2019 Sta 200 b Article
14/95
We can use integrals to compute probabilities, but in this case we dont needto because we are actually just finding the area of a rectangle! It can be shown
that E(Y) = 1+22
and Var (Y) = (21)2
12
Uniform Distribution Example
An insurance company provides roadside assistance to its clients. To save coststhey want to dispatch the nearest possible tow truck.
Along a particular highway which is 100 km long, breakdowns occur at uni-formly distributed locations.
Towing Company A is the nearest for the first 70 km of the highway andTowing Company B is the nearest for the final 30 km of the highway.
1. What is the expected location of the next breakdown?
E (Y) =1+2
2 =
0 + 100
2 = 50
We expect the next breakdown to occur at the 50 km mark
3. What is the probability that the next breakdown will be attended by companyB?
Here f(y) =1
100, 0y100 and 0 elsewhere
We need to find the area under f(y) between 70 and 100
We could calculate
10070
f(y)dy
Or we can simply calculate the area of this rectangle:
14
-
8/22/2019 Sta 200 b Article
15/95
The area of a rectangle is lengthwidth. Thus:
Pr(70Y100) = 30 1100
= 0.30
Normal Distribution
A random variable Y is said to have a normal distribution with parameters< 0 if its probability density function is:
f(y) = 1
2e(y)
2/(22), < y
-
8/22/2019 Sta 200 b Article
16/95
Even more good news: any Normally distributed random variable Y withmean and standard deviation can be transformed to a Standard Normalrandom variable Zusing this simple transformation:
Z=Y
This graph shows how the transformation works:
Using the Z Table to Calculate Probabilities
TheZTable provides us with Pr (Z < z) for anyzvalue that we choose up to2 decimal places
16
-
8/22/2019 Sta 200 b Article
17/95
Suppose we want to know Pr (Z z) = 1 Pr (Z < z)
If we want to find Pr (Z < z) for a negative zvalue, we can use the fact thatthe Standard Normal Distribution is symmetric:
Pr (Z < z) = 1
Pr (Z 2.2285. tobserved = 10.50> 2.228, thus we reject H0
6. We conclude at 5% significance level that the correlation is significantly dif-ferent from 0
The Fisher Transformation
What if we want to test whether = 0 for any value1< 0 < 1? What if we want a confidence interval for ? The Fisher Transformation allows us to do both (approximately)
zr =12
ln1 +r1 r
This quantity has an approximate Normal distribution with a mean of 0 and
a variance of 1
n 3 From this we get the following test statistic, which has a standard normal
distribution under the null hypothesis:
Z=
12ln
1+r1r
12ln
1+0101n3
23
-
8/22/2019 Sta 200 b Article
24/95
Pearsons Correlation Coefficient: General Hypothesis Test Example
Suppose we want to find out whether the correlation is less than 0.99 in ourice cream sales vs. temperature example?
1. H0 : = 0.99 vs. HA:
-
8/22/2019 Sta 200 b Article
25/95
Spearmans Rank Correlation Coefficient
What if one or both ofXand Yare not normally distributed?
Suppose we have the Statistics FISA marks and number of hours of TVwatched per week for n= 8 students:
FISA Marks vs. Hours of TV per week
Hours of TV per week (xi) FISA Mark (yi)
3 73
11 50
7 87
38 31
13 62
20 61
22 46
34 59
Spearmans Rank Correlation Coefficient
In this case we can instead use Spearmans Rank Correlation Coefficient s,which is based on the ranksof thexiandyirather than the values themselves
It is a general measure of association rather than a measure of linear depen-dence
R(xi) are the ranks of the xvalues; thus the lowest value has a rank of 1, thesecond lowest a rank of 2, etc.
R(yi) is computed the same way for the y values The sample estimator ofs is:
rs=
n
ni=1
R(xi)R(yi) n
i=1
R(xi)
ni=1
R(yi)n n
i=1
R(xi)2
ni=1
R(xi)
2n ni=1
R(yi)2
ni=1
R(yi)
2 If there are no ties inxor y, this reduces to a simpler formula:
rs= 1
6n
i=1d2i
n (n2 1) where di=R(xi) R(yi)
25
-
8/22/2019 Sta 200 b Article
26/95
FISA Marks vs. TV hours per week
Hours of TV per week (xi) FISA Mark (yi) R(xi) R(yi) di d2i
3 73 1 7 6 3611 50 3 3 0 07 87 2 8 6 36
38 31 8 1 7 49
13 62 4 6 2 420 61 5 5 0 0
22 46 6 2 4 16
34 59 7 4 3 9
d2i = 150
Spearmans Rank Correlation Coefficient Example
In our FISA marks vs. TV hours example:
We can now compute the sample Spearman correlation coefficient:
rs= 1
6 1508 (8
2
1)=
0.786
This suggests that there is a negative association between hours spent watchingTV and FISA mark
Spearmans Rank Correlation Coefficient: Hypothesis Testing
We may want to test the null hypothesis H0:s= 0 against some alternativeto see if there is a significant association between xand y
Ifn is large (and there are no ties) then the statistic t =
rs
n
21 r2s has ap-
proximately a tdistribution with n 2 degrees of freedom Ifn is small we use rs as our test statistic and use a table of critical values
(see appendix)
For our student marks vs. TV hours example, suppose we want to check if theassociation between these two variables is significant at the 5% significancelevel
26
-
8/22/2019 Sta 200 b Article
27/95
Spearmans Rank Correlation Coefficient: Hypothesis Testing Example
1. H0:s= 0 vs. HA:s= 02. = 0.05
3. Test statistic is rs
4. Critical value isrs/2,8 = 0.738, so we reject H0 if|rsobserved |> 0.7385.|rsobserved|=| 0.786|= 0.786> 0.738, so we reject H06. We conclude there is a (negative) association between hours spent watching
TV per week and FISA mark
Spearmans Rank Correlation Coefficient: General Hypothesis Tests andConfidence Intervals
The Fisher Transformation that was done on the Pearson Correlation Coeffi-cient also applies to the Spearman Rank Correlation Coefficient Thus we can use the very same formulas based on the standard normal dis-
tribution to carry out general hypothesis tests such as H0 : s = 0.6 vs.HA:s= 0.6 as well as to construct confidence intervals for s
Of course we need to use rs instead ofr in these formulas, but everything elsestays the same
Limitations of Correlation Analysis
Two of the limitations of correlation analysis are:
1. It does not allow us to compare more than two variables at a time
2. It does not allow us to make predictions
We now turn to linear regression analysis which enables us to do both of these
3 Simple Linear Regression AnalysisEquation of a Line
The equation of a line is often expressed as y = mx+c
mis the slope of the line, the change in y for a one unit change in x c is the intercept of the line, the value ofy when x= 0 (and the point
where the line crosses the vertical axis)
Often when we compare observations from two variables, we see what appearsto be an approximately linear relationship
We must decide logically which is the independent variable (x) and which isthe dependent variable (y)
For example, the scatter plot of ice cream sales vs. temperatures (whichis dependent on the other?)
27
-
8/22/2019 Sta 200 b Article
28/95
Line Fitting
If we have only two points, we can fit a line that goes right through them both
E.g. if we have the points (x1 = 2, y1= 4) and (x2= 6, y2 = 6) m= y2y1
x2x1 = 6462 =
12
m= y y1x x1
1
2=
y 4x 2
2 (y 4) =x 22y 8 =x 2
2y= x+ 6
y=1
2x+ 3
28
-
8/22/2019 Sta 200 b Article
29/95
Line Fitting
However, as soon as we have three or more points, we usually cant fit themperfectly with a straight line
Consider the following scatter plot:
There is no line that describes this relationship perfectly So how do we model a relationship that is kind oflinear?
The Simple Linear Regression Model
We could assume that the yi observations depend on the xi observations in alinear way but also contain some unexplained variation
We model this unexplained variation or error as a random variable i This means Y is a random variable since it depends on a random variable Thus we have Y =0+1x+ Or, for individual observations, yi=0+1xi+i for i= 1, 2, . . . , n
We have simply changed the name ofm to
1andc to
0, switched their
order, and added the error term
29
-
8/22/2019 Sta 200 b Article
30/95
Model Assumptions
The most important assumptions of a simple linear regression model are asfollows:
Thex values are fixed, not random (thus we write x in lower case and Y,a random variable, in upper case)
All error terms have a zero mean, i.e. E (i) = 0i All error terms have the same fixed variance, i.e. Var (i) =2i All observations are independent of each other The error terms follow the normal distribution
The Problem
Even if our model and its assumptions are correct, we have a problem: wedont know the values of0, 1 or i
In order to know them we would have to have data from the whole populationofxand y, which is usually impossible
We can only estimate 0, 1 and i as best as we can But how?
Line Fitting
If we asked three people to draw the line that best fits the points, we mightget three different results:
How would we know which line is the best? As statisticians we want to use a statistic to quantify this! But how?
30
-
8/22/2019 Sta 200 b Article
31/95
The Least Squares Method
Suppose we have observations (xi, yi) fori = 1, 2, . . . , n, and we fit a line withequation yi= 0+ 1xi
We have simply changed the name ofmto 1 and c to 0, and switchedtheir order
Theony, 0 and 1 reminds us that these are estimates of the relation-ship
We can determine how far each individual yi value is from the line using theformulaei=yi
yi= yi
0+ 1xi The ei values are called residuals
31
-
8/22/2019 Sta 200 b Article
32/95
The residuals ei are our best estimate of the unknown errors i They also provide us with a clue of how to find the estimated line that best
fits the data
Overall, we want the errors to be as small as possible However, we cant just minimize the sum of errors because thepositiveerrors
(points above the line) andnegativeerrors (points below the line) will canceleach other out!
Instead we minimize the sum of squared errors 2i because these will all bepositive
SSError=n
i=1
2i
This quantifies the overall distance between the points and the line Similar to how thevariance gives an indication of the distance between
data points and their mean
32
-
8/22/2019 Sta 200 b Article
33/95
We will choose the values of 0 and 1 that minimize the sum of squarederrors
How do we do this? Calculus! The Sum of Squared Errors is a function of0 and 1
SSError=S(0, 1) =n
i=1
(yi 0 1xi)2
So our method is as follows:1. Take partial derivatives of theS SError function with respect to0 and1
2. Set the derivatives equal to zero
3. Solve this system of equations for 0 and 1 to get the values whichminimize the function
Deriving the Least Squares Estimators
S(0, 1)
0=2
ni=1
(yi 0 1xi) = 0 (1)
S(0, 1)
1 =2n
i=1(yi 0 1xi) xi = 0 (2)
This is the system of equations we must solve in terms of0 and 1 We simplify them as follows:
2n
i=1
(yi 0 1xi) = 0n
i=1
yi n
i=1
0 n
i=1
1xi= 0
ni=1
yi 0n
i=1
1 1n
i=1
xi= 0
ny n0 n1x= 00= y 1x
33
-
8/22/2019 Sta 200 b Article
34/95
2n
i=1(yi 0 1xi) xi= 0
ni=1
yixi n
i=1
0xi n
i=1
1x2i = 0
ni=1
yixi 0n
i=1
xi 1n
i=1
x2i = 0
ni=1
xiyi (y 1x)n
i=1
xi 1n
i=1
x2i = 0
n
i=1 xiyi nxy+n1x2 1
n
i=1 x2i = 0
1
ni=1
x2i nx2
=n
i=1
xiyi nxy
1=
ni=1
xiyi nxyn
i=1
x2i nx2
Least Squares Estimation Formula
Thus the least squares estimates of 0 and 1 can be calculated using thefollowing formula:
1 =
ni=1
xiyi nxyn
i=1
x2i nx
2
0 = y 1x
It turns out that 1 and 0 are Minimum variance unbiased estimators(MVUE)of1 and 0
This means that:1. E
0
=0 and E
1
=1 (unbiased)
2. 0and
1can be proven to have the smallest variance (greatest precision)
of any linear estimators of0 and 1
34
-
8/22/2019 Sta 200 b Article
35/95
Proof that 1 is Unbiased Estimator of1
We first need to derive E (Yi) and E
Y
We will also use our assumptions that the xvalues are fixed and that E (i) = 0E (Yi) = E (0+1xi+i)
= E (0) + E (1xi) + E (i)
=0+1xi+ 0 (since the first two are constants)
=0+1xi
E
Y
= E 1
n
ni=1
yi
= 1
n
ni=1
E (yi)
= 1
n
ni=1
(0+1xi)
= 1
n(n0+1nx)
=0+1x
35
-
8/22/2019 Sta 200 b Article
36/95
E
1
= E
ni=1
xiyi nxyn
i=1x
2i nx
2
=
1n
i=1
x2i nx2E
ni=1
xiyi nxy
(since x is fixed, the denominator is constant)
= 1
ni=1
x2i nx2
ni=1
xiE (yi) nxE (y)
= 1ni=1
x2i nx2 n
i=1
xi(0+1xi) nx (0+1x) (see results proved above)
= 1
ni=1
x2i nx2
0nx+1
ni=1
x2i nx0 nx21
=
1
n
i=1x2i nx2
n
i=1
x2i nx2
=1
Proof that 0 is an Unbiased Estimator of0
As an exercise, try to prove that E 0=0
The proof is much shorter than the proof for 1Prediction with Simple Linear Regression
Once we have calculated the least squares estimates 1 and 0, we can writeout the fitted regression equation:
y= 0+ 1x
We can now use this equation to predict the most likely value of y for aparticular value ofx
36
-
8/22/2019 Sta 200 b Article
37/95
This is one of the most useful things about this model! However we must be careful to only make predictions for values ofx in the
domain of our data
We cannot extrapolate since the relationship may not be linear outside ofthe domain of the data
The Riskiness of Extrapolation
Suppose we fit a line to a set of data points with xi values ranging from 0 to 6 Now we use our fitted line to predict the value ofy for x= 10
The Riskiness of Extrapolation
What if modeling the relationship between y and x as a straight line is onlyappropriate between x= 0 andx= 6?
Can you see how far off the prediction would appear to be if we had data forlarger xvalues like this?
Simple Linear Regression Example
Various doses of a toxic substance were given to groups of 25 rats and theresults were observed (see table below)
37
-
8/22/2019 Sta 200 b Article
38/95
Rat Deaths vs. Doses
Dose in mg (x) Number of Deaths (y)
4 1
6 3
8 6
10 8
12 14
14 16
16 20
1. Find the fitted simple linear regression equation for this data
2. Use the model to predict the number of deaths in a group of 25 rats whoreceive a 7 mg dose of the toxin
38
-
8/22/2019 Sta 200 b Article
39/95
Rat Deaths vs. Doses
xi yi x2i xiyi
4 1 16 4
6 3 36 188 6 64 48
10 8 100 80
12 14 144 168
14 16 196 224
16 20 256 320xi = 70
yi= 68
x2i = 812
xiyi= 862
x= 10 y= 9.714
1=
ni=1
xiyi nxyn
i=1
x2i nx2
=862 7 10 9.714
812
7
102
=182.02112
= 1.625
0= y 1x= 9.714 1.625 10=6.536
Note that it is important not to round numbers off until you have the finalregression equation, otherwise your answer may be inaccurate
Thus the fitted regression equation is y=6.54 + 1.63x Predicting the number of deaths for a dose of 7mg:
y=6.54 + 1.63x=6.54 + 1.63 7 = 4.9
39
-
8/22/2019 Sta 200 b Article
40/95
Simple Linear Regression Exercise
Calculate the equation of the line of best fit for the temperature (x) vs. icecream sales (y) example
Use the equation to predict the ice cream sales on a day on which the temper-ature is 20
Inferences from a Simple Linear Regression
The two unknown parameters involved in a simple linear regression model are0 and 1
2, the variance of the error terms, is also unknown
We may be interested in knowing whether it is reasonable to conclude that
one of these unknowns is equal to (or not equal to) a particular value
Most often we are interested in whether 1= 0 since this determines whetherxand y have a positive relationship, a negative relationship or no relationship
Like in correlation analysis! To use hypothesis testing to make inferences about these unknowns we need
an appropriate test statistic
Inferences on 1
Inferences about 1 will be based on how far the estimated value 1 is fromthe null hypothesis value
As always, we also take into account the standard errorof the estimate anditsprobability distribution
We already proved that E
1
=1
Let:
SSx =n
i=1 x2i
nx2
SSy =n
i=1
y2i ny2
SSxy =n
i=1
xiyi nxy
Notice that, expressed in these terms, 1 = SSxySSx
Subject to our model assumptions, it can be proven that Var 1 = 2SSx40
-
8/22/2019 Sta 200 b Article
41/95
However, because we do not know the value of2 we must use the best esti-
mate, which turns out to be 2
=
1
n 2n
i=1 e2i = 1n 2SSResidual= M SResidual ThusVar1 = 2
SSx
It can be proven that 1 E (1)Var1 has a t distribution with n 2 degrees offreedom
Thust=1
1 2
SSx
has a tdistribution with n 2 degrees of freedom
Since SSResidual= SSy1SSxy, we can express this as:
t=1 1
SSy1SSxy(n 2) SSx
If we replace1with1 this becomes our test statistic for testing H0 : 1 =
1
Hypothesis Testing Review
For such a t test, our decision rules would be as follows: H0:1=1 vs. HA:1=1
Reject H0 if|tobserved|> t/2,n2 H0:1=1 vs. HA:1< 1
Reject H0 iftobserved 1
Reject H0 iftobserved > t,n2
41
-
8/22/2019 Sta 200 b Article
42/95
The p-value Approach
Instead of using critical values to decide whether to reject H0, one can also usep-values
A p-value (sometimes denoted ) is defined as the probability of obtaining aresult at least as extreme as the observed data, given that H0 is true.
For such a t test, our decision rules would be as follows: H0:1=1 vs. HA:1=1
Reject H0 if 2 Pr (t >|tobserved| given that 1=1 )< H0:1=1 vs. HA:1< 1
Reject H0 if Pr (t < tobserved given that 1 = 1 )< H0:1=1 vs. HA:1> 1
Reject H0 if Pr (t > tobserved given that 1 = 1 )< Note that p-values cannot usually be computed by hand. As an example, the
third p-value involves computing
=
tobserved
f(y)dy where f(y) is the probability density function of the t distribution
However,p-values can be easily calculated with a computer, and are the quick-est way to reach a decision about a hypothesis test when using statisticalsoftware packages
Confidence Interval for 1
Using the t statistic above, we can derive a (1 )100% confidence intervalfor 1 as follows:
Pr1 t/2,n2 SSy1SSxy(n 2) SSx < 1< 1+t/2,n2 SSy1SSxy(n 2) SSx = 1 Thus the C.I. for 1 is:1 t/2,n2
SSy1SSxy
(n 2) SSx ,1+t/2,n2
SSy1SSxy
(n 2) SSx
42
-
8/22/2019 Sta 200 b Article
43/95
Inference on 1 Example
Suppose we want to test H0 : 1 = 0 vs. HA : 1= 0 for the rat death vs.dosage example, at the = 0.05 significance level
Our test statistic is tt (n 2) as defined above Our critical region is|tobserved|> t/2,n2=t0.025,5 = 2.570 We have already calculated that SSxy = 182 and SSx = 112 We further can calculate that SSy = 301.4286
t=1 1
SSy1SSxy
(n
2) SSx
= 1.625 0
301.4286 1.625 182(7 2) 112
= 1.625
0.01014
= 1.625
0.1007= 16.14
|tobserved|> 2.570, thus we reject H0 and conclude that1= 0; the slope of theregression model is statistically significant
A 95% Confidence Interval for 1 is given by:1 t/2,n2
SSy1SSxy(n 2) SSx ,
1+t/2,n2
SSy1SSxy(n 2) SSx
(1.625 2.570 0.1007 ,1.625 + 2.570 0.1007)
(1.37 ,1.88)
Inference on 0
In a similar way it can be proven that:
E
0
=0
Var
0
=2
1
n+
x2
SSx
If we estimate2 with 2 thent = 0 0
1
n+
x2
SSx
has at distribution with
n 2 degrees of freedom
43
-
8/22/2019 Sta 200 b Article
44/95
We can also express tas:
t=0 0
SSy1SSxyn 2 1n+ x2SSxConfidence Interval for 0
A (1 )100% Confidence Interval for 0 is given by:0 t/2,n2
SSy1SSxyn 2
1
n+
x2
SSx
,0+t/2,n2
SSy1SSxy
n 2
1
n+
x2
SSx
Inference on 0 Example
With our dosage vs. rat deaths example, suppose we are interested in whether0
-
8/22/2019 Sta 200 b Article
45/95
Inference on 2
It is also possible to perform hypothesis tests and confidence intervals con-cerning 2 using the 2 distribution
However we will not cover these in this module.
Predicting the Mean Response
One of the advantages of the linear regression model is that we can use x topredict Y
Suppose we want to estimate the mean value ofY whenx= x, E (Y|x= x)
We know that E (Y|x= x) =0+1x Our best estimate of E (Y|x= x) is y = 0+ 1x
The variance of this estimator is Var (y) =2
1
n+
(x x)2SSx
Since 2 is unknown, we can use the following estimate:
Var (y) = 2
1
n+
(x x)2SSx
=SSy1SSxy
n 2
1
n+
(x x)2SSx
It can also be shown that t= yVar (y) t (n 2)
Confidence Interval for Mean Response
Thus a (1
)100% Confidence Interval for E (Y
|x= x) is given by:0+ 1x t/2,n2
SSy1SSxyn 2
1
n+
(x x)2SSx
If we want the interval to be as narrow as possible (a more accurate prediction),
then nshould be large, SSx should be large, and xshould be near x.
That is, we should gather data on a wide range ofxvalues
45
-
8/22/2019 Sta 200 b Article
46/95
Predicting a New Response
Suppose we want to predict the response valuey for a new observationx = x
Our best estimate would be y
= 0+
1x
E (y) =0+1x
Var (y) =2
1 +1
n+
(x x)2SSx
Thus:
Var (y) = 2
1 +
1
n+
(x x)2SSx
=SSy1SSxy
n 2
1 +1
n+
(x x)2SSx
It can be shown that t= yVar (y) t (n 2)
Prediction Interval for an Individual Response
A (1
)100% Prediction Interval for y is given by:
0+ 1x t/2,n2
SSy1SSxyn 2
1 +
1
n+
(x x)2SSx
It is called a prediction interval rather than a confidence interval because Yiis a random variable, not an unknown parameter
Notice that the prediction interval for Yi is always wider than the confidenceinterval for E (Y|x= x)
It is more difficult to predict the value of an individual observation than themean of many observations
Example
Consider our Temperature vs. Ice Cream Sales example We want a confidence interval for the average ice cream sales when the tem-
perature is 20 and a prediction interval for the ice cream sales on a particularday when the temperature is 20
46
-
8/22/2019 Sta 200 b Article
47/95
1. Confidence Interval for E (Y|x= 20)
0+ 1x
t/2,n2
SSy1SSxyn 2
1
n
+(x x)2
SSx 159.474 + 30.088(20) t0.025,10
174754.9 30.088(5325.025)12 2
1
12+
(20 18.675)2176.9825
442.286 2.228
135.549
442.286 25.94= (416.35, 468.23)
2. Prediction Interval for Yi when x= 20
0+ 1xi t/2,n2SSy1SSxy
n 2
1 +
1
n+
(x x)2SSx
159.474 + 30.088(20) t0.025,10
174754.9 30.088(5325.025)12 2
1 +
1
12+
(20 18.675)2176.9825
442.286 2.228
1589.10
442.286 88.82= (353.47, 531.11)
Assessing the Fit of a Regression Line
While testing the hypothesis H0 : 1 = 0 can give us a yes or no answer onwhether the model is appropriate, we would like a statistic that can quantifyhow good the model is
One method is to calculate what proportion of the total variation in y isexplained by our model
The total variation in y is S Sy =n
i=1(yi y)
2
=
ni=1
y2i ny
2
The variation not explained by the model isSSResidual=n
i=1
(yi yi)2
Thus the variationexplained by the modelis the differenceSSy SSResidual Our goodness of fit statistic, called the Coefficient of Determination, is
the ratio of the variation explained by the model to the total variation:
r2
=SSy
SSResidual
SSy = 1 SSResidual
SSy
47
-
8/22/2019 Sta 200 b Article
48/95
We call this statistic r 2 because it turns out that it is the square of Pearsonssample correlation coefficient r
Proof:
r2 = 1 S SResidualSSy
= 1 S Sy1SSxy
SSy
= 1
1 1SSxySSy
= 1
SSxySSy
=SS
xySSx
SSxy
SSy
=SS2xy
SSxSSy
= (r)2
Goodness of Fit Example
In our dosage vs. rat deaths example:
r2 = SS
2
xySSxSSy
= 1822
112 301.4286 = 0.981
Thus in this case we can say that 98 .1% of the variation in rat deaths can beexplained by the dosage given
4 Multiple Linear RegressionMultiple Linear Regression Model Specification
Before now we have used models with only one independent variable xi What if we want to investigate the relationship between a single dependent
variable Y and two independent variables x1 and x2?
The multiple linear regression model allows us to do this
Motivational Example
An experiment was conducted to determine the effect of pressure and temper-ature on the yield of a chemical. Two levels of pressure (in kPa) and three
levels of temperature (inC) were used and the results were as follows:
48
-
8/22/2019 Sta 200 b Article
49/95
Yield (yi) Pressure (xi1) Temperature (xi2)
21 350 40
23 350 90
26 350 15022 550 40
23 550 90
28 550 150
3D Scatter Plot
If we want to represent the relationship graphically we would need a threedimensional scatter plot
Instead of a line of best fit, we now need a plane of best fit
Multiple Linear Regression Model
Themultiple linear regression model allows us to investigate the relation-ship between a single dependent variable Y and two independent variables x1and x2
The model is specified as follows:
Y =0+1x1+2x2+
Or, in terms of observations, as follows:
yi=0+1x1i+2x2i+i
49
-
8/22/2019 Sta 200 b Article
50/95
This is the equation of a plane, not a line 0 is still the intercept (the point where the plane crosses the vertical axis,
x1=x2 = 0)
1 is the slope of the plane in the x1 direction 2 is the slope of the plane in the x2 direction
1 and 2 are sometimes referred to as partial slope coefficients This model relies on the same assumptions as the simple linear regression
model, with one addition:
x1 and x2 must not be collinear(highly correlated with one another)
The fitted regression equation in this case is:
Y = 0+ 1x1+ 2x2
Multiple Linear Regression Model: Deriving Least Squares ParameterEstimates
We can again use theMethod of Least Squaresto estimate the parameters0, 1 and 2
We still have our sum of squared error function, which is now a function ofthree variables:
SSError= S(0, 1, 2) =n
i=1
2i =n
i=1
(yi 0 1x1i 2x2i)2
We can still use the same steps:1. Take partial derivatives of theSSErrorfunction with respect to0,1and
2
2. Set the derivatives equal to zero
3. Solve this system of equations for 0,
1and
2to get the values which
minimize the function
S(0, 1, 2)
0=2
ni=1
(yi 0 1x1i 2x2i) = 0
S(0, 1, 2)
1=2
ni=1
(yi 0 1x1i 2x2i) x1i= 0
S(0, 1, 2)
2=2
n
i=1(yi 0 1x1i 2x2i) x2i= 0
50
-
8/22/2019 Sta 200 b Article
51/95
Solving this system of equations for 0, 1 and 2 is possible but it will takelong and the formula will be complicated.
An alternative is to use matrix notation, which is more compact
Multiple Linear Regression Model: Matrix Notation
We can specify the regression model in matrix notation as follows:
y= X + where
y is an n 1 matrix:
y=
y1
y2...
yn
X is an 3 nmatrix:
X =
1 1 1x11 x12 x1nx21 x22 x2n
is a 3
1 matrix:
=
012
is an n 1 matrix:
=
1
2...
n
Quick Review of Matrices
For any matrices A and B where A is the transpose ofA:A
=A
(A + B)=A+ B
(AB)=BA
51
-
8/22/2019 Sta 200 b Article
52/95
Additionally, the inverse of a square matrix A (which is like the matrixequivalent of division) is the matrix A1 such that AA1 =Iwhere I is theidentity matrix, e.g.
I= 1 0 00 1 00 0 1
To find the inverse of a matrix we can use the following method (similar to
Gauss-Jordan elimination):
Suppose
A=
1 2 3
0 4 5
1 0 6
Then: 1 2 3 1 0 00 4 5 0 1 01 0 6 0 0 1
= 1 2 3 1 0 00 4 5 0 1 0
0 2 3 1 0 1
= 1 2 3 1 0 00 4 5 0 1 0
0 0 11 2 1 2
=
2 0 1 2 1 00 4 5 0 1 0
0 0 11 2 1 2
=
22 0 0 24 12 20 4 5 0 1 0
0 0 11 2 1 2
=
22 0 0 24 12 20 44 0 10 6 100 0 11 2 1 2
= 1 0 0 1211 611 1110 1 0 5
223
22522
0 0 1 211
111
211
Thus A1 =
1211
611
111
522
322
522
211
111
211
Deriving Least Squares Estimates in Matrix Notation
Our sum of squared error function in matrix notation is:
S() =n
i=1
2i == (y X) (y X)
=y (X)
(y X)
=y X (y X)
=yy
Xy
yX + XX
52
-
8/22/2019 Sta 200 b Article
53/95
Now, in Xy we are multiplying a 1 3 matrix by a 3 n matrix by an 1 matrix, so the result will be a 1 1 matrix, i.e. a scalar number
Similarly, in yX we are multiplying a 1 nmatrix by an 3 matrix by a3 1 matrix, so the result will again be a 1 1 matrix, i.e. a scalar
Notice also that xy= yx The transpose of a scalar is itself Thus, since these matrices are both scalars, they are equal, and we can simplify
our equation to:
S() =yy 2Xy + XX
We now differentiate this function using vector calculus and set it equal to 0:S
=2Xy + 2XX= 0
XX= Xy
=XX
1Xy
Thus in matrix form, the least squares estimators of are given by =
XX
1Xy
This matrix exists as long as the inverse ofXXexists, which it does as longas our assumption of no linear dependence between x1 and x2 holds true
The estimators have the same Minimum Variance Unbiased Estimatorproperty as 0 and 1 do in the simple linear regression case
In matrix form, the fitted regression equation is y= X In matrix form, the residuals are e= y y
Multiple Linear Regression Example
We have the following data from ten species of mammal:
53
-
8/22/2019 Sta 200 b Article
54/95
Species Name Gestation Period in days (y) Body Weight in kg (x1) Avg. Litter size (x2)
Rat 23 0.05 7.3
Tree Squirrel 38 0.33 3
Dog 63 8.5 4Porcupine 112 11 1.2
Pig 115 190 8
Bush Baby 135 0.7 1
Goat 150 49 2.4
Hippo 240 1400 1
Fur seal 254 250 1
Human 270 65 1
Here, our individual matrices are as follows:
y=
23
38
63
112
115
135
150
240
254
270
X =
1 0.05 7.3
1 0.33 3
1 8.5 4
1 11 1.2
1 190 8
1 0.7 1
1 49 2.4
1 1400 1
1 250 1
1 65 1
We first check if our y values appear to be normally distributed:
54
-
8/22/2019 Sta 200 b Article
55/95
Looks okay
Our XXmatrix is as follows: 10 1974.580 29.91974.58 2065419.851 3401.85529.9 3401.855 153.49
To find the inverse of this matrix we would use Gauss-Jordan Elimination as
above
However in the age of technology its much quicker to use computer softwaresuch as MatLab
We find that
XX
1=
0.3021 1.9913 104 5.4428 1021.9913 104 6.3378 107 2.4744 1055.4428 102 2.4744 105 1.6569 102
We multiply this matrix byX and then by y to get our parameter estimates
= 178.70.0756917.93
Thus our fitted regression equation is Y= 178.68 + 0.07569x1 17.93x2 We interpret this as follows:
The intercept means that (according to the model) a mammal with bodyweight of 0 kg which has an average litter size of 0 babies would have agestation period of 179 days
(Note that the intercept does not always make practical sense!)
55
-
8/22/2019 Sta 200 b Article
56/95
For every kg of body weight, gestation period increases by 0.07569 days For every baby in the average litter, gestation period decreases by 17.93
days
Remember, we cannot assume the relationships are causal
It can be dangerous to extrapolate outside the region ofx1 and x2 values inthe data even if it is within range of individual values
Intercept may be an example of this! See the graph below
Multiple Linear Regression with k Independent Variables
Using our matrix notation we can generalise the multiple linear regressionmodel from 2 independent variables to k independent variables
The model is specified as follows:
Y =0+1x1+2x2+ +kxk+
Or, in terms of observations, as follows:
yi=0+1x1i+2x2i+ +kxki+i
Note that p= k+ 1 is the total number of parameters in the model (k inde-pendent variables plus one intercept)
56
-
8/22/2019 Sta 200 b Article
57/95
Hence y= X + where: y is an n 1 matrix, Xis an n pmatrix, is a p 1 matrix, and is
an n 1 matrix This model relies on the same assumptions as the simple linear regression
model, along with the assumption of no multicollinearity:
None of the independent variables are collinear (highly correlated with oneanother)
Multiple Linear Regression Example
Data was collected from 195 American universities on the following variables:
Graduation Rate (the proportion of students in Bachelors degree pro-grammes who graduate after four years)
Admission Rate (the proportion of applicants to the university who areaccepted)
Student-to-Faculty Ratio (the number of students per lecturer) Average Debt (the average student debt level at graduation, in US dol-
lars)
A few observations from the data are displayed below:
Grad Rate (y) Admission Rate (x1) S/F Ratio (x2) Avg Debt (x3)
0.65 0.35 14 11156
0.81 0.39 16 13536
0.8 0.35 12 19762
0.46 0.65 13 12906
0.5 0.58 21 14449
0.47 0.65 11 166450.18 0.59 14 17221
0.52 0.6 13 14791
0.39 0.79 15 14382...
... ...
...
57
-
8/22/2019 Sta 200 b Article
58/95
In this case we have k = 3 independent variables and p = 4 parameters toestimate
The model equation is as follows:yi = 0+1xi1+2xi2+3xi3+i
Using computer software we determine that:
XX
1
=
j = 0 j= 1 j = 2 j = 3
j= 0 0.1059 0.01782 3.0672 103 3.2823 106j= 1 0.01782 0.1906 5.7407 103 5.7146 107j= 2 3.0672 103 5.7407 103 4.6400 104 2.1002 109j= 3
3.2823
106
5.7146
107 2.1001
109 2.3045
1010
We further determine that:
=XX
1Xy=
1.1095
0.37980.02789
5.1687 107
Thus our sample regression function is:
y = 1.1095 0.3798x1 0.02789x2+ 5.1687 107
x3
Interpretation: For every 0.01 unit increase in admission rate, there is an expected
0.003798 unit decrease in graduation rate (we cant really talk aboutthe usual 1 unit increase in x1 since it is a proportion and ranges onlyfrom 0 to 1)
For every one unit increase in student-to-lecturer ratio, there is an ex-pected 0.02789 unit decrease in graduation rate
For every $1 increase in average student debt, there is an expected 5.1687107 unit increase in graduation rateInferences from a Multiple Linear Regression
Just like in simple linear regression, we often want to do hypothesis testing formultiple linear regression
There are three main types of hypothesis tests to consider:1. Inferences on Individual Parameters
2. Inferences on the Full Model (all parameters)
3. Inferences on Subsets of Parameters
58
-
8/22/2019 Sta 200 b Article
59/95
Inferences on Individual Parameters
The logic is the same as in simple linear regression but we now use a matrixapproach
It can be proven that E = It can also be proven that the covariance matrix of is:
Cov
=2XX
1 This means that for each individual element of, j:
Ej =jVar
j
=2Cjj
where Cjj is the diagonal element ofXX
1corresponding to j
This is the multivariate equivalent of our result in simple linear regression thatVar
1
=2SS1x
Now, we face the same problem as before in that we dont usually know thevalue of2
Remember, before we estimated 2 with
2 = 1
n 2n
i=1
e2i = 1
n 2SSResidual
In the multivariate case, we have to divide by np instead of n2 (wesubtract the number of parameters to be estimated which was 2 in that case)
Our sum of squared residuals can be expressed as follows:
SSResidual=n
i=1
e2i =ee
= (y y) (y y)
=yX
yX
=yy Xy yX + XX=yy 2Xy + XX=yy Xy since XX= Xy
59
-
8/22/2019 Sta 200 b Article
60/95
Therefore, 2 = SSResidualn p =
1
n pyy Xy
The test statistic for testing the null hypothesis H0:j =j is thus:
t=j j
Cjj
=j j
yy Xy
Cjj / (n p)
Under the null hypothesis, t follows a t distribution with np degrees offreedom
Our decision rules will be the same as for inferences on 1 in the simple linearregression model (depending whether we have a two-tailed, lower tail or uppertail test)
Note that this formula can be used for any j including 0 If we set j = 0 then we are testing for the significance of an individual
coefficient, that is, whether there is a linear relationship between Y and xj
Inferences on Individual Parameters: Example
Suppose we want to test whether the admission rate has a significant, negativeimpact on the graduation rate
1. H0 : 1= 0 vs. HA : 1 < 0
2. = 0.05
3. t=1
(yy Xy)C11/ (n p)t (n p)
4. Critical region: tobserved t/2,np=t0.025,1954 = t0.025,1911.9845. tobserved =
5.169 1074.7691 2.3045 1010/ (195 4) = 0.215
|tobserved|< 1.984 thus we do not reject H06. We conclude that average student debt has no significant effect on grad-
uation rate
Inference on the Whole Regression Model
One way to test the usefulness of a particular multiple linear regression modelwith k independent variables is to test the following:
H0 : 1=2 = = k = 0HA : j= 0 for at least one j
If we reject H0, this implies that at least one of the independent variables
x1, x2, . . . , xk contributes significantly to the model
To develop this test, remember the following from our r2 calculations:
SSy =n
i=1
(yi y)2 =n
i=1
y2i ny2
=yy ny2SSResidual= y
y XyHence S SModel = SSy SSResidual= xy ny2
It can be shown that under H0,SSModel2 (p 1) and SSResidual2 (n p) From this we can develop a test statistic which compares the variation ex-
plained by the model to the variation not explained by the model:
F = SSModel/ (p 1)SSResidual/ (n p)
Under H0, F F(p 1, n p) and so we use the F distribution table todetermine whether or not to reject the null hypothesis
In this case we always have a one-sided, upper tail test. Our decision rule is: Reject H0 ifFobserved > F,p1,np
61
-
8/22/2019 Sta 200 b Article
62/95
Inference on the Whole Regression Model: Example
For our graduation rate example:1. H0 : 1=2 = 3 = 0 vs. HA:j
= 0 for at least one j = 1, 2, 3
2. = 0.05
3. Test statistic: F = SSModel/(p 1)SSResidual/(n p)F(p 1, n p)
4. Critical Region: Fobserved > F,p1,np = F0.05,2,1923.041
5. Fobserved =
Xy ny2
/ (p 1)
yy Xy
/ (n p)=
6.102/(4 1)4.769/(195 4) = 81.47 >
3.041, so we reject H0
6. We conclude that at least one of the independent variables contributes
significantly to the model.Inference on a Subset of the Parameters
It is also possible to carry out a test of significance on a subset of the param-eters, but we will not cover this
Confidence Intervals for Individual Coefficients
By rearranging our test statistic for an individual coefficient parameter, wecan obtain the following (1 ) 100% Confidence Interval for j for any j =0, 1, 2, . . . , k:
Prj t/2,np2Cjjj j+ t/2,np2Cjj = 1 where2 =S SResidual/ (n p) =
yy Xy
/ (n p)
Confidence Intervals for Individual Coefficients: Example
Let us construct a confidence interval for 3 in the graduation rate example First lets calculate 2
Ifyy= 68.9714 and Xy= 64.20232, then SSResidual= 4.769
Thus 2 =S SResidual/(n
p) = 4.769/(195
4) = 0.02497
We know that 3= 5.1687 107 and C33 = 2.3045 1010
Thus our confidence interval is given by:j t/2,np
2Cjj
5.1687 107 t0.025,1954
0.02497(2.3045 1010)5.1687 107 1.984
0.02497(2.3045 1010)
5.1687 107 1.984
0.02497(2.3045 1010)5.1687
107
4.759
106
= 4.24 106, 5.28 10662
-
8/22/2019 Sta 200 b Article
63/95
Thus we can say with 95% confidence that the change in graduation rate fora $1 increase in average student debt is between4.25 106 and 5.28 106
Notice that the confidence interval contains the value 0, which agrees with theconclusion to our hypothesis test earlier
Confidence Region for All Coefficients
One can also construct a joint confidence region for all parameters For a simple linear regression model the confidence ellipse for (0, 1) would
have the shape of a two-dimensional ellipse
This is outside the scope of this course howeverConfidence Interval for the Mean Response
As we did in simple linear regression, we can construct a confidence interval
for the mean response at a particular point, say, x
x =
1
x01
x02...
x0k
The mean response at this point is E (Y|x= x) =x
The estimatedmean response at this point is y
=x
A (1 ) 100% Confidence Interval for E (Y|x= x) is given by:
Pr
y t/2,np
2x (XX)1 x E (Y|x= x)y +t/2,np
2x (XX)1 x
= 1
Confidence Interval for the Mean Response: Example
Lets find a confidence interval for the average graduation rate of universitieswhich have an admission rate of 50% = 0.5, a student-to-faculty ratio of20 : 1 = 20, and an average student debt of $20000
In this case, x = [1, 0.5, 20, 20000], a 1 4 matrix Our point estimate is:
y =x
= [1, 0.5, 20, 20000]
1.1095
0.37980.02789
5.1687 107
= 1.1095
0.3798(0.5)
0.02789(20) + 5.1687
107(20000)
= 0.3721
63
-
8/22/2019 Sta 200 b Article
64/95
Thus we would predict that such universities would have an average graduationrate of 37.21%
The only thing left to calculate in our confidence interval formula isx XX
1x
Using matrix multiplication we see this is equal to 0.03492 Thus our 95% confidence interval for E (Y|x= x) is:
y t/2,np
2x (XX)1 x
0.3721 1.984
0.02497(0.03492)
0.3721 0.0586= (0.3135, 0.4307)
Prediction Interval for a New Response
Also, like in simple linear regression, we can predict the value of the responseY for a new observation x and obtain a confidence interval for it
The predicted value is y =x (actually the same as y above) A (1 ) 100% Prediction interval for Y is:
Pry t/2,np2 1 + x (XX)1 x Y y +t/2,np2 1 + x (XX)1 x = 1 As in the simple linear regression case, we can see from the 1+ that this
prediction interval is wider than the confidence interval for the mean response
Prediction Interval for a New Response: Example
Let us obtain a prediction interval at a particular university which has anadmission rate of 50% = 0.5, a student-to-faculty ratio of 20 : 1 = 20, and anaverage student debt of $20000
Our point estimate is y which is actually the same as y; it equals 0.3721 Our 95% prediction interval is as follows:
y t/2,np
2
1 + x (XX)1 x
0.3721 1.984
0.02497(1 + 0.03492)
0.3721 0.3189= (0.0532, 0.691)
We can see that this is a very wide (and not very useful) prediction interval
64
-
8/22/2019 Sta 200 b Article
65/95
Assessing Goodness of Fit of a Multiple Linear Regression Model
We can define r2 just as we did for the simple linear regression model:
r2 = 1 SSResidualSSy
= 1 yy Xyyy ny2
In this case it is referred to as theMultiple Coefficient of Determination One of the disadvantages of this statistic is that it will always increase as more
independent variables are added to the model
This will suggest that the fit is getting better even if the new variables are not
significant
This problem led to the development of an alternative goodness of fit statisticfor multiple linear regression called Adjusted r2
Adjusted r2
Adjusted r2, written as r2, imposes a penalty for adding more terms to themodel
It will thus decrease when we add an independent variable that does not
contribute much explanatory power
r2 = 1 SSResidual/ (n p)SSy/ (n 1) = 1
n 1n p
1 r2
r2 and r2 for Multiple Linear Regression Model: Example
In our university graduation rates example, we calculate r2 as follows:
r2 = 1
yy Xyyy ny
2
= 1 68.9714 64.2023268.9714 58.09986
= 1 0.4387= 0.5613
This suggests that 56% of the variation in graduation rates can be explainedby the three factors in the model
65
-
8/22/2019 Sta 200 b Article
66/95
Now we calculate r2 as follows:
r2 = 1
n 1n p
1 r2
= 1 195 1195 4 (1 0.5613)
= 1 0.4456= 0.5544
In this case, there is not much difference between the two, because the samplesize n is very large compared to the number of parameters p
Model Selection Algorithms
Various algorithms (procedures) have been proposed for selecting which vari-ables to include in a model
This is particularly important when there are many possible independent vari-ables to choose from
We do not want to miss out on variables that contribute significantly to themodel, but we also dont want to include unnecessary variables which makeour estimates less precise
The three most common algorithms that are used are:1. Backward Elimination
2. Forward Selection
3. Stepwise Selection
Backward Elimination
Backward Elimination starts with a full model consisting of all possible inde-pendent variables, and cuts it down until the bestmodel is achieved
The algorithm then proceeds as follows:
1. Begin with a model including all possible independent variables
2. Estimate the model and take note of thetobserved statistic values for indi-vidual coefficients (not including 0)
3. Choose the coefficient with the smallest|tobserved|; call it j4. Carry out the test of hypothesis H0 : j = 0 vs. HA : j= 0 at the
significance level
5. If the null hypothesis is rejected, we accept this as our final model
6. If the null hypothesis is not rejected, we remove the variable xj
from themodel and repeat from step (2)
66
-
8/22/2019 Sta 200 b Article
67/95
Forward Selection
Forward Selection works in the opposite direction: it begins with an emptymodel and adds variables until the bestmodel is achieved
The algorithm proceeds as follows:1. Run simple linear regressions between y and each possible xvariable
2. Identify the independent variable with the highest|tobserved| value in itssimple linear regression with y
3. Carry out the test of hypothesis H0 : j = 0 vs. HA : j= 0 at the significance level in this simple linear regression model
4. If we reject H0, we add xj to the multiple linear regression model andproceed to the independent variable with the next highest
|tobserved
|in its
simple linear regression with y, and repeat from step (3).
5. If the null hypothesis is not rejected, we conclude xj is not significantto the model, so we do not add it. We also realise none of the otherindependent variables with smaller|tobserved| will be significant; thus themodel is final; we are done
Stepwise Selection
Stepwise Selection combines elements of both Backward Elimination and For-ward Selection
The algorithm proceeds as follows:1. Run simple linear regressions between y and each possible xvariable
2. Identify the independent variable with the highest|tobserved| value in itssimple linear regression with y
3. Carry out the test of hypothesis H0 : j = 0 vs. HA : j= 0 at the significance level in this simple linear regression model
4. If we reject H0, we add xj to the multiple linear regression model
So far the algorithm is exactly like Forward Selection; but nowit changes
5. Carry out a t test from the multiple linear regression model for thesignificance of each j in the model so far
6. If the null hypothesis is not rejected for any j we delete that xj fromthe model
7. Proceed to the independent variable with the next highest|tobserved|in itssimple linear regression with y, and repeat from step (3).
8. Once we reach a point where all the variables in the model are significant,and none of the variables outside the model are significant, this is our finalmodel
67
-
8/22/2019 Sta 200 b Article
68/95
Model Selection Algorithms: Example
It is easier to see an example in the tutorial using SAS, since these algorithmsare very tedious to carry out by hand
In the case of our Graduation Rate example, all three algorithms lead to thesame result: we keep x1 and x2 in the model and drop x3
Note: there are other model selection algorithms but we will not cover them
Residual Analysis
Revisiting Model Assumptions
Remember that the assumptions of the multiple linear regression include thefollowing:
All error terms have a zero mean, i.e. E (i) = 0i All error terms have the same fixed variance, i.e. Var (i) =2i All observations are independent of each other The error terms follow the normal distribution None of the x variables are highly correlated with one another
Whenever we are applying a multiple linear regression model it is importantto check these assumptions
Model Adequacy
The first four of these assumptions can be assessed using residual analysis:that is, looking at the residuals of the model
There are two basic ways to do this: Graphical Analysis Hypothesis Tests
In this module we will only look at graphical analysis (the hypothesis testingapproach will be taught in Econometrics in third year)
Graphical Residual Analysis
Remember that the residuals are defined as e= y y, that is, ei=yi yi To calculate the residuals we first determine the least squares regression line
and then obtain the predicted value for eachxiin the sample; then we subtractthese predicted values from the observed yi values in the sample
Once we have the residuals we can plot the residuals (vertical axis) againstthe predicted values (horizontal axis)
One can gain a lot of information about the model by looking at this plot
68
-
8/22/2019 Sta 200 b Article
69/95
Plot of Residuals vs. Predicted Values
The main things to look for in the plot are patternsor unusual points
Ideally, the points should be evenly distributed above and below zero andshould appear completely random
In this plot we can see that the points appear random
69
-
8/22/2019 Sta 200 b Article
70/95
Do you see anything different in this plot?
The variance of the residuals appears to increase as y increases
Normal Quantile-Quantile Plot
A normal quantile-quantile plot is a useful tool for checking if the residualsare normally distributed
If so, the points should fall approximately in a straight line
Does this QQ plot look normally distributed?
70
-
8/22/2019 Sta 200 b Article
71/95
How about this one?
71
-
8/22/2019 Sta 200 b Article
72/95
Histogram of Residuals
Another way to check normality is to plot a histogram of the residuals and seeif it is bell shaped
How about this one?
72
-
8/22/2019 Sta 200 b Article
73/95
Summary of Graphical Analysis of Residuals
Graphical analysis of residuals is a useful diagnostic tool for determining model
adequacy
However it has limitations - often the results can be inconclusive This is especially true for small sample sizes
Outlier Diagnostics
We can also use the residuals to look for outliers: values which the modelpredicts extremely badly
While we could simply look at the residuals themselves, it is better to scalethem in some way Analogy tozscores from STA100A: we dont only want to know how far
an observation is from its mean; we want to know how many standarddeviations away it is
A basic way to scale the residuals would be to divide them by their standarddeviation:
di=ei
This is called the standardized residual Since these residuals should be approximately normally distributed with mean
0 and variance 1, they should almost always lie in the range3di3 Thus we could define an outlier as any observation whose standardized residual
is>3 or
-
8/22/2019 Sta 200 b Article
74/95
Outlier Diagnostics: Externally Studentized Residuals
The only weakness with the internally studentized residual is that the varianceestimate 2 used in calculating ri is influenced by the ith observation
It may be thrown off by an outlier; thusri is not ideal for outlier detection Instead, for each observation, we could estimate the variance using a data set
ofn 1 observations with the ith observation removed, and use this estimateS2(i) in the scaling formula
It can be shown that:
S(i) =(n p)2 e2i /(1 hii)
n p 1
If we replace 2 withS2(i)in the internally studentized residual formula we get:
ti= eiS2(i)(1 hii)
This is known as the externally studentized residual and is the best wayof scaling residuals
Hypothesis Test for Outliers
A further advantage is that, under the model assumptions, tit(n p 1) One could carry out a hypothesis test on each observation to check if it is an
outlier:
1. H0: The ith observation is not an outlier vs. HA: The ith observation isan outlier
2. = 0.05
3. Test statistic is|ti|
4. Rejection rule: Reject H0 if|ti|> t/(2n),np15. Compute ti observed and reach a decision
6. State conclusion
The reason why we have /(2n) instead of /2 is that we are running thehypothesis testn times, so we are basically dividing up the overall type I errorprobability among the n individual tests (this is known as the Bonferroniapproach)
74
-
8/22/2019 Sta 200 b Article
75/95
yi xi yi ei di ri ti
19 8 18.325 0.675 0.2008 0.2178 0.1997
17 7 16.275 0.725 0.2157 0.2450 0.2248
23 10 22.425 0.575 0.1711 0.1856 0.169922 9 20.375 1.625 0.4835 0.5169 0.4827
33 14 30.625 2.375 0.7067 1.4133 1.5696
18 7 16.275 1.725 0.5133 0.5830 0.5480
16 7 16.275 -0.275 -0.0818 -0.0929 -0.0849
19 10 22.425 -7.425 -2.2092 -2.3962 -10.5468
Outlier Diagnostics: Example
Suppose we have the following set of data (n= 8):
When we estimate the simple linear regression model yi=0+ 1xi+ i usingthe least squares method, we get:
0 = 1.525,1 = 2.15
We can substitute each of ourxi forx in the fitted equation y= 1.525+2.15xto obtain the predicted values yi which are in the third column of the tableabove
We can then calculate the residuals: ei=yi yi (see fourth column of table) To calculate the standardized residuals we first need to calculate 2:
2 = 1
n 2n
i=1
e2i =1
6
0.2752 + 0.4252 + + (4.025)2 = 3.6625
Now we have di= ei2
(see calculated values in fifth column)
Next we can calculate the internally studentized residuals. We first need tocalculate the Hat matrix H=XXX1 X
In this case, X=
1 8
1 7
1 10
1 9
1 14
1 7
1 71 10
75
-
8/22/2019 Sta 200 b Article
76/95
Taking the diagonal elements ofH and using them in the formula ri =ei
2 (1 hii), we get the values (see sixth column of table above)
Next we calculate the externally studentized residuals. We first need to calcu-late the S2(i)=
(n p)2 e2i /(1 hii)n p 1
Then we plug these into the following formula to get the values in the seventhcolumn:
ti= eiS2(i)(1 hii)
It is now apparent for the first time that the 8th observation is an outlier
Hypothesis Test for Outliers: Example
We conduct the hypothesis test described above for each of the 8 observations,at = 0.05 level
In every case, our rejection rule is reject H0 if|ti|> t/(2n),np1=t0.003125,5 We dont have a column for 0.003125 in our t table so we can take the
average of the entries in the 0.005 and 0.001 columns to get an approxi-mation: (4.030 + 5.876)/2 = 4.953
We reject H0 for all observations for which
|ti|> 4.953; in this case we reject
only for the 8th observation
Thus we conclude that the 8th observation is an outlier and none of the othersare
Influence Diagnostics
Sometimes, a small subset of observations (even one observation) exert a dis-proportionate influence on the fitted regression model
In other words, the parameter estimates depend more on these few obser-
vations than on the majority of the data
We would like to be able to locate these influential observations and pos-sibly eliminate them
Leverage
The elements of the Hat Matrix hij describe the amount of influence exertedbyyj on yi
Thus a basic measure of the influence of an observation, known as the lever-
age, is given by hii
76
-
8/22/2019 Sta 200 b Article
77/95
The properties of the Hat Matrix H include that the sum of all n diagonalelements is equal to p, that is:
n
i=1 hii=p Therefore, the average hii value would be pn As a rule of thumb, any observation i such that hii > 2pn would be called a
high-leverageobservation
Cooks Distance
The leverage only takes into account the location of an x observation A more sophisticated measure of influence would take into account the location
of the xand y values of an observation The Cooks Distance is one such measure Let be the usual least squares parameter estimates from all n observations,
and
Let (i) be the least squares parameter estimates where the ith observationhas been deleted from the data
Then the Cooks Distance is defined as:
Di= (i) XX(i) pMSResidual
The Cooks Distance formula can also be expressed in terms of the internallystudentized residuals:
Di= r2ip
hii1 hii
In general, ifDi > 1 we say that the ith observation is influentialInfluence Diagnostics: Example
With the outlier data set used above, the hii values are:
hii= [0.15, 0.225, 0.15, 0.125, 0.75, 0.225, 0.225, 0.15]
In this case 2pn
= 2(2)8
= 0.5. Sinceh55 = 0.75 > 0.5, we can say that the 5thobservation is a high leverage observation
We can calculate the Cooks Distance using the formula Di= r2i
p
hii1 hii
In this case,Di= [0.0042, 0.0087, 0.0030, 0.0191, 2.9961, 0.0493, 0.0013, 0.5066]
Since D5>1 we can again say that the 5th observation is influential
77
-
8/22/2019 Sta 200 b Article
78/95
Multicollinearity
Multicollinearity occurs when two or more of the x variables have a stronglinear relationship with each other
This makes the estimates less precise In fact, if two or more x variables have a perfect linear relationship, we cannot
use the method of least squares
Technically this is because the XXmatrix is not invertible In most cases the multicollinearity will not be perfect; but if it is strong, it
can still ruin the model
How do we know if there is multicollinearity?
Detecting Multicollinearity
The simplest way to detect multicollinearity is to calculate the Pearson corre-lation coefficient between each pair of independent variables xs and xt
A rule of thumb says that if any of these correlation coefficients is higher than0.7 in absolute value, there is serious multicollinearity
SAS can also provide us with variance inflation factor (VIF) estimates,which tell us by what factor the error variance increases due to multicollinearity
in a particular independent variable
A rule of thumb says that if the VIF >5 for any independent variable, thereis serious multicollinearity involving that variable
The simplest way of resolving multicollinearity is to remove one of the offendingxvariables
Multicollinearity: Example
The table below gives the cost of adding a new communications node to a
network, along with three independent variables thought to explain this cost:the number of ports available for access (x1), the bandwidth (x2), and the portspeed (x3)
When we estimate the modelYi=0 +1x1i +2x2i +3x3i +iusing OrdinaryLeast Squares, we get the fitted equation:
y= 17487 14168x1+ 81.39x2+ 1523.7x3
Continue from SAS project
78
-
8/22/2019 Sta 200 b Article
79/95
yi x1i x2i x3i
52388 68 58 653
51761 52 179 499
50221 44 123 42236095 32 38 307
27500 16 29 154
57088 56 141 538
54475 56 141 538
33969 28 48 269
31309 24 29 230
23444 24 10 230
24269 12 56 115
53479 52 131 499
33543 20 38 192
33056 24 29 230
Changes in Functional Form
What if there is a non-linear relationship between Y and x? E.g. quadratic, cubic, logarithmic, etc.
We can still use linear regression just as before, but with the independentvariables transformed appropriatelyChanges in Functional Form: Example 1
Example with quadratic termChanges in Functional Form: Example 1
Example with ln term (log base e) Interpretation: 1 is the expected change in y for a one unit increase in ln x
This can also be expressed in terms of a change in x: 1 is the expected change in y whenx is multiplied bye = 2.718, that is,
when xincreases by 171.8%
More generally, the expected change iny for a% increase inx would be1ln
100 +
100
Thus the expected change in y for a 10% increase in xwould be 0.0951 For small , ln
100 +
100
100and so, we can say approximately that
1100
is the expected change in y for a 1% increase in x
79
-
8/22/2019 Sta 200 b Article
80/95
Transformations of the Dependent Variable
Used to make the data fit a normal distribution better
Used to resolve the problem of non-constant variance Common transformations include:
y = ln(y) y =y
The Box-Cox Transformation is a method used to choose the best transforma-tion for y
Box-Cox Transformation
Used to make the data fit a normal distribution better Used to make the variance more constant Common transformations include:
y = ln(y) y =y
The Box-Cox Transformation is a method used to choose the best transforma-tion for y
Box-Cox Transformation
The Box-Cox Transformation consists of estimating a new parameter (This has nothing to do with Poisson distribution) The value ofis the best powerto use in transforming y; for instance:
If= 2, we use the transformation y =y 2
If= 1
2
, we use the transformation y =y1/2 =
y
In the special case = 0 we use the transformation y = ln(y) SAS can estimate the parameter for us
Box-Cox Transformation: Example
C
Interaction Terms
a
80
-
8/22/2019 Sta 200 b Article
81/95
Dummy Variables
Do two-category only; save rest for econometrics
5 Logistic RegressionDifferent kinds of Dependent Variables
Throughout our study of linear regression models, we have assumed that thedependent variable is a normally distributed random variable
However, in practice we may want to build models for data that are not nor-mally distributed
For the rest of the module we will be looking at some of these modulesCategorical Dependent Variable
We already studied models with dummy (categorical) independent variables But what if the dependent variable is categorical?
If the dependent variable has two possible values (like a Bernoulli ran-dom variable), then it is called binary
A Bernoulli random variable is a binomial random variable where thenumber of trials is n= 1
For example, the dependent variable could be:
Yi=
1 if the ith product is defective
0 if the ith product is ok
Or:Yi=
1 if the ith patient recovers
0 if the ith patient dies
We can construct models for this kind of dependent variable They will be quite different from linear regression models, but still have some
key similarities since both types of models are classified as Generalized Lin-ear Models
81
-
8/22/2019 Sta 200 b Article
82/95
Generalized Linear Models
Generalized Linear Models are a class of models, some of the properties ofwhich are:
1. We haven independent response observationsy1, y2, . . . , yn with theoret-ical means 1, 2, . . . , n
2. The observation yi is a random variable with a probability distributionfrom the exponential family (which basically means its probability massfunction or probability density function has an e in it)
3. The mean response vector is related to alinear predictor =x=0+1x1+2x2+. . .+kxk
4. The relationship betweeni and i is expressed by alink functiong sothat i= g(i), i= 1, 2, . . . , n
By taking the inverse of this function we can also write i = E (yi) =g1(i) =g1(xi )
In the case of linear regression: The link function is g(i) =i, so E (Yi) =i= i=xi The dependent variable follows a normal distribution In summary,YiN(xi , 2) (this is a way of writing the model without
i)
Logistic Regression Model
If each Yi follows a Bernoulli distribution (binomial with n = 1), with prob-ability of success Pr (Yi= 1) = pi and probability of failure 1 pi, theni= E (Yi) =pi
If we again used the identity link function g(i) = i then our model wouldbe pi=0+1x1+2x2+. . .+kxk
It is easy to see that this is a bad idea, because the predicted values of themodel would not necessarily be between 0 and 1
A better model uses the link function is g(pi) = ln pi1 pi
The quantity pi
1pi is called an odds: it is the ratio of the probability ofsuccess to the probability of failure
Thus the link function gives the log odds, also known as the logit orlogistic function
This means the model can be expressed as follows:
ln pi1 pi =0+1x1+2x2+. . .+kxk82
-
8/22/2019 Sta 200 b Article
83/95
By taking the inverse of the function we can also express the model like this:
E (Yi) =pi= 1
1 +ex
i
where xi = [1, x1i, x2i, . . . , xki]
Notice that there is no error term i in this model Remember that pi are probabilities and thus range between 0 and 1 A graph ofg(pi) is as follows (it is undefined at 0 and 1):
Parameter Estimation in Logistic Regression
Just like in linear regression, our first task is to estimate the parameter vector
=
0
1
2...
k
However we can no longer use the Method of Least Squares (Why?)
Instead we use the Method of Maximum Likelihood
83
-
8/22/2019 Sta 200 b Article
84/95
We will not explain the details of this method Unfortunately this method requires an iterative procedure and cannot easily
be calculated by hand
However computer software such as SAS can compute the estimates0,1,2, . . . ,kquite easily
Interpreting Parameters in Logistic Regression
More important for our purpose is to be able to interpret what the parameterestimates tell us
The parameter estimates themselves are interpreted as log-odds ratios, whilee1 for instance would be interpreted as an odds ratio
It is best to illustrate what these terms mean using an example
Logistic Regression Example
Consider a data set of 200 people admitted to the intensive care unit at ahospital
The dependent variable is whether they died:
yi= 1 if the person died
0 if the person survived
The first independent variable is the type of admission to ICU:
xi1 =
1 if they were admitted via emergency services
0 if the they were self-admitted
The second independent variable xi2 is the persons systolic blood pressure inmm Hg
The estimated model is:
ln
pi
1 pi
= 0+ 1xi1+ 2xi2
which can also be written as:
Pr (Yi= 1) = pi=
1
1 +e(0+1xi1+2xi2)
84
-
8/22/2019 Sta 200 b Article
85/95
We estimate the parameters in SAS and our fitted equation is:
ln
pi
1
pi
=1.33 + 2.022xi1 0.014xi2
Or: Pr (Yi= 1) = pi= 11 +e(1.33+2.022xi10.014xi2)
Now to interpret the parameters: as in linear regression, 0represents the casewhen all independent variables take a value of 0
In this case, if xi1 = 0 (meaning the person was self-admitted) and t