sta 200 b article

Upload: neo-mervyn-monaheng

Post on 08-Aug-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/22/2019 Sta 200 b Article

    1/95

    Faculty of Applied SciencesDepartment of Mathematics and Physics

    Statistical Methods 2B Lecture Notes

    Lecturer: Mr. T. Farrar

    Contents

    1 Review of Random Variables and Probability Distributions 1

    2 Correlation Analysis of Paired Data Sets 19

    3 Simple Linear Regression Analysis 27

    4 Multiple Linear Regression 48

    5 Logistic Regression 81

    6 Poisson Regression 90

    1 Review of Random Variables and ProbabilityDistributions

    What you will be expected to already know

    1. Descriptive Statistics

    2. Basic Probability concepts

    3. Graphical methods of displaying data (line graph, scatter plot, histogram)

    4. Random Variables and Probability Distributions (Discrete and continuous)

    5. Special probability distributions (binomial, Poisson, normal)

    6. Hypothesis Testing (t-tests, F tests, 2 tests, nonparametric tests, p-values)

    7. Basic calculus

    8. Matrices

    1

  • 8/22/2019 Sta 200 b Article

    2/95

    Discrete Random Variables

    Definition: A random variable is a variable which takes onits values by chance

    Definition: The sample space S(a.k.a. support) is theset of possible values that a random variable may take

    A random variable is discrete if it can only take only a finiteor countably infinite number of distinct values. Usually a discrete randomvariable only takes on integer values.

    E.g. Number of defective television sets in a shipment of 100 sets S={1, 2, 3, . . . , 100}

    E.g. Number of visits to a website in one year

    S={1, 2, 3, . . .} We use an uppercase letter such as Y to denote a random variable, and a

    lowercase letter such asy to denote a particular value that the random variablemay assume

    Discrete Probability Distributions

    We may denote the probability that Ytakes on the value y by Pr(Y =y) This probability is subject to the following restrictions:

    1. 0Pr (Y =y)1 for all y (all probabilities must be between 0 and 1)2.yS

    Pr (Y =y) = 1 (sum of probabilities over whole sample space must be 1)

    E.g. Flipping a six sided die: let Ybe the number that comes up Pr (Y =y) = 1

    6, y= 1, 2, 3, 4, 5, 6

    It is easy to see that both restrictions hold.

    The probability distribution of the lengths of patent lives for new drugs isgiven below. The patent life refers to the number of years a company hasto make a profit from the drug after it is approved before competitors mayproduce the same drug.

    Years, y 3 4 5 6 7 8 9 10 11 12 13

    Pr (Y =y) .03 .05 .07 .10 .14 .20 .18 .12 .07 .03 .01

    The function that maps all values in the sample space to their probabilities iscalled a probability mass function

    It may be expressed in a table (as above) or as a mathematical formula

    2

  • 8/22/2019 Sta 200 b Article

    3/95

    We can use a graph to represent the probability mass function:

    Suppose the law dictates that the sentence (in years) for a particular crimemust be between 5 and 10 years in prison. By looking at past cases a lawyeris able to construct the following probability distribution for the number ofyears to which a person convicted of the crime is sentenced:

    f(y) =0.4471

    y , y = 5, 6, 7, 8, 9, 10

    Hence the probability that a person convicted of this crime receives a 6 yearsentence is

    fY(6) =0.4471

    6= 0.1825

    As an exercise, graph this probability mass function and verify that it satisfiesthe two restrictions on probability mass functions.

    Expected Value of a Discrete Random Variable

    We can define the expected value of a random variable as follows:

    E (Y) =yS

    yf(y)

    Iff(y) accurately characterises the population described by the random vari-able Y, then E (Y) =, the population mean

    3

  • 8/22/2019 Sta 200 b Article

    4/95

    In our prison sentencing example:

    E (Y) =10

    y=5y

    0.4471y

    =10

    y=5

    0.4471

    y

    = 0.4471

    5 +

    6 +

    7 +

    8 +

    9 +

    10

    = 7.298

    Thus, we would expect the average sentence to be 7.3 years.

    It can also be shown that for any real-valued functiong(Y), the expected valueofg(Y) is given by:

    E (g(Y)) =yS

    g(y)f(y)

    Variance of a Discrete Random Variable

    We can define the variance of a random variable as follows:2 = E (Y) = E

    (Y )2

    = E

    Y2 2 (why?)

    =yS

    y2f(y) E (Y)2

    In our prison sentencing example:

    Var (Y) =10

    y=5

    y20.4471

    y E (Y)2

    =10

    y=5

    0.4471y(3/2) 7.2982

    = 0.4471

    5(3/2) + 6(3/2) + 7(3/2) + 8(3/2) + 9(3/2) + 10(3/2) 7.2982

    = 56.177 53.261= 2.916

    4

  • 8/22/2019 Sta 200 b Article

    5/95

    Thus, the variance ofY is 2.916 and the standard deviation is2.916 = 1.71

    Properties of Expected Value

    Let Ybe a discrete random variable with probability mass function f(y) andlet abe a constant. Then E (aY) =aE (Y).

    Proof:

    E (aY) =yS

    ayf(y)

    =ayS

    yf(y)

    =aE (Y)

    As an exercise, prove that ifb is a constant, then E (b) =b. As a further exercise, if Y1 and Y2 are two random variables, prove that

    E (Y1+Y2) = E (Y1) + E (Y2).

    Properties of Variance

    Let Ybe a discrete random variable with probability mass function f(y) andlet abe a constant. Then Var (aY) =a2 Var (Y).

    Proof:Var (aY) = E

    a2Y2

    E (aY)2=a2E

    Y2 a2E (Y)2

    =a2

    E

    Y2 E (Y)2

    =a2 Var (Y)

    As an exercise, prove that ifbis a constant, then Var (b) = 0.

    Special Discrete Probability Distributions

    Binomial Distribution

    The binomial distribution relates to a binomial experiment which has thefollowing five properties:

    1. The experiment consists of a fixed number of trials, n

    5

  • 8/22/2019 Sta 200 b Article

    6/95

    2. Each trial results in one of two outcomes, called successand failure (denoted 1 and 0)

    3. The probability of success in each trial is equal to p and the probabilityof failure is 1

    p(sometimes called q)

    4. All the trials are independent of one another

    5. The random variable of interest is Y, the total number of successes ob-served in the n trials

    The probability mass function for the binomial distribution is as follows:

    f(y) = n

    ypy (1

    p)ny , y= 0, 1, 2, . . . , nand 0

    p

    1

    We can derive this function using multiplicative probability rule for indepen-dent events and the concept of combinations

    We have y successes andn y failures, and there are n!y! (n y)! =

    n

    y

    ways

    to arrange them in order

    Here is a graph of the binomial probability mass function where n = 15 andp= 0.4:

    As an exercise, draw the binomial probability mass function wheren= 9 andp= 0.8.

    Mean and Variance of Binomial Distribution

    The mean of a binomially distributed random variable is E (Y) =np. The variance of a binomially distributed random variable is Var (Y) =np (1 p).

    6

  • 8/22/2019 Sta 200 b Article

    7/95

    Binomial Example

    There is an English saying, Dont count your chickens before they hatch

    A farmer is breeding chickens. He has 15 hens that each lay one egg per day.The eggs are then placed in incubators He has observed that there is an 80% hatchability rate, that is, an 80% prob-

    ability that an egg will hatch into a live chick

    1. How many live chicks should the farmer expect per day?

    E (Y) =np = 15 0.8 = 12

    2. What is the probability that at least 13 eggs from a given day will hatch?

    Pr (Y13) = Pr (Y = 13) + Pr (Y = 14) + Pr (Y= 15)=

    15

    13

    0.813 (1 0.8)1513 +

    15

    14

    0.814 (1 0.8)1514 +

    15

    15

    0.815 (1 0.8)1515

    = 0.2309 + 0.1319 + 0.0352 = 0.398

    Negative Binomial Probability Distribution

    While a binomial random variable measures the number of successes in n trialsof a binomial experiment where n is fixed, a negative binomial randomvariable measures the number of trialsy required for k successes to occur.

    We could think of this as the event A B where A is the event that the firsty 1 trials containk 1 successes andB is the event that the yth trial resultsin a success.

    f(y) = Pr (A B) = Pr (A) Pr (B) (since A and B are independent)

    Pr (A) = y 1k 1pk1qyk, yk (by binomial distribution)Pr (B) =p

    Thus f(y) =

    y 1k 1

    pkqyk, y= k, k+ 1, k+ 2, . . .

    7

  • 8/22/2019 Sta 200 b Article

    8/95

    Negative Binomial Distribution

    Here is a graph of the binomial probability mass function where k = 3 andp= 0.6 (going as far as y = 17):

    As an exercise, draw the negative binomial probability mass function wherek= 2 and p= 0.5, up to y = 10.

    Mean and Variance of Negative Binomial Distribution

    The mean of a negative binomial random variable is E (Y) = kp

    The variance of a negative binomial random variable is Var (Y) = k (1 p)p2

    Negative Binomial Distribution Example

    Each time a fisherman casts his line into the water there is a probability of 18

    that he will catch a fish.

    Today he has decided that he will continue casting his line until he catches 5fish

    1. What is the expected number of casts required to catch 5 fish?

    E (Y) =

    k

    p =

    5

    0.125= 40

    2. What is the standard deviation of the number of casts required to catch 5 fish?

    Var (Y) =5 (1 0.125)

    0.1252 = 280

    =

    Var (Y) =

    280 = 16.73

    8

  • 8/22/2019 Sta 200 b Article

    9/95

    4. What is the probability that he will need exactly 50 casts?

    Pr (Y= 20) = 50 15 1 0.1255(1 0.125)505

    = 0.0159

    5. What is the probability that he will need more than 8 casts?

    Pr (Y >8) = 1 8

    y=5

    y 15 1

    0.1255(1 0.125)y5

    = 1 440.1255(1 0.125)55 +540.1255(1 0.125)65+

    6

    4

    0.1255(1 0.125)75 +

    7

    4

    0.1255(1 0.125)85

    = 1 (0.0000 + 0.0001 + 0.0004 + 0.0007)= 1 0.0011 = 0.999

    Poisson Distribution

    The Poisson Distribution can be thought of as a limiting case of the binomialdistribution

    Suppose we are interested in the number of car accidents Y that occur at abusy intersection during one week

    We could divide the week into n intervals of time, with each interval being sosmall that at most one accident could occur in that interval

    We define p as the probability that an accident occurs in a particular sub-interval and 1

    pas the probability that no accident occurs

    We could then think of this as a binomial experiment It can then be shown that:

    limn

    n

    y

    py (1 p)ny =(np)

    y enp

    y!

    If we let = np then we have the probability mass function of the Poissondistribution:

    f(y) =ye

    y!

    , y= 0, 1, 2, . . .

    9

  • 8/22/2019 Sta 200 b Article

    10/95

    Here is a graph of the Poisson probability mass function where = 3.3 (goingas far as y= 12):

    As an exercise, draw the Poisson probability mass function where = 1, upto y= 6.

    Mean and Variance Poisson Distribution

    The Poisson Distribution is used to model the counting of rare events thatoccur with a certain average rate per unit of time or space

    For the Poisson Distribution, E (Y) = and Var (Y) =

    The expected value and variance are equal!

    Poisson Distribution Example

    The number of complaints that a busy laundry facility receives per day is arandom variable Yhaving a Poisson distribution with = 3.3

    1. What is the probability that the facility will receive less than two com-plaints on a particular day?

    Pr (Y

  • 8/22/2019 Sta 200 b Article

    11/95

    If the number of complaints per day has a Poisson distribution withparameter then the number of complaints in five days has a Poissondistribution with parameter 5 . Thus, if we letWbe the number of complaints per week, then:

    E (W) = 5= 16.5

    Continuous Random Variables

    A random variable is continuous if it can on any value in aninterval (e.g., between 0 and 5). In other words, continuous random variablestake on real-numbered values

    There is no such thing as a probability mass function for a continuous randomvariable. Instead, we have a probability density function which allows usto find probabilities over an interval

    If Y is a continuous random variable, and f(y) is the probability densityfunction, then:

    Pr (aYb) = b

    a

    f(y)dy

    What we are actually doing is finding the area under the curve between a andb.

    Properties of a Probability Density Function

    1. f(y)0 for all y,< y

  • 8/22/2019 Sta 200 b Article

    12/95

    First we note that 3y2 0 for all 0y1 , so the first conditionis satisfied.

    Second:

    f(y)dy =

    10

    f(y)dy (since the function is 0 elsewhere)

    =

    10

    3y2dy

    = y31

    0

    = 13 03 = 1

    Thus the second condition is also satisfied.

    Find the probability that between 60% and 90% of people pay their incometax on time.

    Pr(0.6Y0.9) = 0.9

    0.6

    3y2dy

    = y30.90.6

    = 0.93 0.63= 0.513

    Thus 51.3% of people pay their income tax on time according tothis model.

    Note that it does not matter whether we use < or

    with continuous random

    variables

    Expected Value and Variance of a Continuous Random Variable

    The expected value of a continuous random variable Y is defined as follows:= E (Y) =

    yf(y)dy

    Similarly the variance is defined thus:2 = Var (Y) = E

    Y2

    2 =

    y2f(y)dy 2

    These have the same properties as in the discrete case.

    12

  • 8/22/2019 Sta 200 b Article

    13/95

    Find the expected value of the proportion of people who pay their income taxon time.

    = E (Y) = 1

    0

    y 3y2dy

    =

    10

    3y3dy

    = 3

    4y41

    0

    =3

    4= 0.75

    Find the standard deviation of the proportion of people who pay theirincome tax on time.

    2 = Var (Y) =

    10

    y2 3y2dy 2

    = 1

    0 3y

    4

    dy 0.752

    = 3

    5y51

    0

    0.752

    =3

    5 0.752 = 0.6 0.5625 = 0.0375

    Hence = 0.0375 = 0.194

    Special Continuous Probability Distributions

    Uniform Distribution

    Suppose thatYcan take on any value between1and 2with equal probability.Then Y follows the continuous uniform distribution and its probability massfunction is as follows:

    f(y) = 1

    2

    1, 1y2

    0 , elsewhere

    13

  • 8/22/2019 Sta 200 b Article

    14/95

    We can use integrals to compute probabilities, but in this case we dont needto because we are actually just finding the area of a rectangle! It can be shown

    that E(Y) = 1+22

    and Var (Y) = (21)2

    12

    Uniform Distribution Example

    An insurance company provides roadside assistance to its clients. To save coststhey want to dispatch the nearest possible tow truck.

    Along a particular highway which is 100 km long, breakdowns occur at uni-formly distributed locations.

    Towing Company A is the nearest for the first 70 km of the highway andTowing Company B is the nearest for the final 30 km of the highway.

    1. What is the expected location of the next breakdown?

    E (Y) =1+2

    2 =

    0 + 100

    2 = 50

    We expect the next breakdown to occur at the 50 km mark

    3. What is the probability that the next breakdown will be attended by companyB?

    Here f(y) =1

    100, 0y100 and 0 elsewhere

    We need to find the area under f(y) between 70 and 100

    We could calculate

    10070

    f(y)dy

    Or we can simply calculate the area of this rectangle:

    14

  • 8/22/2019 Sta 200 b Article

    15/95

    The area of a rectangle is lengthwidth. Thus:

    Pr(70Y100) = 30 1100

    = 0.30

    Normal Distribution

    A random variable Y is said to have a normal distribution with parameters< 0 if its probability density function is:

    f(y) = 1

    2e(y)

    2/(22), < y

  • 8/22/2019 Sta 200 b Article

    16/95

    Even more good news: any Normally distributed random variable Y withmean and standard deviation can be transformed to a Standard Normalrandom variable Zusing this simple transformation:

    Z=Y

    This graph shows how the transformation works:

    Using the Z Table to Calculate Probabilities

    TheZTable provides us with Pr (Z < z) for anyzvalue that we choose up to2 decimal places

    16

  • 8/22/2019 Sta 200 b Article

    17/95

    Suppose we want to know Pr (Z z) = 1 Pr (Z < z)

    If we want to find Pr (Z < z) for a negative zvalue, we can use the fact thatthe Standard Normal Distribution is symmetric:

    Pr (Z < z) = 1

    Pr (Z 2.2285. tobserved = 10.50> 2.228, thus we reject H0

    6. We conclude at 5% significance level that the correlation is significantly dif-ferent from 0

    The Fisher Transformation

    What if we want to test whether = 0 for any value1< 0 < 1? What if we want a confidence interval for ? The Fisher Transformation allows us to do both (approximately)

    zr =12

    ln1 +r1 r

    This quantity has an approximate Normal distribution with a mean of 0 and

    a variance of 1

    n 3 From this we get the following test statistic, which has a standard normal

    distribution under the null hypothesis:

    Z=

    12ln

    1+r1r

    12ln

    1+0101n3

    23

  • 8/22/2019 Sta 200 b Article

    24/95

    Pearsons Correlation Coefficient: General Hypothesis Test Example

    Suppose we want to find out whether the correlation is less than 0.99 in ourice cream sales vs. temperature example?

    1. H0 : = 0.99 vs. HA:

  • 8/22/2019 Sta 200 b Article

    25/95

    Spearmans Rank Correlation Coefficient

    What if one or both ofXand Yare not normally distributed?

    Suppose we have the Statistics FISA marks and number of hours of TVwatched per week for n= 8 students:

    FISA Marks vs. Hours of TV per week

    Hours of TV per week (xi) FISA Mark (yi)

    3 73

    11 50

    7 87

    38 31

    13 62

    20 61

    22 46

    34 59

    Spearmans Rank Correlation Coefficient

    In this case we can instead use Spearmans Rank Correlation Coefficient s,which is based on the ranksof thexiandyirather than the values themselves

    It is a general measure of association rather than a measure of linear depen-dence

    R(xi) are the ranks of the xvalues; thus the lowest value has a rank of 1, thesecond lowest a rank of 2, etc.

    R(yi) is computed the same way for the y values The sample estimator ofs is:

    rs=

    n

    ni=1

    R(xi)R(yi) n

    i=1

    R(xi)

    ni=1

    R(yi)n n

    i=1

    R(xi)2

    ni=1

    R(xi)

    2n ni=1

    R(yi)2

    ni=1

    R(yi)

    2 If there are no ties inxor y, this reduces to a simpler formula:

    rs= 1

    6n

    i=1d2i

    n (n2 1) where di=R(xi) R(yi)

    25

  • 8/22/2019 Sta 200 b Article

    26/95

    FISA Marks vs. TV hours per week

    Hours of TV per week (xi) FISA Mark (yi) R(xi) R(yi) di d2i

    3 73 1 7 6 3611 50 3 3 0 07 87 2 8 6 36

    38 31 8 1 7 49

    13 62 4 6 2 420 61 5 5 0 0

    22 46 6 2 4 16

    34 59 7 4 3 9

    d2i = 150

    Spearmans Rank Correlation Coefficient Example

    In our FISA marks vs. TV hours example:

    We can now compute the sample Spearman correlation coefficient:

    rs= 1

    6 1508 (8

    2

    1)=

    0.786

    This suggests that there is a negative association between hours spent watchingTV and FISA mark

    Spearmans Rank Correlation Coefficient: Hypothesis Testing

    We may want to test the null hypothesis H0:s= 0 against some alternativeto see if there is a significant association between xand y

    Ifn is large (and there are no ties) then the statistic t =

    rs

    n

    21 r2s has ap-

    proximately a tdistribution with n 2 degrees of freedom Ifn is small we use rs as our test statistic and use a table of critical values

    (see appendix)

    For our student marks vs. TV hours example, suppose we want to check if theassociation between these two variables is significant at the 5% significancelevel

    26

  • 8/22/2019 Sta 200 b Article

    27/95

    Spearmans Rank Correlation Coefficient: Hypothesis Testing Example

    1. H0:s= 0 vs. HA:s= 02. = 0.05

    3. Test statistic is rs

    4. Critical value isrs/2,8 = 0.738, so we reject H0 if|rsobserved |> 0.7385.|rsobserved|=| 0.786|= 0.786> 0.738, so we reject H06. We conclude there is a (negative) association between hours spent watching

    TV per week and FISA mark

    Spearmans Rank Correlation Coefficient: General Hypothesis Tests andConfidence Intervals

    The Fisher Transformation that was done on the Pearson Correlation Coeffi-cient also applies to the Spearman Rank Correlation Coefficient Thus we can use the very same formulas based on the standard normal dis-

    tribution to carry out general hypothesis tests such as H0 : s = 0.6 vs.HA:s= 0.6 as well as to construct confidence intervals for s

    Of course we need to use rs instead ofr in these formulas, but everything elsestays the same

    Limitations of Correlation Analysis

    Two of the limitations of correlation analysis are:

    1. It does not allow us to compare more than two variables at a time

    2. It does not allow us to make predictions

    We now turn to linear regression analysis which enables us to do both of these

    3 Simple Linear Regression AnalysisEquation of a Line

    The equation of a line is often expressed as y = mx+c

    mis the slope of the line, the change in y for a one unit change in x c is the intercept of the line, the value ofy when x= 0 (and the point

    where the line crosses the vertical axis)

    Often when we compare observations from two variables, we see what appearsto be an approximately linear relationship

    We must decide logically which is the independent variable (x) and which isthe dependent variable (y)

    For example, the scatter plot of ice cream sales vs. temperatures (whichis dependent on the other?)

    27

  • 8/22/2019 Sta 200 b Article

    28/95

    Line Fitting

    If we have only two points, we can fit a line that goes right through them both

    E.g. if we have the points (x1 = 2, y1= 4) and (x2= 6, y2 = 6) m= y2y1

    x2x1 = 6462 =

    12

    m= y y1x x1

    1

    2=

    y 4x 2

    2 (y 4) =x 22y 8 =x 2

    2y= x+ 6

    y=1

    2x+ 3

    28

  • 8/22/2019 Sta 200 b Article

    29/95

    Line Fitting

    However, as soon as we have three or more points, we usually cant fit themperfectly with a straight line

    Consider the following scatter plot:

    There is no line that describes this relationship perfectly So how do we model a relationship that is kind oflinear?

    The Simple Linear Regression Model

    We could assume that the yi observations depend on the xi observations in alinear way but also contain some unexplained variation

    We model this unexplained variation or error as a random variable i This means Y is a random variable since it depends on a random variable Thus we have Y =0+1x+ Or, for individual observations, yi=0+1xi+i for i= 1, 2, . . . , n

    We have simply changed the name ofm to

    1andc to

    0, switched their

    order, and added the error term

    29

  • 8/22/2019 Sta 200 b Article

    30/95

    Model Assumptions

    The most important assumptions of a simple linear regression model are asfollows:

    Thex values are fixed, not random (thus we write x in lower case and Y,a random variable, in upper case)

    All error terms have a zero mean, i.e. E (i) = 0i All error terms have the same fixed variance, i.e. Var (i) =2i All observations are independent of each other The error terms follow the normal distribution

    The Problem

    Even if our model and its assumptions are correct, we have a problem: wedont know the values of0, 1 or i

    In order to know them we would have to have data from the whole populationofxand y, which is usually impossible

    We can only estimate 0, 1 and i as best as we can But how?

    Line Fitting

    If we asked three people to draw the line that best fits the points, we mightget three different results:

    How would we know which line is the best? As statisticians we want to use a statistic to quantify this! But how?

    30

  • 8/22/2019 Sta 200 b Article

    31/95

    The Least Squares Method

    Suppose we have observations (xi, yi) fori = 1, 2, . . . , n, and we fit a line withequation yi= 0+ 1xi

    We have simply changed the name ofmto 1 and c to 0, and switchedtheir order

    Theony, 0 and 1 reminds us that these are estimates of the relation-ship

    We can determine how far each individual yi value is from the line using theformulaei=yi

    yi= yi

    0+ 1xi The ei values are called residuals

    31

  • 8/22/2019 Sta 200 b Article

    32/95

    The residuals ei are our best estimate of the unknown errors i They also provide us with a clue of how to find the estimated line that best

    fits the data

    Overall, we want the errors to be as small as possible However, we cant just minimize the sum of errors because thepositiveerrors

    (points above the line) andnegativeerrors (points below the line) will canceleach other out!

    Instead we minimize the sum of squared errors 2i because these will all bepositive

    SSError=n

    i=1

    2i

    This quantifies the overall distance between the points and the line Similar to how thevariance gives an indication of the distance between

    data points and their mean

    32

  • 8/22/2019 Sta 200 b Article

    33/95

    We will choose the values of 0 and 1 that minimize the sum of squarederrors

    How do we do this? Calculus! The Sum of Squared Errors is a function of0 and 1

    SSError=S(0, 1) =n

    i=1

    (yi 0 1xi)2

    So our method is as follows:1. Take partial derivatives of theS SError function with respect to0 and1

    2. Set the derivatives equal to zero

    3. Solve this system of equations for 0 and 1 to get the values whichminimize the function

    Deriving the Least Squares Estimators

    S(0, 1)

    0=2

    ni=1

    (yi 0 1xi) = 0 (1)

    S(0, 1)

    1 =2n

    i=1(yi 0 1xi) xi = 0 (2)

    This is the system of equations we must solve in terms of0 and 1 We simplify them as follows:

    2n

    i=1

    (yi 0 1xi) = 0n

    i=1

    yi n

    i=1

    0 n

    i=1

    1xi= 0

    ni=1

    yi 0n

    i=1

    1 1n

    i=1

    xi= 0

    ny n0 n1x= 00= y 1x

    33

  • 8/22/2019 Sta 200 b Article

    34/95

    2n

    i=1(yi 0 1xi) xi= 0

    ni=1

    yixi n

    i=1

    0xi n

    i=1

    1x2i = 0

    ni=1

    yixi 0n

    i=1

    xi 1n

    i=1

    x2i = 0

    ni=1

    xiyi (y 1x)n

    i=1

    xi 1n

    i=1

    x2i = 0

    n

    i=1 xiyi nxy+n1x2 1

    n

    i=1 x2i = 0

    1

    ni=1

    x2i nx2

    =n

    i=1

    xiyi nxy

    1=

    ni=1

    xiyi nxyn

    i=1

    x2i nx2

    Least Squares Estimation Formula

    Thus the least squares estimates of 0 and 1 can be calculated using thefollowing formula:

    1 =

    ni=1

    xiyi nxyn

    i=1

    x2i nx

    2

    0 = y 1x

    It turns out that 1 and 0 are Minimum variance unbiased estimators(MVUE)of1 and 0

    This means that:1. E

    0

    =0 and E

    1

    =1 (unbiased)

    2. 0and

    1can be proven to have the smallest variance (greatest precision)

    of any linear estimators of0 and 1

    34

  • 8/22/2019 Sta 200 b Article

    35/95

    Proof that 1 is Unbiased Estimator of1

    We first need to derive E (Yi) and E

    Y

    We will also use our assumptions that the xvalues are fixed and that E (i) = 0E (Yi) = E (0+1xi+i)

    = E (0) + E (1xi) + E (i)

    =0+1xi+ 0 (since the first two are constants)

    =0+1xi

    E

    Y

    = E 1

    n

    ni=1

    yi

    = 1

    n

    ni=1

    E (yi)

    = 1

    n

    ni=1

    (0+1xi)

    = 1

    n(n0+1nx)

    =0+1x

    35

  • 8/22/2019 Sta 200 b Article

    36/95

    E

    1

    = E

    ni=1

    xiyi nxyn

    i=1x

    2i nx

    2

    =

    1n

    i=1

    x2i nx2E

    ni=1

    xiyi nxy

    (since x is fixed, the denominator is constant)

    = 1

    ni=1

    x2i nx2

    ni=1

    xiE (yi) nxE (y)

    = 1ni=1

    x2i nx2 n

    i=1

    xi(0+1xi) nx (0+1x) (see results proved above)

    = 1

    ni=1

    x2i nx2

    0nx+1

    ni=1

    x2i nx0 nx21

    =

    1

    n

    i=1x2i nx2

    n

    i=1

    x2i nx2

    =1

    Proof that 0 is an Unbiased Estimator of0

    As an exercise, try to prove that E 0=0

    The proof is much shorter than the proof for 1Prediction with Simple Linear Regression

    Once we have calculated the least squares estimates 1 and 0, we can writeout the fitted regression equation:

    y= 0+ 1x

    We can now use this equation to predict the most likely value of y for aparticular value ofx

    36

  • 8/22/2019 Sta 200 b Article

    37/95

    This is one of the most useful things about this model! However we must be careful to only make predictions for values ofx in the

    domain of our data

    We cannot extrapolate since the relationship may not be linear outside ofthe domain of the data

    The Riskiness of Extrapolation

    Suppose we fit a line to a set of data points with xi values ranging from 0 to 6 Now we use our fitted line to predict the value ofy for x= 10

    The Riskiness of Extrapolation

    What if modeling the relationship between y and x as a straight line is onlyappropriate between x= 0 andx= 6?

    Can you see how far off the prediction would appear to be if we had data forlarger xvalues like this?

    Simple Linear Regression Example

    Various doses of a toxic substance were given to groups of 25 rats and theresults were observed (see table below)

    37

  • 8/22/2019 Sta 200 b Article

    38/95

    Rat Deaths vs. Doses

    Dose in mg (x) Number of Deaths (y)

    4 1

    6 3

    8 6

    10 8

    12 14

    14 16

    16 20

    1. Find the fitted simple linear regression equation for this data

    2. Use the model to predict the number of deaths in a group of 25 rats whoreceive a 7 mg dose of the toxin

    38

  • 8/22/2019 Sta 200 b Article

    39/95

    Rat Deaths vs. Doses

    xi yi x2i xiyi

    4 1 16 4

    6 3 36 188 6 64 48

    10 8 100 80

    12 14 144 168

    14 16 196 224

    16 20 256 320xi = 70

    yi= 68

    x2i = 812

    xiyi= 862

    x= 10 y= 9.714

    1=

    ni=1

    xiyi nxyn

    i=1

    x2i nx2

    =862 7 10 9.714

    812

    7

    102

    =182.02112

    = 1.625

    0= y 1x= 9.714 1.625 10=6.536

    Note that it is important not to round numbers off until you have the finalregression equation, otherwise your answer may be inaccurate

    Thus the fitted regression equation is y=6.54 + 1.63x Predicting the number of deaths for a dose of 7mg:

    y=6.54 + 1.63x=6.54 + 1.63 7 = 4.9

    39

  • 8/22/2019 Sta 200 b Article

    40/95

    Simple Linear Regression Exercise

    Calculate the equation of the line of best fit for the temperature (x) vs. icecream sales (y) example

    Use the equation to predict the ice cream sales on a day on which the temper-ature is 20

    Inferences from a Simple Linear Regression

    The two unknown parameters involved in a simple linear regression model are0 and 1

    2, the variance of the error terms, is also unknown

    We may be interested in knowing whether it is reasonable to conclude that

    one of these unknowns is equal to (or not equal to) a particular value

    Most often we are interested in whether 1= 0 since this determines whetherxand y have a positive relationship, a negative relationship or no relationship

    Like in correlation analysis! To use hypothesis testing to make inferences about these unknowns we need

    an appropriate test statistic

    Inferences on 1

    Inferences about 1 will be based on how far the estimated value 1 is fromthe null hypothesis value

    As always, we also take into account the standard errorof the estimate anditsprobability distribution

    We already proved that E

    1

    =1

    Let:

    SSx =n

    i=1 x2i

    nx2

    SSy =n

    i=1

    y2i ny2

    SSxy =n

    i=1

    xiyi nxy

    Notice that, expressed in these terms, 1 = SSxySSx

    Subject to our model assumptions, it can be proven that Var 1 = 2SSx40

  • 8/22/2019 Sta 200 b Article

    41/95

    However, because we do not know the value of2 we must use the best esti-

    mate, which turns out to be 2

    =

    1

    n 2n

    i=1 e2i = 1n 2SSResidual= M SResidual ThusVar1 = 2

    SSx

    It can be proven that 1 E (1)Var1 has a t distribution with n 2 degrees offreedom

    Thust=1

    1 2

    SSx

    has a tdistribution with n 2 degrees of freedom

    Since SSResidual= SSy1SSxy, we can express this as:

    t=1 1

    SSy1SSxy(n 2) SSx

    If we replace1with1 this becomes our test statistic for testing H0 : 1 =

    1

    Hypothesis Testing Review

    For such a t test, our decision rules would be as follows: H0:1=1 vs. HA:1=1

    Reject H0 if|tobserved|> t/2,n2 H0:1=1 vs. HA:1< 1

    Reject H0 iftobserved 1

    Reject H0 iftobserved > t,n2

    41

  • 8/22/2019 Sta 200 b Article

    42/95

    The p-value Approach

    Instead of using critical values to decide whether to reject H0, one can also usep-values

    A p-value (sometimes denoted ) is defined as the probability of obtaining aresult at least as extreme as the observed data, given that H0 is true.

    For such a t test, our decision rules would be as follows: H0:1=1 vs. HA:1=1

    Reject H0 if 2 Pr (t >|tobserved| given that 1=1 )< H0:1=1 vs. HA:1< 1

    Reject H0 if Pr (t < tobserved given that 1 = 1 )< H0:1=1 vs. HA:1> 1

    Reject H0 if Pr (t > tobserved given that 1 = 1 )< Note that p-values cannot usually be computed by hand. As an example, the

    third p-value involves computing

    =

    tobserved

    f(y)dy where f(y) is the probability density function of the t distribution

    However,p-values can be easily calculated with a computer, and are the quick-est way to reach a decision about a hypothesis test when using statisticalsoftware packages

    Confidence Interval for 1

    Using the t statistic above, we can derive a (1 )100% confidence intervalfor 1 as follows:

    Pr1 t/2,n2 SSy1SSxy(n 2) SSx < 1< 1+t/2,n2 SSy1SSxy(n 2) SSx = 1 Thus the C.I. for 1 is:1 t/2,n2

    SSy1SSxy

    (n 2) SSx ,1+t/2,n2

    SSy1SSxy

    (n 2) SSx

    42

  • 8/22/2019 Sta 200 b Article

    43/95

    Inference on 1 Example

    Suppose we want to test H0 : 1 = 0 vs. HA : 1= 0 for the rat death vs.dosage example, at the = 0.05 significance level

    Our test statistic is tt (n 2) as defined above Our critical region is|tobserved|> t/2,n2=t0.025,5 = 2.570 We have already calculated that SSxy = 182 and SSx = 112 We further can calculate that SSy = 301.4286

    t=1 1

    SSy1SSxy

    (n

    2) SSx

    = 1.625 0

    301.4286 1.625 182(7 2) 112

    = 1.625

    0.01014

    = 1.625

    0.1007= 16.14

    |tobserved|> 2.570, thus we reject H0 and conclude that1= 0; the slope of theregression model is statistically significant

    A 95% Confidence Interval for 1 is given by:1 t/2,n2

    SSy1SSxy(n 2) SSx ,

    1+t/2,n2

    SSy1SSxy(n 2) SSx

    (1.625 2.570 0.1007 ,1.625 + 2.570 0.1007)

    (1.37 ,1.88)

    Inference on 0

    In a similar way it can be proven that:

    E

    0

    =0

    Var

    0

    =2

    1

    n+

    x2

    SSx

    If we estimate2 with 2 thent = 0 0

    1

    n+

    x2

    SSx

    has at distribution with

    n 2 degrees of freedom

    43

  • 8/22/2019 Sta 200 b Article

    44/95

    We can also express tas:

    t=0 0

    SSy1SSxyn 2 1n+ x2SSxConfidence Interval for 0

    A (1 )100% Confidence Interval for 0 is given by:0 t/2,n2

    SSy1SSxyn 2

    1

    n+

    x2

    SSx

    ,0+t/2,n2

    SSy1SSxy

    n 2

    1

    n+

    x2

    SSx

    Inference on 0 Example

    With our dosage vs. rat deaths example, suppose we are interested in whether0

  • 8/22/2019 Sta 200 b Article

    45/95

    Inference on 2

    It is also possible to perform hypothesis tests and confidence intervals con-cerning 2 using the 2 distribution

    However we will not cover these in this module.

    Predicting the Mean Response

    One of the advantages of the linear regression model is that we can use x topredict Y

    Suppose we want to estimate the mean value ofY whenx= x, E (Y|x= x)

    We know that E (Y|x= x) =0+1x Our best estimate of E (Y|x= x) is y = 0+ 1x

    The variance of this estimator is Var (y) =2

    1

    n+

    (x x)2SSx

    Since 2 is unknown, we can use the following estimate:

    Var (y) = 2

    1

    n+

    (x x)2SSx

    =SSy1SSxy

    n 2

    1

    n+

    (x x)2SSx

    It can also be shown that t= yVar (y) t (n 2)

    Confidence Interval for Mean Response

    Thus a (1

    )100% Confidence Interval for E (Y

    |x= x) is given by:0+ 1x t/2,n2

    SSy1SSxyn 2

    1

    n+

    (x x)2SSx

    If we want the interval to be as narrow as possible (a more accurate prediction),

    then nshould be large, SSx should be large, and xshould be near x.

    That is, we should gather data on a wide range ofxvalues

    45

  • 8/22/2019 Sta 200 b Article

    46/95

    Predicting a New Response

    Suppose we want to predict the response valuey for a new observationx = x

    Our best estimate would be y

    = 0+

    1x

    E (y) =0+1x

    Var (y) =2

    1 +1

    n+

    (x x)2SSx

    Thus:

    Var (y) = 2

    1 +

    1

    n+

    (x x)2SSx

    =SSy1SSxy

    n 2

    1 +1

    n+

    (x x)2SSx

    It can be shown that t= yVar (y) t (n 2)

    Prediction Interval for an Individual Response

    A (1

    )100% Prediction Interval for y is given by:

    0+ 1x t/2,n2

    SSy1SSxyn 2

    1 +

    1

    n+

    (x x)2SSx

    It is called a prediction interval rather than a confidence interval because Yiis a random variable, not an unknown parameter

    Notice that the prediction interval for Yi is always wider than the confidenceinterval for E (Y|x= x)

    It is more difficult to predict the value of an individual observation than themean of many observations

    Example

    Consider our Temperature vs. Ice Cream Sales example We want a confidence interval for the average ice cream sales when the tem-

    perature is 20 and a prediction interval for the ice cream sales on a particularday when the temperature is 20

    46

  • 8/22/2019 Sta 200 b Article

    47/95

    1. Confidence Interval for E (Y|x= 20)

    0+ 1x

    t/2,n2

    SSy1SSxyn 2

    1

    n

    +(x x)2

    SSx 159.474 + 30.088(20) t0.025,10

    174754.9 30.088(5325.025)12 2

    1

    12+

    (20 18.675)2176.9825

    442.286 2.228

    135.549

    442.286 25.94= (416.35, 468.23)

    2. Prediction Interval for Yi when x= 20

    0+ 1xi t/2,n2SSy1SSxy

    n 2

    1 +

    1

    n+

    (x x)2SSx

    159.474 + 30.088(20) t0.025,10

    174754.9 30.088(5325.025)12 2

    1 +

    1

    12+

    (20 18.675)2176.9825

    442.286 2.228

    1589.10

    442.286 88.82= (353.47, 531.11)

    Assessing the Fit of a Regression Line

    While testing the hypothesis H0 : 1 = 0 can give us a yes or no answer onwhether the model is appropriate, we would like a statistic that can quantifyhow good the model is

    One method is to calculate what proportion of the total variation in y isexplained by our model

    The total variation in y is S Sy =n

    i=1(yi y)

    2

    =

    ni=1

    y2i ny

    2

    The variation not explained by the model isSSResidual=n

    i=1

    (yi yi)2

    Thus the variationexplained by the modelis the differenceSSy SSResidual Our goodness of fit statistic, called the Coefficient of Determination, is

    the ratio of the variation explained by the model to the total variation:

    r2

    =SSy

    SSResidual

    SSy = 1 SSResidual

    SSy

    47

  • 8/22/2019 Sta 200 b Article

    48/95

    We call this statistic r 2 because it turns out that it is the square of Pearsonssample correlation coefficient r

    Proof:

    r2 = 1 S SResidualSSy

    = 1 S Sy1SSxy

    SSy

    = 1

    1 1SSxySSy

    = 1

    SSxySSy

    =SS

    xySSx

    SSxy

    SSy

    =SS2xy

    SSxSSy

    = (r)2

    Goodness of Fit Example

    In our dosage vs. rat deaths example:

    r2 = SS

    2

    xySSxSSy

    = 1822

    112 301.4286 = 0.981

    Thus in this case we can say that 98 .1% of the variation in rat deaths can beexplained by the dosage given

    4 Multiple Linear RegressionMultiple Linear Regression Model Specification

    Before now we have used models with only one independent variable xi What if we want to investigate the relationship between a single dependent

    variable Y and two independent variables x1 and x2?

    The multiple linear regression model allows us to do this

    Motivational Example

    An experiment was conducted to determine the effect of pressure and temper-ature on the yield of a chemical. Two levels of pressure (in kPa) and three

    levels of temperature (inC) were used and the results were as follows:

    48

  • 8/22/2019 Sta 200 b Article

    49/95

    Yield (yi) Pressure (xi1) Temperature (xi2)

    21 350 40

    23 350 90

    26 350 15022 550 40

    23 550 90

    28 550 150

    3D Scatter Plot

    If we want to represent the relationship graphically we would need a threedimensional scatter plot

    Instead of a line of best fit, we now need a plane of best fit

    Multiple Linear Regression Model

    Themultiple linear regression model allows us to investigate the relation-ship between a single dependent variable Y and two independent variables x1and x2

    The model is specified as follows:

    Y =0+1x1+2x2+

    Or, in terms of observations, as follows:

    yi=0+1x1i+2x2i+i

    49

  • 8/22/2019 Sta 200 b Article

    50/95

    This is the equation of a plane, not a line 0 is still the intercept (the point where the plane crosses the vertical axis,

    x1=x2 = 0)

    1 is the slope of the plane in the x1 direction 2 is the slope of the plane in the x2 direction

    1 and 2 are sometimes referred to as partial slope coefficients This model relies on the same assumptions as the simple linear regression

    model, with one addition:

    x1 and x2 must not be collinear(highly correlated with one another)

    The fitted regression equation in this case is:

    Y = 0+ 1x1+ 2x2

    Multiple Linear Regression Model: Deriving Least Squares ParameterEstimates

    We can again use theMethod of Least Squaresto estimate the parameters0, 1 and 2

    We still have our sum of squared error function, which is now a function ofthree variables:

    SSError= S(0, 1, 2) =n

    i=1

    2i =n

    i=1

    (yi 0 1x1i 2x2i)2

    We can still use the same steps:1. Take partial derivatives of theSSErrorfunction with respect to0,1and

    2

    2. Set the derivatives equal to zero

    3. Solve this system of equations for 0,

    1and

    2to get the values which

    minimize the function

    S(0, 1, 2)

    0=2

    ni=1

    (yi 0 1x1i 2x2i) = 0

    S(0, 1, 2)

    1=2

    ni=1

    (yi 0 1x1i 2x2i) x1i= 0

    S(0, 1, 2)

    2=2

    n

    i=1(yi 0 1x1i 2x2i) x2i= 0

    50

  • 8/22/2019 Sta 200 b Article

    51/95

    Solving this system of equations for 0, 1 and 2 is possible but it will takelong and the formula will be complicated.

    An alternative is to use matrix notation, which is more compact

    Multiple Linear Regression Model: Matrix Notation

    We can specify the regression model in matrix notation as follows:

    y= X + where

    y is an n 1 matrix:

    y=

    y1

    y2...

    yn

    X is an 3 nmatrix:

    X =

    1 1 1x11 x12 x1nx21 x22 x2n

    is a 3

    1 matrix:

    =

    012

    is an n 1 matrix:

    =

    1

    2...

    n

    Quick Review of Matrices

    For any matrices A and B where A is the transpose ofA:A

    =A

    (A + B)=A+ B

    (AB)=BA

    51

  • 8/22/2019 Sta 200 b Article

    52/95

    Additionally, the inverse of a square matrix A (which is like the matrixequivalent of division) is the matrix A1 such that AA1 =Iwhere I is theidentity matrix, e.g.

    I= 1 0 00 1 00 0 1

    To find the inverse of a matrix we can use the following method (similar to

    Gauss-Jordan elimination):

    Suppose

    A=

    1 2 3

    0 4 5

    1 0 6

    Then: 1 2 3 1 0 00 4 5 0 1 01 0 6 0 0 1

    = 1 2 3 1 0 00 4 5 0 1 0

    0 2 3 1 0 1

    = 1 2 3 1 0 00 4 5 0 1 0

    0 0 11 2 1 2

    =

    2 0 1 2 1 00 4 5 0 1 0

    0 0 11 2 1 2

    =

    22 0 0 24 12 20 4 5 0 1 0

    0 0 11 2 1 2

    =

    22 0 0 24 12 20 44 0 10 6 100 0 11 2 1 2

    = 1 0 0 1211 611 1110 1 0 5

    223

    22522

    0 0 1 211

    111

    211

    Thus A1 =

    1211

    611

    111

    522

    322

    522

    211

    111

    211

    Deriving Least Squares Estimates in Matrix Notation

    Our sum of squared error function in matrix notation is:

    S() =n

    i=1

    2i == (y X) (y X)

    =y (X)

    (y X)

    =y X (y X)

    =yy

    Xy

    yX + XX

    52

  • 8/22/2019 Sta 200 b Article

    53/95

    Now, in Xy we are multiplying a 1 3 matrix by a 3 n matrix by an 1 matrix, so the result will be a 1 1 matrix, i.e. a scalar number

    Similarly, in yX we are multiplying a 1 nmatrix by an 3 matrix by a3 1 matrix, so the result will again be a 1 1 matrix, i.e. a scalar

    Notice also that xy= yx The transpose of a scalar is itself Thus, since these matrices are both scalars, they are equal, and we can simplify

    our equation to:

    S() =yy 2Xy + XX

    We now differentiate this function using vector calculus and set it equal to 0:S

    =2Xy + 2XX= 0

    XX= Xy

    =XX

    1Xy

    Thus in matrix form, the least squares estimators of are given by =

    XX

    1Xy

    This matrix exists as long as the inverse ofXXexists, which it does as longas our assumption of no linear dependence between x1 and x2 holds true

    The estimators have the same Minimum Variance Unbiased Estimatorproperty as 0 and 1 do in the simple linear regression case

    In matrix form, the fitted regression equation is y= X In matrix form, the residuals are e= y y

    Multiple Linear Regression Example

    We have the following data from ten species of mammal:

    53

  • 8/22/2019 Sta 200 b Article

    54/95

    Species Name Gestation Period in days (y) Body Weight in kg (x1) Avg. Litter size (x2)

    Rat 23 0.05 7.3

    Tree Squirrel 38 0.33 3

    Dog 63 8.5 4Porcupine 112 11 1.2

    Pig 115 190 8

    Bush Baby 135 0.7 1

    Goat 150 49 2.4

    Hippo 240 1400 1

    Fur seal 254 250 1

    Human 270 65 1

    Here, our individual matrices are as follows:

    y=

    23

    38

    63

    112

    115

    135

    150

    240

    254

    270

    X =

    1 0.05 7.3

    1 0.33 3

    1 8.5 4

    1 11 1.2

    1 190 8

    1 0.7 1

    1 49 2.4

    1 1400 1

    1 250 1

    1 65 1

    We first check if our y values appear to be normally distributed:

    54

  • 8/22/2019 Sta 200 b Article

    55/95

    Looks okay

    Our XXmatrix is as follows: 10 1974.580 29.91974.58 2065419.851 3401.85529.9 3401.855 153.49

    To find the inverse of this matrix we would use Gauss-Jordan Elimination as

    above

    However in the age of technology its much quicker to use computer softwaresuch as MatLab

    We find that

    XX

    1=

    0.3021 1.9913 104 5.4428 1021.9913 104 6.3378 107 2.4744 1055.4428 102 2.4744 105 1.6569 102

    We multiply this matrix byX and then by y to get our parameter estimates

    = 178.70.0756917.93

    Thus our fitted regression equation is Y= 178.68 + 0.07569x1 17.93x2 We interpret this as follows:

    The intercept means that (according to the model) a mammal with bodyweight of 0 kg which has an average litter size of 0 babies would have agestation period of 179 days

    (Note that the intercept does not always make practical sense!)

    55

  • 8/22/2019 Sta 200 b Article

    56/95

    For every kg of body weight, gestation period increases by 0.07569 days For every baby in the average litter, gestation period decreases by 17.93

    days

    Remember, we cannot assume the relationships are causal

    It can be dangerous to extrapolate outside the region ofx1 and x2 values inthe data even if it is within range of individual values

    Intercept may be an example of this! See the graph below

    Multiple Linear Regression with k Independent Variables

    Using our matrix notation we can generalise the multiple linear regressionmodel from 2 independent variables to k independent variables

    The model is specified as follows:

    Y =0+1x1+2x2+ +kxk+

    Or, in terms of observations, as follows:

    yi=0+1x1i+2x2i+ +kxki+i

    Note that p= k+ 1 is the total number of parameters in the model (k inde-pendent variables plus one intercept)

    56

  • 8/22/2019 Sta 200 b Article

    57/95

    Hence y= X + where: y is an n 1 matrix, Xis an n pmatrix, is a p 1 matrix, and is

    an n 1 matrix This model relies on the same assumptions as the simple linear regression

    model, along with the assumption of no multicollinearity:

    None of the independent variables are collinear (highly correlated with oneanother)

    Multiple Linear Regression Example

    Data was collected from 195 American universities on the following variables:

    Graduation Rate (the proportion of students in Bachelors degree pro-grammes who graduate after four years)

    Admission Rate (the proportion of applicants to the university who areaccepted)

    Student-to-Faculty Ratio (the number of students per lecturer) Average Debt (the average student debt level at graduation, in US dol-

    lars)

    A few observations from the data are displayed below:

    Grad Rate (y) Admission Rate (x1) S/F Ratio (x2) Avg Debt (x3)

    0.65 0.35 14 11156

    0.81 0.39 16 13536

    0.8 0.35 12 19762

    0.46 0.65 13 12906

    0.5 0.58 21 14449

    0.47 0.65 11 166450.18 0.59 14 17221

    0.52 0.6 13 14791

    0.39 0.79 15 14382...

    ... ...

    ...

    57

  • 8/22/2019 Sta 200 b Article

    58/95

    In this case we have k = 3 independent variables and p = 4 parameters toestimate

    The model equation is as follows:yi = 0+1xi1+2xi2+3xi3+i

    Using computer software we determine that:

    XX

    1

    =

    j = 0 j= 1 j = 2 j = 3

    j= 0 0.1059 0.01782 3.0672 103 3.2823 106j= 1 0.01782 0.1906 5.7407 103 5.7146 107j= 2 3.0672 103 5.7407 103 4.6400 104 2.1002 109j= 3

    3.2823

    106

    5.7146

    107 2.1001

    109 2.3045

    1010

    We further determine that:

    =XX

    1Xy=

    1.1095

    0.37980.02789

    5.1687 107

    Thus our sample regression function is:

    y = 1.1095 0.3798x1 0.02789x2+ 5.1687 107

    x3

    Interpretation: For every 0.01 unit increase in admission rate, there is an expected

    0.003798 unit decrease in graduation rate (we cant really talk aboutthe usual 1 unit increase in x1 since it is a proportion and ranges onlyfrom 0 to 1)

    For every one unit increase in student-to-lecturer ratio, there is an ex-pected 0.02789 unit decrease in graduation rate

    For every $1 increase in average student debt, there is an expected 5.1687107 unit increase in graduation rateInferences from a Multiple Linear Regression

    Just like in simple linear regression, we often want to do hypothesis testing formultiple linear regression

    There are three main types of hypothesis tests to consider:1. Inferences on Individual Parameters

    2. Inferences on the Full Model (all parameters)

    3. Inferences on Subsets of Parameters

    58

  • 8/22/2019 Sta 200 b Article

    59/95

    Inferences on Individual Parameters

    The logic is the same as in simple linear regression but we now use a matrixapproach

    It can be proven that E = It can also be proven that the covariance matrix of is:

    Cov

    =2XX

    1 This means that for each individual element of, j:

    Ej =jVar

    j

    =2Cjj

    where Cjj is the diagonal element ofXX

    1corresponding to j

    This is the multivariate equivalent of our result in simple linear regression thatVar

    1

    =2SS1x

    Now, we face the same problem as before in that we dont usually know thevalue of2

    Remember, before we estimated 2 with

    2 = 1

    n 2n

    i=1

    e2i = 1

    n 2SSResidual

    In the multivariate case, we have to divide by np instead of n2 (wesubtract the number of parameters to be estimated which was 2 in that case)

    Our sum of squared residuals can be expressed as follows:

    SSResidual=n

    i=1

    e2i =ee

    = (y y) (y y)

    =yX

    yX

    =yy Xy yX + XX=yy 2Xy + XX=yy Xy since XX= Xy

    59

  • 8/22/2019 Sta 200 b Article

    60/95

    Therefore, 2 = SSResidualn p =

    1

    n pyy Xy

    The test statistic for testing the null hypothesis H0:j =j is thus:

    t=j j

    Cjj

    =j j

    yy Xy

    Cjj / (n p)

    Under the null hypothesis, t follows a t distribution with np degrees offreedom

    Our decision rules will be the same as for inferences on 1 in the simple linearregression model (depending whether we have a two-tailed, lower tail or uppertail test)

    Note that this formula can be used for any j including 0 If we set j = 0 then we are testing for the significance of an individual

    coefficient, that is, whether there is a linear relationship between Y and xj

    Inferences on Individual Parameters: Example

    Suppose we want to test whether the admission rate has a significant, negativeimpact on the graduation rate

    1. H0 : 1= 0 vs. HA : 1 < 0

    2. = 0.05

    3. t=1

    (yy Xy)C11/ (n p)t (n p)

    4. Critical region: tobserved t/2,np=t0.025,1954 = t0.025,1911.9845. tobserved =

    5.169 1074.7691 2.3045 1010/ (195 4) = 0.215

    |tobserved|< 1.984 thus we do not reject H06. We conclude that average student debt has no significant effect on grad-

    uation rate

    Inference on the Whole Regression Model

    One way to test the usefulness of a particular multiple linear regression modelwith k independent variables is to test the following:

    H0 : 1=2 = = k = 0HA : j= 0 for at least one j

    If we reject H0, this implies that at least one of the independent variables

    x1, x2, . . . , xk contributes significantly to the model

    To develop this test, remember the following from our r2 calculations:

    SSy =n

    i=1

    (yi y)2 =n

    i=1

    y2i ny2

    =yy ny2SSResidual= y

    y XyHence S SModel = SSy SSResidual= xy ny2

    It can be shown that under H0,SSModel2 (p 1) and SSResidual2 (n p) From this we can develop a test statistic which compares the variation ex-

    plained by the model to the variation not explained by the model:

    F = SSModel/ (p 1)SSResidual/ (n p)

    Under H0, F F(p 1, n p) and so we use the F distribution table todetermine whether or not to reject the null hypothesis

    In this case we always have a one-sided, upper tail test. Our decision rule is: Reject H0 ifFobserved > F,p1,np

    61

  • 8/22/2019 Sta 200 b Article

    62/95

    Inference on the Whole Regression Model: Example

    For our graduation rate example:1. H0 : 1=2 = 3 = 0 vs. HA:j

    = 0 for at least one j = 1, 2, 3

    2. = 0.05

    3. Test statistic: F = SSModel/(p 1)SSResidual/(n p)F(p 1, n p)

    4. Critical Region: Fobserved > F,p1,np = F0.05,2,1923.041

    5. Fobserved =

    Xy ny2

    / (p 1)

    yy Xy

    / (n p)=

    6.102/(4 1)4.769/(195 4) = 81.47 >

    3.041, so we reject H0

    6. We conclude that at least one of the independent variables contributes

    significantly to the model.Inference on a Subset of the Parameters

    It is also possible to carry out a test of significance on a subset of the param-eters, but we will not cover this

    Confidence Intervals for Individual Coefficients

    By rearranging our test statistic for an individual coefficient parameter, wecan obtain the following (1 ) 100% Confidence Interval for j for any j =0, 1, 2, . . . , k:

    Prj t/2,np2Cjjj j+ t/2,np2Cjj = 1 where2 =S SResidual/ (n p) =

    yy Xy

    / (n p)

    Confidence Intervals for Individual Coefficients: Example

    Let us construct a confidence interval for 3 in the graduation rate example First lets calculate 2

    Ifyy= 68.9714 and Xy= 64.20232, then SSResidual= 4.769

    Thus 2 =S SResidual/(n

    p) = 4.769/(195

    4) = 0.02497

    We know that 3= 5.1687 107 and C33 = 2.3045 1010

    Thus our confidence interval is given by:j t/2,np

    2Cjj

    5.1687 107 t0.025,1954

    0.02497(2.3045 1010)5.1687 107 1.984

    0.02497(2.3045 1010)

    5.1687 107 1.984

    0.02497(2.3045 1010)5.1687

    107

    4.759

    106

    = 4.24 106, 5.28 10662

  • 8/22/2019 Sta 200 b Article

    63/95

    Thus we can say with 95% confidence that the change in graduation rate fora $1 increase in average student debt is between4.25 106 and 5.28 106

    Notice that the confidence interval contains the value 0, which agrees with theconclusion to our hypothesis test earlier

    Confidence Region for All Coefficients

    One can also construct a joint confidence region for all parameters For a simple linear regression model the confidence ellipse for (0, 1) would

    have the shape of a two-dimensional ellipse

    This is outside the scope of this course howeverConfidence Interval for the Mean Response

    As we did in simple linear regression, we can construct a confidence interval

    for the mean response at a particular point, say, x

    x =

    1

    x01

    x02...

    x0k

    The mean response at this point is E (Y|x= x) =x

    The estimatedmean response at this point is y

    =x

    A (1 ) 100% Confidence Interval for E (Y|x= x) is given by:

    Pr

    y t/2,np

    2x (XX)1 x E (Y|x= x)y +t/2,np

    2x (XX)1 x

    = 1

    Confidence Interval for the Mean Response: Example

    Lets find a confidence interval for the average graduation rate of universitieswhich have an admission rate of 50% = 0.5, a student-to-faculty ratio of20 : 1 = 20, and an average student debt of $20000

    In this case, x = [1, 0.5, 20, 20000], a 1 4 matrix Our point estimate is:

    y =x

    = [1, 0.5, 20, 20000]

    1.1095

    0.37980.02789

    5.1687 107

    = 1.1095

    0.3798(0.5)

    0.02789(20) + 5.1687

    107(20000)

    = 0.3721

    63

  • 8/22/2019 Sta 200 b Article

    64/95

    Thus we would predict that such universities would have an average graduationrate of 37.21%

    The only thing left to calculate in our confidence interval formula isx XX

    1x

    Using matrix multiplication we see this is equal to 0.03492 Thus our 95% confidence interval for E (Y|x= x) is:

    y t/2,np

    2x (XX)1 x

    0.3721 1.984

    0.02497(0.03492)

    0.3721 0.0586= (0.3135, 0.4307)

    Prediction Interval for a New Response

    Also, like in simple linear regression, we can predict the value of the responseY for a new observation x and obtain a confidence interval for it

    The predicted value is y =x (actually the same as y above) A (1 ) 100% Prediction interval for Y is:

    Pry t/2,np2 1 + x (XX)1 x Y y +t/2,np2 1 + x (XX)1 x = 1 As in the simple linear regression case, we can see from the 1+ that this

    prediction interval is wider than the confidence interval for the mean response

    Prediction Interval for a New Response: Example

    Let us obtain a prediction interval at a particular university which has anadmission rate of 50% = 0.5, a student-to-faculty ratio of 20 : 1 = 20, and anaverage student debt of $20000

    Our point estimate is y which is actually the same as y; it equals 0.3721 Our 95% prediction interval is as follows:

    y t/2,np

    2

    1 + x (XX)1 x

    0.3721 1.984

    0.02497(1 + 0.03492)

    0.3721 0.3189= (0.0532, 0.691)

    We can see that this is a very wide (and not very useful) prediction interval

    64

  • 8/22/2019 Sta 200 b Article

    65/95

    Assessing Goodness of Fit of a Multiple Linear Regression Model

    We can define r2 just as we did for the simple linear regression model:

    r2 = 1 SSResidualSSy

    = 1 yy Xyyy ny2

    In this case it is referred to as theMultiple Coefficient of Determination One of the disadvantages of this statistic is that it will always increase as more

    independent variables are added to the model

    This will suggest that the fit is getting better even if the new variables are not

    significant

    This problem led to the development of an alternative goodness of fit statisticfor multiple linear regression called Adjusted r2

    Adjusted r2

    Adjusted r2, written as r2, imposes a penalty for adding more terms to themodel

    It will thus decrease when we add an independent variable that does not

    contribute much explanatory power

    r2 = 1 SSResidual/ (n p)SSy/ (n 1) = 1

    n 1n p

    1 r2

    r2 and r2 for Multiple Linear Regression Model: Example

    In our university graduation rates example, we calculate r2 as follows:

    r2 = 1

    yy Xyyy ny

    2

    = 1 68.9714 64.2023268.9714 58.09986

    = 1 0.4387= 0.5613

    This suggests that 56% of the variation in graduation rates can be explainedby the three factors in the model

    65

  • 8/22/2019 Sta 200 b Article

    66/95

    Now we calculate r2 as follows:

    r2 = 1

    n 1n p

    1 r2

    = 1 195 1195 4 (1 0.5613)

    = 1 0.4456= 0.5544

    In this case, there is not much difference between the two, because the samplesize n is very large compared to the number of parameters p

    Model Selection Algorithms

    Various algorithms (procedures) have been proposed for selecting which vari-ables to include in a model

    This is particularly important when there are many possible independent vari-ables to choose from

    We do not want to miss out on variables that contribute significantly to themodel, but we also dont want to include unnecessary variables which makeour estimates less precise

    The three most common algorithms that are used are:1. Backward Elimination

    2. Forward Selection

    3. Stepwise Selection

    Backward Elimination

    Backward Elimination starts with a full model consisting of all possible inde-pendent variables, and cuts it down until the bestmodel is achieved

    The algorithm then proceeds as follows:

    1. Begin with a model including all possible independent variables

    2. Estimate the model and take note of thetobserved statistic values for indi-vidual coefficients (not including 0)

    3. Choose the coefficient with the smallest|tobserved|; call it j4. Carry out the test of hypothesis H0 : j = 0 vs. HA : j= 0 at the

    significance level

    5. If the null hypothesis is rejected, we accept this as our final model

    6. If the null hypothesis is not rejected, we remove the variable xj

    from themodel and repeat from step (2)

    66

  • 8/22/2019 Sta 200 b Article

    67/95

    Forward Selection

    Forward Selection works in the opposite direction: it begins with an emptymodel and adds variables until the bestmodel is achieved

    The algorithm proceeds as follows:1. Run simple linear regressions between y and each possible xvariable

    2. Identify the independent variable with the highest|tobserved| value in itssimple linear regression with y

    3. Carry out the test of hypothesis H0 : j = 0 vs. HA : j= 0 at the significance level in this simple linear regression model

    4. If we reject H0, we add xj to the multiple linear regression model andproceed to the independent variable with the next highest

    |tobserved

    |in its

    simple linear regression with y, and repeat from step (3).

    5. If the null hypothesis is not rejected, we conclude xj is not significantto the model, so we do not add it. We also realise none of the otherindependent variables with smaller|tobserved| will be significant; thus themodel is final; we are done

    Stepwise Selection

    Stepwise Selection combines elements of both Backward Elimination and For-ward Selection

    The algorithm proceeds as follows:1. Run simple linear regressions between y and each possible xvariable

    2. Identify the independent variable with the highest|tobserved| value in itssimple linear regression with y

    3. Carry out the test of hypothesis H0 : j = 0 vs. HA : j= 0 at the significance level in this simple linear regression model

    4. If we reject H0, we add xj to the multiple linear regression model

    So far the algorithm is exactly like Forward Selection; but nowit changes

    5. Carry out a t test from the multiple linear regression model for thesignificance of each j in the model so far

    6. If the null hypothesis is not rejected for any j we delete that xj fromthe model

    7. Proceed to the independent variable with the next highest|tobserved|in itssimple linear regression with y, and repeat from step (3).

    8. Once we reach a point where all the variables in the model are significant,and none of the variables outside the model are significant, this is our finalmodel

    67

  • 8/22/2019 Sta 200 b Article

    68/95

    Model Selection Algorithms: Example

    It is easier to see an example in the tutorial using SAS, since these algorithmsare very tedious to carry out by hand

    In the case of our Graduation Rate example, all three algorithms lead to thesame result: we keep x1 and x2 in the model and drop x3

    Note: there are other model selection algorithms but we will not cover them

    Residual Analysis

    Revisiting Model Assumptions

    Remember that the assumptions of the multiple linear regression include thefollowing:

    All error terms have a zero mean, i.e. E (i) = 0i All error terms have the same fixed variance, i.e. Var (i) =2i All observations are independent of each other The error terms follow the normal distribution None of the x variables are highly correlated with one another

    Whenever we are applying a multiple linear regression model it is importantto check these assumptions

    Model Adequacy

    The first four of these assumptions can be assessed using residual analysis:that is, looking at the residuals of the model

    There are two basic ways to do this: Graphical Analysis Hypothesis Tests

    In this module we will only look at graphical analysis (the hypothesis testingapproach will be taught in Econometrics in third year)

    Graphical Residual Analysis

    Remember that the residuals are defined as e= y y, that is, ei=yi yi To calculate the residuals we first determine the least squares regression line

    and then obtain the predicted value for eachxiin the sample; then we subtractthese predicted values from the observed yi values in the sample

    Once we have the residuals we can plot the residuals (vertical axis) againstthe predicted values (horizontal axis)

    One can gain a lot of information about the model by looking at this plot

    68

  • 8/22/2019 Sta 200 b Article

    69/95

    Plot of Residuals vs. Predicted Values

    The main things to look for in the plot are patternsor unusual points

    Ideally, the points should be evenly distributed above and below zero andshould appear completely random

    In this plot we can see that the points appear random

    69

  • 8/22/2019 Sta 200 b Article

    70/95

    Do you see anything different in this plot?

    The variance of the residuals appears to increase as y increases

    Normal Quantile-Quantile Plot

    A normal quantile-quantile plot is a useful tool for checking if the residualsare normally distributed

    If so, the points should fall approximately in a straight line

    Does this QQ plot look normally distributed?

    70

  • 8/22/2019 Sta 200 b Article

    71/95

    How about this one?

    71

  • 8/22/2019 Sta 200 b Article

    72/95

    Histogram of Residuals

    Another way to check normality is to plot a histogram of the residuals and seeif it is bell shaped

    How about this one?

    72

  • 8/22/2019 Sta 200 b Article

    73/95

    Summary of Graphical Analysis of Residuals

    Graphical analysis of residuals is a useful diagnostic tool for determining model

    adequacy

    However it has limitations - often the results can be inconclusive This is especially true for small sample sizes

    Outlier Diagnostics

    We can also use the residuals to look for outliers: values which the modelpredicts extremely badly

    While we could simply look at the residuals themselves, it is better to scalethem in some way Analogy tozscores from STA100A: we dont only want to know how far

    an observation is from its mean; we want to know how many standarddeviations away it is

    A basic way to scale the residuals would be to divide them by their standarddeviation:

    di=ei

    This is called the standardized residual Since these residuals should be approximately normally distributed with mean

    0 and variance 1, they should almost always lie in the range3di3 Thus we could define an outlier as any observation whose standardized residual

    is>3 or

  • 8/22/2019 Sta 200 b Article

    74/95

    Outlier Diagnostics: Externally Studentized Residuals

    The only weakness with the internally studentized residual is that the varianceestimate 2 used in calculating ri is influenced by the ith observation

    It may be thrown off by an outlier; thusri is not ideal for outlier detection Instead, for each observation, we could estimate the variance using a data set

    ofn 1 observations with the ith observation removed, and use this estimateS2(i) in the scaling formula

    It can be shown that:

    S(i) =(n p)2 e2i /(1 hii)

    n p 1

    If we replace 2 withS2(i)in the internally studentized residual formula we get:

    ti= eiS2(i)(1 hii)

    This is known as the externally studentized residual and is the best wayof scaling residuals

    Hypothesis Test for Outliers

    A further advantage is that, under the model assumptions, tit(n p 1) One could carry out a hypothesis test on each observation to check if it is an

    outlier:

    1. H0: The ith observation is not an outlier vs. HA: The ith observation isan outlier

    2. = 0.05

    3. Test statistic is|ti|

    4. Rejection rule: Reject H0 if|ti|> t/(2n),np15. Compute ti observed and reach a decision

    6. State conclusion

    The reason why we have /(2n) instead of /2 is that we are running thehypothesis testn times, so we are basically dividing up the overall type I errorprobability among the n individual tests (this is known as the Bonferroniapproach)

    74

  • 8/22/2019 Sta 200 b Article

    75/95

    yi xi yi ei di ri ti

    19 8 18.325 0.675 0.2008 0.2178 0.1997

    17 7 16.275 0.725 0.2157 0.2450 0.2248

    23 10 22.425 0.575 0.1711 0.1856 0.169922 9 20.375 1.625 0.4835 0.5169 0.4827

    33 14 30.625 2.375 0.7067 1.4133 1.5696

    18 7 16.275 1.725 0.5133 0.5830 0.5480

    16 7 16.275 -0.275 -0.0818 -0.0929 -0.0849

    19 10 22.425 -7.425 -2.2092 -2.3962 -10.5468

    Outlier Diagnostics: Example

    Suppose we have the following set of data (n= 8):

    When we estimate the simple linear regression model yi=0+ 1xi+ i usingthe least squares method, we get:

    0 = 1.525,1 = 2.15

    We can substitute each of ourxi forx in the fitted equation y= 1.525+2.15xto obtain the predicted values yi which are in the third column of the tableabove

    We can then calculate the residuals: ei=yi yi (see fourth column of table) To calculate the standardized residuals we first need to calculate 2:

    2 = 1

    n 2n

    i=1

    e2i =1

    6

    0.2752 + 0.4252 + + (4.025)2 = 3.6625

    Now we have di= ei2

    (see calculated values in fifth column)

    Next we can calculate the internally studentized residuals. We first need tocalculate the Hat matrix H=XXX1 X

    In this case, X=

    1 8

    1 7

    1 10

    1 9

    1 14

    1 7

    1 71 10

    75

  • 8/22/2019 Sta 200 b Article

    76/95

    Taking the diagonal elements ofH and using them in the formula ri =ei

    2 (1 hii), we get the values (see sixth column of table above)

    Next we calculate the externally studentized residuals. We first need to calcu-late the S2(i)=

    (n p)2 e2i /(1 hii)n p 1

    Then we plug these into the following formula to get the values in the seventhcolumn:

    ti= eiS2(i)(1 hii)

    It is now apparent for the first time that the 8th observation is an outlier

    Hypothesis Test for Outliers: Example

    We conduct the hypothesis test described above for each of the 8 observations,at = 0.05 level

    In every case, our rejection rule is reject H0 if|ti|> t/(2n),np1=t0.003125,5 We dont have a column for 0.003125 in our t table so we can take the

    average of the entries in the 0.005 and 0.001 columns to get an approxi-mation: (4.030 + 5.876)/2 = 4.953

    We reject H0 for all observations for which

    |ti|> 4.953; in this case we reject

    only for the 8th observation

    Thus we conclude that the 8th observation is an outlier and none of the othersare

    Influence Diagnostics

    Sometimes, a small subset of observations (even one observation) exert a dis-proportionate influence on the fitted regression model

    In other words, the parameter estimates depend more on these few obser-

    vations than on the majority of the data

    We would like to be able to locate these influential observations and pos-sibly eliminate them

    Leverage

    The elements of the Hat Matrix hij describe the amount of influence exertedbyyj on yi

    Thus a basic measure of the influence of an observation, known as the lever-

    age, is given by hii

    76

  • 8/22/2019 Sta 200 b Article

    77/95

    The properties of the Hat Matrix H include that the sum of all n diagonalelements is equal to p, that is:

    n

    i=1 hii=p Therefore, the average hii value would be pn As a rule of thumb, any observation i such that hii > 2pn would be called a

    high-leverageobservation

    Cooks Distance

    The leverage only takes into account the location of an x observation A more sophisticated measure of influence would take into account the location

    of the xand y values of an observation The Cooks Distance is one such measure Let be the usual least squares parameter estimates from all n observations,

    and

    Let (i) be the least squares parameter estimates where the ith observationhas been deleted from the data

    Then the Cooks Distance is defined as:

    Di= (i) XX(i) pMSResidual

    The Cooks Distance formula can also be expressed in terms of the internallystudentized residuals:

    Di= r2ip

    hii1 hii

    In general, ifDi > 1 we say that the ith observation is influentialInfluence Diagnostics: Example

    With the outlier data set used above, the hii values are:

    hii= [0.15, 0.225, 0.15, 0.125, 0.75, 0.225, 0.225, 0.15]

    In this case 2pn

    = 2(2)8

    = 0.5. Sinceh55 = 0.75 > 0.5, we can say that the 5thobservation is a high leverage observation

    We can calculate the Cooks Distance using the formula Di= r2i

    p

    hii1 hii

    In this case,Di= [0.0042, 0.0087, 0.0030, 0.0191, 2.9961, 0.0493, 0.0013, 0.5066]

    Since D5>1 we can again say that the 5th observation is influential

    77

  • 8/22/2019 Sta 200 b Article

    78/95

    Multicollinearity

    Multicollinearity occurs when two or more of the x variables have a stronglinear relationship with each other

    This makes the estimates less precise In fact, if two or more x variables have a perfect linear relationship, we cannot

    use the method of least squares

    Technically this is because the XXmatrix is not invertible In most cases the multicollinearity will not be perfect; but if it is strong, it

    can still ruin the model

    How do we know if there is multicollinearity?

    Detecting Multicollinearity

    The simplest way to detect multicollinearity is to calculate the Pearson corre-lation coefficient between each pair of independent variables xs and xt

    A rule of thumb says that if any of these correlation coefficients is higher than0.7 in absolute value, there is serious multicollinearity

    SAS can also provide us with variance inflation factor (VIF) estimates,which tell us by what factor the error variance increases due to multicollinearity

    in a particular independent variable

    A rule of thumb says that if the VIF >5 for any independent variable, thereis serious multicollinearity involving that variable

    The simplest way of resolving multicollinearity is to remove one of the offendingxvariables

    Multicollinearity: Example

    The table below gives the cost of adding a new communications node to a

    network, along with three independent variables thought to explain this cost:the number of ports available for access (x1), the bandwidth (x2), and the portspeed (x3)

    When we estimate the modelYi=0 +1x1i +2x2i +3x3i +iusing OrdinaryLeast Squares, we get the fitted equation:

    y= 17487 14168x1+ 81.39x2+ 1523.7x3

    Continue from SAS project

    78

  • 8/22/2019 Sta 200 b Article

    79/95

    yi x1i x2i x3i

    52388 68 58 653

    51761 52 179 499

    50221 44 123 42236095 32 38 307

    27500 16 29 154

    57088 56 141 538

    54475 56 141 538

    33969 28 48 269

    31309 24 29 230

    23444 24 10 230

    24269 12 56 115

    53479 52 131 499

    33543 20 38 192

    33056 24 29 230

    Changes in Functional Form

    What if there is a non-linear relationship between Y and x? E.g. quadratic, cubic, logarithmic, etc.

    We can still use linear regression just as before, but with the independentvariables transformed appropriatelyChanges in Functional Form: Example 1

    Example with quadratic termChanges in Functional Form: Example 1

    Example with ln term (log base e) Interpretation: 1 is the expected change in y for a one unit increase in ln x

    This can also be expressed in terms of a change in x: 1 is the expected change in y whenx is multiplied bye = 2.718, that is,

    when xincreases by 171.8%

    More generally, the expected change iny for a% increase inx would be1ln

    100 +

    100

    Thus the expected change in y for a 10% increase in xwould be 0.0951 For small , ln

    100 +

    100

    100and so, we can say approximately that

    1100

    is the expected change in y for a 1% increase in x

    79

  • 8/22/2019 Sta 200 b Article

    80/95

    Transformations of the Dependent Variable

    Used to make the data fit a normal distribution better

    Used to resolve the problem of non-constant variance Common transformations include:

    y = ln(y) y =y

    The Box-Cox Transformation is a method used to choose the best transforma-tion for y

    Box-Cox Transformation

    Used to make the data fit a normal distribution better Used to make the variance more constant Common transformations include:

    y = ln(y) y =y

    The Box-Cox Transformation is a method used to choose the best transforma-tion for y

    Box-Cox Transformation

    The Box-Cox Transformation consists of estimating a new parameter (This has nothing to do with Poisson distribution) The value ofis the best powerto use in transforming y; for instance:

    If= 2, we use the transformation y =y 2

    If= 1

    2

    , we use the transformation y =y1/2 =

    y

    In the special case = 0 we use the transformation y = ln(y) SAS can estimate the parameter for us

    Box-Cox Transformation: Example

    C

    Interaction Terms

    a

    80

  • 8/22/2019 Sta 200 b Article

    81/95

    Dummy Variables

    Do two-category only; save rest for econometrics

    5 Logistic RegressionDifferent kinds of Dependent Variables

    Throughout our study of linear regression models, we have assumed that thedependent variable is a normally distributed random variable

    However, in practice we may want to build models for data that are not nor-mally distributed

    For the rest of the module we will be looking at some of these modulesCategorical Dependent Variable

    We already studied models with dummy (categorical) independent variables But what if the dependent variable is categorical?

    If the dependent variable has two possible values (like a Bernoulli ran-dom variable), then it is called binary

    A Bernoulli random variable is a binomial random variable where thenumber of trials is n= 1

    For example, the dependent variable could be:

    Yi=

    1 if the ith product is defective

    0 if the ith product is ok

    Or:Yi=

    1 if the ith patient recovers

    0 if the ith patient dies

    We can construct models for this kind of dependent variable They will be quite different from linear regression models, but still have some

    key similarities since both types of models are classified as Generalized Lin-ear Models

    81

  • 8/22/2019 Sta 200 b Article

    82/95

    Generalized Linear Models

    Generalized Linear Models are a class of models, some of the properties ofwhich are:

    1. We haven independent response observationsy1, y2, . . . , yn with theoret-ical means 1, 2, . . . , n

    2. The observation yi is a random variable with a probability distributionfrom the exponential family (which basically means its probability massfunction or probability density function has an e in it)

    3. The mean response vector is related to alinear predictor =x=0+1x1+2x2+. . .+kxk

    4. The relationship betweeni and i is expressed by alink functiong sothat i= g(i), i= 1, 2, . . . , n

    By taking the inverse of this function we can also write i = E (yi) =g1(i) =g1(xi )

    In the case of linear regression: The link function is g(i) =i, so E (Yi) =i= i=xi The dependent variable follows a normal distribution In summary,YiN(xi , 2) (this is a way of writing the model without

    i)

    Logistic Regression Model

    If each Yi follows a Bernoulli distribution (binomial with n = 1), with prob-ability of success Pr (Yi= 1) = pi and probability of failure 1 pi, theni= E (Yi) =pi

    If we again used the identity link function g(i) = i then our model wouldbe pi=0+1x1+2x2+. . .+kxk

    It is easy to see that this is a bad idea, because the predicted values of themodel would not necessarily be between 0 and 1

    A better model uses the link function is g(pi) = ln pi1 pi

    The quantity pi

    1pi is called an odds: it is the ratio of the probability ofsuccess to the probability of failure

    Thus the link function gives the log odds, also known as the logit orlogistic function

    This means the model can be expressed as follows:

    ln pi1 pi =0+1x1+2x2+. . .+kxk82

  • 8/22/2019 Sta 200 b Article

    83/95

    By taking the inverse of the function we can also express the model like this:

    E (Yi) =pi= 1

    1 +ex

    i

    where xi = [1, x1i, x2i, . . . , xki]

    Notice that there is no error term i in this model Remember that pi are probabilities and thus range between 0 and 1 A graph ofg(pi) is as follows (it is undefined at 0 and 1):

    Parameter Estimation in Logistic Regression

    Just like in linear regression, our first task is to estimate the parameter vector

    =

    0

    1

    2...

    k

    However we can no longer use the Method of Least Squares (Why?)

    Instead we use the Method of Maximum Likelihood

    83

  • 8/22/2019 Sta 200 b Article

    84/95

    We will not explain the details of this method Unfortunately this method requires an iterative procedure and cannot easily

    be calculated by hand

    However computer software such as SAS can compute the estimates0,1,2, . . . ,kquite easily

    Interpreting Parameters in Logistic Regression

    More important for our purpose is to be able to interpret what the parameterestimates tell us

    The parameter estimates themselves are interpreted as log-odds ratios, whilee1 for instance would be interpreted as an odds ratio

    It is best to illustrate what these terms mean using an example

    Logistic Regression Example

    Consider a data set of 200 people admitted to the intensive care unit at ahospital

    The dependent variable is whether they died:

    yi= 1 if the person died

    0 if the person survived

    The first independent variable is the type of admission to ICU:

    xi1 =

    1 if they were admitted via emergency services

    0 if the they were self-admitted

    The second independent variable xi2 is the persons systolic blood pressure inmm Hg

    The estimated model is:

    ln

    pi

    1 pi

    = 0+ 1xi1+ 2xi2

    which can also be written as:

    Pr (Yi= 1) = pi=

    1

    1 +e(0+1xi1+2xi2)

    84

  • 8/22/2019 Sta 200 b Article

    85/95

    We estimate the parameters in SAS and our fitted equation is:

    ln

    pi

    1

    pi

    =1.33 + 2.022xi1 0.014xi2

    Or: Pr (Yi= 1) = pi= 11 +e(1.33+2.022xi10.014xi2)

    Now to interpret the parameters: as in linear regression, 0represents the casewhen all independent variables take a value of 0

    In this case, if xi1 = 0 (meaning the person was self-admitted) and t