chapter 10: inferential for regression 1

55
Chapter 10: Inferential for Regression http://jonfwilkins.blogspot.com/ 2011_08_01_archive.html 1

Upload: nigel-phillips

Post on 18-Dec-2015

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Chapter 10: Inferential for Regression  1

1

Chapter 10: Inferential for Regression

http://jonfwilkins.blogspot.com/2011_08_01_archive.html

Page 2: Chapter 10: Inferential for Regression  1

2

10.1: Simple Linear Regression10.2: More Detail about Simple Linear Regression

Goals• Describe the simple linear regression model (review – Ch. 2).• Be able to perform the method (with the output from software

packages, Lab 8).• Use diagnostic plots to check the assumptions.• Be able to perform inference on the slope (Confidence interval and

hypothesis test).• Be able to determine if there is an association between the response

and explanatory variables.• Be able to perform a hypothesis test using the correlation coefficient.• Be able to state the similarities and differences between a confidence

interval for a mean response and a prediction interval and in which situations each would be used (if there is time)

Page 3: Chapter 10: Inferential for Regression  1

3

Conditions for Linear Regression

• We have n (x,y) pairs.• For any fixed x, y ~ N(y, )

• Each yi is independent of the other yj’s.

• y = 0 + 1x

Page 4: Chapter 10: Inferential for Regression  1

4

Model for Linear Regression

yi = 0 + 1x + i

Data = Fit + Error

Page 5: Chapter 10: Inferential for Regression  1

5

Linear Regressiony = b0 + b1x• y is an unbiased estimator for y

• b0 is an unbiased estimator for 0

• b1 is an unbiased estimator for 1

Page 6: Chapter 10: Inferential for Regression  1

6

Linear Regression

i i xy y

1 2xx xi

x x y y SS sb r

SS sx x

b0 = y - b1x�

ei = yi - yi

Page 7: Chapter 10: Inferential for Regression  1

7

Other SS and df

• Total

dft = n - 1• Model

dfm = 1

Page 8: Chapter 10: Inferential for Regression  1

8

ANOVA table for Linear Regression

Source df SS MS

Model (Regression) 1 Σ(yi - y)2

Error n – 2 Σ(yi - yi)2

Total n - 1 Σ(yi - y)2

SSM

dfm

SSE

dfe

SST

dft

Page 9: Chapter 10: Inferential for Regression  1

9

Conditions for Linear Regression• SRS• Observations are independent of each other.• The relationship is linear in the population.• The response, y, is normally distribution around

the population regression line.• The standard deviation of the response is constant.• Important plots:– Scatter plot– Residual plot– Histogram/Normal quantile plot of the

residuals.

Page 10: Chapter 10: Inferential for Regression  1

10

Residual Plots

Page 11: Chapter 10: Inferential for Regression  1

11

Example: Linear Regression 1The cetane number is a critical property in specifying the ignition

quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.

a) Verify the assumptions required for linear regression.b) Determine the equation of the fitted line.c) What is a point estimate of the true average cetane number

whose iodine value is 100?d) Estimate the value of σ.e) What proportion of the observed variation in cetane number

that can be attributed to the iodine value?

Page 12: Chapter 10: Inferential for Regression  1

12

Example: Linear Regression 1 (cont.)x: 132.0 129.0 120.0 113.2 105.0 92.0 84.0y: 46.0 48.0 51.0 52.1 54.0 52.0 59.0x: 83.2 88.4 59.0 80.0 81.5 71.0 69.2y: 58.7 61.6 64.0 61.4 54.6 58.8 58.0

Page 13: Chapter 10: Inferential for Regression  1

13

Example: SLR 1 - Scatterplot

Page 14: Chapter 10: Inferential for Regression  1

14

Example: SLR 1 – Residual Plot

Page 15: Chapter 10: Inferential for Regression  1

15

Example: SLR 1 – Normality

Page 16: Chapter 10: Inferential for Regression  1

16

Example: Linear Regression 1The cetane number is a critical property in specifying the ignition

quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.

a) Verify the assumptions required for linear regression.b) Determine the equation of the fitted line.c) What is a point estimate of the true average cetane number

whose iodine value is 100?d) Estimate the value of σ.e) What proportion of the observed variation in cetane number

that can be attributed to the iodine value?

Page 17: Chapter 10: Inferential for Regression  1

17

Example: SLR 1 – Fitted Line x: 132.0 129.0 120.0 113.2 105.0 92.0 84.0y: 46.0 48.0 51.0 52.1 54.0 52.0 59.0x: 83.2 88.4 59.0 80.0 81.5 71.0 69.2y: 58.7 61.6 64.0 61.4 54.6 58.8 58.0

r = -0.88925 sx = 22.8755 sy = 5.3864yQ = 55.657 xQ = 93.393

Page 18: Chapter 10: Inferential for Regression  1

18

Example: SLR – fitted line

Page 19: Chapter 10: Inferential for Regression  1

19

Example: Linear Regression 1The cetane number is a critical property in specifying the ignition

quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.

a) Verify the assumptions required for linear regression.b) Determine the equation of the fitted line.c) What is a point estimate of the true average cetane number

whose iodine value is 100?d) Estimate the value of σ.e) What proportion of the observed variation in cetane number

that can be attributed to the iodine value?

Page 20: Chapter 10: Inferential for Regression  1

20

Example: Linear Regression 1The cetane number is a critical property in specifying the ignition

quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.

a) Verify the assumptions required for linear regression.b) Determine the equation of the fitted line.c) What is a point estimate of the true average cetane number

whose iodine value is 100?d) Estimate the value of σ.e) What proportion of the observed variation in cetane number

that can be attributed to the iodine value?

Page 21: Chapter 10: Inferential for Regression  1

21

Example: SLR - sx: 132.0 129.0 120.0 113.2 105.0 92.0 84.0y: 46.0 48.0 51.0 52.1 54.0 52.0 59.0x: 83.2 88.4 59.0 80.0 81.5 71.0 69.2y: 58.7 61.6 64.0 61.4 54.6 58.8 58.0

Analysis of VarianceSource DF Sum of

SquaresMean

SquareF

ValuePr > F

Model 1 298.25443 298.25443 45.35 <.0001Error 12 78.91986 6.57665Corrected Total

13 377.17429

Page 22: Chapter 10: Inferential for Regression  1

22

Example: Linear Regression 1The cetane number is a critical property in specifying the ignition

quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.

a) Verify the assumptions required for linear regression.b) Determine the equation of the fitted line.c) What is a point estimate of the true average cetane number

whose iodine value is 100?d) Estimate the value of σ.e) What proportion of the observed variation in cetane number

that can be attributed to the iodine value?

Page 23: Chapter 10: Inferential for Regression  1

23

Example: SLR 1x: 132.0 129.0 120.0 113.2 105.0 92.0 84.0y: 46.0 48.0 51.0 52.1 54.0 52.0 59.0x: 83.2 88.4 59.0 80.0 81.5 71.0 69.2y: 58.7 61.6 64.0 61.4 54.6 58.8 58.0

Analysis of VarianceSource DF Sum of

SquaresMean

SquareF

ValuePr > F

Model 1 298.25443 298.25443 45.35 <.0001Error 12 78.91986 6.57665Corrected Total

13 377.17429

Page 24: Chapter 10: Inferential for Regression  1

24

Confidence Interval• Point estimates–b0 is an unbiased estimator for 0

–b1 is an unbiased estimator for 1

• Assumptions– SRS– linearity–Constant standard deviation of residuals–Normality• If y is normal, then both b0 and b1 are

normal• If y is not normal, there is still CLT

Page 25: Chapter 10: Inferential for Regression  1

25

Standard deviation for b1

(Bonus on HW)

Page 26: Chapter 10: Inferential for Regression  1

26

Confidence Interval for 1

𝑏1± 𝑡𝑐𝑜𝑙𝑢𝑚𝑛∗ (𝑑𝑓 )𝑆𝐸𝑏1=𝑏1±𝑡𝑐𝑜𝑙𝑢𝑚𝑛

∗ (𝑑𝑓 ) √ 𝑀𝑆𝐸𝑆𝑥𝑥

Page 27: Chapter 10: Inferential for Regression  1

27

Example: SLR 1 - InferenceThe cetane number is a critical property in specifying the

ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.

e) What is the 95% Confidence Interval for the population

slope?f) Is the model useful (that is, is there a useful linear

relationship between iodine value and cetane number)?

Page 28: Chapter 10: Inferential for Regression  1

28

Example: SLR 1x: 132.0 129.0 120.0 113.2 105.0 92.0 84.0y: 46.0 48.0 51.0 52.1 54.0 52.0 59.0x: 83.2 88.4 59.0 80.0 81.5 71.0 69.2y: 58.7 61.6 64.0 61.4 54.6 58.8 58.0

Analysis of VarianceSource DF Sum of

SquaresMean

SquareF

ValuePr > F

Model 1 298.25443 298.25443 45.35 <.0001Error 12 78.91986 6.57665Corrected Total

13 377.17429b1 = -0.209 Sxx = 6802.77

Page 29: Chapter 10: Inferential for Regression  1

29

Example: SLR 1 – CI.

We are 95% confidence that the population slope is between -0.277 and -0.141.

Page 30: Chapter 10: Inferential for Regression  1

30

Example: SLR – fitted line

Page 31: Chapter 10: Inferential for Regression  1

31

LR Hypothesis Test: SummaryNull hypothesis: H0: 1 = Δ

Test statistic:

Note: A two-sided test with Δ = 0 is called a model utility test

Alternative Hypothesis

P-Value

Upper-tailed Ha: 1 > Δ P(T ≥ t)Lower-tailed Ha: 1 < Δ P(T ≤ t)two-sided Ha: 1 ≠ Δ 2P(T ≥ |t|)

Page 32: Chapter 10: Inferential for Regression  1

32

Example: SLR 1 - InferenceThe cetane number is a critical property in specifying the

ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.

e) What is the 95% Confidence Interval for the population

slope?f) Is the model useful (that is, is there a useful linear

relationship between iodine value and cetane number)?

Page 33: Chapter 10: Inferential for Regression  1

33

Example: SLR 1 - HT

The data does provide strong support (P = 2.13 x 10-5) to the claim that there is a linear relationship between iodine value and cetane number.

Page 34: Chapter 10: Inferential for Regression  1

34

ANOVA table for Linear Regression

Source df SS MS

Model (Regression) 1 Σ(yi - y)2

Error n – 2 Σ(yi - yi)2

Total n - 1 Σ(yi - y)2

SSM

dfm

SSE

dfe

SST

dft

Page 35: Chapter 10: Inferential for Regression  1

35

LR Hypothesis Test: SummaryNull hypothesis: H0: there is an association between

x and yTest statistic: F

P-value: P = P(F > Ftest), dfn = dfm, dfd = dfe

Page 36: Chapter 10: Inferential for Regression  1

36

Example: LR - InferenceThe cetane number is a critical property in specifying

the ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.

g) Perform the hypothesis test using the F test statistic.h) Perform the hypothesis using the population

correlation coefficient

Page 37: Chapter 10: Inferential for Regression  1

37

Example: LR – Inference - ANOVA

Parameter EstimatesVariable DF Parameter

EstimateStandard

Errort Value Pr > |t|

Intercept 1 75.21243 2.98363 25.21 <.0001iodine 1 -0.20939 0.03109 -6.73 <.0001

Analysis of VarianceSource DF Sum of

SquaresMean

SquareF

ValuePr > F

Model 1 298.25443 298.25443 45.35 <.0001Error 12 78.91986 6.57665Corrected Total

13 377.17429

Page 38: Chapter 10: Inferential for Regression  1

38

Example: LR – Inference (cont)

The data does provide strong support (P = 2.09 x 10-5) to the claim that there is a linear relationship between iodine value and cetane number.

Page 39: Chapter 10: Inferential for Regression  1

39

Inference for Correlation: Assumptions

• (x,y) are independent• (x,y) is normal• Linear relationship between x and y• Constant variance for the residuals.

Page 40: Chapter 10: Inferential for Regression  1

40

LR Hypothesis Test: SummaryNull hypothesis: H0: = 0

Test statistic:

Alternative Hypothesis

P-Value

Upper-tailed Ha: > Δ P(T ≥ t)Lower-tailed Ha: < Δ P(T ≤ t)two-sided Ha: ≠ Δ 2P(T ≥ |t|)

Page 41: Chapter 10: Inferential for Regression  1

41

Example: LR - InferenceThe cetane number is a critical property in specifying

the ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.

g) Perform the hypothesis test using the F test statistic.h) Perform the hypothesis using the population

correlation coefficient.

Page 42: Chapter 10: Inferential for Regression  1

42

Example: LR – Inference - ANOVA

Parameter EstimatesVariable DF Parameter

EstimateStandard

Errort Value Pr > |t|

Intercept 1 75.21243 2.98363 25.21 <.0001iodine 1 -0.20939 0.03109 -6.73 <.0001

Analysis of VarianceSource DF Sum of

SquaresMean

SquareF

ValuePr > F

Model 1 298.25443 298.25443 45.35 <.0001Error 12 78.91986 6.57665Corrected Total

13 377.17429

Page 43: Chapter 10: Inferential for Regression  1

43

Example: LR - Inference

The data does provide strong support (P = 2.12 x 10-5) to the claim that there is a linear relationship between iodine value and cetane number.

Page 44: Chapter 10: Inferential for Regression  1

44

SE hµ�

𝑆𝐸 ��h=√𝑀𝑆𝐸 [ 1𝑛+(𝑥h−𝑥 )2

∑ (𝑥 𝑖−𝑥 )2 ]

Page 45: Chapter 10: Inferential for Regression  1

45

Example: LR - InferenceThe cetane number is a critical property in specifying the

ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.

i) What is the 95% confidence interval for the cetane

number with the iodine value is 100.j) Predict the cetane number for the next sample of

biofuel that contains an iodine value of 100.

Page 46: Chapter 10: Inferential for Regression  1

46

Example: LR – Inference Analysis of Variance

Source DF Sum ofSquares

MeanSquare

F Value

Pr > F

Model 1 298.25443 298.25443 45.35 <.0001Error 12 78.91986 6.57665Corrected Total

13 377.17429

� = 54.313Sxx = 6802.77 xQ = 93.393

Page 47: Chapter 10: Inferential for Regression  1

47

Example: SLR (cont)

We are 95% confident that the population mean cetane number is between 52.754 and 55.872 when the iodine value is 100.

Page 48: Chapter 10: Inferential for Regression  1

48

SEy

Variance Components of prediction value1) Variance associate with the mean response

2) Variance associated with the observation

Page 49: Chapter 10: Inferential for Regression  1

49

Example: LR - InferenceThe cetane number is a critical property in specifying the

ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. Therefore a way of predicting this number is wanted. The data on the next slide is x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100g of oil.

i) What is the 95% confidence interval for the cetane

number with the iodine value is 100.j) Predict the cetane number for the next sample of

biofuel that contains an iodine value of 100.

Page 50: Chapter 10: Inferential for Regression  1

50

Example: LR – Inference Analysis of Variance

Source DF Sum ofSquares

MeanSquare

F Value

Pr > F

Model 1 298.25443 298.25443 45.35 <.0001Error 12 78.91986 6.57665Corrected Total

13 377.17429

� = 54.313Sxx = 6802.77 xQ = 93.393

Page 51: Chapter 10: Inferential for Regression  1

51

Example: SLR (cont)

We are 95% confident that the next cetane number is between 48.512 and 60.114 when the iodine value is 100.

Mean response: (52.754, 55.872)Prediction interval: (48.512. 60.114)

Page 52: Chapter 10: Inferential for Regression  1

52

CI for mean responsePrediction interval

Page 53: Chapter 10: Inferential for Regression  1

53

Example: Confidence/Prediction Band

Page 54: Chapter 10: Inferential for Regression  1

54

Multiple Regression: Examples 11) A portrait studio operates in cities of medium

size and specializes in portraits of children. They want to open a store in a other similar community, but want to be able to predict sales.

2) So that only students that succeed are accepted into college, the registrar’s office wants to be able to predict GPA from entering high school students.

3) A researcher studied the effects of the charge rate and the temperature on the life of a new type of power cell in a preliminary small-scale experiment.

Page 55: Chapter 10: Inferential for Regression  1

55

Multiple Regression: Examples 24) An experiment was run to investigate the yield

of tomato plants as a function of the amount of water levels. A series of plots were randomized to different water levels and at the end of the season, the yield of the plants was determined.

5) Fernandez-Juricic et al. (2003) examined the effect of human disturbance on the nesting of house sparrows (Passer domesticus). They counted breeding sparrows per hectare in 18 parks in Madrid, Spain, and also counted the number of people per minute walking through each park (both measurement variables).