regression: motivation one dimensional data (summary by mean) 10 20 30 40 50
TRANSCRIPT
Regression: Motivation
One dimensional data
(Summary by Mean)
10 20 30 40 50
X (X-a)2
10 (10-a)2
20 (20-a)2
30 (30-a)2
40 (40-a)2
50 (50-a)2
150 T min T when a = mean=30
RegressionEstriol Birth Wt
7 25
9 25
9 25
12 27
14 27
14 30
15 32
15 34
15 34
15 35
16 27
16 24
16 30
16 31
16 32
Estriol Birth Wt
30 35.5
32 35.5
36 35.5
35 37.0
37 37.0
31 38.5
34 38.5
38 40.0
30 41.5
40 43.0
28 46.0
43 46.0
32 47.5
39 47.5
34 50.5
Regression
• Concerns– Data summarization
• (As in one dimensional data)
– Prediction of low birthweight baby• (for special prenatal care to those in high risk)
Scatter plot
7 12 17 22 27
24
29
34
39
43
Birt
h w
eigh
t
Estriol
Lines through scatter plot to represent the data
7 12 17 22 27
24
29
34
39
43
Line 3
Line 4
Line 5
Estriol (mg/24 hr)
Bir
thw
eigh
t (g/
100)
Line 2
Regression line: The best lineThe best representation of data
Regression Line through Scatter Plot
7 12 17 22 27
24
29
34
39
43
Fig Reg 1.6
Estriol (mg/24 hr)
Bir
thw
eigh
t (g/
100)
What is this with a line and numbers anyway?
• They could be the same in two different form or language
• But, lines require less space to record remember, memorize and are easy to comprehend
• Lines could be pictorial or mathematical representation of numerical data
• A lineY = 2+3X
Numbers generated by the line
Slope = 2
Intercept =3
(interpretation ??)
x y
0 2
1 5
2 8
… …
50 152
… …
… …
Representation of bivariate measure ments in different forms
• Equation Y =2+3x
• Data/Number
• x y
• 0 2• 1 5• 2 8• … …
50 152• … …• … …
Y
X0 3
2
11
Picture/Graph
Straight lines
Inte
rcep
t
-------
A Straight Line
X
Y
Two Straight lines with the Same Slope but Different Intercepts
X Y
Straight lines
Zero Slope
Zero Intercept
X X
Y
Y
Two Straight Lines with the same Intercept but Different Slopes
Straight Line with Zero Slope and Zero Intercept
Regression: what line will generate the data?
Estriol Birth Wt
7 25
9 25
9 25
12 27
14 27
14 30
15 32
15 34
15 34
15 35
16 27
16 24
16 30
16 31
16 32
Estriol Birth Wt
30 35.5
32 35.5
36 35.5
35 37.0
37 37.0
31 38.5
34 38.5
38 40.0
30 41.5
40 43.0
28 46.0
43 46.0
32 47.5
39 47.5
34 50.5
Regression: what line will generate the data?
7 12 17 22 27
24
29
34
39
43
Birt
h w
eigh
t
Estriol
Which is the best line?
7 12 17 22 27
24
29
34
39
43
Line 1
Line 3
Line 4
Line 5
Estriol (mg/24 hr)
Bir
thw
eigh
t (g/
100)
Line 2
The best lineBirthweight = 21.52 + 0.608 Estriol
Regression Line through Scatter Plot
7 12 17 22 27
24
29
34
39
43
Estriol (mg/24 hr)
Bir
thw
eigh
t (g/
100)
Computer output
Coefficientsa
21.523 2.620 8.214 .000 16.164 26.883
.608 .147 .610 4.143 .000 .308 .908
(Constant)
ESTRIOL
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Dependent Variable: BWEIGHTa.
Regression
The Saga continues
Out of curiosity
How did this accomplish what we wanted (i.e. data summarization and identifying women who might need special prenatal care)
• 1. We end up with the line Birthweight =21.52+0.608 Estriol, hoping that
this line will generate the original data
2. In the case of univariate ‘mean’ is closest to the data in a sense. In similar way, regression line is the closet line to the data . In that sense it summarizes the data.
Recall
One dimensional data
(Summary by Mean)
10 20 30 40 50
Recall
X (X-a)2Bweight (bweight- L)2
10 (10-a)2 25 (25-L)2
20 (20-a)2 25 (25-L)2
30 (30-a)2 25 (25-L)2
40 (40-a)2 27 (27-L)2
50 (50-a)2 … …
Mean=30 minimizes sum L =21.52+0.608 Esriol minimizes the sum – This is regression line
Prediction
• Women that need special care
• If lowbirth weight is defined as < 2500g, then women with estriol level < 5.72 are in hirisk of having low birthweight babies.
• So is everything fine and dandy
• Not necessarily -– How closely does the regression line
generates the data?– How much is estriol is responsible for
birthweight??– Was there something that would have better
predicted women at risk???
Birthweights Generated From
Observed Difference
Squared From
Obs. No.
(a)
Estriol
(b)
Observed Data (c)
Line 1.1
(d)
Line 1.2
(e)
Line 1.1 [(c)-(d)]2
Line 1.2 [(c)-(e)]2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
7 9 9
12 14 14 15 15 15 15 16 16 16 16 16 16 17 17 17 18 18 19 19 20 21 22 24 24 25 25 27
25 25 25 27 27 30 32 34 34 35 27 24 30 31 32 35 30 32 36 35 37 31 34 38 30 40 28 43 32 39 34
20.5 23.5 23.5 28.0 31.0 31.0 32.5 32.5 32.5 32.5 34.0 34.0 34.0 34.0 34.0 34.0 35.5 35.5 35.5 37.0 37.0 38.5 38.5 40.0 41.5 43.0 46.0 46.0 47.5 47.5 50.5
25.776 26.992 26.992 28.816 30.032 30.032 30.640 30.640 30.640 30.640 31.248 31.248 31.248 31.248 31.248 31.248 31.856 31.856 31.856 32.464 32.464 33.072 33.072 33.680 34.288 34.896 36.112 36.112 36.720 36.720 37.936
20.25 2.25 2.25 1.00
16.00 1.00 0.25 2.25 2.25 6.25
49.00 100.00
16.00 9.00 4.00 1.00
30.25 12.25 0.25 4.00 0.00
56.25 20.25 4.00
132.25 9.00
324.00 9.00
240.25 72.25
272.25
0.6022 3.9681 3.9681 3.2979 9.1930 0.0010 1.8496
11.2896 11.2896 19.0096 18.0455 52.5335
1.5575 0.0615 0.5655
14.0775 3.4447 0.0207
17.1727 6.4313
20.5753 4.2932 0.8612
18.6624 18.3869 26.0508 65.8045 47.4445 22.2784
5.1984 15.4921
Sum Mean Variance
534.00 17.23 22.58
992.00 32.00 22.47
1111.00 35.84 50.81
992.00 32.00 8.35
1419.00 - -
423.43 - -
E BW Pred Diff 7.00 25.00 25.78076 -.78076 9.00 25.00 26.99714 -1.99714 9.00 25.00 26.99714 -1.99714 12.00 27.00 28.82171 -1.82171 14.00 27.00 30.03810 -3.03810 14.00 30.00 30.03810 -.03810 15.00 32.00 30.64629 1.35371 15.00 34.00 30.64629 3.35371 15.00 34.00 30.64629 3.35371 15.00 35.00 30.64629 4.35371 16.00 27.00 31.25448 -4.25448 16.00 24.00 31.25448 -7.25448 16.00 30.00 31.25448 -1.25448 16.00 31.00 31.25448 -.25448 16.00 32.00 31.25448 .74552 16.00 35.00 31.25448 3.74552 17.00 30.00 31.86267 -1.86267 17.00 32.00 31.86267 .13733 17.00 36.00 31.86267 4.13733 18.00 35.00 32.47086 2.52914 18.00 37.00 32.47086 4.52914 19.00 31.00 33.07905 -2.07905 19.00 34.00 33.07905 .92095 20.00 38.00 33.68724 4.31276 21.00 30.00 34.29543 -4.29543 22.00 40.00 34.90362 5.09638 24.00 28.00 36.12000 -8.12000 24.00 43.00 36.12000 6.88000 25.00 32.00 36.72819 -4.72819 25.00 39.00 36.72819 2.27181 27.00 34.00 37.94457 -3.94457
How good is the regression
Regression Line through Scatter Plot
7 12 17 22 27
24
29
34
39
43
Fig Reg 1.6
Estriol (mg/24 hr)
Bir
thw
eigh
t (g/
100)
How good is the regression
• R2 = 0.372– Estriol explains about 37.2% of variation in
the birthweights. Remaining 62.8 % is explained by other factors
– At estriol 16, we have several birthweight s(24,30,31,32 and 35). If estriol is the only factor for Birthweight we would not see this variation.
How good is the regrssionRegression line and 95% confidence intervals around predicted values
Estriol
Bweight line upper lower
7 27
22.4777
43
Other factors
Multiple Regression
Regression Diagnostics
Residual Analysis
Diagnostics
• Residual for a patient (observation)– Difference between observed birthweight and
the birthweight regression line would generate (predict)
• Example: (for the first patient)– Observed birthweight = 25– Generated = 21.52+0.608 estriol
=21.52+0.608(7)=25.776
Residual = 25-25.776= -0.776
Diagnostics
• Residual plots
• Plot of residuals against predicted values
• For assumptions– Normality, linearity and homoscedasticity
Non normal
Heteroscedasticity
nonlinearity
Diagnostics
• Residuals for influence patients (observation)
- change in estimated parameters (slope and intercept) when the analysis is redone without the patient in question
Patients with high leverage and large residual will have greater influence.
Diagnostics
• Standardized and the studentized (or jackknife) residual
– A patient with large values for these residuals indicate outliers