elementary statistics for the biological and life sciences stat 205 university of south carolina...
Post on 19-Dec-2015
214 Views
Preview:
TRANSCRIPT
Elementary Statistics for the Elementary Statistics for the Biological and Life SciencesBiological and Life Sciences
STAT 205STAT 205
University of South CarolinaUniversity of South CarolinaColumbia, SCColumbia, SC
© 2010, University of South Carolina. All rights reserved, except where previous rights © 2010, University of South Carolina. All rights reserved, except where previous rights exist. No part of this material may be reproduced, stored in a retrieval system, or exist. No part of this material may be reproduced, stored in a retrieval system, or transmitted in any form or by any means — electronic, mechanical, photoreproduction, transmitted in any form or by any means — electronic, mechanical, photoreproduction, recording, or scanning — without the prior written consent of the University of South recording, or scanning — without the prior written consent of the University of South Carolina.Carolina.
STAT205 – Elementary Statistics for the Biological and Life Sciences 2
Chapter 9: Inferences forChapter 9: Inferences for
Paired SamplesPaired Samples
Selected tables and figures from Samuels, M. L., and Witmer, J. A., Selected tables and figures from Samuels, M. L., and Witmer, J. A., StatisticsStatistics forfor thethe LifeLife SciencesSciences, 3rd Ed. © 2003, Prentice Hall, Upper Saddle River, NJ. Used by per-, 3rd Ed. © 2003, Prentice Hall, Upper Saddle River, NJ. Used by per-mission.mission.
STAT205 – Elementary Statistics for the Biological and Life Sciences 3
Independence ViolationsIndependence Violations
In some settings, the In some settings, the independenceindependence be- be-tween samples in the 2-sample t-test is tween samples in the 2-sample t-test is violated, invalidating the methods used in violated, invalidating the methods used in Chapter 7.Chapter 7.
Secs. 7.9–7.10 go into more detail on model Secs. 7.9–7.10 go into more detail on model violations.violations.
One special case where we can provide a One special case where we can provide a solution is that of solution is that of PAIRED DATAPAIRED DATA..
STAT205 – Elementary Statistics for the Biological and Life Sciences 4
Paired DataPaired Data
Suppose the effect of some treatment or Suppose the effect of some treatment or stimulus is studied stimulus is studied on the same subjectson the same subjects (say, “before”–“after, “right”–“left”, etc.).(say, “before”–“after, “right”–“left”, etc.).
Independence is clearly violated!Independence is clearly violated!
But (!), since the data are so clearly But (!), since the data are so clearly “paired,” the differences “paired,” the differences
d = Yd = Y11 – Y – Y22
can still inform us about the treatment can still inform us about the treatment effect.effect.
STAT205 – Elementary Statistics for the Biological and Life Sciences 5
Paired Data ModelPaired Data Model
Suppose YSuppose Yi1i1 ~ i.i.d. N(µ ~ i.i.d. N(µ11,,1122) is ) is pairedpaired with Y with Yi2i2 ~ ~
i.i.d. N(µi.i.d. N(µ22,,2222) at each i = 1, …, n.) at each i = 1, …, n.
Then, for dThen, for dii = Y = Yi1i1 – Y – Yi2i2, we know from Rule E1 in , we know from Rule E1 in
Ch. 3. that Ch. 3. that
µµdd = E{d = E{dii} = E{Y} = E{Yi1i1 – Y – Yi2i2}}
= E{Y = E{Yi1i1} – E{Y} – E{Yi2i2} = µ} = µ11 – µ – µ22..
In fact, under this model dIn fact, under this model dii ~ i.i.d. N(µ ~ i.i.d. N(µdd,,dd22) ) ((dd
22 is a is a
complicated function of the model parameters)complicated function of the model parameters)
STAT205 – Elementary Statistics for the Biological and Life Sciences 6
Sample Mean DifferenceSample Mean Difference
If dIf dii ~ i.i.d. N(µ ~ i.i.d. N(µdd,,dd22), i = 1, …, n, then), i = 1, …, n, then
which can make inferences on µwhich can make inferences on µdd using our using our
previous application of the t-distribution:previous application of the t-distribution:
d - µd
SE(d) ~ t(n - 1)
where SE(d) = Sdn
= 1
n - 1 (di - d)
2i=1
n
n
d ~ N(µd , d
2
n ) ,
STAT205 – Elementary Statistics for the Biological and Life Sciences 7
Conf. Interval on µConf. Interval on µdd
Using the t-distribution feature for Using the t-distribution feature for
yields our typical form of confidence yields our typical form of confidence
interval on µinterval on µdd::
where df = n – 1 = (# pairs) – 1.where df = n – 1 = (# pairs) – 1.
d
d ± t/2SE(d)
STAT205 – Elementary Statistics for the Biological and Life Sciences 8
Example 9.3Example 9.3
Ex. 9.3Ex. 9.3: Y: Y11 = wt. loss after appetite inhib.; = wt. loss after appetite inhib.;
YY22 = wt. loss = wt. loss inin samesame womanwoman after placebo: after placebo:
STAT205 – Elementary Statistics for the Biological and Life Sciences 9
Example 9.3 – Conf. IntervalExample 9.3 – Conf. Interval
We have df = n–1 = 9–1 = 8, so for a 95% We have df = n–1 = 9–1 = 8, so for a 95% conf. interval on µconf. interval on µdd, we employ , we employ tt.025.025 = 2.306 = 2.306
(from Table 4).(from Table 4).
The 95% conf. interval isThe 95% conf. interval is
or 0.45 < µor 0.45 < µdd < 1.55 kg. < 1.55 kg.
d ± t.025SE(d) = d ± t.025Sdn
= 1.00 ± (2.306)0.729
= 1.00 ± (2.306)(0.24)
= 1.00 ± 0.55
STAT205 – Elementary Statistics for the Biological and Life Sciences 10
Hypothesis Tests on µHypothesis Tests on µdd
To test HTo test Hoo:µ:µdd = 0 find t = 0 find tss = =
Then, reject HThen, reject Hoo vs. vs.
• HHAA:µ:µdd ≠ 0, ≠ 0, when P = 2P{t(n–1) > |twhen P = 2P{t(n–1) > |tss|} ≤ |} ≤
• HHAA:µ:µdd > 0, > 0,
when P = P{t(n–1) > twhen P = P{t(n–1) > tss} ≤ } ≤
• HHAA:µ:µdd < 0, < 0,
when P = P{t(n–1) < twhen P = P{t(n–1) < tss} ≤ } ≤
d - 0SE(d)
STAT205 – Elementary Statistics for the Biological and Life Sciences 11
t-Test Rejection Regionst-Test Rejection Regions
To test HTo test Ho o :µ:µdd = 0 using rejection regions, reject = 0 using rejection regions, reject
HHoo vs. vs.
• HHA A :µ:µdd ≠ 0, ≠ 0,
when |t when |tss| ≥ t| ≥ t/2/2 (with df = n–1)(with df = n–1)
• HHA A :µ:µdd > 0, > 0,
when t when tss ≥ t ≥ t (with df = n–1)(with df = n–1)
• HHA A :µ:µdd < 0, < 0,
when t when tss ≤ –t ≤ –t(with df = n–1)(with df = n–1)
STAT205 – Elementary Statistics for the Biological and Life Sciences 12
Example 9.6Example 9.6Ex. 9.6Ex. 9.6: Y: Y11 = squirrel dist. to person chasing; = squirrel dist. to person chasing;
YY22 = squirrel dist. to nearest tree (n = 11). = squirrel dist. to nearest tree (n = 11).
Same squirrelSame squirrel each time, so data are paired: each time, so data are paired:
STAT205 – Elementary Statistics for the Biological and Life Sciences 13
Example 9.6 (cont’d)Example 9.6 (cont’d)
(Note: in Fig. 9.3 we find that Y(Note: in Fig. 9.3 we find that Y22 does not appear does not appear
normal, but the differences dnormal, but the differences dii do. So, we contin- do. So, we contin-
ue with the t-test.)ue with the t-test.)
Set Set = 0.10. = 0.10. Test H Test Hoo:µ:µdd = 0 vs. H = 0 vs. HAA:µ:µdd ≠ 0. ≠ 0.
We find tWe find tss = =
Apply P-value approach: find P = Apply P-value approach: find P =
2P{t(n–1) > 2P{t(n–1) > ||ttss||} = 2P{t(10) > 1.613}} = 2P{t(10) > 1.613} →→
dsd/ n
= 72148/ 11
= 7244.62
= 1.613
STAT205 – Elementary Statistics for the Biological and Life Sciences 14
Example 9.6 – P-valueExample 9.6 – P-value
From Table 4:From Table 4:P{t(10) > 1.812} = 0.05P{t(10) > 1.812} = 0.05
P{t(10) > 1.613} = between 0.05 and 0.10P{t(10) > 1.613} = between 0.05 and 0.10
P{t(10) > 1.372} = 0.10P{t(10) > 1.372} = 0.10
So, So, = 0.10 < P < 0.20 = 0.10 < P < 0.20 we we failfail toto rejectreject H Hoo and and conclude there is no conclude there is no
significant difference in mean distances.significant difference in mean distances.
Can find exact P = 0.1382 via TI-84 or R.Can find exact P = 0.1382 via TI-84 or R.
STAT205 – Elementary Statistics for the Biological and Life Sciences 15
More on Paired DesignMore on Paired Design
As n As n ∞∞, the CLT allows use of the t-, the CLT allows use of the t-distribution for paired data, so these distribution for paired data, so these inferences are available in large inferences are available in large samples.samples.
In small samples, a distribution-free In small samples, a distribution-free approach is possible (as we’ll see in approach is possible (as we’ll see in Sec. 9.4)Sec. 9.4)
Additional features of the paired design Additional features of the paired design are discussed in Sec. 9.3.are discussed in Sec. 9.3.
STAT205 – Elementary Statistics for the Biological and Life Sciences 16
Sign TestSign Test
In small samples with non-normal paired In small samples with non-normal paired differences, a distribution-free approach is differences, a distribution-free approach is available, known as the available, known as the SIGN TESTSIGN TEST..
• For paired data YFor paired data Yi1i1, Y, Yi2i2, find d, find dii = Y = Yi1i1–Y–Yi2i2 and and
taketake WWii = {sign of d = {sign of dii} }
Under HUnder Hoo:no difference between Y:no difference between Yi1i1 & Y & Yi2i2, we , we
expect dexpect dii ≈ 0 such that ≈ 0 such that
P{WP{Wii > 0} = P{W > 0} = P{Wii < 0} = 1/2. < 0} = 1/2.
• Ignore any dIgnore any dii = 0. Let n = 0. Let ndd = # non-zero d = # non-zero dii’s.’s.
STAT205 – Elementary Statistics for the Biological and Life Sciences 17
Sign test (cont’d)Sign test (cont’d)
To set up sign test:To set up sign test:a)a) Select Select ..
b)b) Determine HDetermine HAA from subject-matter from subject-matter
principles. Possibilities areprinciples. Possibilities are
““directionaldirectional”:”:
H HAA: effect in group 1 > effect in group 2: effect in group 1 > effect in group 2
H HAA: effect in group 1 < effect in group 2: effect in group 1 < effect in group 2
““non-directionalnon-directional”:”:
H HAA: effect in group 1 ≠ effect in group 2: effect in group 1 ≠ effect in group 2
STAT205 – Elementary Statistics for the Biological and Life Sciences 18
Sign Test StatisticSign Test Statistic
The test statistic is BThe test statistic is Bss, and it depends on H, and it depends on HAA. . Let NLet N++ = {# W = {# Wii > 0} and N > 0} and N–– = {# W = {# Wii < 0}. < 0}.
Then,Then, N N++ if Hif HAA: Y: Y11 > Y > Y22
B Bss = = N N–– if Hif HAA: Y: Y11 < Y < Y22 max{N max{N++,N,N––} } if Hif HAA: Y: Y11 ≠ Y ≠ Y22
Reject HReject Hoo in favor of H in favor of HAA when B when Bss exceeds a exceeds a critical point from Table 7critical point from Table 7(e.g., if H(e.g., if HAA:Y:Y11 > Y > Y22, reject when B, reject when Bss ≥ b ≥ b).).
STAT205 – Elementary Statistics for the Biological and Life Sciences 19
(Portion of) Table 7, p. 684(Portion of) Table 7, p. 684
STAT205 – Elementary Statistics for the Biological and Life Sciences 20
Sign Test P-valueSign Test P-value
Notice that this is a BInS setting: BNotice that this is a BInS setting: Bss is the is the
number of “successes” among nnumber of “successes” among ndd binary binary
trials where, under Htrials where, under Hoo, P{success} = ., P{success} = .
So, if HSo, if Hoo is true, B is true, Bss ~ Bin ~ Bin((nndd, , )). Thus for. Thus for
HHAA:effect 1 > effect 2, set P = P{Bin(n:effect 1 > effect 2, set P = P{Bin(ndd, ) ≥ B, ) ≥ Bss}.}.
HHAA:effect 1 < effect 2, set P = P{Bin(n:effect 1 < effect 2, set P = P{Bin(ndd, ) , ) ≥≥ B Bss},},
HHAA:effect 1 ≠ effect 2, :effect 1 ≠ effect 2, set Pset P == 2P{Bin(n2P{Bin(ndd, ) ≥ B, ) ≥ Bss},},
and reject Hand reject Hoo when P ≤ when P ≤ ..
12
12
121212
STAT205 – Elementary Statistics for the Biological and Life Sciences 21
Example 9.12Example 9.12
Ex. 9.12Ex. 9.12: Y = skin graft survival (days).: Y = skin graft survival (days).
Group 1: HL-antigen compatibility “close”Group 1: HL-antigen compatibility “close”Group 2: HL-antigen compatibility “poor”Group 2: HL-antigen compatibility “poor”
STAT205 – Elementary Statistics for the Biological and Life Sciences 22
Example 9.12 (cont’d)Example 9.12 (cont’d)
With this small data set, normality is brought With this small data set, normality is brought into question (and, data have ‘censored’ into question (and, data have ‘censored’ feature; see patients #3 and #10). So, a sign feature; see patients #3 and #10). So, a sign test is used.test is used.
Set Set = 0.05. = 0.05. Take H Take Hoo: “close” = “poor” : “close” = “poor”
vs. Hvs. HAA: “close” > “poor” (since we expect : “close” > “poor” (since we expect
poorer survival in the “poor” group).poorer survival in the “poor” group).
In Table 9.7 we see NIn Table 9.7 we see N++ = 9 (and N = 9 (and N–– = 2). So = 2). So
take take BBss = 9 = 9..
STAT205 – Elementary Statistics for the Biological and Life Sciences 23
Example 9.12 – P-valueExample 9.12 – P-value
Let B ~ Bin(11 , 0.5), so that the P-value Let B ~ Bin(11 , 0.5), so that the P-value here is P = P{B ≥ 9}here is P = P{B ≥ 9}
= P{B = 9} + P{B = 10} + P{B = 11}= P{B = 9} + P{B = 10} + P{B = 11}
= 11C9 (12)
9(1
2)2 + 11C10 (1
2)10
(12)
1 + 11C11 (1
2)11
(12)
0
= 11!9! 2!
(12)
11+ 11!
10! 1!(1
2)11
+ 11!11! 0!
(12)
11
= (55 + 11 + 1)(12)
11 = 67
211 = 0.033.
STAT205 – Elementary Statistics for the Biological and Life Sciences 24
Example 9.12 (concluded)Example 9.12 (concluded)
Since P = 0.033 < 0.05 = Since P = 0.033 < 0.05 = , we , we reject reject HHoo
and and conclude that graft survival is conclude that graft survival is significantly higher in the “close” group.significantly higher in the “close” group.
(Note that the Binomial P-value can be (Note that the Binomial P-value can be computed via TI-84.)computed via TI-84.)
To use the rejection region approach for To use the rejection region approach for
these data: reject Hthese data: reject Hoo if B if Bss ≥ b ≥ b.05.05 = 9 from = 9 from
Table 7. Since BTable 7. Since Bss = 9 ≥ 9, we still = 9 ≥ 9, we still reject Hreject Hoo..
STAT205 – Elementary Statistics for the Biological and Life Sciences 25
Chapter 10: Categorical Data and Chapter 10: Categorical Data and
Contingency Table AnalysisContingency Table Analysis
(Coverage order: Secs. 10.7(Coverage order: Secs. 10.710.210.210.310.310.1)10.1)
Selected tables and figures from Samuels, M. L., and Witmer, J. A., Selected tables and figures from Samuels, M. L., and Witmer, J. A., StatisticsStatistics forfor thethe LifeLife SciencesSciences, 3rd Ed. © 2003, Prentice Hall, Upper Saddle River, NJ. Used by per-, 3rd Ed. © 2003, Prentice Hall, Upper Saddle River, NJ. Used by per-mission.mission.
STAT205 – Elementary Statistics for the Biological and Life Sciences 26
Sec. 10.7: 2-Sample Sec. 10.7: 2-Sample Proportion DataProportion Data
Returning to the independent (two-)sample Returning to the independent (two-)sample case, suppose now the data are from a BInS case, suppose now the data are from a BInS setting:setting: Y Y11 ~ Bin(n ~ Bin(n11,p,p11) indep. of Y) indep. of Y22 ~ Bin(n ~ Bin(n22,p,p22))
Of interest is the difference pOf interest is the difference p11 – p – p22..
A good point estimator for pA good point estimator for p11 – p – p22 is the is the
difference in sample proportions difference in sample proportions p1 - p2 = Y1
n1 - Y2
n2
STAT205 – Elementary Statistics for the Biological and Life Sciences 27
Conf. Intervals for pConf. Intervals for p11–p–p22
But (!) for building conf. intervals on pBut (!) for building conf. intervals on p11 – p – p22
we apply our previous AC strategy and start we apply our previous AC strategy and start withwith
Then, findThen, find
p1 - p2 = Y1 + 1n1 + 2
- Y2 + 1n2 + 2
SE(p1 - p2) = p1(1-p1)n1 + 2
+ p2(1-p2)n2 + 2
STAT205 – Elementary Statistics for the Biological and Life Sciences 28
Agresti-Caffo Conf. IntervalsAgresti-Caffo Conf. Intervals
DEF’NDEF’N: When Y: When Y11 ~ Bin(n ~ Bin(n11,p,p11) indep. of Y) indep. of Y22
~ Bin(n~ Bin(n22,p,p22), the ), the 95% AGRESTI-CAFFO 95% AGRESTI-CAFFO
CONFIDENCE INTERVAL for pCONFIDENCE INTERVAL for p11 – p – p22 is is
where at where at = 0.05 we use z = 0.05 we use z0.0250.025 = 1.96. (Generali- = 1.96. (Generali-
zations exist for other values of zations exist for other values of .).)
p1 - p2 ± z/2SE(p1 - p2) =
Y1 + 1n1 + 2
- Y2 + 1n2 + 2
± z/2p1(1-p1)n1 + 2
+ p2(1-p2)n2 + 2
STAT205 – Elementary Statistics for the Biological and Life Sciences 29
Example 10.37Example 10.37
Ex. 10.37Ex. 10.37 (from Ex. 10.11 – see below): (from Ex. 10.11 – see below):YY11 = # patients angina-free after Timolol trt. = # patients angina-free after Timolol trt.
YY22 = # patients angina-free after placebo. = # patients angina-free after placebo.
Data from Ex. 10.11 (Table 10.4): Data from Ex. 10.11 (Table 10.4):
STAT205 – Elementary Statistics for the Biological and Life Sciences 30
Example 10.37 (cont’d)Example 10.37 (cont’d)
The Agresti-Caffo point estimator isThe Agresti-Caffo point estimator is
Associated SE isAssociated SE is
p1 - p2 = 44 + 1160 + 2
- 19 + 1147 + 2
= 45162
- 20149
= .278 - .134 = .144
SE(p1 - p2) = (.278)(.722)162
+ (.134)(.866)149
STAT205 – Elementary Statistics for the Biological and Life Sciences 31
Example 10.37 (concluded)Example 10.37 (concluded)
From this the 95% conf. interval isFrom this the 95% conf. interval is
or 0.056 < por 0.056 < p11 – p – p22 < 0.232. < 0.232.
p1 - p2 ± z/2SE(p1 - p2)
= 0.144 ± (1.96) (.278)(.722)162
+ (.134)(.866)149
= 0.144 ± (1.96)(0.0449) = 0.144 ± 0.088
STAT205 – Elementary Statistics for the Biological and Life Sciences 32
Sec. 10.2: Testing pSec. 10.2: Testing p11 vs. p vs. p22
For testing HFor testing Hoo: p: p11 = p = p22, we introduce a new , we introduce a new
construction: the construction: the contingencycontingency tabletable..
DEF’NDEF’N: A : A 222 CONTINGENCY TABLE2 CONTINGENCY TABLE is a is a tabular arrangement of count data tabular arrangement of count data representing how the success & failure representing how the success & failure frequencies relate to an explanatory factor. frequencies relate to an explanatory factor.
For testing HFor testing Hoo: p: p11 = p = p22, the column factor , the column factor
delineates Group 1 vs. Group 2 and the row delineates Group 1 vs. Group 2 and the row factor delineates success vs. failure.factor delineates success vs. failure.
STAT205 – Elementary Statistics for the Biological and Life Sciences 33
Basic structure of a 2Basic structure of a 22 contingency table:2 contingency table:
Grp. 1 Grp. 2
# Success Y1 Y2
# Failures n1–Y1 n2–Y2
(Col.) Total n1 n2
Notice that we can read the sample propor-Notice that we can read the sample propor-tions straight from the table:tions straight from the table:
222 Contingency Table2 Contingency Table
p1 = Y1n1
, p2 = Y2n2
STAT205 – Elementary Statistics for the Biological and Life Sciences 34
Example 10.11Example 10.11
Ex 10.11Ex 10.11: (Ex. 10.37, cont’d) Angina expt.: (Ex. 10.37, cont’d) Angina expt.
Timolol Placebo (Row) Tot.
# Angina-free 44 19 63
# Angina 116 128 244 (Col.) Tot. 160 147 307
Of interest is testing whether Angina status Of interest is testing whether Angina status is associated with Timolol trt., i.e., do the is associated with Timolol trt., i.e., do the row and column factors “interact”?row and column factors “interact”?
STAT205 – Elementary Statistics for the Biological and Life Sciences 35
Testing pTesting p11 vs. p vs. p22
To test HTo test Hoo:p:p11 = p = p22, there are many available , there are many available
approaches. We employ the contingency approaches. We employ the contingency table since it can be extended to more than 2 table since it can be extended to more than 2 row or column levels (see Sec. 10.5).row or column levels (see Sec. 10.5).
The table allows for construction of a statistic The table allows for construction of a statistic that compares the “that compares the “observedobserved” data against ” data against their “their “expectedexpected” values under a pre-specified ” values under a pre-specified model, say, the model under Hmodel, say, the model under Hoo:p:p11 = p = p22..
STAT205 – Elementary Statistics for the Biological and Life Sciences 36
DEF’NDEF’N: : PEARSON’S PEARSON’S 22 (CHI-SQUARE) (CHI-SQUARE) STATISTICSTATISTIC is is
is sometimes called a “goodness-of-fit” is sometimes called a “goodness-of-fit” statistic (for reasons explained in Sec. 10.1).statistic (for reasons explained in Sec. 10.1).
For application in a 2For application in a 22 contingency table, the 2 contingency table, the “O” values are the four counts in the table (Y“O” values are the four counts in the table (Y11, ,
YY22, n, n11–Y–Y11, n, n22–Y–Y22), and the “E” values are their ), and the “E” values are their
expected values under Hexpected values under Hoo:p:p11 = p = p22..
Xs2 = (O - E)2
E
Pearson’s Pearson’s 22 Statistic Statistic
Xs2
STAT205 – Elementary Statistics for the Biological and Life Sciences 37
““E” valuesE” values
But, what But, what areare the “E” values under the “E” values under HHoo:p:p11 = p = p22??
Well, Well, ifif HHoo is true, we expect both Y is true, we expect both Y11/n/n11 and and
YY22/n/n22 to estimate the same value, say, p. to estimate the same value, say, p.
We can estimate this common p using a We can estimate this common p using a weighted (“pooled”) estimator:weighted (“pooled”) estimator:
ppool = n1p1 + n2p2n1 + n2
= n1(Y1 n1) + n2(Y2 n2)
n1 + n2
= Y1 + Y2n1 + n2
Xs2
STAT205 – Elementary Statistics for the Biological and Life Sciences 38
““E” successesE” successes
Now, if there are nNow, if there are n11 total obsv’ns for Grp. 1, then total obsv’ns for Grp. 1, then
we “expect” nwe “expect” n11pppoolpool of these to be successes. of these to be successes.
This isThis is
Similarly, with nSimilarly, with n22 total obsv’ns in Grp. 2 we total obsv’ns in Grp. 2 we
expect nexpect n22pppoolpool successes: successes:
n1ppool = n1(Y1 + Y2)n1 + n2
n2ppool = n2(Y1 + Y2)n1 + n2
STAT205 – Elementary Statistics for the Biological and Life Sciences 39
““E” failuresE” failures
For the expected # of failures, just subtract For the expected # of failures, just subtract the “E” successes from each total, nthe “E” successes from each total, n jj::
n1 - n1ppool = n1n1 + n2 - Y1 - Y2
n1 + n2
= n1(n1 - Y1 + n2 - Y2)n1 + n2
and
n2 - n2ppool = = n2(n1 - Y1 + n2 - Y2)n1 + n2
STAT205 – Elementary Statistics for the Biological and Life Sciences 40
Grp. 1 Grp. 2 Row tot.
# successn1(Y1+Y2)
n1 + n2
n2(Y1+Y2)n1 + n2
Y1 + Y2
# failuren1(n1-Y1+n2-Y2)
n1 + n2
n2(n1-Y1+n2-Y2)n1 + n2
n1-Y1 + n2-Y2
Col. Tot. n1 n2 n1 + n2
The result is an “expected” 2The result is an “expected” 22 contingency 2 contingency table:table:
Notice the similar structure of each “E”:Notice the similar structure of each “E”:
E = (Row Total)(Col. Total)/(Grand Total) E = (Row Total)(Col. Total)/(Grand Total)
Expected 2Expected 22 Table2 Table
STAT205 – Elementary Statistics for the Biological and Life Sciences 41
Examples 10.14-10.15Examples 10.14-10.15
Exs. 10.14-10.15Exs. 10.14-10.15 (10.11 cont’d): Angina expt. (10.11 cont’d): Angina expt. ““O” table wasO” table was
Timolol Placebo Row Tot.# Angina-free 44 19 63# Angina 116 128 244
Col. Tot. 160 147 307
““E” table isE” table isTimolol Placebo Row Tot.
# Angina-free 32.83 30.17 63# Angina 127.17 116.83 244
Col. Tot. 160 147 307
(cf. Table 10.7)(cf. Table 10.7)
e.g., e.g., E = (63)(160)/307E = (63)(160)/307= 32.83= 32.83
STAT205 – Elementary Statistics for the Biological and Life Sciences 42
222 Table2 Table
In the 2In the 22 table of expected counts, note that:2 table of expected counts, note that:
the “E” values need not be integers, and we the “E” values need not be integers, and we do NOT round them;do NOT round them;
the row and column totals do not change the row and column totals do not change (they are designed not to)(they are designed not to)• This is a quick way to double-check the This is a quick way to double-check the
calculations!calculations!
STAT205 – Elementary Statistics for the Biological and Life Sciences 43
22(() Distribution) Distribution
To find the P-value, we need the null refer-To find the P-value, we need the null refer-ence distribution of ence distribution of
DEF’NDEF’N (p.394): The (p.394): The 22(() DISTRIBUTION) DISTRIBUTION
with with df is the limiting distribution of df is the limiting distribution of
Pearson’s statistic under HPearson’s statistic under Hoo..
NOTATIONNOTATION: ~ : ~ 22(())
In the special case of a 2In the special case of a 22 contingency 2 contingency table, table, = 1. = 1.
Xs2
Xs2
Xs2
STAT205 – Elementary Statistics for the Biological and Life Sciences 44
Properties of Properties of 22(() )
The The 22(() dist’n is) dist’n is
always ≥ 0always ≥ 0
skewed rightskewed right
has integer df’shas integer df’s
has upper-has upper- critical point , given in critical point , given in Table 9 (Table 9 ( must bracket P-values) must bracket P-values)
computable in DoStatcomputable in DoStat
2()
STAT205 – Elementary Statistics for the Biological and Life Sciences 45
(Portion of) Table 9, p. 686(Portion of) Table 9, p. 686
STAT205 – Elementary Statistics for the Biological and Life Sciences 46
Rejecting HRejecting Hoo
So, to reject HSo, to reject Hoo:p:p11 = p = p22 vs. H vs. HAA:p:p11 ≠ p ≠ p22, set , set and and
find = find = ∑∑(O–E) (O–E) 22/E./E.
P-valueP-value approachapproach: find P = P{: find P = P{22(1) ≥ } via (1) ≥ } via computer or bracket via Table 9 and reject Hcomputer or bracket via Table 9 and reject Hoo if if
P ≤ P ≤ ..
RejectionRejection regionregion approachapproach: find from : find from
Table 9 and reject HTable 9 and reject Hoo if if
(Notice: a 1-tailed table look-up for a 2-sided test!)(Notice: a 1-tailed table look-up for a 2-sided test!)
Xs2
Xs2
2(1)
Xs2
2(1)
STAT205 – Elementary Statistics for the Biological and Life Sciences 47
One-sided testingOne-sided testing
To find a one-sided P-value:To find a one-sided P-value:
for Hfor HAA: p: p11 > p > p22, use , use
(otherwise, report P > 0.50).(otherwise, report P > 0.50).
for Hfor HAA: p: p11 < p < p22, use , use
(otherwise, report P > 0.50).(otherwise, report P > 0.50).
P = 12P{2(1) Xs
2} if p1 < p2
P = 12P{2(1) Xs
2} if p1 > p2
STAT205 – Elementary Statistics for the Biological and Life Sciences 48
Example 10.16Example 10.16
Ex. 10.16Ex. 10.16 (10.11 cont’d): Angina expt. (10.11 cont’d): Angina expt. Set Set = 0.01. From the O and E values = 0.01. From the O and E values computed in Ex. 10.15, computed in Ex. 10.15, we find we find
Xs2 = (O - E)2
E
= (44-32.83)2
32.83 + (116-127.17) 2
127.17
+ (19-30.17)2
30.17 + (128-116.83) 2
116.83
= 10.0
STAT205 – Elementary Statistics for the Biological and Life Sciences 49
Example 10.16 (cont’d)Example 10.16 (cont’d)
To test To test H Hoo:p:p11 = p = p22 vs. H vs. HAA:p:p11 ≠ p ≠ p22, we , we
bracket P = P{bracket P = P{22(1) ≥ 10.0} from Table 9:(1) ≥ 10.0} from Table 9:
P{P{22(1) ≥ 6.63} = 0.01(1) ≥ 6.63} = 0.01
P{P{22(1) ≥(1) ≥ 10.0} = between 0.01 and 0.00110.0} = between 0.01 and 0.001
P{P{22(1) ≥ 10.83} = 0.001(1) ≥ 10.83} = 0.001
So, 0.001 < P < 0.01 (two-sided).So, 0.001 < P < 0.01 (two-sided).
STAT205 – Elementary Statistics for the Biological and Life Sciences 50
Example 10.16 (concluded)Example 10.16 (concluded)
Since P < 0.01 = Since P < 0.01 = , we , we rejectreject H Hoo and and
conclude there is a significant conclude there is a significant
difference in angina response after difference in angina response after
Timolol trt.Timolol trt.
Can Can find exact P = 0.0016 via TI-84/R. find exact P = 0.0016 via TI-84/R.
(A one-sided H(A one-sided HAA is not unreasonable here, is not unreasonable here,
but it’s easy to mess up the P-value, so be but it’s easy to mess up the P-value, so be careful!)careful!)
STAT205 – Elementary Statistics for the Biological and Life Sciences 51
Pearson’s XPearson’s X22 for 2 for 22 Table2 Table
When using the Pearson XWhen using the Pearson X22 statistic in 2 statistic in 22 2 tables, note that:tables, note that:
22(1) is only an approximation in (1) is only an approximation in finite samples. To be valid, a standard rule-of-finite samples. To be valid, a standard rule-of-thumb is to require E ≥ 1 for every cell, thumb is to require E ≥ 1 for every cell, andand E ≥ 5 (i.e., here {nE ≥ 5 (i.e., here {n11+n+n22}/4 ≥ 5).}/4 ≥ 5).
This method is antiquated for testing pThis method is antiquated for testing p11=p=p22
(esp. against 1-sided alternatives). A better (esp. against 1-sided alternatives). A better method is method is Fisher’s Exact test; see Sec. 10.4.; see Sec. 10.4.
Xs2
STAT205 – Elementary Statistics for the Biological and Life Sciences 52
Sec. 10.3: Testing AssociationSec. 10.3: Testing Association
The layout of the 2The layout of the 22 table can apply to 2 table can apply to more than just tests of pmore than just tests of p11 = p = p22..
What if the row factor represents more than What if the row factor represents more than just success-vs.-failure? (Not uncommon!)just success-vs.-failure? (Not uncommon!)
In this case, we have a In this case, we have a single samplesingle sample with n with n observations and with observations and with twotwo explanatory explanatory factors (each having two levels).factors (each having two levels).
STAT205 – Elementary Statistics for the Biological and Life Sciences 53
General 2General 22 table 2 table (cf. Table 10.13)(cf. Table 10.13):: column factor
level C1 level C2 row tot.
row level R1 a b a+bfactor level R2 c d c+d
col. tot. a+c b+d n
Notice that n = a + b + c + d.Notice that n = a + b + c + d.
Natural question: does the column factor Natural question: does the column factor affect the row factor, and/or affect the row factor, and/or vice versavice versa??
General 2General 22 Table2 Table
STAT205 – Elementary Statistics for the Biological and Life Sciences 54
Testing AssociationTesting Association
Statistically, asking if the 2 factors Statistically, asking if the 2 factors interrelate is an issue of “association”:interrelate is an issue of “association”:• HHoo: there is : there is nono associationassociation between the row between the row
and column factors and column factors
• HHAA: there is : there is somesome associationassociation between the between the row and column factors row and column factors
(An older term for “no association” is (An older term for “no association” is “independence,” but don’t confuse this with “independence,” but don’t confuse this with statistical independence from Chap. 3.)statistical independence from Chap. 3.)
STAT205 – Elementary Statistics for the Biological and Life Sciences 55
Pearson’s XPearson’s X22 for Association for Association
We can test the association hypotheses We can test the association hypotheses using Pearson’s statistic, = using Pearson’s statistic, = ∑∑(O–E)(O–E)22/E./E.
Here, the “O” terms are just a, b, c, and d.Here, the “O” terms are just a, b, c, and d.
The “E” terms are calculated as in Sec. 10.2:The “E” terms are calculated as in Sec. 10.2:
For instance, in the (C1,R1) cell we have For instance, in the (C1,R1) cell we have EE1111 = (a+b)(a+c)/n, etc. = (a+b)(a+c)/n, etc.
E = (Row Total)(Col. Total)Grand Total
Xs2
STAT205 – Elementary Statistics for the Biological and Life Sciences 56
Rejecting HRejecting Hoo
As in Sec. 10.2, ~ As in Sec. 10.2, ~ 22(1) under H(1) under Hoo, so for , so for
fixed fixed , reject H, reject Hoo as follows: as follows:
P-valueP-value approachapproach: find P = P{: find P = P{22(1) ≥ } via (1) ≥ } via computer or bracket via Table 9 and reject Hcomputer or bracket via Table 9 and reject Hoo if if
P ≤ P ≤ ..
RejectionRejection regionregion approachapproach: find : find from from
Table 9 and reject HTable 9 and reject Hoo if if ≥≥
(Again: a 1-tailed table look-up for a 2-sided test.)(Again: a 1-tailed table look-up for a 2-sided test.)
Xs2
Xs2
2(1)
Xs2
2(1)
STAT205 – Elementary Statistics for the Biological and Life Sciences 57
Example 10.21Example 10.21
Ex. 10.21Ex. 10.21: Hair color & eye color in n = 6800 : Hair color & eye color in n = 6800 German males.German males.
““O” values in Table 10.11:O” values in Table 10.11:
STAT205 – Elementary Statistics for the Biological and Life Sciences 58
Example 10.21 (cont’d)Example 10.21 (cont’d)
We find the “E” values as:We find the “E” values as:
Dark hair Light hair row total
Dark eye 485.84 371.16 857
Light eye 3,369.16 2,573.84 5,943
col. total 3,855 2,945 6,800
Set = 0.05. Of interest is testing whether a significant association exists between hair color and eye color.
STAT205 – Elementary Statistics for the Biological and Life Sciences 59
Example 10.21 – XExample 10.21 – X22 Statistic Statistic
Given the O’s and the E’s, Given the O’s and the E’s, the test the test statistic isstatistic is
The P-value is P = P{The P-value is P = P{22(1) ≥ 313.63}.(1) ≥ 313.63}.
Xs2 = (O - E)2
E
= (726 - 485.84)2
485.84 + + (2814 - 2573.84)2
2573.84 = 313.63
STAT205 – Elementary Statistics for the Biological and Life Sciences 60
Example 10.21 – P-valueExample 10.21 – P-value
From Table 9:From Table 9:
P{P{22(1) ≥ 313.63} = below 0.0001(1) ≥ 313.63} = below 0.0001
P{P{22(1) ≥ 15.14} = 0.0001(1) ≥ 15.14} = 0.0001
So, P < 0.001 < 0.05 = So, P < 0.001 < 0.05 = we we rejectreject HHoo and and conclude that a significant conclude that a significant
association exists between hair color association exists between hair color and eye color in these males.and eye color in these males.
Can Can find P < 0.0001 via TI-84/R.find P < 0.0001 via TI-84/R.
STAT205 – Elementary Statistics for the Biological and Life Sciences 61
Notes on the Notes on the 22 Test Test
Some notes:Some notes:• 22(1) is only an approximation in (1) is only an approximation in
finite samples. The rule-of-thumb E ≥ 1 finite samples. The rule-of-thumb E ≥ 1 for every cell, for every cell, andand E ≥ 5 (here, n/4 ≥ 5) E ≥ 5 (here, n/4 ≥ 5) still applies.still applies.
• By contrast, Ex. 10.21 illustrates that By contrast, Ex. 10.21 illustrates that is is veryvery sensitive when n is large. sensitive when n is large.
• The 2The 22 table structure allows for a 2 table structure allows for a variety of “conditional” probability variety of “conditional” probability descriptions of the data; see pp. 413-416.descriptions of the data; see pp. 413-416.
Xs2
Xs2
STAT205 – Elementary Statistics for the Biological and Life Sciences 62
Notes on the Notes on the 22 Test (cont’d) Test (cont’d)
• We can We can extendextend the 2 the 2 2 table into an2 table into an r r c CONTINGENCY TABLEc CONTINGENCY TABLE
for cases when more than 2 levels exist for for cases when more than 2 levels exist for either factor. is still useful here; see Sec. either factor. is still useful here; see Sec. 10.5.10.5.
• Many variants exist of Pearson’s . One seen Many variants exist of Pearson’s . One seen often is = 2often is = 2∑∑∑∑OOlnln{O/E}, known as the {O/E}, known as the ‘Likelihood-ratio test,’ ‘LR test,’ or ‘G test.’ ‘Likelihood-ratio test,’ ‘LR test,’ or ‘G test.’ While this has useful properties, While this has useful properties, it usually it usually performs worseperforms worse than for contingency tables than for contingency tables and so is and so is NOT recommendedNOT recommended..
Xs2
Xs2
Xs2
Gs2
STAT205 – Elementary Statistics for the Biological and Life Sciences 63
Phi-Divergence StatisticPhi-Divergence Statistic
An alternative competitor to Pearson’s An alternative competitor to Pearson’s for rfor rc tables that can be recommended is c tables that can be recommended is known as the known as the
PHI-DIVERGENCE STATISTICPHI-DIVERGENCE STATISTIC::
In the 2In the 22 case, ~ 2 case, ~ 22(1) under H(1) under Hoo and so and so
is used in the same fashion as . (It can is used in the same fashion as . (It can also be extended to the ralso be extended to the rc case.)c case.)
Cs2
Xs2
Xs2
Cs2 = 8
3 O OE
- O
STAT205 – Elementary Statistics for the Biological and Life Sciences 64
Sec. 10.1: Goodness-of-FitSec. 10.1: Goodness-of-Fit
Pearson’s original idea was to use to Pearson’s original idea was to use to assess divergence in the “O”s against a assess divergence in the “O”s against a modeled value for “E”. modeled value for “E”.
AnyAny model could be proposed, not just for model could be proposed, not just for rrc tables. In this sense, measures the c tables. In this sense, measures the goodness-of-fitgoodness-of-fit of the model for “E”. of the model for “E”.
Xs2
Xs2
STAT205 – Elementary Statistics for the Biological and Life Sciences 65
Example 10.1Example 10.1 Ex. 10.1Ex. 10.1: In genetics, we believe that : In genetics, we believe that
offspring characters appear in regular offspring characters appear in regular “ratios.” E.g., in snapdragons, the offspring “ratios.” E.g., in snapdragons, the offspring of two pink (hybrid) parents producesof two pink (hybrid) parents produces
(i.e., “1:2:1”). Or, so we think!(i.e., “1:2:1”). Or, so we think!
Can this model be supported by data? (We’ll Can this model be supported by data? (We’ll see, later…)see, later…)
P{Red} = 14
, P{Pink} = 12
, P{White} = 14
STAT205 – Elementary Statistics for the Biological and Life Sciences 66
Testing Goodness-of-FitTesting Goodness-of-Fit
To use to test a model’s goodness-of-fit:To use to test a model’s goodness-of-fit:• Set Set ..
• Designate K > 1 categories that the model can Designate K > 1 categories that the model can predict (e.g., Red/Pink/White predict (e.g., Red/Pink/White K = 3). K = 3).
• Collect data (the OCollect data (the Okk’s) from a sample of size n.’s) from a sample of size n.
• Determine the EDetermine the Ekk’s for each ’s for each kk th category th category
using the model’s predictions.using the model’s predictions.
• Calculate = Calculate = ∑∑(O(Okk–E–Ekk))22/E/Ekk..
Xs2
Xs2
STAT205 – Elementary Statistics for the Biological and Life Sciences 67
Testing Goodness-of-Fit (cont’d)Testing Goodness-of-Fit (cont’d)
• The pertinent hypotheses areThe pertinent hypotheses areHHoo: model fit is adequate vs. : model fit is adequate vs.
HHAA: model fit is poor: model fit is poor
• Under HUnder Hoo, ~ , ~ 22(K–1), so (K–1), so P-value is P = P-value is P =
P{P{22(K–1) ≥ }. (K–1) ≥ }. Reject HReject Hoo if P ≤ if P ≤ . .
• Or, Or, use Rejection Region approach: use Rejection Region approach:
reject Hreject Hoo if if
Xs2
Xs2
Xs2
2(K-1)
STAT205 – Elementary Statistics for the Biological and Life Sciences 68
Exs. 10.4–10.5Exs. 10.4–10.5 (10.1 cont’d): (10.1 cont’d): Set Set = 0.10. = 0.10. Suppose a sample of n = 234 snapdragons Suppose a sample of n = 234 snapdragons crossed from pink parents yields:crossed from pink parents yields:
Color Red Pink WhiteObserved 54 122 58Expected 58.5 117 58.5
Examples 10.4–10.5Examples 10.4–10.5
EERR = n = n P(Red) P(Red) = (234)(0.25)= (234)(0.25) = 58.5= 58.5 EEPP = n = n P(Pink) P(Pink)
= (234)(0.5)= (234)(0.5) = 117= 117
EEWW = n = n P(White) P(White) = (234)(0.25)= (234)(0.25) = 58.5= 58.5
STAT205 – Elementary Statistics for the Biological and Life Sciences 69
Examples 10.4–10.5 (cont’d)Examples 10.4–10.5 (cont’d)
Test Test the goodness-of-fit of the the goodness-of-fit of the (Mendelian) hypotheses:(Mendelian) hypotheses:
HHoo: model fit to Mendelian ratios : model fit to Mendelian ratios
is adequateis adequatevs. vs.
HHAA: model fit to Mendelian ratios : model fit to Mendelian ratios
is pooris poor
STAT205 – Elementary Statistics for the Biological and Life Sciences 70
Example 10.5 – XExample 10.5 – X22 Statistic Statistic
We calculateWe calculate
Reject HReject Hoo if P if P = P{= P{22(K–1) ≥ } (K–1) ≥ }
= P{= P{22(2) ≥ 0.56}(2) ≥ 0.56}is less than or equal to is less than or equal to = 0.10. = 0.10.
Xs2
Xs2 = (Ok - Ek)
2
Ekk=1
3
= (54 - 58.5)2
58.5 + (122 - 117) 2
117 + (58 - 58.5)2
58.5
= 0.56
STAT205 – Elementary Statistics for the Biological and Life Sciences 71
Example 10.5 – P-valueExample 10.5 – P-value
From Table 9:From Table 9:P{P{22(2) ≥ 3.22} = 0.20(2) ≥ 3.22} = 0.20
P{P{22(2) ≥ 0.56} = above 0.20(2) ≥ 0.56} = above 0.20
So, P > 0.20 > 0.10 = So, P > 0.20 > 0.10 = we we failfail toto rejectreject H Hoo and and conclude the model fit conclude the model fit
appears adequate.appears adequate.
Can Can find exact P = 0.756 via TI-84/R. find exact P = 0.756 via TI-84/R.
STAT205 – Elementary Statistics for the Biological and Life Sciences 72
CaveatsCaveats
Some warnings: Goodness-of-fit tests Some warnings: Goodness-of-fit tests requirerequire
• categorical data (i.e., counts, not categorical data (i.e., counts, not continuous measurements)continuous measurements)
• large nlarge n
• objectively defined categoriesobjectively defined categories
So, they cannot be applied haphazardly!So, they cannot be applied haphazardly!
STAT205 – Elementary Statistics for the Biological and Life Sciences 73
Chapter 12: Linear Regression Chapter 12: Linear Regression
and Correlationand Correlation
Selected tables and figures from Samuels, M. L., and Witmer, J. A., Selected tables and figures from Samuels, M. L., and Witmer, J. A., StatisticsStatistics forfor thethe LifeLife SciencesSciences, 3rd Ed. © 2003, Prentice Hall, Upper Saddle River, NJ. Used by per-, 3rd Ed. © 2003, Prentice Hall, Upper Saddle River, NJ. Used by per-mission.mission.
STAT205 – Elementary Statistics for the Biological and Life Sciences 74
Predictor VariablesPredictor Variables
In Chap. 10 we introduced the idea that In Chap. 10 we introduced the idea that a (categorical) response, Y, could a (categorical) response, Y, could depend on levels of an external variable.depend on levels of an external variable.
Why not extend this idea to when Y is a Why not extend this idea to when Y is a continuous (normal) measurement?continuous (normal) measurement?
We say Y is a We say Y is a RESPONSE VARIABLERESPONSE VARIABLE, , dependent upon an explanatory dependent upon an explanatory PREDICTOR VARIABLEPREDICTOR VARIABLE, X., X.
STAT205 – Elementary Statistics for the Biological and Life Sciences 75
Simple Linear ModelSimple Linear Model
DEF’NDEF’N: The : The SIMPLE LINEAR MODELSIMPLE LINEAR MODEL relating relating Y and X isY and X is
Y = bY = b00 + b + b11X. X.
• bb00 is the is the Y-INTERCEPTY-INTERCEPT of the model, the of the model, the
point where the line crosses the Y-axis.point where the line crosses the Y-axis.
• bb11 is the is the SLOPESLOPE of the model, the change in of the model, the change in
Y for a given unit change in X (“rise” over Y for a given unit change in X (“rise” over “run”).“run”).
STAT205 – Elementary Statistics for the Biological and Life Sciences 76
Y = bY = b00 + b + b11XX
XX
YY
∆∆Y = bY = b11
bb00
∆∆X = 1X = 1
STAT205 – Elementary Statistics for the Biological and Life Sciences 77
Linear RegressionLinear Regression
DEF’NDEF’N: The : The LEAST SQUARES (LS) LEAST SQUARES (LS) REGRESSION LINEREGRESSION LINE is a data-dependent fit is a data-dependent fit of a linear model. It has coefficientsof a linear model. It has coefficients
(slope) b1 = (xi - x)(yi - y)
i=1
n
(xi - x)2i=1
n
(intercept) b0 = y - b1x
STAT205 – Elementary Statistics for the Biological and Life Sciences 78
Example 12.3Example 12.3
Ex. 12.3Ex. 12.3: Y = snake weight (g): Y = snake weight (g)X = snake length (cm)X = snake length (cm)
Notice that the data appear as (xNotice that the data appear as (x ii,y,yii) pairs:) pairs:
STAT205 – Elementary Statistics for the Biological and Life Sciences 79
Example 12.4Example 12.4
Ex. 12.4Ex. 12.4 (12.3 cont’d): Snake data. (12.3 cont’d): Snake data. ScatterplotScatterplot shows a clear linear relation: shows a clear linear relation:
STAT205 – Elementary Statistics for the Biological and Life Sciences 80
Example 12.4 (cont’d)Example 12.4 (cont’d)
Table 12.3 summarizes the LS calculations:Table 12.3 summarizes the LS calculations:
STAT205 – Elementary Statistics for the Biological and Life Sciences 81
Example 12.4 – LS CoefficientsExample 12.4 – LS Coefficients
From Table 12.3 we seeFrom Table 12.3 we see
so that the LS coefficients areso that the LS coefficients are
bb11 = 1237/172 = = 1237/172 = 7.1927.192 and and
bb00 = 152 – (7.192)(63) = = 152 – (7.192)(63) = –301.096–301.096..
Thus, the LS line is Thus, the LS line is –301.096 + 7.192X–301.096 + 7.192X. Note . Note that these operations are available in TI-84/R.that these operations are available in TI-84/R.
x = 63, y = 152,
(xi - x)(yi - y)i=1
n = 1237 and (xi - x)2
i=1
n = 172
STAT205 – Elementary Statistics for the Biological and Life Sciences 82
Example 12.4 – InterpretationsExample 12.4 – Interpretations
Interpretation of LS coefficients:Interpretation of LS coefficients:
• bb11 = 7.192 indicates that a 1 cm increase in = 7.192 indicates that a 1 cm increase in
snake length leads to an estimated 7.192 g snake length leads to an estimated 7.192 g increase in snake weight.increase in snake weight.
• bb00 is the estimated weight of a snake whose is the estimated weight of a snake whose
length is 0 cmlength is 0 cm
Clearly, this is a poor Clearly, this is a poor EXTRAPOLATIONEXTRAPOLATION (p. 546) away from the bulk of the data.(p. 546) away from the bulk of the data.
Indeed, would we ever see Y < 0 ?Indeed, would we ever see Y < 0 ?
STAT205 – Elementary Statistics for the Biological and Life Sciences 83
ResidualsResiduals
DEF’NDEF’N: A : A PREDICTED VALUEPREDICTED VALUE (a.k.a. a (a.k.a. a FITTED VALUEFITTED VALUE ) is an estimate of y) is an estimate of yii based on a based on a
prediction/fitted regression equation, prediction/fitted regression equation, bb00 + b + b11xxii..
NOTATIONNOTATION::
DEF’NDEF’N: A : A RESIDUALRESIDUAL is the departure from Y is the departure from Y of a fitted value: Residof a fitted value: Residii = =
See Figure 12.6 See Figure 12.6
yi
yi - yi
STAT205 – Elementary Statistics for the Biological and Life Sciences 84
Figure 12.6Figure 12.6
STAT205 – Elementary Statistics for the Biological and Life Sciences 85
SS(Resid.)SS(Resid.)
DEF’NDEF’N: The : The RESIDUAL SUM OF SQUARESRESIDUAL SUM OF SQUARES (a.k.a. (a.k.a. SUM OF SQUARED ERRORSSUM OF SQUARED ERRORS, or , or SSESSE), is), is
DEF’NDEF’N: The : The LEAST SQUARES CRITERIONLEAST SQUARES CRITERION states that the optimal fit of a model to data states that the optimal fit of a model to data occurs when SS(Resid.) is minimized.occurs when SS(Resid.) is minimized.
Notice that under our linear model,Notice that under our linear model,
SS(Resid.) = SS(Resid.) = ∑∑(y(yii – b – b00 – b – b11xxii))22..
SS(Resid.) = (yi - yi)2
i=1
n
STAT205 – Elementary Statistics for the Biological and Life Sciences 86
Example 12.5Example 12.5
Ex. 12.5Ex. 12.5 (12.3 cont’d): Snake data. Table 12.4 (12.3 cont’d): Snake data. Table 12.4 shows the calculations that lead to SS(Resid.):shows the calculations that lead to SS(Resid.):
STAT205 – Elementary Statistics for the Biological and Life Sciences 87
Std. DeviationStd. Deviation We can use SS(Resid.) to update the accura-cy We can use SS(Resid.) to update the accura-cy
of our measure of variability. Recall that to of our measure of variability. Recall that to estimate the variation of Y we used the SDestimate the variation of Y we used the SD
But, this But, this ignoresignores any effect X has on Y. Since any effect X has on Y. Since SS(Resid.) incorporates the effect of X, it serves SS(Resid.) incorporates the effect of X, it serves as a basis for more accurate estimates of as a basis for more accurate estimates of variationvariation
SD = SY = (yi - y)2
i=1
n
n - 1
STAT205 – Elementary Statistics for the Biological and Life Sciences 88
Residual SDResidual SD
DEF’NDEF’N: The : The RESIDUAL STANDARD DEVIA-RESIDUAL STANDARD DEVIA-TIONTION from an LS fit is from an LS fit is
Notice in SNotice in SY|XY|X that the df have changed from that the df have changed from
n – 1 to n – 2, now “incorporating” the fitting n – 1 to n – 2, now “incorporating” the fitting of of 22 model parameters, b model parameters, b00 & b & b11..
SY|X = SS(Resid.)n - 2
= (yi - yi)
2i=1
n
n - 2
STAT205 – Elementary Statistics for the Biological and Life Sciences 89
Example 12.6Example 12.6
Ex. 12.6Ex. 12.6 (12.3 cont’d): Snake data. With n = 9 (12.3 cont’d): Snake data. With n = 9 data pairs, we found SS(Resid.) = 1093.66. data pairs, we found SS(Resid.) = 1093.66. Thus Thus
Compare this with the larger SD Compare this with the larger SD
for these data.for these data.
SY|X = 1093.667
= 156.238 = 12.499 g
SY = 99909 - 1
= 1248.75 = 35.338 g
STAT205 – Elementary Statistics for the Biological and Life Sciences 90
The Linear Statistical ModelThe Linear Statistical Model
To perform inferences in a linear regression, we To perform inferences in a linear regression, we need a statistical model. We start with:need a statistical model. We start with:
DEF’NDEF’N: A : A CONDITIONAL MEANCONDITIONAL MEAN is the expected is the expected value of a variable, Y, conditional on another value of a variable, Y, conditional on another variable, X.variable, X.
NOTATION: µNOTATION: µY|XY|X
DEF’NDEF’N: A : A CONDITIONAL STD. DEVIATIONCONDITIONAL STD. DEVIATION is the is the SD of a variable, Y, conditional on another SD of a variable, Y, conditional on another variable, X.variable, X.
NOTATION: NOTATION: Y|XY|X
STAT205 – Elementary Statistics for the Biological and Life Sciences 91
Linear (Regression) ModelLinear (Regression) Model
DEF’NDEF’N: The : The LINEAR (REGRESSION) MODELLINEAR (REGRESSION) MODEL
of Y on X assumesof Y on X assumes
Y = µ Y = µY|XY|X + + ,,
where the conditional mean is linear:where the conditional mean is linear:
µµY|XY|X = = 00 + + 11X X
and and is a random error term with is a random error term with
µ µ = 0 and = 0 and = = Y|XY|X
STAT205 – Elementary Statistics for the Biological and Life Sciences 92
Linear PredictionLinear Prediction
In the linear regression model, we use the LS In the linear regression model, we use the LS coefficients, bcoefficients, b00 & b & b11, to estimate , to estimate 00 & & 11, and , and
SSY|XY|X to estimate to estimate Y|XY|X..
Thus, in principle we could estimate (or Thus, in principle we could estimate (or “predict”) µ“predict”) µY|XY|X at at anyany X = X = xx via via
= b= b00 + b + b1 1 xx
Careful: when making predictions on µCareful: when making predictions on µY|XY|X
outside the range of xoutside the range of x, the “, the “extrapolationextrapolation” can ” can be very poor!be very poor!
μ xˆ XY|
STAT205 – Elementary Statistics for the Biological and Life Sciences 93
Example 12.12Example 12.12
Ex. 12.12Ex. 12.12 (12.3 cont’d): For the snake (12.3 cont’d): For the snake
data, we saw bdata, we saw b00 = –301.096 and b = –301.096 and b11 = =
7.192 (and S7.192 (and SY|XY|X = 12.499). = 12.499).
If, e.g., we wished to predict the weight If, e.g., we wished to predict the weight
of snake of snake xx = 68 cm long, we would use = 68 cm long, we would use
= –301.096 + (7.192)(68)= –301.096 + (7.192)(68)
= 187.96 g= 187.96 g
μ̂ 68XY|
STAT205 – Elementary Statistics for the Biological and Life Sciences 94
Normal Error ModelNormal Error Model
11 indicates how Y changes with (unit) indicates how Y changes with (unit)
increases in X, and thus has important increases in X, and thus has important biological interest.biological interest.
To make inferences on To make inferences on 11 we update the linear we update the linear
statistical model:statistical model:
Y = µY = µY|XY|X + +
µ µY|XY|X = = 00 + + 11X X ~ N(0, ~ N(0,Y|XY|X22))
(conditional mean is linear) (conditional SD is constant)(conditional mean is linear) (conditional SD is constant)
STAT205 – Elementary Statistics for the Biological and Life Sciences 95
Figure 12.9Figure 12.9
STAT205 – Elementary Statistics for the Biological and Life Sciences 96
Confidence Interval for Confidence Interval for 11
Under the normal error model, bUnder the normal error model, b11 is unbiased is unbiased
for for 11, with, with
Using this, a Using this, a 1 – 1 – conf. interval for conf. interval for 11 is is
b b11 ± ± tt/2/2SE(bSE(b11))
where where tt/2/2 has df = n–2 (same as S has df = n–2 (same as SY|XY|X).).
SE(b1) = SY|X
(xi - x)2i=1
n = SY|X
xi2
i=1
n - 1n xi
i=1
n 2
STAT205 – Elementary Statistics for the Biological and Life Sciences 97
Example 12.18Example 12.18
Ex. 12.18Ex. 12.18 (12.3 cont’d). For the Snake Data (12.3 cont’d). For the Snake Data (n = 9), we had b(n = 9), we had b11 = 7.19 and S = 7.19 and SY|XY|X = 12.499. For = 12.499. For
a 95% conf. interval on a 95% conf. interval on 11, we need , we need tt.05/2.05/2 = = tt.025.025 = =
2.365 (df = 9–2 = 7). Also, from Table 12.3,2.365 (df = 9–2 = 7). Also, from Table 12.3,
The 95% interval is then The 95% interval is then
(xi - x)2i=1
n = 172
b1 ± t.025SE(b1) = 7.19 ± (2.365)12.499
172
= 7.19 ± (2.365)(0.953) = 7.19 ± 2.25
STAT205 – Elementary Statistics for the Biological and Life Sciences 98
Testing Testing 11
• Similarly, we can use the t-dist’n to Similarly, we can use the t-dist’n to testtest HHoo::11 = 0. (Why 0? At = 0. (Why 0? At 11 = 0, there is = 0, there is nono
effecteffect of X on Y.) of X on Y.)
• The test statistic isThe test statistic is
• Under HUnder Hoo, this has t, this has tss ~ t(n–2). ~ t(n–2).
We can use either the P-value approach We can use either the P-value approach or the rejection region approach.or the rejection region approach.
ts = b1 - 0 SE(b1)
STAT205 – Elementary Statistics for the Biological and Life Sciences 99
P-values for Testing P-values for Testing 11
To test HTo test Hoo::11 = 0 vs. = 0 vs.
• HHAA: : 11 ≠ 0, ≠ 0,
reject Hreject Hoo when P = 2P{t(n–2) > |t when P = 2P{t(n–2) > |tss|} ≤ |} ≤
• HHAA: : 11 > 0, > 0,
reject Hreject Hoo when P = P{t(n–2) > t when P = P{t(n–2) > tss} ≤ } ≤
• HHAA: : 11 < 0, < 0,
reject Hreject Hoo when P = P{t(n–2) < t when P = P{t(n–2) < tss} ≤ } ≤
STAT205 – Elementary Statistics for the Biological and Life Sciences 100
Rejection Regions for Testing Rejection Regions for Testing 11
To test HTo test Hoo::11 = 0 using rejection regions, = 0 using rejection regions, if:if:
• HHAA: : 11 ≠ 0, ≠ 0,
reject Hreject Hoo when |t when |tss| ≥ t| ≥ t/2/2 (with df = n–2)(with df = n–2)
• HHAA: : 11 > 0, > 0,
reject Hreject Hoo when t when tss ≥ t ≥ t (with df = n–2)(with df = n–2)
• HHAA: : 11 < 0, < 0,
reject Hreject Hoo when t when tss ≤ –t ≤ –t(with df = n-2)(with df = n-2)
STAT205 – Elementary Statistics for the Biological and Life Sciences 101
Example 12.18 (cont’d)Example 12.18 (cont’d)
Ex. 12.18Ex. 12.18 (cont’d): For the Snake Data, (cont’d): For the Snake Data, Set Set = 0.05. A natural alternative to = 0.05. A natural alternative to H Hoo::11 = 0 is H = 0 is HAA::11 > 0 (why?), so > 0 (why?), so ifif the linear the linear
regression model is valid, regression model is valid, we find we find
with df = n–2 = 9–2 = 7.with df = n–2 = 9–2 = 7.
ts = b1
SE(b1) = 7.19
12.499/ 172
= 7.190.953
= 7.545
STAT205 – Elementary Statistics for the Biological and Life Sciences 102
Example 12.18 – P-valueExample 12.18 – P-value
For For the P-value, use Table 4: the P-value, use Table 4:
P{t(7) ≥ 7.545} = below 0.0005P{t(7) ≥ 7.545} = below 0.0005
P{t(7) ≥ 5.408} = 0.0005P{t(7) ≥ 5.408} = 0.0005
So, P < 0.0005 < 0.05 = So, P < 0.0005 < 0.05 = we we rejectreject H Hoo and and conclude that mean conclude that mean
snake weight increases significantly snake weight increases significantly with increasing snake length.with increasing snake length.
Can Can find P = find P = 0.000070.00007 via TI-84/R. via TI-84/R.
STAT205 – Elementary Statistics for the Biological and Life Sciences 103
rr 22
DEF’NDEF’N: The : The COEFFICIENT OF DETERMINATIONCOEFFICIENT OF DETERMINATION is is
Properties of Properties of rr 22::• 0 ≤ 0 ≤ rr 22 ≤ 1 ≤ 1
• Interpret as % variation in Y that is explained Interpret as % variation in Y that is explained by variation in Xby variation in X
• BADLY over-usedBADLY over-used
r 2 = (xi - x)(yi - y)
i=1
n 2
(xi-x)2i=1
n(yi-y)2
i=1
n
STAT205 – Elementary Statistics for the Biological and Life Sciences 104
Example 12.22Example 12.22
Ex. 12.22Ex. 12.22 (12.3 cont’d): Snake data. From (12.3 cont’d): Snake data. From
Table 12.3 we can findTable 12.3 we can find
so so rr22 = (1237) = (1237)22/(172)(9990) = 0.8905 (/(172)(9990) = 0.8905 ( 89% 89%
of variation in snake weight is explained of variation in snake weight is explained
by variation in snake length).by variation in snake length).
(xi - x)(yi - y)i=1
n
= 1237
(xi - x)2i=1
n
= 172 and (yi - y)2i=1
n
= 9990
STAT205 – Elementary Statistics for the Biological and Life Sciences 105
Random PredictorsRandom Predictors
The linear regression model with µThe linear regression model with µY|XY|X = = 00 + + 11X X
is is conditionalconditional on X. This is an important on X. This is an important distinction.distinction.
If X is itself random — which is not uncommon If X is itself random — which is not uncommon in biology — inferences on in biology — inferences on 11 and/or prediction and/or prediction
on µon µYY are are invalidinvalid (uninterpretable, really) unless (uninterpretable, really) unless
we impose the conditioning.we impose the conditioning.
For cases when we center on the joint relation For cases when we center on the joint relation between X and Y, between X and Y, rather than predicting Y from rather than predicting Y from XX, we need a different statistical model., we need a different statistical model.
STAT205 – Elementary Statistics for the Biological and Life Sciences 106
Bivariate ModelBivariate Model
DEF’NDEF’N: The : The BIVARIATE RANDOM SAMPLING BIVARIATE RANDOM SAMPLING MODELMODEL views the pairs (X views the pairs (Xii,Y,Yii) as joint random ) as joint random
variables, sampled from a popl’n of pairs with variables, sampled from a popl’n of pairs with means µmeans µXX, µ, µYY, SD’s , SD’s XX, , YY, and a , and a population population
correlationcorrelation parameterparameter, , ..
In this model, –1 ≤ In this model, –1 ≤ ≤ 1 such that ≤ 1 such that measures measures the the level of dependencelevel of dependence between X and Y: between X and Y:• ±1 ±1 X & Y highly dependent/related X & Y highly dependent/related
• 0 0 X & Y independent X & Y independent
STAT205 – Elementary Statistics for the Biological and Life Sciences 107
Sample CorrelationSample Correlation To estimate To estimate we use the we use the SAMPLE SAMPLE
CORRELATION COEFFICIENTCORRELATION COEFFICIENT
Computing formula:Computing formula:
r = (xi - x)(yi - y)
i=1
n
(xi - x)2i=1
n
(yi - y)2i=1
n
r = xi yi
i=1
n
- 1n xi i=1
n
yi i=1
n
xi2
i=1
n
- 1n( xii=1
n
)2 yi2
i=1
n
- 1n( yii=1
n
)2
STAT205 – Elementary Statistics for the Biological and Life Sciences 108
Properties of Properties of rr
Properties of the sample correlation coeffi-Properties of the sample correlation coeffi-cient, cient, rr ::•
• as n as n ∞, E[ ∞, E[rr] ≈ ] ≈ • related to LS regression coeffs.: brelated to LS regression coeffs.: b11 = = rr SSYY/S/SXX
• test of Htest of Hoo::11 = 0 numerically equivalent to test = 0 numerically equivalent to test
of Hof Hoo:: = 0 = 0
use use
• see plotted illustrations in Fig. 12.15see plotted illustrations in Fig. 12.15
r = ± r 2
ts = b1SE(b1)
= r n-21 - r 2
STAT205 – Elementary Statistics for the Biological and Life Sciences 109
Figure 12.15Figure 12.15
STAT205 – Elementary Statistics for the Biological and Life Sciences 110
Example 12.27Example 12.27Ex. 12.27Ex. 12.27 (from Ex. 12.19): Is calcium in blood (from Ex. 12.19): Is calcium in blood related to blood pressure?related to blood pressure?
Y = calcium conc. in blood plateletsY = calcium conc. in blood plateletsX = b.p. (avg. of diastolic & systolic)X = b.p. (avg. of diastolic & systolic)
STAT205 – Elementary Statistics for the Biological and Life Sciences 111
Example 12.27 (cont’d)Example 12.27 (cont’d)
We are told that We are told that
So,So,
estimates the population correlation estimates the population correlation ..
(xi - x)(yi - y)i=1
n
= 2792.5
(xi - x)2i=1
n
= 2397.5 and (yi - y)2i=1
n
= 9562.97
r = 2792.5(2397.5)(9562.97)
= 0.5832
STAT205 – Elementary Statistics for the Biological and Life Sciences 112
Regression vs. CorrelationRegression vs. Correlation
The best way to contrast regression The best way to contrast regression and correlation is to:and correlation is to:
• use (conditional) regression analysis use (conditional) regression analysis when when predictionprediction of Y from X is of Y from X is desired, butdesired, but
• use correlation analysis when use correlation analysis when associationassociation between Y and X is between Y and X is under study.under study.
STAT205 – Elementary Statistics for the Biological and Life Sciences 113
Bivariate Normal ModelBivariate Normal Model
We can build 1 – We can build 1 – conf. intervals on conf. intervals on if if we extend the model to include bivariate we extend the model to include bivariate normality.normality.
Assume Y ~ N(µAssume Y ~ N(µYY,,YY22), X ~ N(µ), X ~ N(µXX,,XX
22), with ), with
CorrCorr(X,Y) = (X,Y) = ..
Unfortunately, there is no easy way to Unfortunately, there is no easy way to build good intervals directly on build good intervals directly on . Instead, . Instead, we transform between different scales for we transform between different scales for
STAT205 – Elementary Statistics for the Biological and Life Sciences 114
Fisher Z-TransformFisher Z-Transform
DEF’NDEF’N: The : The FISHER Z-TRANSFORMFISHER Z-TRANSFORM is is
with with INVERSE Z-TRANSFORMINVERSE Z-TRANSFORM
Under the bivariate normal model, Under the bivariate normal model,
Z(r) = 12
ln 1 + r1 - r
r = e2Z - 1e2Z + 1
Z(r) ~ N(0 , 1n-3
)
STAT205 – Elementary Statistics for the Biological and Life Sciences 115
Confidence Interval on Confidence Interval on Using the Z-transform, we can build a conf. Using the Z-transform, we can build a conf.
interval on Z(interval on Z():):
Then, invert this into a 1 – Then, invert this into a 1 – conf. intv’l on conf. intv’l on ::
Z(r) ± z/21
n-3
exp{2 Z(r) - z/21
n-3 } - 1
exp{2 Z(r) - z/21
n-3 } + 1
< <
exp{2 Z(r) + z/2
1n-3
} - 1exp{2 Z(r) + z/2
1n-3
} + 1
STAT205 – Elementary Statistics for the Biological and Life Sciences 116
Example 12.30Example 12.30
Ex. 12.30Ex. 12.30 (12.27 cont’d): Calcium/b.p. data. (12.27 cont’d): Calcium/b.p. data.n = 38 with n = 38 with rr = 0.5832. The Z-transform gives = 0.5832. The Z-transform gives
So, a 95% conf. interval on Z(So, a 95% conf. interval on Z() is) is
Don’t stop here!Don’t stop here!
Z(.5832) = 12 ln 1.5832
0.4168 = 1
2 ln(3.7985)
= 1.33462
= 0.6673
0.6673 ± z.025135
= 0.6673 ± (1.96)(0.169)
= 0.6673 ± 0.3313 or .3360 < Z(r) < .9986.
STAT205 – Elementary Statistics for the Biological and Life Sciences 117
Example 12.30 – Conf. LimitsExample 12.30 – Conf. Limits
Now apply the inverse Z-transform:Now apply the inverse Z-transform:
So, report 0.32 < So, report 0.32 < < 0.76. < 0.76.
lower limit on is e2(.3360) - 1e2(.3360) + 1
= 1.958 - 11.958 + 1
= 0.9582.958
= 0.32
upper limit on is e2(.9986) - 1e2(.9986) + 1
= 7.368 - 17.368 + 1
= 6.3688.368
= 0.76
STAT205 – Elementary Statistics for the Biological and Life Sciences 118
Notes on Notes on rr
Some final notes on Some final notes on rr::• Always plot the data!Always plot the data! Why? Because Why? Because
• rr is VERY sensitive to extreme observations is VERY sensitive to extreme observations and outliers (see Fig. 12.19 and outliers (see Fig. 12.19 ), so BE ), so BE CAREFUL!CAREFUL!
• rr is also known as the Pearson Product- is also known as the Pearson Product-Moment Correlation Coefficient.Moment Correlation Coefficient.
• A distribution-free version of A distribution-free version of rr exists, exists, known as known as Spearman’s Rank Correlation Spearman’s Rank Correlation CoefficientCoefficient..
STAT205 – Elementary Statistics for the Biological and Life Sciences 119
Figure 12.19Figure 12.19
top related