elementary statistics for the biological and life sciences stat 205 university of south carolina...

Elementary Statistics for the Elementary Statistics for the Biological and Life SciencesBiological and Life Sciences

STAT 205STAT 205

University of South CarolinaUniversity of South CarolinaColumbia, SCColumbia, SC

© 2010, University of South Carolina. All rights reserved, except where previous rights © 2010, University of South Carolina. All rights reserved, except where previous rights exist. No part of this material may be reproduced, stored in a retrieval system, or exist. No part of this material may be reproduced, stored in a retrieval system, or transmitted in any form or by any means — electronic, mechanical, photoreproduction, transmitted in any form or by any means — electronic, mechanical, photoreproduction, recording, or scanning — without the prior written consent of the University of South recording, or scanning — without the prior written consent of the University of South Carolina.Carolina.

STAT205 – Elementary Statistics for the Biological and Life Sciences 2

Chapter 9: Inferences forChapter 9: Inferences for

Paired SamplesPaired Samples

Selected tables and figures from Samuels, M. L., and Witmer, J. A., Selected tables and figures from Samuels, M. L., and Witmer, J. A., StatisticsStatistics forfor thethe LifeLife SciencesSciences, 3rd Ed. © 2003, Prentice Hall, Upper Saddle River, NJ. Used by per-, 3rd Ed. © 2003, Prentice Hall, Upper Saddle River, NJ. Used by per-mission.mission.

Independence ViolationsIndependence Violations

In some settings, the In some settings, the independenceindependence be- be-tween samples in the 2-sample t-test is tween samples in the 2-sample t-test is violated, invalidating the methods used in violated, invalidating the methods used in Chapter 7.Chapter 7.

Secs. 7.9–7.10 go into more detail on model Secs. 7.9–7.10 go into more detail on model violations.violations.

One special case where we can provide a One special case where we can provide a solution is that of solution is that of PAIRED DATAPAIRED DATA..

Paired DataPaired Data

Suppose the effect of some treatment or Suppose the effect of some treatment or stimulus is studied stimulus is studied on the same subjectson the same subjects (say, “before”–“after, “right”–“left”, etc.).(say, “before”–“after, “right”–“left”, etc.).

Independence is clearly violated!Independence is clearly violated!

But (!), since the data are so clearly But (!), since the data are so clearly “paired,” the differences “paired,” the differences

d = Yd = Y11 – Y – Y22

can still inform us about the treatment can still inform us about the treatment effect.effect.

Paired Data ModelPaired Data Model

Suppose YSuppose Yi1i1 ~ i.i.d. N(µ ~ i.i.d. N(µ11,,1122) is ) is pairedpaired with Y with Yi2i2 ~ ~

i.i.d. N(µi.i.d. N(µ22,,2222) at each i = 1, …, n.) at each i = 1, …, n.

Then, for dThen, for dii = Y = Yi1i1 – Y – Yi2i2, we know from Rule E1 in , we know from Rule E1 in

Ch. 3. that Ch. 3. that

µµdd = E{d = E{dii} = E{Y} = E{Yi1i1 – Y – Yi2i2}}

= E{Y = E{Yi1i1} – E{Y} – E{Yi2i2} = µ} = µ11 – µ – µ22..

In fact, under this model dIn fact, under this model dii ~ i.i.d. N(µ ~ i.i.d. N(µdd,,dd22) ) ((dd

22 is a is a

complicated function of the model parameters)complicated function of the model parameters)

Sample Mean DifferenceSample Mean Difference

If dIf dii ~ i.i.d. N(µ ~ i.i.d. N(µdd,,dd22), i = 1, …, n, then), i = 1, …, n, then

which can make inferences on µwhich can make inferences on µdd using our using our

previous application of the t-distribution:previous application of the t-distribution:

d - µd

SE(d) ~ t(n - 1)

where SE(d) = Sdn

n - 1 (di - d)

d ~ N(µd , d

Conf. Interval on µConf. Interval on µdd

Using the t-distribution feature for Using the t-distribution feature for

yields our typical form of confidence yields our typical form of confidence

interval on µinterval on µdd::

where df = n – 1 = (# pairs) – 1.where df = n – 1 = (# pairs) – 1.

d ± t/2SE(d)

Example 9.3Example 9.3

Ex. 9.3Ex. 9.3: Y: Y11 = wt. loss after appetite inhib.; = wt. loss after appetite inhib.;

YY22 = wt. loss = wt. loss inin samesame womanwoman after placebo: after placebo:

Example 9.3 – Conf. IntervalExample 9.3 – Conf. Interval

We have df = n–1 = 9–1 = 8, so for a 95% We have df = n–1 = 9–1 = 8, so for a 95% conf. interval on µconf. interval on µdd, we employ , we employ tt.025.025 = 2.306 = 2.306

(from Table 4).(from Table 4).

The 95% conf. interval isThe 95% conf. interval is

or 0.45 < µor 0.45 < µdd < 1.55 kg. < 1.55 kg.

d ± t.025SE(d) = d ± t.025Sdn

= 1.00 ± (2.306)0.729

= 1.00 ± (2.306)(0.24)

= 1.00 ± 0.55

Hypothesis Tests on µHypothesis Tests on µdd

To test HTo test Hoo:µ:µdd = 0 find t = 0 find tss = =

Then, reject HThen, reject Hoo vs. vs.

• HHAA:µ:µdd ≠ 0, ≠ 0, when P = 2P{t(n–1) > |twhen P = 2P{t(n–1) > |tss|} ≤ |} ≤

• HHAA:µ:µdd > 0, > 0,

when P = P{t(n–1) > twhen P = P{t(n–1) > tss} ≤ } ≤

• HHAA:µ:µdd < 0, < 0,

when P = P{t(n–1) < twhen P = P{t(n–1) < tss} ≤ } ≤

d - 0SE(d)

t-Test Rejection Regionst-Test Rejection Regions

To test HTo test Ho o :µ:µdd = 0 using rejection regions, reject = 0 using rejection regions, reject

HHoo vs. vs.

• HHA A :µ:µdd ≠ 0, ≠ 0,

when |t when |tss| ≥ t| ≥ t/2/2 (with df = n–1)(with df = n–1)

• HHA A :µ:µdd > 0, > 0,

when t when tss ≥ t ≥ t (with df = n–1)(with df = n–1)

• HHA A :µ:µdd < 0, < 0,

when t when tss ≤ –t ≤ –t(with df = n–1)(with df = n–1)

Example 9.6Example 9.6Ex. 9.6Ex. 9.6: Y: Y11 = squirrel dist. to person chasing; = squirrel dist. to person chasing;

YY22 = squirrel dist. to nearest tree (n = 11). = squirrel dist. to nearest tree (n = 11).

Same squirrelSame squirrel each time, so data are paired: each time, so data are paired:

Example 9.6 (cont’d)Example 9.6 (cont’d)

(Note: in Fig. 9.3 we find that Y(Note: in Fig. 9.3 we find that Y22 does not appear does not appear

normal, but the differences dnormal, but the differences dii do. So, we contin- do. So, we contin-

ue with the t-test.)ue with the t-test.)

Set Set = 0.10. = 0.10. Test H Test Hoo:µ:µdd = 0 vs. H = 0 vs. HAA:µ:µdd ≠ 0. ≠ 0.

We find tWe find tss = =

Apply P-value approach: find P = Apply P-value approach: find P =

2P{t(n–1) > 2P{t(n–1) > ||ttss||} = 2P{t(10) > 1.613}} = 2P{t(10) > 1.613} →→

dsd/ n

= 72148/ 11

= 7244.62

= 1.613

Example 9.6 – P-valueExample 9.6 – P-value

From Table 4:From Table 4:P{t(10) > 1.812} = 0.05P{t(10) > 1.812} = 0.05

P{t(10) > 1.613} = between 0.05 and 0.10P{t(10) > 1.613} = between 0.05 and 0.10

P{t(10) > 1.372} = 0.10P{t(10) > 1.372} = 0.10

So, So, = 0.10 < P < 0.20 = 0.10 < P < 0.20 we we failfail toto rejectreject H Hoo and and conclude there is no conclude there is no

significant difference in mean distances.significant difference in mean distances.

Can find exact P = 0.1382 via TI-84 or R.Can find exact P = 0.1382 via TI-84 or R.

More on Paired DesignMore on Paired Design

As n As n ∞∞, the CLT allows use of the t-, the CLT allows use of the t-distribution for paired data, so these distribution for paired data, so these inferences are available in large inferences are available in large samples.samples.

In small samples, a distribution-free In small samples, a distribution-free approach is possible (as we’ll see in approach is possible (as we’ll see in Sec. 9.4)Sec. 9.4)

Additional features of the paired design Additional features of the paired design are discussed in Sec. 9.3.are discussed in Sec. 9.3.

Sign TestSign Test

In small samples with non-normal paired In small samples with non-normal paired differences, a distribution-free approach is differences, a distribution-free approach is available, known as the available, known as the SIGN TESTSIGN TEST..

• For paired data YFor paired data Yi1i1, Y, Yi2i2, find d, find dii = Y = Yi1i1–Y–Yi2i2 and and

taketake WWii = {sign of d = {sign of dii} }

Under HUnder Hoo:no difference between Y:no difference between Yi1i1 & Y & Yi2i2, we , we

expect dexpect dii ≈ 0 such that ≈ 0 such that

P{WP{Wii > 0} = P{W > 0} = P{Wii < 0} = 1/2. < 0} = 1/2.

• Ignore any dIgnore any dii = 0. Let n = 0. Let ndd = # non-zero d = # non-zero dii’s.’s.

Sign test (cont’d)Sign test (cont’d)

To set up sign test:To set up sign test:a)a) Select Select ..

b)b) Determine HDetermine HAA from subject-matter from subject-matter

principles. Possibilities areprinciples. Possibilities are

““directionaldirectional”:”:

H HAA: effect in group 1 > effect in group 2: effect in group 1 > effect in group 2

H HAA: effect in group 1 < effect in group 2: effect in group 1 < effect in group 2

““non-directionalnon-directional”:”:

H HAA: effect in group 1 ≠ effect in group 2: effect in group 1 ≠ effect in group 2

Sign Test StatisticSign Test Statistic

The test statistic is BThe test statistic is Bss, and it depends on H, and it depends on HAA. . Let NLet N++ = {# W = {# Wii > 0} and N > 0} and N–– = {# W = {# Wii < 0}. < 0}.

Then,Then, N N++ if Hif HAA: Y: Y11 > Y > Y22

B Bss = = N N–– if Hif HAA: Y: Y11 < Y < Y22 max{N max{N++,N,N––} } if Hif HAA: Y: Y11 ≠ Y ≠ Y22

Reject HReject Hoo in favor of H in favor of HAA when B when Bss exceeds a exceeds a critical point from Table 7critical point from Table 7(e.g., if H(e.g., if HAA:Y:Y11 > Y > Y22, reject when B, reject when Bss ≥ b ≥ b).).

(Portion of) Table 7, p. 684(Portion of) Table 7, p. 684

Sign Test P-valueSign Test P-value

Notice that this is a BInS setting: BNotice that this is a BInS setting: Bss is the is the

number of “successes” among nnumber of “successes” among ndd binary binary

trials where, under Htrials where, under Hoo, P{success} = ., P{success} = .

So, if HSo, if Hoo is true, B is true, Bss ~ Bin ~ Bin((nndd, , )). Thus for. Thus for

HHAA:effect 1 > effect 2, set P = P{Bin(n:effect 1 > effect 2, set P = P{Bin(ndd, ) ≥ B, ) ≥ Bss}.}.

HHAA:effect 1 < effect 2, set P = P{Bin(n:effect 1 < effect 2, set P = P{Bin(ndd, ) , ) ≥≥ B Bss},},

HHAA:effect 1 ≠ effect 2, :effect 1 ≠ effect 2, set Pset P == 2P{Bin(n2P{Bin(ndd, ) ≥ B, ) ≥ Bss},},

and reject Hand reject Hoo when P ≤ when P ≤ ..

121212

Ex. 9.12Ex. 9.12: Y = skin graft survival (days).: Y = skin graft survival (days).

Group 1: HL-antigen compatibility “close”Group 1: HL-antigen compatibility “close”Group 2: HL-antigen compatibility “poor”Group 2: HL-antigen compatibility “poor”

With this small data set, normality is brought With this small data set, normality is brought into question (and, data have ‘censored’ into question (and, data have ‘censored’ feature; see patients #3 and #10). So, a sign feature; see patients #3 and #10). So, a sign test is used.test is used.

Set Set = 0.05. = 0.05. Take H Take Hoo: “close” = “poor” : “close” = “poor”

vs. Hvs. HAA: “close” > “poor” (since we expect : “close” > “poor” (since we expect

poorer survival in the “poor” group).poorer survival in the “poor” group).

In Table 9.7 we see NIn Table 9.7 we see N++ = 9 (and N = 9 (and N–– = 2). So = 2). So

take take BBss = 9 = 9..

Let B ~ Bin(11 , 0.5), so that the P-value Let B ~ Bin(11 , 0.5), so that the P-value here is P = P{B ≥ 9}here is P = P{B ≥ 9}

= P{B = 9} + P{B = 10} + P{B = 11}= P{B = 9} + P{B = 10} + P{B = 11}

= 11C9 (12)

2)2 + 11C10 (1

1 + 11C11 (1

= 11!9! 2!

11+ 11!

10! 1!(1

+ 11!11! 0!

= (55 + 11 + 1)(12)

11 = 67

211 = 0.033.

Example 9.12 (concluded)Example 9.12 (concluded)

Since P = 0.033 < 0.05 = Since P = 0.033 < 0.05 = , we , we reject reject HHoo

and and conclude that graft survival is conclude that graft survival is significantly higher in the “close” group.significantly higher in the “close” group.

(Note that the Binomial P-value can be (Note that the Binomial P-value can be computed via TI-84.)computed via TI-84.)

To use the rejection region approach for To use the rejection region approach for

these data: reject Hthese data: reject Hoo if B if Bss ≥ b ≥ b.05.05 = 9 from = 9 from

Table 7. Since BTable 7. Since Bss = 9 ≥ 9, we still = 9 ≥ 9, we still reject Hreject Hoo..

Chapter 10: Categorical Data and Chapter 10: Categorical Data and

Contingency Table AnalysisContingency Table Analysis

(Coverage order: Secs. 10.7(Coverage order: Secs. 10.710.210.210.310.310.1)10.1)

Sec. 10.7: 2-Sample Sec. 10.7: 2-Sample Proportion DataProportion Data

Returning to the independent (two-)sample Returning to the independent (two-)sample case, suppose now the data are from a BInS case, suppose now the data are from a BInS setting:setting: Y Y11 ~ Bin(n ~ Bin(n11,p,p11) indep. of Y) indep. of Y22 ~ Bin(n ~ Bin(n22,p,p22))

Of interest is the difference pOf interest is the difference p11 – p – p22..

A good point estimator for pA good point estimator for p11 – p – p22 is the is the

difference in sample proportions difference in sample proportions p1 - p2 = Y1

n1 - Y2

Conf. Intervals for pConf. Intervals for p11–p–p22

But (!) for building conf. intervals on pBut (!) for building conf. intervals on p11 – p – p22

we apply our previous AC strategy and start we apply our previous AC strategy and start withwith

Then, findThen, find

p1 - p2 = Y1 + 1n1 + 2

- Y2 + 1n2 + 2

SE(p1 - p2) = p1(1-p1)n1 + 2

+ p2(1-p2)n2 + 2

Agresti-Caffo Conf. IntervalsAgresti-Caffo Conf. Intervals

DEF’NDEF’N: When Y: When Y11 ~ Bin(n ~ Bin(n11,p,p11) indep. of Y) indep. of Y22

~ Bin(n~ Bin(n22,p,p22), the ), the 95% AGRESTI-CAFFO 95% AGRESTI-CAFFO

CONFIDENCE INTERVAL for pCONFIDENCE INTERVAL for p11 – p – p22 is is

where at where at = 0.05 we use z = 0.05 we use z0.0250.025 = 1.96. (Generali- = 1.96. (Generali-

zations exist for other values of zations exist for other values of .).)

p1 - p2 ± z/2SE(p1 - p2) =

Y1 + 1n1 + 2

- Y2 + 1n2 + 2

± z/2p1(1-p1)n1 + 2

+ p2(1-p2)n2 + 2

Ex. 10.37Ex. 10.37 (from Ex. 10.11 – see below): (from Ex. 10.11 – see below):YY11 = # patients angina-free after Timolol trt. = # patients angina-free after Timolol trt.

YY22 = # patients angina-free after placebo. = # patients angina-free after placebo.

Data from Ex. 10.11 (Table 10.4): Data from Ex. 10.11 (Table 10.4):

The Agresti-Caffo point estimator isThe Agresti-Caffo point estimator is

Associated SE isAssociated SE is

p1 - p2 = 44 + 1160 + 2

- 19 + 1147 + 2

= 45162

- 20149

= .278 - .134 = .144

SE(p1 - p2) = (.278)(.722)162

+ (.134)(.866)149

From this the 95% conf. interval isFrom this the 95% conf. interval is

or 0.056 < por 0.056 < p11 – p – p22 < 0.232. < 0.232.

p1 - p2 ± z/2SE(p1 - p2)

= 0.144 ± (1.96) (.278)(.722)162

+ (.134)(.866)149

= 0.144 ± (1.96)(0.0449) = 0.144 ± 0.088

Sec. 10.2: Testing pSec. 10.2: Testing p11 vs. p vs. p22

For testing HFor testing Hoo: p: p11 = p = p22, we introduce a new , we introduce a new

construction: the construction: the contingencycontingency tabletable..

DEF’NDEF’N: A : A 222 CONTINGENCY TABLE2 CONTINGENCY TABLE is a is a tabular arrangement of count data tabular arrangement of count data representing how the success & failure representing how the success & failure frequencies relate to an explanatory factor. frequencies relate to an explanatory factor.

For testing HFor testing Hoo: p: p11 = p = p22, the column factor , the column factor

delineates Group 1 vs. Group 2 and the row delineates Group 1 vs. Group 2 and the row factor delineates success vs. failure.factor delineates success vs. failure.

Basic structure of a 2Basic structure of a 22 contingency table:2 contingency table:

Grp. 1 Grp. 2

# Success Y1 Y2

# Failures n1–Y1 n2–Y2

(Col.) Total n1 n2

Notice that we can read the sample propor-Notice that we can read the sample propor-tions straight from the table:tions straight from the table:

222 Contingency Table2 Contingency Table

p1 = Y1n1

, p2 = Y2n2

Ex 10.11Ex 10.11: (Ex. 10.37, cont’d) Angina expt.: (Ex. 10.37, cont’d) Angina expt.

Timolol Placebo (Row) Tot.

# Angina-free 44 19 63

# Angina 116 128 244 (Col.) Tot. 160 147 307

Of interest is testing whether Angina status Of interest is testing whether Angina status is associated with Timolol trt., i.e., do the is associated with Timolol trt., i.e., do the row and column factors “interact”?row and column factors “interact”?

Testing pTesting p11 vs. p vs. p22

To test HTo test Hoo:p:p11 = p = p22, there are many available , there are many available

approaches. We employ the contingency approaches. We employ the contingency table since it can be extended to more than 2 table since it can be extended to more than 2 row or column levels (see Sec. 10.5).row or column levels (see Sec. 10.5).

The table allows for construction of a statistic The table allows for construction of a statistic that compares the “that compares the “observedobserved” data against ” data against their “their “expectedexpected” values under a pre-specified ” values under a pre-specified model, say, the model under Hmodel, say, the model under Hoo:p:p11 = p = p22..

DEF’NDEF’N: : PEARSON’S PEARSON’S 22 (CHI-SQUARE) (CHI-SQUARE) STATISTICSTATISTIC is is

is sometimes called a “goodness-of-fit” is sometimes called a “goodness-of-fit” statistic (for reasons explained in Sec. 10.1).statistic (for reasons explained in Sec. 10.1).

For application in a 2For application in a 22 contingency table, the 2 contingency table, the “O” values are the four counts in the table (Y“O” values are the four counts in the table (Y11, ,

YY22, n, n11–Y–Y11, n, n22–Y–Y22), and the “E” values are their ), and the “E” values are their

expected values under Hexpected values under Hoo:p:p11 = p = p22..

Xs2 = (O - E)2

Pearson’s Pearson’s 22 Statistic Statistic

““E” valuesE” values

But, what But, what areare the “E” values under the “E” values under HHoo:p:p11 = p = p22??

Well, Well, ifif HHoo is true, we expect both Y is true, we expect both Y11/n/n11 and and

YY22/n/n22 to estimate the same value, say, p. to estimate the same value, say, p.

We can estimate this common p using a We can estimate this common p using a weighted (“pooled”) estimator:weighted (“pooled”) estimator:

ppool = n1p1 + n2p2n1 + n2

= n1(Y1 n1) + n2(Y2 n2)

n1 + n2

= Y1 + Y2n1 + n2

““E” successesE” successes

Now, if there are nNow, if there are n11 total obsv’ns for Grp. 1, then total obsv’ns for Grp. 1, then

we “expect” nwe “expect” n11pppoolpool of these to be successes. of these to be successes.

This isThis is

Similarly, with nSimilarly, with n22 total obsv’ns in Grp. 2 we total obsv’ns in Grp. 2 we

expect nexpect n22pppoolpool successes: successes:

n1ppool = n1(Y1 + Y2)n1 + n2

n2ppool = n2(Y1 + Y2)n1 + n2

““E” failuresE” failures

For the expected # of failures, just subtract For the expected # of failures, just subtract the “E” successes from each total, nthe “E” successes from each total, n jj::

n1 - n1ppool = n1n1 + n2 - Y1 - Y2

n1 + n2

= n1(n1 - Y1 + n2 - Y2)n1 + n2

n2 - n2ppool = = n2(n1 - Y1 + n2 - Y2)n1 + n2

Grp. 1 Grp. 2 Row tot.

# successn1(Y1+Y2)

n1 + n2

n2(Y1+Y2)n1 + n2

Y1 + Y2

# failuren1(n1-Y1+n2-Y2)

n1 + n2

n2(n1-Y1+n2-Y2)n1 + n2

n1-Y1 + n2-Y2

Col. Tot. n1 n2 n1 + n2

The result is an “expected” 2The result is an “expected” 22 contingency 2 contingency table:table:

Notice the similar structure of each “E”:Notice the similar structure of each “E”:

E = (Row Total)(Col. Total)/(Grand Total) E = (Row Total)(Col. Total)/(Grand Total)

Expected 2Expected 22 Table2 Table

Examples 10.14-10.15Examples 10.14-10.15

Exs. 10.14-10.15Exs. 10.14-10.15 (10.11 cont’d): Angina expt. (10.11 cont’d): Angina expt. ““O” table wasO” table was

Timolol Placebo Row Tot.# Angina-free 44 19 63# Angina 116 128 244

Col. Tot. 160 147 307

““E” table isE” table isTimolol Placebo Row Tot.

# Angina-free 32.83 30.17 63# Angina 127.17 116.83 244

Col. Tot. 160 147 307

(cf. Table 10.7)(cf. Table 10.7)

e.g., e.g., E = (63)(160)/307E = (63)(160)/307= 32.83= 32.83

222 Table2 Table

In the 2In the 22 table of expected counts, note that:2 table of expected counts, note that:

the “E” values need not be integers, and we the “E” values need not be integers, and we do NOT round them;do NOT round them;

the row and column totals do not change the row and column totals do not change (they are designed not to)(they are designed not to)• This is a quick way to double-check the This is a quick way to double-check the

calculations!calculations!

22(() Distribution) Distribution

To find the P-value, we need the null refer-To find the P-value, we need the null refer-ence distribution of ence distribution of

DEF’NDEF’N (p.394): The (p.394): The 22(() DISTRIBUTION) DISTRIBUTION

with with df is the limiting distribution of df is the limiting distribution of

Pearson’s statistic under HPearson’s statistic under Hoo..

NOTATIONNOTATION: ~ : ~ 22(())

In the special case of a 2In the special case of a 22 contingency 2 contingency table, table, = 1. = 1.

Properties of Properties of 22(() )

The The 22(() dist’n is) dist’n is

always ≥ 0always ≥ 0

skewed rightskewed right

has integer df’shas integer df’s

has upper-has upper- critical point , given in critical point , given in Table 9 (Table 9 ( must bracket P-values) must bracket P-values)

computable in DoStatcomputable in DoStat

(Portion of) Table 9, p. 686(Portion of) Table 9, p. 686

Rejecting HRejecting Hoo

So, to reject HSo, to reject Hoo:p:p11 = p = p22 vs. H vs. HAA:p:p11 ≠ p ≠ p22, set , set and and

find = find = ∑∑(O–E) (O–E) 22/E./E.

P-valueP-value approachapproach: find P = P{: find P = P{22(1) ≥ } via (1) ≥ } via computer or bracket via Table 9 and reject Hcomputer or bracket via Table 9 and reject Hoo if if

P ≤ P ≤ ..

RejectionRejection regionregion approachapproach: find from : find from

Table 9 and reject HTable 9 and reject Hoo if if

(Notice: a 1-tailed table look-up for a 2-sided test!)(Notice: a 1-tailed table look-up for a 2-sided test!)

One-sided testingOne-sided testing

To find a one-sided P-value:To find a one-sided P-value:

for Hfor HAA: p: p11 > p > p22, use , use

(otherwise, report P > 0.50).(otherwise, report P > 0.50).

for Hfor HAA: p: p11 < p < p22, use , use

(otherwise, report P > 0.50).(otherwise, report P > 0.50).

P = 12P{2(1) Xs

2} if p1 < p2

P = 12P{2(1) Xs

2} if p1 > p2

Ex. 10.16Ex. 10.16 (10.11 cont’d): Angina expt. (10.11 cont’d): Angina expt. Set Set = 0.01. From the O and E values = 0.01. From the O and E values computed in Ex. 10.15, computed in Ex. 10.15, we find we find

Xs2 = (O - E)2

= (44-32.83)2

32.83 + (116-127.17) 2

127.17

+ (19-30.17)2

30.17 + (128-116.83) 2

116.83

= 10.0

To test To test H Hoo:p:p11 = p = p22 vs. H vs. HAA:p:p11 ≠ p ≠ p22, we , we

bracket P = P{bracket P = P{22(1) ≥ 10.0} from Table 9:(1) ≥ 10.0} from Table 9:

P{P{22(1) ≥ 6.63} = 0.01(1) ≥ 6.63} = 0.01

P{P{22(1) ≥(1) ≥ 10.0} = between 0.01 and 0.00110.0} = between 0.01 and 0.001

P{P{22(1) ≥ 10.83} = 0.001(1) ≥ 10.83} = 0.001

So, 0.001 < P < 0.01 (two-sided).So, 0.001 < P < 0.01 (two-sided).

Since P < 0.01 = Since P < 0.01 = , we , we rejectreject H Hoo and and

conclude there is a significant conclude there is a significant

difference in angina response after difference in angina response after

Timolol trt.Timolol trt.

Can Can find exact P = 0.0016 via TI-84/R. find exact P = 0.0016 via TI-84/R.

(A one-sided H(A one-sided HAA is not unreasonable here, is not unreasonable here,

but it’s easy to mess up the P-value, so be but it’s easy to mess up the P-value, so be careful!)careful!)

Pearson’s XPearson’s X22 for 2 for 22 Table2 Table

When using the Pearson XWhen using the Pearson X22 statistic in 2 statistic in 22 2 tables, note that:tables, note that:

22(1) is only an approximation in (1) is only an approximation in finite samples. To be valid, a standard rule-of-finite samples. To be valid, a standard rule-of-thumb is to require E ≥ 1 for every cell, thumb is to require E ≥ 1 for every cell, andand E ≥ 5 (i.e., here {nE ≥ 5 (i.e., here {n11+n+n22}/4 ≥ 5).}/4 ≥ 5).

This method is antiquated for testing pThis method is antiquated for testing p11=p=p22

(esp. against 1-sided alternatives). A better (esp. against 1-sided alternatives). A better method is method is Fisher’s Exact test; see Sec. 10.4.; see Sec. 10.4.

Sec. 10.3: Testing AssociationSec. 10.3: Testing Association

The layout of the 2The layout of the 22 table can apply to 2 table can apply to more than just tests of pmore than just tests of p11 = p = p22..

What if the row factor represents more than What if the row factor represents more than just success-vs.-failure? (Not uncommon!)just success-vs.-failure? (Not uncommon!)

In this case, we have a In this case, we have a single samplesingle sample with n with n observations and with observations and with twotwo explanatory explanatory factors (each having two levels).factors (each having two levels).

General 2General 22 table 2 table (cf. Table 10.13)(cf. Table 10.13):: column factor

level C1 level C2 row tot.

row level R1 a b a+bfactor level R2 c d c+d

col. tot. a+c b+d n

Notice that n = a + b + c + d.Notice that n = a + b + c + d.

Natural question: does the column factor Natural question: does the column factor affect the row factor, and/or affect the row factor, and/or vice versavice versa??

General 2General 22 Table2 Table

Testing AssociationTesting Association

Statistically, asking if the 2 factors Statistically, asking if the 2 factors interrelate is an issue of “association”:interrelate is an issue of “association”:• HHoo: there is : there is nono associationassociation between the row between the row

and column factors and column factors

• HHAA: there is : there is somesome associationassociation between the between the row and column factors row and column factors

(An older term for “no association” is (An older term for “no association” is “independence,” but don’t confuse this with “independence,” but don’t confuse this with statistical independence from Chap. 3.)statistical independence from Chap. 3.)

Pearson’s XPearson’s X22 for Association for Association

We can test the association hypotheses We can test the association hypotheses using Pearson’s statistic, = using Pearson’s statistic, = ∑∑(O–E)(O–E)22/E./E.

Here, the “O” terms are just a, b, c, and d.Here, the “O” terms are just a, b, c, and d.

The “E” terms are calculated as in Sec. 10.2:The “E” terms are calculated as in Sec. 10.2:

For instance, in the (C1,R1) cell we have For instance, in the (C1,R1) cell we have EE1111 = (a+b)(a+c)/n, etc. = (a+b)(a+c)/n, etc.

E = (Row Total)(Col. Total)Grand Total

Rejecting HRejecting Hoo

As in Sec. 10.2, ~ As in Sec. 10.2, ~ 22(1) under H(1) under Hoo, so for , so for

fixed fixed , reject H, reject Hoo as follows: as follows:

P-valueP-value approachapproach: find P = P{: find P = P{22(1) ≥ } via (1) ≥ } via computer or bracket via Table 9 and reject Hcomputer or bracket via Table 9 and reject Hoo if if

P ≤ P ≤ ..

RejectionRejection regionregion approachapproach: find : find from from

Table 9 and reject HTable 9 and reject Hoo if if ≥≥

(Again: a 1-tailed table look-up for a 2-sided test.)(Again: a 1-tailed table look-up for a 2-sided test.)

Ex. 10.21Ex. 10.21: Hair color & eye color in n = 6800 : Hair color & eye color in n = 6800 German males.German males.

““O” values in Table 10.11:O” values in Table 10.11:

We find the “E” values as:We find the “E” values as:

Dark hair Light hair row total

Dark eye 485.84 371.16 857

Light eye 3,369.16 2,573.84 5,943

col. total 3,855 2,945 6,800

Set = 0.05. Of interest is testing whether a significant association exists between hair color and eye color.

Example 10.21 – XExample 10.21 – X22 Statistic Statistic

Given the O’s and the E’s, Given the O’s and the E’s, the test the test statistic isstatistic is

The P-value is P = P{The P-value is P = P{22(1) ≥ 313.63}.(1) ≥ 313.63}.

Xs2 = (O - E)2

= (726 - 485.84)2

485.84 + + (2814 - 2573.84)2

2573.84 = 313.63

From Table 9:From Table 9:

P{P{22(1) ≥ 313.63} = below 0.0001(1) ≥ 313.63} = below 0.0001

P{P{22(1) ≥ 15.14} = 0.0001(1) ≥ 15.14} = 0.0001

So, P < 0.001 < 0.05 = So, P < 0.001 < 0.05 = we we rejectreject HHoo and and conclude that a significant conclude that a significant

association exists between hair color association exists between hair color and eye color in these males.and eye color in these males.

Can Can find P < 0.0001 via TI-84/R.find P < 0.0001 via TI-84/R.

Notes on the Notes on the 22 Test Test

Some notes:Some notes:• 22(1) is only an approximation in (1) is only an approximation in

finite samples. The rule-of-thumb E ≥ 1 finite samples. The rule-of-thumb E ≥ 1 for every cell, for every cell, andand E ≥ 5 (here, n/4 ≥ 5) E ≥ 5 (here, n/4 ≥ 5) still applies.still applies.

• By contrast, Ex. 10.21 illustrates that By contrast, Ex. 10.21 illustrates that is is veryvery sensitive when n is large. sensitive when n is large.

• The 2The 22 table structure allows for a 2 table structure allows for a variety of “conditional” probability variety of “conditional” probability descriptions of the data; see pp. 413-416.descriptions of the data; see pp. 413-416.

Notes on the Notes on the 22 Test (cont’d) Test (cont’d)

• We can We can extendextend the 2 the 2 2 table into an2 table into an r r c CONTINGENCY TABLEc CONTINGENCY TABLE

for cases when more than 2 levels exist for for cases when more than 2 levels exist for either factor. is still useful here; see Sec. either factor. is still useful here; see Sec. 10.5.10.5.

• Many variants exist of Pearson’s . One seen Many variants exist of Pearson’s . One seen often is = 2often is = 2∑∑∑∑OOlnln{O/E}, known as the {O/E}, known as the ‘Likelihood-ratio test,’ ‘LR test,’ or ‘G test.’ ‘Likelihood-ratio test,’ ‘LR test,’ or ‘G test.’ While this has useful properties, While this has useful properties, it usually it usually performs worseperforms worse than for contingency tables than for contingency tables and so is and so is NOT recommendedNOT recommended..

Phi-Divergence StatisticPhi-Divergence Statistic

An alternative competitor to Pearson’s An alternative competitor to Pearson’s for rfor rc tables that can be recommended is c tables that can be recommended is known as the known as the

PHI-DIVERGENCE STATISTICPHI-DIVERGENCE STATISTIC::

In the 2In the 22 case, ~ 2 case, ~ 22(1) under H(1) under Hoo and so and so

is used in the same fashion as . (It can is used in the same fashion as . (It can also be extended to the ralso be extended to the rc case.)c case.)

Cs2 = 8

3 O OE

Sec. 10.1: Goodness-of-FitSec. 10.1: Goodness-of-Fit

Pearson’s original idea was to use to Pearson’s original idea was to use to assess divergence in the “O”s against a assess divergence in the “O”s against a modeled value for “E”. modeled value for “E”.

AnyAny model could be proposed, not just for model could be proposed, not just for rrc tables. In this sense, measures the c tables. In this sense, measures the goodness-of-fitgoodness-of-fit of the model for “E”. of the model for “E”.

Example 10.1Example 10.1 Ex. 10.1Ex. 10.1: In genetics, we believe that : In genetics, we believe that

offspring characters appear in regular offspring characters appear in regular “ratios.” E.g., in snapdragons, the offspring “ratios.” E.g., in snapdragons, the offspring of two pink (hybrid) parents producesof two pink (hybrid) parents produces

(i.e., “1:2:1”). Or, so we think!(i.e., “1:2:1”). Or, so we think!

Can this model be supported by data? (We’ll Can this model be supported by data? (We’ll see, later…)see, later…)

P{Red} = 14

, P{Pink} = 12

, P{White} = 14

Testing Goodness-of-FitTesting Goodness-of-Fit

To use to test a model’s goodness-of-fit:To use to test a model’s goodness-of-fit:• Set Set ..

• Designate K > 1 categories that the model can Designate K > 1 categories that the model can predict (e.g., Red/Pink/White predict (e.g., Red/Pink/White K = 3). K = 3).

• Collect data (the OCollect data (the Okk’s) from a sample of size n.’s) from a sample of size n.

• Determine the EDetermine the Ekk’s for each ’s for each kk th category th category

using the model’s predictions.using the model’s predictions.

• Calculate = Calculate = ∑∑(O(Okk–E–Ekk))22/E/Ekk..

Testing Goodness-of-Fit (cont’d)Testing Goodness-of-Fit (cont’d)

• The pertinent hypotheses areThe pertinent hypotheses areHHoo: model fit is adequate vs. : model fit is adequate vs.

HHAA: model fit is poor: model fit is poor

• Under HUnder Hoo, ~ , ~ 22(K–1), so (K–1), so P-value is P = P-value is P =

P{P{22(K–1) ≥ }. (K–1) ≥ }. Reject HReject Hoo if P ≤ if P ≤ . .

• Or, Or, use Rejection Region approach: use Rejection Region approach:

reject Hreject Hoo if if

2(K-1)

Exs. 10.4–10.5Exs. 10.4–10.5 (10.1 cont’d): (10.1 cont’d): Set Set = 0.10. = 0.10. Suppose a sample of n = 234 snapdragons Suppose a sample of n = 234 snapdragons crossed from pink parents yields:crossed from pink parents yields:

Color Red Pink WhiteObserved 54 122 58Expected 58.5 117 58.5

Examples 10.4–10.5Examples 10.4–10.5

EERR = n = n P(Red) P(Red) = (234)(0.25)= (234)(0.25) = 58.5= 58.5 EEPP = n = n P(Pink) P(Pink)

= (234)(0.5)= (234)(0.5) = 117= 117

EEWW = n = n P(White) P(White) = (234)(0.25)= (234)(0.25) = 58.5= 58.5

Examples 10.4–10.5 (cont’d)Examples 10.4–10.5 (cont’d)

Test Test the goodness-of-fit of the the goodness-of-fit of the (Mendelian) hypotheses:(Mendelian) hypotheses:

HHoo: model fit to Mendelian ratios : model fit to Mendelian ratios

is adequateis adequatevs. vs.

HHAA: model fit to Mendelian ratios : model fit to Mendelian ratios

is pooris poor

Example 10.5 – XExample 10.5 – X22 Statistic Statistic

We calculateWe calculate

Reject HReject Hoo if P if P = P{= P{22(K–1) ≥ } (K–1) ≥ }

= P{= P{22(2) ≥ 0.56}(2) ≥ 0.56}is less than or equal to is less than or equal to = 0.10. = 0.10.

Xs2 = (Ok - Ek)

= (54 - 58.5)2

58.5 + (122 - 117) 2

117 + (58 - 58.5)2

= 0.56

From Table 9:From Table 9:P{P{22(2) ≥ 3.22} = 0.20(2) ≥ 3.22} = 0.20

P{P{22(2) ≥ 0.56} = above 0.20(2) ≥ 0.56} = above 0.20

So, P > 0.20 > 0.10 = So, P > 0.20 > 0.10 = we we failfail toto rejectreject H Hoo and and conclude the model fit conclude the model fit

appears adequate.appears adequate.

Can Can find exact P = 0.756 via TI-84/R. find exact P = 0.756 via TI-84/R.

CaveatsCaveats

Some warnings: Goodness-of-fit tests Some warnings: Goodness-of-fit tests requirerequire

• categorical data (i.e., counts, not categorical data (i.e., counts, not continuous measurements)continuous measurements)

• large nlarge n

• objectively defined categoriesobjectively defined categories

So, they cannot be applied haphazardly!So, they cannot be applied haphazardly!

Chapter 12: Linear Regression Chapter 12: Linear Regression

and Correlationand Correlation

Predictor VariablesPredictor Variables

In Chap. 10 we introduced the idea that In Chap. 10 we introduced the idea that a (categorical) response, Y, could a (categorical) response, Y, could depend on levels of an external variable.depend on levels of an external variable.

Why not extend this idea to when Y is a Why not extend this idea to when Y is a continuous (normal) measurement?continuous (normal) measurement?

We say Y is a We say Y is a RESPONSE VARIABLERESPONSE VARIABLE, , dependent upon an explanatory dependent upon an explanatory PREDICTOR VARIABLEPREDICTOR VARIABLE, X., X.

Simple Linear ModelSimple Linear Model

DEF’NDEF’N: The : The SIMPLE LINEAR MODELSIMPLE LINEAR MODEL relating relating Y and X isY and X is

Y = bY = b00 + b + b11X. X.

• bb00 is the is the Y-INTERCEPTY-INTERCEPT of the model, the of the model, the

point where the line crosses the Y-axis.point where the line crosses the Y-axis.

• bb11 is the is the SLOPESLOPE of the model, the change in of the model, the change in

Y for a given unit change in X (“rise” over Y for a given unit change in X (“rise” over “run”).“run”).

Y = bY = b00 + b + b11XX

∆∆Y = bY = b11

∆∆X = 1X = 1

Linear RegressionLinear Regression

DEF’NDEF’N: The : The LEAST SQUARES (LS) LEAST SQUARES (LS) REGRESSION LINEREGRESSION LINE is a data-dependent fit is a data-dependent fit of a linear model. It has coefficientsof a linear model. It has coefficients

(slope) b1 = (xi - x)(yi - y)

(xi - x)2i=1

(intercept) b0 = y - b1x

Ex. 12.3Ex. 12.3: Y = snake weight (g): Y = snake weight (g)X = snake length (cm)X = snake length (cm)

Notice that the data appear as (xNotice that the data appear as (x ii,y,yii) pairs:) pairs:

Ex. 12.4Ex. 12.4 (12.3 cont’d): Snake data. (12.3 cont’d): Snake data. ScatterplotScatterplot shows a clear linear relation: shows a clear linear relation:

Table 12.3 summarizes the LS calculations:Table 12.3 summarizes the LS calculations:

Example 12.4 – LS CoefficientsExample 12.4 – LS Coefficients

From Table 12.3 we seeFrom Table 12.3 we see

so that the LS coefficients areso that the LS coefficients are

bb11 = 1237/172 = = 1237/172 = 7.1927.192 and and

bb00 = 152 – (7.192)(63) = = 152 – (7.192)(63) = –301.096–301.096..

Thus, the LS line is Thus, the LS line is –301.096 + 7.192X–301.096 + 7.192X. Note . Note that these operations are available in TI-84/R.that these operations are available in TI-84/R.

x = 63, y = 152,

(xi - x)(yi - y)i=1

n = 1237 and (xi - x)2

n = 172

Example 12.4 – InterpretationsExample 12.4 – Interpretations

Interpretation of LS coefficients:Interpretation of LS coefficients:

• bb11 = 7.192 indicates that a 1 cm increase in = 7.192 indicates that a 1 cm increase in

snake length leads to an estimated 7.192 g snake length leads to an estimated 7.192 g increase in snake weight.increase in snake weight.

• bb00 is the estimated weight of a snake whose is the estimated weight of a snake whose

length is 0 cmlength is 0 cm

Clearly, this is a poor Clearly, this is a poor EXTRAPOLATIONEXTRAPOLATION (p. 546) away from the bulk of the data.(p. 546) away from the bulk of the data.

Indeed, would we ever see Y < 0 ?Indeed, would we ever see Y < 0 ?

ResidualsResiduals

DEF’NDEF’N: A : A PREDICTED VALUEPREDICTED VALUE (a.k.a. a (a.k.a. a FITTED VALUEFITTED VALUE ) is an estimate of y) is an estimate of yii based on a based on a

prediction/fitted regression equation, prediction/fitted regression equation, bb00 + b + b11xxii..

NOTATIONNOTATION::

DEF’NDEF’N: A : A RESIDUALRESIDUAL is the departure from Y is the departure from Y of a fitted value: Residof a fitted value: Residii = =

See Figure 12.6 See Figure 12.6

yi - yi

Figure 12.6Figure 12.6

SS(Resid.)SS(Resid.)

DEF’NDEF’N: The : The RESIDUAL SUM OF SQUARESRESIDUAL SUM OF SQUARES (a.k.a. (a.k.a. SUM OF SQUARED ERRORSSUM OF SQUARED ERRORS, or , or SSESSE), is), is

DEF’NDEF’N: The : The LEAST SQUARES CRITERIONLEAST SQUARES CRITERION states that the optimal fit of a model to data states that the optimal fit of a model to data occurs when SS(Resid.) is minimized.occurs when SS(Resid.) is minimized.

Notice that under our linear model,Notice that under our linear model,

SS(Resid.) = SS(Resid.) = ∑∑(y(yii – b – b00 – b – b11xxii))22..

SS(Resid.) = (yi - yi)2

Ex. 12.5Ex. 12.5 (12.3 cont’d): Snake data. Table 12.4 (12.3 cont’d): Snake data. Table 12.4 shows the calculations that lead to SS(Resid.):shows the calculations that lead to SS(Resid.):

Std. DeviationStd. Deviation We can use SS(Resid.) to update the accura-cy We can use SS(Resid.) to update the accura-cy

of our measure of variability. Recall that to of our measure of variability. Recall that to estimate the variation of Y we used the SDestimate the variation of Y we used the SD

But, this But, this ignoresignores any effect X has on Y. Since any effect X has on Y. Since SS(Resid.) incorporates the effect of X, it serves SS(Resid.) incorporates the effect of X, it serves as a basis for more accurate estimates of as a basis for more accurate estimates of variationvariation

SD = SY = (yi - y)2

Residual SDResidual SD

DEF’NDEF’N: The : The RESIDUAL STANDARD DEVIA-RESIDUAL STANDARD DEVIA-TIONTION from an LS fit is from an LS fit is

Notice in SNotice in SY|XY|X that the df have changed from that the df have changed from

n – 1 to n – 2, now “incorporating” the fitting n – 1 to n – 2, now “incorporating” the fitting of of 22 model parameters, b model parameters, b00 & b & b11..

SY|X = SS(Resid.)n - 2

= (yi - yi)

Ex. 12.6Ex. 12.6 (12.3 cont’d): Snake data. With n = 9 (12.3 cont’d): Snake data. With n = 9 data pairs, we found SS(Resid.) = 1093.66. data pairs, we found SS(Resid.) = 1093.66. Thus Thus

Compare this with the larger SD Compare this with the larger SD

for these data.for these data.

SY|X = 1093.667

= 156.238 = 12.499 g

SY = 99909 - 1

= 1248.75 = 35.338 g

The Linear Statistical ModelThe Linear Statistical Model

To perform inferences in a linear regression, we To perform inferences in a linear regression, we need a statistical model. We start with:need a statistical model. We start with:

DEF’NDEF’N: A : A CONDITIONAL MEANCONDITIONAL MEAN is the expected is the expected value of a variable, Y, conditional on another value of a variable, Y, conditional on another variable, X.variable, X.

NOTATION: µNOTATION: µY|XY|X

DEF’NDEF’N: A : A CONDITIONAL STD. DEVIATIONCONDITIONAL STD. DEVIATION is the is the SD of a variable, Y, conditional on another SD of a variable, Y, conditional on another variable, X.variable, X.

NOTATION: NOTATION: Y|XY|X

Linear (Regression) ModelLinear (Regression) Model

DEF’NDEF’N: The : The LINEAR (REGRESSION) MODELLINEAR (REGRESSION) MODEL

of Y on X assumesof Y on X assumes

Y = µ Y = µY|XY|X + + ,,

where the conditional mean is linear:where the conditional mean is linear:

µµY|XY|X = = 00 + + 11X X

and and is a random error term with is a random error term with

µ µ = 0 and = 0 and = = Y|XY|X

Linear PredictionLinear Prediction

In the linear regression model, we use the LS In the linear regression model, we use the LS coefficients, bcoefficients, b00 & b & b11, to estimate , to estimate 00 & & 11, and , and

SSY|XY|X to estimate to estimate Y|XY|X..

Thus, in principle we could estimate (or Thus, in principle we could estimate (or “predict”) µ“predict”) µY|XY|X at at anyany X = X = xx via via

= b= b00 + b + b1 1 xx

Careful: when making predictions on µCareful: when making predictions on µY|XY|X

outside the range of xoutside the range of x, the “, the “extrapolationextrapolation” can ” can be very poor!be very poor!

μ xˆ XY|

Ex. 12.12Ex. 12.12 (12.3 cont’d): For the snake (12.3 cont’d): For the snake

data, we saw bdata, we saw b00 = –301.096 and b = –301.096 and b11 = =

7.192 (and S7.192 (and SY|XY|X = 12.499). = 12.499).

If, e.g., we wished to predict the weight If, e.g., we wished to predict the weight

of snake of snake xx = 68 cm long, we would use = 68 cm long, we would use

= –301.096 + (7.192)(68)= –301.096 + (7.192)(68)

= 187.96 g= 187.96 g

μ̂ 68XY|

Normal Error ModelNormal Error Model

11 indicates how Y changes with (unit) indicates how Y changes with (unit)

increases in X, and thus has important increases in X, and thus has important biological interest.biological interest.

To make inferences on To make inferences on 11 we update the linear we update the linear

statistical model:statistical model:

Y = µY = µY|XY|X + +

µ µY|XY|X = = 00 + + 11X X ~ N(0, ~ N(0,Y|XY|X22))

(conditional mean is linear) (conditional SD is constant)(conditional mean is linear) (conditional SD is constant)

Confidence Interval for Confidence Interval for 11

Under the normal error model, bUnder the normal error model, b11 is unbiased is unbiased

for for 11, with, with

Using this, a Using this, a 1 – 1 – conf. interval for conf. interval for 11 is is

b b11 ± ± tt/2/2SE(bSE(b11))

where where tt/2/2 has df = n–2 (same as S has df = n–2 (same as SY|XY|X).).

SE(b1) = SY|X

(xi - x)2i=1

n = SY|X

n - 1n xi

Ex. 12.18Ex. 12.18 (12.3 cont’d). For the Snake Data (12.3 cont’d). For the Snake Data (n = 9), we had b(n = 9), we had b11 = 7.19 and S = 7.19 and SY|XY|X = 12.499. For = 12.499. For

a 95% conf. interval on a 95% conf. interval on 11, we need , we need tt.05/2.05/2 = = tt.025.025 = =

2.365 (df = 9–2 = 7). Also, from Table 12.3,2.365 (df = 9–2 = 7). Also, from Table 12.3,

The 95% interval is then The 95% interval is then

(xi - x)2i=1

n = 172

b1 ± t.025SE(b1) = 7.19 ± (2.365)12.499

= 7.19 ± (2.365)(0.953) = 7.19 ± 2.25

Testing Testing 11

• Similarly, we can use the t-dist’n to Similarly, we can use the t-dist’n to testtest HHoo::11 = 0. (Why 0? At = 0. (Why 0? At 11 = 0, there is = 0, there is nono

effecteffect of X on Y.) of X on Y.)

• The test statistic isThe test statistic is

• Under HUnder Hoo, this has t, this has tss ~ t(n–2). ~ t(n–2).

We can use either the P-value approach We can use either the P-value approach or the rejection region approach.or the rejection region approach.

ts = b1 - 0 SE(b1)

P-values for Testing P-values for Testing 11

To test HTo test Hoo::11 = 0 vs. = 0 vs.

• HHAA: : 11 ≠ 0, ≠ 0,

reject Hreject Hoo when P = 2P{t(n–2) > |t when P = 2P{t(n–2) > |tss|} ≤ |} ≤

• HHAA: : 11 > 0, > 0,

reject Hreject Hoo when P = P{t(n–2) > t when P = P{t(n–2) > tss} ≤ } ≤

• HHAA: : 11 < 0, < 0,

reject Hreject Hoo when P = P{t(n–2) < t when P = P{t(n–2) < tss} ≤ } ≤

Rejection Regions for Testing Rejection Regions for Testing 11

To test HTo test Hoo::11 = 0 using rejection regions, = 0 using rejection regions, if:if:

• HHAA: : 11 ≠ 0, ≠ 0,

reject Hreject Hoo when |t when |tss| ≥ t| ≥ t/2/2 (with df = n–2)(with df = n–2)

• HHAA: : 11 > 0, > 0,

reject Hreject Hoo when t when tss ≥ t ≥ t (with df = n–2)(with df = n–2)

• HHAA: : 11 < 0, < 0,

reject Hreject Hoo when t when tss ≤ –t ≤ –t(with df = n-2)(with df = n-2)

Ex. 12.18Ex. 12.18 (cont’d): For the Snake Data, (cont’d): For the Snake Data, Set Set = 0.05. A natural alternative to = 0.05. A natural alternative to H Hoo::11 = 0 is H = 0 is HAA::11 > 0 (why?), so > 0 (why?), so ifif the linear the linear

regression model is valid, regression model is valid, we find we find

with df = n–2 = 9–2 = 7.with df = n–2 = 9–2 = 7.

ts = b1

SE(b1) = 7.19

12.499/ 172

= 7.190.953

= 7.545

For For the P-value, use Table 4: the P-value, use Table 4:

P{t(7) ≥ 7.545} = below 0.0005P{t(7) ≥ 7.545} = below 0.0005

P{t(7) ≥ 5.408} = 0.0005P{t(7) ≥ 5.408} = 0.0005

So, P < 0.0005 < 0.05 = So, P < 0.0005 < 0.05 = we we rejectreject H Hoo and and conclude that mean conclude that mean

snake weight increases significantly snake weight increases significantly with increasing snake length.with increasing snake length.

Can Can find P = find P = 0.000070.00007 via TI-84/R. via TI-84/R.

DEF’NDEF’N: The : The COEFFICIENT OF DETERMINATIONCOEFFICIENT OF DETERMINATION is is

Properties of Properties of rr 22::• 0 ≤ 0 ≤ rr 22 ≤ 1 ≤ 1

• Interpret as % variation in Y that is explained Interpret as % variation in Y that is explained by variation in Xby variation in X

• BADLY over-usedBADLY over-used

r 2 = (xi - x)(yi - y)

(xi-x)2i=1

n(yi-y)2

Ex. 12.22Ex. 12.22 (12.3 cont’d): Snake data. From (12.3 cont’d): Snake data. From

Table 12.3 we can findTable 12.3 we can find

so so rr22 = (1237) = (1237)22/(172)(9990) = 0.8905 (/(172)(9990) = 0.8905 ( 89% 89%

of variation in snake weight is explained of variation in snake weight is explained

by variation in snake length).by variation in snake length).

(xi - x)(yi - y)i=1

= 1237

(xi - x)2i=1

= 172 and (yi - y)2i=1

= 9990

Random PredictorsRandom Predictors

The linear regression model with µThe linear regression model with µY|XY|X = = 00 + + 11X X

is is conditionalconditional on X. This is an important on X. This is an important distinction.distinction.

If X is itself random — which is not uncommon If X is itself random — which is not uncommon in biology — inferences on in biology — inferences on 11 and/or prediction and/or prediction

on µon µYY are are invalidinvalid (uninterpretable, really) unless (uninterpretable, really) unless

we impose the conditioning.we impose the conditioning.

For cases when we center on the joint relation For cases when we center on the joint relation between X and Y, between X and Y, rather than predicting Y from rather than predicting Y from XX, we need a different statistical model., we need a different statistical model.

Bivariate ModelBivariate Model

DEF’NDEF’N: The : The BIVARIATE RANDOM SAMPLING BIVARIATE RANDOM SAMPLING MODELMODEL views the pairs (X views the pairs (Xii,Y,Yii) as joint random ) as joint random

variables, sampled from a popl’n of pairs with variables, sampled from a popl’n of pairs with means µmeans µXX, µ, µYY, SD’s , SD’s XX, , YY, and a , and a population population

correlationcorrelation parameterparameter, , ..

In this model, –1 ≤ In this model, –1 ≤ ≤ 1 such that ≤ 1 such that measures measures the the level of dependencelevel of dependence between X and Y: between X and Y:• ±1 ±1 X & Y highly dependent/related X & Y highly dependent/related

• 0 0 X & Y independent X & Y independent

Sample CorrelationSample Correlation To estimate To estimate we use the we use the SAMPLE SAMPLE

CORRELATION COEFFICIENTCORRELATION COEFFICIENT

Computing formula:Computing formula:

r = (xi - x)(yi - y)

(xi - x)2i=1

(yi - y)2i=1

r = xi yi

- 1n xi i=1

yi i=1

- 1n( xii=1

)2 yi2

- 1n( yii=1

Properties of Properties of rr

Properties of the sample correlation coeffi-Properties of the sample correlation coeffi-cient, cient, rr ::•

• as n as n ∞, E[ ∞, E[rr] ≈ ] ≈ • related to LS regression coeffs.: brelated to LS regression coeffs.: b11 = = rr SSYY/S/SXX

• test of Htest of Hoo::11 = 0 numerically equivalent to test = 0 numerically equivalent to test

of Hof Hoo:: = 0 = 0

use use

• see plotted illustrations in Fig. 12.15see plotted illustrations in Fig. 12.15

r = ± r 2

ts = b1SE(b1)

= r n-21 - r 2

Example 12.27Example 12.27Ex. 12.27Ex. 12.27 (from Ex. 12.19): Is calcium in blood (from Ex. 12.19): Is calcium in blood related to blood pressure?related to blood pressure?

Y = calcium conc. in blood plateletsY = calcium conc. in blood plateletsX = b.p. (avg. of diastolic & systolic)X = b.p. (avg. of diastolic & systolic)

We are told that We are told that

So,So,

estimates the population correlation estimates the population correlation ..

(xi - x)(yi - y)i=1

= 2792.5

(xi - x)2i=1

= 2397.5 and (yi - y)2i=1

= 9562.97

r = 2792.5(2397.5)(9562.97)

= 0.5832

Regression vs. CorrelationRegression vs. Correlation

The best way to contrast regression The best way to contrast regression and correlation is to:and correlation is to:

• use (conditional) regression analysis use (conditional) regression analysis when when predictionprediction of Y from X is of Y from X is desired, butdesired, but

• use correlation analysis when use correlation analysis when associationassociation between Y and X is between Y and X is under study.under study.

Bivariate Normal ModelBivariate Normal Model

We can build 1 – We can build 1 – conf. intervals on conf. intervals on if if we extend the model to include bivariate we extend the model to include bivariate normality.normality.

Assume Y ~ N(µAssume Y ~ N(µYY,,YY22), X ~ N(µ), X ~ N(µXX,,XX

22), with ), with

CorrCorr(X,Y) = (X,Y) = ..

Unfortunately, there is no easy way to Unfortunately, there is no easy way to build good intervals directly on build good intervals directly on . Instead, . Instead, we transform between different scales for we transform between different scales for

Fisher Z-TransformFisher Z-Transform

DEF’NDEF’N: The : The FISHER Z-TRANSFORMFISHER Z-TRANSFORM is is

with with INVERSE Z-TRANSFORMINVERSE Z-TRANSFORM

Under the bivariate normal model, Under the bivariate normal model,

Z(r) = 12

ln 1 + r1 - r

r = e2Z - 1e2Z + 1

Z(r) ~ N(0 , 1n-3

Confidence Interval on Confidence Interval on Using the Z-transform, we can build a conf. Using the Z-transform, we can build a conf.

interval on Z(interval on Z():):

Then, invert this into a 1 – Then, invert this into a 1 – conf. intv’l on conf. intv’l on ::

Z(r) ± z/21

exp{2 Z(r) - z/21

n-3 } - 1

exp{2 Z(r) - z/21

n-3 } + 1

exp{2 Z(r) + z/2

} - 1exp{2 Z(r) + z/2

Ex. 12.30Ex. 12.30 (12.27 cont’d): Calcium/b.p. data. (12.27 cont’d): Calcium/b.p. data.n = 38 with n = 38 with rr = 0.5832. The Z-transform gives = 0.5832. The Z-transform gives

So, a 95% conf. interval on Z(So, a 95% conf. interval on Z() is) is

Don’t stop here!Don’t stop here!

Z(.5832) = 12 ln 1.5832

0.4168 = 1

2 ln(3.7985)

= 1.33462

= 0.6673

0.6673 ± z.025135

= 0.6673 ± (1.96)(0.169)

= 0.6673 ± 0.3313 or .3360 < Z(r) < .9986.

Example 12.30 – Conf. LimitsExample 12.30 – Conf. Limits

Now apply the inverse Z-transform:Now apply the inverse Z-transform:

So, report 0.32 < So, report 0.32 < < 0.76. < 0.76.

lower limit on is e2(.3360) - 1e2(.3360) + 1

= 1.958 - 11.958 + 1

= 0.9582.958

= 0.32

upper limit on is e2(.9986) - 1e2(.9986) + 1

= 7.368 - 17.368 + 1

= 6.3688.368

= 0.76

Notes on Notes on rr

Some final notes on Some final notes on rr::• Always plot the data!Always plot the data! Why? Because Why? Because

• rr is VERY sensitive to extreme observations is VERY sensitive to extreme observations and outliers (see Fig. 12.19 and outliers (see Fig. 12.19 ), so BE ), so BE CAREFUL!CAREFUL!

• rr is also known as the Pearson Product- is also known as the Pearson Product-Moment Correlation Coefficient.Moment Correlation Coefficient.

• A distribution-free version of A distribution-free version of rr exists, exists, known as known as Spearman’s Rank Correlation Spearman’s Rank Correlation CoefficientCoefficient..

elementary statistics for the biological and life sciences stat 205 university of south carolina...

Documents

south carolina education & business summit greenville, south...

south carolina drivers manual | south carolina drivers...

south carolina by tressa reed. columbia, south carolina...

south carolina - bureau of transportation statistics ·...

south carolina v. north carolina (2010)

stat 512 sp 2020 lec 11 slides 12pt moms and mles · stat...

south carolina department of transportation …september 29,...

the south carolina gardener | spring 2018| 1...the south...

2006 south carolina baseball -...

south carolina state oral health plan - scdhec stat… ·...

south carolina - sc wildlife · 2020. 10. 16. · south...

ifta quarterly stat ment blank - maine.gov quarter 2018 ifta...

monaghan community greenville county, south carolina ·...

a south carolina environmental curriculum … south carolina...

22017 south carolina gamecock softball spotting chart017...

south carolina association of counties · south carolina...

sections 2.1 and 2 - university of south...

south carolina residential builders commission - llr...

settling the south maryland virginia north carolina south...

1740 south carolina slave code. acts of the south carolina...