bayes factors as a replacement for t-tests

Bayes factors as a replacement for t-tests

Jeffrey N. Rouder

University of Missouri

September, 2008

Jeffrey N. Rouder University of Missouri

Collaborators

I Paul Speckman

I Dongchu Sun

I Richard Morey

I Geoff Iverson

The De Facto Rule of Data Analysis

p < .05 −→ Good

p ≥ .05 −→ Bad

The De Facto Rule of Data Analysis

p < .05 −→ Good

p ≥ .05 −→ Bad

Mission of This Talk

Provide a practical and useful alternative to the De Facto Rule inData Analysis.

1. Invariances, as opposed to differences, are the heart of science

2. Significance tests are ill-suited for assessing invariances

3. Significance tests are ill-suited for assessing differences

4. Bayes factor approach

5. Easy-to-use web applet to speed adoption

Johannes Kepler (1571-1630)

I Planets varied greatly inthe speed & direction oftheir paths through thesky.

I Kepler extracted theinvariants of celestialmotion (e.g., ellipticalorbits, equal areacircumscribed in equaltime).

Invariances At The Heart of Science

1. Usually, observables change across conditions.

2. What relations among observables remain constant orinvariant?

3. These invariances form phenomena to be explained by theory.

I Conservation Laws: e.g., F = MA implies that if two objectsare dropped from the same height F1

M1= F2

M2. Moreover,

conservation laws hold nearly exactly across everydaystiuations.

I In genetics, adenine binds to thymine, guanine binds tocytosine, across all DNA in all species.

I In chemistry, mecahisms of covalent bonding are the sameacross all atoms.

M1= F2

M2. Moreover,

M1= F2

M2. Moreover,

M1= F2

M2. Moreover,

Invariances in Psychology?

p < .05 −→ Good

p ≥ .05 −→ Bad

Effects are valued; lack of effects are not.

p < .05 −→ Good

p ≥ .05 −→ Bad

p < .05 −→ Good

p ≥ .05 −→ Bad

Examples of Invariances in Psychology

Gender (Shibley Hyde, 2005, 2007)

I Performance on many tasks is invariant to gender

I How can these invariances come about?

Weber’s Law (1860)

I ∆ ∝ I

I For two different backgrounds, ∆1I1

= ∆I2I2

Choice Rule (Clarke, 1957; Luce, 1959; Shepard, 1957)

I Choice probabilities:

I Gin & Tonic .5I Beer .4I Wine .1

I Out of gin

I Beer .8I Wine .2

I Invariance in the ratio of choice probabilities (4:1)

Choice Rule (Clarke, 1957; Luce, 1959; Shepard, 1957)I Choice probabilities:

I Out of gin

I Beer .8I Wine .2

I Out of ginI Beer .8I Wine .2

Law-like Phenomena: Psychometric Functions Shift (Watson &Pelli, 1983)

Intensity

10 50 100 500 1000 5000

Cowan’s K model (2001)

I H = Pr(“change” | change)

I F = Pr(“change” | same)

IH1 − F1

H2 − F2=

N1, N1,N2 > K

IH1 − F1

H2 − F2=

N1, N1,N2 > K

IH1 − F1

H2 − F2=

N1, N1,N2 > K

Selective Influence, Sternberg (1969)

I For a given model, a manipulation should affect someparameters and not others.

I Example: Process Dissociation. Dividing attention at testshould affect recollection but not automatic activation.

Note 1: Double Dissociations

Conscious Familiar Conscious Familiar

LowHigh

Divided Attention Percpetual Similarity

Note 1: Double Invariance

Conscious Familiar Conscious Familiar

LowHigh

Divided Attention Percpetual Similarity

Note 2: Maybe there are no invariances

I Antinull View; championed by Meehl, Cohen

I Examples: Newtonian mechaniscs.I Constructive Challenge:

I Invariances may exist at a platonic level but be perturbed inthe real world.

I The goal is to find platonical rather than actual invariances.I Anticipates subjectivity in inference.I Pragmatics: Emphasis on model selection rather than truth

I Examples: Newtonian mechaniscs.

I Constructive Challenge:

I The goal is to find platonical rather than actual invariances.

I Anticipates subjectivity in inference.I Pragmatics: Emphasis on model selection rather than truth

I The goal is to find platonical rather than actual invariances.I Anticipates subjectivity in inference.

I Pragmatics: Emphasis on model selection rather than truth

Invariances Are Difficult To Assess

I Invariances are null hypotheses

I Significance tests cannot provide evidence for the null.

Simplest Problem: One-Sample Design

I Is there a difference?

I Difference Scores: y1, y2, . . . , yN

I Null: yi ∼ Normal(0, σ2)

I Alternative: yi ∼ Normal(µ, σ2), µ 6= 0

I Test: Paired t-test. Calculate p. Is p < .05?

The t-test

−4 −2 0 2 4 6 8

t−value

The t-test

−4 −2 0 2 4 6 8

t−value

Significance Tests

−3 −1 1 3

Difference

−4 0 4 8

t−value

N == 10

0.0 0.4 0.8

p−value

N == 10

Significance Tests

−3 −1 1 3

Difference

−4 0 4 8

t−value

N == 100

0.0 0.4 0.8

p−value

N == 100

Significance Tests

−3 −1 1 3

Difference

−4 0 4 8

t−value

N == 10

0.0 0.4 0.8

p−value

N == 10

Significance Tests

−3 −1 1 3

Difference

−4 0 4 8

t−value

N == 100

0.0 0.4 0.8

p−value

N == 100

Fantasy J-Value Statistic

0.0 0.2 0.4 0.6 0.8 1.0

J−value

Resulting Critiques of Significance Tests

I Consequence #1: Can’t gain evidence for the null.

I Consequence #2: Overstates the evidence against the null.

I Berger & Berry (1988): Miscalibration

I Meehl (1978): Design flaw from the faulty reasoning ofPopper and Fisher

Bias to Overstate the Evidence Against the Null

I Example: N = 100, y = 10ms, t = 2.38, p ≈ .02.

I Consider the alternative µ = 30ms, which is a typical effect,e.g., priming.

I Which is more likely given the data, the null (µ = 0) or thetypical alternative (µ = 30)?

I Compute Likelihood ratio

L(µ = 0; data)

L(µ = 30; data)≈ 3800

I Example: N = 100, y = 10ms, t = 2.38, p ≈ .02.

L(µ = 0; data)

L(µ = 30; data)≈ 3800

I Example: N = 100, y = 10ms, t = 2.38, p ≈ .02.

L(µ = 0; data)

L(µ = 30; data)≈ 3800

I Example: N = 100, y = 10ms, t = 2.38, p ≈ .02.

L(µ = 0; data)

L(µ = 30; data)≈ 3800

Alternative (ms)

0 10 20 30 40

Truths About Testing

1. The following methods overstate the evidence against thenull: p-values, confidence intervals, p-rep, Neyman-Pearson(with fixed α), statistical equivalence, Akaike InformationCriterion (AIC).

2. Princibled inference is only possible with a priori specificationof the alternatives. This fact is true for Bayesians and classichypothesis testing.

3. Hypothesis testing is inherently subjective, but don’t freak outyet.

Bayes Factors to the Rescue

Intellectual Legacy:

I Bayes (1760s): Places probability directly on hypotheses

I Laplace (1810s): Proposes using odds for inference

I Jeffreys (1960): Formalizes Bayes factor; proposesspecification of alternative that we adopt here.

I Zellner & Siow (1980): Reformulates Jeffrey’s work broadlyfor linear models.

I Berger, Bayarri & colleagues (2001-2008): ShowedJeffreys-Zellner-Siow Bayes factors have desirablemathemtatical properties.

Bayes Theorem

Pr(A | B) =Pr(B|A)Pr(A)

Pr(H | data) =Pr(data|H)× Pr(H)

Pr(data)

Bayes Theorem

Pr(A | B) =Pr(B|A)Pr(A)

Pr(H | data) =Pr(data|H)× Pr(H)

Pr(data)

Bayes Factor

Ω =Pr(H0 | data)

Pr(H1 | data)

=p(data|H0)

p(data|H1)× Pr(H0)

Pr(H1)

= B01 ×Pr(H0)

Pr(H1),

B01 =p(data|H0)

p(data|H1)=

Bayes Factor

Ω =Pr(H0 | data)

Pr(H1 | data)

=p(data|H0)

p(data|H1)× Pr(H0)

Pr(H1)

= B01 ×Pr(H0)

Pr(H1),

B01 =p(data|H0)

p(data|H1)=

Bayes Factor

Ω =Pr(H0 | data)

Pr(H1 | data)

=p(data|H0)

p(data|H1)× Pr(H0)

Pr(H1)

= B01 ×Pr(H0)

Pr(H1),

B01 =p(data|H0)

p(data|H1)=

Simplest Example

I Alternative: yi ∼ Normal(µ1, σ2), µ1 6= 0

I Researcher specifies µ1 before hand.

I Lets also assume σ2 is known.

Simplest Example

−10 0 10 20 30 40 50

Effect µµ (ms)

Null Alternative: µµ1 == 40

Simplest Example

M0 = L(y;µ = 0) =1√2πσ

∑y2i

)M1 = L(y;µ = µ1)

1√2πσ

∑(yi − µ1)

Simplest Example

Alternative µµ1 (units of σσ)

0.0 0.2 0.4 0.6 0.8 1.0

y == 0

y == 0.2σσ

y == 0.35σσ

Next Example

Let’s assume σ2 is known.

I µ ∼ Normal(0, σ20),

I Researcher needs to specify σ20

Next Example

−50 0 50

Effect µµ (ms)

Alternativeσσ0 == 30

Next Example

M0 = L(y;µ = 0)

L(y;µ)p(µ)dµ

Prior Standard Deviation σσµµ (units of σσ)

0.01 0.1 1 10 100 1000 1e+05

y == 0

y == 0.15σσ

y == 0.22σσ

Lessons

I As the alternative include a greater percentage ofunrealistically large values, the Bayes factor favors the null.

I If the alternative can include a wide-range of values, it ispenalized by Bayes factor.

I Penalty for complexity

What is the Right Value for Variance σ20

I If σ2 is small, then σ20 should be small (perception)

I If σ2 is large, then σ20 should be large (clinical applications)

I Idea: Place prior on effect size instead: δ = µ/σ.

yi ∼ Normal(µ, σ2) ∼ Normal(δσ, σ2).

I δ ∼ Normal(0, σ2δ )

I Default: σ2δ = 1.

−4 −2 0 2 4

Effect Size δδ

Alternative

Jeffreys-Zellner-Siow Prior

I Place prior distribution on σ2δ

0 2 4 6 8

Prior Variance σσδδ2

−5 0 5

Effect Size δδ

CauchyNormal

I We recommend the JZS as a noninformative default.

I Normal prior is defensible too

How to Calculate a Bayes Factor

pcl.missouri.edu/bayesfactor

Scaling Alternative

−5 0 5

Effect Size δδ

Scale .4Scale 1.0

Scaling Alternative

I If small effect sizes are of interest, use small scales.

I If moderate or large effects sizes are of interest, use defaultvalue of 1.0.

Sample Size

t−va

5 10 20 50 100 500 2000 5000

6JZS BFUnit BFBICp−value

Questionable Effects

I Grider & Malberg (2008) claim emotional words are betterremembered than neutral ones.

I .76 vs. .80, t(79) = 2.24I B01 = 1.02I No evidence for either hypothesis.

I Plant & Peruche (2005) claim sensitivity training reducedshooter bias.

I F (1, 47) = 5.70I t(47) = 2.39I B01 = .66 or 1.6 : 1 in favor of alternativeI Miniscule evidence for the claim

Questionable Effects

I Grider & Malberg (2008) claim emotional words are betterremembered than neutral ones.

I .76 vs. .80, t(79) = 2.24I B01 = 1.02I No evidence for either hypothesis.

I Plant & Peruche (2005) claim sensitivity training reducedshooter bias.

I F (1, 47) = 5.70I t(47) = 2.39I B01 = .66 or 1.6 : 1 in favor of alternativeI Miniscule evidence for the claim

What about really small effects (δ = .02)?

5 10 20 50 100

Sample Size

Conclusions: Bayes Factor

I Bayes factor gives researchers a principled way of acceptingand rejecting the null

I Bayes factor requires specification of alternatives.

I The JZS prior is a good noninformative choice

I Extensions to factorial designs are coming (NIH willing)

Conclusions: Hypothesis Testing

I Do hypothesis testing only if (1) it is necessary and (2) youare willing to accept the null. Alternative: Explore data forstructure.

I Hypothesis testing is necessarily subjective, but not too analarming degree.

I We have the communal infrastructure to evaluate subjectiveelements in science.

Interpretation Specification of Alternatives

bayes factors as a replacement for t-tests

Documents