intro to hypothesis testing - umdmmazurek/634-slides/13-stats1.pdf•in general: dv varies with iv...
TRANSCRIPT
![Page 1: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/1.jpg)
1
Intro to hypothesis testing
With material from Howard Seltman, Blase Ur
![Page 2: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/2.jpg)
2
BACKGROUND: WHAT IS HYPOTHESIS TESTING?
![Page 3: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/3.jpg)
3
“Classical” hypothesis testing
• Frame a mathematical model describing relationship between input and output variables
• Specify null and alternative hypotheses within this model
• Choose a statistic that (hopefully) can discriminate between the null and alternative hypotheses (probabilisitically)
![Page 4: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/4.jpg)
4
Running example: Password meters
• Do different password meters help users to create better passwords?
![Page 5: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/5.jpg)
5
Framing a model and hypotheses
• In general: DV varies with IV
– Do you expect a direction?– Categorical vs. numeric inputs and outputs
• Null hypothesis: IV *does not* influence DV
• Alt. hypothesis: IV *does* influence DV
– As framed in model
![Page 6: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/6.jpg)
6
(Running example)
• Password meter that changes colors will produce stronger passwords than all-green meter
– Has a direction: color >> green– IV: categorical; DV: numeric (guess score)
• H0: No difference in strength between meters
• H1: Multicolor >> Green
• What does it mean for one set to be “stronger”?
![Page 7: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/7.jpg)
7
What does a statistical test do?
• Compares tendencies in the data set
– Central tendency: mean, median, etc.– Start w/ mean b/c simplest to talk about
• We have samples; they have errors
– Measurement errors– Sampling errors– Random noise, variation in people, etc.
![Page 8: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/8.jpg)
8
Running example
• H0: mM = mG
• H1: mM > mG
![Page 9: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/9.jpg)
9
What does a statistical test do?
• Find evidence to reject the null
– Or not!
• Does not find evidence to support the null
– In practice: Evidence things are different, but not finding evidence of difference != evidence that things are the same
![Page 10: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/10.jpg)
10
What kind of evidence?
• Calculate theoretical sample distribution for the null
• p-value = area under curve that is *more extreme* than your observed sample
• Decision rule: p < alpha (typically alpha = 0.05)
![Page 11: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/11.jpg)
11
Example: Independent samples t-test
• Assumes DV is normally distributed for each condition
• Assumes common variance between conditions
• Assumes means mu, mu+delta
– H0: delta = 0, or muA = muB
![Page 12: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/12.jpg)
12
T-test continued
• Calculate sample distribution under H0
• T-statistic (for null hyp):
– (mA – mB) / sqrt(var1/n1 + var2/n2)– Denominator based on variance, sample size– Follows t-distribution based on assumptions, degrees
of freedom (N-1)
• P-value = area under curve more extreme than observed T-stat
– Tailedness (in our running example)
![Page 13: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/13.jpg)
13
Interpreting p-values
• Model assumptions matter!
• NO ASSUMPTIONS about chance that H0 is true – comes from dist. assuming H0 is true!
• Type I error: We found a difference that isn’t real
• Type II error: We failed to find a real difference
• P-values ONLY BOUND TYPE I
![Page 14: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/14.jpg)
14
Interpreting p-values
• Small p-value is evidence that H1 is likely
– Otherwise, bad luck of wonky sample
• Large p-value: H0 is true, or type 2 error
– Can use power analysis to interrogate this a bit– ”Statistical power”: prob. of rejecting null if you should
(more later)
• P-values don’t PROVE anything
![Page 15: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/15.jpg)
15
Interpreting p-values
• Generally, reject null w/ p < 0.05
• A p-value is not magic, just probability, and the threshold is arbitrary
• But, reported TRUE or FALSE: You don’t say something is “more significant” because the p-value is lower
![Page 16: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/16.jpg)
16
Defining significance
• Statistically significant: It would be unlikely to observe this data if the underlying distributions were the same (e.g. if they had the same mean)
• This doesn’t mean the difference is meaningful!
– Effect size == strength of the effect– Sufficiently large samples can find real but small
effects
![Page 17: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/17.jpg)
17
Running example
• We find that mM >> mG (H0 is rejected)
– Attacker has to guess 2 million more passwords?– Attacker has to guess 2 more passwords?– Etc.
![Page 18: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/18.jpg)
18
P values and multiple testing
• P-values bound Type I error (false positive)
– You expect this to happen 5% of the time if α = 0.05
• What happens if you conduct a lot of statistical tests in one experiment?
• Your cumulative probability of a Type I error can increase dramatically!
• You can correct for this – more details later
![Page 19: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/19.jpg)
19
p-values and confidence intervals
• 95% CI: over many repeated experiments, the parameter of interest (e.g. mean) will be within the CI 95% of the time
– Calculated from your sample, with assumptions
• Related to p-value: if CIs do not overlap, p< 0.05
• Human judgment about whether this is narrow/wide and how to interpret result
![Page 20: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/20.jpg)
20
CHOOSING THE RIGHT TEST
![Page 21: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/21.jpg)
21
Planning
• Choose the test(s) before you collect the data
– Especially before you look at it!
• Interplay between: what question am I asking? How could I demonstrate a result? What data must be collected for that to work? Etc.
• But, do exploratory analysis / visualization to sanity check your plan and results
![Page 22: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/22.jpg)
22
http://i196.photobucket.com/albums/aa92/k
arina408_album/Wallpaper-53.jpg
What kind of data do you have?
• For input and outcome variables
• Quantitative
– Discrete (Number of caffeine pills taken by each pony)– Continuous (Weight of each pony)
• Categorical
– Binary (Is it or isn’t it a pony?)– Nominal: No order (Color of the pony)– Ordinal: Ordered (Is the pony super cool,
cool, partly cool, or uncool)
• How many of each?!
![Page 23: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/23.jpg)
23
What kind of data do you have?
• Does your dependent data follow a normal distribution? (You can calculate this!)
• Choice of test depends on normality
– If so, use parametric tests. – If not, use non-parametric tests.
http://www.wikipedia.org
![Page 24: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/24.jpg)
24
What kind of data do you have?
• Are your data independent?
– Within vs. between subjects; group effects; time series
– If not, repeated-measures, mixed models, etc.– Can you make them independent? E.g., after – before– Independence is usually the least robust assumption!
• Other assumptions (and robustness) about distribution, errors, variance, etc.
![Page 25: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/25.jpg)
25
What is your hypothesis?
• One-tailed vs. two-tailed
– Which side the p-value area is calculated under– Doesn’t apply to all tests– Make sure your observed data goes in the predicted
direction– No cheating – must select this *ahead of time*
![Page 26: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/26.jpg)
26
Reporting a statistical test
• Make it clear what the IV and DVs were
– Fails surprisingly often
• Be clear which test was used and why
• Supply:
– p-value– Test statistic (sometimes redundant but people like it)– df / sample size– Effect size whenever possible
![Page 27: Intro to hypothesis testing - UMDmmazurek/634-slides/13-stats1.pdf•In general: DV varies with IV –Do you expect a direction? –Categorical vs. numeric inputs and outputs •Null](https://reader035.vdocument.in/reader035/viewer/2022071501/611fb70b16529f12f63bc824/html5/thumbnails/27.jpg)
27
From here, for a while
• Different experimental designs and tests that go with them
• Assumptions and assumption checking