data mining - purdue university€¦ · 1. p-values can indicate how incompatible the data are with...

41
Data Mining CS57300 Purdue University March 20, 2018 1

Upload: others

Post on 04-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

Data Mining

CS57300

Purdue University

March 20, 2018

1

Page 2: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

• Select 50% users to see headline A

• Unlimited Clean Energy: Cold

Fusion has Arrived

• Select 50% users to see headline B

• Wedding War

• Do people click more on headline A or B?

A/B Testing

2

© 2016 Bruno Ribeiro

Page 3: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

• Can you guess which page has a higher conversion rate (buying customers)

and whether the difference is significant?

• When “upgraded” from the A to B the site lost 90% of their revenue

• Why? “There maybe discount coupons out there that I do not have. The price

may be too high. I should try to find these coupons.” [Kumar et al. 2009]

3

A/B Testing on Websites

A BKumar et al. 2009

© 2016 Bruno Ribeiro

Page 4: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

Testing Hypotheses over Two Populations

12

?

?

1049

?

?

34

?

Are the averages different?

Which one has the largest average?

?

?

4

Average μ1 Average μ2

Population 2Population 1

© 2016 Bruno Ribeiro

Page 5: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

The two-sample t-test

Is difference in averages between two groups more than we would expect

based on chance alone?

PS: Same as alien identification problem: we don’t know how to model

“average is different”

5

© 2016 Bruno Ribeiro

Page 6: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

t-Test (Independent Samples)

H0: μ1 - μ2 = 0

H1: μ1 - μ2 ≠ 0

The goal is to evaluate if the average difference between two populations is zero

In the t-test we make the following assumptions

• The averages and follow a normal distribution (we will see why)

• Observations are independent

Two hypotheses:

6

population 1 average

vectors

© 2016 Bruno Ribeiro

Page 7: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

General t formula

t = sample statistic - hypothesized population difference

estimated standard error

Independent samples t

Empirical averages

Empirical standard deviation (formula later)

t-Test Calculation

7

© 2016 Bruno Ribeiro

Page 8: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

• What is the p-value?

• Can we test H1?

• Can we ever directly accept hypothesis H1 ?

• No, we can’t test H1, we can only reject H0 in favor of H1

8

t-Statistics p-value

H0: μ1 - μ2 = 0

H1: μ1 - μ2 ≠ 0

© 2016 Bruno Ribeiro

Random variables

Page 9: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

Null hypothesis H0 Alternative hypothesis H1 No. Tails

μ1 - μ2 = d μ1 - μ2 ≠ d 2

μ1 - μ2 = d μ1 - μ2 < d 1

μ1 - μ2 = d μ1 - μ2 > d 1

9

Two Sample Tests (Fisher)

12

?

?

109

?

?

4

??

?

Average μ1 Average μ2

© 2016 Bruno Ribeiro

Page 10: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

• E.g. software updates

• Perform incremental A/B testing before rolling ANY big system change on

a website that should have no effect on users

(even if users don’t directly see the change)

• What is the hypothesis we want to test?

• H0 = no difference in [engagement, purchases, delay, transaction

time,…]

• How?

• Start with 0.1% of visitors (machines) and

grow until 50% of visitors (machines)

• If at any time H0 is rejected, stop the roll out

• Must account for testing multiple hypotheses (next class)

(more precisely, this is sequential analysis)

10

Less Obvious Applications

© 2016 Bruno Ribeiro

Page 11: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

• Fisher’s test

• Test can only reject H0 (we never accept a hypothesis)

• H0 is likely wrong in real-life, so rejection depends on the amount of data

• More data, more likely we will reject H0

• Neyman-Pearson’s test

• Compare H0 to alternative H1

• E.g.: H0: μ = μ0 and H1: μ = μ1

• P[Data | H0 ] / P[Data | H1]

• Bayesian test

• Compute probability P[H0 | Data] and compare against P[H1 | Data]

• More precisely, test P[H0 | Data] / P[H1 | Data]

• >1 implies H0 is more likely

• <1 implies H1 is more likely

• Neyman-Pearson’s test = Bayes factor when H0 and H1 have same priors

11

Types of Hypothesis Tests

© 2016 Bruno Ribeiro

Page 12: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

12

Back to Fisher’s test

(no priors)

© 2016 Bruno Ribeiro

Page 13: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

1) Compute the pooled empirical standard error

where,

and

(assumes both populations have equal variance)

13

How to Compute Two-sample t-test (1)

Sample variance of x(i)

Number of observations in x(i)

© 2016 Bruno Ribeiro

Page 14: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

2) Compute the degrees of freedom

3) Compute test statistic (t-score, also known as Welsh’s t)

where d is the Null hypothesis difference.

4) Compute p-value (depends on H1)

• p = P[TDF < -|td|] + P[TDF > |td|] (Two-Tailed Test H1: μ1 - μ2 ≠ d)

• p = P[TDF > td] (One-Tailed Test for H1 : μ1 - μ2 > d)

• Important: H0 is always μ1 - μ2 = d even when H1 : μ1 - μ2 > d !!

Testing H0: μ1 - μ2 ≤ d is harder and “has same power” as H0:μ1 - μ2 = d

14

How to Compute Two-sample t-test (2)

© 2016 Bruno Ribeiro

Page 15: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

• Back to step 4 of slide 16:

15

Rejecting H0 in favor of H1

4) Compute p-value (depends on H1)

p = P[TDF < -|td|] + P[TDF > |td|] (Two-Tailed Test H1: μ1 - μ2 ≠ d)

p = P[TDF > td] (One-Tailed Test for H1 : μ1 - μ2 > d)

Reject H0 with 95% confidence if p < 0.05

p=0.025 p=0.025 p=0.05

© 2016 Bruno Ribeiro

Page 16: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

• Observations of X1 and X2 are independent and identically distributed (i.i.d.)

• Central Limit Theorem (Classical CLT)

• If: E[Xk(i)] = μi and Var[Xk

(i)] = σi2 < ∞

More generally, the real CLT is about stable distributions

16

Some assumptions about X1 and X2

(here ∞ is with respect to ni)

© 2016 Bruno Ribeiro

Page 17: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

• CLT: If we have enough independent observations with small variance we can

approximate the distribution of their average with a normal distribution

17

Central Limit Theorem

© 2016 Bruno Ribeiro

Page 18: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

• approximation not too useful if we don’t know

• We can estimate with ni observations of

• But we cannot just plug-in estimate on the normal

• It has some variability if ni < ∞

• is Chi-Squared distributed

• The t-distribution is a convolution of the standard normal with a Chi-

Square distribution to compute

18

But we don’t know the variance of X(1) or X(2)

© 2016 Bruno Ribeiro

Page 19: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

• If results are 0 or 1 (buy, not buy) we can use Bernoulli random variables

rather than the Normal approximation

19

For small samples we can use the Binomial distribution

© 2016 Bruno Ribeiro

Page 20: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

20

What about

false positives and

false negatives

of a test?

© 2016 Bruno Ribeiro

Page 21: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

21

Hypothesis Test Possible Outcomes

Type I error (false positive)

Type II error (false negative)

- Reject H0 given H0 is true

- Accept H0 given H0 is false

Errors:

In medicine our “goal” is to reject H0

(drug, food has no effect / not sick), thus a “positive” result rejects H0

© 2016 Bruno Ribeiro

Page 22: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

• Statistical power is probability of rejecting H0

when H0 is indeed false

• Statistical Power ⇒ Number of Observations Needed

• Standard value is 0.80 but can go up to 0.95

• E.g.: H0 is μ1 - μ2 = 0 , where μi = true average of population i

• Define n = n1 = n2 such that statistical power is 0.8 under assumption |μ1 - μ2| = Δ:

• P[Test Rejects | |μ1 - μ2| = Δ] = 0.8

where Test Rejects = 1{P[x(1) , x(2) | μ1 - μ2 = 0] < 0.05}

which gives

22

Statistical Power

© 2016 Bruno Ribeiro

Page 23: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

More Broadly: Hypothesis Testing Procedures

Hypothesis Testing

Procedures

Parametric

Z Test t Test Cohen's d

Nonparametric

Wilcoxon Rank Sum

Test

Kruskal-WalliH-Test

Kolmogorov-Smirnov test

23

© 2016 Bruno Ribeiro

Page 24: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

Parametric Test Procedures

• Tests Population Parameters (e.g. Mean)

• Distribution Assumptions (e.g. Normal distribution)

• Examples: Z Test, t-Test, 2 Test, F test

24

© 2016 Bruno Ribeiro

Page 25: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

25

Effect Size

© 2016 Bruno Ribeiro

Page 26: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

General t formula

t = sample statistic - hypothesized population difference

estimated standard error

Independent samples t

Empirical averages

Estimated standard deviation

Testing Effect Sizes

26

t-Test tests only if the difference is zero or not?

© 2016 Bruno Ribeiro

Page 27: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

Cohen’s d often used to complement t-test when reporting

effect sizes

where S is the pooled variance

Effect Size: Good practice

27

© 2016 Bruno Ribeiro

Page 28: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

28

Important Warning

Page 29: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

1. p-values can indicate how incompatible the data are with a specified

statistical model.

2. p-values do not measure the probability that the studied hypothesis is true,

or the probability that the data were produced by random chance alone.

3. Scientific conclusions and business or policy decisions should not be

based only on whether a p-value passes a specific threshold.

4. Proper inference requires full reporting and transparency.

5. A p-value, or statistical significance, does not measure the size of an effect

or the importance of a result.

6. By itself, a p-value does not provide a good measure of evidence

regarding a model or hypothesis.

29

American Statistical Association Statement On Statistical

Significance And p-values

Page 30: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

30

Bayesian Approach

© 2016 Bruno Ribeiro

Page 31: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

• Probability of hypothesis given data

• The Bayes factor

• Reject H0 if is less than some value

31

Bayesian Approach

© 2016 Bruno Ribeiro

Page 32: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

• Have Aliens visited Earth and government keeping secret?• 21% of U.S. voters say a UFO crashed in Roswell, NM in 1947 and the US government covers it up

• Priors:

• H0 : At least 21% of U.S. voters are irrational, will believe in alien story without evidence

• P[H0] = 1010/(1010+1) [Ribeiro’s prior]

• P[H0] ~ Beta(1010,1) [Prior can also be a random variable, better models uncertainty]

• H1 : Aliens can travel faster than the speed of light and, despite that, can’t drive and are easily captured by humans.

• Because either H0 or H1 must be true: P[H1]=1- P[H0]

• What is the data?

• Data:

• 15% of U.S. voters say the government or the media adds mind-controlling technology to TV broadcast signals (a.k.a., the Tinfoil Hat crowd)

• 20% of U.S. voters believe there is a link between childhood vaccines and autism, despite scientific evidence there is no such link

• 15% of U.S. voters think the medical industry and the pharmaceutical industry “create” new diseases to make money (Ebola, Zika,…)

• 14% of U.S. voters say the CIA was instrumental in creating the crack cocaine epidemic

• Bayesian Disadvantage : Often hard to define P[Data | H1]

• Bayesian Advantage: Prior helps encode your uncertainty and beliefs about the world32

Bayesian Hypothesis Tests Need Assumptions

Source: Public Policy Polling, 2013© 2016 Bruno Ribeiro

Page 33: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

Nonparametric Test Procedures

• Not Related to Population Parameters

Example: Probability Distributions, Independence

• Data Values not Directly Used

Uses Ordering of Data

Examples:

Wilcoxon Rank Sum Test , Komogorov-Smirnov Test

33

© 2016 Bruno Ribeiro

Page 34: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

34

Example of Nonparametric Test

© 2016 Bruno Ribeiro

Page 35: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

• Two-sample Kolmogorov-Smirnov Test

• Do X(0) and X(1) come from same underlying distribution?

• Hypothesis (same distribution)

rejected at level p if

35

Nonparametric Testing of Distributions

Wikipedia

Em

piric

al

Confidence interval factor

Sample size correction

The K-S test is less

sensitive when the

differences between

curves is greatest at the

beginning or the end of

the distributions. Works

Good reading:

M. Tygert, Statistical tests for whether a given set of

independent, identically distributed draws comes from a

specified probability density. PNAS 2010

© 2016 Bruno Ribeiro

Page 36: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

36

Are Two User Features Independent?

© 2016 Bruno Ribeiro

Page 37: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

• Twitter users can have gender and number of tweets.

• We want to determine whether gender is related to number of tweets.

• Use chi-square test for independence

Chi-Squared Test

37

© 2016 Bruno Ribeiro

Page 38: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

• When to use chi-square test for independence:

• Uniform sampling design

• Categorical features

• Population is significantly larger than sample

• State the hypotheses:

• H0 ?

• H1 ?

When to use Chi-Squared test

38

© 2016 Bruno Ribeiro

Page 39: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

men = c(300, 100, 40)

women = c(350, 200, 90)

data = as.data.frame(rbind(men, women))

names(data) = c('low', 'med', 'large')

data

chisq.test(data)

Reject H0 (p<0.05) means …

Example Chi-Squared Test

39

© 2016 Bruno Ribeiro

Page 40: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

40

Deciding Headlines

© 2016 Bruno Ribeiro

Page 41: Data Mining - Purdue University€¦ · 1. p-values can indicate how incompatible the data are with a specified statistical model. 2. p-values do not measure the probability that

• Select 50% users to see headline A

• Titanic Sinks

• Select 50% users to see headline B

• Ship Sinks Killing Thousands

• Assign half the readers to headline A and half to headline B?

• Yes?

• No?

• Which test to use?

What happens A is MUCH better than B?

Revisiting The New York Times Dilemma

41

© 2016 Bruno Ribeiro