statistics for cegep biology

55
Making sense out of data (aka doing statistics)

Upload: corey-chivers

Post on 06-May-2015

2.611 views

Category:

News & Politics


2 download

TRANSCRIPT

Page 1: Statistics for CEGEP Biology

Making sense out of data(aka doing statistics)

Page 2: Statistics for CEGEP Biology

Things you will need

Page 3: Statistics for CEGEP Biology

Who I am and what I do

Corey ChiversPhD Student in Biology at McGill

I study biological invasions using statistics

Page 4: Statistics for CEGEP Biology

What is a Statistician?

Page 5: Statistics for CEGEP Biology

What is a Statistician?

A statistician is someone who:

Page 6: Statistics for CEGEP Biology

What is a Statistician?● Turns data into insights.A statistician is

someone who:

Page 7: Statistics for CEGEP Biology

What is a Statistician?● Turns data into insights.● Answers questions about the world.

A statistician is someone who:

Page 8: Statistics for CEGEP Biology

What is a Statistician?● Turns data into insights.● Answers questions about the world.

A statistician is someone who:

variation in

Page 9: Statistics for CEGEP Biology

What is a Statistician?● Turns data into insights.● Answers questions about the world.● Isn't fun to talk to at a party?

A statistician is someone who:

variation in

Page 10: Statistics for CEGEP Biology

Statistics is very cool

Page 11: Statistics for CEGEP Biology

Data is Everywhere

Page 12: Statistics for CEGEP Biology

Data is Everywhere

Page 13: Statistics for CEGEP Biology

Statisticians are in demand

Page 14: Statistics for CEGEP Biology

Portrait of a Statistician

Page 15: Statistics for CEGEP Biology

Portrait of a Statistician

?

Page 16: Statistics for CEGEP Biology

Portrait of a StatisticianThe cool kids are calling themselves Data Scientists

Page 17: Statistics for CEGEP Biology

Portrait of a StatisticianThe cool kids are calling themselves Data Scientists

Name: Hilary Mason

Title: Chief Data Scientist at bit.ly

member of Mayor Bloomberg’s Technology and Innovation Advisory Council

From her web bio:“I <3 data and cheeseburgers.”

Page 18: Statistics for CEGEP Biology

What do you know about statistics?

● On a piece of paper, make a list of all the words you know about statistics.

● I'll start:– Average (mean)

– Variance

– Normal distribution

– ...

Page 19: Statistics for CEGEP Biology

Despite how exciting we are, statisticians always start by

assuming the world is boring

The Null Hypothesis, or Ho is this boring world.

Page 20: Statistics for CEGEP Biology

Despite how exciting we are, statisticians always start by

assuming the world is boring

The Null Hypothesis, or Ho is this boring world.

Usually something like “there is no effect of caption size on the lulzyness of LOLcats”

Page 21: Statistics for CEGEP Biology

Looking for evidence against the Null Hypothesis

● The alternative hypothesis (Ha) is that something interesting is going on.– Ex: “Bigger captions are, on average, funnier”

● How would we know?

Page 22: Statistics for CEGEP Biology

Looking for evidence against the Null Hypothesis

● The alternative hypothesis (Ha) is that something interesting is going on.– Ex: “Bigger captions are, on average, funnier”

● How would we know?

● To the internetz!

Page 23: Statistics for CEGEP Biology

Collect some sample data!

Big caption, quite funny

Small caption, fairly humourous

Small caption, funny-ish

Big caption, peed in pants a little

Page 24: Statistics for CEGEP Biology

Dealing with variability

Some small caption images are funny, and some large caption images are not funny.

There is variance in the data.

But we want to know if there is a difference on average. We'll need to take variance into account.

Page 25: Statistics for CEGEP Biology

Descriptive Statistics

Measures of Variability

Variance Standard Deviation

Where xi = the ith value of a distributionn = number of values in the samplex = sample mean

s2=

∑i=1

n

(x i− x̄)2

n−1 s=√∑i=1

n

(x i− x̄)2

n−1

Page 26: Statistics for CEGEP Biology

Descriptive Statistics

Measures of Variability1, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9

Variance and Standard Deviation

Therefore, variance of our dist’n (w/ mean = 5):

Step 11-5 = -42-5 = -33-5 = -23-5 = -2

[…]9-5 = 4

Step 2-42 = 16-32 = 9-22 = 4-22 = 4

[…]42 = 16

Step 316 + 9 + 4 + 4 + […] + 16 = 72

Step 4 (Variance)72/18 = 4

Step 5 (Std Deviation)√4 = 2

s2=

∑i=1

n

(x i− x̄)2

n−1

Page 27: Statistics for CEGEP Biology

Your turnCalculate the variance of the heights in your group.

s2=∑i=1

n

(x i− x̄)2

n−1

1) Write down your heights (xi)

2) Calculate the average (Σxi / n)

3) Subtract the average for each height and square it4) Add them all up and divide by n-1

Page 28: Statistics for CEGEP Biology

Variance

Page 29: Statistics for CEGEP Biology

Measures of Central Tendency

Calculating the Mean

Using the following distribution of values:1, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9

(Arithmetic) Mean – the average of a distribution of values

Sum of values in dist’nNumber of values in dist’n

1+2+3+3+4+4+4+5+5+5+5+5+6+6+6+7+7+8+919 = 5

x̄ =

∑i=1

n

x i

n−1

or

Page 30: Statistics for CEGEP Biology

Could the difference be due to chance?

Remember, we started by assuming that there was no difference (the Null Hypothesis).

If the Null Hypothesis is true, what are the chances that we observed this amount of difference between groups?

How do we decide whether the difference is due to chance or not?

By vote???

Page 31: Statistics for CEGEP Biology

A better way: (formal) Hypothesis testing

● Determine in advance the level of error you are willing to put up with.– We cannot avoid the chance of errors, but we can

decide how often we are willing to have them happen.

● Biologist like to use 0.05 (a 1 in 20 chance).● We call this α (alpha)

Page 32: Statistics for CEGEP Biology

A better way: (formal) Hypothesis testing

● Determine in advance the level of error you are willing to put up with.– We cannot avoid the chance of errors, but we can

decide how often we are willing to have them happen.

● Biologist like to use 0.05 (a 1 in 20 chance).● We call this α (alpha)

Ronald Fisher: The man behind the idea of NHST

Page 33: Statistics for CEGEP Biology

A better way: (formal) Hypothesis testing

● Calculate how likely your data set is if the null were true.

● If it is less than α, we say that we reject the null hypothesis.

● If we reject the null, we say the results are statistically significant.

Page 34: Statistics for CEGEP Biology

A better way: (formal) Hypothesis testing

● Calculate how likely your data set is if the null were true.

● If it is less than α, we say that we reject the null hypothesis.

● If we reject the null, we say the results are statistically significant.

“The world is not boring afterall!”

Page 35: Statistics for CEGEP Biology

Lets do it!

● To calculate how likely it is that our data is from the null hypothesis (ie difference is due to chance), we need a statistic.

● But first, some Beer!

Page 36: Statistics for CEGEP Biology

Student's t-test

● William Sealy Gosset figured out how to test if a batch of beer was significantly different than the standard.

While working for the Guinness brewing company, he was forbidden to publish academic research, so published his method under the pseudonym 'student'.

Page 37: Statistics for CEGEP Biology

Student's t-testThe t-value is calculated using the following equation:

Where 1 and 2 are the means of the experimental and control

groups;S1

2 and S22 are the variances of the experimental and control groups;

n1 and n2 are the sample sizes for the experimental and control

groups.

x x

t=X̄ 1− X̄ 2

√ s12

n1

+s2

2

n2

Page 38: Statistics for CEGEP Biology

Student's t-testThe t-value is calculated using the following equation:

Where 1 and 2 are the means of the experimental and control

groups;S1

2 and S22 are the variances of the experimental and control groups;

n1 and n2 are the sample sizes for the experimental and control

groups.

x x

t=X̄ 1− X̄ 2

√ s12

n1

+s2

2

n2

Page 39: Statistics for CEGEP Biology

t-test

α = 0.05

If the t-test detects a difference between the means, there is a 5% chance that this conclusion is incorrect.

State your alpha level

Page 40: Statistics for CEGEP Biology

Calculating your t-value

Generic-brand(Group 2)

Name-brand(Group 1)

Mean # of chips 2 = 11.2 1 = 15.3

Standard Deviation

S2 = 4.3 S1 = 2.4

n (sample size) n1 = 3 n2 = 3

x x

According to the data above:

calculated t = 1.4

t=X̄ 1− X̄ 2

√ s12

n1

+s2

2

n2

Page 41: Statistics for CEGEP Biology

Alternate HypothesisYou can only test ONE possible alternate hypothesis at any one time. The one chosen depends on what you are looking to find.

Alternative hypothesis: 2 types

2-tailed

Non-directional (general): not specifying a direction.

“The two groups are not the same”

1-tailed

Directional (specific): specify direction

“Group A is greater than group B.”

Page 42: Statistics for CEGEP Biology

Look up the Critical t-value

In order to find your critical t-value, you need 3 pieces of information:

1. Whether the alternate hypothesis is 1- or 2-tailed

2. Alpha level (usually = 0.05)

3. Degrees of freedom (df = n-1)

Calculating degrees of freedom (df)

Degrees of Freedom = n-1

What if you have 2 different sample sizes (n1 and n2)… which do you pick to calculate your degrees of freedom?

A: df = the smallest of : (n1-1) or (n2 –1)

Page 43: Statistics for CEGEP Biology

Looking up your Critical t-value

Page 44: Statistics for CEGEP Biology

Compare your ‘calculated’ t-value with your ‘critical’ t-value

It is the difference in values between the t-value and critical t that will determine whether you can reject or fail to reject your null hypothesis

a) If ‘calculated’ > ‘critical’, then: reject null hyp.

“My observed data are really unlikely under the null hypothesis, therefore I reject the null hypothesis!”

b) If ‘calculated’ < ‘critical’, then: do NOT reject null hyp.

“My observed data are consistent with the null hypothesis, therefore I have no reason to believe that it is not true.”

Page 45: Statistics for CEGEP Biology
Page 46: Statistics for CEGEP Biology

What if we are measuring a category, rather than a number?

● The t-test lets us compare the value of some attribute between two groups.– Do mutant fruit flies live longer than wild type?

– Does IQ differ between Dawson and Laurier students?

– Does drug x decrease blood pressure?

● The dependent variable is quantitative:– Life span

– IQ

– Blood pressure

Page 47: Statistics for CEGEP Biology

What if we are measuring a category, rather than a number?

● Chi-squared test lets us test hypotheses about categories.– Are there more cars of a certain colour getting speeding

tickets?

– Is the ratio of dominant to recessive phenotypes 3:1?

– Do chromosomes assort independently?

● The dependent variable is categorical:– Car colour

– Phenotype

– Chromosome donor

Page 48: Statistics for CEGEP Biology

Chi-square or T-test???How do you know which one you need?

T-Test• the dependent variable is quantitative (e.g. height, weight, etc.)• data can be organized as two lists of numbersExample:

Chi-square Test• the dependent variable is qualitative (aka. Nominal data) (e.g. gender, colour, etc.)• data can be easily tabulated as counts:Example:Room

temp (bpm)

Cold temp (bpm)

178 86

169 89

192 55

(dependent variable: heart rate)

Male 98

Female

102

(dependent variable: gender)

Page 49: Statistics for CEGEP Biology

Steps to performing a chi-square test

1. State your null hypothesis2. State your alternate hypothesis3. State your alpha level (usually α = 0.05)4. Calculate your ‘calculated chi-square value’5. Look up your ‘critical chi-square value’ (from chi-square

table)6. Compare your ‘calculated’ and ‘critical’ values

a) If ‘calculated’ > ‘critical’, conclusion: reject null hyp.b) If ‘calculated’ < ‘critical’, conclusion: do NOT reject null

hyp.7. State your conclusion

Page 50: Statistics for CEGEP Biology

Sample hypotheses for chi-squareN

ull h

y po

thes

isA

ltern

ativ

e h

y po

thes

is

Sex ratio in our class

1. There is no difference between the frequency of men and women in the class____________________________

2. There is a difference between the frequency of men and women in the class

Chi-square can only test non-directional alt. hyp.

Page 51: Statistics for CEGEP Biology

Calculating Chi-square‘Calculated’ chi-square values are calculated using the following formula:

O = observedE = expected

Calculating the chi-square is easier using the following table:

Gender O E O-E (O-E)2 (O-E)2

E

Female

Male

χ2 = sum of last column =

Page 52: Statistics for CEGEP Biology

Looking up the Critical χ2 To find the critical χ2 , you need the alpha

level and the df.

Df for a χ2 test = (# of categories) – 1

In our example, df = 2-1 = 1

Page 53: Statistics for CEGEP Biology

Compare your ‘calculated’ chi-sq with your ‘critical’ chi-sq

It is the difference between the calculated chi-sq and critical chi-sq that will determine whether you can reject or fail to reject your null hypothesis

a) If ‘calculated’ > ‘critical’, then: reject null hyp.

“My observed data are really unlikely under the null hypothesis, therefore I reject the null hypothesis!”

b) If ‘calculated’ < ‘critical’, then: do NOT reject null hyp.

“My observed data are consistent with the null hypothesis, therefore I have no reason to believe that it is not true.”

Page 54: Statistics for CEGEP Biology

Statistics just might save your life

Page 55: Statistics for CEGEP Biology

Questions for Corey

● You can email me! [email protected]

● I blog about statistics:

bayesianbiologist.com

● I tweet about statistics:

@cjbayesian