how to use statistics in your research

45
HOW TO USE STATISTICS IN YOUR RESEARCH LIES, DAMNED LIES AND STATISTICS!

Upload: idalia

Post on 21-Feb-2016

73 views

Category:

Documents


0 download

DESCRIPTION

HOW TO USE STATISTICS IN YOUR RESEARCH. LIES, DAMNED LIES AND STATISTICS!. What we will cover. WHY. HOW. Graphpad EXCEL. Why do statistics Descriptive Statistics Distributions Sampling & Hypotheses Presenting Results Chart junk. Why do you need statistics ?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: HOW TO USE STATISTICS  IN  YOUR RESEARCH

HOW TO USE STATISTICS IN YOUR RESEARCH

LIES, DAMNED LIES AND STATISTICS!

Page 2: HOW TO USE STATISTICS  IN  YOUR RESEARCH

What we will cover

WHY• Why do statistics• Descriptive Statistics• Distributions • Sampling & Hypotheses• Presenting Results

– Chart junk

HOW• Graphpad• EXCEL

Page 3: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Why do you need statistics ?

• Why are you doing a research project!

• Also very important in everyday life

• Measure things• Examine relationships• Make predictions• Test hypotheses• Explore issues• Explain activities or attitudes• Make comparisons• Draw conclusions based on

samples• Develop new theories• …

Page 4: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Misuse of statistics

Design• Ignoring some

‘inconvenient data points’• Focus on certain variables

and exclude others• Alter scales to present your

data in a more positive way• Present correlation as

causation

0 2 4 6 8 10 120

5

10

15

20

25

0 2 4 6 8 10 120

5

10

15

20

25

Page 5: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Misuse of statistics

Design• Ignoring some ‘inconvenient

data points’• Focus on certain variables

and exclude others• Alter scales to present your

data in a more positive way• Present correlation as

causation

1 20

5

10

15

20

25

30

35

40

1 210

15

20

25

30

35

40

Page 6: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Misuse of statistics

Design• Ignoring some ‘inconvenient

data points’• Focus on certain variables

and exclude others• Alter scales to present your

data in a more positive way• Present correlation as

causation

Page 7: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Misuse generally accidental• Bias

– Need to be particular careful in ‘questionnaire’ type research

– Also when sampling• Using the wrong statistical

tests• Making incorrect inferences

– In going from your sample to the general case

• Incorrect drawing conclusions based on correlations

Page 8: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Descriptive Statistics

• Used to describe or summarise what your data shows

• Not used to draw any conclusions that extend beyond your own data

• Mean• Median• Mode• Variance• Standard Deviation

Page 9: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Mean (Average)

• Imagine you have collected some data– From running an

algorithm on a problem– By measuring execution

time– By asking opinions

• You want to summarise your data – Don’t present all the

results

1

1 n

ii

m xn

mean {-30, 1, 2, 3, 4} = -4 mean {0, 1, 2, 3, 4} = 2Measures centrality

Excel: = AVERAGE(A1:A10)

Graphpad

Page 10: HOW TO USE STATISTICS  IN  YOUR RESEARCH

The mean is not the whole story..

Emma’s AlgorithmMEAN 50504950505050515149

Malcolm’s AlgorithmMEAN 50

75259914555506040

Page 11: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Standard Deviation

• Standard Deviation measures something about the spread of your data

• Important as it gives you some indication of reliability or variability of your results

21

11

n

ii

s m xn

sd {-30, 1, 2, 3, 4} = 14.6 sd {0, 1, 2, 3, 4} = 1.6

Measures spread

Excel: = STDEV(A1:A10)

Page 12: HOW TO USE STATISTICS  IN  YOUR RESEARCH

The mean is not the whole story..

Emma’s AlgorithmMEAN 50504950505050515149

Malcolm’s AlgorithmMEAN 50

75259914555506040

STD: 0.71 STD: 28.07

Page 13: HOW TO USE STATISTICS  IN  YOUR RESEARCH

True or False ?

The majority of Scots have more than the average number of legs

Page 14: HOW TO USE STATISTICS  IN  YOUR RESEARCH

TRUE!

Most Scots have more than the average number of legs!

• (None have 3 legs)• Most have 2 legs• Some have 1 leg• Some have 0 legs• The average < 2 (~1.9)

• The mean is not a relevant measure!

Page 15: HOW TO USE STATISTICS  IN  YOUR RESEARCH

When can I use the mean?

• The data that you are sampling should follow a normal distribution Most values are close to

the mean, and a few lie at either extreme

68% of values within 1 SD of mean

95% of values within 2 SD In practice, a lot of data

does follow this kind of distribution

Page 16: HOW TO USE STATISTICS  IN  YOUR RESEARCH

But not all data has a normal distribution

• majority of the data is < m ;• more than half the

population has less than the mean value

• more than half the population is “below average”!

0-2 0 2 4 6 8 10

m m + sd

m - sd

0

10

20

30

40

50

60

70

0 2 4 6 8 10 12 14

Page 17: HOW TO USE STATISTICS  IN  YOUR RESEARCH

The Median• Median : item with average rank• Rank the items in order, and pick the middle one– median {-30, 1, 2, 3, 4} = 2 ; – median {0, 1, 2, 2, 2, 3, 4, 10, 27} = 2

EXCEL: =MEDIAN(A1:A10)

0-2 0 2 4 6 8 10

median

Page 18: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Example: Mean vs MedianSuppose we ask 7 students how much money they have on them:

Person Money

John 2

Ann 3

Bob 1

Mary 10

Sue 5

Carol 2

Ken 999

Mean: £146

Median: £3

The median is much less affected by outliers in the data

It is more representative of the sample

Page 19: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Measuring Spread in non-normal data

• Quartiles (25th and 75th percentiles) are a nonparametric measure of spread – first quartile ( Q1) = lower quartile = cuts off lowest

25% of data – third quartile (designated Q3) = upper quartile = cuts

off lowest 75%

0-2 0 2 4 6 8 10

med q3q1

Page 20: HOW TO USE STATISTICS  IN  YOUR RESEARCH

SAMPLING AND EXPERIMENTS

Page 21: HOW TO USE STATISTICS  IN  YOUR RESEARCH

• Bag contains 1000 balls• They are either red or

black• How can we estimate

what proportion is red and what proportion is black without looking at all the balls in the bag ?

Page 22: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Sampling

• Most experiments involve taking samples from a much larger “population”of data– 20 people asked to rate a

website– An algorithm run 10 times

to benchmark speed– A measure of quality of

service on 10 consecutive days from a network

We want to assume that our sample is representative of the larger population

Page 23: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Sampling

• Imagine throwing a dice 600 times…– We know what the

distributions of outcome should be theoretically

• Assume we throw the die 30 times– We might take ‘good’

samplesFr

eque

ncy

2 3 4 5 61

Page 24: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Sampling

• Imagine throwing a dice 600 times…– We know what the

distributions of outcome should be theoretically

• Assume we throw the die 30 times– We might take ‘good’

samplesFr

eque

ncy

2 3 4 5 61

Page 25: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Sampling

• Imagine throwing a dice 600 times…– We know what the

distributions of outcome should be theoretically

• Assume we throw the die 30 times– We might take ‘good’

samples• Or we might be ‘unlucky’

with our samples

Freq

uenc

y2 3 4 5 61

Page 26: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Sampling• Now imagine we have a

weighted die…• We make 30 throws• The results look a lot like the

‘unlucky’ results from our previous sample…

• How can we tell whether the die is really different or whether we were just unlucky during our sampling…

• (In most experiments we don’t know what the underlying distribution actually is)

Freq

uenc

y

2 3 4 5 61

Freq

uenc

y

2 3 4 5 61

Page 27: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Sampling• Now imagine we have a

weighted die…• We make 30 throws• The results look a lot like the

‘unlucky’ results from our previous sample…

• How can we tell whether the die is really different or whether we were just unlucky during our sampling…

• (In most experiments we don’t know what the underlying distribution actually is)

Freq

uenc

y

2 3 4 5 61

Freq

uenc

y

2 3 4 5 61

Page 28: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Statistical Tests – Student TTest• The t-test tells us the probability that the

two sets of data came from the same underlying distribution

• If the probability is very small (< 5%) then we assume the samples come from DIFFERENT distributions

• We can safely say that one experiment is better than the other

• But…– If >5%, you have to assume both

samples came from the same distribution

– Any differences in mean, standard deviation are only due to random sampling

– There is no significant difference between the samples

Excel:TTEST(Range1, Range2, tails, type)

Range 1 – first set of dataRange 2 – second set of data

Tails: set this to 2 (assume a 2-tailed distribution)

Type: set this to 2 (an unpaired t-test)

Graphpad

Page 29: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Statistical Tests – Student TTest• Mary and John each write an

algorithm to sort a large database. Mary claims hers is faster than Johns.

• They each run their algorithms 20 times on the same machine and record the results and some descriptive statistics.

• John claim she was wrong – his algorithm is definitely faster

• Is he right ?

Mean SD

Mary 83.25 10.58

John 79.45 10.17

Two-tailed p value = 0.25

• There is a 25% chance the Mary’s and John’s samples both came from the same distribution• Therefore the difference in

results is only down to random variations sampling

• There is no statistical difference in performance between John’s and Mary’s algorithms

Page 30: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Another Example

• Mary and John both roll a die many time and record the mean score.

• Mary claims that John’s die is biased

• Is she right ?

Mean SD

Mary 3.12 1.68

John 5.19 1.5

Two-tailed p value = 0.00002

• There is a 0.002% chance the Mary’s and John’s samples both came from the same distribution• Therefore the difference in

results is statistically significant

• We can safely conclude that John’s die is different to Mary’s

Page 31: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Some words of caution…

• Strictly speaking, the t-test should only be used if the underlying data distribution is normal

• If you don’t think it is, there are similar tests you can use:– Wilcoxon– RankSum

Page 32: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Some more tests

• For some experiments, we might have a hypothesis:– Students have no preference as to which of 3 browsers

they use when they go in the JKCC• From the hypothesis, we can calculate what we

would expect to find in an experiment if the hypothesis was true– A researcher goes into the lab and records which of 3

browsers is being used by 60 students– Would expect to see 20 students using each browser – He records the actual results observed

Page 33: HOW TO USE STATISTICS  IN  YOUR RESEARCH

CHITEST

• The CHITEST asks:– What is the probability of finding the observed results is

the hypothesis was true ?• It generates a number called the p-value– If p <0.05, we REJECT the hypothesis– If p>0.05, we ACCEPT the hypothesis

• In this case, if p < 0.05, the we reject the hypothesis that students have no preference for browsers (i.e. they do have a preference!)

• In EXCEL: CHITEST(actualValues, expectedValues)

Page 34: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Chi test• Students have no

preference as to which browser they use when they go in the JKCC

EXPECTED ACTUAL

FIREFOX 20 27

INTERNET EXPLORER

20 15

CHROME 20 18

The two-tailed P value equals 0.1423 By conventional criteria, this difference is considered to be not statistically significant.

P > 0.05 so we ACCEPT the hypothesis

There is a 14% chance the data was sampled from the expected distribution

This is NOT statistically sufficient – we have to assume that students have NO preference as to which browser they use, i.e. the theory is correct(value < 0.05 to be significant)

Page 35: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Linear Regression

• Sometimes you want to find correlations between two variables:– QualityOfS &

SizeOfNetwork– LinesOfCode &

SpeedOfExecution– Age &

TimeSpentOnSocialMedia

• Show trends• Use to predict future

values

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

f(x) = 0.695757575757576 x + 4.93333333333334R² = 0.981336859850719

Mark

Page 36: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Understanding the Graph

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

f(x) = 0.695757575757576 x + 4.93333333333334R² = 0.981336859850719

Mark

y=mx+c

Variable on y axis(dependent variable)

Variable on x axis (independent variable)

Slope of line

y-intercept

mark = 0.6958attendance + 4.9333R² = 0.98134

Prediction: what mark would a student who attended 75% of time get ?

Mark = 0.6958*75+4.9333=57.11 Measure of quality of fit (maximum = 1)

Page 37: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Doing this in Excel

• Scatterplot of data– Make sure it is in

columns, with independent variable first (x)

• Chart Layout:– Add linear trendline– Trendline Options:

choose show R value and equation

0 20 40 60 80 100 1200

20

40

60

80

Mark

Mark

0 20 40 60 80 100 1200

20

40

60

80

Mark

MarkLinear (Mark)

0 20 40 60 80 100 1200

20

40

60

80f(x) = 0.695757575757576 x + 4.93333333333334R² = 0.981336859850719

Mark

MarkLinear (Mark)Linear (Mark)

Page 38: HOW TO USE STATISTICS  IN  YOUR RESEARCH

PRESENTING YOUR RESULTSAnd finally

Page 39: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Chart Junk

“The interior decoration of graphics generates a lot of ink that does not tell the viewer anything new. The purpose of decoration varies — to make the graphic appear more scientific and precise, to enliven the display, to give the designer an opportunity to exercise artistic skills. Regardless of its cause, it is all non-data-ink or redundant data-ink, and it is often chartjunk.”

“Chartjunk can turn bores into disasters, but it can never rescue a thin data set.”

Page 40: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Examples of chart junk

Page 41: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Much better!

Page 42: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Other bad examples

Page 43: HOW TO USE STATISTICS  IN  YOUR RESEARCH

Too much information!• Too much info

Page 44: HOW TO USE STATISTICS  IN  YOUR RESEARCH

SUMMARY

• Remember you need to use statistics to properly analyse your work

• Make sure you use the right statistic• Make sure your present your data/statistics

well• Don’t lie with statistics !

Page 45: HOW TO USE STATISTICS  IN  YOUR RESEARCH

HTTP://BIT.LY/MRW1K4

HTTP://BIT.LY/1FANFQ3

Dropbox links to slides and a workbook