how to use statistics in your research

HOW TO USE STATISTICS IN YOUR RESEARCH

LIES, DAMNED LIES AND STATISTICS!

What we will cover

WHY• Why do statistics• Descriptive Statistics• Distributions • Sampling & Hypotheses• Presenting Results

– Chart junk

HOW• Graphpad• EXCEL

Why do you need statistics ?

• Why are you doing a research project!

• Also very important in everyday life

• Measure things• Examine relationships• Make predictions• Test hypotheses• Explore issues• Explain activities or attitudes• Make comparisons• Draw conclusions based on

samples• Develop new theories• …

Misuse of statistics

Design• Ignoring some

‘inconvenient data points’• Focus on certain variables

and exclude others• Alter scales to present your

data in a more positive way• Present correlation as

causation

0 2 4 6 8 10 120

5

10

15

20

25

0 2 4 6 8 10 120

5

10

15

20

25


Design• Ignoring some ‘inconvenient

data points’• Focus on certain variables



causation

1 20

5

10

15

20

25

30

35

40

1 210

15

20

25

30

35

40


Design• Ignoring some ‘inconvenient

data points’• Focus on certain variables



causation

Misuse generally accidental• Bias

– Need to be particular careful in ‘questionnaire’ type research

– Also when sampling• Using the wrong statistical

tests• Making incorrect inferences

– In going from your sample to the general case

• Incorrect drawing conclusions based on correlations

Descriptive Statistics

• Used to describe or summarise what your data shows

• Not used to draw any conclusions that extend beyond your own data

• Mean• Median• Mode• Variance• Standard Deviation

Mean (Average)

• Imagine you have collected some data– From running an

algorithm on a problem– By measuring execution

time– By asking opinions

• You want to summarise your data – Don’t present all the

results

1

1 n

ii

m xn

mean {-30, 1, 2, 3, 4} = -4 mean {0, 1, 2, 3, 4} = 2Measures centrality

Excel: = AVERAGE(A1:A10)

Graphpad

http://graphpad.com/quickcalcs/CImean1.cfm

http://graphpad.com/quickcalcs/CImean1.cfm

The mean is not the whole story..

Emma’s AlgorithmMEAN 50504950505050515149

Malcolm’s AlgorithmMEAN 50

75259914555506040

Standard Deviation

• Standard Deviation measures something about the spread of your data

• Important as it gives you some indication of reliability or variability of your results

21

11

n

ii

s m xn

sd {-30, 1, 2, 3, 4} = 14.6 sd {0, 1, 2, 3, 4} = 1.6

Measures spread

Excel: = STDEV(A1:A10)

The mean is not the whole story..

Emma’s AlgorithmMEAN 50504950505050515149

Malcolm’s AlgorithmMEAN 50

75259914555506040

STD: 0.71 STD: 28.07

True or False ?

The majority of Scots have more than the average number of legs

TRUE!

Most Scots have more than the average number of legs!

• (None have 3 legs)• Most have 2 legs• Some have 1 leg• Some have 0 legs• The average < 2 (~1.9)

• The mean is not a relevant measure!

When can I use the mean?

• The data that you are sampling should follow a normal distribution Most values are close to

the mean, and a few lie at either extreme

68% of values within 1 SD of mean

95% of values within 2 SD In practice, a lot of data

does follow this kind of distribution

But not all data has a normal distribution

• majority of the data is < m ;• more than half the

population has less than the mean value

• more than half the population is “below average”!

0-2 0 2 4 6 8 10

m m + sd

m - sd

0

10

20

30

40

50

60

70

0 2 4 6 8 10 12 14

The Median• Median : item with average rank• Rank the items in order, and pick the middle one– median {-30, 1, 2, 3, 4} = 2 ; – median {0, 1, 2, 2, 2, 3, 4, 10, 27} = 2

EXCEL: =MEDIAN(A1:A10)

0-2 0 2 4 6 8 10

median

Example: Mean vs MedianSuppose we ask 7 students how much money they have on them:

Person Money

John 2

Ann 3

Bob 1

Mary 10

Sue 5

Carol 2

Ken 999

Mean: £146

Median: £3

The median is much less affected by outliers in the data

It is more representative of the sample

Measuring Spread in non-normal data

• Quartiles (25th and 75th percentiles) are a nonparametric measure of spread – first quartile ( Q1) = lower quartile = cuts off lowest

25% of data – third quartile (designated Q3) = upper quartile = cuts

off lowest 75%

0-2 0 2 4 6 8 10

med q3q1

SAMPLING AND EXPERIMENTS

• Bag contains 1000 balls• They are either red or

black• How can we estimate

what proportion is red and what proportion is black without looking at all the balls in the bag ?

Sampling

• Most experiments involve taking samples from a much larger “population”of data– 20 people asked to rate a

website– An algorithm run 10 times

to benchmark speed– A measure of quality of

service on 10 consecutive days from a network

We want to assume that our sample is representative of the larger population

Sampling

• Imagine throwing a dice 600 times…– We know what the

distributions of outcome should be theoretically

• Assume we throw the die 30 times– We might take ‘good’

samplesFr

eque

ncy

2 3 4 5 61

Sampling

• Imagine throwing a dice 600 times…– We know what the

distributions of outcome should be theoretically

• Assume we throw the die 30 times– We might take ‘good’

samples• Or we might be ‘unlucky’

with our samples

Freq

uenc

y2 3 4 5 61

Sampling• Now imagine we have a

weighted die…• We make 30 throws• The results look a lot like the

‘unlucky’ results from our previous sample…

• How can we tell whether the die is really different or whether we were just unlucky during our sampling…

• (In most experiments we don’t know what the underlying distribution actually is)

Freq

uenc

y

2 3 4 5 61

Freq

uenc

y

2 3 4 5 61

Statistical Tests – Student TTest• The t-test tells us the probability that the

two sets of data came from the same underlying distribution

• If the probability is very small (< 5%) then we assume the samples come from DIFFERENT distributions

• We can safely say that one experiment is better than the other

• But…– If >5%, you have to assume both

samples came from the same distribution

– Any differences in mean, standard deviation are only due to random sampling

– There is no significant difference between the samples

Excel:TTEST(Range1, Range2, tails, type)

Range 1 – first set of dataRange 2 – second set of data

Tails: set this to 2 (assume a 2-tailed distribution)

Type: set this to 2 (an unpaired t-test)

Graphpad

mailto:http://www.graphpad.com/quickcalcs/ttest1/?Format=C

Statistical Tests – Student TTest• Mary and John each write an

algorithm to sort a large database. Mary claims hers is faster than Johns.

• They each run their algorithms 20 times on the same machine and record the results and some descriptive statistics.

• John claim she was wrong – his algorithm is definitely faster

• Is he right ?

Mean SD

Mary 83.25 10.58

John 79.45 10.17

Two-tailed p value = 0.25

• There is a 25% chance the Mary’s and John’s samples both came from the same distribution• Therefore the difference in

results is only down to random variations sampling

• There is no statistical difference in performance between John’s and Mary’s algorithms

Another Example

• Mary and John both roll a die many time and record the mean score.

• Mary claims that John’s die is biased

• Is she right ?

Mean SD

Mary 3.12 1.68

John 5.19 1.5

Two-tailed p value = 0.00002

• There is a 0.002% chance the Mary’s and John’s samples both came from the same distribution• Therefore the difference in

results is statistically significant

• We can safely conclude that John’s die is different to Mary’s

Some words of caution…

• Strictly speaking, the t-test should only be used if the underlying data distribution is normal

• If you don’t think it is, there are similar tests you can use:– Wilcoxon– RankSum

Some more tests

• For some experiments, we might have a hypothesis:– Students have no preference as to which of 3 browsers

they use when they go in the JKCC• From the hypothesis, we can calculate what we

would expect to find in an experiment if the hypothesis was true– A researcher goes into the lab and records which of 3

browsers is being used by 60 students– Would expect to see 20 students using each browser – He records the actual results observed

CHITEST

• The CHITEST asks:– What is the probability of finding the observed results is

the hypothesis was true ?• It generates a number called the p-value– If p <0.05, we REJECT the hypothesis– If p>0.05, we ACCEPT the hypothesis

• In this case, if p < 0.05, the we reject the hypothesis that students have no preference for browsers (i.e. they do have a preference!)

• In EXCEL: CHITEST(actualValues, expectedValues)

Chi test• Students have no

preference as to which browser they use when they go in the JKCC

EXPECTED ACTUAL

FIREFOX 20 27

INTERNET EXPLORER

20 15

CHROME 20 18

The two-tailed P value equals 0.1423 By conventional criteria, this difference is considered to be not statistically significant.

P > 0.05 so we ACCEPT the hypothesis

There is a 14% chance the data was sampled from the expected distribution

This is NOT statistically sufficient – we have to assume that students have NO preference as to which browser they use, i.e. the theory is correct(value < 0.05 to be significant)

Linear Regression

• Sometimes you want to find correlations between two variables:– QualityOfS &

SizeOfNetwork– LinesOfCode &

SpeedOfExecution– Age &

TimeSpentOnSocialMedia

• Show trends• Use to predict future

values

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

f(x) = 0.695757575757576 x + 4.93333333333334R² = 0.981336859850719

Mark

Understanding the Graph

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

f(x) = 0.695757575757576 x + 4.93333333333334R² = 0.981336859850719

Mark

y=mx+c

Variable on y axis(dependent variable)

Variable on x axis (independent variable)

Slope of line

y-intercept

mark = 0.6958attendance + 4.9333R² = 0.98134

Prediction: what mark would a student who attended 75% of time get ?

Mark = 0.6958*75+4.9333=57.11 Measure of quality of fit (maximum = 1)

Doing this in Excel

• Scatterplot of data– Make sure it is in

columns, with independent variable first (x)

• Chart Layout:– Add linear trendline– Trendline Options:

choose show R value and equation

0 20 40 60 80 100 1200

20

40

60

80

Mark

Mark

0 20 40 60 80 100 1200

20

40

60

80

Mark

MarkLinear (Mark)

0 20 40 60 80 100 1200

20

40

60

80f(x) = 0.695757575757576 x + 4.93333333333334R² = 0.981336859850719

Mark

MarkLinear (Mark)Linear (Mark)

PRESENTING YOUR RESULTSAnd finally

Chart Junk

“The interior decoration of graphics generates a lot of ink that does not tell the viewer anything new. The purpose of decoration varies — to make the graphic appear more scientific and precise, to enliven the display, to give the designer an opportunity to exercise artistic skills. Regardless of its cause, it is all non-data-ink or redundant data-ink, and it is often chartjunk.”

“Chartjunk can turn bores into disasters, but it can never rescue a thin data set.”

Examples of chart junk

Much better!

Other bad examples

Too much information!• Too much info

SUMMARY

• Remember you need to use statistics to properly analyse your work

• Make sure you use the right statistic• Make sure your present your data/statistics

well• Don’t lie with statistics !

HTTP://BIT.LY/MRW1K4

HTTP://BIT.LY/1FANFQ3

Dropbox links to slides and a workbook

how to use statistics in your research

Documents