how to use statistics in your research
DESCRIPTION
HOW TO USE STATISTICS IN YOUR RESEARCH. LIES, DAMNED LIES AND STATISTICS!. What we will cover. WHY. HOW. Graphpad EXCEL. Why do statistics Descriptive Statistics Distributions Sampling & Hypotheses Presenting Results Chart junk. Why do you need statistics ?. - PowerPoint PPT PresentationTRANSCRIPT
HOW TO USE STATISTICS IN YOUR RESEARCH
LIES, DAMNED LIES AND STATISTICS!
What we will cover
WHY• Why do statistics• Descriptive Statistics• Distributions • Sampling & Hypotheses• Presenting Results
– Chart junk
HOW• Graphpad• EXCEL
Why do you need statistics ?
• Why are you doing a research project!
• Also very important in everyday life
• Measure things• Examine relationships• Make predictions• Test hypotheses• Explore issues• Explain activities or attitudes• Make comparisons• Draw conclusions based on
samples• Develop new theories• …
Misuse of statistics
Design• Ignoring some
‘inconvenient data points’• Focus on certain variables
and exclude others• Alter scales to present your
data in a more positive way• Present correlation as
causation
0 2 4 6 8 10 120
5
10
15
20
25
0 2 4 6 8 10 120
5
10
15
20
25
Misuse of statistics
Design• Ignoring some ‘inconvenient
data points’• Focus on certain variables
and exclude others• Alter scales to present your
data in a more positive way• Present correlation as
causation
1 20
5
10
15
20
25
30
35
40
1 210
15
20
25
30
35
40
Misuse of statistics
Design• Ignoring some ‘inconvenient
data points’• Focus on certain variables
and exclude others• Alter scales to present your
data in a more positive way• Present correlation as
causation
Misuse generally accidental• Bias
– Need to be particular careful in ‘questionnaire’ type research
– Also when sampling• Using the wrong statistical
tests• Making incorrect inferences
– In going from your sample to the general case
• Incorrect drawing conclusions based on correlations
Descriptive Statistics
• Used to describe or summarise what your data shows
• Not used to draw any conclusions that extend beyond your own data
• Mean• Median• Mode• Variance• Standard Deviation
Mean (Average)
• Imagine you have collected some data– From running an
algorithm on a problem– By measuring execution
time– By asking opinions
• You want to summarise your data – Don’t present all the
results
1
1 n
ii
m xn
mean {-30, 1, 2, 3, 4} = -4 mean {0, 1, 2, 3, 4} = 2Measures centrality
Excel: = AVERAGE(A1:A10)
Graphpad
The mean is not the whole story..
Emma’s AlgorithmMEAN 50504950505050515149
Malcolm’s AlgorithmMEAN 50
75259914555506040
Standard Deviation
• Standard Deviation measures something about the spread of your data
• Important as it gives you some indication of reliability or variability of your results
21
11
n
ii
s m xn
sd {-30, 1, 2, 3, 4} = 14.6 sd {0, 1, 2, 3, 4} = 1.6
Measures spread
Excel: = STDEV(A1:A10)
The mean is not the whole story..
Emma’s AlgorithmMEAN 50504950505050515149
Malcolm’s AlgorithmMEAN 50
75259914555506040
STD: 0.71 STD: 28.07
True or False ?
The majority of Scots have more than the average number of legs
TRUE!
Most Scots have more than the average number of legs!
• (None have 3 legs)• Most have 2 legs• Some have 1 leg• Some have 0 legs• The average < 2 (~1.9)
• The mean is not a relevant measure!
When can I use the mean?
• The data that you are sampling should follow a normal distribution Most values are close to
the mean, and a few lie at either extreme
68% of values within 1 SD of mean
95% of values within 2 SD In practice, a lot of data
does follow this kind of distribution
But not all data has a normal distribution
• majority of the data is < m ;• more than half the
population has less than the mean value
• more than half the population is “below average”!
0-2 0 2 4 6 8 10
m m + sd
m - sd
0
10
20
30
40
50
60
70
0 2 4 6 8 10 12 14
The Median• Median : item with average rank• Rank the items in order, and pick the middle one– median {-30, 1, 2, 3, 4} = 2 ; – median {0, 1, 2, 2, 2, 3, 4, 10, 27} = 2
EXCEL: =MEDIAN(A1:A10)
0-2 0 2 4 6 8 10
median
Example: Mean vs MedianSuppose we ask 7 students how much money they have on them:
Person Money
John 2
Ann 3
Bob 1
Mary 10
Sue 5
Carol 2
Ken 999
Mean: £146
Median: £3
The median is much less affected by outliers in the data
It is more representative of the sample
Measuring Spread in non-normal data
• Quartiles (25th and 75th percentiles) are a nonparametric measure of spread – first quartile ( Q1) = lower quartile = cuts off lowest
25% of data – third quartile (designated Q3) = upper quartile = cuts
off lowest 75%
0-2 0 2 4 6 8 10
med q3q1
SAMPLING AND EXPERIMENTS
• Bag contains 1000 balls• They are either red or
black• How can we estimate
what proportion is red and what proportion is black without looking at all the balls in the bag ?
Sampling
• Most experiments involve taking samples from a much larger “population”of data– 20 people asked to rate a
website– An algorithm run 10 times
to benchmark speed– A measure of quality of
service on 10 consecutive days from a network
We want to assume that our sample is representative of the larger population
Sampling
• Imagine throwing a dice 600 times…– We know what the
distributions of outcome should be theoretically
• Assume we throw the die 30 times– We might take ‘good’
samplesFr
eque
ncy
2 3 4 5 61
Sampling
• Imagine throwing a dice 600 times…– We know what the
distributions of outcome should be theoretically
• Assume we throw the die 30 times– We might take ‘good’
samplesFr
eque
ncy
2 3 4 5 61
Sampling
• Imagine throwing a dice 600 times…– We know what the
distributions of outcome should be theoretically
• Assume we throw the die 30 times– We might take ‘good’
samples• Or we might be ‘unlucky’
with our samples
Freq
uenc
y2 3 4 5 61
Sampling• Now imagine we have a
weighted die…• We make 30 throws• The results look a lot like the
‘unlucky’ results from our previous sample…
• How can we tell whether the die is really different or whether we were just unlucky during our sampling…
• (In most experiments we don’t know what the underlying distribution actually is)
Freq
uenc
y
2 3 4 5 61
Freq
uenc
y
2 3 4 5 61
Sampling• Now imagine we have a
weighted die…• We make 30 throws• The results look a lot like the
‘unlucky’ results from our previous sample…
• How can we tell whether the die is really different or whether we were just unlucky during our sampling…
• (In most experiments we don’t know what the underlying distribution actually is)
Freq
uenc
y
2 3 4 5 61
Freq
uenc
y
2 3 4 5 61
Statistical Tests – Student TTest• The t-test tells us the probability that the
two sets of data came from the same underlying distribution
• If the probability is very small (< 5%) then we assume the samples come from DIFFERENT distributions
• We can safely say that one experiment is better than the other
• But…– If >5%, you have to assume both
samples came from the same distribution
– Any differences in mean, standard deviation are only due to random sampling
– There is no significant difference between the samples
Excel:TTEST(Range1, Range2, tails, type)
Range 1 – first set of dataRange 2 – second set of data
Tails: set this to 2 (assume a 2-tailed distribution)
Type: set this to 2 (an unpaired t-test)
Graphpad
Statistical Tests – Student TTest• Mary and John each write an
algorithm to sort a large database. Mary claims hers is faster than Johns.
• They each run their algorithms 20 times on the same machine and record the results and some descriptive statistics.
• John claim she was wrong – his algorithm is definitely faster
• Is he right ?
Mean SD
Mary 83.25 10.58
John 79.45 10.17
Two-tailed p value = 0.25
• There is a 25% chance the Mary’s and John’s samples both came from the same distribution• Therefore the difference in
results is only down to random variations sampling
• There is no statistical difference in performance between John’s and Mary’s algorithms
Another Example
• Mary and John both roll a die many time and record the mean score.
• Mary claims that John’s die is biased
• Is she right ?
Mean SD
Mary 3.12 1.68
John 5.19 1.5
Two-tailed p value = 0.00002
• There is a 0.002% chance the Mary’s and John’s samples both came from the same distribution• Therefore the difference in
results is statistically significant
• We can safely conclude that John’s die is different to Mary’s
Some words of caution…
• Strictly speaking, the t-test should only be used if the underlying data distribution is normal
• If you don’t think it is, there are similar tests you can use:– Wilcoxon– RankSum
Some more tests
• For some experiments, we might have a hypothesis:– Students have no preference as to which of 3 browsers
they use when they go in the JKCC• From the hypothesis, we can calculate what we
would expect to find in an experiment if the hypothesis was true– A researcher goes into the lab and records which of 3
browsers is being used by 60 students– Would expect to see 20 students using each browser – He records the actual results observed
CHITEST
• The CHITEST asks:– What is the probability of finding the observed results is
the hypothesis was true ?• It generates a number called the p-value– If p <0.05, we REJECT the hypothesis– If p>0.05, we ACCEPT the hypothesis
• In this case, if p < 0.05, the we reject the hypothesis that students have no preference for browsers (i.e. they do have a preference!)
• In EXCEL: CHITEST(actualValues, expectedValues)
Chi test• Students have no
preference as to which browser they use when they go in the JKCC
EXPECTED ACTUAL
FIREFOX 20 27
INTERNET EXPLORER
20 15
CHROME 20 18
The two-tailed P value equals 0.1423 By conventional criteria, this difference is considered to be not statistically significant.
P > 0.05 so we ACCEPT the hypothesis
There is a 14% chance the data was sampled from the expected distribution
This is NOT statistically sufficient – we have to assume that students have NO preference as to which browser they use, i.e. the theory is correct(value < 0.05 to be significant)
Linear Regression
• Sometimes you want to find correlations between two variables:– QualityOfS &
SizeOfNetwork– LinesOfCode &
SpeedOfExecution– Age &
TimeSpentOnSocialMedia
• Show trends• Use to predict future
values
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100
f(x) = 0.695757575757576 x + 4.93333333333334R² = 0.981336859850719
Mark
Understanding the Graph
0 10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100
f(x) = 0.695757575757576 x + 4.93333333333334R² = 0.981336859850719
Mark
y=mx+c
Variable on y axis(dependent variable)
Variable on x axis (independent variable)
Slope of line
y-intercept
mark = 0.6958attendance + 4.9333R² = 0.98134
Prediction: what mark would a student who attended 75% of time get ?
Mark = 0.6958*75+4.9333=57.11 Measure of quality of fit (maximum = 1)
Doing this in Excel
• Scatterplot of data– Make sure it is in
columns, with independent variable first (x)
• Chart Layout:– Add linear trendline– Trendline Options:
choose show R value and equation
0 20 40 60 80 100 1200
20
40
60
80
Mark
Mark
0 20 40 60 80 100 1200
20
40
60
80
Mark
MarkLinear (Mark)
0 20 40 60 80 100 1200
20
40
60
80f(x) = 0.695757575757576 x + 4.93333333333334R² = 0.981336859850719
Mark
MarkLinear (Mark)Linear (Mark)
PRESENTING YOUR RESULTSAnd finally
Chart Junk
“The interior decoration of graphics generates a lot of ink that does not tell the viewer anything new. The purpose of decoration varies — to make the graphic appear more scientific and precise, to enliven the display, to give the designer an opportunity to exercise artistic skills. Regardless of its cause, it is all non-data-ink or redundant data-ink, and it is often chartjunk.”
“Chartjunk can turn bores into disasters, but it can never rescue a thin data set.”
Examples of chart junk
Much better!
Other bad examples
Too much information!• Too much info
SUMMARY
• Remember you need to use statistics to properly analyse your work
• Make sure you use the right statistic• Make sure your present your data/statistics
well• Don’t lie with statistics !
HTTP://BIT.LY/MRW1K4
HTTP://BIT.LY/1FANFQ3
Dropbox links to slides and a workbook