introduction to statistics by harry
TRANSCRIPT
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 1/24
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 2/24
Resources
• Crawley, MJ (2005) Statistics: An Introduction
Using R. Wiley.• Gentle, J (2002) Elements of Computational
Statistics. Springer.• Gonick, L., and Woollcott Smith (1993) A Cartoon
Guide to Statistics. HarperResource (for fun).
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 3/24
Who am I?• Dr. Harry Erwin BS MA PhD MIET MBCS• My PhD was awarded in bioinformatics. Although my
research interests are in neuroscience, I've had the
coursework and understand current research directions
in computational biology and statistics. I’ve also had
the coursework for a PhD in mathematics.• I teach computing and neuroscience here at the
University of Sunderland.
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 4/24
Doing Statistics• Usually you do statistics to explore the structure of
data. The questions you might ask are rather open-
ended. Your understanding is facilitated by a model.• A model embodies what you currently know about the
data. You can formulate it either as a data-generating
process or a set of rules for processing the data.• We’ll look at modelling in detail later.
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 5/24
Statistical Models• Often expressed as a set of equations relating
data elements.• Can include probability distributions for the
elements. If this is the case, you have a
stochastic model.• The model should be free to evolve based on
data mining.
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 6/24
Common Stochastic Models• Parameterized statistical distributions, such as
the normal distribution, binomial distribution, or
the chi-squared distribution.
• Sometimes more complicated, where you might
need to use simulation, resampling, and
visualization to determine the parameters of the
model.
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 7/24
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 8/24
Visualization• Multiple views are necessary, particularly for
multivariate data.• Be able to zoom in on the data as a few points
can obscure the interesting structure.• Scaling of the axes may be necessary, since our
eyes are not perfect tools for detecting structure.• Watch out for time-ordered or location-ordered
data, particularly if time or location are notexplicitly reported.
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 9/24
Plots• Use simple plots to start with.• Watch for rounded data—shown by horizontal
strata in the data. That often signals otherproblems.
• There are a number of plotting tutorials, consult
them.
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 10/24
Statistical Activities• Data collection (ideally the statistician has a say on
how they are collected)• Description of a dataset
– Averages – Spreads – Extreme points
• Inference within a model or collection of models• Model selection
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 11/24
How to Do It• Start by determining what sort of statistical
analysis you will be doing. You need to know: – Which variable is the response variable? – Which are the explanatory variables? – What kind are the explanatory variables? – What kind of response variable do you have?
• If you have multiple response variables, you needto do multivariate analysis (more advanced).
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 12/24
Basic Methods• If all explanatory variables are continuous, plan
on a regression analysis.• If all explanatory variables are categorical, plan
for an analysis of variance (ANOVA).• If you have a mix, plan for an analysis of
covariance (ANCOVA)
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 13/24
Effect of the Response Variable• If the response variable is continuous, then plan on a
normal regression, ANOVA, or ANCOVA.• If the response variable is a proportion, do a logistic
regression.• If a count, you need a log linear model.• If binary, you need a binary logistic analysis• If time to event or time at death, you will be doing a
survival analysis.
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 14/24
Variation• You want to understand how the response is
dependent on variation in the explanatory
variables, but you are also interested in lack of dependence.• Design the simplest model that explains the data
adequately.
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 15/24
Significance• You have to determine what the probability of a
false alarm will be—that is, the chance that you
will think something is significant which reallyis not.• Typical values are 5%, 1%, and 0.1%.• Don’t test every hypothesis. Some will be true
by chance.
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 16/24
Good and Bad Hypotheses• ‘There are vultures in the local park.’• ‘There are no vultures in the local park.’• Which is testable?• Discuss…
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 17/24
Answer• The ‘null hypothesis’ is testable. • ‘There are no vultures in the local park.’• You test it by taking measurements and showing
that if the null hypothesis were true, the chance
of those measurements would be close to zero.• Discuss further…
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 18/24
Experimental Design• Replication
– Increases reliability, so be thorough. Often theanswer is ‘30’.
– Discuss why.• Randomization
– Reduces systematic bias, so do it properly –
Almost never done properly
– Discuss why.
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 19/24
Controls• “No controls, no conclusions.”• A ‘control experiment’ is one where you don’t
apply the treatment or don’t enable the part of your experiment that is supposed to produce thedifferent outcome.
• You’re comparing the results when the
treatment is applied to the results with notreatment.
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 20/24
Replication• Must be independent• Not part of a time series• Not grouped together in space• Of an appropriate spatial scale• Covers the normal variation in initial
conditions.
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 21/24
Error TypesNull hypothesis
actually true Null hypothesis
actually falseAccept null
hypothesis Correct(no paper but no
embarrassment)Type II (β) error(further experiments
can change this)Reject null
hypothesisType I (α) error(can result in a paper
you have to
withdraw)Correct(a publishable paper)
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 22/24
Typical α and β values• You usually want the probability of rejecting the null
hypothesis (α) when it is true to be less than 5%.• You usually want the probability of accepting the null
hypothesis (β) when it is false to be less than 20%.• The power of a test is 1- β, or greater than 80% in this case.• Rule of Thumb: the number of replicates to reject the null
hypothesis with probability 80% is about 8s2/d 2, where s2 is
the variance in the response and d is the size of thedifference to be detected in a single sample.
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 23/24
Inference• Strong inference
– A clear hypothesis –
An acceptable test
• Weak inference
– Natural experiments• Conclusions from natural experiments are
hypotheses. Can still produce good papers.• Discuss
8/14/2019 Introduction to Statistics by Harry
http://slidepdf.com/reader/full/introduction-to-statistics-by-harry 24/24
How Long to Go On?• To stop the experiment as soon as a pleasing
result is obtained?• To keep going until the theoretically correct
result is obtained? • Discuss.• Gregor Mendel’s experiments with peas.