Chapter 1Chapter 1
Why Statistics?
2
Learning can result from:Learning can result from:Critical thinkingAsking an authorityReligious experience
However, collecting DATA is the surest However, collecting DATA is the surest way to learn about the worldway to learn about the world
3
Data in the Sciences are messyData in the Sciences are messy
At first glance, data often look like an incoherent jumble of numbers
How do we make sense of data?
Statistical procedures are tools for Statistical procedures are tools for learning about the world by Learning learning about the world by Learning from Data.from Data.
4
Real Data!Real Data!To help you understand the power and
usefulness of statistics, we will explore two real and interesting data sets
“The Smoking Study”“The Maternity Study”
5
The Smoking StudyThe Smoking Study From the University of
Wisconsin Center for Tobacco Research and Intervention
608 participants provided data on smoking, addiction, withdrawal, and how best to quit smoking
The full data set is provided on the CD, a description of the data collected in provided in the appendices of the book
6
The Maternity StudyThe Maternity Study From Wisconsin Maternity
Leave and Health Project
244 families provided data on marital satisfaction, child-rearing styles, and other household events
The full data set is
provided on the CD, a description of the data collected in provided in the appendices of the book
7
VariabilityVariability Why are data messy? Consider a concrete example:
Depression scores (“CESD”) for participants in the Smoking Study
Some participants (each has a different ID number) have CESD scores of 0, while others have scores of 2, 11 or 7, or some other value
These data are messy in that the scores are different from one another
VariabilityVariability is the statistical term for the is the statistical term for the degree to which scores (such as the degree to which scores (such as the depression scores) differ from one depression scores) differ from one another.another.
8
Sources of VariabilitySources of Variability It is easy to see that depression scores are
variable, by why?– Individual differences
Some people are more depressed than others Some people have difficulty reading the and
understanding the questions on the test Some people answer the questions more honestly than
others– Procedure
Differences in the ways the data were collected– Conditions or Treatments
The conditions that are imposed on the participants of the study
9
Populations and SamplesPopulations and SamplesStatistical Population – a collection or Statistical Population – a collection or
set of measurements of a variable that set of measurements of a variable that share some common characteristicshare some common characteristic
Sample – a subset of measurements Sample – a subset of measurements from a populationfrom a population
Random sample – a sample selected Random sample – a sample selected such that every score in the population such that every score in the population has an equal chance of being includedhas an equal chance of being included
Chapter 2Chapter 2
Frequency Distributions and
Percentiles
Variability (revisited)Variability (revisited)Collecting Data means measuring a
variableThose measurements differ (vary) from
one anotherOne way to organize and summarize a
set of measurements is to construct a frequency distribution
These methods can be applied to both populations and samples
ExampleExample
5 13 17 20 19 35 21 28 3 22
26 13 30 30 30 32 40 27 14 4
27 33 28 45 29 25 38 35 33 39
5 4 20 24 25 27 16 25 38 9
36 20 18 11 12 23 22 27 32 49
22 30 0 32 4 23 9 29 22 23
YRSMK – Number of Years Smoking Daily From the First 60 Participants in the Smoking Study
ExampleExample
0 3 4 4 4 5 5 9 9 10
11 13 13 14 16 17 18 19 20 20
20 21 22 22 22 22 23 23 23 24
25 25 25 26 27 27 27 27 28 28
29 29 30 30 30 30 32 32 32 33
33 35 35 36 38 38 39 40 45 49
YRSMK – Number of Years Smoking Daily From the First 60 Participants in the Smoking Study
A Better Summary?A Better Summary?
ClassInterval
FrequencyRelative
FrequencyCumulativeFrequency
CumulativeProportion
0 - 4 5 .083 5 .083
5 - 9 4 .067 9 .150
10 - 14 5 .083 14 .233
15 - 19 4 .067 18 .300
20 - 24 12 .200 30 .500
25 - 29 12 .200 42 .700
30 - 34 9 .150 51 .850
35 - 39 6 .100 57 .950
40 - 44 1 .017 58 .967
45 – 49 2 .033 60 1.00
Total (n) 60 1.000
YRSMK – Number of Years Smoking Daily From the First 60 Participants in the Smoking Study
Graphing DistributionsGraphing Distributions
PercentilesPercentilesWe have been focusing on distributions
rather than individual scoresSometimes, individual scores are of great
importanceComputing Percentiles, when n=608
The 50-th percentile is the “middle” score. It is the 304-th sorted score.
The 32-th percentile is the 608*0.32=194.56, i.e., the 195-th sorted score.
Percentile RankPercentile RankThe percentile rank of a score is the
percent (the proportion times 100) of the measurements in the distribution below that score value
Computing percentile rank for YRSMK:Sort the variable, called YRSMK_sorted The percentile rank of 9 is 50/608 = 0.082, so
it is the 8-th percentileThe percentile rank of 21 is 246/608 =
0.4046053, so it is the 40-th percentile
Graphing DistributionsGraphing DistributionsGraphing distributions is a very
valuable tool for highlighting features of the data– Shape– Range– Central Tendency– Variability
ShapeShapeWe classify the shape of distributions
in three ways:– Symmetry – is one half a mirror image of
the other half?– Skew – are there high/low frequencies of
low/high scores?– Modality – how many humps or modes?
SymmetrySymmetry Is one half of the distribution a mirror image of the
other (along a vertical axis)? Three examples of symmetrical distributions:
SkewSkew Positive – high
frequencies of low values and low frequencies of high values
Negative – low frequencies of low values and high frequencies of high values
ModalityModalityHow many humps (or modes)?
Unimodal Bimodal
Characterizing ShapeCharacterizing Shape
AsymmetricNegatively Skewed
Bimodal
AsymmetricPositively Skewed
Unimodal
Central Tendency and Central Tendency and VariabilityVariability In addition to shape, distributions differ
in terms of:– Central Tendency - scores near the center
of the distributions; where the scores “tend” to be
– Variability – the degree to which scores differ from one another; the “spread” of the scores
Comparing DistributionsComparing Distributions It is very useful to be able to compare
and contrast (name similarities and differences) of distributions
Distributions can differ in terms of shapes, central tendencies, and variability
Comparing DistributionsComparing Distributions
How do these distributions differ?