review bps chapter 1 picturing distributions with graphs what is statistics ? individuals and...
TRANSCRIPT
Review BPS chapter 1Picturing Distributions with Graphs
• What is Statistics ?
• Individuals and variables
• Two types of data: categorical and quantitative
• Ways to chart categorical data: bar graphs and pie charts
• Ways to chart quantitative data: histograms and stem plots
• Interpreting histograms
• Time plots
Example BPS chapter 1Indicate whether each of the following variables is categorical orquantitative.
a. We have data on 20 individuals measuring amount of time it takes toclimb five flights of stairs.
b. During a clinical trial, an experimental pain relief drug is administered toindividuals. Each individual is then asked whether s/he experiencedany pain relief.
Quantitative
Categorical
Objectives (BPS chapter 2)Describing distributions with numbers
• Measure of center: mean and median
• Measure of spread: quartiles and standard deviation
• The five-number summary and boxplots
• IQR and outliers
• Choosing among summary statistics
The mean or arithmetic average
To calculate the average, or mean, add
all values, then divide by the number of
individuals. It is the “center of mass.”
Sum of heights is 1598.3
Divided by 25 women = 63.9 inches
58.2 64.059.5 64.560.7 64.160.9 64.861.9 65.261.9 65.762.2 66.262.2 66.762.4 67.162.9 67.863.9 68.963.1 69.663.9
Measure of center: the mean
n
nx....xxx
21
x 1598.3
2563.9
Mathematical notation:
x1
n ixi1
n
woman(i)
height(x)
woman(i)
height(x)
i = 1 x1= 58.2 i = 14 x14= 64.0
i = 2 x2= 59.5 i = 15 x15= 64.5
i = 3 x3= 60.7 i = 16 x16= 64.1
i = 4 x4= 60.9 i = 17 x17= 64.8
i = 5 x5= 61.9 i = 18 x18= 65.2
i = 6 x6= 61.9 i = 19 x19= 65.7
i = 7 x7= 62.2 i = 20 x20= 66.2
i = 8 x8= 62.2 i = 21 x21= 66.7
i = 9 x9= 62.4 i = 22 x22= 67.1
i = 10 x10= 62.9 i = 23 x23= 67.8
i = 11 x11= 63.9 i = 24 x24= 68.9
i = 12 x12= 63.1 i = 25 x25= 69.6
i = 13 x13= 63.9 n =25 S=1598.3
Learn right away how to get the mean using your calculators.
Measure of center: the medianThe median(M) is the midpoint of a distribution—the number such that half of the observations are smaller and half are larger.
1. Sort observations from smallest to largest. 2. Find the location of the median (L)
1 1 0.62 2 1.23 3 1.64 4 1.95 5 1.56 6 2.17 7 2.38 8 2.39 9 2.510 10 2.811 11 2.912 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 8 4.722 9 4.923 10 5.324 11 5.6
n = 24 L=(n+1)/2 = 12.5
M= (3.3+3.4) /2 = 3.35
(2). If n is even, the median is the mean of the two center observations
1 1 0.62 2 1.23 3 1.64 4 1.95 5 1.56 6 2.17 7 2.38 8 2.39 9 2.510 10 2.811 11 2.912 12 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 8 4.722 9 4.923 10 5.324 11 5.625 12 6.1
n = 25 L=(n+1)/2 = 26/2 = 13 M = 3.4
(1). If n is odd, the median is observation (n+1)/2 down the list n = number of observations
Mean and median for skewed distributions
Mean and median for a symmetric distribution
Left skew Right skew
MeanMedian
Mean Median
MeanMedian
Comparing the mean and the median
The mean and the median are the same only if the distribution is
symmetrical. In a skewed distribution, the mean is usually farther out in
the long tail than is the median. The median is a measure of center that
is resistant to skew and outliers. The mean is not.
The median, on the other hand,
is only slightly pulled to the right
by the outliers (from 3.4 to 3.6).
The mean is pulled to the
right a lot by the outliers
(from 3.4 to 4.2).
P
erc
en
t o
f p
eo
ple
dyi
ng
Mean and median of a distribution with outliers
4.3x
Without the outliers
2.4x
With the outliers
Disease X:
Mean and median are the same.
Mean and median of a symmetric distribution
4.3
4.3
M
x
Multiple myeloma:
5.2
4.3
M
x
and a right-skewed distribution
The mean is pulled toward the skew.
Impact of skewed data
Example: STAT 200 Midterm Score
Midterm303540404040454545455050555560656570
100100
Descriptive Statistics: Midterm
Variable N Mean StDev Minimum Q1 Median Q3 MaximumMidterm 20 53.75 18.98 30.00 40.00 47.50 63.75 100.00
M = median = 3.4
Q1= first quartile = 2.2
Q3= third quartile = 4.35
1 1 0.62 2 1.23 3 1.64 4 1.95 5 1.56 6 2.17 7 2.38 1 2.39 2 2.510 3 2.811 4 2.912 5 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 1 4.722 2 4.923 3 5.324 4 5.625 5 6.1
Measure of spread: quartiles
The first quartile, Q1, is the value in
the sample that has 25% of the data
at or below it.
The third quartile, Q3, is the value in
the sample that has 75% of the data
at or below it.
M = median = 3.4
Q3= third quartile = 4.35
Q1= first quartile = 2.2
25 6 6.124 5 5.623 4 5.322 3 4.921 2 4.720 1 4.519 6 4.218 5 4.117 4 3.916 3 3.815 2 3.714 1 3.613 3.412 6 3.311 5 2.910 4 2.89 3 2.58 2 2.37 1 2.36 6 2.15 5 1.54 4 1.93 3 1.62 2 1.21 1 0.6
Largest = max = 6.1
Smallest = min = 0.6
Disease X
0
1
2
3
4
5
6
7
Yea
rs u
nti
l dea
th
“Five-number summary”
Center and spread in boxplots
0123456789
101112131415
Disease X Multiple myeloma
Yea
rs u
ntil
deat
h
Comparing box plots for a normal and a right-skewed distribution
Boxplots for skewed data
Boxplots remain true
to the data and clearly
depict symmetry or
skewness.
IQR and outliersThe interquartile range (IQR) is the distance between the first
and third quartiles (the length of the box in the boxplot) IQR = Q3 - Q1
An outlier is an individual value that falls outside the overall pattern.
• How far outside the overall pattern does a value have to fall to be considered an outlier?
• The 1.5 X IQR Rules for OutliersLow outlier: any value < Q1 – 1.5 IQR
High outlier: any value > Q3 + 1.5 IQR
Example: STAT 200 Midterm Score
IQR = Q3 - Q1 =63.75-40.00=23.75
Low outlier: any value < Q1 – 1.5 IQR = 40.00 - 1.5(23.75) = 4.375
High outlier: any value > Q3 + 1.5 IQR = 63.75 + 1.5(23.75) =99.375
Midterm303540404040454545455050555560656570
100100
Outliers !!
The standard deviation is used to describe the variation around the mean.
2
1
2 )(1
1xx
ns
n
i
1) First calculate the variance s2.
2
1
)(1
1xx
ns
n
i
2) Then take the square root to get
the standard deviation s.
Measure of spread: standard deviation
Mean± 1 s.d.
x
Calculations …
We’ll never calculate these by hand, so make sure you know how to get the standard deviation using your calculator.
2
1
1( )
1
n
is x xn
Mean = 63.4
Sum of squared deviations from mean = 85.2
Degrees freedom (df) = (n − 1) = 13
s2 = variance = 85.2/13 = 6.55 inches squared
s = standard deviation = √6.55 = 2.56 inches
Women’s height (inches)
Choosing among summary statistics
• Because the mean is not resistant to outliers or skew, use it to describe distributions that are fairly symmetrical and don’t have outliers. Plot the mean and use the standard deviation for error bars.
• Otherwise, use the median in the five-number summary, which can be plotted as a boxplot.
Height of 30 women
58
59
60
61
62
63
64
65
66
67
68
69
Box plot Mean +/- sd
Hei
ght i
n in
ches
Box plot Mean ± s.d.
Example 1
Suppose a sample of twelve lab rats is found to have the following glucose levels:
3 4 4 6 6 6 8 8 9 10 12 15
1. Find the five-number summary of the data and construct box-plot .
2. Based on the box plot, the data set is
a. Skewed to left b. roughly symmetric c. skewed to right
Min=3, Q1=5, M=7, Q3=9.5, Max=15
Example 2
Suppose a researcher is recording fifty values in a database. Suppose she records every value correctly except the lowest value, which is supposed to be “2” but which she incorrectly types as “200”.
In the above scenario, the effect of the researcher’s error on mean and Median is:
a. Her calculated mean will be lower than it would have been without the error, but her calculated Median will remain unchanged.
b. Her calculated mean will be higher than it would have been without the error, but her calculated Median will remain unchanged.
c. Her calculated mean will remain unchanged, but her calculated Median will be lower than it would have been without the error.
d. Her calculated mean will remain unchanged, but her calculated Median will be lower than it would have been without the error.
Example 2
In the above scenario, the effect of the researcher’s error on standard deviation is:
a. The error will not affect standard deviation.
b. Her calculated standard deviation will be smaller than it would have been without the error.
c. Her calculated standard deviation will be larger than it would have been without the error.
d. The error is likely to make the calculated standard deviation negative.
Example 3
There are three children in a room -- ages 3, 4, and 5. If a four-year-old child enters the room, the
a.mean age and variance will stay the same.
b.mean age and variance will increase.
c.mean age will stay the same but the variance will increase.
d.mean age will stay the same but the variance will decrease.