objectives 1.2describing distributions with numbers measures of center: mean, median mean versus...
Post on 26-Dec-2015
225 Views
Preview:
TRANSCRIPT
Objectives
1.2 Describing distributions with numbers
Measures of center: mean, median
Mean versus median
Measures of spread: quartiles, standard deviation
Five-number summary and boxplot
Choosing among summary statistics
Changing the unit of measurement
Numerical descriptions of distributions
Describe the shape, center, and spread of a distribution…
Center: mean, median and mode.
Spread: range, IQR, standard deviation (SD).
We treat these as aids to understanding the distribution of the variable at hand…
The mean is often called the "average" and is in fact the arithmetic average ("add all the values and divide by the number of observations").
The mean or arithmetic average
To calculate the average, or mean, add all
values, then divide by the number of
individuals. It is the “center of mass.”
height58.259.560.760.961.9
Measure of center: sample mean: Example 1
Sum of heights is 301.2
divided by 5 women = 301.2/5=60.24 inches
x 1598.3
2563.9
Mathematical notation:(Sample mean)
x 1
n ixi1
n
woman(i)
height(x)
woman(i)
height(x)
i = 1 x1= 58.2 i = 14 x14= 64.0
i = 2 x2= 59.5 i = 15 x15= 64.5
i = 3 x3= 60.7 i = 16 x16= 64.1
i = 4 x4= 60.9 i = 17 x17= 64.8
i = 5 x5= 61.9 i = 18 x18= 65.2
i = 6 x6= 61.9 i = 19 x19= 65.7
i = 7 x7= 62.2 i = 20 x20= 66.2
i = 8 x8= 62.2 i = 21 x21= 66.7
i = 9 x9= 62.4 i = 22 x22= 67.1
i = 10 x10= 62.9 i = 23 x23= 67.8
i = 11 x11= 63.9 i = 24 x24= 68.9
i = 12 x12= 63.1 i = 25 x25= 69.6
i = 13 x13= 63.9 n= 25 =1598.3
Learn right away how to get the mean using your calculators.
x x1 x2 ... xn
n
Measure of center: sample mean: Example 2
Your numerical summary must be meaningful!
The distribution of women’s heights appears coherent and symmetrical. The mean is a good numerical summary.
9.63x
Height of 25 women in a class
The Median (M) is often called the "middle" value and is the value at the midpoint of the observations when they are ranked from smallest to largest value….
Steps to get median: arrange the data from smallest to largest if n is odd then the median is the single observation in the
center (at the (n+1)/2 position in the ordering) if n is even then the median is the average of the two middle
observations (at the (n+1)/2 position; i.e., in between…) E.g1: 5, 1, 7, 4, 3 E.g2: 5, 1, 7, 4, 3, 8
Note: for a median, 50% of the data are less than it and 50% of the data are bigger than it
Example1: with the data listed below, what are the mean and median?
2, 3, 5, 1. Example2: with the data listed below, what are the mean and median?
2, 3, 5, 1, 100. Example3: with the data listed below, what are the mean and median? -100, 2, 3, 5, 1, 100.Question: What can we conclude from the examples above?
Measure of center: the median
Mean is sensitive to outliers;Median is robust to outliers.
Measure of center: the medianThe median is the midpoint of a distribution—the number such
that half of the observations are smaller and half are larger.
1. Sort observations by size.n = number of observations
______________________________
1 1 0.62 2 1.23 3 1.64 4 1.95 5 1.56 6 2.17 7 2.38 8 2.39 9 2.510 10 2.811 11 2.912 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 8 4.722 9 4.923 10 5.324 11 5.6
n = 24 n/2 = 12
Median = (3.3+3.4) /2 = 3.35
2.b. If n is even, the median is the mean of the two middle observations.
1 1 0.62 2 1.23 3 1.64 4 1.95 5 1.56 6 2.17 7 2.38 8 2.39 9 2.510 10 2.811 11 2.912 12 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 8 4.722 9 4.923 10 5.324 11 5.625 12 6.1
n = 25 (n+1)/2 = 26/2 = 13 Median = 3.4
2.a. If n is odd, the median is observation (n+1)/2 down the list
The median, on the other hand,
is only slightly pulled to the right
by the outliers (from 3.4 to 3.6).
The mean is pulled to the
right a lot by the outliers
(from 3.4 to 4.2).
P
erc
en
t o
f p
eo
ple
dyi
ng
Mean and median of a distribution with outliers
4.3x
Without the outliers
2.4x
With the outliers
Disease X:
Mean and median are the same.
Mean and median of a symmetric
4.3
4.3
M
x
Multiple myeloma:
5.2
4.3
M
x
… and a right-skewed distribution
The mean is pulled toward the skew.
Impact of skewed data
We can describe the shape, center and spread of a density curve in the same way we describe data… e.g.,
the median of a density curve is the “equal-areas” point - the point on the horizontal axis that divides the area under the density curve into two equal (.5 each) parts.
The mean of the density curve is the balance point - the point on the horizontal axis where the curve would balance if it were made of a solid material. (See figures 1.24b and 1.25 below)
Skewness: The mean is pulled toward the skew.
Mode = Mean = Median
SKEWED LEFT(negatively)
SYMMETRIC
Mean Mode Median
SKEWED RIGHT(positively)
Mean Mode Median
The mean is pulled toward the skew.
Spread: percentiles, quartiles (Q1 and Q3), IQR,5-number summary (and boxplots), range, standard deviation
pth percentile of a variable is a data value such that p% of the values of the variable fall at or below it.
The lower (Q1) and upper (Q3) quartiles are special percentiles dividing the data into quarters (fourths). get them by finding the medians of the lower and upper halves of the data
IQR = interquartile range = Q3 - Q1 = spread of the middle 50% of the data. IQR is used with the so-called 1.5*IQR criterion for outliers - know this!
Measure of spread: the quartiles
Eg1: Dataset: 3, 2, 1, 5, 6.
1) Find the Median, Q1, Q3 and IQR.
2) Find the 5-# summary.
3) Draw a Boxplot for Eg1.
Examples to find 5-# summary and Boxplot
Eg2: Dataset: 3, 2, 1, 5, 6, 8.
1) Find the Median, Q1, Q3 and IQR.
2) Find the 5-# summary.
3) Draw a Boxplot for Eg1.
Definition, pg 35Introduction to the Practice of Statistics, Sixth Edition
© 2009 W.H. Freeman and Company
Measure of spread: the quartiles
M = median = 3.4
Q1= first quartile = 2.2
Q3= third quartile = 4.35
1 1 0.62 2 1.23 3 1.54 4 1.65 5 1.96 6 2.17 7 2.38 1 2.39 2 2.510 3 2.811 4 2.912 5 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 1 4.722 2 4.923 3 5.324 4 5.625 5 6.1
Measure of spread: the quartiles
The first quartile, Q1, is the value in the
sample that has 25% of the data at or
below it ( it is the median of the lower
half of the sorted data, excluding M).
The third quartile, Q3, is the value in the
sample that has 75% of the data at or
below it ( it is the median of the upper
half of the sorted data, excluding M).
Definition, pg 37Introduction to the Practice of Statistics, Sixth Edition
© 2009 W.H. Freeman and Company
Definition, pg 38aIntroduction to the Practice of Statistics, Sixth Edition
© 2009 W.H. Freeman and Company
M = median = 3.4
Q3= third quartile = 4.35
Q1= first quartile = 2.2
25 6 6.124 5 5.623 4 5.322 3 4.921 2 4.720 1 4.519 6 4.218 5 4.117 4 3.916 3 3.815 2 3.714 1 3.613 3.412 6 3.311 5 2.910 4 2.89 3 2.58 2 2.37 1 2.36 6 2.15 5 1.94 4 1.63 3 1.52 2 1.21 1 0.6
Largest = max = 6.1
Smallest = min = 0.6
Disease X
0
1
2
3
4
5
6
7
Yea
rs u
nti
l dea
th
Five-number summary:
min Q1 M Q3 max
Five-number summary and boxplot
BOXPLOT
0123456789
101112131415
Disease X Multiple Myeloma
Yea
rs u
ntil
deat
h
Comparing box plots for a normal and a right-skewed distribution
Boxplots for skewed data
Boxplots remain
true to the data and
depict clearly
symmetry or skew.
5-number summary: min. , Q1, median, Q3, maxwhen plotted, the 5-number summary is a boxplot we can also
do a modified boxplot to show outliers (mild and extreme). Boxplots have less detail than histograms and are often used for comparing distributions… e.g., Fig. 1.17, p.47 and below...
Suspected outliers: how to detect outliersOutliers are troublesome data points, and it is important to be able to
identify them.
One way to raise the flag for a suspected outlier is to compare the
distance from the suspicious data point to the nearest quartile (Q1 or Q3).
We then compare this distance to the interquartile range (distance
between Q1 and Q3).
We call an observation a suspected outlier if it falls more than 1.5 times
the size of the interquartile range (IQR) above the first quartile or below
the third quartile. This is called the “1.5 * IQR rule for outliers.”
Modified Boxplot Modified boxplot (helps detect outliers)
Calculate 1.5*IQR Q1 – 1.5*IQR
Q3+1.5*IQR
Draw box and line (similar to before). Draw whiskers to minimum and maximum observation
within (Q1 – 1.5*IQR, Q3+1.5*IQR). Observations outside this range should be plotted as
dots separately.
Q3 = 4.35
Q1 = 2.2
25 6 7.924 5 6.123 4 5.322 3 4.921 2 4.720 1 4.519 6 4.218 5 4.117 4 3.916 3 3.815 2 3.714 1 3.613 3.412 6 3.311 5 2.910 4 2.89 3 2.58 2 2.37 1 2.36 6 2.15 5 1.94 4 1.63 3 1.52 2 1.21 1 0.6
Modified Boxplot
Q1: Is there any suspected outliers?
Q2: If yes, then find the following values: Calculate 1.5*IQR; Lower bound = Q1 – 1.5*IQR;
Upper bound = Q3+1.5*IQR; Find Min*=min within lower/upper
bounds; Find Max*=max within lower/upper
bounds;
Q3: Can we verify any outliers?
Q4: Now draw the Modified Boxplot: Draw Min* and Max*, Q1, Med, Q3. For all observations outside this range
should be plotted as dots separately.
Q3 = 4.35
Q1 = 2.2
25 6 7.924 5 6.123 4 5.322 3 4.921 2 4.720 1 4.519 6 4.218 5 4.117 4 3.916 3 3.815 2 3.714 1 3.613 3.412 6 3.311 5 2.910 4 2.89 3 2.58 2 2.37 1 2.36 6 2.15 5 1.94 4 1.63 3 1.52 2 1.21 1 0.6
Disease X
0
1
2
3
4
5
6
7
Yea
rs u
nti
l dea
th
8
Interquartile rangeQ3 – Q1
4.35 − 2.2 = 2.15
Distance to Q3
7.9 − 4.35 = 3.55
Individual #25 has a value of 7.9 years, which is 3.55 years above the
third quartile. This is more than 3.225 years, 1.5 * IQR. Thus, individual
#25 is an outlier by our 1.5 * IQR rule.
Modified Boxplot
The standard deviation “s” is used to describe the variation around the mean. Like the mean, it is not resistant to skew or outliers.
2
1
2 )(1
1xx
ns
n
i
1. First calculate the variance s2.
2
1
)(1
1xx
ns
n
i
2. Then take the square root to get
the standard deviation s.
Measure of spread: the standard deviation
Mean± 1 s.d.
x
Calculations …For data: 1, 2, 3, 4, 5. Q: Find the sample variance and sample SD.
Make sure to know how to get the standard deviation using your calculator.
2
1
)(1
xxdf
sn
i Mean = 3
Sum of squared deviations from mean = 10
Degrees freedom (df) = (n − 1) = 4
s2 = sample variance = 10/4 = 2.5
s = sample standard deviation
= √2.5 = 1.58
Example 1: to calculate sample SD
1
1
Order i
Make sure to know how to get the standard deviation using your calculator.
Example 2: Use hand to calculate sample SD for the following data set: 3, 4, 5, 8.
2
1
2 )(1
1xx
ns
n
i
1. First calculate the variance s2.
2
1
)(1
1xx
ns
n
i
2. Then take the square root to get
the standard deviation s.
How to use calculator to find statistics… In order to find sample mean, sample SD, and 5-# summary, we can
use calculator to help as following: Stat Edit choose 1: Edit… input your data into L1; Stat Calc choose 1: 1-Var Stats Enter Enter. Read your outputs carefully.
Note: X-bar means sample mean; Sx means sample SD; n means sample size.
Q: find the sample mean, sample SD, and 5-# summary for the following data:
Example1: Data are: 3, 4, 5, 8. Example 2: Data are: 1, 3, 5, 6, 7, 8.
Definition, pg 43aIntroduction to the Practice of Statistics, Sixth Edition
© 2009 W.H. Freeman and Company
ALWAYS PLOT DATA BEFORE DECIDING ON A NUMERICAL SUMMARY.
How to choose summary statistics? Use: 5-number summary is better than the mean and s.d.
for skewed data; Use mean & s.d. for symmetric data.
How to perform data analysis:
top related