stats lecture 02 descriptive stats
Post on 06-Apr-2018
250 Views
Preview:
TRANSCRIPT
-
8/3/2019 Stats Lecture 02 Descriptive Stats
1/51
Descriptive Statistics
Chapter 2
Quantitative Methods for Economics
Dr. Katherine Sauer
Metropolitan State College of Denver
-
8/3/2019 Stats Lecture 02 Descriptive Stats
2/51
Chapter Overview:I. Working With Raw Data
II. Working With Grouped Data
III. Measures of Dispersion for Raw Data
IV. Measures of Dispersion for Grouped DataV. Other Measures of Dispersion
-
8/3/2019 Stats Lecture 02 Descriptive Stats
3/51
I. Working with Raw Data (mean, median and mode)
Suppose you are a manager preparing a report on hours worked
by your 49 staff members.
You might like to know the average number of hours worked.
49
49
11
i
i
N
i
ix
N
x
= 1592.5 = 32.5
49
20.0 37.3 54.2 25.3 59.6 24.5 29.7
18.0 38.8 42.1 39.5 56.8 16.9 28.5
45.5 42.0 39.5 42.6 40.0 44.2 40.1
44.0 56.4 30.2 20.0 22.7 37.8 23.4
26.0 20.2 36.1 18.3 19.7 36.8 26.5
24.0 23.4 15.4 20.0 38.9 42.1 24.1
41.0 18.5 21.3 22.6 37.2 42.9 17.9
Hours worked in a given week by 49 staff members
-
8/3/2019 Stats Lecture 02 Descriptive Stats
4/51
You might also like to know the median hours worked.
- sort the data in ascending order
15.4 20.0 23.4 26.5 37.3 40.1 44.0
16.9 20.0 23.4 28.5 37.8 41.0 44.2
17.9 20.0 24.0 29.7 38.8 42.0 45.5
18.0 20.2 24.1 30.2 38.9 42.1 54.2
18.3 21.3 24.5 36.1 39.5 42.1 56.4
18.5 22.6 25.3 36.8 39.5 42.6 56.8
19.7 22.7 26.0 37.2 40.0 42.9 59.6
Hours worked in a given week by 49 staff members
The mode and median can be determined from the sorted data.
Are there any outliers we should make note of?
mean: 32.5 hours
median: 30.2 hours
mode: 20 hours
-
8/3/2019 Stats Lecture 02 Descriptive Stats
5/51
15.4 20.0 23.4 26.5 37.3 40.1 44.0
16.9 20.0 23.4 28.5 37.8 41.0 44.2
17.9 20.0 24.0 29.7 38.8 42.0 45.5
18.0 20.2 24.1 30.2 38.9 42.1 54.2
18.3 21.3 24.5 36.1 39.5 42.1 56.4
18.5 22.6 25.3 36.8 39.5 42.6 56.8
19.7 22.7 26.0 37.2 40.0 42.9 59.6
Hours worked in a given week by 49 staff members
One final calculation we might like to make is arranging the data
into quartiles.
The position of the lower quartile (Q1) is the item that is closest to
position
0.25(n+1)
Q1: 0.25(49 + 1)
= 12.5
There is no 12.5thposition so well average
the 12th and 13th positions together.
-
8/3/2019 Stats Lecture 02 Descriptive Stats
6/51
15.4 20.0 23.4 26.5 37.3 40.1 44.0
16.9 20.0 23.4 28.5 37.8 41.0 44.2
17.9 20.0 24.0 29.7 38.8 42.0 45.5
18.0 20.2 24.1 30.2 38.9 42.1 54.2
18.3 21.3 24.5 36.1 39.5 42.1 56.4
18.5 22.6 25.3 36.8 39.5 42.6 56.8
19.7 22.7 26.0 37.2 40.0 42.9 59.6
Hours worked in a given week by 49 staff members
So, Q1 = 21.3+22.6 = 21.45
2
Weve already found Q2.
30.2
To find the upper quartile (Q3), use the value of the item closest
to position
0.75(n + 1).
Q3: 0.75(50) = 37.5
-
8/3/2019 Stats Lecture 02 Descriptive Stats
7/51
15.4 20.0 23.4 26.5 37.3 40.1 44.0
16.9 20.0 23.4 28.5 37.8 41.0 44.2
17.9 20.0 24.0 29.7 38.8 42.0 45.5
18.0 20.2 24.1 30.2 38.9 42.1 54.2
18.3 21.3 24.5 36.1 39.5 42.1 56.4
18.5 22.6 25.3 36.8 39.5 42.6 56.8
19.7 22.7 26.0 37.2 40.0 42.9 59.6
Hours worked in a given week by 49 staff members
So, Q3 has a value of
41+42 = 41.52
Q1 = 21.45 Q2 = 30.2 Q3 = 41.5
-
8/3/2019 Stats Lecture 02 Descriptive Stats
8/51
Sometimes the mean is not a good representation of the data.
- a representative statistic is fairly typical of most of the
data
Outliers can skew the mean.
Ex: Suppose we have the following data on ages of student taking
piano lessons.5,6,7,7,7,8,9,9,32
Calculate the mean, median and mode:
10, 7, 7
Drop the outlier and re-calculate the mean, median and mode:
7.25, (7+7)/2 = 7 , 7
-
8/3/2019 Stats Lecture 02 Descriptive Stats
9/51
Graphically, skewed data has a long tail extending to the outlier.
- low outliers produce skewed to the left graphs
- high outliers produce skewed to the right graphs
For low outliers, the value of the mean will be less than the
value of the median.
For high outliers, the value of the mean will be more than the
value of the median.
-
8/3/2019 Stats Lecture 02 Descriptive Stats
10/51
II. Working with Grouped Data (mean, median and mode)
Many times it would be impractical to list all of the raw data.
Often data is first put into groups.
Example: employment data in the farming, fishing and forestry
industry
Age Group 1991 1996
15-19 4,585 2,826
20-24 11,872 9,319
25-34 27,171 24,492
35-44 31,299 28,21045-54 31,626 30,902
55-64 33,477 25,846
65 and over 23,519 19,030
Total 163,549 140,625
Employment in the Farming, Fishing and Forestry Industry
-
8/3/2019 Stats Lecture 02 Descriptive Stats
11/51
Note: We are assuming that the values within each interval vary
uniformly between the lowest and highest values for the interval.
The mid-interval value is the average value of the data in
any interval.
- used to represent the group numerically
Mid-Interval Value for 15-19: 15+19 = 17
2
The age of each person in the interval is assumed to be 17.
Age Group 1991 1996
15-19 4,585 2,826
20-24 11,872 9,319
25-34 27,171 24,492
35-44 31,299 28,210
45-54 31,626 30,902
55-64 33,477 25,846
65 and over 23,519 19,030
Total 163,549 140,625
Employment in the Farming, Fishing and Forestry Industry
-
8/3/2019 Stats Lecture 02 Descriptive Stats
12/51
Back to our hours worked example
15.4 20.0 23.4 26.5 37.3 40.1 44.0
16.9 20.0 23.4 28.5 37.8 41.0 44.2
17.9 20.0 24.0 29.7 38.8 42.0 45.5
18.0 20.2 24.1 30.2 38.9 42.1 54.2
18.3 21.3 24.5 36.1 39.5 42.1 56.4
18.5 22.6 25.3 36.8 39.5 42.6 56.819.7 22.7 26.0 37.2 40.0 42.9 59.6
Hours worked in a given week by 49 staff members
Lets group this data into a frequency distribution table.
- choose between 5 and 20 intervals
Data starts at 15.4 and goes to 59.6.
Grouping hours by 5s or 10s makes sense.
For our data, by 5s will be more revealing.
-
8/3/2019 Stats Lecture 02 Descriptive Stats
13/51
Hours Worked Frequency
15
-
8/3/2019 Stats Lecture 02 Descriptive Stats
14/51
Lets calculate the mid-interval values and add them to our table.
Hours Worked Frequency Mid Interval Value15
-
8/3/2019 Stats Lecture 02 Descriptive Stats
15/51
Lets calculate the total hours worked for each interval and add to
the table.
frequency x mid-interval value
Hours Worked Frequency Mid Interval Value Sub-Group Total Hours Worked
15
-
8/3/2019 Stats Lecture 02 Descriptive Stats
16/51
Hours Worked Frequency Mid Interval Value Sub-Group Total Hours Worked
15
-
8/3/2019 Stats Lecture 02 Descriptive Stats
17/51
To find the mode, we simply need our frequencies and intervals.
Hours Worked Frequency
15
-
8/3/2019 Stats Lecture 02 Descriptive Stats
18/51
Now lets calculate the median and quartiles.
Well first need to compute the cumulative frequency and add it toour table.
Hours Worked Frequency Less Than Cumulative Frequency
15
-
8/3/2019 Stats Lecture 02 Descriptive Stats
19/51
To determine the value of Q1:-From the 7 items in the preceding interval, 5.5 more are needed to
reach the 12.5th position.
-There are 12 items in the interval that contains Q1.
From this we get: 5.5 / 12 = 0.46
Take this times the size of the interval to get: 0.46 x 5 = 2.3
Add this to the beginning of the interval to get: 2.3 + 20 = 22.3 = Q1
Hours Worked Frequency Less Than Cumulative Frequency
15
-
8/3/2019 Stats Lecture 02 Descriptive Stats
20/51
Hours Worked Frequency Less Than Cumulative Frequency
15
-
8/3/2019 Stats Lecture 02 Descriptive Stats
21/51
Hours Worked Frequency Less Than Cumulative Frequency
15
-
8/3/2019 Stats Lecture 02 Descriptive Stats
22/51
Hours Worked
Raw Data: Grouped Data:
mean 32.5 32.602
median 30.2 32.5
mode 20 22.08Q1 21.45 22.3
Q2 30.2 32.5
Q3 45.1 41.75
-
8/3/2019 Stats Lecture 02 Descriptive Stats
23/51
-
8/3/2019 Stats Lecture 02 Descriptive Stats
24/51
Lets calculate the average price per bottle for each option.
Bundle 1:8 + 10 + 12 + 55 + 150
5
= $47 per bottle
Bundle 2:
8(8) + 10(8) + 12(8) + 55(8) + 150(8)
5(8)
= $47 per bottle
Bundle 2 is a weighted average, but all the weights are the same.
-
8/3/2019 Stats Lecture 02 Descriptive Stats
25/51
For Bundle 3, the weights will be different.
Bundle 3:
8(123) + 10(62) + 12(32) + 55(2) + 150(1)
123+62+32+2+1
= 2248
220
= $10.22 per bottle
-
8/3/2019 Stats Lecture 02 Descriptive Stats
26/51
Quick Summary:
A summary statistic is used to represent a typical value of our data.
- mean
- median- mode
- quartiles
We can calculate summary statistics for raw data and grouped data.
-
8/3/2019 Stats Lecture 02 Descriptive Stats
27/51
III. Measures of Dispersion for Raw Data
A summary statistic gives no indication about the dispersion of
values within a set of data.
Ex: You are a tour operator planning activities for two different
tour groups. You are told the average age for each group is 50
years old.
When the tourists arrive you discover the ages of the individuals in
each group are as follows:
group 1: 48, 50, 52, 51, 49
group 2: 22, 85, 72, 27, 64, 39, 41
-
8/3/2019 Stats Lecture 02 Descriptive Stats
28/51
The range is the difference between the highest and lowest
value in the data set.
group 1 range =
5248 = 4
group 2 range =
8522 = 63
A smaller number indicates all data values are closer together.
A larger number could indicate:
1. data are disperse
2. there are outliers
-
8/3/2019 Stats Lecture 02 Descriptive Stats
29/51
Variance is a way of measuring how much each data point
varies from the mean value.
Lets calculate the difference between each data point and themean. or
Then, calculate the sum of the differences for each group.
or
xi xi - 50 xi xi - 50
48 -2 22 -28
50 0 85 35
52 2 72 22
51 1 27 -23
49 -1 64 14
39 -11
41 -9
Total 0 Total 0
Group 1 Group 2
)( ix
ix xx
i
)( xxi
-
8/3/2019 Stats Lecture 02 Descriptive Stats
30/51
To overcome the problem of the differences from the mean
summing to zero:
square each difference and then sum.
2
ix 2
xxi
xi xi - 50 (xi - 50)^2 xi xi - 50 (xi - 50)^2
48 -2 4 22 -28 784
50 0 0 85 35 1225
52 2 4 72 22 484
51 1 1 27 -23 529
49 -1 1 64 14 196
39 -11 121
41 -9 81
Total 0 10 Total 0 3420
Group 1 Group 2
We can see that there is much larger variation from the mean in
group 2 data.
-
8/3/2019 Stats Lecture 02 Descriptive Stats
31/51
However, because our data sets are of unequal size, we should
adjust for that.
Divide the sum of squared differences by the number ofobservations.
group 1: 10/ 5 = 2
group 2: 3420 / 7 = 488.57
This statistic is called the variance.
N
xi2
2 1
2
2
n
xxs
i
If the n-1 is used in the defining formula for the sample variance, then it is possible to
prove that the average value of the sample variance equals the true variance.
-
8/3/2019 Stats Lecture 02 Descriptive Stats
32/51
The square root of the variance is called the standard deviation.
- it is another way to measure the dispersion around the mean
- it is measured in the same units as the data- unless data is a percent, then standard deviation is
in percentage points
N
xi
2
1
2
n
xxs
i
For our example:
group 1: 1.41
group 2: 22.1
-
8/3/2019 Stats Lecture 02 Descriptive Stats
33/51
In the same way that a mean can be skewed by outliers, so can thevariance and standard deviation.
Looking at the median and quartiles may be informative.
The semi-interquartile range is the difference between the upperand lower quartile.
The quartile deviation is the semi-interquartile divided by 2.
-
8/3/2019 Stats Lecture 02 Descriptive Stats
34/51
Lets arrange our raw data into quartiles:
First, order the data:
group 1: 48,50,52,51,49 becomes 48, 49, 50, 51, 52
Then, find Q1, Q2, Q3:
Q2 = median = 50
Q1: 0.25(5+1) = 1.5
so average the 1st and 2nd values Q1 = 48.5
Q3: 0.75(5+1) = 4.5
so average the 4th and 5th values Q3 = 51.5
Now, find the IQR and QD:
IQR = 51.548.5 = 3
QD = 3/2 = 1.5
-
8/3/2019 Stats Lecture 02 Descriptive Stats
35/51
First, order the data:
group 2: 22,85,72,27,64,39,41 becomes 22, 27, 39, 41, 64, 72, 85
Then, find Q1, Q2, Q3:Q2 = median = 41
Q1 = 0.25(7+1) = 2
Q1 = 27
Q3 = 0.75(7+1) = 6
Q3 = 72
Now, find the IQR and QD:
IQR = 7241 = 31
QD = 31/2 = 15.5
Group 1 has a much lower IQR and QD than Group 2.
-
8/3/2019 Stats Lecture 02 Descriptive Stats
36/51
Group1: Group 2:Mean 50 50
Median 50 41
Range 4 63
Variance 2 488.57Stand. Dev. 1.14 22.1
Q1 48.5 27
Q2 50 41
Q3 51.5 72
IQR 3 31QD 1.5 15.5
-
8/3/2019 Stats Lecture 02 Descriptive Stats
37/51
IV. Measures of Dispersion for Grouped Data
Suppose we have the following frequency distribution table for
swimmers and their ages.
frequency
Ages fi
17 < 19 14
19 < 21 1921 < 23 11
23 < 25 4
25 < 27 1
27 < 29 1
Total 50
To calculate the mean, well need the mid-interval values.
Lets calculate the mid-interval values.
-
8/3/2019 Stats Lecture 02 Descriptive Stats
38/51
frequency
Ages fi xi
17 < 19 14 18
19 < 21 19 20
21 < 23 11 22
23 < 25 4 24
25 < 27 1 26
27 < 29 1 28Total 50 na
Mid-Interval
Value
The mean is given by
i
ii
f
xf
We know the sum of the frequencies. We need to calculate the
product of the frequencies and mid-interval value and then sum.
-
8/3/2019 Stats Lecture 02 Descriptive Stats
39/51
Ages fi xi (fi)(xi)
17 < 19 14 18 252
19 < 21 19 20 380
21 < 23 11 22 242
23 < 25 4 24 96
25 < 27 1 26 26
27 < 29 1 28 28
Total 50 na 1024
So the mean for this grouped data is:
1024 / 50 = 20.48
Now that we have the mean, we can calculate the dispersion
around the mean for each mid-interval value. Then square.
- instead of taking each data point minus the mean, we
are using the mid-interval value
-
8/3/2019 Stats Lecture 02 Descriptive Stats
40/51
Multiply the squared terms by the frequency. Then sum.
Ages fi xi (fi)(xi) (xi - mean) (xi - mean)^2
17 < 19 14 18 252 -2.48 6.150419 < 21 19 20 380 -0.48 0.2304
21 < 23 11 22 242 1.52 2.3104
23 < 25 4 24 96 3.52 12.3904
25 < 27 1 26 26 5.52 30.4704
27 < 29 1 28 28 7.52 56.5504
Total 50 na 1024 na na
-
8/3/2019 Stats Lecture 02 Descriptive Stats
41/51
Ages fi xi (fi)(xi) (xi - mean) (xi - mean)^2 fi(xi - mean)^2
17 < 19 14 18 252 -2.48 6.1504 86.1056
19 < 21 19 20 380 -0.48 0.2304 4.3776
21 < 23 11 22 242 1.52 2.3104 25.414423 < 25 4 24 96 3.52 12.3904 49.5616
25 < 27 1 26 26 5.52 30.4704 30.4704
27 < 29 1 28 28 7.52 56.5504 56.5504
Total 50 na 1024 na na 252.48
i
ii
fxf
2
2 )( Variance =
2)( ii xf if
We can now use our grouped data variance formula.
= 252.48 = 5.049650
The standard deviation is 2.247
-
8/3/2019 Stats Lecture 02 Descriptive Stats
42/51
There is an alternative formula for calculating the variance for
grouped data:
22
2
i
ii
i
ii
f
xf
f
xf
Lets calculate the mid-interval value squared and then multiply
it by the frequency. Then sum.
-
8/3/2019 Stats Lecture 02 Descriptive Stats
43/51
Ages fi xi (fi)(xi) (xi - mean) (xi - mean)^2 fi(xi - mean)^2 fi(xi)^2
17 < 19 14 18 252 -2.48 6.1504 86.1056 4536
19 < 21 19 20 380 -0.48 0.2304 4.3776 7600
21 < 23 11 22 242 1.52 2.3104 25.4144 5324
23 < 25 4 24 96 3.52 12.3904 49.5616 2304
25 < 27 1 26 26 5.52 30.4704 30.4704 676
27 < 29 1 28 28 7.52 56.5504 56.5504 784
Total 50 na 1024 na na 252.48 21224
22
2
i
ii
i
ii
f
xf
f
xf
Variance = 21224/50 - (1024/50)^2
= 424.48 - 419.4304
= 5.0496
Same answer as other formula!
-
8/3/2019 Stats Lecture 02 Descriptive Stats
44/51
Finally, lets calculate the inter-quartile range and the quartile
deviation. Well need the cumulative frequency to do this.
Q1: 0.25(50 + 1) = 12.75th
position
From the 0 items in the preceding interval,
12.75 more are needed to reach the 12.75th
position.
There are 14 items in the interval that
contains Q1.
From this we get: 12.75 / 14 = 0.91Take this times the size of the interval to get: 0.91 x 2 = 1.82
Add this to the beginning of the interval to get: 1.82 + 17 = 18.82
= Q1
Ages fi cumulative
17 < 19 14 14
19 < 21 19 33
21 < 23 11 44
23 < 25 4 48
25 < 27 1 4927 < 29 1 50
-
8/3/2019 Stats Lecture 02 Descriptive Stats
45/51
Q2: 0.5(50 + 1) = 25.5th position
From the 14 items in the precedinginterval, 11.5 more are needed to reach the
25.5th position.
There are 19 items in the interval that
contains Q2.
From this we get: 11.5 / 19 = 0.605
Take this times the size of the interval to get: 0.605 x 2 = 1.21
Add this to the beginning of the interval to get: 1.21 + 19 = 20.21
= Q2
Ages fi cumulative
17 < 19 14 14
19 < 21 19 33
21 < 23 11 4423 < 25 4 48
25 < 27 1 49
27 < 29 1 50
-
8/3/2019 Stats Lecture 02 Descriptive Stats
46/51
Q3: 0.75(50 + 1) = 38.25th position
From the 33 items in the precedinginterval, 5.25 more are needed to reach the
38.25th position.
There are 11 items in the interval that
contains Q3.
From this we get: 5.25 / 11 = 0.4772
Take this times the size of the interval to get: 0.4772 x 2 = 0.954
Add this to the beginning of the interval to get: 0.954 + 21 = 21.95
= Q3
Ages fi cumulative
17 < 19 14 14
19 < 21 19 33
21 < 23 11 4423 < 25 4 48
25 < 27 1 49
27 < 29 1 50
-
8/3/2019 Stats Lecture 02 Descriptive Stats
47/51
The IQR = Q3Q1 = 21.9518.82 = 3.13
The QD = 3.13 /2 = 1.565
Summary of our Grouped data:
mean 20.48
variance 5.0496
st. dev. 2.247
median 20.21
Q1 18.82Q2 20.21
Q3 21.95
IQR 3.13
QD 1.57
-
8/3/2019 Stats Lecture 02 Descriptive Stats
48/51
V. Other Descriptive Statistics
The coefficient of variation (CV) is useful for comparing two
sets of data when
- the means are close but the variances are different
- the means are different but the variances are close
CV is independent of the units of measurement.
100
CV
-
8/3/2019 Stats Lecture 02 Descriptive Stats
49/51
Pearsons Coefficient of skewness (sk) gives a measure of the
degree of skewness in a dataset.- independent of units of measure
sk = 3(meanmedian)
standard deviation
A negative value means the data is skewed to the left.
A positive value means the data is skewed to the right.
-
8/3/2019 Stats Lecture 02 Descriptive Stats
50/51
A box plot is a graphical display of the symmetry or skewness
of a dataset.
The middle bar in the box represents the median.
Each end of the box is Q1 and Q3.
The whiskers extend to the minimum and maximum data values.
- as long as the value is within (1.5)(IQR)
- otherwise value is marked with an *
Chapter Skills:
-
8/3/2019 Stats Lecture 02 Descriptive Stats
51/51
Chapter Skills:
Given raw data you should be able to calculate:
mean median
mode quartiles
variance standard deviation
coefficient of variation Pearsons coefficient
box plot
Given raw data you should be able to construct a frequencydistribution table and cumulative frequency.
From grouped data you should be able to calculate:
mean medianmode quartiles
variance standard deviation
coefficient of variation Pearsons coefficient
box plot
top related