stats lecture 02 descriptive stats

8/3/2019 Stats Lecture 02 Descriptive Stats

1/51

Descriptive Statistics

Chapter 2

Quantitative Methods for Economics

Dr. Katherine Sauer

Metropolitan State College of Denver


2/51

Chapter Overview:I. Working With Raw Data

II. Working With Grouped Data

III. Measures of Dispersion for Raw Data

IV. Measures of Dispersion for Grouped DataV. Other Measures of Dispersion


3/51

I. Working with Raw Data (mean, median and mode)

Suppose you are a manager preparing a report on hours worked

by your 49 staff members.

You might like to know the average number of hours worked.

49

49

11

i

i

N

i

ix

N

x

= 1592.5 = 32.5

49

20.0 37.3 54.2 25.3 59.6 24.5 29.7

18.0 38.8 42.1 39.5 56.8 16.9 28.5

45.5 42.0 39.5 42.6 40.0 44.2 40.1

44.0 56.4 30.2 20.0 22.7 37.8 23.4

26.0 20.2 36.1 18.3 19.7 36.8 26.5

24.0 23.4 15.4 20.0 38.9 42.1 24.1

41.0 18.5 21.3 22.6 37.2 42.9 17.9

Hours worked in a given week by 49 staff members


4/51

You might also like to know the median hours worked.

- sort the data in ascending order

15.4 20.0 23.4 26.5 37.3 40.1 44.0

16.9 20.0 23.4 28.5 37.8 41.0 44.2

17.9 20.0 24.0 29.7 38.8 42.0 45.5

18.0 20.2 24.1 30.2 38.9 42.1 54.2

18.3 21.3 24.5 36.1 39.5 42.1 56.4

18.5 22.6 25.3 36.8 39.5 42.6 56.8

19.7 22.7 26.0 37.2 40.0 42.9 59.6


The mode and median can be determined from the sorted data.

Are there any outliers we should make note of?

mean: 32.5 hours

median: 30.2 hours

mode: 20 hours


5/51

15.4 20.0 23.4 26.5 37.3 40.1 44.0

16.9 20.0 23.4 28.5 37.8 41.0 44.2

17.9 20.0 24.0 29.7 38.8 42.0 45.5

18.0 20.2 24.1 30.2 38.9 42.1 54.2

18.3 21.3 24.5 36.1 39.5 42.1 56.4

18.5 22.6 25.3 36.8 39.5 42.6 56.8

19.7 22.7 26.0 37.2 40.0 42.9 59.6


One final calculation we might like to make is arranging the data

into quartiles.

The position of the lower quartile (Q1) is the item that is closest to

position

0.25(n+1)

Q1: 0.25(49 + 1)

= 12.5

There is no 12.5thposition so well average

the 12th and 13th positions together.


6/51

15.4 20.0 23.4 26.5 37.3 40.1 44.0

16.9 20.0 23.4 28.5 37.8 41.0 44.2

17.9 20.0 24.0 29.7 38.8 42.0 45.5

18.0 20.2 24.1 30.2 38.9 42.1 54.2

18.3 21.3 24.5 36.1 39.5 42.1 56.4

18.5 22.6 25.3 36.8 39.5 42.6 56.8

19.7 22.7 26.0 37.2 40.0 42.9 59.6


So, Q1 = 21.3+22.6 = 21.45

2

Weve already found Q2.

30.2

To find the upper quartile (Q3), use the value of the item closest

to position

0.75(n + 1).

Q3: 0.75(50) = 37.5


7/51

15.4 20.0 23.4 26.5 37.3 40.1 44.0

16.9 20.0 23.4 28.5 37.8 41.0 44.2

17.9 20.0 24.0 29.7 38.8 42.0 45.5

18.0 20.2 24.1 30.2 38.9 42.1 54.2

18.3 21.3 24.5 36.1 39.5 42.1 56.4

18.5 22.6 25.3 36.8 39.5 42.6 56.8

19.7 22.7 26.0 37.2 40.0 42.9 59.6


So, Q3 has a value of

41+42 = 41.52

Q1 = 21.45 Q2 = 30.2 Q3 = 41.5


8/51

Sometimes the mean is not a good representation of the data.

- a representative statistic is fairly typical of most of the

data

Outliers can skew the mean.

Ex: Suppose we have the following data on ages of student taking

piano lessons.5,6,7,7,7,8,9,9,32

Calculate the mean, median and mode:

10, 7, 7

Drop the outlier and re-calculate the mean, median and mode:

7.25, (7+7)/2 = 7 , 7


9/51

Graphically, skewed data has a long tail extending to the outlier.

- low outliers produce skewed to the left graphs

- high outliers produce skewed to the right graphs

For low outliers, the value of the mean will be less than the

value of the median.

For high outliers, the value of the mean will be more than the

value of the median.


10/51

II. Working with Grouped Data (mean, median and mode)

Many times it would be impractical to list all of the raw data.

Often data is first put into groups.

Example: employment data in the farming, fishing and forestry

industry

Age Group 1991 1996

15-19 4,585 2,826

20-24 11,872 9,319

25-34 27,171 24,492

35-44 31,299 28,21045-54 31,626 30,902

55-64 33,477 25,846

65 and over 23,519 19,030

Total 163,549 140,625

Employment in the Farming, Fishing and Forestry Industry


11/51

Note: We are assuming that the values within each interval vary

uniformly between the lowest and highest values for the interval.

The mid-interval value is the average value of the data in

any interval.

- used to represent the group numerically

Mid-Interval Value for 15-19: 15+19 = 17

2

The age of each person in the interval is assumed to be 17.

Age Group 1991 1996

15-19 4,585 2,826

20-24 11,872 9,319

25-34 27,171 24,492

35-44 31,299 28,210

45-54 31,626 30,902

55-64 33,477 25,846

65 and over 23,519 19,030

Total 163,549 140,625

Employment in the Farming, Fishing and Forestry Industry


12/51

Back to our hours worked example

15.4 20.0 23.4 26.5 37.3 40.1 44.0

16.9 20.0 23.4 28.5 37.8 41.0 44.2

17.9 20.0 24.0 29.7 38.8 42.0 45.5

18.0 20.2 24.1 30.2 38.9 42.1 54.2

18.3 21.3 24.5 36.1 39.5 42.1 56.4

18.5 22.6 25.3 36.8 39.5 42.6 56.819.7 22.7 26.0 37.2 40.0 42.9 59.6


Lets group this data into a frequency distribution table.

- choose between 5 and 20 intervals

Data starts at 15.4 and goes to 59.6.

Grouping hours by 5s or 10s makes sense.

For our data, by 5s will be more revealing.


13/51

Hours Worked Frequency

15


14/51

Lets calculate the mid-interval values and add them to our table.

Hours Worked Frequency Mid Interval Value15


15/51

Lets calculate the total hours worked for each interval and add to

the table.

frequency x mid-interval value

Hours Worked Frequency Mid Interval Value Sub-Group Total Hours Worked

15


16/51

Hours Worked Frequency Mid Interval Value Sub-Group Total Hours Worked

15


17/51

To find the mode, we simply need our frequencies and intervals.

Hours Worked Frequency

15


18/51

Now lets calculate the median and quartiles.

Well first need to compute the cumulative frequency and add it toour table.

Hours Worked Frequency Less Than Cumulative Frequency

15


19/51

To determine the value of Q1:-From the 7 items in the preceding interval, 5.5 more are needed to

reach the 12.5th position.

-There are 12 items in the interval that contains Q1.

From this we get: 5.5 / 12 = 0.46

Take this times the size of the interval to get: 0.46 x 5 = 2.3

Add this to the beginning of the interval to get: 2.3 + 20 = 22.3 = Q1


15


20/51


15


21/51


15


22/51

Hours Worked

Raw Data: Grouped Data:

mean 32.5 32.602

median 30.2 32.5

mode 20 22.08Q1 21.45 22.3

Q2 30.2 32.5

Q3 45.1 41.75


23/51


24/51

Lets calculate the average price per bottle for each option.

Bundle 1:8 + 10 + 12 + 55 + 150

5

= $47 per bottle

Bundle 2:

8(8) + 10(8) + 12(8) + 55(8) + 150(8)

5(8)

= $47 per bottle

Bundle 2 is a weighted average, but all the weights are the same.


25/51

For Bundle 3, the weights will be different.

Bundle 3:

8(123) + 10(62) + 12(32) + 55(2) + 150(1)

123+62+32+2+1

= 2248

220

= $10.22 per bottle


26/51

Quick Summary:

A summary statistic is used to represent a typical value of our data.

- mean

- medianmode

- quartiles

We can calculate summary statistics for raw data and grouped data.


27/51

III. Measures of Dispersion for Raw Data

A summary statistic gives no indication about the dispersion of

values within a set of data.

Ex: You are a tour operator planning activities for two different

tour groups. You are told the average age for each group is 50

years old.

When the tourists arrive you discover the ages of the individuals in

each group are as follows:

group 1: 48, 50, 52, 51, 49

group 2: 22, 85, 72, 27, 64, 39, 41


28/51

The range is the difference between the highest and lowest

value in the data set.

group 1 range =

5248 = 4

group 2 range =

8522 = 63

A smaller number indicates all data values are closer together.

A larger number could indicate:

1. data are disperse

2. there are outliers


29/51

Variance is a way of measuring how much each data point

varies from the mean value.

Lets calculate the difference between each data point and themean. or

Then, calculate the sum of the differences for each group.

or

xi xi - 50 xi xi - 50

48 -2 22 -28

50 0 85 35

52 2 72 22

51 1 27 -23

49 -1 64 14

39 -11

41 -9

Total 0 Total 0

Group 1 Group 2

)( ix

ix xx

i

)( xxi


30/51

To overcome the problem of the differences from the mean

summing to zero:

square each difference and then sum.

2

ix 2

xxi

xi xi - 50 (xi - 50)^2 xi xi - 50 (xi - 50)^2

48 -2 4 22 -28 784

50 0 0 85 35 1225

52 2 4 72 22 484

51 1 1 27 -23 529

49 -1 1 64 14 196

39 -11 121

41 -9 81

Total 0 10 Total 0 3420

Group 1 Group 2

We can see that there is much larger variation from the mean in

group 2 data.


31/51

However, because our data sets are of unequal size, we should

adjust for that.

Divide the sum of squared differences by the number ofobservations.

group 1: 10/ 5 = 2

group 2: 3420 / 7 = 488.57

This statistic is called the variance.

N

xi2

2 1

2

2

n

xxs

i

If the n-1 is used in the defining formula for the sample variance, then it is possible to

prove that the average value of the sample variance equals the true variance.


32/51

The square root of the variance is called the standard deviation.

- it is another way to measure the dispersion around the mean

- it is measured in the same units as the data- unless data is a percent, then standard deviation is

in percentage points

N

xi

2

1

2

n

xxs

i

For our example:

group 1: 1.41

group 2: 22.1


33/51

In the same way that a mean can be skewed by outliers, so can thevariance and standard deviation.

Looking at the median and quartiles may be informative.

The semi-interquartile range is the difference between the upperand lower quartile.

The quartile deviation is the semi-interquartile divided by 2.


34/51

Lets arrange our raw data into quartiles:

First, order the data:

group 1: 48,50,52,51,49 becomes 48, 49, 50, 51, 52

Then, find Q1, Q2, Q3:

Q2 = median = 50

Q1: 0.25(5+1) = 1.5

so average the 1st and 2nd values Q1 = 48.5

Q3: 0.75(5+1) = 4.5

so average the 4th and 5th values Q3 = 51.5

Now, find the IQR and QD:

IQR = 51.548.5 = 3

QD = 3/2 = 1.5


35/51

First, order the data:

group 2: 22,85,72,27,64,39,41 becomes 22, 27, 39, 41, 64, 72, 85

Then, find Q1, Q2, Q3:Q2 = median = 41

Q1 = 0.25(7+1) = 2

Q1 = 27

Q3 = 0.75(7+1) = 6

Q3 = 72

Now, find the IQR and QD:

IQR = 7241 = 31

QD = 31/2 = 15.5

Group 1 has a much lower IQR and QD than Group 2.


36/51

Group1: Group 2:Mean 50 50

Median 50 41

Range 4 63

Variance 2 488.57Stand. Dev. 1.14 22.1

Q1 48.5 27

Q2 50 41

Q3 51.5 72

IQR 3 31QD 1.5 15.5


37/51

IV. Measures of Dispersion for Grouped Data

Suppose we have the following frequency distribution table for

swimmers and their ages.

frequency

Ages fi

17 < 19 14

19 < 21 1921 < 23 11

23 < 25 4

25 < 27 1

27 < 29 1

Total 50

To calculate the mean, well need the mid-interval values.

Lets calculate the mid-interval values.


38/51

frequency

Ages fi xi

17 < 19 14 18

19 < 21 19 20

21 < 23 11 22

23 < 25 4 24

25 < 27 1 26

27 < 29 1 28Total 50 na

Mid-Interval

Value

The mean is given by

i

ii

f

xf

We know the sum of the frequencies. We need to calculate the

product of the frequencies and mid-interval value and then sum.


39/51

Ages fi xi (fi)(xi)

17 < 19 14 18 252

19 < 21 19 20 380

21 < 23 11 22 242

23 < 25 4 24 96

25 < 27 1 26 26

27 < 29 1 28 28

Total 50 na 1024

So the mean for this grouped data is:

1024 / 50 = 20.48

Now that we have the mean, we can calculate the dispersion

around the mean for each mid-interval value. Then square.

- instead of taking each data point minus the mean, we

are using the mid-interval value


40/51

Multiply the squared terms by the frequency. Then sum.

Ages fi xi (fi)(xi) (xi - mean) (xi - mean)^2

17 < 19 14 18 252 -2.48 6.150419 < 21 19 20 380 -0.48 0.2304

21 < 23 11 22 242 1.52 2.3104

23 < 25 4 24 96 3.52 12.3904

25 < 27 1 26 26 5.52 30.4704

27 < 29 1 28 28 7.52 56.5504

Total 50 na 1024 na na


41/51

Ages fi xi (fi)(xi) (xi - mean) (xi - mean)^2 fi(xi - mean)^2

17 < 19 14 18 252 -2.48 6.1504 86.1056

19 < 21 19 20 380 -0.48 0.2304 4.3776

21 < 23 11 22 242 1.52 2.3104 25.414423 < 25 4 24 96 3.52 12.3904 49.5616

25 < 27 1 26 26 5.52 30.4704 30.4704

27 < 29 1 28 28 7.52 56.5504 56.5504

Total 50 na 1024 na na 252.48

i

ii

fxf

2

2 )( Variance =

2)( ii xf if

We can now use our grouped data variance formula.

= 252.48 = 5.049650

The standard deviation is 2.247


42/51

There is an alternative formula for calculating the variance for

grouped data:

22

2

i

ii

i

ii

f

xf

f

xf

Lets calculate the mid-interval value squared and then multiply

it by the frequency. Then sum.


43/51

Ages fi xi (fi)(xi) (xi - mean) (xi - mean)^2 fi(xi - mean)^2 fi(xi)^2

17 < 19 14 18 252 -2.48 6.1504 86.1056 4536

19 < 21 19 20 380 -0.48 0.2304 4.3776 7600

21 < 23 11 22 242 1.52 2.3104 25.4144 5324

23 < 25 4 24 96 3.52 12.3904 49.5616 2304

25 < 27 1 26 26 5.52 30.4704 30.4704 676

27 < 29 1 28 28 7.52 56.5504 56.5504 784

Total 50 na 1024 na na 252.48 21224

22

2

i

ii

i

ii

f

xf

f

xf

Variance = 21224/50 - (1024/50)^2

= 424.48 - 419.4304

= 5.0496

Same answer as other formula!


44/51

Finally, lets calculate the inter-quartile range and the quartile

deviation. Well need the cumulative frequency to do this.

Q1: 0.25(50 + 1) = 12.75th

position

From the 0 items in the preceding interval,

12.75 more are needed to reach the 12.75th

position.

There are 14 items in the interval that

contains Q1.

From this we get: 12.75 / 14 = 0.91Take this times the size of the interval to get: 0.91 x 2 = 1.82

Add this to the beginning of the interval to get: 1.82 + 17 = 18.82

= Q1

Ages fi cumulative

17 < 19 14 14

19 < 21 19 33

21 < 23 11 44

23 < 25 4 48

25 < 27 1 4927 < 29 1 50


45/51

Q2: 0.5(50 + 1) = 25.5th position

From the 14 items in the precedinginterval, 11.5 more are needed to reach the

25.5th position.


contains Q2.

From this we get: 11.5 / 19 = 0.605



= Q2

Ages fi cumulative

17 < 19 14 14

19 < 21 19 33

21 < 23 11 4423 < 25 4 48

25 < 27 1 49

27 < 29 1 50


46/51

Q3: 0.75(50 + 1) = 38.25th position

From the 33 items in the precedinginterval, 5.25 more are needed to reach the

38.25th position.


contains Q3.

From this we get: 5.25 / 11 = 0.4772



= Q3

Ages fi cumulative

17 < 19 14 14

19 < 21 19 33

21 < 23 11 4423 < 25 4 48

25 < 27 1 49

27 < 29 1 50


47/51

The IQR = Q3Q1 = 21.9518.82 = 3.13

The QD = 3.13 /2 = 1.565

Summary of our Grouped data:

mean 20.48

variance 5.0496

st. dev. 2.247

median 20.21

Q1 18.82Q2 20.21

Q3 21.95

IQR 3.13

QD 1.57


48/51

V. Other Descriptive Statistics

The coefficient of variation (CV) is useful for comparing two

sets of data when

- the means are close but the variances are different

- the means are different but the variances are close

CV is independent of the units of measurement.

100

CV


49/51

Pearsons Coefficient of skewness (sk) gives a measure of the

degree of skewness in a dataset.- independent of units of measure

sk = 3(meanmedian)

standard deviation

A negative value means the data is skewed to the left.

A positive value means the data is skewed to the right.


50/51

A box plot is a graphical display of the symmetry or skewness

of a dataset.

The middle bar in the box represents the median.

Each end of the box is Q1 and Q3.

The whiskers extend to the minimum and maximum data values.

- as long as the value is within (1.5)(IQR)

- otherwise value is marked with an *

Chapter Skills:


51/51

Chapter Skills:

Given raw data you should be able to calculate:

mean median

mode quartiles

variance standard deviation

coefficient of variation Pearsons coefficient

box plot

Given raw data you should be able to construct a frequencydistribution table and cumulative frequency.

From grouped data you should be able to calculate:

mean medianmode quartiles

variance standard deviation

coefficient of variation Pearsons coefficient

box plot

stats lecture 02 descriptive stats

Documents

descriptive stats and data exploration · 2020. 9. 2. ·...

chapter 1 descriptive stats 2 2012

chapter 3 descriptive stats numerical measures (1)

lecture 1 stats

stats 330: lecture 27

stats 330: lecture 12

spss introductory session data entry and descriptive stats

stats 330: lecture 18

stats 330: lecture 23

stats 330: lecture 30

1 descriptive stats slides worked

stats lecture notes oct 29 - stat007, fall 19, section...

6.descriptive stats

ge 210 lecture 5 (descriptive stats and intro to...

lecture 9 stats

stats 330: lecture 28

essential stats for decision making-1 descriptive stats-2011

stats 330: lecture 25

w7 dmitriy-zinovev descriptive stats

lecture 2: descriptive statistical graphs and...