measures of central tendency -...

MAT 121 Spring 2013 Fisher

1 Sections Covered: 3.1 – 3.2

Now that we have learned how to collect and organize data, it is now time to summarize what we have

gathered. During the last class, we discussed the idea of distribution of data. In particular, we looked at:

uniform, bell shaped, skewed left, and skewed right distributions.

When we look at a distribution of data, we must consider three characteristics:

Shape

Center

Spread

The center and spread are in fact numerical summaries of the data collected. One common misconception that

must be monitored is in regard to the idea of average. Average is erroneously most often associated with mean.

The truth is that there are many ways to describe the average value of a distribution of data. The most

appropriate measure of center and spread depends on the shape of the distribution. Once we know the three

(3) characteristics mentioned above, we can provide more accurate and effective analysis of our data.

Measures of Central Tendency

A quick bit of review…

Population versus Sample

Parameter versus Statistic

Qualitative versus Quantitative

Simple Random Sampling

As mentioned above, the word average is used quite frequently especially in the media. We hear about stats

and surveys that tell us that Americans, on average…fill in the blank. Often we interpret average as the mean,

but as we will discuss average can be used to describe other measures of central tendency. The three most

commonly used measures of central tendency are of course: mean, median, and mode. From our own

experience when we hear average we generally go to the mean, but average can be used to describe either

median or mode. As we will discuss, these different measurements can sometimes vary greatly.

We will first start by focusing on the arithmetic mean.

The arithmetic mean is computed by adding all the values of a variable in the data and dividing the sum by the

number of observations.



We will make a distinction between the arithmetic mean of a population and the arithmetic mean of a sample.

Population Arithmetic Mean: Represented by the symbol , the population arithmetic mean is computed using

ALL the individuals in a population. Therefore the population arithmetic mean is a parameter.

Formula:

Sample Arithmetic Mean: Represented by the symbol , the sample mean is computed using a sample of data.

Therefore the sample arithmetic population is a statistic.

Formula:

*For sake of not getting tongue-tied every 3 seconds, we will refer to the arithmetic mean as simply the mean.*

EXAMPLE: Below are the market-values of a single block of houses in a suburb of the Finger Lakes region.

$185,000 $168,000 $179,500 $216,500

$280,000 $158,250 $198,000 $173,850

$150,000 $299,900 $165,100 $170,000

a) Find the population mean, .

b) Find the sample mean, , of a sample of size n = 5.

Think about the numbers calculated in the example above. How do and compare?



Here is a frequency table, with corresponding histogram, of the data presented in the table.

Sometimes the mean doesn’t really do just cause in describing the data set completely. If the mean doesn’t tell

the story accurately then we need another measurement that will.

The next measure we will explore is the median. The median of a variable is the value that lies in the middle of

the data. The data must be arranged in ascending order prior to finding the middle value. Median is often

denoted by a capital m, M.

Steps in Finding the Median of a Data Set

1) Arrange the data in ascending order.

2) Determine the number of observations, n.

3) Determine the observation in the middle of the data set.

If n is odd, the median is in exactly the middle of the data set, specifically the value in the

position.

If n is even, the median is the mean of the middle two values in the data set, that is the mean of the values in

the

and

positions.

mean



EXAMPLE REVISITED: Let’s look back at the real estate example from the mean problem laid out above. Find the

median of the data set.

$185,000 $168,000 $179,500 $216,500

$280,000 $158,250 $198,000 $173,850

$150,000 $299,900 $165,100 $170,000

Example Expanded: Suppose a new home is built in the same neighborhood as above with a market-value of

$185,000; find the median of the data.

Depending on the shape of the distribution, we can make some generalizations about how mean and median

compare to each other. The following table from page 132 in your text summarizes the generalities nicely.

One aspect of these measures we need to be aware of is the idea of resistance. A numerical summary of data is

resistant IF extreme values (known as outliers) relative to the data fail to affect its value substantially, if at all.

Median is resistant due to the fact that it is found by position, regardless of what the actual numbers are.

However, mean is not resistant! Mean is calculated by adding numbers together and dividing by the number of

observations. Let’s pretend for a moment that the most expensive hose in the example above is actually worth

$500,000 rather than $299,900. The significantly higher number will make the mean significantly higher, giving

the impression that the neighborhood is “wealthier” than it actually is.

Relation between Mean, Median, and Distribution Shape

Distribution Shape Mean vs. Median Skewed Left Mean substantially less than median Symmetric Mean roughly equal to median Skewed Right Mean substantially larger than median



The final measure of central tendency we will explore is mode. Mode, simply put, is the value (if any) of the

variable that appears most often. A data set can have 0, 1, or more than 1 mode. Mode, like median and mean

can be very useful in helping summarize a quantitative data set. However, mode can also be used to describe

qualitative set, which mean and median cannot realistically do.

When should you use mean, median, or mode?

Mean – When data are quantitative AND the frequency distribution is approximately symmetric.

Median – When the data are quantitative AND the frequency distribution is skewed left or right.

Mode – When the data is qualitative OR if the most frequent observation is the desired measure.

Try This! Ms. Mosher recorded the math test scores of six students in the table below. Determine the mean of the

student scores, to the nearest tenth. Determine the median of the student scores. Describe the effect on the mean

and the median if Ms. Mosher adds 5 bonus points to each of the six students’ scores.

Student Score

Andrew 72

John 80

George 85

Amber 93

Betty 78

Robert 80

Try This! The prices of seven race cars sold last week are listed in the table below. What is the mean value of these

race cars, in dollars? What is the median value of these race cars, in dollars? State which of these measures of

central tendency best represents the value of the seven race cars. Justify your answer.

Price of Car Frequency

$126,000 1

$140,000 2

$180,000 1

$400,000 2

$819,000 1



One last problem before we move on…

The given data represent the fossil-fuel carbon dioxide (CO2) emissions (in thousands of

metric tons) of the top 10 emitters in 2007.

Country Emissions Per Capita Emissions

China 1,783,029 1.35

United States 1,591,756 5.20

India 439,695 0.39

Russia 419,241 2.95

Japan 342,118 2.71

Germany 214,872 2.61

Canada 151,988 4.61

United Kingdom 147,155 2.41

South Korea 137,257 2.82

Iran 135,257 1.88

A) Determine the mean and median CO2 emissions of the top 10 countries.

B) Why is the total emissions of a country not necessarily the best gauge of CO2

emissions? Why is per capita emissions a better gauge?

C) Determine the mean and median per capita CO2 emissions of the top 10 countries.

Which measure of central tendency is an environmentalist likely to use to support

their position that per capita emissions are too high?



Measures of Dispersion

In the last section, we focused primarily on the center and the shape of the distribution. No we shift our attention

to the spread of the data, this is known as dispersion.

Knowing where our data are centered is great, but it is also beneficial to know how fast or slow our values move

away from that central spot. Are the values clumped up close to the middle or do they taper out slowly? We will

look at three ways to gauge dispersion of data:

Range

Standard Deviation

Variance

The most basic and easiest to calculate is range. Range, R, is simply the difference between the highest and

lowest value.

The range is a quick way to determine how spread out the data is relative to the numbers in the data set. For

instance, if I have a set of numbers ranging from 10 to 34 then my range is 24. This is a considerable range for

such small numbers; therefore the data is quite spread out. On the contrary, if the numbers went from 100 to 124

I still have a range of 24 but this time the range is small compared to the numbers in my data set. This data would

be grouped rather closely.

Not as easy to calculate but a little more revealing about dispersion is standard deviation. Like mean, we can

calculate standard deviation of a population (σ) or of a sample (s). Just to give you an appreciation and hopefully

better understand what the technology is doing, we will look at the formulas used to calculate population and

sample standard deviation. Before we do, let’s understand what we are actually measuring.

Standard deviation measures how far an observed value is from the mean. In essence, standard deviation is a

mean of the difference between an observed value and the population/sample mean. And now for the formulas:

Population Standard Deviation (σ) =

Sample Standard Deviation (s) =



Example: We will return to the house value example from the last section to practice finding standard deviation

without letting Excel do everything. Just as a reminder of what we are dealing with, here are the values again:

$185,000 $168,000 $179,500 $216,500

$280,000 $158,250 $198,000 $173,850

$150,000 $299,900 $165,100 $170,000

Example continued: Now let’s practice with a sample instead of a population. Use the sample created in the first

example we did regarding sample mean.

***Once you have calculated standard deviation, it is time to make sense of the number you have found. In short,

the higher the standard deviation the more dispersion the distribution has.

Now that you can calculate standard deviation, variance is just one simple calculation away. Population variance

is represented by and sample variance is s2. Take note, variance is simply standard deviation squared! So once

you have standard deviation, just square it and you will have the variance. Variance is not widely used due to its

lack of commonsense application. For example, if standard deviation is measured in dollars then variance would

be square dollars. Huh???

Just so we say we did some work with variance, calculate the population variance and sample variance based on

the values we calculated for standard deviation above.

Empirical Rule

The Empirical Rule is a nice rule of thumb that can be used to determine the percentage of data that will lie within

a certain number of standard deviations of the mean. It is crucial to understand that the Empirical Rule only

applies to data with symmetrical (or approximately) distribution. This rule does NOT apply to skewed data.



The Empirical Rules is as follows:

If a distribution is roughly bell shaped, then:

1) ~68% of the data will fall within 1 standard deviation (

2) ~95% of the data will fall within 2 standard deviations

3) ~99.7% of the data will fall within 3 standard deviations

A visual of what was stated above:

NOTE: Nearly 100% of all data falls within 3

standard deviations of the mean. We can never say,

theoretically, that ALL data is accounted for!

Some practice with dispersion…

The following data represent the number of pods on a sample of soybean plants for two different plot types.

Plot Type Pods

Liberty 32 31 36 35 44 31 39 37 38

No Till 35 31 32 30 43 33 37 42 40

A) Find the range for each plot type. Does this measure of dispersion help us determine which plot type is

superior?

B) Find the sample standard deviation for each plot type. Does either plot appear to be better using s as your

gauge?

C) Just for practice, find the sample variance for each plot type.



What is the distribution is not bell shaped??? Then we use the one-size-fits-all approach, Chebyshev’s Inequality.

Chebyshev’s Inequality works for ALL distributions, even bell shaped. Furthermore, it can be used on sample

data as well as on entire populations. Chebyshev’s Inequality states that at least

of the

observations lie within k standard deviations of the mean. Important! k must be greater than 1.

Example: A class of second graders has mean height of five feet with standard deviation of one inch. At least

what percent of the class must be between 4’10”and 5’2”?

Try This! Computers from a particular company are found to last on average for three years without any

hardware malfunction, with standard deviation of two months. At least what percent of the computers last

between 31 months and 41 months?

measures of central tendency -...

Documents