measures of central tendency -...
TRANSCRIPT
MAT 121 Spring 2013 Fisher
1 Sections Covered: 3.1 – 3.2
Now that we have learned how to collect and organize data, it is now time to summarize what we have
gathered. During the last class, we discussed the idea of distribution of data. In particular, we looked at:
uniform, bell shaped, skewed left, and skewed right distributions.
When we look at a distribution of data, we must consider three characteristics:
Shape
Center
Spread
The center and spread are in fact numerical summaries of the data collected. One common misconception that
must be monitored is in regard to the idea of average. Average is erroneously most often associated with mean.
The truth is that there are many ways to describe the average value of a distribution of data. The most
appropriate measure of center and spread depends on the shape of the distribution. Once we know the three
(3) characteristics mentioned above, we can provide more accurate and effective analysis of our data.
Measures of Central Tendency
A quick bit of review…
Population versus Sample
Parameter versus Statistic
Qualitative versus Quantitative
Simple Random Sampling
As mentioned above, the word average is used quite frequently especially in the media. We hear about stats
and surveys that tell us that Americans, on average…fill in the blank. Often we interpret average as the mean,
but as we will discuss average can be used to describe other measures of central tendency. The three most
commonly used measures of central tendency are of course: mean, median, and mode. From our own
experience when we hear average we generally go to the mean, but average can be used to describe either
median or mode. As we will discuss, these different measurements can sometimes vary greatly.
We will first start by focusing on the arithmetic mean.
The arithmetic mean is computed by adding all the values of a variable in the data and dividing the sum by the
number of observations.
MAT 121 Spring 2013 Fisher
2 Sections Covered: 3.1 – 3.2
We will make a distinction between the arithmetic mean of a population and the arithmetic mean of a sample.
Population Arithmetic Mean: Represented by the symbol , the population arithmetic mean is computed using
ALL the individuals in a population. Therefore the population arithmetic mean is a parameter.
Formula:
Sample Arithmetic Mean: Represented by the symbol , the sample mean is computed using a sample of data.
Therefore the sample arithmetic population is a statistic.
Formula:
*For sake of not getting tongue-tied every 3 seconds, we will refer to the arithmetic mean as simply the mean.*
EXAMPLE: Below are the market-values of a single block of houses in a suburb of the Finger Lakes region.
$185,000 $168,000 $179,500 $216,500
$280,000 $158,250 $198,000 $173,850
$150,000 $299,900 $165,100 $170,000
a) Find the population mean, .
b) Find the sample mean, , of a sample of size n = 5.
Think about the numbers calculated in the example above. How do and compare?
MAT 121 Spring 2013 Fisher
3 Sections Covered: 3.1 – 3.2
Here is a frequency table, with corresponding histogram, of the data presented in the table.
Sometimes the mean doesn’t really do just cause in describing the data set completely. If the mean doesn’t tell
the story accurately then we need another measurement that will.
The next measure we will explore is the median. The median of a variable is the value that lies in the middle of
the data. The data must be arranged in ascending order prior to finding the middle value. Median is often
denoted by a capital m, M.
Steps in Finding the Median of a Data Set
1) Arrange the data in ascending order.
2) Determine the number of observations, n.
3) Determine the observation in the middle of the data set.
If n is odd, the median is in exactly the middle of the data set, specifically the value in the
position.
If n is even, the median is the mean of the middle two values in the data set, that is the mean of the values in
the
and
positions.
mean
MAT 121 Spring 2013 Fisher
4 Sections Covered: 3.1 – 3.2
EXAMPLE REVISITED: Let’s look back at the real estate example from the mean problem laid out above. Find the
median of the data set.
$185,000 $168,000 $179,500 $216,500
$280,000 $158,250 $198,000 $173,850
$150,000 $299,900 $165,100 $170,000
Example Expanded: Suppose a new home is built in the same neighborhood as above with a market-value of
$185,000; find the median of the data.
Depending on the shape of the distribution, we can make some generalizations about how mean and median
compare to each other. The following table from page 132 in your text summarizes the generalities nicely.
One aspect of these measures we need to be aware of is the idea of resistance. A numerical summary of data is
resistant IF extreme values (known as outliers) relative to the data fail to affect its value substantially, if at all.
Median is resistant due to the fact that it is found by position, regardless of what the actual numbers are.
However, mean is not resistant! Mean is calculated by adding numbers together and dividing by the number of
observations. Let’s pretend for a moment that the most expensive hose in the example above is actually worth
$500,000 rather than $299,900. The significantly higher number will make the mean significantly higher, giving
the impression that the neighborhood is “wealthier” than it actually is.
Relation between Mean, Median, and Distribution Shape
Distribution Shape Mean vs. Median Skewed Left Mean substantially less than median Symmetric Mean roughly equal to median Skewed Right Mean substantially larger than median
MAT 121 Spring 2013 Fisher
5 Sections Covered: 3.1 – 3.2
The final measure of central tendency we will explore is mode. Mode, simply put, is the value (if any) of the
variable that appears most often. A data set can have 0, 1, or more than 1 mode. Mode, like median and mean
can be very useful in helping summarize a quantitative data set. However, mode can also be used to describe
qualitative set, which mean and median cannot realistically do.
When should you use mean, median, or mode?
Mean – When data are quantitative AND the frequency distribution is approximately symmetric.
Median – When the data are quantitative AND the frequency distribution is skewed left or right.
Mode – When the data is qualitative OR if the most frequent observation is the desired measure.
Try This! Ms. Mosher recorded the math test scores of six students in the table below. Determine the mean of the
student scores, to the nearest tenth. Determine the median of the student scores. Describe the effect on the mean
and the median if Ms. Mosher adds 5 bonus points to each of the six students’ scores.
Student Score
Andrew 72
John 80
George 85
Amber 93
Betty 78
Robert 80
Try This! The prices of seven race cars sold last week are listed in the table below. What is the mean value of these
race cars, in dollars? What is the median value of these race cars, in dollars? State which of these measures of
central tendency best represents the value of the seven race cars. Justify your answer.
Price of Car Frequency
$126,000 1
$140,000 2
$180,000 1
$400,000 2
$819,000 1
MAT 121 Spring 2013 Fisher
6 Sections Covered: 3.1 – 3.2
One last problem before we move on…
The given data represent the fossil-fuel carbon dioxide (CO2) emissions (in thousands of
metric tons) of the top 10 emitters in 2007.
Country Emissions Per Capita Emissions
China 1,783,029 1.35
United States 1,591,756 5.20
India 439,695 0.39
Russia 419,241 2.95
Japan 342,118 2.71
Germany 214,872 2.61
Canada 151,988 4.61
United Kingdom 147,155 2.41
South Korea 137,257 2.82
Iran 135,257 1.88
A) Determine the mean and median CO2 emissions of the top 10 countries.
B) Why is the total emissions of a country not necessarily the best gauge of CO2
emissions? Why is per capita emissions a better gauge?
C) Determine the mean and median per capita CO2 emissions of the top 10 countries.
Which measure of central tendency is an environmentalist likely to use to support
their position that per capita emissions are too high?
MAT 121 Spring 2013 Fisher
7 Sections Covered: 3.1 – 3.2
Measures of Dispersion
In the last section, we focused primarily on the center and the shape of the distribution. No we shift our attention
to the spread of the data, this is known as dispersion.
Knowing where our data are centered is great, but it is also beneficial to know how fast or slow our values move
away from that central spot. Are the values clumped up close to the middle or do they taper out slowly? We will
look at three ways to gauge dispersion of data:
Range
Standard Deviation
Variance
The most basic and easiest to calculate is range. Range, R, is simply the difference between the highest and
lowest value.
The range is a quick way to determine how spread out the data is relative to the numbers in the data set. For
instance, if I have a set of numbers ranging from 10 to 34 then my range is 24. This is a considerable range for
such small numbers; therefore the data is quite spread out. On the contrary, if the numbers went from 100 to 124
I still have a range of 24 but this time the range is small compared to the numbers in my data set. This data would
be grouped rather closely.
Not as easy to calculate but a little more revealing about dispersion is standard deviation. Like mean, we can
calculate standard deviation of a population (σ) or of a sample (s). Just to give you an appreciation and hopefully
better understand what the technology is doing, we will look at the formulas used to calculate population and
sample standard deviation. Before we do, let’s understand what we are actually measuring.
Standard deviation measures how far an observed value is from the mean. In essence, standard deviation is a
mean of the difference between an observed value and the population/sample mean. And now for the formulas:
Population Standard Deviation (σ) =
Sample Standard Deviation (s) =
MAT 121 Spring 2013 Fisher
8 Sections Covered: 3.1 – 3.2
Example: We will return to the house value example from the last section to practice finding standard deviation
without letting Excel do everything. Just as a reminder of what we are dealing with, here are the values again:
$185,000 $168,000 $179,500 $216,500
$280,000 $158,250 $198,000 $173,850
$150,000 $299,900 $165,100 $170,000
Example continued: Now let’s practice with a sample instead of a population. Use the sample created in the first
example we did regarding sample mean.
***Once you have calculated standard deviation, it is time to make sense of the number you have found. In short,
the higher the standard deviation the more dispersion the distribution has.
Now that you can calculate standard deviation, variance is just one simple calculation away. Population variance
is represented by and sample variance is s2. Take note, variance is simply standard deviation squared! So once
you have standard deviation, just square it and you will have the variance. Variance is not widely used due to its
lack of commonsense application. For example, if standard deviation is measured in dollars then variance would
be square dollars. Huh???
Just so we say we did some work with variance, calculate the population variance and sample variance based on
the values we calculated for standard deviation above.
Empirical Rule
The Empirical Rule is a nice rule of thumb that can be used to determine the percentage of data that will lie within
a certain number of standard deviations of the mean. It is crucial to understand that the Empirical Rule only
applies to data with symmetrical (or approximately) distribution. This rule does NOT apply to skewed data.
MAT 121 Spring 2013 Fisher
9 Sections Covered: 3.1 – 3.2
The Empirical Rules is as follows:
If a distribution is roughly bell shaped, then:
1) ~68% of the data will fall within 1 standard deviation (
2) ~95% of the data will fall within 2 standard deviations
3) ~99.7% of the data will fall within 3 standard deviations
A visual of what was stated above:
NOTE: Nearly 100% of all data falls within 3
standard deviations of the mean. We can never say,
theoretically, that ALL data is accounted for!
Some practice with dispersion…
The following data represent the number of pods on a sample of soybean plants for two different plot types.
Plot Type Pods
Liberty 32 31 36 35 44 31 39 37 38
No Till 35 31 32 30 43 33 37 42 40
A) Find the range for each plot type. Does this measure of dispersion help us determine which plot type is
superior?
B) Find the sample standard deviation for each plot type. Does either plot appear to be better using s as your
gauge?
C) Just for practice, find the sample variance for each plot type.
MAT 121 Spring 2013 Fisher
10 Sections Covered: 3.1 – 3.2
What is the distribution is not bell shaped??? Then we use the one-size-fits-all approach, Chebyshev’s Inequality.
Chebyshev’s Inequality works for ALL distributions, even bell shaped. Furthermore, it can be used on sample
data as well as on entire populations. Chebyshev’s Inequality states that at least
of the
observations lie within k standard deviations of the mean. Important! k must be greater than 1.
Example: A class of second graders has mean height of five feet with standard deviation of one inch. At least
what percent of the class must be between 4’10”and 5’2”?
Try This! Computers from a particular company are found to last on average for three years without any
hardware malfunction, with standard deviation of two months. At least what percent of the computers last
between 31 months and 41 months?