summarizing data by a single number
Post on 16-Jul-2015
336 Views
Preview:
TRANSCRIPT
Summarizing Data by a Single
Number
Presented by:
Ahmad Imair
The term "measures of central tendency“ is a single value
that attempts to describe a set of data by identifying the central
position within that set of data. Refers to
finding the mean, median and mode.
Measures of central tendency are
sometimes called measures of central
location
Mean: Average.
The sum of a set of data
divided by the number of
data.
Median: The Middle value, or the
mean of the middle two
values, when the data is
arranged in numerical order.
Mode: The value (number) that
appears the Most.
It is possible to have more
than one mode, and it is
possible to have no mode.
Consider this set of test score values:
The two sets of scores above are identical except for the first score
The set on the left
shows the actual scores
The set on the right shows
what would happen if one of
the scores was WAY out of
range in regard to the other
scores
Such a term is
called an OutlierWith the outlier the Mean changed.
With the outlier the Median did NOT
change.
How do I know which measure of central tendency to use?
What will happen to the measures of central
tendency if we add the same amount to all data
values, or multiply each data value by the same
amount?
When added: Since all values are shifted the same amount,
the measures of central tendency all shifted by the same
amount. If you add 3 to each data value, you will add 3 to the
mean, mode and median.
When multiplied: Since all values are affected by the same
multiplicative values, the measures of central tendency will
feel the same affect. If you multiply each data value by 2, you
will multiply the mean, mode and median by 2.
Summary of when to use the mean, median and
mode
Type of Variable Best measure of central
tendency
Nominal Mode
Ordinal Median
Interval/Ratio (not skewed) Mean
Interval/Ratio (skewed) Median
Below some common questions that are asked regarding measures
of central tendency
What is the best measure of central tendency?
In a strongly skewed distribution, what is the best indicator of central
tendency?
Does all data have a median, mode and mean?
When is the mean the best measure of central tendency?
When is the mode the best measure of central tendency?
When is the median the best measure of central tendency?
What is the most appropriate measure of central tendency when the data
has outliers?
In a normally distributed data set, which is greatest: mode, median or
mean?
For any data set, which measures of central tendency have only one value?
When we summarize the performance we prefer to have a single number to
compare the performance.
a situation when it
is difficult to
compare the
performance of two
machines.
For program 1 computer A is 10 times faster than B.
For program 2 computer B is 10 times faster than A.
Using these measurements, the relative performance of computers A and B is
unclear
The simplest method to summarize relative performance is to use total execution
time of the two programs. computer B is 1001/110 = 9.1 times faster
than A for programs 1 and 2
The average of the execution times is the Arithmetic Mean (AM):
If performance is expressed as a rate, then the average that tracks
total execution time is the Harmonic Mean (HM)
This is the
definition for “average” you are most
familiar with
This is a different definition for
“average”
you are less familiar with
Problems with Arithmetic Mean
1. Applications do not have the same probability of being run.(AM assuming that
programs from the set are each run an equal number of times)
2. Longer programs weigh more heavily in the average.
To deal with those problems we have to calculate the Weighted Execution Time.
We should weight the more frequently used programs execution time
Weighted Arithmetic Mean Weighted Harmonic
Mean
Using a Weighted Sum (or weighted average)
Weighted sum of Mac A = ( 0.20 * 2 + 0.80 * 12 ) = 0.4 + 9.6 = 10 seconds
Weighted sum of Mac B = ( 0.20 * 4 + 0.80 * 8 ) = 0.8 + 6.4 = 7.2 seconds
Allows us to determine relative performance 10/7.2 = 1.38 --
> Machine B is 1.38 times faster than Machine A
Weighting each program by its use
Other method of presenting machine performance is to normalize execution times to
a reference machine, and then take the average of the normalized execution times.
However, if we compute the arithmetic mean of the normalized execution time
values, the result will depend on the choice of the machine we use as a
reference.
When we normalize to machine A
The Arithmetic Mean indicates that A is faster than B by 5.05.
When we normalize to B
The Arithmetic Mean indicates that B is faster than A by 5.05.
Only one of these results can be correct.
Instead of using the Arithmetic Mean, the normalized execution times should be
combined with the Geometric Mean (GM)
The Geometric Mean is independent of which data series we use for normalization,
because it has the following property:
Thus the geometric mean produces the same result whether we normalize to
machine A or B, as we can see
Execution time ratio:
is the execution time,
normalized to the reference
computer
The advantage of the Geometric Mean:
1. It is independent of the running times of the individual programs.
2. It doesn’t matter which machine is used for normalization
The disadvantage of using Geometric Means:
1. They do not predict execution time
The geometric means in Table suggest that for programs 1 and 2 the
performance is the same for machines A and B. The arithmetic mean of the
execution times suggests that machine B is 9.1 times faster then machine A
top related