summarizing data by a single number

Post on 16-Jul-2015

336 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Summarizing Data by a Single

Number

Presented by:

Ahmad Imair

The term "measures of central tendency“ is a single value

that attempts to describe a set of data by identifying the central

position within that set of data. Refers to

finding the mean, median and mode.

Measures of central tendency are

sometimes called measures of central

location

Mean: Average.

The sum of a set of data

divided by the number of

data.

Median: The Middle value, or the

mean of the middle two

values, when the data is

arranged in numerical order.

Mode: The value (number) that

appears the Most.

It is possible to have more

than one mode, and it is

possible to have no mode.

Consider this set of test score values:

The two sets of scores above are identical except for the first score

The set on the left

shows the actual scores

The set on the right shows

what would happen if one of

the scores was WAY out of

range in regard to the other

scores

Such a term is

called an OutlierWith the outlier the Mean changed.

With the outlier the Median did NOT

change.

How do I know which measure of central tendency to use?

What will happen to the measures of central

tendency if we add the same amount to all data

values, or multiply each data value by the same

amount?

When added: Since all values are shifted the same amount,

the measures of central tendency all shifted by the same

amount. If you add 3 to each data value, you will add 3 to the

mean, mode and median.

When multiplied: Since all values are affected by the same

multiplicative values, the measures of central tendency will

feel the same affect. If you multiply each data value by 2, you

will multiply the mean, mode and median by 2.

Summary of when to use the mean, median and

mode

Type of Variable Best measure of central

tendency

Nominal Mode

Ordinal Median

Interval/Ratio (not skewed) Mean

Interval/Ratio (skewed) Median

Below some common questions that are asked regarding measures

of central tendency

What is the best measure of central tendency?

In a strongly skewed distribution, what is the best indicator of central

tendency?

Does all data have a median, mode and mean?

When is the mean the best measure of central tendency?

When is the mode the best measure of central tendency?

When is the median the best measure of central tendency?

What is the most appropriate measure of central tendency when the data

has outliers?

In a normally distributed data set, which is greatest: mode, median or

mean?

For any data set, which measures of central tendency have only one value?

When we summarize the performance we prefer to have a single number to

compare the performance.

a situation when it

is difficult to

compare the

performance of two

machines.

For program 1 computer A is 10 times faster than B.

For program 2 computer B is 10 times faster than A.

Using these measurements, the relative performance of computers A and B is

unclear

The simplest method to summarize relative performance is to use total execution

time of the two programs. computer B is 1001/110 = 9.1 times faster

than A for programs 1 and 2

The average of the execution times is the Arithmetic Mean (AM):

If performance is expressed as a rate, then the average that tracks

total execution time is the Harmonic Mean (HM)

This is the

definition for “average” you are most

familiar with

This is a different definition for

“average”

you are less familiar with

Problems with Arithmetic Mean

1. Applications do not have the same probability of being run.(AM assuming that

programs from the set are each run an equal number of times)

2. Longer programs weigh more heavily in the average.

To deal with those problems we have to calculate the Weighted Execution Time.

We should weight the more frequently used programs execution time

Weighted Arithmetic Mean Weighted Harmonic

Mean

Using a Weighted Sum (or weighted average)

Weighted sum of Mac A = ( 0.20 * 2 + 0.80 * 12 ) = 0.4 + 9.6 = 10 seconds

Weighted sum of Mac B = ( 0.20 * 4 + 0.80 * 8 ) = 0.8 + 6.4 = 7.2 seconds

Allows us to determine relative performance 10/7.2 = 1.38 --

> Machine B is 1.38 times faster than Machine A

Weighting each program by its use

Other method of presenting machine performance is to normalize execution times to

a reference machine, and then take the average of the normalized execution times.

However, if we compute the arithmetic mean of the normalized execution time

values, the result will depend on the choice of the machine we use as a

reference.

When we normalize to machine A

The Arithmetic Mean indicates that A is faster than B by 5.05.

When we normalize to B

The Arithmetic Mean indicates that B is faster than A by 5.05.

Only one of these results can be correct.

Instead of using the Arithmetic Mean, the normalized execution times should be

combined with the Geometric Mean (GM)

The Geometric Mean is independent of which data series we use for normalization,

because it has the following property:

Thus the geometric mean produces the same result whether we normalize to

machine A or B, as we can see

Execution time ratio:

is the execution time,

normalized to the reference

computer

The advantage of the Geometric Mean:

1. It is independent of the running times of the individual programs.

2. It doesn’t matter which machine is used for normalization

The disadvantage of using Geometric Means:

1. They do not predict execution time

The geometric means in Table suggest that for programs 1 and 2 the

performance is the same for machines A and B. The arithmetic mean of the

execution times suggests that machine B is 9.1 times faster then machine A

top related