initial data analysis central tendency. notation when we describe a set of data corresponding to...

Post on 11-Jan-2016

216 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Initial Data AnalysisCentral Tendency

Notation When we describe a set of data

corresponding to the values of some variable, we will refer to that set using an uppercase letter such as X or Y.

When we want to talk about specific data points within that set, we specify those points by adding a subscript to the uppercase letter like X1

Example

5, 8, 12, 3, 6, 8, 7

X1, X2, X3, X4, X5, X6, X7

Summation The Greek letter sigma, which looks like ,

means “add up” or “sum” whatever follows it.

Thus, Xi, means “add up all the Xis.

If we use the Xis from the previous example, Xi = 49 (or just X).

Example

Pred. Actual Student Score Score X Y

1 82 84 2 66 51 3 70 72 4 81 56 5 61 73

Example (cont.)

X = 82 + 66 + 70 + 81 + 61 = 360

Y = 84 + 51 + 72 + 56 + 73 = 336

(X-Y) = (82-84) + (66-51) + (70-72) + (81-56) + (61-73) = -2 + 15 + (-2) + 25 + (-12) = 24

X2 = 822 + 662 + 702 + 812 + 612 = 6724 + 4356 +

4900 + 6561 + 3721 = 26262One can also see it as (X2)

(X)2 = 3602 = 129600

Your turn(XY) =

((X-Y))² =

=

X Y

3 5

6 7

2 3

22 ( )

1

XX

NN

Your turn(XY) = 15 + 42 + 6 = 63

((X-Y))² = [(-2)+(-1)+(-1)]2 = 16

= 2.08

X Y

3 5

6 7

2 3

22 ( )

1

XX

NN

Measures of Central Tendency While distributions provide an overall picture

of some data set, it is sometimes desirable to represent the entire data set using descriptive statistics.

The first descriptive statistics we will discuss are those used to indicate where the center of the distribution lies.

60.5 362.5 864.5 766.5 1268.5 770.5 672.5 474.5 076.5 1

0

2

4

6

8

10

12

14

60.5 62.5 64.5 66.5 68.5 70.5 72.5 74.5 76.5

Height (Inches)

Freq

uen

cy

The Mode There are different measures of central

tendency, each with their own advantages and disadvantages

The first of these is called the mode.

The mode is simply the value of the relevant variable that occurs most often (i.e., has the highest frequency) in the sample.

The Mode (cont.)

Note that if you have done a frequency histogram, you can often identify the mode simply by finding the value with the highest bar.

However, that will not work when grouping was performed prior to plotting the histogram (although you can still use the histogram to identify the modal group, just not the modal value).

Finding the mode• Create a non-grouped frequency table as described previously, then identify the value with the

greatest frequency.

• Example: Class height.

• n=48 Value Freq Value Freq

61 3 69 362 4 70 263 4 71 464 4 72 465 3 73 066 7 74 067 5 75 068 4 76 1

Mode Advantages

Very quick and easy to determine Is an actual value of the data Not affected by extreme scores

Disadvantages Sometimes not very informative (e.g. cigarettes smoked

in a day) Can change dramatically from sample to sample Might be more than one (which is more representative?)

The Median A second measure of central tendency is

called the median.

The median is the point corresponding to the score that lies in the middle of the distribution (i.e., there are as many data points above the median as there are below the median).

The Median (cont.) To find the median, the data points must first be sorted

into either ascending or descending numerical order.

The position of the median value can then be calculated using the following formula:

Median Location = N + 12

Examples

• If there are an odd number of data points:

(1, 2, 2, 3, 3, 4, 4, 5, 6)

• The median is the item in the fifth position of the ordered data set, therefore the median is 3.

Median Location = 9 + 12

= 5

If there are an even number of data points:(1, 2, 2, 3, 3, 4, 4, 5, 6, 793)

The formula would tell us to look in the 5.5th place, which we can’t really do. However we can take the average of the 5th and 6th values to give us the median. In the above scenario 3 is in the fifth place and 4 is in the sixth place so we can use 3.5 as our median.

Median (Advantage/Disadvantage)Advantage:

Resistant to outliers

Disadvantage:May not be so informative:

(1, 1, 2, 2, 2, 2, 5, 6, 9, 9, 10 )

Does the value of 2 really represent this sample as a whole very well?

The Mean Finally, the most commonly used measure of

central tendency is called the mean (denoted for a sample, and µ for a population).

The mean is the same of what most of us call the average, and it is calculated in the following manner:

XXN

X

The Mean

• For example, given the data set that we used to calculate the median (odd number example), the corresponding mean would be:

• Similarly, the mean height of a statistics class,

as indicated by the previous sample, would be:

X = XN

= 319848

= 66.625

303.33

9

XX

N

Mode vs. Median vs. Mean

• In our height example, the mode and median were the same, and the mean was fairly close to the mode and median.

• This was the case because the height distribution was fairly symmetrical.

• However, when the underlying distribution is not symmetrical, the three measures of central tendency can be quite different.

Mode = 2 slices per week

Median = 4 slices per week

Mean = 5.7 slices per week

• This raises the issue of which measure is best.Example: Slices of Pizza Eaten Last Week

Value Freq Value Freq

0 4 8 51 2 10 22 8 15 13 6 16 14 6 20 15 6 40 16 5

Some Visual DemosHere is a demonstration that allows you to change a frequency histogram while simultaneously noting the effects of those changes on the mean versus the median.

As you use the demo, you should easily be able to think about how these changes are also affecting the mode, right?

Note that the order would go Mode Median and Mean in the direction the tail is pointing.

Your turn Find the mean, median and mode of the

following dataset:

7 3 4 3 5 2 4 6 1 7 3 6 3 3 4 Mean = Median = Mode =

Mean = 4.07 Median = 4 Mode = 3

Other measures of central tendency (preview) Trimmed mean

Created by “trimming” some percentage of the high and low ends of the data

M-estimators Extreme values are given less weight than those

closer to the center of the distribution. May be more robust than mean or median for

certain types of “funky” data

top related