initial data analysis central tendency. notation when we describe a set of data corresponding to...

27
Initial Data Analysis Central Tendency

Upload: amice-banks

Post on 11-Jan-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Initial Data AnalysisCentral Tendency

Page 2: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Notation When we describe a set of data

corresponding to the values of some variable, we will refer to that set using an uppercase letter such as X or Y.

When we want to talk about specific data points within that set, we specify those points by adding a subscript to the uppercase letter like X1

Page 3: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Example

5, 8, 12, 3, 6, 8, 7

X1, X2, X3, X4, X5, X6, X7

Page 4: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Summation The Greek letter sigma, which looks like ,

means “add up” or “sum” whatever follows it.

Thus, Xi, means “add up all the Xis.

If we use the Xis from the previous example, Xi = 49 (or just X).

Page 5: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Example

Pred. Actual Student Score Score X Y

1 82 84 2 66 51 3 70 72 4 81 56 5 61 73

Page 6: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Example (cont.)

X = 82 + 66 + 70 + 81 + 61 = 360

Y = 84 + 51 + 72 + 56 + 73 = 336

(X-Y) = (82-84) + (66-51) + (70-72) + (81-56) + (61-73) = -2 + 15 + (-2) + 25 + (-12) = 24

X2 = 822 + 662 + 702 + 812 + 612 = 6724 + 4356 +

4900 + 6561 + 3721 = 26262One can also see it as (X2)

(X)2 = 3602 = 129600

Page 7: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Your turn(XY) =

((X-Y))² =

=

X Y

3 5

6 7

2 3

22 ( )

1

XX

NN

Page 8: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Your turn(XY) = 15 + 42 + 6 = 63

((X-Y))² = [(-2)+(-1)+(-1)]2 = 16

= 2.08

X Y

3 5

6 7

2 3

22 ( )

1

XX

NN

Page 9: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Measures of Central Tendency While distributions provide an overall picture

of some data set, it is sometimes desirable to represent the entire data set using descriptive statistics.

The first descriptive statistics we will discuss are those used to indicate where the center of the distribution lies.

Page 10: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

60.5 362.5 864.5 766.5 1268.5 770.5 672.5 474.5 076.5 1

0

2

4

6

8

10

12

14

60.5 62.5 64.5 66.5 68.5 70.5 72.5 74.5 76.5

Height (Inches)

Freq

uen

cy

Page 11: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

The Mode There are different measures of central

tendency, each with their own advantages and disadvantages

The first of these is called the mode.

The mode is simply the value of the relevant variable that occurs most often (i.e., has the highest frequency) in the sample.

Page 12: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

The Mode (cont.)

Note that if you have done a frequency histogram, you can often identify the mode simply by finding the value with the highest bar.

However, that will not work when grouping was performed prior to plotting the histogram (although you can still use the histogram to identify the modal group, just not the modal value).

Page 13: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Finding the mode• Create a non-grouped frequency table as described previously, then identify the value with the

greatest frequency.

• Example: Class height.

• n=48 Value Freq Value Freq

61 3 69 362 4 70 263 4 71 464 4 72 465 3 73 066 7 74 067 5 75 068 4 76 1

Page 14: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Mode Advantages

Very quick and easy to determine Is an actual value of the data Not affected by extreme scores

Disadvantages Sometimes not very informative (e.g. cigarettes smoked

in a day) Can change dramatically from sample to sample Might be more than one (which is more representative?)

Page 15: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

The Median A second measure of central tendency is

called the median.

The median is the point corresponding to the score that lies in the middle of the distribution (i.e., there are as many data points above the median as there are below the median).

Page 16: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

The Median (cont.) To find the median, the data points must first be sorted

into either ascending or descending numerical order.

The position of the median value can then be calculated using the following formula:

Median Location = N + 12

Page 17: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Examples

• If there are an odd number of data points:

(1, 2, 2, 3, 3, 4, 4, 5, 6)

• The median is the item in the fifth position of the ordered data set, therefore the median is 3.

Median Location = 9 + 12

= 5

Page 18: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

If there are an even number of data points:(1, 2, 2, 3, 3, 4, 4, 5, 6, 793)

The formula would tell us to look in the 5.5th place, which we can’t really do. However we can take the average of the 5th and 6th values to give us the median. In the above scenario 3 is in the fifth place and 4 is in the sixth place so we can use 3.5 as our median.

Page 19: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Median (Advantage/Disadvantage)Advantage:

Resistant to outliers

Disadvantage:May not be so informative:

(1, 1, 2, 2, 2, 2, 5, 6, 9, 9, 10 )

Does the value of 2 really represent this sample as a whole very well?

Page 20: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

The Mean Finally, the most commonly used measure of

central tendency is called the mean (denoted for a sample, and µ for a population).

The mean is the same of what most of us call the average, and it is calculated in the following manner:

XXN

X

Page 21: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

The Mean

• For example, given the data set that we used to calculate the median (odd number example), the corresponding mean would be:

• Similarly, the mean height of a statistics class,

as indicated by the previous sample, would be:

X = XN

= 319848

= 66.625

303.33

9

XX

N

Page 22: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Mode vs. Median vs. Mean

• In our height example, the mode and median were the same, and the mean was fairly close to the mode and median.

• This was the case because the height distribution was fairly symmetrical.

• However, when the underlying distribution is not symmetrical, the three measures of central tendency can be quite different.

Page 23: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Mode = 2 slices per week

Median = 4 slices per week

Mean = 5.7 slices per week

• This raises the issue of which measure is best.Example: Slices of Pizza Eaten Last Week

Value Freq Value Freq

0 4 8 51 2 10 22 8 15 13 6 16 14 6 20 15 6 40 16 5

Page 24: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Some Visual DemosHere is a demonstration that allows you to change a frequency histogram while simultaneously noting the effects of those changes on the mean versus the median.

As you use the demo, you should easily be able to think about how these changes are also affecting the mode, right?

Note that the order would go Mode Median and Mean in the direction the tail is pointing.

Page 25: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Your turn Find the mean, median and mode of the

following dataset:

7 3 4 3 5 2 4 6 1 7 3 6 3 3 4 Mean = Median = Mode =

Page 26: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Mean = 4.07 Median = 4 Mode = 3

Page 27: Initial Data Analysis Central Tendency. Notation  When we describe a set of data corresponding to the values of some variable, we will refer to that

Other measures of central tendency (preview) Trimmed mean

Created by “trimming” some percentage of the high and low ends of the data

M-estimators Extreme values are given less weight than those

closer to the center of the distribution. May be more robust than mean or median for

certain types of “funky” data