descriptive measures - nkd group · chapter 3: descriptive statistics page -3- class notes to...
TRANSCRIPT
Descriptive Measures
Measures Of Central Tendency Indicates where the center or most typical value of a data set lies. http://www.getstats.org.uk/2013/01/21/it-all-depends-what-you-mean-by-
average/
The Sample Mean The MEAN of a data set is defined as the sum of the observations divided by the number of observations.
Notation: 1
ni
i
xxn
Where: x is a random variable. n is the number of data items in a sample xi is the ith observation
Chapter 3: Descriptive Statistics Page -2- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Example:
xi Values
1 200 2 200 3 840 4 350 5 300 6 300 7 200 8 200 9 950
10 200
n = 10 sum of the observations = 3740 Mean = 3740 / 10 = 374.0
Chapter 3: Descriptive Statistics Page -3- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
The Sample Median If n is odd
The Median is that data value exactly in the middle of an ordered list.
If n is even
The Median is the mean of the two middle data values of an ordered list.
Median is often called the 50th percentile.
NOTATION: An ordered list is denoted as: x(1) < x(2) < x(3) < ... < x(n)
For n odd: 1
2
nx x
For n even: 1
2 2
2
n nx x
x
Chapter 3: Descriptive Statistics Page -4- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
EXAMPLE For the data set presented previously, n is even. Therefore,
1
5 62 2 200 300250
2 2 2
n nx x
x xx
The calculations are shown below:
xi Values x(i) Sorted
1 200 1 200 2 200 2 200 3 840 3 200 4 350 4 200 5 300 5 200 <--- (200+300)/2 6 300 6 300 = 250 (Median) 7 200 7 300 8 200 8 350 9 950 9 840
10 200 10 950
Chapter 3: Descriptive Statistics Page -5- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
The Sample Mode Mode is that data value that occurs most frequently in a data set. If all values occur only once the data set has NO mode. If more than one value occurs two or more times the data set has multiple modes For the above data set the mode is: 200
Comparison Of Mean, Median, & Mode 1. If data is symmetric Mean = Median = Mode 2. If data is right-skewed then Mode < Median <
Mean 3. If data is left-skewed then Mean < Median < Mode 4. Mean is sensitive to fluctuations in data values --
Median is not. 5. Mean accounts for the numerical value of each
piece of data -- Median does not.
Chapter 3: Descriptive Statistics Page -6- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
The Midrange The midrange is the value halfway between the highest and lowest scores. It is found by adding the highest score to the lowest score and then dividing the sum by 2. Thus for our example: Midrange = (950 + 200)/2 = 575
The Weighted Mean A weight is a value corresponding to how many times a particular score occurs in the data set. The weighted mean is:
n
i i
i
n
i
i
w x
w
Chapter 3: Descriptive Statistics Page -7- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Review Example:
i Xi X(i)
1 37 24 2 37 24 3 24 27 4 28 28 5 43 28 6 44 33 7 36 36 8 41 37 9 27 37 10 33 41 11 28 43 12 24 44
402 _
X = (402 /12) = 33.5 years
Median = (x(6) + x(7)) / 2 = (33+36)/2 = 34.5 years
Mode 24, 28, 37
Midrange = 34
Chapter 3: Descriptive Statistics Page -8- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Measures Of Dispersion The Sample Range Range = Largest Value - Smallest Value Features: 1. A lot of information is ignored 2. Simple to calculate The Sample Standard Deviation Corrects for the problem with range. Estimates how far the data values are from the
mean. A large standard deviation implies a higher
variation in the data. Example:
a) Standard deviation = 7.4; Mean = 50
b) Standard deviation = 14.2; Mean = 50
Chapter 3: Descriptive Statistics Page -9- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Steps Method I 1. Calculate the sample mean.
2. Calculate the deviations from the mean. That is, subtract the mean from each data value.
3. Square the deviations.
4. Obtain the sum of the squared deviations.
5. Obtain the SAMPLE VARIANCE. That is, take an average of the squared deviations.
6. The square-root of the sample variance is the STANDARD DEVIATION.
Notation For Method I
Variance: s2 = 1
2
1
_
n
xxn
i
i
Standard Deviation: s = 2s
Chapter 3: Descriptive Statistics Page -10- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Example
i ix
_x
ix
2_
x
ix
1 200 -174 30276 2 200 -174 30276 3 840 466 217156 4 350 -24 576 _
x ---> 3740 / 10 = 374 5 300 -74 5476 2s ---> 711840 /9 = 79093.33 6 300 -74 5476 s = 281.33 7 200 -174 30276 8 200 -174 30276 9 950 576 331776
10 200 -174 30276 Sum 3740 711840
Excel functions to compute: the mean: Average(list) the standard deviation: Std(list) the sample variance: Var(list) the population variance: Varp(list)
Chapter 3: Descriptive Statistics Page -11- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Steps Method II 1. Obtain the squared sum of the data series.
2. Obtain the sum of squares for the data set and multiply by the total number of data items (n).
3. Subtract the value in step 1 from step 2 and divide by (n*(n-1)). This is the sample variance.
4. The square-root of the sample variance is the standard deviation.
Notation For Method II
Variance: s2 =
2
11
2
1
nn
xxnn
ii
n
ii
Standard Deviation: s = 2s
Chapter 3: Descriptive Statistics Page -12- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Example:
i Xi Sqr(Xi)
1 200 40000 2 200 40000 3 840 705600 4 350 122500 Xbar ---> 3740 / 10 = 374 5 300 90000 [2] n*SumOfSq ---> 10*2110600 =
21106000 6 300 90000 [1] SqofSum ---> Sqr(3740) =13987600 7 200 40000 Var ---> ([2]-[1])/(10*9) =79093.33 8 200 40000 Std --> Sqrt(Var) =281.23 9 950 902500
10 200 40000 Z-Score X10 (200-374) / 281.23
= -0.62
Z-Score X9 (950-374) / 281.23
= 2.05
Sum 3740 2110600
Chapter 3: Descriptive Statistics Page -13- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Interpretation Of Standard Deviation
Z-Scores or Standard Scores It is used to interpret and compare standard
deviations from different data sets or for different data series.
Higher the absolute (z-score) for a data value further away it is from the mean of the data series.
If the data value has a large positive z-score, then the data value is larger than most of the other data values.
Large negative z-scores imply that the data value is smaller than most of the other data values.
Generally, presented as x.xx standard deviations from the mean.
Notation: s
xx
z
i
_
Chapter 3: Descriptive Statistics Page -14- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Chebychev's Rule For any number k > 1, at least 1 - 1/k2 of the data lies within k standard deviations to either side of the mean. That is, the upper and lower bound of the
data is:
,,__
ksxksx
Example: Assume: n = 5; Mean = 4; s = 2.45
Then using the formula we have = (4 - 2*2.45, 4 + 2*2.45) = (-0.9, 8.9)
Proportion = 1 - 1/k2 = 1 - 1/4 = 0.75 = 75%
For k = 3; using the formula we have:
= ( 4 – 3*2.45, 4 + 3*2.45)
= (-3.35, 11.35)
Proportion = 1 - 1/k2 = 1 - 1/9 = 0.89 = 89%
Chapter 3: Descriptive Statistics Page -15- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
In summary: Chebychev’s rule which is valid for all data sets, implies that:
93.75% of the observations lie within four standard deviations of the mean.
89% of the observations lie within three standard deviations of the mean.
75% of the observations lie within two standard deviations of the mean.
In contrast the empirical rule which applies to data sets that have approximately a bell-shaped curve states that:
99.7% of the observations lie within three standard deviations of the mean.
95% of the observations lie within two standard deviations of the mean.
68% of the observations lie within one standard deviation of the mean.
Chapter 3: Descriptive Statistics Page -16- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Group Data Analysis Pg: 119-120
Sample Mean for Grouped Data
The formula is: n
fxx
_
Where: x = class mark f = class frequency n = sample size. Sample Variance for Grouped Data The formula for Method I is:
s2 = 1
2_
n
xxf
The formula for Method II is:
s2 =
2
2
1
nn
fxfxn
Where all terms are as defined before.
Chapter 3: Descriptive Statistics Page -17- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Example Days To
Maturity Freq. (x) Sqr(x) fx f * sqr(x)
30-39 3 34.5 1190.25 103.50 3570.75 40-49 1 44.5 1980.25 44.50 1980.25 50-59 8 54.5 2970.25 436.00 23762.00 60-69 10 64.5 4160.25 645.00 41602.50 70-79 7 74.5 5550.25 521.50 38851.75 80-89 7 84.5 7140.25 591.50 49981.75 90-99 4 94.5 8930.25 378.00 35721.00
Totals 40 2720.00 195470.00
The mean is 2720 / 40 = 68 The variance = [40(195470) - Sqr(2720)] / (40*39) Variance = 269.49 Std.Dev = 16.42
Chapter 3: Descriptive Statistics Page -18- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Estimating Population Parameters Mean
= x / N
Variance The formula for Method I is:
Variance: 2 =
N
xn
ii
2
1
The formula for Method II is:
Variance: 2 =
21
2
N
xn
ii
Where: N is the population size. is the population parameter for mean.
2 is the population parameter for variance.
Z-Score or Standardized Variable z = (xi - ) /
Chapter 3: Descriptive Statistics Page -19- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Sectioning Data Five Number Summary
o Min o Max o Q1 o Q2 o Q3
Quartiles A data set has three quartiles. The quartiles divide the data set into fourths. That is, Q1 = Divides the bottom 25% from the top 75% Q2 = Divides the data set into half Q3 = Divides the bottom 75% from the top 25% Steps (Sample size a multiple of 4) 1. Arrange the data in ascending (increasing) order. 2. Divide the data into quarters: (n /4)
3. Find the numbers dividing the quarters. Compute
the median for each half
Chapter 3: Descriptive Statistics Page -20- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Example: n = 16 (divisible by 4)
Sorted
x
3.3
5.7
6.6
7.7 (7.7 + 8.3) / 2
8.3 = 8 <-- Q1
8.6 8.9
9.2 (9.2+10.2) / 2
10.2 = 9.7 <-- Q2
10.3
10.6
11.8 (11.8 + 12.) / 2
12.0 = 11.9 <-- Q3
12.7
13.7
15.0
Chapter 3: Descriptive Statistics Page -21- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Steps (Sample size not a multiple of 4) 1. Arrange the data in ascending order. 2. Determine the median position (m) of the data set:
if n is odd: m = (n + 1)/2 if n is even: m = n/2
3. Q1 position is calculated as: (m + 1)/2 counting in
from the top (left). 4. Q3 position is calculated as: (m + 1)/2 counting in
from the bottom (right).
Chapter 3: Descriptive Statistics Page -22- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Review Example:
Xi Values X(i) Sorted
1 200 1 200
2 200 2 200
3 840 3 200 Q1
4 350 4 200
5 300 5 200 <--- (200+300)/2
6 300 6 300 = 250 (Median) = Q2
7 200 7 300
8 200 8 350 Q3
9 950 9 840
10 200 10 950
n = 10 n is even Median = (X(5) + X(6)) / 2 = (200 + 300) / 2 = $250 m = n/2 = 5 Q1: position = (m+1)/2 = (5+1)/2 = 3 from top. Thus Q1 = $200. Q3: position = 3 from bottom.
Thus Q3 = $350.
Chapter 3: Descriptive Statistics Page -23- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Interquartile Range Difference between Q3 and Q1. That is,
IQR = Q3 – Q1
For the previous example, IQR = $150
Deciles When the data set is divided into TENTHS, each part is a DECILE.
NOTE: The fifth deciles is the Median
Percentiles When the data set is divided into HUNDREDS, each part is a PERCENTILE.
NOTE: The fiftieth percentile is the Median.
Percentile of x = 100*#
n
xofelements
kth percentile = (Pk) = ? Compute: L = (k / 100)n Where: k = percentile to find n = sample size Therefore, P25 of a sample of 106 observation is (25/100)* 106 = 26.5 27. That is, position 27 in
an ordered set.
Chapter 3: Descriptive Statistics Page -24- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Box & Whisker Diagrams Another term: Boxplots Shows graphically the dispersion in the data set. Invented by John Tukey.
Steps 1. Determine the quartiles for the data. The first and
third quartiles are also called hinges.
2. Find the smallest and largest data-values.
3. Draw a horizontal axis on which the values obtained in steps 1 & 2 are located. Mark the quartiles & the smallest and largest data values with vertical lines.
4. Connect the quartiles to each other to make a box.
5. Connect the box to the largest and smallest data values by a line called whiskers.
Chapter 3: Descriptive Statistics Page -25- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
For Example: In the above example we found that the five number summary is: Min=200; Q1=200; Q2=250; Q3=350; Max=950 There is no variation in the first quarter (Q1-Min)=0 There is a little variation in the second, and third quarter (Q2-Q1)=50; and (Q3-Q2)=100; The most variation is in the fourth quarter (Max-Q3)=600. Conclusion: The data is right skewed
Box Plot
0
1
2
0 100 200 300 400 500 600 700 800 900 1000
Chapter 3: Descriptive Statistics Page -26- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
Outlier Detection – Modified Boxplots Computer Inner an Outer Fences: Inner fences:
Q1 - 1.5*IQR; (lower limit) Q3 + 1.5*IQR (upper limit)
Outer fences: Q1 - 3*IQR; Q3 + 3*IQR Data values that lie between the inner and outer fences are considered possible or potential outliers; those that lie outside the outer fences are considered probable outliers or extreme values For the example data: Inner fences are: 200 – 1.5*150 = -25 and 350 + 1.5*150 = 575 Outer fences are: 200 – 3 * 150 = -250 350 + 3 * 150 = 800 Thus there are two probable outliers: 840 & 950 We can now construct a modified boxplot indicating
the outliers. But first, we need to determine the adjacent value. That is, the most extreme value(s) in the data set that is still within the inner fences or between the lower and upper limits. In our example
Chapter 3: Descriptive Statistics Page -27- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji
the minimum value (200) is still within the limits (between -25 and 575); but the maximum value is outside the upper limit. Thus, we have only one adjacent value which is 350. Since both Q1=200 and Q3=350 the modified boxplot shown below has no whiskers. The outliers are as shown at 840 and 950.