descriptive measures - nkd group · chapter 3: descriptive statistics page -3- class notes to...

Descriptive Measures

Measures Of Central Tendency Indicates where the center or most typical value of a data set lies. http://www.getstats.org.uk/2013/01/21/it-all-depends-what-you-mean-by-

average/

The Sample Mean The MEAN of a data set is defined as the sum of the observations divided by the number of observations.

Notation: 1

ni

i

xxn

Where: x is a random variable. n is the number of data items in a sample xi is the ith observation

http://www.getstats.org.uk/2013/01/21/it-all-depends-what-you-mean-by-average/

http://www.getstats.org.uk/2013/01/21/it-all-depends-what-you-mean-by-average/

Chapter 3: Descriptive Statistics Page -2- Class Notes to accompany: Introductory Statistics, By Neil A. Weiss Prepared by: Nina Kajiji

Example:

xi Values

1 200 2 200 3 840 4 350 5 300 6 300 7 200 8 200 9 950

10 200

n = 10 sum of the observations = 3740 Mean = 3740 / 10 = 374.0


The Sample Median If n is odd

The Median is that data value exactly in the middle of an ordered list.

If n is even

The Median is the mean of the two middle data values of an ordered list.

Median is often called the 50th percentile.

NOTATION: An ordered list is denoted as: x(1) < x(2) < x(3) < ... < x(n)

For n odd: 1

2

nx x

For n even: 1

2 2

2

n nx x

x


EXAMPLE For the data set presented previously, n is even. Therefore,

1

5 62 2 200 300250

2 2 2

n nx x

x xx

The calculations are shown below:

xi Values x(i) Sorted

1 200 1 200 2 200 2 200 3 840 3 200 4 350 4 200 5 300 5 200 <--- (200+300)/2 6 300 6 300 = 250 (Median) 7 200 7 300 8 200 8 350 9 950 9 840

10 200 10 950


The Sample Mode Mode is that data value that occurs most frequently in a data set. If all values occur only once the data set has NO mode. If more than one value occurs two or more times the data set has multiple modes For the above data set the mode is: 200

Comparison Of Mean, Median, & Mode 1. If data is symmetric Mean = Median = Mode 2. If data is right-skewed then Mode < Median <

Mean 3. If data is left-skewed then Mean < Median < Mode 4. Mean is sensitive to fluctuations in data values --

Median is not. 5. Mean accounts for the numerical value of each

piece of data -- Median does not.


The Midrange The midrange is the value halfway between the highest and lowest scores. It is found by adding the highest score to the lowest score and then dividing the sum by 2. Thus for our example: Midrange = (950 + 200)/2 = 575

The Weighted Mean A weight is a value corresponding to how many times a particular score occurs in the data set. The weighted mean is:

n

i i

i

n

i

i

w x

w


Review Example:

i Xi X(i)

1 37 24 2 37 24 3 24 27 4 28 28 5 43 28 6 44 33 7 36 36 8 41 37 9 27 37 10 33 41 11 28 43 12 24 44

402 _

X = (402 /12) = 33.5 years

Median = (x(6) + x(7)) / 2 = (33+36)/2 = 34.5 years

Mode 24, 28, 37

Midrange = 34


Measures Of Dispersion The Sample Range Range = Largest Value - Smallest Value Features: 1. A lot of information is ignored 2. Simple to calculate The Sample Standard Deviation Corrects for the problem with range. Estimates how far the data values are from the

mean. A large standard deviation implies a higher

variation in the data. Example:

a) Standard deviation = 7.4; Mean = 50

b) Standard deviation = 14.2; Mean = 50


Steps Method I 1. Calculate the sample mean.

2. Calculate the deviations from the mean. That is, subtract the mean from each data value.

3. Square the deviations.

4. Obtain the sum of the squared deviations.

5. Obtain the SAMPLE VARIANCE. That is, take an average of the squared deviations.

6. The square-root of the sample variance is the STANDARD DEVIATION.

Notation For Method I

Variance: s2 = 1

2

1

_

n

xxn

i

i

Standard Deviation: s = 2s


Example

i ix

_x

ix

2_

x

ix

1 200 -174 30276 2 200 -174 30276 3 840 466 217156 4 350 -24 576 _

x ---> 3740 / 10 = 374 5 300 -74 5476 2s ---> 711840 /9 = 79093.33 6 300 -74 5476 s = 281.33 7 200 -174 30276 8 200 -174 30276 9 950 576 331776

10 200 -174 30276 Sum 3740 711840

Excel functions to compute: the mean: Average(list) the standard deviation: Std(list) the sample variance: Var(list) the population variance: Varp(list)


Steps Method II 1. Obtain the squared sum of the data series.

2. Obtain the sum of squares for the data set and multiply by the total number of data items (n).

3. Subtract the value in step 1 from step 2 and divide by (n*(n-1)). This is the sample variance.

4. The square-root of the sample variance is the standard deviation.

Notation For Method II

Variance: s2 =

2

11

2

1

nn

xxnn

ii

n

ii

Standard Deviation: s = 2s


Example:

i Xi Sqr(Xi)

1 200 40000 2 200 40000 3 840 705600 4 350 122500 Xbar ---> 3740 / 10 = 374 5 300 90000 [2] n*SumOfSq ---> 10*2110600 =

21106000 6 300 90000 [1] SqofSum ---> Sqr(3740) =13987600 7 200 40000 Var ---> ([2]-[1])/(10*9) =79093.33 8 200 40000 Std --> Sqrt(Var) =281.23 9 950 902500

10 200 40000 Z-Score X10 (200-374) / 281.23

= -0.62

Z-Score X9 (950-374) / 281.23

= 2.05

Sum 3740 2110600


Interpretation Of Standard Deviation

Z-Scores or Standard Scores It is used to interpret and compare standard

deviations from different data sets or for different data series.

Higher the absolute (z-score) for a data value further away it is from the mean of the data series.

If the data value has a large positive z-score, then the data value is larger than most of the other data values.

Large negative z-scores imply that the data value is smaller than most of the other data values.

Generally, presented as x.xx standard deviations from the mean.

Notation: s

xx

z

i

_


Chebychev's Rule For any number k > 1, at least 1 - 1/k2 of the data lies within k standard deviations to either side of the mean. That is, the upper and lower bound of the

data is:

,,__

ksxksx

Example: Assume: n = 5; Mean = 4; s = 2.45

Then using the formula we have = (4 - 2*2.45, 4 + 2*2.45) = (-0.9, 8.9)

Proportion = 1 - 1/k2 = 1 - 1/4 = 0.75 = 75%

For k = 3; using the formula we have:

= ( 4 – 3*2.45, 4 + 3*2.45)

= (-3.35, 11.35)

Proportion = 1 - 1/k2 = 1 - 1/9 = 0.89 = 89%


In summary: Chebychev’s rule which is valid for all data sets, implies that:

93.75% of the observations lie within four standard deviations of the mean.

89% of the observations lie within three standard deviations of the mean.

75% of the observations lie within two standard deviations of the mean.

In contrast the empirical rule which applies to data sets that have approximately a bell-shaped curve states that:

99.7% of the observations lie within three standard deviations of the mean.

95% of the observations lie within two standard deviations of the mean.

68% of the observations lie within one standard deviation of the mean.


Group Data Analysis Pg: 119-120

Sample Mean for Grouped Data

The formula is: n

fxx

_

Where: x = class mark f = class frequency n = sample size. Sample Variance for Grouped Data The formula for Method I is:

s2 = 1

2_

n

xxf

The formula for Method II is:

s2 =

2

2

1

nn

fxfxn

Where all terms are as defined before.


Example Days To

Maturity Freq. (x) Sqr(x) fx f * sqr(x)

30-39 3 34.5 1190.25 103.50 3570.75 40-49 1 44.5 1980.25 44.50 1980.25 50-59 8 54.5 2970.25 436.00 23762.00 60-69 10 64.5 4160.25 645.00 41602.50 70-79 7 74.5 5550.25 521.50 38851.75 80-89 7 84.5 7140.25 591.50 49981.75 90-99 4 94.5 8930.25 378.00 35721.00

Totals 40 2720.00 195470.00

The mean is 2720 / 40 = 68 The variance = [40(195470) - Sqr(2720)] / (40*39) Variance = 269.49 Std.Dev = 16.42


Estimating Population Parameters Mean

= x / N

Variance The formula for Method I is:

Variance: 2 =

N

xn

ii

2

1

The formula for Method II is:

Variance: 2 =

21

2

N

xn

ii

Where: N is the population size. is the population parameter for mean.

2 is the population parameter for variance.

Z-Score or Standardized Variable z = (xi - ) /


Sectioning Data Five Number Summary

o Min o Max o Q1 o Q2 o Q3

Quartiles A data set has three quartiles. The quartiles divide the data set into fourths. That is, Q1 = Divides the bottom 25% from the top 75% Q2 = Divides the data set into half Q3 = Divides the bottom 75% from the top 25% Steps (Sample size a multiple of 4) 1. Arrange the data in ascending (increasing) order. 2. Divide the data into quarters: (n /4)

3. Find the numbers dividing the quarters. Compute

the median for each half


Example: n = 16 (divisible by 4)

Sorted

x

3.3

5.7

6.6

7.7 (7.7 + 8.3) / 2

8.3 = 8 <-- Q1

8.6 8.9

9.2 (9.2+10.2) / 2

10.2 = 9.7 <-- Q2

10.3

10.6

11.8 (11.8 + 12.) / 2

12.0 = 11.9 <-- Q3

12.7

13.7

15.0


Steps (Sample size not a multiple of 4) 1. Arrange the data in ascending order. 2. Determine the median position (m) of the data set:

if n is odd: m = (n + 1)/2 if n is even: m = n/2

3. Q1 position is calculated as: (m + 1)/2 counting in

from the top (left). 4. Q3 position is calculated as: (m + 1)/2 counting in

from the bottom (right).


Review Example:

Xi Values X(i) Sorted

1 200 1 200

2 200 2 200

3 840 3 200 Q1

4 350 4 200

5 300 5 200 <--- (200+300)/2

6 300 6 300 = 250 (Median) = Q2

7 200 7 300

8 200 8 350 Q3

9 950 9 840

10 200 10 950

n = 10 n is even Median = (X(5) + X(6)) / 2 = (200 + 300) / 2 = $250 m = n/2 = 5 Q1: position = (m+1)/2 = (5+1)/2 = 3 from top. Thus Q1 = $200. Q3: position = 3 from bottom.

Thus Q3 = $350.


Interquartile Range Difference between Q3 and Q1. That is,

IQR = Q3 – Q1

For the previous example, IQR = $150

Deciles When the data set is divided into TENTHS, each part is a DECILE.

NOTE: The fifth deciles is the Median

Percentiles When the data set is divided into HUNDREDS, each part is a PERCENTILE.

NOTE: The fiftieth percentile is the Median.

Percentile of x = 100*#

n

xofelements

kth percentile = (Pk) = ? Compute: L = (k / 100)n Where: k = percentile to find n = sample size Therefore, P25 of a sample of 106 observation is (25/100)* 106 = 26.5 27. That is, position 27 in

an ordered set.


Box & Whisker Diagrams Another term: Boxplots Shows graphically the dispersion in the data set. Invented by John Tukey.

Steps 1. Determine the quartiles for the data. The first and

third quartiles are also called hinges.

2. Find the smallest and largest data-values.

3. Draw a horizontal axis on which the values obtained in steps 1 & 2 are located. Mark the quartiles & the smallest and largest data values with vertical lines.

4. Connect the quartiles to each other to make a box.

5. Connect the box to the largest and smallest data values by a line called whiskers.


For Example: In the above example we found that the five number summary is: Min=200; Q1=200; Q2=250; Q3=350; Max=950 There is no variation in the first quarter (Q1-Min)=0 There is a little variation in the second, and third quarter (Q2-Q1)=50; and (Q3-Q2)=100; The most variation is in the fourth quarter (Max-Q3)=600. Conclusion: The data is right skewed

Box Plot

0

1

2

0 100 200 300 400 500 600 700 800 900 1000


Outlier Detection – Modified Boxplots Computer Inner an Outer Fences: Inner fences:

Q1 - 1.5*IQR; (lower limit) Q3 + 1.5*IQR (upper limit)

Outer fences: Q1 - 3*IQR; Q3 + 3*IQR Data values that lie between the inner and outer fences are considered possible or potential outliers; those that lie outside the outer fences are considered probable outliers or extreme values For the example data: Inner fences are: 200 – 1.5*150 = -25 and 350 + 1.5*150 = 575 Outer fences are: 200 – 3 * 150 = -250 350 + 3 * 150 = 800 Thus there are two probable outliers: 840 & 950 We can now construct a modified boxplot indicating

the outliers. But first, we need to determine the adjacent value. That is, the most extreme value(s) in the data set that is still within the inner fences or between the lower and upper limits. In our example


the minimum value (200) is still within the limits (between -25 and 575); but the maximum value is outside the upper limit. Thus, we have only one adjacent value which is 350. Since both Q1=200 and Q3=350 the modified boxplot shown below has no whiskers. The outliers are as shown at 840 and 950.

descriptive measures - nkd group · chapter 3: descriptive statistics page -3- class notes to...

Documents