descriptive statistics -...

30
Statistical measures Measures of central tendency DESCRIPTIVE STATISTICS Dr Alina Gleska Institute of Mathematics, PUT April 20, 2018

Upload: others

Post on 30-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Statistical measures Measures of central tendency

DESCRIPTIVE STATISTICS

Dr Alina Gleska

Institute of Mathematics, PUT

April 20, 2018

Statistical measures Measures of central tendency

1 Statistical measures

2 Measures of central tendency

Statistical measures Measures of central tendency

We consider the following statistical measures:measures of locations (also called central tendency) - thedescriptive measures that indicate where the center or themost typical value of the variable lies in collected set ofmeasurements;measures of statistical dispersion (also called measures ofvariation) - they numerically measure the extent ofvariation around the center. Two data sets of the samevariable may exhibit similar positions of center but may beremarkably different with respect to variability;measures of the shape of the distribution like skewness(asymmetry);measures of concentration (kurtosis).

Statistical measures Measures of central tendency

We consider two types of measures of central tendency:classical and depending on the position in the series.Calculating classical measures we use values of allobservations. These measures consist of: the arithmetic mean(average), the geometric mean and the harmonic mean.Positional measures cover: the mode (the dominant value), themedian and the quartiles (specially the first quartile, the secondquartile (it is a median) and the third quartile).

Statistical measures Measures of central tendency

ARITHMETIC MEAN - it is the sum of all observations dividedby n. We distinguish different types of the arithmetic averages:

(I) the simple arithmetic mean for simple series:

x =1n

n

∑i=1

xi ,

where xi is a value of i-th observation, and n - the totalnumber of observations,

Statistical measures Measures of central tendency

(II) the weighted average, for grouped data, which is definedby the formula:

x =1n

k

∑i=1

xini ,

where

a) for discrete grouped data xi is the value of the i-thobservation with the frequency ni ,

Statistical measures Measures of central tendency

b) for the continuous data, grouped in the intervals xi is thecenter of i-th interval (denoted by x0

i ) with the frequency ni ,and k is the number of intervals.

Statistical measures Measures of central tendency

The properties of the arithmetic mean:the sum of the values of all observations is equal to theproduct of the arithmetic mean by the total number ofobservations:

n

∑i=1

xi = nx ,

the mean is the only single number for which the residuals(deviations from the estimate) sum to zero:

n

∑i=1

(xi −x) = 0,

the sample mean is also the best single predictor in thesense of having the lowest root mean squared error:

n

∑i=1

(xi −x)2 = min .

Statistical measures Measures of central tendency

RemarkThe arithmetic mean for data grouped in the intervals isonly the approximation of the true average value becauseall observations are just represented by the centers of theintervals.The center of the interval IS NOT the average value - thedifferences between values and the center of the intervalcan be sometimes quite big.

Statistical measures Measures of central tendency

RemarkThe mean value we calculate for closed intervals. But if wehave outliers we have to leave last (or first) interval open.In such a case, if outliers are less than 5% of the totalnumber of observations, we can close these open intervalsand calculate the arithmetic mean.The arithmetic mean is not a robust statistic (a statistic issaid to be robust if it is not sensitive to outliers).

Statistical measures Measures of central tendency

RESUME:we calculate the arithmetic mean if:

a) the population is homogeneous; there are no outliers; thedistribution is symmetric or a little asymmetric;

b) the distribution is unimodal (there is only one maximum);we do not use the arithmetic mean if:

a) the distribution is very strong asymmetric;b) the distribution is bimodal or multimodal;c) the distribution is U-shape.

Statistical measures Measures of central tendency

The geometric mean is a classical measure used in specialcases, mainly in time series analysis. We calculate it using theformula:

xg = n√

x1x2 . . .xn,

where x1,x2, . . . ,xn denote the observations (so the geometricmean is defined as the nth root of the product of n numbers). Itis often used for a set of numbers whose values are meant tobe multiplied together or are exponential in nature, such asdata on the growth of the human population or interest rates ofa financial investment.The geometric mean of growth over periods yields theequivalent constant growth rate that would yield the same finalamount.

Statistical measures Measures of central tendency

ExampleSuppose an orange tree yields 100 oranges one year and then180, 210 and 300 the following years, so the growth is 80%,16.6666% and 42.8571% for each year respectively. Growingwith 80% corresponds to multiplying with 1.80, so we take thegeometric mean of 1.80, 1.166666 and 1.428571, i.e.3√

1.80×1.166666×1.428571 = 1.442249; thus the ’average’growth per year is 44.2249%. If we start with 100 oranges andlet the number grow with 44.2249% each year, the result is 300oranges.

Statistical measures Measures of central tendency

The harmonic mean (sometimes called the subcontrary mean)– can be expressed as the reciprocal of the arithmetic mean ofthe reciprocals of the given set of observations:

xH =n

n∑

i=1

1xi

.

This measure is used rather seldom.For positive values of observations we have a relation:

xH ≤ xg ≤ x .

Statistical measures Measures of central tendency

DefinitionFor discrete data obtain the frequency of each observed valueof the variable in a collection and note the greatest frequency.

1 If the greatest frequency is 1 (i.e. no value occurs morethan once), then the variable has no mode.

2 If the greatest frequency is 2 or greater, then any value thatoccurs with that greatest frequency is called a samplemode of the variable.

Statistical measures Measures of central tendency

For continuous data at first we find the interval with the greatestfrequency and then we use the interpolating formula:

Mo = xld +(ns−ns−1)

(ns−ns−1)+(ns−ns+1)·d ,

where:s - the number of the interval with the greatest frequency,xld - the left end of the s-th interval,d - the width of the interval,ns - the frequency of the s-th interval.

Statistical measures Measures of central tendency

The practical rules of calculating the mode:finding of the mode is legitimate only in case of unimodaldistributions (with one clear maximum);the interval of the mode and two successive intervalsshould be the same width;we do not calculate the mode for multimodal distributions.

Statistical measures Measures of central tendency

How to find the mode graphically?Do a histogram;Plot two lines from the vertices of the highest rectangularjoining them with the vertices of the successiverectangulars;Find the cross-point of those lines and throw it on theX-axis;Read an approximation of the mode on the X-axis.

Statistical measures Measures of central tendency

The sample median of a quantitative variable is that value ofthe variable in a data set that divides the set of observed valuesin half, so that the observed values in one half are less than orequal to the median value and the observed values in the otherhalf are greater or equal to the median value. To obtain themedian of the variable, we arrange observed values in a dataset in increasing order and then determine the middle value inthe ordered list.

Statistical measures Measures of central tendency

How to find the median?Arrange the observed values of variable in a data in increasingorder.

1 If the number of observation is odd, then the samplemedian is the observed value exactly in the middle of theordered list.

2 If the number of observation is even, then the samplemedian is the number halfway between the two middleobserved values in the ordered list.

In both cases, if we let n denote the number of observations ina data set, then the sample median is at position n+1

2 in theordered list.

Statistical measures Measures of central tendency

Remark1 The median is the value of the middle observation, not its

frequency.2 The median is a robust statistic - it does not depend on

outliers.3 In a case of asymmetric distributions the median gives

more information than the arithmetic mean.

Statistical measures Measures of central tendency

How to find the median in a case of grouped data? For thecategory data:

1) find the position of the middle observation as n2 (or n+1

2 forthe series with an odd number of observations),

2) find the class where is this middle observation; this is theclass in which the cumulative frequency reaches n

2 for thefirst time,

3) read the value of the proper category data.

Statistical measures Measures of central tendency

For the continuous data we use the interpolating formula:

Me = xlm +n2 −ncum

m−1

nm·dm

where Me - the median, xlm - the left end of the median interval,n - the total number of observations, nm - the frequency of themedian interval, ncum

m−1 - the cumulative frequency of the intervalpreceding the median interval, dm - the width of the medianinterval.

Statistical measures Measures of central tendency

How to find the median graphically?

1 Do a histogram for cumulative frequency and plot the lineof cumulative frequency.

2 Mark on the Y-axis n2 .

3 Plot from this point the horizontal line. Mark the pointwhere this line crosses the line of the cumulative frequency.

4 Throw this cross-point on the X-axis. This is theapproximated median.

Statistical measures Measures of central tendency

We have some relations between the arithmetic mean, themode and the median.

1 For symmetric distributions all those measures are equal:

x = Mo = Me.

2 For right-skewed (asymmetric) distributions the mode isthe smallest measure and the arithmetic mean is thegreatest one:

Mo < Me < x .

3 For left-skewed (asymmetric) distributions the mode is thegreatest measure and the arithmetic mean is the smallestone:

x < Me < Mo.

Statistical measures Measures of central tendency

We have also so called Pearson equation: Mo = 3Me−2x .This relation is valid only for symmetric distributions orasymmetric distributions which are skewed in a very smallextent. We will say more about it during the lecture on themeasures of the skewness.

Statistical measures Measures of central tendency

The quartiles of a ranked set of data values are the three pointsthat divide the data set into four equal groups, each groupcomprising a quarter of the data. A quartile is a type ofquantile. The first quartile Q1 is defined as the middle numberbetween the smallest number and the median of the data set.The second quartile Q2 is the median of the data. The thirdquartile Q3 is the middle value between the median and thehighest value of the data set.

Statistical measures Measures of central tendency

For discrete distributions, there is no universal agreement onselecting the quartile values.

For simple series Q1 is the value at n/4 position, and Q3 isthe value at 3n/4 position (if some of them is between theobservations we take their arithmetic mean),For grouped series with categories we do the same as forthe median – we find the class in which the cumulativefrequency reaches n/4 and 3n/4 for the first time, and thenwe read the quartiles,For continuous data, grouped in the intervals, we use theinterpolating formulas:

Statistical measures Measures of central tendency

Q1 = xlq1 +

n4 −ncum

q1−1

nq1

·dq1

where Q1 - the first quartile, xlq1 - the left end of the first quartileinterval , n - the total number of observations, nq1 - thefrequency of the first quartile interval, ncum

q1−1 - the cumulativefrequency of the interval preceding the first quartile interval, dq1

- the width of the first quartile interval.

Statistical measures Measures of central tendency

Q3 = xlq3 +

3n4 −ncum

q3−1

nq3

·dq3

where Q3 - the third quartile, xlq3 - the left end of the thirdquartile interval, n - the total number of observations, nq3 - thefrequency of the third quartile interval, ncum

q3−1 - the cumulativefrequency of the interval preceding the third quartile interval,dq3 - the width of the third quartile interval.