6.describing a distribution
TRANSCRIPT
DESCRIBING A DISTRIBUTION
• A good way to describe the distribution of a quantitative variable is to take the following three steps:o Report the center of the distribution. [Measures of
Central Tendency]o Report any significant deviations from the center.
[Measures of Variation]o Report the general shape of the distribution.
[Measures of Skewness and Peakedness]
1
MEASURES OF CENTRAL TENDENCYCENTER OF DISTRIBUTION
2
AVERAGE
The central tendency is measured by averages. These describe the point about which the various observed values cluster.
In mathematics, an average, or central tendency of a data set refers to a measure of the "middle" or "expected" value of the data set.
An average is a single value which considered as the most representative or a typical value for a given set of data.
Objectives of averagingo To get one single value that describes the characteristics
of the entire data.o To facilitate comparison
3
CHARACTERISTICS OF A GOOD AVERAGE
Easy to understandSimple to computeBased all observationsCapable of further algebraic treatmentShould not be unduly affected by the presence of
extreme values
4
TOOLS TO COMPUTE THE AVERAGE
Mean Median Mode
5
MEAN
It is commonly used measure of central tendency.The mean is obtained by adding together all
observations and by dividing the total by the number of observations.
The mean, in most cases, is not an actual data value.
6
CALCULATION OF AVERAGE DEVIATION
For ungrouped series
For grouped series o Direct Method
o Short cut method
Where,o = Meano X=observationo N = Number of Observationso A = Assumed meano i = Class interval
7
X
N
XX ∑=
N
fXX ∑=
i x N
fdAX ∑+=
MATHEMATICAL PROPERTIES OF MEAN
The algebraic sum of the deviation of all the observations from mean is always zero.
The sum of squared deviation of all the observation from mean is minimum i.e. less than the squared deviation of all observations from any other value than the mean.
If we have the mean and number of observations of two or more than two related groups.
8
.....NN
....XNXNX Mean Combined
21
221112
++++==
MERITS AND DEMERITS OF MEAN
Merits o It possesses the first four out of five characteristics of a
good average.
Demeritso Mean is unduly affected by the presence of extreme values. o In continuous series, it is difficult to compute mean without
making assumption of mid point of the class.o Applicable for only quantitative data.o Some times mean may not be an observation in data.
9
MEDIAN
Median is the measure of central tendency which appears in the ‘middle’ of an ordered sequence (either in ascending or descending order) of values.
It divides whole data into two equal parts. In other words, 50% of the observations are smaller than the median and 50% will larger than it
10
CALCULATION OF MEDIAN
Individual Series
M = Size of the
item Discrete Series
M = Size of the
item Continuous Series
M = Size of the
item
Where,o M= Mediano N = Number of
Observationso L = Lower limit of median
classo cf = Cumulative
frequency of the class preceding median class
o f = frequency of median class observation
o i = Class interval
11
th
2
1N
+
th
2
N
th
2
N
Ccf
N
LM ×−
+=f
21
MERITS AND DEMERITS OF MEDIAN
Meritso It is especially useful in case of continuous series because
mid point is not used for calculation.o It is not influenced by presence of extreme values. o Applicable for quantitative and qualitative data.
Demeritso Not based on every observationso Not capable of algebraic treatmento Tends to be rather unstable value if the number of
observations is small.
12
MODE
• Mode is defined as that value which occurs the maximum number of times i.e. having the maximum frequency
• A data set can have more than one mode.• A data set is said to have no modeno mode if all values occur
with equal frequency.
13
CALCULATION OF MODE
Individual SeriesZ = The item which repeated more number of times
Discrete SeriesZ = The item which repeated more number of times i.e higher frequency
Continuous Series
Where,o Z= Modeo L = Lower limit of median
class
o f1 = Frequency of modal class
o fo = Frequency of the class preceding the modal class
o f2= Frequency of the class succeeding the modal class
o i = Class interval14
ifff
ffLZ ×
−−−+=
201
01
2
MERITS AND DEMERITS OF MODE
Meritso Not affected by extreme valueso Applicable for quantitative and qualitative data.o Can be obtained in continuous series without assuming
the mid point.
Demeritso Limited utility compared to mean and mediano Mode can not be determined if modal class is at the
extreme.o Difficult to compute in case of bimodal distribution
modeo Possibilities of ‘no mode distribution’
15
GENERAL LIMITATION OF AN AVERAGE
Since an average is a single value representing a group values, it must be properly interpreted, otherwise, there is every possibility of jumping to wrong conclusion.
An average may give us a value does not exit in the data.
Some time an average may give absurd result.Measure of central value fail to give us any idea
about the formation of the series. Two or more series may have the same central value but may differ widely in composition.
16
DESCRIBING A DISTRIBUTION
• A good way to describe the distribution of a quantitative variable is to take the following three steps:o Report the center of the distribution. [Measures of
Central Tendency]o Report any significant deviations from the center.
[Measures of Variation]o Report the general shape of the distribution.
[Measures of Skewness and Peakedness]
17
MEASURE OF DISPERSION/VARIATION DEVIATIONS FROM THE CENTER
TODAY’S QUESTION
Two classes took a recent quiz. There were 10 students in each class, and Their scores are as follows
Each class had an average score of 81.5Since the averages are the same, can we assume
that the students in both classes all did pretty much the same on the exam?
The answer is… No.The average (mean) does not tell us anything about the distribution or variation in the grades.
19
Class A 72 76 80 80 81 83 84 85 85 89Class B 57 65 83 94 95 96 98 93 71 63
20
Mean
TODAY’S QUESTION
TODAY’S QUESTION
So, we need to come up with some way of measuring not just the average, but also the spread of the distribution of our data. i.e. variation or dispersion
Variation/dispersion means how spread out are the scores around the mean.
If many observations “bunched up” around the mean which indicates narrowly spread and otherwise widely spread.
If the distribution is narrowly spread the better your ability to make accurate predictions.
21
MEASURE OF VARIATION
A measure of variation/dispersion is designed to state the extent to which the individual observation differ from mean.
The measure of variation gives the degree of variation i.e. amount of variation.
22
SIGNIFICANCE OF MEASURING VARIATION
To determine the reliability of an averageTo compare two or more series with regard to their
variabilityTo facilitate the use of other statistical measures
23
HOW CAN WE QUANTIFY DISPERSION?
The mean deviationThe standard deviation
24
COEFFICIENT OF VARIATION (CV)
All the tools of measurement of variation quantify the variation/deviation. The CV indicates the degree of variation in a scale of 0 to 1.
CV is a measure of relative variability used to:o measure changes that have occurred in a population over
timeo compare variability of two populations that are expressed
in different units of measuremento expressed as a fraction rather than in terms of the units of
the particular datao Always lies between 0 to 1o If CV is near to 0, then the degree of variation less and near
to 1, then degree variation is high.
25
RANGE
Range is defined as difference between the value of smallest observation and largest observation in the distribution.
Range = L-S Coefficient of Range =
Useful for: daily temperature fluctuations or share price movement
Is considered primitive as it considers only the extreme values which may not be useful indicators of the bulk of the population.
An outlieroutlier is an extremely high or an extremely low data value when compared with the rest of the data values.
26
SL
S-L
+
MERITS AND DEMERITS OF RANGE
Meritso Simple to understand and easy to computeo Less time consuming
Demeritso Not based on each and every observation of the
distributiono Can not be calculated in case of open end distributiono Fails to reveal the character of the distribution
27
INTERQUARTILE RANGE OR QUARTILE DEVIATION
Measures the range of the middle 50% of the values only
Is defined as the difference between the upper and lower quartiles
Interquartile range = Q3-Q1
Quartile Deviation =
Coefficient of Q.D. =
28
2
Q-Q Q.D. 13=
13
13
Q-Q
+
MERITS AND DEMERITS OF QD
Meritso Superior than rangeo Can be calculated for open end classes alsoo Not affected by the presence of extreme values
Demeritso Considers only 50% of the observationso Not capable of mathematical manipulationo Does not show the scatter around an average
29
AVERAGE DEVIATION
Average deviation is obtained by calculating the absolute deviations of each observation from mean or median and then averaging these deviations by taking their arithmetic mean.
Measures the ‘average’ distance of each observation away from the mean of the data
Gives an equal weight to each observationGenerally more sensitive than the range or
interquartile range, since a change in any value will affect it
30
CALCULATION OF AVERAGE DEVIATION
For ungrouped series
For grouped series
Coefficient of Average
Deviation
Whereo AD = Average Deviationo o = Meano f = Frequency of observation
31
N
dAD ∑=
2
N
fdAD ∑=
2
X
AD
X-X d =
X
MERITS AND DEMERITS OF AD
Meritso Relatively simple to calculate.o Based on each and every observations of the datao Less affected by the values of extreme observationso Since deviations are taken from central value, comparison
about formation of different distributions can easily be made.
Demeritso Algebraic sign are ignored o May not give accurate result
32
STANDARD DEVIATION
Most popular tool of measure of variation.It is introduced by Karl Pearson in 1893.It is the square root of the means of square
deviations from the arithmetic mean.Measures the variation of observations from the
meanWorks with squares of residuals not absolute
valuesIf the Standard Deviation is large, it means the
observations are spread out from their mean.If the Standard Deviation is small, it means the observations are close to their mean.
33
CALCULATION OF AVERAGE DEVIATION
For ungrouped series
For grouped series o Direct Method
o Short cut method
Coefficient of Average
Deviation
Whereo = Standard Deviationo o = Meano f = Frequency of observation
34
X-X d =
X
N
d∑=2
σ σ
N
fd∑=2
σ
N
fd
N
fd ∑∑ −=22
σ
100×=X
σ
MATHEMATICAL PROPERTIES OF STANDARD DEVIATION
Combined Standard Deviation
Standard Deviation of natural numbers
The sum of the squares of the deviations of all the
observations from their arithmetic mean is minimum. Standard Deviation is independent of change of origin but
not scale.
35
21
222
211
222
211
12 NN
dNdNNN
++++
= ∑ σσσ
)1(12
1 2 −= Nσ
MERITS AND DEMERITS OF STANDARD DEVIATION
Meritso Based on every item of the distributiono Possible to calculate the combined standard deviation o For comparing the variability of two or more distribution
coefficient of variation is considered to be most appropriate
o It is used most prominently used in further statistical work.
Demeritso Compare to others it is difficult to computeo It gives more weight to extreme values and less to those
which near to mean.
36
DESCRIBING A DISTRIBUTION
• A good way to describe the distribution of a quantitative variable is to take the following three steps:o Report the center of the distribution. [Measures of
Central Tendency]o Report any significant deviations from the center.
[Measures of Variation]o Report the general shape of the distribution.
[Measures of Skewness and Peakedness]
37
MEASURES OF SKEWNESS AND PEAKEDNESS SHAPE OF THE DISTRIBUTION
38
DISTRIBUTION OF DATA
Data can be "distributed" (spread out) in different ways.
39
spread out more on the left spread out more on the right
all jumbled uparound a central value with no bias left or right
NORMAL DISTRIBUTION CURVE [BELL SHAPED CURVE]
40
CHARACTERISTICS OF THE NORMAL DISTRIBUTION
• The normal distribution curve is bell-shaped.• It is symmetrical about mean-50% observations are to one
side of the center; the other 50% observations on the other side.
• The curve never touches the X-axis• The height of the normal curve is at its maximum at the
mean.• The distribution is single peaked, not bimodal or multi-
modal• Most of the cases will fall in the center portion of the curve
and as values of the variable become more extreme they become less frequent, with “outliers” at each of the “tails” of the distribution few in number.
• The Mean, Median, and Mode are the same.41
NORMAL DISTRIBUTION & OTHER TOOLS
Symmetrical distribution and Mean/Median/Mode
o Mode= 3 Median-2 Mean Symmetrical distribution and standard deviation
o covers 68.27% observationso covers 95.45% observationso covers 99.73% observations
42
σ1±X
σ2±X
σ3±X
SKEWNESS• The term skewness refers to lack of symmetry or departure
from symmetry. When a distribution is not symmetrical it is called as skewed distribution.
• In a symmetrical distribution, the values of mean, median and mode are alike.
• If the value of mean is greater than the mode, skewness is said to be positive. A positive skewed distribution contains some values that are much larger than the majority of observations.
• If the value of mode is greater than mean, skewness is said to be negative. A negative skewed distribution contains some values that are much smaller than the majority of observations.
• It is important to emphasize that skewness of a distribution cannot be determined simply by inspection.
• Points to be remember-Zero skewness does not mean that distribution is normal distribution! [A normal distribution should have skewness as zero and peakedness as 3.]
43
SKEWNESS
If Mean = Mode, the skewness is zero.If Mean > Mode, the skewness is positive.If Mean < Mode, the skewness is negative.
44
SKEWNESS DISTRIBUTIONS
45
SKEWNESS DISTRIBUTION
46
MEASURES OF SKEWNESS
Karl Pearson’s Coefficient of Skewness
Bowley’s Coefficient of skewness
47
σMode)-(Mean
Sk p =
13
13B QQ
2Median-QQSk
−+=
COEFFICIENT OF SKEWNESSCoefficient of skewness measures the degree of
skewness and always lies between +1 to -1.If the answer is 0, indicates symmetrical distributionIf the answer is negative, then the distribution is
negatively skewed. o If the answer is close to -1 (say -0.90), then the distribution
is highly negatively skewed.o If the answer is close to 0 (say -0.20), then the distribution
is slightly negatively skewed.If the answer is positive, then the distribution is
negatively skewed. o If the answer is close to 1 (say 0.90), then the distribution
is highly positively skewed.o If the answer is close to 0 (say 0.20), then the distribution
is slightly negatively skewed.48
A PROBLEM
Following data is related to marks scored by three different sections in statistics.
Compute the Mean, Median, Mode, Standard deviation, skewness and interpret the results.
Marks
0-10 10-20
20-30
30-40
40-50
50-60
60-70
Number of
Students
Sec A
3 5 11 22 11 5 3
Sec B
6 15 20 10 5 3 1
Sec C
1 3 5 10 20 15 6
49
SECTION A
Marks X f cf fX d fd fd2
0-10 5 3 3 15 -30 -90 2700
10-20 15 5 8 75 -20 -100 2000
20-30 25 11 19 275 -10 -110 1100
30-40 35 22 41 770 0 0 0
40-50 45 11 52 495 10 110 1100
50-60 55 5 57 275 20 100 2000
60-70 65 3 60 195 30 90 2700
60 2100 0 0 10800
MEAN 35
MEDIAN 35
MODE 35
SD 13.9
SKEWNESS 0
50
SECTION B
Marks X f cf fX d fd fd2
0-10 5 6 6 30 -21 -126 2646
10-20 15 15 21 225 -11 -165 1815
20-30 25 20 41 500 -1 -20 20
30-40 35 10 51 350 9 90 810
40-50 45 5 56 225 19 95 1805
50-60 55 3 59 165 29 87 2523
60-70 65 1 60 65 39 39 1521
60 1560 63 0 11140
MEAN 26.00
MEDIAN 24.50
MODE 23.33
SD 13.63
SKEWNESS 0.11
51
SECTION C
Marks X f cf fX d fd fd2
0-10 5 1 1 5 -39 -39 1521
10-20 15 3 4 45 -29 -87 2523
20-30 25 5 9 125 -19 -95 1805
30-40 35 10 19 350 -9 -90 810
40-50 45 20 39 900 1 20 20
50-60 55 15 54 825 11 165 1815
60-70 65 6 60 390 21 65 2646
60 2640 -63 0 16000
MEAN 44.00
MEDIAN 45.50
MODE 46.67
SD13.6
3
SKEWNESS -0.1152
GRAPHS OF SECTION A, B AND C
.
53