descriptive statistics

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Click to edit Master text stylesSecond level

Third level Fourth level

Fifth level

Click to edit Master title styleKing Mongkut’s University of Technology North BangkokFaculty of Information Technology

1ON THE TECHNOLOGICAL FRONTIER WITH ANALYTICAL MIND AND PRACTICE

Descriptive StatisticsStatistics

2


A descriptive statistic is a numerical index that describes or summarizes some characteristic of a frequency or relative frequency distribution.

(Frank & Althoen, “Statistics: Concepts and applications”, 1994)

The discipline of quantitatively describing the main features of a collection of data or the quantitative description itself.

Descriptive Statistic

3


Frequency distribution tableDescribe

Measures of central tendency – Mode, Median, MeanDispersion of distribution – Range, SD, VarianceShape of distribution – Skewness, KurtosisIndividuals in distributions – Percentile, Decile, QuartileJoint distributions of data

Scatter Diagram Correlation Coefficient Linear Regression

Descriptive Statistics

4


Score Frequency0 - 4 05 – 9 0

10 – 14 015 – 19 020 – 24 025 – 29 030 – 34 135 – 39 140 – 44 345 – 49 650 – 54 555 – 59 1060 – 64 1065 – 69 1070 – 74 775 – 79 480 – 84 685 – 89 690 – 94 395 – 99 0

100 0

Frequency Distribution

Raw data:42, 45, 82, 32, 91, 76, 55, 58, 55, 62, 60, …

Ungrouped

Grouped

•Can be visualized using graphs and charts• Determining number of intervals

k = 1 + 3.3logN• Interval width = Range / k

5


One-wayOne variable – often used with percentage

Two-wayTwo variables – shows rough relation between two variables

Etc.

Frequency Distribution Table

Plan

Department

IT DN MIS

Male Female Male Female Male Female

Thesis 3 5 4 3 2 3

Master Project 29 34 27 22 32 35

6


ModeThe value with highest frequencyApplicable to nominal scale (and higher scale)Can be more than one value for one set of datafx : MODE

Measures of Central Tendency: Mode

7


Considered best among the threeSum of value divided by total frequencyCan be affected by (very) peak valuesA value change of an entry also changes meanAdding / subtracting a value from all entry changes mean for the

same valueMultiply / divide all entry with a value also changes mean for the

same multiplication/division with the valueSum of the difference between each entry and mean is always zeroIn case of grouped data, use sum of product of the midpoint of

each interval and the frequency of that interval fx : AVERAGE

Measures of Central Tendency: Arithmetic Mean

8


Better for data with very peaked values5, 9, 7, 12, 89

Ungrouped dataThe value in the middle of distribution after sortingN is odd: (N+1) / 2 N is even: Average(N/2, N/2 +1)

Average of two middle valuesfx : MEDIAN

Grouped dataSee percentile

Measures of Central Tendency: Median

9


PercentileQuartileDecilePerformed on data sorted in ascending orderDividing data in 100, 4, 10 parts and identify the value at

the desired position

Describing Individuals in Distributions

10


“The percentile rank of any particular score x is the percentage of observations equal to or less than x”

Divide sorted data set into 100 parts“cent” = 100 thus “per”“cent” = /100

Percentile rank of entry xi = 100*(cumulative frequencyi / N)e.g. 18, 29, 31, 32, 33Percentile rank of 31 = 100*(3/5) = 60Be careful!

Percentile rank determines rank from data valueExcel uses 0.00 – 1.00 for fx: PERCENTRANK

Percentile Rank

11


“The kth percentile is the x-value at or below which fall K percent of observations”

RoughlyPosition of data entry at kth Percentile = k(n+1)/100

e.g. 18, 29, 31, 32, 33 (data must first be sorted)Percentile 80th = 80/100(5+1) = 4.8 = 5th positionBe careful!

Percentile rank determines data value from percentileExcel uses 0.00 – 1.00 for fx: PERCENTILE

Percentile

12


Determine percentile from frequency distribution table

L : true lower bound of the interval containing Pr

I : width of intervalr : percentile in questionn : number of data entryfi : accumulated frequency of the intervals below one

containing Pr

fr : frequency of the interval containing Pr

Determining Percentile in Table

13


First, determine the interval containing the percentile in question by comparing (n x r)/100 against accumulated frequency

E.g. Percentile 37(188*37)/100 = 43.66

Interval 17-24

Determining Percentile in Table

Interval Freq. Accum. Freq.

0 – 8 7 79 – 16 15 22

17 – 24 26 4825 – 32 25 7333 – 40 23 9641 – 48 17 11349 – 56 5 118

True lower bound

14


The kth quartile is the x-value at or below which fall K quarters of observations

RoughlyPosition of data entry at kth Quartile = k(n+1)/4

e.g. 18, 29, 31, 32, 33 (data must first be sorted)Quartile 3th = 3/4(5+1) = 4.5 = 4th-5th positionfx: QUARTILE

Quartile

15


Determine quartile from frequency distribution table

L : true lower bound of the interval containing Qk

I : width of intervalk : quartile in questionn : number of data entryfi : accumulated frequency of the intervals below one

containing Qk

fk : frequency of the interval containing Qk

Determining Quartile in Table

16


First, determine the interval containing the quartile in question by comparing (n x k)/4 against accumulated frequency

E.g. Quartile 2(118*2)/4 = 59

Interval 25-32

Determining Quartile in Table


0 – 8 7 79 – 16 15 22

17 – 24 26 4825 – 32 25 7333 – 40 23 9641 – 48 17 11349 – 56 5 118

True lower bound

17


The kth Decile is the x-value at or below which fall K tenth of observations

RoughlyPosition of data entry at kth decile = k(n+1)/10

e.g. 18, 29, 31, 32, 33 (data must first be sorted)Decile 5th = 5/10(5+1) = 3rd positionExcel does not have direct decile function

Use fx: PERCENTILE with 0.1, 0.2, 0.3, … instead

Decile

18


Determine decile from frequency distribution table

L : true lower bound of the interval containing Dk

I : width of intervalk : decile in questionn : number of data entryfi : accumulated frequency of the intervals below one

containing Dk

fk : frequency of the interval containing Dk

Determining Decile in Table

19


First, determine the interval containing the decile in question by comparing (n x k)/10 against accumulated frequency

E.g. Decile 7(118*7)/10 = 83

Interval 33-40

Determining Decile in Table


0 – 8 7 79 – 16 15 22

17 – 24 26 4825 – 32 25 7333 – 40 23 9641 – 48 17 11349 – 56 5 118

True lower bound

20


Median

21


Measures of central tendency cannot tell how data are dispersed.

Two different datasets may have a similar mean while the values are very different10, 20, 30, 40, 50 : mean = 305, 5, 0, 120, 20 : mean = 30

RangeInterquartile Range and Quartile DeviationStandard DeviationVariance

Dispersion of Distribution

22


RangeUngrouped: Max – Min (fx MAX – fx MIN)Grouped: true highest upper bound – true lowest lower

boundTrue upper bound is average value between the upper

bound of the interval and the (expected) lower bound of the higher interval

True lower bound is average value between the lower bound of the interval and the (expected) upper bound of the lower interval

Range

23


More stable than Range as it is less affected by peak values

Quartile Deviation: QD = IR / 2AKA Semi-interquartile rangeUse together with median

Interquartile Range

24


Root of the sum of difference between each entry and arithmetic mean (higher value means data are more dispersed)

OR

Standard Deviation (or SD, S.D., S) is most popular for describing dispersion

Standard Deviation & Variance

N >= 30

N < 30

N >= 30

N < 30

25


Example

26


Always SD >= 0SD of 0 means that all data entries are of the same valueAdding / subtracting a value from all entries does not

affect SDMultiply / divide all entries with a value m changes SD by

multiplying/dividing SD with the absolute value of m

Variance (S2, SD2) is equal to SD2

Only interested in the positive value of SDfx : STDEV and VARA

Standard Deviation & Variance

27


Skewness

0 means there is no skewness (normal distribution)Positive value means

positive/right skewedNegative value means

negative/left skewedfx : SKEW

Shape of Distribution

28


29


20 25 25 30 30 45 45 45 55 60

Positive skewed to the right

Example

30


Kurtosis

0 means normal distributionPositive value means

very peaked (less dispersed)Negative value means

less peaked (more dispersed)fx : KURT

Shape of Distribution

31


20 25 25 30 30 45 45 45 55 60

Example

32


Study the relationship between two variablesDoes NOT infer cause and effectPearson Product-Moment Correlation Coefficient

Interval scale and ratio scale onlySpearman Rank Correlation Coefficient

Two ordinal-scale variablesKendall’s Tau Rank Correlation Coefficient

Three ordinal-scale variables

Correlation


r = 0 : two datasets have no relation|r| <= 0.5 : the relation between two datasets is low0.5 < |r| < 0.8 : the relation between two datasets is

mediocre|r| >= 0.8 : the relation between two datasets is high|r| = 1 : total relationCan take value from -1 to 1

Value of 1: two data sets have absolute positive relationValue of -1: two data sets have absolute negative relationValue of 0: two data sets have no linear relation

Interpretation

34


Scatter Diagram

Joint Distribution of Data

Imaginary line showing relation

Imaginary line showing relation

Negatively related Not related Positively related

35


Pearson Product-Moment Correlation CoefficientDenoted as rxy or r fx: PEARSON (do not use in MS Excel earlier than 2003)fx: CORREL

Pearson Product-Moment

Look familiar?Recall from reliability of tool?

36


Find the correlation between scores in mathematics exam (x) and science exam (y) of 5 students

Example

x x2 y y2 xy

3 9 2 4 6

4 16 4 16 16

2 4 2 4 4

3 9 3 9 9

2 4 1 1 2

∑ 14 42 12 34 37

37


Correlation between ranks two ordinal variablesData are sorted and rankedIf two entries have the same value, assign the average of

the rank

D = delta of ranks between data setsN = number of pairs

Spearman Rank

38


Find correlation between ranks of theoretical exam and practice exam

Example

Sampl

e

Exam

rank

Prac.

rank

D (delta)

D2

1 1 1 0 0

2 2 2 0 0

3 4 3 1 1

4 3 4 -1 1

∑ 2

)1N(N

D61r 2

N

1i

2i

s

39


Team Win Ratio Income (M$)

40


Correlation between three or more ordinal variables (or sets of ranks)

Data are first sorted and ranked

N = number of pairsD = absolute value of delta between sum of rank and mean

of total rank = |r – r|k = number of variables (or sets of ranks)

Kendall’s Tau Rank

41


Find the correlation in school ranking by 3 experts

Example

School Ranking by expert 1

Ranking by expert 2

Ranking by expert 3 Sum (r) D D2

1 2 1 1 4 5 25

2 1 2 2 5 4 16

3 4 5 4 13 4 16

4 5 4 5 14 5 25

5 3 3 3 9 0 0Σr=45r = 9

)1(

12

221

2

NNk

DW

N

ii

42


Describe relation between two interval-scale variables in the form of regression equation

y = bx + a (Straight line)y = a + bx + cx2 (Parabola equation)y = abx (Exponential equation)x: independent variabley: dependent variablea: Y-intercept (where the line crosses Y axis)b: Slope

Linear Regression

43


44


45


Find b then a

Then write the equationy = bx + a

E.g. b = 31.4, a = 4.52y = 31.4x + 4.52

Simple Linear Regression

46


Student Time Score

1 70 252 85 333 45 304 150 505 100 476 90 367 135 528 120 449 60 42

10 180 54

ExampleTable shows the period of time each student

spends reading for exam and his/her score b = {10 (45885) – (1035)(413)} / {10 (123375) – (1035)2

= 31395 / 162525 = 0.1932 a = 41.3 – (0.1932) (103.5) = 21.3038

y = 0.1932x + 21.3038Meaning

Spending 1 minute will increase score by 0.1932 mark

If you don’t read at all you should get 21.3038 mark

47


More than one independent variablesEquation

Y = a + b1x1 + b2x2 + b3x3…Requirement

Normal distributionNo multicollinearity (independent variables do not depend

on each other)

Multiple Linear Regression

48


Selecting independent variablesAll Entry – when you are not sure which variable has effectStepwise – only use variables tested to be significant

Forward Backward (all entry then removed insignificant variable)

Sample size must be at least 5 times of the number of variables

Multiple Linear Regression

49


Simple correlation

how much of the dependent variable can be explained by the independent variable

Is the model good (significant)? (yes, Sig. < 0.05)

abab1

b2

50


Summary

Freq.

Distr.

Measures of Central

Tendency

Individual Dispersion

F % Mean

Median

Mode

P Q D Range

SD

Variance

Norminal

/ / /

Ordinal

/ / / / /

Interval

/ / /

Ratio / / /

descriptive statistics

Documents

frequency distributionraw

distributions percentile

quantitative description

collection of data

highest frequencyapplicable

analytical mind

technological frontier

frank althoen