descriptive statistics

50
King Mongkut’s University of Technology North Bangkok Faculty of Information Technology Click to edit Master text styles Second level Third level Fourth level Fifth level Click to edit Master title style King Mongkut’s University of Technology North Bang Faculty of Information Technology ON THE TECHNOLOGICAL FRONTIER WITH ANALYTICAL MIND AND PRACTICE Descriptive Statistics Statistics 1

Upload: keith-lawson

Post on 02-Jan-2016

53 views

Category:

Documents


0 download

DESCRIPTION

Descriptive Statistics. Statistics. Descriptive Statistic. A descriptive statistic is a numerical index that describes or summarizes some characteristic of a frequency or relative frequency distribution. (Frank & Althoen , “Statistics: Concepts and applications”, 1994) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Descriptive Statistics

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Click to edit Master text stylesSecond level

Third level Fourth level

Fifth level

Click to edit Master title styleKing Mongkut’s University of Technology North BangkokFaculty of Information Technology

1ON THE TECHNOLOGICAL FRONTIER WITH ANALYTICAL MIND AND PRACTICE

Descriptive StatisticsStatistics

Page 2: Descriptive Statistics

2

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

A descriptive statistic is a numerical index that describes or summarizes some characteristic of a frequency or relative frequency distribution.

(Frank & Althoen, “Statistics: Concepts and applications”, 1994)

The discipline of quantitatively describing the main features of a collection of data or the quantitative description itself.

Descriptive Statistic

Page 3: Descriptive Statistics

3

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Frequency distribution tableDescribe

Measures of central tendency – Mode, Median, MeanDispersion of distribution – Range, SD, VarianceShape of distribution – Skewness, KurtosisIndividuals in distributions – Percentile, Decile, QuartileJoint distributions of data

Scatter Diagram Correlation Coefficient Linear Regression

Descriptive Statistics

Page 4: Descriptive Statistics

4

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Score Frequency0 - 4 05 – 9 0

10 – 14 015 – 19 020 – 24 025 – 29 030 – 34 135 – 39 140 – 44 345 – 49 650 – 54 555 – 59 1060 – 64 1065 – 69 1070 – 74 775 – 79 480 – 84 685 – 89 690 – 94 395 – 99 0

100 0

Frequency Distribution

Raw data:42, 45, 82, 32, 91, 76, 55, 58, 55, 62, 60, …

Ungrouped

Grouped

•Can be visualized using graphs and charts• Determining number of intervals

k = 1 + 3.3logN• Interval width = Range / k

Page 5: Descriptive Statistics

5

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

One-wayOne variable – often used with percentage

Two-wayTwo variables – shows rough relation between two variables

Etc.

Frequency Distribution Table

Plan

Department

IT DN MIS

Male Female Male Female Male Female

Thesis 3 5 4 3 2 3

Master Project 29 34 27 22 32 35

Page 6: Descriptive Statistics

6

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

ModeThe value with highest frequencyApplicable to nominal scale (and higher scale)Can be more than one value for one set of datafx : MODE

Measures of Central Tendency: Mode

Page 7: Descriptive Statistics

7

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Considered best among the threeSum of value divided by total frequencyCan be affected by (very) peak valuesA value change of an entry also changes meanAdding / subtracting a value from all entry changes mean for the

same valueMultiply / divide all entry with a value also changes mean for the

same multiplication/division with the valueSum of the difference between each entry and mean is always zeroIn case of grouped data, use sum of product of the midpoint of

each interval and the frequency of that interval fx : AVERAGE

Measures of Central Tendency: Arithmetic Mean

Page 8: Descriptive Statistics

8

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Better for data with very peaked values5, 9, 7, 12, 89

Ungrouped dataThe value in the middle of distribution after sortingN is odd: (N+1) / 2 N is even: Average(N/2, N/2 +1)

Average of two middle valuesfx : MEDIAN

Grouped dataSee percentile

Measures of Central Tendency: Median

Page 9: Descriptive Statistics

9

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

PercentileQuartileDecilePerformed on data sorted in ascending orderDividing data in 100, 4, 10 parts and identify the value at

the desired position

Describing Individuals in Distributions

Page 10: Descriptive Statistics

10

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

“The percentile rank of any particular score x is the percentage of observations equal to or less than x”

Divide sorted data set into 100 parts“cent” = 100 thus “per”“cent” = /100

Percentile rank of entry xi = 100*(cumulative frequencyi / N)e.g. 18, 29, 31, 32, 33Percentile rank of 31 = 100*(3/5) = 60Be careful!

Percentile rank determines rank from data valueExcel uses 0.00 – 1.00 for fx: PERCENTRANK

Percentile Rank

Page 11: Descriptive Statistics

11

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

“The kth percentile is the x-value at or below which fall K percent of observations”

RoughlyPosition of data entry at kth Percentile = k(n+1)/100

e.g. 18, 29, 31, 32, 33 (data must first be sorted)Percentile 80th = 80/100(5+1) = 4.8 = 5th positionBe careful!

Percentile rank determines data value from percentileExcel uses 0.00 – 1.00 for fx: PERCENTILE

Percentile

Page 12: Descriptive Statistics

12

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Determine percentile from frequency distribution table

L : true lower bound of the interval containing Pr

I : width of intervalr : percentile in questionn : number of data entryfi : accumulated frequency of the intervals below one

containing Pr

fr : frequency of the interval containing Pr

Determining Percentile in Table

Page 13: Descriptive Statistics

13

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

First, determine the interval containing the percentile in question by comparing (n x r)/100 against accumulated frequency

E.g. Percentile 37(188*37)/100 = 43.66

Interval 17-24

Determining Percentile in Table

Interval Freq. Accum. Freq.

0 – 8 7 79 – 16 15 22

17 – 24 26 4825 – 32 25 7333 – 40 23 9641 – 48 17 11349 – 56 5 118

True lower bound

Page 14: Descriptive Statistics

14

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

The kth quartile is the x-value at or below which fall K quarters of observations

RoughlyPosition of data entry at kth Quartile = k(n+1)/4

e.g. 18, 29, 31, 32, 33 (data must first be sorted)Quartile 3th = 3/4(5+1) = 4.5 = 4th-5th positionfx: QUARTILE

Quartile

Page 15: Descriptive Statistics

15

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Determine quartile from frequency distribution table

L : true lower bound of the interval containing Qk

I : width of intervalk : quartile in questionn : number of data entryfi : accumulated frequency of the intervals below one

containing Qk

fk : frequency of the interval containing Qk

Determining Quartile in Table

Page 16: Descriptive Statistics

16

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

First, determine the interval containing the quartile in question by comparing (n x k)/4 against accumulated frequency

E.g. Quartile 2(118*2)/4 = 59

Interval 25-32

Determining Quartile in Table

Interval Freq. Accum. Freq.

0 – 8 7 79 – 16 15 22

17 – 24 26 4825 – 32 25 7333 – 40 23 9641 – 48 17 11349 – 56 5 118

True lower bound

Page 17: Descriptive Statistics

17

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

The kth Decile is the x-value at or below which fall K tenth of observations

RoughlyPosition of data entry at kth decile = k(n+1)/10

e.g. 18, 29, 31, 32, 33 (data must first be sorted)Decile 5th = 5/10(5+1) = 3rd positionExcel does not have direct decile function

Use fx: PERCENTILE with 0.1, 0.2, 0.3, … instead

Decile

Page 18: Descriptive Statistics

18

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Determine decile from frequency distribution table

L : true lower bound of the interval containing Dk

I : width of intervalk : decile in questionn : number of data entryfi : accumulated frequency of the intervals below one

containing Dk

fk : frequency of the interval containing Dk

Determining Decile in Table

Page 19: Descriptive Statistics

19

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

First, determine the interval containing the decile in question by comparing (n x k)/10 against accumulated frequency

E.g. Decile 7(118*7)/10 = 83

Interval 33-40

Determining Decile in Table

Interval Freq. Accum. Freq.

0 – 8 7 79 – 16 15 22

17 – 24 26 4825 – 32 25 7333 – 40 23 9641 – 48 17 11349 – 56 5 118

True lower bound

Page 20: Descriptive Statistics

20

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Median

Page 21: Descriptive Statistics

21

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Measures of central tendency cannot tell how data are dispersed.

Two different datasets may have a similar mean while the values are very different10, 20, 30, 40, 50 : mean = 305, 5, 0, 120, 20 : mean = 30

RangeInterquartile Range and Quartile DeviationStandard DeviationVariance

Dispersion of Distribution

Page 22: Descriptive Statistics

22

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

RangeUngrouped: Max – Min (fx MAX – fx MIN)Grouped: true highest upper bound – true lowest lower

boundTrue upper bound is average value between the upper

bound of the interval and the (expected) lower bound of the higher interval

True lower bound is average value between the lower bound of the interval and the (expected) upper bound of the lower interval

Range

Page 23: Descriptive Statistics

23

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

More stable than Range as it is less affected by peak values

Quartile Deviation: QD = IR / 2AKA Semi-interquartile rangeUse together with median

Interquartile Range

Page 24: Descriptive Statistics

24

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Root of the sum of difference between each entry and arithmetic mean (higher value means data are more dispersed)

OR

Standard Deviation (or SD, S.D., S) is most popular for describing dispersion

Standard Deviation & Variance

N >= 30

N < 30

N >= 30

N < 30

Page 25: Descriptive Statistics

25

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Example

Page 26: Descriptive Statistics

26

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Always SD >= 0SD of 0 means that all data entries are of the same valueAdding / subtracting a value from all entries does not

affect SDMultiply / divide all entries with a value m changes SD by

multiplying/dividing SD with the absolute value of m

Variance (S2, SD2) is equal to SD2

Only interested in the positive value of SDfx : STDEV and VARA

Standard Deviation & Variance

Page 27: Descriptive Statistics

27

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Skewness

0 means there is no skewness (normal distribution)Positive value means

positive/right skewedNegative value means

negative/left skewedfx : SKEW

Shape of Distribution

Page 28: Descriptive Statistics

28

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Page 29: Descriptive Statistics

29

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

20 25 25 30 30 45 45 45 55 60

Positive skewed to the right

Example

Page 30: Descriptive Statistics

30

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Kurtosis

0 means normal distributionPositive value means

very peaked (less dispersed)Negative value means

less peaked (more dispersed)fx : KURT

Shape of Distribution

Page 31: Descriptive Statistics

31

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

20 25 25 30 30 45 45 45 55 60

Example

Page 32: Descriptive Statistics

32

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Study the relationship between two variablesDoes NOT infer cause and effectPearson Product-Moment Correlation Coefficient

Interval scale and ratio scale onlySpearman Rank Correlation Coefficient

Two ordinal-scale variablesKendall’s Tau Rank Correlation Coefficient

Three ordinal-scale variables

Correlation

Page 33: Descriptive Statistics

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

r = 0 : two datasets have no relation|r| <= 0.5 : the relation between two datasets is low0.5 < |r| < 0.8 : the relation between two datasets is

mediocre|r| >= 0.8 : the relation between two datasets is high|r| = 1 : total relationCan take value from -1 to 1

Value of 1: two data sets have absolute positive relationValue of -1: two data sets have absolute negative relationValue of 0: two data sets have no linear relation

Interpretation

Page 34: Descriptive Statistics

34

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Scatter Diagram

Joint Distribution of Data

Imaginary line showing relation

Imaginary line showing relation

Negatively related Not related Positively related

Page 35: Descriptive Statistics

35

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Pearson Product-Moment Correlation CoefficientDenoted as rxy or r fx: PEARSON (do not use in MS Excel earlier than 2003)fx: CORREL

Pearson Product-Moment

Look familiar?Recall from reliability of tool?

Page 36: Descriptive Statistics

36

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Find the correlation between scores in mathematics exam (x) and science exam (y) of 5 students

Example

x x2 y y2 xy

3 9 2 4 6

4 16 4 16 16

2 4 2 4 4

3 9 3 9 9

2 4 1 1 2

∑ 14 42 12 34 37

Page 37: Descriptive Statistics

37

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Correlation between ranks two ordinal variablesData are sorted and rankedIf two entries have the same value, assign the average of

the rank

D = delta of ranks between data setsN = number of pairs

Spearman Rank

Page 38: Descriptive Statistics

38

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Find correlation between ranks of theoretical exam and practice exam

Example

Sampl

e

Exam

rank

Prac.

rank

D (delta)

D2

1 1 1 0 0

2 2 2 0 0

3 4 3 1 1

4 3 4 -1 1

∑ 2

)1N(N

D61r 2

N

1i

2i

s

Page 39: Descriptive Statistics

39

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Team Win Ratio Income (M$)

Page 40: Descriptive Statistics

40

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Correlation between three or more ordinal variables (or sets of ranks)

Data are first sorted and ranked

N = number of pairsD = absolute value of delta between sum of rank and mean

of total rank = |r – r|k = number of variables (or sets of ranks)

Kendall’s Tau Rank

Page 41: Descriptive Statistics

41

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Find the correlation in school ranking by 3 experts

Example

School Ranking by expert 1

Ranking by expert 2

Ranking by expert 3 Sum (r) D D2

1 2 1 1 4 5 25

2 1 2 2 5 4 16

3 4 5 4 13 4 16

4 5 4 5 14 5 25

5 3 3 3 9 0 0Σr=45r = 9

)1(

12

221

2

NNk

DW

N

ii

Page 42: Descriptive Statistics

42

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Describe relation between two interval-scale variables in the form of regression equation

y = bx + a (Straight line)y = a + bx + cx2 (Parabola equation)y = abx (Exponential equation)x: independent variabley: dependent variablea: Y-intercept (where the line crosses Y axis)b: Slope

Linear Regression

Page 43: Descriptive Statistics

43

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Page 44: Descriptive Statistics

44

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Page 45: Descriptive Statistics

45

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Find b then a

Then write the equationy = bx + a

E.g. b = 31.4, a = 4.52y = 31.4x + 4.52

Simple Linear Regression

Page 46: Descriptive Statistics

46

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Student Time Score

1 70 252 85 333 45 304 150 505 100 476 90 367 135 528 120 449 60 42

10 180 54

ExampleTable shows the period of time each student

spends reading for exam and his/her score b = {10 (45885) – (1035)(413)} / {10 (123375) – (1035)2

= 31395 / 162525 = 0.1932 a = 41.3 – (0.1932) (103.5) = 21.3038

y = 0.1932x + 21.3038Meaning

Spending 1 minute will increase score by 0.1932 mark

If you don’t read at all you should get 21.3038 mark

Page 47: Descriptive Statistics

47

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

More than one independent variablesEquation

Y = a + b1x1 + b2x2 + b3x3…Requirement

Normal distributionNo multicollinearity (independent variables do not depend

on each other)

Multiple Linear Regression

Page 48: Descriptive Statistics

48

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Selecting independent variablesAll Entry – when you are not sure which variable has effectStepwise – only use variables tested to be significant

Forward Backward (all entry then removed insignificant variable)

Sample size must be at least 5 times of the number of variables

Multiple Linear Regression

Page 49: Descriptive Statistics

49

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Simple correlation

how much of the dependent variable can be explained by the independent variable

Is the model good (significant)? (yes, Sig. < 0.05)

abab1

b2

Page 50: Descriptive Statistics

50

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Summary

Freq.

Distr.

Measures of Central

Tendency

Individual Dispersion

F % Mean

Median

Mode

P Q D Range

SD

Variance

Norminal

/ / /

Ordinal

/ / / / /

Interval

/ / /

Ratio / / /