descriptive statistics

Post on 02-Jan-2016

53 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Descriptive Statistics. Statistics. Descriptive Statistic. A descriptive statistic is a numerical index that describes or summarizes some characteristic of a frequency or relative frequency distribution. (Frank & Althoen , “Statistics: Concepts and applications”, 1994) - PowerPoint PPT Presentation

TRANSCRIPT

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Click to edit Master text stylesSecond level

Third level Fourth level

Fifth level

Click to edit Master title styleKing Mongkut’s University of Technology North BangkokFaculty of Information Technology

1ON THE TECHNOLOGICAL FRONTIER WITH ANALYTICAL MIND AND PRACTICE

Descriptive StatisticsStatistics

2

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

A descriptive statistic is a numerical index that describes or summarizes some characteristic of a frequency or relative frequency distribution.

(Frank & Althoen, “Statistics: Concepts and applications”, 1994)

The discipline of quantitatively describing the main features of a collection of data or the quantitative description itself.

Descriptive Statistic

3

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Frequency distribution tableDescribe

Measures of central tendency – Mode, Median, MeanDispersion of distribution – Range, SD, VarianceShape of distribution – Skewness, KurtosisIndividuals in distributions – Percentile, Decile, QuartileJoint distributions of data

Scatter Diagram Correlation Coefficient Linear Regression

Descriptive Statistics

4

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Score Frequency0 - 4 05 – 9 0

10 – 14 015 – 19 020 – 24 025 – 29 030 – 34 135 – 39 140 – 44 345 – 49 650 – 54 555 – 59 1060 – 64 1065 – 69 1070 – 74 775 – 79 480 – 84 685 – 89 690 – 94 395 – 99 0

100 0

Frequency Distribution

Raw data:42, 45, 82, 32, 91, 76, 55, 58, 55, 62, 60, …

Ungrouped

Grouped

•Can be visualized using graphs and charts• Determining number of intervals

k = 1 + 3.3logN• Interval width = Range / k

5

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

One-wayOne variable – often used with percentage

Two-wayTwo variables – shows rough relation between two variables

Etc.

Frequency Distribution Table

Plan

Department

IT DN MIS

Male Female Male Female Male Female

Thesis 3 5 4 3 2 3

Master Project 29 34 27 22 32 35

6

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

ModeThe value with highest frequencyApplicable to nominal scale (and higher scale)Can be more than one value for one set of datafx : MODE

Measures of Central Tendency: Mode

7

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Considered best among the threeSum of value divided by total frequencyCan be affected by (very) peak valuesA value change of an entry also changes meanAdding / subtracting a value from all entry changes mean for the

same valueMultiply / divide all entry with a value also changes mean for the

same multiplication/division with the valueSum of the difference between each entry and mean is always zeroIn case of grouped data, use sum of product of the midpoint of

each interval and the frequency of that interval fx : AVERAGE

Measures of Central Tendency: Arithmetic Mean

8

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Better for data with very peaked values5, 9, 7, 12, 89

Ungrouped dataThe value in the middle of distribution after sortingN is odd: (N+1) / 2 N is even: Average(N/2, N/2 +1)

Average of two middle valuesfx : MEDIAN

Grouped dataSee percentile

Measures of Central Tendency: Median

9

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

PercentileQuartileDecilePerformed on data sorted in ascending orderDividing data in 100, 4, 10 parts and identify the value at

the desired position

Describing Individuals in Distributions

10

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

“The percentile rank of any particular score x is the percentage of observations equal to or less than x”

Divide sorted data set into 100 parts“cent” = 100 thus “per”“cent” = /100

Percentile rank of entry xi = 100*(cumulative frequencyi / N)e.g. 18, 29, 31, 32, 33Percentile rank of 31 = 100*(3/5) = 60Be careful!

Percentile rank determines rank from data valueExcel uses 0.00 – 1.00 for fx: PERCENTRANK

Percentile Rank

11

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

“The kth percentile is the x-value at or below which fall K percent of observations”

RoughlyPosition of data entry at kth Percentile = k(n+1)/100

e.g. 18, 29, 31, 32, 33 (data must first be sorted)Percentile 80th = 80/100(5+1) = 4.8 = 5th positionBe careful!

Percentile rank determines data value from percentileExcel uses 0.00 – 1.00 for fx: PERCENTILE

Percentile

12

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Determine percentile from frequency distribution table

L : true lower bound of the interval containing Pr

I : width of intervalr : percentile in questionn : number of data entryfi : accumulated frequency of the intervals below one

containing Pr

fr : frequency of the interval containing Pr

Determining Percentile in Table

13

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

First, determine the interval containing the percentile in question by comparing (n x r)/100 against accumulated frequency

E.g. Percentile 37(188*37)/100 = 43.66

Interval 17-24

Determining Percentile in Table

Interval Freq. Accum. Freq.

0 – 8 7 79 – 16 15 22

17 – 24 26 4825 – 32 25 7333 – 40 23 9641 – 48 17 11349 – 56 5 118

True lower bound

14

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

The kth quartile is the x-value at or below which fall K quarters of observations

RoughlyPosition of data entry at kth Quartile = k(n+1)/4

e.g. 18, 29, 31, 32, 33 (data must first be sorted)Quartile 3th = 3/4(5+1) = 4.5 = 4th-5th positionfx: QUARTILE

Quartile

15

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Determine quartile from frequency distribution table

L : true lower bound of the interval containing Qk

I : width of intervalk : quartile in questionn : number of data entryfi : accumulated frequency of the intervals below one

containing Qk

fk : frequency of the interval containing Qk

Determining Quartile in Table

16

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

First, determine the interval containing the quartile in question by comparing (n x k)/4 against accumulated frequency

E.g. Quartile 2(118*2)/4 = 59

Interval 25-32

Determining Quartile in Table

Interval Freq. Accum. Freq.

0 – 8 7 79 – 16 15 22

17 – 24 26 4825 – 32 25 7333 – 40 23 9641 – 48 17 11349 – 56 5 118

True lower bound

17

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

The kth Decile is the x-value at or below which fall K tenth of observations

RoughlyPosition of data entry at kth decile = k(n+1)/10

e.g. 18, 29, 31, 32, 33 (data must first be sorted)Decile 5th = 5/10(5+1) = 3rd positionExcel does not have direct decile function

Use fx: PERCENTILE with 0.1, 0.2, 0.3, … instead

Decile

18

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Determine decile from frequency distribution table

L : true lower bound of the interval containing Dk

I : width of intervalk : decile in questionn : number of data entryfi : accumulated frequency of the intervals below one

containing Dk

fk : frequency of the interval containing Dk

Determining Decile in Table

19

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

First, determine the interval containing the decile in question by comparing (n x k)/10 against accumulated frequency

E.g. Decile 7(118*7)/10 = 83

Interval 33-40

Determining Decile in Table

Interval Freq. Accum. Freq.

0 – 8 7 79 – 16 15 22

17 – 24 26 4825 – 32 25 7333 – 40 23 9641 – 48 17 11349 – 56 5 118

True lower bound

20

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Median

21

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Measures of central tendency cannot tell how data are dispersed.

Two different datasets may have a similar mean while the values are very different10, 20, 30, 40, 50 : mean = 305, 5, 0, 120, 20 : mean = 30

RangeInterquartile Range and Quartile DeviationStandard DeviationVariance

Dispersion of Distribution

22

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

RangeUngrouped: Max – Min (fx MAX – fx MIN)Grouped: true highest upper bound – true lowest lower

boundTrue upper bound is average value between the upper

bound of the interval and the (expected) lower bound of the higher interval

True lower bound is average value between the lower bound of the interval and the (expected) upper bound of the lower interval

Range

23

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

More stable than Range as it is less affected by peak values

Quartile Deviation: QD = IR / 2AKA Semi-interquartile rangeUse together with median

Interquartile Range

24

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Root of the sum of difference between each entry and arithmetic mean (higher value means data are more dispersed)

OR

Standard Deviation (or SD, S.D., S) is most popular for describing dispersion

Standard Deviation & Variance

N >= 30

N < 30

N >= 30

N < 30

25

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Example

26

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Always SD >= 0SD of 0 means that all data entries are of the same valueAdding / subtracting a value from all entries does not

affect SDMultiply / divide all entries with a value m changes SD by

multiplying/dividing SD with the absolute value of m

Variance (S2, SD2) is equal to SD2

Only interested in the positive value of SDfx : STDEV and VARA

Standard Deviation & Variance

27

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Skewness

0 means there is no skewness (normal distribution)Positive value means

positive/right skewedNegative value means

negative/left skewedfx : SKEW

Shape of Distribution

28

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

29

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

20 25 25 30 30 45 45 45 55 60

Positive skewed to the right

Example

30

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Kurtosis

0 means normal distributionPositive value means

very peaked (less dispersed)Negative value means

less peaked (more dispersed)fx : KURT

Shape of Distribution

31

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

20 25 25 30 30 45 45 45 55 60

Example

32

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Study the relationship between two variablesDoes NOT infer cause and effectPearson Product-Moment Correlation Coefficient

Interval scale and ratio scale onlySpearman Rank Correlation Coefficient

Two ordinal-scale variablesKendall’s Tau Rank Correlation Coefficient

Three ordinal-scale variables

Correlation

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

r = 0 : two datasets have no relation|r| <= 0.5 : the relation between two datasets is low0.5 < |r| < 0.8 : the relation between two datasets is

mediocre|r| >= 0.8 : the relation between two datasets is high|r| = 1 : total relationCan take value from -1 to 1

Value of 1: two data sets have absolute positive relationValue of -1: two data sets have absolute negative relationValue of 0: two data sets have no linear relation

Interpretation

34

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Scatter Diagram

Joint Distribution of Data

Imaginary line showing relation

Imaginary line showing relation

Negatively related Not related Positively related

35

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Pearson Product-Moment Correlation CoefficientDenoted as rxy or r fx: PEARSON (do not use in MS Excel earlier than 2003)fx: CORREL

Pearson Product-Moment

Look familiar?Recall from reliability of tool?

36

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Find the correlation between scores in mathematics exam (x) and science exam (y) of 5 students

Example

x x2 y y2 xy

3 9 2 4 6

4 16 4 16 16

2 4 2 4 4

3 9 3 9 9

2 4 1 1 2

∑ 14 42 12 34 37

37

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Correlation between ranks two ordinal variablesData are sorted and rankedIf two entries have the same value, assign the average of

the rank

D = delta of ranks between data setsN = number of pairs

Spearman Rank

38

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Find correlation between ranks of theoretical exam and practice exam

Example

Sampl

e

Exam

rank

Prac.

rank

D (delta)

D2

1 1 1 0 0

2 2 2 0 0

3 4 3 1 1

4 3 4 -1 1

∑ 2

)1N(N

D61r 2

N

1i

2i

s

39

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Team Win Ratio Income (M$)

40

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Correlation between three or more ordinal variables (or sets of ranks)

Data are first sorted and ranked

N = number of pairsD = absolute value of delta between sum of rank and mean

of total rank = |r – r|k = number of variables (or sets of ranks)

Kendall’s Tau Rank

41

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Find the correlation in school ranking by 3 experts

Example

School Ranking by expert 1

Ranking by expert 2

Ranking by expert 3 Sum (r) D D2

1 2 1 1 4 5 25

2 1 2 2 5 4 16

3 4 5 4 13 4 16

4 5 4 5 14 5 25

5 3 3 3 9 0 0Σr=45r = 9

)1(

12

221

2

NNk

DW

N

ii

42

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Describe relation between two interval-scale variables in the form of regression equation

y = bx + a (Straight line)y = a + bx + cx2 (Parabola equation)y = abx (Exponential equation)x: independent variabley: dependent variablea: Y-intercept (where the line crosses Y axis)b: Slope

Linear Regression

43

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

44

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

45

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Find b then a

Then write the equationy = bx + a

E.g. b = 31.4, a = 4.52y = 31.4x + 4.52

Simple Linear Regression

46

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Student Time Score

1 70 252 85 333 45 304 150 505 100 476 90 367 135 528 120 449 60 42

10 180 54

ExampleTable shows the period of time each student

spends reading for exam and his/her score b = {10 (45885) – (1035)(413)} / {10 (123375) – (1035)2

= 31395 / 162525 = 0.1932 a = 41.3 – (0.1932) (103.5) = 21.3038

y = 0.1932x + 21.3038Meaning

Spending 1 minute will increase score by 0.1932 mark

If you don’t read at all you should get 21.3038 mark

47

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

More than one independent variablesEquation

Y = a + b1x1 + b2x2 + b3x3…Requirement

Normal distributionNo multicollinearity (independent variables do not depend

on each other)

Multiple Linear Regression

48

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Selecting independent variablesAll Entry – when you are not sure which variable has effectStepwise – only use variables tested to be significant

Forward Backward (all entry then removed insignificant variable)

Sample size must be at least 5 times of the number of variables

Multiple Linear Regression

49

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Simple correlation

how much of the dependent variable can be explained by the independent variable

Is the model good (significant)? (yes, Sig. < 0.05)

abab1

b2

50

King Mongkut’s University of Technology North BangkokFaculty of Information Technology

Summary

Freq.

Distr.

Measures of Central

Tendency

Individual Dispersion

F % Mean

Median

Mode

P Q D Range

SD

Variance

Norminal

/ / /

Ordinal

/ / / / /

Interval

/ / /

Ratio / / /

top related