descriptive statistics
DESCRIPTION
Descriptive Statistics. Statistics. Descriptive Statistic. A descriptive statistic is a numerical index that describes or summarizes some characteristic of a frequency or relative frequency distribution. (Frank & Althoen , “Statistics: Concepts and applications”, 1994) - PowerPoint PPT PresentationTRANSCRIPT
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Click to edit Master text stylesSecond level
Third level Fourth level
Fifth level
Click to edit Master title styleKing Mongkut’s University of Technology North BangkokFaculty of Information Technology
1ON THE TECHNOLOGICAL FRONTIER WITH ANALYTICAL MIND AND PRACTICE
Descriptive StatisticsStatistics
2
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
A descriptive statistic is a numerical index that describes or summarizes some characteristic of a frequency or relative frequency distribution.
(Frank & Althoen, “Statistics: Concepts and applications”, 1994)
The discipline of quantitatively describing the main features of a collection of data or the quantitative description itself.
Descriptive Statistic
3
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Frequency distribution tableDescribe
Measures of central tendency – Mode, Median, MeanDispersion of distribution – Range, SD, VarianceShape of distribution – Skewness, KurtosisIndividuals in distributions – Percentile, Decile, QuartileJoint distributions of data
Scatter Diagram Correlation Coefficient Linear Regression
Descriptive Statistics
4
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Score Frequency0 - 4 05 – 9 0
10 – 14 015 – 19 020 – 24 025 – 29 030 – 34 135 – 39 140 – 44 345 – 49 650 – 54 555 – 59 1060 – 64 1065 – 69 1070 – 74 775 – 79 480 – 84 685 – 89 690 – 94 395 – 99 0
100 0
Frequency Distribution
Raw data:42, 45, 82, 32, 91, 76, 55, 58, 55, 62, 60, …
Ungrouped
Grouped
•Can be visualized using graphs and charts• Determining number of intervals
k = 1 + 3.3logN• Interval width = Range / k
5
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
One-wayOne variable – often used with percentage
Two-wayTwo variables – shows rough relation between two variables
Etc.
Frequency Distribution Table
Plan
Department
IT DN MIS
Male Female Male Female Male Female
Thesis 3 5 4 3 2 3
Master Project 29 34 27 22 32 35
6
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
ModeThe value with highest frequencyApplicable to nominal scale (and higher scale)Can be more than one value for one set of datafx : MODE
Measures of Central Tendency: Mode
7
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Considered best among the threeSum of value divided by total frequencyCan be affected by (very) peak valuesA value change of an entry also changes meanAdding / subtracting a value from all entry changes mean for the
same valueMultiply / divide all entry with a value also changes mean for the
same multiplication/division with the valueSum of the difference between each entry and mean is always zeroIn case of grouped data, use sum of product of the midpoint of
each interval and the frequency of that interval fx : AVERAGE
Measures of Central Tendency: Arithmetic Mean
8
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Better for data with very peaked values5, 9, 7, 12, 89
Ungrouped dataThe value in the middle of distribution after sortingN is odd: (N+1) / 2 N is even: Average(N/2, N/2 +1)
Average of two middle valuesfx : MEDIAN
Grouped dataSee percentile
Measures of Central Tendency: Median
9
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
PercentileQuartileDecilePerformed on data sorted in ascending orderDividing data in 100, 4, 10 parts and identify the value at
the desired position
Describing Individuals in Distributions
10
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
“The percentile rank of any particular score x is the percentage of observations equal to or less than x”
Divide sorted data set into 100 parts“cent” = 100 thus “per”“cent” = /100
Percentile rank of entry xi = 100*(cumulative frequencyi / N)e.g. 18, 29, 31, 32, 33Percentile rank of 31 = 100*(3/5) = 60Be careful!
Percentile rank determines rank from data valueExcel uses 0.00 – 1.00 for fx: PERCENTRANK
Percentile Rank
11
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
“The kth percentile is the x-value at or below which fall K percent of observations”
RoughlyPosition of data entry at kth Percentile = k(n+1)/100
e.g. 18, 29, 31, 32, 33 (data must first be sorted)Percentile 80th = 80/100(5+1) = 4.8 = 5th positionBe careful!
Percentile rank determines data value from percentileExcel uses 0.00 – 1.00 for fx: PERCENTILE
Percentile
12
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Determine percentile from frequency distribution table
L : true lower bound of the interval containing Pr
I : width of intervalr : percentile in questionn : number of data entryfi : accumulated frequency of the intervals below one
containing Pr
fr : frequency of the interval containing Pr
Determining Percentile in Table
13
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
First, determine the interval containing the percentile in question by comparing (n x r)/100 against accumulated frequency
E.g. Percentile 37(188*37)/100 = 43.66
Interval 17-24
Determining Percentile in Table
Interval Freq. Accum. Freq.
0 – 8 7 79 – 16 15 22
17 – 24 26 4825 – 32 25 7333 – 40 23 9641 – 48 17 11349 – 56 5 118
True lower bound
14
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
The kth quartile is the x-value at or below which fall K quarters of observations
RoughlyPosition of data entry at kth Quartile = k(n+1)/4
e.g. 18, 29, 31, 32, 33 (data must first be sorted)Quartile 3th = 3/4(5+1) = 4.5 = 4th-5th positionfx: QUARTILE
Quartile
15
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Determine quartile from frequency distribution table
L : true lower bound of the interval containing Qk
I : width of intervalk : quartile in questionn : number of data entryfi : accumulated frequency of the intervals below one
containing Qk
fk : frequency of the interval containing Qk
Determining Quartile in Table
16
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
First, determine the interval containing the quartile in question by comparing (n x k)/4 against accumulated frequency
E.g. Quartile 2(118*2)/4 = 59
Interval 25-32
Determining Quartile in Table
Interval Freq. Accum. Freq.
0 – 8 7 79 – 16 15 22
17 – 24 26 4825 – 32 25 7333 – 40 23 9641 – 48 17 11349 – 56 5 118
True lower bound
17
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
The kth Decile is the x-value at or below which fall K tenth of observations
RoughlyPosition of data entry at kth decile = k(n+1)/10
e.g. 18, 29, 31, 32, 33 (data must first be sorted)Decile 5th = 5/10(5+1) = 3rd positionExcel does not have direct decile function
Use fx: PERCENTILE with 0.1, 0.2, 0.3, … instead
Decile
18
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Determine decile from frequency distribution table
L : true lower bound of the interval containing Dk
I : width of intervalk : decile in questionn : number of data entryfi : accumulated frequency of the intervals below one
containing Dk
fk : frequency of the interval containing Dk
Determining Decile in Table
19
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
First, determine the interval containing the decile in question by comparing (n x k)/10 against accumulated frequency
E.g. Decile 7(118*7)/10 = 83
Interval 33-40
Determining Decile in Table
Interval Freq. Accum. Freq.
0 – 8 7 79 – 16 15 22
17 – 24 26 4825 – 32 25 7333 – 40 23 9641 – 48 17 11349 – 56 5 118
True lower bound
20
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Median
21
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Measures of central tendency cannot tell how data are dispersed.
Two different datasets may have a similar mean while the values are very different10, 20, 30, 40, 50 : mean = 305, 5, 0, 120, 20 : mean = 30
RangeInterquartile Range and Quartile DeviationStandard DeviationVariance
Dispersion of Distribution
22
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
RangeUngrouped: Max – Min (fx MAX – fx MIN)Grouped: true highest upper bound – true lowest lower
boundTrue upper bound is average value between the upper
bound of the interval and the (expected) lower bound of the higher interval
True lower bound is average value between the lower bound of the interval and the (expected) upper bound of the lower interval
Range
23
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
More stable than Range as it is less affected by peak values
Quartile Deviation: QD = IR / 2AKA Semi-interquartile rangeUse together with median
Interquartile Range
24
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Root of the sum of difference between each entry and arithmetic mean (higher value means data are more dispersed)
OR
Standard Deviation (or SD, S.D., S) is most popular for describing dispersion
Standard Deviation & Variance
N >= 30
N < 30
N >= 30
N < 30
25
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Example
26
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Always SD >= 0SD of 0 means that all data entries are of the same valueAdding / subtracting a value from all entries does not
affect SDMultiply / divide all entries with a value m changes SD by
multiplying/dividing SD with the absolute value of m
Variance (S2, SD2) is equal to SD2
Only interested in the positive value of SDfx : STDEV and VARA
Standard Deviation & Variance
27
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Skewness
0 means there is no skewness (normal distribution)Positive value means
positive/right skewedNegative value means
negative/left skewedfx : SKEW
Shape of Distribution
28
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
29
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
20 25 25 30 30 45 45 45 55 60
Positive skewed to the right
Example
30
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Kurtosis
0 means normal distributionPositive value means
very peaked (less dispersed)Negative value means
less peaked (more dispersed)fx : KURT
Shape of Distribution
31
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
20 25 25 30 30 45 45 45 55 60
Example
32
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Study the relationship between two variablesDoes NOT infer cause and effectPearson Product-Moment Correlation Coefficient
Interval scale and ratio scale onlySpearman Rank Correlation Coefficient
Two ordinal-scale variablesKendall’s Tau Rank Correlation Coefficient
Three ordinal-scale variables
Correlation
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
r = 0 : two datasets have no relation|r| <= 0.5 : the relation between two datasets is low0.5 < |r| < 0.8 : the relation between two datasets is
mediocre|r| >= 0.8 : the relation between two datasets is high|r| = 1 : total relationCan take value from -1 to 1
Value of 1: two data sets have absolute positive relationValue of -1: two data sets have absolute negative relationValue of 0: two data sets have no linear relation
Interpretation
34
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Scatter Diagram
Joint Distribution of Data
Imaginary line showing relation
Imaginary line showing relation
Negatively related Not related Positively related
35
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Pearson Product-Moment Correlation CoefficientDenoted as rxy or r fx: PEARSON (do not use in MS Excel earlier than 2003)fx: CORREL
Pearson Product-Moment
Look familiar?Recall from reliability of tool?
36
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Find the correlation between scores in mathematics exam (x) and science exam (y) of 5 students
Example
x x2 y y2 xy
3 9 2 4 6
4 16 4 16 16
2 4 2 4 4
3 9 3 9 9
2 4 1 1 2
∑ 14 42 12 34 37
37
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Correlation between ranks two ordinal variablesData are sorted and rankedIf two entries have the same value, assign the average of
the rank
D = delta of ranks between data setsN = number of pairs
Spearman Rank
38
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Find correlation between ranks of theoretical exam and practice exam
Example
Sampl
e
Exam
rank
Prac.
rank
D (delta)
D2
1 1 1 0 0
2 2 2 0 0
3 4 3 1 1
4 3 4 -1 1
∑ 2
)1N(N
D61r 2
N
1i
2i
s
39
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Team Win Ratio Income (M$)
40
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Correlation between three or more ordinal variables (or sets of ranks)
Data are first sorted and ranked
N = number of pairsD = absolute value of delta between sum of rank and mean
of total rank = |r – r|k = number of variables (or sets of ranks)
Kendall’s Tau Rank
41
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Find the correlation in school ranking by 3 experts
Example
School Ranking by expert 1
Ranking by expert 2
Ranking by expert 3 Sum (r) D D2
1 2 1 1 4 5 25
2 1 2 2 5 4 16
3 4 5 4 13 4 16
4 5 4 5 14 5 25
5 3 3 3 9 0 0Σr=45r = 9
)1(
12
221
2
NNk
DW
N
ii
42
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Describe relation between two interval-scale variables in the form of regression equation
y = bx + a (Straight line)y = a + bx + cx2 (Parabola equation)y = abx (Exponential equation)x: independent variabley: dependent variablea: Y-intercept (where the line crosses Y axis)b: Slope
Linear Regression
43
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
44
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
45
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Find b then a
Then write the equationy = bx + a
E.g. b = 31.4, a = 4.52y = 31.4x + 4.52
Simple Linear Regression
46
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Student Time Score
1 70 252 85 333 45 304 150 505 100 476 90 367 135 528 120 449 60 42
10 180 54
ExampleTable shows the period of time each student
spends reading for exam and his/her score b = {10 (45885) – (1035)(413)} / {10 (123375) – (1035)2
= 31395 / 162525 = 0.1932 a = 41.3 – (0.1932) (103.5) = 21.3038
y = 0.1932x + 21.3038Meaning
Spending 1 minute will increase score by 0.1932 mark
If you don’t read at all you should get 21.3038 mark
47
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
More than one independent variablesEquation
Y = a + b1x1 + b2x2 + b3x3…Requirement
Normal distributionNo multicollinearity (independent variables do not depend
on each other)
Multiple Linear Regression
48
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Selecting independent variablesAll Entry – when you are not sure which variable has effectStepwise – only use variables tested to be significant
Forward Backward (all entry then removed insignificant variable)
Sample size must be at least 5 times of the number of variables
Multiple Linear Regression
49
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Simple correlation
how much of the dependent variable can be explained by the independent variable
Is the model good (significant)? (yes, Sig. < 0.05)
abab1
b2
50
King Mongkut’s University of Technology North BangkokFaculty of Information Technology
Summary
Freq.
Distr.
Measures of Central
Tendency
Individual Dispersion
F % Mean
Median
Mode
P Q D Range
SD
Variance
Norminal
/ / /
Ordinal
/ / / / /
Interval
/ / /
Ratio / / /