statistical methods in computer science © 2006-now gal kaminka / ido dagan 1 statistical methods in...
Post on 22-Dec-2015
219 views
TRANSCRIPT
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 1
Statistical Methods in Computer Science
Data 1: Frequency Distributions
Ido Dagan
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 2
Concrete Theory: Relates Variables to Each Other
Examples: Mathematically accurate
Memory = 2*sizeof(input) + 3 Runtime = 500 + 30*sizeof(input) + 20
Asymptotically correct Memory = O(sizeof(input)) in worst case, Runtime = O(log (sizeof(input))) in best case Accuracy is proportional to run-time
Qualitative User performance is increased with reduced cognitive load number of bugs discovered is monotonically decreasing, but
positive, if the same programmer is used, otherwise, it increases
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 3
Behavior Parameters/Variables(typical of Computer Science)
Hardware parameters CPU model and organization, cache organization, latencies in the
system
System parameters Memory availability, usage CPU running time (sometimes approximated by world-clock time) Communication bandwidth, usage Program characteristics
requires floating-point, heavy disk usage, integer math, graphics
large heap, large stack, uses non-local information, ...
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 4
Additional Behavior Variables
Algorithm parameters: Algorithm choice, correctness/accuracy of results Performance curves (accuracy vs. run-time) Size of input Worst case, best case, average case (!!)
Other Development person-hours User (programmer) satisfaction, productivity Lines of code, number of components, ... Robotics: Speed of movement, accuracy of positioning Learning: precision and recall
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 5
Scales of Measurements Nominal (also called categorical): No order, just labels
e.g., “Algorithm Name” Ordinal (also called rank): Order, but not numerical
Difference between ranks is not necessarily the same e.g., ranks in (hierarchical/military) organization
Interval: Difference between values has same meaning everywhere e.g., temperature in Celsius (rise of 10 degrees is the same
everywhere) But 100C is not twice as hot as 50C, and 0C is not lack of heat
Ratio: Interval + Fixed zero point e.g., temperature in Kelvin, robot position, memory usage, run-time
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 6
Scale Hierarchy
Nominal < Ordinal < Interval < Ratio
Propositions that are true for some level, are true above it But not necessarily the other way around
e.g., we can calculate the mean (average) value for numerical variables But not for nominal and ordinal
e.g., we can calculate the most frequent value for all variables
http://en.wikipedia.org/wiki/Levels_of_measurement
“Numerical”
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 7
Variables
Discrete: Can take on only certain values: symbols, exact numbers
For ordinal, interval and ratio scales, this means there will be gaps e.g., User satisfaction surveys, memory usage
Continuous: Can take on any value within its range: no gaps e.g., run-time, CPU temperature, robot velocity and position In practice: limited by measurement accuracy
Up to researcher to determine needed accuracy, approximate carefully
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 8
Data• The collection of values that a variable X took during
the measurement
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 9
Describing Data
Our task: Describe the data we have collected Find ways to characterize it, represent it Find properties that are true of the data
So that we can relate the values to those of other variables
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 11
Frequency Distribution
Examine the frequency of values
f(x) = # of times variable took on value x.
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 12
Frequency Distribution
Examine the frequency of values
f(x) = # of times variable took on value x.
Student Grade Grade fX1 60 82 2X2 43 75 1X3 57 60 2X4 82 57 1X5 75 43 1X6 32 32 1X7 82 Total 8X8 60
?
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 13
Frequency DistributionExamine the frequency of values
f(x) = # of times variable took on value x.
Student GradeX1 60X2 43X3 57X4 82X5 75X6 32X7 82X8 60
Convention (Ordinal/Numerical): Sort by value
Grade f82 281 080 0
...... 075 1
...... 060 2
...... 057 1
...... ...Total 8
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 14
Grouped Frequency Distributions In ordinal/numerical variables, possible to group
values together Create Grouped Frequency Distributions
Score f Score f96 1 78 495 0 77 294 0 76 193 0 75 192 0 74 091 1 73 190 1 72 289 3 71 088 2 70 387 2 69 186 6 68 285 2 67 184 2 66 083 1 65 082 3 64 081 2 63 080 2 62 1
Score f96-100 191-95 186-90 1481-85 1076-80 1171-75 466-70 761-65 2
N= 50
Width (i) =5
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 15
Grouped Frequency Distributions In ordinal/numerical variables, possible to group
values together Create Grouped Frequency Distributions
Score f Score f96 1 78 495 0 77 294 0 76 193 0 75 192 0 74 091 1 73 190 1 72 289 3 71 088 2 70 387 2 69 186 6 68 285 2 67 184 2 66 083 1 65 082 3 64 081 2 63 080 2 62 1
Score f95-99 190-94 285-89 1580-84 1075-79 1070-74 665-69 460-64 2
N= 50
i=5
Score f96-100 191-95 186-90 1481-85 1076-80 1171-75 466-70 761-65 2
N= 50
i=5
Warning: Loss of Information
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 16
Real and Apparent Limits
Continuous values are more difficult to divide into intervals Score of 95 falls within 95-99, not within 90-94 But what about temperature of 94.87 ? 94 < 94.87 < 95 !
By convention, the real limits of a score are within ½ the measurement resolution If our resolution is 0.1, then limits are within 0.05 If our resolution 100, then limits are within 50
We break convention only for exceptional cases e.g., age: “I am 35” is true of 35.0 .. 36.0 (not including 36).
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 17
Real/Apparent Limits
For example: Resolution of 0.01. Interval 95..99 really covers values
94.995 to 99.005 Apparent limits: 95..99 Real limits: 94.995 to 99.005
Resolution of 10: 740-800 really covers values 735 to 805.
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 18
Relative Frequency Distributions
A frequency count can be misleading Algorithm X was fastest on 60,000 trials: Is this good? 100,000 people voted for candidate A: Is she the winner?
We need a way to compare values, i.e., relate them to each other
Relative frequency distributions: translate f into percentage or ratio rel f (propor) = f/N rel f (%) = 100 * f/N
Warning: Can be misleading, if ignoring count magnitude 50% of all test cases succeeded (with only two cases…)
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 19
Relative Frequency Distributions
Score f (%)95-99 290-94 485-89 3080-84 2075-79 2070-74 1265-69 860-64 4
N= 100
i=5
Score f95-99 190-94 285-89 1580-84 1075-79 1070-74 665-69 460-64 2
N= 50
i=5
f/N
Example:
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 20
Cumulative Frequency Distribution For ordinal/numerical variables Where values are with respect to others: How many
below or above
Cumulative frequency distribution
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 21
Cumulative Frequency DistributionBased on the cumulative distribution, can answer
question such as: What percentage of scores fall below 80? How many scores below 95?
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 22
Percentiles, Percentile Ranks Percentile X: Value for which X percent of values are lower
e.g. baby height We use P
x to denote the Xth percentile, e.g., P
98 is in range 90-94.
Percentile rank X: the percent of values that fall below X. e.g., percentile rank of the interval 65-69 is 12.
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 23
Computing Percentiles, P. RanksHow do we compute percentiles and percentile ranks
from grouped data? What is the score which defines the top 20% of scores? Is it between 84 and 85?
Score f f (%)
95-99 1 2
90-94 2 4
85-89 15 30
80-84 10 20
75-79 10 20
70-74 6 12
65-69 4 8
60-64 2 4N= 50 100
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 24
Computing Percentiles
We want to compute P80
. 80% of 50 cases = 40 cases.
We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit).
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 25
Computing Percentiles
We want to compute P80
. 80% of 50 cases = 40 cases.
We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit).i =5
Score cum f cum %95-99 50 10090-94 49 9885-89 47 9480-84 32 6475-79 22 4470-74 12 2465-69 6 1260-64 2 4
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 26
Computing Percentiles
We want to compute P80
. 80% of 50 cases = 40 cases.
We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit).
We need 8 more.i =5
Score cum fcum %95-99 50 10090-94 49 9885-89 47 9480-84 32 6475-79 22 4470-74 12 2465-69 6 1260-64 2 4
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 27
The interval 85-89 contains 47-32 = 15 cases. real limit 84.5
These are spread over width of 5 (= 89.5-84.5).
Assume scores are evenly distributed within interval 8 more cases ==>
8/15 * 5 = 2.67 (linear interpolation)
P80
= 84.5 + 2.67 = 87.17
Computing Percentiles We want to compute P
80. 80% of 50 cases = 40 cases.
We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit).
We need 8 more.
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 28
Computing Percentile Ranks We want to compute the percentile rank of 86 Lies in the interval 85-89, real limits 84.5 – 89.5. 86-84.5 = 1.5 score points. Width of interval = 5. 1.5/5 = 0.3 ==> 30% of scores in interval
(0.3*15 = 4.5)
Score cum fcum %95-99 50 10090-94 49 9885-89 47 9480-84 32 6475-79 22 4470-74 12 2465-69 6 1260-64 2 4
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 29
Computing Percentile Ranks We want to compute the percentile rank of 86 Lies in the interval 85-89, real limits 84.5 – 89.5. 86-84.5 = 1.5 score points. Width of interval = 5. 1.5/5 = 0.3 ==> 30% of scores in
interval (0.3*15 = 4.5)
So we have 32 scores up to 84.5 4.5 scores from 84.5 to 86. Total: 4.5 + 32 = 36.5 scores. 36.5/50 = 73%. This is the percentile rank
of 86.
Score cum fcum %95-99 50 10090-94 49 9885-89 47 9480-84 32 6475-79 22 4470-74 12 2465-69 6 1260-64 2 4
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 30
Frequency Distributions and Scales
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 31
Displaying Frequency Distributions:Nominal Data
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 32
Displaying Frequency Distributions:Ordinal/Numerical Data
Histogram
Score f f (%)
95-99 1 2
90-94 2 4
85-89 15 30
80-84 10 20
75-79 10 20
70-74 6 12
65-69 4 8
60-64 2 4N= 50 100 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99
0
2
4
6
8
10
12
14
16
2
4
6
10 10
15
2
1
Scores Histogram
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 33
Displaying Frequency Distributions:Ordinal/Numerical Data
Histogram: Different Grouping
Score f f (%)
95-99 1 2
90-94 2 4
85-89 15 30
80-84 10 20
75-79 10 20
70-74 6 12
65-69 4 8
60-64 2 4N= 50 100 60-69 70-79 80-89 90-99
0
5
10
15
20
25
30
6
16
25
3
Scores Histogram
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 34
Lying with Visuals
60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99
0
120108 110 111 112 111 110 109 108
Scores Histogram
60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99
107
115
108
110
111
112
111
110
109
108
Scores Histogram