stochastic systems - random variables
TRANSCRIPT
Stochastic Systems - Random Variables • Computer Systems • Traffic Systems • Financial Systems • Data Systems • Class
– Absences – Exam marks
• Points Won – Premier League – Dice, Pokemon, Whack-a-Mole, Assassin,…
• Natural Language – Google suggest/translate – Randomly generated music
• Sally Clarke
12/12/2012 ST2004 2012 Week 12 1
Framework - Rows and Columns
• Column Names – Variables of interest
– Derived variables
• Rows – Replications
• Column Summaries – Averages
– Frequency tables • Subsets
Simplest case replications
Instances Realisations
Possible values Numeric Labels True/False
12/11/2012 ST2004 2012 Week 12 3
Expected Values and Variances
• System Level Random Variation – Challenge is Structure
• Rows/cols
• System level summaries
– Monte Carlo simulation “easy” • How long to simulate?
– Often Prob dist of system level summaries “hard”
– Often Exp Val, Var system level summaries “easier”
• Often suff to summarise random variation by variance
12/12/2012 ST2004 2012 Week 12 4
Common Systems of Random Variables
System level rvs • Sums/differences of elementary indep rvs • Linear combs (weighted averages) Often Expected Values and Variances easy Often Normal distribution useful Sometimes recursions easy • Max/min combs Often probs easy
ST2004 2012 Week 12 5 12/11/2012
Modelling Mechanics – Populating Rows/Cols
• Data Generating System – Derived Variables
• Random Number Generation – Transformations – Functions/Combinations of random vars
• Thought Experiments – How many replications
• What Summaries – Rel Freq – Averages Values and Variances
Objective?
12/11/2012 ST2004 2012 Week 12 6
Systems – examples • Teams in league
– Games won by NA N B NC
– Sums of binary rvs
• Student attendance at class – Binary Name Chosen/ Not Chosen
– Given chosen Presence/Absence
– Sums of binary rvs
• Sets of Dice – Scores X1 X 2 X3 Sums, Max S 3 M3
ST2004 2012 Week 12 7
Model Decompose Splash Bean Machine
12/11/2012
League as additive system
ST2004 2012 Week 12 8
MiniLeague 3 teams play each other once
(i) AB 0.7 0.3
(ii) BC 0.4 0.6
(iii) AC 0.2 0.8
(i) (ii) (iii)
Reps AB BC AC Winner Win A B C Pts
1 0.500 0.952 0.347 A C C ACC 1 0 2 102 C
2 0.518 0.126 0.708 A B C ABC 1 1 1 111 #N/A
3 0.972 0.358 0.653 B B C BBC 0 2 1 021 B
4 0.985 0.851 0.910 B C C BCC 0 1 2 012 C
5 0.982 0.499 0.735 B C C BCC 0 1 2 012 C
6 0.234 0.587 0.168 A C A ACA 2 0 1 201 A
7 0.584 0.390 0.482 A B C ABC 1 1 1 111 #N/A
Games
Game Prob of Winning
Pr(A winner)
Pr(B winner)
Pr(A winner)
using COUNTIF
Num Wins for Outright
winner
12/11/2012
League as additive system
ST2004 2012 Week 12 9
Winners
(i)(ii)(iii)
ABA 0.051 0.056 210 0.051 0.056
ABC 0.234 0.224 201 0.092 0.084
ACA 0.092 0.084 120 0.023 0.024
ACC 0.348 0.336 111 0.254 0.260
BBA 0.023 0.024 102 0.348 0.336
BBC 0.089 0.096 021 0.089 0.096
BCA 0.020 0.036 012 0.143 0.144
BCC 0.143 0.144
sum 1 1 sum 1 1
Prob dist of games won
A 0.143 0.140 by 0 1 2 sum
B 0.112 0.120 A 0.240 0.620 0.140 1
C 0.491 0.480 B 0.420 0.460 0.120 1
#N/A 0.254 0.260 C 0.080 0.440 0.480 1
sum 1 1
Exp Val Var
A 0.9 0.37
B 0.7 0.45
C 1.4 0.4
Prob
Rel
Freq Prob
Outright
Winner
Rel
Freq Prob
Points
for ABC
Rel
Freq
(i) AB 0.7
(ii) BC 0.4
(iii) AC 0.2
Game Prob of Winning
Pr(A winner)
Pr(B winner)
Pr(A winner)
Decomposing via
events
sums
Prob Dist for
A
A
A
A
N
N
E N
Var N
12/11/2012
Student Absences as additive system
12/11/2012 ST2004 2012 Week 12 10
Dist # of absences
Num
Abs All MSISS Num AbsAll MSISS
Freq Rel Freq
0 15 7 8 0 0.20 0.21 0.20
1 18 9 9 1 0.24 0.26 0.22
2 17 6 11 2 0.23 0.18 0.27
3 12 4 8 3 0.16 0.12 0.20
4 5 3 2 4 0.07 0.09 0.05
5 5 2 3 5 0.07 0.06 0.07
6 2 2 0 6 0.03 0.06 0.00
7 1 1 0 7 0.01 0.03 0.00
0 75 34 41 0 1 1 1
mean 2.03 2.18 1.90 mean 2.03 2.18 1.90
var 2.83 3.73 2.04 var 2.83 3.73 2.04
SD 1.68 1.93 1.43 SD 1.68 1.93 1.43
JS
(Eng/B&
C/CSSL)
JS
(Eng/B&
C/CSSL)
Question: Could the obs differences between the classes have arisen by chance? If so, the evidence that there is a real difference is weak.
We show that – under a model where the classes are not different – that there is >20% chance that ‘MSISS’ exceeds ‘Others’ by more than observed (2.18-1.90 = 0.27 before rounding) (by replicating simulations and by theory for infinite reps)
MSISS Other
5 5 3 7 5 1 3 2 1 4 20 16
Abs?
rep Stud 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
1 1st 9 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2nd 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0
3 1st 4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 2nd 8 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1st 4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 2nd 10 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1
7 1st 7 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 2nd 4 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0
9 1st 8 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 2nd 5 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1
59 1st 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
60 2nd 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0
MSISS Other
Total AbsPr Abs
0.5
Student
Student Abs if Chosen
Number of Abs Each
10 students only; 30 classes; choose 2
60 chances to return as Abs
Random
Student
Here replications are
Random Roll Call as sums: simple case
12/11/2012 ST2004 2012 Week 12 11
Spreadsheet
Here common Prob Abs
Key: sums of binary random variables Pr(name called and absent) = 0.25(1/10)
Here, to simplify things, using sampling with replacement. Technical detail only
Avg 20/4 16/6
System rvs
Elementary rvs
Theory – 10 students, common prob
12/12/2012 ST2004 2012 Week 12 12
#Abs each = sum 60 binary rvs Y MSISS Total = sum 240 binary rvs Y Other Total = sum 360 binary rvs Y Pr(Y=1) = q =1/10*0.25=0.025 E[Y]=q Var[Y] = q(1-q)≈q E[#each]=60q Var[#each]≈60q #each ~N(60q,60q) CLT approx E[MSISS Tot]=240q Var[MSISS Tot]≈240q MSISS Tot ~ N(240q,240q) CLT approx MSISS Avg = Total/4 MSISS Avg ~ N(60q,15q) Other Avg = Total/6 Other Avg ~ N(60q,6.7q)
Binomial Dist for each, sums of binary Good Poisson approx q small #each ~ B(60,q) app P(60q)
MSISS Tot ~ B(240,q) approx P(60q) approx Normal
Using Simulations to Test Even with common prob absence
MSISS Avg ≠ Other Avg
• How big a difference is possible?
– 500 replicates of 30 classes 75 students
– 500 differences
– Prop(sim mean diff > observed mean diff) = 0.25
• Conclude: nothing remarkable about observed mean diff, even if classes have common prob abs
12/12/2012 ST2004 2012 Week 12 13
Summaries of 500 reps
MSISS Other Diff
Mean 3.02 3.01 0.01
SD 0.80 0.57 1.13
Using Probability: to test Even with common prob absence
MSISS Avg ≠ Other Avg
• How big a difference is possible?
– replicates of 30 classes 75 students
– Use Normal distribution to compute
– Pr(random mean diff > observed mean diff) = 0.21
– Theory follows
12/12/2012 ST2004 2012 Week 12 14
4.53.01.50.0-1.5-3.0
200
150
100
50
0
4.53.01.50.0-1.5-3.0
200
150
100
50
0
MSISS
Fre
qu
en
cy
Other
Diff
Mean 2.961
StDev 0.7345
N 500
MSISS
Mean 2.999
StDev 0.5716
N 500
Other
Mean -0.03833
StDev 1.055
N 500
Diff
Histogram of MSISS, Other, DiffNormal
Student Absences
12/11/2012 ST2004 2012 Week 12 15
Dist # of absences
Num
Abs All MSISS Num AbsAll MSISS
Freq Rel Freq
0 15 7 8 0 0.20 0.21 0.20
1 18 9 9 1 0.24 0.26 0.22
2 17 6 11 2 0.23 0.18 0.27
3 12 4 8 3 0.16 0.12 0.20
4 5 3 2 4 0.07 0.09 0.05
5 5 2 3 5 0.07 0.06 0.07
6 2 2 0 6 0.03 0.06 0.00
7 1 1 0 7 0.01 0.03 0.00
0 75 34 41 0 1 1 1
mean 2.03 2.18 1.90 mean 2.03 2.18 1.90
var 2.83 3.73 2.04 var 2.83 3.73 2.04
SD 1.68 1.93 1.43 SD 1.68 1.93 1.43
JS
(Eng/B&
C/CSSL)
JS
(Eng/B&
C/CSSL)
We show that – under a model where the classes are not different – that there is >20% chance that ‘MSISS’ exceeds ‘Others’ by more than observed (2.18-1.90 = 0.27 before rounding) (by replicating simulations and by theory for infinite reps)
Model: No real diff between student groups.
ie diff of 0.27 could have arisen by chance.
Compute Pr(mean diff >0.27 by chance alone);
if not small no evidence that model is not O
Options 1 Many simulation
175
1 134 41
1 134 41
s
2 Thought expt with reps
Theory, with common taken as 0.51
300 300~ 300 , ; ~ 300 ,
34 41
~ 0,300
0.27 0Pr 0.27 1 0.2
300
MSISS Others
MSISS Others
q
q qY N q Y N q
RandDiff Y Y N q
RandDiffq
1
What’s the use of Theory? • Estimating Pr(abs)
• Testing
– What if student groups not different?
• Avoid simulation
12/11/2012 ST2004 2012 Week 12 16
Pr( )
1#Abs each student 300 4
75
1ˆ #Abs each student
4
ˆ #Abs each day /
p Abs
E p p
p ObsAvg
Equiv
p Avg Number of Students
Alt, if simulation preferred: Speed-up
Simulate each from B(n, q ) - form total Simulate total directly from B(n#students, q) – form avg Simulate avg directly from Normal dist
Validate simulation
Student Absences
• Simplest Model
– Fixed Pr(abs for randomly chosen MSISS/Other)
• Richer model - approached similarly
– Pr(Abs) increases over time
– From “small” to “large”
– Possibly different rate of increase
– Odds(Abs)= + group t
12/11/2012 ST2004 2012 Week 12 17
Bootstrap - Alternative Approach
Avoid Theoretical distributions
Draw samples from empirical dist
12/11/2012 ST2004 2012 Week 12 18
Sample binary random variables and sum equiv sample from Binomial distribution
Sample from observed distribution equiv resample data with replacement
Opinion Polls Methodology
For all national population opinion polls RED C interview a random sample of 1,000+ adults aged 18+ by telephone. This sample size is the recognised sample required by polling organisations for ensuring accuracy on political voting intention surveys. The accuracy level is estimated to be approximately plus or minus 3 per cent on any given result at 95% confidence levels.
19
Electorate Sample Size ±3%
Ireland ~ 2,000,000 1000
USA ~200,000,000 ?
ST2004 2012 Week 11 11/12/2012
Bootstrap Precision • Sample of 1000 selected randomly
– Voters in Colorado
• 51.1% Democrat
• But, what if repeated..?
• “Data like this, but randomly different”
– Re-sample 1000, from 1000, ‘with replacement’
– Repeat many times
– Replications Precision, Prob > 0.50
– 95% Conf Interval (cf formula method)
ST2004 2012 Week 11 20 11/12/2012
Precision by Thought Expt Re-sample 1000, from 1000, ‘with replacement’
Replications Precision
Bernoulli Process
#Dem in 1000 sample; 0.511
~
[ ] [ ]
ˆ1000
ˆ ˆ[ ] [ ]
ˆ[ ]
ˆ95% Precision 2 [ ]
Define N p
N
E N Var N
NDefine p
E p Var p
SD p
SD p
ST2004 2012 Week 11 21 11/12/2012
Precision by Thought Expt Re-sample 1000, from 1000, ‘with replacement’
Replications Precision
Bernoulli Process
#Dem in 1000 sample; 0.511
~ (1000,0.511)
[ ] 511 [ ] 250
ˆ1000
ˆ ˆ[ ] 0.511 [ ] .000250
ˆ[ ] 0.016
ˆ95% Precision 2 [ ]
Define N p
N B
E N Var N
NDefine p
E p Var p
SD p
SD p
ST2004 2012 Week 11 22 11/12/2012
Answers
23
Precision in Opinion Polls
ST2004 2012 Week 11
1
1
1
1000 precision 3%
ˆ Number 'Favourable'; Prop 'Favourable'
Model as realisation of ~ ( , )
~ , (1 )
ˆ ~ , (1 )
ˆ95% of values in range 2 (1 )
1000,
n
n
n
n
Y p Y
Theory y Y B n p
Y N np np p approx
p N p p p approx
p p
Define
p p
De
n
fine
10.5 0.5 2 0.5(1 0.5) 0.5 0.032
As percentages 50% 3.2%
np
11/12/2012
24
Normal Approx to B(n,p) Sample Size Application
ST2004 2012 Week 11
1
Precision in Opinion Polls
1000 precision 3%
100, 0.5 0.5 2 0.5(1 0.5) 0.5 0.1
As percentages 50% 10%
400, 0.5 50% 5%
400, 0.8 50% 4%
n
n
n p
n p
n p
Challenge What sample size needed for 95% precision 10% What sample size needed for 99.7% precision 10% 11/12/2012
Law of Small Numbers
http://www.greenbookblog.org/2012/05/11/how-myths-are-formed-the-law-of-small-numbers-market-research / The site discusses how people treat sample sizes as interchangeable things. This is most likely due to over use of percentages when dealing with statistics. For example many cosmetic advertisements say things like “86% of people agree” or “91% of people said this was their favourite”. Which on the surface, makes the product seem amazing. However on closer inspection the advertisement will read “based on a sample of 87 people”. This to most people won’t start alarm bells ringing, because surely this percentage will apply to a larger scale. The site talks about how when a sample size is smaller the error involved in the sample is increased, leading to extreme statements being made, based on conclusions draw from sample sizes too small to be representative of the majority. 12/11/2012 ST2004 2012 Week 12 25