stochastic systems - random variables

Stochastic Systems - Random Variables • Computer Systems • Traffic Systems • Financial Systems • Data Systems • Class

– Absences – Exam marks

• Points Won – Premier League – Dice, Pokemon, Whack-a-Mole, Assassin,…

• Natural Language – Google suggest/translate – Randomly generated music

• Sally Clarke

12/12/2012 ST2004 2012 Week 12 1

Framework - Rows and Columns

• Column Names – Variables of interest

– Derived variables

• Rows – Replications

• Column Summaries – Averages

– Frequency tables • Subsets

Simplest case replications

Instances Realisations

Possible values Numeric Labels True/False

12/11/2012 ST2004 2012 Week 12 3

Expected Values and Variances

• System Level Random Variation – Challenge is Structure

• Rows/cols

• System level summaries

– Monte Carlo simulation “easy” • How long to simulate?

– Often Prob dist of system level summaries “hard”

– Often Exp Val, Var system level summaries “easier”

• Often suff to summarise random variation by variance

12/12/2012 ST2004 2012 Week 12 4

Common Systems of Random Variables

System level rvs • Sums/differences of elementary indep rvs • Linear combs (weighted averages) Often Expected Values and Variances easy Often Normal distribution useful Sometimes recursions easy • Max/min combs Often probs easy

ST2004 2012 Week 12 5 12/11/2012

Modelling Mechanics – Populating Rows/Cols

• Data Generating System – Derived Variables

• Random Number Generation – Transformations – Functions/Combinations of random vars

• Thought Experiments – How many replications

• What Summaries – Rel Freq – Averages Values and Variances

Objective?

12/11/2012 ST2004 2012 Week 12 6

Systems – examples • Teams in league

– Games won by NA N B NC

– Sums of binary rvs

• Student attendance at class – Binary Name Chosen/ Not Chosen

– Given chosen Presence/Absence

– Sums of binary rvs

• Sets of Dice – Scores X1 X 2 X3 Sums, Max S 3 M3

ST2004 2012 Week 12 7

Model Decompose Splash Bean Machine

12/11/2012

League as additive system

ST2004 2012 Week 12 8

MiniLeague 3 teams play each other once

(i) AB 0.7 0.3

(ii) BC 0.4 0.6

(iii) AC 0.2 0.8

(i) (ii) (iii)

Reps AB BC AC Winner Win A B C Pts

1 0.500 0.952 0.347 A C C ACC 1 0 2 102 C

2 0.518 0.126 0.708 A B C ABC 1 1 1 111 #N/A

3 0.972 0.358 0.653 B B C BBC 0 2 1 021 B

4 0.985 0.851 0.910 B C C BCC 0 1 2 012 C

5 0.982 0.499 0.735 B C C BCC 0 1 2 012 C

6 0.234 0.587 0.168 A C A ACA 2 0 1 201 A

7 0.584 0.390 0.482 A B C ABC 1 1 1 111 #N/A

Games

Game Prob of Winning

Pr(A winner)

Pr(B winner)

Pr(A winner)

using COUNTIF

Num Wins for Outright

winner

12/11/2012

League as additive system

ST2004 2012 Week 12 9

Winners

(i)(ii)(iii)

ABA 0.051 0.056 210 0.051 0.056

ABC 0.234 0.224 201 0.092 0.084

ACA 0.092 0.084 120 0.023 0.024

ACC 0.348 0.336 111 0.254 0.260

BBA 0.023 0.024 102 0.348 0.336

BBC 0.089 0.096 021 0.089 0.096

BCA 0.020 0.036 012 0.143 0.144

BCC 0.143 0.144

sum 1 1 sum 1 1

Prob dist of games won

A 0.143 0.140 by 0 1 2 sum

B 0.112 0.120 A 0.240 0.620 0.140 1

C 0.491 0.480 B 0.420 0.460 0.120 1

#N/A 0.254 0.260 C 0.080 0.440 0.480 1

sum 1 1

Exp Val Var

A 0.9 0.37

B 0.7 0.45

C 1.4 0.4

Prob

Rel

Freq Prob

Outright

Winner

Rel

Freq Prob

Points

for ABC

Rel

Freq

(i) AB 0.7

(ii) BC 0.4

(iii) AC 0.2

Game Prob of Winning

Pr(A winner)

Pr(B winner)

Pr(A winner)

Decomposing via

events

sums

Prob Dist for

A

A

A

A

N

N

E N

Var N

12/11/2012

Student Absences as additive system

12/11/2012 ST2004 2012 Week 12 10

Dist # of absences

Num

Abs All MSISS Num AbsAll MSISS

Freq Rel Freq

0 15 7 8 0 0.20 0.21 0.20

1 18 9 9 1 0.24 0.26 0.22

2 17 6 11 2 0.23 0.18 0.27

3 12 4 8 3 0.16 0.12 0.20

4 5 3 2 4 0.07 0.09 0.05

5 5 2 3 5 0.07 0.06 0.07

6 2 2 0 6 0.03 0.06 0.00

7 1 1 0 7 0.01 0.03 0.00

0 75 34 41 0 1 1 1

mean 2.03 2.18 1.90 mean 2.03 2.18 1.90

var 2.83 3.73 2.04 var 2.83 3.73 2.04

SD 1.68 1.93 1.43 SD 1.68 1.93 1.43

JS

(Eng/B&

C/CSSL)

JS

(Eng/B&

C/CSSL)

Question: Could the obs differences between the classes have arisen by chance? If so, the evidence that there is a real difference is weak.

We show that – under a model where the classes are not different – that there is >20% chance that ‘MSISS’ exceeds ‘Others’ by more than observed (2.18-1.90 = 0.27 before rounding) (by replicating simulations and by theory for infinite reps)

MSISS Other

5 5 3 7 5 1 3 2 1 4 20 16

Abs?

rep Stud 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

1 1st 9 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 2nd 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0

3 1st 4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

4 2nd 8 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 1st 4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

6 2nd 10 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1

7 1st 7 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

8 2nd 4 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0

9 1st 8 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

10 2nd 5 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1

59 1st 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

60 2nd 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0

MSISS Other

Total AbsPr Abs

0.5

Student

Student Abs if Chosen

Number of Abs Each

10 students only; 30 classes; choose 2

60 chances to return as Abs

Random

Student

Here replications are

Random Roll Call as sums: simple case

12/11/2012 ST2004 2012 Week 12 11

Spreadsheet

Here common Prob Abs

Key: sums of binary random variables Pr(name called and absent) = 0.25(1/10)

Here, to simplify things, using sampling with replacement. Technical detail only

Avg 20/4 16/6

System rvs

Elementary rvs

https://www.scss.tcd.ie/John.Haslett/st2004/Week 12/PseudoAbsData.xlsx

Theory – 10 students, common prob

12/12/2012 ST2004 2012 Week 12 12

#Abs each = sum 60 binary rvs Y MSISS Total = sum 240 binary rvs Y Other Total = sum 360 binary rvs Y Pr(Y=1) = q =1/10*0.25=0.025 E[Y]=q Var[Y] = q(1-q)≈q E[#each]=60q Var[#each]≈60q #each ~N(60q,60q) CLT approx E[MSISS Tot]=240q Var[MSISS Tot]≈240q MSISS Tot ~ N(240q,240q) CLT approx MSISS Avg = Total/4 MSISS Avg ~ N(60q,15q) Other Avg = Total/6 Other Avg ~ N(60q,6.7q)

Binomial Dist for each, sums of binary Good Poisson approx q small #each ~ B(60,q) app P(60q)

MSISS Tot ~ B(240,q) approx P(60q) approx Normal

Using Simulations to Test Even with common prob absence

MSISS Avg ≠ Other Avg

• How big a difference is possible?

– 500 replicates of 30 classes 75 students

– 500 differences

– Prop(sim mean diff > observed mean diff) = 0.25

• Conclude: nothing remarkable about observed mean diff, even if classes have common prob abs

12/12/2012 ST2004 2012 Week 12 13

Summaries of 500 reps

MSISS Other Diff

Mean 3.02 3.01 0.01

SD 0.80 0.57 1.13

Using Probability: to test Even with common prob absence

MSISS Avg ≠ Other Avg

• How big a difference is possible?

– replicates of 30 classes 75 students

– Use Normal distribution to compute

– Pr(random mean diff > observed mean diff) = 0.21

– Theory follows

12/12/2012 ST2004 2012 Week 12 14

4.53.01.50.0-1.5-3.0

200

150

100

50

0

4.53.01.50.0-1.5-3.0

200

150

100

50

0

MSISS

Fre

qu

en

cy

Other

Diff

Mean 2.961

StDev 0.7345

N 500

MSISS

Mean 2.999

StDev 0.5716

N 500

Other

Mean -0.03833

StDev 1.055

N 500

Diff

Histogram of MSISS, Other, DiffNormal

Student Absences

12/11/2012 ST2004 2012 Week 12 15

Dist # of absences

Num

Abs All MSISS Num AbsAll MSISS

Freq Rel Freq

0 15 7 8 0 0.20 0.21 0.20

1 18 9 9 1 0.24 0.26 0.22

2 17 6 11 2 0.23 0.18 0.27

3 12 4 8 3 0.16 0.12 0.20

4 5 3 2 4 0.07 0.09 0.05

5 5 2 3 5 0.07 0.06 0.07

6 2 2 0 6 0.03 0.06 0.00

7 1 1 0 7 0.01 0.03 0.00

0 75 34 41 0 1 1 1

mean 2.03 2.18 1.90 mean 2.03 2.18 1.90

var 2.83 3.73 2.04 var 2.83 3.73 2.04

SD 1.68 1.93 1.43 SD 1.68 1.93 1.43

JS

(Eng/B&

C/CSSL)

JS

(Eng/B&

C/CSSL)

We show that – under a model where the classes are not different – that there is >20% chance that ‘MSISS’ exceeds ‘Others’ by more than observed (2.18-1.90 = 0.27 before rounding) (by replicating simulations and by theory for infinite reps)

Model: No real diff between student groups.

ie diff of 0.27 could have arisen by chance.

Compute Pr(mean diff >0.27 by chance alone);

if not small no evidence that model is not O

Options 1 Many simulation

175

1 134 41

1 134 41

s

2 Thought expt with reps

Theory, with common taken as 0.51

300 300~ 300 , ; ~ 300 ,

34 41

~ 0,300

0.27 0Pr 0.27 1 0.2

300

MSISS Others

MSISS Others

q

q qY N q Y N q

RandDiff Y Y N q

RandDiffq

1

What’s the use of Theory? • Estimating Pr(abs)

• Testing

– What if student groups not different?

• Avoid simulation

12/11/2012 ST2004 2012 Week 12 16

Pr( )

1#Abs each student 300 4

75

1ˆ #Abs each student

4

ˆ #Abs each day /

p Abs

E p p

p ObsAvg

Equiv

p Avg Number of Students

Alt, if simulation preferred: Speed-up

Simulate each from B(n, q ) - form total Simulate total directly from B(n#students, q) – form avg Simulate avg directly from Normal dist

Validate simulation

Student Absences

• Simplest Model

– Fixed Pr(abs for randomly chosen MSISS/Other)

• Richer model - approached similarly

– Pr(Abs) increases over time

– From “small” to “large”

– Possibly different rate of increase

– Odds(Abs)= + group t

12/11/2012 ST2004 2012 Week 12 17

Bootstrap - Alternative Approach

Avoid Theoretical distributions

Draw samples from empirical dist

12/11/2012 ST2004 2012 Week 12 18

Sample binary random variables and sum equiv sample from Binomial distribution

Sample from observed distribution equiv resample data with replacement

Opinion Polls Methodology

For all national population opinion polls RED C interview a random sample of 1,000+ adults aged 18+ by telephone. This sample size is the recognised sample required by polling organisations for ensuring accuracy on political voting intention surveys. The accuracy level is estimated to be approximately plus or minus 3 per cent on any given result at 95% confidence levels.

19

Electorate Sample Size ±3%

Ireland ~ 2,000,000 1000

USA ~200,000,000 ?

ST2004 2012 Week 11 11/12/2012

Bootstrap Precision • Sample of 1000 selected randomly

– Voters in Colorado

• 51.1% Democrat

• But, what if repeated..?

• “Data like this, but randomly different”

– Re-sample 1000, from 1000, ‘with replacement’

– Repeat many times

– Replications Precision, Prob > 0.50

– 95% Conf Interval (cf formula method)

ST2004 2012 Week 11 20 11/12/2012

Precision by Thought Expt Re-sample 1000, from 1000, ‘with replacement’

Replications Precision

Bernoulli Process

#Dem in 1000 sample; 0.511

~

[ ] [ ]

ˆ1000

ˆ ˆ[ ] [ ]

ˆ[ ]

ˆ95% Precision 2 [ ]

Define N p

N

E N Var N

NDefine p

E p Var p

SD p

SD p

ST2004 2012 Week 11 21 11/12/2012

Precision by Thought Expt Re-sample 1000, from 1000, ‘with replacement’

Replications Precision

Bernoulli Process

#Dem in 1000 sample; 0.511

~ (1000,0.511)

[ ] 511 [ ] 250

ˆ1000

ˆ ˆ[ ] 0.511 [ ] .000250

ˆ[ ] 0.016

ˆ95% Precision 2 [ ]

Define N p

N B

E N Var N

NDefine p

E p Var p

SD p

SD p

ST2004 2012 Week 11 22 11/12/2012

Answers

23

Precision in Opinion Polls

ST2004 2012 Week 11

1

1

1

1000 precision 3%

ˆ Number 'Favourable'; Prop 'Favourable'

Model as realisation of ~ ( , )

~ , (1 )

ˆ ~ , (1 )

ˆ95% of values in range 2 (1 )

1000,

n

n

n

n

Y p Y

Theory y Y B n p

Y N np np p approx

p N p p p approx

p p

Define

p p

De

n

fine

10.5 0.5 2 0.5(1 0.5) 0.5 0.032

As percentages 50% 3.2%

np

11/12/2012

24

Normal Approx to B(n,p) Sample Size Application

ST2004 2012 Week 11

1

Precision in Opinion Polls

1000 precision 3%

100, 0.5 0.5 2 0.5(1 0.5) 0.5 0.1

As percentages 50% 10%

400, 0.5 50% 5%

400, 0.8 50% 4%

n

n

n p

n p

n p

Challenge What sample size needed for 95% precision 10% What sample size needed for 99.7% precision 10% 11/12/2012

Law of Small Numbers

http://www.greenbookblog.org/2012/05/11/how-myths-are-formed-the-law-of-small-numbers-market-research / The site discusses how people treat sample sizes as interchangeable things. This is most likely due to over use of percentages when dealing with statistics. For example many cosmetic advertisements say things like “86% of people agree” or “91% of people said this was their favourite”. Which on the surface, makes the product seem amazing. However on closer inspection the advertisement will read “based on a sample of 87 people”. This to most people won’t start alarm bells ringing, because surely this percentage will apply to a larger scale. The site talks about how when a sample size is smaller the error involved in the sample is increased, leading to extreme statements being made, based on conclusions draw from sample sizes too small to be representative of the majority. 12/11/2012 ST2004 2012 Week 12 25

http://www.greenbookblog.org/2012/05/11/how-myths-are-formed-the-law-of-small-numbers-market-research /






















stochastic systems - random variables

Documents