statistical methods bayesian methods 3 daniel thorburn stockholm university 2012-04-03

Statistical Methods Bayesian methods 3

Daniel Thorburn

Stockholm University

2012-04-03

Slides presented previous time

Rational behaviour – one person

• Axiomatic foundation of probability. Type:– For any two events A and B exactly one of the following must hold A <

B, A > B or A v B (pronounce A as less likely than B, B less likely than A, equally likely)

– If A1, A2, B1 and B2 are four events such that A1A2 = B1B2 is empty and A1 > B1 and A2 > B2 then A1 U A2 > B1 U B2. If further either A1 > B1 or A2 > B2 then A1 U A2 > B1 U B2

– …(see next page)• If these axioms hold all events can be assigned probabilities, which

obey Kolmogorovs axioms (Villegas, Annals Math Stat, 1964),

• Axioms for behaviour. Type …– If you prefer A to B, and B to C then you must also prefer A to C– …

• If you want to behave rationally, then you must behave as if all events were assigned probabilities (Anscombe and Aumann, Annals Math Stat, 1963)

• Axioms for probability (these six are enough to prove that a probability following Kolmogorovs axioms can be defined plus the definition of conditional probability)

– For any two events A and B exactly one of the following must hold A < B, A > B or A v B (pronounce A as less likely than B, B less likely than A, equally likely)

– If A1, A2, B1 and B2 are four events such that A1A2 = B1B2 is empty and A1 > B1 and A2 > B2 then A1 U A2 > B1 U B2. If further either A1 > B1 or A2 > B2 then A1 U A2 > B1 U B2

– If A is any event then A > (the impossible (empty) event)

– If Ai is an strictly decreasing sequence of events and B a fixed event such that Ai > B for all i then i Ai (the intersection of all Ai) > B

– There exists one random variable which has a uniform distribution

– For any events A, B and D, (A|D) < (B|D) if and only if AD < BD

4

• Further one needs some axioms about comparing outcomes, (utilities) in order to be able to prove rationality – For any two outcomes, A and B, one either prefers A to B or B to A or is

indifferent

– If you prefer A to B, and B to C then you must also prefer A to C

– If P1 and P2 are two distributions over outcomes they may be compared and you are indifferent between A and the distribution with P(A)=1

– Two measurability axioms like • If A is any outcome and P a distribution then the event that P gives an outcome

preferred two A can be compared to other events (more likely …)

– If P1 is preferred to P2 and A is an event, A > 0, then the game giving P1 if A occurs is preferred to the game giving P2 under A if the results under the not-A are the same.

– If you prefer P1 to P and P to P2, then there exists numbers a>0 and b>0 such that P1 with probability 1-a and P2 with probability a is preferred to P, which is preferred to P1 with probability b and P2 with probability 1-b.

5

There is only one type of numbers, which may be known or unknown.

• Classical inference has a mess of different types of numbers e.g. – Parameters– Latent variables like in factor analysis– Random variables– Observations– Independent (explaining) variables – Dependent variables– Constants – a.s.o.

• Superstition!

7

Outline

10. Inference – intervals and proving scientific results

11. Probability assessment12. Linear models13. Predictive distributions

10. Inference – intervals and proving scientific results

Intervals

• Definiton of confidence intervals: – An interval constructed in this way will in the long run cover the

true values in 1- of all cases if it is repeated many many many times.

– Like a person throwing rings (or horse-shoes) around a peg. If he is skilful he will get the ring around the peg in 95% of all cases. It’s a property of the person not a particular throw

• Definition of probability intervals. – The true value lies with probability 1- in the interval in this case

(given what is known)– The interval is fixed and known. The probability statement refers

to the unknown quantity– Synonyms (roughly): credibility intervals, prediction intervals

Probability intervals

• An interval (a,b) such that

• Synonyms: Prediction intervals, Credibility intervals• Highest Posterior Density (HPD)-interval

(under unimodality)

dxXfb

a )|(1

bxforaxforxfaf

andbxaforxfbfaf

)()()(

)()()(

Shortest Possible intervals

• Find two numbers a<b such which are small for all positive a and b

• Minimise.

• Differentiate with respect to a and b

• Putting the derivatives equal to 0 shows that this is a HPD-interval (if unimodal)

b

a

dfdabcbaLE )()()),,((

)();( bdfcadfc

)()(

),,(

badIabc

baL

Note that this interval may be empty if the distribution is wide spread i.e.

xallforcdxf /)(

Comment

• One often speaks about HPD-regions, which are regions defined by f(c for some positive value c

• Can in one dimension consist of several intervals if the distribution is unimodal.

• In several dimensions it is often used. E.g. for bivariate normal distributions it is an ellips.

”Hypothesis testing”• One may differentiate between (at least)

two problems1. Convince yourself

2. Prove the fact – convince all others

• These problems need different approaches and priors. – Much of the previous discussion has been

about your own subjective opinion

Prove scientific facts to others

• It is very easy to convince people who believe in the fact from the beginning.

• It is often fairly simple to convince yourself even if you are broadminded

• But to prove a scientific fact, you must convince also all those who have reasonable doubts.

Prove scientific facts• PX,K) P(X P(

• A person is convinced of a fact when his posterior probability is close to one for the fact.

• But to prove the fact scientifically this must hold for all reasonable priors P( including those describing reasonable doubt.

• Even if there is no such person whose prior is K’ this must hold also for that prior as long as K’ is reasonable.

• I.e. a result is ”proved” if inf (PX,K); K reasonable) > 1 – for some .

You must check for many possible priors that might be reasonable. Try priors that assess small probabilitis to the statement that you intend to prove.

• Reporting: Use vague priors, but also show what the consequences are for some priors with (un-) reasonable doubt.

• When you prove something all available data should be used. Type: Meta analysis. In some fields one study is usually not enough to convince people

• Designing experiments: Design your experiments so that you maximise E(inf (PX,K); K reasonable) | KYOU) (if you are convinced).

What is reasonable doubt? Convincing others

• You have to contemplate what is meant by reasonable doubt.

• Depends on the importance of the subject.

• It can be just putting very small prior probability on the fact to be proven

• But you must also try to find the possible flaws in your theory and designing your experiments to counterprove them.

Priors with reasonable doubt • Use priors with reasonable doubt

– In an experiment to prove telepathic effects you could e.g. use priors like P(logodds ratio = 0) = 0,9999.

• If the logodds ratio is different from 0 it may be modelled as N(0,s2), where s2 may be moderate or large.

• If the posterior e.g. says that P(p > 0,5) > 0.95, and may consider the fact as proved. (Roughly this means that you need about 20 successes in a row, where a random guess has probability ½ to be correct).

– Never use the prior P=1, since you must always be open-minded (only fundamentalists do so. They will never change their opinion whatever the evidence).

– In more standard situations you will probably not need quite so negative priors

11. Probability assessment

Probability assessmentsProbability of the outcomes under your assessments and

independence (upper dot is total ignorance)

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

0 0,01 0,02 0,03 0,04 0,05 0,06 0,07 0,08 0,09

Total square error of your assessments (upper dot is total ignorance)

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

0 0,5 1 1,5 2 2,5 3 3,5

Your results (correct sun shined, OMX30 lower, Sweden beat Canada, No medal, ?, Oslo more to the south, Swedish GDP less than 4 Billions

Abrar Raza Khan was best according to both measures.

L = (pi) = 0.081 resp MSE = 0.82.

The results were not quite as spread as last time, but still the average was not better than chance. (5 resp 6) worse than chance.)

There are many studies on how people assess probability

• Small familiar risks are underestimated (e.g. risk for smokers to get lung cancer by smokers, driving accidents by drivers)

• Small unfamiliar risks are overestimated (health risks of E211 in the food is assessed higher than of sodium benzoate by people making jam at home or E160 c more risky than Paprika extract))

• People confuse the events with the conditions– E.g. P(the beauty-contest was won a shortsighted librarian) is usually assessed smaller

that P(the beauty contest was won by a young blonde shortsighted librarian with in a bathing suit)

• The probability to win the top prizes on lotteries is considered to be overestimated but many of the studies have not taken into account the value of the pleasure in having the chance (speculating what to do).

Continuous priors

• There are many suggested methods to find priors in the literature

• Without assumption on the form. – Betting (See lecture 1) – Successively dividing the intervals into two equally probable

events• Try to find a value such that the probability that the value is above or

below are equal (=½) • Suppose that the value is above this value. Try to find a value such

the value is above or below are equal (=½) given that it is above • Suppose that the value is below this value. Try to find a value such

the value is above or below are equal (=½) given that it is below• Continue dividing the intervals

– a.s.o.

Assessing continuous priors• Of a special form (usually a conjugate)

– Give the most likely value and how many observations are needed for the prior and posterior to contain the same information. Identify parameters

– Give the most likely value and expected standard deviation uncertainty (This is not adviced to try on persons without some statistical training)

– Give two values. One upper bound that you are pretty certain that the true value does not exceed (should be surprised if it exceeds) and one lower bond in the similar way. Say that this corresponds to a 95 % probability interval and identify the parameters.

– Give three values: Most likely value and two bounds like above. (Identify parameters of a skew distribution e.g. Weibull)

12. Linear models

Theorem

• Let (X,Y) have a bivariate normal distribution with means (mx,my) and variances/covariance xx, yy and xy

• Then the distribution of Y given X=x is normal with mean my,+ (xy/xx)(x-mx) and variance yy-

xy/ xx.• Apply to a simple Bayesian model: (,X) has a

bivariate normal distribution with means (m,m) and variances/covariance , + and .

• Then the posterior given X=x is normal with mean m+ /(+(x-m) = (m+x)/(+and variance -(+(+

Multivariate version– Let (X,Y) be a pair of row vectors having multivariate normal

distribution with means (mx,my) and variances/covariance xx, yy and xy

– Then the distribution of Y given X=x is normal with mean my,+ xy

xx(x-mx) and variance yy - yxxxxy.

• This can be applied to a Bayesian linear model: – Let , be a vector of parameters and with normal prior mean m

and covariance matrix .

– Let X have a multivariate normal distribution given with mean A and covariance matrix .

– Then (,X) is normal with mean (m,Am) and covariance matrix

– The posterior given X=x is normal with mean m+ A’(AA’1(x-Am) and variance A’(AA’1A

A’

A AA’+

Example• Consider a agricultural study on six piglets

with two explanatory variables which are strongly colinear. The piglets stem from three litters.

• Model Yi=a +bx1i+cx2i+k(i)+i where all error terms are iid Normal(0,1) (For simplicity known and equal variances)

• This is a multilevel study. (Too small to be realistic. Can easily be analysed by good multilevel computer packages like MLwin)

Data

y x1 x2 Litter

32 1 47 A

141 2 53 B

544 6 90 B

528 6 97 C

336 4 76 C

239 3 65 C

Parameters a=47 b=129 c=-3 ABC 0,87 -0,59 -1,2

Identification of arguments

• Prior m=(50,1,0,0,0,0)• Prior variance = I

• T=• (Large uncertainty)

• Data• y=(32,141,544,528,336,239)

• A =

1000 0 0 0 0 0

0 1000 0 0 0 0

0 0 50 0 0 0

0 0 0 1 0 0

0 0 0 0 1 0

0 0 0 0 0 1

1 1 47 1 0 0

1 2 53 0 1 0

1 6 90 0 1 0

1 6 97 0 0 1

1 4 76 0 0 1

1 3 65 0 0 1

Results• Estimates a*=27.6 b1*=123.4 b2*=-2.48• Covariance

matrix

• Note that the correlation between the estimates of the regression coefficients is very close to -1. The bayesian analysis tells us that we cannot differentiate between them. Do not worry in the analysis – the apparatus takes care of that

• We could have done the analysis successively for one observation at a time. The number of observations does not have to be larger than the number of parameters, if the prior is proper.

89.9 23.2 -2.45

23.2 6.24 -0.65

-2.45 -0.65 0.067

13. Predictive distributions

Prediction

• Suppose that we intend to study a new piglet from a new litter with parameters 5 and 82.

• Its expected value given the data is a+ b1*5+b2*82=27.6+123.4*5-2.48*82=441.3 (with better precision)

• With variance 89.9+6.24*25+0.067*82*82+2*23.2*5-2*2.45*82-2*0.65*5*82 + 1 + 1 = 2.99 (with better precision)

• Thus a 95%-prediction interval is 441.3+/-3.4

Predictive distributions

• Parameters do not exist – only observations. Parameters are only a help in describing their distributions

• So the interesting things is how well we can predict future observations by our inference

• Where the last equality holds if future observations and the old data are independent given .

• We just saw one example for the piglet data and we have also seen it in Bernoulli examples

ddatafxPddatafdataxP newnew )|()|()|(),|(

Simulate from posterior and predictive distributions

• In some cases there is no closed expression for the posterior distribution. In most of these cases it is nevertheless possible to simulate from it.

• We shall illustrate it in a simple situation (where it is possible to get the posterior)

• (y,x) is N((0,0), 1, 1, )

Algorithm• Starting value anything e.g. x0=0• Take y1 from the conditional distribution of Y1 given that X1=x0. Y1

= Normalrand(x0,1-

Take x1 from the conditional distribution of X1 given that Y1=y1. X1 = Normalrand(y1,1-

• Continue like this. Take yi+1 from the conditional distribution of Yi+1 given that Xi+1=xi

, Yi+1 = Normalrand(xi,1-

Take xi+1 from the conditional distribution of Xi+1 given that Yi+1=yi+1. X1 = Normalrand(y1,1-

• The pairs of (Yi,Xi) form a Markov chain, which converges in distribution to N((0,0), 1, 1, ) (convergenc is exponential, but slow for close to one)

• Thus after a ”burn-in” period the pairs will be a sequence of (dependent) random variables from the correct distribution. (see Excel sheet)

• In this way on can construct densities from a multivariate distribution using only the one-dimensional conditional distributions. Histograms and different multidemensional plots may be constructed.

• This was a simple example of an MCMC-technique. Usually burn-ins of more than 100 are used and the chain is repeated for at least 10 000 times. Special techniques are used to hurry upp the convergence.

Predictive distributions• Missing data

– Suppose we have observed 100 x-values but 5 y values are missing due to chance

• Model Y=a+bx+e where e is normal– Predict a and b B=100 times using by drawing them from the posterior

distributions. – For each pair (a,b) find a set of the five missing values using their

distribution given a, b, and the five x-values.– Fill in the five values. You have now 100 sets with complete data.–

• You may now do the analysis that you intended to do without losing information. – (The estimates should be the mean of the 100 estimates and the

posterior variance is the sum of the estimated variance and the between prediction variances)

statistical methods bayesian methods 3 daniel thorburn stockholm university 2012-04-03

Documents