cardiff feb 16-17 2005 1 statistical inference for astrophysics a short course for astronomers...

Cardiff Feb 16-17 20051

Statistical inference for astrophysicsA short course for astronomers

Cardiff University 16-17 Feb 2005

Graham Woan, University of Glasgow


Lecture Plan

• Why statistical astronomy?

• What is probability?

• Estimating parameters values o the Bayesian way

o the Frequentist way

• Testing hypothesesothe Bayesian way

o the Frequentist way

• Assigning Bayesian probabilities

• Monte-Carlo methods

Lectu

res 1

& 2

Lectu

res 3

& 4


Why statistical astronomy?

Benjamin DisraeliMark Twain

“There are three types of lies:lies, damned lies and

statistics”

Generally, statistics has got a bad reputation

Often for good reason:

Jun 3rd 2004

… two researchers at the University of Girona in Spain, have found that 38% of a sample of papers in Nature contained one or more statistical errors…

The Economist



Data analysis methods are often regarded as simple recipes…

http://www.numerical-recipes.com/



…but in analysis as in life, sometimes the recipes don’t work as you expect.

o Low number countso Distant sourceso Correlated ‘residuals’o Incorrect assumptions

Systematic errors and/orSystematic errors and/or

misleading resultsmisleading results

Data analysis methods are often regarded as simple recipes…


" The trouble is that what we [statisticians] call modern statistics was developed under strong

pressure on the part of biologists. As a result, there is practically nothing done by us which is directly

applicable to problems of astronomy."

Jerzy Neyman, founder of frequentist hypothesis testing.


…and the tools can be clunky:


For example, we can observe only the one Universe:

(From Bennett et al 2003)



The Astrophysicist’s Shopping List

We want tools capable of:

o dealing with very faint sources

o handling very large data sets

o correcting for selection effects

o diagnosing systematic errors

o avoiding unnecessary assumptions

o estimating parameters and testing models



Key question:

How do we infer properties of the Universe from incomplete and imprecise astronomical data?

Our goal:

To make the best inference, based on our observed data and any prior knowledge, reserving the right to revise our position if new information comes to light.

Let’s come to this problem afresh with an astrophyicist’s eye, and bypass some of the jargon of orthodox statistics, going

right back to plain probability:


Herodotus, c.500 BC

“A decision was wise, even though it led to disastrous consequences, if with the evidence at hand indicated it was the best one to make; and a decision was foolish, even though it led to the happiest possible consequences, if it was unreasonable to expect those consequences”

We should do the best with what we have, not what we wished we had.

Right-thinking gentlemen #1


“Probability theory is nothing but common sense reduced to calculation”

Pierre-Simon Laplace(1749 – 1827)



“Frustra fit per plura, quod fieri potest per pauciora.”

“It is vain to do with more what can be done with less.”

Occam’s Razor

William of Occam(1288 – 1348 AD)

Everything else being equal, we favour models which are simple.



The meaning of probability

Definitions, algebra and useful distributions


But what is “probability”?

• There are three popular interpretations of the word, each with an interesting history:– Probability as a measure of our degree of belief in a statement– Probability as a measure of limiting relative frequency of outcome

of a set of identical experiments– Probability as the fraction of favourable (equally likely) possibilities

• We will call these the Bayesian, Frequentist and Combinatorial interpretations.

• Note there are signs of trouble here:– How do you quantify “degree of belief”?– How do you define “relative frequency” without using ideas of

probability?– What does “equally likely” mean?

• Thankfully, at some level, all three interpretations agree on the algebra of probability, which we will present in Bayesian terms:


Algebra of (Bayesian) probability

• Probability [of a statement, such as “y = 3”, “the mass of a neutron star is 1.4 solar masses” or “it will rain tomorrow”] is a number between 0 and 1, such that

• For some statement X,

where the bar denotes the negation of the statement -- The Sum Rule

• If there are two statements X and Y, then joint probability

where the vertical line denotes the conditional statement “X given Y is true” – The Product Rule

1)()( XpXp

1)sure notif (0

0)falseif (

1)trueif (

p

p

p

)|()(),( YXpYpYXp



• From these we can deduce that

where “+” denotes “or” and I represents common background information -- The Extended Sum Rule

• …and, because p(X,Y)=p(Y,X), we get

which is called Bayes’ Theorem

• Note that these results are also applicablein Frequentist probability theory, with a suitable change in meaning of “p”.

)|,()|()|()|( IYXpIYpIXpIYXp

)|(),|()|(

),|(IYp

IXYpIXpIYXp

Thomas Bayes(1702 – 1761 AD)



• We can also deduce the marginal probabilities. If X and Y are propositions that can take on values drawn from and then

this gives use the probability of X when we don’t care about Y. In these circumstances, Y is known as a nuisance parameter.

• All these relationships can be smoothly extended from discrete probabilities to probability densities, e.g.

where “p(y)dy” is the probability that y lies in the range y to y+dy.

},,...,,{ 21 nxxxX },,...,,{ 21 myyyY

mj

jimj

ijimj

ijii yxpxypxpxypxpxp..1..1..1

),()|()()|()()(

=1

yyxpxp d),()(


These equations are the key to Bayesian Inference – the methodology upon which (much) astronomical data analysis is now founded.

Clear introduction by Devinder Sivia (OUP).

(fuller bibliography tomorrow)


See also the free book by Praesenjit Saha (QMW, London).

http://ankh-morpork.maths.qmw.ac.uk/%7Esaha/book


Example…

• A gravitational wave detector may have seen a type II supernova as a burst of gravitational radiation. Burst-like signals can also come from instrumental glitches, and only 1 in 10,000 bursts is really a supernova, so the data are checked for glitches by examining veto channels. The test is expected to confirm the burst is astronomical in 95% of cases in which it truly is, and in 1% when it truly is not.

The burst passes the veto test!! What is the probability we have seen a supernova?

Answer: Denote

Let I represent the information that the burst seen is typical of those used to deduce the information in the question. Then we are told that:

glitch” a sit' says “test

supernova” a sit' says “test

glitch” a isreally “burst

supernova” a isreally “burst

G

S

01.0),|(

95.0),|(

9999.0)|(

0001.0)|(

IGp

ISp

IGp

ISp }}

Prior probabilities for S and G

Likelihoods for S and G


Example cont…

• But we want to know the probability that it’s a supernova, given it passed the veto test:

By Bayes, theorem

and we are directly told everything on the rhs except , the probability that any burst candidate would pass the veto test. If the burst is either a supernova or a hardware glitch then we can marginalise over these alternatives:

so

).,|( ISp

,)|(),|(

)|(),|(IpISp

ISpISp

)|( Ip

)|,()|,()|( IGpISpIp ),|()|(),|()|( ISpISpIGpIGp

0.9999 0.01 0.0001 0.95

01.095.00001.001.09999.0

95.00001.0),|(

ISp


Example cont…

• So, despite passing the test there is only a 1% probability that the burst is a supernova!Veto tests have to be blisteringly good if supernovae are rare.

• Why? Because most bursts that pass the test are just instrumental glitches – it really is just common sense reduced to calculation.

• Note however that by passing the test, the probability that this burst is from a supernova has increased by a factor of 100 (from 0.0001 to 0.01).

• Moral:),|(),|( ISpISp

Probability that a supernova burst gets

through the veto

(0.95)

Probability that it’s a supernova burst if it

gets through the veto(0.01)

This is the likelihood: how consistent is the data with

a particular model?

This is the posterior: how probable is the

model, given the data?


the basis of frequentist probability


•Bayesian probability theory is simultaneously a very old and a very young field:

•Old : original interpretation of Bernoulli, Bayes, Laplace…•Young: ‘state of the art’ in (astronomical) data analysis

•But BPT was rejected for several centuries because probability as degree of belief was seen as too subjective

Basis of frequentist probability

Frequentist approach


Probability = ‘long run relative frequency’ of an event

it appears at first that this can be measured objectively

e.g. rolling a die. What is ?

If die is ‘fair’ we expect

These probabilities are fixed (but unknown) numbers.

Can imagine rolling die M times.

Number rolled is a random variable – different outcome each time.

61

)6()2()1( ppp

)1(p



•We define If die is ‘fair’

•But objectivity is an illusion:

assumes each outcome equally likely

(i.e. equally probable)

•Also assumes infinite series of identical trials

•What can we say about the fairness of the die after (say) 5 rolls, or 10, or 100 ?

Mn

pM

)1(lim)1(

61)1( p

Mn

pM

)1(lim)1(



In the frequentist approach, a lot of mathematical machinery is specifically defined to let us address frequency questions:

•We take the data as a random sample of size M , drawn from an assumed underlying pdf

•Sampling distribution, derived from the underlying pdf and M

•Define an estimator – function of the sample data that is used to estimate properties of the pdf

But how do we decide what makes an acceptable estimator?



Example: measuring a galaxy redshift

• Let the true redshift = z0 -- a fixed but unknown parameter. We use two telescopes to estimate z0 and compute sampling distributions for and modelling errors

1. Small telescopelow dispersion spectrometer

Unbiased:

Repeat observation alarge number of times average estimate isequal to z0

BUT is large due to the low dispersion.

1zp

2zp

zp ˆ

010111 |)( zzdzzpzzE

1varz

1z

2z



Example: measuring a galaxy redshift

• Let the true redshift = z0 -- a fixed but unknown parameter. We use two telescopes to estimate z0 and compute sampling distributions for and modelling errors

2. Large telescopehigh dispersion spectrometer

but faulty astronomer!(e.g. wrong calibration)

Biased:

BUT is small. Which is the better estimator?

1zp

2zp

zp ˆ

2varz

1z

2z

020222 |)( zzdzzpzzE



What about the sample mean?• Let be a random sample from pdf with

mean

and variance . Then

= sample mean

Can show that -- an unbiased estimator

But bias is defined formally in terms of an infinite set of randomly chosen samples, each of size M.

What can we say with a finite number of samples, each of finite size?

Before that…

Mxx ,,1 )(xp

M

iix

M 1

1

)( E

2



Some important probability

distributionsquick definitions


1) Poisson pdf

e.g. number of photons / second counted by a CCD,

number of galaxies / degree2 counted by a survey

!)(

ne

npn

n = number of detections

Poisson pdf assumes detections are independent, and there is a constant rate

Can show that

01)(

nnp

Some important pdfs: discrete case


1) Poisson pdf

e.g. number of photons / second counted by a CCD,

number of galaxies / degree2 counted by a survey

!)(

ne

npn

)(np


n


2. Binomial pdf

number of ‘successes’ from N observations, for two mutuallyexclusive outcomes (‘Heads’ and ‘Tails’)

e.g. number of binary stars, Seyfert galaxies, supernovae…

r = number of ‘successes’

= probability of ‘success’ for single observation

rNrN rNr

Nrp

)1(

)!(!!

)(

Can show that

0

1)(r

N rp



1) Uniform pdf

otherwise 0

1)(

bxaabxp

0 a b

1b-a

p(x)

x

Some important pdfs: continuous case


1) Central, or Normal pdf(also known as Gaussian )

2

21

exp21

)(

xxp

5.0

1

x

p(x)

x

Some important pdfs: continuous case


= Prob( x < a )

a

dxxpaP )()(

x

)(xP

x

Cumulative distribution function (CDF)


The nth moment of a pdf is defined as

dxIxpxx

Ixpxx

nn

x

nn

)|(

)|(0

Discrete case

Continuous case

Measures and moments of a pdf


The 1st moment is called the mean or expectation value

dxIxpxxxE

IxpxxxEx

)|()(

)|()(0

Discrete case

Continuous case



The 2nd moment is called the mean square

dxIxpxx

Ixpxxx

)|(

)|(

22

0

22Discrete case

Continuous case



The variance is defined as

and is often written . is called the standard deviation

Discrete case

Continuous case

2 2

22var xxx In general


dxIxpxxx

Ixpxxxx

)|(var

)|(var

2

0

2


pdf mean variance

2

Poisson

Binomial

Uniform

Normal

!)(

re

rpr

rNrN rNr

Nrp

)1(

)!(!!

)( N )1( N

2

21

exp21

)(

XXp

abXp 1)( ba 21 212

1 ab


discrete

continuous


The Median divides the CDF into two equal halves

x

)(xP

x

5.0')'()(med

med

x

dxxpxP

Prob( x < xmed ) = Prob( x > xmed ) = 0.5



The Mode is the value of x for which the pdf is a maximum

5.0

1

x

p(x)

x

For a Normal pdf, mean = median = mode =



Parameter estimation

The Bayesian way


)|model(),model|data()data,|model( IpIpIp

Likelihood PriorPosterior

What we know now Influence of our observations

What we knew before

In the Bayesian approach, we can test our model, in the light of our data (i.e. rolling the die) and see how our degree of belief in its ‘fairness’ evolves, for any sample size, considering only the data that we did actually observe.

Bayesian parameter estimation


We want to know the fraction of Seyfert galaxies that are type 1.

How large a sample do we need to reliably measure this?

Model as a binomial pdf: = global fraction of Seyfert 1s

Suppose we sample N Seyferts, and observe r Seyfert 1srNr

N Nrp )1(),|( probability of obtainingobserved data, given model – the likelihood of

Astronomical example #1:

Probability that a galaxy is a Seyfert 1



• But what do we choose as our prior? This has been the source of much argument between Bayesians and frequentists, though it is often not that important.

•We can sidestep this for a moment, realising that if our data are good enough, the choice of prior shouldn’t matter!

)|model(),model|data()data,|model( IpIpIp

Likelihood PriorPosterior

Dominates



p(

|

I )

Flat prior; all values of

equally probable

Normal prior;peaked at = 0.5

Consider a simulation of this problem using two different priors



After observing 0 galaxies

p(

| d

ata

, I )


After observing 1 galaxy: Seyfert 1

p(

| d

ata

, I )


After observing 2 galaxies: S1 + S1

p(

| d

ata

, I )


After observing 3 galaxies: S1 + S1 + S2

p(

| d

ata

, I )


After observing 4 galaxies: S1 + S1 + S2 + S2

p(

| d

ata

, I )


After observing 5 galaxies: S1 + S1 + S2 + S2 + S2

p(

| d

ata

, I )


After observing 10 galaxies: 5 S1 + 5 S2

p(

| d

ata

, I )



p(

| d

ata

, I )


What do we learn from all this?

•As our data improve (e.g. our sample increases), the posterior pdf narrows and becomes less sensitive to our choice of prior.

• The posterior conveys our (evolving) degree of belief in different values of , in the light of our data

• If we want to express our result as a single number we could perhaps adopt the mean, median, or mode

• We can use the variance of the posterior pdf to assign anuncertainty for

• It is very straightforward to define confidence intervals



p(

| d

ata

, I )

95% of area under pdf

12

We are 95% sure thatlies between and

Note: the confidence interval is not unique, but we can define the shortest C.I.

1 2

Bayesian confidence intervals




Astronomical example #2:

Flux density of a GRBTake Gamma Ray Bursts to be equally luminous events, distributed homogeneously in the Universe. We see three gamma ray photons from a GRB in an interval of 1 s. What is the flux of the source, F?

•The seat-of-the-pants answer is F=3 photons/s, with an uncertainty of about , but we can do better than that by including our prior information on luminosity and homogeneity. Call this background information I:

3

Homogeneity implies that the probability the source is in any particular volume of space is proportional to the volume, so the prior probability that the source is in a thin shell of radius r is

drrdrIrp 24)|(


• But the sources have a fixed luminosity, L, so r and F are directly related by

• The prior on F is therefore

Interpretation: low flux sources are intrinsically more probable, as there is more space for them to sit in.

• We now apply Bayes’ theorem to determine the posterior for F after seeing n photons:


3

24

rdrdF

rL

F

hence

2/5)|()|( FdFdr

IrpIFpF

p(F|I)

),|()|(),|( IFnpIFpInFp


1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

• The Likelihood for F comes from the Poisson nature of photons:

so finally,

For n=3 we get this

with the most probable value ofF equalling 0.5 photons/sec.

• Clearly it is more probable this is a distant sourcefrom which we have seen an unusually high number of photons than it is an unusually nearby source from which we have seen an expected number of photons. (The most probable value of F is n-5/2, approaching n for n>>1)


!/)exp(),|( nFFIFnp n

)exp(),|( 2/5 FFInFp n

F

p(F|n=3,I)

F=n/T

Fmost prob


Parameter estimation

The frequentist way


• Recall that in frequentist (orthodox) statistics, probability is limiting relative frequency of outcome, so :

only random variables can have frequentist probabilities

as only these show variation with repeated measurement. So we can’t talk about the probability of a model parameter, or of a hypothesis. E.g., a measurement of a mass is a RV, but the mass itself is not.

• So no orthodox probabilistic statement can be interpreted as directly referring to the parameter in question! For example, orthodox confidence intervals do not indicate the range in which we are confident the parameter value lies. That’s what Bayesian intervals do.

• So what do they mean? …

Frequentist parameter estimation


• Orthodox parameter estimate proceeds as follows: Imagine we have some data, {di}, that we want to use to estimate the value of a parameter, a, in a model. These data must depend on the true value of a in a known way, but as they are random variables all we can say is that we know, or can estimate, p({di}) for some given a.

1. Use the data to construct a statistic (i.e. a function of the data) that can be used as an estimator for a called . A good estimator will have a pdf that depends heavily on a, and which is sharply peaked at, or very close to, a.


a

)|ˆ( aap

a

Measured value of a

a


2. One such estimator is the maximum likelihood (ML) estimator, constructed in the following way: given the distribution from which the data are drawn, p(d|a), construct the sampling distribution p({di}|a), which is just p(d1|a). p(d2|a). p(d3|a)… if the data are independent. Interpret this as a function of a, called the Likelihood of a, and call the value of a that maximises it for the given data . The corresponding sampling distribution, , is the one from which the data were ‘most likely’ drawn.


a

)(aL

MLa

)ˆ|}({MLi

adpML

a


…but remember that this is just one value of , albeit drawn from a population that has an attractive average behaviour:

And we still haven’t said anything about a.

3. Define a confidence interval around a enclosing, say, 95% of the expected values of from repeated experiments:


MLa

)|ˆ( aapML

a

Measured value of ML

a

MLa

MLa

MLa

)|ˆ( aapML

a

2


may be known from the sampling pdf or it may have to be estimated too.

4. We can now say that . Note this is a probabilistic statement about the estimator, not about a. However this expression can be restated: 0.95 is the relative frequency with which the statement ‘a lies in the region ’ is true, over many repeated experiments.


95.0)ˆ(Prob aaaML

ML

a

Different experiments

No

No

Yes

Yes

Yes

The disastrous shorthand for this is

Note that this is not a statement about our degree of belief that a

lies in the numerical interval generated in our particular

experiment.

a

"confidence 95% with ˆ" ML

aa


Example: B and F compared

• Two gravitational wave (GW) detectors see 7 simultaneous burst-like signal in the data from one hour, consisting of GW signals and spurious signals. When the two sets of data are offset in time by 10 minutes there are 9 simultaneous signals. What is the true GW burst rate?

(Note: no need to be a expert to realise there is not much GW activity here!)

• A frequentist solution:Take the events as Poisson, characterised by the background rate, rb and the GW rate, rg. We get

479ˆ and

2)97(ˆ so

9ˆ ;9ˆ

7ˆ ;7

g

g

bb

gbgb

r

r

rr

Where is the variance of the sampling

distribution

2


• So we would quote our result as:

• But burst rates can’t be negative! What does this mean? Infact it is quite consistent with our definition of a frequentist confidence interval. Our value of is drawn from its sampling distribution, that will look something like:


42gr

gr

gr

)|ˆ( gg rrp

grOur particular sample

~4

So, in the shorthand of the subject, this result is quite correct.


• The Bayesian solution: We’ll go through this carefully. Our job is to determine rg. In Bayesian terms that means we are after p(rg|data). We start by realising there are two experiments here: one to determine the background rate and one to determine the joint rate, so we will also determine p(rb|data).

If the background count comes from a Poisson process of mean rb then the probability of n events is

which is our Bayesian likelihood for rb. We will choose a prior probability for rb proportional to 1/rb. This is the scale invariant prior that encapsulates total ignorance of the non-zero rate (of course in reality we may have something that constrains rb more tightly a priori). See later…


!)exp(

)|(n

rrrnp b

nb

b


• So our posterior for rb, based on the background counts, is

which, normalising and setting n=9, gives

• The probability of seeing m coincident bursts, given the two rates, is


!)exp(1

)|(n

rrr

nrp bnb

bb

)exp(!81

)9|( 8bbb rrnrp

0 2 4 6 8 10 12 14 16 18 200

0.05

0.1

rb

p(r b|n=9)

Background rate

!

)](exp[)(),|(

m

rrrrrrmp gb

mgb

gb


• And our joint posterior for the rates is, by Bayes’ theorem,

The joint prior in the above can be split into a probability for rb, which we have just evaluated, and a prior for rg. This may be zero, so we will say that we are equally ignorant over whether to expect 0,1,2,… counts from this source. This translates into a uniform prior on rg, so our joint prior is

Finally we get the posterior on rg by marginalising over rb:


),|(),()|,( gbgbgb rrmprrpmrrp

)exp(!81

),( 8bbbg rrrrp

bgbgbbbg drrrrrrrmrp )](exp[)(!7

1).exp(!81)7|( 78

prior likelihood


0 1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

rg

p(r g|n=9,m=7)


...giving our final answer to the problem

Note that p(rg<0)=0, due to the prior, and that the true value of rg could very easily be as high as 4 or 5

42gr

Compare with our

frequentist result:

to 1-sigma

rg

p(rg|n=9,m=7)


0 5 10 150

0.02

0.04

0.06

0.08

0.1

0.12

0.14

rg

p(r g|n=3,m=7)

Let’s see how this result would change if the background count was 3, rather than 9 (joint count still 7):


Again, this looks very reasonable

p(rg|n=3,m=7)

rg


Bayesian hypothesis testing

And why we already know how to do it



• In Bayesian analysis, hypothesis testing can be performed as a generalised application of Bayes’ theorem. Generally a hypothesis differs from a parameter in that many values of the parameter(s) are consistent with one hypothesis. Hypotheses are models that depend of parameters.

• Note however that we cannot define the probability of one hypothesis given some data, d, without defining all the alternatives in its class, i.e.

and often this is impossible. So questions like “do the data fit a gaussian?” are not well-formed until you list all the curves the data could fit. A well-formed question would be “do the data fit a gaussian better than these other curves?”, or more usually “do the data fit a gaussian better than a lorentzian?”.

• Simple comparisons can be expressed as an odds ratio, O

hypotheses possible all

111 ),|prob()|prob(

),|prob()|prob(),|prob(

IHdIHIHdIH

IdHii

),|prob()|prob(),|prob()|prob(

),|prob(),|prob(

22

11

2

112 IHdIH

IHdIHIdHIdH

O


• The odds ratio an be divided into the prior odds and the Bayes’ factor

• The prior odds simply express our prior preference for H1 over H2, and is set to 1 if you are indifferent.

• The Bayes’ factor is just the ratio of the evidences, as defined in the earlier lectures. Recall that for a model that depends on a parameter a

so the evidence is simply the joint probability of the parameter(s) and the data, marginalised over all hypothesis parameter values:

factor Bayes'

2

1

oddsprior

2

1

2

112 ),|prob(

),|prob()|prob()|prob(

),|prob(),|prob(

IHdIHd

IHIH

IdHIdH

O


),|(),,|(),|(

),,|(1

111 IHdp

IHadpIHapIHdap

aIHadpIHapIHdp d),,|(),|(),|( 111


• Example: A spacecraft is sent to a moon of Saturn and, using a penetrating probe, detects a liquid sea deep under the surface at 1 atmosphere pressure and a temperature of -3°C. However, the thermometer has a fault, so that the temperature reading may differ from the true temperature by as much as ±5°C, with a uniform probability within this range.

Determine the temperature of the liquid, assuming it is water (liquid within 0<T<100°C) and then assuming it is ethanol (liquid within -80<T<80°C). What are the odds of it being ethanol?

[based loosely on a problem by John Skilling]


Ethanol is liquid

water is liquid

-80 0 80 100 °C

measurement


• Call the water hypothesis H1 and the ethanol hypothesis H2.For H1:The prior on the temperature is

the likelihood of the temperature is the probability of the data d, given the temperature:

thought of as a function of T, for d=-3,


otherwise 0

1000for 01.0)|( 1

THTp

0 100 T

otherwise 0

5||for 1.0),|( 1

TdHTdp

)|( 1HTp

T d10

),|( 1HTdp

-3

T

10

),|3( 1HTdp


• So under the water hypothesis we have a tighter possible range for the liquid’s temperature, but it may not be water. In fact, the odds of it being water rather than ethanol are

which means 3:1 in favour of ethanol. Of course this depends on our prior odds too, which we have set to 1. If the choice was between water and whisky under the surface of the moon the result would be very different, though the Bayes’ factor would be roughly the same!

• Why do we prefer the ethanol option? Because too much of the prior for temperature, assuming water, falls at values that are excluded by the data. In other words, the water hypothesis is unnecessarily complicated . Bayes’ factors naturally implement Occam’s razor in a quantitative way.


32.000625.0002.0

1),|prob(),|prob(

)|prob()|prob(

),|prob(),|prob(

factor Bayes'

2

1

oddsprior

2

1

2

112

IHdIHd

IHIH

IdHIdH

O


),|( 1HTdp

),|( 2HTdp

)|( 1HTp

)|( 2HTp

T

T

Overlap integral (=evidence) is greater for H2 (ethanol) than for H1 (water)


-8 0 2

-8 0 2

0.01

0.00625

0.1

0.1

002.001.01.02d)|(),|( 11 THTpHTdp

00625.000625.01.010d)|(),|( 22 THTpHTdp


• To look at this a bit more generally, we can split the evidence into two approximate parts, the maximum of the likelihood and an “Occam factor”:

i.e., evidence = maximum likelihood x Occam factorThe Occam factor penalises models that include wasted parameter space, even if they show a good ML fit.


prior

likelihood

xPrior_range

Likelihood_range

factor' Occam' the

max eprior_rang_rangelikelihood

d),|()|()|( LxHxdpHxpHdp

Lmax


• This is a very powerful property of the Bayesian method.

Example: your given a time series of 1000 data points comprising a number of sinusoids embedded in gaussian noise. Determine the number of sinusoids and the standard deviation of the noise.

• We could think of this as comparing hypotheses Hn that there are n sinusoids in the data, with n ranging from 0 to nmax. Equivalently, we could consider it as a parameter fitting problem, with n an unknown parameter within the model.

The joint posterior for the n signals, with amplitudes {A}n and frequencies {}n and noise variance given the overall model and data {D} is

The likelihood term, based on gaussian noise is

and we can set the priors as independent and uniform over sensible ranges.


),}{,}{,|}({)|,}{,}{,()},{|,}{,}{,( nnnnnn AnDpIAnpIDAnp

2

j

n

ijiijnn tADAnDp

0

22 )sin(

21

exp21

),}{,}{,|}({


• Our result for n is just its marginal probability:

and similarly we could marginalise for . Recent work, in collaboration with Umstatter and Christensen, has explored this:

d}{d}{d)},{|,}{,}{,()},{|( ,}{,}{ nnnnA nn AIDAnpIDnp


•1000 data points with 50 embedded signals

=2.6

Result: around 33 signals can be recovered from the data, the rest are indistinguishable from noise, and is consequentially higher


• This has implications for the analysis of LISA data, which is expected to contain many (perhaps 50,000) signals from white dwarf binaries. The data will contain resolvable binaries and binaries that just contribute to the overall noise (either because they are faint or because their frequencies are too close together).

Bayes can sort these out without having to introduce ad hoc acceptance and rejection criteria, and without needing to know the “true noise level” (whatever that means!).



Frequentist hypothesis testing

And why we should tread carefully, if at all


• A note on why these should really be avoided!• The method goes like this:

– To test a hypothesis H1 consider another hypothesis, called the null hypothesis, H0, the truth of which would deny H1. Then argue against H0…

– Use the data you have gathered to compute a test statistic Tobs (eg, the chisquared statistic) which has a calculable pdf if H0 is true. This can be calculated analytically or by Monte Carlo methods.

– Look where your observed value of the statistic lies in the pdf, and reject H0 based on how far in the wings of the distribution you have fallen.

Frequentist hypothesis testing – significance tests

p(T|H0) Reject H0 if your result lies in here

T


• H0 is rejected at the x% level if x% of the probability lies to the right of the observed value of the statistic (or is ‘worse’ in some other sense):

and makes no reference to how improbable the value is under any alternative hypothesis (not even H1!). So…


An hypothesis [H0] that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure. On the face of it, the evidence might more reasonably be taken as evidence for the hypothesis,

not against it. Harold Jeffreys Theory of Probability

1939

p(T|H0)

TTobs

X% of the area


• Eg, You are given a data point Tobs affected by gaussian noise, drawn either from N(mean=-1,=0.5) or N(+1,0.5). H0: The datum comes from = -1 H1: The datum comes from = +1 Test whether H1 is true.

• Here our statistic is simply the value of the observed data. We will chose a critical region of T>0, so that if Tobs>0 we reject H0 and therefore accept H1.


H0

H1

T -1 +1


• Formally, this can go wrong in two ways:– A Type I error occurs when we reject the null hypothesis when it is true– A Type II error occurs when we accept the null hypothesis when it is false

both of which we should strive to minimise• The probabilities (‘p-values’) of these are shown as the coloured areas

above (about 5% for this problem)

• If Tobs>0 then we ‘reject the null hypothesis at the 5% level’ and therefore accept H1.


T

Type II error

Type I error


• But note that we do not consider the relative probabilities of the hypotheses (we can’t! Orthodox statistics does not allow the idea of hypothesis probabilities), so the results can be misleading.

• For example, let Tobs=0.01. This lies just on the boundary of the critical region, so we reject H0 in favour of H1 at the 5% level, despite the fact that we know a value of 0.01 is just as likely under both hypotheses (acutally just as unlikely, but it has happened).

• Note that the Bayes’ factor for this same result is ~1, reflecting the intuitive answer that you can’t decide between the hypotheses based on such a datum.

Moral

The subtleties of p-values are rarely reflected in papers that quote them, and many errors of Type III (misuse!) occur. Always take them with a pinch of salt, and avoid them if possible – there are better tools available.



Assigning Bayesian probabilities

(you have to start somewhere!)



• We have done much on the manipulation of probabilities, but not much on the initial assignments of likelihoods and priors. Here are a few words…

The principle of insufficient reason (Bernoulli, 1713) helps us out: If we have N mutually exclusive possibilities for an outcome, and no reason to believe one more than another, then each should be assigned the probability 1/N.

• So the probability of throwing a 6 on a die is 1/6, if all you know is it has 6 faces.

• The key is to realise that this is one example of a more general principle. Your state of knowledge is such that the probabilities are invariant under some transformation group (here, the exchange of numbers on faces).

• Using this idea we can extend the principle of insufficient reason to continuous parameters.


• So, a parameter that is not known to within a change of origin (a ‘location parameter’) should have a uniform probability

• A parameter for which you have no knowledge of scale (a ‘scale parameter’) is a location parameter in its log so

the so-called ‘Jeffreys prior’.

• Note that both these prior are improper – you can’t normalise them, so their unfettered use is restricted. However, you are rarely that ignorant about the value of a parameter.


constant)()( axpxp

xxp

1)(


• More generally, given some information, I, that we wish to use to assign probabilities {p1…pn} to n different possibilities, then the most honest way of assigning the probabilities is the one that maximises

subject to the constraints imposed by I. H is traditionally called the information entropy of the distribution and measures the amount of uncertainty in the distribution in the sense of Shannon (though there are several routes to the same result). Honesty demands we maximise this, otherwise we are implicitly imposing further constraints not contained in I.

• For example, the maximum entropy solution for the probability of seeing k events given only the information that they are characterised by a single rate constant is the Poisson distribution. If you are told the first and second moments of a continuous distribution the maxent solution is a gaussian etc…


n

iii ppH

1log


Monte Carlo methods

How to do those nasty marginalisations


1. Uniform random number, U[0,1]

See Numerical Recipes!

http://www.numerical-recipes.com/

Monte Carlo Methods


2. Transformed random variables

Suppose we have

Let

Then

)(xyy

]1,0[~Ux

xxpyyp d)(d)(

Probability of number between y and y+dy

Probability of number between x and x+dx

xyyxp

ypdd))((

)(

Because probability must be positive

Monte Carlo Methods


2. Transformed random variables

Suppose we have

Let

Then

]1,0[~Ux

xabay )(

],[~ baUy

0 a b

1b-a

p(y)

y

Monte Carlo Methods

Numerical Recipes uses the transformation method to provide )1,0(~Ny


3. Probability integral transform

Suppose we can compute the CDF of some desired random variable

1)

2) Compute

3) Then

]1,0[~Uy

)(1 yPx

)(~ xpx

Monte Carlo Methods


4. Rejection Sampling

Suppose we want to sample from some pdf p(x) and we know that

where q(x) is a simpler pdf called the proposal distribution

xxqxp )()(

)(xp

)(xq1) Sample x1 from q(x)

2) Sample y~U[0,q(x1)]

3) If y<p(x) ACCEPTotherwise REJECT

1x

y

Set of accepted values {xi} are a sample from p(x).

Monte Carlo Methods


4. Rejection Sampling

The method can be very slow if the shaded region is too large -particularly in high-N problems, such as the LISA problem considered earlier.

LISA

Monte Carlo Methods


5. Metropolis-Hastings Algorithm

Speed this up by letting the proposal distribution depend on the current sample:

x)(p

o Sample initial point

o Sample tentative new state from (e.g. Gaussian)

o Compute

o If Accept Otherwise Accept with probability a

)1(x

)x;x'( )1(Q

)x;x'()x()x';x()x'(

(1)(1)

(1)

QpQp

a

1a

Acceptance:

Rejection:

x'x(2) (1)(2) xx

X is a Markov Chain. Note that rejected samples appear in the chain as repeated values of the current state.

Monte Carlo Methods


http://www.statslab.cam.ac.uk/~mcmc/pages/links.html

Monte Carlo Methods

A histogram of the contents of the chain for a parameter

converges to the marginal pdf for that parameter.

In this way, high-dimensional posteriors can be explored and

marginalised to return parameter ranges and

interrelations inferred fro complex data sets (eg, the

WMAP results)


The end

nearly


Final thoughts

• Things to keep in mind:

– Priors are fundamental, but not always influential. If you have good data most reasonable priors return very similar posteriors.

– Degree of belief is subjective but not arbitrary. Two people with the same degree of belief agree about the value of the corresponding probability.

– Write down the probability of everything, and marginalise over those parameters that are of no interest.

– Don’t pre-filter if you can help it. Work with the data, not statistics of the data.


Bayesian bibliography for astronomy

There are an increasing number of good books and articles on Bayesian methods. Here are just a few:

• E.T. Jaynes, Probability theory: the logic of science, CUP 2003• D.J.C. Mackay, Information theory, inference and learning algorithms,

CUP, 2004• D.S. Sivia, Data analysis – a Bayesian tutorial, OUP 1996• T.J. Loredo, From Laplace to Supernova SN 1987A: Bayesian Inference in

Astrophysics, 1990, copy at http://bayes.wustl.edu/gregory/articles.pdf

• G.L. Bretthorst, Bayesian Spectrum Analysis and Parameter Estimation, 1988 copy at http://bayes.wustl.edu/glb/book.pdf,

• G. D'Agostini, Bayesian reasoning in high-energy physics: principles and applications http://preprints.cern.ch/cgi-bin/setlink?base=cernrep&categ=Yellow_Report&id=99-03

• Soon to appear: P. Gregory, Bayesian Logical Data Analysis for the Physical Sciences, 2005, CUP

Review sites:• http://bayes.wustl.edu/• http://www.bayesian.org/

cardiff feb 16-17 2005 1 statistical inference for astrophysics a short course for astronomers...

Documents

statistical astronomy

o distant sources o

systematic errors o

statistical errors

economist slide

parameters values o

statistical inference

astronomers cardiff