probabilities, likelihood, bayesian...

Probabilities, Likelihood, Bayesian inference

Tuesday, April 12, 2011

Taking samples for statistical analysis

Combining probabilities• Multiply probabilities if the component events

must happen simultaneously (i.e. where you would naturally use the word AND when describing the problem)

(1/6) ! (1/6) = 1/36

Using 2 dice, what is the probability of

AND ?AND

Combining probabilities

• Add probabilities if the component events are mutually exclusive (i.e. where you would naturally use the word OR in describing the problem)

(1/6) + (1/6) = 1/3

Using one die, what is the probability of

OR ?OR

Combining AND and OR

(1/36) + (1/36) + (1/36) + (1/36) + (1/36) + (1/36) = 1/6

1 and 6

2 and 5

3 and 4

4 and 3

5 and 2

6 and 1

What is the probability that the sum of two dice is 7?

Joint probabilities B = Black S = SolidW = White D = Dotted

Conditional probabilities

Independence

Pr(A and B) = Pr(A) Pr(B|A)joint probability conditional

probability

Pr(A and B) = Pr(A) Pr(B)

...then events A and B are independent and we can express the joint probability as the product of Pr(A) and Pr(B)

If we can say this...

Pr(B|A) = Pr(B)

This is always true...

The Likelihood CriterionThe probability of the observations computed using a model tells

us how surprised we should be.The preferred model is the one that surprises us least.

Suppose I threw 20 dice down on the table and this was the result...

Pr(obs.|fair dice model) =�

3, 656, 158, 440, 062, 976

The Fair Dice model

You should have been very surprised at this result because the probability of this event is very small: only 1 in 3.6 quadrillion!

Pr(obs.|trick dice model) = 120 = 1

The Trick Dice model(assumes dice each have 5 on every side)

You should not be surprised at all at this result because the observed outcome is certain under this model

Results

Model Likelihood Surprise level

Fair Dice 13,656,158,440,062,976

Very, very, very surprised

Trick Dice 1.0 Not surprised at all

winning model maximizes likelihood(and thus minimizes surprise)

Likelihood and model comparison

• Analyses using likelihoods ultimately involve model comparison

• The models compared can be discrete (as in the fair vs. trick dice example)

• More often the models compared differ continuously:– Model 1: branch length is 0.05– Model 2: branch length is 0.06

Rather than having an infinity of models, we instead think of

the branch length as a parameter within

one model

Likelihood versus log likelihood

0.2 0.4 0.6 0.8 1.0

Log[L]

Invariance principle

Coin flipping

Pr(y|p) =�

�py(1− p)n−y = L(p|y)

Coin-flipping

y = observed number of headsn = number of flips (sample size)p = (unobserved) proportion of heads

Note that the same formula serves as both the: - probability of y (if p is fixed)- likelihood of p (if y is fixed)

Outcome Fair coin model Two-heads model

H 0.5 1.0

T 0.5 0.0

1.0 1.0

Likelihood vs. Probability

• Symbol Pr(D|M) often used to represent the likelihood (I will be following convention in this regard)

Pr(D | M)Probabilities

are functions of the data (the

model is fixed)

L(M | D)

Likelihoods are functions of

models (the data is fixed)

Examples

Mathematic

a example

Bayesian inference

Joint probabilities B = Black S = SolidW = White D = Dotted

Conditional probabilities

Bayes’ rule

Probability of "Dotted"

Bayes' rule (cont.)

Pr(D) is the marginal probability of being dottedTo compute it, we marginalize over colors

Joint probabilities

Pr(D,B)D

S Pr(S,B) Pr(S,W)

Pr(D,W)

Marginal probabilities

Pr(D,B) + Pr(D,W)D

S Pr(S,B) + Pr(S,W)

Marginal probabilityof being solid

Marginal probabilityof being dotted

Marginalizing over "dottedness"

Pr(D,B)

Pr(S,B) Pr(S,W)

Pr(D,W) Marginal probability of being a white

marble

Bayes' rule (cont.)

Bayes' rule in Statistics

D refers to the "observables" (i.e. the Data)! refers to one or more "unobservables"

(i.e. parameters of a model, or the model itself):– tree model (i.e. tree topology)– substitution model (e.g. JC, F84, GTR, etc.)– parameter of a substitution model (e.g. a branch length,

a base frequency, transition/transversion rate ratio, etc.)– hypothesis (i.e. a special case of a model)– a latent variable (e.g. ancestral state)

Pr(θ|D) =Pr(D|θ) Pr(θ)�θ Pr(D|θ) Pr(θ)

Bayes’ rule in statistics

Likelihood of hypothesis ! Prior probability of hypothesis !

Posterior probabilityof hypothesis !

Marginal probabilityof the data (marginalizing

over hypotheses)

Simple (albeit silly) paternity example

Possibilities !1 !2 Row sum

Genotypes AA Aa ---

Prior 1/2 1/2 1

Likelihood 1 1/2 ---

Prior X Likelihood 1/2 1/4 3/4

Posterior 2/3 1/3 1

!1 and !2 are assumed to be the only possible fathers, child has genotype Aa, mother has genotype aa, so child must have received allele A from the true father. Note: the data in this case is the child’s genotype (Aa)

The prior can be your friend

Suppose the test for a rare disease is 99% accurate.

Pr(+|disease) = 0.99Pr(+|healthy) = 0.01

datum hypothesis

(Note that we do not need to consider the case of a negative test result.)

It is very tempting to (mis)interpret the likelihood as a posterior probability and conclude “There is a 99% chance that I have the disease.”

Suppose further I test positive for the disease. How worried should I be?

Pr(disease|+) =Pr(+|disease)

Pr(+|disease)�

�+ Pr(+|healthy)

=(0.99)

(0.99)�

�+ (0.01)

� = 0.99

The prior can be your friend

The posterior probability is 0.99 only if the prior probability of having the disease is 0.5:

Pr(disease|+) =(0.99)

1000000

(0.99)�

11000000

�+ (0.01)

�9999991000000

≈ 0.0001

If, however, the prior odds against having the disease are a million to 1, then the posterior probability is much more reassuring:

An important caveat

This (rare disease) example involves a tiny amount of data (one observation) and an extremely informative prior, and gives the impression that maximum likelihood (ML) inference is not very reliable.

However, in phylogenetics, we often have lots of data and use much less informative priors, so in phylogenetics ML inference is generally very reliable.

Discrete vs. Continuous

• So far, we've been dealing with discrete hypotheses (e.g. either this father or that father, have disease or don’t have disease)

• In phylogenetics, substitution models represent an infinite number of hypotheses (each combination of parameter values is in some sense a separate hypothesis)

• How do we use Bayes' rule when our hypotheses form a continuum?

f(θ|D) =f(D|θ)f(θ)�f(D|θ)f(θ)dθ

Bayes’ rule: continuous case

Likelihood

Marginal probabilityof the data

Posterior probabilitydensity

Prior probabilitydensity

If you had to guess...

1 meter

Not knowing anything about my archery abilities,draw a curve representingyour view of the chances of my arrow landing a distanced from the center of the target(if it helps, I'm standing 50meters away from the target)

Case 1: assume I have talent

1 meter

20.0 40.0 60.0 80.0distance in centimeters from target center

An informative prior(low variance) thatsays most of my arrows will fall within10 cm of the center(thanks for yourconfidence!)

Case 2: assume I have a talent for missing the target!

1 meter

20.0 40.0 60.0

distance in cm from target center

Also an informative prior,but one that says most of my arrows will fall withina narrow range justoutside the entire target!

Case 3: assume I have no talent

1 meter

20.0 40.0 60.0 80.0distance in cm from target center

This is a vague prior:its high variance reflectsnearly total ignoranceof my abilities, saying that my arrows could land nearly anywhere!

A matter of scale

Notice that I haven't provided a scale forthe vertical axis.

What exactly does the height of thiscurve mean?

For example, does the height of the dottedline represent the probability that my arrow lands 60 cm from the center of the target?

0.0 20.0 40.0 60.0

Probabilities apply to intervals

Probabilities are attached to intervals(i.e. ranges of values), not individual values

The probability of any given point (e.g. d = 60.0) is zero!

However, we can ask about the probability that d falls in a particular range e.g. 50.0 < d < 65.0

0.0 20.0 40.0 60.0

Probabilities vs. probability densities

Probability density functionNote: the height of this curve does not represent a probability (if it did, it would not exceed 1.0)

density.ai example_density.xls

Integration of densities

The density curve is scaled so that the value of this integral(i.e. the total area) equals 1.0

Integration of a probability density yields a probability

Area under the densitycurve from 0 to 2 is theprobability that ! is less

than 2

mean=60var=3

! "! #! $! %! &! '!

mean=200var=40000

mean=2.5var=3.125 Archery Priors Revisited

These density curves areall variations of a gammaprobability distribution.

We could have used agamma distribution to

specify each of the priorprobability distributions

for the archery example.Note that higher variancemeans less informative

Pr(y|p) =�

�py(1− p)n−y = L(p|y)

Coin-flipping

y = observed number of headsn = number of flips (sample size)p = (unobserved) proportion of heads

Note that the same formula serves as both the: - probability of y (if p is fixed)- likelihood of p (if y is fixed)

Outcome Fair coin model Two-heads model

H 0.5 1.0

T 0.5 0.0

1.0 1.0

Likelihood vs. Probability

• Symbol Pr(D|M) often used to represent the likelihood (I will be following convention in this regard)

Pr(D | M)Probabilities

are functions of the data (the

model is fixed)

L(M | D)

Likelihoods are functions of

models (the data is fixed)

The posterior is generally more informative than the prior (data contains information)

uniform prior density

posterior density

= posterior probability (mass)

Mathematic

a example

Beta prior gives more flexibility

Beta(2,2) prior density

posterior density

Posterior probability of p between 0.45 and 0.55 is 0.223

Note:• prior is vague but not flat• posterior has smaller variance (i.e. it is more informative) than the prior

Usually there are many parameters...

Likelihood

Marginal probability of dataPosteriorprobability

density

Prior probabilitydensityA 2-parameter example

An analysis of 100 sequences under the simplestmodel (JC69) requires 197 branch length parameters.The denominator is a 197-fold integral in this case!

Now consider summing over all possible tree topologies!It would thus be nice to avoid having to calculate the

marginal probability of the data...

ISC5317-Fall 2007-PB Computational Evolutionary Biology

influenced the posterior will be by the likelihood. Eventually, the posterior will be determined

entirely by the likelihood. For instance, if 10,000 throws still indicated that the coin had a heads

probability of 0.45, then the posterior would spike on that value even for a Beta(100, 100) prior

distribution that put a lot of probability on fair coins (p = 0.5).

6 Accumulating scientific knowledge

One of the most attractive aspects of Bayesian theory is that it provides a rational and consistent

framework for accumulating scientific knowledge. According to the Bayesian view, knowledge is

accumulated by successively updating probability distributions to take new data into account. In

fact, we can show that a successive series of Bayesian updates based on stepwise accumulation of

evidence is mathematically equivalent to a single update based on all the evidence. Assume that

we start with an initial prior P (θ) and an initial set of observations D1. The posterior probability

distribution after this step would be:

P (θ|D1) =P (θ)P (D1|θ)

P (D1)

If we use this posterior probability distribution as the prior in the next step, where we take a second

set of observations D2 into account, we get:

P (θ|D2, D1) =P (θ|D1)P (D2|θ)

P (D2)

=P (θ)P (D1|θ)P (D2|θ)

P (D1)P (D2)

=P (θ)P (D1, D2|θ)

P (D1, D2)

The last rearrangement can be made because D1 and D2 are independent sets of observations.

Note that the last equation is the same as applying Bayes’ theorem directly on the collected set

of observations, D1 + D2. In other words, there is a rational procedure for accumulating scientific

knowledge in the Bayesian framework and the end result is the same regardless of the order or

exact way in which the data are added as long as the posterior probability distribution obtained in

one analysis is used as the prior in the next analysis. In maximum likelihood analyses, background

knowledge is typically ignored.

Accumulation of scientific knowledge

ISC5317-Fall 2007-PB Computational Evolutionary Biology

influenced the posterior will be by the likelihood. Eventually, the posterior will be determined

entirely by the likelihood. For instance, if 10,000 throws still indicated that the coin had a heads

probability of 0.45, then the posterior would spike on that value even for a Beta(100, 100) prior

distribution that put a lot of probability on fair coins (p = 0.5).

6 Accumulating scientific knowledge

One of the most attractive aspects of Bayesian theory is that it provides a rational and consistent

framework for accumulating scientific knowledge. According to the Bayesian view, knowledge is

accumulated by successively updating probability distributions to take new data into account. In

fact, we can show that a successive series of Bayesian updates based on stepwise accumulation of

evidence is mathematically equivalent to a single update based on all the evidence. Assume that

we start with an initial prior P (θ) and an initial set of observations D1. The posterior probability

distribution after this step would be:

P (θ|D1) =P (θ)P (D1|θ)

P (D1)

If we use this posterior probability distribution as the prior in the next step, where we take a second

set of observations D2 into account, we get:

P (θ|D2, D1) =P (θ|D1)P (D2|θ)

P (D2)

=P (θ)P (D1|θ)P (D2|θ)

P (D1)P (D2)

=P (θ)P (D1, D2|θ)

P (D1, D2)

The last rearrangement can be made because D1 and D2 are independent sets of observations.

Note that the last equation is the same as applying Bayes’ theorem directly on the collected set

of observations, D1 + D2. In other words, there is a rational procedure for accumulating scientific

knowledge in the Bayesian framework and the end result is the same regardless of the order or

exact way in which the data are added as long as the posterior probability distribution obtained in

one analysis is used as the prior in the next analysis. In maximum likelihood analyses, background

knowledge is typically ignored.

probabilities, likelihood, bayesian...

Documents

genetic probabilities

probabilities and expected values

limiting probabilities

complemental probabilities

how to calculate binomial probabilities using...

part ii - random...

hubbard p values are not error probabilities

thinking about probabilities

unreliable probabilities

prior probabilities

introduction to probabilities

hit probabilities

p values are not error probabilities - uv

ensemble verification metrics - ecmwf...sharpness sharpness...

probability distributions probability distribution: a...

data structures - department of scientific...

probabilities of events

arpm _ the “checklist” - 2a. estimation flexible...

p values are not error probabilities

report probabilities