lecture 4 probability and what it has to do with data analysis

Lecture 4

Probabilityand what it has to do with

data analysis

Please Read

Doug Martinson’s

Chapter 2: ‘Probability Theory’

Available on Courseworks

Abstraction

Random variable, x

it has no set value, until you ‘realize’ it

its properties are described by a distribution, p(x)

When you realize x

the probability that the value you get is

between x and x+dx

is p(x) dx

Probability density distribution

the probability, P, that the value you get is

is between x1 and x2

is P = x1

x2 p(x) dx

Note that it is written with a capital P

And represented by a fraction between

0 = never

And

1 = always

p(x)

x x1x2

Probability P that x is between x1 and x2 is proportional to this area

the probability that the value you get is

is something is unity

-+ p(x) dx = 1

Or whatever the allowable range of x is …

p(x)

x

Probability that x is between - and + is unity, so total area = 1

Why all this is relevant …

Any measurement is that contains noise is treated as a random variable, x

The distribution p(x) embodies both the ‘true value’ of the quantity being measured and the measurement noise

All quantities derived from a random variable are themselves random variables, so …

The algebra of random variables allows you to understand how measurement noise affects inferences made from the data

Basic Description of Distributions

p(x)

xxmode

Mode

x at which distribution has peak

most-likely value of x

peak

But modes can be deceptive …p(

x)

xxmode

peak

0 10

x N0-1 31-2 182-3 113-4 84-5 115-6 146-7 87-8 78-9 119-10 9

Sure, the 1-2 range has the most counts, but most of the measurements are bigger than 2!

100 realizations of x

p(x)

xxmedian

Median

50% chance x is smaller than xmedian

50% chance x is bigger than xmedian

No special reason the median needs to coincide with the peak

50% 50%

p(x)

x

Expected value or ‘mean’

x you would get if you took the mean of lots of realizations of x

01

2

3

4

1 2 3

Let’s examine a discrete distribution, for simplicity ...

x N

1 20

2 80

3 40

Total 140

mean = [ 20 1 + 80 2 + 40 3 ] / 140

= (20/140) 1 + (80/140) 2 + (40/140) 3

= p(1) 1 + p(2) 2 + p(3) 3

= Σi p(xi) xi

Hypothetical table of 140 realizations of x

by analogyfor a smooth distribution

Expected value of x

E(x) = -+ x p(x) dx

by the way …You can compute the expected (“mean”)

value of any function of x this way …

E(x) = -+ x p(x) dx

E(x2) = -+ x2 p(x) dx

E(x) = -+ x p(x) dx

etc.

Beware

E(x2) E(x)2

E(x) E(x)2

and so forth …

p(x)

x

Width of a distribution

Here’s a perfectly sensible way to define the width of a distribution…

50%25%25%

W50

… it’s not used much, though

p(x)

x

Width of a distribution

Here’s another way…

… multiply and integrate

E(x)

Parabola [x-E(x)]2

p(x)

x

Variance = 2 = -+ [x-E(x)]2 p(x) dx

E(x)

[x-E

(x)]

2

[x-E

(x)]

2 p(

x)

xE(x)

Compute this total area …

Idea is that if distribution is narrow, then most of the probability lines up with the low spot of the parabola

But if it is wide, then some of the probability lines up with the high parts of the parabola

p(x)

x

variance =

A measure of width …

we don’t immediately know its relationship to area, though …

E(x)

the Gaussian or normal distribution

p(x) = exp{ - (x-x)2 / 22 ) 1(2)

expected value

variance

Memorize me !

x = 1

= 1

x = 3

= 0.5

x

x

p(x)

p(x)

Examples of

Normal

Distributions

x

p(x)

x x+2x-2

95%

Expectation =

Median =

Mode = x

95% of probability within 2 of the expected value

Properties of the normal distribution

Functions of a random variable

any function of a random variable is itself a random variable

If x has distribution p(x)

the y(x) has distribution

p(y) = p[x(y)] dx/dy

This follows from the rule for transforming integrals …

1 = x1

x2 p(x) dx = y1

y2 p[x(y)] dx/dy dy

Limits so that y1=y(x1), etc.

example

Let x have a uniform (white) distribution of [0,1]

p(x)

0 x 1

1

Uniform probability that x is anywhere between 0 and 1

Let y = x2

then x=y½

y(x=0)=0y(x=1)=1dx/dy=½y-½

p[x(y)]=1

So p(y)=½y-½ on the interval [0,1]

1

Numerical testhistogram of 1000 random numbers

Histogram of x, generated with Excel’s rand() function which claims to be based upon a uniform distribution

Histogram of x2, generated by squaring x’s from above

Plausible that it’s proportional to 1/y

Plausible that it’s uniform

multivariate distributions

example

Liberty island is inhabited by both pigeons and seagulls

40% of the birds are pigeonsand 60% of the birds are gulls

50% of pigeons are white and 50% are grey100% of gulls are white

Two variables

species s takes two values

pigeon p

and gull g

color c takes two values

white w

and tan t

Of 100 birds,

20 are white pigeons

20 are tan pigeons

60 are white gulls

0 are tan gulls

What is the probability that a bird has species s and color c ?

cw t

p

g

s

20% 20%

60% 0%

Note: sum of all boxes is 100%

a random bird, that is

This is called theJoint Probability

and is writtenP(s,c)

Two continuous variablessay x1 and x2

have a joint probability distributionand written

p(x1, x2)with

p(x1, x2) dx1 dx2 = 1

The probability thatx1 is between x1 and x1+dx1

and x2 is between x2 and x2+dx2

isp(x1, x2) dx1 dx2

so p(x1, x2) dx1 dx2 = 1

You would contour a joint probability distribution

and it would look something like

x2

x1

What is the probability that a bird has color c ?

cw t

p

g

s

20% 20%

60% 0%

start with P(s,c)

80% 20%

and sum columns

To get P(c)

Of 100 birds,


20 are tan pigeons

60 are white gulls

0 are tan gulls

What is the probability that a bird has species s ?

cw t

p

g

s

20% 20%

60% 0%

start with P(s,c)

60%

40%

and sum rows

To get P(s)

Of 100 birds,


20 are tan pigeons

60 are white gulls

0 are tan gulls

These operations make sense with distributions, too

x2

x1

x2

x1

x1

p(x1)

p(x1) = p(x1,x2) dx2

x2

p(x2)

p(x2) = p(x1,x2) dx1

distribution of x1

(irrespective of x2)distribution of x2

(irrespective of x1)

Given that a bird is species swhat is the probability that it has color c ?

cw t

p

g

s

50% 50%

100% 0%

Note, all rows sum to 100

Of 100 birds,


20 are tan pigeons

60 are white gulls

0 are tan gulls

This is called theConditional Probability of c given s

and is writtenP(c|s)

similarly …

Given that a bird is color cwhat is the probability that it has species s ?

cw t

p

g

s

25% 100%

75% 0%

Note, all columns sum to 100

Of 100 birds,


20 are tan pigeons

60 are white gulls

0 are tan gulls

So 25% of white birds are pigeons

This is called theConditional Probability of s given c

and is writtenP(s|c)

Beware!P(c|s) P(s|c)

cw t

p

g

s

50% 50%

100% 0%

cw t

p

g

s

25% 100%

75% 0%

note

P(s,c) = P(s|c) P(c)

cw t

p

g

s

20 20

60 0

cw t

p

g

s

25 100

75 0

= 80 20

cw t

25% of 80 is 20

and

P(s,c) = P(c|s) P(s)

cw t

p

g

s

20 20

60 0

=

cw t

p

g

s

50 50

100 0 60

40p

g

s

50% of 40 is 20

Why Bayes Theorem is important

Consider the problem of fitting a straight line to data, d, where the intercept and slope are given by the vector m.

If we guess m and use it to predict d we are doing something like P(d|m)

But if we observe d and use it to estimate m then we are doing something like P(m|d)

Bayes Theorem provides a framework for relating what we do to get P(d|m) to what we do to get P(m|d)

Expectation

Variance

And

Covariance

Of a multivariate distribution

The expected value of x1 and x2 are calculated in a fashion analogous to the one-variable case:

E(x1)= x1 p(x1,x2) dx1dx2 E(x2)= x2 p(x1,x2) dx1dx2

x2

x1

Note

E(x1) = x1 p(x1,x2) dx1dx2

= x1 [ p(x1,x2)dx2 ] dx1

= x1 p(x1) dx1

So the formula really is just the expectation of a one-variable distribution

The variance of x1 and x2 are calculated in a fashion analogous to the one-variable case, too:

x12= (x1-x1)2p(x1,x2) dx1dx2 with x1=E(x1)

and similarly for x22

x2

x1

Note, once againx1

2= (x1-x1)2p(x1,x2) dx1dx2

= (x1-x1)2 [p(x1,x2) dx2] dx2

= (x1-x1)2p(x1) dx1

So the formula really is just the variance of a one-variable distribution

Note that in this distributionif x1 is bigger than x1, then x2 is bigger than x2 and if x1 is smaller than x1, then x2 is smaller than x2

x2

x1Expected value

x1

x2

This is a

positive correlation

Conversely, in this distributionif x1 is bigger than x1, then x2 is smaller than x2 and if x1 is smaller than x1, then x2 is smaller than x2

x2

x1Expected value

x1

x2

This is a

negative correlation

This correlation can be quantified by multiplying the distribution by a four-quadrant function

x2

x1

x1

x2

+

+ -

-

And then integrating. The function (x1-x1)(x2-x2) works fine

cov(x1,x2) = (x1-x1) (x2-x2) p(x1,x2) dx1dx2Called the “covariance”

Note that the vector x with elements

xi = E(xi)= xi p(x1,x2) dx1dx2 is the expectation of x

and the matrix Cx with elementsCij = (xi-xi) (xj-xj) p(x1,x2) dx1dx2

has diagonal elements equal to the variance of x i

Cxii = xi

2

andoff-diagonal elements equal to the covariance of x i and xj

Cxij = cov(xi,xj)

“Center” of multivatiate distribution

x

“Width” and “Correlatedness” of multivariate distribution

summarized a lot – but not everything –about a multivariate distribution

Functions of a set of random variables, x

A vector of of N random variables in a vector, x

given y(x)Do you remember how to

transform the integral

… p(x) dNx =

… ? dNy =

given y(x)

then

… p(x) dNx =

… p[x(y)] |dx/dy| dNy =

Jacobian determinant, that is, the determinant of matrix Jij whose elements are dxi/dyj

But here’s something that’s EASIER …

Suppose y(x) is a linear function y=Mx

Then we can easily calculate the expectation of y

yi = E(yi) = … yi p(x1 … xN) dx1…dxN

= … Mijxj p(x1 … xN) dx1… dxN

= Mij … xj p(x1 … xN) dx1 … dxN

= Mij E(xi) = Mij xi So y=Mx

And we can easily calculate the covariance

Cyij = … (yi – yi) (yj – yj) p(x1,x2) dx1dx2

= … ΣpMip(xp – xp) ΣqMjq (xq – xq) p(x1…xN) dx1…dxN

= ΣpMip ΣqMjq … (xp – xp) (xq – xq) p(x1…xN) dx1 …dxN

= ΣpMip ΣqMjq Cxpq

So Cy = M Cx MT

Memorize!

Note that these rules work regardless of the distribution of x

if y is linearly related to x, y=Mx then

y=Mx (rule for means)

Cy = M Cx MT

(rule for propagating error)

lecture 4 probability and what it has to do with data analysis

Documents

px x x mode mode x

mean x

px x probability

px x x median median

chance x

function of x

x px dx slide

px x width