lecture 4 probability and what it has to do with data analysis

64
Lecture 4 Probability and what it has to do with data analysis

Post on 21-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 4 Probability and what it has to do with data analysis

Lecture 4

Probabilityand what it has to do with

data analysis

Page 2: Lecture 4 Probability and what it has to do with data analysis

Please Read

Doug Martinson’s

Chapter 2: ‘Probability Theory’

Available on Courseworks

Page 3: Lecture 4 Probability and what it has to do with data analysis

Abstraction

Random variable, x

it has no set value, until you ‘realize’ it

its properties are described by a distribution, p(x)

Page 4: Lecture 4 Probability and what it has to do with data analysis

When you realize x

the probability that the value you get is

between x and x+dx

is p(x) dx

Probability density distribution

Page 5: Lecture 4 Probability and what it has to do with data analysis

the probability, P, that the value you get is

is between x1 and x2

is P = x1

x2 p(x) dx

Note that it is written with a capital P

And represented by a fraction between

0 = never

And

1 = always

Page 6: Lecture 4 Probability and what it has to do with data analysis

p(x)

x x1x2

Probability P that x is between x1 and x2 is proportional to this area

Page 7: Lecture 4 Probability and what it has to do with data analysis

the probability that the value you get is

is something is unity

-+ p(x) dx = 1

Or whatever the allowable range of x is …

p(x)

x

Probability that x is between - and + is unity, so total area = 1

Page 8: Lecture 4 Probability and what it has to do with data analysis

Why all this is relevant …

Any measurement is that contains noise is treated as a random variable, x

The distribution p(x) embodies both the ‘true value’ of the quantity being measured and the measurement noise

All quantities derived from a random variable are themselves random variables, so …

The algebra of random variables allows you to understand how measurement noise affects inferences made from the data

Page 9: Lecture 4 Probability and what it has to do with data analysis

Basic Description of Distributions

Page 10: Lecture 4 Probability and what it has to do with data analysis

p(x)

xxmode

Mode

x at which distribution has peak

most-likely value of x

peak

Page 11: Lecture 4 Probability and what it has to do with data analysis

But modes can be deceptive …p(

x)

xxmode

peak

0 10

x N0-1 31-2 182-3 113-4 84-5 115-6 146-7 87-8 78-9 119-10 9

Sure, the 1-2 range has the most counts, but most of the measurements are bigger than 2!

100 realizations of x

Page 12: Lecture 4 Probability and what it has to do with data analysis

p(x)

xxmedian

Median

50% chance x is smaller than xmedian

50% chance x is bigger than xmedian

No special reason the median needs to coincide with the peak

50% 50%

Page 13: Lecture 4 Probability and what it has to do with data analysis

p(x)

x

Expected value or ‘mean’

x you would get if you took the mean of lots of realizations of x

01

2

3

4

1 2 3

Let’s examine a discrete distribution, for simplicity ...

Page 14: Lecture 4 Probability and what it has to do with data analysis

x N

1 20

2 80

3 40

Total 140

mean = [ 20 1 + 80 2 + 40 3 ] / 140

= (20/140) 1 + (80/140) 2 + (40/140) 3

= p(1) 1 + p(2) 2 + p(3) 3

= Σi p(xi) xi

Hypothetical table of 140 realizations of x

Page 15: Lecture 4 Probability and what it has to do with data analysis

by analogyfor a smooth distribution

Expected value of x

E(x) = -+ x p(x) dx

Page 16: Lecture 4 Probability and what it has to do with data analysis

by the way …You can compute the expected (“mean”)

value of any function of x this way …

E(x) = -+ x p(x) dx

E(x2) = -+ x2 p(x) dx

E(x) = -+ x p(x) dx

etc.

Page 17: Lecture 4 Probability and what it has to do with data analysis

Beware

E(x2) E(x)2

E(x) E(x)2

and so forth …

Page 18: Lecture 4 Probability and what it has to do with data analysis

p(x)

x

Width of a distribution

Here’s a perfectly sensible way to define the width of a distribution…

50%25%25%

W50

… it’s not used much, though

Page 19: Lecture 4 Probability and what it has to do with data analysis

p(x)

x

Width of a distribution

Here’s another way…

… multiply and integrate

E(x)

Parabola [x-E(x)]2

Page 20: Lecture 4 Probability and what it has to do with data analysis

p(x)

x

Variance = 2 = -+ [x-E(x)]2 p(x) dx

E(x)

[x-E

(x)]

2

[x-E

(x)]

2 p(

x)

xE(x)

Compute this total area …

Idea is that if distribution is narrow, then most of the probability lines up with the low spot of the parabola

But if it is wide, then some of the probability lines up with the high parts of the parabola

Page 21: Lecture 4 Probability and what it has to do with data analysis

p(x)

x

variance =

A measure of width …

we don’t immediately know its relationship to area, though …

E(x)

Page 22: Lecture 4 Probability and what it has to do with data analysis

the Gaussian or normal distribution

p(x) = exp{ - (x-x)2 / 22 ) 1(2)

expected value

variance

Memorize me !

Page 23: Lecture 4 Probability and what it has to do with data analysis

x = 1

= 1

x = 3

= 0.5

x

x

p(x)

p(x)

Examples of

Normal

Distributions

Page 24: Lecture 4 Probability and what it has to do with data analysis

x

p(x)

x x+2x-2

95%

Expectation =

Median =

Mode = x

95% of probability within 2 of the expected value

Properties of the normal distribution

Page 25: Lecture 4 Probability and what it has to do with data analysis

Functions of a random variable

any function of a random variable is itself a random variable

Page 26: Lecture 4 Probability and what it has to do with data analysis

If x has distribution p(x)

the y(x) has distribution

p(y) = p[x(y)] dx/dy

Page 27: Lecture 4 Probability and what it has to do with data analysis

This follows from the rule for transforming integrals …

1 = x1

x2 p(x) dx = y1

y2 p[x(y)] dx/dy dy

Limits so that y1=y(x1), etc.

Page 28: Lecture 4 Probability and what it has to do with data analysis

example

Let x have a uniform (white) distribution of [0,1]

p(x)

0 x 1

1

Uniform probability that x is anywhere between 0 and 1

Page 29: Lecture 4 Probability and what it has to do with data analysis

Let y = x2

then x=y½

y(x=0)=0y(x=1)=1dx/dy=½y-½

p[x(y)]=1

So p(y)=½y-½ on the interval [0,1]

1

Page 30: Lecture 4 Probability and what it has to do with data analysis

Numerical testhistogram of 1000 random numbers

Histogram of x, generated with Excel’s rand() function which claims to be based upon a uniform distribution

Histogram of x2, generated by squaring x’s from above

Plausible that it’s proportional to 1/y

Plausible that it’s uniform

Page 31: Lecture 4 Probability and what it has to do with data analysis

multivariate distributions

Page 32: Lecture 4 Probability and what it has to do with data analysis

example

Liberty island is inhabited by both pigeons and seagulls

40% of the birds are pigeonsand 60% of the birds are gulls

50% of pigeons are white and 50% are grey100% of gulls are white

Page 33: Lecture 4 Probability and what it has to do with data analysis

Two variables

species s takes two values

pigeon p

and gull g

color c takes two values

white w

and tan t

Of 100 birds,

20 are white pigeons

20 are tan pigeons

60 are white gulls

0 are tan gulls

Page 34: Lecture 4 Probability and what it has to do with data analysis

What is the probability that a bird has species s and color c ?

cw t

p

g

s

20% 20%

60% 0%

Note: sum of all boxes is 100%

a random bird, that is

Page 35: Lecture 4 Probability and what it has to do with data analysis

This is called theJoint Probability

and is writtenP(s,c)

Page 36: Lecture 4 Probability and what it has to do with data analysis

Two continuous variablessay x1 and x2

have a joint probability distributionand written

p(x1, x2)with

p(x1, x2) dx1 dx2 = 1

Page 37: Lecture 4 Probability and what it has to do with data analysis

The probability thatx1 is between x1 and x1+dx1

and x2 is between x2 and x2+dx2

isp(x1, x2) dx1 dx2

so p(x1, x2) dx1 dx2 = 1

Page 38: Lecture 4 Probability and what it has to do with data analysis

You would contour a joint probability distribution

and it would look something like

x2

x1

Page 39: Lecture 4 Probability and what it has to do with data analysis

What is the probability that a bird has color c ?

cw t

p

g

s

20% 20%

60% 0%

start with P(s,c)

80% 20%

and sum columns

To get P(c)

Of 100 birds,

20 are white pigeons

20 are tan pigeons

60 are white gulls

0 are tan gulls

Page 40: Lecture 4 Probability and what it has to do with data analysis

What is the probability that a bird has species s ?

cw t

p

g

s

20% 20%

60% 0%

start with P(s,c)

60%

40%

and sum rows

To get P(s)

Of 100 birds,

20 are white pigeons

20 are tan pigeons

60 are white gulls

0 are tan gulls

Page 41: Lecture 4 Probability and what it has to do with data analysis

These operations make sense with distributions, too

x2

x1

x2

x1

x1

p(x1)

p(x1) = p(x1,x2) dx2

x2

p(x2)

p(x2) = p(x1,x2) dx1

distribution of x1

(irrespective of x2)distribution of x2

(irrespective of x1)

Page 42: Lecture 4 Probability and what it has to do with data analysis

Given that a bird is species swhat is the probability that it has color c ?

cw t

p

g

s

50% 50%

100% 0%

Note, all rows sum to 100

Of 100 birds,

20 are white pigeons

20 are tan pigeons

60 are white gulls

0 are tan gulls

Page 43: Lecture 4 Probability and what it has to do with data analysis

This is called theConditional Probability of c given s

and is writtenP(c|s)

similarly …

Page 44: Lecture 4 Probability and what it has to do with data analysis

Given that a bird is color cwhat is the probability that it has species s ?

cw t

p

g

s

25% 100%

75% 0%

Note, all columns sum to 100

Of 100 birds,

20 are white pigeons

20 are tan pigeons

60 are white gulls

0 are tan gulls

So 25% of white birds are pigeons

Page 45: Lecture 4 Probability and what it has to do with data analysis

This is called theConditional Probability of s given c

and is writtenP(s|c)

Page 46: Lecture 4 Probability and what it has to do with data analysis

Beware!P(c|s) P(s|c)

cw t

p

g

s

50% 50%

100% 0%

cw t

p

g

s

25% 100%

75% 0%

Page 47: Lecture 4 Probability and what it has to do with data analysis

note

P(s,c) = P(s|c) P(c)

cw t

p

g

s

20 20

60 0

cw t

p

g

s

25 100

75 0

= 80 20

cw t

25% of 80 is 20

Page 48: Lecture 4 Probability and what it has to do with data analysis

and

P(s,c) = P(c|s) P(s)

cw t

p

g

s

20 20

60 0

=

cw t

p

g

s

50 50

100 0 60

40p

g

s

50% of 40 is 20

Page 49: Lecture 4 Probability and what it has to do with data analysis

and if

P(s,c) = P(s|c) P(c) = P(c|s) P(s)

thenP(s|c) = P(c|s) P(s) / P(c)

and

P(c|s) = P(s|c) P(c) / P(s)

… which is called Bayes Theorem

Page 50: Lecture 4 Probability and what it has to do with data analysis

Why Bayes Theorem is important

Consider the problem of fitting a straight line to data, d, where the intercept and slope are given by the vector m.

If we guess m and use it to predict d we are doing something like P(d|m)

But if we observe d and use it to estimate m then we are doing something like P(m|d)

Bayes Theorem provides a framework for relating what we do to get P(d|m) to what we do to get P(m|d)

Page 51: Lecture 4 Probability and what it has to do with data analysis

Expectation

Variance

And

Covariance

Of a multivariate distribution

Page 52: Lecture 4 Probability and what it has to do with data analysis

The expected value of x1 and x2 are calculated in a fashion analogous to the one-variable case:

E(x1)= x1 p(x1,x2) dx1dx2 E(x2)= x2 p(x1,x2) dx1dx2

x2

x1

Note

E(x1) = x1 p(x1,x2) dx1dx2

= x1 [ p(x1,x2)dx2 ] dx1

= x1 p(x1) dx1

So the formula really is just the expectation of a one-variable distribution

Page 53: Lecture 4 Probability and what it has to do with data analysis

The variance of x1 and x2 are calculated in a fashion analogous to the one-variable case, too:

x12= (x1-x1)2p(x1,x2) dx1dx2 with x1=E(x1)

and similarly for x22

x2

x1

Note, once againx1

2= (x1-x1)2p(x1,x2) dx1dx2

= (x1-x1)2 [p(x1,x2) dx2] dx2

= (x1-x1)2p(x1) dx1

So the formula really is just the variance of a one-variable distribution

Page 54: Lecture 4 Probability and what it has to do with data analysis

Note that in this distributionif x1 is bigger than x1, then x2 is bigger than x2 and if x1 is smaller than x1, then x2 is smaller than x2

x2

x1Expected value

x1

x2

This is a

positive correlation

Page 55: Lecture 4 Probability and what it has to do with data analysis

Conversely, in this distributionif x1 is bigger than x1, then x2 is smaller than x2 and if x1 is smaller than x1, then x2 is smaller than x2

x2

x1Expected value

x1

x2

This is a

negative correlation

Page 56: Lecture 4 Probability and what it has to do with data analysis

This correlation can be quantified by multiplying the distribution by a four-quadrant function

x2

x1

x1

x2

+

+ -

-

And then integrating. The function (x1-x1)(x2-x2) works fine

cov(x1,x2) = (x1-x1) (x2-x2) p(x1,x2) dx1dx2Called the “covariance”

Page 57: Lecture 4 Probability and what it has to do with data analysis

Note that the vector x with elements

xi = E(xi)= xi p(x1,x2) dx1dx2 is the expectation of x

and the matrix Cx with elementsCij = (xi-xi) (xj-xj) p(x1,x2) dx1dx2

has diagonal elements equal to the variance of x i

Cxii = xi

2

andoff-diagonal elements equal to the covariance of x i and xj

Cxij = cov(xi,xj)

Page 58: Lecture 4 Probability and what it has to do with data analysis

“Center” of multivatiate distribution

x

“Width” and “Correlatedness” of multivariate distribution

summarized a lot – but not everything –about a multivariate distribution

Page 59: Lecture 4 Probability and what it has to do with data analysis

Functions of a set of random variables, x

A vector of of N random variables in a vector, x

Page 60: Lecture 4 Probability and what it has to do with data analysis

given y(x)Do you remember how to

transform the integral

… p(x) dNx =

… ? dNy =

Page 61: Lecture 4 Probability and what it has to do with data analysis

given y(x)

then

… p(x) dNx =

… p[x(y)] |dx/dy| dNy =

Jacobian determinant, that is, the determinant of matrix Jij whose elements are dxi/dyj

Page 62: Lecture 4 Probability and what it has to do with data analysis

But here’s something that’s EASIER …

Suppose y(x) is a linear function y=Mx

Then we can easily calculate the expectation of y

yi = E(yi) = … yi p(x1 … xN) dx1…dxN

= … Mijxj p(x1 … xN) dx1… dxN

= Mij … xj p(x1 … xN) dx1 … dxN

= Mij E(xi) = Mij xi So y=Mx

Page 63: Lecture 4 Probability and what it has to do with data analysis

And we can easily calculate the covariance

Cyij = … (yi – yi) (yj – yj) p(x1,x2) dx1dx2

= … ΣpMip(xp – xp) ΣqMjq (xq – xq) p(x1…xN) dx1…dxN

= ΣpMip ΣqMjq … (xp – xp) (xq – xq) p(x1…xN) dx1 …dxN

= ΣpMip ΣqMjq Cxpq

So Cy = M Cx MT

Memorize!

Page 64: Lecture 4 Probability and what it has to do with data analysis

Note that these rules work regardless of the distribution of x

if y is linearly related to x, y=Mx then

y=Mx (rule for means)

Cy = M Cx MT

(rule for propagating error)