bayesian stat - blog.kakaocdn.net

17
Bayesian stat MATH

Upload: others

Post on 23-Nov-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Bayesian stat

MATH

Background for Lesson 1

1 Rules of Probability

Lesson 1 will assume that you have previously seen probability notation and the basic rules

of probability. For reference, we review those here.

Probabilities are defined for events. An event is some outcome that we could potentially

or hypothetically observe or experience, such as the result of rolling a fair six-sided die.

(English grammar note: die is singular, dice is plural.) In mathematical notation, we often

write an event as a capital letter, for example, A is the event that we roll a “4” on a fair

six-sided die. This event has probability 1/6, so we would write P (A) = 1/6. We might

want to represent the numerical result of the die roll as the random variable X, and then we

could also write P (X = 4) = 1/6.

Probabilities must be between zero and one, i.e., 0 P (A) 1 for any event A.

Probabilities add to one, i.e., if we add up the probabilities of all possible events, those

probabilities must add to one. For the die rolling example:

6X

i=1

P (X = i) = 1 .

Recall that the symbol

6X

i=1

denotes adding up the entries as i goes from 1 to 6.

The complement of an event, Ac, means that the event does not happen. Since probabilities

must add to one, P (Ac) = 1� P (A).

If A and B are two events, the probability that A or B happens (this is an inclusive or,

meaning that either A, or B, or both happen) is the probability of the union of the events:

P (A [ B) = P (A) + P (B)� P (A \ B)

where [ represents union (“or”) and \ represents intersection (“and”). If a set of events Ai

1

• gyagggaa ,✓

P(✗ ' +"=") = 3% = ¥ : equally- likely outcomeonly ⇒at of

p(✗=4) frameworks

pcfair)

P( RAIN) • Frequentist- relative frequency ex] lose1 in 101000

: ¥000

pcdrop packet]

g,,

gcrp)"

• Bayesian → personal perspectivePLY , > Yz) ex] 4-1714-9171-87881't?

my-57

ex) if rain if not rain a :|5

$4 - $1 p=%if rain if not rain

- $4 1-$1

pcrain)= ¥¢=¥

expected return : acts) - 11%-7=0

[ -acts) + '

¥-0

for i = 1, . . . ,m are mutually exclusive (only one can happen), then

P (

m[

i=1

Ai) =

mX

i=1

P (Ai) .

2 Odds

Probabilities can be re-expressed in terms of odds. Suppose again that we denote rolling a

“4” on a fair six-sided die as the event A. Then P (A) = 1/6. The odds for event A, denoted

O(A) is defined as O(A) = P (A)/P (Ac) = P (A)/(1� P (A)). Hence, in this example,

O(A) =1/6

5/6=

1

5.

This can also be expressed as 1:5 (or 5:1 “odds against”). Thus, an event with probability

3/10 has 3:7 odds (7:3 odds against) and an event with probability 4/5 has 4:1 odds.

Note that we can also calculate probabilities from odds. If an event B has a : b odds (with

a > 0 and b > 0), then P (B)/(1� P (B)) = a/b =) P (B) · b = a� P (B) · a =) P (B) =

a/(a+ b). Thus, an event with 2:5 odds has probability 2/7.

3 Expectation

The expected value of a random variable X is a weighted average of values X can take, with

weights given by the probabilities of those values. If X can take on only a finite number of

values (say, x1, x2, . . . , xn), we can calculate the expected value as

E(X) =

nX

i=1

xi · P (X = xi) .

For example, the expected value of a fair six-sided die would be

E(X) =

6X

i=1

i · P (X = i) =6X

i=1

i ·1

6=

1

6(1 + 2 + 3 + 4 + 5 + 6) = 3.5 .

Note that the die cannot achieve this value, but if you were to roll the die many times and

average the values, the result would likely be close to 3.5.

2

Supplementary material for Lesson 2

The simple form of Bayes Theorem involves two discrete events:

P (A|B) =P (A \B)

P (B)=

P (B|A)P (A)

P (B|A)P (A) + P (B|Ac)P (Ac)

When there are three possible outcomes A1, A2, and A3 such that exactly one of these must

happen, then Bayes Theorem expands to:

P (A1|B) =P (B|A1)P (A1)

P (B|A1)P (A1) + P (B|A2)P (A2) + P (B|A3)P (A3)

If the events A1, . . . , Am form a partition of the space (exactly one of the Ai’s must occur,

i.e., the Ai’s are mutually exclusive andmX

i=1

P (Ai) = 1), then we can write Bayes Theorem

as:

P (A1|B) =P (B|A1)P (A1)Pmi=1 P (B|Ai)P (Ai)

.

For continuous distributions, the sum gets replaced with an integral, as we’ll see in the next

lesson.

1

Supplementary material for Lesson 3

1 Indicator Functions

The concept of an indicator function is a really useful one. This is a function that takes the

value one if its argument is true, and the value zero if its argument is false. Sometimes these

functions are called Heaviside functions or unit step functions. I write an indicator function

as I{A}(x), although sometimes they are written 1{A}(x). If the context is obvious, we can

also simply write I{A}.

Example:

I{x>3}(x) =

(0 x 3

1 x > 3

Indicator functions are useful for making sure that you don’t take the log of a negative

number, and things like that. Indicator functions are always first in the order of operations—

if the indicator function is zero, you don’t try to evaluate the rest of the expression. When

taking derivatives they just go along for the ride. When taking integrals, they may a↵ect

the range over which the integral is evaluated.

Example: The density function for the exponential distribution (see Section 4.1) can be

written as f(x) = � exp(��x)I{x�0}(x). We can integrate the density and show that it is

indeed a proper density and integrates to one:Z 1

�1� exp(��x)I{x�0}(x)dx =

Z 1

0

� exp(��x)dx = � exp(��x)|10 = �(0� 1) = 1.

The derivative of the density function would be:

d

dx� exp(��x)I{x�0}(x) = ��2 exp(��x)I{x�0}(x).

2 Expected Values

The expected value, also known as the expectation or mean, of a random variable X is

denoted E(X). It is the weighed average of all values X could take, with weights given by

1

the probabilities of those values. If X is discrete-valued, then

E(X) =X

x

x · P (X = x) =X

x

x · f(x) .

If X is a continuous random variable with probability density function (PDF) f(x), we

replace the summation with an integral

E(X) =

Z 1

�1x · f(x)dx .

The expectation of a random variable can be thought of as the average value. If, for example,

we observed many realizations of a random variable X and took their average, it would be

close to E(X).

One nice property of expected values is that they are easy to compute for linear functions

of random variables. To see this, let X and Y be random variables with E(X) = µX and

E(Y ) = µY . Suppose we are interested in a new random variable Z = aX+bY +c where a, b,

and c are any real constants. The mean of Z is easy to compute: E(Z) = E(aX + bY + c) =

aE(X) + bE(Y ) + c = aµX + bµY + c.

We can also compute expectations of functions of X. For example, suppose g(X) = 2/X.

Then we have E(g(X)) =R1�1 g(x)f(x)dx =

R1�1

2xf(x)dx. Note however that in general,

E(g(X)) 6= g(E(X)).

Example: Let’s say continuous random variable X has PDF f(x) = 3x2I{0x1}(x). We

want to find E(X) and E(X2). First,

E(X) =

Z 1

�1x · 3x2I{0x1}(x)dx

=

Z 1

0

x · 3x2dx =

Z 1

0

3x3dx =3

4x4|x=1

x=0 =3

4(1� 0) (1)

=3

4,

and second,

E(X2) =

Z 1

�1x2 · 3x2I{0x1}(x)dx

=

Z 1

0

x2 · 3x2dx =

Z 1

0

3x4dx =3

5x5|x=1

x=0 =3

5(1� 0) (2)

=3

5.

2

2.1 Variance

The variance of random variable measures how spread out its values are. If X is a random

variable with mean E(X) = µ, then the variance is E[(X � µ)2]. In words, the variance

is the expected value of the squared deviation of X from its mean. If X is discrete, this is

calculated as

V ar(X) =X

x

(x� µ)2 · P (X = x)

and if X is continuous, it is

V ar(X) =

Z 1

�1(x� µ)2 · f(x)dx .

For both discrete and continuous X, a convenient formula for the variance is V ar(X) =

E[X2]� (E[X])2. The square root of variance is called the standard deviation.

Variance has a linear property similar to expectation. Again, let X and Y be random

variables with V ar(X) = �2X and V ar(Y ) = �2

Y . It is also necessary to assume that X and Y

are independent. Suppose we are interested in a new random variable Z = aX+bY +c where

a, b, and c are any real constants. The variance of Z is then V ar(Z) = V ar(aX + bY + c) =

a2V ar(X) + b2V ar(Y ) + 0 = a2�2X + b2�2

Y . Because c is constant, it has variance 0.

Example: Continuing the previous example, let’s say continuous random variable X has

PDF f(x) = 3x2I{0x1}(x). We found in Equations 1 and 2 that E(X) = 3/4 and E(X2) =

3/5. Then, V ar(X) = E[X2]� (E[X])2 = 3/5� (3/4)2 = 3/80.

3 Additional Discrete Distributions

3.1 Geometric

The geometric distribution is the number of trials needed to get the first success, i.e., the

number of Bernoulli events until a success is observed, such as the first head when flipping

a coin. It takes values on the positive integers starting with one (since at least one trial is

3

needed to observe a success).

X ⇠ Geo(p)

P (X = x|p) = p(1� p)x�1 for x = 1, 2, . . .

E[X] =1

p

If the probability of getting a success is p, then the expected number of trials until the first

success is 1/p.

Example: What is the probability that we flip a fair coin four times and don’t see any heads?

This is the same as asking what is P (X > 4) where X ⇠ Geo(1/2). P (X > 4) = 1�P (X =

1)�P (X = 2)�P (X = 3)�P (X = 4) = 1�(1/2)�(1/2)(1/2)�(1/2)(1/2)2�(1/2)(1/2)3 =

1/16. Of course, we could also have just computed it directly, but here we see an example

of using the geometric distribution and we can also see that we got the right answer.

3.2 Multinomial

Another generalization of the Bernoulli and the binomial is the multinomial distribution,

which is like a binomial when there are more than two possible outcomes. Suppose we have

n trials and there are k di↵erent possible outcomes which occur with probabilities p1, . . . , pk.

For example, we are rolling a six-sided die that might be loaded so that the sides are not

equally likely, then n is the total number of rolls, k = 6, p1 is the probability of rolling a one,

and we denote by x1, . . . , x6 a possible outcome for the number of times we observe rolls of

each of one through six, where6X

i=1

xi = n and6X

i=1

pi = 1.

f(x1, . . . , xk|p1, . . . , pk) =n!

x1! · · · xk!px11 · · · pxk

k .

Recall that n! stands for n factorial, which is the product of n times n� 1 times . . . 1, e.g.,

4! = 4 · 3 · 2 · 1 = 24. The expected number of observations in category i is npi.

4

3.3 Poisson

The Poisson distribution is used for counts, and arises in a variety of situations. The param-

eter � > 0 is the rate at which we expect to observe the thing we are counting.

X ⇠ Pois(�)

P (X = x|�) = �x exp(��)

x!for x = 0, 1, 2, . . .

E[X] = �

V ar[X] = �

A Poisson process is a process wherein events occur on average at rate �, events occur one

at a time, and events occur independently of each other.

Example: Significant earthquakes occur in the Western United States approximately fol-

lowing a Poisson process with rate of two earthquakes per week. What is the probability

there will be at least 3 earthquakes in the next two weeks? Answer: the rate per two weeks is

2⇥ 2 = 4, so let X ⇠ Pois(4) and we want to know P (X � 3) = 1�P (X 2) = 1�P (X =

0)� P (X = 1)� P (X = 2) = 1� e�4 � 4e�4 � 42e�4

2 = 1� 13e�4 = 0.762. Note that 0! = 1

by definition.

4 Continuous Distributions

4.1 Exponential

The exponential distribution is often used to model the waiting time between random events.

Indeed, if the waiting times between successive events are independent from an Exp(�)

distribution, then for any fixed time window of length t, the number of events occurring in

that window will follow a Poisson distribution with mean t�.

X ⇠ Exp(�)

f(x|�) = �e��xI{x�0}(x)

E[X] =1

V ar[X] =1

�2

5

Similar to the Poisson distribution, the parameter � is interpreted as the rate at which the

events occur.

4.2 Gamma

IfX1, X2, . . . , Xn are independent (and identically distributed Exp(�)) waiting times between

successive events, then the total waiting time for all n events to occur Y =Pn

i=1 Xi will

follow a gamma distribution with shape parameter ↵ = n and rate parameter � = �.

Y ⇠ Gamma(↵, �)

f(y|↵, �) = �↵

�(↵)y↵�1e��yI{y�0}(y)

E[Y ] =↵

V ar[Y ] =↵

�2

where �(·) is the gamma function, a generalization of the factorial function which can accept

non-integer arguments. If n is a positive integer, then �(n) = (n� 1)!. Note also that ↵ > 0

and � > 0.

The exponential distribution is a special case of the gamma distribution with ↵ = 1. The

gamma distribution commonly appears in statistical problems, as we will see in this course.

It is used to model positive-valued, continuous quantities whose distribution is right-skewed.

As ↵ increases, the gamma distribution more closely resembles the normal distribution.

4.3 Uniform

The uniform distribution is used for random variables whose possible values are equally likely

over an interval. If the interval is (a, b), then the uniform probability density function (PDF)

f(x) is flat for all values in that interval and 0 everywhere else.

X ⇠ Uniform(a, b)

f(x|a, b) = 1

b� aI{axb}(x)

E[X] =a+ b

2

V ar[X] =(b� a)2

12

The standard uniform distribution is obtained when a = 0 and b = 1.

6

4.4 Beta

The beta distribution is used for random variables which take on values between 0 and 1.

For this reason (and other reasons we will see later in the course), the beta distribution is

commonly used to model probabilities.

X ⇠ Beta(↵, �)

f(x|↵, �) = �(↵ + �)

�(↵)�(�)x↵�1(1� x)��1I{0<x<1}(x)

E[X] =↵

↵ + �

V ar[X] =↵�

(↵ + �)2(↵ + � + 1)

where �(·) is the gamma function introduced with the gamma distribution. Note also that

↵ > 0 and � > 0. The standard Uniform(0, 1) distribution is a special case of the beta

distribution with ↵ = � = 1.

4.5 Normal

The normal, or Gaussian distribution is one of the most important distributions in statistics.

It arises as the limiting distribution of sums (and averages) of random variables. This is

due to the Central Limit Theorem, introduced in Section 5. Because of this property, the

normal distribution is often used to model the “errors,” or unexplained variation of individual

observations in regression models.

The standard normal distribution is given by

Z ⇠ N(0, 1)

f(z) =1p2⇡

exp

✓�z2

2

E[Z] = 0

V ar[Z] = 1

Now considerX = �Z+µ where � > 0 and µ is any real constant. Then E(X) = E(�Z+µ) =

�E(Z) + µ = � · 0 + µ = µ and V ar(X) = V ar(�Z + µ) = �2V ar(Z) + 0 = �2 · 1 = �2.

Then, X follows a normal distribution with mean µ and variance �2 (standard deviation �)

7

denoted as

X ⇠ N(µ, �2)

f(x|µ, �2) =1p2⇡�2

exp

✓�(x� µ)2

2�2

The normal distribution is symmetric about the mean µ, and is often described as a “bell-

shaped” curve. Although X can take on any real value (positive or negative), more than

99% of the probability mass is concentrated within three standard deviations of the mean.

The normal distribution has several desirable properties. One is that if X1 ⇠ N(µ1, �21)

and X2 ⇠ N(µ2, �22) are independent, then X1+X2 ⇠ N(µ1+µ2, �2

1+�22). Consequently, if we

take the average of n independent and identically distributed (iid) normal random variables,

X =1

n

nX

i=1

Xi ,

where Xiiid⇠ N(µ, �2) for i = 1, 2, . . . , n, then

X ⇠ N

✓µ,

�2

n

◆. (3)

4.6 t

If we have normal data, we can use Equation 3 to help us estimate the mean µ. Reversing

the transformation from the previous section, we get

X � µ

�/pn

⇠ N(0, 1) . (4)

However, we may not know the value of �. If we estimate it from data, we can replace it with

S =pP

i(Xi � X)2/(n� 1), the sample standard deviation. This causes the expression (4)

to no longer be distributed as standard normal, but as a standard t distribution with ⌫ = n�1

degrees of freedom.

Y ⇠ t⌫

f(y) =�(⌫+1

2 )

�(⌫2 )p⌫⇡

✓1 +

y2

◆�( ⌫+12 )

E[Y ] = 0 if ⌫ > 1

V ar[Y ] =⌫

⌫ � 2if ⌫ > 2

8

The t distribution is symmetric and resembles the normal distribution, but with thicker tails.

As the degrees of freedom increase, the t distribution looks more and more like the standard

normal distribution.

5 Central Limit Theorem

The Central Limit Theorem is one of the most important results in statistics, basically saying

that with su�ciently large sample sizes, the sample average approximately follows a normal

distribution. This underscores the importance of the normal distribution, as well as most

of the methods commonly used which make assumptions about the data being normally

distributed.

Let’s first stop and think about what it means for the sample average to have a distribu-

tion. Imagine going to the store and buying a bag of your favorite brand of chocolate chip

cookies. Suppose the bag has 24 cookies in it. Will each cookie have the exact same number

of chocolate chips in it? It turns out that if you make a batch of cookies by adding chips

to dough and mixing it really well, then putting the same amount of dough onto a baking

sheet, the number of chips per cookie closely follows a Poisson distribution. (In the limiting

case of chips having zero volume, this is exactly a Poisson process.) Thus we expect there to

be a lot of variability in the number of chips per cookie. We can model the number of chips

per cookie with a Poisson distribution. We can also compute the average number of chips

per cookie in the bag. For the bag we have, that will be a particular number. But there may

be more bags of cookies in the store. Will each of those bags have the same average number

of chips? If all of the cookies in the store are from the same industrial-sized batch, each

cookie will individually have a Poisson number of chips. So the average number of chips in

one bag may be di↵erent from the average number of chips in another bag. Thus we could

hypothetically find out the average number of chips for each bag in the store. And we could

think about what the distribution of these averages is, across the bags in the store, or all the

bags of cookies in the world. It is this distribution of averages that the central limit theorem

says is approximately a normal distribution, with the same mean as the distribution for the

individual cookies, but with a standard deviation that is divided by the square root of the

number of samples in each average (i.e., the number of cookies per bag).

In formal mathematical notation, the Central Limit Theorem says: Let X1, . . . , Xn be

independent and identically distributed with E[Xi] = µ and V ar(Xi) = �2, 0 < �2 < 1.

9

Then pn(X � µ)

�) N(0, 1) .

That is, Xn is approximately normally distributed with mean µ and variance �2/n or stan-

dard deviation �/pn.

6 Bayes Theorem for continuous distributions

When dealing with a continuous random variable ✓, we can write the conditional density for

✓ given y as:

f(✓|y) = f(y|✓)f(✓)Rf(y|✓)f(✓)d✓

.

This expression does the same thing that the versions of Bayes’ theorem from Lesson 2 do.

Because ✓ is continuous, we integrate over all possible values of ✓ in the denominator rather

than take the sum over these values. The continuous version of Bayes’ theorem will play a

central role from Lesson 5 on.

10

Background for Lesson 4

1 Products and Exponents

In Lesson 1, we introduced the summation notationPn

i=1 xi = x1 + x2 + . . .+ xn. Similarly,

we can define product notation as

nY

i=1

xi = x1 · x2 · . . . · xn .

Example: We can re-write the factorial function as n! =Qn

i=1 i for n � 1.

Example: Suppose f is a function which returns f(x) = 3x + 1. Suppose x can take on

only the discrete values x 2 {�1, 2, 4}. ThenY

x

f(x) = (3 · (�1) + 1) · (3 · (2) + 1) · (3 · (4) + 1)

= (�2) · (7) · (13) = �182 .

Exponents are of the form ax where a (called the base) and x (called the exponent) are any

real numbers. Recall that a0 = 1. Exponents have the following useful properties

1. ax · ay = ax+y

2. (ax)y = ax·y

Note that the first property requires that both terms have the same base a. Thus we cannot

simplify ax · by if a 6= b.

One common base is the number e which is approximately equal to 2.7183. The function

ex is so common in mathematics that it has its own symbol ex = exp(x). Because e > 0, we

have ex > 0 for all real numbers x, although limx!1 e�x = 0.

Example: Using Property 1 above, we haveQ5

x=1 ex = exp(

P5x=1 x) = e15.

1

2 Natural Logarithm

Logarithms can be defined as the inverse of exponential functions. That is, if y = ax then

loga(y) = x. The natural logarithm function has base e and is written without the subscript

loge(y) = log(y). Because ex > 0 for all x, log(y) is only defined for y > 0. We always have

exp(log(y)) = log(exp(y)) = y.

We can use the properties of exponents from the previous section to obtain some impor-

tant properties of logarithms:

1. log(x · y) = log(x) + log(y)

2. log(xy ) = log(x)� log(y)

3. log(xb) = b log(x)

4. log(1) = 0

Because the natural logarithm is a monotonically increasing one-to-one function, finding the

x which maximizes any (positive-valued function) f(x) is equivalent to maximizing log(f(x)).

This is useful because we often take derivatives to maximize functions. If f(x) has product

terms, then log(f(x)) will have summation terms, which are usually simpler when taking

derivatives.

Example: log⇣

52

10

⌘= 2 log(5)� log(10) ⇡ 0.916.

3 Argmax

When we want to maximize a function f(x), there are two things we may be interested in:

1. The value f(x) achieves when it is maximized, which we denote maxxf(x).

2. The x-value that results in maximizing f(x), which we denote x = arg maxxf(x).

Thus maxxf(x) = f(x).

Example: Suppose f(x) = exp(�x2). Then log(f(x)) = �x2 which is maximized at x = 0.

Hence, arg maxxf(x) = x = 0 and maxxf(x) = f(x) = f(0) = 1.

2

Supplementary material for Lesson 4

1 Derivative of a Likelihood

Recall that the likelihood function is viewed as a function of the parameters ✓ and the data

y are considered fixed. Thus, when we take the derivative of the log-likelihood function, it

is with respect to ✓ only.

Example: Consider the normal likelihood where only the mean µ is unknown:

f(y|µ) =nY

i=1

1p2⇡�2

exp

� 1

2�2(yi � µ)2

=1

(p2⇡�2)n

exp

"� 1

2�2

nX

i=1

(yi � µ)2#

which yields the log-likelihood

`(µ) = �n

2log(2⇡�2)� 1

2�2

nX

i=1

(yi � µ)2

= �n

2log(2⇡�2)� 1

2�2

"nX

i=1

(y2i � 2µyi + µ2)

#

= �n

2log(2⇡�2)� 1

2�2

"nX

i=1

y2i � 2µnX

i=1

yi + nµ2

#.

We take the derivative of `(µ) with respect to µ to obtain

d`(µ)

dµ=

2Pn

i=1 yi2�2

� 2nµ

2�2

=

Pni=1 yi�2

� nµ

�2. (1)

If we set the expression in (1) equal to 0, we obtain the MLE µ = 1n

Pni=1 yi = y.

2 Products of Indicator Functions

Because 0 · 1 = 0, the product of indicator functions can be combined into a single indicator

function with a modified condition.

Examples: I{x<5} · I{x�0} = I{0x<5}, andQn

i=1 I{xi<2} = I{xi<2 for all i} = I{maxixi<2}.

1