bayseian decision theory

35
Lecture 2. Bayesian Decision Theory Bayes Decision Rule Loss function Decision surface Multivariate normal and Discriminant Function

Upload: sia16

Post on 23-Jan-2018

916 views

Category:

Technology


2 download

TRANSCRIPT

Lecture 2.

Bayesian Decision Theory

Bayes Decision Rule

Loss function

Decision surface

Multivariate normal and Discriminant Function

Bayes Decision

It is the decision making when all underlying probability distributions are known.It is optimal given the distributions are known.

For two classes ω1 and ω2 ,

Prior probabilities for an unknown new observation:

P(ω1) : the new observation belongs to class 1P(ω2) : the new observation belongs to class 2P(ω1 ) + P(ω2 ) = 1

It reflects our prior knowledge. It is our decision rule when no feature on the new object is available:Classify as class 1 if P(ω1 ) > P(ω2 )

Bayes Decision

We observe features on each object.P(x| ω1) & P(x| ω2) : class-specific density

The Bayes rule:

Bayes Decision

Likelihood of observing x given class label.

Bayes Decision

Posterior probabilities.

Loss function

Loss function:

probability statement --> decision

some classification mistakes can be more costly than others.

The set of c classes:

The set of possible actions:

: deciding that an observation belongs to

Loss when taking action i given the observation belongs to hidden class j:

Loss functionThe expected loss:

Given an observation with covariant vector x, the conditional risk is:

Our final goal is to minimize the total risk over all x.

Loss function

The zero-one loss:

All errors are equally costly.

The conditional risk is:

“The risk corrsponding to this loss function is the average probability error.”€

R(αi |x)= λ(αi |ωj)P(ωj |x)j=1

j=c

= P(ωj |x)=1−P(ωi |x)j≠i

c,...,1j,i ji 1

ji 0),( ji =

≠=

=ωαλ

Loss function

Let denote the loss for deciding class i when the true class is j

In minimizing the risk, we decide class one if

Rearrange it, we have

Loss function

λλ θωωωθ

ωω

λλλλ >=

−−

)|x(P

)|x(P :if decide then

)(P

)(P. Let

2

11

1

2

1121

2212

λ =0 1

1 0

,

then θλ = P (ω 2 )

P(ω1)= θa

λ =0 2

1 0

then θλ = 2P(ω 2)P (ω1)

= θb

Example:

Loss function

Likelihood ratio.

Zero-one loss function

If miss-classifying ω2 is penalized more:

Discriminant function & decision surface

Features -> discriminant functions gi(x), i=1,…,c

Assign class i if gi(x) > gj(x) ∀ j ≠ i

Decision surface defined bygi(x) = gj(x)

Decision surfaceThe discriminant functions help partition the feature space into c decision regions (not necessarily contiguous). Our interest is to estimate the boundaries between the regions.

Minimax

Minimizing the maximum possible loss.

What happens when the priors change?

Normal density

Reminder: the covariance matrix is symmetric and positive semidefinite.

Entropy - the measure of uncertainty

Normal distribution has the maximum entropy over all distributions with a given mean and variance.

Reminder of some results for random vectors

Let Σ be a kxk square symmetrix matrix, then it has k pairs of eigenvalues and eigenvectors. A can be decomposed as:

Σ=λ1e1e1′+λ2e2e2′+.......+λkekek′=PΛ′ P

Positive-definite matrix:

′ x Σx >0,∀x ≠0

λ1 ≥λ2 ≥......≥λk >0

Note: ′ x Σx =λ1( ′ x e1)2 +......+λk( ′ x ek)

2

Normal density

Whitening transform:

P : eigen vector matrix

Λ : diagonal eigen value matrix

Aw = PΛ− 12

Awt ΣAw

= Λ− 12P tΣPΛ− 1

2

= Λ− 12P tPΛP tPΛ− 1

2

= I

Σ=λ1e1e1′+λ2e2e2′+.......+λkekek′=PΛ′ P

Normal density

To make a minimum error rate classification (zero-one loss), we use discriminant functions:

This is the log of the numerator in the Bayes formula. The log posterior probability is proportional to it. Log is used because we are only comparing the gi’s, and log is monotone.

When normal density is assumed:

We have:

Discriminant function for normal density

(1)Σi = σ2I

Linear discriminant function:

Note: blue boxes – irrelevant terms.

Discriminant function for normal density

The decision surface is where

With equal prior, x0 is the middle point between the two means.The decision surface is a hyperplane,perpendicular to the line between the means.

Discriminant function for normal density

“Linear machine”: dicision surfaces are hyperplanes.

Discriminant function for normal density

With unequal prior probabilities, the decision boundary shifts to the less likely mean.

Discriminant function for normal density

(2) Σi = Σ

Discriminant function for normal densitySet:

The decision boundary is:

Discriminant function for normal density

The hyperplane is generally not perpendicular to the line between the means.

Discriminant function for normal density

(3) Σi is arbitrary

Decision boundary is hyperquadrics (hyperplanes, pairs of

hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids)€

gi(x)=xtWix+witx+wi0

Wi =−1

2Σi

−1

wi =Σi−1µi

wi0 =−12

µitΣi

−1µi −12

lnΣi +lnP(ωi)

Discriminant function for normal density

Discriminant function for normal density

Discriminant function for normal density

Extention to multi-class.

Discriminant function for discrete features

Discrete features: x = [x1, x2, …, xd ]t , xi∈{0,1 }

pi = P(xi = 1 | ω1)

qi = P(xi = 1 | ω2)

The likelihood will be:

Discriminant function for discrete features

The discriminant function:

The likelihood ratio:

g(x) = wi

i=1

d

∑ xi + w0

wi = lnpi(1− qi)qi(1− pi)

i =1,...,d

w0 = ln1− pi

1− qii=1

d

∑ + lnP(ω1)P(ω2)

Discriminant function for discrete features

So the decision surface is again a hyperplane.

Optimality

Consider a two-class case.

Two ways to make a mistake in the classification:

Misclassifying an observation from class 2 to class 1;

Misclassifying an observation from class 1 to class 2.

The feature space is partitioned into two regions by any classifier: R1 and R2

Optimality

Optimality

In the multi-class case, there are numerous ways to make mistakes. It is easier to calculate the probability of correct classification.

Bayes classifier maximizes P(correct). Any other partitioning will yield higher probability of error.

The result is not dependent on the form of the underlying distributions.