bayesian decision theory chapter 2 · bayesian decision theory chapter 2 (jan 11, 18, 23, 25) •...

86
Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) Bayes decision theory is a fundamental statistical approach to pattern classification Assumption: decision problem posed in probabilistic terms and relevant probability values are known

Upload: others

Post on 04-Jun-2020

5 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Bayesian Decision Theory

Chapter 2(Jan 11, 18, 23, 25)

• Bayes decision theory is a fundamental

statistical approach to pattern classification

• Assumption: decision problem posed in

probabilistic terms and relevant probability values

are known

Page 2: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification
Page 3: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification
Page 4: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Decision Making

Probabilistic model

Known Unknown

Supervised Learning

Unsupervised Learning

Nonparametric Approach

(Chapter 4, 6)

Parametric Approach

(Chapter 3)

Nonparametric Approach

(Chapter 10)

Parametric Approach

(Chapter 10)

Bayes Decision Theory

(Chapter 2)

“Optimal” Rules

Plug-in Rules

Density Estimation

K-NN, neural networks

Mixture models

Cluster Analysis

Page 5: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Sea bass v. Salmon Classification

• Each fish appearing on the conveyor belt is either

sea bass or salmon; two “states of nature”

• Let denote the state of nature: 1 = sea bass

and 2 = salmon; is a random variable that

must be described probabilistically

• a priori (prior) probability: P (1) and P(2); P (1)

is the probability next fish observed is a sea bass

• If no other types of fish are present then• P(1) + P( 2) = 1 (exclusivity and exhaustivity)

• P(1) = P(2) (uniform priors)

• Prior prob. reflects our prior knowledge about how likely

we are to observe a sea bass or salmon; prior prob. may

depend on time of the year or the fishing area!

Page 6: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• Case 1: Suppose we are asked to make a decision

without observing the fish. We only have prior

information

• Bayes decision rule given only prior information

• Decide 1 if P(1) > P(2), otherwise decide 2

• Error rate = Min {P(1) , P(2)}

• Suppose now we are allowed to measure a feature

on the state of nature - say the fish lightness value

• Define class-conditional probability density function

(pdf) of feature x; x is a r.v.

• P(x | i) is the prob. of x given class i , i =1, 2. P(x |

i )>= 0 and area under the pdf is 1.

Page 7: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Less the densities overlap, better the feature

Page 8: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• Case 2: Suppose we only have class-conditional

densities and no prior information

• Maximum likelihood decision rule

• Assign input pattern x to class 1 if

P(x | 1) > P(x | 2), otherwise 2

• P(x | 1) is also the likelihood of class 1 given the

feature value x

• Case 3: We have both prior densities and class-

conditional densities

• How does the feature x influence our attitude

(prior) concerning the true state of nature?

• Bayes decision rule

Page 9: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• Posteriori prob. is a function of likelihood & prior

• Joint density: P(j , x) = P(j | x)p (x) = p(x | j) P (j)

• Bayes rule

P(j | x) = {p(x | j) . P (j)} / p(x), j = 1,2

where

• Posterior = (Likelihood x Prior) / Evidence

• Evidence P(x) can be viewed as a scale factor that

guarantees that the posterior probabilities sum to 1

• P(x) is also called the unconditional density of feature x

2

1

)()|()(j

j

jj PxPxP

Page 10: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification
Page 11: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• P(1 | x) is the probability of the state of nature being 1

given that feature value x has been observed

• Decision based on the posterior probabilities is called the “Optimal” Bayes Decision rule. What does optimal mean?

For a given observation (feature value) X:

if P(1 | x) > P(2 | x) decide 1

if P(1 | x) < P(2 | x) decide 2

To justify the above rule, calculate the probability of error:

P(error | x) = P(1 | x) if we decide 2

P(error | x) = P(2 | x) if we decide 1

Page 12: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• So, for a given x, we can minimize the prob. of

error by deciding 1 if

P(1 | x) > P(2 | x);

otherwise decide 2

Therefore:

P(error | x) = min [P(1 | x), P(2 | x)]

• For each observation x, Bayes decision rule

minimizes the probability of error

• Unconditional error: P(error) obtained by

integration over all possible observed x w.r.t. p(x)

Page 13: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• Optimal Bayes decision rule

Decide 1 if P(1 | x) > P(2 | x); otherwise decide 2

• Special cases:

(i) P(1) = P(2); Decide 1 if

p(x | 1) > p(x | 2), otherwise 2

(ii) p(x | 1) = p(x | 2); Decide 1 if

P(1) > P(2), otherwise 2

Page 14: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Bayesian Decision Theory –

Continuous Features

• Generalization of the preceding formulation

• Use of more than one feature (d features)

• Use of more than two states of nature (c classes)

• Allowing actions other than deciding on the state of

nature

• Introduce a “loss function”; minimizing the “risk” is more

general than minimizing the probability of error

Page 15: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• Allowing actions other than classification primarily

allows the possibility of “rejection”

• Rejection: Input pattern is rejected when it is

difficult to decide between two classes or the

pattern is too noisy!

• The loss function specifies the cost of each action

Page 16: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• Let {1, 2,…, c} be the set of c states of nature

(or “categories” or “classes”)

• Let {1, 2,…, a} be the set of a possible actions

that can be taken for an input pattern x

• Let (i | j) be the loss incurred for taking

action i when the true state of nature is j

• Decision rule: (x) specifies which action to take for every

possible observation x

Page 17: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Conditional Risk

Overall risk

R = Expected value of R(i | x) w.r.t. p(x)

Minimizing R Minimize R(i | x) for i = 1,…, a

Conditional risk

cj

1j

jjii )x|(P)|()x|(R

For a given x, suppose we take the action i

• If the true state is j , we will incur the loss (i | j)

• P(j | x) is the prob. that the true state is j

• But, any one of the C states is possible for given x

Page 18: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Select the action i for which R(i | x) is minimum

• This action minimizes the overall risk

• The resulting risk is called the Bayes risk

• It is the best classification performance that can be

achieved given the priors, class-conditional

densities and the loss function!

Page 19: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• Two-category classification

1 : decide 1

2 : decide 2

ij = (i | j); loss incurred in deciding i when the true

state of nature is j

Conditional risk:

R(1 | x) = 11P(1 | x) + 12P(2 | x)

R(2 | x) = 21P(1 | x) + 22P(2 | x)

Page 20: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Bayes decision rule is stated as:

if R(1 | x) < R(2 | x)

Take action 1: “decide 1”

This rule is equivalent to: decide 1 if:

{(21- 11) P(x | 1) P(1)} >

{(12- 22) P(x | 2) P(2)};

decide 2 otherwise

Page 21: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

In terms of the Likelihood Ratio (LR), the

preceding rule is equivalent to the following rule:

then take action 1 (decide 1); otherwise take

action 2 (decide 2)

The “threshold” term on the right hand side now

involves the prior and the loss function

)(P

)(P.

)|x(P

)|x(P if

1

2

1121

2212

2

1

Page 22: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Interpretation of the Bayes decision rule:

If the likelihood ratio of class 1 and class 2

exceeds a threshold value (independent of the input

pattern x), the optimal action is: decide 1

Maximum likelihood decision rule is a special case

of minimum risk decision rule:

• Threshold value = 1

• 0-1 loss function

• Equal class prior probability

Page 23: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Bayesian Decision Theory(Sections 2.3-2.5)

• Minimum Error Rate Classification

• Classifiers, Discriminant Functions and Decision Surfaces

• Multivariate Normal (Gaussian) Density

Page 24: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Minimum Error Rate Classification

• Actions are decisions on classes

If action i is taken and the true state of nature is j then:

the decision is correct if i = j and in error if i j

• Seek a decision rule that minimizes the probability of error or the error rate

Page 25: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• Zero-one (0-1) loss function: no loss for correct decision and a unit loss for incorrect decision

The conditional risk can now be simplified as:

“The risk corresponding to the 0-1 loss function is the average probability of error”

c,...,1j,i ji 1

ji 0),( ji

1j

ij

cj

1j

jjii

)x|(P1)x|(P

)x|(P)|()x|(R

Page 26: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• Minimizing the risk under 0-1 loss function requires maximizing the posterior probability P(i | x) since

R(i | x) = 1 – P(i | x))

• For Minimum error rate

•Decide i if P (i | x) > P(j | x) j i

Page 27: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• Decision boundaries and decision regions

• If is the 0-1 loss function then the threshold involves only the priors:

b

1

2

a

1

2

)(P

)(P2 then

0 1

2 0 if

)(P

)(P then

0 1

1 0

)|x(P

)|x(P :if decide then

)(P

)(P. Let

2

11

1

2

1121

2212

Page 28: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification
Page 29: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Classifiers, Discriminant Functionsand Decision Surfaces

• Many different ways to represent classifiers or decision rules;

• One of the most useful is in terms of “discriminant functions”

• The multi-category case

• Set of discriminant functions gi(x), i = 1,…,c

• Classifier assigns a feature vector x to class i if:

gi(x) > gj(x) j i

Page 30: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Network Representation of a Classifier

Page 31: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• Bayes classifier can be represented in this way, but the choice of discriminant function is not unique

• gi(x) = - R(i | x)

(max. discriminant corresponds to minimum risk)

• For the minimum error rate, we take

gi(x) = P(i | x)

(max. discrimination corresponds to max. posterior!)

gi(x) P(x | i) P(i)

gi(x) = ln P(x | i) + ln P(i)

(ln: natural log)

Page 32: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• A decision rule partitions the feature space into c decision regions

if gi(x) > gj(x) j i then x is in Ri

In region Ri input pattern x is assigned to class i

• Two-category case

• Here a classifier is a “dichotomizer” that has two discriminant functions g1 and g2

Let g(x) g1(x) – g2(x)

Decide 1 if g(x) > 0 ; Otherwise decide 2

Page 33: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• A “dichotomizer” computes a single discriminant function g(x) and classifies x according to whether g(x) is positive or not

• Computation of g(x) = g1(x) – g2(x)

)(P

)(Pln

)|x(P

)|x(Pln

)x|(P)x|(P)x(g

2

1

2

1

21

Page 34: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification
Page 35: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

The Normal Density

• Univariate density: N( , 2)

• Normal density is analytically tractable

• Continuous density with two parameters (mean, variance)

• A number of processes are asymptotically Gaussian (CLT)

• Patterns (e.g., handwritten characters, speech signals ) can be viewed as randomly corrupted (noisy) versions of a single typical or prototype pattern

where:

= mean (or expected value) of x

2 = variance (or expected squared deviation) of x

,x

2

1exp

2

1)x(P

2

Page 36: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification
Page 37: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• Multivariate density: N( , )

• Multivariate normal density in d dimensions:

where:

x = (x1, x2, …, xd)t (“t” stands for the transpose of a vector)

= (1, 2, …, d)t mean vector

= d*d covariance matrix

|| and -1 are determinant and inverse of , respectively

• Covariance matrix is symmetric and positive semidefinite; we assume is positive definite so the determinant of is strictly positive

• Multivariate normal density is completely specified by [d + d(d+1)/2] parameters

• If variables x1 and x2 are “statistically independent” then the covarianceof x1 and x2 is zero.

)x()x(2

1exp

)2(

1)x(P

1t

2/12/d

Page 38: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Multivariate Normal density

2 1( ) ( )tr x x

Samples drawn from a normal population tend to fall in a single cloud or cluster; cluster center is determined by the mean vector and shape by the covariance matrix

The loci of points of constant density are hyperellipsoids whose principal axes are the eigenvectors of

squared Mahalanobis distance from x to

Page 39: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Transformation of Normal Variables

Linear combinations of jointly normal random variables have normal distribution

Linear transformation can convert an arbitrary multivariate normal distribution into a spherical one (“Whitening”)

Page 40: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Bayesian Decision Theory

(Sections 2.6 to 2.9)

• Discriminant Functions for the Normal Density

• Bayes Decision Theory – Discrete Features

Page 41: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Discriminant Functions for the

Normal Density

• The minimum error-rate classification can be

achieved by the discriminant functions

gi(x) = ln P(x | i) + ln P(i), I =1, 2,.., c

• In case of multivariate normal densities

)(Plnln2

12ln

2

d)x()x(

2

1)x(g ii

1

i

i

t

ii

Page 42: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• Case i = 2.I (I is the identity matrix)

Features are statistically independent and each feature

has the same variance irrespective of the class

)category! ththe for thresholdthe called is (

)(Pln2

1w ;w

:where

function) ntdiscrimina (linear wxw)x(g

0i

ii

t

i20i2

ii

0i

t

ii

i

Page 43: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• A classifier that uses linear discriminant functions is called

“a linear machine”

• The decision surfaces (boundaries) for a linear machine

are pieces of hyperplanes defined by the linear equations:

gi(x) = gj(x)

Page 44: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• The hyperplane separating Ri and Rj

is orthogonal to the line linking the means!

)()(P

)(Pln)(

2

1x ji

j

i

2

ji

2

ji0

)(2

1x then )(P)(P if ji0ji

Page 45: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification
Page 46: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification
Page 47: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification
Page 48: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• Case 2: i = (covariance matrices of all classes

are identical, but otherwise arbitrary!)

• Hyperplane separating Ri and Rj

• The hyperplane separating Ri and Rj is generally

not orthogonal to the line between the means!

• To classify a feature vector x, measure the

squared Mahalanobis distance from x to each of

the c means; assign x to the category of the

nearest mean

).(

)()(

)(P/)(Pln)(

2

1x ji

ji

1t

ji

ji

ji0

Page 49: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification
Page 50: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification
Page 51: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• Case 3: i = arbitrary

• The covariance matrices are different for each category

In the 2-category case, the decision surfaces are hyperquadrics that can assume any of the general forms: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids)

)(Plnln2

1

2

1 w

w

2

1W

:where

wxwxWx)x(g

iii

1

i

t

i0i

i

1

ii

1

ii

0i

t

ii

t

i

Page 52: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Discriminant Functions for 1D Gaussian

Page 53: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Discriminant Functions for the Normal Density

Page 54: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification
Page 55: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Discriminant Functions for the Normal Density

Page 56: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Discriminant Functions for the Normal Density

Page 57: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Decision Regions for Two-Dimensional Gaussian Data

2

112 1875.0125.1514.3 xxx

Page 58: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Error Probabilities and Integrals

• 2-class problem

• There are two types of errors

• Multi-class problem – Simpler to computer the prob. of being correct (more

ways to be wrong than to be right)

Page 59: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Error Probabilities and Integrals

Bayes optimal decision boundary in 1-D case

Page 60: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Error Rate of Linear Discriminant Function (LDF)

• Assume a 2-class problem

• Due to the symmetry of the problem (identical ), the two types of errors are identical

• Decide if or

or

1x 1 2( ) ( )g x g x

1 1

1 21 1 2 2

1 1( ) ( ) log ( ) ( ) ( ) log ( )

2 2

t tx x P x x P

1 1 1

1 22 1 1 1 2 2

1( ) log ( ) / ( )

2

t tt x P P

1 1 2 2

1

~ ( , ), ~ ( , )

1( ) log ( ) ( ) ( ) log ( )

2

t

i i ii i

p x N p x N

g x P x x x P

Page 61: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• Let

• Compute expected values & variances of when

where

= squared Mahalanobis distance between

1 1 1

2 1 1 1 2 2

1( ) ( )

2

t tth x x

( )h x

1 2&x x

1 1 1

1 1 12 1 1 1 2 2

1

2 1 2 1

1( ) ( )

2

1( ) ( )

2

t tt

t

E h x x E x

1

1

2 1 2 1( ) ( )t

1 2&

Error Rate of LDF

Page 62: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• Similarly 1

2 2 1 2 1

1( ) ( )

2

t

1

2

( ) ~ ( ,2 )

( ) ~ ( ,2 )

p h x x N

p h x x N

22 1

1 1 1 12 1 1

1

2 1 2 1

( ) ( ) ( )

( ) ( )

2

t

t

E h x E x x

2

2 2

Error Rate of LDF

Page 63: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

2

1( )

21 1 2 1 1

1

2

2

1( ) ( ) ( ) ( ) ~

2 2

1

2

1 1

2 2 4

t

n t

P g x g x x P h x dh h x e

e d

terf

Error Rate of LDF

Page 64: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

2

1

2

0

2

1 1 2 2

( )log

( )

2( )

1 1

2 2 4

Total probability of error

( ) ( )

r

x

e

Pt

P

erf r e dx

terf

P P P

Error Rate of LDF

Page 65: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

1 2

1

1 2 1 21 2

10

2

( ) ( )1 1 1 1

2 2 2 24 2 2

t

P P t

erf erf

Error Rate of LDF

1

1 2 1 2

1 2

1

1 2 1 2

1 2

Mahalanobis distance is a good measure of separation between classes

(i) No Class Separation

( ) ( ) 0

1

2

(ii) Perfect Class Separation

( ) ( ) 0

0 ( 1)

t

t

erf

Page 66: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Error Bounds for Normal Densities

• The exact calculation of the error for the

general Guassian case (case 3) is

extremely difficult

• However, in the 2-category case the general

error can be approximated analytically to

give us an upper bound on the error

Page 67: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Chernoff Bound

• To derive a bound for the error, we need the

following inequality

Assume conditional prob. are normal

where

Page 68: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Chernoff Bound

Chernoff bound for P(error) is found by determining the

value of that minimizes exp(-k())

Page 69: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Error Bounds for Normal Densities

• Bhattacharyya Bound

• Assume = 1/2

• computationally simpler

• slightly less tight bound

• Now, Eq. (73) has the form

When the two covariance matrices are equal, k(1/2) is te

same as the Mahalanobis distance between the two means

Page 70: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Error Bounds for Gaussian Distributions

Chernoff Bound

Bhattacharya Bound (β=1/2)

2–category, 2D data

True error using numerical integration = 0.0021

Best Chernoff error bound is 0.008190

Bhattacharya error bound is 0.008191

1 1

1 1 1 2( ) ( ) ( ) ( | ) ( | ) 0 1P error P P p x p x dx 1 ( )

1 2( | ) ( | ) kp x p x dx e 1 1 2

12 1 1 2 2 11 2

(1 )(1 ) 1( ) ( ) [ (1 ) ] ( ) ln

2 2 | | | |

tk

(1/2)

1 2 1 2 1 2( ) ( ) ( ) ( | ) ( | ) ( ) ( ) kP error P P P x P x dx P P e 1 2

11 2

2 1 2 11 2

1 2(1 / 2) 1 / 8( ) ( ) ln

2 2 | || |

tk

Page 71: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

We are interested in detecting a single weak pulse, e.g. radar reflection; the internal signal (x) in detector has mean m1 (m2) when pulse is absent (present)

Signal Detection Theory

Discriminability: ease of determining whether the pulse is present or not

The detector uses a threshold x* to determine the presence of pulse

2( *| ) :P x x x hit1( * | ) :P x x x false alarm2( *| ) :P x x x miss1( * | ) :P x x x correct rejection

For given threshold, define hit,

false alarm, miss and correct

rejection

2

1 1( | ) ~ ( , )p x N

2

2 2( | ) ~ ( , )p x N

1 2| |'d

Page 72: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Receiver Operating Characteristic (ROC)

• Experimentally compute hit and false alarm rates for fixed x*

• Changing x* will change the hit and false alarm rates

• A plot of hit and false alarm rates is called the ROC curve

Performance

shown at different

operating points

Page 73: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Operating Characteristic

• In practice, distributions may not be Gaussian and will be multidimensional; ROC curve can still be plotted

• Vary a single control parameter for the decision rule and plot the resulting hit and false alarm rates

Page 74: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Bayes Decision Theory: Discrete Features

• Components of x are binary or integer valued; x

can take only one of m discrete values

v1, v2, …,vm

• Case of independent binary features for 2-

category problem

Let x = [x1, x2, …, xd ]t where each xi is either 0

or 1, with probabilities:

pi = P(xi = 1 | 1)

qi = P(xi = 1 | 2)

Page 75: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

• The discriminant function in this case is:

0g(x) if and0 g(x) if decide

)(P

)(Pln

q1

p1lnw

:and

d,...,1i )p1(q

)q1(plnw

:where

wxw)x(g

21

2

1d

1i i

i0

ii

iii

0i

d

1i

i

Page 76: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Bayesian Decision for 3-dim Binary Data

Left figure: pi=.8 and qi=.5. Right figure: p3=q3 (feature 3 does not provide any discriminatory information) so the decision surface is parallel to x3 axis

• A 2-class problem; 3 independent binary features; class priors

are equal; pi = 0.8 and qi = 0.5, i = 1,2,3

• wi = 1.3863; w0 = 1.2

• Decision surface g(x) = 0 is shown below

Page 77: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Neyman-Pearson Rule

“Classification, Estimation and Pattern recognition” by Young and Calvert

Page 78: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Neyman-Pearson Rule

Page 79: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Neyman-Pearson Rule

Page 80: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Neyman-Pearson Rule

Page 81: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Neyman-Pearson Rule

Page 82: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Neyman-Pearson Rule

Page 83: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Missing Feature Values• n x d pattern matrix; n x n (dis)similarity matrix

• Suppose it is not possible to measure a certain feature for a given pattern

• Possible solutions:

• Reject the pattern

• Approximate the missing value

• Replace missing value by mean for that feature

• Marginalize over distribution of the missing feature

Page 84: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Handling Missing Feature value

Page 85: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Other Topics• Compound Bayes Decision Theory & Context

– Consecutive states of nature may be dependent; state of next fish may depend on state of the previous fish

– Exploit such statistical dependence to gain improved performance (use of context)

– Compound decision vs. sequential compound decision

– Markov dependence

• Sequential Decision Making

– Feature measurement process is sequential

– Feature measurement cost

– Minimize a combination of feature measurement cost and the classification accuracy

Page 86: Bayesian Decision Theory Chapter 2 · Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25) • Bayes decision theory is a fundamental statistical approach to pattern classification

Context in Text Recognition