Transcript
Page 1: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification

Page 2: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Chapter 2 (Part 1): Bayesian Decision Theory

(Sections 2.1-2.2)

• Introduction

• Bayesian Decision Theory–Continuous Features

Page 3: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

3

Introduction

• The sea bass/salmon example

• State of nature, prior

• State of nature is a random variable

• The catch of salmon and sea bass is equiprobable

• P(1) = P(2) (uniform priors)

• P(1) + P( 2) = 1 (exclusivity and exhaustivity)

Page 4: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

4

• Decision rule with only the prior information

• Decide 1 if P(1) > P(2) otherwise decide 2

• Use of the class–conditional information

• P(x | 1) and P(x | 2) describe the difference in lightness between populations of sea-bass and salmon

Page 5: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

5

Page 6: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

6

• Posterior, likelihood, evidence

• P(j | x) = (P(x | j) * P (j)) / P(x) (BAYES RULE)

• Posterior = (Likelihood * Prior) / Evidence

• Where in case of two categories

2

1

)()|()(j

jjj PxPxP

Page 7: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

7

Page 8: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

8

• Intuitive decision rule given the posterior probabilities:

Given x:

if P(1 | x) > P(2 | x) True state of nature = 1

if P(1 | x) < P(2 | x) True state of nature = 2

Why do this?: Whenever we observe a particular x, the probability of error is :

P(error | x) = P(1 | x) if we decide 2

P(error | x) = P(2 | x) if we decide 1

Page 9: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

9

• Minimizing the probability of error

• Decide 1 if P(1 | x) > P(2 | x); otherwise decide 2

Therefore:

P(error | x) = min [P(1 | x), P(2 | x)]

(Bayes decision)

Page 10: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

10Bayesian Decision Theory – Continuous Features

Generalization of the preceding ideas

• Use of more than one feature

• Use more than two states of nature

• Allowing actions and not only decide on the state of nature

• Introduce a loss of function which is more general than the probability of error

• Allowing actions other than classification primarily allows the possibility of rejection

• Refusing to make a decision in close or bad cases!

• Letting loss function state how costly each action taken is

Page 11: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

11

• Let {1, 2,…, c} be the set of c states of nature (or “classes”)

• Let {1, 2,…, a} be the set of possible actions

• Let (i | j) be the loss for action i when the state of nature is j

Bayesian Decision Theory – Continuous Features

Page 12: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

12

What is the Expected Loss for action i ?

Conditional risk = Expected Loss

cj

1jjjii )x|(P)|()x|(R

For any given x the expected loss is

Page 13: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

13

Overall risk

R = Sum of all R(i | x) for i = 1,…,a

Minimizing R Minimizing R(i | x) for i = 1,…, a

for i = 1,…,a

Conditional risk

cj

1jjjii )x|(P)|()x|(R

Page 14: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

14

Select the action i for which R(i | x) is minimum

R is minimum and R in this case is called the Bayes risk = best performance that can be achieved!

Page 15: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

15Two-Category Classification

1 : deciding 1

2 : deciding 2

ij = (i | j)

loss incurred for deciding i when the true state of nature is j

Conditional risk:

R(1 | x) = 11P(1 | x) + 12P(2 | x)

R(2 | x) = 21P(1 | x) + 22P(2 | x)

Page 16: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

16

Our rule is the following:

if R(1 | x) < R(2 | x)

action 1: “decide 1” is taken

This results in the equivalent rule :

decide 1 if:

(21- 11) P(x | 1) P(1) >

(12- 22) P(x | 2) P(2)

and decide 2 otherwise

Page 17: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

17

x (21- 11)

x (12- 22)

Page 18: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

18Two-Category Decision Theory: Chopping Machine

1 = chop

2 = DO NOT chop

1 = NO hand in machine

2 = hand in machine

11 = (1 | 1) = $ 0.00

12 = (1 | 2) = $ 100.00

21 = (2 | 1) = $ 0.01

22 = (1 | 1) = $ 0.01

Therefore our rule becomes

(21- 11) P(x | 1) P(1) > (12- 22) P(x | 2) P(2)

0.01 P(x | 1) P(1) > 99.99 P(x | 2) P(2)

Page 19: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

19

Our rule is the following:

if R(1 | x) < R(2 | x)

action 1: “decide 1” is taken

This results in the equivalent rule :

decide 1 if:

(21- 11) P(x | 1) P(1) >

(12- 22) P(x | 2) P(2)

and decide 2 otherwise

Page 20: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

20

x 0.01

x 99.99

1 = chop 2 = DO NOT chop

1 = NO hand in machine2 = hand in machine

Page 21: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

21

Exercise

Select the optimal decision where:

= {1, 2}

P(x | 1) N(2, 0.5) (Normal distribution)

P(x | 2) N(1.5, 0.2)

P(1) = 2/3

P(2) = 1/3

43

21

Page 22: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Chapter 2 (Part 2): Bayesian Decision Theory

(Sections 2.3-2.5)

• Minimum-Error-Rate Classification

• Classifiers, Discriminant Functions and Decision Surfaces

• The Normal Density

Page 23: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

23

Minimum-Error-Rate Classification

• Actions are decisions on classesIf action i is taken and the true state of nature is j then

the decision is correct if i = j and in error if i j

• Seek a decision rule that minimizes the probability of error which is the error rate

Page 24: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

24

• Introduction of the zero-one loss function:

Therefore, the conditional risk for each action is:

“The risk corresponding to this loss function is the average (or expected) probability error”

c,...,1j,i ji 1

ji 0),( ji

ijij

cj

jjjii

xPxP

xPxR

)|(1)|(

)|()|()|(1

Average Prob. of Error

Page 25: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

25

• Minimize the risk requires maximize P(i | x)

(since R(i | x) = 1 – P(i | x))

• For Minimum error rate

• Decide i if P (i | x) > P(j | x) j i

Page 26: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

26Two-Category Classification

1 : deciding 1

2 : deciding 2

ij = (i | j)

loss incurred for deciding i when the true state of nature is j

Conditional risk:

R(1 | x) = 11P(1 | x) + 12P(2 | x)

R(2 | x) = 21P(1 | x) + 22P(2 | x)

Page 27: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

27

Our rule is the following:

if R(1 | x) < R(2 | x)

action 1: “decide 1” is taken

This results in the equivalent rule :

decide 1 if:

(21- 11) P(x | 1) P(1) >

(12- 22) P(x | 2) P(2)

and decide 2 otherwise

Page 28: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

28

Likelihood ratio:

The preceding rule is equivalent to the following rule:

Then take action 1 (decide 1)

Otherwise take action 2 (decide 2)

)(

)(.

)|()|(

1

2

1121

2212

2

1

P

PxPxP

if

Page 29: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

29

• Regions of decision and zero-one loss function, therefore:

• If is the zero-one loss function which means:

b1

2

a1

2

)(P

)(P2 then

0 1

2 0 if

)(P

)(P then

0 1

1 0

)|x(P

)|x(P :if decide then

)(P

)(P. Let

2

11

1

2

1121

2212

Page 30: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

30

Page 31: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

31Classifiers, Discriminant Functionsand Decision Surfaces

• The multi-category case

• Set of discriminant functions gi(x), i = 1,…, c

• The classifier assigns a feature vector x to class i

if:

gi(x) > gj(x) j i

Page 32: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

32

Page 33: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

33

• Let gi(x) = - R(i | x)

(max. discriminant corresponds to min. risk!)

• For the minimum error rate, we take gi(x) = P(i | x)

(max. discrimination corresponds to max. posterior!)

gi(x) P(x | i) P(i)

gi(x) = ln P(x | i) + ln P(i)

(ln: natural logarithm!)

Page 34: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

34

• Feature space divided into c decision regions

if gi(x) > gj(x) j i then x is in Ri

(Ri means assign x to i)

• The two-category case• A classifier is a “dichotomizer” that has two discriminant

functions g1 and g2

Let g(x) g1(x) – g2(x)

Decide 1 if g(x) > 0 ; Otherwise decide 2

Page 35: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

35

• The computation of g(x)

)(P

)(Pln

)|x(P

)|x(Pln

)x|(P)x|(P)x(g

2

1

2

1

21

Page 36: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

36

On to higher dimensions!

Page 37: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

37

The Normal Density• Univariate density

• Density which is analytically tractable

• Continuous density

• A lot of processes are asymptotically Gaussian

• Handwritten characters, speech sounds are ideal or prototype corrupted by random process (central limit theorem)

Where: = mean (or expected value) of x 2 = expected squared deviation or variance

,x

2

1exp

2

1)x(P

2

Page 38: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

38

Page 39: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

39

• Multivariate density

• Multivariate normal density in d dimensions is:

where:

x = (x1, x2, …, xd)t (t stands for the transpose vector form)

= (1, 2, …, d)t mean vector = d*d covariance matrix

|| and -1 are determinant and inverse respectively

)x()x(

2

1exp

)2(

1)x(P 1t

2/12/d

Page 40: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Chapter 2 (part 3)Bayesian Decision Theory (Sections

2-6,2-9)

• Discriminant Functions for the Normal Density

• Bayes Decision Theory – Discrete Features

Page 41: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

41Discriminant Functions for the Normal Density

• We saw that the minimum error-rate classification can be achieved by the discriminant function

gi(x) = ln P(x | i) + ln P(i)

• Case of multivariate normal

)(lnln2

12ln

2)()(

2

1)( 1

iii

it

ii Pd

xxxg

Page 42: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

42

Case i = 2I (I stands for the identity matrix)

)category! ththe for thresholdthe called is (

)(Pln2

1w ;w

:where

function) ntdiscrimina (linear wxw)x(g

0i

iiti20i2

ii

0itii

i

Page 43: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

43

• A classifier that uses linear discriminant functions is called “a linear machine”

• The decision surfaces for a linear machine are pieces of hyperplanes defined by:

gi(x) = gj(x)

Page 44: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

44

Page 45: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

45

The hyperplane separating Ri and Rj are given by

always orthogonal to the line linking the means!

)()(

)(ln)(

2

1

0)(

2

2

0

0

jij

i

ji

ji

ji

t

P

P

x

w

xxw

)(2

1)()( 0 jiji PPif x then

Page 46: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

46

Page 47: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

47

Page 48: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

48

Case i = (covariance of all classes are identical but arbitrary!)

Hyperplane separating Ri and Rj

Here the hyperplane separating Ri and Rj is generally not orthogonal to the line between the means!

)(

)()(

)(/)(ln)(

2

1

)(

0)(

10

1

0

jiji

tji

jiji

ji

t

PPx

w

xxw

Page 49: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

49

Page 50: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

50

Page 51: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

51

Case i = arbitrary

The covariance matrices are different for each category

Here the separating surfaces are Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids)

)(Plnln2

1

2

1 w

w

2

1W

:where

wxwxWx)x(g

iii1

iti0i

i1

ii

1ii

0itii

ti

Page 52: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

52

Page 53: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

53

Page 54: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

54

Bayes Decision Theory – Discrete Features

• Components of x are binary or integer valued, x can take only one of m discrete values

v1, v2, …, vm

• Case of independent binary features in 2 category problem

Let x = [x1, x2, …, xd ]t where each xi is either 0 or 1, with probabilities:

pi = P(xi = 1 | 1)

qi = P(xi = 1 | 2)

Page 55: Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1)

55

• The discriminant function in this case is:

0g(x) if and0 g(x) if decide

)(P

)(Pln

q1

p1lnw

:and

d,...,1i )p1(q

)q1(plnw

:where

wxw)x(g

21

2

1d

1i i

i0

ii

iii

0i

d

1ii


Top Related