bayesian decision theory (sections 2.1-2.2)

88
Bayesian Decision Theory (Sections 2.1-2.2) Decision problem posed in probabilistic terms Bayesian Decision Theory–Continuous Features All the relevant probability values are known

Upload: akio

Post on 03-Feb-2016

79 views

Category:

Documents


0 download

DESCRIPTION

Bayesian Decision Theory (Sections 2.1-2.2). Decision problem posed in probabilistic terms Bayesian Decision Theory–Continuous Features All the relevant probability values are known. Probability Density. Course Outline. MODEL INFORMATION. COMPLETE. INCOMPLETE. Bayes Decision Theory. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bayesian Decision Theory (Sections 2.1-2.2)

Bayesian Decision Theory

(Sections 2.1-2.2)

• Decision problem posed in probabilistic terms

• Bayesian Decision Theory–Continuous Features

• All the relevant probability values are known

Page 2: Bayesian Decision Theory (Sections 2.1-2.2)
Page 3: Bayesian Decision Theory (Sections 2.1-2.2)
Page 4: Bayesian Decision Theory (Sections 2.1-2.2)
Page 5: Bayesian Decision Theory (Sections 2.1-2.2)

Probability Density

Jain CSE 802, Spring 2013

Page 6: Bayesian Decision Theory (Sections 2.1-2.2)

Course OutlineMODEL INFORMATION

COMPLETE INCOMPLETE

Supervised Learning

Unsupervised Learning

Nonparametric Approach

Parametric Approach

Nonparametric Approach

Parametric Approach

Bayes Decision Theory

“Optimal” Rules

Plug-in Rules

Density Estimation

Geometric Rules (K-NN, MLP)

Mixture Resolving

Cluster Analysis (Hard, Fuzzy)

Page 7: Bayesian Decision Theory (Sections 2.1-2.2)

Introduction

• From sea bass vs. salmon example to “abstract” decision making problem

• State of nature; a priori (prior) probability• State of nature (which type of fish will be observed next) is

unpredictable, so it is a random variable

• The catch of salmon and sea bass is equiprobable

• P(1) = P(2) (uniform priors)

• P(1) + P( 2) = 1 (exclusivity and exhaustivity)

• Prior prob. reflects our prior knowledge about how likely we are to observe a sea bass or salmon; these probabilities may depend on time of the year or the fishing area!

Page 8: Bayesian Decision Theory (Sections 2.1-2.2)

• Bayes decision rule with only the prior information

• Decide 1 if P(1) > P(2), otherwise decide 2

• Error rate = Min {P(1) , P(2)}

• Suppose now we have a measurement or feature on the state of nature - say the fish lightness value

• Use of the class-conditional probability density

• P(x | 1) and P(x | 2) describe the difference in lightness feature between populations of sea bass and salmon

Page 9: Bayesian Decision Theory (Sections 2.1-2.2)

Amount of overlap between the densities determines the “goodness” of feature

Page 10: Bayesian Decision Theory (Sections 2.1-2.2)

• Maximum likelihood decision rule

• Assign input pattern x to class 1 if

P(x | 1) > P(x | 2), otherwise 2

• How does the feature x influence our attitude (prior) concerning the true state of nature?

• Bayes decision rule

Page 11: Bayesian Decision Theory (Sections 2.1-2.2)

• Posteriori probability, likelihood, evidence

• P(j , x) = P(j | x)p (x) = p(x | j) P (j)

• Bayes formula

P(j | x) = {p(x | j) . P (j)} / p(x)

where

• Posterior = (Likelihood. Prior) / Evidence

• Evidence P(x) can be viewed as a scale factor that guarantees that the posterior probabilities sum to 1

• P(x | j) is called the likelihood of j with respect to x; the category j for which P(x | j) is large is more likely to be the true category

2j

1jjj )(P)|x(P)x(P

Page 12: Bayesian Decision Theory (Sections 2.1-2.2)
Page 13: Bayesian Decision Theory (Sections 2.1-2.2)

• P(1 | x) is the probability of the state of nature being 1

given that feature value x has been observed

• Decision based on the posterior probabilities is called the Optimal Bayes Decision rule

For a given observation (feature value) X:

if P(1 | x) > P(2 | x) decide 1

if P(1 | x) < P(2 | x) decide 2

To justify the above rule, calculate the probability of error:

P(error | x) = P(1 | x) if we decide 2

P(error | x) = P(2 | x) if we decide 1

Page 14: Bayesian Decision Theory (Sections 2.1-2.2)

• So, for a given x, we can minimize te rob. Of error, decide 1 if

P(1 | x) > P(2 | x); otherwise decide 2

Therefore:

P(error | x) = min [P(1 | x), P(2 | x)]

• Thus, for each observation x, Bayes decision rule minimizes the probability of error

• Unconditional error: P(error) obtained by integration over all x w.r.t. p(x)

Page 15: Bayesian Decision Theory (Sections 2.1-2.2)

• Optimal Bayes decision rule

Decide 1 if P(1 | x) > P(2 | x); otherwise decide 2

• Special cases:(i) P(1) = P(2); Decide 1 if

p(x | 1) > p(x | 2), otherwise 2

(ii) p(x | 1) = p(x | 2); Decide 1 if

P(1) > P(2), otherwise 2

Page 16: Bayesian Decision Theory (Sections 2.1-2.2)

Bayesian Decision Theory – Continuous Features

• Generalization of the preceding formulation

• Use of more than one feature (d features)

• Use of more than two states of nature (c classes)

• Allowing other actions besides deciding on the state of nature

• Introduce a loss function which is more general than the probability of error

Page 17: Bayesian Decision Theory (Sections 2.1-2.2)

• Allowing actions other than classification primarily allows the possibility of rejection

• Refusing to make a decision when it is difficult to decide between two classes or in noisy cases!

• The loss function specifies the cost of each action

Page 18: Bayesian Decision Theory (Sections 2.1-2.2)

• Let {1, 2,…, c} be the set of c states of nature

(or “categories”)

• Let {1, 2,…, a} be the set of a possible actions

• Let (i | j) be the loss incurred for taking

action i when the true state of nature is j

• General decision rule (x) specifies which action to take for every possible

observation x

Page 19: Bayesian Decision Theory (Sections 2.1-2.2)

Conditional Risk

Overall risk

R = Expected value of R(i | x) w.r.t. p(x)

Minimizing R Minimize R(i | x) for i = 1,…, a

Conditional risk

cj

1jjjii )x|(P)|()x|(R

For a given x, suppose we take the action i ; if the true state is j , we will incur the loss (i | j). P(j | x) is the prob. that the true state is j But, any one of the C states is possible for given x.

Page 20: Bayesian Decision Theory (Sections 2.1-2.2)

Select the action i for which R(i | x) is minimum

The overall risk R is minimized and the resulting risk is called the Bayes risk; it is the best performance that can be achieved!

Page 21: Bayesian Decision Theory (Sections 2.1-2.2)

• Two-category classification

1 : deciding 1

2 : deciding 2

ij = (i | j)

loss incurred for deciding i when the true state of nature is j

Conditional risk:

R(1 | x) = 11P(1 | x) + 12P(2 | x)

R(2 | x) = 21P(1 | x) + 22P(2 | x)

Page 22: Bayesian Decision Theory (Sections 2.1-2.2)

Bayes decision rule is stated as:

if R(1 | x) < R(2 | x)

Take action 1: “decide 1”

This results in the equivalent rule:

decide 1 if:

(21- 11) P(x | 1) P(1) >

(12- 22) P(x | 2) P(2)

and decide 2 otherwise

Page 23: Bayesian Decision Theory (Sections 2.1-2.2)

Likelihood ratio:

The preceding rule is equivalent to the following rule:

then take action 1 (decide 1); otherwise take action 2 (decide 2)

Note that the posteriori porbabilities are scaled by the loss differences.

)(P

)(P.

)|x(P)|x(P

if1

2

1121

2212

2

1

Page 24: Bayesian Decision Theory (Sections 2.1-2.2)

Interpretation of the Bayes decision rule:

“If the likelihood ratio of class 1 and class 2

exceeds a threshold value (that is independent of the input pattern x), the optimal action is to decide 1”

Maximum likelihood decision rule: the threshold value is 1; 0-1 loss function and equal class prior probability

Page 25: Bayesian Decision Theory (Sections 2.1-2.2)

Bayesian Decision Theory(Sections 2.3-2.5)

• Minimum Error Rate Classification

• Classifiers, Discriminant Functions and Decision Surfaces

• The Normal Density

Page 26: Bayesian Decision Theory (Sections 2.1-2.2)

Minimum Error Rate Classification

•Actions are decisions on classesIf action i is taken and the true state of nature is j then:the decision is correct if i = j and in error if i j

•Seek a decision rule that minimizes the probability of error or the error rate

Page 27: Bayesian Decision Theory (Sections 2.1-2.2)

• Zero-one (0-1) loss function: no loss for correct decision and a unit loss for any error

The conditional risk can now be simplified as:

“The risk corresponding to the 0-1 loss function is the average probability of error”

c,...,1j,i ji 1

ji 0),( ji

1jij

cj

1jjjii

)x|(P1)x|(P

)x|(P)|()x|(R

Page 28: Bayesian Decision Theory (Sections 2.1-2.2)

•Minimizing the risk requires maximizing the posterior probability P(i | x) since

R(i | x) = 1 – P(i | x))

•For Minimum error rate

•Decide i if P (i | x) > P(j | x) j i

Page 29: Bayesian Decision Theory (Sections 2.1-2.2)

• Decision boundaries and decision regions

• If is the 0-1 loss function then the threshold involves only the priors:

b1

2

a1

2

)(P

)(P2 then

0 1

2 0 if

)(P

)(P then

0 1

1 0

)|x(P

)|x(P :if decide then

)(P

)(P. Let

2

11

1

2

1121

2212

Page 30: Bayesian Decision Theory (Sections 2.1-2.2)
Page 31: Bayesian Decision Theory (Sections 2.1-2.2)

Classifiers, Discriminant Functionsand Decision Surfaces

•Many different ways to represent pattern classifiers; one of the most useful is in terms of discriminant functions

•The multi-category case

•Set of discriminant functions gi(x), i = 1,…,c

•Classifier assigns a feature vector x to class i if:

gi(x) > gj(x) j i

Page 32: Bayesian Decision Theory (Sections 2.1-2.2)

Network Representation of a Classifier

Page 33: Bayesian Decision Theory (Sections 2.1-2.2)

•Bayes classifier can be represented in this way, but the choice of discriminant function is not unique

•gi(x) = - R(i | x)

(max. discriminant corresponds to min. risk!)

•For the minimum error rate, we take gi(x) = P(i | x)

(max. discrimination corresponds to max. posterior!)

gi(x) P(x | i) P(i)

gi(x) = ln P(x | i) + ln P(i)

(ln: natural logarithm!)

Page 34: Bayesian Decision Theory (Sections 2.1-2.2)

•Effect of any decision rule is to divide the feature space into c decision regions

if gi(x) > gj(x) j i then x is in Ri

(Region Ri means assign x to i)

•The two-category case•Here a classifier is a “dichotomizer” that has two

discriminant functions g1 and g2

Let g(x) g1(x) – g2(x)

Decide 1 if g(x) > 0 ; Otherwise decide 2

Page 35: Bayesian Decision Theory (Sections 2.1-2.2)

• So, a “dichotomizer” computes a single discriminant function g(x) and classifies x according to whether g(x) is positive or not.

• Computation of g(x) = g1(x) – g2(x)

)(P

)(Pln

)|x(P

)|x(Pln

)x|(P)x|(P)x(g

2

1

2

1

21

Page 36: Bayesian Decision Theory (Sections 2.1-2.2)
Page 37: Bayesian Decision Theory (Sections 2.1-2.2)

The Normal Density

• Univariate density: N( , 2)

• Normal density is analytically tractable

• Continuous density

• A number of processes are asymptotically Gaussian

• Patterns (e.g., handwritten characters, speech signals ) can be viewed as randomly corrupted versions of a single typical or prototype (Central Limit theorem)

where: = mean (or expected value) of x 2 = variance (or expected squared deviation) of x

,x

2

1exp

2

1)x(P

2

Page 38: Bayesian Decision Theory (Sections 2.1-2.2)
Page 39: Bayesian Decision Theory (Sections 2.1-2.2)

• Multivariate density: N( , )

• Multivariate normal density in d dimensions:

where:

x = (x1, x2, …, xd)t (t stands for the transpose of a vector)

= (1, 2, …, d)t mean vector = d*d covariance matrix

|| and -1 are determinant and inverse of , respectively

• The covariance matrix is always symmetric and positive semidefinite; we assume is positive definite so the determinant of is strictly positive

• Multivariate normal density is completely specified by [d + d(d+1)/2] parameters

• If variables x1 and x2 are statistically independent then the covariance of x1 and x2 is zero.

)x()x(

2

1exp

)2(

1)x(P 1t

2/12/d

Page 40: Bayesian Decision Theory (Sections 2.1-2.2)

Multivariate Normal density

2 1( ) ( )tr x x

Samples drawn from a normal population tend to fall in a single cloud or cluster; cluster center is determined by the mean vector and shape by the covariance matrix

The loci of points of constant density are hyperellipsoids whose principal axes are the eigenvectors of

Page 41: Bayesian Decision Theory (Sections 2.1-2.2)

Transformation of Normal VariablesLinear combinations of jointly normally distributed random variables are normally distributed

Coordinate transformation can convert an arbitrary multivariate normal distribution into a spherical one

Page 42: Bayesian Decision Theory (Sections 2.1-2.2)

Bayesian Decision Theory (Sections 2-6 to 2-9)

• Discriminant Functions for the Normal Density

• Bayes Decision Theory – Discrete Features

Page 43: Bayesian Decision Theory (Sections 2.1-2.2)

Discriminant Functions for the Normal Density

• The minimum error-rate classification can be achieved by the discriminant function

gi(x) = ln P(x | i) + ln P(i)

• In case of multivariate normal densities

)(Plnln2

12ln

2

d)x()x(

2

1)x(g ii

1

ii

tii

Page 44: Bayesian Decision Theory (Sections 2.1-2.2)

• Case i = 2.I (I is the identity matrix)

Features are statistically independent and each feature has

the same variance

)category! ththe for thresholdthe called is (

)(Pln2

1w ;w

:where

function) ntdiscrimina (linear wxw)x(g

0i

iiti20i2

ii

0itii

i

Page 45: Bayesian Decision Theory (Sections 2.1-2.2)

• A classifier that uses linear discriminant functions is called “a linear machine”

• The decision surfaces for a linear machine are pieces of hyperplanes defined by the linear equations:

gi(x) = gj(x)

Page 46: Bayesian Decision Theory (Sections 2.1-2.2)

• The hyperplane separating Ri and Rj

is orthogonal to the line linking the means!

)()(P

)(Pln)(

2

1x ji

j

i2

ji

2

ji0

)(2

1x then )(P)(P if ji0ji

Page 47: Bayesian Decision Theory (Sections 2.1-2.2)
Page 48: Bayesian Decision Theory (Sections 2.1-2.2)
Page 49: Bayesian Decision Theory (Sections 2.1-2.2)
Page 50: Bayesian Decision Theory (Sections 2.1-2.2)

• Case 2: i = (covariance matrices of all classes are identical but otherwise arbitrary!)

• Hyperplane separating Ri and Rj

• The hyperplane separating Ri and Rj is generally not orthogonal to the line between the means!

• To classify a feature vector x, measure the squared Mahalanobis distance from x to each of the c means; assign x to the category of the nearest mean

).(

)()(

)(P/)(Pln)(

2

1x ji

ji1t

ji

jiji0

Page 51: Bayesian Decision Theory (Sections 2.1-2.2)
Page 52: Bayesian Decision Theory (Sections 2.1-2.2)
Page 53: Bayesian Decision Theory (Sections 2.1-2.2)

Discriminant Functions for 1D Gaussian

Page 54: Bayesian Decision Theory (Sections 2.1-2.2)

• Case 3: i = arbitrary

• The covariance matrices are different for each category

In the 2-category case, the decision surfaces are hyperquadrics that can assume any of the general forms: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids)

)(Plnln2

1

2

1 w

w

2

1W

:where

wxwxWx)x(g

iii1

iti0i

i1

ii

1ii

0itii

ti

Page 55: Bayesian Decision Theory (Sections 2.1-2.2)

Discriminant Functions for the Normal Density

Page 56: Bayesian Decision Theory (Sections 2.1-2.2)
Page 57: Bayesian Decision Theory (Sections 2.1-2.2)

Discriminant Functions for the Normal Density

Page 58: Bayesian Decision Theory (Sections 2.1-2.2)

Discriminant Functions for the Normal Density

Page 59: Bayesian Decision Theory (Sections 2.1-2.2)

Decision Regions for Two-Dimensional Gaussian Data

2112 1875.0125.1514.3 xxx

Page 60: Bayesian Decision Theory (Sections 2.1-2.2)

Error Probabilities and Integrals• 2-class problem

• There are two types of errors

• Multi-class problem – Simpler to computer the prob. of being correct (more

ways to be wrong than to be right)

Page 61: Bayesian Decision Theory (Sections 2.1-2.2)

Error Probabilities and Integrals

Bayes optimal decision boundary in 1-D case

Page 62: Bayesian Decision Theory (Sections 2.1-2.2)

Error Bounds for Normal Densities

• The exact calculation of the error for the general Guassian case (case 3) is extremely difficult

• However, in the 2-category case the general error can be approximated analytically to give us an upper bound on the error

Page 63: Bayesian Decision Theory (Sections 2.1-2.2)

Error Rate of Linear Discriminant Function (LDF)

• Assume a 2-class problem

• Due to the symmetry of the problem (identical ), the two types of errors are identical

• Decide if or

or

1x 1 2( ) ( )g x g x

1 11 21 1 2 2

1 1( ) ( ) log ( ) ( ) ( ) log ( )

2 2t tx x P x x P

1 1 11 22 1 1 1 2 2

1( ) log ( ) / ( )

2t tt x P P

1 1 2 2

1

~ ( , ), ~ ( , )

1( ) log ( ) ( ) ( ) log ( )

2t

i i ii i

p x N p x N

g x P x x x P

Page 64: Bayesian Decision Theory (Sections 2.1-2.2)

• Let• Compute expected values & variances of when

where = squared Mahalanobis distance

between

1 1 1

2 1 1 1 2 2

1( ) ( )

2t tth x x

( )h x

1 2&x x

1 1 11 1 12 1 1 1 2 2

1

2 1 2 1

1( ) ( )

21

( ) ( )2

t tt

t

E h x x E x

1

1

2 1 2 1( ) ( )t

1 2&

Error Rate of LDF

Page 65: Bayesian Decision Theory (Sections 2.1-2.2)

• Similarly

12 2 1 2 1

1( ) ( )

2t

1

2

( ) ~ ( , 2 )

( ) ~ ( , 2 )

p h x x N

p h x x N

22 11 1 1 12 1 1

1

2 1 2 1

( ) ( ) ( )

( ) ( )

2

t

t

E h x E x x

22 2

Error Rate of LDF

Page 66: Bayesian Decision Theory (Sections 2.1-2.2)

2

1( )

21 1 2 1 1

1

2

2

1( ) ( ) ( ) ( ) ~

2 2

1

2

1 1

2 2 4

t

n t

P g x g x x P h x dh h x e

e d

terf

Error Rate of LDF

Page 67: Bayesian Decision Theory (Sections 2.1-2.2)

2

1

2

0

2

1 1 2 2

( )log

( )

2( )

1 1

2 2 4

Total probability of error

( ) ( )

rx

e

Pt

P

erf r e dx

terf

P P P

Error Rate of LDF

Page 68: Bayesian Decision Theory (Sections 2.1-2.2)

1 2

1

1 2 1 21 2

10

2

( ) ( )1 1 1 1

2 2 2 24 2 2

t

P P t

erf erf

Error Rate of LDF

1

1 2 1 2

1 2

1

1 2 1 2

1 2

Mahalanobis distance is a good measure of separation between classes

(i) No Class Separation

( ) ( ) 0

1

2

(ii) Perfect Class Separation

( ) ( ) 0

0 ( 1)

t

t

erf

Page 69: Bayesian Decision Theory (Sections 2.1-2.2)

Chernoff Bound• To derive a bound for the error, we need the following inequality

Assume conditional prob. are normal

where

Page 70: Bayesian Decision Theory (Sections 2.1-2.2)

Chernoff BoundChernoff bound for P(error) is found by determining the value of that minimizes exp(-k())

Page 71: Bayesian Decision Theory (Sections 2.1-2.2)

Error Bounds for Normal Densities• Bhattacharyya Bound

• Assume = 1/2 • computationally simpler

• slightly less tight bound

• Now, Eq. (73) has the form

When the two covariance matrices are equal, k(1/2) is te same as the Mahalanobis distance between the two means

Page 72: Bayesian Decision Theory (Sections 2.1-2.2)

Error Bounds for Gaussian Distributions

Chernoff Bound

Bhattacharya Bound (β=1/2)2–category, 2D data

True error using numerical integration = 0.0021

Best Chernoff error bound is 0.008190

Bhattacharya error bound is 0.008191

1 11 1 1 2( ) ( ) ( ) ( | ) ( | ) 0 1P error P P p x p x dx

1 ( )1 2( | ) ( | ) kp x p x dx e

1 1 212 1 1 2 2 1

1 2

(1 )(1 ) 1( ) ( ) [ (1 ) ] ( ) ln

2 2 | | | |

tk

(1/2)1 2 1 2 1 2( ) ( ) ( ) ( | ) ( | ) ( ) ( ) kP error P P P x P x dx P P e

1 211 2

2 1 2 11 2

1 2(1 / 2) 1 / 8( ) ( ) ln

2 2 | || |

tk

Page 73: Bayesian Decision Theory (Sections 2.1-2.2)

Neyman-Pearson Rule

“Classification, Estimation and Pattern recognition” by Young and Calvert

Page 74: Bayesian Decision Theory (Sections 2.1-2.2)

Neyman-Pearson Rule

Page 75: Bayesian Decision Theory (Sections 2.1-2.2)

Neyman-Pearson Rule

Page 76: Bayesian Decision Theory (Sections 2.1-2.2)

Neyman-Pearson Rule

Page 77: Bayesian Decision Theory (Sections 2.1-2.2)

Neyman-Pearson Rule

Page 78: Bayesian Decision Theory (Sections 2.1-2.2)

Neyman-Pearson Rule

Page 79: Bayesian Decision Theory (Sections 2.1-2.2)

We are interested in detecting a single weak pulse, e.g. radar reflection; the internal signal (x) in detector has mean m1 (m2) when pulse is absent (present)

Signal Detection Theory

Discriminability: ease of determining whether the pulse is present or not

The detector uses a threshold x* to determine the presence of pulse

2( * | ) :P x x x hit1( * | ) :P x x x false alarm2( * | ) :P x x x miss1( * | ) :P x x x correct rejection

For given threshold, define hit, false alarm, miss and correct rejection

21 1( | ) ~ ( , )p x N

22 2( | ) ~ ( , )p x N

1 2| |'d

Page 80: Bayesian Decision Theory (Sections 2.1-2.2)

Receiver Operating Characteristic (ROC)

• Experimentally compute hit and false alarm rates for fixed x*

• Changing x* will change the hit and false alarm rates

• A plot of hit and false alarm rates is called the ROC curve

Performance shown at different operating points

Page 81: Bayesian Decision Theory (Sections 2.1-2.2)

Operating Characteristic• In practice, distributions may not be Gaussian and

will be multidimensional; ROC curve can still be plotted

• Vary a single control parameter for the decision rule and plot the resulting hit and false alarm rates

Page 82: Bayesian Decision Theory (Sections 2.1-2.2)

Bayes Decision Theory – Discrete Features

• Components of x are binary or integer valued; x can take only one of m discrete values

v1, v2, …,vm

• Case of independent binary features for 2-category problem

Let x = [x1, x2, …, xd ]t where each xi is either 0 or 1, with probabilities:

pi = P(xi = 1 | 1)

qi = P(xi = 1 | 2)

Page 83: Bayesian Decision Theory (Sections 2.1-2.2)

• The discriminant function in this case is:

0g(x) if and0 g(x) if decide

)(P

)(Pln

q1

p1lnw

:and

d,...,1i )p1(q

)q1(plnw

:where

wxw)x(g

21

2

1d

1i i

i0

ii

iii

0i

d

1ii

Page 84: Bayesian Decision Theory (Sections 2.1-2.2)

Bayesian Decision for Three-dimensional Binary Data

Decision boundary for 3D binary features. Left figure shows the case when pi=.8 and qi=.5. Right figure shows case when p3=q3 (Feature 3 is not providing any discriminatory information) so decision surface is parallel to x3 axis

• Consider a 2-class problem with three independent binary features; class priors are equal and pi = 0.8 and qi = 0.5, i = 1,2,3• wi = 1.3863• w0 = 1.2• Decision surface g(x) = 0 is shown below

Page 85: Bayesian Decision Theory (Sections 2.1-2.2)

Handling Missing Features

• Suppose it is not possible to measure a certain feature for a given pattern

• Possible solutions:

• Reject the pattern

• Approximate the missing feature

• Mean of all the available values for the missing feature

• Marginalize over the distribution of the missing feature

Page 86: Bayesian Decision Theory (Sections 2.1-2.2)

Handling Missing Features

Page 87: Bayesian Decision Theory (Sections 2.1-2.2)

Other Topics• Compound Bayes Decision Theory & Context

– Consecutive states of nature might not be statistically independent; in sorting two types of fish, arrival of next fish may not be independent of the previous fish

– Can we exploit such statistical dependence to gain improved performance (use of context)

– Compound decision vs. sequential compound decision problems

– Markov dependence

• Sequential Decision Making

– Feature measurement process is sequential (as in medical diagnosis)

– Feature measurement cost

– Minimize the no. of features to be measured while achieving a sufficient accuracy; minimize a combination of feature measurement cost & classification accuracy

Page 88: Bayesian Decision Theory (Sections 2.1-2.2)

Context in Text Recognition