pattern classification. chapter 2 (part 1): bayesian decision theory (sections 2.1-2.2) introduction...

Click here to load reader

Post on 17-Dec-2015

213 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

  • Slide 1
  • Pattern Classification
  • Slide 2
  • Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision TheoryContinuous Features
  • Slide 3
  • Pattern Classification, Chapter 2 (Part 1) 2 Introduction The sea bass/salmon example State of nature, prior State of nature is a random variable The catch of salmon and sea bass is equiprobable P( 1 ) = P( 2 ) (uniform priors) P( 1 ) + P( 2 ) = 1 (exclusivity and exhaustivity)
  • Slide 4
  • Pattern Classification, Chapter 2 (Part 1) 3 Decision rule with only the prior information Decide 1 if P( 1 ) > P( 2 ) otherwise decide 2 Use of the classconditional information P(x | 1 ) and P(x | 2 ) describe the difference in lightness between populations of sea-bass and salmon
  • Slide 5
  • Pattern Classification, Chapter 2 (Part 1) 4
  • Slide 6
  • 5 Posterior, likelihood, evidence P( j | x) = (P(x | j ) * P ( j )) / P(x) (BAYES RULE) Posterior = (Likelihood * Prior) / Evidence Where in case of two categories
  • Slide 7
  • Pattern Classification, Chapter 2 (Part 1) 6
  • Slide 8
  • 7 Intuitive decision rule given the posterior probabilities: Given x: if P( 1 | x) > P( 2 | x) True state of nature = 1 if P( 1 | x) < P( 2 | x) True state of nature = 2 Why do this?: Whenever we observe a particular x, the probability of error is : P(error | x) = P( 1 | x) if we decide 2 P(error | x) = P( 2 | x) if we decide 1
  • Slide 9
  • Pattern Classification, Chapter 2 (Part 1) 8 Minimizing the probability of error Decide 1 if P( 1 | x) > P( 2 | x); otherwise decide 2 Therefore: P(error | x) = min [P( 1 | x), P( 2 | x)] (Bayes decision)
  • Slide 10
  • Pattern Classification, Chapter 2 (Part 1) 9 Bayesian Decision Theory Continuous Features Generalization of the preceding ideas Use of more than one feature Use more than two states of nature Allowing actions and not only decide on the state of nature Introduce a loss of function which is more general than the probability of error Allowing actions other than classification primarily allows the possibility of rejection Refusing to make a decision in close or bad cases! Letting loss function state how costly each action taken is
  • Slide 11
  • Pattern Classification, Chapter 2 (Part 1) 10 Let { 1, 2,, c } be the set of c states of nature (or classes) Let { 1, 2,, a } be the set of possible actions Let ( i | j ) be the loss for action i when the state of nature is j Bayesian Decision Theory Continuous Features
  • Slide 12
  • Pattern Classification, Chapter 2 (Part 1) 11 What is the Expected Loss for action i ? Conditional risk = Expected Loss For any given x the expected loss is
  • Slide 13
  • Pattern Classification, Chapter 2 (Part 1) 12 Overall risk R = Sum of all R( i | x) for i = 1,,a Minimizing R Minimizing R( i | x) for i = 1,, a for i = 1,,a Conditional risk
  • Slide 14
  • Pattern Classification, Chapter 2 (Part 1) 13 Select the action i for which R( i | x) is minimum R is minimum and R in this case is called the Bayes risk = best performance that can be achieved!
  • Slide 15
  • Pattern Classification, Chapter 2 (Part 1) 14 Two-Category Classification 1 : deciding 1 2 : deciding 2 ij = ( i | j ) loss incurred for deciding i when the true state of nature is j Conditional risk: R( 1 | x) = 11 P( 1 | x) + 12 P( 2 | x) R( 2 | x) = 21 P( 1 | x) + 22 P( 2 | x)
  • Slide 16
  • Pattern Classification, Chapter 2 (Part 1) 15 Our rule is the following: if R( 1 | x) < R( 2 | x) action 1 : decide 1 is taken This results in the equivalent rule : decide 1 if: ( 21 - 11 ) P(x | 1 ) P( 1 ) > ( 12 - 22 ) P(x | 2 ) P( 2 ) and decide 2 otherwise
  • Slide 17
  • Pattern Classification, Chapter 2 (Part 1) 16 x ( 21 - 11 ) x ( 12 - 22 )
  • Slide 18
  • Pattern Classification, Chapter 2 (Part 1) 17 Two-Category Decision Theory: Chopping Machine 1 = chop 2 = DO NOT chop 1 = NO hand in machine 2 = hand in machine 11 = ( 1 | 1 ) = $ 0.00 12 = ( 1 | 2 ) = $ 100.00 21 = ( 2 | 1 ) = $ 0.01 22 = ( 1 | 1 ) = $ 0.01 Therefore our rule becomes ( 21 - 11 ) P(x | 1 ) P( 1 ) > ( 12 - 22 ) P(x | 2 ) P( 2 ) 0.01 P(x | 1 ) P( 1 ) > 99.99 P(x | 2 ) P( 2 )
  • Slide 19
  • Pattern Classification, Chapter 2 (Part 1) 18 Our rule is the following: if R( 1 | x) < R( 2 | x) action 1 : decide 1 is taken This results in the equivalent rule : decide 1 if: ( 21 - 11 ) P(x | 1 ) P( 1 ) > ( 12 - 22 ) P(x | 2 ) P( 2 ) and decide 2 otherwise
  • Slide 20
  • Pattern Classification, Chapter 2 (Part 1) 19 x 0.01 x 99.99 1 = chop 2 = DO NOT chop 1 = NO hand in machine 2 = hand in machine
  • Slide 21
  • Pattern Classification, Chapter 2 (Part 1) 20 Exercise Select the optimal decision where: W = { 1, 2 } P(x | 1 ) N(2, 0.5) (Normal distribution) P(x | 2 ) N(1.5, 0.2) P( 1 ) = 2/3 P( 2 ) = 1/3
  • Slide 22
  • Chapter 2 (Part 2): Bayesian Decision Theory (Sections 2.3-2.5) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density
  • Slide 23
  • Pattern Classification, Chapter 2 (Part 1) 22 Minimum-Error-Rate Classification Actions are decisions on classes If action i is taken and the true state of nature is j then the decision is correct if i = j and in error if i j Seek a decision rule that minimizes the probability of error which is the error rate
  • Slide 24
  • Pattern Classification, Chapter 2 (Part 1) 23 Introduction of the zero-one loss function: Therefore, the conditional risk for each action is: The risk corresponding to this loss function is the average (or expected) probability error Average Prob. of Error
  • Slide 25
  • Pattern Classification, Chapter 2 (Part 1) 24 Minimize the risk requires maximize P( i | x) (since R( i | x) = 1 P( i | x)) For Minimum error rate Decide i if P ( i | x) > P( j | x) j i
  • Slide 26
  • Pattern Classification, Chapter 2 (Part 1) 25 Two-Category Classification 1 : deciding 1 2 : deciding 2 ij = ( i | j ) loss incurred for deciding i when the true state of nature is j Conditional risk: R( 1 | x) = 11 P( 1 | x) + 12 P( 2 | x) R( 2 | x) = 21 P( 1 | x) + 22 P( 2 | x)
  • Slide 27
  • Pattern Classification, Chapter 2 (Part 1) 26 Our rule is the following: if R( 1 | x) < R( 2 | x) action 1 : decide 1 is taken This results in the equivalent rule : decide 1 if: ( 21 - 11 ) P(x | 1 ) P( 1 ) > ( 12 - 22 ) P(x | 2 ) P( 2 ) and decide 2 otherwise
  • Slide 28
  • Pattern Classification, Chapter 2 (Part 1) 27 Likelihood ratio: The preceding rule is equivalent to the following rule: Then take action 1 (decide 1 ) Otherwise take action 2 (decide 2 )
  • Slide 29
  • Pattern Classification, Chapter 2 (Part 1) 28 Regions of decision and zero-one loss function, therefore: If is the zero-one loss function which means:
  • Slide 30
  • Pattern Classification, Chapter 2 (Part 1) 29
  • Slide 31
  • Pattern Classification, Chapter 2 (Part 1) 30 Classifiers, Discriminant Functions and Decision Surfaces The multi-category case Set of discriminant functions g i (x), i = 1,, c The classifier assigns a feature vector x to class i if: g i (x) > g j (x) j i
  • Slide 32
  • Pattern Classification, Chapter 2 (Part 1) 31
  • Slide 33
  • Pattern Classification, Chapter 2 (Part 1) 32 Let g i (x) = - R( i | x) (max. discriminant corresponds to min. risk!) For the minimum error rate, we take g i (x) = P( i | x) (max. discrimination corresponds to max. posterior!) g i (x) P(x | i ) P( i ) g i (x) = ln P(x | i ) + ln P( i ) (ln: natural logarithm!)
  • Slide 34
  • Pattern Classification, Chapter 2 (Part 1) 33 Feature space divided into c decision regions if g i (x) > g j (x) j i then x is in R i ( R i means assign x to i ) The two-category case A classifier is a dichotomizer that has two discriminant functions g 1 and g 2 Let g(x) g 1 (x) g 2 (x) Decide 1 if g(x) > 0 ; Otherwise decide 2
  • Slide 35
  • Pattern Classification, Chapter 2 (Part 1) 34 The computation of g(x)
  • Slide 36
  • Pattern Classification, Chapter 2 (Part 1) 35 On to higher dimensions!
  • Slide 37
  • Pattern Classification, Chapter 2 (Part 1) 36 The Normal Density Univariate density Density which is analytically tractable Continuous density A lot of processes are asymptotically Gaussian Handwritten characters, speech sounds are ideal or prototype corrupted by random process (central limit theorem) Where: = mean (or expected value) of x 2 = expected squared deviation or variance
  • Slide 38
  • Pattern Classification, Chapter 2 (Part 1) 37
  • Slide 39
  • Pattern Classification, Chapter 2 (Part 1) 38 Multivariate density Multivariate normal density in d dimensions is: where: x = (x 1, x 2, , x d ) t (t stands for the transpose vector form) = ( 1, 2, , d ) t mean vector = d*d covariance matrix | | and -1 are determinant and inverse respectively
  • Slide 40
  • Chapter 2 (part 3) Bayesian Decision Theory (Sections 2-6,2-9) Discriminant Functions for the Normal Density Bayes Decision Theory Discrete Features
  • Slide 41
  • Pattern Classification, Chapter 2 (Part 1) 40 Discriminant Functions for the Normal Density We saw that the minimum error-rate classification can be achieved by the discriminant function g i (x) = ln P(x | i ) + ln P( i ) Case of multivariate normal
  • Slide 42
  • Pattern Classification, Chapter 2 (Part 1) 41 Case i = 2 I ( I stands for the identity matrix)
  • Slide 43
  • Pattern Classification, Chapter 2 (Part 1) 42 A classifier that uses linear discriminant functions is called a linear machine The decision surfaces for a linear machine are pieces of hyperplanes defined by: g i (x) = g j (x)
  • Slide 44
  • Pattern Classification, Chapter 2 (Part 1) 43
  • Slide 45
  • Pattern Classification, Chapter 2 (Part 1) 44 The hyperplane separating R i and R j are given by always orthogonal to the line linking the means!
  • Slide 46
  • Pattern Classification, Chapter 2 (Part 1) 45
  • Slide 47
  • Pattern Classification, Chapter 2 (Part 1) 46
  • Slide 48
  • Pattern Classification, Chapter 2 (Part 1) 47 Case i = (covariance of all classes are identical but arbitrary!) Hyperplane separating R i and R j Here the hyperplane separating R i and R j is generally not orthogonal to the line between the means!
  • Slide 49
  • Pattern Classification, Chapter 2 (Part 1) 48
  • Slide 50
  • Pattern Classification, Chapter 2 (Part 1) 49
  • Slide 51
  • Pattern Classification, Chapter 2 (Part 1) 50 Case i = arbitrary The covariance matrices are different for each category Here the separating surfaces are Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids)
  • Slide 52
  • Pattern Classification, Chapter 2 (Part 1) 51
  • Slide 53
  • Pattern Classification, Chapter 2 (Part 1) 52
  • Slide 54
  • Pattern Classification, Chapter 2 (Part 1) 53 Bayes Decision Theory Discrete Features Components of x are binary or integer valued, x can take only one of m discrete values v 1, v 2, , v m Case of independent binary features in 2 category problem Let x = [x 1, x 2, , x d ] t where each x i is either 0 or 1, with probabilities: p i = P(x i = 1 | 1 ) q i = P(x i = 1 | 2 )
  • Slide 55
  • Pattern Classification, Chapter 2 (Part 1) 54 The discriminant function in this case is: