bayesian decision theory machine learning for context...

52
Christian Schüller 15.04.03 page 1 Bayesian decision theory Bayesian Decision Theory Machine Learning for Context Aware Computing

Upload: buikhanh

Post on 08-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

ChristianSchüller15.04.03page 1

Bayesian decision theory

Bayesian Decision Theory

Machine Learning for Context Aware Computing

ChristianSchüller15.04.03page 2

Bayesian decision theory

introduction (Chapter 1 – Pattern Classification, Duda/Hart/Stork)

Bayesian decision theory (Chapter 2 – Pattern Classification)

> machine perception

- pattern recognition systems

- the design cycle

- learning and adaptation

- conclusion

- an example

Machine Learning for Context Aware Computing

index of contents

ChristianSchüller15.04.03page 3

Bayesian decision theory

build a machine that can recognize patterns:

- speech recognition

- fingerprint identification

- DNA sequence identification

- OCR (optical character recognition)

accurate pattern recognition immensely useful

deeper understanding by solving problems

algorithm and hardware design is influenced by knowledge howthese are solved in nature

Machine Learning for Context Aware Computing

machine perception

ChristianSchüller15.04.03page 4

Bayesian decision theoryMachine Learning for Context Aware Computing

index of contents

introduction to ML

Bayesian decision theory

- machine perception

- pattern recognition systems

- the design cycle

- learning and adaptation

- conclusion

> an example

ChristianSchüller15.04.03page 5

Bayesian decision theoryMachine Learning for Context Aware Computing

an example – fish packing plant (1/8)

- “Sorting incoming fish on a conveyor according to species using optical sensing”

- pilot project: separate sea bass from salmonai

ms

problem analysis

- wants automate process of incoming fish

- pilot project: separate sea bass from salmon

- using optical sensing

- take sample pictures

- extract features

> length, width

> lightness

> number and shape of fins

- notice noise or variations in the images

variation in lighting

position of the fish on the conveyor

aims

ChristianSchüller15.04.03page 6

Bayesian decision theoryMachine Learning for Context Aware Computing

preprocessing

- use a segmentation operation to isolate fishes from...

feature extraction

an example – fish packing plant (2/8)

> one another

> background

- information from a single fish is sent to a feature extractor whose purpose is to reduce the data by measuring certain features

- the features are passed to a classifier

model

- differences between the population, different models

- hypothesize the class of models

- choose best corresponding model

ChristianSchüller15.04.03page 7

Bayesian decision theoryMachine Learning for Context Aware Computing

overview

Preprocessing

Feature extraction

Classification “salmon”“sea bass”

an example – fish packing plant (3/8)

training samples

- length obvious feature, try to classify by the length L

- obtain training samples by making length measurements

- suppose sea bass is generally longer than a salmon

classification

- evaluates evidence

- makes final decision

ChristianSchüller15.04.03page 8

Bayesian decision theoryMachine Learning for Context Aware Computing

feature: length

- length alone is a poor feature

an example – fish packing plant (4/8)

5 10

2

150

20 25length

20

count

4

6

8

10

12

14

16

18

L

salmon sea bass

- select the lightness as a possible feature

ChristianSchüller15.04.03page 9

Bayesian decision theoryMachine Learning for Context Aware Computing

feature: lightness

an example – fish packing plant (5/8)

- careful elimination of variations in illumination

- classes much better separated2 4

2

60

8 10lightness

count

4

6

8

10

12

14

x

salmon sea bass

decision boundary and cost relationship

- Move our decision boundary toward smaller values of lightness in order to minimize the cost (reduce the number of sea bass that are classified salmon!)

- task of decisions theory

ChristianSchüller15.04.03page 10

Bayesian decision theoryMachine Learning for Context Aware Computing

decision theory

- make decision rules, such as to minimize cost

- width as new feature to classify

- problem to partition feature space into two regions

‡ xT = [ x1,x2 ]

x1 = lightnessx2 = width

- add other features that are not correlated with the ones we already have

- a precaution should be taken not to reduce the performance by adding such “noisy features”

an example – fish packing plant (6/8)

5 10

2

150

20 25lightness

20

width

4

6

8

10

12

14

16

18

salmon sea bass

ChristianSchüller15.04.03page 11

Bayesian decision theoryMachine Learning for Context Aware Computing

best decision boundary

- best decision boundary should be the one which provides an optimal performance such as in the following figure:

- satisfaction is premature because the central aim of designing a classifier is to correctly classify novel input

‡ issue of generalization

an example – fish packing plant (7/8)

5 10

2

150

20 25lightness

20

width

4

6

8

10

12

14

16

18

salmon sea bass

ChristianSchüller15.04.03page 12

Bayesian decision theoryMachine Learning for Context Aware Computing

generalized decision boundary

an example – fish packing plant (8/8)

20

10

12

14

16

18

5 10

2

150

20 25lightness

width

4

6

8

salmon sea bass

ChristianSchüller15.04.03page 13

Bayesian decision theoryMachine Learning for Context Aware Computing

index of contents

introduction to ML

Bayesian decision theory

- machine perception

> pattern recognition systems

- the design cycle

- learning and adaptation

- conclusion

- an example

ChristianSchüller15.04.03page 14

Bayesian decision theoryMachine Learning for Context Aware Computing

pattern recognition systems (1/3)

overview

decision

post-processing

classification

feature extraction

segmentation

sensing

input

costs

adjustments for context

adjustment fomissing features

ChristianSchüller15.04.03page 15

Bayesian decision theoryMachine Learning for Context Aware Computing

pattern recognition systems (2/3)

sensing

- use of a transducer (i.e. camera)

- pattern recognition systems depends off:

segmentation and grouping

> the bandwidth

> the resolution sensitivity distortion of the transducer

- patterns should:

> be well separated

> not overlap

ChristianSchüller15.04.03page 16

Bayesian decision theoryMachine Learning for Context Aware Computing

pattern recognition systems (3/3)

feature extraction

- arbitrary boundary between feature extraction and classification

- invariant features with respect to translation, rotation and scale

classification

- use a feature vector provided by a feature extractor to assign the

object to a category

post processing

- exploit context input dependent information other than from the

target pattern itself to improve performance

ChristianSchüller15.04.03page 17

Bayesian decision theoryMachine Learning for Context Aware Computing

index of contents

introduction to ML

Bayesian decision theory

- machine perception

- pattern recognition systems

> the design cycle

- learning and adaptation

- conclusion

- an example

ChristianSchüller15.04.03page 18

Bayesian decision theoryMachine Learning for Context Aware Computing

the design cycle (1/3)

overviewstart

collect data

choose features

choose model

train classifier

evaluate classifier

end

prior knowledge(e.g. invariances)

ChristianSchüller15.04.03page 19

Bayesian decision theoryMachine Learning for Context Aware Computing

the design cycle (2/3)

data collection

- how do we know when we have collected an adequately large and representative set of examples for training and testing the system?

feature choice

- depends on the characteristics of the problem domain. Simple to

extract, invariant to irrelevant transformation insensitive to noise

model choice

- unsatisfied with the performance of our fish classifier and want to

jump to another class of model

- how to combine prior knowledge and empirical data?

ChristianSchüller15.04.03page 20

Bayesian decision theoryMachine Learning for Context Aware Computing

the design cycle (3/3)

training

- use data to determine the classifier

- different procedures for training classifiers and choosing models

evaluation

- measure the error rate (or performance and switch from one set of

features to another one)

solving problems

- no universal methods have been found for solving problems

ChristianSchüller15.04.03page 21

Bayesian decision theoryMachine Learning for Context Aware Computing

index of contents

introduction to ML

Bayesian decision theory

- machine perception

- pattern recognition systems

- the design cycle

> learning and adaptation

- conclusion

- an example

ChristianSchüller15.04.03page 22

Bayesian decision theoryMachine Learning for Context Aware Computing

learning and adaptation

supervised learning

- teacher provides a category label or cost for each pattern in the

training set

unsupervised learning

- the system forms clusters or “natural groupings” of the input

patterns

learning

- pattern recognition problems to hard to guess best classification

decision

ChristianSchüller15.04.03page 23

Bayesian decision theoryMachine Learning for Context Aware Computing

index of contents

introduction to ML

Bayesian decision theory

- machine perception

- pattern recognition systems

- the design cycle

- learning and adaptation

> conclusion

- an example

ChristianSchüller15.04.03page 24

Bayesian decision theoryMachine Learning for Context Aware Computing

conclusion

overwhelmed

- seems to be overwhelmed by the number, complexity and

magnitude of the sub-problems of pattern recognition

problems

- many of these sub-problems can indeed be solved

unresolved problems

- many fascinating unsolved problems still remain

- mathematical theories solving some problems have been

discovered

ChristianSchüller15.04.03page 25

Bayesian decision theory

index of contents

introduction to ML

Bayesian decision theory

> introduction

- minimum-error-rate classification

- classifiers, discriminant functions and decision surfaces

- the normal density

- discriminant functions for the normal density

- Bayesian decision theory / continuous features

- Bayesian decision theory / discrete features

ChristianSchüller15.04.03page 26

Bayesian decision theory

assumptions

- sequence of types of fish appears to be random

introduction (1/4)

- decision-theoretic terminology: each fish emerges nature is in one

or the other of two possible states

state of nature

- w denote state of nature

- w1 = sea bass and w2 = salmon

- the catch of salmon and sea bass is equiprobable:P(w1) = P(w2) (uniform priors)P(w1) + P( w2) = 1 (exclusivity and exhaustivity)

a priori probability

- P(w1) = priority next fish is a sea bass

ChristianSchüller15.04.03page 27

Bayesian decision theory

introduction (2/4)

example

- classification problem of sea bass and salmon by lightness

- assume apriori probabilities are not equal

> i.e. assume P(sea bass) > P(salmon)

> if you don’t have a chance to see the fish, every time decide as a sea bass

- if you see the lightness of fish

Question: P(sea bass | lightness) = ? and P(salmon | lightness) = ?

p(lightness | sea bass)

p(lightness | salmon)

P(sea bass | lightness)

P(salmon | lightness)

ChristianSchüller15.04.03page 28

Bayesian decision theory

introduction (3/4)

where in case of two categories

Â=

=

=2

1

)()|()(j

jjj Pxpxp ww

Bayes’ rule

P(wj | x) = p(x | wj) x P (wj)

p(x)

decision given the posterior probabilities

- x is an observation for which:

> P(w1 | x) > P(w2 | x) ‡ True state of nature = w1

> P(w1 | x) < P(w2 | x) ‡ True state of nature = w2

- whenever we observe a particular x, the probability of error is :

> P(error | x) = P(w1 | x) if we decide w2

> P(error | x) = P(w2 | x) if we decide w1

- maximum a posterior classifier or Bayes classifier

ChristianSchüller15.04.03page 29

Bayesian decision theory

introduction (4/4)

minimizing the probability of error

maximum-likelihood classifier

- P(w1) = P(w2)

P(error | x) = min [P(w1 | x), P(w2 | x)] (Bayes decision)

- decide w1 if P(w1 | x) > P(w2 | x); otherwise decide w2

- simpler decision rule

- decide w1 if p(x | w1) > p(x | w2); otherwise decide w2

ChristianSchüller15.04.03page 30

Bayesian decision theory

index of contents

introduction to ML

Bayesian decision theory

- introduction

- minimum-error-rate classification

- classifiers, discriminant functions and decision surfaces

- the normal density

- discriminant functions for the normal density

> Bayesian decision theory / continuous features

- Bayesian decision theory / discrete features

ChristianSchüller15.04.03page 31

Bayesian decision theory

Bayesian decision theory / continuous features (1/4)

generalization of the preceding ideas

- use of more than one feature

- use more than two states of nature

- allow actions and not only decide the state of nature

- introduce a loss of function which is more general than the

probability of error

feature space

- replace scalar x by the feature vector x

loss function

- states how costly each action is

- let us treat situations in which some kinds of classification mistakes

are more costly than others

ChristianSchüller15.04.03page 32

Bayesian decision theory

Bayesian decision theory / continuous features (2/4)

definitions

- let {w1, w2,…, wc} be the set of c states of nature (or “categories”)

- let {a1, a2,…, aa} be the set of possible actions

- let l(ai | wj) be the loss incurred for taking action ai when thestate of nature is wj

risk := expected values (cost)

- expected loss by taking action ai:

Â=

=

=cj

jjjii xx PR

1

)|()|()|( wwala

select the action ai for which R(ai | x) is minimum

- R is minimum and R in this case is called the Bayes risk = best performance that can be achieved!

P(wj | x) = p(x | wj) x P (wj)

p(x)- a posterior probability

ChristianSchüller15.04.03page 33

Bayesian decision theory

Bayesian decision theory / continuous features (3/4)

two-category classification

- a1 : deciding w1

- a2 : deciding w2

- lij = l(ai | wj)

conditional risk

- R(a1 | x) = l11P(w1 | x) + l12P(w2 | x)

- R(a2 | x) = l21P(w1 | x) + l22P(w2 | x)

rule

- if R(a1 | x) < R(a2 | x) action a1: “decide w1” is taken

- loss incurred for deciding ai when the true state of nature is wj

- this results in the equivalent rule: decide w1 if: (l21- l11) P(x | w1) > (l12- l22) P(x | w2) and decide w2 otherwise

ChristianSchüller15.04.03page 34

Bayesian decision theory

Bayesian decision theory / continuous features (4/4)

likelihood

- the preceding rule is equivalent to the following rule:

)(

)(.

)|()|(

1

2

1121

2212

2

1

ww

llll

ww

P

Pxpxp

->

-

- then take action a1 (decide w1), otherwise action a2 (decide w2)

likelihood ratio

- likelihood ratio for class- conditional probability density function

1aq- decision boundary de- termined by threshold

ChristianSchüller15.04.03page 35

Bayesian decision theory

index of contents

introduction to ML

Bayesian decision theory

- introduction

> minimum-error-rate classification

- classifiers, discriminant functions and decision surfaces

- the normal density

- discriminant functions for the normal density

- Bayesian decision theory / continuous features

- Bayesian decision theory / discrete features

ChristianSchüller15.04.03page 36

Bayesian decision theory

minimum-error-rate classification (1/2)

actions are decisions on classes

- if action ai is taken and the true state of nature is wj then: the decision is correct if i = j and in error if i ≠ j

decision rule

- seek a decision rule that minimizes the probability of error which is the error rate

introduction of the zero-one loss function

cjiji

jiji ,...,1,

1

0),( =

ÓÌÏ

==wal

- therefore, the conditional risk is:

Â

Â

=

=

-==

=

ijij

cj

jjjii

xPxP

xPxR

)|(1)|(

)|()|()|(1

ww

wwala

ChristianSchüller15.04.03page 37

Bayesian decision theory

minimum-error-rate classification (2/2)

minimizing the risk requires maximization of P(wi | x)

for minimum error rate

- Decide wi if P (wi | x) > P(wj | x) "j ≠ i

ChristianSchüller15.04.03page 38

Bayesian decision theory

index of contents

introduction to ML

Bayesian decision theory

- introduction

- minimum-error-rate classification

> classifiers, discriminant functions and decision surfaces

- the normal density

- discriminant functions for the normal density

- Bayesian decision theory / continuous features

- Bayesian decision theory / discrete features

ChristianSchüller15.04.03page 39

Bayesian decision theory

classifiers, discriminant functions and decision surfaces (1/4)

the multi-category case

- set of discriminant functions gi(x), i = 1, ..., c

functional structure of general statistical pattern classifier

- the classifier assigns a feature vector x to class wi if: gi(x) > gj(x) " j ≠ i

ChristianSchüller15.04.03page 40

Bayesian decision theory

classifiers, discriminant functions and decision surfaces (2/4)

Bayes classifier can be represented in this way

the selection of a discriminant function is not unique

- gi(x) = P(wi | x) minimum error-rate

for minimum error classifier, one may choose:

gi(x) = p(x | wi) P(wi)

gi(x) = ln P(x | wi) + ln P(wi)

discriminant functions

- discriminant functions can be in different forms, but the effect of decision rules is the same: decision boundaries

- gi(x) = - R(ai | x) minimum conditional risk

- decide x is in Ri if: gi(x) > gj(x) " j ≠ i

ChristianSchüller15.04.03page 41

Bayesian decision theory

two-dimensional two-category classifier

classifiers, discriminant functions and decision surfaces (4/4)

ChristianSchüller15.04.03page 42

Bayesian decision theory

index of contents

introduction to ML

Bayesian decision theory

- introduction

- minimum-error-rate classification

- classifiers, discriminant functions and decision surfaces

> the normal density

- discriminant functions for the normal density

- Bayesian decision theory / continuous features

- Bayesian decision theory / discrete features

ChristianSchüller15.04.03page 43

Bayesian decision theory

the normal density (1/2)

univariate normal density

˙˙˚

˘

ÍÍÎ

Ș¯

ˆÁË

Ê --=

2

2

1exp

2

1)(

sm

sp

xxP

- s2 = expected squared deviation or variance

- x = (x1, x2, …, xd)t (t stands for the transpose vector form)

- m = (m1, m2, …, md)t

- S = d-by-d covariance matrix

- |S| and S-1 are determinant and inverse respectively

multivariate normal density in d dimensions

˙̊˘

ÍÎ

È -S--S

= - )()(2

1exp

)2(

1)( 1

2/12/mm

pxxxP t

d

ChristianSchüller15.04.03page 44

Bayesian decision theory

the normal density (2/2)

univariate normal distribution

multivariate normal distribution

covariance matrixdetermines the shape of Gaussian curve

ChristianSchüller15.04.03page 45

Bayesian decision theory

index of contents

introduction to ML

Bayesian decision theory

- introduction

- minimum-error-rate classification

- classifiers, discriminant functions and decision surfaces

- the normal density

> discriminant functions for the normal density

- Bayesian decision theory / continuous features

- Bayesian decision theory / discrete features

ChristianSchüller15.04.03page 46

Bayesian decision theory

discriminant functions for the normal density (1/4)

minimum error-rate classification can be achieved by thediscriminant function

case of multivariate normal density, discriminant function is

gi(x) = ln P(x | wi) + ln P(wi)

)(lnln2

12ln

2)()(

2

1)(

1

iii

it

ii Pd

xxxg wpmm +S-----= Â-

case 1: Si = s_I (independence, equal s)

)(ln2

1)(

22 iiti

ti

i Pxxg wmmss

m+-=

[ ] )(ln22

1)(

2 iiti

ti

ti Pxxxxg wmmm

s++--=

ChristianSchüller15.04.03page 47

Bayesian decision theory

discriminant functions for the normal density (2/4)

case 1: Si = s_I (independence, equal s)

ChristianSchüller15.04.03page 48

Bayesian decision theory

discriminant functions for the normal density (3/4)

case 2: Si = S (covariance of all classes are identical but arbitrary!)

ChristianSchüller15.04.03page 49

Bayesian decision theory

discriminant functions for the normal density (4/4)

case 3: Si = arbitrary

)(lnln2

12ln

2)()(

2

1)(

1

iii

it

ii Pd

xxxg wpmm +S-----= Â-

ChristianSchüller15.04.03page 50

Bayesian decision theory

index of contents

introduction to ML

Bayesian decision theory

- introduction

- minimum-error-rate classification

- classifiers, discriminant functions and decision surfaces

- the normal density

- discriminant functions for the normal density

- Bayesian decision theory / continuous features

> Bayesian decision theory / discrete features

ChristianSchüller15.04.03page 51

Bayesian decision theory

Bayesian decision theory / discrete features (1/2)

components of x are binary or integer valued, x can take only one ofm discrete values

case of independent binary features in 2 category problem

- v1, v2, …, vm

- x = [x1, x2, …, xd ]t where each xi is either 0 or 1, with probabilities

> pi = P(xi = 1 | w1)

> qi = P(xi = 1 | w2)

( )dxxp iÚ w| ( )Âx ixP w|replaced by

- fundamental Bayes decision rule remains the same

ChristianSchüller15.04.03page 52

Bayesian decision theory

Bayesian decision theory / discrete features (2/2)

the discriminant function in this case is

0g(x) if and 0g(x) if

)(

)(ln

1

1ln

:

,...,1 )1(

)1(ln

:

)(

21

2

1

10

i

01

£>

+-

-=

=-

-=

+=

Â

Â

=

=

ww

ww

decide

P

P

q

pw

and

dipq

qpw

where

wxwxg

d

i i

i

ii

ii

i

d

ii