machine learning - bayesian decision theory and …vishy/fall2016/notes/bayesiandecision.pdf ·...

Machine LearningBayesian Decision Theory and Classification

S.V.N. (vishy) Vishwanathan

University of California, Santa [email protected]

October 21, 2016

S.V. N. Vishwanathan (UCSC) CMPS242 1 / 21

Binary Classification

Outline

1 Binary Classification

2 Generative ModelsGaussian Generative ModelNaive Bayes

3 Discriminative ClassifiersLogistic Regression



Problem Setting: Binary Classification

Data: x = (x1, x2, . . . , xN)>

Labels: t = (t1, t2, . . . , tN)> with ti ∈ {0, 1}t = 1 implies class C1 and t = 0 implies class C2

Let us call the two classes C1 and C2, and let p (C1) = π andp (C2) = 1− π.



Basic Idea

Estimate p (Ci |x) and predict C1 if p (C1|x) > p (C2|x)

Key Problem: How to estimate p (Ci |x)?

Two philosophies:

GenerativeDiscriminative


Generative Models

Outline





Generative Models

Generative Models

p (Ci |x) =p (x|Ci ) · p (Ci )

p (x)

Decision function: predict C1 if

p (C1|x) > p (C2|x)


Generative Models

Generative Models

p (Ci |x) =p (x|Ci ) · p (Ci )

p (x)


p (x|C1) · p (C1)

p (x)>

p (x|C2) · p (C2)

p (x)


Generative Models

Generative Models

p (Ci |x) =p (x|Ci ) · p (Ci )

p (x)


p (x|C1) · p (C1) > p (x|C2) · p (C2)


Generative Models

Generative Models

p (Ci |x) =p (x|Ci ) · p (Ci )

p (x)


ln p (x|C1) + ln p (C1) > ln p (x|C2) + ln p (C2)


Generative Models

Generative Models

p (Ci |x) =p (x|Ci ) · p (Ci )

p (x)


ln p (x|C1) + lnπ > ln p (x|C2) + ln (1− π)


Generative Models Gaussian Generative Model

Class-Conditional Gaussian Distribution

p (x|Ck) = N (x|µk ,Σk)



Class-Conditional Gaussian Distribution

p (x|Ck) =1

(2π)D2

1

|Σ|12

exp

{−1

2(x− µk)>Σ−1

k (x− µk)

}



Decision Rule




Decision Rule

ln

(1

(2π)D2

1

|Σ1|12

exp

{−1

2(x− µ1)>Σ−1

1 (x− µ1)

})+ ln p (C1) >

ln

(1

(2π)D2

1

|Σ2|12

exp

{−1

2(x− µ2)>Σ−1

2 (x− µ2)

})+ ln p (C2)



Decision Rule

1

2x>(Σ−1

2 − Σ−11

)x + µ>1 Σ−1

1 x− µ>2 Σ−12 x + b > 0

where

b = ln

(π

1− π

)− 1

2ln|Σ1||Σ2|

+1

2µ>2 Σ−1

2 µ2 −1

2µ>1 Σ−1

1 µ1



Special Case: Σi = Σ

(µ1 − µ2)>Σ−1x + b > 0

where

b = ln

(π

1− π

)+

1

2µ>2 Σ−1µ2 −

1

2µ>1 Σ−1µ1




(µ1 − µ2)>Σ−1︸︷︷︸w>

x + b > 0

where

b = ln

(π

1− π

)+

1

2µ>2 Σ−1µ2 −

1

2µ>1 Σ−1µ1




w>x + b > 0

where

b = ln

(π

1− π

)+

1

2µ>2 Σ−1µ2 −

1

2µ>1 Σ−1µ1



Parameter Estimation via MLE

p (x, t|π, µ1, µ2,Σ) =N∏

n=1

[π · N (xn|µ1,Σ)]tn · [(1− π) · N (xn|µ2,Σ)]1−tn



Parameter Estimation via MLE

ln p (x, t|π, µ1, µ2,Σ) =N∑

n=1

tn lnπ + (1− tn) ln (1− π) + tn lnN (xn|µ1,Σ)

+ (1− tn) lnN (xn|µ2,Σ)



Focus on π

ln p (x, t|π, µ1, µ2,Σ) =N∑

n=1


+ (1− tn) lnN (xn|µ2,Σ)

Take gradients and set to zero:

π =1

N

N∑n=1

tn =N1

N=

N1

N1 + N2



Focus on µ1

ln p (x, t|π, µ1, µ2,Σ) =N∑

n=1


+ (1− tn) lnN (xn|µ2,Σ)


µ1 =1

N1

N∑n=1

tnxn

Similar calculation for µ2



Focus on Σ

ln p (x, t|π, µ1, µ2,Σ) =N∑

n=1


+ (1− tn) lnN (xn|µ2,Σ)


Σ =N1

NS1 +

N2

NS2

S1 =1

N1

∑n,tn=1

(xn − µ1) (xn − µ1)>

S2 =1

N2

∑n,tn=0

(xn − µ2) (xn − µ2)>


Generative Models Naive Bayes

Class-Conditional Distribution

For simplicity let each component xi ∈ {0, 1} and we assume conditionalindependence

p (x|Ck) =D∏i=1

µxiki (1− µki )(1−xi )



Decision Rule




Decision Rule

D∑i=1

xi lnµ1i + (1− xi ) ln (1− µ1i ) + ln p (C1) >

D∑i=1

xi lnµ2i + (1− xi ) ln (1− µ2i ) + ln p (C2)



Decision Rule

D∑i=1

(xi ln

µ1i

µ2i+ (1− xi ) ln

(1− µ1i

1− µ2i

))+ ln

(π

1− π

)> 0



Decision Rule

D∑i=1

xi · lnµ1i · (1− µ2i )

µ2i · (1− µ1i )︸︷︷︸w>x

+ ln

(1− µ1i

1− µ2i

)+ ln

(π

1− π

)︸︷︷︸

b

> 0


Discriminative Classifiers

Outline






Rewriting the Model

p (C1|x) =p (x|C1) · p (C1)

p (x)



Rewriting the Model

p (C1|x) =p (x|C1) · p (C1)

p (x|C1) · p (C1) + p (x|C2) · p (C2)



Rewriting the Model

p (C1|x) =exp (a1)

exp (a1) + exp (a2)

where ak = ln p (x|Ck) · p (Ck)



Rewriting the Model

p (C1|x) =1

1 + exp (−a)= σ (a)

where a = a1 − a2



Key Idea

Recall that in the Gaussian case with Σi = Σ

a = lnp (x|C1) · p (C1)

p (x|C2) · p (C2)



Key Idea


a = ln p (x|C1) + ln p (C1)− ln p (x|C2)− ln p (C2)



Key Idea


a = (µ1 − µ2)>Σ−1︸︷︷︸w>

·x + b = µ>1 Σ−1︸︷︷︸w>

1

·x− µ>2 Σ−1︸︷︷︸w>

2

·x + b

where

b = ln

(π

1− π

)+

1

2µ>2 Σ−1µ2 −

1

2µ>1 Σ−1µ1

Why not model a directly as w>x + b for some arbitrary w?



Questions?


machine learning - bayesian decision theory and …vishy/fall2016/notes/bayesiandecision.pdf ·...

Documents