1
CS 2750 Machine Learning
CS 2750 Machine LearningLecture 8
Milos [email protected] Sennott Square
Classification : Logistic regression.
Generative classification model.
CS 2750 Machine Learning
Binary classification
• Two classes• Our goal is to learn to classify correctly two types of examples
– Class 0 – labeled as 0, – Class 1 – labeled as 1
• We would like to learn• Zero-one error (loss) function
• Error we would like to minimize:• First step: we need to devise a model of the function
}1,0{=Y
}1,0{: →Xf
=≠
=ii
iiii yf
yfyError
),(0),(1
),(1 wxwx
x
)),(( 1),( yErrorE yx x
2
CS 2750 Machine Learning
Discriminant functions
• One way to represent a classifier is by using– Discriminant functions
• Works for binary and multi-way classification
• Idea: – For every class i = 0,1, …k define a function
mapping– When the decision on input x should be made choose the
class with the highest value of
• So what happens with the input space? Assume a binary case.
)(xigℜ→X
)(xig
CS 2750 Machine Learning
Discriminant functions
)()( 01 xx gg ≥
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
3
CS 2750 Machine Learning
Discriminant functions
)()( 01 xx gg ≥
)()( 01 xx gg ≤
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
)()( 01 xx gg ≤
CS 2750 Machine Learning
Discriminant functions
)()( 01 xx gg ≥
)()( 01 xx gg ≤
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
)()( 01 xx gg ≤
)()( 01 xx gg ≥
4
CS 2750 Machine Learning
Discriminant functions
• Define decision boundary
)()( 01 xx gg ≥
)()( 01 xx gg ≤
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
)()( 01 xx gg ≥
)()( 01 xx gg ≤
)()( 01 xx gg =
CS 2750 Machine Learning
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3Decision boundary
Quadratic decision boundary
)()( 01 xx gg ≥
)()( 01 xx gg ≤ )()( 01 xx gg =
5
CS 2750 Machine Learning
Logistic regression model
• Defines a linear decision boundary• Discriminant functions:
• where
)()()( 1 xwxwwx, TT ggf ==
)1/(1)( zezg −+=
xInput vector
∑
1
1x )( wx,f
0w
1w2w
dw2x
z
dx
Logistic function
)()(1 xwx Tgg = )(1)(0 xwx Tgg −=- is a logistic function
CS 2750 Machine Learning
Logistic functionfunction
• Is also referred to as a sigmoid function• Replaces the threshold function with smooth switching • takes a real number and outputs the number in the interval [0,1]
)1(1)( ze
zg −+=
-20 -15 -10 -5 0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
6
CS 2750 Machine Learning
Logistic regression model
• Discriminant functions:
• Values of discriminant functions vary in [0,1]– Probabilistic interpretation
),|1( wx=yp
xInput vector
∑
1
1x0w
1w2w
dw2x
z
dx
)()(1 xwx Tgg = )(1)(0 xwx Tgg −=
)()(),|1()( 1 xwxxwwx, Tggypf ====
CS 2750 Machine Learning
Logistic regression
• We learn a probabilistic function
– where f describes the probability of class 1 given x
Note that:
• Transformation to binary class values:
),|1()()( 1 wxxwwx, === ypgf T
]1,0[: →Xf
2/1)|1( ≥= xypIf then choose 1Else choose 0
)|1(1),|0( wx,wx =−== ypyp
7
CS 2750 Machine Learning
Linear decision boundary
• Logistic regression model defines a linear decision boundary• Why?• Answer: Compare two discriminant functions.• Decision boundary:• For the boundary it must hold:
0)(
)(1log)()(log
1
=−
=xw
xwxx
T
T
gg
ggo
)()( 01 xx gg =
0)(explog
)(exp11
)(exp1)(exp
log)()(log
1
==−=
−+
−+−
= xwxw
xw
xwxw
xx TT
T
T
T
ggo
CS 2750 Machine Learning
Logistic regression model. Decision boundary
• LR defines a linear decision boundaryExample: 2 classes (blue and red points)
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Decision boundary
8
CS 2750 Machine Learning
Likelihood of outputs• Let
• Then
• Find weights w that maximize the likelihood of outputs– Apply the log-likelihood trick The optimal weights are the
same for both the likelihood and the log-likelihood
Logistic regression: parameter learning
=−=−= ∑∏=
−
=
−n
i
yi
yi
n
i
yi
yi
iiiiDl1
1
1
1 )1(log)1(log),( µµµµw
∏∏=
−
=
−===n
i
yi
yiii
n
i
iiyyPDL1
1
1
)1(),|(),( µµwxw
)()(),|1( xwwx Tiiii gzgyp ====µ
)1log()1(log1
iii
n
ii yy µµ −−+= ∑
=
>=< iii yD ,x
CS 2750 Machine Learning
Logistic regression: parameter learning• Log likelihood
• Derivatives of the loglikelihood
• Gradient descent:
)1log()1(log),(1
iii
n
ii yyDl µµ −−+= ∑
=
w
)),(())((),(11
iii
n
ii
Tii
n
i
fygyDl xwxxwxww −−=−−=−∇ ∑∑==
)1(|)],([)()1()(−−∇−← −
kDlkkkww www α
Nonlinear in weights !!
∑=
−− −+←n
iii
ki
kk fyk1
)1()1()( )],([)( xxwww α
))((),(1
, ii
n
iji
j
zgyxDlw
−−=∂∂
− ∑=
w
9
CS 2750 Machine Learning
Logistic regression. Online gradient descent
• On-line component of the loglikelihood
• On-line learning update for weight w
• ith update for the logistic regression and
)1(|)],([)()1()(−∇−← −
kkonlinekk DJk
ww www α
),( wkonline DJ
>=< kkk yD ,x
kkk
iki fyk xxwww )],()[( )1()1()( −− −+← α
)1log()1(log),(online iiiii yyDJ µµ −−+=− w
CS 2750 Machine Learning
Online logistic regression algorithm
Online-logistic-regression (D, number of iterations)initialize weightsfor i=1:1: number of iterations
do select a data point from Dset update weights (in parallel)
end forreturn weights
),,( 210 dwwww K=w
i/1=α
iii fyi xxwww )],()[( −+← α
>=< iii yD ,x
w
10
CS 2750 Machine Learning
Online algorithm. Example.
CS 2750 Machine Learning
Online algorithm. Example.
11
CS 2750 Machine Learning
Online algorithm. Example.
CS 2750 Machine Learning
Derivation of the gradient• Log likelihood
• Derivatives of the loglikelihood
)1log()1(log),(1
iii
n
ii yyDl µµ −−+= ∑
=
w
)),(())((),(11
iii
n
ii
Tii
n
i
fygyDl xwxxwxww −−=−−=∇ ∑∑==
[ ]j
in
iiiii
ij wzyy
zDl
w ∂∂
−−+∂∂
=∂∂ ∑
=1
)1log()1(log),( µµw
[ ]i
i
ii
i
i
iiiiii
i zzg
zgy
zzg
zgyyy
z ∂∂
−−
−+∂
∂=−−+
∂∂ )(
)(11)1()(
)(1)1log()1(log µµ
))(1)(()(ii
i
i zgzgzzg
−=∂
∂Derivative of a logistic function
)())()(1())(1( iiiiii zgyzgyzgy −=−−+−=
jij
i xwz
,=∂∂
12
CS 2750 Machine Learning
Generative approach to classification
Idea: 1. Represent and learn the distribution2. Use it to define probabilistic discriminant functions
E.g.
Typical model• = Class-conditional distributions (densities)
binary classification: two class-conditional distributions
• = Priors on classes - probability of class ybinary classification: Bernoulli distribution
)0|( =yp x
1)1()0( ==+= ypyp
y
x
),( yp x
)()|(),( ypypyp xx =)|( yp x
)1|( =yp x)( yp
)|0()( xx == ypg o )|1()(1 xx == ypg
CS 2750 Machine Learning
Generative approach to classification
Example:• Class-conditional distributions
– multivariate normal distributions
• Priors on classes (class 0,1)– Bernoulli distribution
−−−= − )()(
21exp
)2(1)|( 1
2/12/µxΣµx
ΣΣµ,x T
dp
π
0for),(~ 00 =yN Σµx1for),(~ 11 =yN Σµx
y
x
Multivariate normal ),(~ Σµx N
yyyp −−= 1)1(),( θθθ
Bernoulliy ~
}1,0{∈y
13
CS 2750 Machine Learning
Learning of parameters of the model
Density estimation in statistics• We see examples – we do not know the parameters of
Gaussians (class-conditional densities)
• ML estimate of parameters of a multivariate normal for a set of n examples of x Optimize log-likelihood:
• How about class priors?
−−−= − )()(
21exp
)2(1),|( 1
2/12/µxΣµx
ΣΣµx T
dp
π
∑=
=n
iin 1
1ˆ xµ Tn
in)ˆ)(ˆ(1ˆ
1µxµxΣ ii −−= ∑
=
)|(log),,(1
Σ,µxΣµ ∏=
=n
iipDl
),( ΣµN
CS 2750 Machine Learning
Generative model
)()( 01 xx gg ≥
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
14
CS 2750 Machine Learning
2 Gaussian class-conditional densities
• .
CS 2750 Machine Learning
Making class decision
Basically we need to design discriminant functionsTwo possible choices:• Likelihood of data – choose the class (Gaussian) that explains
the input data (x) better (likelihood of the data)
• Posterior of a class – choose the class with better posterior probability
),|(),|( 0011 ΣxΣx µµ pp > then y=1else y=0
)|0()|1( xx =>= ypyp then y=1else y=0
)1(),|()0(),|()1(),|()|1(
1100
11
=+==
==yppypp
yppypΣxΣx
Σxxµµ
µ
)(1 xg )(0 xg
15
CS 2750 Machine Learning
2 Gaussians: Quadratic decision boundary
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Contours of class-conditional densities
CS 2750 Machine Learning
2 Gaussians: Quadratic decision boundary
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3Decision boundary
16
CS 2750 Machine Learning
2 Gaussians: Linear decision boundary• When covariances are the same 0,),(~ 0 =yN Σµx
1,),(~ 1 =yN Σµx
CS 2750 Machine Learning
2 Gaussians: Linear decision boundary
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Contours of class-conditional densities
17
CS 2750 Machine Learning
2 Gaussians: linear decision boundary
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Decision boundary