Non-Bayes classifiers.
Linear discriminants,
neural networks.
Discriminant functions(1)
2121 :?0)|()|( wwxwPxwP Bayes classification rule:
Instead might try to find a function:
21, :?0)(21
wwxf ww
)(21 , xf ww is called discriminant function.
}0)(|{21 , xfx ww
- decision surface
Discriminant functions (2)Class 1
Class 2
Class 1
Class 2
0, )(21
wxwxf Tww
Decision surface is a hyperplane 00 wxwT
Linear discriminant function:
Linear discriminant – perceptron cost function
x
Tx xwxJ )(
1 xand
0
x
w
wwReplace
Thus now decision function is and decision surface is
xwxf Tww )(
21 ,
0xwT
Perceptron cost function:
where
classifiedcorrectly is ,0
0 is and if ,1
0 is and if ,1
2
1
x
xwxwx
xwxwxT
T
x
Linear discriminant – perceptron cost function
x
Tx xwxJ )(
Perceptron cost function:Class 1
Class 2
Value of is proportional to the sum of distances of all misclassified samples to the decision surface.
)(xJ
If discriminant function separates classes perfectly, thenOtherwise, and we want to minimize it.
0)( xJ0)( xJ
is continuous and piecewise linear. So we might try to use gradient descent algorithm.
)(xJ
Linear discriminant – Perceptron algorithm
)(
)()()1(
twwt w
wJtwtw
Gradient descent:
At points where is differentiable )(xJ
x
xxδw
wJ
sifiedmisclas
)(
Thus x
xt xδtwtw
sifiedmisclas
)()1(
Perceptron algorithm converges when classes are linearly separable with some conditions on t
Sum of error squares estimation
xwx Twwf )(
21 ,Want to find discriminant functionwhose output is similar to
Let denote as desired output function, 1 for one class and –1 for the other.
1)( xy
)(xy
Use sum of error squares as similarity criterion:
)(minargˆ
)(1
2
ww
xww
wJ
yJN
ii
Ti
Sum of error squares estimationMinimize mean square error:
N
iii
N
i
Tii
N
ii
Tii
y
yJ
11
1
ˆ
0)(2)(
xwxx
xwxw
w
Thus
N
iii
N
i
Tii yw
1
1
1
ˆ xxx
Neurons
Artificial neuron.
1w
2w
lw
1x
2x
lx0w
f
Above figure represent artificial neuron calculating:
l
iiixwfy
1
Artificial neuron.Threshold functions f:
0
1
0
1
Step function Logistic function
00
01)(
x
xxf
axexf
1
1)(
Combining artificial neurons
1x
2x
lx
Multilayer perceptron with 3 layers.
Discriminating ability of multilayer perceptron
Since 3-layer perceptron can approximate any smooth function, it can approximate - optimal discriminant function of two classes.
)|()|()( 21 xwPxwPxF
Training of multilayer perceptronf
f
f
Layer r-1 Layer r
f
f
f
1rky r
jkwrjv
rjy
Training and cost functionDesired network output: )()( iyix
Trained network output: )(ˆ)( iyix
Cost function for one training sample:
Lk
mmm iyiyiE
1
2))(ˆ)((2
1)(
Total cost function:
N
i
iEJ1
)(
Goal of the training: find values of which minimize cost function .
rjkw
J
Gradient descentDenote: Tr
jkrj
rj
rj r
www ],...,,[110
w
rj
rj
rj
Joldnew
www
)()(Gradient descent:
Since , we might want to update weights after processing each training sample separately:
N
i
iEJ1
)(
rj
rj
rj
iEoldnew
www
)(
)()(
Gradient descent
)()(
)()(
)(
)()( 1 iyiv
iEiv
iv
iEiE rrj
rj
rj
rj
rj
ww
Chain rule for differentiating composite functions:
Denote: )(
)()(
iv
iEi
rj
rj
BackpropagationIf r=L, then
))(()())(())(ˆ))(((
))(ˆ))(((2
1
)()(
)()(
1
2
ivfieivfiyivf
iyivfiviv
iEi
Ljj
Ljj
Lj
k
mm
LmL
jLj
Lj
L
If r<L, then
rr
r
k
k
rj
rkj
rj
k
krj
rjr
j
k
krj
rj
rj
rj
rj
ivfwiiv
ivi
iv
iv
iv
iE
iv
iEi
1
1
11
111
1
))(()()(
)()(
)(
)(
)(
)(
)(
)()(
Backpropagation algorithm
• Initialization: initialize all weights with random values.• Forward computations: for each training vector x(i) compute all • Backward computations: for each i, j and r=L, L-1,…,2 compute • Update weights:
)(1 irj
)( ),( iyiv rj
rj
)()()(
)()()(
1 iyiold
iEoldnew
rrj
rj
rj
rj
rj
w
www
MLP issues• What is the best network configuration?• How to choose proper learning parameter ?• When training should be stopped?• Choose another threshold function f or cost function J?