introduction to machine learning ethem alpaydin © the mit press, 2004 [email protected]...
TRANSCRIPT
INTRODUCTION TO Machine LearningETHEM ALPAYDIN© The MIT Press, 2004
[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml
Lecture Slides for
CHAPTER 11:
Multilayer Perceptrons
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)3
Neural Networks
Networks of processing units (neurons) with connections (synapses) between them
Large number of neurons: 1010
Large connectitivity: 105
Parallel processing Distributed computation/memory Robust to noise, failures
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)4
Understanding the Brain
Levels of analysis (Marr, 1982)1. Computational theory2. Representation and algorithm3. Hardware implementation
Reverse engineering: From hardware to theory Parallel processing: SIMD vs MIMD
Neural net: SIMD with modifiable local memoryLearning: Update by training/experience
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)5
Perceptron
(Rosenblatt, 1962)
Td
Td
Td
jjj
x,...,x,
w,...,w,w
wxwy
1
10
01
1
x
w
xw
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)6
What a Perceptron Does
Regression: y=wx+w0 Classification:
y=1(wx+w0>0)
ww0
y
x
x0=+1
ww0
y
x
s
w0
y
x
xwToy
exp1
1sigmoid
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)7
K Outputs
kk
i
i
k k
ii
Tii
yy
C
oo
y
o
maxif
choose
expexp
xw
Classification:
Regression:
xy
xw
W
Tii
d
jjiji wxwy 0
1
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)8
Training
Online (instances seen one by one) vs batch (whole sample) learning: No need to store the whole sample Problem may change in time Wear and degradation in system components
Stochastic gradient-descent: Update after a single pattern
Generic update rule (LMS rule):
InpututActualOutpputDesiredOutctorLearningFaUpdate
tj
ti
ti
tij xyrw
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)9
Training a Perceptron: Regression Regression (Linear output):
t
jttt
j
tTtttttt
xyrw
ryrr,E
22
21
21
| xwxw
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)10
Classification
Single sigmoid output
K>2 softmax outputs
tj
tttj
ttttttt
tTt
xyrw
yryr,E
y
1 log 1 log |
sigmoid
rxw
xw
tj
ti
ti
tij
i
ti
ti
ttii
t
k
tTk
tTit
xyrw
yr,Ey
log | exp
exprxw
xwxw
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)11
Learning Boolean AND
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)12
XOR
No w0, w1, w2 satisfy:
(Minsky and Papert, 1969)
0
0
0
0
021
01
02
0
www
ww
ww
w
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)13
Multilayer Perceptrons
(Rumelhart et al., 1986)
d
j hjhj
Thh
H
hihih
Tii
wxw
z
vzvy
1 0
10
exp1
1
sigmoid xw
zv
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)14
x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)15
Backpropagation
hj
h
h
i
ihj
d
j hjhj
Thh
H
hihih
Tii
wz
zy
yE
wE
wxw
z
vzvy
exp1
1
sigmoid
1 0
10
xw
zv
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)16
t
jth
th
th
tt
tj
th
th
th
tt
hj
th
th
t
tt
hjhj
xzzvyr
xzzvyr
wz
zy
yE
wE
w
1
1
Regression
Forward
Backward
x
xwThhz sigmoid
H
h
thh
t vzvy1
0
221
| t
tt yr,E XvW
th
t
tth zyrv
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)17
Regression with Multiple Outputs
zh
vih
yi
xj
whj
tj
th
th
t iih
ti
tihj
th
t
ti
tiih
i
H
h
thih
ti
t i
ti
ti
xzzvyrw
zyrv
vzvy
yr,E
1
21
|
01
2
XVW
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)18
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)19
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)20
whx+w0
zh
vhzh
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)21
Two-Class Discrimination
One sigmoid output yt for P(C1|xt) and P(C2|xt) ≡ 1-yt
t
jth
thh
t
tthj
th
t
tth
t
tttt
H
h
thh
t
xzzvyrw
zyrv
yryr,E
vzvy
1
1 log 1 log |
sigmoid1
0
XvW
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)22
K>2 Classes
tj
th
th
t iih
ti
tihj
th
t
ti
tiih
t i
ti
ti
ti
k
tk
tit
i
H
hi
thih
ti
xzzvyrw
zyrv
yr,E
CPo
oyvzvo
1
log|
|exp
exp
10
Xv
x
W
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)23
Multiple Hidden Layers
MLP with one hidden layer is a universal approximator (Hornik et al., 1989), but using multiple layers may lead to simpler networks
2
1
1022
21
0212122
11
01111
1sigmoidsigmoid
1sigmoidsigmoid
H
lll
T
H
hlhlh
Tll
d
jhjhj
Thh
vzvy
H,...,l,wzwz
H,...,h,wxwz
zv
zw
xw
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)24
Improving Convergence
Momentum
Adaptive learning rate
1
ti
i
tti w
wE
w
otherwise
if
b
EEa tt
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)25
Overfitting/OvertrainingNumber of weights: H (d+1)+(H+1)K
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)26
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)27
Structured MLP
(Le Cun et al, 1989)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)28
Weight Sharing
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)29
Hints
Invariance to translation, rotation, size
Virtual examples Augmented error: E’=E+λhEh
If x’ and x are the “same”: Eh=[g(x|θ)- g(x’|θ)]2
Approximation hint:
(Abu-Mostafa, 1995)
xx
xx
xx
h
bxgbxg
axgaxg
b,axg
E
|if |
|if |
|if 0
2
2
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)30
Tuning the Network Size
Destructive Weight decay:
Constructive Growing networks
(Ash, 1989) (Fahlman and Lebiere, 1989)
ii
ii
i
wE'E
wwE
w
2
2
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)31
Bayesian Learning
Consider weights wi as random vars, prior p(wi)
Weight decay, ridge regression, regularizationcost=data-misfit + λ complexity
2
2
212exp where
log|log|log
|log max arg |
|
w
w
www
ww
www
w
E'E
)/(w
cwpwpp
Cppp
pˆp
ppp
ii
ii
MAP
XX
XX
XX
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)32
Dimensionality Reduction
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)33
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)34
Learning Time
Applications: Sequence recognition: Speech recognition Sequence reproduction: Time-series prediction Sequence association
Network architectures Time-delay networks (Waibel et al., 1989) Recurrent networks (Rumelhart et al., 1986)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)35
Time-Delay Neural Networks
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)36
Recurrent Networks
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)37
Unfolding in Time