neural networks i cmput 466/551 nilanjan ray. outline projection pursuit regression neural network...
Post on 19-Dec-2015
224 views
TRANSCRIPT
Neural Networks I
CMPUT 466/551
Nilanjan Ray
Outline
• Projection Pursuit Regression
• Neural Network– Background– Vanilla Neural Networks– Back-propagation– Examples
Projection Pursuit Regression
• Additive model with non-linear gm’s
• Features X is projected to parameters wm, which we have to find from training data
• Precursors to neural networks
M
m
Tmm XwgXf
1
Fitting a PPR Model
• Minimize squared-error loss function:
• Proceed in forward stages: M=1, 2,…etc.
– At each stage, estimate g, given w (Say, by fitting a spline function)
– Estimate w, given g (details provided in the next slide)
• The value of M is decided by cross-validation
N
i
M
mi
Tmmi
WGxwgy
1
2
1,])([minarg
Fitting a PPR Model…
• At stage m, given g, compute w (Gauss-Newton search)
N
ii
T
iTold
iToldi
iToldi
Told
N
ii
Toldi
Toldi
Toldi
N
ii
Ti
iT
oldiToldi
Toldi
T
xwxwg
xwgzxwxwg
xwwxwgxwgzxwgz
xwwxwgxwgxwg
1
2
2
1
2
1
2
)('
)()('
))((')()]([
))((')()(
weights adjusted responses
So, this is a weighted least-square technique
1
1
)(m
ki
Tkkii xwgyz Residual after stage m-1
Vanilla Neural Network
v
Kl l
kk
kk
K
kkk
K
Tkkk
P
Tmmm
evsigmoid
T
TTgtionclassificaFor
TTgregressionFor
TTTT
KkTgXfY
ZZZZ
KkZT
XXXX
MmXZ
1
1
)exp(
)exp()(,
)(,
),....,,(
,,2,1),()(
),....,,(
,2,1,
),....,,(
,,2,1,
1
21
21
0
21
0
hiddenlayer
Input layer
Output layer
The Sigmoid Function
σ(0.5v)
σ(10v)
σ(sv) = 1/(1+exp(-sv))• a smooth (regularized) threshold function• s controls the activation rate s↑, hard activation s↓, close to identity function
Multilayer Feed Forward NN
http://www.teco.uni-karlsruhe.de/~albrecht/neuro/html/node18.html
Examples architectures
NN: Universal Approximator
• A NN with one hidden units, can approximate arbitrarily well any functional continuous mapping from one finite dimensional space to another, provided number of hidden units is sufficiently large.
• Proof is based on Fourier expansion of a function (see Bishop).
NN: Kolomogorov’s Theorem
• Any continuous mapping f(x) from d input variables can be expressed by a neural networks with two hidden layers of nodes. The first layer contains d(2d+1) nodes, and the second layer contains (2d+1) nodes.
• So, why bother about topology at all? This ‘universal’ architecture is impractical because the functions represented by hidden units will be non-smooth and are unsuitable for learning. (see Bishop for more.)
The XOR Problem and NN
Activation functionsare hard thresholds at 0
Fitting Neural Networks
• Parameters to learn from training data
• Cost functions– Sum-of-squared errors for regression
– Cross-entropy errors for classification
weights1 ,...,2,1;,
weights,1 ,...,2,1;,
0
0
MKKk
pMMm
kk
mm
K
k
N
iikik xfyR
1 1
2
K
k
N
iikik xfyR
1 1
log
Gradient descent: Back-propagation
K
kkikmi
Tmmi
ilmiil
K
ki
Tmkmi
Tkkikik
ml
i
iTmmmi
mikimiiTkkikik
km
i
N
i
K
kikik
N
ii
xs
xsxxzgxfyR
xz
zzzgxfyR
xfyRR
1
'
1
'
0
1 1
2
1
where
2
)(where
,2
and ,parametersfor sDerivative
:function Fitting
N
irml
ir
rml
rml
N
irkm
ir
rkm
rkm
R
R
1
1
1
1 ,
updatedescent Gradient
Back-propagation: Implementation
• Step 1: Initialize the parameters (weights) of NN• Iterate
– Forward pass: compute fk(X) for the current parameter values starting at the input layer and moving all the up to the output layer.
– Backward pass: Start at the output layer; compute i; go down one layer at a time and compute smi all the way down to the input layer
– Update weights by gradient descent rule
Issues in Training Neural Networks
• Initial values of parameters– Back-propagation finds local minimum
• Overfitting– Neural networks have too many parameters– Early stop and regularization
• Scaling of the inputs– Inputs are typically scaled to have zero mean and unit
standard deviation
• Number of hidden units and layers– Better to have too many than too few– With ‘traditional’ back-propagation a long NN gets
stuck in local minima and does not learn well
Avoiding Overfitting
• Weight decay cost function:
• Weight elimination penalty function:
)(,
2
,
2 lm
mlmk
kmR
)11
(,
2
2
,2
2
lm ml
ml
mk km
kmR
Example
Example: ZIP Code Recognition
Some Architectures for ZIP Code Recognition
Architectures and Parameters
• Net-1: No hidden layer, equivalent to multinomial logistic regression
• Net-2: One hidden layer, 12 hidden units fully connected
• Net-3: Two hidden layers, locally connected• Net-4: Two hidden layers, locally connected with
weight sharing• Net-5: Two hidden layers, locally connected, two
levels of weight sharing
Weight sharing is also known as convolutional neural networks
More on Architectures and ResultsArchitecture Links Weights %Correct
Net-1 Single layer 2570 2570 80.0
Net-2 Two layer 3214 3214 87.0
Net-3 Locally connected
1226 1226 88.5
Net-4 Constrained 2266 1132 94.0
Net-5 Constrained 5194 1060 98.4
Net-1: #Links/Weights- 2570 = 16*16*10+10
Net-2: #Links/Weights- 16*16*12+12+12*10+10=3214
Net-3: #Links/Weights- 8*8*3*3+8*8+4*4*5*5+4*4+10*4*4+10=1226
Net-4: #Links- 2*8*8*3*3 + 2*8*8 + 4*4*5*5*2 + 4*4 + 10*4*4+10 = 2266#Weights- 2*3*3 + 2*8*8 + 4*4*5*5*2 + 4*4 + 10*4*4+10 = 1132
Net-5: #Links- 2*8*8*3*3 + 2*8*8 + 4*4*4*5*5*2 + 4*4*4 + 4*4*4*10 + 10 = 5194#Weights- 2*3*3 + 2*8*8 + 4*5*5*2 + 4*4*4 + 4*4*4*10 + 10 =
1060
Performance vs. Training Time
Some References
• C.M. Bishop, Neural Networks for Pattern Recognition, Oxford Univ. Press, 1996. (For good understanding)
• S. Haykin, Neural Networks and Learning Machines, Prentice Hall, 2009. (For very basic reading, lots of examples etc.)
• Prominent Researchers:– Yann LeCun (http://yann.lecun.com/)
– G.E. Hinton (http://www.cs.toronto.edu/~hinton/)
– Yosua Bengio http://www.iro.umontreal.ca/~bengioy/yoshua_en/index.html