supervised learning in neural networks - watanabe lab.watanabe- · in image analysis, a network is...

Supervised Learning in Neural Networks

Sumio WatanabeTokyo Institute of Technology

Advanced Topics in Mathematical Information Sciences ＩＩ

April 24, May 1, 2015

Quick Review

2015/5/1 Mathematical Learning Theory 3

Supervised Learning

Y=f(x,w)

…

X1, X2, …, Xn

Y1, Y2, …, Yn

Supervisor

Learner

Samples

Mathematics of Supervised Learning

Training Samples

X1, X2, …, Xn

Y1, Y2, …, Yn

Test Samples

X

Y

TrueInformationSource

NeuralNetwork

y=f(x,w) q(x,y)

x1 x2 x3 xN

w1 w2 w3 wN

∑ wi xiN

i=1

σ( ∑ wi xi + θ)N

i=1

Neuron

Synapse weight

biasθ

One Neuron Model

Output

Input

Three-Layered Neural Network

x1 x2 xM

f1 f2 fN

Output Layer

Hidden Layer

Input Layer

1. Deep neural network

2. Sequential learning and auto-encoder

3. Convolution learning

Contents

Deep neural network (DNN)

Recently, neural networks which havedeep layers are being studied.

It is reported that DNNs have bettergeneralization performance.

f1 f2 fN

x1 x2 xM

2015

1960

x1 x2 xM

f1 f2 fN

1985

x1 x2 xM

f1 f2 fN

DefinitionIt is easy to define a deep network.

fi =σ( ∑uij xj + θj )H

j=1Simpleperceptron

M

k=1fi =σ( ∑uij σ( ∑ wjkxk + θj) + φi)

H

j=1

Three-layerNeural network

H2

k=1fi =σ( ∑uij σ( ∑ wjk σ( ∑ vkl ( ……. )l+ θj) + φi)

H1

j=1

M

l=1DNN

Learning and Generalization

E(w) = (1/n) Σ (Yi-f(Xi,w))2n

i=1

Training Error

G(w) = ∫∫ (y-f(x,w))2 q(x,y) dxdy

Generalization Error

The main purpose of learning is to minimize G(w), but we haveonly training samples. Minimizing E(w) is not equivalent to minimizing G(w).

Steepest Descent : Error back-propagation

E(w) = ― ∑(fi －yi )2N

i=1

12

oj=σ(∑wjkok+θj)M

k=1fi =σ(∑uijoj+φi)

H

j=1Inference

∂E∂wjk

= ∑ (fi －yi ) ∂ fi∂wjk

N

i=1

∂ fi∂wjk

= ∂ fi∂oj

∂ oｊ

∂wjk

All parameters can be optimized by steepest descent of E(w) by

Square error

recursive

2015/5/1 Mathematical Learning Theory

Regularization : Ridge and Lasso

E(w) = (1/n) Σ (Yi-f(Xi,w))2 + R(w)n

i=1

Ridge R(w) = λ Σ |wj|2

Lasso R(w)= λ Σ |wj|

λ>0 : Hyperparameter

DNN has many parameters which are to be optimized.Regularization terms are necessary.

Remark. It is still difficult to find the optimal hyperparameter.

Steepest Descent ?

Minimize this error by optimizing all parameters.

f1 f2 fN

x1 x2 xM

Supervised data

Outputs are far from inputs.Mathematically speaking, all parameters can be optimized by steepest descent, but it is difficult for a neural network to find the nonlinear relation between distant inputs and outputs.

input

output

We need methodology to build a deep neural network.




Contents


Deep Learning Methodology

(2) Auto-encoder

(1) Sequential layer learning

(3) Convolution network

Three methods are being studied.

(1) Sequential Layer Learning

x1 x2 xM

f1 f2 fN

x1 x2 xM

f1 f2 fN

x1 x2 xM

f1 f2 fN

Synapse weights in lower layers are copied from a trained shallow network to deeper one.

copy copy

SupervisorSupervisor

Supervisor


Parameter SpaceE(w)

W is in the higher dimensional Euclidean Space

Many local and complicated structure

E(w) is minimized at |w|=infinity.We need an appropriatefinite and local parameter.

Sequential layer learning maylead the training result to some appropriate point.


(2) Auto-encoder

X1 X2 XM

f1 f2 fN

Firstly, a bottleneck network is trained,then its weight is copied.

Smaller than M

X1 X2 XM

X1 X2 XM

SupervisorInput

Input


Bottleneck neural network

X1 X2 XM

If a set of inputs are on the K dimensionalmanifold in M dimensional Euclidean space, then its essential coordinates can be extracted automatically.

= Nonlinear Principal Component Analysis

X1 X2 XMK dimensional

manifold Input

Input

M dimensionalEuclidean space

Same

ExampleInput 5 * 5Training samples 2000Testining samples 2000

0

6 image

Input 25

A network

Hidden 8

０ 6

Output 2

Hidden 6

Hidden 4


(0) Only Error backpropagation

mean 213.5

mean 265.5

std 414.7

std 388.0

Training Error

Generalization Error

Training results strongly depend on initial synapse weights.


(1) Sequential Layer Learning

Training error Testing errorMean 4.1Std 1.8

mean 61.6std 7.0


(2) Auto-encoder

Std 3.4Mean 61.3Std 8.1

mean 5.3Training Error Test Error




Contents


Data structureIn several data such as images or time series,neighborhoods have local covariance.

Image： a pixel depends on its neighbors.

Time series: a future value can be predicted by the past.

Convolutional network is useful to analyze such data.

fi =σ( ∑uij σ( ∑ wjkxk + θj) + φi)|i-j|<3 |j-k|<3

Synapse weights outside of neighbors are zero.


Convolutional network

In image analysis,a network is made by nonlinear convolution processingfrom local to global.

LocalInformation

Globalinformation


Multi-resolution Analysis

Multi-resolution analysis (MRA)

is a method of analyzing images by integration from local to global data.

Convolution network can beunderstood as a kind ofMRA.

2015/5/1Mathematical Learning Theory

Time Delay Neural Network

Human’s speech containslocal abbreviation, expansion, and contraction.

A layered neural network wasproposed so as to be adapted to such local nonlinear changing.

This is called TDNN.

speech sound

timeRecognitionresult

time

Example: Time series

x(t) = f(x(t-1),x(t-2),…,x(t-27)) + noise,

Time series prediction problem : how to find a nonlinear function

where { x(t) } is the set of prices of Hakusai (Japanese vegetable like cabbage) of each month 1970 – 2013.

Before processing, a linear prediction was optimized

x(t) = a1 x(t-1) + a2x(t-2) +・・・+ a27 x(t-27).

Linear prediction: Generalization Error 1.55Training Error 1.29

2015/5/1


Example

Month

Month

Price

Price

Training resultRed: TrueBlue: prediction

The data in e-stats of Japanese Government are used. http://www.e-stat.go.jp/SG1/estat/eStatTopPortal.do

Test resultRed: TrueBlue: Prediction


Comparison of DNN and ConvNN

A deep neural Network Convolution Network

TimeTime

Generalization Error 1.56Training Error 1.01

Generalization Error 1.35Training Error 1.28


Deep Learning and Feature extraction

(1) Automatic extraction of feature

(2) Preparing feature by human

By using the deep neural network, the optimal feature representation may be found. Discovery of unknown structure enables us to “mine data”. However, it may be difficult or if it is possible, it needs heavy computational costs.

If an appropriate feature is found by human before training, then computational cost in learning can be reduced.However, discovery of unknown feature does not occur.


Summary

(1) Supervised learning in neural networks is introduced. (April,24th)

(2) Methodology of deep neural network (May, 1st)

(a) Definitions of Training and Generalization Errors

(b) Steepest Descent as an learning algorithm

(a) Sequential layer learning

(b) Auto-encoder

(c) Convolution network

supervised learning in neural networks - watanabe lab.watanabe- · in image analysis, a network is...

Documents