inf2b - learning

35
Inf2b - Learning Lecture 11: Single layer Neural Networks (1) Hiroshi Shimodaira (Credit: Iain Murray and Steve Renals) Centre for Speech Technology Research (CSTR) School of Informatics University of Edinburgh http://www.inf.ed.ac.uk/teaching/courses/inf2b/ https://piazza.com/ed.ac.uk/spring2020/infr08028 Office hours: Wednesdays at 14:00-15:00 in IF-3.04 Jan-Mar 2020 Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 1

Upload: others

Post on 02-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Inf2b - Learning

Inf2b - LearningLecture 11: Single layer Neural Networks (1)

Hiroshi Shimodaira(Credit: Iain Murray and Steve Renals)

Centre for Speech Technology Research (CSTR)School of Informatics

University of Edinburghhttp://www.inf.ed.ac.uk/teaching/courses/inf2b/

https://piazza.com/ed.ac.uk/spring2020/infr08028

Office hours: Wednesdays at 14:00-15:00 in IF-3.04

Jan-Mar 2020

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 1

Page 2: Inf2b - Learning

Today’s Schedule

1 Discriminant functions (recap)

2 Decision boundary of linear discriminants (recap)

3 Discriminative training of linear discriminans (Perceptron)

4 Structures and decision boundaries of Perceptron

5 LSE Training of linear discriminants

6 Appendix - calculus, gradient descent, linear regression

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 2

Page 3: Inf2b - Learning

Discriminant functions (recap)

yk(x) = ln (P(x |C )P(Ck))

= −1

2(x − µk)TΣ−1k (x − µk)− 1

2ln |Σk |+ lnP(Ck)

= −1

2xTΣ−1k x + µT

k Σ−1k x − 1

2µT

k Σ−1k µk −

1

2ln |Σk |+ lnP(Ck)

(a) (c)(b)

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 3

Page 4: Inf2b - Learning

Linear discriminants for a 2-class problem

y1(x) = wT1 x + w10

y2(x) = wT2 x + w20

Combined discriminant function:

y(x) = y1(x)− y2(x) = (w1 −w2)Tx + (w10 − w20)

= wTx + w0

Decision:

C =

{1, if y(x) ≥ 0,

2, if y(x) < 0

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 4

Page 5: Inf2b - Learning

Decision boundary of linear discriminants

Decision boundary:

y(x) = wTx + w0 = 0

Dimension Decision boundary

2 line w1x1 + w2x2 + w0 = 0

3 plane w1x1 + w2x2 + w3x3 + w0 = 0

D hyperplane (∑D

i=1 wixi) + w0 = 0

NB: w is a normal vector to the hyperplane

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 5

Page 6: Inf2b - Learning

Decision boundary of linear discriminant (2D)

y(x) = w1x1 + w2x2 + w0 = 0 (x2 =−w1

w2x1 −

w0

w2, when w2 6= 0)

/w/w

T

/w

2

22

1

2 1

1

w w

0

T

0

C

w

x wy( )= =0x

w

=( , )

−w

1C

2

+w

−w

2x

x10

slope =

slope =

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 6

Page 7: Inf2b - Learning

Decision boundary of linear discriminant (3D)

y(x) = w1x1 + w2x2 + w3x3 + w0 = 0

X 1

X 2

X 3

T

1 32w=(w , w , w )

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 7

Page 8: Inf2b - Learning

Approach to linear discminant functions

Generative models : p(x|Ck)

Discriminant function based on Bayes decision rule

yk(x) = ln p(x|Ck) + lnP(Ck)

↓ Gaussian pdf (model)

yk(x) = −1

2(x− µk)TΣ−1k (x− µk)− 1

2ln |Σk |+ lnP(Ck)

↓ Equal covariance assumption

yk(x) = wTx + w0

↑ Why not estimating the decision boundaryor P(Ck |x) directly?

Discriminative training / models(Logistic regression, Percepton / Neural network, SVM)

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 8

Page 9: Inf2b - Learning

Training linear discriminant functions directly

A discriminant for a two-class problem:

y(x) = y1(x)− y2(x) = (w1 −w2)Tx + (w10 − w20)

= wTx + w0

1

2

1 2

T

X

( )w ,w

X

(new)(new) T

1 2w ,w( )

T

2

1

1

2

X

X

( )w ,w

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 9

Page 10: Inf2b - Learning

Perceptron error correction algorithm

a(x) = wTx + w0 = wT x

where w = (w0,wT )T , x = (1, xT )T

Let’s just use w and x to denote w and x from now on!

y(x) = g( a(x) ) = g( wTx ) where g(a) =

{1, if a ≥ 0,0, if a < 0

g(a): activation / transfer function

Training set : D = {(x1, t1), . . . , (xN , tN)}where ti ∈ {0, 1} : target value

Modify w if xi was misclassified

w (new) ← w + η (ti − y(xi)) xi (0 < η < 1)learning rateNB:

(w (new))Txi = wTxi + η (ti − y(xi)) ‖xi‖2

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 10

Page 11: Inf2b - Learning

Geometry of Perceptron’s error correction

y(xi) = g(wTxi)

w (new) ← w + η (ti − y(xi)) xi (0 < η < 1)

ti−y(xi )y(xi)0 1

ti0 0 -11 1 0

wTx = ‖w‖‖x‖ cos θ

T

1

X 2

C1

w x

C0

X

w

x

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 11

Page 12: Inf2b - Learning

Geometry of Perceptron’s error correction (cont.)

y(xi) = g(wTxi)

w (new) ← w + η (ti − y(xi)) xi (0 < η < 1)

ti−y(xi )y(xi)0 1

ti0 0 -11 1 0

wTx = ‖w‖‖x‖ cos θ

T

1

X 2

C1

C0

w x X

w

x

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 12

Page 13: Inf2b - Learning

Geometry of Perceptron’s error correction (cont.)

y(xi) = g(wTxi)

w (new) ← w + η (ti − y(xi)) xi (0 < η < 1)

ti−y(xi )y(xi)0 1

ti0 0 -11 1 0

wTx = ‖w‖‖x‖ cos θ

(new)

0

x

C

w

w

1

X

X

2C1

η

η

x

x

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 13

Page 14: Inf2b - Learning

The Perceptron learning algorithm

Incremental (online) Perceptron algorithm:

for i = 1, . . . ,Nw ← w + η (ti − y(xi)) xi

Batch Perceptron algorithm:

vsum = 0for i = 1, . . . ,N

vsum = vsum + (ti − y(xi)) xiw ← w + η vsum

What about convergence?The Perceptron learning algorithm terminates if trainingsamples are linearly separable.

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 14

Page 15: Inf2b - Learning

Linearly separable vs linearly non-separable

Linearly separable Linearly non−separable(b)(a−1) (a−2)

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 15

Page 16: Inf2b - Learning

Background of Perceptron

(https://en.wikipedia.org/wiki/File:Neuron Hand-tuned.svg)

w2

w1

x3

(a) function unit

wyΣ

1940s Warren McCulloch and Walter Pitts : ’threshold logic’Donald Hebb : ’Hebbian learning’

1957 Frank Rosenblatt : ’Perceptron’

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 16

Page 17: Inf2b - Learning

Character recognition by Perceptron

(1,1)

i

j

(W,H)

W

H

Σ

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 17

Page 18: Inf2b - Learning

Perceptron structures and decision boundaries

y(x) = g(a(x))

= g( wTx )

w = (w0,w1, . . . ,wD)T

x = (1, x1, . . . , xD)T

where g(a) =

{1, if a ≥ 0,0, if a < 0

0

2

1

1

2

0

xx x

www

Σ

2

1

−1

X

X

x2 ≥ x1 − 1

a(x) = 1− x1 + x2= w0+w1x1+w2x2

w0 =1,w1 =−1,w2 =1

NB: A one node/neuron constructs a decision boundary, whichsplits the input space into two regions

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 18

Page 19: Inf2b - Learning

Perceptron as a logical function

NOTx1 y0 11 0

ORx1 x2 y0 0 00 1 11 0 11 1 1

NANDx1 x2 y0 0 10 1 11 0 11 1 0

XORx1 x2 y0 0 00 1 11 0 11 1 0

1

2

X

X

10

1

1

2

X

X

1

10

1

2

1

X

1

0

X

Question: find the weights for each function

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 19

Page 20: Inf2b - Learning

Perceptron structures and decision boundaries (cont.)

01 10 22x x xxx x

Σ

Σ

Σ

10 2xx x

Σ

Σ

Σ

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 20

Page 21: Inf2b - Learning

Perceptron structures and decision boundaries (cont.)

10 2xx x

Σ

Σ

Σ

2

1

X

X

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 21

Page 22: Inf2b - Learning

Perceptron structures and decision boundaries (cont.)

10 2xx x

Σ

Σ

ΣΣ Σ

1

2

X

X

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 22

Page 23: Inf2b - Learning

Perceptron structures and decision boundaries (cont.)

0 21 xxx

ΣΣ

ΣΣΣ

Σ

Σ

Σ

1

2X

X

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 23

Page 24: Inf2b - Learning

Problems with the Perceptron learning algorithm

No training algorithms for multi-layer Percepton

Non-convergence for linearly non-separable data

Weights w are adjusted for misclassified data only(correctly classified data are not considered at all)

Consider not only mis-classification (on train data), butalso the optimality of decision boundary

Least squares error training

Large margin classifiers (e.g. SVM)

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 24

Page 25: Inf2b - Learning

Training with least squares

Squared error function:

E (w) =1

2

N∑n=1

(wTxn − tn

)2Optimisation problem:

minw

E (w)

One way to solve this is to apply gradient descent(steepest descent):

w ← w − η ∇w E (w)

where η : step size (a small positive const.)

∇w E (w) =

(∂E

∂w0, . . .

∂E

∂wD

)TInf2b - Learning: Lecture 11 Single layer Neural Networks (1) 25

Page 26: Inf2b - Learning

Training with least squares (cont.)

∂E

∂wi=

∂wi

1

2

N∑n=1

(wTxn − tn

)2=

N∑n=1

(wTxn − tn

) ∂

∂wiwTxn

=N∑

n=1

(wTxn − tn

)xni

Trainable in linearly non-separable case

Not robust (sensitive) against errornous data (outliers) faraway from the boundary

More or less a linear discriminant

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 26

Page 27: Inf2b - Learning

Appendix – derivatives

Derivatives of functions of one variabledf

dx= f ′(x) = lim

ε→0

f (x + ε)− f (x)

ε

e.g., f (x) = 4x3, f ′(x) = 12x2

Partial derivatives of functions of more than one variable∂f

∂x= lim

ε→0

f (x + ε, y)− f (x , y)

ε

e.g., f (x , y) = y 3x2, ∂f∂x

= 2y 3x

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 27

Page 28: Inf2b - Learning

Derivative rules

Function Derivative

Constant c 0x 1

Power xn nxn−11x − 1

x2√x 1

2x− 1

2

Exponential ex ex

Logarithms ln(x) 1x

Sum rule f (x) + g(x) f ′(x) + g ′(x)Product rule f (x)g(x) f ′(x)g(x) + f (x)g ′(x)

Reciprocal rule 1f (x) − f ′(x)

f 2(x)

f (x)g(x) − f ′(x)g(x)−f (x)g ′(x)

g2(x)

Chain rule f (g(x)) f ′(g(x))g ′(x)

z = f (y), y = g(x) dzdx = dz

dydydx

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 28

Page 29: Inf2b - Learning

Vectors of derivatives

Consider f (x), where x = (x1, . . . , xD)T

Notation: all partial derivatives put in a vector:

∇x f (x) =

(∂f

∂x1,∂f

∂x2, · · · , ∂f

∂xD

)T

Example: f (x) = x31x22

∇x f (x) =

(3x21x

22

2x31x2

)

Fact: f (x) changes most quickly in direction ∇x f (x)

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 29

Page 30: Inf2b - Learning

Gradient descent (steepest descent)

First order optimisation algorithm using ∇x f (x)

Optimisation problem: minx f (x)

Useful when analytic solutions (closed forms) are notavailable or difficult to find

Algorithm1 Set an initial value x0 and set t = 0

2 If ‖∇x f (xt)‖ ' 0, then stop. Otherwise, do thefollowing.

3 xt+1 = xt − η∇x f (xt) for η > 0

4 t = t + 1, and go to step 2.

Problem: stops at a local minimum (difficult to find aglobal maximum).

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 30

Page 31: Inf2b - Learning

Linear regression (one variable) least squares line fitting

Training set: D = {(xn, tn)}Nn=1

Linear regression: tn = axn + b

Objective function: E =N∑

n=1

(ti − (axi + b))2

Optimisation problem: mina,b

E

Partial derivatives:

∂E

∂a= 2

N∑n=1

((ti − (axi + b)) (−xi )

= 2aN∑

n=1

x2i + 2bN∑

n=1

xi − 2N∑

n=1

tixi

∂E

∂b= −2

N∑n=1

((ti − (axi + b))

= 2aN∑

n=1

xi + 2bN∑

n=1

1− 2N∑

n=1

ti

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 31

Page 32: Inf2b - Learning

Linear regression (one variable) (cont.)

Letting∂E

∂a= 0 and

∂E

∂b= 0

( ∑Nn=1 x

2i

∑Nn=1 xi∑N

n=1 xi∑N

n=1 1

)(ab

)=

( ∑Nn=1 tixi∑Nn=1 ti

)

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 32

Page 33: Inf2b - Learning

Linear regression (multiple variables)

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 3

•• •

••

• ••

• •

••

••

••

••

••

•• ••

• ••

•• •

••

• ••

• •

••

• •••

X1

X2

Y

FIGURE 3.1. Linear least squares fitting withX ∈ IR2. We seek the linear function of X that mini-mizes the sum of squared residuals from Y .

Training set:D = {(xn, tn)}Nn=1, wherexn = (1, x1, . . . , xD)T

Linear regression:tn = wTxn

Objective function:

E =N∑

n=1

(tn −wTxn

)2Optimisation problem:mina,b

E

Elements of Statistical Learning (2nd Ed.) c© Hastie, Tibshirani & Friedman 2009

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 33

Page 34: Inf2b - Learning

Linear regression (multiple variables) (cont.)

E =N∑

n=1

(tn −wTxn

)2Partial derivatives:

∂E

∂wi= −2

N∑n=1

(tn −wTxn)xni

Vector/matrix representation (NE) :

X =

xT1...

xTN

=

x10, . . . , x1d...

...xN0, . . . , xNd

,T =

t1...tN

E = (T − Xw)T (T − Xw)

∂E

∂w= −2XT (T − XW )

Letting ∂E∂w = 0 ⇒ XT (T − XW ) = 0

XTXW = XTT

W = (XTX )−1XTT · · · analytic solution if the inverse exists

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 34

Page 35: Inf2b - Learning

Summary

Training discriminant functions directly (discriminativetraining)

Perceptron training algorithm

Perceptron error correction algorithmLeast squares error + gradient descent algorithm

Linearly separable vs linearly non-separable

Perceptron structures and decision boundaries

See Notes 11 for a Perceptron with multiple output nodes

Inf2b - Learning: Lecture 11 Single layer Neural Networks (1) 35