function learning and neural nets r&n: chap. 20, sec. 20.5

Post on 30-Dec-2015

33 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Function Learning and Neural Nets R&N: Chap. 20, Sec. 20.5. f(x). x. Function-Learning Formulation. Goal function f Training set: ( x (i) ,y (i) ), i = 1,…,n, y (i) = f ( x (i) ) Inductive inference: find a function h that fits the points well Same Keep-It-Simple bias. f(x). x. - PowerPoint PPT Presentation

TRANSCRIPT

1

Function Learning and Neural Nets

R&N: Chap. 20, Sec. 20.5

2

Function-Learning Formulation

Goal function f Training set: (x(i),y(i)), i = 1,…,n, y(i)=f(x(i))

Inductive inference: find a function h that fits the points well

Same Keep-It-Simple biasx

f(x)

3

Least-Squares Fitting Propose a class of functions g(x,)

parameterized by Minimize E() = i ( g(x(i),)-y(i))2

x

f(x)

4

Linear Least-Squares

g(x,) = x1 1 + … + xN N

Best given by = (ATA)-1 AT b

Where A is matrix of x(i)’s, b is vector of y(i)’s

x

f(x)g(x,)

5

Constant offset

Set x0=1, g(x,) = x0 0 + x1 1 + … + xN N

Best given by = (ATA)-1 AT b

Where A is matrix of x(i)’s, b is vector of y(i)’s

x

f(x)g(x,)

6

Nonlinear Least-Squares

E.g. quadratic g(x,) = 0 + x 1 + x2

2

E.g. exponential g(x,) = exp(0 + x

1) Any combinations

g(x,) = exp(0 + x 1) + 2 + x

3x

f(x)

linear

quadratic other

7

Performance of Nonlinear Least-squares

Overfitting: too many parameters Efficient optimization

Often can only find a local minimum of objective E()

Expensive with lots of data!

8

Neural Networks Overfitting: too many parameters Efficient optimization

Often can only find a local minimum of objective E()

Expensive with lots of data!

9

Perceptron(The goal function f is a boolean

one)

gxi

x1

xn

ywi

y = g(i=1,…,n wi xi)

+ +

+

++ -

-

--

-x1

x2

w1 x1 + w2 x2 = 0

10

gxi

x1

xn

ywi

y = g(i=1,…,n wi xi)

+ +

+ +

+ -

-

--

-

?

Perceptron(The goal function f is a boolean

one)

11

Unit (Neuron)

gxi

x1

xn

ywi

y = g(i=1,…,n wi xi)

g(u) = 1/[1 + exp(-u)]

12

A Single Neuron can learn

gxi

x1

xn

ywi

A disjunction of boolean literals x1 x2 x3

Majority function

XOR?

13

Neural Network

Network of interconnected neurons

gxi

x1

xn

ywi

gxi

x1

xn

ywi

Acyclic (feed-forward) vs. recurrent networks

14

Two-Layer Feed-Forward Neural Network

Inputs Hiddenlayer

Outputlayer

w1j w2k

15

Backpropagation (Principle)

New example y(k) = f(x(k)) φ(k) = outcome of NN with weights w(k-1) for

inputs x(k) Error function: E(k)(w(k-1)) = ||φ(k) – y(k)||2

wij(k) = wij

(k-1) – εE(k)/wij (w(k) = w(k-1) - E)

Backpropagation algorithm: Update the weights of the inputs to the last layer, then the weights of the inputs to the previous layer, etc.

16

Understanding Backpropagation

Minimize E() Gradient Descent…

E()

17

Understanding Backpropagation

Minimize E() Gradient Descent…

E()

Gradient of E

18

Understanding Backpropagation

Minimize E() Gradient Descent…

E()

Step ~ gradient

19

Understanding Backpropagation

Example of Stochastic Gradient Descent

Minimize E() = e1()+e2()+…+eN() Here ei = (g(x(i),)-y(i))2

Take a step to reduce eiE()

Gradient of e1

20

Understanding Backpropagation

Example of Stochastic Gradient Descent

Minimize E() = e1()+e2()+…+eN() Here ei = (g(x(i),)-y(i))2

Take a step to reduce eiE()

Gradient of e1

21

Understanding Backpropagation

Example of Stochastic Gradient Descent

Minimize E() = e1()+e2()+…+eN() Here ei = (g(x(i),)-y(i))2

Take a step to reduce eiE()

Gradient of e2

22

Understanding Backpropagation

Example of Stochastic Gradient Descent

Minimize E() = e1()+e2()+…+eN() Here ei = (g(x(i),)-y(i))2

Take a step to reduce eiE()

Gradient of e2

23

Understanding Backpropagation

Example of Stochastic Gradient Descent

Minimize E() = e1()+e2()+…+eN() Here ei = (g(x(i),)-y(i))2

Take a step to reduce eiE()

Gradient of e3

24

Understanding Backpropagation

Example of Stochastic Gradient Descent

Minimize E() = e1()+e2()+…+eN() Here ei = (g(x(i),)-y(i))2

Take a step to reduce eiE()

Gradient of e3

25

Stochastic Gradient Descent

Parameter values over time

(local) minimum of E

26

Stochastic Gradient Descent

Objective function values over time

27

Caveats

Choosing a convergent “learning rate” can be hard in practice

E()

28

Comments and Issues

How to choose the size and structure of networks?• If network is too large, risk of over-fitting

(data caching)• If network is too small, representation

may not be rich enough Role of representation: e.g., learn the

concept of an odd number Incremental learning

29

Role of Marketing

Not a good model of a neuron Spiking behavior, recurrence in real NNs

No special properties above other learning techniques

Like other learning techniques, a convenient way to get results without thinking too hard

30

Incremental (“Online”) Function Learning

31

Incremental (“Online”) Function Learning

Data is streaming into learnerx1,y1, …, xt,yt yi = f(xi)

Observes xt+1 and must make

prediction for next time step yt+1

Brute force approach: Store all data at step t Use your learner of choice on all data up

to time t, predict for time t+1

32

Example: Mean Estimation

yi = + error term (no x’s) Current estimate t= 1/t i=1…t yi

t+1= 1/(t+1) i=1…t+1 yi

= 1/(t+1) (yt+1 + i=1…t yi) = 1/(t+1) (yt+1 + tt)

5

33

Example: Mean Estimation

yi = + error term (no x’s) Current estimate t= 1/t i=1…t yi

t+1= 1/(t+1) i=1…t+1 yi

= 1/(t+1) (yt+1 + i=1…t yi) = 1/(t+1) (yt+1 + tt)

5

y6

34

Example: Mean Estimation

yi = + error term (no x’s) Current estimate t= 1/t i=1…t yi

t+1= 1/(t+1) i=1…t+1 yi

= 1/(t+1) (yt+1 + i=1…t yi) = 1/(t+1) (yt+1 + tt)

5 6 = 5/6 5 + 1/6 y6

35

Example: Mean Estimation

t+1= 1/(t+1) (yt+1 + tt) Only need to store t, t

Similar formulas for standard deviation

5 6 = 5/6 6 + 1/6 y6

36

Incremental Least Squares

Recall Least Squares estimate = (ATA)-1 AT b

Where A is matrix of x(i)’s, b is vector of y(i)’s (laid out in rows)

A =

x(1)

x(2)

x(N)

b =

y(1)

y(2)

y(N)

…NxM Nx1

37

Incremental Least Squares

Let A(t), b(t) be A matrix, b vector up to time t

(t) = (A(t)TA(t))-1 A(t)T b(t)

A(t+1) =

x(t+1)

b(t+1) =

y(t+1)

(T+1)xM (t+1)x1

b(t)A(t)

38

Incremental Least Squares

Let A(t), b(t) be A matrix, b vector up to time t

(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1)

A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1)

A(t+1) =

x(t+1)

b(t+1) =

y(t+1)

(T+1)xM (t+1)x1

b(t)A(t)

39

Incremental Least Squares

Let A(t), b(t) be A matrix, b vector up to time t

(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1)

A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1)

A(t+1)TA(t+1) = A(t)TA(t) + x(t+1)x(t+1)T

A(t+1) =

x(t+1)

b(t+1) =

y(t+1)

(T+1)xM (t+1)x1

b(t)A(t)

40

Incremental Least Squares

Let A(t), b(t) be A matrix, b vector up to time t

(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1)

A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1)

A(t+1)TA(t+1) = A(t)TA(t) + x(t+1)x(t+1)T

A(t+1) =

x(t+1)

b(t+1) =

y(t+1)

(T+1)xM (t+1)x1

b(t)A(t)

41

Incremental Least Squares

Let A(t), b(t) be A matrix, b vector up to time t

(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1)

A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1)

A(t+1)TA(t+1) = A(t)TA(t) + x(t+1)x(t+1)T

Sherman-Morrison Update (Y + xxT)-1 = Y-1 - Y-1

xxT Y-1 / (1 – xT Y-1 x)

42

Incremental Least Squares

Putting it all together Store

p(t) = A(t)Tb(t)

Q(t) = (A(t)TA(t))-1

Updatep(t+1) = p(t) + y x

Q(t+1) = Q(t) - Q(t)

xxT Q(t) / (1 – xT Q(t) x)(t+1) = Q(t+1)p(t+1)

43

Recap

• Function learning with least squares

• Neural nets, backpropagation, and gradient descent

• Incremental learning

44

Reminder

• HW6 due

• HW7 available on Oncourse

45

Machine Learning Classes

• CS659 (Hauser) Principles of Intelligent Robot Motion

• CS657 (Yu) Computer Vision

• STAT520 (Trosset) Introduction to Statistics

• STAT682 (Rocha) Statistical Model Selection

top related