15-780: graduate artificial intelligenceggordon/780-fall07/lectures/newnn.pdf15-780: graduate...

48
15-780: Graduate Artificial Intelligence Neural networks

Upload: others

Post on 01-Jun-2020

19 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

15-780: Graduate ArtificialIntelligence

Neural networks

Page 2: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Schedule note• Project proposals due this week• Midterm - Open book, notes - 8 (short) questions - 1.5 hours, in class - Review sessions 1. Monday, 11/12, 7-9pm, WeH 4625 2. Thursday, 11/13, 7-9pm, WeH 4625 - email us if you have specific questions

Page 3: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Midterm: Topics• Search (A*, BFS, DFS, iterative deepening)• Path planning (A*, RRTs, c-space, discretization/partitioning)• Propositional and first-order logic (syntax, semantics/models, transformation

rules, normal forms, theorem provers, resolution, unification)• SAT and CSPs (walksat, DPLL, constraint propagation, NP, reductions, phase

transitions)• Planning (languages like STRIPS, linear planning, POP, GraphPlan)• Optimization (LP, ILP/MILP, branch & bound, simplex)• Duality (lagrange multipliers, cutting planes)• Game tree search (minimax, alpha-beta)• Normal form games (equilibria, dominated strategies, bargaining)• Probabilistic reasoning• Bayesian networks• Density estimation• HMMs• Decision trees• Nueral networks• MDPs• Reinforcement learning

Page 4: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week
Page 5: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week
Page 6: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Mimicking the brain• In the early days of AI there was a lot of interest in

developing models that can mimic human thinking.• While no one knew exactly how the brain works (and,

even though there was a lot of progress since, there isstill little known), some of the basic computational unitswere known

• A key component of these units is the neuron.

Page 7: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

The Neuron• A cell in the brain• Highly connected to other

neurons• Thought to perform

computations by integratingsignals from other neurons

• Outputs of thesecomputation may betransmitted to one or moreneurons

Page 8: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

What can we do with NN?• Classification - We already mentioned many useful applications• Regression - A new concept: Input: Real valued variables Output: One or more real values• Examples: - Predict the price of Googles stock from Microsofts stock - Predict distance to obstacle from various sensors

Page 9: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Linear regression• Given an input x we would

like to compute an output y• In linear regression we

assume that y and x arerelated with the followingequation:

y = wx+ε where w is a parameter

and ε representsmeasurement or othernoise

X

Y

Page 10: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

• Our goal is to estimate w from a trainingdata of <xi,yi> pairs

• This could be done using a least squaresapproach

• Why least squares?

- minimizes squared distance betweenmeasurements and predicted line

- has a nice probabilistic interpretation

- easy to compute

Linear regression

! "i

iiwwxy

2)(minargX

Y !+= wxy

If the noise is Gaussianwith mean 0 then leastsquares is also themaximum likelihoodestimate of w

Page 11: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Solving linear regression

• You should be familiar with this by now …

• We just take the derivative w.r.t. to w and set to 0:

!

!

! !

!

!!

=

"=

"=#

"##=#$

$

i

i

i

ii

i i

iii

i

iii

i

iii

i

ii

x

yx

w

wxyx

wxyx

wxyxwxyw

2

2

2

0)(2

)(2)(

Page 12: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Regression example• Generated: w=2• Recovered: w=2.03• Noise: std=1

Page 13: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Regression example• Generated: w=2• Recovered: w=2.05• Noise: std=2

Page 14: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Regression example• Generated: w=2• Recovered: w=2.08• Noise: std=4

Page 15: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Affine regression• So far we assumed that the

line passes through the origin• What if the line does not?• No problem, simply change the

model to y = w0 + w1x+ε

• Can use least squares todetermine w0 , w1

n

xwy

w i

ii! "

=1

0

X

Y

w0

!

! "

=

i

i

i

ii

x

wyx

w2

0

1

)(

Page 16: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Affine regression• So far we assumed that the

line passes through the origin• What if the line does not?• No problem, simply change the

model to y = w0 + w1x+ε

• Can use least squares todetermine w0 , w1

n

xwy

w i

ii! "

=1

0

X

Y

w0

!

! "

=

i

i

i

ii

x

wyx

w2

0

1

)(

Just a second, we will soongive a simpler solution

Page 17: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Multivariate regression• What if we have several inputs? - Stock prices for Yahoo, Microsoft and Ebay for the

Google prediction task• This becomes a multivariate regression problem• Again, its easy to model: y = w0 + w1x1+ … + wkxk + ε

Notations:

Lower case: variable or parameter (w0)

Lower case bold: vector (w)

Upper case bold: matrix (X)

Page 18: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Multivariate regression: Leastsquares

• We are now interested in a vector wT = [w0, w1 ,… , wk]• It would be useful to represent this in matrix notations:

!!!!

"

#

$$$$

%

&

=

!!!!

"

#

$$$$

%

&

=

nkkk

nnT

xxx

xxxXXX

L

MLMM

L

L

MMM

MMM

L

MMM

21

121111

111

!!!!

"

#

$$$$

%

&

=

ny

y

y

M

2

1

y

• We can thus re-write our model as y = wTX+ε

• The solution turns out to be: w = (XTX)-1XTy

• This is an instance of a larger set of computational solutions whichare usually referred to as ‘generalized least squares’

Page 19: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Multivariate regression: Leastsquares

• We can re-write our model as y = wTX

• The solution turns out to be: w = (XTX)-1XTy

• The is an instance of a larger set of computational solutions whichare usually referred to as ‘generalized least squares’

• XTX is a k by k matrix

• XTy is a vector with k entries

Why is (XTX)-1XTy the right solution?

Hint: Multiply both sides of the original equation by (XTX)-1XT

Page 20: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Multivariate regression: Leastsquares

• We can re-write our model as y = wTX

• The solution turns out to be: w = (XTX)-1XTy

We need to invert a k by k matrix

• This takes O(k3)

• Depending on k this can be rather slow

Page 21: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Where we are• Linear regression – solved!• But - Solution may be slow - Does not address general regression problems of the

form y = f(wTx)

Page 22: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Back to NN: Preceptron• The basic processing unit of a neural net

y=f(∑wixi)

w0

w1

w2

wk

x1

x2

xk

1

Page 23: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Linear regression• Lets start by setting f(∑wixi)=∑wixi

• We are back to linear regression• Unlike our original linear regression

solution, for perceptrons we will use adifferent strategy

• Why? - We will discuss this later, for now lets

focus on the solution …

y=wixi

w0

w1

w2

wk

x1

x2

xk

1

Page 24: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Gradient descent

z=(f(w)-y)2

w

Slope = ∂z/ ∂w

Δz

Δw

• Going in the opposite direction to the slope will lead toa smaller z

• But not too much, otherwise we would go beyond theoptimal w

Page 25: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Gradient descent• Going in the opposite direction to the slope will lead toa smaller z

• But not too much, otherwise we would go beyond theoptimal w

• We thus update the weights by setting:

where λ is small constant which is intended to preventus from passing the optimal w

w

zww

!

!"# $

Page 26: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Example when choosing the ‘right’λ

• We get a monotonically decreasing error as we performmore updates

Page 27: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Gradient descent for linearregression

• We compute the gradient w.r.t. to each wi

• And if we have n measurements then

where xj,i is the i’th value of the j’th input vector

)(2

2

!! ""=#$

%&'

("

)

)

k

kki

k

kk

i

xwyxxwyw

!!==

"=#

# n

j

j

T

jij

n

j

j

T

j

i

yxyw 1

,

1

2 )(2)( xw-xw-

Page 28: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Gradient descent for linearregression

• If we have n measurements then

• Set

• Then our update rule can be written as

!!==

"=#

# n

j

j

T

jij

n

j

j

T

j

i

yxyw 1

,

1

2 )(2)( xw-xw-

)( j

T

jj y xw-=!

!=

+"n

j

jijiixww

1

,2 #$

Page 29: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Gradient descent algorithm forlinear regression

1.Chose λ2.Start with a guess for w3.Compute δj for all j4.For all i set

5. If no improvement for

stop. Otherwise go to step 3

!=

+"n

j

jijiixww

1

,2 #$

!=

n

j

j

T

jy1

2)( xw-

Page 30: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Example• W = 2

Page 31: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Gradient descent vs. matrixinversion

• Advantages of matrix inversion - No iterations - No need to specify parameters - Closed form solution in a predictable time• Advantages of gradient descent - Applicable regardless of the number of parameters - General, applies to other forms of regression

Page 32: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Perceptrons for classification• So far we discussed regression• However, perceptrons can also be used for classification• For example, output 1 is wTx > 0 and -1 otherwise• Problem?

Page 33: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Perceptrons for classification• So far we discussed regression• However, perceptrons can also be used for classification• Outputs either 0 or 1• We predict 1if wTx > 1/2 and 0 otherwise• Problem?

y

x

Best least squares fit

Best classifier

Page 34: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

The sigmoid function• To classify using a perceptron we

replace the linear function with thesigmoid function:

• Using the sigmoid we would minimize

• Where yj is either 0 or 1 depending onthe class

hehg

!+

=1

1)(

!=

"n

j

j

T

j gy1

2))(( xw

Page 35: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Gradient descent with sigmoid• Once we defined our target function, we can minimize it using

gradient descent• This involves some math, and relies on the following derivation*:

• So,

))(1)(()(' hghghg !=

*I have included a derivation of this at the end ofthe lecture notes

ijj

T

j

Tn

j

j

T

j

j

T

i

j

Tn

j

j

T

j

j

Tn

j

j

i

j

T

j

n

j

j

T

j

i

xgggy

wggy

gyw

gygyw

,

1

1

11

2

))(1)(())((2

)('))((2

))(())((2))((

xwxwxw

xwxwxw

xwxwxw

!!!=

"

"!!=

!"

"!=!

"

"

#

#

##

=

=

==

Page 36: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Gradient descent with sigmoid

)( j

T

jj gy xw!=" )( j

T

j gg xw=

ijj

T

j

Tn

j

j

T

j

n

j

j

T

j

i

xgggygyw

,

11

2 ))(1)(())((2))(( xwxwxwxw !!!=!"

"##==

Set

ijj

n

j

jj

n

j

j

T

j

i

xgggyw

,

11

2 )1(2))(( !!=!"

"##==

$xw

ijj

n

j

jjiixggww ,

1

)1(2 !+" #=

$%

So our update rule is:

Page 37: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Revised algorithm for sigmoidregression

1.Chose λ2.Start with a guess for w3.Compute δj for all j4.For all i set

5. If no improvement for

stop. Otherwise go to step 3

!=

n

j

j

T

j gy1

2))( x(w-

ijj

n

j

jjiixggww ,

1

)1(2 !+" #=

$%

Page 38: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Multilayer neural networks• So far we discussed networks with one layer.• But these networks can be extended to combine several

layers, increasing the set of functions that can berepresented using a NN

v1=g(wTx)w0,1

x1

x2

1

v2=g(wTx)

v1=g(wTv)

w1,1

w2,1

w0,2

w1,2

w2,2

w1

w2

Often called the ‘hidden layer’

Page 39: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Learning the parameters formultilayer networks

• Gradient descent works by connecting the output to theinputs.

• But how do we use it for a multilayer network?• We need to account for both, the output weights and the

hidden layer weights

v1=g(wTx)w0,1

x1

x2

1

v2=g(wTx)

v1=g(wTv)

w1,1

w2,1

w0,2

w1,2

w2,2

w1

w2

Page 40: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Learning the parameters formultilayer networks

• Its easy to compute the update rule for the output weightsw1 and w2:

where

ijj

n

j

jjiivggww ,

1

)1(2 !+" #=

$%

v1=g(wTx)w0,1

x1

x2

1

v2=g(wTx)

v1=g(wTv)

w1,1

w2,1

w0,2

w1,2

w2,2

w1

w2

)( j

T

jj gy vw!="

Page 41: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Learning the parameters formultilayer networks

• Its easy to compute the update rule for the output weightsw1 and w2:

where

ijj

n

j

jjiivggww ,

1

)1(2 !+" #=

$%

v1=g(wTx)w0,1

x1

x2

1

v2=g(wTx)

v1=g(wTv)

w1,1

w2,1

w0,2

w1,2

w2,2

w1

w2

)( j

T

jj gy vw!="

But what is the error associated with each of thehidden layer states?

Page 42: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Backpropagation• A method for distributing the error among hidden layer states• Using the error for each of these states we can employ gradient

descent to update them• Set

!

" j,i = wi# j (1$ g j )g j

v1=g(wTx)w0,1

x1

x2

1

v2=g(wTx)

v1=g(wTv)

w1,1

w2,1

w0,2

w1,2

w2,2

w1

w2

output error

weight

Page 43: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Backpropagation• A method for distributing the error among hidden layer states• Using the error for each of these states we can employ gradient

descent to update them• Set

• Our update rule changes to:

!

" j,i = wi# j (1$ g j )g j

kjij

n

j

ijijikik xggww ,,

1

,,,, )1(2 !"+# $=

%

Page 44: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Backpropagation

kjij

n

j

ijijikik xggww ,,

1

,,,, )1(2 !"+# $=

%

The correct error term for each hidden state can bedetermined by taking the partial derivative for eachof the weight parameters of the hidden layer w.r.t.the global error function*:

2))((( xwwT

i

T

jj ggyErr !=

*See RN book for details (pages 746-747)

Page 45: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Revised algorithm for multilayeredneural network

1.Chose λ2.Start with a guess for w, wi3.Compute values vi,j for all hidden layer states i and inputs j4.Compute δj for all j:5.Compute Δj,I6.For all i set

7. For all k and i set

8. If no improvement for stop. Otherwise go tostep 3

)( j

T

jj gy vw!="

ijj

n

j

jjiivggww ,

1

)1(2 !+" #=

$%

kjij

n

j

ijijikik xggww ,,

1

,,,, )1(2 !"+# $=

%

!!==

"+s

i

ij

n

j

j

1

2

,

1

2#

Page 46: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Examples

Figure 1: Feedforward ANN designed and testedfor prediction of tactical air combat maneuvers.

Page 47: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

What you should know• Linear regression - Solving a linear regression problem• Gradient descent• Perceptrons - Sigmoid functions for classification• Multilayered neural networks - Backpropagation

Page 48: 15-780: Graduate Artificial Intelligenceggordon/780-fall07/lectures/newNN.pdf15-780: Graduate Artificial Intelligence Neural networks. Schedule note •Project proposals due this week

Deriving g’(x)• Recall that g(x) is the sigmoid function so

• The derivation of g’(x) is below

xe

xg!

+=1

1)(