computation graphs - cs.umd.edu

44
Computation Graphs Brian Thompson slides by Philipp Koehn 2 October 2018 Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Upload: others

Post on 31-Oct-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computation Graphs - cs.umd.edu

Computation Graphs

Brian Thompsonslides by Philipp Koehn

2 October 2018

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Antonis Anastasopoulos
Antonis Anastasopoulos
Antonis Anastasopoulos
MTMA, May 27th, 2019�
Page 2: Computation Graphs - cs.umd.edu

1Neural Network Cartoon

• A common way to illustrate a neural network

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 3: Computation Graphs - cs.umd.edu

2Neural Network Math

• Hidden layerh = sigmoid(W1x+ b1)

• Final layery = sigmoid(W2h+ b2)

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 4: Computation Graphs - cs.umd.edu

3Computation Graph

sigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 5: Computation Graphs - cs.umd.edu

4Simple Neural Network

11

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 6: Computation Graphs - cs.umd.edu

5Computation Graph

sigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 7: Computation Graphs - cs.umd.edu

6Processing Input

sigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

]

[1.00.0

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 8: Computation Graphs - cs.umd.edu

7Processing Input

sigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

]

[1.00.0

][3.72.9

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 9: Computation Graphs - cs.umd.edu

8Processing Input

sigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

]

[1.00.0

][3.72.9

][2.2−1.6

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 10: Computation Graphs - cs.umd.edu

9Processing Input

sigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

]

[1.00.0

][3.72.9

][2.2−1.6

][.900.168

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 11: Computation Graphs - cs.umd.edu

10Processing Input

sigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

]

[1.00.0

][3.72.9

][2.2−1.6

][.900.168

][3.18

][1.18

][.765

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 12: Computation Graphs - cs.umd.edu

11Error Function

• For training, we need a measure how well we do

⇒ Cost function

also known as objective function, loss, gain, cost, ...

• For instance L2 normerror =

1

2(t− y)2

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 13: Computation Graphs - cs.umd.edu

12Gradient Descent

• We view the error as a function of the trainable parameters

error(λ)

• We want to optimize error(λ) by moving it towards its optimum

λ

error(λ)

gradient = 1

current λoptimal λ

λ

error(λ)

grad

ient =

2

current λoptimal λ

λ

error(λ)

gradient = 0.2

current λoptimal λ

• Why not just set it to its optimum?

– we are updating based on one training example, do not want to overfit to it– we are also changing all the other parameters, the curve will look different

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 14: Computation Graphs - cs.umd.edu

13Calculus Refresher: Chain Rule

• Formula for computing derivative of composition of two or more functions

– functions f and g– composition f ◦ g maps x to f(g(x))

• Chain rule

(f ◦ g)′ = (f ′ ◦ g) · g′

or

F ′(x) = f ′(g(x))g′(x)

• Leibniz’s notationdzdx = dz

dy ·dydx

if z = f(y) and y = g(x), thendzdx = dz

dy ·dydx = f ′(y)g′(x) = f ′(g(x))g′(x)

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 15: Computation Graphs - cs.umd.edu

14Final Layer Update

• Linear combination of weights s =∑

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t− y)

2

• Derivative of error with regard to one weight wk

dE

dwk=dE

dy

dy

ds

ds

dwk

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 16: Computation Graphs - cs.umd.edu

15Error Computation in Computation Graph

L2

tsigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 17: Computation Graphs - cs.umd.edu

16Error Propagation in Computation Graph

E

B

A

• Compute derivative at node A: dEdA = dE

dBdBdA

• Assume that we already computed dEdB (backward pass through graph)

• So now we only have to get the formula for dBdA

• For instance B is a square node

– forward computation: B = A2

– backward computation: dBdA = dA2

dA = 2A

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 18: Computation Graphs - cs.umd.edu

17Derivatives for Each Node

L2

tsigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

dL2dsigmoid = do

di =ddi

12(t− i)

2 = t− i

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 19: Computation Graphs - cs.umd.edu

18Derivatives for Each Node

L2

tsigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

dL2dsigmoid = do

di =ddi

12(t− i)

2 = t− i

dsigmoiddsum = do

di =ddiσ(i) = σ(i)(1− σ(i))

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 20: Computation Graphs - cs.umd.edu

19Derivatives for Each Node

L2

tsigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

dL2dsigmoid = do

di =ddi

12(t− i)

2 = t− i

dsigmoiddsum = do

di =ddiσ(i) = σ(i)(1− σ(i))

dsumdprod = do

di1= d

di1i1 + i2 = 1, do

di2= 1

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 21: Computation Graphs - cs.umd.edu

20Derivatives for Each Node

L2

tsigmoid

sum

b2prod

W2sigmoid

sum

b1prod

W1x

dL2dsigmoid = do

di =ddi

12(t− i)

2 = t− i

dsigmoiddsum = do

di =ddiσ(i) = σ(i)(1− σ(i))

dsumdprod = do

di1= d

di1i1 + i2 = 1, do

di2= 1

dsumdprod = do

di1= d

di1i1i2 = i2, do

di2= i1

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 22: Computation Graphs - cs.umd.edu

21Backward Pass: Derivative Computation

t

b2

W2

b1

W1x

L2

sigmoid

sum

prod

sigmoid

sum

prod

i2 − i1

σ′(i)

1, 1

i2, i1

σ′(i)

1, 1

i2, i1

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

][1.0]

[1.00.0

][3.72.9

][2.2−1.6

][.900.17

][3.18

][1.18

][.765

][.0277

] [.235

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 23: Computation Graphs - cs.umd.edu

22Backward Pass: Derivative Computation

t

b2

W2

b1

W1x

L2

sigmoid

sum

prod

sigmoid

sum

prod

i2 − i1

σ′(i)

1, 1

i2, i1

σ′(i)

1, 1

i2, i1

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

][1.0]

[1.00.0

][3.72.9

][2.2−1.6

][.900.17

][3.18

][1.18

][.765

][.0277

][.180

]×[.235

]=[.0424

][.235

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 24: Computation Graphs - cs.umd.edu

23Backward Pass: Derivative Computation

t

b2

W2

b1

W1x

L2

sigmoid

sum

prod

sigmoid

sum

prod

i2 − i1

σ′(i)

1, 1

i2, i1

σ′(i)

1, 1

i2, i1

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

][1.0]

[1.00.0

][3.72.9

][2.2−1.6

][.900.17

][3.18

][1.18

][.765

][.0277

][.0424

],[.0424

][.180

]×[.235

]=[.0424

][.235

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 25: Computation Graphs - cs.umd.edu

24Backward Pass: Derivative Computation

t

b2

W2

b1

W1x

L2

sigmoid

sum

prod

sigmoid

sum

prod

i2 − i1

σ′(i)

1, 1

i2, i1

σ′(i)

1, 1

i2, i1

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

][1.0]

[1.00.0

][3.72.9

][2.2−1.6

][.900.17

][3.18

][1.18

][.765

][.0277

]

[−.0260−.0260

],[

0171 0−.0308 0

][.0171−.0308

],[.0171−.0308

][.0171−.0308

][.191−.220

],[.0382 .00712

][.0424

],[.0424

][.180

]×[.235

]=[.0424

][.235

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 26: Computation Graphs - cs.umd.edu

25Gradients for Parameter Update

t

b2

W2

b1

W1x

L2

sigmoid

sum

prod

sigmoid

sum

prod

i2 − i1

σ′(i)

1, 1

i2, i1

σ′(i)

1, 1

i2, i1

[3.7 3.72.9 2.9

]

[4.5 −5.2

]

[−1.5−4.6

]

[−2.0

]

[.0171 0−.0308 0

][.0171−.0308

]

[.0382 .00712

][.0424

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 27: Computation Graphs - cs.umd.edu

26Parameter Update

t

b2

W2

b1

W1x

L2

sigmoid

sum

prod

sigmoid

sum

prod

i2 − i1

σ′(i)

1, 1

i2, i1

σ′(i)

1, 1

i2, i1

[3.7 3.72.9 2.9

]– µ[.0171 0−.0308 0

]

[4.5 −5.2

]– µ

[.0171−.0308

]

[−1.5−4.6

]– µ

[.0382 .00712

]

[−2.0

]– µ

[.0424

]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 28: Computation Graphs - cs.umd.edu

27

toolkits

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 29: Computation Graphs - cs.umd.edu

28Explosion of Deep Learning Toolkits

• University of Montreal: Theano

• Google: Tensorflow

• Microsoft: CNTK

• Facebook: Torch, pyTorch

• Amazon: MX-Net

• CMU: Dynet

• AMU/Edinburgh: Marian

• ... and many more

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 30: Computation Graphs - cs.umd.edu

29Toolkits

• Machine learning architectures around computations graphs very powerful

– define a computation graph– provide data and a training strategy (e.g., batching)– toolkit does the rest

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 31: Computation Graphs - cs.umd.edu

30Example: Theano

• Deep learning toolkit for Python

• Included as library

> import numpy

> import theano

> import theano.tensor as T

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 32: Computation Graphs - cs.umd.edu

31Example: Theano

• Definition of parameters

> x = T.dmatrix()

> W = theano.shared(value=numpy.array([[3.0,2.0],[4.0,3.0]]))

> b = theano.shared(value=numpy.array([-2.0,-4.0]))

• Definition of feed-forward layer

> h = T.nnet.sigmoid(T.dot(x,W)+b)

note: x is matrix→ process several training examples (sequence of vectors).

• Define as callable function

> h function = theano.function([x], h)

• Apply to data

> h function([[1,0]])

array([[ 0.73105858, 0.11920292]])

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 33: Computation Graphs - cs.umd.edu

32Example: Theano

• Same setup for hidden→output layer

W2 = theano.shared(value=numpy.array([5.0,-5.0] ))

b2 = theano.shared(-2.0)

y pred = T.nnet.sigmoid(T.dot(h,W2)+b2)

• Define as callable function > predict = theano.function([x], y pred)

• Apply to data

> predict([[1,0]])

array([[ 0.7425526]])

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 34: Computation Graphs - cs.umd.edu

33Model Training

• First, define the variable for the correct output

> y = T.dvector()

• Definition of a cost function (we use the L2 norm).

> l2 = (y-y pred)**2

> cost = l2.mean()

• Gradient descent training: computation of the derivative

> gW, gb, gW2, gb2 = T.grad(cost, [W,b,W2,b2])

• Update rule (with learning rate 0.1)

> train = theano.function(inputs=[x,y],outputs=[y pred,cost],

updates=((W, W-0.1*gW), (b, b-0.1*gb),

(W2, W2-0.1*gW2), (b2, b2-0.1*gb2)))

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 35: Computation Graphs - cs.umd.edu

34Model Training

• Training data

> DATA X = numpy.array([[0,0],[0,1],[1,0],[1,1]])

> DATA Y = numpy.array([0,1,1,0])

• Predict output for training data

> predict(DATA X)

array([ 0.18333462, 0.7425526 , 0.7425526 , 0.33430961])

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 36: Computation Graphs - cs.umd.edu

35Model Training

• Train with training data

> train(DATA X,DATA Y)

[array([ 0.18333462, 0.7425526 , 0.7425526 , 0.33430961]),

array(0.06948320612438118)]

• Prediction after training

> train(DATA X,DATA Y)

[array([ 0.18353091, 0.74260499, 0.74321824, 0.33324929]),

array(0.06923193686092949)]

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 37: Computation Graphs - cs.umd.edu

36

example: dynet

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 38: Computation Graphs - cs.umd.edu

37Dynet

• Our example: static computation graph, fixed set of data

• But: language requires different computation data for different data items(sentences have different length)

⇒ Dynamically create a computation graph for each data item

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 39: Computation Graphs - cs.umd.edu

38Example: Dynetmodel = dy.model()

W p = model.add parameters((20, 100))

b p = model.add parameters(20)

E = model.add lookup parameters((20000, 50))

trainer = dy.SimpleSGDTrainer(model)

for epoch in range(num epochs):

for in words, out label in training data:

dy.renew cg()

W = dy.parameter(W p)

b = dy.parameter(b p)

score sym = dy.softmax(

W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)

loss sym = dy.pickneglogsoftmax(score sym, out label)

loss sym.forward()

loss sym.backward()

trainer.update()

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 40: Computation Graphs - cs.umd.edu

39Model Parametersmodel = dy.model()

W p = model.add parameters((20, 100))

b p = model.add parameters(20)

E = model.add lookup parameters((20000, 50))

trainer = dy.SimpleSGDTrainer(model)

for epoch in range(num epochs):

for in words, out label in training data:

dy.renew cg()

W = dy.parameter(W p)

b = dy.parameter(b p)

score sym = dy.softmax(

W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)

loss sym = dy.pickneglogsoftmax(score sym, out label)

loss sym.forward()

loss sym.backward()

trainer.update()

Model holds the values for the weight matrices and weight vectors

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 41: Computation Graphs - cs.umd.edu

40Training Setupmodel = dy.model()

W p = model.add parameters((20, 100))

b p = model.add parameters(20)

E = model.add lookup parameters((20000, 50))

trainer = dy.SimpleSGDTrainer(model)

for epoch in range(num epochs):

for in words, out label in training data:

dy.renew cg()

W = dy.parameter(W p)

b = dy.parameter(b p)

score sym = dy.softmax(

W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)

loss sym = dy.pickneglogsoftmax(score sym, out label)

loss sym.forward()

loss sym.backward()

trainer.update()

Defines the model update function (could be also Adagrad, Adam, ...)

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 42: Computation Graphs - cs.umd.edu

41Setting up Computation Graphmodel = dy.model()

W p = model.add parameters((20, 100))

b p = model.add parameters(20)

E = model.add lookup parameters((20000, 50))

trainer = dy.SimpleSGDTrainer(model)

for epoch in range(num epochs):

for in words, out label in training data:

dy.renew cg()

W = dy.parameter(W p)

b = dy.parameter(b p)

score sym = dy.softmax(

W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)

loss sym = dy.pickneglogsoftmax(score sym, out label)

loss sym.forward()

loss sym.backward()

trainer.update()

Create a new computation graph. Inform it about parameters.

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Antonis Anastasopoulos
The latest version does NOTrequire these, use W_p, b_p directly�
Page 43: Computation Graphs - cs.umd.edu

42Operationsmodel = dy.model()

W p = model.add parameters((20, 100))

b p = model.add parameters(20)

E = model.add lookup parameters((20000, 50))

trainer = dy.SimpleSGDTrainer(model)

for epoch in range(num epochs):

for in words, out label in training data:

dy.renew cg()

W = dy.parameter(W p)

b = dy.parameter(b p)

score sym = dy.softmax(

W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)

loss sym = dy.pickneglogsoftmax(score sym, out label)

loss sym.forward()

loss sym.backward()

trainer.update()

Builds the computation graph by defining operations.

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018

Page 44: Computation Graphs - cs.umd.edu

43Training Loopmodel = dy.model()

W p = model.add parameters((20, 100))

b p = model.add parameters(20)

E = model.add lookup parameters((20000, 50))

trainer = dy.SimpleSGDTrainer(model)

for epoch in range(num epochs):

for in words, out label in training data:

dy.renew cg()

W = dy.parameter(W p)

b = dy.parameter(b p)

score sym = dy.softmax(

W*dy.concatenate([E[in words[0]],E[in words[1]]])+b)

loss sym = dy.pickneglogsoftmax(score sym, out label)

loss sym.forward()

loss sym.backward()

trainer.update()

Process training data. Computations are done in forward and backward.

Philipp Koehn Machine Translation: Computation Graphs 2 October 2018