differentiable functional programminggunes/assets/pdf/baydin-2016-slides-function… · derivatives...

67
Differentiable Functional Programming Atılım Güneş Baydin University of Oxford http://www.robots.ox.ac.uk/~gunes/ F#unctional Londoners Meetup, April 8, 6

Upload: others

Post on 19-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Differentiable FunctionalProgrammingAtılım Güneş Baydin

University of Oxfordhttp://www.robots.ox.ac.uk/~gunes/

F#unctional Londoners Meetup, April 28, 2016

Page 2: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

About me

Current (from 11 April 2016):Postdoctoral researcher,Machine Learning Research Group, University of Oxfordhttp://www.robots.ox.ac.uk/~parg/

Previously:Brain and Computation Lab, National University of IrelandMaynooth: http://www.bcl.hamilton.ie/Working primarily with F#, on algorithmic differentiation,functional programming, machine learning

1/36

Page 3: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Today’s talk

Derivatives in computer programsDifferentiable functional programmingDiffSharp + Hype librariesTwo demos

2/36

Page 4: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Derivatives in computer programsHow do we compute them?

Page 5: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Manual differentiationf(x) = sin(exp x)

let f x = sin (exp x)

Calculus 101: differentiation rulesd(fg)dx =

dfdxg+ f dgdx

d(af + bg)dx = adf

dx + bdgdx. . .

f ′(x) = cos(exp x)× exp x

let f’ x = (cos (exp x)) * (exp x)

3/36

Page 6: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Manual differentiationf(x) = sin(exp x)

let f x = sin (exp x)

Calculus 101: differentiation rulesd(fg)dx =

dfdxg+ f dgdx

d(af + bg)dx = adf

dx + bdgdx. . .

f ′(x) = cos(exp x)× exp x

let f’ x = (cos (exp x)) * (exp x)

3/36

Page 7: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Manual differentiationf(x) = sin(exp x)

let f x = sin (exp x)

Calculus 101: differentiation rulesd(fg)dx =

dfdxg+ f dgdx

d(af + bg)dx = adf

dx + bdgdx. . .

f ′(x) = cos(exp x)× exp x

let f’ x = (cos (exp x)) * (exp x)

3/36

Page 8: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Manual differentiationIt can get complicatedf(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2(4th iteration of the logistic map ln+1 = 4ln(1− ln), l1 = x)let f x =

64*x * (1-x) * ((1 - 2*x) ** 2) * ((1 - 8*x + 8*x*x) ** 2)

f ′(x) =128x(1−x)(−8+16x)(1−2x)2(1−8x+8x2)+64(1−x)(1−2x)2(1−8x+8x2)2−64x(1−2x)2(1−8x+8x2)2−256x(1−x)(1−2x)(1−8x+8x2)2let f’ x = 128*x * (1-x) * (-8+16*x) * (1-2*x)**2 * (1-8*x+8*x*

x) + 64 * (1-x) * (1-2*x)**2 * (1-8*x+8*x*x)**2 - 64*x(1-2*x)**2 * (1-8*x+8*x*x)**2 - 256*x*(1-x) * (1-2*x) * (1-8*x+8*x*x)**2

4/36

Page 9: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Manual differentiationIt can get complicatedf(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2(4th iteration of the logistic map ln+1 = 4ln(1− ln), l1 = x)let f x =

64*x * (1-x) * ((1 - 2*x) ** 2) * ((1 - 8*x + 8*x*x) ** 2)

f ′(x) =128x(1−x)(−8+16x)(1−2x)2(1−8x+8x2)+64(1−x)(1−2x)2(1−8x+8x2)2−64x(1−2x)2(1−8x+8x2)2−256x(1−x)(1−2x)(1−8x+8x2)2let f’ x = 128*x * (1-x) * (-8+16*x) * (1-2*x)**2 * (1-8*x+8*x*

x) + 64 * (1-x) * (1-2*x)**2 * (1-8*x+8*x*x)**2 - 64*x(1-2*x)**2 * (1-8*x+8*x*x)**2 - 256*x*(1-x) * (1-2*x) * (1-8*x+8*x*x)**2

4/36

Page 10: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Symbolic differentiationComputer algebra packages help: Mathematica, Maple, Maxima

But, it has some serious drawbacks5/36

Page 11: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Symbolic differentiationComputer algebra packages help: Mathematica, Maple, Maxima

But, it has some serious drawbacks5/36

Page 12: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Symbolic differentiationWe get “expression swell”

Logistic map ln+1 = 4ln(1 − ln), l1 = x

n ln ddx ln

1 x 12 4x(1 − x) 4(1 − x) − 4x3 16x(1− x)(1−

2x)216(1− x)(1− 2x)2 −16x(1 − 2x)2 −64x(1 − x)(1 − 2x)

4 64x(1− x)(1−2x)2 (1− 8x +8x2)2

128x(1 − x)(−8 +

16x)(1−2x)2(1−8x+8x2) + 64(1− x)(1−2x)2(1−8x+8x2)2−64x(1−2x)2(1−8x+8x2)2 − 256x(1 −x)(1 − 2x)(1 − 8x +

8x2)2 1 2 3 4 50

100200300400500600

n

Number of terms

ln

ddx ln

6/36

Page 13: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Symbolic differentiationWe are limited to closed-form formulaeYou can find the derivative of math expressions:f(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2But not of algorithms, branching, control flow:let f x n =

if n = 1 thenx

elselet mutable v = xfor i = 1 to n

v <- 4 * v * (1 - v)v

let a = f x 4

7/36

Page 14: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Symbolic differentiationWe are limited to closed-form formulaeYou can find the derivative of math expressions:f(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2But not of algorithms, branching, control flow:let f x n =

if n = 1 thenx

elselet mutable v = xfor i = 1 to n

v <- 4 * v * (1 - v)v

let a = f x 4

7/36

Page 15: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Symbolic differentiationWe are limited to closed-form formulaeYou can find the derivative of math expressions:f(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2But not of algorithms, branching, control flow:let f x n =

if n = 1 thenx

elselet mutable v = xfor i = 1 to n

v <- 4 * v * (1 - v)v

let a = f x 4

7/36

Page 16: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Numerical differentiationA very common hack:Use the limit definition of the derivative

dfdx = lim

h→0f(x+ h)− f(x)

hto approximate the numerical value of the derivativelet diff f x =

let h = 0.00001(f (x + h) - f (x)) / h

Again, some serious drawbacks

8/36

Page 17: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Numerical differentiationA very common hack:Use the limit definition of the derivative

dfdx = lim

h→0f(x+ h)− f(x)

hto approximate the numerical value of the derivativelet diff f x =

let h = 0.00001(f (x + h) - f (x)) / h

Again, some serious drawbacks

8/36

Page 18: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Numerical differentiationA very common hack:Use the limit definition of the derivative

dfdx = lim

h→0f(x+ h)− f(x)

hto approximate the numerical value of the derivativelet diff f x =

let h = 0.00001(f (x + h) - f (x)) / h

Again, some serious drawbacks

8/36

Page 19: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Numerical differentiationWe must select a proper value of hand we face approximation errors

h

10-17 10-15 10-13 10-11 10-9 10-7 10-5 10-3 10-110-1010-810-610-410-2100102

Round-off errordominant Truncation errordominant

Error

Computed using

E(h, x∗) =

∣∣∣∣∣ f(x∗ + h) − f(x∗)

h−

ddx

f(x)∣∣x∗

∣∣∣∣∣f(x) = 64x(1 − x)(1 − 2x)2(1 − 8x + 8x2)2x∗ = 0.2

9/36

Page 20: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Numerical differentiationBetter approximations exist

Higher-order finite differencesE.g.∂f(x)∂xi

=f(x+ hei)− f(x− hei)2h + O(h2) ,

Richardson extrapolationDifferential quadrature

but they increase rapidly in complexity and never completelyeliminate the error

10/36

Page 21: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Numerical differentiation

Poor performance:f : Rn → R, approximate the gradient∇f =

(∂f∂x1 , . . . ,

∂f∂xn

) using∂f(x)∂xi

≈ f(x+ hei)− f(x)h , 0 < h� 1

We must repeat the function evaluation n times for getting∇f

11/36

Page 22: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Algorithmic differentiation (AD)

Page 23: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Algorithmic differentiationAlso known as automatic differentiation (Griewank & Walther,2008)Gives numeric code that computesthe function AND its derivatives at a given point

f(a, b):c = a * bd = sin creturn d

f'(a, a', b, b'):(c, c') = (a*b, a'*b + a*b')(d, d') = (sin c, c' * cos c)return (d, d')

Derivatives propagated at the elementary operation level,as a side effect, at the same time when the function itself iscomputed→ Prevents the “expression swell” of symbolic derivativesFull expressive capability of the host language→ Including conditionals, looping, branching

12/36

Page 24: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):

c = a * bif c > 0d = log c

elsed = sin c

return d

f(2, 3)

a = 2

b = 3

c = a * b = 6

d = log c = 1.791

return d

(primal)

a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’

(tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂

∂a f(a, b)∣∣(2,3) = 0.5

This is called the forward (tangent) mode of AD

13/36

Page 25: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):

c = a * bif c > 0d = log c

elsed = sin c

return d

f(2, 3)

a = 2

b = 3

c = a * b = 6

d = log c = 1.791

return d

(primal)

a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’

(tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂

∂a f(a, b)∣∣(2,3) = 0.5

This is called the forward (tangent) mode of AD

13/36

Page 26: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):

c = a * bif c > 0d = log c

elsed = sin c

return d

f(2, 3)

a = 2

b = 3

c = a * b = 6

d = log c = 1.791

return d

(primal)

a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’

(tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂

∂a f(a, b)∣∣(2,3) = 0.5

This is called the forward (tangent) mode of AD

13/36

Page 27: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):

c = a * bif c > 0d = log c

elsed = sin c

return d

f(2, 3)

a = 2

b = 3

c = a * b = 6

d = log c = 1.791

return d

(primal)

a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’

(tangent)

i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂∂a f(a, b)

∣∣(2,3) = 0.5

This is called the forward (tangent) mode of AD

13/36

Page 28: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):

c = a * bif c > 0d = log c

elsed = sin c

return d

f(2, 3)

a = 2

b = 3

c = a * b = 6

d = log c = 1.791

return d

(primal)

a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’

(tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂

∂a f(a, b)∣∣(2,3) = 0.5

This is called the forward (tangent) mode of AD13/36

Page 29: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Function evaluation tracesf(a, b):

c = a * bif c > 0d = log c

elsed = sin c

return d

f(2, 3)

a = 2b = 3c = a * b = 6d = log c = 1.791return d

(primal)

a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return d, a’, b’

(adjoint)i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)

This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD

14/36

Page 30: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Function evaluation tracesf(a, b):

c = a * bif c > 0d = log c

elsed = sin c

return d

f(2, 3)

a = 2b = 3c = a * b = 6d = log c = 1.791return d

(primal)

a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return d, a’, b’

(adjoint)i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)

This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD

14/36

Page 31: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Function evaluation tracesf(a, b):

c = a * bif c > 0d = log c

elsed = sin c

return d

f(2, 3)

a = 2b = 3c = a * b = 6d = log c = 1.791return d

(primal)

a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return d, a’, b’

(adjoint)

i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)

This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD

14/36

Page 32: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Function evaluation tracesf(a, b):

c = a * bif c > 0d = log c

elsed = sin c

return d

f(2, 3)

a = 2b = 3c = a * b = 6d = log c = 1.791return d

(primal)

a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return d, a’, b’

(adjoint)i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)

This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD

14/36

Page 33: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

How is this useful?

Page 34: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Forward vs reverseIn the extreme cases,for F : R→ Rm, forward AD can compute all (∂F1

∂x , . . . ,∂Fm∂x

)for f : Rn → R, reverse AD can compute∇f =

(∂f∂xi , . . . ,

∂f∂xn

)in just one evaluation

In general, for f : Rn → Rm, the Jacobian J ∈ Rm×n takesO(n× time(f)) with forward ADO(m× time(f)) with reverse AD

Reverse mode performs better when n� m

15/36

Page 35: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Forward vs reverseIn the extreme cases,for F : R→ Rm, forward AD can compute all (∂F1

∂x , . . . ,∂Fm∂x

)for f : Rn → R, reverse AD can compute∇f =

(∂f∂xi , . . . ,

∂f∂xn

)in just one evaluation

In general, for f : Rn → Rm, the Jacobian J ∈ Rm×n takesO(n× time(f)) with forward ADO(m× time(f)) with reverse AD

Reverse mode performs better when n� m

15/36

Page 36: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

How is this useful?Traditional application domains of AD in industry and academia(Corliss et al., 2002)

Computational fluiddynamicsAtmospheric chemistryEngineering designoptimizationComputational finance

16/36

Page 37: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Functional ADor”Differentiable functional programming”

Page 38: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

AD and functional programmingAD has been around since the 1960s(Wengert, 1964; Speelpenning, 1980; Griewank, 1989)The foundations for AD in a functional framework(Siskind and Pearlmutter, 2008; Pearlmutter and Siskind, 2008)With research implementations

R6RS-ADhttps://github.com/qobi/R6RS-AD

Stalingradhttp://www.bcl.hamilton.ie/~qobi/stalingrad/

Alexey Radul’s DVLhttps://github.com/axch/dysvunctional-language

Recently, my DiffSharp libraryhttp://diffsharp.github.io/DiffSharp/

17/36

Page 39: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Differentiable functional programmingDeep learning: neural network models are assembled frombuilding blocks and trained with backpropagation

Traditional:FeedforwardConvolutionalRecurrent

18/36

Page 40: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Differentiable functional programmingDeep learning: neural network models are assembled frombuilding blocks and trained with backpropagation

Traditional:FeedforwardConvolutionalRecurrent

18/36

Page 41: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Differentiable functional programmingNewer additions:Make algorithmic elements continuous and differentiable→ enables use in deep learning

NTM on copy task(Graves et al. 2014)

Neural Turing Machine (Graves et al., 2014)→ can infer algorithms: copy, sort, recallStack-augmented RNN (Joulin & Mikolov, 2015)End-to-end memory network (Sukhbaatar et al., 2015)Stack, queue, deque (Grefenstette et al., 2015)Discrete interfaces (Zaremba & Sutskever, 2015)

19/36

Page 42: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Differentiable functional programmingStacking of many layers, trained through backpropagationAlexNet, 8 layers (ILSVRC 2012)

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 128

, /2

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 256

, /2

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 512

, /2

3x3

conv

, 512

1x1

conv

, 204

8

1x1

conv

, 512

3x3

conv

, 512

1x1

conv

, 204

8

1x1

conv

, 512

3x3

conv

, 512

1x1

conv

, 204

8

ave

pool

, fc 1

000

7x7

conv

, 64,

/2,

poo

l/2

3x3

conv

, 64

3x3

conv

, 64,

poo

l/2

3x3

conv

, 128

3x3

conv

, 128

, poo

l/2

3x3

conv

, 256

3x3

conv

, 256

3x3

conv

, 256

3x3

conv

, 256

, poo

l/2

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

, poo

l/2

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

, poo

l/2

fc, 4

096

fc, 4

096

fc, 1

000

11x1

1 co

nv, 9

6, /4

, poo

l/2

5x5

conv

, 256

, poo

l/2

3x3

conv

, 384

3x3

conv

, 384

3x3

conv

, 256

, poo

l/2

fc, 4

096

fc, 4

096

fc, 1

000

VGG, 19 layers (ILSVRC 2014)1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 128

, /2

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 256

, /2

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 512

, /2

3x3

conv

, 512

1x1

conv

, 204

8

1x1

conv

, 512

3x3

conv

, 512

1x1

conv

, 204

8

1x1

conv

, 512

3x3

conv

, 512

1x1

conv

, 204

8

ave

pool

, fc 1

000

7x7

conv

, 64,

/2,

poo

l/2

3x3

conv

, 64

3x3

conv

, 64,

poo

l/2

3x3

conv

, 128

3x3

conv

, 128

, poo

l/2

3x3

conv

, 256

3x3

conv

, 256

3x3

conv

, 256

3x3

conv

, 256

, poo

l/2

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

, poo

l/2

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

, poo

l/2

fc, 4

096

fc, 4

096

fc, 1

000

11x1

1 co

nv, 9

6, /4

, poo

l/2

5x5

conv

, 256

, poo

l/2

3x3

conv

, 384

3x3

conv

, 384

3x3

conv

, 256

, poo

l/2

fc, 4

096

fc, 4

096

fc, 1

000ResNet, 152 layers (deep residual learning) (ILSVRC 2015)

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 128

, /2

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 256

, /2

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 512

, /2

3x3

conv

, 512

1x1

conv

, 204

8

1x1

conv

, 512

3x3

conv

, 512

1x1

conv

, 204

8

1x1

conv

, 512

3x3

conv

, 512

1x1

conv

, 204

8

ave

pool

, fc 1

000

7x7

conv

, 64,

/2,

poo

l/2

3x3

conv

, 64

3x3

conv

, 64,

poo

l/2

3x3

conv

, 128

3x3

conv

, 128

, poo

l/2

3x3

conv

, 256

3x3

conv

, 256

3x3

conv

, 256

3x3

conv

, 256

, poo

l/2

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

, poo

l/2

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

, poo

l/2

fc, 4

096

fc, 4

096

fc, 1

000

11x1

1 co

nv, 9

6, /4

, poo

l/2

5x5

conv

, 256

, poo

l/2

3x3

conv

, 384

3x3

conv

, 384

3x3

conv

, 256

, poo

l/2

fc, 4

096

fc, 4

096

fc, 1

000

(He, Zhang, Ren, Sun. “Deep Residual Learning for Image Recognition.” 2015. arXiv:1512.03385)20/36

Page 43: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Differentiable functional programmingOne way of viewing deep learning systems is“differentiable functional programming”

Two main characteristics:Differentiability→ optimization

Chained function composition→ successivetransformations→ successive levels ofdistributed representations(Bengio 2013)→ the chain rule of calculuspropagates derivatives

21/36

Page 44: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

The bigger pictureIn a functional interpretation

Weight-tying or multiple applications of the same neuron(e.g., ConvNets and RNNs) resemble function abstractionStructural patterns of composition resemblehigher-order functions (e.g., map, fold, unfold, zip)

22/36

Page 45: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

The bigger pictureEven when you have complex compositions,differentiability ensures that they can be trained end-to-endwith backpropagation

(Vinyals, Toshev, Bengio, Erhan. “Show and tell: a neural image caption generator.” 2014. arXiv:1411.4555)

23/36

Page 46: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

The bigger pictureChristopher Olah’s blog post (September 3, 2015)http://colah.github.io/posts/2015-09-NN-Types-FP/

“The field does not (yet) have a unifying insight or narrative”

David Dalrymple’s essay (January 2016)http://edge.org/response-detail/26794

“The most natural playground ... would be a new language that canrun back-propagation directly on functional programs.”

AD in a functional framework is a manifestation of this vision.

24/36

Page 47: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

The bigger pictureChristopher Olah’s blog post (September 3, 2015)http://colah.github.io/posts/2015-09-NN-Types-FP/

“The field does not (yet) have a unifying insight or narrative”

David Dalrymple’s essay (January 2016)http://edge.org/response-detail/26794

“The most natural playground ... would be a new language that canrun back-propagation directly on functional programs.”

AD in a functional framework is a manifestation of this vision.

24/36

Page 48: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

DiffSharp

Page 49: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

The ambitionDeeply embedded AD (forward and/or reverse)as part of the language infrastructureRich API of differentiation operationsas higher-order functionsHigh-performance matrix operations for deep learning(GPU support, model and data parallelism), gradients,Hessians, Jacobians, directional derivatives, matrix-freeHessian- and Jacobian-vector products

I have been working on these issues with Barak Pearlmutterand created DiffSharp:http://diffsharp.github.io/DiffSharp/

25/36

Page 50: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

The ambitionDeeply embedded AD (forward and/or reverse)as part of the language infrastructureRich API of differentiation operationsas higher-order functionsHigh-performance matrix operations for deep learning(GPU support, model and data parallelism), gradients,Hessians, Jacobians, directional derivatives, matrix-freeHessian- and Jacobian-vector products

I have been working on these issues with Barak Pearlmutterand created DiffSharp:http://diffsharp.github.io/DiffSharp/

25/36

Page 51: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

DiffSharp“Generalized AD as a first-class function in an augmentedλ-calculus” (Pearlmutter and Siskind, 2008)Forward, reverse, and any nested combination thereof,instantiated according to usage scenarioNested lambda expressions with free-variable references

min (λx . (f x) +min (λy . g x y))

let m = min (fun x -> (f x) + min (fun y -> g (x y)))

Must handle “perturbation confusion” (Manzyuk et al., 2012)d

dx

(x(

d

dyx+ y)∣∣∣∣

y=1

)∣∣∣∣∣x=1

?= 1

let d = diff (fun x -> x * (diff (fun y -> x + y) 1.)) 1.26/36

Page 52: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

DiffSharp“Generalized AD as a first-class function in an augmentedλ-calculus” (Pearlmutter and Siskind, 2008)Forward, reverse, and any nested combination thereof,instantiated according to usage scenarioNested lambda expressions with free-variable references

min (λx . (f x) +min (λy . g x y))

let m = min (fun x -> (f x) + min (fun y -> g (x y)))

Must handle “perturbation confusion” (Manzyuk et al., 2012)d

dx

(x(

d

dyx+ y)∣∣∣∣

y=1

)∣∣∣∣∣x=1

?= 1

let d = diff (fun x -> x * (diff (fun y -> x + y) 1.)) 1.26/36

Page 53: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

DiffSharpHigher-order differentiation APIOp. Value Type signature AD Num. Sym.

f : R → R diff f ′ (R → R) → R → R X, F A Xdiff’ (f, f ′) (R → R) → R → (R× R) X, F A Xdiff2 f ′′ (R → R) → R → R X, F A Xdiff2’ (f, f ′′) (R → R) → R → (R× R) X, F A Xdiff2’’ (f, f ′, f ′′) (R → R) → R → (R× R× R) X, F A Xdiffn f(n) N → (R → R) → R → R X, F Xdiffn’ (f, f(n)) N → (R → R) → R → (R× R) X, F X

f : Rn → R grad ∇f (Rn → R) → Rn → Rn X, R A Xgrad’ (f,∇f) (Rn → R) → Rn → (R× Rn) X, R A Xgradv ∇f · v (Rn → R) → Rn → Rn → R X, F Agradv’ (f,∇f · v) (Rn → R) → Rn → Rn → (R× R) X, F Ahessian Hf (Rn → R) → Rn → Rn×n X, R-F A Xhessian’ (f,Hf ) (Rn → R) → Rn → (R× Rn×n) X, R-F A Xhessianv Hfv (Rn → R) → Rn → Rn → Rn X, F-R Ahessianv’ (f,Hfv) (Rn → R) → Rn → Rn → (R× Rn) X, F-R Agradhessian (∇f,Hf ) (Rn → R) → Rn → (Rn × Rn×n) X, R-F A Xgradhessian’ (f,∇f,Hf ) (Rn → R) → Rn → (R× Rn × Rn×n) X, R-F A Xgradhessianv (∇f · v,Hfv) (Rn → R) → Rn → Rn → (R× Rn) X, F-R Agradhessianv’ (f,∇f · v,Hfv) (Rn → R) → Rn → Rn → (R× R× Rn) X, F-R Alaplacian tr(Hf ) (Rn → R) → Rn → R X, R-F A Xlaplacian’ (f, tr(Hf )) (Rn → R) → Rn → (R× R) X, R-F A X

f : Rn → Rm jacobian Jf (Rn → Rm) → Rn → Rm×n X, F/R A Xjacobian’ (f ,Jf ) (Rn → Rm) → Rn → (Rm × Rm×n) X, F/R A Xjacobianv Jfv (Rn → Rm) → Rn → Rn → Rm X, F Ajacobianv’ (f ,Jfv) (Rn → Rm) → Rn → Rn → (Rm × Rm) X, F AjacobianT JT

f (Rn → Rm) → Rn → Rn×m X, F/R A XjacobianT’ (f ,JT

f ) (Rn → Rm) → Rn → (Rm × Rn×m) X, F/R A XjacobianTv JT

f v (Rn → Rm) → Rn → Rm → Rn X, RjacobianTv’ (f ,JT

f v) (Rn → Rm) → Rn → Rm → (Rm × Rn) X, RjacobianTv’’ (f ,JT

f (·)) (Rn → Rm) → Rn → (Rm × (Rm → Rn)) X, Rcurl ∇× f (R3 → R3) → R3 → R3 X, F A Xcurl’ (f ,∇× f) (R3 → R3) → R3 → (R3 × R3) X, F A Xdiv ∇ · f (Rn → Rn) → Rn → R X, F A Xdiv’ (f ,∇ · f) (Rn → Rn) → Rn → (Rn × R) X, F A Xcurldiv (∇× f ,∇ · f) (R3 → R3) → R3 → (R3 × R) X, F A Xcurldiv’ (f ,∇× f ,∇ · f) (R3 → R3) → R3 → (R3 × R3 × R) X, F A X 27/36

Page 54: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

DiffSharpMatrix operationshttp://diffsharp.github.io/DiffSharp/api-overview.html

High-performance OpenBLAS backend by default, work on aCUDA-based GPU backend underwaySupport for 64- and 32-bit floats (faster on many systems)Benchmarking toolhttp://diffsharp.github.io/DiffSharp/benchmarks.html

A growing collection of tutorials: gradient-based optimizationalgorithms, clustering, Hamiltonian Monte Carlo, neural networks,inverse kinematics

28/36

Page 55: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Hype

Page 56: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Hypehttp://hypelib.github.io/Hype/

An experimental library for “compositional machine learningand hyperparameter optimization”, built on DiffSharpA robust optimization core

highly configurable functional modulesSGD, conjugate gradient, Nesterov, AdaGrad, RMSProp,Newton’s methodUse nested AD for gradient-based hyperparameteroptimization (Maclaurin et al., 2015)

Researching the differentiable functional programming paradigmfor machine learning29/36

Page 57: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

HypeExtracts from Hype neural network code,use higher-order functions, don’t think about gradients orbackpropagationhttps://github.com/hypelib/Hype/blob/master/src/Hype/Neural.fs

30/36

Page 58: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

HypeExtracts from Hype optimization codehttps://github.com/hypelib/Hype/blob/master/src/Hype/Optimize.fs

Optimization and training as higher-order functions→ can be composed, nested

31/36

Page 59: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

HypeUser doesn’t need to think about derivativesThey are instantiated within the optimization code

32/36

Page 60: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

HypeBut they can use derivatives within their models, if needed→ input sensitivities→ complex objective functions→ adaptive PID controllers→ integrating differential equations

Thanks to nested generalized ADyou can optimize components that are internally usingdifferentiationresulting higher-order derivatives propagate viaforward/reverse AD as needed

33/36

Page 61: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

HypeBut they can use derivatives within their models, if needed→ input sensitivities→ complex objective functions→ adaptive PID controllers→ integrating differential equations

Thanks to nested generalized ADyou can optimize components that are internally usingdifferentiationresulting higher-order derivatives propagate viaforward/reverse AD as needed

33/36

Page 62: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

HypeWe also provide a Torch-like API for neural networks

A cool thing: thanks to AD, we can freely codeany F# function as a layer, it just works

34/36

Page 63: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

HypeWe also provide a Torch-like API for neural networks

A cool thing: thanks to AD, we can freely codeany F# function as a layer, it just works

34/36

Page 64: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Hypehttp://hypelib.github.io/Hype/feedforwardnets.html

We also have some nice additions for F# interactive

35/36

Page 65: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Roadmap

Transformation-based, context-aware ADF# quotations (Syme, 2006) give us a direct path for deeplyembedding ADCurrently experimenting with GPU backends(CUDA, ArrayFire, Magma)Generalizing to tensors(for elegant implementations of, e.g., ConvNets)

36/36

Page 66: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Demos

Page 67: Differentiable Functional Programminggunes/assets/pdf/baydin-2016-slides-function… · Derivatives propagated at the elementary operation level, as a side effect, at the same time

Thank You!References• Baydin AG, Pearlmutter BA, Radul AA, Siskind JM (Submitted) Automatic differentiation in machine learning: a survey [arXiv:1502.05767]• Baydin AG, Pearlmutter BA, Siskind JM (Submitted) DiffSharp: automatic differentiation library [arXiv:1511.07727]• Bengio Y (2013) Deep learning of representations: looking forward. Statistical Language and Speech Processing. LNCS 7978:1–37 [arXiv:1404.7456]• Graves A, Wayne G, Danihelka I (2014) Neural Turing machines. [arXiv:1410.5401]• Grefenstette E, Hermann KM, Suleyman M, Blunsom, P (2015) Learning to transduce with unbounded memory. [arXiv:1506.02516]• Griewank A, Walther A (2008) Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Society for Industrial and Applied Mathematics,Philadelphia [DOI 10.1137/1.9780898717761]• He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. [arXiv:1512.03385]• Joulin A, Mikolov T (2015) Inferring algorithmic patterns with stack-augmented recurrent nets. [arXiv:1503.01007]• Maclaurin D, David D, Adams RP (2015) Gradient-based Hyperparameter Optimization through Reversible Learning [arXiv:1502.03492]• Manzyuk O, Pearlmutter BA, Radul AA, Rush DR, Siskind JM (2012) Confusion of tagged perturbations in forward automatic differentiation of higher-order functions[arXiv:1211.4892]• Pearlmutter BA, Siskind JM (2008) Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator. ACM TOPLAS 30(2):7 [DOI10.1145/1330017.1330018]• Siskind JM, Pearlmutter BA (2008) Nesting forward-mode AD in a functional framework. Higher Order and Symbolic Computation 21(4):361–76 [DOI10.1007/s10990-008-9037-1]• Sukhbaatar S, Szlam A, Weston J, Fergus R (2015) Weakly supervised memory networks. [arXiv:1503.08895]• Syme D (2006) Leveraging .NET meta-programming components from F#: integrated queries and interoperable heterogeneous execution. 2006 Workshop on ML.ACM.• Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator. [arXiv:1411.4555]• Wengert R (1964) A simple automatic derivative evaluation program. Communications of the ACM 7:463–4• Zaremba W, Sutskever I (2015) Reinforcement learning neural Turing machines. [arXiv:1505.00521]