differentiable functional programminggunes/assets/pdf/baydin-2016-slides-function… · derivatives...
TRANSCRIPT
Differentiable FunctionalProgrammingAtılım Güneş Baydin
University of Oxfordhttp://www.robots.ox.ac.uk/~gunes/
F#unctional Londoners Meetup, April 28, 2016
About me
Current (from 11 April 2016):Postdoctoral researcher,Machine Learning Research Group, University of Oxfordhttp://www.robots.ox.ac.uk/~parg/
Previously:Brain and Computation Lab, National University of IrelandMaynooth: http://www.bcl.hamilton.ie/Working primarily with F#, on algorithmic differentiation,functional programming, machine learning
1/36
Today’s talk
Derivatives in computer programsDifferentiable functional programmingDiffSharp + Hype librariesTwo demos
2/36
Derivatives in computer programsHow do we compute them?
Manual differentiationf(x) = sin(exp x)
let f x = sin (exp x)
Calculus 101: differentiation rulesd(fg)dx =
dfdxg+ f dgdx
d(af + bg)dx = adf
dx + bdgdx. . .
f ′(x) = cos(exp x)× exp x
let f’ x = (cos (exp x)) * (exp x)
3/36
Manual differentiationf(x) = sin(exp x)
let f x = sin (exp x)
Calculus 101: differentiation rulesd(fg)dx =
dfdxg+ f dgdx
d(af + bg)dx = adf
dx + bdgdx. . .
f ′(x) = cos(exp x)× exp x
let f’ x = (cos (exp x)) * (exp x)
3/36
Manual differentiationf(x) = sin(exp x)
let f x = sin (exp x)
Calculus 101: differentiation rulesd(fg)dx =
dfdxg+ f dgdx
d(af + bg)dx = adf
dx + bdgdx. . .
f ′(x) = cos(exp x)× exp x
let f’ x = (cos (exp x)) * (exp x)
3/36
Manual differentiationIt can get complicatedf(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2(4th iteration of the logistic map ln+1 = 4ln(1− ln), l1 = x)let f x =
64*x * (1-x) * ((1 - 2*x) ** 2) * ((1 - 8*x + 8*x*x) ** 2)
f ′(x) =128x(1−x)(−8+16x)(1−2x)2(1−8x+8x2)+64(1−x)(1−2x)2(1−8x+8x2)2−64x(1−2x)2(1−8x+8x2)2−256x(1−x)(1−2x)(1−8x+8x2)2let f’ x = 128*x * (1-x) * (-8+16*x) * (1-2*x)**2 * (1-8*x+8*x*
x) + 64 * (1-x) * (1-2*x)**2 * (1-8*x+8*x*x)**2 - 64*x(1-2*x)**2 * (1-8*x+8*x*x)**2 - 256*x*(1-x) * (1-2*x) * (1-8*x+8*x*x)**2
4/36
Manual differentiationIt can get complicatedf(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2(4th iteration of the logistic map ln+1 = 4ln(1− ln), l1 = x)let f x =
64*x * (1-x) * ((1 - 2*x) ** 2) * ((1 - 8*x + 8*x*x) ** 2)
f ′(x) =128x(1−x)(−8+16x)(1−2x)2(1−8x+8x2)+64(1−x)(1−2x)2(1−8x+8x2)2−64x(1−2x)2(1−8x+8x2)2−256x(1−x)(1−2x)(1−8x+8x2)2let f’ x = 128*x * (1-x) * (-8+16*x) * (1-2*x)**2 * (1-8*x+8*x*
x) + 64 * (1-x) * (1-2*x)**2 * (1-8*x+8*x*x)**2 - 64*x(1-2*x)**2 * (1-8*x+8*x*x)**2 - 256*x*(1-x) * (1-2*x) * (1-8*x+8*x*x)**2
4/36
Symbolic differentiationComputer algebra packages help: Mathematica, Maple, Maxima
But, it has some serious drawbacks5/36
Symbolic differentiationComputer algebra packages help: Mathematica, Maple, Maxima
But, it has some serious drawbacks5/36
Symbolic differentiationWe get “expression swell”
Logistic map ln+1 = 4ln(1 − ln), l1 = x
n ln ddx ln
1 x 12 4x(1 − x) 4(1 − x) − 4x3 16x(1− x)(1−
2x)216(1− x)(1− 2x)2 −16x(1 − 2x)2 −64x(1 − x)(1 − 2x)
4 64x(1− x)(1−2x)2 (1− 8x +8x2)2
128x(1 − x)(−8 +
16x)(1−2x)2(1−8x+8x2) + 64(1− x)(1−2x)2(1−8x+8x2)2−64x(1−2x)2(1−8x+8x2)2 − 256x(1 −x)(1 − 2x)(1 − 8x +
8x2)2 1 2 3 4 50
100200300400500600
n
Number of terms
ln
ddx ln
6/36
Symbolic differentiationWe are limited to closed-form formulaeYou can find the derivative of math expressions:f(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2But not of algorithms, branching, control flow:let f x n =
if n = 1 thenx
elselet mutable v = xfor i = 1 to n
v <- 4 * v * (1 - v)v
let a = f x 4
7/36
Symbolic differentiationWe are limited to closed-form formulaeYou can find the derivative of math expressions:f(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2But not of algorithms, branching, control flow:let f x n =
if n = 1 thenx
elselet mutable v = xfor i = 1 to n
v <- 4 * v * (1 - v)v
let a = f x 4
7/36
Symbolic differentiationWe are limited to closed-form formulaeYou can find the derivative of math expressions:f(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2But not of algorithms, branching, control flow:let f x n =
if n = 1 thenx
elselet mutable v = xfor i = 1 to n
v <- 4 * v * (1 - v)v
let a = f x 4
7/36
Numerical differentiationA very common hack:Use the limit definition of the derivative
dfdx = lim
h→0f(x+ h)− f(x)
hto approximate the numerical value of the derivativelet diff f x =
let h = 0.00001(f (x + h) - f (x)) / h
Again, some serious drawbacks
8/36
Numerical differentiationA very common hack:Use the limit definition of the derivative
dfdx = lim
h→0f(x+ h)− f(x)
hto approximate the numerical value of the derivativelet diff f x =
let h = 0.00001(f (x + h) - f (x)) / h
Again, some serious drawbacks
8/36
Numerical differentiationA very common hack:Use the limit definition of the derivative
dfdx = lim
h→0f(x+ h)− f(x)
hto approximate the numerical value of the derivativelet diff f x =
let h = 0.00001(f (x + h) - f (x)) / h
Again, some serious drawbacks
8/36
Numerical differentiationWe must select a proper value of hand we face approximation errors
h
10-17 10-15 10-13 10-11 10-9 10-7 10-5 10-3 10-110-1010-810-610-410-2100102
Round-off errordominant Truncation errordominant
Error
Computed using
E(h, x∗) =
∣∣∣∣∣ f(x∗ + h) − f(x∗)
h−
ddx
f(x)∣∣x∗
∣∣∣∣∣f(x) = 64x(1 − x)(1 − 2x)2(1 − 8x + 8x2)2x∗ = 0.2
9/36
Numerical differentiationBetter approximations exist
Higher-order finite differencesE.g.∂f(x)∂xi
=f(x+ hei)− f(x− hei)2h + O(h2) ,
Richardson extrapolationDifferential quadrature
but they increase rapidly in complexity and never completelyeliminate the error
10/36
Numerical differentiation
Poor performance:f : Rn → R, approximate the gradient∇f =
(∂f∂x1 , . . . ,
∂f∂xn
) using∂f(x)∂xi
≈ f(x+ hei)− f(x)h , 0 < h� 1
We must repeat the function evaluation n times for getting∇f
11/36
Algorithmic differentiation (AD)
Algorithmic differentiationAlso known as automatic differentiation (Griewank & Walther,2008)Gives numeric code that computesthe function AND its derivatives at a given point
f(a, b):c = a * bd = sin creturn d
f'(a, a', b, b'):(c, c') = (a*b, a'*b + a*b')(d, d') = (sin c, c' * cos c)return (d, d')
Derivatives propagated at the elementary operation level,as a side effect, at the same time when the function itself iscomputed→ Prevents the “expression swell” of symbolic derivativesFull expressive capability of the host language→ Including conditionals, looping, branching
12/36
Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):
c = a * bif c > 0d = log c
elsed = sin c
return d
f(2, 3)
a = 2
b = 3
c = a * b = 6
d = log c = 1.791
return d
(primal)
a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’
(tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂
∂a f(a, b)∣∣(2,3) = 0.5
This is called the forward (tangent) mode of AD
13/36
Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):
c = a * bif c > 0d = log c
elsed = sin c
return d
f(2, 3)
a = 2
b = 3
c = a * b = 6
d = log c = 1.791
return d
(primal)
a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’
(tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂
∂a f(a, b)∣∣(2,3) = 0.5
This is called the forward (tangent) mode of AD
13/36
Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):
c = a * bif c > 0d = log c
elsed = sin c
return d
f(2, 3)
a = 2
b = 3
c = a * b = 6
d = log c = 1.791
return d
(primal)
a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’
(tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂
∂a f(a, b)∣∣(2,3) = 0.5
This is called the forward (tangent) mode of AD
13/36
Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):
c = a * bif c > 0d = log c
elsed = sin c
return d
f(2, 3)
a = 2
b = 3
c = a * b = 6
d = log c = 1.791
return d
(primal)
a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’
(tangent)
i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂∂a f(a, b)
∣∣(2,3) = 0.5
This is called the forward (tangent) mode of AD
13/36
Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):
c = a * bif c > 0d = log c
elsed = sin c
return d
f(2, 3)
a = 2
b = 3
c = a * b = 6
d = log c = 1.791
return d
(primal)
a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’
(tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂
∂a f(a, b)∣∣(2,3) = 0.5
This is called the forward (tangent) mode of AD13/36
Function evaluation tracesf(a, b):
c = a * bif c > 0d = log c
elsed = sin c
return d
f(2, 3)
a = 2b = 3c = a * b = 6d = log c = 1.791return d
(primal)
a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return d, a’, b’
(adjoint)i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)
This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD
14/36
Function evaluation tracesf(a, b):
c = a * bif c > 0d = log c
elsed = sin c
return d
f(2, 3)
a = 2b = 3c = a * b = 6d = log c = 1.791return d
(primal)
a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return d, a’, b’
(adjoint)i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)
This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD
14/36
Function evaluation tracesf(a, b):
c = a * bif c > 0d = log c
elsed = sin c
return d
f(2, 3)
a = 2b = 3c = a * b = 6d = log c = 1.791return d
(primal)
a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return d, a’, b’
(adjoint)
i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)
This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD
14/36
Function evaluation tracesf(a, b):
c = a * bif c > 0d = log c
elsed = sin c
return d
f(2, 3)
a = 2b = 3c = a * b = 6d = log c = 1.791return d
(primal)
a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return d, a’, b’
(adjoint)i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)
This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD
14/36
How is this useful?
Forward vs reverseIn the extreme cases,for F : R→ Rm, forward AD can compute all (∂F1
∂x , . . . ,∂Fm∂x
)for f : Rn → R, reverse AD can compute∇f =
(∂f∂xi , . . . ,
∂f∂xn
)in just one evaluation
In general, for f : Rn → Rm, the Jacobian J ∈ Rm×n takesO(n× time(f)) with forward ADO(m× time(f)) with reverse AD
Reverse mode performs better when n� m
15/36
Forward vs reverseIn the extreme cases,for F : R→ Rm, forward AD can compute all (∂F1
∂x , . . . ,∂Fm∂x
)for f : Rn → R, reverse AD can compute∇f =
(∂f∂xi , . . . ,
∂f∂xn
)in just one evaluation
In general, for f : Rn → Rm, the Jacobian J ∈ Rm×n takesO(n× time(f)) with forward ADO(m× time(f)) with reverse AD
Reverse mode performs better when n� m
15/36
How is this useful?Traditional application domains of AD in industry and academia(Corliss et al., 2002)
Computational fluiddynamicsAtmospheric chemistryEngineering designoptimizationComputational finance
16/36
Functional ADor”Differentiable functional programming”
AD and functional programmingAD has been around since the 1960s(Wengert, 1964; Speelpenning, 1980; Griewank, 1989)The foundations for AD in a functional framework(Siskind and Pearlmutter, 2008; Pearlmutter and Siskind, 2008)With research implementations
R6RS-ADhttps://github.com/qobi/R6RS-AD
Stalingradhttp://www.bcl.hamilton.ie/~qobi/stalingrad/
Alexey Radul’s DVLhttps://github.com/axch/dysvunctional-language
Recently, my DiffSharp libraryhttp://diffsharp.github.io/DiffSharp/
17/36
Differentiable functional programmingDeep learning: neural network models are assembled frombuilding blocks and trained with backpropagation
Traditional:FeedforwardConvolutionalRecurrent
18/36
Differentiable functional programmingDeep learning: neural network models are assembled frombuilding blocks and trained with backpropagation
Traditional:FeedforwardConvolutionalRecurrent
18/36
Differentiable functional programmingNewer additions:Make algorithmic elements continuous and differentiable→ enables use in deep learning
NTM on copy task(Graves et al. 2014)
Neural Turing Machine (Graves et al., 2014)→ can infer algorithms: copy, sort, recallStack-augmented RNN (Joulin & Mikolov, 2015)End-to-end memory network (Sukhbaatar et al., 2015)Stack, queue, deque (Grefenstette et al., 2015)Discrete interfaces (Zaremba & Sutskever, 2015)
19/36
Differentiable functional programmingStacking of many layers, trained through backpropagationAlexNet, 8 layers (ILSVRC 2012)
1x1
conv
, 64
3x3
conv
, 64
1x1
conv
, 256
1x1
conv
, 64
3x3
conv
, 64
1x1
conv
, 256
1x1
conv
, 64
3x3
conv
, 64
1x1
conv
, 256
1x1
conv
, 128
, /2
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 256
, /2
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 512
, /2
3x3
conv
, 512
1x1
conv
, 204
8
1x1
conv
, 512
3x3
conv
, 512
1x1
conv
, 204
8
1x1
conv
, 512
3x3
conv
, 512
1x1
conv
, 204
8
ave
pool
, fc 1
000
7x7
conv
, 64,
/2,
poo
l/2
3x3
conv
, 64
3x3
conv
, 64,
poo
l/2
3x3
conv
, 128
3x3
conv
, 128
, poo
l/2
3x3
conv
, 256
3x3
conv
, 256
3x3
conv
, 256
3x3
conv
, 256
, poo
l/2
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
, poo
l/2
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
, poo
l/2
fc, 4
096
fc, 4
096
fc, 1
000
11x1
1 co
nv, 9
6, /4
, poo
l/2
5x5
conv
, 256
, poo
l/2
3x3
conv
, 384
3x3
conv
, 384
3x3
conv
, 256
, poo
l/2
fc, 4
096
fc, 4
096
fc, 1
000
VGG, 19 layers (ILSVRC 2014)1x1
conv
, 64
3x3
conv
, 64
1x1
conv
, 256
1x1
conv
, 64
3x3
conv
, 64
1x1
conv
, 256
1x1
conv
, 64
3x3
conv
, 64
1x1
conv
, 256
1x1
conv
, 128
, /2
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 256
, /2
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 512
, /2
3x3
conv
, 512
1x1
conv
, 204
8
1x1
conv
, 512
3x3
conv
, 512
1x1
conv
, 204
8
1x1
conv
, 512
3x3
conv
, 512
1x1
conv
, 204
8
ave
pool
, fc 1
000
7x7
conv
, 64,
/2,
poo
l/2
3x3
conv
, 64
3x3
conv
, 64,
poo
l/2
3x3
conv
, 128
3x3
conv
, 128
, poo
l/2
3x3
conv
, 256
3x3
conv
, 256
3x3
conv
, 256
3x3
conv
, 256
, poo
l/2
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
, poo
l/2
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
, poo
l/2
fc, 4
096
fc, 4
096
fc, 1
000
11x1
1 co
nv, 9
6, /4
, poo
l/2
5x5
conv
, 256
, poo
l/2
3x3
conv
, 384
3x3
conv
, 384
3x3
conv
, 256
, poo
l/2
fc, 4
096
fc, 4
096
fc, 1
000ResNet, 152 layers (deep residual learning) (ILSVRC 2015)
1x1
conv
, 64
3x3
conv
, 64
1x1
conv
, 256
1x1
conv
, 64
3x3
conv
, 64
1x1
conv
, 256
1x1
conv
, 64
3x3
conv
, 64
1x1
conv
, 256
1x1
conv
, 128
, /2
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 256
, /2
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 512
, /2
3x3
conv
, 512
1x1
conv
, 204
8
1x1
conv
, 512
3x3
conv
, 512
1x1
conv
, 204
8
1x1
conv
, 512
3x3
conv
, 512
1x1
conv
, 204
8
ave
pool
, fc 1
000
7x7
conv
, 64,
/2,
poo
l/2
3x3
conv
, 64
3x3
conv
, 64,
poo
l/2
3x3
conv
, 128
3x3
conv
, 128
, poo
l/2
3x3
conv
, 256
3x3
conv
, 256
3x3
conv
, 256
3x3
conv
, 256
, poo
l/2
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
, poo
l/2
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
, poo
l/2
fc, 4
096
fc, 4
096
fc, 1
000
11x1
1 co
nv, 9
6, /4
, poo
l/2
5x5
conv
, 256
, poo
l/2
3x3
conv
, 384
3x3
conv
, 384
3x3
conv
, 256
, poo
l/2
fc, 4
096
fc, 4
096
fc, 1
000
(He, Zhang, Ren, Sun. “Deep Residual Learning for Image Recognition.” 2015. arXiv:1512.03385)20/36
Differentiable functional programmingOne way of viewing deep learning systems is“differentiable functional programming”
Two main characteristics:Differentiability→ optimization
Chained function composition→ successivetransformations→ successive levels ofdistributed representations(Bengio 2013)→ the chain rule of calculuspropagates derivatives
21/36
The bigger pictureIn a functional interpretation
Weight-tying or multiple applications of the same neuron(e.g., ConvNets and RNNs) resemble function abstractionStructural patterns of composition resemblehigher-order functions (e.g., map, fold, unfold, zip)
22/36
The bigger pictureEven when you have complex compositions,differentiability ensures that they can be trained end-to-endwith backpropagation
(Vinyals, Toshev, Bengio, Erhan. “Show and tell: a neural image caption generator.” 2014. arXiv:1411.4555)
23/36
The bigger pictureChristopher Olah’s blog post (September 3, 2015)http://colah.github.io/posts/2015-09-NN-Types-FP/
“The field does not (yet) have a unifying insight or narrative”
David Dalrymple’s essay (January 2016)http://edge.org/response-detail/26794
“The most natural playground ... would be a new language that canrun back-propagation directly on functional programs.”
AD in a functional framework is a manifestation of this vision.
24/36
The bigger pictureChristopher Olah’s blog post (September 3, 2015)http://colah.github.io/posts/2015-09-NN-Types-FP/
“The field does not (yet) have a unifying insight or narrative”
David Dalrymple’s essay (January 2016)http://edge.org/response-detail/26794
“The most natural playground ... would be a new language that canrun back-propagation directly on functional programs.”
AD in a functional framework is a manifestation of this vision.
24/36
DiffSharp
The ambitionDeeply embedded AD (forward and/or reverse)as part of the language infrastructureRich API of differentiation operationsas higher-order functionsHigh-performance matrix operations for deep learning(GPU support, model and data parallelism), gradients,Hessians, Jacobians, directional derivatives, matrix-freeHessian- and Jacobian-vector products
I have been working on these issues with Barak Pearlmutterand created DiffSharp:http://diffsharp.github.io/DiffSharp/
25/36
The ambitionDeeply embedded AD (forward and/or reverse)as part of the language infrastructureRich API of differentiation operationsas higher-order functionsHigh-performance matrix operations for deep learning(GPU support, model and data parallelism), gradients,Hessians, Jacobians, directional derivatives, matrix-freeHessian- and Jacobian-vector products
I have been working on these issues with Barak Pearlmutterand created DiffSharp:http://diffsharp.github.io/DiffSharp/
25/36
DiffSharp“Generalized AD as a first-class function in an augmentedλ-calculus” (Pearlmutter and Siskind, 2008)Forward, reverse, and any nested combination thereof,instantiated according to usage scenarioNested lambda expressions with free-variable references
min (λx . (f x) +min (λy . g x y))
let m = min (fun x -> (f x) + min (fun y -> g (x y)))
Must handle “perturbation confusion” (Manzyuk et al., 2012)d
dx
(x(
d
dyx+ y)∣∣∣∣
y=1
)∣∣∣∣∣x=1
?= 1
let d = diff (fun x -> x * (diff (fun y -> x + y) 1.)) 1.26/36
DiffSharp“Generalized AD as a first-class function in an augmentedλ-calculus” (Pearlmutter and Siskind, 2008)Forward, reverse, and any nested combination thereof,instantiated according to usage scenarioNested lambda expressions with free-variable references
min (λx . (f x) +min (λy . g x y))
let m = min (fun x -> (f x) + min (fun y -> g (x y)))
Must handle “perturbation confusion” (Manzyuk et al., 2012)d
dx
(x(
d
dyx+ y)∣∣∣∣
y=1
)∣∣∣∣∣x=1
?= 1
let d = diff (fun x -> x * (diff (fun y -> x + y) 1.)) 1.26/36
DiffSharpHigher-order differentiation APIOp. Value Type signature AD Num. Sym.
f : R → R diff f ′ (R → R) → R → R X, F A Xdiff’ (f, f ′) (R → R) → R → (R× R) X, F A Xdiff2 f ′′ (R → R) → R → R X, F A Xdiff2’ (f, f ′′) (R → R) → R → (R× R) X, F A Xdiff2’’ (f, f ′, f ′′) (R → R) → R → (R× R× R) X, F A Xdiffn f(n) N → (R → R) → R → R X, F Xdiffn’ (f, f(n)) N → (R → R) → R → (R× R) X, F X
f : Rn → R grad ∇f (Rn → R) → Rn → Rn X, R A Xgrad’ (f,∇f) (Rn → R) → Rn → (R× Rn) X, R A Xgradv ∇f · v (Rn → R) → Rn → Rn → R X, F Agradv’ (f,∇f · v) (Rn → R) → Rn → Rn → (R× R) X, F Ahessian Hf (Rn → R) → Rn → Rn×n X, R-F A Xhessian’ (f,Hf ) (Rn → R) → Rn → (R× Rn×n) X, R-F A Xhessianv Hfv (Rn → R) → Rn → Rn → Rn X, F-R Ahessianv’ (f,Hfv) (Rn → R) → Rn → Rn → (R× Rn) X, F-R Agradhessian (∇f,Hf ) (Rn → R) → Rn → (Rn × Rn×n) X, R-F A Xgradhessian’ (f,∇f,Hf ) (Rn → R) → Rn → (R× Rn × Rn×n) X, R-F A Xgradhessianv (∇f · v,Hfv) (Rn → R) → Rn → Rn → (R× Rn) X, F-R Agradhessianv’ (f,∇f · v,Hfv) (Rn → R) → Rn → Rn → (R× R× Rn) X, F-R Alaplacian tr(Hf ) (Rn → R) → Rn → R X, R-F A Xlaplacian’ (f, tr(Hf )) (Rn → R) → Rn → (R× R) X, R-F A X
f : Rn → Rm jacobian Jf (Rn → Rm) → Rn → Rm×n X, F/R A Xjacobian’ (f ,Jf ) (Rn → Rm) → Rn → (Rm × Rm×n) X, F/R A Xjacobianv Jfv (Rn → Rm) → Rn → Rn → Rm X, F Ajacobianv’ (f ,Jfv) (Rn → Rm) → Rn → Rn → (Rm × Rm) X, F AjacobianT JT
f (Rn → Rm) → Rn → Rn×m X, F/R A XjacobianT’ (f ,JT
f ) (Rn → Rm) → Rn → (Rm × Rn×m) X, F/R A XjacobianTv JT
f v (Rn → Rm) → Rn → Rm → Rn X, RjacobianTv’ (f ,JT
f v) (Rn → Rm) → Rn → Rm → (Rm × Rn) X, RjacobianTv’’ (f ,JT
f (·)) (Rn → Rm) → Rn → (Rm × (Rm → Rn)) X, Rcurl ∇× f (R3 → R3) → R3 → R3 X, F A Xcurl’ (f ,∇× f) (R3 → R3) → R3 → (R3 × R3) X, F A Xdiv ∇ · f (Rn → Rn) → Rn → R X, F A Xdiv’ (f ,∇ · f) (Rn → Rn) → Rn → (Rn × R) X, F A Xcurldiv (∇× f ,∇ · f) (R3 → R3) → R3 → (R3 × R) X, F A Xcurldiv’ (f ,∇× f ,∇ · f) (R3 → R3) → R3 → (R3 × R3 × R) X, F A X 27/36
DiffSharpMatrix operationshttp://diffsharp.github.io/DiffSharp/api-overview.html
High-performance OpenBLAS backend by default, work on aCUDA-based GPU backend underwaySupport for 64- and 32-bit floats (faster on many systems)Benchmarking toolhttp://diffsharp.github.io/DiffSharp/benchmarks.html
A growing collection of tutorials: gradient-based optimizationalgorithms, clustering, Hamiltonian Monte Carlo, neural networks,inverse kinematics
28/36
Hype
Hypehttp://hypelib.github.io/Hype/
An experimental library for “compositional machine learningand hyperparameter optimization”, built on DiffSharpA robust optimization core
highly configurable functional modulesSGD, conjugate gradient, Nesterov, AdaGrad, RMSProp,Newton’s methodUse nested AD for gradient-based hyperparameteroptimization (Maclaurin et al., 2015)
Researching the differentiable functional programming paradigmfor machine learning29/36
HypeExtracts from Hype neural network code,use higher-order functions, don’t think about gradients orbackpropagationhttps://github.com/hypelib/Hype/blob/master/src/Hype/Neural.fs
30/36
HypeExtracts from Hype optimization codehttps://github.com/hypelib/Hype/blob/master/src/Hype/Optimize.fs
Optimization and training as higher-order functions→ can be composed, nested
31/36
HypeUser doesn’t need to think about derivativesThey are instantiated within the optimization code
32/36
HypeBut they can use derivatives within their models, if needed→ input sensitivities→ complex objective functions→ adaptive PID controllers→ integrating differential equations
Thanks to nested generalized ADyou can optimize components that are internally usingdifferentiationresulting higher-order derivatives propagate viaforward/reverse AD as needed
33/36
HypeBut they can use derivatives within their models, if needed→ input sensitivities→ complex objective functions→ adaptive PID controllers→ integrating differential equations
Thanks to nested generalized ADyou can optimize components that are internally usingdifferentiationresulting higher-order derivatives propagate viaforward/reverse AD as needed
33/36
HypeWe also provide a Torch-like API for neural networks
A cool thing: thanks to AD, we can freely codeany F# function as a layer, it just works
34/36
HypeWe also provide a Torch-like API for neural networks
A cool thing: thanks to AD, we can freely codeany F# function as a layer, it just works
34/36
Hypehttp://hypelib.github.io/Hype/feedforwardnets.html
We also have some nice additions for F# interactive
35/36
Roadmap
Transformation-based, context-aware ADF# quotations (Syme, 2006) give us a direct path for deeplyembedding ADCurrently experimenting with GPU backends(CUDA, ArrayFire, Magma)Generalizing to tensors(for elegant implementations of, e.g., ConvNets)
36/36
Demos
Thank You!References• Baydin AG, Pearlmutter BA, Radul AA, Siskind JM (Submitted) Automatic differentiation in machine learning: a survey [arXiv:1502.05767]• Baydin AG, Pearlmutter BA, Siskind JM (Submitted) DiffSharp: automatic differentiation library [arXiv:1511.07727]• Bengio Y (2013) Deep learning of representations: looking forward. Statistical Language and Speech Processing. LNCS 7978:1–37 [arXiv:1404.7456]• Graves A, Wayne G, Danihelka I (2014) Neural Turing machines. [arXiv:1410.5401]• Grefenstette E, Hermann KM, Suleyman M, Blunsom, P (2015) Learning to transduce with unbounded memory. [arXiv:1506.02516]• Griewank A, Walther A (2008) Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Society for Industrial and Applied Mathematics,Philadelphia [DOI 10.1137/1.9780898717761]• He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. [arXiv:1512.03385]• Joulin A, Mikolov T (2015) Inferring algorithmic patterns with stack-augmented recurrent nets. [arXiv:1503.01007]• Maclaurin D, David D, Adams RP (2015) Gradient-based Hyperparameter Optimization through Reversible Learning [arXiv:1502.03492]• Manzyuk O, Pearlmutter BA, Radul AA, Rush DR, Siskind JM (2012) Confusion of tagged perturbations in forward automatic differentiation of higher-order functions[arXiv:1211.4892]• Pearlmutter BA, Siskind JM (2008) Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator. ACM TOPLAS 30(2):7 [DOI10.1145/1330017.1330018]• Siskind JM, Pearlmutter BA (2008) Nesting forward-mode AD in a functional framework. Higher Order and Symbolic Computation 21(4):361–76 [DOI10.1007/s10990-008-9037-1]• Sukhbaatar S, Szlam A, Weston J, Fergus R (2015) Weakly supervised memory networks. [arXiv:1503.08895]• Syme D (2006) Leveraging .NET meta-programming components from F#: integrated queries and interoperable heterogeneous execution. 2006 Workshop on ML.ACM.• Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator. [arXiv:1411.4555]• Wengert R (1964) A simple automatic derivative evaluation program. Communications of the ACM 7:463–4• Zaremba W, Sutskever I (2015) Reinforcement learning neural Turing machines. [arXiv:1505.00521]