neural ordinary differential equations

42
Neural Ordinary Differential Equations Ricky T. Q. Chen*, Yulia Rubanova*, Jesse Bettencourt*, David Duvenaud University of Toronto

Upload: others

Post on 26-Oct-2021

23 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural Ordinary Differential Equations

Neural Ordinary Differential Equations

Ricky T. Q. Chen*, Yulia Rubanova*, Jesse Bettencourt*, David Duvenaud

University of Toronto

Page 2: Neural Ordinary Differential Equations

Background: Ordinary Differential Equations (ODEs)

- Model the instantaneous change of a state.

(explicit form)

- Solving an initial value problem (IVP) corresponds to integration.

(solution is a trajectory)

- Euler method approximates with small steps:

Page 3: Neural Ordinary Differential Equations

Residual Networks interpreted as an ODE Solver- Hidden units look like:

- Final output is the composition:

Haber & Ruthotto (2017). E (2017).

Page 4: Neural Ordinary Differential Equations

Residual Networks interpreted as an ODE Solver- Hidden units look like:

- Final output is the composition:

- This can be interpreted as an Euler discretization of an ODE.

Haber & Ruthotto (2017). E (2017).

- In the limit of smaller steps:

Page 5: Neural Ordinary Differential Equations

Deep Learning as Discretized Differential EquationsMany deep learning networks can be interpreted as ODE solvers.

Network Fixed-step Numerical Scheme

ResNet, RevNet, ResNeXt, etc. Forward Euler

PolyNet Approximation to Backward Euler

FractalNet Runge-Kutta

DenseNet Runge-Kutta

Lu et al. (2017) Chang et al. (2018)Zhu et al. (2018)

Page 6: Neural Ordinary Differential Equations

Deep Learning as Discretized Differential EquationsMany deep learning networks can be interpreted as ODE solvers.

Network Fixed-step Numerical Scheme

ResNet, RevNet, ResNeXt, etc. Forward Euler

PolyNet Approximation to Backward Euler

FractalNet Runge-Kutta

DenseNet Runge-Kutta

Lu et al. (2017) Chang et al. (2018)Zhu et al. (2018)

But:(1) What is the underlying dynamics?(2) Adaptive-step size solvers provide better error handling.

Page 7: Neural Ordinary Differential Equations

“Neural” Ordinary Differential Equations

Instead of y = F(x),

Page 8: Neural Ordinary Differential Equations

Parameterize

“Neural” Ordinary Differential Equations

Instead of y = F(x), solve y = z(T) given the initial condition z(0) = x.

Page 9: Neural Ordinary Differential Equations

Parameterize

“Neural” Ordinary Differential Equations

Solve the dynamic using any black-box ODE solver.

- Adaptive step size.- Error estimate.- O(1) memory learning.

Instead of y = F(x), solve y = z(T) given the initial condition z(0) = x.

Page 10: Neural Ordinary Differential Equations

Backprop without knowledge of the ODE SolverUltimately want to optimize some loss

Page 11: Neural Ordinary Differential Equations

Backprop without knowledge of the ODE SolverUltimately want to optimize some loss

Naive approach: Know the solver. Backprop through the solver.- Memory-intensive.- Family of “implicit” solvers perform inner optimization.

Page 12: Neural Ordinary Differential Equations

Backprop without knowledge of the ODE SolverUltimately want to optimize some loss

Naive approach: Know the solver. Backprop through the solver.- Memory-intensive.- Family of “implicit” solvers perform inner optimization.

Our approach: Adjoint sensitivity analysis. (Reverse-mode Autodiff.)- Pontryagin (1962).

+ Automatic differentiation.+ O(1) memory in backward pass.

Page 13: Neural Ordinary Differential Equations

Continuous-time Backpropagation

Residual network.

Forward:

Backward:

Params:

Define:Adjoint method.

Page 14: Neural Ordinary Differential Equations

Continuous-time Backpropagation

Residual network.

Forward:

Backward:

Params:

Adjoint method.

Forward:

Define:

Page 15: Neural Ordinary Differential Equations

Continuous-time Backpropagation

Residual network.

Forward:

Backward:

Params:

Adjoint method.

Forward:

Backward:

Adjoint DiffEqAdjoint State

Define:

Page 16: Neural Ordinary Differential Equations

Continuous-time Backpropagation

Residual network.

Forward:

Backward:

Params:

Adjoint method.

Forward:

Backward:

Params:

Adjoint DiffEqAdjoint State

Define:

Page 17: Neural Ordinary Differential Equations

A Differentiable Primitive for AutoDiff

Forward:

Backward:

Page 18: Neural Ordinary Differential Equations

A Differentiable Primitive for AutoDiff

Forward:

Backward:

Page 19: Neural Ordinary Differential Equations

A Differentiable Primitive for AutoDiff

Reversible networks (Gomez et al. 2018) also only require O(1)-memory, but require very specific neural network architectures with partitioned dimensions.

Don’t need to store layer activations for reverse pass - just follow dynamics in reverse!

Page 20: Neural Ordinary Differential Equations

Reverse versus Forward Cost

- Empirically, reverse pass roughly half as expensive as forward pass.

-

- Adapts to instance difficulty.

-

- Num evaluations can be viewed as number of layers in neural nets.

NFE = Number of Function Evaluations.

Page 21: Neural Ordinary Differential Equations

Dynamics Become Increasingly Complex

- Dynamics become more demanding to compute during training.

- Adapts computation time according to complexity of diffeq.

In contrast, Chang et al. (ICLR 2018) explicitly add layers during training.

Page 22: Neural Ordinary Differential Equations

Continuous-time RNNs for Time Series Modeling- We often want arbitrary measurement times, ie. irregular time intervals.- Can do VAE-style inference with a latent ODE.

Page 23: Neural Ordinary Differential Equations

ODEs vs Recurrent Neural Networks (RNNs)

- RNNs learn very stiff dynamics, have exploding gradients.

-

- Whereas ODEs are guaranteed to be smooth.

Page 24: Neural Ordinary Differential Equations

Continuous Normalizing Flows

Instantaneous Change of variables (iCOV):

- For a Lipschitz continuous function

Page 25: Neural Ordinary Differential Equations

Continuous Normalizing Flows

Instantaneous Change of variables (iCOV):

- For a Lipschitz continuous function

- In other words,

Page 26: Neural Ordinary Differential Equations

Continuous Normalizing Flows

Instantaneous Change of variables (iCOV):

- For a Lipschitz continuous function

- In other words,

With an invertible F:

Page 27: Neural Ordinary Differential Equations

Continuous Normalizing Flows

1D: 2D: Data Discrete-NF CNF

Page 28: Neural Ordinary Differential Equations

Is the ODE being correctly solved?

Page 29: Neural Ordinary Differential Equations

Stochastic Unbiased Log Density

Page 30: Neural Ordinary Differential Equations

Stochastic Unbiased Log Density

Can further reduce time complexity using stochastic estimators.

Grathwohl et al. (2019)

Page 31: Neural Ordinary Differential Equations

FFJORD - Stochastic Continuous Flows

Grathwohl et al. (2019)

MNIST - Model Samples CIFAR10 - Model Samples

Page 32: Neural Ordinary Differential Equations

Variational Autoencoders with FFJORD

Page 33: Neural Ordinary Differential Equations

ODE Solving as a Modeling PrimitiveAdaptive-step solvers with O(1) memory backprop.

github.com/rtqichen/torchdiffeq

Future directions we’re currently working on:

- Latent Stochastic Differential Equations.- Network architectures suited for ODEs.- Regularization of dynamics to require fewer evaluations.

Page 34: Neural Ordinary Differential Equations

Thanks!Yulia Rubanova Jesse Bettencourt David Duvenaud

Co-authors:

Page 35: Neural Ordinary Differential Equations

Extra Slides

Page 36: Neural Ordinary Differential Equations

Latent Space Visualizations

Page 37: Neural Ordinary Differential Equations

• Released an implementation of reverse-mode autodiff through black-box ODE solvers.

• Solves a system of size 2D + K + 1.

• In contrast, forward-mode implementation solves a system of size D^2 + KD.

• Tensorflow has Dormand-Prince-Shampine Runge-Kutta 5(4) implemented, but uses naive autodiff for backpropagation.

Page 39: Neural Ordinary Differential Equations

Explicit Error Control

- More fine-grained control than low-precision floats.

- Cost scales with instance difficulty.

NFE = Number of Function Evaluations.

Page 40: Neural Ordinary Differential Equations

Computation Depends on Complexity of Dynamics

- Time cost is dominated by evaluation of dynamics f.

NFE = Number of Function Evaluations.

Page 41: Neural Ordinary Differential Equations

Why not use an ODE solver as modeling primitive?- Solving an ODE is expensive.

Page 42: Neural Ordinary Differential Equations

Future Directions- Stochastic differential equations and Random ODEs. Approximates stochastic

gradient descent.- Scaling up ODE solvers with machine learning.- Partial differential equations.- Graphics, physics, simulations.