deep learning and differentiable programmingdic.uqam.ca/upload/files/seminaires/deep learning and...

Deep Learning, differentiable programming, and software 2.0

(or white is the new black? ) Mounir BoukadoumUQAM, Dep. CS

RuslanSalakhutdinov

Soumit Chintala

Chris Olah

Ian Goodfellow

AndejKarpathy

Ilya Sutskever

Alex Krizhevsky

Young fields [often] start in a very ad‐hoc manner. Later, the mature field is understood very differently … It seems quite likely that deep learning is in this ad‐hoc state.

Chris Olah, Google Brainhttps://colah.github.io/posts/2015‐09‐NN‐Types‐FP/

Software 1.0 is what we’re all familiar with — it is written in languages such as Python, C++, etc. … In contrast, Software 2.0 is written in neural network weights. No human is involved in writing this code.

Andrej Karpathy, OpenAIhttps://medium.com/@karpathy/software‐2‐0‐a64152b37c35

Deep learning has enabled spectacular achievements in solving complex problems of perception and prediction, but…

[With Deep Neural Networks,]machine learning has become alchemy

Ali Rahimi, Google (talk at NIPS 2017)https://www.youtube.com/watch?v=Qi1Yry33TQE

Mostly trial and error success, is there a unifying theory behind current knowledge and practices?

So, is there white-box deep learning? At least three ways to approach the issue

• Neuroscience : reproduction of human intelligence (biological analogies)• Probabilities : inference from available data (latent variable manipulation)• Data representations: transformations in manifolds? (differential calculus)

Currently, deep learning is mostly the third approach, using trial and error; could there be a white box model behind the black box appearance?

https://colah.github.io/posts/2015‐09‐NN‐Types‐FP/

Artificial neural network (ANN) 101

5/54

Loose metaphor of biological neural networks Interconnected neurons with similar computation types => computational graph

Neuron ‐> node with I/O edges Synapse ‐> weighted connection

A special type of graph

2‐bit adder with NAND gates ANN equivalent

But in ANNs:• The task is automatically learned from the data

– The neural weights and type(s) of neural outputs set the function

• There is generalization capacity and resilience to imprecision and fragmentary inputshttp://neuralnetworksanddeeplearning.com/chap1.htmlAckerman and Freer, arXiv:1703.09406

Many types de computational graphs exist

Two fundamental topologies

Feedforward architectures good for static problems, recurrent ones for dynamic/ contextual problems (currently studied as “unfolded” feedforward architectures)

7

+BSB, BAM, etc.

Non recurrent Recurrent

Neural Network

Three ways to set the neural weights (learning)

All based on the available data:• Supervised learning: the data are labeled

• Unsupervised learning: the data are not labeled; labelling is done based on patterns/similarities (categorisation);

• Reinforcement learning: the data are not labeled, labelling is done based on generated output value (expectation versus outcome)

Generic two-step operation1. Training (learning)

Done in advance• By programming (C++, Python, Lua, Java, etc.)• Using a NN simulator (Matlab, SNNS, etc.)

Cross‐validation frequently used for consistent results!

2. Using

Training algorithm

Neural Weights

Patterns to learn

9

ANNPattern to classify

Corresponding output

Neural weights

Seminal architecture of deep learning• 1‐2 hidden layers: shallow; more than two

layers : deep

Essentially a projection operator : given at the input, provides at the output

Multi-layer perceptron

10/54

Dynamic/contextual problems handled by recurrent networks that are unfolded

MLP learning process Builds a persistent and hierarchical representation of the data information

• Hidden layers progressively learn deeper intermediate representations

Lee, Largman, Pham & Ng, NIPS 2009Lee, Grosse, Ranganath & Ng, ICML 2009

Layer 1

Parts combine to form objects

Layer 3High‐level linguistic representations

Layer 2

MLP Learning details Supervised; tries to minimize the difference average between a labeled

training set , and its neural representation , • Minimization of the average squared error, expressed as a function of the

neural weights:

The intuitive way to solve 0 doesn’t work (requires to know the data statistics!), ametaheuristic is used, with the assumption that E (stochastic gradient descent)• Many variants exist

In any case, the process requires differentiable error functions!

)(wfE

∆∆

for small ∆ Therefore, ∆ · ∆If w is evolved in the opposite direction to for each learning trial, then∆ and ∆ 0

=> E decreases monotically!

Only the input and final output of the network are known at eachtraining trial, those of the hidden layers must also be determined=> Error backpropagation algorithm (based on the chain rule for derivatives)

13

Stochastic gradient descent

The more layers, the deeper the learning (or so it seems)

2011: 25,8% error with shallow net2012: 16,4% error with 8 layers

2014: 7,3% error with 19 layers

2015: 3,57 error with 152 layers

Double the human performance, but black box operation!

Kaiming He,Xiangyu Zhang, Shaoqing Ren,& Jian Sun."Deep ResidualLearning for Image Recognition".arXiv 2015.

-....

-

25.8

16.4

22 layers

6.719 layers

7.3

28.2

shallow

ImageNet Large Scale Visual Recognition Competition (ILSVRC)

152 layers!

3,57

8 layers 8 layers

11.7

ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13 ILSVRC’14 ILSVRC’14 ILSVRC’15Alexnet VGG GoogLeNet ResNEt

'lmageNet: 1000 objects, 1.2 million imagesTop-5 error (%)

In sum… Deep learning is essentially many‐layers MLPs trained by error

backpropagation with (mostly) no side effects At least three technologies and extensions:

• Autoencoders and deep belief (unsupervized learning)

• Convolutional MLPs (supervised learning)• Generative adversarial networks (supervised learning)• Extensions (e.g., unfolded recurrent architectures)

No white‐box model yet!

Bengio Montréal

HintonToronto

Le CunNew York

15/54

Back box = fad?

We [must] think through artificial intelligence from foundational principles rather than from the empirics of past data

Martin Reeves & Mihnea Moldovean, Scientific American, sep. 2017

Starting May 25, [2018,] the European Union will require algorithms to explain their output, making deep learning illegal.

Reported by Pedro Domingos, U. Washington Seattle, Jan. 2018

What if there is a white box waiting to be uncovered, after all?

Back to representations…

Each layer processes the output of its predecessor to create a new data representation (function composition!)

If all the nodes are differentiable, task training by error backpropagation is feasible!

Could this be the start of a white box ANN formalism?

The functional programming connection Three main ANN characteristics:

• Function composition→ output based on embedded transformations

• End‐to‐end differentiability→ optimization

• Weight‐tying→ sub‐network reusability

Can it be that deep learning is just functional programming with reusable blocks, configured by error backpropagation training?

How so?

Ng et al., proc. ICML 09, pp 609‐616

Transfer learning

Current ANN models

Rectangle = vector; arrow = function. (a) fixed-sized input to fixed-sized output (e.g., image classification); (b) Sequence output (e.g., image captioning); (c) Sequence input (e.g., sentiment analysis); (d) sequence to sequence (e.g., translation); (e) sync’ed sequence to sequence (e.g., video frame tagging). Green layer length is arbitrary, being the result of unfolding a recurrent architecture.

http://karpathy.github.io/2015/05/21/rnn‐effectiveness/

Output

Hidden/State

Input

a) b) c) d) e)

20/54

Special neuron:1 input, 3 controls, 1 output

MemoryCell

Input Gate

Output GateOutput control

Forget Gate

Input

LSTM

Forget control

Output

The control signal typically come from perceptrons

Input control

Long Short‐Term Memory (LSTM) adds a neural structure that enables storing, retrieving or erasing the neural state based on context rather than sequentially

Gated Recurrent Unit (GRU) is a close relative But the LSTM and GRU access mechanisms are not differentiable!

How about memory?

Making memory access differentiable Necessary for learning where to write and read Not obvious as memory addresses are fundamentally discrete How about writing and writing everywhere, just to different extents?

• Approach taken in Neural Turing Machines and several other recent models

https://distill.pub/2016/augmented‐rnns/

Making memory differentiable The idea is to link the memory states to an attention mechanism:

Given a memory context cj and a sequence of memory items hi , i=1..n:• A “distance” aij = f(hi, cj) can be defined for each pair (hi, cj)

(f can be implemented with a basic feed‐forward network, making it part of the overall ANN)

• The relative weight (attention) of each hi with respect to cj is thenαi=exp(aij)/∑i=1..n exp(aij)

and a composite attention of all hi with respect to cj can be defined asc = ∑i=1..n αi hi

cj is not longer associated with a single item hi and the steps to distribute across the whole memory are all differentiable!

ANNs as functional graphs MLPs, CNNs and RNNs are all expressible as graphs where the nodes

perform layer computations and the arcs layer interconnections Given differentiable nodes, end‐to‐end graph training by error

backpropagation is possible Two major gains in doing so:

• General purpose computation systems that are automatically configurable for desired outcomes!

• White box modeling through functional similarities and abstractions

http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/ https://pseudoprofound.wordpress.com/2016/08/03/differentiable‐programming/

Functional similarties Weight‐tying (multiple reuse of the same neuron as in CNNs and RNNs)

resembles function abstraction Structural patterns of composition resemble higher‐order functions

(e.g., map, fold, unfold, zip)

25/54

fold = Encoding RNNHaskell: foldl a

unfold = Generating RNNHaskell: unfoldr a s

Encoding Recurrent Neural Networks are folds

Generating Recurrent Neural Networks are unfolds

http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/

General Recurrent Neural Networks are accumulating maps.

Accumulating Map = RNNHaskell: mapAccumR a s

Convolutional Neural Networks are a close relative of map.

Windowed Map = Convolutional LayerHaskell: zipWith a xs (tail xs)

Two Dimensional Convolutional Network


Recursive Neural Networks (“TreeNets”) are catamorphisms, a generalization of folds.

Catamorphism = TreeNetHaskell: cata a


Examples of building block combinations English to French translation by combining an encoding RNN and a generating RNN,

to essentially perform a fold followed by unfold (Sutskever, et al. (2014)).

Image captions with a convolutional network and a generating RNN. The CNN doesfeature detection and unfold the resulting vector into a description sentence (Vinyals, et al. (2014)).

30/54

Functional Names of Common LayersDeep Learning Name Functional Name

Learned Vector ConstantEmbedding Layer List IndexingEncoding RNN FoldGenerating RNN UnfoldGeneral RNN Accumulating MapBidirectional RNN Zipped Left/Right Accumulating Maps

Conv Layer “WindowMap”TreeNet CatamorphismInverse TreeNet Anamorphism


Creating differentiable functional graphs

Make algorithmic elements continuous and differentiableNTM on copy task (Graves et al. 2014)

Create/implement a functional language where all primitives are differentiable and expressible in neural form (save basic arithmetic operations), so that we have:

y = f(x) = σ(Wx + b) Structural models already exist (Neural Turing Machine; Stack‐augmented RNN; Stack,

queue, deque), what is missing is the neural programming langageAdapted from http://www.cs.nuim.ie/~gunes/files/Baydin‐MSR‐Slides‐20160201.pdf

Basic differentiable structures based on y = f(x) = σ(Wx + b) Functional expressions (no mutable data inside)

declarative languages (Lisp, Haskell, Erlang, etc.) function h(x) return f(g(x)) h(x) = σ(W1(σ(W2x + b1)) + b2)

endfunction

function f(x, a) if x > 1.0

return a + 1else +(x, y) = σ(Wx +W‘y + b)

return a f(x, a) = if(x, 1.0, +(a, 1.0), a)endif

endfunction

Needed language constructs

Differentiable if, implemented with a TreeNet neural network

https://pseudoprofound.wordpress.com/2016/08/03/differentiable‐programming/

Functional language constructs

Primitive functions f: T ‐> S to carry out the basic σ(Wx + b) building blocks, with W and blearned from the data.

Mechanism to create composite functions from primitive functions, e.g., mlp(x) = f(g(x))

Higher‐level functions that take functions as inputs, generate functions as outputs, or both

Memory constructs (lists? Monads?)

=> calculus!

calculus syntax

All expressions are of the form:

e :: x // variable|x.e1 // function definition|e1 e2 // function application|(e1) // disambiguation


Examples of higher-order functions map(Fun, List)

• Applies Fun to each element of List, returning a list of results that may be of a different type

filter(Pred, List)• Returns a sublist of List that contains the elements of List that satisfy the predicate Pred

foldl(Fun, Acc, List)• Calls Fun on successive pairs of elements of List , starting with Acc and returning the same type

Etc. 35/54

More higher-order functions

all(Pred, List) any(Pred, List) takewhile(Pred, List)

dropwhile(Pred, List) flatten(DeepList) flatmap(Fun, List)

foreach(Fun, List) partition(Pred,List) zip(List1,List2)

unzip(List) …

36

Software is dead, long live software?

Current software is imperative (sequence of instructions, each one imparting a behaviour to a point in program space)• But for most real‐world problems, it is easier to state desired behaviour (e.g., via input‐

output examples) than to write executable code

V2.0 would be declarative: the “programmer” specifies the outcome and a composition of neural building blocks is searched for to provide it • Deep learning searches in continuous manifolds (for dimensionality reduction and to

make gradient descent possible)

Software should switch from writing programs, maintaining repositories and doing run‐time analysis to collecting, analyzing and preparing data for a neural network

Classical program: Sequence of executable instructions to perform a specified task

Differentiable program: Sequence of problem domain declarations on how to perform a specified task

• Functional blocks for white box operation• Differentiable nodes for auto‐configuration by

error backpropagation learning

From classical to differentiable machines

How about existing frameworks? Currently two Types of computational graphs: Symbolic

• Typical representatives: Theano, Tensorflow, CGT• Fine‐grained• Graph analysis and optimizations

Modular• Typical representatives: Torch, Caffe• Coarse‐grained• Manually designed modules

Similarities• Model definition using a (constrained) symbolic language• Automatic handling of backpropagation in the final model

(no need to code derivatives along)

(Kenneth Tran. “Evaluation of Deep Learning Toolkits”.https://github.com/zer0n/deepframeworks)

You are limited to symbolic graph building, with the mini‐language

You build this symbolic graph:

For example, instead of this in pure Python (for y=Ak):

But no direct functional building as such

http://deeplearning.net/software/theano/library/scan.html 40/54

Current efforts Neural programmers (a bit similar genetic programming) Neural Programmer‐Interpreters (with by‐example supervision) Neural Turing Machines DiffSharp (High‐order differentiation) Autograd (automatic differentiation of numPy and Python code) DNNGraph (Haskell model to caffe and Torch scripts) Etc.

All in the last couple of years, but is gradient descent really necessary?How about copying biology?

A biologically-inspired neural building block

A loop-based neural architecture

Gisiger & Boukadoum, Neural networks, 2018

Delayed-response task (DRT) Tests the ability to respond to stimuli based on short‐term memory Three major steps, repeated over a number of trials:

• Cue: sensory information to retain (e.g., image, dot on a screen, auditory stimulus)• Delay: The cue is withdrawn for an arbitrary delay;• Response: cue‐related action (e.g., identify a cue image in a set, or point to the location where the dot initially appeared).

Although seemingly simple, the task requires complex mental processing :1. Sensing the cue information, say a visual representation (VR) ;2. committing the cue information to short‐term memory;3. protecting it from interference by external and internal distractions; 4. using the information stored in working memory to produce the correct motor response (PM);5. discarding this information at the end of the trial in preparation for the next one (Reset).

Implementing DRT with a loop-based network

45/54

LSTM perspective

Many obstacles remain Need for more parallel processing and better energy efficiency

• Both at the hardware and software level

Need for training with less data Lift the algebraically expressible data restriction (vectors, matrices,

tensors…) Gradient descent learning is convex optimization; non‐convex techniques

have not been studied due to apparent NP‐hardness Serious side effects!

Noise effects (and hacker opportunities!)

http://arxiv.org/pdf/1312.6199v4.pdfhttps://codewords.recurse.com/issues/five/why‐do‐neural‐networks‐think‐a‐panda‐is‐a‐vulturehttps://medium.com/@ageitgey/machine‐learning‐is‐fun‐part‐8‐how‐to‐intentionally‐trick‐neural‐networks‐b55da32b7196

The algorithm is deceivably simple

1. Feed in the photo to hack2. Get the neural network’s prediction and see

how far off it is from the target answer3. Tweak the photo using back‐propagation to

make the prediction closer to the target answer

4. Repeat steps 1–3 with the same photo until the network gives us the answer we want

Adding an imperceptibly small vector of the same sign as the gradient of the cost function with respect to the input can drastically change the image classification.https://arxiv.org/abs/1412.6572

+ 0.007 =

x sign( x J (θ, x, y)) x + sign( x J (θ, x, y))“panda” “nematode” “gibbon”

57.7% confidence 8.2% confidence 99.3 % confidence

https://medium.com/@ageitgey/machine‐learning‐is‐fun‐part‐8‐how‐to‐intentionally‐trick‐neural‐networks‐b55da32b7196 50/54

+ 0.007 =

Sometimes, it doesn’t work!

Overfitting

https://ml.berkeley.edu/blog/2017/07/13/tutorial‐4/

9055.5 90555 316942.5 452773 217331

1 = 1

2 = 3

3 = 5

4 = 7

5 = 217341

Answer: 217341!

The consequences cans be disastrous!

Data order

• Capture of invariant “spatial motives” possible

22 1A a@a 1 aa a1.a 123 aa1

33 2B b@b 2 bb b2.b 234 bb2

44 3C c@c 3 cc c3.c 345 cc3

55 4D d@d 4 dd d4.d 456 dd4

66 5E e@e 5 ee e5.e 567 ee5

77 6F f@f 6 ff f6.f 678 ff6

88 7G g@g 7 gg g7.g 789 gg7

99 8H h@h 8 hh h8.h 890 hh8

111 9I i@i 9 ii i9.i 901 ii9

• Capture of invariant “spatial motives” doubtful if the row of column order is arbitrary

In summary… Efforts are under way to make white the new black Until then, deep learning remains a black box, and neural

network parameter tuning an art Currently, the choice is between 80‐90% accurate, non‐

DL models that we understand, or 99% accurate DL models that we don’t!

deep learning and differentiable programmingdic.uqam.ca/upload/files/seminaires/deep learning and...

Documents