deep learning and differentiable programmingdic.uqam.ca/upload/files/seminaires/deep learning and...
TRANSCRIPT
Deep Learning, differentiable programming, and software 2.0
(or white is the new black? ) Mounir BoukadoumUQAM, Dep. CS
RuslanSalakhutdinov
Soumit Chintala
Chris Olah
Ian Goodfellow
AndejKarpathy
Ilya Sutskever
Alex Krizhevsky
Young fields [often] start in a very ad‐hoc manner. Later, the mature field is understood very differently … It seems quite likely that deep learning is in this ad‐hoc state.
Chris Olah, Google Brainhttps://colah.github.io/posts/2015‐09‐NN‐Types‐FP/
Software 1.0 is what we’re all familiar with — it is written in languages such as Python, C++, etc. … In contrast, Software 2.0 is written in neural network weights. No human is involved in writing this code.
Andrej Karpathy, OpenAIhttps://medium.com/@karpathy/software‐2‐0‐a64152b37c35
Deep learning has enabled spectacular achievements in solving complex problems of perception and prediction, but…
[With Deep Neural Networks,]machine learning has become alchemy
Ali Rahimi, Google (talk at NIPS 2017)https://www.youtube.com/watch?v=Qi1Yry33TQE
Mostly trial and error success, is there a unifying theory behind current knowledge and practices?
So, is there white-box deep learning? At least three ways to approach the issue
• Neuroscience : reproduction of human intelligence (biological analogies)• Probabilities : inference from available data (latent variable manipulation)• Data representations: transformations in manifolds? (differential calculus)
Currently, deep learning is mostly the third approach, using trial and error; could there be a white box model behind the black box appearance?
https://colah.github.io/posts/2015‐09‐NN‐Types‐FP/
Artificial neural network (ANN) 101
5/54
Loose metaphor of biological neural networks Interconnected neurons with similar computation types => computational graph
Neuron ‐> node with I/O edges Synapse ‐> weighted connection
A special type of graph
2‐bit adder with NAND gates ANN equivalent
But in ANNs:• The task is automatically learned from the data
– The neural weights and type(s) of neural outputs set the function
• There is generalization capacity and resilience to imprecision and fragmentary inputshttp://neuralnetworksanddeeplearning.com/chap1.htmlAckerman and Freer, arXiv:1703.09406
Many types de computational graphs exist
Two fundamental topologies
Feedforward architectures good for static problems, recurrent ones for dynamic/ contextual problems (currently studied as “unfolded” feedforward architectures)
7
+BSB, BAM, etc.
Non recurrent Recurrent
Neural Network
Three ways to set the neural weights (learning)
All based on the available data:• Supervised learning: the data are labeled
• Unsupervised learning: the data are not labeled; labelling is done based on patterns/similarities (categorisation);
• Reinforcement learning: the data are not labeled, labelling is done based on generated output value (expectation versus outcome)
Generic two-step operation1. Training (learning)
Done in advance• By programming (C++, Python, Lua, Java, etc.)• Using a NN simulator (Matlab, SNNS, etc.)
Cross‐validation frequently used for consistent results!
2. Using
Training algorithm
Neural Weights
Patterns to learn
9
ANNPattern to classify
Corresponding output
Neural weights
Seminal architecture of deep learning• 1‐2 hidden layers: shallow; more than two
layers : deep
Essentially a projection operator : given at the input, provides at the output
Multi-layer perceptron
10/54
Dynamic/contextual problems handled by recurrent networks that are unfolded
MLP learning process Builds a persistent and hierarchical representation of the data information
• Hidden layers progressively learn deeper intermediate representations
Lee, Largman, Pham & Ng, NIPS 2009Lee, Grosse, Ranganath & Ng, ICML 2009
Layer 1
Parts combine to form objects
Layer 3High‐level linguistic representations
Layer 2
MLP Learning details Supervised; tries to minimize the difference average between a labeled
training set , and its neural representation , • Minimization of the average squared error, expressed as a function of the
neural weights:
The intuitive way to solve 0 doesn’t work (requires to know the data statistics!), ametaheuristic is used, with the assumption that E (stochastic gradient descent)• Many variants exist
In any case, the process requires differentiable error functions!
)(wfE
∆∆
for small ∆ Therefore, ∆ · ∆If w is evolved in the opposite direction to for each learning trial, then∆ and ∆ 0
=> E decreases monotically!
Only the input and final output of the network are known at eachtraining trial, those of the hidden layers must also be determined=> Error backpropagation algorithm (based on the chain rule for derivatives)
13
Stochastic gradient descent
The more layers, the deeper the learning (or so it seems)
2011: 25,8% error with shallow net2012: 16,4% error with 8 layers
2014: 7,3% error with 19 layers
2015: 3,57 error with 152 layers
Double the human performance, but black box operation!
Kaiming He,Xiangyu Zhang, Shaoqing Ren,& Jian Sun."Deep ResidualLearning for Image Recognition".arXiv 2015.
-....
-
25.8
16.4
22 layers
6.719 layers
7.3
28.2
shallow
ImageNet Large Scale Visual Recognition Competition (ILSVRC)
152 layers!
3,57
8 layers 8 layers
11.7
ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13 ILSVRC’14 ILSVRC’14 ILSVRC’15Alexnet VGG GoogLeNet ResNEt
'lmageNet: 1000 objects, 1.2 million imagesTop-5 error (%)
In sum… Deep learning is essentially many‐layers MLPs trained by error
backpropagation with (mostly) no side effects At least three technologies and extensions:
• Autoencoders and deep belief (unsupervized learning)
• Convolutional MLPs (supervised learning)• Generative adversarial networks (supervised learning)• Extensions (e.g., unfolded recurrent architectures)
No white‐box model yet!
Bengio Montréal
HintonToronto
Le CunNew York
15/54
Back box = fad?
We [must] think through artificial intelligence from foundational principles rather than from the empirics of past data
Martin Reeves & Mihnea Moldovean, Scientific American, sep. 2017
Starting May 25, [2018,] the European Union will require algorithms to explain their output, making deep learning illegal.
Reported by Pedro Domingos, U. Washington Seattle, Jan. 2018
What if there is a white box waiting to be uncovered, after all?
Back to representations…
Each layer processes the output of its predecessor to create a new data representation (function composition!)
If all the nodes are differentiable, task training by error backpropagation is feasible!
Could this be the start of a white box ANN formalism?
The functional programming connection Three main ANN characteristics:
• Function composition→ output based on embedded transformations
• End‐to‐end differentiability→ optimization
• Weight‐tying→ sub‐network reusability
Can it be that deep learning is just functional programming with reusable blocks, configured by error backpropagation training?
How so?
Ng et al., proc. ICML 09, pp 609‐616
Transfer learning
Current ANN models
Rectangle = vector; arrow = function. (a) fixed-sized input to fixed-sized output (e.g., image classification); (b) Sequence output (e.g., image captioning); (c) Sequence input (e.g., sentiment analysis); (d) sequence to sequence (e.g., translation); (e) sync’ed sequence to sequence (e.g., video frame tagging). Green layer length is arbitrary, being the result of unfolding a recurrent architecture.
http://karpathy.github.io/2015/05/21/rnn‐effectiveness/
Output
Hidden/State
Input
a) b) c) d) e)
20/54
Special neuron:1 input, 3 controls, 1 output
MemoryCell
Input Gate
Output GateOutput control
Forget Gate
Input
LSTM
Forget control
Output
The control signal typically come from perceptrons
Input control
Long Short‐Term Memory (LSTM) adds a neural structure that enables storing, retrieving or erasing the neural state based on context rather than sequentially
Gated Recurrent Unit (GRU) is a close relative But the LSTM and GRU access mechanisms are not differentiable!
How about memory?
Making memory access differentiable Necessary for learning where to write and read Not obvious as memory addresses are fundamentally discrete How about writing and writing everywhere, just to different extents?
• Approach taken in Neural Turing Machines and several other recent models
https://distill.pub/2016/augmented‐rnns/
Making memory differentiable The idea is to link the memory states to an attention mechanism:
Given a memory context cj and a sequence of memory items hi , i=1..n:• A “distance” aij = f(hi, cj) can be defined for each pair (hi, cj)
(f can be implemented with a basic feed‐forward network, making it part of the overall ANN)
• The relative weight (attention) of each hi with respect to cj is thenαi=exp(aij)/∑i=1..n exp(aij)
and a composite attention of all hi with respect to cj can be defined asc = ∑i=1..n αi hi
cj is not longer associated with a single item hi and the steps to distribute across the whole memory are all differentiable!
ANNs as functional graphs MLPs, CNNs and RNNs are all expressible as graphs where the nodes
perform layer computations and the arcs layer interconnections Given differentiable nodes, end‐to‐end graph training by error
backpropagation is possible Two major gains in doing so:
• General purpose computation systems that are automatically configurable for desired outcomes!
• White box modeling through functional similarities and abstractions
http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/ https://pseudoprofound.wordpress.com/2016/08/03/differentiable‐programming/
Functional similarties Weight‐tying (multiple reuse of the same neuron as in CNNs and RNNs)
resembles function abstraction Structural patterns of composition resemble higher‐order functions
(e.g., map, fold, unfold, zip)
25/54
fold = Encoding RNNHaskell: foldl a
unfold = Generating RNNHaskell: unfoldr a s
Encoding Recurrent Neural Networks are folds
Generating Recurrent Neural Networks are unfolds
http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/
General Recurrent Neural Networks are accumulating maps.
Accumulating Map = RNNHaskell: mapAccumR a s
Convolutional Neural Networks are a close relative of map.
Windowed Map = Convolutional LayerHaskell: zipWith a xs (tail xs)
Two Dimensional Convolutional Network
http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/
Recursive Neural Networks (“TreeNets”) are catamorphisms, a generalization of folds.
Catamorphism = TreeNetHaskell: cata a
http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/
Examples of building block combinations English to French translation by combining an encoding RNN and a generating RNN,
to essentially perform a fold followed by unfold (Sutskever, et al. (2014)).
Image captions with a convolutional network and a generating RNN. The CNN doesfeature detection and unfold the resulting vector into a description sentence (Vinyals, et al. (2014)).
30/54
Functional Names of Common LayersDeep Learning Name Functional Name
Learned Vector ConstantEmbedding Layer List IndexingEncoding RNN FoldGenerating RNN UnfoldGeneral RNN Accumulating MapBidirectional RNN Zipped Left/Right Accumulating Maps
Conv Layer “WindowMap”TreeNet CatamorphismInverse TreeNet Anamorphism
http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/
Creating differentiable functional graphs
Make algorithmic elements continuous and differentiableNTM on copy task (Graves et al. 2014)
Create/implement a functional language where all primitives are differentiable and expressible in neural form (save basic arithmetic operations), so that we have:
y = f(x) = σ(Wx + b) Structural models already exist (Neural Turing Machine; Stack‐augmented RNN; Stack,
queue, deque), what is missing is the neural programming langageAdapted from http://www.cs.nuim.ie/~gunes/files/Baydin‐MSR‐Slides‐20160201.pdf
Basic differentiable structures based on y = f(x) = σ(Wx + b) Functional expressions (no mutable data inside)
declarative languages (Lisp, Haskell, Erlang, etc.) function h(x) return f(g(x)) h(x) = σ(W1(σ(W2x + b1)) + b2)
endfunction
function f(x, a) if x > 1.0
return a + 1else +(x, y) = σ(Wx +W‘y + b)
return a f(x, a) = if(x, 1.0, +(a, 1.0), a)endif
endfunction
Needed language constructs
Differentiable if, implemented with a TreeNet neural network
https://pseudoprofound.wordpress.com/2016/08/03/differentiable‐programming/
Functional language constructs
Primitive functions f: T ‐> S to carry out the basic σ(Wx + b) building blocks, with W and blearned from the data.
Mechanism to create composite functions from primitive functions, e.g., mlp(x) = f(g(x))
Higher‐level functions that take functions as inputs, generate functions as outputs, or both
Memory constructs (lists? Monads?)
=> calculus!
calculus syntax
All expressions are of the form:
e :: x // variable|x.e1 // function definition|e1 e2 // function application|(e1) // disambiguation
http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/
Examples of higher-order functions map(Fun, List)
• Applies Fun to each element of List, returning a list of results that may be of a different type
filter(Pred, List)• Returns a sublist of List that contains the elements of List that satisfy the predicate Pred
foldl(Fun, Acc, List)• Calls Fun on successive pairs of elements of List , starting with Acc and returning the same type
Etc. 35/54
More higher-order functions
all(Pred, List) any(Pred, List) takewhile(Pred, List)
dropwhile(Pred, List) flatten(DeepList) flatmap(Fun, List)
foreach(Fun, List) partition(Pred,List) zip(List1,List2)
unzip(List) …
36
Software is dead, long live software?
Current software is imperative (sequence of instructions, each one imparting a behaviour to a point in program space)• But for most real‐world problems, it is easier to state desired behaviour (e.g., via input‐
output examples) than to write executable code
V2.0 would be declarative: the “programmer” specifies the outcome and a composition of neural building blocks is searched for to provide it • Deep learning searches in continuous manifolds (for dimensionality reduction and to
make gradient descent possible)
Software should switch from writing programs, maintaining repositories and doing run‐time analysis to collecting, analyzing and preparing data for a neural network
Classical program: Sequence of executable instructions to perform a specified task
Differentiable program: Sequence of problem domain declarations on how to perform a specified task
• Functional blocks for white box operation• Differentiable nodes for auto‐configuration by
error backpropagation learning
From classical to differentiable machines
How about existing frameworks? Currently two Types of computational graphs: Symbolic
• Typical representatives: Theano, Tensorflow, CGT• Fine‐grained• Graph analysis and optimizations
Modular• Typical representatives: Torch, Caffe• Coarse‐grained• Manually designed modules
Similarities• Model definition using a (constrained) symbolic language• Automatic handling of backpropagation in the final model
(no need to code derivatives along)
(Kenneth Tran. “Evaluation of Deep Learning Toolkits”.https://github.com/zer0n/deepframeworks)
You are limited to symbolic graph building, with the mini‐language
You build this symbolic graph:
For example, instead of this in pure Python (for y=Ak):
But no direct functional building as such
http://deeplearning.net/software/theano/library/scan.html 40/54
Current efforts Neural programmers (a bit similar genetic programming) Neural Programmer‐Interpreters (with by‐example supervision) Neural Turing Machines DiffSharp (High‐order differentiation) Autograd (automatic differentiation of numPy and Python code) DNNGraph (Haskell model to caffe and Torch scripts) Etc.
All in the last couple of years, but is gradient descent really necessary?How about copying biology?
A biologically-inspired neural building block
A loop-based neural architecture
Gisiger & Boukadoum, Neural networks, 2018
Delayed-response task (DRT) Tests the ability to respond to stimuli based on short‐term memory Three major steps, repeated over a number of trials:
• Cue: sensory information to retain (e.g., image, dot on a screen, auditory stimulus)• Delay: The cue is withdrawn for an arbitrary delay;• Response: cue‐related action (e.g., identify a cue image in a set, or point to the location where the dot initially appeared).
Although seemingly simple, the task requires complex mental processing :1. Sensing the cue information, say a visual representation (VR) ;2. committing the cue information to short‐term memory;3. protecting it from interference by external and internal distractions; 4. using the information stored in working memory to produce the correct motor response (PM);5. discarding this information at the end of the trial in preparation for the next one (Reset).
Implementing DRT with a loop-based network
45/54
LSTM perspective
Many obstacles remain Need for more parallel processing and better energy efficiency
• Both at the hardware and software level
Need for training with less data Lift the algebraically expressible data restriction (vectors, matrices,
tensors…) Gradient descent learning is convex optimization; non‐convex techniques
have not been studied due to apparent NP‐hardness Serious side effects!
Noise effects (and hacker opportunities!)
http://arxiv.org/pdf/1312.6199v4.pdfhttps://codewords.recurse.com/issues/five/why‐do‐neural‐networks‐think‐a‐panda‐is‐a‐vulturehttps://medium.com/@ageitgey/machine‐learning‐is‐fun‐part‐8‐how‐to‐intentionally‐trick‐neural‐networks‐b55da32b7196
The algorithm is deceivably simple
1. Feed in the photo to hack2. Get the neural network’s prediction and see
how far off it is from the target answer3. Tweak the photo using back‐propagation to
make the prediction closer to the target answer
4. Repeat steps 1–3 with the same photo until the network gives us the answer we want
Adding an imperceptibly small vector of the same sign as the gradient of the cost function with respect to the input can drastically change the image classification.https://arxiv.org/abs/1412.6572
+ 0.007 =
x sign( x J (θ, x, y)) x + sign( x J (θ, x, y))“panda” “nematode” “gibbon”
57.7% confidence 8.2% confidence 99.3 % confidence
https://medium.com/@ageitgey/machine‐learning‐is‐fun‐part‐8‐how‐to‐intentionally‐trick‐neural‐networks‐b55da32b7196 50/54
+ 0.007 =
Sometimes, it doesn’t work!
Overfitting
https://ml.berkeley.edu/blog/2017/07/13/tutorial‐4/
9055.5 90555 316942.5 452773 217331
1 = 1
2 = 3
3 = 5
4 = 7
5 = 217341
Answer: 217341!
The consequences cans be disastrous!
Data order
• Capture of invariant “spatial motives” possible
22 1A a@a 1 aa a1.a 123 aa1
33 2B b@b 2 bb b2.b 234 bb2
44 3C c@c 3 cc c3.c 345 cc3
55 4D d@d 4 dd d4.d 456 dd4
66 5E e@e 5 ee e5.e 567 ee5
77 6F f@f 6 ff f6.f 678 ff6
88 7G g@g 7 gg g7.g 789 gg7
99 8H h@h 8 hh h8.h 890 hh8
111 9I i@i 9 ii i9.i 901 ii9
• Capture of invariant “spatial motives” doubtful if the row of column order is arbitrary
In summary… Efforts are under way to make white the new black Until then, deep learning remains a black box, and neural
network parameter tuning an art Currently, the choice is between 80‐90% accurate, non‐
DL models that we understand, or 99% accurate DL models that we don’t!