an introduction to neural networks and deep learningasperti/slides/neural.pdf · an introduction to...
Post on 01-Sep-2020
0 Views
Preview:
TRANSCRIPT
An introduction to Neural Networksand Deep Learning
Talk given at the Department of Mathematicsof the University of Bologna
February 20, 2018
Andrea Asperti
DISI - Department of Informatics: Science and EngineeringUniversity of Bologna
Mura Anteo Zamboni 7, 40127, Bologna, ITALYandrea.asperti@unibo.it
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 1
A branch of Machine Learning
What is Machine Learning?
There are problems that are difficult to address with traditionalprogramming techniques:
I classify a document according to some criteria (e.g. spam,sentiment analysis, ...)
I compute the probability that a credit card transaction isfraudulent
I recognize an object in some image (possibly from an inusualviewpoint, in new lighting conditions, in a cluttered scene)
I ...
Typically the result is a weighted combination of a large number ofparameters, each one contributing to the solution in a small degree.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 2
The Machine Learning approach
Suppose to have a set of input-output pairs (training set)
{〈xi , yi 〉}
the problem consists in guessing the map xi 7→ yi
The M.L. approach:
• describe the problem with a model depending on someparameters Θ (i.e. choose a parametric class of functions)
• define a loss function to compare the results of the modelwith the expected (experimental) values
• optimize (fit) the parameters Θ to reduce the loss to aminimum
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 3
Why Learning?
Machine Learning problems are in fact optimization problems!So, why talking about learning?
The point is that the solution to the optimization problem is notgiven in an analytical form (often there is no closed form solution).
So, we use iterative techniques (typically, gradient descent) toprogressively approximate the result.
This form of iteration over data can be understood as a way ofprogressive learning of the objective function based on theexperience of past observations.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 4
Why Learning?
Machine Learning problems are in fact optimization problems!So, why talking about learning?
The point is that the solution to the optimization problem is notgiven in an analytical form (often there is no closed form solution).
So, we use iterative techniques (typically, gradient descent) toprogressively approximate the result.
This form of iteration over data can be understood as a way ofprogressive learning of the objective function based on theexperience of past observations.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 5
Why Learning?
Machine Learning problems are in fact optimization problems!So, why talking about learning?
The point is that the solution to the optimization problem is notgiven in an analytical form (often there is no closed form solution).
So, we use iterative techniques (typically, gradient descent) toprogressively approximate the result.
This form of iteration over data can be understood as a way ofprogressive learning of the objective function based on theexperience of past observations.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 6
Why Learning?
Machine Learning problems are in fact optimization problems!So, why talking about learning?
The point is that the solution to the optimization problem is notgiven in an analytical form (often there is no closed form solution).
So, we use iterative techniques (typically, gradient descent) toprogressively approximate the result.
This form of iteration over data can be understood as a way ofprogressive learning of the objective function based on theexperience of past observations.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 7
Using gradients
The objective is to minimize some loss function over (fixed)training samples, e.g.
Θ(w) =∑i
E (o(w , xi ), yi )
by suitably adjusting the parameters w .
See how it changes according to small perturbations ∆(w) of theparameters w : this is the gradient
∇w [θ] = [ ∂Θ∂w1
, . . . , ∂Θ∂wn
]
of Θ w.r.t. w .
The gradient is a vectorpointing in the direction ofsteepest ascent.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 8
Gradient descent
Goal: minimize some loss function Θ(w) by suitably adjusting theparameters.
We can reach a minimalconfiguration for Θ(w) byiteratively taking small stepsin the direction opposite tothe gradient (gradient descent).
This is a general technique.
Warning: not guaranteed to work:
I may end up in local minima
I may get lost in plateau
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 9
Next arguments
A bit of taxonomy
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 10
Different types of Learning Tasks
• supervised learning:inputs + outputs (labels)- classification
- regression
• unsupervised learning:just inputs- clustering
- component analysis
- autoencoding
• reinforcement learningactions and rewards- learning long-term gains
- planning
supervised
unsupervised
reinforcement
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 11
Classification vs. Regression
Two forms of supervised learning: {〈xi , yi 〉}
inputNew
Probably a cat!
classification
New input
Expectedvalue
regression
y is discete: y ∈ {•,+} y is (conceptually) continuous
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 12
Many different techniques
• Different ways todefine the models:- decision trees
- linear models
- neural networks
- ...
• Different error(loss) functions:- mean squared errors
- logistic loss
- cross entropy
- cosine distance
- maximum margin
- ...
Sunny Overcast Rain
High Strong Normal Weak
No Yes
Yes
No Yes
Outlook
Humidity Wind
decision tree neural net
mean squared errors maximum margin
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 13
Next argument
Neural Networks
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 14
Neural Network
A network of (artificial) neurons
Artificial neuron
Each neuron takes multiple inputs and produces a single output(that can be passed as input to many other neurons).
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 15
The artificial neuron
w1
output
x
x
b
+1
Σ
inputs
function
activation
bias
1
x22w
n
nw
The purpose of the activation function is to introduce athresholding mechanism(similar to the axon-hillock of cortical neurons).
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 16
Different activation functions
The activation function is responsible for threshold triggering.
0
1
threshold: if x > 0 then 1 else 0 logistic function: 11+e−x
1
0
hyperbolic tangent: ex−e−x
ex +e−x rectified linear (RELU): if x > 0 then x else 0
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 17
A comparison with the cortical neuron
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 18
Next argument
Networks typology/topology
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 19
Layers
A neural network is a collection of artificial neurons connectedtogether.Neurons are usually organized in layers.
If there is more than one hidden layer the network is deep,otherwise it is called a shallow network.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 20
Feed-forward networks
If the network is acyclic, it is called a feed-forward network.Feed-forward networks are (at present) the commonest type ofnetworks in practical applications.
Important Composing linear transformations makes no sense,since we still get a linear transformation.What is the source of non linearity in Neural Networks?
The activation function
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 21
Feed-forward networks
If the network is acyclic, it is called a feed-forward network.Feed-forward networks are (at present) the commonest type ofnetworks in practical applications.
Important Composing linear transformations makes no sense,since we still get a linear transformation.What is the source of non linearity in Neural Networks?
The activation function
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 22
Dense networks
The most typical feed-forward network is a dense network whereeach neuron at layer k − 1 is connected to each neuron at layer k .
The network is defined by a matrix of parameters (weights) Wk foreach layer (+ biases).
The matrix Wk has dimension Lk × Lk+1 where Lk is the numberof neurons at layer k .
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 23
Parameters and hyper-parameters
The weights Wk are the parameters of the model: they arelearned during the training phase.
The number of layers and the number of neurons per layer arehyper-parameters: they are chosen by the user and fixed beforetraining may start.
Other important hyper-parameters govern training such as learningrate, batch-size, number of ephocs an many others.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 24
Convolutional networks
Convolutional networks are used with inputs with a topologicalstructure: signal sequences (e.g. sound), or images.
They repeteadly apply a (small) uniform linear transformation,called kernel, shifting it over the whole input image.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 25
Example
[−1 0 1
]−→
−101
−→
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 26
Computing features
Many interesting kernels (filters) known from Image Processing:
I first and second order derivatives, image gradients
I sobel, prewitt, . . .
In Neural Networks, kernels are learned by training.
Since kernels are small and weights are shared training is relativelyfast.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 27
Recurrent Networks
In a recurrent net-woks you may have cycles:
• dynamics is very complexnot even clear it stabilizes
• difficult to train
• biologically more realistic
Restricted models:
I Long-Short Term Memory models (LSTM),
I Gated Recurrent Unit (GRU)
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 28
LSTM and GRU
LSTM are useful to model sequences:
I equivalent to very deep nets with one hidden layer per timeslice (net unrolling)
I weights are shared between different time slices
I they can keep information for a long time in an internal state
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 29
Symmetrically connected networks
Similar to recurrent networks, but connections between units aresymmetrical (they have the same weight in both directions).
They have stable configurations corresponding to local minima of asuitable energy function.
Hopfield nets: symmetrically connected nets without hidden units
Boltzmann machines: symmetrically connected nets with hidden units:
I more powerful models than Hopfield nets
I less powerful than general recurrent networks
I have a nice and simple learning algorithm
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 30
How a real network looks like
VGG 16 (Simonyan e Zisserman). 92.7 accuracy (top-5) in ImageNet.
Picture by Davi Frossard: VGG in TensorFlow
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 31
How do we implement a neural net?
Neural nets looks complicated.
How do we implement them?
There exist suitable languages:
I Theano, University of Montreal
I TensorFlow, Google Brain
I Caffe, Berkeley Vision
I Keras, F.Chollet
I PyTorch, Facebook
I . . .
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 32
VGG 16 in Keras
From GitHub
def VGG_16(weights_path=None):
model = Sequential()
model.add(ZeroPadding2D((1,1),input_shape=(3,224,224)))
model.add(Convolution2D(64, 3, 3, activation=’relu’))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(64, 3, 3, activation=’relu’))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(128, 3, 3, activation=’relu’))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(128, 3, 3, activation=’relu’))
model.add(MaxPooling2D((2,2), strides=(2,2)))
...
The whole model is defined in 50 lines of code.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 33
But what about training?
So complex ...
fit(x, y, batch_size=32, epochs=10)
I x: input, an array of data (hence, typically, an array of arrays)
I y: labels, an array of target categories
I batch size: integer, number of samples per gradient update.
I epochs: integer, the number of epochs (passes) to train themodel.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 34
Next arguments
Features and deep features
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 35
Features
Any individual measurable property of data useful for the solutionof a specific task is called a feature.
Examples:
I Emergency C-section: age, first pregnancy, anemia, fetusmalpresentation, previous premature birth, anomalousultrasound, ...
I Meteo: humidity, pression, temperature, wind, rain, snow, ...
I Expected lifetime: age, health, annual income, kind of work,...
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 36
Derived (inner) features
New interesting features may be derived as a combination of inputfeatures.
Suppose for instance that we are interested to model somephenomenon with a cubic function
f (x) = ax3 + bx2 + cx + d
We can use x as input or . . .
we can precompute x , x2 and x3 reducing the problem to a linearmodel!
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 37
Derived (inner) features
New interesting features may be derived as a combination of inputfeatures.
Suppose for instance that we are interested to model somephenomenon with a cubic function
f (x) = ax3 + bx2 + cx + d
We can use x as input or . . .
we can precompute x , x2 and x3 reducing the problem to a linearmodel!
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 38
Traditional Image Processing
In order to process animage we start computinginteresting derived featureson the image:
- first order derivatives- second order derivatives- difference of gaussians- laplacian- ...
original gaussian blur 25
gaussian difference gaussian blur 10
Then we use these derived features to get the desired output.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 39
Deep learning, in deeper sense
Discovering good features is a complex task.
Why not delegating the task to the machine, learning them?
Deep learning exploits a hierarchical organization of the learningmodel, allowing complex features to be computed in terms ofsimpler ones, through non-linear transformations.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 40
AI, machine learning, deep learning
• Knowledge-based systems: take an expert, ask him how hesolves a problem and try to mimic his approach by means oflogical rules
• Traditional Machine-Learning: take an expert, ask himwhat are the features of data relevant to solve a givenproblem, and let the machine learn the mapping
• Deep-Learning: get rid of the expert
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 41
AI, machine learning, deep learning
• Knowledge-based systems: take an expert, ask him how hesolves a problem and try to mimic his approach by means oflogical rules
• Traditional Machine-Learning: take an expert, ask himwhat are the features of data relevant to solve a givenproblem, and let the machine learn the mapping
• Deep-Learning: get rid of the expert
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 42
AI, machine learning, deep learning
• Knowledge-based systems: take an expert, ask him how hesolves a problem and try to mimic his approach by means oflogical rules
• Traditional Machine-Learning: take an expert, ask himwhat are the features of data relevant to solve a givenproblem, and let the machine learn the mapping
• Deep-Learning: get rid of the expert
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 43
Relations between research areas
Deep learning
Example: MLPs autoencoders Example:
Representation learning
Machine learning
Example:
logistic
regression
Artificial Intelligence
bases
knowledgeExample:
Picture taken from “Deep Learning” by Y.Bengio, I.Goodfellow e A.Courville, MITPress.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 44
Components trained to learn
Input InputInput
Rule−based
systems
Input
Output
Output Output
Output
Hand−
designedprogram features
Hand−
designed
Mapping
from
features
Mapping
from
features
Features Features
Mapping
from
features
features
complex
More
Classic
Machine
Learning
Learning
Representation Deep
Learning
learning
components
Picture taken from “Deep Learning” by Y.Bengio, I.Goodfellow e A.Courville, MITPress.Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 45
Next arguments
Some successful applications
• MINST and ImageNet• Speech Recognition• Lip reading• Text generation• Deep dreams and Inceptionism• Mimicking style• Robot navigation• Game simulation
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 46
MNIST
Modified National Institute of Standards and Technology database
I grayscale images of handwritten digits, 20× 20 pixels each
I 60,000 training images and 10,000 testing images
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 47
MNIST
A comparison of different techniques
Classifier Error rateLinear classifier 7.6K-Nearest Neighbors 0.52SVM 0.56Shallow neural network 1.6Deep neural network 0.35Convolutional neural network 0.21
See LeCun’s page the mnist database for more data.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 48
ImageNet
ImageNet (@Stanford Vision Lab)
I high resolution color images covering 22K object classes
I over 15 million labeled images from the web
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 49
ImageNet competition
Annual competition of image classification (since 2010).
I 1.2 Million images (30K categories)
I make five guesses about image label, ordered by confidence
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 50
ImageNet samples
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 51
ImageNet results
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 52
Speech recognition
Several stages (similar to optical character recognition):
I Segmentation. Convert the sound wave into a vector ofacoustic coefficients. Typical sampling: 10 milliseconds.
I The acoustic model Use adjacent vectors of acousticcoefficients to associate probabilities with phonemes.
I Decoding Find the sequence of phonemes that best fit theacoustic data, and a model of expected sentences.
Deep neural networks, pioneered by George Dahl andAbdel-rahman Mohamed, are replacing previous machine learningmethods.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 53
Speech recognition
Major industries are investing lot of money on speech recognition:Amazon (with Intel), Google, Microsoft, ...
Achieving Human Parity in Conversational Speech Recognition.Speech & Dialog research group at Microsoft, 2016
R.Zweig (project manager) attributes the accomplishment to thesystematic use of the latest neural network technology in allaspects of the system.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 54
Lip reading
Google’s DeepMind AI can lip-read TV shows better than aprofessional
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 55
Text Generation
See Andrej Karpathy’s blogThe Unreasonable Effectiveness of Recurrent Neural Networks
Examples of fake algebraic documents automatically generated by a
RNN.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 56
Deep dreams
Visit Deep dreams generator
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 57
Mimicking style
A neural algorithm of artistic styleL.A. Gatys, A.S. Ecker, M. Bethge
Similar to inceptionism, but with “style” (texture) instead of content.
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 58
More examples
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 59
More examples
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 60
Mimicking style: a different approach
Image-to-image translation with Cycle Generative Adversarial Networks
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 61
Robot navigation
Quadcopter Navigation in the Forest using Deep Neural Networks
Robotics and Perception Group, University of Zurich, Switzerland &
Institute for Artificial Intelligence (IDSIA), Lugano Switzerland
Based on Imitation Learning
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 62
Atari Games and Q-learning
Google DeepMind’s system playing Atari games (2013)
Recently extended to Augmented Imagination (2017)
video
Based on:
I deep neural networks
I an innovative reinforcement learning technique calledQ-learning
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 63
Atari Games and Q-learning
The same network architecturewas applied to all games
Input are screen frames
Works well for reactive games,not for planning
Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 64
top related