Introduction to Neural Networks
Fundamental ideas behind artificial neural networks
Haidar KhanBulent Yener
RPI β Computer Science β Haidar Khan
Outline
Introduction Machine Learning framework Neural Networks1. Simple linear models2. Nonlinear activations3. Gradient descent Demos
RPI β Computer Science β Haidar Khan
What are (artificial) neural networks
A technique to estimate patterns from data (~1940s)
Also called βmulti-layer perceptronsβ βneuralβ β very crude mimicry of
how real biological neurons work Large network of simple units which
produce a complex output
RPI β Computer Science β Haidar Khan
Why do we care about them
Key ingredient in real AI Useful for industry problems Perform best on important tasks Yield insights into the biological brain (maybe)
RPI β Computer Science β Haidar Khan
General machine learning framework
Data β ππ Γ ππ matrix ππ rows are observations ππππ (1 Γ ππ)
Data labels - ππ Γ 1 vector π¦π¦ Assume there is some unknown
function ππ οΏ½ that generates the label π¦π¦ππ given ππππ:
ππ ππππ = π¦π¦ππ ML problem: estimate ππ οΏ½ Use it to generate labels for new
observations!
ππ ππ
ππ
ππ
RPI β Computer Science β Haidar Khan
Some examplesβ¦
Problem Data Data labels119 images of cats and dogs (20 x 20 pixels)
119 Γ 400 matrix of pixeldata(we stretch each image into a long vector)
{Cat, Dog}
A 15 question political poll of 139 residents on recent state legislation
139 Γ 15 matrix of answers (A-E)
Party affiliation: {Republican, Democrat, Independent}
RPI β Computer Science β Haidar Khan
Recall: Linear regression
Assume the generating function ππ(οΏ½) is linear Write label π¦π¦ππ as a linear function of ππ: π¦π¦ππ = ππππππ Matrix form: ππ = ππππ
What should the ππ Γ 1 vector ππ be? This is the familiar least squares regression:
ππ = ππππππ β1ππππππ We will set up the simplest neural network and show we arrive
at this same solution!
RPI β Computer Science β Haidar Khan
Declare a simple neural network
Recall ππ is 1 Γ ππ One artificial neural unit Connects to each input π₯π₯ππ with a
weight π€π€ππ Produces one output π§π§
π§π§ = οΏ½ππ
ππ
π₯π₯πππ€π€ππhttp://nikhilbuduma.com
π§π§
Γ
RPI β Computer Science β Haidar Khan
Set an objective to learn
Want network outputs π§π§ππ to match labels π¦π¦ππ Choose a loss function πΈπΈ and optimize w.r.t the weights
πΈπΈ =12οΏ½ππ
ππ
π§π§ππ β π¦π¦ππ 2
πΈπΈ =12οΏ½ππ
ππ
ππππππ β π¦π¦ππ 2
How to minimize πΈπΈ with respect to ππ?
RPI β Computer Science β Haidar Khan
Equivalence to least squares
Take the derivative and set it to zero:πππΈπΈππππ
= οΏ½ππ
ππ
ππππππ β π¦π¦ππ ππππππ
οΏ½ππ
ππ
ππππππππππππ β πππππππ¦π¦ππ = ππ
Written in matrix form this becomes:ππππππππ β ππππππ = ππππ = ππππππ β1ππππππ
RPI β Computer Science β Haidar Khan
Key idea: compose simple units
Where do we go from here? Use many of these simple units and
compose them in layers: Function composition: ππ(β οΏ½ )
Each layer learns a new representation of the data 3 layer network: π§π§ππ = β3 β2 β1 ππππ
http://neuralnetworksanddeeplearning.com
RPI β Computer Science β Haidar Khan
Drawback to only linear units
Recall our earlier assumption that ππ(οΏ½) is linear This is a very restrictive assumption
Furthermore, composing strictly linear models is also linear!π§π§ππ = β3 β2 β1 ππππ = ππ3ππ2ππ1ππππ = ππ123ππππ
XOR problem (Minsky, Papert 1969)
RPI β Computer Science β Haidar Khan
XOR problem
Canβt learn a simple XOR gate using only one straight line
RPI β Computer Science β Haidar Khan
Key idea: non-linear activations
Solution: add a non-linear function at the output of each layer What kind of function? Differentiable at least:
Hyperbolic tangent: π§π§ = tanh(ππππππππ)
Sigmoid: π§π§ = 1
1+ππβππππππππ
Rectified Linear: π§π§ = max 0,ππππππππ Why? Labels ππ can be a non-linear function of the inputs (like
XOR)
RPI β Computer Science β Haidar Khan
Examples of non-linear activations
http://ufldl.stanford.edu
RPI β Computer Science β Haidar Khan
How do we learn weights now?
With multiple layers and non-linear activation functions we canβt simply take the derivative and set it to 0
Still can set a loss function and: Randomly try different weights Numerically estimate the derivative
ππβ² π₯π₯ =ππ π₯π₯ + β β ππ π₯π₯
β Terribly inefficient and scale badly with the number of layersβ¦
RPI β Computer Science β Haidar Khan
Key idea: gradient descent on loss function
Suppose we could calculate the partial derivative of πΈπΈ w.r.t each weight π€π€ππ:
πΏπΏπΏπΏπΏπΏπ€π€ππ
(gradient)
Decrease the loss function πΈπΈ by updating weights:
π€π€ππ = π€π€ππ +πΏπΏπΈπΈπΏπΏπ€π€ππ
Repeatedly doing this process is called gradient descent
Leads to a set of weights that correspond to a local minimum of the loss function
RPI β Computer Science β Haidar Khan
Backpropagation to estimate gradients
One of the breakthroughs in neural network research Allows to calculate the gradients of the network! Core idea behind the algorithm is multiple applications of the
chain rule of derivatives:πΉπΉ π₯π₯ = ππ ππ π₯π₯
πΉπΉβ² π₯π₯ = ππβ² ππ π₯π₯ ππβ² π₯π₯ Two passes through the network: forward and backward
Forward: calculate ππ(π₯π₯) and then ππ(ππ π₯π₯ ) Backward: calculate πππ(ππ π₯π₯ ) and then πππ(π₯π₯)
RPI β Computer Science β Haidar Khan
Multilayer Backpropagation
Assume we have π‘π‘ππ , π‘π‘ππ , π§π§ππ from the forward pass Work backward from the output of the network:
πΈπΈ = 12βππβππππππππππππ π‘π‘ππ β π¦π¦ππ
2, πΏπΏπΏπΏπΏπΏππππ
= β π‘π‘ππ β π¦π¦ππ (for output neurons)
πΏπΏπΏπΏπΏπΏππππ
= βπππππ§π§ππππππππ
πΏπΏπΏπΏπΏπΏπ§π§ππ
= βπππ€π€πππππΏπΏπΏπΏπΏπΏπ§π§ππ
πΏπΏπΏπΏπΏπΏπ§π§ππ
=πΏπΏπππππΏπΏπ§π§ππ
πΏπΏπΏπΏπΏπΏππππ
= π‘π‘πππΏπΏπΏπΏπΏπΏππππ
πΏπΏπΏπΏπΏπΏππππ
= βπππ€π€ππππ π‘π‘πππΏπΏπΏπΏπΏπΏππππ
πΏπΏπΏπΏπΏπΏπ€π€ππππ
=πΏπΏπ§π§πππΏπΏπ€π€ππππ
πΏπΏπΏπΏπΏπΏπ§π§ππ
= π‘π‘πππ‘π‘πππΏπΏπΏπΏπΏπΏππππ
http://nikhilbuduma.com
RPI β Computer Science β Haidar Khan
Putting all the pieces together
3 key elements to understanding neural networks Composition of units with simple operations (dot-product) Non-linearity activation functions at unit outputs Learn weights using gradient descent
Using neural networks: Set up data matrix and label vector: ππ and π¦π¦ Define a network architecture: number of layers, units per layer Choose a loss function to minimize: depends on the task
RPI β Computer Science β Haidar Khan
A couple of demosβ¦
RPI β Computer Science β Haidar Khan
Credits
Images from: http://nikhilbuduma.com/2015/01/11/a-deep-dive-into-recurrent-neural-networks/ http://ufldl.stanford.edu http://neuralnetworksanddeeplearning.com