machine learning i 80-629a apprentissage automatique i 80-629

Neural Networks— Week #5

Machine Learning I 80-629A

Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

This lecture

• Neural Networks

A. Modeling

B. Fitting

C. Deep neural networks

D. In practice

Some of today’s material is (adapted) from Joelle Pineau’s slides

From Linear Classification to Neural Networks

Recall Linear Classification

^(]) = \!]+\�

(\�]+\�) > � =�(\�]+\�) < � =�

Decision

Recall Linear Classification

^(]) = \!]+\�

(\�]+\�) > � =�(\�]+\�) < � =�

Decision

What if data is not linearly separable?

Exclusive OR (XOR)

]�Use the joint decision of several linear classifier?

Exclusive OR (XOR)

]�Use the joint decision of several linear classifier?

\′"]+\′�

\!]+\�

Exclusive OR (XOR)

Combining models

\′"]+\′�

\!]+\�

(\�]+\�) > � =�(\�]+\�) < � =�

(\��]+\��) > � =�

(\��]+\��) < � =�

K′(]) :

K(]) :

Combining models

\′"]+\′�

\!]+\�

(\�]+\�) > � =�(\�]+\�) < � =�

(\��]+\��) > � =�

(\��]+\��) < � =�

K(]) = FSI K�(]) = =�K(]) = FSI K�(]) = =�K(]) = FSI K�(]) = =�

K′(]) :

K(]) :

Combining models

\′"]+\′�

\!]+\�

(\�]+\�) > � =�(\�]+\�) < � =�

(\��]+\��) > � =�

(\��]+\��) < � =�

1. Evaluate each model 2. Combine the output of models

K′(]) :

K(]) :

Combining models

\′"]+\′�

\!]+\�

(\�]+\�) > � =�(\�]+\�) < � =�

(\��]+\��) > � =�

(\��]+\��) < � =�

1. Evaluate each model 2. Combine the output of models

K��(]) = YMWJXMTQI(\��K(])K�(])

�+\��

K′(]) :

K(]) :

Combining model (graphical view)

K′(])

K(])]�

K′′(])

K′(])

K(])]�

K′′(])

K′(])

K(])]�

K′′(])

K′(])

K(])]�

Neural Network

K′′(])

K′(])

K(])]�

Perceptron/ Neuron

Neural Network

Feed-forward neural network

• Each arrow denotes a connection

• A signal associated with a weight

• Each node is the weighted sum of its input followed by a non-linear activation

• Connections go left to right

• No connections within a layer

• No backward connections (recurrent)

Input Layer Hidden Layer(s) Output Layer

σ(∑

\N�]N)

σ(∑

\N�T�N)

Feed-forward neural network

1. An input layer

• Its size is the number of inputs + 1

2. One or more hidden layer(s)

• Their size is a hyper-parameter

3. An output layer

• Its size is the number of outputs

Compute a prediction (forward pass)

σ(∑

\N�]N)

σ(∑

\N�T�N)

Neural Networks

• Flexible model class

• Highly-non linear models

• Good for regression/classification/density estimation

• Models behind “Deep Learning”

• Historical aspects

Learning the Parameters of a Neural Network

Fitting a neural network

Fitting a neural networkHow do we estimate the model’s parameters?

• No-closed form solution

• Gradient-based optimization

• Threshold functions are not differentiable

• Replace by sigmoid (inverse logit). A soft threshold.

ɀǣǕȅȒǣƳ(Ə) :=(

+ exp(−Ə)

<latexit sha1_base64="dLhR9EENMeitoFPRf4huIjKVqJk=">AAAD4nicjVLLbtNAFJ02PIp5pbBkM6KqlIi0itmAkJCqtqFdNFAeaSvFUTQe3zijjD3WzHVIZHnHih1iyyfwNewQLPgHfoBxUqo8KsRItu6ce87ce4+un0hhsF7/vrJaunL12vW1G87NW7fv3C2v3zsxKtUcWlxJpc98ZkCKGFooUMJZooFFvoRTf7BX5E+HoI1Q8TscJ9CJWBiLnuAMLdQtv/AQRpgZEUZKBHllemV5lT57Tj0JPaxQr6cZz9w8cx95MEoqWxeknHpahH2sdssb9e365NDlwD0PNnb2fu/+IoQcd9dXP3iB4mkEMXLJjGm79QQ7GdMouITc8VIDCeMDFkLbhjGLwHSyycA53bRIQHtK2y9GOkFnFRmLjBlHvmVGDPtmMVeAl+XaKfaedjIRJylCzKeFeqmkqGjhHg2EBo5ybAPGtbC9Ut5n1h60Hs9V8ZUaIPNNjWnNxrUolSi0ej83WDZpIAE+j47SWHAVwAIqcYSa5c7mbMPBUCRmxpxUFwQDGDERF+ZkhyCHYBtl9CWkcJG1hYt0ZV+EAk3tyO5DXDvQAIPqkmTuvab9GbBD4dZbiMSuksFfxr/evFT2H73M6JoQiDRaMODo4E1YMDsTM5kpVjN3HMfbB7tXGpoWfZWA1SudeY088wqe72eN3NLs3rqLW7ocnDzedm382i5wg0zPGnlAHpIKcckTskMOyTFpEU6+km/kB/lZCkofS59Kn6fU1ZVzzX0yd0pf/gB3nFEj</latexit><latexit sha1_base64="DpBwbaWiy/+4AHb7lZsaM+BTqpo=">AAAD4nicjVLLbtNAFJ0mPEp4pbBkM6KqlIi0itmAkCpVbUO7aKA80laKo2g8vk5GGXusmeuQyPKOFTvElk/ga9gh+hfwAYyTUuVRIUaydefcc+bee3S9WAqD9fqPlULx2vUbN1dvlW7fuXvvfnntwYlRiebQ4koqfeYxA1JE0EKBEs5iDSz0JJx6g708fzoEbYSK3uM4hk7IepEIBGdooW75pYswwtSIXqiEn1WmV5ZV6Ytt6koIsELdQDOeOlnqPHFhFFc2L0kZdbXo9bHaLa/Xt+qTQ5cD5yJY39n7tXv+29s+7q4VPrq+4kkIEXLJjGk79Rg7KdMouISs5CYGYsYHrAdtG0YsBNNJJwNndMMiPg2Utl+EdILOKlIWGjMOPcsMGfbNYi4Hr8q1Ewyed1IRxQlCxKeFgkRSVDR3j/pCA0c5tgHjWtheKe8zaw9aj+eqeEoNkHmmxrRm41qYSBRafZgbLJ00EAOfR0dJJLjyYQGVOELNstLGbMP+UMRmxpxE5wQDGDIR5eakhyCHYBtl9BUkcJm1hfN0ZV/0BJrakd2HqHagAQbVJcnce037M2CHws13EIpdJf2/jH+9eaXsP3qZ0TXBF0m4YMDRwdtezuxMzGQmX82sVCq5+2D3SkPToq9jsHqlU7eRpW7O87y0kVma3VtncUuXg5OnW46N39gFbpDpWSWPyGNSIQ55RnbIITkmLcLJN/Kd/CTnRb/4qfi5+GVKLaxcaB6SuVP8+gcznVKy</latexit><latexit sha1_base64="DpBwbaWiy/+4AHb7lZsaM+BTqpo=">AAAD4nicjVLLbtNAFJ0mPEp4pbBkM6KqlIi0itmAkCpVbUO7aKA80laKo2g8vk5GGXusmeuQyPKOFTvElk/ga9gh+hfwAYyTUuVRIUaydefcc+bee3S9WAqD9fqPlULx2vUbN1dvlW7fuXvvfnntwYlRiebQ4koqfeYxA1JE0EKBEs5iDSz0JJx6g708fzoEbYSK3uM4hk7IepEIBGdooW75pYswwtSIXqiEn1WmV5ZV6Ytt6koIsELdQDOeOlnqPHFhFFc2L0kZdbXo9bHaLa/Xt+qTQ5cD5yJY39n7tXv+29s+7q4VPrq+4kkIEXLJjGk79Rg7KdMouISs5CYGYsYHrAdtG0YsBNNJJwNndMMiPg2Utl+EdILOKlIWGjMOPcsMGfbNYi4Hr8q1Ewyed1IRxQlCxKeFgkRSVDR3j/pCA0c5tgHjWtheKe8zaw9aj+eqeEoNkHmmxrRm41qYSBRafZgbLJ00EAOfR0dJJLjyYQGVOELNstLGbMP+UMRmxpxE5wQDGDIR5eakhyCHYBtl9BUkcJm1hfN0ZV/0BJrakd2HqHagAQbVJcnce037M2CHws13EIpdJf2/jH+9eaXsP3qZ0TXBF0m4YMDRwdtezuxMzGQmX82sVCq5+2D3SkPToq9jsHqlU7eRpW7O87y0kVma3VtncUuXg5OnW46N39gFbpDpWSWPyGNSIQ55RnbIITkmLcLJN/Kd/CTnRb/4qfi5+GVKLaxcaB6SuVP8+gcznVKy</latexit><latexit sha1_base64="s6DDuxI+w1jNPnbWU6NWyHo2G8Q=">AAAD4nicjVJLb9NAEN7WPEp4pXDksiKqlIi0irmAkJAqaGgPDZRH2kpxFK3XE2eVtdfaHYdElm+cuCGu/AR+DTcEP4Z1Eqo8KsRKtma/+b6dmU/jJ1IYbDR+bmw6V65eu751o3Tz1u07d8vb906NSjWHNldS6XOfGZAihjYKlHCeaGCRL+HMH74s8mcj0Eao+ANOEuhGLIxFX3CGFuqVX3kIY8yMCCMlgrw6u7K8Rp89p56EPlap19eMZ26euY88GCfV3QtSTj0twgHWeuVKY68xPXQ9cOdBhczPSW9785MXKJ5GECOXzJiO20iwmzGNgkvIS15qIGF8yELo2DBmEZhuNh04pzsWCWhfafvFSKfooiJjkTGTyLfMiOHArOYK8LJcJ8X+024m4iRFiPmsUD+VFBUt3KOB0MBRTmzAuBa2V8oHzNqD1uOlKr5SQ2S+qTOt2aQepRKFVh+XBsumDSTAl9FxGguuAlhBJY5Rs7y0s9hwMBKJWTAn1QXBAEZMxIU52RHIEdhGGX0NKVxkbeEiXT0QoUBTP7b7ENcPNcCwtiZZeq9lfwbsULj7HiLxQsngL+Nfb14q+49eFnQtCEQarRhwfPguLJjdqZnMFKuZl0ol7wDsXmloWfRNAlavdOY188wreL6fNXNLs3vrrm7penD6eM+18dtGZb853+At8oA8JFXikidknxyRE9ImnHwnP8gv8tsJnM/OF+frjLq5MdfcJ0vH+fYHjqtOiw==</latexit>

Fit the parameters (w) (backward pass)

• Derive a gradient wrt the parameters (w)

• The back-propagation starts from the output node(s) and heads toward the input(s)

• In practice, the order of the computation is important

(^− ˆ)�

�(^ � ˆ)�

�\O=

�(^ � K(�

N \NTN))�

=�(^ � K(

�N� \N�K(

�O \O]O)N�)�

Gradient descent• No closed-form formula

• Repeat the following steps (for t=0,1,2,… until convergence):

1. Calculate a gradient

2. Apply the update

• Stochastic gradient descent

• One example at a time

• Batch gradient descent

• All examples at a time

∇\YNO

\Y+�NO = \Y

NO − α∇\YNO

What can an MLP learn?1. A single unit (neuron)

• Linear classifier + non-linear output

2. A network with a single hidden layer

• Any continuous function (but may require exponentially many hidden units as a function of the inputs)

3. A network with two (or more) hidden layers

• Any function can be approximated with arbitrary accuracy.

The Importance of Representations

From Neural Networks to Deep Neural Networks

A neural Network

A neural Network A deep neural Network

Modern deep learning provides a powerful framework for supervised learning. By adding more layers and more units within a layer, a deep network can represent functions of increasing complexity.

Deep Learning — Part II, p.163 http://www.deeplearningbook.org/contents/part_practical.html

Another View of deep learning• Representations are important

[From: Honglak Lee (and Graham Taylor)]19

Representations

[From: Honglak Lee (and Graham Taylor]20

Representations

Data Layer 1 Layer 2 Classifier… Output

Machine Translation

Universal Sentence

Representation

French Encoder

English Encoder

Spanish Encoder

French Decoder

English Decoder

Spanish Decoder

<latexit sha1_base64="728H+MTBszQF6C/oMDOXf+EkQ1Q=">AAADZ3icfVJdaxNBFJ1m/ajrR1sFEXxZDYUKMezqgz5WbbGg1YomLWRDmZ29ScfMxzJzNyYsC/4EX/Wf+RP8F84msXST4oWFyzlz75w5e5JMcIth+Hut4V25eu36+g3/5q3bdzY2t+52rc4Ngw7TQpuThFoQXEEHOQo4yQxQmQg4TkZvKv54DMZyrb7gNIO+pEPFB5xRdFA3Hqca7elmM2yHswpWm2jRNMmijk63Gu/iVLNcgkImqLW9KMywX1CDnAko/Ti3kFE2okPouVZRCbZfzOSWwbZD0mCgjfsUBjPU374wUkiHWzCGYtmzIHmiRdq/uLSg0tqpTNwySfHMLnMVeBnXy3Hwsl9wleUIis21DHIRoA4qe4KUG2Aopq6hzHD3nICdUUMZOhNrtyRaj5AmtkWdzmlL5gK50d9qby9mAjJgdXSSK850CkuowAkaWjcvHfPMLuybzP1zVllASbmq7CsOQIzB6aTBB8jhnHX3VvTOHh9ytK337n+r1lsDMHqyMlLbd3ju/dPPzvrXzvp/J/6389IxP94Dlw8Dh272YwaO1aaI98sirrYlSbFflr6LX7QcttWm+6wdPW+Hn8Lm7qvv8yCuk4fkMdkhEXlBdskBOSIdwshX8oP8JL8af7wN7773YH60sbYI7z1SK+/RX/X8IL0=</latexit>

Machine learning Make machines that can learn

Deep learning A set of machine learning techniques

based on neural networks

Idea: Hugo Larochelle

Machine learning Make machines that can learn

Deep learning A set of machine learning techniques

based on neural networks

Idea: Hugo Larochelle

Representation learning Machine learning paradigm to discover data representations

Deep neural networks

• Several layers of hidden nodes

• Parameters at different levels of representation

24Figure 1: We would like the raw input image to be transformed into gradually higher levels of representation,representing more and more abstract functions of the raw input, e.g., edges, local shapes, object parts,etc. In practice, we do not know in advance what the “right” representation should be for all these levelsof abstractions, although linguistic concepts might help guessing what the higher levels should implicitlyrepresent.

[Figure from Yoshua Bengio]

Laurent Charlin — 80-629 [Figures from Yoshua Bengio]25

Neural Network Hyper-parameters

Hyperparameters

1. Model specific

• Activation functions (output & hidden), Network size

2. Optimisation Objective

• Regularization, Early-stopping, Dropout

3. Optimization procedure

• Momentum, Adaptive learning rates

Activation Functions• Non-linear functions that transform the weighted sum of the inputs, e.g.:

<latexit sha1_base64="BlcsYKbhJPBm6aghT8qYTX/5YD8=">AAAD0XicjVJbaxNBFJ52vdR4S/XRl8VSSCGGbBH0RSjaYJFWazWmkF2W2dmTdMjMzjKXXFgWxNf+BcEnxd/jm/ji7/DJ2SSUbhLQAwMf35nznTPfnChlVOlm8+faunPl6rXrGzcqN2/dvnO3unnvgxJGEmgTwYQ8jbACRhNoa6oZnKYSMI8YdKLBiyLfGYJUVCTv9SSFgON+QnuUYG2psFrt1XxleEjdUUjHId0Jq1vNRnMa7jLw5mBr79Xv8y+/fvw5DjfXv/uxIIZDognDSnW9ZqqDDEtNCYO84hsFKSYD3IeuhQnmoIJsOnrublsmdntC2pNod8pWti+VZNzyCqTEOu8q4DQSLA4ui2aYKzXhkRXjWJ+pxVxBrsp1je49DTKapEZDQmaz9AxztXALq9yYSiCaTSzARFL7HJecYYmJtoaWukRCDDSOVB3bOSd1bpimUoxKb8+mA6RAyuzYJJSIGBZYpsda4rJ58ZCmam7feOaftUqB5pgmhX3ZAbAh2Dmx+xoMXGRt3yJd26d9qlX90P59Un8pAQY7SyUlvaML7x+9s9Y/t9b/l+jKun+WddtpCpLYbX52An3DsKwfilGZCVZrV/x9sMsn4cg2eGNVsBYy81t55hctoyhr5XnF7ra3uMnLoLPb8B43PO+t3fIWmsUGeoAeohry0BO0hw7QMWojgoboM/qKvjknztj56HyaXV1fm9fcR6Vwzv8CIYdOCg==</latexit>

ǔ(∑

ɯǣɴǣ)

ǔ(∑

ɯǣɴǣ)

•Non-linearities increase model representation power

ǔ(∑

ɯǣɴǣ)

•Non-linearities increase the difficult of optimization

ǔ(∑

ɯǣɴǣ)

•Non-linearities increase the difficult of optimization

• Different functions for hidden units and output units

ǔ(∑

ɯǣɴǣ)

Activation functions — hidden units

• Traditional

• Logistic-like units

• Saturate on both sides

• Derivable everywhere

ǔ(ɿ) = ǼȒǕǣɎ−(ɿ) =

+ exp(−ɿ)<latexit sha1_base64="qpaDyuzKtpV050lgPaVO/5mx6Xs=">AAADmHicfVLLbhMxFHUzPEp4pbCDjUVVKRFJOsNDsEEKtFWLoFAeaStlQuTx3EmteMYj29MmtWbNR/A17BB8AQv+BU8SqiapuJKlq3N8X0cnSDlT2nV/L5WcS5evXF2+Vr5+4+at25WVO/tKZJJCmwou5GFAFHCWQFszzeEwlUDigMNBMNgo+INjkIqJ5LMepdCNST9hEaNEW6hXeRpVT2v4BfY1DLXhos90/sU0vHwKR5JQ4+XGww+xD8O02jit5bhXWXWb7jjwYuJNk9XWyx8nf76uP9nrrZTe+KGgWQyJppwo1fHcVHcNkZpRDnnZzxSkhA5IHzo2TUgMqmvG9+V4zSIhjoS0L9F4jJbXzpWY2OIKpCQ67yiIWSB42D3f1JBYqVEc2GYx0UdqnivAi7hOpqPnXcOSNNOQ0MkuUcaxFrjQE4dMAtV8ZBNCJbPnYHpErGraqj4zJRBioEmg6sTuOarHGddMipOZ2814gRToLDrMEkZFCHMo10Mtyax44TFL1VS+4UQ/K5UCHROWFPKZHeDHYPck+B1kcMbauQVd3WTWA6r+1hokqW9LgEFtoWSm3+6Z9o1PVvpXVvp/P/7X88Kysr8J1h8Sdm3t+xQsK6Txt3LjF92CwGzlednaz5s322Ky/6jpPW66H6wPN9AkltF99ABVkYeeoRbaQXuojSj6hr6jn+iXc89pOdvO68nX0tK05i6aCefjX3EGM5o=</latexit>

ǔ(ɿ) = tanh(ɿ)<latexit sha1_base64="kYbldNofxKT1cgIzqzUH5bfYpVo=">AAADcnicfVJdaxNBFJ0mftT40VTf9GU0FFKIYaMP+iIU22JBqxVNW8iGMjt7kwyZj2Xmbkxc9pf4qj/K/+EPcDaJpZsULwwcztl7587ZEyVSOAyC3xuV6o2bt25v3qndvXf/wVZ9++GpM6nl0OVGGnseMQdSaOiiQAnniQWmIgln0Xi/0M8mYJ0w+ivOEugrNtRiIDhDT13UtwbN77v0DQ2R6ZGHF/VG0A7mRddBZwkaew0yr5OL7cr7MDY8VaCRS+ZcrxMk2M+YRcEl5LUwdZAwPmZD6HmomQLXz+ab53THMzEdGOuPRjpnaztXWjLleQfWMsx7DpSIjIz7V4dmTDk3U5EfphiO3KpWkNdpvRQHr/uZ0EmKoPlil0EqKRpaOEVjYYGjnHnAuBX+OZSPmGUcvZ+lWyJjxsgi12J+z1lLpRKFNd9Kb8/mCyTAy+w01YKbGFZYiVO0rGxePBGJW9o3XfjnrXKAigld2JcdgZyA35PRj5DCpervLeTmgRgKdK0P/tfr1jsLMN5daynNO770/vkXb/1bb/2/L/4389q2WngAPh8Wjn3vpwS8amwWHuZZWEyLouwwz2s+fp3VsK2D0xftzst28NnncH+RQ7JJnpBnpEk65BXZI0fkhHQJJyn5QX6SX5U/1cfVp9VlZisby55HpFTV1l/ZdCMh</latexit>

[https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40811.pdf]

• Traditional

• Rectified linear units (Relu)

• Non-derivable at a single point

• Now Standard

• Better results / faster training

• Shuts off units

• Leaky Relu

ǔ(ɿ) = max{, ɿ}<latexit sha1_base64="GbUEfDiBNvnjfmEEpvbxAIDwbWQ=">AAADeXicfVJdaxNBFJ0mftT4lSr44svUUkglDRv7oC9C1RYLWq1o2kI2hNnZm3TIfCwzd2PSdcHf4qv+IX+LL84mtXTT4oVhLufMvXPv4USJFA6D4PdSpXrt+o2by7dqt+/cvXe/vvLg0JnUcuhwI409jpgDKTR0UKCE48QCU5GEo2j0puCPxmCdMPoLThPoKTbUYiA4Qw/1648GjdMN+pKGik1omNGgSU/DnPbra0ErmAW9nLTPkrXt1e+kiIP+SuVdGBueKtDIJXOu2w4S7GXMouAS8lqYOkgYH7EhdH2qmQLXy2YL5HTdIzEdGOuPRjpDa+sXSjLlcQfWMsy7DpSIjIx7F5tmTDk3VZFvphieuEWuAK/iuikOXvQyoZMUQfP5LINUUjS0EIzGwgJHOfUJ41b4dSg/YZZx9LKWfomMGSGLXJP5OadNlUoU1nwt7Z7NBkiAl9FJqgU3MSygEidoWVm8eCwSdybfZK6fl8oBKiZ0IV+2B3IMfk5GP0AK56z/t6AbO2Io0DXfewfo5lsLMNq4VFLqt3+u/eZnL/1rL/2/F//reWVZLdwB7w8L+772YwKeNTYLd/MsLLpFUbab5zVvv/ai2S4nh89a7a1W8Mn78NXch2SZPCZPSIO0yXOyTfbIAekQTr6RH+Qn+VX5U12tNqpP508rS/ObPCSlqG79BRPjJjk=</latexit>

Ǖ(ɿ) = max{, ɿ}+ αmin{, ɿǣ}<latexit sha1_base64="tNDTbrt6UmRHIkHIIwjEGouO1lQ=">AAADk3icfVJdaxNBFJ1m/ajxK1V88mW0FFJMw0YfFESoNcWCRiuatpBdwuzsTTJkPpaZ2Zh0WfBX+Gt81f/gv3E2aUs3KV4Y5nLO3Dv3Hk6UcGas7/9dq3jXrt+4uX6revvO3Xv3axsPjoxKNYUuVVzpk4gY4ExC1zLL4STRQETE4Tgavyv44wlow5T8ZmcJhIIMJRswSqyD+jV/WD/dxm9wIMgUBxn2G/g0yPEzHBCejIjDmTzH+wwHeb+26Tf9eeDVpHWWbO4++YGKOOxvVD4EsaKpAGkpJ8b0Wn5iw4xoyyiHvBqkBhJCx2QIPZdKIsCE2Xy1HG85JMYDpd2RFs/R6talkkw43IDWxOY9A4JFisfh5aYZEcbMROSaCWJHZpkrwKu4XmoHr8KMySS1IOlilkHKsVW4kBLHTAO1fOYSQjVz62A6IppQ6wQv/RIpNbYkMg3i5pw1RMot0+p7afdsPkACtIxOU8moimEJ5XZqNSmLF09YYs7kmy70c1IZsIIwWciXHQCfgJuT4E+QwgXr/i3oepsNmTWNj84bsvFeA4y3V0pK/ToX2u98ddLvOenPX/yv55Vl1aANzh8aOq72cwKOVToL9vMsKLpFUbaf51Vnv9ay2VaTo+fN1oum/8X58O3Ch2gdPUZPUR210Eu0iw7QIeoiin6iX+g3+uM98l57e1578bSytrjRQ1QKr/MPjMYueg==</latexit>

Activation functions — Output units

σ(F) =�

�+ exp(−F)

XTKYRF](F)N =J]U(FN)∑O exp(FO)

Output type Output Unit Equivalent Statistical Distribution

Binary (0,1) Bernoulli

Categorical (0,1,2,3,k) Multinoulli

Continuous Identity(z) = z Gaussian

Multi-modal mean, (co-)variance, components Mixture of Gaussians

ɀǣǕȅȒǣƳ(ɿ) =

+ exp(−ɿ)<latexit sha1_base64="dMsVesdqf5yE82ASLgFUnJGN2Rk=">AAADjXicfZLbbhMxEIbdLIcSDk2Bu95YVJUSSKNdEIKLgiraqpWgUA49SNko8nonqRV7vbK9Iam178DTcAuvwdvgTULVTSpGsjT6f894/GmilDNtfP/PUsW7cfPW7eU71bv37j9Yqa0+PNEyUxSOqeRSnUVEA2cJHBtmOJylCoiIOJxGg53CPx2C0kwm38w4hY4g/YT1GCXGSd3a09DAyFjN+kKyOK9fNPAbHPYUoTbIbYCf4RBGaX3zopF3a+t+y58EXkyCWbK+vfV4becL1I+6q5X3YSxpJiAxlBOt24Gfmo4lyjDKIa+GmYaU0AHpQ9ulCRGgO3byqRxvOCXGPancSQyeqNWNKyVWOF2DUsTkbQ2CRZLHnatNLRFaj0XkmglizvW8V4jXee3M9F53LEvSzEBCp7P0Mo6NxAVEHDMF1PCxSwhVzH0H03PioBmHuvRKJOXAkEg3iZtz3BQZN0zJ76W/28kAKdCyOsoSRmUMcyo3I6NIGV48ZKme4RtN+TlUGowgLCnw2QPgQ3BzEvwRMrh03buFXd9lfWZ084PbiqS5rwAGjYWSUr/DS/abXx36dw79vxv/63ltWTXcBbcfCg5d7acUnCuVDfdyGxbdosju5XnVrV8wv2yLycnzVvCi5X92e/gWTWMZraEnqI4C9AptowN0hI4RRT/QT/QL/fZWvJfelje7W1ma1TxCpfD2/wI6US64</latexit>

ɀȒǔɎȅƏɴ(ɿǣ) =exp(ɿǣ)∑ǣ′ exp(ɿǣ′)

<latexit sha1_base64="C8rgjhM2JzvG2gHJ44PWY0F11SY=">AAADonicfVJdb9MwFPUWPkb56uCRF4tuopVKlcIDvAwm2MQEDDaNbpOaqnLcm86qHUe2U1Ks/Av+Av+EByRe4YF/g9OWaWknLEU6Osf3+t6TEyacaeP7f1ZWvStXr11fu1G5eev2nbvV9XvHWqaKQodKLtVpSDRwFkPHMMPhNFFARMjhJBy9LvSTMSjNZPzJTBLoCTKMWcQoMY7qV18GBjJjtYyMIFle/9JnDbyFg0gRagPIkimT20Cnom/ZoxzPyQI3ctyv1vyWPz14GbTnoLa99bX2I9j4ftBfX30XDCRNBcSGcqJ1t+0npmeJMoxyyCtBqiEhdESG0HUwJgJ0z043zfGmYwY4ksp9scFTtrJ5ocQKx2tQipi8q0GwUPJB72JTS4TWExG6ZoKYM72oFeRlWjc10fOeZXGSGojpbJYo5dhIXDiLB0wBNXziAKGKuXUwPSPOR+P8L70SSjkyJNRN4uacNEXKDVPyc2l3Ox0gAVpmszRmVA5ggeUmM4qUzRuMWaLn9mUz/5xVGtyPZnFhn90DPgY3J8EfIIVz1b1byPUdNmRGN9+7qMTNNwpg1FgqKfXbP/f+8ZGz/pWz/t+N//W8tKwS7IDLh4J9V/sxAadKZYNdF8WiWxja3TyvuPi1F8O2DI6ftNpPW/6hy+ELNDtr6AF6iOqojZ6hbbSHDlAHUfQN/US/0G9vw3vrHXpHs6urK/Oa+6h0vOAvvCQ53A==</latexit>

Regularization• Weight decay on the parameters

• L2 penalty on the parameters

• Early stopping of the optimization procedure

• Monitor the validation loss and terminate when it stops improving

• Number of hidden layers and hidden units per layer

Momentum

• Pro: Can allow you to jump over small local optima

• Pro: Goes faster through flat areas by using acquired speed

• Con: Can also jump over global optima

• Con: One more hyper-parameter to tune

• More advanced adaptive steps: adagrad, adam

<latexit sha1_base64="sKY+EgrIOYJedFEs1rb/QhcIVTk=">AAAEd3icnVPtatRAFM22q9b1q9WfigyWli1ul40UFKRQaksLtlo/1hZ21nIzubsddpIJMzf7Qcj7+AS+i2/iTyfbpTRtQXFCyOHcuefeOXMTJEpaarV+Vebmq7du31m4W7t3/8HDR4tLj79ZnRqBbaGVNicBWFQyxjZJUniSGIQoUHgcDN4V8eMhGit1/JUmCXYj6MeyJwWQo04Xf46+Z/TSz9nqJitgztYZB5WcAeMxBAocS4y/LR7GCceU1fcMhBJjYjtohfuu5YxzVlud7uK1YaHFAyRgw/9QY6Mmi3TkUBqt5U6v1CGxl2zI+enicqvZmi52HfgzsOzN1tHp0twPHmqRFrpCgbUdv5VQNwNDUijMazy1mIAYQB87DsYQoe1mU4NztuKYkPW0ca/rcMrWVi6lZJHjLRoDlHcsRjLQKuxeFs0gsnYSBU4sAjqzV2MFeVOsk1LvTTeTcZISxuK8l16qGGlWXCgLpUFBauIACCPdcZg4AwOC3LWXqgRaDwgC2wDX56QRpYqk0aPS2bNpAwmKMjtOYyl0iFdYRWMyUDYvHMrEzuwbn/vnrLJIEci4sC/bRzVE1yewD5jiRdTVLcL1HdmXZBsHbkLjxp5BHKxdSynpHV54v/7FWb/trP8n0Rvz/prWaScJGuH+uc3P2E8VmMaBHpWZ7s3aNb6DbvgMHroCH50KkDYZ380zXpQMgmw3z2tutv2rk3wdHL9q+htN3/+0sby1PRvzBe+p98Kre7732tvy9r0jr+2JyrPKduV95WD+d/V5dbVaP986V5nlPPFKq+r/AcuMfWo=</latexit>

ɯɎ+ = ɯɎ � ↵rɯɎ ٢JȸƏƳǣƺȇɎ (ƺɀƬƺȇɎ٣

ɮ = �ɮ� ↵rɯɎ ٢JȸƏƳǣƺȇɎ (ƺɀƬƺȇɎ ɯ ȅȒȅƺȇɎɖȅ٣

ɯɎ+ = ɯɎ + ɮ

Wide or Deep?

[Figure 6.6, Deep Learning, book]

Wide or Deep?

[Figure 6.7, Deep Learning, book]

Dropout

35 [Figure 7.6, Deep Learning]

• Standard regularization technique

• At training drop a percentage of the units

• Used for non-output layer

• Prevents co-adaptation / Bagging

• At test: use the full network

• Normalize the weights

Neural Network Takeaways

Neural Networks takeaways

• Very flexible models

• Composed of simple units (neurons)

• Adapt to different types of data

• (Highly) non-linear models

• E.g., Can learn to order/rank inputs easily

• Scale to very large datasets

• May require “fiddling” with model architecture + optimization hyper-parameters

• Standardizing data can be very important

Where do NNs shine

• Input is high-dimensional discrete or real-valued

• Output is discrete or real valued, or a vector of values

• Possibly noisy data

• Form of target function is unknown

• Human interpretability is not important

• The computation of the output based on the input has to be fast

Laurent Charlin — 80-629 40

Most tasks that consist of mapping an input vector to an output vector, and that are easy for a person to do rapidly, can be accomplished via deep learning, given sufficiently large models and sufficiently large datasets of labeled training examples.

Other tasks, that cannot be described as associating one vector to another, or that are difficult enough that a person would require time to think and reflect in order to accomplish the task, remain beyond the scope of deep learning for now.

Deep Learning — Part II, p.163 http://www.deeplearningbook.org/contents/part_practical.html

Neural Networks in Practice

In practice• Software now derives gradients automatically

• You specify the architecture of the network

• Connection pattern

• Number of hidden layers

• Number of layers

• Activation functions

• Learning rate (learning rate updates)

• Dropout

• For intuitions: https://playground.tensorflow.org

A selection of standard tools (in python)

• Scikit-learn

• Machine learning toolbox

• Feed-forward neural networks

• Neural network specific tools

• PyTorch, Tensorflow

• Keras

• More specific tools for specific tasks:

• caffe for computer vision, pySpark for distributed computations, NLTK for natural language processing

machine learning i 80-629a apprentissage automatique i 80-629

Documents

« phenomix grant for medical imaging...learning...

verbes irréguliers anglais apprentissage

mks baratron type 622a/626a/627a/628a/629a absolute...

interfaces pour supporter l ’ apprentissage

machine learning i 80-629a apprentissage automatique i...

apprentissage transformationnel et compétences

nos formations en apprentissage - eplagro55 ·...

intelligence artificielle - apprentissage automatique

apprentissage moteur

module 4 demarches d apprentissage strategies d enseignement...

apprentissage precoce problematique.pps

ia project apprentissage automatique

intelligence artificielle sans données ontologiques sur...

instructional technology for teaching cte 629a rob askey...

sciences cognitives et apprentissage

machine learning i 80-629a apprentissage automatique i...

introduction à l'apprentissage...

apprentissage relationnel apprentissage data mining ilp

03 act3 apprentissage 3 pieces

apprentissage statistique - introduction: intelligence