short trip in the valley of deep learning · convolution operation (filtering, sliding) • kernels...

Short Trip In The Valley of Deep LearningPantelis Vlachas, Guido Novati

Computational Science and Engineering Lab ETH Zürich

Motivation - What is machine learning?

2

Classical Machine Learning

data regression/classification/etc. result

data

feature extraction

feature extraction + regression/classification/etc.

Deep Learning

result

Deep Learning

• Backpropagation

• Backpropagation through time (BPTT)

• Variational Inference (Bayesian)

• GEMM (General Matrix to Matrix Multiplication)

Sophisticated Architectures Algorithms

LeNet of Yann LeCun et al., 1998

LSTM, 1997 GRU, 2014

• Graphical Processing Units (Hardware)

Hardware

Convolutional Neural Networks

• Heavily based on GEMM (General Matrix to Matrix Multiplication)

• Parametric models suited for image processing (classification, object detection, etc.)

• Applications in self-driving cars, robotics, healthcare, physics, image and video recognition, recommender systems, image classification, medical image analysis, natural language processing, financial time series, etc.


Biological Intuition

• Very roughly biological brains have neurons that activate when they recognize a triggering pattern in their input

• Each unit does “simple” pattern recognition

• Complexity emerges from sheer numbers

Convolutional Model of a part of a Fruit-Fly’s brainJonathan Schneider et al., 2018

Convolutional Neural Network

What is a Convolution ?

Input is a matrix : dIY

× dIX× dIC

dIX

dIY

dIC

Parameters are a tensor : dKY

× dKX× dIC

× dKC

dKY

dKX

dIC

KC = 1 KC = 2 KC = 3 KC = 4 …(We have filters)dKC

dOX

dOY

dKC

Output is a matrix : dOy

× dOX× dKC

• Mapping an image to another image

• Feature sizes , , can be any numbers

• Parameters are called “filters” or “kernels”

dICdKC

Convolution Operation (filtering, sliding)

• Kernels

• Sliding a kernel along the spatial dimensions , and (Iterating along , and )

• At each position and for each filter in the dimension, we compute the scalar product

between the filter (of size ) and a “patch” of the image of size

• The output of the scalar product is a number which is written in a single color pixel (channel) of the output image

dKY× dKX

× dIC× dKC

IX IY IX IY

KC

dKY× dKX

× dICdKY

× dKX× dIC

1D Convolution, 1D Filter

1-1201

10-1

-1xxx

+


1-1201

10-1

-1xxx

+ -1


1-1201

10-1

-1

xxx

+-11

Padding

• What if we want to keep the output equal to the input in the spatial dimensions ?

• Size of the image is extended in both directions by and

• Usually zero padding

dIX= dOX

, dIY= dOY

dPYdPX

1-1201

10-1

xxx

+

-1-11

0

0

1

010-1

xxx

+

dS = 2

Stride

• Convolution does not have to be computed by increments of pixel

• Sride (skip, stepping) and

• Here padding , stride

1dSY

dSX

1 2

1-1201

0

0

10-1

xxx

+ 1

-110-1

xxx

+dS = 2

010-1

xxx

+

2-D Convolutions

• If the input is

• The filters (kernels) are

• With strides ,

• Padding ,

• The output image has size:

dIY× dIX

× dIC

dKY× dKX

× dIC× dKC

dSYdSX

dPYdPX

dOY=

dIY− dKY

+ 2dPY

dSY

+ 1

dOX=

dIX− dKX

+ 2dPX

dSX

+ 1

dOC= dKC

Pooling Operations (Subsampling)

Short detour in classification…

19

Classification

ALGORITHM

INPUT

object, image, etc.

OUTPUT

0/1 Binary

Logistic Regression

ALGORITHMW

OUTPUT

y ∈ {0,1}x ∈ ℝdx

INPUT

• Training examples

• Testing examples {(x1, y1), …, (xNtrain, yNtrain)}

{(x1, y1), …, (xNtest, yNtest)}

Logistic Regression

ALGORITHMx ∈ ℝdxW

y = fW(x) ∈ ℝ

x1

x2

σ(x) =1

1 + e−x

• Output is a real number (one class)

• Ideally we want

• Training data ,

• Model 1 : linear regression :

• Model 2 : Sigmoid output layer :

• , ?

• Cross entropy loss:

y = fW(x)y = P(y = 1 | x)

{(x1, y1), …, (xNtrain, yNtrain)} yn ∈ {0,1}y = f w(x) = wTx + b

y = f w(x) = σ(wTx + b)W⋆ = argmin

WL(y, y) L (y, y) =

12 (y − y)2

L (y, y) = − (y log(y) + (1 − y) log(1 − y))

if

• • Maximum log likelihood ! if

• • Maximum log likelihood !

y = 1L (y, y) = − y log(y) = − log P(y = 1 | x)

y = 0L (y, y) = − log(1 − y) = − log P(y = 0 | x)

Logistic Regression

ALGORITHMx ∈ ℝdxW

y = fW(x) ∈ ℝ

x1

x2

σ(x) =1

1 + e−x

• Output is a real number (one class)

• Ideally we want

• Training data ,

• Model 2 : Sigmoid output layer :

•

• Cross entropy loss:

• How can we classify an object to more than 2 classes ?

y = fW(x)y = P(y = 1 | x)

{(x1, y1), …, (xNtrain, yNtrain)} yn ∈ {0,1}y = f w(x) = σ(wTx + b)

W⋆ = argminW

L(y, y)

L (y, y) = − (y log(y) + (1 − y) log(1 − y))

Classification

ALGORITHM

INPUT

object, image, etc.

OUTPUT

0

Multi-class

100

Dog

Cat

Mouse

Elephant

Back in CNNs…

25

Classification on Images

Input to the network is an image

CNN

{ 0.02 0.03 0.01 0.01 0.70 0.02 0.02 0.01 0.06 0.12 }

Probability that image is 0 1 2 3 4 5 6 7 8 9

Output of the network is the probability of the input image being one of the digits (belonging to one of the target classes)

Classification Layer

• SoftMax Output layer

• Sum of outputs is equal to 1

• Represent probabilities for the target classes

• Loss function ? Cross entropy loss:

•

• Measure of dissimilarity between distributions

f(xi) =exp(xi)

∑10j=1 exp(xj)

L( f, f ) = −10

∑i=1

fi log f(xi)

f = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0] ⟺

hl o = softmax(Whl + b

x

)

Automatic Feature DetectionHigh level features

Low level features (edges, circels, mesh, text etc.)

Second layer features

Architectures

• LeNet by Yann LeCun et al. in 1998 • Alex-Net by Alex Krizhevsky, et. al. 2012 • VGG Net by Oxford’s Visual Geometry Group 2014 • GoogLeNet by Christian Szegedy, et. al. 2014 • ResNet (Residual Network) by Kaiming He, et. al. 2015 • DenseNet by Gao Huang, et. al. 2016


Alex-Net of Alex Krizhevsky et al., 2012

Heuristics for Deep Learning

Data Preprocessing • Scaling (e.g. zero mean, unit variance) • Random cropping • Flipping data • PCA whitening • Noise

Initialization of Weights • Scale the weights of each layers bu the

inverse of the square root of number of

input neurons

• Xavier initialization

1Nl

Activation Functions • tanh

• sigmoid

• ReLU

• ELU

Regularization

DropoutFull-connected

Operating on Sequences• In many applications cases, the data have temporal order (language, time series, etc.) • Fully-connected networks, and CNNs do not take into account this feature and have fixed input

and output sizes • Recurrent Neural Networks: networks with feedback loops

Operating on Sequences

xt+1

ht+1

RNN

xt�1

ht�1

RNN

xt

tanh

ht�1 ht

ht

hT

yT

x3

RNNW h3

y3

x2

RNNW h2h1

Weight Sharing in Time

h0

x1

RNNW

W, b

y1 y2

hT

yT

x3

RNNW h3

y3

x2

RNNW h2h1

Weight Sharing in Time

h0

x1

RNNW

W, b

y1 y2

y1

L1 = | y1 − y2 |2

y2 y3 yT

L2 L3 LT

L =1T

T

∑t=1

Lt

L

Backpropagation Through Time (BPTT)

FORWARD PASS - entire sequence, compute loss

BACKWARD PASS - entire sequence, comppute gradients

L

BACKWARD PASS - on some smaller amount of steps

L

Truncated BPTT

“Carry” hidden state forever

BACKWARD PASS - on some smaller amount of steps

tanh(W ·)

h1

x1

h0

tanh(W ·)

h2

x2

h1

tanh(W ·)

h3

h2

x3

tanh(W ·)

h4

h3

x4

h4

Vanishing Gradients Problem

• Computing the gradient of the loss w.r.t. involves many factors of and repeated

• In case of a linear activation and no bias, you would have factors like

• The gradient vanishes (explodes) if largest singular value ( )

h0 W tanh

W(W(…(Wh0)))< 1 > 1

Gating Architectures

ot

ht�1�

got

ht

ht

ct�1 ct�

gft

git

�

+

tanh

�tanh

ct

�

�

Long Short-Term Memory Cell

ot

ht�1

�

�

rt

�

1 � ·

zt

+ht

ht

�

tanh

ht�

Gated Recurrent Unit

Gating Architectures

�

got

�

gft

git

�

+

tanh

�tanh

ct

�

�

�

got

�

gft

git

�

+

tanh

�tanh

ct

�

�

�

got

�

gft

git

�

+

tanh

�tanh

ct

�

�

�

got

�

gft

git

�

+

tanh

�tanh

ct

�

�

x1 x2 x3 x4

h1 h2 h3 h4

C0 C1 C2 C3 C4

h1 h2 h3h0 h4

Uninterrupted gradient flow !

RNN structure

One-To-One e.g. classification

Many-To-One e.g. sentiment

analysis

One-To-Many e.g. Image captioning,

video generation

Many-To-Many e.g. Machine translation,

time-series prediction

Prediction of Chaotic Dynamics

• Forecasting the state of the Kuramoto-Sivashinsky equation

∂u∂t

= − v∂4u∂x4

−∂2u∂x2

− u∂u∂x

• RNNs can be chaotic !

• They are dynamical systems

Word embeddings

• Words can be represented by numbers (vectors) that encode semantic meaning • E.g. Word2Vec • Input: LARGE CORPUS OF TEXT • Learns a vector space where each word is assigned a vector • How ? Predict a word (target) from its neighboring words (context) or vice versa • Encodes context information

x1

x2

France

ParisGreece

Athens (closest word)

Word embeddings

Applications The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches, 2018

Object detection Object localisation Image/Video segmentation

Autonomous driving

Brain cancer detection

Scin cancer recognition

Speech recognition Machine translation Image/Video captioning Medicine/Biology

Outlook

46

Why Deep Learning ? • Universal approach for learning problems • Robust approach, does not require “much” expert

knowledge • Generalization, Scalability

Challenges ? • Big data and scalability • Generalization, transfer learning, multi-task learning • Generate new “artificial” datasets, for applications where data is scarce (Generative

models) • Understaning/Explainable models, incorporating physics • Causality and not plain pattern recognition/correlations • Energy efficient implementations on mobiles/FPGAs, etc.

Amount of data

Performance

Classical ML

Deep Learning

short trip in the valley of deep learning · convolution operation (filtering, sliding) • kernels...

Documents