raja giryes tel aviv universityauai.org/uai2019/tutorials/giryes_theory2019.pdf · 2019-07-22 ·...

MATHEMATICS OF DEEP LEARNING

RAJA GIRYES

TEL AVIV UNIVERSITY

Mathematics of Deep Learning Tutorial

UAI, Tel Aviv, Israel

July 22, 2019 1UAI Tutorial

DEEP LEARNING IMPACT

• Imagenet dataset

• 1,400,000 images

• 1000 categories

• 150000 for testing,

• 50000 for validation

Today we get 3.5% by 152 layers

2UAI Tutorial

WHY THINGS WORK BETTER TODAY?

• More data – larger datasets, more access (internet)

• Better hardware (GPU)

• Better learning regularization (dropout)

• Deep learning impact and success is not unique only to image classification.

• But it is still unclear why deep neural networks are so remarkably successful and how they are doing it.

3UAI Tutorial

CUTTING EDGE PERFORMANCE IN MANY OTHER APPLICATIONS

• Disease diagnosis [Zhou, Greenspan & Shen, 2016].

• Language translation [Sutskever et al., 2014].

• Video classification [Karpathy et al., 2014].

• Handwriting recognition [Poznanski & Wolf, 2016].

• Sentiment classification [Socher et al., 2013].

• Image denoising [Remez et al., 2017].

• Depth Reconstruction [Haim et al., 2017].

• Super-resolution [Kim et al., 2016], [Bruna et al., 2016].

• Error correcting codes [Nahmani, 2016]

• many other applications…4UAI Tutorial

CLASS AWARE DENOISING

[Remez, Litani, Giryes, Bronstein, 2017]

Class aware denoising

Agnostic denoising

5UAI Tutorial

DEPTH ESTIMATION BY PHASE CODED CUES

[Haim, Elmalem, Bronstein, Marom, Giryes, 2017]

6UAI Tutorial

ALL-IN-FOCUS BY PHASE CODED CUES

[Elmalem, Marom, Giryes, 2018] 7UAI Tutorial

COMPRESSED COLOR LIGHT FIELD

[Yovel, Nabati, Mendelovic, Giryes, 2018]8UAI Tutorial

EXOPLANETS DETECTION

[Zucker and Giryes, 2018] 9UAI Tutorial

DEEP ISP

[E. Schwartz, R. Giryes and A. M. Bronstein, 2018] 10UAI Tutorial

PARTIAL SHAPE ALIGNMENT

[Hanocka et al., 2018]

Alignment is performed by a free form deformation generated by a network:

11UAI Tutorial

MESH CNN

• A neural network for mesh data• Perform a different mesh simplification for different tasks.

[Hanocka et al., 2019]12

ASAP -NETWORK ARCHITECTURE SEARCH

[Noy et al., 2019] 13UAI Tutorial

DEEP NEURAL NETWORKS (DNN)• One layer of a neural net

• Concatenation of the layers creates the whole net

Φ(𝑊1,𝑊2, … ,𝑊𝐾) = 𝜓 𝑊𝐾 …𝜓 𝑊2𝜓 𝑊1𝑋

𝑋 ∈ ℝ𝑑 𝑾 𝝍 𝝍(𝑾𝑿) ∈ ℝ𝒎

𝑋 is a linear operation

𝝍 is a non-linear function

𝑉𝑋

𝑿 ∈ ℝ𝒅 𝑾𝟏 𝝍 𝑾𝒊 𝝍 𝑾𝑲 𝝍

14UAI Tutorial

CONVOLUTIONAL NEURAL NETWORKS (CNN)

• In many cases, 𝑊 is selected to be a convolution.

• This operator is shift invariant.

• CNN are commonly used with images as they are typically shift invariant.

𝑿 ∈ ℝ𝒅 𝑾 𝝍 𝝍(𝑾𝑿) ∈ ℝ𝒎

𝑊 is a linear operation

𝐹 is a non-linear function

𝑊𝑋

15UAI Tutorial

THE NON-LINEAR PART

• Usually 𝜓 = 𝑔 ∘ 𝑓.

• 𝑓 is the (point-wise) activation function

• 𝑔 is a pooling or an aggregation operator.

ReLU 𝑓(x) = max(x, 0)

Sigmoid

𝑓 𝑥 =1

1 + 𝑒−𝑥

Hyperbolic tangent

𝑓 𝑥 = tanh(𝑥)

𝑋1 𝑋2 𝑋𝑟𝑋3 𝑋4 … … … …

max𝑖𝑋𝑖

Max pooling Mean pooling

1

𝑛

𝑖=1

𝑛

𝑋𝑖

𝑙𝑝 pooling𝑝

𝑖=1

𝑛

𝑋𝑖𝑝

𝑾 𝝍

16UAI Tutorial

A SAMPLE OF EXISTING THEORY FOR

DEEP LEARNING

17UAI Tutorial

WHY DNN WORK?

What is so special with the DNN structure? What is the role of

the depth of DNN?

What is the role of pooling?

What is the role of the activation

function?

How many training samples

do we need?

What is the capability of DNN?

What happens to the data throughout the

layers?18UAI Tutorial

DEEP LEARNING THEORY SURVEY

19UAI Tutorial

SAMPLE OF RELATED EXISTING THEORY

• Universal approximation for any measurable Borel functions [Hornik et. al., 1989, Cybenko 1989, Daniely et al., 2017]

• Depth of a network provides an exponential complexity compared to the numberparameters [Montúfar et al. 2014, Cohen et al. 2016, Eldan & Shamir, 2016] andinvariance to more complex deformations [Bruna & Mallat, 2013]

• Number of training samples scales as the number of parameters [Shalev-Shwartz& Ben-David 2014] or the norm of the weights in the DNN [Neyshabur et al.2015]

• Pooling relation to shift invariance and phase retrieval [Bruna et al. 2013, 2014]

• Deeper networks have more local minima that are close to the global one and less saddle points [Saxe et al. 2014], [Dauphin et al. 2014], [Choromanska et al. 2015], [Haeffele & Vidal, 2015], [Soudry & Hoffer, 2017]

• Relation to dictionary learning [Papayan et al. 2016].

• Information bottleneck [Shwartz-Ziv & Tishby, 2017], [Tishby & Zaslavsky 2017].

• Invariant representation for certain tasks [Soatto & Chiuso, 2016]

• Bayesian deep learning [Kendall and Gal. 2017] [Patel, Nguyen & Baraniuk , 2016]

UAI Tutorial 20

REPRESENTATION POWER

• Neural nets serve as a universal approximation for any measurable Borel functions [Cybenko 1989, Hornik 1991].

• In particular, let the non-linearity 𝜓 be a bounded, non-constant continuous function, 𝐼𝑑 be the 𝑑-dimensional hypercube, and 𝐶 𝐼𝑑 be the space of continuous functions on 𝐼𝑑. Then for any 𝑓 ∈ 𝐶 𝐼𝑑and 𝜖 > 0, there exists 𝑚 > 0, and 𝑋 ∈ ℝ𝑑×𝑚, 𝐵 ∈ ℝ𝑚, 𝑊 ∈ ℝ𝑚 such that the neural network

𝐹 𝑉 = 𝜓 𝑉𝑋 + 𝐵 𝑊𝑇

approximates 𝑓 with a precision 𝜖:

𝐹 𝑉 − 𝑓 𝑉 < 𝜖, ∀𝑉 ∈ ℝ𝑑

21UAI Tutorial

ESTIMATION ERROR

• The estimation error of a function f by a neural networks scales as [Barron 1994].

𝑶𝑪𝒇

𝑵+𝑶

𝑵𝒅

𝑳𝐥𝐨𝐠(𝑳)Smoothness of

approximated function

Number of neurons in the

DNN

Number of training

examples

Input dimension

22UAI Tutorial

DEPTH OF THE NETWORK

• Depth allow representing shallow restrictedBoltzmann machines, which has an exponentialnumber of parameters, compared to the deep one[Montúfar & Morton, 2015]

• Each DNN layer with ReLU divides the space by ahyper-plane, folding one part of it.

• Thus, the depth of the network folds the spaceinto an exponential number of sets compared tothe number of parameters [Montúfar, Pascanu, Cho &

Bengio, 2014]

23UAI Tutorial

DEPTH EFFICIENCY OF CNN

• Function realized by CNN, with ReLU and max-pooling, of polynomial size requires super-polynomial size for being approximated by shallow network [Telgarsky 2016 ,Cohen et al., 2016].

• Standard convolutional network design has learning bias towards statistics of natural images [Cohen et al., 2016].

24UAI Tutorial

ROLE OF POOLING

• The pooling stage provides shift invariance [Boureau et

al. 2010], [Bruna, LeCun & Szlam, 2013].

• A connection is drawn between the pooling stage and the phase retrieval methods [Bruna, Szlam & LeCun, 2014].

• This allows calculating Lipchitz constants of each DNN layer 𝜓(∙ 𝑋) and empirically recovering the input of a layer from its output.

• However, the Lipchitz constants calculated are very loose and no theoretical guarantees are given for the recovery.

25UAI Tutorial

SUFFICIENT STATISTIC AND INVARIANCE

• Given a certain task at hand:

• Minimal sufficient statistic guarantees that we can replace raw data with a representation with smallest complexity and no performance loss.

• Invariance guarantees that the statistic is constant with respect to uninformative transformations of the data.

• CNN are shown to have these properties for many tasks [Soatto & Chiuso, 2016].

• Good structures of deep networks can generate representations that are good for learning with a small number of examples [Anselmi et al., 2016].

26UAI Tutorial

SCATTERING TRANSFORMS

• Scattering transform - a cascade of wavelet transform convolutions with nonlinear modulus and averaging operators.

• Scattering coefficients are stable encodings of geometry and texture [Bruna & Mallat, 2013]

27

Original image with 𝑑 pixels

Recovery from first scattering moments: 𝑂 log𝑑 coefficients

Recovery from 1st & 2nd

scattering moments:

𝑂 log2 𝑑 coefficientsImages from slides of Joan Bruna in ICCV 2015 tutorialUAI Tutorial

SCATTERING TRANSFORMS AND DNN

• More layers create features that can be made invariant to increasingly more complex deformations.

• Deep layers in DNN encode complex, class-specific geometry.

• Deeper architectures are able to better capture invariant properties of objects and scenes in images[Bruna & Mallat, 2013], [Wiatowski & Bölcskei, 2016]

28UAI Tutorial

SCATTERING TRANSFORMS AS A METRIC

• Scattering transforms may be used as a metric.

• Inverse problems can be solved by minimizing distance at the scattering transform domain.

• Leads to remarkable results in super-resolution[Bruna, Sprechmann & Lecun, 2016]

29UAI Tutorial

SCATTERING SUPER RESOLUTION

Original Best Linear Estimate State-of-the-art Scattering estimate

Images from slides of Joan Bruna in CVPR 2016 tutorial

[Bruna, Sprechmann & Lecun, 2016]30UAI Tutorial

MINIMIZATION

• The local minima in deep networks are not far from the global minimum.

• saddle points are the main problem of deepLearning optimization.

• Deeper networks have more local minima but less saddle points. [Saxe, McClelland & Ganguli, 2014], [Dauphin, Pascanu, Gulcehre, Cho, Ganguli & Bengio, 2014] [Choromanska, Henaff, Mathieu, Ben Arous & LeCun, 2015]

31

[Choromanska et al., 2015]

UAI Tutorial

GLOBAL OPTIMALITY IN DEEP LEARNING

• Deep learning is a positively homogeneous factorization problem, i.e., ∃𝑝 ≥ 0 such that ∀𝛼 ≥ 0 DNN obey

Φ 𝛼𝑋1, 𝛼𝑋2, … , 𝛼𝑋𝐾 = 𝛼𝑝Φ 𝑋1, 𝑋2, … , 𝑋𝐾 .

• With proper regularization, local minima are global.

• If the network is large enough, global minima can be found by local descent.

Guarantees of proposed framework

[Haeffele & Vidal, 2015]32UAI Tutorial

OUR THEORY

• DNN Classification is affected by the angles in the data [Giryes et al. 2016].

• Generalization error of neural network [Sokolic, Giryes, Sapiro & Rodrigues, 2017].

• Relationship between invariance and generalization in deep learning [Sokolic, Giryes, Sapiro & Rodrigues, 2017].

• Solving optimization problems with neural networks [Giryes, Eldar,Bronstein & Sapiro, 2018].

• Robustness to adversarial examples [Jakubovitz & Giryes, 2018].

• Robustness to label noise [Drory, Avidan & Giryes, 2019].

• Observation of k-NN behavior in neural networks to explain the coexistence of memorization and generalization in neural networks [Cohen, Sapiro & Giryes, 2018].

• Relationship between ETF and dropout [Bank & Giryes, 2019].

• Lautum information for transfer learning [Jakubovitz & Giryes, 2018].

UAI Tutorial 33

Outline

Generalization error depends

on the DNN input margin

DNN may solve

optimization problems

Robustness of neural

networks to Adversarial

attacks34UAI Tutorial

Generalization Error



DNN may solve





GENERALIZATION ERROR SURVEY

36UAI Tutorial

softmax/ linear

classifier

ASSUMPTIONS

𝑾𝟏 𝝍 𝑾𝒊 𝝍 𝑾𝑲 𝝍

general non-linearity (ReLU, pooling,…)

𝐓𝐰𝐨

𝐂𝐥𝐚𝐬𝐬𝐞𝐬 𝒘

𝒘𝑻𝜱 𝑾𝟏,𝑾𝟐, … ,𝑾𝑲 = 𝟎

∈ 𝜰

Input Space Feature Space37UAI Tutorial

GENERALIZATION ERROR (GE)

• In training, we reduce the classification error ℓtraining of the training data as the number of

training examples 𝐿 increases.

• However, we are interested to reduce the error ℓtest of the (unknown) testing data as 𝐿 increases.

• The difference between the two is the generalization error

GE = ℓtraining − ℓtest

• It is important to understand the GE of DNN

38UAI Tutorial

ESTIMATION ERROR

• The estimation error of a function f by a neural networks scales as [Barron 1994].

𝑶𝑪𝒇

𝑵+𝑶

𝑵𝒅

𝑳𝐥𝐨𝐠(𝑳)Smoothness of

approximated function

Number of neurons in the

DNN

Number of training

examples

Input dimension

39UAI Tutorial

REGULARIZATION TECHNIQUES

• Weight decay – penalizing DNN weights [Krogh & Hertz, 1992].

• Dropout - randomly drop units (along with their connections) from the neural network during training [Hinton et al., 2012], [Baldi & Sadowski, 2013], Srivastava et al., 2014].

• DropConnect – dropout extension [Wan et al., 2013]

• Batch normalization [Ioffe & Szegedy, 2015].

• Stochastic gradient descent (SGD) [Hardt, Recht & Singer, 2016].

• Path-SGD [Neyshabur et al., 2015].

• And more [Rifai et al., 2011], [Salimans & Kingma, 2016], [Sun et al, 2016].

40UAI Tutorial

A SAMPLE OF GE BOUNDS

• Using the VC dimension it can be shown that

GE ≤ 𝑂 DNN params ∙log 𝐿

𝐿

[Shalev-Shwartz and Ben-David, 2014].

• The GE was bounded also by the DNN weights

GE ≤1

𝐿2𝐾 𝑤 2

𝑖

𝑋𝑖2,2

[Neyshabur et al., 2015].

•

41

L is the number of

training samples

UAI Tutorial

A SAMPLE OF GE BOUNDS

• Using the VC dimension it can be shown that

GE ≤ 𝑂 DNN params ∙log 𝐿

𝐿

[Shalev-Shwartz and Ben-David, 2014].

• The GE was bounded also by the DNN weights

GE ≤1

𝐿2𝐾−1 𝑤 2

𝑖

𝑋𝑖𝐹

[Neyshabur et al., 2015].

• In [Golowich et al., 2018] an RC bound was provided that is independent of the network size

42UAI Tutorial

RETHINKING GENERALIZATION

• Networks with the same architecture may generalize well with structured data but overfit if the data is given with random labels [Zhang et al., 2017].

• This phenomena is affected by explicit regularization.

• This shows that taking into account only the network structure for bouding the generalization error is misleading

• We need to seek an alternative to the RademacherComplexity and VC-dimension based bounds

43UAI Tutorial

COMPRESSION APPROACH

• Weight in neural networks are very redundant

• One may compress the network and still get approximately the same performance

• One may calculate the RC or VC dimension based bounds based on the number of neurons in the compressed network

• This leads to a significantly tighter GE bounds [Neyshabur et al., 2018].

44UAI Tutorial

OPTIMIZATION AND GENERALIZATION

• In [Hardt et al., 2016] it is shown that SGD serves as a regularizer in the training of neural networks.

• In [Brutzkus et al., 2018] it is proven formally that a two layers neural network trained with SGD (under some assumptions) converges to the global minimum and generalizes well.

• Training using softmax is shown to lead to large margin in linear networks [Soudry et al., 2018]

UAI Tutorial 45

DNN INPUT MARGIN

• Theorem 6: If for every input margin 𝛾𝑖𝑛 𝑋𝑖 > 𝛾

then 𝐺𝐸 ≤ 𝑁𝛾/2(Υ) 𝐿

• 𝑁𝛾/2(Υ) is the covering number of the data Υ.

• 𝑁𝛾/2(Υ) gets smaller as 𝛾 gets larger.

• Bound is independent of depth.

• Our theory relies on the robustness framework [Xu & Mannor, 2012].

𝑋𝑖𝑋𝑖

[Sokolic, Giryes, Sapiro, Rodrigues, 2017]

46UAI Tutorial

INPUT MARGIN BOUND

• Maximizing the input margin directly is hard

• Our strategy: relate the input margin to the output margin 𝛾𝑜𝑢𝑡 𝑋

𝑖 and other DNN properties

• Theorem 7:

𝛾𝑖𝑛 𝑋𝑖 ≥𝛾𝑜𝑢𝑡 𝑋

𝑖

sup𝑋∈Υ

𝑋

𝑋 2𝐽 𝑋

2

≥𝛾𝑜𝑢𝑡 𝑋

𝑖

1≤𝑖≤𝐾 𝑊𝑖2


𝑖

1≤𝑖≤𝐾 𝑊𝑖𝐹

𝑋𝑖

Φ(𝑋𝑖)

[Sokolic, Giryes, Sapiro, Rodrigues, 2017] 47UAI Tutorial

OUTPUT MARGIN

• Theorem 7: 𝛾𝑖𝑛 𝑋𝑖 ≥𝛾𝑜𝑢𝑡 𝑋

𝑖

sup𝑉∈Υ

𝑋

𝑋 2𝐽 𝑋

2


𝑖



𝑖


• Output margin is easier tomaximize – SVM problem

• Maximized by many cost functions, e.g., hinge loss.

𝑋𝑖

Φ(𝑋𝑖)

48UAI Tutorial

GE AND WEIGHT DECAY


𝑖

sup𝑉∈Υ

𝑋

𝑋 2𝐽 𝑋

2


𝑖



𝑖


• Bounding the weights increases the input margin

• Weight decay regularizationdecreases the GE

• Related to regularization used by [Haeffele & Vidal, 2015]

𝑉𝑖

Φ(𝑉𝑖)

49UAI Tutorial

JACOBIAN BASED REGULARIZATION


𝑖

sup𝑉∈Υ

𝑋

𝑋 2𝐽 𝑋

2


𝑖



𝑖


• 𝐽 𝑋 is the Jacobian of the DNN at point 𝑋.

• 𝐽 ∙ is piecewise constant.

• Using the Jacobian of theDNN leads to a better bound.

• New regularization technique.50UAI Tutorial

RESULTS

• Better performance with less training samples

• CCE: the categorical cross entropy.

• WD: weight decay regularization.

• LM: Jacobian based regularization for large margin.

• Note that hinge loss generalizes better than CCE and that LM is better than WD as predicted by our theory.

MNIST Dataset

[Sokolic, Giryes, Sapiro, Rodrigues, 2017]

51UAI Tutorial

INVARIANCE

• Our theory extends also to study of the relation between invariance in the data and invariance in the network

• Invariance may improve generalization by a factor

of 𝑇, where T is the number of transformations

• We have proposed also a new strategy to enforce invariance in the network [Sokolic, Giryes, Sapiro, Rodrigues, 2017]

52UAI Tutorial

INVARIANCE SLICE

• Use transformations 𝑻𝟏, … , 𝑻𝑵 to transform the input [Dieleman et al., 2016]

• Average the features before the soft-max layer

𝑿

𝑾𝟏 𝝍 𝑾𝒊 𝝍 𝑾𝑲 𝝍+

𝒃𝟏 𝒃𝒊 𝒃𝑲

𝝓 𝒚



𝑻𝟏

𝑻𝑵

⋮ Averaging/Voting

53UAI Tutorial

INVARIANCE BY REGULARIZATION

• Use transformations 𝑻𝟏, … , 𝑻𝑵 to transform the input [Sokolic et al., 2017]

• Force features to be similar

𝑿



𝝓 𝒚



𝑻𝟏

𝑻𝑵

⋮ Minimizedistance

𝝓 𝒚

54UAI Tutorial

INVARIANCE

• Designing invariant DNN reduce the GE

[Sokolić, Giryes, Sapiro & Rodrigues, 2017]

55UAI Tutorial

Adversarial Examples



DNN may solve





ADVERSARIAL ATTACKS

Deep neural networks can easily be fooled by small perturbations in the input, commonly referred to as Adversarial Attacks.

The problem of “no common sense” – a network can perform its task well (e.g. image classification), but can easily be fooled in a way a human cannot.

Adversarial Attacks are deliberate input perturbations –random noise is not as likely to fool the network.

57UAI Tutorial

ADVERSARIAL EXAMPLES

*These examples were generated using the DeepFool attack on ResNet-34, ImageNet classification.

58UAI Tutorial

ADVERSARIAL ATTACKS

Adversarial examples are highly transferable – an attack that successfully fooled one network is very likely to fool another network as well.

Very little knowledge of the network’s architecture is necessary to attack it, i.e. grey-box attacks are very efficient as well.

Several explanations to this phenomenon have been suggested:

• Adversarial examples finely tile the space like the rational numbers amongst the reals, they are common but occur only at very precise locations (“pockets”).

• Positively curved decision boundaries are more susceptible to universal adversarial perturbations.

59UAI Tutorial

DEFENSE AND ATTACK METHODS

Several attack and defense methods have been proposed to counter this problem.

Defense methods aim at either increasing the robustness to attacks, or detecting that an attack has been performed.

Attack methods:

• FGSM (Fast Gradient Sign Method)

• JSMA (Jacobian-based Saliency Map Approach)

• DeepFool

• Carlini & Wagner

60UAI Tutorial

DEFENSE AND ATTACK METHODS

Defense methods:

• Adversarial training

• Network Distillation

• Input Gradient regularization

• Cross-Lipschitz regularization

• Jacobian regularization

61UAI Tutorial

JACOBIAN REGULARIZATION

We proposed a novel approach to enhance a network’s robustness to adversarial attacks – Jacobian regularization.

A network’s Jacobian matrix for 𝑧(𝐿) - the output of the network’s last fully connected layer before softmax (i.e. the logits) is

where 𝐷 – input dimension, 𝐾 – output dimension:

𝐽 𝑥𝑖 ≜ 𝐽 𝐿 𝑥𝑖 =

𝜕𝑧1𝐿𝑥𝑖

𝜕𝑥 1⋯

𝜕𝑧1𝐿𝑥𝑖

𝜕𝑥 𝐷

⋮ ⋱ ⋮

𝜕𝑧𝐾𝐿𝑥𝑖

𝜕𝑥 1⋯

𝜕𝑧𝐾𝐿𝑥𝑖

𝜕𝑥 𝐷

∈ ℝ𝐾× 𝐷

62UAI Tutorial

JACOBIAN REGULARIZATION

We regularize the Frobenius norm of the network's Jacobian:

We perform post-processing training, i.e. our regularization is applied to already-trained networks:

Reducing the computational overhead (Jacobian computation requires an additional back-propagation step).

Allowing the usage of pre-existing networks.

𝐽 𝑥𝑖 𝐹2 =

𝑑=1

𝐷

𝑘=1

𝐾𝜕

𝜕𝑥𝑑𝑧𝑘𝐿(𝑥𝑖)

2

=

𝑘=1

𝐾

∇𝑥𝑧𝑘𝐿𝑥𝑖

2

2

63UAI Tutorial[Sokolić, Giryes, Sapiro & Rodrigues, 2017]

WHY DOES IT WORK?

Adversarial examples are essentially cases in which similar network inputs result in very different network outputs. Regularizing the Jacobian constraints this behavior: smaller Jacobian Frobenius norm → smoother classification function.

Let [𝑥, 𝑥𝑝𝑒𝑟𝑡] be the 𝐷-dimensional line in the input space

connecting 𝑥 and 𝑥𝑝𝑒𝑟𝑡. According to the mean value theorem

there exists some 𝑥′ ∈ [𝑥, 𝑥𝑝𝑒𝑟𝑡] such that:

𝑧 𝐿 𝑥𝑝𝑒𝑟𝑡 − 𝑧 𝐿 𝑥2

2

‖𝑥𝑝𝑒𝑟𝑡 − 𝑥‖22 ≤

𝑘=1

𝐾

∇𝑥𝑧𝑘𝐿𝑥′

2

2= 𝐽 𝑥′ 𝐹

2

64UAI Tutorial

WHY DOES IT WORK?

Empirical motivation – the average Jacobian Frobenius norm of perturbed images is larger:

𝟏

𝑵

𝒊=𝟏

𝑵

‖ 𝐉(𝐱𝒊𝒑𝒆𝒓𝒕) 𝑭

𝟏

𝑵

𝒊=𝟏

𝑵

‖ ‖𝐉(𝐱𝒊) 𝑭Defense method

0.18770.14No defense

0.1430.141Adversarial training

0.0550.0315Jacobian regularization

0.05450.0301Jacobian reg. & Adversarial training

65UAI Tutorial

[Jakubovitz & Giryes, 2018]

WHY DOES IT WORK?

Generally, an attack method seeks for the closest decision boundary to be crossed to cause a misclassification, such that the perturbation of the input is minimal.

The distance 𝑑∗ is the first order approximation of the distance between an input 𝑥 and an input classified to the closest decision boundary. The relation between 𝑑∗ and the network’s Jacobian matrix (𝑘∗ is the original class of input 𝑥, 𝑘 = 1,… , 𝐾 is the class index):

Encouraging a smaller Frobenius norm of the network’s Jacobian means encouraging a larger minimal distance between the original input and a perturbed input that would cause a misclassification.

𝑑∗ ≥1

2‖J 𝐿 𝑥 ‖𝐹min𝑘≠𝑘∗

𝑧𝑘∗𝐿𝑥 − 𝑧𝑘

𝐿𝑥

66UAI Tutorial

[Jakubovitz & Giryes, 2018]

WHY DOES IT WORK?

It has been shown that positively curved decision boundaries create an enhanced vulnerability to adversarial examples:

*Illustration taken from “Analysis of universal adversarial perturbations”, Moosavi-Dezfooli et al.

Jacobian regularization encourages the network’s learned decision boundaries to be less positively curved.

67UAI Tutorial

WHY DOES IT WORK?

The decision boundary separating the classes 𝑘1 and 𝑘2 is the hyper-

surface: 𝐹𝑘1,𝑘2 𝑥 = 𝑧𝑘1𝐿(𝑥) − 𝑧𝑘2

𝐿(𝑥) = 0.

Using the approximation 𝐻𝑘 𝑥 =𝜕2𝑧𝑘

𝐿𝑥

𝜕𝑥2≈ 𝐽𝑘 𝑥 𝑇𝐽𝑘 𝑥 (outer

product of gradients, 𝐽𝑘(𝑥) is the 𝑘𝑡ℎ row in the matrix 𝐽(𝑥)), we get

that the curvature of the decision boundary 𝐹𝑘1,𝑘2 𝑥 = 𝑧𝑘1𝐿𝑥

− 𝑧𝑘2𝐿𝑥 = 0 at the input point 𝑥 can be upper bounded by:

For this reason, Jacobian regularization promotes a less positive curvature of the decision boundaries in the environment of the input samples.

𝑥𝑇 𝐻𝑘1 −𝐻𝑘2 𝑥 ≤

𝑘=1

𝐾

𝐽𝑘 𝑥 𝑥 2 ≤ 𝐽 𝑥 𝐹2‖𝑥‖2

2

68UAI Tutorial

EXPERIMENTAL RESULTS

We examined the performance of our method under different attack methods, and compared them to 3 other prominent defense methods – Input Gradient regularization, Cross-Lipschitz regularization and adversarial training.

Results under the DeepFool attack on CIFAR-10 ( 𝜌𝑎𝑑𝑣 is the average proportion between the norm of the minimal perturbation necessary to fool the network and the norm of the corresponding original input):

𝝆𝒂𝒅𝒗Test accuracyDefense method

1.21 x 10−288.79%No defense

1.23 x 10−288.88%Adversarial Training

1.43 x 10−288.56%Input Gradient regularization

2.17 x 10−288.49%Input Gradient reg. & Adversarial Training

2.08 x 10−288.91%Cross-Lipschitz regularization

4.04 x 10−288.49%Cross-Lipschitz reg. & Adversarial Training

𝟑. 𝟒𝟐 𝐱 𝟏𝟎−𝟐89.16%Jacobian regularization

𝟔. 𝟎𝟑 𝐱 𝟏𝟎−𝟐88.49%Jacobian reg. & Adversarial Training

69UAI Tutorial [Jakubovitz & Giryes, 2018]

EXPERIMENTAL RESULTS - FGSM

Results under FGSM attack on CIFAR-10 (test accuracy for a test set of adversarial examples, ϵ -attack magnitude):

79.63

70.9

62.86

55.3449.66

45.4441.38

38.3835.62 33.92

80.65

74.07

65.34

58.74

52.1346.46

42.0738.28

35.41 32.91

76.56

66.75

57.5652.53

47.6844.56

41.1637.75 35.34 34.17

78.12

68.8

60.12

53.5148.7

44.6441.2

38.3635.55 33.75

86.6283.08

77.9571.89

65.7460.79

55.0851.19

47.5643.84

86.88 84.880.68

74.9369.04

62.6857.1

52.1447.13

43.69

77.32

66.27

57.551.64

46.9242.43

39.04 36.56 34.1 32.13

86.2580.11

71.86

63.62

56.5551

45.7841.83

38.4635.14

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10

TEST

AC

CU

RA

CY

Ε

No Adversarial Training - Cross-Lipschitz Regularization

No Adversarial Training - Jacobian Regularization

No Adversarial Training - No Defense

No Adversarial Training - Input Gradient Regularization

Adversarial Training - Cross-Lipschitz Regularization

Adversarial Training - Jacobian Regularization

Adversarial Training - No Defense

Adversarial Training - Input Gradient Regularization


Results under JSMA attack on CIFAR-10 (test accuracy for a test set of adversarial examples, ϵ -attack magnitude):

50

55

60

65

70

75

80

85

90

95

100

1 2 3 4 5 6 7 8 9 10

TEST

AC

CU

RA

CY

Ε

No Adversarial Training - Cross-Lipschitz RegularizationNo Adversarial Training - Jacobian RegularizationNo Adversarial Training - No DefenseNo Adversarial Training - Input Gradient RegularizationAdversarial Training - Cross-Lipschitz RegularizationAdversarial Training - Jacobian RegularizationAdversarial Training - No Defense

EXPERIMENTAL RESULTS - JSMA


Optimization by DNN



DNN may solve





INVERSE PROBLEMS

• We are given 𝑿 = 𝑨𝒁 + 𝑬

• Standard technique for recovery𝐦𝐢𝐧𝒁

𝑿− 𝑨𝒁 𝟐 𝐬. 𝐭. 𝒁 ∈ 𝜰

• Unconstrained form

𝐦𝐢𝐧𝒁

𝑿 − 𝑨𝒁 𝟐𝟐 + 𝝀𝒇(𝒁)

Unknown signal

noise Given set of

measurementslinear

operator

𝒁 resides in a low dimensional set 𝜰

𝒇 is a penalty function

Regularization parameter

73UAI Tutorial

ℓ𝟏 MINIMIZATION CASE

• Unconstrained form

𝐦𝐢𝐧𝒁

𝑿 − 𝑨𝒁 𝟐𝟐 + 𝝀 𝒁 𝟏

• Can be solved by proximal gradient, e.g., iterative shrinkage and thresholding technique (ISTA)

𝒁𝒕+𝟏 = 𝝍𝝀𝝁 𝒁𝒕 + 𝝁𝑨𝑻 𝑿 − 𝑨𝒁𝒕

Soft thresholding

operation- 𝝀𝝁 𝝀𝝁

𝝁 is the step size

74UAI Tutorial

ISTA CONVERGENCE

• Reconstruction mean squared error (MSE) as a function of the number of iterations

𝑬 𝒁 − 𝒁𝒕

𝒕 75UAI Tutorial

𝑳𝑰𝑺𝑻𝑨

• ISTA

𝒁𝒕+𝟏 = 𝝍𝜆𝜇 𝒁𝒕 + 𝜇𝑨𝑻 𝑿 − 𝑨𝒁𝒕

• Rewriting ISTA:

𝒁𝒕+𝟏 = 𝝍𝜆𝜇 𝑰 − 𝝁𝑨𝑻𝑨 𝒁𝒕 + 𝜇𝑨𝑻𝑿

• Learned ISTA (LISTA): 𝒁𝒕+𝟏 = 𝝍𝜆 𝑾𝒁𝒕 + 𝑺𝑿

Learned operators

76UAI Tutorial

• Replacing 𝐼 − 𝜇𝐴𝑇𝐴 and 𝜇𝐴𝑇 in ISTA with the learned 𝑊 and 𝑆 improves convergence [Gregor & LeCun, 2010]

• Extensions to other models [Sprechmann, Bronstein & Sapiro, 2015],

[Remez, Litani & Bronstein, 2015], [Tompson, Schlachter, Sprechmann & Perlin, 2016].

LISTA CONVERGENCE

100

20

𝐸 𝑍 − 𝑍𝑡

𝒕5020

77UAI Tutorial

LISTA AS A NEURAL NETWORK

𝑾 𝝍

𝑺

𝒁 ∈ 𝜰

𝑿 ∈ ℝ𝒅

𝑿 = 𝑨𝒁 + 𝑬

𝒁

An estimate of 𝒁

+

Learned linear

operators

[Gregor & LeCun, 2010]

Soft thresholding

operation- 𝝀 𝝀

78UAI Tutorial

ISTA

𝝁𝑨𝑻

Soft thresholding

operation

𝑿 ∈ ℝ𝒅


𝒁+

Step size 𝜇 obeys1𝜇≥ 𝐴

Iterative soft thresholding

algorithm (ISTA)

- 𝝀𝝁 Minimizer of

𝒎𝒊𝒏 𝒁

𝑿 − 𝑨 𝒁 + 𝝀 𝒁𝟏

𝝀𝝁

𝜇 is the step size

[Daubechies, Defrise & Mol, 2004], [Beck & Teboulle, 2009]

𝑰 − 𝝁𝑨𝑻𝑨

𝝍

79UAI Tutorial

PROJECTED GRADIENT DESCENT (PGD)

𝝍 projects onto the set 𝜰

𝒇( 𝒁) ≤ 𝑹

𝑿 ∈ ℝ𝒅


𝝁𝑨𝑻 𝝍 𝒁+

𝒇(𝒁) ≤ 𝑹


Estimate of 𝒁.Aim at solving

𝒎𝒊𝒏 𝒁

𝑿 − 𝑨 𝒁

𝒔. 𝒕. 𝒇( 𝒁) ≤ 𝑹

𝑰 − 𝝁𝑨𝑻𝑨

80UAI Tutorial

THEORY FOR PGD

• Theorem 8: Let 𝑍 ∈ ℝ𝑑, 𝑓:ℝ𝑑 → ℝ a proper function, 𝑓 𝑍 ≤ 𝑅, 𝐶𝑓(𝑍) the tangent cone of 𝑓at point 𝑍, 𝐴 ∈ ℝ𝑚×𝑑 a random Gaussian matrix and 𝑋 = 𝐴𝑍 + 𝐸. Then the estimate of PGD at iteration 𝑡, 𝑍𝑡, obeys

𝑍𝑡 − 𝑍 ≤ 𝜅𝑓𝜌𝑡𝑍 ,

where 𝜌 = sup𝑈,𝑉∈𝐶𝑓 𝑍 ∩ℬ𝑑

𝑈𝑇 𝐼 − 𝜇𝐴𝑇𝐴 𝑉

and 𝜅𝑓 = 1 if 𝑓 is convex and 𝜅𝑓 = 2 otherwise.[Oymak, Recht & Soltanolkotabi, 2016].

81UAI Tutorial

PGD CONVERGENCE RATE

• 𝜌 = sup𝑈,𝑉∈𝐶𝑓 𝑍 ∩ℬ𝑑

𝑈𝑇 𝐼 − 𝜇𝐴𝑇𝐴 𝑉 is the convergence

rate of PGD.

• Let 𝜔 be the Gaussian mean width of 𝐶𝑓 𝑍 ∩ ℬ𝑑.

• If 𝜇 =1

𝑚+ 𝑑2 ≃

1

𝑑then 𝜌 = 1 − 𝑂

𝑚−𝜔

𝑚+𝑑.

• If 𝜇 =1

𝑚then 𝜌 = 𝑂

𝜔

𝑚.

• For the 𝑘-sparse model 𝜔2 = 𝑂 𝑘log d

• For GMM with 𝑘 Gaussians 𝜔2 = 𝑂 𝑘 .

• How may we cause 𝜔 to become smaller for having a better convergence rate?

82UAI Tutorial

INACCURATE PROJECTION

• PGD iterations projects onto Υ = 𝑍: 𝑓 𝑍 ≤ 𝑅 .

• Smaller Υ ⇒ Smaller 𝜔.

Faster convergence as

𝜌 = 1 − 𝑂𝑚−𝜔

𝑚+𝑑or 𝑂

𝜔

𝑚

• Let us assume that our signal belongs to a smaller set Υ = 𝑍: 𝑓 𝑍 ≤ 𝑅 with 𝜔 ≪ 𝜔.

• Ideally, we would like to project onto Υ instead of Υ.

• This will lead to faster convergence.

• What if such a projection is not feasible?

⇒ 𝒇( 𝒁) ≤ 𝑹

𝒇( 𝒁) ≤ 𝑹

𝜰

𝜰

83UAI Tutorial

INACCURATE PROJECTION

• We will estimate the projection onto Υ by

• A linear projection 𝑃

• Followed by a projection onto Υ

• Assumptions:

• ℘Υ(𝑃𝑍)−𝑍 ≤ ϵ

Projection of the target vector 𝑍onto P and then onto Υ

𝒇( 𝒁) ≤ 𝑹

84UAI Tutorial

INACCURATE PGD (IPGD)

𝝍 projects onto the set Υ

𝒇(𝒁) ≤ 𝑹

𝑿 ∈ ℝ𝒅


𝝁𝑷𝑨𝑻 𝝍

𝑷 𝑰 − 𝝁𝑨𝑻𝑨

𝒁+

𝜰


Estimate of 𝑍.Aim at solving

𝒎𝒊𝒏 𝒁

𝑿 − 𝑨 𝒁

𝒔. 𝒕. 𝒇( 𝒁) ≤ 𝑹

𝒇(𝒁) ≤ 𝑹

85UAI Tutorial

THEORY FOR IPGD

• Theorem 9: Let 𝑍 ∈ ℝ𝑑, 𝑓:ℝ𝑑 → ℝ a proper convex* function, 𝑓 𝑍 ≤ 𝑅, 𝐶𝑓(𝑍) the tangent cone of 𝑓 at point 𝑍, 𝐴 ∈ ℝ𝑑×𝑚 a random Gaussian matrix and 𝑋= 𝐴𝑍 + 𝐸. Then the estimate of IPGD at iteration 𝑡, 𝑍𝑡, obeys

𝑍𝑡 − 𝑍 ≤ 𝜌𝑃𝑡 +

1 − 𝜌𝑃𝑡

1 − 𝜌𝑃 𝜖 𝑍 ,

where 𝜌𝑝 = sup𝑈,𝑉∈𝐶𝑓 𝑍 ∩ℬ𝑑

𝑈𝑇𝑃 𝐼 − 𝜇𝐴𝑇𝐴 𝑃𝑉

and 𝜖 = (2 + 𝜌𝑝)ϵ.[Giryes, Eldar, Bronstein & Sapiro, 2016]*We have a version of this theorem also when 𝑓 is non-proper or non-convex function 86UAI Tutorial

CONVERGENCE RATE COMPARISON

• PGD convergence:

𝜌 𝑡

• IPGD convergence:

𝜌𝑃𝑡 +

1 − 𝜌𝑃𝑡

1 − 𝜌𝑃(2 + 𝜌𝑝)𝜖

≃(𝑎)

𝜌𝑃𝑡 + 𝜖 ≃

(𝑏)

𝜌𝑃𝑡 ≪(𝑐)

𝜌 𝑡

(a)𝜖 is negligible compared to 𝜌𝑃(b) For small values of 𝑡 (early iterations).

(c) Faster convergence as 𝜌𝑃 ≪ 𝜌 (because 𝜔𝑝 ≪ 𝜔).87UAI Tutorial

MODEL BASED COMPRESSED SENSING

• Υ is the set of sparse vectors with sparsity patterns that obey a tree structure.

• Projecting onto Υ improves convergence rate compared to projecting onto the set of sparse vectors Υ [Baraniuk et al., 2010].

• The projection onto Υ is more demanding than onto Υ.

• Note that the probability of selecting atoms from lower tree levels is smaller than upper ones.

• 𝑃 will be a projection onto certain tree levels – zeroing the values at lower levels.

1

0.5 0.5

0.25 0.25 0.25 0.25

88UAI Tutorial

MODEL BASED COMPRESSED SENSING

Non-zeros picked entries has zero mean random Gaussian distribution with variance:- 1 at first two levels- 0.52 at the third level- 0.22 at the rest of the levels𝑍

− 𝑍𝑡

89UAI Tutorial

SPECTRAL COMPRESSED SENSING

• Υ is the set of vectors with sparse representation in a 2-times redundant DCT dictionary such that:

• The active atoms are selected uniformly at random such that minimum distance between neighboring atoms is 20.

• The value of each representation coefficient ~𝑁(0,1) i.i.d.

• We set the neighboring coefficients at distance 2 of each active atom to be ~𝑁(0,0.012) , respectively

• We set 𝑃 to be a pooling-like operation that keeps in each window of size 3 only the largest value.

90UAI Tutorial

SPECTRAL COMPRESSED SENSING𝑍− 𝑍𝑡

91UAI Tutorial

SPECTRAL COMPRESSED SENSING

• Υ is the set of vectors with sparse representation in a 4-times redundant DCT dictionary such that:

• The active atoms are selected uniformly at random such that minimum distance between neighboring atoms is 5.

• The value of each representation coefficient ~𝑁(0,1) i.i.d.

• We set the neighboring coefficients at distance 1 and 2 of each active atom to be ~𝑁(0,0.052)

• We set 𝑃 to be a pooling-like operation that keeps in each window of size 5 only the largest value.

92UAI Tutorial

SPECTRAL COMPRESSED SENSING𝑍− 𝑍𝑡

93UAI Tutorial

LEARNING THE PROJECTION

• If we have no explicit information about Υ it might be desirable to learn the projection.

• Instead of learning 𝑃, it is possible to replace

𝑃 𝐼 − 𝜇𝐴𝑇𝐴 and 𝜇𝑃𝐴𝑇 with two learned matrices

𝑆 and 𝑊 respectively.

• This leads to a very similar scheme to the one of LISTA and provides a theoretical foundation for the success of LISTA.

94UAI Tutorial

LEARNED IPGD

𝝍 projects onto the set 𝜰

𝒇(𝒁) ≤ 𝑹

𝑿 ∈ ℝ𝒅 𝑾 𝝍 𝒁+


𝒎𝒊𝒏 𝒁

𝑿 − 𝑨 𝒁

𝒔. 𝒕. 𝒇( 𝒁) ≤ 𝑹

Learned linear

operators

𝑿 = 𝑨𝒁 + 𝑬 𝜰

𝒇(𝒁) ≤ 𝑹

𝑺

95UAI Tutorial

SUPER RESOLUTION

• A popular super-resolution technique uses a pair of low-res and high-res dictionaries [Zeyde et al. 2012]

• The original work uses OMP with sparsity 3 to decode the representation of patches in low-res image

• Then the representation is used to reconstruct the patches of the high-res image

• We replace OMP with LIPGD with 3 levels but higher target sparsity

• This leads to better reconstruction results (with up to 0.5dB improvement)

96UAI Tutorial

LISTA

𝜓 is a proximal mapping.

𝝍 𝑼 =

𝐚𝐫𝐠𝐦𝐢𝐧 𝒁∈ℝ𝒅

𝑼 − 𝒁

+𝝀𝒇( 𝒁)

𝑿 ∈ ℝ𝒅


𝑾 𝝍

𝑺

𝒁+

𝜰


𝒎𝒊𝒏 𝒁

𝑿 − 𝑨 𝒁

+𝝀 𝒇( 𝒁)

𝒇(𝒁) ≤ 𝑹

Learned linear

operators

97UAI Tutorial

LISTA MIXTURE MODEL

• Approximation of the projection onto Υwith one linear projection may not be accurate enough.

• This requires more LISTA layers/iterations.

• Instead, one may use several LISTA networks, where each approximates a different part of Υ

• Training multiple LISTA networksaccelerate the convergence further.

Υ

Υ

98UAI Tutorial

LISTA MIXTURE MODEL𝑋−𝐴 𝑍𝑡22+

𝑍𝑡1−

𝑋−𝐴𝑍

22−

𝑍1

99UAI Tutorial

RELATED WORKS

• In [Bruna et al. 2017] it is shown that a learning may give a gain due to better preconditioning of A.

• In [Xin et al. 2016] a relation to the restricted isometry property (RIP) is drawn

• In [Borgerding & Schniter, 2016] a connection is drawn to approximate message passing (AMP).

• In [Chen et al., 2018] and [Liu et al., 2019] tied and analytical weights are studied showing exponential convergence under some conditions.

• All these works consider only the sparsity case

100UAI Tutorial

Take Home Message



DNN may solve





ACKNOWLEDGEMENTS

Guillermo Sapiro

Duke University

Alex M. Bronstein

Technion

Jure Sokolic

UCL

Miguel Rodrigues

UCL

Yonina C. Eldar

Weizmann

102

Daniel Jakubovitz

Tel Aviv UniversityUAI Tutorial

QUESTIONS?

WEB.ENG.TAU.AC.IL/~RAJA

103UAI Tutorial

FULL REFERENCES 1

• A. L. Samuel, “Some studies in machine learning using the game of checkers,” IBM Journal, vol. 3, no. 3, pp. 535–554, 1959.

• D. H. Hubel & T. N. Wiesel, “Receptive fields of single neurones in the cat's striate cortex”, J Physiol., vol. 148, no. 3, pp. 574-591, 1959.

• D. H. Hubel & T. N. Wiesel, “Receptive fields, binocular interaction and functional architecture in the cat's visual cortex”, J Physiol., vol. 160, no. 1, pp. 106-154, 1962.

• K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position”, Biological Cybernetics, vol. 36, no. 4, pp. 93-202, 1980.

• Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard & L. D. Jackel, “Backpropagation Applied to Handwritten Zip Code Recognition”, Neural Computation, vol. 1, no. 4, pp. 541-551, 1989.

• Y.LeCun, L. Bottou, Y. Bengio & P. Haffner, “Gradient Based Learning Applied to Document Recognition”, Proceedings of IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

• C. Farabet, C. Couprie, L. Najman & Y. LeCun, “Learning Hierarchical Features for Scene Labeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 35, no. 8, pp. 1915-1929, Aug. 2013.

• O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge”, International Journal of Computer Vision, vol. 115, no. 3, pp. 211-252, 2015

• A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks”, NIPS, 2012.

• K. He, X. Zhang, S. Ren, and J. Sun. “Deep residual learning for image recognition”, CVPR, 2016.

104UAI Tutorial

FULL REFERENCES 2

• M.D. Zeiler & R. Fergus, “Visualizing and Understanding Convolutional Networks”, ECCV, 2014.

• D. Yu & L. Deng, “Automatic Speech Recognition: A Deep Learning Approach”, Springer, 2014.

• J. Bellegarda & C. Monz, “State of the art in statistical methods for language and speech processing,” Computer Speech and Language, vol. 35, pp. 163–184, Jan. 2016.

• C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke & A. Rabinovich, “Going Deeper with Convolutions”, CVPR, 2015.

• V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra & M. Riedmiller, “Playing Atari with Deep Reinforcement Learning”, NIPS deep learning workshop, 2013.

• V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg & D. Hassabis, “Human-level control through deep reinforcement learning”, Nature vol. 518, pp. 529–533, Feb. 2015.

• D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel & D. Hassabis, “Mastering the Game of Go with Deep Neural Networks and Tree Search”, Nature, vol. 529, pp. 484–489, 2016.

• S. K. Zhou, H. Greenspan, D. Shen, “Deep Learning for Medical Image Analysis”, Academic Press, 2017.

• I. Sutskever, O. Vinyals & Q. Le, “Sequence to Sequence Learning with Neural Networks”, NIPS 2014.

• A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, “Large-scale Video Classification with Convolutional Neural Networks”, CVPR, 2014.

105UAI Tutorial

FULL REFERENCES 3• F. Schroff, D. Kalenichenko & J. Philbin, “FaceNet: A Unified Embedding for Face Recognition and Clustering”,

CVPR, 2015.

• A. Poznanski & L. Wolf, “CNN-N-Gram for Handwriting Word Recognition”, CVPR, 2016.

• R. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning, A. Ng & C. Potts, “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, EMNLP, 2013.

• H. C. Burger, C. J. Schuler & S. Harmeling, Image denoising: Can plain Neural Networks compete with BM3D?, CVPR, 2012.

• J. Kim, J. K. Lee, K. M. Lee, “Accurate Image Super-Resolution Using Very Deep Convolutional Networks”, CVPR, 2016.

• J. Bruna, P. Sprechmann, and Y. LeCun, “Super-Resolution with Deep Convolutional Sufficient Statistics”, ICLR, 2016.

• V. Nair & G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines”, ICML, 2010.

• L. Deng & D. Yu, “Deep Learning: Methods and Applications”, Foundations and Trends in Signal Processing, vol. 7 no. 3-4, pp. 197–387, 2014.

• Y. Bengio, “Learning Deep Architectures for AI”, Foundations and Trends in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.

• Y. LeCun, Y. Bengio, & G. Hinton. Deep learning. Nature, vol. 521, no. 7553, pp. 436–444, 2015.

• J. Schmidhuber, “Deep learning in neural networks: An overview”, Neural Networks, vol. 61, pp. 85–117, Jan. 2015.

• I. Goodfellow, Y. Bengio & A. Courville, “Deep learning”, Book in preparation for MIT Press, 2016.

106UAI Tutorial

FULL REFERENCES 4• G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Math. Control Signals Systems, vol. 2,

pp. 303–314, 1989.

• K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Netw., vol. 4, no. 2, pp. 251–257, 1991.

• A. R. Barron, Approximation and estimation bounds for artificial neural networks, Machine Learning, vol. 14, no. 1, pp. 115–133, Jan. 1994.

• G. F. Montu ́far & J. Morton, “When does a mixture of products contain a product of mixtures”, SIAM Journal on Discrete Mathematics (SIDMA), vol. 29, no. 1, pp. 321-347, 2015.

• G. F. Montu ́far , R. Pascanu, K. Cho, & Y. Bengio, “On the number of linear regions of deep neural networks,” NIPS, 2014.

• N. Cohen, O. Sharir & A. Shashua, “Deep SimNets,” CVPR, 2016.

• N. Cohen, O. Sharir & A. Shashua, “On the Expressive Power of Deep Learning: A Tensor Analysis,” COLT, 2016.

• N. Cohen & A. Shashua, “Convolutional Rectifier Networks as Generalized Tensor Decompositions,” ICML, 2016

• M. Telgarsky, “Benefits of depth in neural networks,” COLT, 2016.

• R. Eldan and O. Shamir, “The power of depth for feedforward neural networks.,” COLT, 2016.

• N. Cohen and A. Shashua, “Inductive Bias of Deep Convolutional Networks through Pooling Geometry,” arXivabs/ 1605.06743, 2016.

• J. Bruna, Y. LeCun, & A. Szlam, “Learning stable group invariant representations with convolutional networks,” ICLR, 2013.

• Y-L. Boureau, J. Ponce, Y. LeCun, Theoretical Analysis of Feature Pooling in Visual Recognition, ICML, 2010.

107UAI Tutorial

FULL REFERENCES 5

• J. Bruna, A. Szlam, & Y. LeCun, “Signal recovery from lp pooling representations”, ICML, 2014.

• S. Soatto & A. Chiuso, “Visual Representations: Defining properties and deep approximation”, ICLR 2016.

• F. Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, and T. Poggio, “Unsupervised learning of invariant representations in hierarchical architectures,” Theoretical Computer Science, vol. 663, no. C, pp. 112-121, Jun. 2016.

• J. Bruna and S. Mallat, “Invariant scattering convolution networks,” IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), vol. 35, no. 8, pp. 1872–1886, Aug 2013.

• T. Wiatowski and H. Bölcskei, “A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction,” arXiv abs/1512.06293, 2016

• A. Saxe, J. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural network”, ICLR, 2014.

• Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in high dimensional non-convex optimization,” NIPS, 2014.

• A. Choromanska, M. B. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun, “The loss surfaces of multilayer networks,” in International Conference on Artificial Intelligence and Statistics (AISTATS), 2015.

• B. D. Haeffele and R. Vidal. Global Optimality in Tensor Factorization, Deep Learning, and Beyond. arXiv, abs/1506.07540, 2015.

• S. Arora, A. Bhaskara, R. Ge, and T. Ma, “Provable bounds for learning some deep representations,” in Int. Conf. on Machine Learning (ICML), 2014, pp. 584–592.

108UAI Tutorial

FULL REFERENCES 6

• A. M. Bruckstein, D. L. Donoho, & M. Elad, “From sparse solutions of systems of equations to sparse modeling of signals and images”, SIAM Review, vol. 51, no. 1, pp. 34–81, 2009.

• G. Yu, G. Sapiro & S. Mallat, “Solving inverse problems with piecewise linear estimators: From Gaussian mixture models to structured sparsity”, IEEE Trans. on Image Processing, vol. 21, no. 5, pp. 2481 –2499, May 2012.

• N. Srebro & A. Shraibman, “Rank, trace-norm and max-norm,” COLT, 2005.

• E. Cand`es & B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational mathematics, vol. 9, no. 6, pp. 717– 772, 2009.

• R. G. Baraniuk, V. Cevher & M. B. Wakin, "Low-Dimensional Models for Dimensionality Reduction and Signal Recovery: A Geometric Perspective," Proceedings of the IEEE, vol. 98, no. 6, pp. 959-971, 2010.

• Y. Plan and R. Vershynin, “Dimension reduction by random hyperplane tessellations,” Discrete and Computational Geometry, vol. 51, no. 2, pp. 438–461, 2014.

• Y. Plan and R. Vershynin, “Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach,” IEEE Trans. Inf. Theory, vol. 59, no. 1, pp. 482–494, Jan. 2013.

• R. Giryes, G. Sapiro and A.M. Bronstein, “Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? “, IEEE Transactions onSignal Processing, vol. 64, no. 13, pp. 3444-3457, Jul. 2016.

• A. Choromanska, K. Choromanski, M. Bojarski, T. Jebara, S. Kumar, Y. LeCun, “Binary embeddings with structured hashed projections”, ICML, 2016.

• J. Masci, M. M. Bronstein, A. M. Bronstein and J. Schmidhuber, “Multimodal Similarity-Preserving Hashing”, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 36, no. 4, pp. 824-830, April 2014.

109UAI Tutorial

FULL REFERENCES 7

• H. Lai, Y. Pan, Y. Liu & S. Yan, “Simultaneous Feature Learning and Hash Coding With Deep Neural Networks”, CVPR, 2015.

• A. Mahendran & A. Vedaldi, “Understanding deep image representations by inverting them,” CVPR, 2015.

• K. Simonyan & A. Zisserman, “Very deep convolutional networks for large-scale image recognition”, ICLR, 2015

• A. Krogh & J. A. Hertz, “A Simple Weight Decay Can Improve Generalization”, NIPS, 1992.

• P. Baldi & P. Sadowski, “Understanding dropout”, NIPS, 2013.

• N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, & R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

• L. Wan, M. Zeiler, S. Zhang, Y. LeCun & R. Fergus, “Regularization of Neural Networks using DropConnect”, ICML, 2013.

• S. Ioffe & C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” ICML, 2015.

• M. Hardt, B. Recht & Y. Singer, “Train faster, generalize better: Stability of stochastic gradient descent”, ICML, 2016.

• B. Neyshabur, R. Salakhutdinov & N. Srebro, “Path-SGD: Path-normalized optimization in deep neural networks,” NIPS, 2015.

• S. Rifai, P. Vincent, X. Muller, X. Glorot, & Y. Bengio. “Contractive auto-encoders: explicit invariance during feature extraction,” ICML, 2011.

110UAI Tutorial

FULL REFERENCES 8

• T. Salimans & D. Kingma, “Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks”, arXiv abs/1602.07868, 2016.

• S. Sun, W. Chen, L. Wang, & T.-Y. Liu, “Large margin deep neural networks: theory and algorithms”, AAAI, 2016.

• S. Shalev-Shwartz & S. Ben-David. “Understanding machine learning: from theory to algorithms”, Cambridge University Press, 2014.

• P. L. Bartlett & S. Mendelson, “Rademacher and Gaussian complexities: risk bounds and structural results”. The Journal of Machine Learning Research (JMLR), vol 3, pp. 463–482, 2002.

• B. Neyshabur, R. Tomioka, and N. Srebro, “Norm-based capacity control in neural networks,” COLT, 2015.

• J. Sokolic, R. Giryes, G. Sapiro, M. R. D. Rodrigues, “Margin Preservation of Deep Neural Networks”, arXiv, abs/1605.08254, 2016.

• H. Xu and S. Mannor. “Robustness and generalization,” JMLR, vol. 86, no. 3, pp. 391–423, 2012.

• J. Huang, Q. Qiu, G. Sapiro, R. Calderbank, “Discriminative Geometry-Aware Deep Transform”, ICCV 2015

• J. Huang, Q. Qiu, G. Sapiro, R. Calderbank, “Discriminative Robust Transformation Learning”, NIPS 2016.

• T. Blumensath & M.E. Davies, “Iterative hard thresholding for compressed sensing”, Appl. Comput. Harmon. Anal, vol. 27, no. 3, pp. 265 – 274, 2009.

• I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint”, Communicationson Pure and Applied Mathematics, vol. 57, no. 11, pp. 1413–1457, 2004.

111UAI Tutorial

FULL REFERENCES 9

• A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Sci., 2(1):183–202, Mar. 2009.

• K. Gregor & Y. LeCun, “Learning fast approximations of sparse coding”, ICML, 2010.

• P. Sprechmann, A. M. Bronstein & G. Sapiro, “Learning efficient sparse and low rank models”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1821–1833, Sept. 2015.

• T. Remez, O. Litany, & A. M. Bronstein, “A picture is worth a billion bits: Real-time image reconstruction from dense binary pixels”, ICCP, 2015.

• J. Tompson, K. Schlachter, P. Sprechmann & K. Perlin, “Accelerating Eulerian Fluid Simulation With Convolutional Networks”, arXiv, abs/1607.03597, 2016.

• S. Oymak, B. Recht, & M. Soltanolkotabi, “Sharp time–data tradeoffs for linear inverse problems”, arXiv, abs/1507.04793, 2016.

• R. Giryes, Y. C. Eldar, A. M. Bronstein, G. Sapiro, “Tradeoffs between Convergence Speed and Reconstruction Accuracy in Inverse Problems”, arXiv, abs/1605.09232, 2016.

• R.G. Baraniuk, V. Cevher, M.F. Duarte & C. Hegde, “Model-based compressive sensing”, IEEE Trans. Inf. Theory, vol. 56, no. 4, pp. 1982–2001, Apr. 2010.

• M. F. Duarte & R. G. Baraniuk, “Spectral compressive sensing”, Appl. Comput. Harmon. Anal., vol. 35, no. 1, pp. 111 – 129, 2013.

• J. Bruna & T. Moreau, Adaptive Acceleration of Sparse Coding via Matrix Factorization, arXivabs/1609.00285, 2016.

112UAI Tutorial

raja giryes tel aviv universityauai.org/uai2019/tutorials/giryes_theory2019.pdf · 2019-07-22 ·...

Documents