data structure based theory for deep … · data structure based theory for deep learning ......
TRANSCRIPT
![Page 1: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/1.jpg)
DATA STRUCTURE BASED THEORY FOR
DEEP LEARNING
RAJA GIRYES
TEL AVIV UNIVERSITY
Mathematics of Deep Learning Tutorial
EUSIPCO Conference, Budapest, Hungary
August 29, 2016 1
![Page 2: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/2.jpg)
AGENDA
• History of deep learning.
• A sample of existing theory for deep learning.
• Neural networks with random Gaussian weights.
• Generalization error of deep neural networks.
• Deep Learning as metric learning.
• Solving minimization problems with deep learning.
2
![Page 3: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/3.jpg)
FIRST LEARNING PROGRAM
3
1956
“field of study that gives computers the ability to learn without being explicitly programmed”. [Arthur Samuel, 1959]
![Page 4: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/4.jpg)
RELATION TO VISION
4Wiesel and Hubel, 1959
![Page 5: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/5.jpg)
HUMAN VISUAL SYSTEM
5
In the visual cortex there are two types of neurons:
Simple and complex
![Page 6: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/6.jpg)
IMITATING THE HUMAN BRAIN
6
Fukushima 1980
![Page 7: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/7.jpg)
CONVOLUTIONAL NEURAL NETWORKS
• Introduction of convolutional neural networks [LeCun et. al. 1989]
• Training by backpropagation
7
![Page 8: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/8.jpg)
SCENE PARSING
[Farabet et al., 2012, 2013]
• Deep Learning usage before 2012:
8
![Page 9: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/9.jpg)
2012 IMAGENET DEEP LEARNING BREAKTHROUGH
• Imagenet dataset
• 1.4 Million images
• 1000 categories
• 1.2 Million for training
• 150000 for testing
• 50000 for validation
9
Today deep learning achieves 3.5% by 152 layers [He, Zhang, Ren & Sun, 2016]
[Krizhevsky, Sutskever& Hinton, 2012]
![Page 10: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/10.jpg)
DEEP NETWORK STRUCTURE
• What each layer of the network learns?
[Krizhevsky, Sutskever & Hinton, 2012]
10
![Page 11: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/11.jpg)
LAYERS STRUCTURE
• First layers detect simple patterns that corresponds to simple objects
•
[Zeiler & Fergus, 2014]11
![Page 12: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/12.jpg)
LAYERS STRUCTURE
• Deeper layers detects more complex patterns corresponding to more complex objects.
•
[Zeiler & Fergus, 2014]12
![Page 13: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/13.jpg)
LAYERS STRUCTURE
•
[Zeiler & Fergus, 2014] 13
![Page 14: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/14.jpg)
WHY THINGS WORK BETTER TODAY?
• More data
• Better Hardware (GPU)
• Better learning regularization (dropout)
• Deep learning impact and success is not unique only to image classification.
14
![Page 15: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/15.jpg)
DEEP LEARNING FOR SPEECH RECOGNITION
15
![Page 16: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/16.jpg)
OBJECT DETECTION
16[Szegedy et al., 2015]
![Page 17: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/17.jpg)
GAME PLAYING
17[Mnih et al., 2013, 2015]
![Page 18: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/18.jpg)
GO GAME
• AlphaGo - First computer program to ever beat a professional player at the game of go
• Program created by Google DeepMind
• Game strategy learned using deep learning [Silver et al., 2016].
18
![Page 19: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/19.jpg)
DEEP NEURAL NETWORKS (DNN)
• One layer of a neural net
• Concatenation of the layers creates the whole net
Φ(𝑋1, 𝑋2, … , 𝑋𝐾) = 𝜓 𝜓 𝜓 𝑉𝑋1 𝑋2 … 𝑋𝐾
𝑉 ∈ ℝ𝑑 𝑋 𝜓 𝜓(𝑉𝑋) ∈ ℝ𝑚
𝑋 is a linear operation
𝐹 is a non-linear function
𝑉𝑋
𝑉 ∈ ℝ𝑑 𝑋1 𝜓 𝑋𝑖 𝜓 𝑋𝐾 𝜓
19
![Page 20: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/20.jpg)
CONVOLUTIONAL NEURAL NETWORKS (CNN)
• In many cases, 𝑋 is selected to be a convolution.
• This operator is shift invariant.
• CNN are commonly used with images as they are typically shift invariant.
𝑉 ∈ ℝ𝑑 𝑋 𝜓 𝜓(𝑉𝑋) ∈ ℝ𝑚
𝑋 is a linear operation
𝐹 is a non-linear function
𝑉𝑋
20
![Page 21: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/21.jpg)
THE NON-LINEAR PART
• Usually 𝜓 = 𝑔 ∘ 𝑓.
• 𝑓 is the (point-wise) activation function
• 𝑔 is a pooling or an aggregation operator.
ReLU 𝑓(x) = max(x, 0)
Sigmoid
𝑓 𝑥 =1
1 + 𝑒−𝑥
Hyperbolic tangent
𝑓 𝑥 = tanh(𝑥)
𝑉1 𝑉2 𝑉𝑟𝑉3 𝑉4 … … … …
max𝑖
𝑉𝑖
Max pooling Mean pooling
1
𝑛
𝑖=1
𝑛
𝑉𝑖
𝑙𝑝 pooling𝑝
𝑖=1
𝑛
𝑉𝑖𝑝
𝑋 𝜓
21
![Page 22: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/22.jpg)
WHY DNN WORK?
What is so special with the DNN structure? What is the role of
the depth of DNN?
What is the role of pooling?
What is the role of the activation
function?
How many training samples
do we need?
What is the capability of DNN?
What happens to the data throughout the
layers?22
![Page 23: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/23.jpg)
REPRESENTATION POWER
• Neural nets serve as a universal approximation for any measurable Borel functions [Cybenko 1989, Hornik 1991].
• In particular, let the non-linearity 𝜓 be a bounded, non-constant continuous function, 𝐼𝑚 be the 𝑚-dimensional hypercube, and 𝐶 𝐼𝑚 be the space of continuous functions on 𝐼𝑚. Then for any 𝑓 ∈ 𝐶 𝐼𝑚and 𝜖 > 0, there exists 𝑚 > 0, and 𝑋 ∈ ℝ𝑑×𝑚, 𝐵 ∈ ℝ𝑚, 𝑊 ∈ ℝ𝑚 such that the neural network
𝐹 𝑉 = 𝜓 𝑉𝑋 𝑊𝑇
approximates 𝑓 with a precision 𝜖:
𝐹 𝑉 − 𝑓 𝑉 < 𝜖, ∀𝑉 ∈ ℝ𝑑
23
![Page 24: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/24.jpg)
ESTIMATION ERROR
• The estimation error of a function f by a neural networks scales as [Barron 1992].
𝑂𝐶𝑓
𝑁+ 𝑂
𝑁𝑑
𝑛log(𝑛)Smoothness of
approximated function
Number of neurons in the
DNN
Number of training
examples
Input dimension
24
![Page 25: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/25.jpg)
DEPTH OF THE NETWORK
• DNN allow representing restricted Boltzmann machines with a number of parameters exponentially greater than the number of the network parameters [Montúfar & Morton, 2014]
• Each DNN layer with ReLU divides the space by a hyper-plane.
• Therefore the depth of the network divides the space into an exponential number of sets compared to the number of parameters [Montúfar,
Pascanu, Cho & Bengio, 2014]
25
![Page 26: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/26.jpg)
DEPTH EFFICIENCY OF CNN
• Function realized by CNN, with ReLU and max-pooling, of polynomial size requires super-polynomial size for being approximated by shallow network [Cohen et al., 2016].
• Standard convolutional network design has learning bias towards statistics of natural images [Cohen et al., 2016].
26
![Page 27: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/27.jpg)
ROLE OF POOLING
• The pooling stage provides shift invariance [Bruna, LeCun & Szlam, 2013].
• A connection is drawn between the pooling stage and the phase retrieval methods [Bruna, Szlam & LeCun, 2014].
• This allows calculating Lipchitz constants of each DNN layer 𝜓(∙ 𝑋) and empirically recovering the input of a layer from its output.
• However, the Lipchitz constants calculated are very loose and no theoretical guarantees are given for the recovery.
27
![Page 28: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/28.jpg)
SUFFICIENT STATISTIC AND INVARIANCE
• Given a certain task at hand:
• Minimal sufficient statistic guarantees that we can replace raw data with a representation with smallest complexity and no performance loss.
• Invariance guarantees that the statistic is constant with respect to uninformative transformations of the data.
• CNN are shown to have these properties for many tasks [Soatto & Chiuso, 2016].
28
![Page 29: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/29.jpg)
SCATTERING TRANSFORMS
• Scattering Transforms - a cascade of wavelet transform convolutions with nonlinear modulus and averaging operators.
• Scattering coefficients are stable encodings of geometry and texture [Bruna & Mallat, 2013]
29
Original image with 𝑑pixels
Recovery from scattering transform with 1 layer
Recovery from scattering transform with 2 layers
Images from slides of Joan Bruna in ICCV 2015 tutorial
![Page 30: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/30.jpg)
SCATTERING TRANSFORMS AND DNN
• More layers create features that can be made invariant to increasingly more complex deformations.
• Layers in a DNN encode complex, class-specific geometry.
• Deeper architectures are able to better capture invariant properties of objects and scenes in images[Bruna & Mallat, 2013]
30
![Page 31: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/31.jpg)
SCATTERING TRANSFORMS AS A METRIC
• Scattering transforms may be used as a metric.
• Inverse problems can be solved by minimizing distance at the scattering transform domain.
• Leads to remarkable results in super-resolution[Bruna, Sprechmann & Lecun, 2016]
31
![Page 32: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/32.jpg)
SCATTERING SUPER RESOLUTION
Original Best Linear Estimate State-of-the-art Scattering estimate
Images from slides of Joan Bruna in CVPR 2016 tutorial
[Bruna, Sprechmann & Lecun, 2016]32
![Page 33: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/33.jpg)
MINIMIZATION
• The local minima in deep networks are not far from the global minimum.
• saddle points are the main problem of deepLearning optimization.
• Deeper networks have more local minima but less saddle points. [Saxe, McClelland & Ganguli, 2014], [Dauphin, Pascanu, Gulcehre, Cho, Ganguli & Bengio, 2014] [Choromanska, Henaff, Mathieu, Ben Arous & LeCun, 2015]
33
[Choromanska et al., 2015]
![Page 34: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/34.jpg)
GLOBAL OPTIMALITY IN DEEP LEARNING
• Deep learning is a positively homogeneous factorization problem, i.e., ∃𝑝 ≥ 0 such that ∀𝛼 ≥ 0 DNN obey
Φ 𝛼𝑋1, 𝛼𝑋2, … , 𝛼𝑋𝐾 = 𝛼𝑝Φ 𝑋1, 𝑋2, … , 𝑋𝐾 .
• With proper regularization, local minima are global.
• If the network is large enough, global minima can be found by local descent.
Guarantees of proposed framework
[Haeffele & Vidal, 2015]34
![Page 35: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/35.jpg)
Take Home
Message
DNN keep the
important information of the data.
Gaussian mean width is a good measure for the
complexity of the data.
Random Gaussian
weights are good for
classifying the average points
in the data.
Important goal of training: Classify the
boundary points between the
different classes in the data.
Generalization error depends
on the DNN input margin
Deep learning can be viewed
as a metric learning.
DNN may solve
optimization problems
35
![Page 36: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/36.jpg)
• Infusion of random weights reveals internal properties of a system
ASSUMPTIONS – GAUSSIAN WEIGHTS
𝑉 ∈ ℝ𝑑 𝑋1 𝜓 𝑋𝑖 𝜓 𝑋𝐾 𝜓
𝑋𝑖 , … , 𝑋𝑖, …,𝑋𝐾 are random Gaussian matrices
Compressed Sensing
Phase Retrieval
Sparse Recovery
Deep Learning
[Saxe et al. 2014]
[Dauphin et al. 2014]
[Choromanskaet al. 2015] [Arora et
al. 2014]
36
![Page 37: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/37.jpg)
• Pooling provides invariance [Boureau et. al. 2010, Bruna et. al. 2013].
We assume that all equivalent points in the data were merged together and omit this stage.
Reveals the role of the other components in the DNN.
ASSUMPTIONS – NO POOLING
𝑉 ∈ ℝ𝑑 𝑋1 𝜓 𝑋𝑖 𝜓 𝑋𝐾 𝜓
𝜓 is an element wise activation function
max(v, 0) 1
1 + 𝑒−𝑥tanh(𝑣)
37
![Page 38: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/38.jpg)
ASSUMPTIONS – LOW DIMENSIONAL DATA
Υ is a low dimensional set
𝑉 ∈ Υ 𝑋1 𝜓 𝑋𝑖 𝜓 𝑋𝐾 𝜓
Gaussian Mixture Models (GMM)
Signals with Sparse
Representations
Low Dimensional
Manifolds
Low Rank Matrices
38
![Page 39: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/39.jpg)
Gaussian Mean Width
DNN keep the
important information of the data.
Gaussian mean width is a good measure for the
complexity of the data.
Random Gaussian
weights are good for
classifying the average points
in the data.
Important goal of training: Classify the
boundary points between the
different classes in the data.
Generalization error depends
on the DNN input margin
Deep learning can be viewed
as a metric learning.
DNN may solve
optimization problems
39
![Page 40: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/40.jpg)
WHAT HAPPENS TO SPARSE DATA IN DNN?
• Let Υ be sparsely represented data
• Example: Υ = {𝑉 ∈ ℝ3: 𝑉 0 ≤ 1}
• ΥX is still sparsely represented data
• Example: ΥX = {𝑉 ∈ ℝ3: ∃𝑊 ∈ ℝ3, 𝑉 = 𝑋𝑊, 𝑊 0 ≤ 1}
• 𝜓(ΥX) not sparsely represented
• But is still low dimensional
Υ𝑋
𝜓(Υ𝑋)
Υ
40
![Page 41: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/41.jpg)
GAUSSIAN MEAN WIDTH
• Gaussian mean width:𝜔 Υ = 𝐸 sup
𝑉,𝑊∈Υ𝑉 − 𝑊, 𝑔 , 𝑔~𝑁(0, 𝐼).
𝑊
𝑉
Υ
𝑔The width of the set Υ in
the direction of 𝑔:
41
![Page 42: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/42.jpg)
MEASURE FOR LOW DIMENSIONALITY
• Gaussian mean width:𝜔 Υ = 𝐸 sup
𝑉,𝑊∈Υ𝑉 − 𝑊, 𝑔 , 𝑔~𝑁(0, 𝐼).
• 𝜔2 Υ is a measure for the dimensionality of the data.
• Examples:
If Υ ⊂ 𝔹𝑑 is a Gaussian Mixture Model with 𝑘Gaussians then
𝜔2 Υ = 𝑂(𝑘)
If Υ ⊂ 𝔹𝑑 is a data with 𝑘-sparse representations then𝜔2 Υ = 𝑂(𝑘 log 𝑑)
42
![Page 43: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/43.jpg)
Theorem 1: small 𝜔2 Υ
𝑚imply 𝜔2 Υ ≈ 𝜔2 𝜓(𝑉𝑋)
GAUSSIAN MEAN WIDTH IN DNN
Υ ⊂ ℝ𝑑
𝑋 𝜓
𝜓(𝑉𝑋) ∈ ℝ𝑚𝑋 is a linear operation
𝐹 is a non-linear function
Υ𝑋 ⊂ ℝ𝑚
Small 𝜔2 Υ Small 𝜔2 𝜓(𝑉𝑋)
It is sufficient to provide proofs only for a single layer
43
![Page 44: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/44.jpg)
DNN keep the
important information of the data.
Gaussian mean width is a good measure for the
complexity of the data.
Random Gaussian
weights are good for
classifying the average points
in the data.
Important goal of training: Classify the
boundary points between the
different classes in the data.
Generalization error depends
on the DNN input margin
Deep learning can be viewed
as a metric learning.
DNN may solve
optimization problems
Stability
44
![Page 45: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/45.jpg)
ASSUMPTIONS
𝑋 is a random
Gaussian matrix
𝜓 is an element wise
activation function
𝑉 ∈ Υ
𝑉 ∈ 𝕊𝑑 𝜓 𝜓(𝑉𝑋) ∈ ℝ𝑚𝑉𝑋𝑋
max(v, 0) 1
1 + 𝑒−𝑥tanh(𝑣)
𝑚 = 𝑂 𝛿−6𝜔2 Υ
45
![Page 46: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/46.jpg)
ISOMETRY IN A SINGLE LAYER
Theorem 2: 𝜓(∙ 𝑋) is a 𝛿-isometry in the Gromov-Hausdorff sense between the sphere 𝕊𝑑−1 and the Hamming cube [Plan & Vershynin, 2014, Giryes, Sapiro & Bronstein 2016].
𝑉 ∈ 𝕊𝑑 𝜓 𝜓(𝑉𝑋) ∈ ℝ𝑚𝑉𝑋𝑋
• If two points belong to the same tile then their distance < 𝛿
• Each layer of the network keeps the main information of the data
The rows of 𝑋 create a tessellation of the space. This stands in line with [Montúfar et. al. 2014] This structure can be used for hashing 46
![Page 47: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/47.jpg)
DNN AND HASHING
• A single layer performs a locally sensitive hashing.
• Deep network with random weights may be designed to do better [Choromanska et al., 2016].
• It is possible to train DNN for hashing, which provides cutting-edge results [Masci et al., 2012], [Lai et al., 2015].
47
![Page 48: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/48.jpg)
DNN STABLE EMBEDDING
𝑉 ∈ 𝕊𝑑 𝜓 𝜓(𝑉𝑋) ∈ ℝ𝑚𝑉𝑋𝑋
Theorem 3: There exists an algorithm 𝒜 such that
𝑉 − 𝒜(𝜓(𝑉𝑋)) < 𝑂𝜔 Υ
𝑚= 𝑂 𝛿3
[Plan & Vershynin, 2013, Giryes, Sapiro & Bronstein 2016].
After 𝐾 layers we have an error 𝑂 𝐾𝛿3
Stands in line with [Mahendran and Vedaldi, 2015].
DNN keep the important information of the data
48
![Page 49: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/49.jpg)
RECOVERY FROM DNN OUTPUT
[Mahendran and Vedaldi, 2015].49
![Page 50: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/50.jpg)
DNN keep the
important information of the data.
Gaussian mean width is a good measure for the
complexity of the data.
Random Gaussian
weights are good for
classifying the average points
in the data.
Important goal of training: Classify the
boundary points between the
different classes in the data.
Generalization error depends
on the DNN input margin
Deep learning can be viewed
as a metric learning.
DNN may solve
optimization problems
DNN with Gaussian Weights
50
![Page 51: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/51.jpg)
ASSUMPTIONS
𝑋 is a random
Gaussian matrix
𝜓 is the ReLU𝑉 ∈ Υ
𝑉 ∈ ℝ𝑑 𝜓 𝜓(𝑉𝑋) ∈ ℝ𝑚𝑉𝑋𝑋
max(v, 0)
𝑚 = 𝑂 𝛿−4𝜔4 Υ
51
![Page 52: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/52.jpg)
DISTANCE DISTORTION
𝑉 ∈ Υ 𝜓 𝜓(𝑉𝑋) ∈ ℝ𝑚𝑉𝑋𝑋
Theorem 4: for 𝑉, 𝑊 ∈ Υ
𝜓(𝑉𝑋) − 𝜓(𝑊𝑋) 2 − 12 V − W 2
−V W
𝜋(sin ∠ V, W
∠ V, W
The smaller ∠ V, W the smaller the distance we get between the points
52
![Page 53: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/53.jpg)
ANGLE DISTORTION
𝑉 ∈ Υ 𝜓 𝜓(𝑉𝑋) ∈ ℝ𝑚𝑉𝑋𝑋
Theorem 5: for 𝑉, 𝑊 ∈ Υ
cos ∠ 𝜓(𝑉𝑋), 𝜓(W𝑋) − cos ∠ V, W
−1
𝜋(sin ∠ V, 𝑊
∠ V, W
Behavior of ∠ 𝜓(𝑉𝑋), 𝜓(𝑊𝑋)
53
![Page 54: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/54.jpg)
DISTANCE AND ANGLES DISTORTION
Points with small angles between them become closer than points with larger angles between them
𝑋 𝜓
Class IIClass I Class IIClass I
54
![Page 55: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/55.jpg)
POOLING AND CONVOLUTIONS
• We test empirically this behavior on convolutional neural networks (CNN) with random weights and the MNIST, CIFAR-10 and ImageNet datasets.
• The behavior predicted in the theorems remains also in the presence of pooling and convolutions.
55
![Page 56: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/56.jpg)
TRAINING DATA SIZE
• Stability in the network implies that close points in the input are close also at the output
• Having a good network for an 휀-net of the input set Υguarantees a good network for all the points in Υ.
• Using Sudakov minoration the number of data points is
exp(𝜔2 Υ /휀2) .
• Though this is not a tight bound, it introduces the Gaussian mean width 𝜔 Υ as a measure for the complexity of the input data and the required number of training samples.
56
![Page 57: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/57.jpg)
DNN keep the
important information of the data.
Gaussian mean width is a good measure for the
complexity of the data.
Random Gaussian
weights are good for
classifying the average points
in the data.
Important goal of training: Classify the
boundary points between the
different classes in the data.
Generalization error depends
on the DNN input margin
Deep learning can be viewed
as a metric learning.
DNN may solve
optimization problems
Role of Training
57
![Page 58: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/58.jpg)
ROLE OF TRAINING
• Having a theory for Gaussian weights we test the behavior of DNN after training.
• We looked at the MNIST, CIFAR-10 and ImageNetdatasets.
• We will present here only the ImageNet results.
• We use a state-of-the-art pre-trained network for ImageNet [Simonyan & Zisserman, 2014].
• We compute inter and intra class distances.
58
![Page 59: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/59.jpg)
Compute the distance ratio: 𝑉− 𝑍
𝑊−𝑉
INTER BOUNDARY POINTS DISTANCE RATIO
Class IIClass I
Class IIClass I
𝑊𝑉
𝑉 is a random point and 𝑊 its closest point from
a different class.
𝑉
𝑉 is the output of 𝑉 and 𝑍 the closest point to 𝑉 at the output from a
different class.
𝑊 − 𝑉 𝑍
𝑉 − 𝑍
𝑋1 𝜓 𝑋𝑖 𝜓 𝑋𝐾 𝜓
59
![Page 60: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/60.jpg)
Compute the distance ratio: 𝑉− 𝑍
𝑊−𝑉
INTRA BOUNDARY POINTS DISTANCE RATIO
Class IIClass I Class IIClass I
𝑊
𝑉
Let 𝑉 be a point and 𝑊its farthest point from
the same class.
𝑉
Let 𝑉 be the output of 𝑉 and 𝑍 the farthest point from 𝑉 at the output
from the same class
𝑊 − 𝑉
𝑍
𝑉 − 𝑍
𝑋1 𝜓 𝑋𝑖 𝜓 𝑋𝐾 𝜓
60
![Page 61: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/61.jpg)
Inter-class Intra-class
𝑉 − 𝑍
𝑊 − 𝑉
𝑉 − 𝑍
𝑊 − 𝑉
BOUNDARY DISTANCE RATIO
61
![Page 62: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/62.jpg)
Compute the distance ratios: 𝑉− 𝑊
𝑉−𝑊,
𝑉− 𝑍
𝑉−𝑍
AVERAGE POINTS DISTANCE RATIO
Class II
Class I
Class IIClass I𝑍
𝑉
𝑉, 𝑊 and 𝑍 are three random points
𝑉
𝑉, 𝑊 and 𝑍 are the outputs of 𝑉, 𝑊and 𝑍 respectively.
𝑉 − 𝑊 𝑍
𝑉 − 𝑍
𝑋1 𝜓 𝑋𝑖 𝜓 𝑋𝐾 𝜓
𝑊
𝑉 − 𝑊
𝑉 − 𝑍
𝑊
62
![Page 63: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/63.jpg)
AVERAGE DISTANCE RATIO
Inter-class Intra-class
𝑉 − 𝑊
𝑉 − 𝑊
𝑉 − 𝑍
𝑉 − 𝑍63
![Page 64: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/64.jpg)
ROLE OF TRAINING
• On average distances are preserved in the trained and random networks.
• The difference is with respect to the boundary points.
• The inter distances become larger.
• The intra distances shrink.
64
![Page 65: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/65.jpg)
DNN keep the
important information of the data.
Gaussian mean width is a good measure for the
complexity of the data.
Random Gaussian
weights are good for
classifying the average points
in the data.
Important goal of training: Classify the
boundary points between the
different classes in the data.
Generalization error depends
on the DNN input margin
Deep learning can be viewed
as a metric learning.
DNN may solve
optimization problems
Generali-zationError
65
![Page 66: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/66.jpg)
Linear classifier
ASSUMPTIONS
𝑋1 𝜓 𝑋𝑖 𝜓 𝑋𝐾 𝜓
𝜓 is the ReLUTwo
Classes𝑤
𝑤𝑇Φ 𝑋1, 𝑋2, … , 𝑋𝐾 = 0
∈ Υ
Input Space Feature Space66
![Page 67: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/67.jpg)
GENERALIZATION ERROR (GE)
• In training, we reduce the classification error ℓtraining of the training data as the number of
training examples 𝐿 increases.
• However, we are interested to reduce the error ℓtest of the (unknown) testing data as 𝐿 increases.
• The difference between the two is the generalization error
GE = ℓtraining − ℓtest
• It is important to understand the GE of DNN
67
![Page 68: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/68.jpg)
REGULARIZATION TECHNIQUES
• Weight decay – penalizing DNN weights [Krogh & Hertz, 1992].
• Dropout - randomly drop units (along with their connections) from the neural network during training [Hinton et al., 2012, Srivastava et al., 2014].
• DropConnect – dropout extension [Wan et al., 2013]
• Batch normalization [Ioffe & Szegedy, 2015].
• Stochastic gradient descent (SGD) [Hardt, Recht & Singer, 2016].
• Path-SGD [Neyshabur et al., 2015].68
![Page 69: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/69.jpg)
A SAMPLE OF GE BOUNDS
• Using the VC dimension it can be shown that
GE ≤ 𝑂 DNN params ∙log 𝐿
𝐿
[Shalev-Shwartz and Ben-David, 2014].
• The GE was bounded also by the DNN weights
GE ≤1
𝐿2𝐾 𝑤 2
𝑖
𝑋𝑖2,2
[Neyshabur et al., 2015].
•
69
![Page 70: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/70.jpg)
A SAMPLE OF GE BOUNDS
• Using the VC dimension it can be shown that
GE ≤ 𝑂 DNN params ∙log 𝐿
𝐿
[Shalev-Shwartz and Ben-David, 2014].
• The GE was bounded also by the DNN weights
GE ≤1
𝐿2𝐾 𝑤 2
𝑖
𝑋𝑖2,2
[Neyshabur et al., 2015].
• Note that in both cases the GE grows with the depth
70
![Page 71: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/71.jpg)
DNN INPUT MARGIN
• Theorem 6: If for every input margin 𝛾𝑖𝑛 𝑉𝑖 > 𝛾
then 𝐺𝐸 ≤ 𝑁𝛾/2(Υ) 𝑚
• 𝑁𝛾/2(Υ) is the covering number of the data Υ.
• 𝑁𝛾/2(Υ) gets smaller as 𝛾 gets larger.
• Bound is independent of depth.
• Our theory relies on the robustness framework [Xu & Mannor, 2015].
𝑉𝑖
𝑉𝑖
[Sokolic, Giryes, Sapiro, Rodrigues, 2016]
71
![Page 72: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/72.jpg)
INPUT MARGIN BOUND
• Maximizing the input margin directly is hard
• Our strategy: relate the input margin to the output margin 𝛾𝑜𝑢𝑡 𝑉𝑖 and other DNN properties
• Theorem 7:
𝛾𝑖𝑛 𝑉𝑖 ≥𝛾𝑜𝑢𝑡 𝑉𝑖
sup𝑉∈Υ
𝑉
𝑉 2𝐽 𝑉
2
≥𝛾𝑜𝑢𝑡 𝑉𝑖
1≤𝑖≤𝐾 𝑋𝑖2
≥𝛾𝑜𝑢𝑡 𝑉𝑖
1≤𝑖≤𝐾 𝑋𝑖𝐹
𝑉𝑖
Φ(𝑉𝑖)
[Sokolic, Giryes, Sapiro, Rodrigues, 2016] 72
![Page 73: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/73.jpg)
OUTPUT MARGIN
• Theorem 7: 𝛾𝑖𝑛 𝑉𝑖 ≥𝛾𝑜𝑢𝑡 𝑉𝑖
sup𝑉∈Υ
𝑉
𝑉 2𝐽 𝑉
2
≥𝛾𝑜𝑢𝑡 𝑉𝑖
1≤𝑖≤𝐾 𝑋𝑖2
≥𝛾𝑜𝑢𝑡 𝑉𝑖
1≤𝑖≤𝐾 𝑋𝑖𝐹
• Output margin is easier tomaximize – SVM problem
• Maximized by many cost functions, e.g., hinge loss.
𝑉𝑖
Φ(𝑉𝑖)
73
![Page 74: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/74.jpg)
GE AND WEIGHT DECAY
• Theorem 7: 𝛾𝑖𝑛 𝑉𝑖 ≥𝛾𝑜𝑢𝑡 𝑉𝑖
sup𝑉∈Υ
𝑉
𝑉 2𝐽 𝑉
2
≥
𝛾𝑜𝑢𝑡 𝑉𝑖
1≤𝑖≤𝐾 𝑋𝑖2
≥𝛾𝑜𝑢𝑡 𝑉𝑖
1≤𝑖≤𝐾 𝑋𝑖𝐹
• Bounding the weights increases the input margin
• Weight decay regularizationdecreases the GE
• Related to regularization used by [Haeffele & Vidal, 2015]
𝑉𝑖
Φ(𝑉𝑖)
74
![Page 75: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/75.jpg)
JACOBIAN BASED REGULARIZATION
• Theorem 7: 𝛾𝑖𝑛 𝑉𝑖 ≥𝛾𝑜𝑢𝑡 𝑉𝑖
sup𝑉∈Υ
𝑉
𝑉 2𝐽 𝑉
2
≥
𝛾𝑜𝑢𝑡 𝑉𝑖
1≤𝑖≤𝐾 𝑋𝑖2
≥𝛾𝑜𝑢𝑡 𝑉𝑖
1≤𝑖≤𝐾 𝑋𝑖𝐹
• 𝐽 𝑉 is the Jacobian of the DNN at point 𝑉.
• 𝐽 ∙ is piecewise constant.
• Using the Jacobian of theDNN leads to a better bound.
• New regularization technique.75
![Page 76: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/76.jpg)
RESULTS
• Better performance with less training samples
• CCE: the categorical cross entropy.
• WD: weight decay regularization.
• LM: Jacobian based regularization for large margin.
• Note that hinge loss generalizes better than CCE and that LM is better than WD as predicted by our theory.
MNIST Dataset
[Sokolic, Giryes, Sapiro, Rodrigues, 2016]
76
![Page 77: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/77.jpg)
DNN keep the
important information of the data.
Gaussian mean width is a good measure for the
complexity of the data.
Random Gaussian
weights are good for
classifying the average points
in the data.
Important goal of training: Classify the
boundary points between the
different classes in the data.
Generalization error depends
on the DNN input margin
Deep learning can be viewed
as a metric learning.
DNN may solve
optimization problems
DNN as Metric
Learning
77
![Page 78: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/78.jpg)
𝜓
ASSUMPTIONS
𝑋 is fully connectedand trained
𝜓 is the hyperbolic tan
𝑉 ∈ ℝ𝑑 𝜓𝑋1 𝑋2 𝑉
78
![Page 79: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/79.jpg)
METRIC LEARNING BASED TRAINING
• Cosine Objective:
min𝑋1,𝑋2
𝑖,𝑗∈𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑆𝑒𝑡
𝑉𝑖𝑇 𝑉𝑗
𝑉𝑖 𝑉𝑗
− 𝜗𝑖,𝑗
2
𝜗𝑖,𝑗 = 𝜆 + (1 − 𝜆)
𝑉𝑖𝑇𝑉𝑗
𝑉𝑖 𝑉𝑗
𝑖, 𝑗 ∈ 𝑠𝑎𝑚𝑒 𝑐𝑙𝑎𝑠𝑠
−1 𝑖, 𝑗 ∈ 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑐𝑙𝑎𝑠
𝜓𝑉𝑖 ∈ ℝ𝑑 𝜓𝑉𝑋𝑋1 𝑋2 𝑉𝑖
Classification term
Metric preservation term
79
![Page 80: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/80.jpg)
METRIC LEARNING BASED TRAINING
• Euclidean Objective:
min𝑋1,𝑋2
𝜆𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔
𝑆𝑒𝑡
𝑖,𝑗∈𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔
𝑆𝑒𝑡
𝑙𝑖𝑗 𝑉𝑖 − 𝑉𝑗 − 𝑡𝑖𝑗 +
+ 1−𝜆𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑢𝑟𝑠
𝑉𝑖,𝑉𝑗 𝑎𝑟𝑒
𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑢𝑟𝑠
𝑉𝑖 − 𝑉𝑗 − 𝑉𝑖 − 𝑉𝑗
𝜓𝑉𝑖 ∈ ℝ𝑑 𝜓𝑉𝑋𝑋1 𝑋2 𝑉𝑖
𝑙𝑖𝑗 = 1 𝑖, 𝑗 ∈
𝑠𝑎𝑚𝑒𝑐𝑙𝑎𝑠𝑠
−1 𝑖, 𝑗 ∈𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡
𝑐𝑙𝑎𝑠𝑠
𝑙𝑖𝑗 =
𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑖𝑛𝑡𝑟𝑎𝑐𝑙𝑎𝑠𝑠 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒
𝑖, 𝑗 ∈𝑠𝑎𝑚𝑒𝑐𝑙𝑎𝑠𝑠
𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑖𝑛𝑡𝑒𝑟𝑐𝑙𝑎𝑠𝑠 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒
𝑖, 𝑗 ∈𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡
𝑐𝑙𝑎𝑠𝑠
Classification term
Metric learning term
80
![Page 81: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/81.jpg)
ROBUSTNESS OF THIS NETWORK
• Metric learning objectives impose stability
• Similar to what we have in the random case
• Close points at the input are close at the output
• Using the theory of 𝑇, 𝜖 -robustness, the
generalization error scales as𝑇
𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑒𝑡
• 𝑇 is the covering number.
• Also here, the number of training samples scales as
exp(𝜔2 Υ /휀2) .
81
![Page 82: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/82.jpg)
RESULTS
• Better performance with less training samples#Training/class 30 50 70 100
original pixels 81.91% 86.18% 86.86% 88.49%
LeNet 87.51% 89.89% 91.24% 92.75%
Proposed 1 92.32% 94.45% 95.67% 96.19%
Proposed 2 94.14% 95.20% 96.05% 96.21%
MNIST Dataset
Faces in the wild
ROC curve
[Huang, Qiu, Sapiro, Calderbank, 2015]
82
![Page 83: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/83.jpg)
DNN keep the
important information of the data.
Gaussian mean width is a good measure for the
complexity of the data.
Random Gaussian
weights are good for
classifying the average points
in the data.
Important goal of training: Classify the
boundary points between the
different classes in the data.
Generalization error depends
on the DNN input margin
Deep learning can be viewed
as a metric learning.
DNN may solve
optimization problems
DNN as Metric
Learning
83
![Page 84: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/84.jpg)
ASSUMPTIONS
𝑋 𝜓
𝑆
𝜓 is a projection
onto Υ𝑍 ∈ Υ
𝑉 ∈ ℝ𝑑
𝑉 = 𝑍𝐴 + 𝐸
𝑍
An estimate of 𝑍
+
Linear operators
84
![Page 85: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/85.jpg)
ℓ0-MINIMIZATION
𝜓 is the hard thresholding
operation: keeps the largest k entries
𝑍 is ak−sparsevecotr
𝑉 ∈ ℝ𝑑
𝑉 = 𝑍𝐴 + 𝐸
Iterative hard thresholding
algorithm (IHT)
𝜇𝐴𝑇 𝜓
𝐼 − 𝜇𝐴𝐴𝑇
𝑍+
𝜇 is the step size
A k-sparseestimate of 𝑍.Aim at solving
min 𝑍
𝑉 − 𝑍𝐴
𝑠. 𝑡 𝑍1
≤ k85
![Page 86: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/86.jpg)
ℓ1-MINIMIZATION
𝜓 projects onto the ℓ1 ball
𝑍 1 ≤ 𝑅
𝑉 ∈ ℝ𝑑
𝑉 = 𝑍𝐴 + 𝐸 Estimate of 𝑍.Aim at solving
min 𝑍
𝑉 − 𝑍𝐴
𝑠. 𝑡 𝑍1
≤ 𝑅
Projected gradient descent algorithm for ℓ1
minimization
𝜇𝐴𝑇 𝜓
𝐼 − 𝜇𝐴𝐴𝑇
𝑍+
𝑅
𝑅
−𝑅
−𝑅
𝜇 is the step size
86
![Page 87: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/87.jpg)
UNCONSTRAINED ℓ1-MINIMIZATION
𝜇𝐴𝑇 𝜓
𝐼 − 𝜇𝐴𝐴𝑇
Soft thresholding
operation
𝑉 ∈ ℝ𝑑
𝑉 = 𝑍𝐴 + 𝐸
𝑍+
1𝜇
≥ 𝐴
Iterative soft thresholding
algorithm (ISTA)
- 𝜆𝜇Minimizer of
min 𝑍
𝑉 − 𝑍𝐴 + 𝜆 𝑍1
𝜆𝜇
𝜇 is the step size
[Daubechies, Defrise, Mol, 2004] 87
![Page 88: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/88.jpg)
ISTA CONVERGENCE
• Reconstruction error as a function of the number of iterations
88
![Page 89: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/89.jpg)
LEARNED ISTA (LISTA)
𝑋 𝜓
𝑆
𝑍 ∈ Υ
𝑉 ∈ ℝ𝑑
𝑉 = 𝑍𝐴 + 𝐸
𝑍
An estimate of 𝑍
+
Learned Linear
operators
[Gregor & LeCun, 2010]
Soft thresholding
operation- 𝜆 𝜆
89
![Page 90: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/90.jpg)
LISTA CONVERGENCE
• Replacing 𝐼 − 𝜇𝐴𝐴𝑇 and 𝜇𝐴𝑇 in ISTA with the learned 𝑋 and 𝑆 improves convergence [Gregor & LeCun, 2010]
• Extensions to other models [Sprechmann, Bronstein & Sapiro, 2015],
[Remez, Litani & Bronstein, 2015], [Tompson, Schlachter, Sprechmann & Perlin, 2016].
100
20
90
![Page 91: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/91.jpg)
PROJECTED GRADIENT DESCENT (PGD)
𝜓 projects onto the set Υ
𝑓( 𝑍) ≤ 𝑅
𝑉 ∈ ℝ𝑑
𝑉 = 𝑍𝐴 + 𝐸
𝜇𝐴𝑇 𝜓
𝐼 − 𝜇𝐴𝐴𝑇
𝑍+
𝑓(𝑍) ≤ 𝑅
𝜇 is the step size
Estimate of 𝑍.Aim at solving
min 𝑍
𝑉 − 𝑍𝐴
𝑠. 𝑡. 𝑓( 𝑍) ≤ 𝑅91
![Page 92: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/92.jpg)
THEORY FOR PGD
• Theorem 8: Let 𝑍 ∈ ℝ𝑑, 𝑓: ℝ𝑑 → ℝ a proper function, 𝑓 𝑍 ≤ 𝑅, 𝐶𝑓(𝑥) the tangent cone of 𝑓at point 𝑥, 𝐴 ∈ ℝ𝑑×𝑚 a random Gaussian matrix and 𝑉 = 𝑍𝐴 + 𝐸. Then the estimate of PGD at iteration 𝑡, 𝑍𝑡, obeys
𝑍𝑡 − 𝑍 ≤ 𝜅𝑓𝜌𝑡
𝑍 ,
where 𝜌 = sup𝑈,𝑊∈𝐶𝑓 𝑥 ∩ℬ𝑑
𝑈 𝐼 − 𝜇𝐴𝐴𝑇 𝑊𝑇
and 𝜅𝑓 = 1 if 𝑓 is convex and 𝜅𝑓 = 2 otherwise.[Oymak, Recht & Soltanolkotabi, 2016].
92
![Page 93: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/93.jpg)
PGD CONVERGENCE RATE
• 𝜌 = sup𝑈,𝑊
𝑈 𝐼 − 𝜇𝐴𝐴𝑇 𝑊𝑇 is the convergence rate
of PGD.
• Let 𝜔 be the Gaussian mean width of 𝐶𝑓 𝑥 ∩ ℬ𝑑.
• If 𝜇 =1
𝑚+ 𝑑2 ≃
1
𝑑then 𝜌 = 1 − 𝑂
𝑚−𝜔
𝑚+𝑑.
• If 𝜇 =1
𝑚then 𝜌 = 𝑂
𝜔
𝑚.
93
![Page 94: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/94.jpg)
INACCURATE PROJECTION
• PGD iterations projects onto Υ = 𝑍: 𝑓 𝑍 ≤ 𝑅 .
• Smaller Υ ⇒ Smaller 𝜔.
Faster convergence as
𝜌 = 1 − 𝑂𝑚−𝜔
𝑚+𝑑or 𝑂
𝜔
𝑚
• Let us assume that our signal belongs to a smaller set Υ = 𝑍: 𝑓 𝑍 ≤ 𝑅 with 𝜔 ≪ 𝜔.
• Ideally, we would like to project onto Υ instead of Υ.
• This will lead to faster convergence.
• What if such a projection is not feasible?
⇒ 𝑓( 𝑍) ≤ 𝑅
𝑓( 𝑍) ≤ 𝑅
94
![Page 95: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/95.jpg)
INACCURATE PROJECTION
• We will estimate the projection onto Υ by
• A linear projection 𝑃
• Followed by a projection onto Υ
• Assumptions:
• 𝑃(𝑍)−𝑍 ≤ ϵ
• ℘𝐶 𝑓(𝑍)(𝑈)−℘𝐶𝑓(𝑍𝑃)
(𝑈𝑃) ≤ ϵ, ∀𝑈 ∈ ℝ𝑑
Projection of 𝑈𝑃 onto thetangent cone of 𝑓 at point 𝑍𝑃
Projection of 𝑈 onto the
tangent cone of 𝑓 at point 𝑍
𝑓( 𝑍) ≤ 𝑅
95
![Page 96: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/96.jpg)
INACCURATE PGD (IPGD)
𝜓 projects onto the set Υ
𝑓(𝑍) ≤ 𝑅
𝑉 ∈ ℝ𝑑
𝑉 = 𝑍𝐴 + 𝐸
𝜇𝐴𝑇𝑃 𝜓
𝐼 − 𝐴𝐴𝑇 𝑃
𝑍+
Υ
𝜇 is the step size
Estimate of 𝑍.Aim at solving
min 𝑍
𝑉 − 𝑍𝐴
𝑠. 𝑡. 𝑓( 𝑍) ≤ 𝑅
𝑓(𝑍) ≤ 𝑅
96
![Page 97: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/97.jpg)
THEORY FOR IPGD
• Theorem 9: Let 𝑍 ∈ ℝ𝑑, 𝑓: ℝ𝑑 → ℝ a proper function, 𝑓 𝑍 ≤ 𝑅, 𝐶𝑓(𝑥) the tangent cone of 𝑓 at point 𝑥, 𝐴∈ ℝ𝑑×𝑚 a random Gaussian matrix and 𝑉 = 𝑍𝐴 + 𝐸. Then the estimate of IPGD at iteration 𝑡, 𝑍𝑡, obeys
𝑍𝑡 − 𝑍
≤ 𝜅𝑓 𝜌 + 𝜖𝛾𝑡
+1 − 𝜅𝑓 𝜌 + 𝜖𝛾
𝑡
1 − 𝜅𝑓 𝜌 + 𝜖𝛾𝜖 𝑍 ,
where 𝜌 = sup𝑈,𝑊∈𝐶𝑓 𝑥 ∩ℬ𝑑
𝑈 𝐼 − 𝜇𝐴𝐴𝑇 𝑊𝑇
𝛾 = 𝐼 − 𝜇𝐴𝐴𝑇 and 𝜅𝑓 as in Theorem 8.[Giryes, Eldar, Bronstein & Sapiro, 2016]
97
![Page 98: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/98.jpg)
CONVERGENCE RATE COMPARISON
• PGD:
𝜅𝑓𝜌𝑡
• IPGD:
𝜅𝑓 𝜌 + 𝜖𝛾𝑡
+1 − 𝜅𝑓 𝜌 + 𝜖𝛾
𝑡
1 − 𝜅𝑓 𝜌 + 𝜖𝛾𝜖
≃(𝑎)
𝜅𝑓 𝜌𝑡
+ 𝜖
(a) assuming that 𝜖 is negligible compared to 𝜌
(b) For small values of 𝑡
≪(𝑏)
𝜅𝑓𝜌𝑡
98
![Page 99: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/99.jpg)
MODEL BASED COMPRESSED SENSING
• Υ is the set of sparse vectors with sparsity patterns that obey a tree structure.
• Projecting onto Υ improves convergence rate compared to projecting onto the set of sparse vectors Υ [Baraniuk et al., 2010].
• The projection onto Υ is more demanding than onto Υ.
• Note that the probability of selecting atoms from lower tree levels is smaller than upper ones.
• 𝑃 will be a projection onto certain tree levels – zeroing the values at lower levels.
1
0.5 0.5
0.25 0.25 0.25 0.25
99
![Page 100: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/100.jpg)
MODEL BASED COMPRESSED SENSING
Non-zeros picked entries has zero mean random Gaussian distribution with variance:- 1 at first two levels- 0.12 at the third level- 0.012 at the rest of the levels
100
![Page 101: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/101.jpg)
SPECTRAL COMPRESSED SENSING
• Υ is the set of vectors with sparse representation in a 4-times redundant DCT dictionary such that:
• The active atoms are selected uniformly at random such that minimum distance between neighboring atoms is 5.
• The value of each representation coefficient ~𝑁(0,1) i.i.d.
• We set the neighboring coefficients at distance 1 and 2 of each active atom to be ~𝑁(0,0.12) and ~𝑁(0,0.012) , respectively
• We set 𝑃 to be a pooling-like operation that keeps in each window of size 5 only the largest value.
101
![Page 102: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/102.jpg)
SPECTRAL COMPRESSED SENSING
102
![Page 103: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/103.jpg)
LEARNING THE PROJECTION
• In we have no explicit information about Κ it might be desirable to learn the projection.
• Instead of learning 𝑃, it is possible to replace
𝐼 − 𝐴𝐴𝑇 𝑃 and 𝜇𝐴𝑇𝑃 with two learned matrices
𝑆 and 𝑋 respectively.
• This leads to a very similar scheme to the one of LISTA and provides a theoretical foundation for the success of LISTA.
103
![Page 104: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/104.jpg)
LEARNED IPGD
𝜓 projects onto the set Υ
𝑓(𝑍) ≤ 𝑅
𝑉 ∈ ℝ𝑑
𝑉 = 𝑍𝐴 + 𝐸
𝑋 𝜓
𝑆
𝑍+
Υ
𝜇 is the step size
Estimate of 𝑍.Aim at solving
min 𝑍
𝑉 − 𝑍𝐴
𝑠. 𝑡. 𝑓( 𝑍) ≤ 𝑅
𝑓(𝑍) ≤ 𝑅
Learned linear
operators
104
![Page 105: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/105.jpg)
LISTA
𝜓 is a proximal mapping.
𝜓 𝑈 =
argmin 𝑍
𝑈 − 𝑍
+𝜆𝑓( 𝑍)
𝑉 ∈ ℝ𝑑
𝑉 = 𝑍𝐴 + 𝐸
𝑋 𝜓
𝑆
𝑍+
Υ
𝜇 is the step size
Estimate of 𝑍.Aim at solving
min 𝑍
𝑉 − 𝑍𝐴
+𝜆 𝑓( 𝑍)
𝑓(𝑍) ≤ 𝑅
Learned linear
operators
105
![Page 106: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/106.jpg)
LISTA MIXTURE MODEL
• Approximation of the projection onto Υwith one linear projection may not be accurate enough.
• This requires more LISTA layers/iterations.
• Instead, one may use several LISTA networks, where each approximates a different part of Υ
• Training 18 LISTA networks, each with3 layers, provides the same accuracylike 1 LISTA network with 10 layers.
Υ
Υ
106
![Page 107: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/107.jpg)
RELATED WORKS
• In [Bruna et al. 2016] a different route it taken to explain the faster convergence of LISTA. It is shown that a learning may give a gain due to better preconditioning of A.
107
![Page 108: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/108.jpg)
Take Home
Message
DNN keep the
important information of the data.
Gaussian mean width is a good measure for the
complexity of the data.
Random Gaussian
weights are good for
classifying the average points
in the data.
Important goal of training: Classify the
boundary points between the
different classes in the data.
Generalization error depends
on the DNN input margin
Deep learning can be viewed
as a metric learning.
DNN may solve
optimization problems
108
![Page 109: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/109.jpg)
ACKNOWLEDGEMENTS
Guillermo Sapiro
Duke University
Robert Calderbank
Duke University
Qiang Qiu
Duke University
Jiaji Huang
Duke University
Alex M. Bronstein
Technion
Jure Sokolic
University College London
Miguel Rodrigues
University College London
Yonina C. Eldar
Technion
109
![Page 110: DATA STRUCTURE BASED THEORY FOR DEEP … · DATA STRUCTURE BASED THEORY FOR DEEP LEARNING ... Mathematics of Deep Learning Tutorial EUSIPCO Conference, Budapest, Hungary August 29](https://reader034.vdocument.in/reader034/viewer/2022052609/5b05a8607f8b9ac33f8b9faa/html5/thumbnails/110.jpg)
QUESTIONS?WEB.ENG.TAU.AC.IL/~RAJA
REFERENCES
R. GIRYES, G. SAPIRO, A. M. BRONSTEIN, DEEP NEURAL NETWORKS WITH RANDOM GAUSSIAN WEIGHTS: A UNIVERSAL CLASSIFICATION STRATEGY?
J. HUANG, Q. QIU, G. SAPIRO, R. CALDERBANK, DISCRIMINATIVE GEOMETRY-AWARE DEEP TRANSFORM
J. HUANG, Q. QIU, G. SAPIRO, R. CALDERBANK, DISCRIMINATIVE ROBUST TRANSFORMATION LEARNING
J. SOKOLIC, R. GIRYES, G. SAPIRO, M. R. D. RODRIGUES, MARGIN PRESERVATION OF DEEP NEURAL NETWORKS
R. GIRYES, Y. C. ELDAR, A. M. BRONSTEIN, G. SAPIRO, TRADEOFFS BETWEEN CONVERGENCE SPEED AND RECONSTRUCTION ACCURACY IN INVERSE PROBLEMS
110