partial differential equations approaches to optimization ......partial differential equations...
TRANSCRIPT
Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks
Celebrating 75 Years of Mathematics of ComputationICERM
Nov 2, 2018
Adam ObermanMcGill Dept of Math and Stats
supported by NSERC, Simons Fellowship, AFOSR FA9550-18-1-0167
Background AI
• Artificial Intelligence is loosely defined as intelligence exhibited by machines
• Operationally: R&D in CS academic sub-disciplines: Computer Vision, Natural Language Processing (NLP), Robotics, etc
AlphaGo uses DL to beat world champion at Go
• AI : specific tasks,
• AGI : general cognitive abilities.
• AGI is a small research area within AI: build machines that can successfully perform any task that a human might do
• So far, no progress on AGI.
Artificial General Intelligence (AGI)
Deep Learning vs. traditional Machine Learning
• Machine Learning (ML) has been around for some time.
• Deep Learning is newer branch of ML which uses Deep Neural networks.
• ML has theory: error estimates and convergence proofs.
• DL less theory. But DL can effectively solve substantially larger scale problems
• ImageNet: Total number of classes: m =21841
• Total number of images: n =14,197,122
• Color images d= 3*256*256= 196,608
Facebook used 256 GPUs, working in parallel, to train ImageNet.
Still an academic dataset. Total number of images on Facebook is much larger
What are DNNs (in Math Language)?
What is the function?Looking for a map from images to labels
x in M = manifold of images
graph of list of word labels
f(x)
Doing the impossible?
In theory, due to curse of dimensionality, impossible to accurately interpolate a high dimensional function.
In practice, possible using Deep Neural Network architecture, training to fit the data with SGD.However we don’t know why it works.
Can train a computer to caption images more accurately that human performance.
Loss Functions versus Error
Classification problem: map image to discrete label space {1,2,3,…,10}
In practice: map to a probability vector, then assign label of the arg max.
Classification is not differentiable, so instead, in order to train, use a loss function as a surrogate.
DNNs in Math Language: high dimensional function fitting
Data fitting problem: f parameterized map from images to probability vectors on labels. y is the correct label. Try to fit data by minimizing the loss.
Training: minimize expected loss, by taking stochastic (approximate) gradients
Note: train on an empirical distribution sampled from the density rho.
minw
Ex⇠⇢n`(f(x;w), y(x)) =nX
i=1
`(f(xi;w), y(xi))<latexit sha1_base64="3uHpidkJhxBpaO95Fmtt7WdytCQ=">AAACQ3icbZBNSysxFIbP+HW99avq0k1QhA6IdNwoyAVBBZcKVoudGjJpxoYmmSHJaMsw/8Wf4sY/4M4/4MaFF3ErmLYifh0IPLznPZycN0oFN7ZavfdGRsfGJ/5M/i1NTc/MzpXnF05MkmnKajQRia5HxDDBFatZbgWrp5oRGQl2GnV2+/3TS6YNT9Sx7aWsKcmF4jGnxDoJl89CyRW+QqEkth1F+X6B8y4KDZco1O0EqwKFTIhKXOluX/lrvUrX99E/Z8gkzrmjoDhXHxbM302Y+z4ur1TXq4NCHxB8h5WdfVW/BoBDXL4LWwnNJFOWCmJMI9hIbTMn2nIqWFEKM8NSQjvkgjUcKiKZaeaDDAq06pQWihPtnrJooH6eyIk0picj5+yfar73+uJvvUZm461mzlWaWabocFGcCWQT1A8Utbhm1IqeA0I1d39FtE00odbFXnIh/Dj5J5xsrAeOj1waezCsSViCZahAAJuwAwdwCDWgcAMP8AT/vVvv0Xv2XobWEe99ZhG+lPf6Bur+sIg=</latexit><latexit sha1_base64="LDMLH6UNa2jjZV0Aeyzoau7EZiQ=">AAACQ3icbZDLSgMxFIYz9V5vVZdugiJ0QGqnGwURBBVcKlgtdmrIpBkbmmSGJGNbhnkIn8EXceMLuPMF3LhQxK1g2op4OxD4+M9/ODl/EHOmTbn84ORGRsfGJyan8tMzs3PzhYXFUx0litAqiXikagHWlDNJq4YZTmuxolgEnJ4F7b1+/+yKKs0ieWJ6MW0IfClZyAg2VkKFc18wiTrQF9i0giA9yFDahb5mAvqqFSGZQZ9yXgyL3e2Ou94rdl0X7lhDIlDKLHnZhfyyIPZpQsx1UWG1XCoPCn6B9xtWdw9k7WZj//oIFe79ZkQSQaUhHGtd9yqxaaRYGUY4zfJ+ommMSRtf0rpFiQXVjXSQQQbXrNKEYaTskwYO1O8TKRZa90Rgnf1T9e9eX/yvV09MuNVImYwTQyUZLgoTDk0E+4HCJlOUGN6zgIli9q+QtLDCxNjY8zaEPyf/hdNKybN8bNPYB8OaBMtgBRSBBzbBLjgER6AKCLgFj+AZvDh3zpPz6rwNrTnnc2YJ/Cjn/QNKGrGP</latexit><latexit sha1_base64="LDMLH6UNa2jjZV0Aeyzoau7EZiQ=">AAACQ3icbZDLSgMxFIYz9V5vVZdugiJ0QGqnGwURBBVcKlgtdmrIpBkbmmSGJGNbhnkIn8EXceMLuPMF3LhQxK1g2op4OxD4+M9/ODl/EHOmTbn84ORGRsfGJyan8tMzs3PzhYXFUx0litAqiXikagHWlDNJq4YZTmuxolgEnJ4F7b1+/+yKKs0ieWJ6MW0IfClZyAg2VkKFc18wiTrQF9i0giA9yFDahb5mAvqqFSGZQZ9yXgyL3e2Ou94rdl0X7lhDIlDKLHnZhfyyIPZpQsx1UWG1XCoPCn6B9xtWdw9k7WZj//oIFe79ZkQSQaUhHGtd9yqxaaRYGUY4zfJ+ommMSRtf0rpFiQXVjXSQQQbXrNKEYaTskwYO1O8TKRZa90Rgnf1T9e9eX/yvV09MuNVImYwTQyUZLgoTDk0E+4HCJlOUGN6zgIli9q+QtLDCxNjY8zaEPyf/hdNKybN8bNPYB8OaBMtgBRSBBzbBLjgER6AKCLgFj+AZvDh3zpPz6rwNrTnnc2YJ/Cjn/QNKGrGP</latexit><latexit sha1_base64="1t3sG2NE1Erwu7My7b6VndIN3d0=">AAACQ3icbZDLSgMxFIYz9VbrrerSTbAIHRCZcaMggqCCywrWip0aMmmmDSaZIcloyzDv5sYXcOcLuHGhiFvBtJbi7UDg4z//4eT8YcKZNp736BQmJqemZ4qzpbn5hcWl8vLKuY5TRWidxDxWFyHWlDNJ64YZTi8SRbEIOW2E14eDfuOGKs1ieWb6CW0J3JEsYgQbK6HyZSCYRLcwENh0wzA7zlHWg4FmAgaqGyOZw4ByXo2qvb1bd7Nf7bku3LeGVKCMWfLzKzm2IDYyIea6qFzxtrxhwTH4v6ECRlVD5YegHZNUUGkIx1o3/e3EtDKsDCOc5qUg1TTB5Bp3aNOixILqVjbMIIcbVmnDKFb2SQOH6veJDAut+yK0zsGp+ndvIP7Xa6Ym2m1lTCapoZJ8LYpSDk0MB4HCNlOUGN63gIli9q+QdLHCxNjYSzaEPyf/hfPtLd/yqVc5OBrFUQRrYB1UgQ92wAE4ATVQBwTcgSfwAl6de+fZeXPev6wFZzSzCn6U8/EJVBWunQ==</latexit>
Generalization. Training set and test set
Goal: generalization: hope that training error is a good estimate of the generalization loss, which is the expected loss on unseen images drawn from the same distribution.
Ex⇠⇢`(f(x;w), y(x)) =
Z`(f(xi;w), y(xi))d⇢(x)
<latexit sha1_base64="mQd6WwJaTEyehbn5A9xdhhb6+jI=">AAACOXicbVBLS0JBFD7XXmYvq2WbIQkUQtRNQQVCBS0N8gFeucwd5+rg3Eczc0u5+Lfa9C/aBW1a9KBtf6C5KlLagYGP78Gc89kBZ1IVCs9GYmFxaXkluZpaW9/Y3Epv79SkHwpCq8TnvmjYWFLOPFpVTHHaCATFrs1p3e6dx3r9jgrJfO9GDQLacnHHYw4jWGnKSldMF6uubUeXQyvqI1MyF5mi6w+RSTnPOtn+yX3ucJDt53LoDJnMU1PBYhPJYlpsxylts9KZQr4wGjQFxVmQKZ9+3CIAqFjpJ7Ptk9ClniIcS9kslgLVirBQjHA6TJmhpAEmPdyhTQ097FLZikaXD9GBZtrI8YV+erUR+zsRYVfKgWtrZ3ynnNVi8j+tGSrnuBUxLwgV9cj4IyfkSPkorhG1maBE8YEGmAimd0WkiwUmSped0iXMnTwPaqV8UeNr3cYFjCcJe7APWSjCEZThCipQBQIP8AJv8G48Gq/Gp/E1tiaMSWYX/ozx/QOjV6x4</latexit><latexit sha1_base64="PeNmUL0W364KwKS6AjbkFuD2cho=">AAACOXicbVDLSgMxFM3UV62vUZduokVoQUrbjYIKBRUENxXsAzrDkEkzbWjmYZLRDkO/yL0b/8Kd4MaFD9z6A2baItp6IXA4D3LvsQNGhSwWn7TUzOzc/EJ6MbO0vLK6pq9v1IUfckxq2Gc+b9pIEEY9UpNUMtIMOEGuzUjD7p0keuOGcEF970pGATFd1PGoQzGSirL0quEi2bXt+GxgxX1oCOpCg3f9ATQIYzkn1z+8ze9FuX4+D4+hQT35I1h0LFlUie0kpWyWni0WisOBP6A0CbKVo/fr7bsLUbX0R6Pt49AlnsQMCdEqlQNpxohLihkZZIxQkADhHuqQloIecokw4+HlA7irmDZ0fK6eWm3I/k7EyBUicm3lTO4Uk1pC/qe1QukcmDH1glASD48+ckIGpQ+TGmGbcoIlixRAmFO1K8RdxBGWquyMKmHq5GlQLxdKCl+qNk7BaNJgC+yAHCiBfVAB56AKagCDe/AMXsGb9qC9aB/a58ia0saZTfBntK9vh0ut4g==</latexit><latexit sha1_base64="PeNmUL0W364KwKS6AjbkFuD2cho=">AAACOXicbVDLSgMxFM3UV62vUZduokVoQUrbjYIKBRUENxXsAzrDkEkzbWjmYZLRDkO/yL0b/8Kd4MaFD9z6A2baItp6IXA4D3LvsQNGhSwWn7TUzOzc/EJ6MbO0vLK6pq9v1IUfckxq2Gc+b9pIEEY9UpNUMtIMOEGuzUjD7p0keuOGcEF970pGATFd1PGoQzGSirL0quEi2bXt+GxgxX1oCOpCg3f9ATQIYzkn1z+8ze9FuX4+D4+hQT35I1h0LFlUie0kpWyWni0WisOBP6A0CbKVo/fr7bsLUbX0R6Pt49AlnsQMCdEqlQNpxohLihkZZIxQkADhHuqQloIecokw4+HlA7irmDZ0fK6eWm3I/k7EyBUicm3lTO4Uk1pC/qe1QukcmDH1glASD48+ckIGpQ+TGmGbcoIlixRAmFO1K8RdxBGWquyMKmHq5GlQLxdKCl+qNk7BaNJgC+yAHCiBfVAB56AKagCDe/AMXsGb9qC9aB/a58ia0saZTfBntK9vh0ut4g==</latexit><latexit sha1_base64="4u7epyBMGvw4Z10O2qFM64ivPWQ=">AAACOXicbVBNS8MwGE79nPOr6tFLcAgbyGh3URBhoILHCe4D1lHSLN3CkrYkqa6U/S0v/gtvghcPinj1D5huY+jmC4GH54O87+NFjEplWS/G0vLK6tp6biO/ubW9s2vu7TdkGAtM6jhkoWh5SBJGA1JXVDHSigRB3GOk6Q0uM715T4SkYXCnkoh0OOoF1KcYKU25Zs3hSPU9L70euekQOpJy6Ih+OIIOYazoF4fnD6WTpDgsleAFdGigZoJLp5JLtdjNUtrmmgWrbI0HzoA9DwpgOjXXfHa6IY45CRRmSMq2XYlUJ0VCUczIKO/EkkQID1CPtDUMECeyk44vH8FjzXShHwr99Gpj9nciRVzKhHvamd0p57WM/E9rx8o/66Q0iGJFAjz5yI8ZVCHMaoRdKghWLNEAYUH1rhD3kUBY6bLzuoSFkxdBo1K2Nb61CtWraR05cAiOQBHY4BRUwQ2ogTrA4BG8gnfwYTwZb8an8TWxLhnTzAH4M8b3D/9XqoQ=</latexit>
Testing: reserve some data and approximate the generalization loss/error on the test data, which is a surrogate for the true expected error on the full density.
Ex⇠⇢test`(f(x;w), y(x)) =X
`(f(xi;w), y(xi))<latexit sha1_base64="dsiTuKjAr93GjfozAUupKurJQBM=">AAACN3icbVBNSxxBFHyj0ej6kTUevTSKsAMiu3sxEAQhETyJQlaFnWXo6X3jNnbPDN1vdJdhIEfv/pVc8jdyixcPinjNP7B3VyR+FDQUVe/xuirKlLRUr//1JiY/TE1/nJmtzM0vLH6qLn0+smluBLZEqlJzEnGLSibYIkkKTzKDXEcKj6Ozb0P/+ByNlWnygwYZdjQ/TWQsBScnhdX9QHPqRVGxW4ZFnwVWahaYXhoWhJbKkgWoVC2u9b9e+BuDWt/32babyvWzEconK5S+H1bX6pv1EdgzabwmazvNy6ufAHAQVv8E3VTkGhMSilvbbjQz6hTckBQKy0qQW8y4OOOn2HY04RptpxjlLtm6U7osTo17CbGR+v9GwbW1Ax25yWFK+9obiu957ZziL51CJllOmIjxoThXjFI2LJF1pUFBauAIF0a6vzLR44YLclVXXAlvIr8lR83NhuOHro3vMMYMrMAq1KABW7ADe3AALRDwC67hFu68396Nd+89jEcnvKedZXgB798juuKsrA==</latexit><latexit sha1_base64="NEQI9PaZjbDyruWtrCfw7KgbFqw=">AAACN3icbVBNSyNBFOzxa2N0NbpHL40iZGCRJBcXFiGggieJYEwgE4aezhvT2D0zdL9ZE4Y5ieDdv+Jl/4a33cseVsSr/8BOIuJXQUNR9R6vq4JECoOVyh9nanpmdu5LYb64sPh1abm0snpi4lRzaPJYxrodMANSRNBEgRLaiQamAgmt4Gx35Ld+gTYijo5xmEBXsdNIhIIztJJfOvQUw34QZPu5nw2oZ4Sinu7HfoZgMM+pB1KWw/Lg57n7fVgeuC7dsVOpejF88Wz5wnX90kZlqzIGfSHV92SjXru6vrwIaMMv3Xq9mKcKIuSSGdOp1hLsZkyj4BLyopcaSBg/Y6fQsTRiCkw3G+fO6aZVejSMtX0R0rH6eiNjypihCuzkKKV5743Ez7xOiuGPbiaiJEWI+ORQmEqKMR2VSHtCA0c5tIRxLexfKe8zzTjaqou2hA+RP5KT2lbV8iPbxh6ZoEDWyDopkyrZJnVyQBqkSTi5IX/Jf3Ln/Hb+OffOw2R0ynne+UbewHl8Ah7JrbY=</latexit><latexit sha1_base64="NEQI9PaZjbDyruWtrCfw7KgbFqw=">AAACN3icbVBNSyNBFOzxa2N0NbpHL40iZGCRJBcXFiGggieJYEwgE4aezhvT2D0zdL9ZE4Y5ieDdv+Jl/4a33cseVsSr/8BOIuJXQUNR9R6vq4JECoOVyh9nanpmdu5LYb64sPh1abm0snpi4lRzaPJYxrodMANSRNBEgRLaiQamAgmt4Gx35Ld+gTYijo5xmEBXsdNIhIIztJJfOvQUw34QZPu5nw2oZ4Sinu7HfoZgMM+pB1KWw/Lg57n7fVgeuC7dsVOpejF88Wz5wnX90kZlqzIGfSHV92SjXru6vrwIaMMv3Xq9mKcKIuSSGdOp1hLsZkyj4BLyopcaSBg/Y6fQsTRiCkw3G+fO6aZVejSMtX0R0rH6eiNjypihCuzkKKV5743Ez7xOiuGPbiaiJEWI+ORQmEqKMR2VSHtCA0c5tIRxLexfKe8zzTjaqou2hA+RP5KT2lbV8iPbxh6ZoEDWyDopkyrZJnVyQBqkSTi5IX/Jf3Ln/Hb+OffOw2R0ynne+UbewHl8Ah7JrbY=</latexit><latexit sha1_base64="71ODtIYdgAE2JcdlClqIel1VpHk=">AAACN3icbVDLSsNAFJ34tr6qLt0MFqEBkbQbBREEFVyJglWhKWEyvbGDM0mYudGWkL9y42+4040LRdz6B05rEV8HBg7n3Mudc8JUCoOe9+CMjI6NT0xOTZdmZufmF8qLS2cmyTSHBk9koi9CZkCKGBooUMJFqoGpUMJ5eLXX98+vQRuRxKfYS6Gl2GUsIsEZWikoH/mKYScM84MiyLvUN0JRX3eSIEcwWBTUBymrUbW7feOu96pd16U7dipTX0YghlYgXDcoV7wNbwD6RWq/SYUMcRyU7/12wjMFMXLJjGnW6im2cqZRcAlFyc8MpIxfsUtoWhozBaaVD3IXdM0qbRol2r4Y6UD9vpEzZUxPhXayn9L89vrif14zw2irlYs4zRBi/nkoyiTFhPZLpG2hgaPsWcK4FvavlHeYZhxt1SVbwp/If8lZfaNm+YlX2d0f1jFFVsgqqZIa2SS75JAckwbh5JY8kmfy4tw5T86r8/Y5OuIMd5bJDzjvH/YEqp8=</latexit>
Orange curve: overtrained. Green curve better generalization
Challenges for deep learning
“It is not clear that the existing AI paradigm is immediately amenable to any sort of software engineering validation and verification. This is a serious issue, and is a potential roadblock to DoD’s use of these modern AI systems, especially when considering the liability and accountability of using AI”
JASON report
Mary Shaw’s evolution of software engineering discipline
Better theory: improves reliability and discipline evolves
Entropy-SGD
Pratik Chaudhari UCLA (now Amazon/U Penn)
Stefano Soatto UCLA Comp Sci.
Stanley Osher, UCLA Math
Guillaume Carlier, CEREMADE, U. Parix IX Dauphine
• 2017 UCLA PhD student (at time of research)• 2018 (present) Amazon research • Fall 2019 Faculty in ESE at U Penn
Deep Relaxation: partial differential equations for optimizing deep neural networks Pratik Chaudhari, Adam M. Oberman, Stanley Osher, Stefano Soatto, Guillame Carlier 2017
Entropy SGD results in Deep Neural Networks (Pratik)
Visualization of Improvement in training loss (left)Improve in Validation Error (right)
dimension = 1.67 million
Different Interpretation: Regularization using Viscous Hamilton-Jacobi PDE
Solution of PDE in one dimension. Cartoon: Algorithm only solves PDE for time depending on Hf(x).
Expected Improvement Theorem in continuous time
Adaptive-SGDjoint with PhD Student Mariana Prazeres
mini-batch: randomly choose a smaller set I of points from the data.
Cost is N
Cost is |I|
Model for mini-batch gradients: k=1 means.
full gradient
mini-batch
For k>1 means, same calculation applies, if we restrict to the active indices. This leads to smaller active batch sizes, and higher variance
Motivation: quality of gMB depends on x
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
• far from x* mb gradients point in a good direction.• near x* require more samples, or small steps (so that directions average in time).
-4 -3 -2 -1 0 1 2 3 4-4
-3
-2
-1
0
1
2
3
4
Adapt in Space instead of Time
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
The ideal learning rate/batch size combination should depend on x (space) rather than k (time).
use MB = 10 use MB = 60
Adaptive SGD
• Adaptively, depending on x, decide on • MB size, or • learning rate.
• Use the following formula (derived later)
- f large, learning rate large (ok to use small MB)- g small, var(MB) restricts learning rate (so use large MB)
Benchmarks: Fix MB and adapt h
0 100 200 300 400 500 60010-8
10-6
10-4
10-2
100f(x) - f* (for a typical run)
ScheduledScheduled 1/tAdaptive
0 100 200 300 400 500 60010-6
10-5
10-4
10-3
10-2
10-1
100f(x) - f*(Averaged over 40 runs)
ScheduledScheduled 1/tAdaptive
Left: Not too clear what is happening with one path.Right: Average over several runs to see the trend
-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2Paths of Scheduled and Adaptive SGD
The variance of the paths is clear from this figure
Proof of Convergence with Rate
rate is same order at SGD, but with better constant
Proof of Convergence and Generalization for Lipschitz
Regularized DNNsjoint with Jeff Calder
Lipschitz regularized Deep Neural Networks converge and generalize O. and Jeff Calder; 2018
Background on the generalization/convergence result
Problem: traditional ML theory does not apply to Deep Learning
A new idea is needed to make Deep Learning more reliable.
“Understanding Deep Learning requires rethinking generalization” Zhang (2016)
inspiration: old idea: Total Variation Denoising [1992] R-Osher-F.
used in early, high profile image reconstruction of video images.
Stanley Osher
• minimize a variational functional: combination of a loss term, to the original noisy image, and a regularization term
• Regularization is large on noise, small on images
Regularization: from images to maps
x in M = manifold of images
word labels
f(x)
the learned map: well-behaved on data manifold, but very bad off the manifold (without regularization)
What is new in our result?
• Bartlett proved generalization under the assumption of Lipschitz regularity.
• However, DNNs are not uniformly Lipschitz• By adding regularization to the objective function in the training, we
obtain the uniform Lipschitz bounds
Approaches to regularization
A. Machine Learning: learn data using an appropriate (smooth) parameterized class of functions
B. Algorithmic: use an algorithm which selects the best solution (e.g. Stochastic Gradient Descent as a regularizer, adversarial training)
C. Inverse problems: allow for a broad class of functions, but modify the loss to choose the right one
minw
Ex⇠⇢`(f(x;w), y(x))<latexit sha1_base64="SxNFJklN3YXQujZE+PFI6SQ90+4=">AAACGnicbVDLSgMxFL3js9ZX1aWboAgtSJlxo+BGUMFlBdsKnTJm0kwbmmSGJKMtQ7fiH7jxV9y4UMSduPFvTFsRXwcCh3Pu5eacMOFMG9d9dyYmp6ZnZnNz+fmFxaXlwspqTcepIrRKYh6r8xBrypmkVcMMp+eJoliEnNbD7uHQr19SpVksz0w/oU2B25JFjGBjpaDg+YLJ4Ar5AptOGGbHgyDrIV8zgXzViQfIp5wXo2Jv/6q03S/2SqWgsOmW3RHQF/F+k82D3ZvrCwCoBIVXvxWTVFBpCMdaN7ydxDQzrAwjnA7yfqppgkkXt2nDUokF1c1sFG2AtqzSQlGs7JMGjdTvGxkWWvdFaCeHCfRvbyj+5zVSE+01MyaT1FBJxoeilCMTo2FPqMUUJYb3LcFEMftXRDpYYWJsm3lbwp/If0ltp+y5Ze/UtnEEY+RgHTagCB7swgGcQAWqQOAW7uERnpw758F5dl7GoxPO584a/IDz9gEpiaHK</latexit><latexit sha1_base64="H8CIfyvwEJ3UxyVBgiSjhDmLCRM=">AAACGnicbVDLSgMxFM3UV62vqks3QRFakDLTTQU3BRXEVQX7kE4ZMmnGhiaZIcnYlqFb8Q/c+CtuXCjiTtz4GYIfYNqK+DoQOJxzLzfn+BGjStv2q5Wamp6ZnUvPZxYWl5ZXsqtrNRXGEpMqDlkoGz5ShFFBqppqRhqRJIj7jNT97v7Ir18QqWgoTvUgIi2OzgUNKEbaSF7WcTkVXg+6HOmO7yeHQy/pQ1dRDl3ZCYfQJYzlglx/r5ffGeT6+byX3bIL9hjwizi/yVa5dHV5dvz+VvGyz247xDEnQmOGlGo6xUi3EiQ1xYwMM26sSIRwF52TpqECcaJayTjaEG4bpQ2DUJonNByr3zcSxJUacN9MjhKo395I/M9rxjrYbSVURLEmAk8OBTGDOoSjnmCbSoI1GxiCsKTmrxB3kERYmzYzpoQ/kf+SWrHg2AXnxLRxACZIgw2wCXLAASVQBkegAqoAg2twC+7Bg3Vj3VmP1tNkNGV97qyDH7BePgAKcaPy</latexit><latexit sha1_base64="H8CIfyvwEJ3UxyVBgiSjhDmLCRM=">AAACGnicbVDLSgMxFM3UV62vqks3QRFakDLTTQU3BRXEVQX7kE4ZMmnGhiaZIcnYlqFb8Q/c+CtuXCjiTtz4GYIfYNqK+DoQOJxzLzfn+BGjStv2q5Wamp6ZnUvPZxYWl5ZXsqtrNRXGEpMqDlkoGz5ShFFBqppqRhqRJIj7jNT97v7Ir18QqWgoTvUgIi2OzgUNKEbaSF7WcTkVXg+6HOmO7yeHQy/pQ1dRDl3ZCYfQJYzlglx/r5ffGeT6+byX3bIL9hjwizi/yVa5dHV5dvz+VvGyz247xDEnQmOGlGo6xUi3EiQ1xYwMM26sSIRwF52TpqECcaJayTjaEG4bpQ2DUJonNByr3zcSxJUacN9MjhKo395I/M9rxjrYbSVURLEmAk8OBTGDOoSjnmCbSoI1GxiCsKTmrxB3kERYmzYzpoQ/kf+SWrHg2AXnxLRxACZIgw2wCXLAASVQBkegAqoAg2twC+7Bg3Vj3VmP1tNkNGV97qyDH7BePgAKcaPy</latexit><latexit sha1_base64="O4J9xO27AvK0cuHosbflwDq4gHQ=">AAACGnicbVDLSgMxFM3UV62vqks3wSK0IGWmGwU3BRVcVrAP6JQhk2ba0CQzJBnbYeh3uPFX3LhQxJ248W9MH4i2HggczrmXm3P8iFGlbfvLyqysrq1vZDdzW9s7u3v5/YOGCmOJSR2HLJQtHynCqCB1TTUjrUgSxH1Gmv7gcuI374lUNBR3OolIh6OeoAHFSBvJyzsup8IbQpcj3ff99HrspSPoKsqhK/vhGLqEsWJQHF0MS6dJcVQqefmCXbangD/EWSQFMEfNy3+43RDHnAiNGVKq7VQi3UmR1BQzMs65sSIRwgPUI21DBeJEddJptDE8MUoXBqE0T2g4VX9vpIgrlXDfTE4SqEVvIv7ntWMdnHdSKqJYE4Fnh4KYQR3CSU+wSyXBmiWGICyp+SvEfSQR1qbNnClhKfIyaVTKjl12bu1C9WpeRxYcgWNQBA44A1VwA2qgDjB4AE/gBbxaj9az9Wa9z0Yz1nznEPyB9fkNmmmf5Q==</latexit>
wk+1 = wk + hkrmb`(. . . w)<latexit sha1_base64="BZyyOvDFarvXXwrLP3MBcmYw5kg=">AAACF3icbVBNS8NAFHzx26o16tHLqghKISRe9CIIevCoYG2hiWGz3dglm03Y3VhK6L/w4l/x4kERr3rz37htRbR1YGGY2eG9N1HOmdKu+2lNTc/Mzs0vLFaWlleqq/ba+rXKCklonWQ8k80IK8qZoHXNNKfNXFKcRpw2ouR04DfuqFQsE1e6l9MgxbeCxYxgbaTQdro3ZVLz+ugYdW8SVEOdMEG+wBHHYZlGfeRTztGe3860Qt390N5xHXcI9EO8cbJzsrVWBYOL0P4wUVKkVGjCsVIt7yDXQYmlZoTTfsUvFM0xSfAtbRkqcEpVUA7v6qNdo7RRnEnzhEZD9XeixKlSvcGWuynWHTXuDcT/vFah46OgZCIvNBVkNCguONIZGpSE2kxSonnPEEwkM7si0sESE22qrJgSJk6eJNcHjuc63qVp4wxGWIBN2IY98OAQTuAcLqAOBO7hEZ7hxXqwnqxX6230dcr6zmzAH1jvX40inl0=</latexit><latexit sha1_base64="O8Jfg1XG1Gs6KyE5rSjFZ8Va8xI=">AAACF3icbVBPSwJBHJ21f2ZlWscuYyIYgux6qUsg1aGjQf4BXZfZcdRhZ2eXmdlElv0WXfoqXToU0bVufZtGjSjtwcDjvXn8fr/nhoxKZZqfRmptfWNzK72d2dndy+7n8gctGUQCkyYOWCA6LpKEUU6aiipGOqEgyHcZabve5cxv3xEhacBv1TQkto9GnA4pRkpLTq466cdexUrgOZz0PViBY8eDPY5chpzYdxPYI4zBcm8QKAknJ06uaFbNOeAPsZZJsV7IZ1OFi7Dh5D50FEc+4QozJGXXqoXKjpFQFDOSZHqRJCHCHhqRrqYc+UTa8fyuBJa0MoDDQOjHFZyrvxMx8qWczrYs+UiN5bI3E//zupEantkx5WGkCMeLQcOIQRXAWUlwQAXBik01QVhQvSvEYyQQVrrKjC5h5eRV0qpVLbNq3eg2rsACaXAEjkEZWOAU1ME1aIAmwOAePIJn8GI8GE/Gq/G2+JoyvjOH4A+M9y+pdJ8y</latexit><latexit sha1_base64="O8Jfg1XG1Gs6KyE5rSjFZ8Va8xI=">AAACF3icbVBPSwJBHJ21f2ZlWscuYyIYgux6qUsg1aGjQf4BXZfZcdRhZ2eXmdlElv0WXfoqXToU0bVufZtGjSjtwcDjvXn8fr/nhoxKZZqfRmptfWNzK72d2dndy+7n8gctGUQCkyYOWCA6LpKEUU6aiipGOqEgyHcZabve5cxv3xEhacBv1TQkto9GnA4pRkpLTq466cdexUrgOZz0PViBY8eDPY5chpzYdxPYI4zBcm8QKAknJ06uaFbNOeAPsZZJsV7IZ1OFi7Dh5D50FEc+4QozJGXXqoXKjpFQFDOSZHqRJCHCHhqRrqYc+UTa8fyuBJa0MoDDQOjHFZyrvxMx8qWczrYs+UiN5bI3E//zupEantkx5WGkCMeLQcOIQRXAWUlwQAXBik01QVhQvSvEYyQQVrrKjC5h5eRV0qpVLbNq3eg2rsACaXAEjkEZWOAU1ME1aIAmwOAePIJn8GI8GE/Gq/G2+JoyvjOH4A+M9y+pdJ8y</latexit><latexit sha1_base64="8NtPglgPo5lAysiDwWcJbXcFd98=">AAACF3icbVDLSsNAFJ3UV62vqEs3g6VQKZSkG90IBV24rGAf0LRhMp22QyaTMDOxlJC/cOOvuHGhiFvd+TdO2iLaemDgcM4c7r3HixiVyrK+jNza+sbmVn67sLO7t39gHh61ZBgLTJo4ZKHoeEgSRjlpKqoY6USCoMBjpO35V5nfvidC0pDfqWlEegEacTqkGCktuWZ10k/8ip3CSzjp+7ACx64PHY48htwk8FLoEMZg2RmESsLJmWsWrao1A/wh9jIpggUarvmpozgOCFeYISm7di1SvQQJRTEjacGJJYkQ9tGIdDXlKCCyl8zuSmFJKwM4DIV+XMGZ+juRoEDKabZlKUBqLJe9TPzP68ZqeNFLKI9iRTieDxrGDKoQZiXBARUEKzbVBGFB9a4Qj5FAWOkqC7qElZNXSatWta2qfWsV69eLOvLgBJyCMrDBOaiDG9AATYDBA3gCL+DVeDSejTfjff41Zywyx+APjI9v34Cd3Q==</latexit>
minw
Ex⇠⇢`(f(x;w), y(x)) + �krxfkLp(X,⇢(x))<latexit sha1_base64="P7DuaVrOMGQN8/l0rEXk4DEg0n8=">AAACRnicbVBNaxRBEK1ZPxLXj2zM0UthEGYxLDO5JOAlEAM5eIjgJgvb49jT25Nt0t0zdPeYXca5hvyvXDx78yd48aCIV3t2RTTxQcOrV/WorpeVUlgXRZ+Dzq3bd+6urN7r3n/w8NFab/3xsS0qw/iQFbIwo4xaLoXmQyec5KPScKoyyU+ys/22f/KeGysK/cbNS54oeqpFLhh1Xkp7CVFCp+dIFHXTLKsPmrSeIbFCITHTokHCpQzzcPbivL81D2f9Pj5HIv2CCUXyAYmmmaTpDHNfpfWrt2U42mqd7WiT9jajQbQA/iHxdbK5t3N58Q4AjtLeJzIpWKW4dkxSa8fxdumSmhonmORNl1SWl5Sd0VM+9lRTxW1SL2Jo8JlXJpgXxj/tcKH+7aipsnauMj/ZXmuv91rxf71x5fLdpBa6rBzXbLkoryS6AttMcSIMZ07OPaHMCP9XZFNqKHM++a4P4cbJN8nx9iCOBvFrn8ZLWGIVnsBTCCGGHdiDQziCITC4gi/wDb4HH4OvwY/g53K0E/z2bMA/6MAv/KWxJg==</latexit><latexit sha1_base64="ZwZTG1LCJtv8HQPcPqjWfeKrVIU=">AAACRnicbVBNaxRBEK1ZPxLXr9UcvTQGYRbDMpNLhFwCKkjwEMFNVrbHoaa3J9uku2fo7jG7jHOV/K9ccs7Nn+BF8Auv9uyKaOKDhlev6lFdLyulsC6KPgadK1evXV9ZvdG9eev2nbu9e/f3bVEZxoeskIUZZWi5FJoPnXCSj0rDUWWSH2RHT9v+wTturCj0azcveaLwUItcMHReSnsJVUKnx4QqdNMsq583aT0j1ApFqJkWDaFcyjAPZ9vH/Y15OOv3yWNCpV8wQULfE6oxk5jOSO6rtH75tgxHG62zHW3S3no0iBYgf0h8kazvbJ18eLP7/cte2junk4JVimvHJFo7jjdLl9RonGCSN11aWV4iO8JDPvZUo+I2qRcxNOSRVyYkL4x/2pGF+rejRmXtXGV+sr3WXuy14v9648rlT5Ja6LJyXLPlorySxBWkzZRMhOHMybknyIzwfyVsigaZ88l3fQiXTr5M9jcHcTSIX/k0nsESq/AAHkIIMWzBDryAPRgCg1P4BF/hW3AWfA5+BD+Xo53gt2cN/kEHfgHdjbNO</latexit><latexit sha1_base64="ZwZTG1LCJtv8HQPcPqjWfeKrVIU=">AAACRnicbVBNaxRBEK1ZPxLXr9UcvTQGYRbDMpNLhFwCKkjwEMFNVrbHoaa3J9uku2fo7jG7jHOV/K9ccs7Nn+BF8Auv9uyKaOKDhlev6lFdLyulsC6KPgadK1evXV9ZvdG9eev2nbu9e/f3bVEZxoeskIUZZWi5FJoPnXCSj0rDUWWSH2RHT9v+wTturCj0azcveaLwUItcMHReSnsJVUKnx4QqdNMsq583aT0j1ApFqJkWDaFcyjAPZ9vH/Y15OOv3yWNCpV8wQULfE6oxk5jOSO6rtH75tgxHG62zHW3S3no0iBYgf0h8kazvbJ18eLP7/cte2junk4JVimvHJFo7jjdLl9RonGCSN11aWV4iO8JDPvZUo+I2qRcxNOSRVyYkL4x/2pGF+rejRmXtXGV+sr3WXuy14v9648rlT5Ja6LJyXLPlorySxBWkzZRMhOHMybknyIzwfyVsigaZ88l3fQiXTr5M9jcHcTSIX/k0nsESq/AAHkIIMWzBDryAPRgCg1P4BF/hW3AWfA5+BD+Xo53gt2cN/kEHfgHdjbNO</latexit><latexit sha1_base64="Wk5NREuEB3uk44J3RK/4Zi/0yj8=">AAACRnicbVBNaxRBEK1Zo8b1a9VjLk0WYRbDMpOLgpeACXjwkEA2Wdgeh5renmyT7p6hu8fsMplfl4tnb/4ELx4iwas9myVo4oOGV6/qUV0vK6WwLoq+B517a/cfPFx/1H385Omz570XL49sURnGR6yQhRlnaLkUmo+ccJKPS8NRZZIfZ6cf2v7xF26sKPShW5Q8UXiiRS4YOi+lvYQqodMzQhW6WZbVe01azwm1QhFqZkVDKJcyzMP5+7PB1iKcDwbkDaHSL5gioeeEaswkpnOS+yqtP30uw/FW62xHm7TXj4bREuSGxLdJH1bYT3vf6LRgleLaMYnWTuLt0iU1GieY5E2XVpaXyE7xhE881ai4TeplDA157ZUpyQvjn3Zkqf7tqFFZu1CZn2yvtbd7rfi/3qRy+bukFrqsHNfselFeSeIK0mZKpsJw5uTCE2RG+L8SNkODzPnkuz6EOyffJUfbwzgaxgdRf2d3Fcc6bMAmhBDDW9iBj7API2BwAT/gEn4FX4OfwVXw+3q0E6w8r+AfdOAPbZSvQQ==</latexit>
Comment for math audience
• Our result may not be surprising to math experts• However, it is a new approach to generalization theory.
• Speaking personally, the hard work was giving a math interpretation of the problem (1.5 years)
• Once the model was set up correctly, and we realized we could implement it in a DNN, the math was relatively easier.
• Paper and proof was done in about 6 weeks.
Clean Labels:• relevant in benchmark data sets and
applications,• simpler proof, since the clean lable
function is a minimizers• regime of perfect data interpolation
possible with DNNs
Convergence: two cases
Noisy labels: • relevant in applications,• familiar setting for calculus of
variations
Statement and proof sketch ofthe generalization/convergence result
Lipschitz Regularization of DNNs
Data function: augment the expected loss function on the data with a Lipschitz regularization term
where L0 is Lipschitz constant of the data, and n is number of data points. We are interested in the limit as we sample more points. The limiting funtional is given by
Convergence theorem for Noisy Labels
Convergence on the data manifold. Lipschitz off.
Convergence for clean labels (with a rate)
• Rate of convergence, on the data manifold, of the minimizers.• The rate depends on, n, the number of data points sampled and, m,
the number of labels.• Probabilistic bound, where obtain a given error with high
probability• uniform sampling vs random sampling: the log term and the
probability goes away
Proof
Generalization follows
Lipschitz Regularization improves Adversarial Robustness
Chris Finlay (current PhD student)
Improved robustness to adversarial examples using Lipschitz regularization of the loss
Chris Finlay, O., Bilal Abbasi; Oct 2018; arxiv
Bilal Abbasi (former PhD now working in AI)
Adversarial Attacks
Adversarial Attacks on the loss
Scale measures visible attacks
DNNs are vulnerable to attacks which are invisible to the human eye.Undefended networks have 100% error rate at .1 (in max norm)
Implementation of Lipschitz Regularization of the Loss
Robustness bounds from the Lipschitz constant of the Loss
So training the model to have a better Lipschitz constant will improve the adversarial robustness bounds.
Arms race of attack methods and defences
We tested against toolbox of attacks. Plotted the error curve as a function of the adversarial size.
Strongest attacks: 1. Iterative l2-projected gradient2. Iterative Fast Gradient Signed Method (FGSM)
Adversarial Training: interpretation as regularization
Adversarial Training augmented with Lipschitz Regularization
AT + Tulip Results (2-norm)
Significant improvement over state-of-the art results come from augmenting AT with Lipschitz regularization
AT + Tulip Results (2-norm vs max-norm)
2-Lip > AT-2 > AT-1 > baseline (for all noise levels on both datasets)
Other current areas of interest in AI with
connections to mathematics
We are looking for collaborators: these are possible new projects
Reinforcement Learning
• Related to dynamic Programming
• Computationally intensive and unstable
Related math: dynamic programming, Optimal
Control
Recurrent NN
Generative Networks (GANs)Wasserstein GANs: optimal transportation (OT) mapping between random noise (Gaussians)
and target distribution of images
Related math: Optimal Tranportation algorithms and convergence (Peyre-Cuturi)
Squeeze NetsInference (evaluating the data and assigning a label) is costly (typically 0.1 second on a power hungry high memory GPU) in terms of• Memory (to store the weights)• Computation (multiplying the matrices times the vectors)• Power (the energy Joules used by the chip)• TimeResearch effort to make lean NN. How?• Quantization: low bit number representation and arithmetic. (related
math : non-smooth optimization, when the ReLu are also quantized)• Pruning: trim off the small weights, and retrain• Hyperparameter Optimization: train over multiple architectures and
paramsMostly engineering effort, but could be combined with more math on the training.