ceng 783 on optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/advanceddl... · 2020. 9....
TRANSCRIPT
![Page 1: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/1.jpg)
CENG 793
On Machine Learning and Optimization
Sinan Kalkan
![Page 2: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/2.jpg)
2
![Page 3: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/3.jpg)
Now
• Introduction to ML– Problem definition– Classes of approaches– K-NN– Support Vector Machines– Softmax classification / logistic regression– Parzen Windows
• Optimization– Gradient Descent approaches– A flavor of other approaches
3
![Page 4: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/4.jpg)
Introduction toMachine learning
Uses many figures and material from the following website:
http://www.byclb.com/TR/Tutorials/neural_networks/ch1_1.htm4
![Page 5: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/5.jpg)
Testing
Training
Overview
“Learn”
LearnedModels or Classifiers
Data with label(label: human, animal etc.)
- Size- Texture- Color- Histogram of
oriented gradients- SIFT- Etc.
Test input
ExtractFeatures
ExtractFeatures
Pred
ict
Human or Animal or …
5
![Page 6: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/6.jpg)
Content
• Problem definition
• General approaches
• Popular methods
– kNN
– Linear classification
– Support Vector Machines
– Parzen windows
6
![Page 7: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/7.jpg)
Problem Definition
• Given – Data: a set of instances 𝐱1, 𝐱2, … , 𝐱𝑛
sampled from a space 𝐗 ∈ ℝ𝑑 .
– Labels: the corresponding labels 𝑦𝑖 for each 𝐱𝑖 . 𝑦𝑖 ∈ 𝐘 ∈ ℝ.
• Goal: – Learn a mapping from the space of “data”
to the space of labels, i.e.,
• ℳ:𝐗 → 𝐘.
• 𝑦 = 𝑓 𝐱 .
7
![Page 8: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/8.jpg)
Issues in Machine Learning
• Hypothesis space
• Loss / cost / objective function
• Optimization
• Bias vs. variance
• Test / evaluate
• Overfitting, underfitting
8
![Page 9: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/9.jpg)
GenerativeDiscriminative
General Approaches
Find separating line (in general: hyperplane)
Fit a function to data(regression).
Learn a model for each class.
9
![Page 10: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/10.jpg)
Generative
• Regression• Markov Random Fields• Bayesian Networks/Learning• Clustering via Gaussian Mixture Models,
Parzen Windows etc.• …
Discriminative
• Support Vector Machines• Artificial Neural Networks• Conditional Random Fields• K-NN
General Approaches (cont’d)
10
![Page 11: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/11.jpg)
Supervised
Instance Label
elma
elma
armut
elma
armut
Unsupervised
General Approaches (cont’d)
Extract Features
Learn a model
Extract Features
Learn a model
e.g. SVM
e.g. k-meansclustering
11
![Page 12: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/12.jpg)
General Approaches (cont’d)
Generative Discriminative
SupervisedRegression
Markov Random FieldsBayesian Networks
Support Vector MachinesNeural Networks
Conditional Random FieldsDecision Tree Learning
UnsupervisedGaussian Mixture Models
Parzen WindowsK-means
Self-Organizing Maps
12
![Page 13: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/13.jpg)
Now
Generative Discriminative
Supervised
Unsupervised
Support VectorMachines
Parzen Windows
Softmax/Logistic Regression
14
![Page 14: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/14.jpg)
Feature Space
17
![Page 15: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/15.jpg)
General pitfalls/problems
• Overfitting
• Underfitting
• Occams’ razor
18
![Page 16: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/16.jpg)
21
![Page 17: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/17.jpg)
K-NEAREST NEIGHBOR
22
![Page 18: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/18.jpg)
A very simple algorithm
• Advantages:– No training– Simple
• Disadvantages:– Slow testing time (more efficient
versions exist)– Needs a lot of memory
Wikipedia
http://cs231n.github.io/classification/23
![Page 19: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/19.jpg)
PARZEN WINDOWS
A non-parametric method
24
![Page 20: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/20.jpg)
Probability and density
• Probability of a continuous probability function, 𝑝 𝑥 , satisfies the following:– Probability to be between two values:
𝑃 𝑎 < 𝑥 < 𝑏 = 𝑎
𝑏
𝑝 𝑥 𝑑𝑥
– 𝑝 𝑥 > 0 for all real 𝑥.– And
−∞
∞
𝑝 𝑥 𝑑𝑥 = 1
• In 2-D:– Probability for 𝐱 to be inside region 𝑅:
𝑃 = 𝑅
𝑝 𝐱 𝑑𝐱
– 𝑝 𝐱 > 0 for all real 𝐱.– And
−∞
∞
𝑝 𝐱 𝑑𝐱 = 1
26
![Page 21: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/21.jpg)
Density Estimation
• Basic idea:
𝑃 = 𝑅
𝑝 𝐱 𝑑𝐱
• If 𝑅 is small enough, so that 𝑝(𝐱) is almost constant in 𝑅:
– 𝑃 = 𝑅 𝑝 𝐱 𝑑𝐱 ≈
p 𝐱 𝑅 𝑑𝐱 = 𝑝 𝐱 𝑉
– 𝑉: volume of region 𝑅.
• If 𝑘 out of 𝑛 samples fall into 𝑅, then
𝑃 = 𝑘/𝑛
• From which we can write:
𝑘
𝑛= 𝑝 𝐱 𝑉 ⇒ 𝑝 𝐱 =
𝑘/𝑛
𝑉27
![Page 22: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/22.jpg)
Parzen Windows
• Assume that: – 𝑅 is a hypercube centered at 𝐱.– ℎ is the length of an edge of the hypercube.– Then, 𝑉 = ℎ2 in 2D and 𝑉 = ℎ3 in 3D etc.
• Let us define the following window function:
𝑤𝐱𝑖 − 𝐱𝑘
ℎ=
1, |𝐱𝑖 − 𝐱𝑘|/ℎ < 1/20, otherwise
• Then, we can write the number of samples falling into 𝑅 as follows:
𝑘 =
𝑖=1
𝑛
𝑤𝐱𝑖 − 𝐱𝑘
ℎ28
![Page 23: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/23.jpg)
Parzen Windows (cont’d)
• Remember: 𝑝 𝐱 =𝑘/𝑛
𝑉.
• Using this definition of 𝑘, we can rewrite 𝑝(𝐱)(in 2D):
𝑝 𝐱 =1
𝑛
𝑖=1
𝑛1
ℎ2𝑤
𝐱𝑖 − 𝐱
ℎ
• Interpretation: Probability is the contribution of window functions fitted at each sample!
29
![Page 24: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/24.jpg)
Parzen Windows (cont’d)
• The type of the window function can add different flavors.
• If we use Gaussian for the window function:
𝑝 𝐱 =1
𝑛
𝑖=1
𝑛1
𝜎√2𝜋exp −
𝐱𝑖 − 𝐱 𝟐
2𝜎2
• Interpretation: Fit a Gaussian to each sample.
30
![Page 25: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/25.jpg)
31
![Page 26: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/26.jpg)
LINEAR CLASSIFICATION
32
![Page 27: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/27.jpg)
Linear classification• Linear classification relies on the following score function:
𝑓(𝑥𝑖;𝑊, 𝑏) = 𝑊𝑥𝑖 + 𝑏Or
𝑓(𝑥𝑖;𝑊, 𝑏) = 𝑊𝑥𝑖• where the bias is implicitly represented in 𝑊 and 𝑥𝑖
http://cs231n.github.io/linear-classify/One row per class 33
![Page 28: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/28.jpg)
Linear classification
• We can rewrite the score function as:𝑓(𝑥𝑖;𝑊, 𝑏) = 𝑊𝑥𝑖
• where the bias is implicitly represented in 𝑊and 𝑥𝑖
http://cs231n.github.io/linear-classify/34
![Page 29: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/29.jpg)
Linear classification: One interpretation
• Since an image can be thought as a vector, we can consider them as points in high-dimensional space.
http://cs231n.github.io/linear-classify/
One interpretation of:
𝑓(𝑥𝑖;𝑊, 𝑏) = 𝑊𝑥𝑖 + 𝑏
- Each row describes a line for a class, and “b”
35
![Page 30: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/30.jpg)
Linear classification:Another interpretation
• Each row in 𝑊 can be interpreted as a template of that class.
– 𝑓(𝑥𝑖;𝑊, 𝑏) = 𝑊𝑥𝑖 + 𝑏 calculates the inner product to find which template best fits 𝑥𝑖.
– Effectively, we are doing Nearest Neighbor with the “prototype” images of each class.
http://cs231n.github.io/linear-classify/36
![Page 31: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/31.jpg)
Loss function• A function which measures how good our weights are.
– Other names: cost function, objective function
• Let 𝑠𝑗 = 𝑓 𝑥𝑖;𝑊 𝑗
• An example loss function:
𝐿𝑖 =
𝑗≠𝑦𝑖
max(0, 𝑠𝑗 − 𝑠𝑦𝑖 + Δ)
Or equivalently:
𝐿𝑖 =
𝑗≠𝑦𝑖
max(0, 𝑤𝑗𝑇𝑥𝑖 − 𝑤𝑦𝑖
𝑇 𝑥𝑖 + Δ)
• This directs the distances to other classes to be more than Δ (the margin)
http://cs231n.github.io/linear-classify/37
![Page 32: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/32.jpg)
Example
• Consider our scores for 𝑥𝑖 to be 𝑠 =[13, −7,11] and assume Δ as 10.
• Then, 𝐿𝑖 = max 0,−7 − 13 + 10 +max(0,11 − 13 + 10)
http://cs231n.github.io/linear-classify/ 38
![Page 33: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/33.jpg)
Regularization
• In practice, there are many possible solutions leading to the same loss value.– Based on the requirements of the problem, we
might want to penalize certain solutions.
• E.g.,
𝑅 𝑊 =
𝑖
𝑗
𝑊𝑖,𝑗2
– which penalizes large weights. • Why do we want to do that?
39http://cs231n.github.io/linear-classify/
![Page 34: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/34.jpg)
Combined Loss Function
• The loss function becomes:
• If you expand it:
𝐿 =1
𝑁
𝑖
𝑗≠𝑦𝑖
max 0, 𝑓 𝑥𝑖 ,𝑊 𝑗 − 𝑓 𝑥𝑖 ,𝑊 𝑦𝑖 + Δ + 𝜆
𝑖
𝑗
𝑊𝑖,𝑗2
Hyper parameters(estimated using validation set)40http://cs231n.github.io/linear-classify/
![Page 35: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/35.jpg)
Hinge Loss, or Max-Margin Loss
𝐿 =1
𝑁
𝑖
𝑗≠𝑦𝑖
max 0, 𝑓 𝑥𝑖 ,𝑊 𝑗 − 𝑓 𝑥𝑖 ,𝑊 𝑦𝑖 + Δ + 𝜆
𝑖
𝑗
𝑊𝑖,𝑗2
41http://cs231n.github.io/linear-classify/
![Page 36: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/36.jpg)
Interactive Demo
http://vision.stanford.edu/teaching/cs231n/linear-classify-demo/ 42
![Page 37: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/37.jpg)
SUPPORT VECTOR MACHINES
An alternative formulation of
Barrowed mostly from the slides of:- Machine Learning Group, University of Texas at Austin.- Mingyue Tan, The University of British Columbia
43
![Page 38: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/38.jpg)
Linear Separators
• Binary classification can be viewed as the task of separating classes in feature space:
wTx + b = 0
wTx + b < 0wTx + b > 0
f(x) = sign(wTx + b)
44
![Page 39: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/39.jpg)
Linear Separators
• Which of the linear separators is optimal?
45
![Page 40: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/40.jpg)
Classification Margin
• Distance from example xi to the separator is
• Examples closest to the hyperplane are support vectors.
• Margin ρ of the separator is the distance between support vectors.
w
xw br i
T
r
ρ
46
![Page 41: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/41.jpg)
Maximum Margin Classification
• Maximizing the margin is good according to intuition.
• Implies that only support vectors matter; other training examples are ignorable.
47
![Page 42: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/42.jpg)
Linear SVM Mathematically
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• w . (x+-x-) = 2
X-
x+
ww
wxx 2)(
ρ=Margin Width
49
![Page 43: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/43.jpg)
Linear SVMs Mathematically (cont.)
• Then we can formulate the optimization problem:
Which can be reformulated as:
Find w and b such that
is maximized
and for all (xi, yi), i=1..n : yi(wTxi + b) ≥ 1
w
2
Find w and b such that
Φ(w) = ||w||2=wTw is minimized
and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1
50
![Page 44: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/44.jpg)
Lagrange Multipliers
• Given the following optimization problem:minimize 𝑓 𝑥, 𝑦 subject to 𝑔 𝑥, 𝑦 = 𝑐
• We can formulate it as:𝐿 𝑥, 𝑦, 𝜆 = 𝑓 𝑥, 𝑦 + 𝜆(𝑔 𝑥, 𝑦 − 𝑐)
• and set the derivative to zero:𝛻𝑥,𝑦,𝜆𝐿 𝑥, 𝑦, 𝜆 = 0
51
![Page 45: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/45.jpg)
Lagrange Multipliers
• Main intuition:
– The gradients of f and g are parallel at the maximum
𝛻𝑓 = 𝜆𝛻𝑔
52
Fig: http://mathworld.wolfram.com/LagrangeMultiplier.html
![Page 46: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/46.jpg)
Lagrange Multipliers
• See the following for proof– http://ocw.mit.edu/courses/mathematics/18-02sc-
multivariable-calculus-fall-2010/2.-partial-derivatives/part-c-lagrange-multipliers-and-constrained-differentials/session-40-proof-of-lagrange-multipliers/MIT18_02SC_notes_22.pdf
• A clear example:– http://tutorial.math.lamar.edu/Classes/CalcIII/Lagrang
eMultipliers.aspx
• More intuitive explanation:– http://www.slimy.com/~steuard/teaching/tutorials/La
grange.html
53
![Page 47: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/47.jpg)
54
![Page 48: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/48.jpg)
Non-linear SVMs
• Datasets that are linearly separable with some noise work out great:
• But what are we going to do if the dataset is just too hard?
• How about… mapping data to a higher-dimensional space:
0
0
0
x2
x
x
x
55
![Page 49: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/49.jpg)
SOFTMAX OR LOGISTIC CLASSIFIERS
56
![Page 50: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/50.jpg)
Cross-entropy
• Uses cross-entropy (𝑡: target probabilities, 𝑜: estimated probabilities):
𝐻 𝑝, 𝑞 = 𝐸𝑝 − log 𝑞 = −
𝑗
𝑝𝑗 log 𝑞𝑗
• Wikipedia:– “the cross entropy between two probability distributions 𝑝 and 𝑞 over
the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "unnatural" probability distribution 𝑞, rather than the "true" distribution 𝑝”
57
![Page 51: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/51.jpg)
Softmax classifier – cross-entropy loss
• Cross-entropy: 𝐻 𝑝, 𝑞 = 𝐸𝑝 − log 𝑞 = − 𝑗 𝑝𝑗 log 𝑞𝑗
• In our case, – 𝑝 denotes the correct probabilities of the categories. In other words,
𝑝𝑗 = 1 for the correct label and 𝑝𝑗 = 0 for other categories.
– 𝑞 denotes the estimated probabilities of the categories
• But, our scores are not probabilities!
– One solution: Softmax function: 𝑓 𝑧𝑖 =𝑒𝑧𝑖
𝑗 𝑒𝑧𝑗
– It maps arbitrary ranges to probabilities
• Using the normalized values, we can define the cross-entropy loss for classification problem now:
𝐿𝑖 = − log𝑒𝑓𝑦𝑖
𝑗 𝑒𝑓𝑗
= −𝑓𝑦𝑖 + log
𝑗
𝑒𝑓𝑗
58http://cs231n.github.io/
![Page 52: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/52.jpg)
logistic loss
• A special case of cross-entropy for binary classification:
𝐻 𝑝, 𝑞 = −
𝑗
𝑝𝑗 log 𝑞𝑗 = −𝑝𝑗 log 𝑞𝑗 − 1 − 𝑝𝑗 log 1 − 𝑞𝑗
• Softmax function reduces to the logistic function:1
1 + 𝑒𝑥
• And the loss becomes:
𝐿𝑖 = − log1
1 + 𝑒𝑓
59http://cs231n.github.io/
![Page 53: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/53.jpg)
Why take logarithm of probabilities?
• Maps probability space to logarithmic space
• Multiplication becomes addition
– Multiplication is a very frequent operation with probabilities
• Speed:
– Addition is more efficient
• Accuracy:
– Considering loss in representing real numbers, addition is friendlier
• Since log-probability is negative, to work with positive numbers, we usually negate the log-probability
62
![Page 54: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/54.jpg)
Softmax classifier: One interpretation
• Information theory
– Cross-entropy between a true distribution and an estimated one:
𝐻 𝑝, 𝑞 = −
𝑥
𝑝 𝑥 log 𝑞 𝑥 .
– In our case, 𝑝 = [0,… , 1,0, . . 0], containing only one 1, at the correct label.
– Since 𝐻 𝑝, 𝑞 = 𝐻 𝑝 + 𝐷𝐾𝐿(𝑝||𝑞), we are minimizing the Kullback-Leibler divergence.
63http://cs231n.github.io/
![Page 55: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/55.jpg)
Softmax classifier: Another interpretation
• Probabilistic view
𝑃 𝑦𝑖 𝑥𝑖;𝑊) =𝑒𝑓𝑦𝑖
𝑗 𝑒𝑓𝑗.
• In our case, we are minimizing the negative log likelihood.
• Therefore, this corresponds to Maximum Likelihood Estimation (MLE).
64http://cs231n.github.io/
![Page 56: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/56.jpg)
Numerical Stability
• Exponentials may become very large. A trick:
• Set log 𝐶 = −max𝑗
𝑓𝑗.
65http://cs231n.github.io/
![Page 57: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/57.jpg)
SVM loss vs. cross-entropy loss
• SVM is happy when the classification satisfies the margin
– Ex: if score values = [10, 9, 9] or [10, -10, -10]
• SVM loss is happy if the margin is 1
– SVM is local
• cross-entropy always wants better
– cross-entropy is global
66
![Page 58: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/58.jpg)
0-1 Loss
• Minimize the # of cases where the prediction is wrong:
𝐿 =
𝒊
𝟏 𝑓 𝑥𝑖;𝑊, 𝑏 𝑦𝑖 ≠ 𝑦𝑖
Or equivalently,
𝐿 =
𝒊
𝟏 𝑦𝑖𝑓 𝑥𝑖;𝑊, 𝑏 𝑦𝑖 < 0
68
![Page 59: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/59.jpg)
Absolute Value Loss, Squared Error Loss
𝐿𝑖 =
𝑗
𝑠𝑗 − 𝑦𝑗𝑞
• 𝑞 = 1: absolute value loss
• 𝑞 = 2: square error loss.
69Bishop
![Page 60: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/60.jpg)
MORE ON LOSS FUNCTIONS
70
![Page 61: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/61.jpg)
Visualizing Loss Functions
• If you look at one of the example loss functions:
𝐿𝑖 =
𝑗≠𝑦𝑖
max(0, 𝑤𝑗𝑇𝑥𝑖 −𝑤𝑦𝑖
𝑇 𝑥𝑖 + 1)
• Since 𝑊 has too many dimensions, this is difficult to plot.
• We can visualize this for one weight direction though, which can give us some intuition about the shape of the function.– E.g., start from an arbitrary 𝑊0, choose a direction 𝑊1
and plot 𝐿(𝑊0 + 𝛼𝑊1) for different values of 𝛼.
75http://cs231n.github.io/
![Page 62: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/62.jpg)
Visualizing Loss Functions
• Example:
• If we consider just 𝑤0, we have many linear functions in terms of 𝑤0 and loss is a linear combination of them.
76http://cs231n.github.io/
![Page 63: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/63.jpg)
Visualizing Loss Functions
• You see that this is a convex function.– Nice and easy for optimization
• When you combine many of them in a neural network, it becomes non-convex.
Loss along one direction Loss along two directions Loss along two directions(averaged over many samples)
77http://cs231n.github.io/
![Page 64: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/64.jpg)
• 0-1 loss:𝐿 = 1(𝑓 𝑥 ≠ 𝑦)
or equivalently as:𝐿 = 1(𝑦𝑓 𝑥 > 0)
• Square loss:𝐿 = 𝑓 𝑥 − 𝑦 2
in binary case:
𝐿 = 1 − 𝑦𝑓 𝑥2
• Hinge-loss𝐿 = max(1 − 𝑦𝑓 𝑥 , 0)
• Logistic loss:
𝐿 = ln 2 −1ln(1 + 𝑒−𝑦𝑓 𝑥 )
Another approach for visualizing loss functions
78
Rosacco et al., 2003
![Page 65: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/65.jpg)
SUMMARY OF LOSS FUNCTIONS
79
![Page 66: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/66.jpg)
80
![Page 67: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/67.jpg)
81
![Page 68: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/68.jpg)
82
![Page 69: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/69.jpg)
83
![Page 70: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/70.jpg)
Sum up
• 0-1 loss is not differentiable/helpful at training– It is used in testing
• Other losses try to cover the “weakness” of 0-1 loss
• Hinge-loss imposes weaker constraint compared to cross-entropy
• For classification: use hinge-loss or cross-entropy loss
• For regression: use squared-error loss, or absolute difference loss
84
![Page 71: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/71.jpg)
SO, WHAT DO WE DO WITH A LOSS FUNCTION?
85
![Page 72: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/72.jpg)
Optimization strategies
• We want to find 𝑊 that minimizes the loss function.
– Remember that 𝑊 has lots of dimensions.
• Naïve idea: random search
– For a number of iterations:• Select a 𝑊 randomly
• If it leads to better loss than the previous ones, select it.
– This yields 15.5% accuracy on CIFAR after 1000 iterations (chance: 10%)
86http://cs231n.github.io/
![Page 73: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/73.jpg)
Optimization strategies
• Second idea: random local search
– Start at an arbitrary position (weight 𝑊)
– Select an arbitrary direction 𝑊 + 𝛿𝑊 and see if it leads to a better loss
• If yes, move along
– This leads to 21.4% accuracy after 1000 iterations
– This is actually a variation of simulated annealing
• A better idea: follow the gradient
87http://cs231n.github.io/
![Page 74: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/74.jpg)
A quick reminder on gradients / partial derivatives
– In one dimension: 𝑑𝑓 𝑥
𝑑𝑥= lim
ℎ→0
𝑓 𝑥+ℎ −𝑓 𝑥
ℎ
– In practice:
•𝑑𝑓 𝑛
𝑑𝑛=
𝑓 𝑛+ℎ −𝑓[𝑛]
ℎ
•𝑑𝑓 𝑛
𝑑𝑛=
𝑓 𝑛+ℎ −𝑓[𝑛−ℎ]
2ℎ(centered difference – works better)
– In many dimensions:
1. Compute gradient numerically with finite differences
– Slow
– Easy
– Approximate
2. Compute the gradient analytically
– Fast
– Exact
– Error-prone to implement
• In practice: implement (2) and check against (1) before testing
88
http://cs231n.github.io/
![Page 75: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/75.jpg)
A quick reminder on gradients / partial derivatives
• If you have a many-variable function, e.g., 𝑓 𝑥, 𝑦 = 𝑥 + 𝑦, you can take its derivative wrt either 𝑥 or 𝑦:
–𝑑𝑓 𝑥,𝑦
𝑑𝑥= lim
ℎ→0
𝑓 𝑥+ℎ,𝑦 −𝑓(𝑥,𝑦)
ℎ= 1
– Similarly, 𝑑𝑓 𝑥,𝑦
𝑑𝑦= 1
– In fact, we should denote them as follows since they are “partial derivatives” or “gradients on x or y”:
•𝜕𝑓
𝜕𝑥and
𝜕𝑓
𝜕𝑦
• Partial derivative tells you the rate of change along a single dimension at a point.
– E.g., if 𝜕𝑓/𝜕𝑥=1, it means that a change of 𝑥0 in 𝑥 leads to the same amount of change in the value of the function.
• Gradient is a vector of partial derivatives:
– 𝛻𝑓 =𝜕𝑓
𝜕𝑥,𝜕𝑓
𝜕𝑦
89http://cs231n.github.io/
![Page 76: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/76.jpg)
A quick reminder on gradients / partial derivatives
• A simple example:– 𝑓 𝑥, 𝑦 = max(𝑥, 𝑦)
– 𝛻𝑓 = 𝟏 𝑥 ≥ 𝑦 , 𝟏 𝑦 ≥ 𝑥
• Chaining:– What if we have a composition of functions?
– E.g., 𝑓 𝑥, 𝑦, 𝑧 = 𝑞 𝑥, 𝑦 𝑧 and 𝑞 𝑥, 𝑦 = 𝑥 + 𝑦
–𝜕𝑓
𝜕𝑥=
𝜕𝑓
𝜕𝑞
𝜕𝑞
𝜕𝑥= 𝑧
–𝜕𝑓
𝜕𝑦=
𝜕𝑓
𝜕𝑞
𝜕𝑞
𝜕𝑦= 𝑧
–𝜕𝑓
𝜕𝑧= 𝑞 𝑥, 𝑦 = 𝑥 + 𝑦
• Back propagation– Local processing to improve the system globally
– Each gate locally determines to increase or decrease the inputs
90http://cs231n.github.io/
![Page 77: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/77.jpg)
Partial derivatives and backprop
• Which has many gates:
91http://cs231n.github.io/
![Page 78: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/78.jpg)
In fact, that was the sigmoid function
• We will combine all these gates into a single gate and call it a neuron
92
![Page 79: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/79.jpg)
Intuitive effects
• Commonly used operations– Add gate– Max gate– Multiply gate
• Due to the effect of the multiply gate, when one of the inputs is large, the small input gets the large gradient– This has a negative effect in
NNs– When one of the weights is
unusually large, it effects the other weights.
– Therefore, it is better to normalize the input / weights
93http://cs231n.github.io/
![Page 80: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/80.jpg)
Optimization strategies: gradient
• Move along the negative gradient (since we wish to go down)
– 𝑊 − 𝑠𝜕𝐿 𝑊
𝜕𝑊
– 𝑠: step size
• Gradient tells us the direction– Choosing how much to move along that direction is difficult to
determine
– This is also called the learning rate
– If it is small: too slow to converge
– If it is big: you may overshoot and skip the minimum
94http://cs231n.github.io/
![Page 81: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/81.jpg)
Optimization strategies: gradient
• Take our loss function:
𝐿𝑖 =
𝑗≠𝑦𝑖
max(0,𝑤𝑗𝑇𝑥𝑖 − 𝑤𝑦𝑖
𝑇 𝑥𝑖 + 1)
• Its gradient wrt 𝑤𝑗 is:𝜕𝐿𝑖𝜕𝑤𝑗
= 𝟏 𝑤𝑗𝑇𝑥𝑖 −𝑤𝑦𝑖
𝑇 𝑥𝑖 + 1 > 0 𝑥𝑖
• For the “winning” weight, this is:
95http://cs231n.github.io/
![Page 82: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/82.jpg)
Gradient Descent
• Update the weight:
𝑊𝑛𝑒𝑤 ← 𝑊 − 𝑠𝜕𝐿 𝑊
𝜕𝑊
• This computes the gradient after seeing all examples to update the weight.– Examples can be on the order of millions or billions
• Alternative:– Mini-batch gradient descent: Update the weights after, e.g., 256 examples
– Stochastic (or online) gradient descent: Update the weights after each example
– People usually use batches and call it stochastic.
– Performing an update after one example for 100 examples is more expensive than performing an update at once for 100 examples due to matrix/vector operations
96http://cs231n.github.io/
![Page 83: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/83.jpg)
A GENERAL LOOK AT OPTIMIZATION
97
![Page 84: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/84.jpg)
Mathematical Optimization
Convex Optimization
Least-squares LP
Nonlinear Optimization
Rong Jin
100
![Page 85: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/85.jpg)
Mathematical Optimization
101
Rong Jin
![Page 86: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/86.jpg)
Convex Optimization
102
Rong Jin
![Page 87: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/87.jpg)
Interpretation
Function’s value is below the line connecting two points
Mark Schmidt103
![Page 88: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/88.jpg)
Another interpretation
Mark Schmidt104
![Page 89: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/89.jpg)
Example convex functions
Mark Schmidt106
![Page 90: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/90.jpg)
Operations that conserve convexity
107
Rong Jin
![Page 91: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/91.jpg)
𝑥 + 𝑦𝑝
<= 𝑥𝑝+ 𝑦
𝑝
108
Rong Jin
![Page 92: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/92.jpg)
A KTEC Center of Excellence 109
Why convex optimization?
• Can’t solve most OPs
• E.g. NP Hard, even high polynomial time too slow
• Convex OPs
• (Generally) No analytic solution
• Efficient algorithms to find (global) solution
• Interior point methods (basically Iterated Newton) can be used:
– ~[10-100]*max{p3 , p2m, F} ; F cost eval. obj. and constr. f
• At worst solve with general IP methods (CVX), faster
specialized
![Page 93: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/93.jpg)
A KTEC Center of Excellence 110
Convex Function
• Easy to see why convexity allows for
efficient solution
• Just “slide” down the objective function as
far as possible and will reach a minimum
![Page 94: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/94.jpg)
A KTEC Center of Excellence 111
Convex vs. Non-convex Ex.
• Convex, min. easy to find
Affine – border
case of convexity
![Page 95: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/95.jpg)
A KTEC Center of Excellence 112
Convex vs. Non-convex Ex.
• Non-convex, easy to get stuck in a local min.
• Can’t rely on only local search techniques
![Page 96: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/96.jpg)
A KTEC Center of Excellence 113
Non-convex
• Some non-convex problems highly multi-modal,
or NP hard
• Could be forced to search all solutions, or hope
stochastic search is successful
• Cannot guarantee best solution, inefficient
• Harder to make performance guarantees with
approximate solutions
![Page 97: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/97.jpg)
• Analytical solution• Good algorithms and software• High accuracy and high reliability• Time complexity:
Mathematical Optimization
Convex Optimization
Least-squares LP
Nonlinear Optimization
knC 2
A mature technology!
114
Rong Jin
![Page 98: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/98.jpg)
• No analytical solution• Algorithms and software• Reliable and efficient• Time complexity:
Mathematical Optimization
Convex Optimization
Least-squares LP
Nonlinear Optimization
mnC 2
Also a mature technology!
115
Rong Jin
![Page 99: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/99.jpg)
Mathematical Optimization
Convex Optimization
Nonlinear Optimization
Almost a mature technology!
Least-squares LP
• No analytical solution• Algorithms and software• Reliable and efficient• Time complexity (roughly)
},,max{ 23 Fmnn
116
Rong Jin
![Page 100: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/100.jpg)
Mathematical Optimization
Convex Optimization
Nonlinear Optimization
Far from a technology! (something to avoid)
Least-squares LP
• Sadly, no effective methods to solve• Only approaches with some compromise• Local optimization: “more art than technology”
• Global optimization: greatly compromised efficiency • Help from convex optimization
1) Initialization 2) Heuristics 3) Bounds
117
Rong Jin
![Page 101: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/101.jpg)
Why Study Convex Optimization
If not, ……
-- Section 1.3.2, p8, Convex Optimization
there is little chance you can solve it.
118
Rong Jin
![Page 102: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/102.jpg)
Recognizing Convex Optimization
Problems
121
Rong Jin
![Page 103: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/103.jpg)
Least-squares
126
Rong Jin
![Page 104: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/104.jpg)
Analytical Solution of Least-squares
• Set the derivative to zero:
–𝑑𝑓0 𝑥
𝑑𝑥= 0
– 𝐴𝑇𝐴 2𝑥 − 2𝐴𝑏 = 0
– 𝐴𝑇𝐴 𝑥 = 𝐴𝑏
• Solve this system of linear equations
127
![Page 105: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/105.jpg)
Linear Programming (LP)
128
Rong Jin
![Page 106: CENG 783 On Optimizationuser.ceng.metu.edu.tr/~emre/resources/courses/AdvancedDL... · 2020. 9. 22. · Now •Introduction to ML –Problem definition –Classes of approaches –K-NN](https://reader033.vdocument.in/reader033/viewer/2022061003/60b1b97ee1758c3a7e7e1c0f/html5/thumbnails/106.jpg)
To sum up
• We introduced some important concepts in machine learning and optimization
• We introduced popular machine learning methods
• We talked about loss functions and how we can optimize them using gradient descent
135