![Page 1: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/1.jpg)
Machine Learning Basics Lecture 7: Multiclass Classification
Princeton University COS 495
Instructor: Yingyu Liang
![Page 2: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/2.jpg)
Example: image classification
indoor outdoorIndoor
![Page 3: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/3.jpg)
Example: image classification (multiclass)
ImageNet figure borrowed from vision.standford.edu
![Page 4: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/4.jpg)
Multiclass classification
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷• 𝑥𝑖 ∈ 𝑅𝑑 , 𝑦𝑖∈ {1,2,… , 𝐾}
• Find 𝑓 𝑥 : 𝑅𝑑 → {1,2, … , 𝐾} that outputs correct labels
• What kind of 𝑓?
![Page 5: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/5.jpg)
Approaches for multiclass classification
![Page 6: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/6.jpg)
Approach 1: reduce to regression
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Find 𝑓𝑤 𝑥 = 𝑤𝑇𝑥 that minimizes 𝐿 𝑓𝑤 =1
𝑛σ𝑖=1𝑛 𝑤𝑇𝑥𝑖 − 𝑦𝑖
2
• Bad idea even for binary classificationReduce to linear regression;
ignore the fact 𝑦 ∈ {1,2. . . , 𝐾}
![Page 7: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/7.jpg)
Approach 1: reduce to regression
Figure fromPattern Recognition andMachine Learning, Bishop
Bad idea even for binary
classification
![Page 8: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/8.jpg)
Approach 2: one-versus-the-rest
• Find 𝐾 − 1 classifiers 𝑓1, 𝑓2, … , 𝑓𝐾−1• 𝑓1 classifies 1 𝑣𝑠 {2,3,… , 𝐾}
• 𝑓2 classifies 2 𝑣𝑠 {1,3,… , 𝐾}
• …
• 𝑓𝐾−1 classifies 𝐾 − 1 𝑣𝑠 {1,2,… , 𝐾 − 2}
• Points not classified to classes {1,2,… , 𝐾 − 1} are put to class 𝐾
• Problem of ambiguous region: some points may be classified to more than one classes
![Page 9: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/9.jpg)
Approach 2: one-versus-the-rest
Figure fromPattern Recognition andMachine Learning, Bishop
![Page 10: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/10.jpg)
Approach 3: one-versus-one
• Find 𝐾 − 1 𝐾/2 classifiers 𝑓(1,2), 𝑓(1,3), … , 𝑓(𝐾−1,𝐾)• 𝑓(1,2) classifies 1 𝑣𝑠 2
• 𝑓(1,3) classifies 1 𝑣𝑠 3
• …
• 𝑓(𝐾−1,𝐾) classifies 𝐾 − 1 𝑣𝑠 𝐾
• Computationally expensive: think of 𝐾 = 1000
• Problem of ambiguous region
![Page 11: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/11.jpg)
Approach 3: one-versus-one
Figure fromPattern Recognition andMachine Learning, Bishop
![Page 12: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/12.jpg)
Approach 4: discriminant functions
• Find 𝐾 scoring functions 𝑠1, 𝑠2, … , 𝑠𝐾• Classify 𝑥 to class 𝑦 = argmax𝑖 𝑠𝑖(𝑥)
• Computationally cheap
• No ambiguous regions
![Page 13: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/13.jpg)
Linear discriminant functions
• Find 𝐾 discriminant functions 𝑠1, 𝑠2, … , 𝑠𝐾• Classify 𝑥 to class 𝑦 = argmax𝑖 𝑠𝑖(𝑥)
• Linear discriminant: 𝑠𝑖(𝑥) = 𝑤𝑖 𝑇𝑥, with 𝑤𝑖 ∈ 𝑅𝑑
![Page 14: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/14.jpg)
Linear discriminant functions
• Linear discriminant: 𝑠𝑖(𝑥) = 𝑤𝑖 𝑇𝑥, with 𝑤𝑖 ∈ 𝑅𝑑
• Lead to convex region for each class: by 𝑦 = argmax𝑖 𝑤𝑖 𝑇𝑥
Figure fromPattern Recognition andMachine Learning, Bishop
![Page 15: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/15.jpg)
Conditional distribution as discriminant
• Find 𝐾 discriminant functions 𝑠1, 𝑠2, … , 𝑠𝐾• Classify 𝑥 to class 𝑦 = argmax𝑖 𝑠𝑖(𝑥)
• Conditional distributions: 𝑠𝑖(𝑥) = 𝑝(𝑦 = 𝑖|𝑥)
• Parametrize by 𝑤𝑖: 𝑠𝑖(𝑥) = 𝑝𝑤𝑖(𝑦 = 𝑖|𝑥)
![Page 16: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/16.jpg)
Multiclass logistic regression
![Page 17: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/17.jpg)
Review: binary logistic regression
• Sigmoid
𝜎 𝑤𝑇𝑥 + 𝑏 =1
1 + exp(−(𝑤𝑇𝑥 + 𝑏))
• Interpret as conditional probability
𝑝𝑤 𝑦 = 1 𝑥 = 𝜎 𝑤𝑇𝑥 + 𝑏
𝑝𝑤 𝑦 = 0 𝑥 = 1 − 𝑝𝑤 𝑦 = 1 𝑥 = 1 − 𝜎 𝑤𝑇𝑥 + 𝑏
• How to extend to multiclass?
![Page 18: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/18.jpg)
Review: binary logistic regression
• Suppose we model the class-conditional densities 𝑝 𝑥 𝑦 = 𝑖 and class probabilities 𝑝 𝑦 = 𝑖
• Conditional probability by Bayesian rule:
𝑝 𝑦 = 1|𝑥 =𝑝 𝑥|𝑦 = 1 𝑝(𝑦 = 1)
𝑝 𝑥|𝑦 = 1 𝑝 𝑦 = 1 + 𝑝 𝑥|𝑦 = 2 𝑝(𝑦 = 2)=
1
1 + exp(−𝑎)= 𝜎(𝑎)
where we define
𝑎 ≔ ln𝑝 𝑥|𝑦 = 1 𝑝(𝑦 = 1)
𝑝 𝑥|𝑦 = 2 𝑝(𝑦 = 2)= ln
𝑝 𝑦 = 1|𝑥
𝑝 𝑦 = 2|𝑥
![Page 19: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/19.jpg)
Review: binary logistic regression
• Suppose we model the class-conditional densities 𝑝 𝑥 𝑦 = 𝑖 and class probabilities 𝑝 𝑦 = 𝑖
• 𝑝 𝑦 = 1|𝑥 = 𝜎 𝑎 = 𝜎(𝑤𝑇𝑥 + 𝑏) is equivalent to setting log odds
𝑎 = ln𝑝 𝑦 = 1|𝑥
𝑝 𝑦 = 2|𝑥= 𝑤𝑇𝑥 + 𝑏
• Why linear log odds?
![Page 20: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/20.jpg)
Review: binary logistic regression
• Suppose the class-conditional densities 𝑝 𝑥 𝑦 = 𝑖 is normal
𝑝 𝑥 𝑦 = 𝑖 = 𝑁 𝑥|𝜇𝑖 , 𝐼 =1
2𝜋 𝑑/2exp{−
1
2𝑥 − 𝜇𝑖
2}
• log odd is
𝑎 = ln𝑝 𝑥|𝑦 = 1 𝑝(𝑦 = 1)
𝑝 𝑥|𝑦 = 2 𝑝(𝑦 = 2)= 𝑤𝑇𝑥 + 𝑏
where
𝑤 = 𝜇1 − 𝜇2, 𝑏 = −1
2𝜇1𝑇𝜇1 +
1
2𝜇2𝑇𝜇2 + ln
𝑝(𝑦 = 1)
𝑝(𝑦 = 2)
![Page 21: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/21.jpg)
Multiclass logistic regression
• Suppose we model the class-conditional densities 𝑝 𝑥 𝑦 = 𝑖 and class probabilities 𝑝 𝑦 = 𝑖
• Conditional probability by Bayesian rule:
𝑝 𝑦 = 𝑖|𝑥 =𝑝 𝑥|𝑦 = 𝑖 𝑝(𝑦 = 𝑖)
σ𝑗 𝑝 𝑥|𝑦 = 𝑗 𝑝(𝑦 = 𝑗)=
exp(𝑎𝑖)
σ𝑗 exp(𝑎𝑗)
where we define 𝑎𝑖 ≔ ln [𝑝 𝑥 𝑦 = 𝑖 𝑝 𝑦 = 𝑖 ]
![Page 22: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/22.jpg)
Multiclass logistic regression
• Suppose the class-conditional densities 𝑝 𝑥 𝑦 = 𝑖 is normal
𝑝 𝑥 𝑦 = 𝑖 = 𝑁 𝑥|𝜇𝑖 , 𝐼 =1
2𝜋 𝑑/2exp{−
1
2𝑥 − 𝜇𝑖
2}
• Then
𝑎𝑖 ≔ ln 𝑝 𝑥 𝑦 = 𝑖 𝑝 𝑦 = 𝑖 = −1
2𝑥𝑇𝑥 + 𝑤𝑖
𝑇
𝑥 + 𝑏𝑖
where
𝑤𝑖 = 𝜇𝑖 , 𝑏𝑖 = −1
2𝜇𝑖𝑇𝜇𝑖 + ln 𝑝 𝑦 = 𝑖 + ln
1
2𝜋 𝑑/2
![Page 23: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/23.jpg)
Multiclass logistic regression
• Suppose the class-conditional densities 𝑝 𝑥 𝑦 = 𝑖 is normal
𝑝 𝑥 𝑦 = 𝑖 = 𝑁 𝑥|𝜇𝑖 , 𝐼 =1
2𝜋 𝑑/2exp{−
1
2𝑥 − 𝜇𝑖
2}
• Cancel out −1
2𝑥𝑇𝑥, we have
𝑝 𝑦 = 𝑖|𝑥 =exp(𝑎𝑖)
σ𝑗 exp(𝑎𝑗), 𝑎𝑖 ≔ 𝑤𝑖 𝑇
𝑥 + 𝑏𝑖
where
𝑤𝑖 = 𝜇𝑖 , 𝑏𝑖 = −1
2𝜇𝑖𝑇𝜇𝑖 + ln 𝑝 𝑦 = 𝑖 + ln
1
2𝜋 𝑑/2
![Page 24: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/24.jpg)
Multiclass logistic regression: conclusion
• Suppose the class-conditional densities 𝑝 𝑥 𝑦 = 𝑖 is normal
𝑝 𝑥 𝑦 = 𝑖 = 𝑁 𝑥|𝜇𝑖 , 𝐼 =1
2𝜋 𝑑/2exp{−
1
2𝑥 − 𝜇𝑖
2}
• Then
𝑝 𝑦 = 𝑖|𝑥 =exp( 𝑤𝑖 𝑇
𝑥 + 𝑏𝑖)
σ𝑗 exp( 𝑤𝑗 𝑇𝑥 + 𝑏𝑗)
which is the hypothesis class for multiclass logistic regression
• It is softmax on linear transformation; it can be used to derive the negative log-likelihood loss (cross entropy)
![Page 25: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/25.jpg)
Softmax
• A way to squash 𝑎 = (𝑎1, 𝑎2, … , 𝑎𝑖 , … ) into probability vector 𝑝
softmax 𝑎 =exp(𝑎1)
σ𝑗 exp(𝑎𝑗),exp(𝑎2)
σ𝑗 exp(𝑎𝑗), … ,
exp 𝑎𝑖
σ𝑗 exp 𝑎𝑗, …
• Behave like max: when 𝑎𝑖 ≫ 𝑎𝑗 ∀𝑗 ≠ 𝑖 , 𝑝𝑖 ≅ 1, 𝑝𝑗 ≅ 0
![Page 26: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/26.jpg)
Cross entropy for conditional distribution
• Let 𝑝data(𝑦|𝑥) denote the empirical distribution of the data
• Negative log-likelihood
−1
𝑛σ𝑖=1𝑛 log 𝑝 𝑦 = 𝑦𝑖 𝑥𝑖 = −E𝑝data(𝑦|𝑥) log 𝑝(𝑦|𝑥)
is the cross entropy between 𝑝data and the model output 𝑝
• Information theory viewpoint: KL divergence
𝐷(𝑝data| 𝑝 = E𝑝data[log𝑝data
𝑝] = E𝑝data [log 𝑝data] − E𝑝data[log 𝑝]
Entropy; constant Cross entropy
![Page 27: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/27.jpg)
Cross entropy for full distribution
• Let 𝑝data(𝑥, 𝑦) denote the empirical distribution of the data
• Negative log-likelihood
−1
𝑛σ𝑖=1𝑛 log 𝑝(𝑥𝑖 , 𝑦𝑖) = −E𝑝data(𝑥,𝑦) log 𝑝(𝑥, 𝑦)
is the cross entropy between 𝑝data and the model output 𝑝
![Page 28: Machine Learning Basics Lecture 7: Multiclass classification · 2016. 9. 1. · Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor:](https://reader036.vdocument.in/reader036/viewer/2022062509/60f81e9b317cb320f2659999/html5/thumbnails/28.jpg)
Multiclass logistic regression: summary
Label 𝑦𝑖
(𝑤𝑗)𝑇ℎ + 𝑏𝑗softmax
Last hidden layer ℎ
𝑝𝑗
Cross entropy
Linear Convert to probability Loss