review for final - github pages for final.pdf · •what is the entropy of a group in which all...
TRANSCRIPT
![Page 1: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/1.jpg)
Review for the FinalShuai Li
John Hopcroft Center, Shanghai Jiao Tong University
https://shuaili8.github.io
https://shuaili8.github.io/Teaching/VE445/index.html
1
![Page 2: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/2.jpg)
Exam code
• Exam on Dec 6, 8:00-9:40 at Dong Xia Yuan 102 (lecture classroom)
• Finish the exam paper by yourself
• Allowed:• Calculator
• Not allowed:• Books, materials, cheat sheet, …
• Phones, any smart device
• No entering after 8:25
• Early submission period: 8:30--9:25
2
![Page 3: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/3.jpg)
Coverage of final
• Basics
• Supervised learning• Regression
• SVM and Kernel methods
• Decision Tree
• Deep learning• Neural Networks
• Backpropagation
• Convolutional Neural Network
• Recurrent Neural Network
• Unsupervised learning• K-means, Agglomerative clustering,
BFR, CURE
• PCA, SVD, Autoencoder, Feature selection
• EM, GMM
• HMM
• Learning theory• Bias-variance decomposition
3
![Page 4: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/4.jpg)
Exam contents
• 4 questions • 3 with 30 marks
• Computation tasks
• Concept understanding
• 1 with 10 marks• Algorithm description
4
![Page 5: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/5.jpg)
Supervised Learning
5
![Page 6: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/6.jpg)
Machine Learning Categories
• Unsupervised learning• No labeled data
• Supervised learning• Use labeled data to predict on unseen points
• Semi-supervised learning• Use labeled data and unlabeled data to predict on unlabeled/unseen points
• Reinforcement learning• Sequential prediction and receiving feedbacks
6
![Page 7: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/7.jpg)
Supervised learning example
7
![Page 8: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/8.jpg)
8
![Page 9: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/9.jpg)
Regression example
9
![Page 10: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/10.jpg)
Model Evaluations
10
![Page 11: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/11.jpg)
Classification -- Model evaluations
• Confusion Matrix• TP – True Positive ; FP – False Positive
• FN – False Negative; TN – True Negative
Predicted Class
Actual
Class
Class = Yes Class = No
Class = Yes a (TP) b (FN)
Class = No c (FP) d (TN)
FNFPTNTP
TNTP
dcba
da
+++
+=
+++
+= Accuracy
11
![Page 12: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/12.jpg)
Classification -- Model evaluations
• Limitation of Accuracy• Consider a 2-class problem
• Number of Class 0 examples = 9990
• Number of Class 1 examples = 10
• If a “stupid” model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %
• The accuracy is misleading because the model does not detect any example in class 1
12
![Page 13: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/13.jpg)
Classification -- Model evaluations
• Cost-sensitive measures
cba
a
pr
rp
ba
a
FNTP
TP
ca
a
FPTP
TP
++=
+=
+=
+=
+=
+=
2
22(F) measure-F
(r) Recall
(p) Precision
Harmonic mean of Precision and Recall (Why not just average?)
Predicted Class
Actual
Class
Class = Yes Class = No
Class = Yes a (TP) b (FN)
Class = No c (FP) d (TN)
13
![Page 14: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/14.jpg)
Example
• Given 30 human photographs, a computer predicts 19 to be male, 11 to be female. Among the 19 male predictions, 3 predictions are not correct. Among the 11 female predictions, 1 prediction is not correct.
Predicted Class
Actual
Class
Male Female
Male a = TP = 16 b = FN = 1
Female c = FP = 3 d = TN = 10
14
![Page 15: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/15.jpg)
Example
• Accuracy = (16 + 10) / (16 + 3 + 1 + 10) = 0.867
• Precision = 16 / (16 + 3) = 0.842
• Recall = 16 / (16 + 1) = 0.941
• F-measure = 2 (0.842)(0.941) / (0.842 + 0.941)
= 0.889
Predicted Class
Actual
Class
Male Female
Male a = TP = 16 b = FN = 1
Female c = FP = 3 d = TN = 10
15
![Page 16: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/16.jpg)
Decision Tree
16
![Page 17: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/17.jpg)
Tree models
• Tree models• Intermediate node for splitting data
• Leaf node for label prediction
• Discrete/categorical data example
17
![Page 18: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/18.jpg)
Tree models
• Tree models• Intermediate node for splitting data
• Leaf node for label prediction
• Discrete/categorical data example
• Key questions for decision trees• How to select node splitting conditions?
• How to make prediction?
• How to decide the tree structure?
18
![Page 19: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/19.jpg)
Node splitting
• Which node splitting condition to choose?
• Choose the features with higher classification capacity• Quantitatively, with higher information gain
19
![Page 20: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/20.jpg)
Information Theory
20
![Page 21: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/21.jpg)
Motivating example 1
• Suppose you are a police officer and there was a robbery last night. There are several suspects and you want to find the criminal from them by asking some questions.
• You may ask: where are you last night?
• You are not likely to ask: what is your favorite food?
• Why there is a preference for the policeman? Because the first one can distinguish the guilty from the innocent. It is more informative.
21
![Page 22: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/22.jpg)
Motivating example 2
• Suppose we have a dataset of two classes of people. Which split is better?
• We prefer the right split because there is no outliers and it is more certain.
22
FemaleMale EmployedUnemployed
![Page 23: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/23.jpg)
Entropy
• How to measure the level of informative (first example) and level of certainty (second example) in mathematics?
• Entropy (more specifically, Shannon entropy) is the expected value (average) of the information contained in each message
• Suppose 𝑋 is a random variable with 𝑛 discrete values𝑃 𝑋 = 𝑥𝑖 = 𝑝𝑖
then the entropy is
𝐻 𝑋 = −
𝑖=1
𝑛
𝑝𝑖log2 𝑝𝑖
• Easy to verify 𝐻 𝑋 = −σ𝑖=1𝑛 𝑝𝑖log2 𝑝𝑖 ≤ −σ𝑖=1
𝑛 1
𝑛log2
1
𝑛= log 𝑛
23
![Page 24: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/24.jpg)
Illustration
• Entropy for binary distribution
𝐻 𝑋 = −𝑝1 log2 𝑝1 − 𝑝0log2 𝑝0
24
![Page 25: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/25.jpg)
Entropy examples
• What is the entropy of a group in which all examples belong to the same class?• Entropy = − 1 log21 = 0
• What is the entropy of a group with 50% in either class?• Entropy = −0.5 log20.5 – 0.5 log20.5 = 1
25
Minimum uncertainty
Maximum uncertainty
![Page 26: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/26.jpg)
Conditional entropy
• Specific conditional entropy of 𝑋 given 𝑌 = 𝑦
𝐻 𝑋|𝑌 = 𝑦 = −
𝑖=1
𝑛
𝑃 𝑋 = 𝑖|𝑌 = 𝑦 log 𝑃 𝑋 = 𝑖|𝑌 = 𝑦
• Conditional entropy of 𝑋 given 𝑌
𝐻 𝑋 𝑌 =
𝑦
𝑃(𝑌 = 𝑦)𝐻 𝑋|𝑌 = 𝑦
• Information gain of 𝑋 given 𝑌𝐼 𝑋, 𝑌 = 𝐻 𝑋 − 𝐻(𝑋|𝑌)
26
![Page 27: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/27.jpg)
Interpretation from coding perspective
• Usually entropy denotes the minimal average message length of the best coding (theoretically)
• When the probability distribution is composed of 1
2𝑖, then the average
length of Hoffman code is the entropy
27
![Page 28: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/28.jpg)
Decision Tree
28
![Page 29: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/29.jpg)
29
![Page 30: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/30.jpg)
Node splitting
• We want to determine which attribute in a given set of training feature vectors is most useful for discriminating between the classes to be learned
• Information gain tells us how important a given attribute of the feature vectors is• Is used to decide the ordering of attributes in the nodes of a decision tree
• Information gain of 𝑋 given 𝑌𝐼 𝑋, 𝑌 = 𝐻 𝑋 − 𝐻(𝑋|𝑌)
30
![Page 31: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/31.jpg)
Example
• Given a dataset of 8 students about whether they like the famous movie Gladiator, calculate the entropy in this dataset
31
Like
Yes
No
Yes
No
No
Yes
No
Yes
![Page 32: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/32.jpg)
Example (cont.)
• Given a dataset of 8 students about whether they like the famous movie Gladiator, calculate the entropy in this dataset
32
𝐸 𝐿𝑖𝑘𝑒 = −4
8log
4
8−−
4
8log
4
8=1
Like
Yes
No
Yes
No
No
Yes
No
Yes
![Page 33: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/33.jpg)
Example (cont.)
• Suppose we now also know the gender of these 8 students, what is the conditional entropy on gender?
33
Gender Like
Male Yes
Female No
Male Yes
Female No
Female No
Male Yes
Male No
Female Yes
![Page 34: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/34.jpg)
Example (cont.)
• Suppose we now also know the gender of these 8 students, what is the conditional entropy on gender?
• The labels are divided into two small dataset based on the gender
34P(Yes | male) = 0.75 P(Yes | female) = 0.25
Gender Like
Male Yes
Female No
Male Yes
Female No
Female No
Male Yes
Male No
Female Yes
Like (male)
Yes
Yes
Yes
No
Like(female)
No
No
No
Yes
![Page 35: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/35.jpg)
Example (cont.)
• Suppose we now also know the gender of these 8 students, what is the conditional entropy on gender?• P(Yes|male) = 0.75
• P(Yes|female) = 0.25
• H Like|male
= −1
4𝑙𝑜𝑔
1
4−
3
4𝑙𝑜𝑔
3
4= −0.25 ∗ −2 − 0.75 ∗ −0.41 = 0.81
• H Like|female
= −3
4𝑙𝑜𝑔
3
4−
1
4𝑙𝑜𝑔
1
4= −0.75 ∗ −0.41 − 0.25 ∗ −2 = 0.81
35
Gender Like
Male Yes
Female No
Male Yes
Female No
Female No
Male Yes
Male No
Female Yes
![Page 36: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/36.jpg)
Example (cont.)
• Suppose we now also know the gender of these 8 students, what is the conditional entropy on gender?• E Like|Gender= 𝐸 𝐿𝑖𝑘𝑒 𝑚𝑎𝑙𝑒 ∗ 𝑃 𝑚𝑎𝑙𝑒+𝐸 𝐿𝑖𝑘𝑒 𝑓𝑒𝑚𝑎𝑙𝑒 ∗ 𝑃 𝑓𝑒𝑚𝑎𝑙𝑒
= 0.5 ∗ 0.81 + 0.5 ∗ 0.81 = 0.81
• I Like, Gender = E Like − E Like Gender= 1 − 0.81 = 0.19
36
Gender Like
Male Yes
Female No
Male Yes
Female No
Female No
Male Yes
Male No
Female Yes
![Page 37: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/37.jpg)
Example (cont.)
• Suppose we now also know the major of these 8 students, what about the conditional entropy on major?
37
Major Like
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
![Page 38: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/38.jpg)
Example (cont.)
• Suppose we now also know the major of these 8 students, what about the conditional entropy on major?
• Three datasets are created based on major
38
Major Like
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math YesP(Yes|history) = 0 P(Yes|cs) = 1P(Yes|math) = 0.5
Like (math)
Yes
No
No
Yes
Like(history)
No
No
Like(cs)
Yes
Yes
![Page 39: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/39.jpg)
Example (cont.)
• Suppose we now also know the major of these 8 students, what about the new Entropy
• H Like|Math = −2
4log
2
4−
2
4log
2
4=1
• H Like|CS = −2
2log
2
2−
0
2log
0
2=0
• H Like|history = −2
2log
2
2−
0
2log
0
2=0
• H Like|Major= H Like math × 𝑃 𝑚𝑎𝑡ℎ
+H Like History × 𝑃 𝐻𝑖𝑠𝑡𝑜𝑟𝑦+H Like cs × 𝑃 𝑐𝑠
= 0.5 ∗ 1 + 0.25 ∗ 0 + 0.25 ∗ 0 = 0.5
• I Like,Major = E Like − E Like Major= 1 − 0.5 = 0.5 39
Major Like
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
![Page 40: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/40.jpg)
Example (cont.)
• Compare gender and major
• As we have computed• I(Like, Gender) = E Like −E Like Gender = 1 − 0.81 = 0.19
• I(Like,Major) = E Like −E Like Major = 1 − 0.5 = 0.5
• Major is the better feature to predict the label “like”
40
Gender Major Like
Male Math Yes
Female History No
Male CS Yes
Female Math No
Female Math No
Male CS Yes
Male History No
Female Math Yes
![Page 41: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/41.jpg)
Example (cont.)
• Major is used as the decision condition and it splits the dataset into three small one based on the answer
41
Gender Like
Male Yes
Female No
Female No
Female Yes
Gender Like
Female No
Male No
Major?
Math CS
Gender Like
Male Yes
Female Yes
History
![Page 42: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/42.jpg)
Example (cont.)
• The history and CS subset contain only one label, so we only need to further expand the math subset
42
Gender Like
Male Yes
Female No
Female No
Female Yes
Gender Like
Female No
Male No
Major?
Math CS
Gender Like
Male Yes
Female Yes
History
![Page 43: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/43.jpg)
Example (cont.)
43
Major?
Math CSHistory
Like
Yes
Like
No
Gender?
female
Like
Y/N
Like
Yes
male
![Page 44: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/44.jpg)
Example (cont.)
• In the stage of testing, suppose there come a female students from the CS department, how can we predict whether she like the movie Gladiator?• Based on the major of CS, we will directly predict she like the movie.
• What about a male student and a female student from math department?
44
![Page 45: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/45.jpg)
Decision tree building: ID3 algorithm
• Algorithm framework• Start from the root node with all data
• For each node, calculate the information gain of all possible features
• Choose the feature with the highest information gain
• Split the data of the node according to the feature
• Do the above recursively for each leaf node, until• There is no information gain for the leaf node
• Or there is no feature to select
• Testing• Pass the example through the tree to the leaf node for a label
45
![Page 46: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/46.jpg)
Unsupervised Learning
46
![Page 47: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/47.jpg)
Machine learning categories
• Unsupervised learning• No labeled data
• Supervised learning• Use labeled data to predict on unseen points
• Semi-supervised learning• Use labeled data and unlabeled data to predict on unlabeled/unseen points
• Reinforcement learning• Sequential prediction and receiving feedbacks
47
![Page 48: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/48.jpg)
Unsupervised learning example
48
![Page 49: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/49.jpg)
49
![Page 50: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/50.jpg)
50
![Page 51: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/51.jpg)
51
![Page 52: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/52.jpg)
Clustering
52
![Page 53: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/53.jpg)
Clustering
• Unsupervised learning
• Requires data, but no labels
• Detect patterns e.g. in• Group emails or search results
• Customer shopping patterns
• Regions of images
• Useful when don’t know what you’re looking for
• But: can get gibberish
53
![Page 54: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/54.jpg)
Clustering (cont.)
• Goal: Automatically segment data into groups of similar points• Question: When and why would we want to do this?• Useful for:
• Automatically organizing data• Understanding hidden structure in some data• Representing high-dimensional data in a low-dimensional space
• Examples: Cluster• customers according to purchase histories• genes according to expression profile• search results according to topic• Facebook users according to interests• a museum catalog according to image similarity
54
![Page 55: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/55.jpg)
Intuition
• Basic idea: group together similar instances
• Example: 2D point patterns
55
![Page 56: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/56.jpg)
Intuition (cont.)
• Basic idea: group together similar instances
• Example: 2D point patterns
56
![Page 57: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/57.jpg)
Intuition (cont.)
• Basic idea: group together similar instances
• Example: 2D point patterns
• What could “similar” mean?
• – One option: small Euclidean distance (squared)
• – Clustering results are crucially dependent on the measure of similarity (or distance) between “points” to be clustered
57
![Page 58: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/58.jpg)
Set-up
• Given the data:
• Each data point x is 𝑑-dimensional: 𝑥𝑖 = (𝑥𝑖,1, … , 𝑥𝑖,𝑑)
• Define a distance function between data:
• Goal: segment the data into 𝐾 groups
58
![Page 59: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/59.jpg)
Clustering algorithms
• Partition algorithms (flat clustering)• K-means
• Mixture of Gaussian
• Spectral Clustering
• Hierarchical algorithms• Bottom up-agglomerative
• Top down-divisive
59
![Page 60: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/60.jpg)
Example
• Image segmentation
• Goal: Break up the image into meaningful or perceptually similar regions
60
![Page 61: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/61.jpg)
Example 2
• Gene expression data clustering• Activity level of genes across time
61
![Page 62: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/62.jpg)
K-Means
62
![Page 63: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/63.jpg)
K-Means
• An iterative clustering algorithm
• Initialize: Pick 𝐾 random points as cluster centers
• Alternate:• Assign data points to closest cluster
center
• Change the cluster center to the average of its assigned points
• Stop: when no points’ assignments change
63
![Page 64: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/64.jpg)
K-Means (cont.)
• An iterative clustering algorithm
• Initialize: Pick 𝐾 random points as cluster centers
• Alternate:• Assign data points to closest cluster
center
• Change the cluster center to the average of its assigned points
• Stop: when no points’ assignments change
64
![Page 65: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/65.jpg)
K-Means (cont.)
• An iterative clustering algorithm
• Initialize: Pick 𝐾 random points as cluster centers
• Alternate:• Assign data points to closest cluster
center
• Change the cluster center to the average of its assigned points
• Stop: when no points’ assignments change
65
![Page 66: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/66.jpg)
Example
• Pick 𝐾 random points as cluster centers (means)
• Shown here for 𝐾 = 2
66
![Page 67: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/67.jpg)
Example (cont.)
• Iterative step 1
• Assign data points to closest cluster center
67
![Page 68: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/68.jpg)
Example (cont.)
• Iterative step 2
• Change the cluster center to the average of the assigned points
68
![Page 69: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/69.jpg)
Example (cont.)
• Repeat until convergence
• Convergence means that the differences of the center positions in two continuous loops is smaller than a threshold
69
![Page 70: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/70.jpg)
Example (cont.)
• Repeat until convergence
• Convergence means that the differences of the center positions in two continuous loops is smaller than a threshold
70
![Page 71: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/71.jpg)
Example (cont.)
• Repeat until convergence
• Convergence means that the differences of the center positions in two continuous loops is smaller than a threshold
71
![Page 72: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/72.jpg)
Example 2
• 𝐾 = 4
72
![Page 73: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/73.jpg)
Example 2 (cont.)
• 𝐾 = 4
73
![Page 74: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/74.jpg)
Example 2 (cont.)
• 𝐾 = 4
74
![Page 75: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/75.jpg)
Example 2 (cont.)
• 𝐾 = 4
75
![Page 76: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/76.jpg)
Example 2 (cont.)
• 𝐾 = 4
76
![Page 77: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/77.jpg)
Example 2 (cont.)
• 𝐾 = 4
77
![Page 78: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/78.jpg)
Remained Questions in K-Means
78
![Page 79: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/79.jpg)
Remained questions in K-means
• Although the workflow of K-means is straight forward, there are some important questions that need to be discussed
• How to choose the hyper-parameter 𝐾?
• How to initialize?
79
![Page 80: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/80.jpg)
How to choose K?
• K is the most important hyper-parameter in K-means which strongly affects its performance. In some situation, it’s not an easy task to find the proper K
• The solution includes:• The elbow method
• The silhouette method
80
![Page 81: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/81.jpg)
The elbow method
• Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of 𝐾, and choose the 𝐾 for which WSS stops dropping significantly. In the plot of WSS-versus-k, this is visible as an elbow
• Example: 𝐾 = 3
81
![Page 82: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/82.jpg)
The silhouette method
• The problem of the elbow method is that in many situations the most suitable 𝐾 cannot be unambiguously identified. So we need the silhouette method
• The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation). The range of the silhouette value is between +1 and -1. A high value is desirable and indicates that the point is placed in the correct cluster
82
![Page 83: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/83.jpg)
How to initialize center positions?
• The positions of the centers in the stage of initialization are also very important in K-means algorithms. In some situations it can produce totally different clustering results
• Example:
83
![Page 84: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/84.jpg)
How to initialize center positions? (cont.)
• The positions of the centers in the stage of initialization are also very important in K-means algorithms. In some situations it can produce totally different clustering results
• Example:
84
![Page 85: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/85.jpg)
A possible solution
• Pick one point at random, then 𝐾 − 1other points, each as far away as possible from the previous points• OK, as long as there are no outliers (points
that are far from any reasonable cluster)
85
![Page 86: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/86.jpg)
K-means++
1. The first centroid is chosen uniformly at random from the data points that we want to cluster. This is similar to what we do in K-Means, but instead of randomly picking all the centroids, we just pick one centroid here
2. Next, we compute the distance 𝑑𝑥 is the nearest distance from data point 𝑥 to the centroids that have already been chosen
3. Then, choose the new cluster center from the data points with the probability of 𝑥 being proportional to 𝑑𝑥
2
4. We then repeat steps 2 and 3 until 𝐾 clusters have been chosen
86
![Page 87: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/87.jpg)
Example
• Suppose we have the following points and we want to make 3 clusters here:
87
![Page 88: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/88.jpg)
Example (cont.)
• First step is to randomly pick a data point as a cluster centroid:
88
![Page 89: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/89.jpg)
Example (cont.)
• Calculate the distance 𝑑𝑥 of each data point with this centroid:
89
![Page 90: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/90.jpg)
Example (cont.)
• The next centroid will be sampled with the probability proportional to 𝑑𝑥2
• Say the sampled is the red one
90
![Page 91: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/91.jpg)
Example (cont.)
• To select the last centroid, compute 𝑑𝑥, which is the distance to its closest centroid
91
![Page 92: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/92.jpg)
Example (cont.)
• Sample the one with the probability proportional to 𝑑𝑥2
• Say, the blue one
92
![Page 93: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/93.jpg)
Properties of K-Means
93
![Page 94: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/94.jpg)
How to measure the performance
• K-means can be evaluated by the sum of distance from points to corresponding centers, or the WSS
• The loss will approach zero when increase 𝐾
94
![Page 95: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/95.jpg)
Properties of the K-means algorithm
• Guaranteed to converge in a finite number of iterations
• Running time per iteration:1. Assign data points to closest cluster center
O(KN) time
2. Change the cluster center to the average of its assigned points
O(N)
95
![Page 96: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/96.jpg)
Convergence of K-means
Not guaranteed to converge to optimal
96
![Page 97: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/97.jpg)
Application: Segmentation
• Goal of segmentation is to partition an image into regions each of which has reasonably homogenous visual appearance
• Cluster the colors
97
![Page 98: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/98.jpg)
Agglomerative Clustering
98
![Page 99: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/99.jpg)
Agglomerative clustering
• Agglomerative clustering:• First merge very similar instances
• Incrementally build larger clusters out of smaller clusters
• Algorithm:• Maintain a set of clusters
• Initially, each instance in its own cluster
• Repeat:• Pick the two closest clusters
• Merge them into a new cluster
• Stop when there’s only one cluster left
99
![Page 100: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/100.jpg)
Agglomerative clustering
• Produces not one clustering, but a family of clusterings represented by a dendrogram
100
![Page 101: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/101.jpg)
Example
• Different heights give different clustering
101
![Page 102: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/102.jpg)
Closeness
• How should we define “closest” for clusters with multiple elements?
• Many options:• Closest pair (single-link clustering)
• Farthest pair (complete-link clustering)
• Average of all pairs
• Different choices create different clustering behaviors
102
![Page 103: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/103.jpg)
Closeness example
103
![Page 104: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/104.jpg)
Closeness example 2
104
![Page 105: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/105.jpg)
Dimensionality Reduction
105
![Page 106: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/106.jpg)
Example – MNIST dataset
106
![Page 107: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/107.jpg)
Principal Components AnalysisPCA
107
![Page 108: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/108.jpg)
Principal components analysis (PCA)
• Principal components analysis (PCA) is a technique that can be used to simplify a dataset
• It is usually a linear transformation that chooses a new coordinate system for the data set such that• greatest variance by any projection of the dataset comes to lie on the first axis
(then called the first principal component)
• the second greatest variance on the second axis, and so on
• PCA can be used for reducing dimensionality by eliminating the later principal components
108
![Page 109: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/109.jpg)
Example
• Consider the following 3D points
• If each component is stored in a byte, we need 18 = 3 × 6 bytes
109
1
2
3
2
4
6
4
8
12
3
6
9
5
10
15
6
12
18
![Page 110: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/110.jpg)
Example (cont.)
• Looking closer, we can see that all the points are related geometrically• they are all in the same direction, scaled by a factor:
110
1
2
3
2
4
6
4
8
12
3
6
9
5
10
15
6
12
18
1
2
3
= 1 ×
1
2
3
= 4 ×
1
2
3
= 5 ×
1
2
3
= 2 ×
1
2
3
= 3 ×
1
2
3
= 6 ×
![Page 111: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/111.jpg)
Example (cont.)
• They can be stored using only 9 bytes (50% savings!): • Store one direction (3 bytes) + the multiplying constants (6 bytes)
111
1
2
3
2
4
6
4
8
12
3
6
9
5
10
15
6
12
18
1
2
3
= 1 ×
1
2
3
= 4 ×
1
2
3
= 5 ×
1
2
3
= 2 ×
1
2
3
= 3 ×
1
2
3
= 6 ×
![Page 112: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/112.jpg)
Geometrical interpretation
• View points in 3D space
• In this example, all the points happen to lie on one line• a 1D subspace of the original 3D space
112
![Page 113: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/113.jpg)
Geometrical interpretation
• Consider a new coordinate system where the first axis is along the direction of the line
• In the new coordinate system, every point has only one non-zero coordinate• we only need to store the direction of the line (a 3 bytes point) and the
nonzero coordinates for each point (6 bytes) 113
![Page 114: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/114.jpg)
Back to PCA
• Given a set of points, how can we know if they can be compressed similarly to the previous example?• We can look into the correlation between the points by the tool of PCA
114
![Page 115: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/115.jpg)
From example to theory
• In previous example, PCA rebuilds the coordination system for the data by selecting • the direction with largest variance as the first new base direction
• the direction with the second largest variance as the second new base direction
• and so on
• Then how can we find the direction with largest variance?• By the eigenvector for the covariance matrix of the data
115
![Page 116: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/116.jpg)
Review – Variance
• Variance is the expectation of the squared deviation of a random variable from its mean• Informally, it measures how far a set of (random) numbers are spread out
from their average value
116
![Page 117: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/117.jpg)
Review – Covariance
• Covariance is a measure of the joint variability of two random variables• If the greater values of one variable mainly correspond with the greater
values of the other variable, and the same holds for the lesser values, (i.e., the variables tend to show similar behavior), the covariance is positive• E.g. as the number of hours studied increases, the marks in that subject increase
• In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, (i.e., the variables tend to show opposite behavior), the covariance is negative
• The sign of the covariance therefore shows the tendency in the linearrelationship between the variables
• The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation 117
![Page 118: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/118.jpg)
Review – Covariance (cont.)
• Sample covariance
118
covariance 𝑋, 𝑌 =σ𝑖=1𝑁 (𝑥𝑖 − ҧ𝑥)(𝑦𝑖 − ത𝑦)
𝑁 − 1
![Page 119: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/119.jpg)
PCA
• PCA tries to identify the subspace in which the data approximately lies in
• PCA uses an orthogonal transformation on the coordinate system to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components• The number of principal components is less than or equal to min 𝑑,𝑁
119
![Page 120: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/120.jpg)
Covariance matrix
• Suppose there are 3 dimensions, denoted as 𝑋, 𝑌, 𝑍. The covariance matrix is
• Note the diagonal is the covariance of each dimension with respect to itself, which is just the variance of each random variable
• Also 𝐶𝑂𝑉(𝑋, 𝑌) = 𝐶𝑂𝑉(𝑌, 𝑋)• hence matrix is symmetric about the diagonal
• 𝑑-dimensional data will result in a 𝑑 × 𝑑 covariance matrix120
𝐶𝑂𝑉 =
𝐶𝑂𝑉(𝑋, 𝑋) 𝐶𝑂𝑉(𝑋, 𝑌) 𝐶𝑂𝑉(𝑋, 𝑍)𝐶𝑂𝑉(𝑌, 𝑋) 𝐶𝑂𝑉(𝑌, 𝑌) 𝐶𝑂𝑉(𝑌, 𝑍)𝐶𝑂𝑉(𝑍, 𝑋) 𝐶𝑂𝑉(𝑍, 𝑌) 𝐶𝑂𝑉(𝑍, 𝑍)
![Page 121: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/121.jpg)
Covariance in the covariance matrix
• Diagonal, or the variance, measures the deviation from the mean for data points in one dimension
• Covariance measures how one dimension random variable varies w.r.t.another, or if there is some linear relationship among them
121
![Page 122: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/122.jpg)
Data processing
• Given the dataset 𝐷 = 𝑥(𝑖)𝑖=1
𝑁
• Let ҧ𝑥 =1
𝑁σ𝑖=1𝑁 𝑥(𝑖)
𝑋 =
𝑥(1) − ҧ𝑥⊤
𝑥(2) − ҧ𝑥⊤
⋮
𝑥(𝑁) − ҧ𝑥⊤
∈ ℝ𝑁×𝑑
• Move the center of the data set to 0
122
![Page 123: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/123.jpg)
Data processing (cont.)
• 𝑄 = 𝑋⊤𝑋 = 𝑥(1) − ҧ𝑥 𝑥(2) − ҧ𝑥 ⋯ 𝑥(𝑁) − ҧ𝑥
𝑥(1) − ҧ𝑥⊤
𝑥(2) − ҧ𝑥⊤
⋮
𝑥(𝑁) − ҧ𝑥⊤
• 𝑄 is square with 𝑑 dimension
• 𝑄 is symmetric
• 𝑄 is the covariance matrix [aka scatter matrix]
• 𝑄 can be very large (in vision, 𝑑 is often the number of pixels in an image!)• For a 256 × 256 image, 𝑑 = 65536!!
• Don’t want to explicitly compute 𝑄
123
![Page 124: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/124.jpg)
PCA
• By finding the eigenvalues and eigenvectors of the covariance matrix, we find that the eigenvectors with the largest eigenvalues correspond to the dimensions that have the strongest variation in the dataset
• This is the principal component
• Application:• face recognition, image compression
• finding patterns in data of high dimension
124
![Page 125: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/125.jpg)
PCA theorem
• Theorem:
• Each 𝑥(𝑖) can be written as: 𝑥(𝑖) = ҧ𝑥 + σ𝑗=1𝑑 𝑔𝑖𝑗𝑒𝑗
where 𝑒𝑗 are the 𝑑 eigenvectors of 𝑄 with non-zero eigenvalues
• Notes:1. The eigenvectors 𝑒1𝑒2⋯𝑒𝑑 span an eigenspace2. 𝑒1𝑒2⋯𝑒𝑑 are 𝑑 × 1 orthonormal vectors (directions in 𝑑-Dimensional
space)3. The scalars 𝑔𝑖𝑗 are the coordinates of 𝑥(𝑖) in the space
𝑔𝑖𝑗 = 𝑥(𝑖) − ҧ𝑥, 𝑒𝑗
125
![Page 126: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/126.jpg)
Using PCA to compress data
• Expressing 𝑥 in terms of 𝑒1𝑒2⋯𝑒𝑑 doesn’t change the size of the data
• However, if the points are highly correlated, many of the new coordinates of 𝑥 will become zero or close to zero
• Sort the eigenvectors 𝑒𝑖 according to their eigenvalue𝜆1 ≥ 𝜆2 ≥ ⋯ ≥ 𝜆𝑑
• Assume 𝜆𝑗 ≈ 0 if 𝑗 > 𝑘. Then
𝑥(𝑖) ≈ ҧ𝑥 +
𝑗=1
𝑘
𝑔𝑖𝑗𝑒𝑗
126
![Page 127: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/127.jpg)
Example – STEP 1
127
http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf
mean
this becomes thenew origin of thedata from now on
![Page 128: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/128.jpg)
Example – STEP 2
• Calculate the covariance matrix
Cov =0.616555556 0.6154444440.615444444 0.716555556
• since cov(𝑋, 𝑌) is positive, it is expect that 𝑥 and 𝑦 increase together
128
![Page 129: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/129.jpg)
Example – STEP 3
• Calculate the eigenvectors and eigenvalues of the covariance matrix
• eigenvalues =0.04908339891.28402771
• eigenvectors =−0.735178656 −0.6778733990.677873399 −0.735178656
129
![Page 130: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/130.jpg)
Example – STEP 3 (cont.)
• Eigenvectors are plotted as diagonal dotted lines on the plot
• Note they are perpendicular to each other
• Note one of the eigenvectors goes through the middle of the points, like drawing a line of best fit
• The second eigenvector gives us the other, less important, pattern in the data, that all the points follow the main line, but are off to the side of the main line by some amount
130
![Page 131: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/131.jpg)
Example – STEP 4
• Feature vector = 𝑒1 𝑒2 ⋯ 𝑒𝑑
• We can either form a feature vector with both of the eigenvectors:−0.735178656 −0.6778733990.677873399 −0.735178656
• or, we can choose to delete the smaller, less significant component:−0.677873399−0.735178656
131
![Page 132: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/132.jpg)
Example – STEP 5
• Deriving new data coordinates
• RowZeroMeanData is the mean-adjusted data, i.e. the data items are in each row, with each column representing a separate dimension
• RowFeatureVector is the matrix with the eigenvectors in the columns transposed so that the eigenvectors are now in the rows, with the most significant eigenvector at the top
• Note: We rotate the coordinate axes so high-variance axis comes first
132
FinalData = RowZeroMeanData x RowFeatureVector
FinalData𝑁×𝑑 = 𝑔 𝑥(1)
⊤
⋮
𝑔 𝑥(𝑁)⊤
𝑁×𝑑
𝑒1⊤
⋮𝑒𝑑⊤
𝑑×𝑑
![Page 133: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/133.jpg)
Example – STEP 5 (cont.)
• The plot of the PCA results using both the two eigenvector
133
![Page 134: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/134.jpg)
Example – Final approximation
134
2D point cloudApproximation using one eigenvector basis
![Page 135: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/135.jpg)
Example – Final approximation
FinalData𝑁×𝑑 =𝑔 𝑥(1)
⊤
⋮
𝑔 𝑥(𝑁)⊤
𝑁×𝑑
𝑒1⊤
⋮𝑒𝑑⊤
𝑑×𝑑
≈
𝑔 𝑥(1)1
… 𝑔 𝑥(1)𝑘
… 𝑔 𝑥(1)𝑑
⋮
𝑔 𝑥(𝑁)1
… 𝑔 𝑥(𝑁)𝑘
… 𝑔 𝑥(𝑁)𝑑 𝑁×𝑑
𝑒1⊤
⋮𝑒𝑘⊤
⋮𝑒𝑑⊤
𝑑×𝑑
135
Zero!!
![Page 136: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/136.jpg)
Revisit the eigenvectors in PCA
• It is critical to notice that the direction of maximum variance in the input space happens to be same as the principal eigenvector of the covariance matrix
• Why?
136
![Page 137: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/137.jpg)
Revisit the eigenvectors in PCA (cont.)
• The projection of each point 𝑥 to a direction 𝑢 (with 𝑢 = 1) is𝑥⊤𝑢
• The variance of the projection is
𝑖=1
𝑁
𝑥(𝑖) − ҧ𝑥⊤𝑢
2
= 𝑢⊤𝑄𝑢
which is maximized when 𝑢 is the eigenvector with the largest eigenvalue
• 𝑄 = σ𝑗=1𝑑 𝜆𝑗𝑒𝑗𝑒𝑗
⊤ = 𝐸Λ𝐸⊤ with Λ =𝜆1 ⋯⋮ ⋱ ⋮
⋯ 𝜆𝑑
137
![Page 138: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/138.jpg)
Review – Total/Explained variance
• 𝑅2 =𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
𝑡𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
• Total variance:
• Explained variance:
• Or, it can be computed as:
where
138
![Page 139: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/139.jpg)
Total variance and PCA
• Note that 𝐼 = 𝑒1𝑒1⊤ +⋯+ 𝑒𝑑𝑒𝑑
⊤
• Total variance is
• σ𝑖=1𝑁 𝑥(𝑖) − ҧ𝑥
⊤𝑥(𝑖) − ҧ𝑥
• = σ𝑖=1𝑁 𝑥(𝑖) − ҧ𝑥
⊤𝑒1𝑒1
⊤ +⋯+ 𝑒𝑑𝑒𝑑⊤ 𝑥(𝑖) − ҧ𝑥
• = σ𝑗=1𝑑 𝑒𝑗
⊤𝑄 𝑒𝑗 = 𝜆1 +⋯+ 𝜆𝑑
139
![Page 140: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/140.jpg)
Total variance and PCA (cont.)• Approximation of each 𝑥(𝑖) − ҧ𝑥 ≈ σ𝑗=1
𝑘 𝑔𝑖𝑗𝑒𝑗 =: 𝑥(𝑖) − ҧ𝑥
• Then the explained variance is
• σ𝑖=1𝑁 𝑥(𝑖) − ҧ𝑥
⊤𝑥(𝑖) − ҧ𝑥
• = σ𝑖=1𝑁 𝑥(𝑖) − ҧ𝑥
⊤𝑒1𝑒1
⊤ +⋯+ 𝑒𝑑𝑒𝑑⊤ 𝑥(𝑖) − ҧ𝑥
• = σ𝑗=1𝑑 𝑒𝑗
⊤ ෨𝑄𝑒𝑗 = 𝜆1 +⋯+ 𝜆𝑘
• where ෨𝑄 = σ𝑖=1𝑁 𝑥(𝑖) − ҧ𝑥 𝑥(𝑖) − ҧ𝑥
⊤= 𝐸 ෨Λ𝐸⊤ with
• ෨Λ =
𝜆1⋱
𝜆𝑘0
0 140
![Page 141: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/141.jpg)
Eigenvalues and Eigenvectors
141
![Page 142: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/142.jpg)
Properties
• Scalar 𝜆 and vector 𝑣 are eigenvalues and eigenvectors of 𝐴𝐴𝑣 = 𝜆𝑣
• Visually, 𝐴𝑣 lies along the same line as the eigenvector of 𝑣
142
![Page 143: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/143.jpg)
Example
143
![Page 144: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/144.jpg)
How to solve eigenvalues and eigenvectors?
• If 𝐴 − 𝜆𝐼 𝑥 = 0 has a nonzero solution for 𝜆, then 𝐴 − 𝜆𝐼 is not invertible. Then the determinant of 𝐴 − 𝜆𝐼 must be zero
• 𝜆 is an eigenvalue of 𝐴 if and only if 𝐴 − 𝜆𝐼 is singular:det 𝐴 − 𝜆𝐼 = 0
144
![Page 145: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/145.jpg)
Example
• Find the eigenvalues and eigenvectors of 𝐴 =1 22 4
• Need to solve det 𝐴 − 𝜆𝐼 = 0
• 𝐴 − 𝜆𝐼 =1 − 𝜆 22 4 − 𝜆
• det 𝐴 − 𝜆𝐼 = 1 − 𝜆 4 − 𝜆 − 2 × 2 = 𝜆2 − 5𝜆
• det 𝐴 − 𝜆𝐼 = 0 would imply 𝜆 = 0 and 𝜆 = 5
145
![Page 146: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/146.jpg)
Example (cont.)
• Then solve 𝐴 − 𝜆𝐼 𝑥 = 0 to get the eigenvectors for each of the eigenvalues
• 𝐴𝑥 = 0 has a nonzero solution of 2−1
, or 2/ 5
−1/ 5
• 𝐴 − 5𝐼 𝑥 = 0 has a nonzero solution of 12
, or 1/ 5
2/ 5
• Note: Eigenvectors for different eigenvalues are orthogonal
146
![Page 147: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/147.jpg)
Singular Value DecompositionSVD
147
![Page 148: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/148.jpg)
SVD
• Singular Value Decomposition (SVD) is a factorization method of matrix. It states that any 𝑚 × 𝑛 matrix 𝐴 can be written as the product of 3 matrices:
• Where:• U is 𝑚 ×𝑚 and its columns are orthonormal eigenvectors of 𝐴𝐴⊤
• V is 𝑛 × 𝑛 and its columns are orthonormal eigenvectors of 𝐴⊤𝐴
• S is 𝑚 × 𝑛 is a diagonal matrix with 𝑟 elements equal to the root of the positive eigenvalues of 𝐴𝐴⊤ or 𝐴⊤𝐴 (both matrices have the same positive eigenvalues anyway)
148
![Page 149: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/149.jpg)
In full matrix form
149
![Page 150: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/150.jpg)
Example
• Let’s assume:
• We can have:
150
![Page 151: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/151.jpg)
Example (cont.)
• Compute U and V respectively:
151
![Page 152: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/152.jpg)
Example (cont.)
• Finally, we have:
152
![Page 153: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/153.jpg)
Visualization
• The eigenvector 𝑣𝑗 of 𝑉 is transformed into 𝐴𝑣𝑗 = 𝜎𝑗𝑢𝑗
153
![Page 154: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/154.jpg)
Insight
154
![Page 155: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/155.jpg)
SVD and PCA
• 𝑋 = 𝑈𝑆𝑁×𝑑𝑉⊤
• 𝑄 = 𝑋⊤𝑋 = 𝑉 𝑆⊤ 𝑑×𝑁𝑈⊤ ∙ 𝑈𝑆𝑁×𝑑𝑉
⊤ = 𝑉 𝑆⊤𝑆 𝑑×𝑑𝑉⊤
• This explains why singular value 𝜎𝑗 is the square root of the eigenvalue 𝜆𝑗 of 𝑄
• 𝜆𝑗 = 𝜎𝑗2
• Also 𝑉 contains eigenvectors of 𝑋⊤𝑋
• Similarly, 𝑋𝑋⊤ = 𝑈𝑆𝑁×𝑑𝑉⊤ ∙ 𝑉 𝑆⊤ 𝑑×𝑁𝑈
⊤ = 𝑈 𝑆𝑆⊤ 𝑁×𝑁𝑈⊤
155
𝑆⊤𝑆 𝑑×𝑑 =𝜎12 ⋯⋮ ⋱ ⋮
⋯ 𝜎𝑑2
𝑆𝑆⊤ 𝑁×𝑁 =
𝜎12 ⋯⋮ ⋱ ⋮
⋯𝜎𝑑2
⋱0
![Page 156: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/156.jpg)
Application
• Matrix factorization in recommendation system
• Suppose in the database of Taobao, there are 𝑚 users and 𝑛 items, and an 𝑚 × 𝑛 binary matrix 𝐴• Each entry indicates whether a user has bought an item or not• As each user only buys very few items among all items, the matrix is very
sparse
• As the manager of Taobao, how can you predict the likelihood that a user will buy a given item?
• SVD could help
156
![Page 157: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/157.jpg)
Gaussian Mixture Models
157
![Page 158: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/158.jpg)
Gaussian Mixture Models
• Is a clustering algorithms
• Difference with K-means• K-means outputs the label of a sample
• GMM outputs the probability that a sample belongs to a certain class
• GMM can also be used to generate new samples!
158
![Page 159: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/159.jpg)
K-means vs GMM
159
![Page 160: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/160.jpg)
High-dimensional Gaussian distribution
• The probability density of Gaussian distribution on 𝑥 = 𝑥1, … , 𝑥𝑑⊤
is
𝒩 𝑥|𝜇, σ =
exp −12
𝑥 − 𝜇 ⊤σ−1 𝑥 − 𝜇
2𝜋 𝑑 σ• where 𝜇 is the mean vector
• σ is the symmetric covariance matrix (positive semi-definite)
• E.g. the Gaussian distribution with
160
𝑓 𝑥|𝜇, 𝜎2 =1
2𝜋𝜎2𝑒−
𝑥−𝜇 2
2𝜎2
![Page 161: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/161.jpg)
Mixture of Gaussian
• The probability given in a mixture of K Gaussians is:
𝑝 𝑥 =
𝑗=1
𝐾
𝑤𝑗 ∙ 𝒩(𝑥|𝜇𝑗 , Σ𝑗)
where 𝑤𝑗 is the prior probability of the j-th Gaussian
• Example
161
![Page 162: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/162.jpg)
Examples
162
Observation
![Page 163: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/163.jpg)
Data generation
• Let the parameter set 𝜃 = {𝑤𝑗 , 𝜇𝑗 , Σ𝑗: 𝑗}, then the probability density of mixture Gaussian can be written as
𝑝 𝑥 𝜃 =
𝑗=1
𝐾
𝑤𝑗 ∙ 𝒩(𝑥|𝜇𝑗 , Σ𝑗)
• Equivalent to generate data points in two steps• Select which component 𝑗 the data point belongs to according to the
categorical (or multinoulli) distribution of 𝑤1, … , 𝑤𝐾
• Generate the data point according to the probability of 𝑗-th component
163
![Page 164: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/164.jpg)
Learning task
• Given a dataset 𝑋 = 𝑥(1), 𝑥(2),∙∙∙, 𝑥(𝑁) to train the GMM model
• Find the best 𝜃 that maximizes the probability ℙ 𝑋 𝜃
• Maximal likelihood estimator (MLE)
164
![Page 165: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/165.jpg)
Introduce latent variable
• For data points 𝑥(𝑖), 𝑖 = 1, … , 𝑁, let’s write the probability as
ℙ 𝑥(𝑖) 𝜃 =
𝑗=1
𝐾
𝑤𝑗 ∙ 𝒩(𝑥(𝑖)|𝜇𝑗 , Σ𝑗)
where σ𝑗=1𝐾 𝑤𝑗 = 1
• Introduce latent variable• 𝑧(𝑖) is the Gaussian cluster ID indicates which Gaussian 𝑥(𝑖) comes from
• ℙ 𝑧(𝑖) = 𝑗 = 𝑤𝑗 “Prior”
• ℙ 𝑥(𝑖) 𝑧(𝑖) = 𝑗; 𝜃 = 𝒩(𝑥(𝑖)|𝜇𝑗 , Σ𝑗)
• ℙ 𝑥(𝑖) 𝜃 = σ𝑗=1𝐾 ℙ 𝑧(𝑖) = 𝑗 ∙ ℙ 𝑥(𝑖) 𝑧(𝑖) = 𝑗; 𝜃
165
![Page 166: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/166.jpg)
Maximal likelihood• Let 𝑙 𝜃 = σ𝑖=1
𝑁 logℙ(𝑥 𝑖 ; 𝜃) = σ𝑖=1𝑁 logσ𝑗ℙ(𝑥
𝑖 , 𝑧 𝑖 = 𝑗; 𝜃) be the log-likelihood
• We want to solve
argmax 𝑙 𝜃 = argmax
𝑖=1
𝑁
log
𝑗=1
𝐾
ℙ 𝑧(𝑖) = 𝑗 ∙ ℙ 𝑥(𝑖) 𝑧(𝑖) = 𝑗; 𝜃
= argmax
𝑖=1
𝑁
log
𝑗=1
𝐾
𝑤𝑗 ∙ 𝒩(𝑥(𝑖)|𝜇𝑗 , Σ𝑗)
• No closed solution by solving𝜕𝑙 𝑋|𝜃
𝜕𝑤= 0,
𝜕𝑙 𝑋|𝜃
𝜕𝜇= 0,
𝜕𝑙 𝑋|𝜃
𝜕Σ= 0
166
![Page 167: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/167.jpg)
Likelihood maximization• If we know 𝑧(𝑖) for all 𝑖, the problem becomes
argmax 𝑙 𝜃 = argmax
𝑖=1
𝑁
log ℙ 𝑥(𝑖) 𝜃
= argmax
𝑖=1
𝑁
logℙ 𝑥(𝑖), 𝑧(𝑖) 𝜃
= argmax
𝑖=1
𝑁
log ℙ 𝑥(𝑖) 𝜃, 𝑧(𝑖) + logℙ 𝑧(𝑖) 𝜃
= argmax
𝑖=1
𝑁
log𝒩(𝑥(𝑖)|𝜇𝑧(𝑖) , Σ𝑧(𝑖)) + log𝑤𝑧(𝑖)
167
The solution is
• 𝑤𝑗 =1
𝑁σ𝑖=1𝑁 𝟏 𝑧(𝑖) = 𝑗
• 𝜇𝑗 =σ𝑖=1𝑁 𝟏 𝑧(𝑖)=𝑗 𝑥(𝑖)
σ𝑖=1𝑁 𝟏 𝑧(𝑖)=𝑗
• Σ𝑗 =σ𝑖=1𝑁 𝟏 𝑧(𝑖)=𝑗 𝑥(𝑖)−𝜇𝑗 𝑥(𝑖)−𝜇𝑗
⊤
σ𝑖=1𝑁 𝟏 𝑧(𝑖)=𝑗
Average over each cluster
![Page 168: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/168.jpg)
Likelihood maximization (cont.)
• Given the parameter 𝜃 = {𝑤𝑗 , 𝜇𝑗 , Σ𝑗: 𝑗}, the posterior distribution of each latent variable 𝑧(𝑖) can be inferred
• ℙ 𝑧(𝑖) = 𝑗 𝑥(𝑖); 𝜃 =ℙ 𝑥(𝑖), 𝑧(𝑖) = 𝑗 𝜃
ℙ 𝑥(𝑖) 𝜃
• =ℙ 𝑥(𝑖) 𝑧(𝑖) = 𝑗; 𝜇𝑗 , Σ𝑗 ℙ 𝑧(𝑖) = 𝑗 𝑤
σ𝑗′=1𝐾 ℙ 𝑥(𝑖) 𝑧(𝑖) = 𝑗′; 𝜇𝑗′ , Σ𝑗′ ℙ 𝑧(𝑖) = 𝑗′ 𝑤
• Or 𝑤𝑗(𝑖)
= ℙ 𝑧(𝑖) = 𝑗 𝑥(𝑖); 𝜃 ∝ ℙ 𝑥(𝑖) 𝑧(𝑖) = 𝑗; 𝜇𝑗 , Σ𝑗 ℙ 𝑧(𝑖) = 𝑗 𝑤
168
![Page 169: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/169.jpg)
Likelihood maximization (cont.)
• For every possible values of 𝑧(𝑖)’s
• Take the expectation on probability of 𝑧(𝑖)
169
Which is equivalent to
• 𝑤𝑗 =1
𝑁σ𝑖=1𝑁 𝟏 𝑧(𝑖) = 𝑗
• σ𝑖=1𝑁 𝟏 𝑧(𝑖) = 𝑗 𝜇𝑗 = σ𝑖=1
𝑁 𝟏 𝑧(𝑖) = 𝑗 𝑥(𝑖)
• σ𝑖=1𝑁 𝟏 𝑧(𝑖) = 𝑗 Σ𝑗 = σ𝑖=1
𝑁 𝟏 𝑧(𝑖) = 𝑗 𝑥(𝑖) − 𝜇𝑗 𝑥(𝑖) − 𝜇𝑗⊤
Take expectation on two sides
• 𝑤𝑗 =1
𝑁σ𝑖=1𝑁 ℙ(𝑧(𝑖) = 𝑗)
• σ𝑖=1𝑁 ℙ(𝑧 𝑖 = 𝑗) 𝜇𝑗 = σ𝑖=1
𝑁 ℙ(𝑧 𝑖 = 𝑗)𝑥(𝑖)
• σ𝑖=1𝑁 ℙ(𝑧 𝑖 = 𝑗) Σ𝑗 = σ𝑖=1
𝑁 ℙ(𝑧 𝑖 = 𝑗) 𝑥(𝑖) − 𝜇𝑗 𝑥(𝑖) − 𝜇𝑗⊤
The solution is
• 𝑤𝑗 =1
𝑁σ𝑖=1𝑁 𝟏 𝑧(𝑖) = 𝑗
• 𝜇𝑗 =σ𝑖=1𝑁 𝟏 𝑧(𝑖)=𝑗 𝑥(𝑖)
σ𝑖=1𝑁 𝟏 𝑧(𝑖)=𝑗
• Σ𝑗 =σ𝑖=1𝑁 𝟏 𝑧(𝑖)=𝑗 𝑥(𝑖)−𝜇𝑗 𝑥(𝑖)−𝜇𝑗
⊤
σ𝑖=1𝑁 𝟏 𝑧(𝑖)=𝑗
𝑤𝑗(𝑖)
= ℙ 𝑧(𝑖) = 𝑗 𝑥(𝑖); 𝜃
The solution is
• 𝑤𝑗 =1
𝑁σ𝑖=1𝑁 𝑤𝑗
(𝑖)
• 𝜇𝑗 =σ𝑖=1𝑁 𝑤𝑗
(𝑖)𝑥(𝑖)
σ𝑖=1𝑁 𝑤
𝑗(𝑖)
• Σ𝑗 =σ𝑖=1𝑁 𝑤𝑗
(𝑖)𝑥(𝑖)−𝜇𝑗 𝑥(𝑖)−𝜇𝑗
⊤
σ𝑖=1𝑁 𝑤
𝑗(𝑖)
Take expectation
![Page 170: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/170.jpg)
Likelihood maximization (cont.)
• If we know the distribution of 𝑧(𝑖)’s, then it is equivalent to maximize
𝑙 ℙ𝑧(𝑖) , 𝜃 =
𝑖=1
𝑁
𝑗=1
𝐾
ℙ 𝑧(𝑖) = 𝑗 logℙ 𝑥(𝑖), 𝑧 𝑖 = 𝑗|𝜃
ℙ 𝑧(𝑖) = 𝑗
• Compared with previous if we know the values of 𝑧(𝑖)
argmax
𝑖=1
𝑁
logℙ 𝑥(𝑖), 𝑧(𝑖) 𝜃
• Note that it is hard to solve if we directly maximize
𝑙 𝜃 =
𝑖=1
𝑁
logℙ 𝑥(𝑖) 𝜃 =
𝑖=1
𝑁
log
𝑗=1
𝐾
ℙ 𝑧(𝑖) = 𝑗|𝜃 ℙ 𝑥(𝑖) 𝑧(𝑖) = 𝑗; 𝜃
170
is a lower bound of
![Page 171: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/171.jpg)
Expectation maximization methods
• E-step: • Infer the posterior distribution of the latent variables given the model
parameters
• M-step: • Tune parameters to maximize the data likelihood given the latent variable
distribution
• EM methods• Iteratively execute E-step and M-step until convergence
171
![Page 172: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/172.jpg)
EM for GMM
• Repeat until convergence:{(E-step) For each 𝑖, 𝑗, set
𝑤𝑗(𝑖)
= ℙ 𝑧(𝑖) = 𝑗 𝑥(𝑖), 𝑤, 𝜇, Σ(M-step) Update the parameters
𝑤𝑗 =1
𝑁
𝑖=1
𝑁
𝑤𝑗(𝑖), 𝜇𝑗 =
σ𝑖=1𝑁 𝑤𝑗
(𝑖)𝑥(𝑖)
σ𝑖=1𝑁 𝑤𝑗
(𝑖)
Σ𝑗 =σ𝑖=1𝑁 𝑤𝑗
(𝑖)𝑥(𝑖) − 𝜇𝑗 𝑥(𝑖) − 𝜇𝑗
⊤
σ𝑖=1𝑁 𝑤𝑗
(𝑖)
}172
![Page 173: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/173.jpg)
Example
• Hidden variable: for each point, which Gaussian generates it?
173
𝑝 𝑥 𝜃 =
𝑗=1
𝐾
𝑤𝑗 ∙ 𝒩(𝑥|𝜇𝑗 , Σ𝑗)
![Page 174: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/174.jpg)
Example (cont.)
• E-step: for each point, estimate the probability that each Gaussian component generated it
174
![Page 175: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/175.jpg)
Example (cont.)
• M-Step: modify the parameters according to the hidden variable to maximize the likelihood of the data
175
![Page 176: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/176.jpg)
Example
176
𝐾 = 2 𝐾 = 16
400 new data points generated from the 16-GMM
![Page 177: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/177.jpg)
Remaining issues (cont.)
• Simplification of the covariance matrices
177
![Page 178: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/178.jpg)
Expectation and Maximization
178
![Page 179: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/179.jpg)
Background: Latent variable models
• Some of the variables in the model are not observed
• Examples: mixture model, Hidden Markov Models, LDA, etc
179
![Page 180: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/180.jpg)
Background: Marginal likelihood
• Joint model ℙ 𝑥, 𝑧|𝜃 , 𝜃 is the model parameter
• With 𝑧 unobserved, we marginalize out 𝑧 and use the marginal log-likelihood for learning
ℙ 𝑥|𝜃 =
𝑧
ℙ 𝑥, 𝑧|𝜃
• Example: mixture model
ℙ 𝑥|𝜃 =
𝑘
ℙ 𝑥|𝑧 = 𝑘, 𝜃𝑘 ℙ 𝑧 = 𝑘|𝜃𝑘 =
𝑘
𝜋𝑘ℙ 𝑥|𝑧 = 𝑘, 𝜃𝑘
where 𝜋𝑘 is the mixing proportions
180
![Page 181: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/181.jpg)
Examples
• Mixture of Bernoulli
• Mixture of Gaussians
• Hidden Markov Model
181
![Page 182: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/182.jpg)
Learning the hidden variable 𝑍
• If all 𝑧 observed, the likelihood factorizes, and learning is relatively easy
• If 𝑧 not observed, we have to handle a sum inside log• Idea 1: ignore this problem and simply take derivative and follow the gradient
• Idea 2: use the current 𝜃 to estimate 𝑧, fill them in and do fully-observed learning
• Is there a better way?
182
![Page 183: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/183.jpg)
General EM methods
• Repeat until convergence{(E-step) For each data point 𝑖, set
𝑞𝑖 𝑗 = ℙ 𝑧(𝑖) = 𝑗|𝑥 𝑖 ; 𝜃(M-step) Update the parameters
argmax𝜃 𝑙 𝑞𝑖 , 𝜃 =
𝑖=1
𝑁
𝑗
𝑞𝑖 𝑗 logℙ(𝑥 𝑖 , 𝑧 𝑖 = 𝑗; 𝜃)
𝑞𝑖 𝑗
}
183
![Page 184: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/184.jpg)
Convergence of EM
• Denote 𝜃(𝑡) and 𝜃(𝑡+1) as the parameters of two successive iterations of EM, we want to prove that
𝑙 𝜃(𝑡) ≤ 𝑙 𝜃 𝑡+1
where note that
𝑙 𝜃 =
𝑖=1
𝑁
logℙ(𝑥 𝑖 ; 𝜃) =
𝑖=1
𝑁
log
𝑗
ℙ(𝑥 𝑖 , 𝑧 𝑖 = 𝑗; 𝜃)
• This shows EM always monotonically improves the log-likelihood, thus ensures EM will at least converge to a local optimum
184
![Page 185: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/185.jpg)
Proof• Start from 𝜃(𝑡), we choose the posterior of latent variable
𝑞𝑖(𝑡)
𝑗 = 𝑝 𝑧(𝑖) = 𝑗|𝑥 𝑖 ; 𝜃(𝑡)
• By Jensen’s inequality on the log function
• 𝑙 𝑞, 𝜃(𝑡) = σ𝑖=1𝑁 σ𝑗 𝑞𝑖 𝑗 log
𝑝(𝑥 𝑖 ,𝑧 𝑖 =𝑗;𝜃(𝑡))
𝑞𝑖 𝑗
≤
𝑖=1
𝑗
log
𝑗
𝑞𝑖 𝑗 ∙𝑝 𝑥 𝑖 , 𝑧 𝑖 = 𝑗; 𝜃 𝑡
𝑞𝑖 𝑗
=
𝑖=1
𝑗
log
𝑗
𝑝 𝑥 𝑖 , 𝑧 𝑖 = 𝑗; 𝜃(𝑡) = 𝑙 𝜃(𝑡)
• When 𝑞𝑖(𝑡)
𝑗 = 𝑝 𝑧(𝑖) = 𝑗|𝑥 𝑖 ; 𝜃(𝑡) , the above inequality is equality185
𝑙 𝜃(𝑡) = 𝑙 𝑞𝑖(𝑡), 𝜃(𝑡) ≥ 𝑙 𝑞, 𝜃(𝑡)
![Page 186: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/186.jpg)
Proof (cont.)
• Then the parameter 𝜃(𝑡+1) is obtained by maximizing
𝑙 𝑞𝑖(𝑡), 𝜃 =
𝑖=1
𝑁
𝑗
𝑞𝑖(𝑡)
log𝑝(𝑥 𝑖 , 𝑧 𝑖 = 𝑗; 𝜃)
𝑞𝑖(𝑡)
• Thus
𝑙 𝜃(𝑡+1) ≥ 𝑙 𝑞𝑖(𝑡), 𝜃(𝑡+1) ≥ 𝑙 𝑞𝑖
(𝑡), 𝜃(𝑡) = 𝑙 𝜃(𝑡)
• Since the likelihood is at most 1, EM will converge (to a local optimum)
186
![Page 187: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/187.jpg)
Example
• Follow previous example, suppose ℎ = 20, 𝑐 = 10, 𝑑 = 10, 𝜇(0) = 0
• Then
• Convergence is generally linear: error decreases by a constant factor each time step
187
![Page 188: Review for Final - GitHub Pages for final.pdf · •What is the entropy of a group in which all examples belong to the same class? •Entropy = −1log21=0 •What is the entropy](https://reader033.vdocument.in/reader033/viewer/2022052022/603691cd55ac793b3759f0bc/html5/thumbnails/188.jpg)
K-means and EM
• After initialize the position of the centers, K-means interactively execute the following two operations:• Assign the label of the points based on their distances to the centers
• Specific posterior distribution of latent variable
• E-step
• Update the positions of the centers based on the labeling results• Given the labels, optimize the model parameters
• M-step
188