course 495: advanced statistical machine learning/pattern recognition · 2015-11-12 · stefanos...
TRANSCRIPT
![Page 1: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/1.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
Course 495: Advanced Statistical Machine Learning/Pattern Recognition
Deterministic Component Analysis
• Goal (Lecture): To present standard and modern Component Analysis (CA) techniques such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Graph/Neighbourhood based Component Analysis
• Goal (Tutorials): To provide the students the necessary mathematical tools for deeply understanding the CA techniques.
1
![Page 2: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/2.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
• Pattern Recognition & Machine Learning by C. Bishop Chapter 12
• Turk, Matthew, and Alex Pentland. "Eigenfaces for recognition." Journal of cognitive neuroscience 3.1 (1991): 71-86.
• Belhumeur, Peter N., João P. Hespanha, and David J. Kriegman. "Eigenfaces vs. fisherfaces: Recognition using class specific linear projection." Pattern Analysis and Machine Intelligence, IEEE Transactions on 19.7 (1997): 711-720.
• He, Xiaofei, et al. "Face recognition using laplacianfaces." Pattern Analysis and Machine Intelligence, IEEE Transactions on 27.3 (2005): 328-340.
• Belkin, Mikhail, and Partha Niyogi. "Laplacian eigenmaps for dimensionality reduction and data representation." Neural computation 15.6 (2003): 1373-1396.
Materials
2
![Page 3: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/3.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
Deterministic Component Analysis
Problem: Given a population of data 𝒙1, ⋯ , 𝒙𝑁 ∊ 𝑅𝐹 (i.e. observations) find a
latent space 𝒚1, … , 𝒚𝑁 ∊ 𝑅𝑑 (usually d ≪ 𝐹) which is relevant to a task.
𝒚1 𝒚2 𝒚3 𝒚𝑁 𝒙1 𝒙2 𝒙3 𝒙𝑁 𝒙𝑁
3
![Page 4: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/4.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
Linear:
Parameters:
Latent Variable Models
𝑛
𝑚
𝐹 = 𝑚𝑛
𝑑 ≪ 𝐹
Non-linear:
4
![Page 5: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/5.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
What to have in mind?
What are the properties of my latent space?
How do I find it (linear, non-linear)?
Which is the cost function?
How do I solve the problem?
5
![Page 6: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/6.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
• Let’s assume that we want to find a descriptive latent space, i.e. best describes my population as a whole.
• How do I define it mathematically?
• Idea! This is a statistically machine learning course, isn’t?
• Hence, I will try to preserve global statistical properties.
A first example
6
![Page 7: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/7.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
• What are the data statistics that can used to describe the variability of my observations?
• One such statistic is the variance of the population (how much the data deviate around a mean).
• Attempt: I want to find a low-dimensional latent space where the “majority” of variance is preserved (or in other words maximized).
A first example
7
![Page 8: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/8.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
PCA (one dimension)
•We want to find
𝜎𝑦2 =
1
𝑁 (𝑦𝑖 − 𝜇𝑦)
2
𝑁
𝑖=1
{𝑦1, 𝑦2, … , 𝑦𝑁}
{𝑦1𝑜, … 𝑦𝑛
𝑜} = argm𝑎𝑥{𝑦1,…,𝑦𝑛}
𝜎𝑦2
•Via a linear projection 𝒘
i.e., 𝑦𝑖 = 𝒘𝑇𝒙𝑖
• Variance of the latent space
𝜇𝑦 = 𝑦𝑖
𝑁
𝑖=1
•But we are missing something … The way to do it.
8
![Page 9: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/9.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
𝒙 𝒘
𝜃
cos (𝜃) = 𝒙𝑇𝒘
𝒙 |𝒘|
|𝒙|cos(θ) =𝒙𝑇𝒘
|𝒘|
𝒙 = 𝒙𝑇𝒙
𝒘 = 𝒘𝑇𝒘
PCA (geometric interpretation of projection)
9
![Page 10: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/10.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
PCA (one dimension)
10
![Page 11: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/11.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
Assuming 𝑦𝑖 = 𝒘𝑇𝒙𝑖 latent space
𝑦1, … , 𝑦𝑁 = {𝒘𝑇𝒙1, … ,𝒘𝑇𝒙𝑁}
𝒘0 = argmax𝒘
𝜎2 = argmax𝑤
1
𝑁 (𝒘𝑇𝒙𝑖 −𝒘𝑇𝝁)2
= argmax𝒘
1
𝑁 (𝒘𝑇(𝒙𝑖 − 𝝁))2
= argmax𝒘
1
𝑁 𝒘𝑇(𝒙𝑖 − 𝝁) (𝒙𝑖 − 𝝁)𝑇𝒘
= argmax𝒘
𝒘𝑇 𝐒𝑡𝒘
PCA (one dimension)
𝝁 =1
𝑁 𝒙𝑖
𝑁
𝑖=1
= argmax𝒘
1
𝑁𝒘𝑇 (𝒙𝑖 − 𝝁) (𝒙𝑖 − 𝝁)𝑇 𝒘
11
![Page 12: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/12.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
𝒘𝑜 = argmax𝒘
𝜎2 =argmax𝒘
1
𝑁𝒘𝑇 𝐒𝑡𝒘 ≥ 0
𝑺𝑡 =
1
𝑁 (𝒙𝑖 − 𝝁) (𝒙𝑖 − 𝝁)𝑇
• There is a trivial solution of 𝒘 = ∞
• We can avoid it by adding extra constraints (a fixed
magnitude on 𝒘 (| 𝒘 |2 = 𝒘𝑇𝐰=1)
𝒘0 = argmax𝒘
𝒘𝑇𝑺𝑡𝒘
subject to s. t. 𝒘𝑇𝒘 = 1
PCA (one dimension)
12
![Page 13: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/13.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
Formulate the Lagrangian
𝐿 𝒘, λ = 𝒘𝑇𝑺𝑡𝒘− λ(𝒘𝑇𝒘− 1)
𝜕𝐿
𝜕𝒘= 𝟎 𝑺𝑡𝒘 = λ𝒘
𝜕𝒘𝑇𝑺𝑡𝒘
𝜕𝒘= 2𝑺𝑡𝒘
𝜕𝒘𝑇𝒘
𝜕𝒘= 2𝒘
𝒘 is the largest eigenvector of 𝑺𝑡
PCA (one dimension)
13
![Page 14: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/14.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
𝑿 = [𝒙1 − 𝝁,… , 𝒙𝑁 −𝝁]
𝑺𝑡 is a symmetric matrix all eigenvalues are real
𝑺𝑡 is a positive semi-definite matrix, i.e.
∀𝒘 ≠ 𝟎 𝒘𝑇𝑺𝑡w≥ 0 (all eigenvalues are non negative)
rank 𝑺𝑡 = min(𝑁 − 1, 𝐹)
𝑺𝑡 = 𝑼𝜦𝑼𝑇 𝑼𝑇𝑼 = 𝑰 , 𝑼𝑼𝑻 = 𝑰
𝑺𝑡 =1
𝑁 𝒙𝑖 − 𝝁 𝒙𝑖 − 𝝁 𝑇=
1
𝑁𝑿𝑿𝑻
PCA Properties of 𝑺𝑡
14
![Page 15: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/15.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
• How can we find a latent space with more than one
dimensions? {𝒘1, … ,𝒘𝑑} • We need to find a set of projections
𝒚𝑖 =
𝑦𝑖1…𝑦𝑖𝑑
=𝒘1
𝑇𝒙𝑖…
𝒘𝑑𝑇𝒙𝑖
=𝑾𝑇𝒙𝑖
𝑾 = [𝒘1, … ,𝒘𝑑] 𝒚𝑖
𝒙𝑖
PCA (more dimensions)
15
![Page 16: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/16.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
PCA (more dimensions)
16
![Page 17: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/17.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
Maximize the variance in each dimension
𝑾𝑜 = argmax𝑾
1
𝑁 (𝑦𝑖𝑘 − 𝜇𝑖𝑘)
2
𝑁
𝑖=1
𝑑
𝑘=1
= arg max𝑾
1
𝑁 𝒘𝑘
𝑇(𝒙𝑖 − 𝝁𝑖)(𝒙𝑖 − 𝝁𝑖)𝑇𝒘𝑘
𝑁
𝑖=1
𝑑
𝑘=1
= arg max𝑾
𝒘𝑘𝑇𝑺𝑡𝒘𝑘 = arg max
𝑾 tr[𝑾𝑇𝑺𝑡𝑾
𝑑
𝑘=1
]
PCA (more dimensions)
17
![Page 18: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/18.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
𝑾𝑜 = arg max𝑾
tr[𝑾𝑇𝑺𝑡𝑾]
𝑾𝑇𝑾 = 𝑰 s.t.
𝐿 𝐖,𝜦 = tr[𝐖𝑇𝑺𝑡𝐖]-tr[𝜦(𝐖𝑇𝐖− 𝐈)] Lagrangian
𝑺𝒕𝑾 =𝑾𝜦 Does it ring a bell?
𝜕tr[𝐖𝑇𝑺𝑡𝐖]
𝜕𝑾= 2𝑺𝑡𝑾
𝜕tr[𝜦(𝐖𝑇𝐖− 𝐈)]
𝜕𝑾= 2𝑾𝚲
𝜕L(𝑾,𝜦)
𝜕𝑾= 𝟎
PCA (more dimensions)
18
![Page 19: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/19.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
• Hence, 𝑾 has as columns the d eigenvectors of 𝑺𝑡 that correspond to its d largest nonzero eigenvalues
tr[𝑾𝑇𝑺𝑡𝑾] = tr[𝑾𝑇𝑼𝚲𝑼𝑻𝑾] = tr[𝚲𝑑]
𝑾 = 𝑼𝒅
𝑾𝑇𝑼 =
𝑢1. 𝑢1 𝑢1. 𝑢2 𝑢1. 𝑢3𝑢2. 𝑢1 𝑢2. 𝑢2 𝑢2. 𝑢3𝑢3. 𝑢1 𝑢3. 𝑢2 𝑢3. 𝑢3
𝑢1. 𝑢4 𝑢1. 𝑢5 𝑢2. 𝑢4 𝑢2. 𝑢5 𝑢3. 𝑢4 𝑢3. 𝑢5
=1 0 0 0 1 00 0 1
0 0 0 0 0 0
Example: U be 5x5 and W be a 5x3
PCA (more dimensions)
19
![Page 20: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/20.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
𝑾𝑇𝑼𝚲 =
𝜆10000
0𝜆2000
00𝜆300
00000
00000
𝑾𝑇𝑼𝚲𝑼𝑻𝑾 =
𝜆1 0 00 𝜆2 00 0 𝜆3
tr 𝚲𝑑 = 𝜆𝑑
𝑑
𝑖=1
and
Hence the maximum is
PCA (more dimensions)
20
![Page 21: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/21.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
• We want to find a set of bases 𝑾 that best reconstructs the
data after projection
𝒚𝑖 = 𝑾𝑇𝒙𝑖
𝒙𝑖 𝒙𝑖 = 𝑾𝑾𝑇𝒙𝑖
PCA (another perspective)
21
![Page 22: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/22.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
Let us assume for simplicity centred data (zero mean)
𝑿 = 𝑾𝒀 = 𝑾𝑻𝑾𝑿
𝑾𝟎 = arg min𝑾
1
𝑁 (𝑥𝑖𝑗 − 𝑥𝑖𝑗 )2
𝐹
𝑗=1
𝑁
𝑖=1
• Reconstructed data
= argmin𝑾
| 𝑿 −𝑾𝑾𝑻𝑿 |2𝐹 (1)
s. t.𝑾𝑻𝑾 = 𝑰
PCA (another perspective)
22
![Page 23: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/23.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
= tr 𝐗 −𝐖𝐖𝐓𝐗𝑇𝐗 −𝐖𝐖𝐓𝐗
= tr[𝐗𝐓𝐗 − 𝐗𝐓𝐖𝐖𝑻𝐗 − 𝐗𝐓𝐖𝐖𝑻𝐗 + 𝐗𝐓𝐖𝐖𝑻𝐖𝐖𝑻𝐗]
= tr 𝐗𝐓𝐗 − tr[𝐗𝐓𝐖𝐖𝑻𝐗]
max𝑊
tr[𝐖𝑻𝐗𝐗𝐓𝐖]
𝐈
𝑿 −𝑾𝑾𝑻𝑿 𝟐
𝑭
constant
min(1)
PCA (another perspective)
23
![Page 24: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/24.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
•Mean
•1st PC •2nd PC •3rd PC •4th PC
•TEXTURE
PCA (applications)
24
![Page 25: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/25.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
•Mean
•1st PC •2nd PC •3rd PC •4th PC
•SHAPE
PCA (applications)
25
![Page 26: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/26.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
Identity Expression
•Textu
re
•Shap
e
•Textu
re
•Shap
e
PCA (applications)
26
![Page 27: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/27.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
• PCA: Unsupervised approach good for compression
of data and data reconstruction. Good statistical prior.
• PCA: Not explicitly defined for classification problems
(i.e., in case that data come with labels)
• How do we define a latent space it this case? (i.e.,
that helps in data classification)
Linear Discriminant Analysis
27
![Page 28: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/28.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
• We need to properly define statistical properties
which may help us in classification.
• Intuition: We want to find a space in which
(a) the data consisting each class look more like
each other, while
(b) the data of separate classes look more
dissimilar.
Linear Discriminant Analysis
28
![Page 29: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/29.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
• How do I make my data in each class look more
similar? Minimize the variability in each class
(minimize the variance)
𝜎𝑦2 𝑐1
𝜎𝑦2 𝑐2
Space of 𝒙
Space of 𝒚
Linear Discriminant Analysis
29
![Page 30: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/30.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
• How do I make the data between classes look
dissimilar? I move the data from different classes further
away from each other (increase the distance between
their means).
Space of 𝒙
Space of 𝒚 (𝜇𝑦(𝑐1) − 𝜇𝑦(𝑐2))2
Linear Discriminant Analysis
30
![Page 31: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/31.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
𝜎𝑦2 𝑐1 + 𝜎𝑦
2 𝑐2 is minimum
A bit more formally. I want a latent space 𝑦 such that:
(𝜇𝑦(𝑐1) − 𝜇𝑦(𝑐2))2is maximum
𝜎𝑦2 𝑐1 + 𝜎𝑦
2 𝑐2
(𝜇𝑦(𝑐1) − 𝜇𝑦(𝑐2))2
How do I combine them together?
minimize
Or maximize (𝜇𝑦(𝑐1) − 𝜇𝑦(𝑐2))2
𝜎𝑦2 𝑐1 + 𝜎𝑦
2 𝑐2
Linear Discriminant Analysis
31
![Page 32: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/32.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
How can I find my latent space?
𝑦𝑖 = 𝒘𝑇𝒙𝑖
Linear Discriminant Analysis
32
![Page 33: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/33.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
𝜎𝑦2 𝑐1 =
1
𝑁𝑐1 (𝑦𝑖 − 𝜇𝑦 𝑐1 )2
𝑥𝑖∈𝑐1
=1
𝑁𝑐1 (𝒘𝑇(𝒙𝑖 − 𝝁 𝑐1 ))2𝑥𝑖∈𝑐1
=1
𝑁𝑐1 𝒘𝑇(𝒙𝑖−𝝁 𝑐1 )(𝒙𝑖−𝝁 𝑐1 )𝑇𝒘
𝑥𝑖∈𝑐1
= 𝒘𝑇1
𝑁𝑐1 (𝒙𝑖−𝝁 𝑐1 )(𝒙𝑖−𝝁 𝑐1 )𝑇𝒘
𝑥𝑖∈𝑐1
= 𝒘𝑇𝑺1𝒘
𝜎𝑦2 𝑐2 = 𝒘𝑇𝑺2𝒘
𝝁 𝑐1 = 1
𝑁𝑐1 𝒙𝑖𝒙𝑖∈𝑐1
Linear Discriminant Analysis
33
![Page 34: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/34.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
𝜎𝑦2 𝑐1 + 𝜎𝑦
2 𝑐2 = 𝒘𝑇(𝑺1 + 𝑺2)𝒘
𝑺𝑤
(𝜇𝑦(𝑐1) − 𝜇𝑦(𝑐2))2
= 𝒘𝑇(𝝁 𝑐1 − 𝝁(𝑐2))(𝝁 𝑐1 − 𝝁 (𝑐2))𝑇𝒘
𝑺𝑏
(𝜇𝑦(𝑐1) − 𝜇𝑦(𝑐2))2
𝜎𝑦2 𝑐1 + 𝜎𝑦
2 𝑐2=𝒘𝑇𝑺𝑏𝒘
𝒘𝑇𝑺𝑤𝒘
within class scatter matrix
between class scatter matrix
Linear Discriminant Analysis
34
![Page 35: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/35.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
max 𝒘𝑇𝑺𝑏𝒘 s.t. 𝒘𝑇𝑺𝑤𝒘=1
Lagrangian: 𝐿 𝒘, λ = 𝒘𝑇𝑺𝑏𝐰− λ(𝒘𝑻𝑺𝒘𝒘-1)
𝐰 ∝ 𝑺𝑤−1(𝝁 𝑐1 − 𝝁 𝑐2 )
𝜕𝒘𝑇𝑺𝑤𝒘
𝜕𝒘= 2𝑺𝑤𝒘
𝜕𝒘𝑇𝑺𝑏𝒘
𝜕𝒘= 2𝑺𝑏𝒘
𝜕𝐿
𝜕𝒘= 𝟎 λ𝑺𝑤𝒘 = 𝑺𝑏𝒘
𝒘 is the largest eigenvector of 𝑺𝑤−1𝑺𝑏
Linear Discriminant Analysis
35
![Page 36: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/36.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
Compute the LDA projection for the following 2D dataset
𝑐1 = 4,1 , 2,4 , 2,3 , 3,6 , (4,4)
𝑐2 = 9,10 , 6,8 , 9,5 , 8,7 , (10,8)
Linear Discriminant Analysis
36
![Page 37: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/37.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
Solution (by hand)
•The class statistics are
•The within and between class scatter are
𝑺1 = [0.8 −0.4−0.4 2.64
] 𝑺2 = [1.84 −0.04−0.04 2.64
]
𝝁1 = [3.0 3.6]𝑇 𝝁2 = [8.4 7.6]𝑇
𝑺𝑏 = [29.16 21.621.6 16.0
] 𝑺𝑤 = [2.64 −0.44−0.44 5.28
]
Linear Discriminant Analysis
37
![Page 38: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/38.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
The LDA projection is then obtained as the solution of the
generalized eigenvalue problem
λ=15.65
11.89 8.815.08 3.76
𝑤1
𝑤2= 15.65
𝑤1
𝑤2→
𝑤1
𝑤2=
0.910.39
Or directly by
𝒘∗ = 𝑺𝑊−1 𝝁1 − 𝝁2 = [−0.91 − 0.39]𝑇
𝑺𝑤−1𝑺𝑏𝒘 = λ𝒘 → 𝑺𝑤
−1𝑺𝑏 − λ𝑰 = 0 → 11.89 − λ 8.815.08 3.76 − λ
= 0 →
Linear Discriminant Analysis
38
![Page 39: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/39.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
𝑾 = [𝒘1, … ,𝒘𝑑]
LDA (Multiclass & Multidimensional case)
39
![Page 40: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/40.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
𝑺𝑤 = 𝑺𝑗
𝐶
𝑗=1
= 1
𝑁𝑐𝑗 (𝒙𝑖−𝝁 𝑐𝑗 )(𝒙𝑖−𝝁 𝑐𝑗 )𝑇
𝑥𝑖∈𝑐𝑗
𝐶
𝑗=1
𝑺𝑏 = (𝝁 𝑐𝑗 − 𝝁) (𝝁 𝑐𝑗 − 𝝁)Τ𝑐
𝑗=1
Within-class scatter matrix
Between-class scatter matrix
LDA (Multiclass & Multidimensional case)
40
![Page 41: Course 495: Advanced Statistical Machine Learning/Pattern Recognition · 2015-11-12 · Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) • Pattern Recognition &](https://reader034.vdocument.in/reader034/viewer/2022042320/5f09b7fa7e708231d4282faf/html5/thumbnails/41.jpg)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495)
max tr[𝑾T𝑺𝑏𝑾] s.t. 𝑾𝑇𝑺𝑤𝑾=I
Lagranging: 𝐿 𝑾,𝚲 = 𝑡𝑟[𝒘𝑇𝑺𝑏𝐰] − 𝑡𝑟[𝚲(𝑾𝑻𝑺𝑤𝑾− 𝑰)]
𝜕tr[𝐖𝑇𝑺𝑏𝐖]
𝜕𝑾= 2𝑺𝑏𝑾
𝜕tr[𝜦(𝐖𝑇𝑺𝑤𝐖− 𝐈)]
𝜕𝑾= 2𝑺𝑤𝑾𝚲
𝜕L(𝑾,𝜦)
𝜕𝑾= 𝟎 𝑺𝑏𝑾 = 𝑺𝑤𝑾𝚲
the eigenvectors of 𝑺𝑤−1𝑺𝑏that correspond to
the largest eigenvalues
LDA (Multiclass & Multidimensional case)
41