23 machine learning feature generation
TRANSCRIPT
Machine Learning for Data MiningFeature Generation
Andres Mendez-Vazquez
July 16, 2015
1 / 35
Images/cinvestav-1.jpg
Outline
1 IntroductionWhat do we want?
2 Fisher Linear DiscriminantThe Rotation IdeaSolutionScatter measureThe Cost Function
3 Principal Component Analysis
2 / 35
Images/cinvestav-1.jpg
Outline
1 IntroductionWhat do we want?
2 Fisher Linear DiscriminantThe Rotation IdeaSolutionScatter measureThe Cost Function
3 Principal Component Analysis
3 / 35
Images/cinvestav-1.jpg
What do we want?
WhatGiven a set of measurements, the goal is to discover compact andinformative representations of the obtained data.
Our ApproachWe want to “squeeze” in a relatively small number of features, leading toa reduction of the necessary feature space dimension.
PropertiesThus removing information redundancies - Usually produced and themeasurement.
4 / 35
Images/cinvestav-1.jpg
What do we want?
WhatGiven a set of measurements, the goal is to discover compact andinformative representations of the obtained data.
Our ApproachWe want to “squeeze” in a relatively small number of features, leading toa reduction of the necessary feature space dimension.
PropertiesThus removing information redundancies - Usually produced and themeasurement.
4 / 35
Images/cinvestav-1.jpg
What do we want?
WhatGiven a set of measurements, the goal is to discover compact andinformative representations of the obtained data.
Our ApproachWe want to “squeeze” in a relatively small number of features, leading toa reduction of the necessary feature space dimension.
PropertiesThus removing information redundancies - Usually produced and themeasurement.
4 / 35
Images/cinvestav-1.jpg
What Methods we will see?
Fisher Linear Discriminant1 Squeezing to the maximum.2 From Many to One Dimension
Principal Component Analysis1 Not so much squeezing2 You are willing to lose some information
Singular Value Decomposition1 Decompose a m × n data matrix A into A = USV T , U and V
orthonormal matrices and S contains the eigenvalues.2 You can read more of it on “Singular Value Decomposition -
Dimensionality Reduction”
5 / 35
Images/cinvestav-1.jpg
What Methods we will see?
Fisher Linear Discriminant1 Squeezing to the maximum.2 From Many to One Dimension
Principal Component Analysis1 Not so much squeezing2 You are willing to lose some information
Singular Value Decomposition1 Decompose a m × n data matrix A into A = USV T , U and V
orthonormal matrices and S contains the eigenvalues.2 You can read more of it on “Singular Value Decomposition -
Dimensionality Reduction”
5 / 35
Images/cinvestav-1.jpg
What Methods we will see?
Fisher Linear Discriminant1 Squeezing to the maximum.2 From Many to One Dimension
Principal Component Analysis1 Not so much squeezing2 You are willing to lose some information
Singular Value Decomposition1 Decompose a m × n data matrix A into A = USV T , U and V
orthonormal matrices and S contains the eigenvalues.2 You can read more of it on “Singular Value Decomposition -
Dimensionality Reduction”
5 / 35
Images/cinvestav-1.jpg
Outline
1 IntroductionWhat do we want?
2 Fisher Linear DiscriminantThe Rotation IdeaSolutionScatter measureThe Cost Function
3 Principal Component Analysis
6 / 35
Images/cinvestav-1.jpg
Rotation
ProjectingProjecting well-separated samples onto an arbitrary line usually produces aconfused mixture of samples from all of the classes and thus produces poorrecognition performance.
Something NotableHowever, moving and rotating the line around might result in anorientation for which the projected samples are well separated.
Fisher linear discriminant (FLD)It is a discriminant analysis seeking directions that are efficient fordiscriminating binary classification problem.
7 / 35
Images/cinvestav-1.jpg
Rotation
ProjectingProjecting well-separated samples onto an arbitrary line usually produces aconfused mixture of samples from all of the classes and thus produces poorrecognition performance.
Something NotableHowever, moving and rotating the line around might result in anorientation for which the projected samples are well separated.
Fisher linear discriminant (FLD)It is a discriminant analysis seeking directions that are efficient fordiscriminating binary classification problem.
7 / 35
Images/cinvestav-1.jpg
Rotation
ProjectingProjecting well-separated samples onto an arbitrary line usually produces aconfused mixture of samples from all of the classes and thus produces poorrecognition performance.
Something NotableHowever, moving and rotating the line around might result in anorientation for which the projected samples are well separated.
Fisher linear discriminant (FLD)It is a discriminant analysis seeking directions that are efficient fordiscriminating binary classification problem.
7 / 35
Images/cinvestav-1.jpg
Example
Example - From Left to Right the Improvement
8 / 35
Images/cinvestav-1.jpg
This is actually comming from...
Classifier asA machine for dimensionality reduction.
Initial SetupWe have:
N d-dimensional samples x1, x2, ..., xN
Ni is the number of samples in class Ci for i=1,2.
Then, we ask for the projection of each xi into the line by means of
yi = wT xi (1)
9 / 35
Images/cinvestav-1.jpg
This is actually comming from...
Classifier asA machine for dimensionality reduction.
Initial SetupWe have:
N d-dimensional samples x1, x2, ..., xN
Ni is the number of samples in class Ci for i=1,2.
Then, we ask for the projection of each xi into the line by means of
yi = wT xi (1)
9 / 35
Images/cinvestav-1.jpg
This is actually comming from...
Classifier asA machine for dimensionality reduction.
Initial SetupWe have:
N d-dimensional samples x1, x2, ..., xN
Ni is the number of samples in class Ci for i=1,2.
Then, we ask for the projection of each xi into the line by means of
yi = wT xi (1)
9 / 35
Images/cinvestav-1.jpg
Outline
1 IntroductionWhat do we want?
2 Fisher Linear DiscriminantThe Rotation IdeaSolutionScatter measureThe Cost Function
3 Principal Component Analysis
10 / 35
Images/cinvestav-1.jpg
Use the mean of each Class
ThenSelect w such that class separation is maximized
We then define the mean sample for ecah class1 C1 ⇒m1 = 1
N1
∑N1i=1 xi
2 C2 ⇒m2 = 1N2
∑N2i=1 xi
Ok!!! This is giving us a measure of distanceThus, we want to maximize the distance the projected means:
m1 −m2 = wT (m1 −m2) (2)
where mk = wT mk for k = 1, 2.
11 / 35
Images/cinvestav-1.jpg
Use the mean of each Class
ThenSelect w such that class separation is maximized
We then define the mean sample for ecah class1 C1 ⇒m1 = 1
N1
∑N1i=1 xi
2 C2 ⇒m2 = 1N2
∑N2i=1 xi
Ok!!! This is giving us a measure of distanceThus, we want to maximize the distance the projected means:
m1 −m2 = wT (m1 −m2) (2)
where mk = wT mk for k = 1, 2.
11 / 35
Images/cinvestav-1.jpg
Use the mean of each Class
ThenSelect w such that class separation is maximized
We then define the mean sample for ecah class1 C1 ⇒m1 = 1
N1
∑N1i=1 xi
2 C2 ⇒m2 = 1N2
∑N2i=1 xi
Ok!!! This is giving us a measure of distanceThus, we want to maximize the distance the projected means:
m1 −m2 = wT (m1 −m2) (2)
where mk = wT mk for k = 1, 2.
11 / 35
Images/cinvestav-1.jpg
However
We could simply seek
maxwT (m1 −m2)
s.t.d∑
i=1wi = 1
After allWe do not care about the magnitude of w.
12 / 35
Images/cinvestav-1.jpg
However
We could simply seek
maxwT (m1 −m2)
s.t.d∑
i=1wi = 1
After allWe do not care about the magnitude of w.
12 / 35
Images/cinvestav-1.jpg
Example
Here, we have the problem
13 / 35
Images/cinvestav-1.jpg
Outline
1 IntroductionWhat do we want?
2 Fisher Linear DiscriminantThe Rotation IdeaSolutionScatter measureThe Cost Function
3 Principal Component Analysis
14 / 35
Images/cinvestav-1.jpg
Fixing the Problem
To obtain good separation of the projected dataThe difference between the means should be large relative to somemeasure of the standard deviations for each class.
We define a SCATTER measure (Based in the Sample Variance)
s2k =
∑xi∈Ck
(wT xi −mk
)2=
∑yi=wT xi∈Ck
(yi −mk)2 (3)
We define then within-class variance for the whole data
s21 + s2
2 (4)
15 / 35
Images/cinvestav-1.jpg
Fixing the Problem
To obtain good separation of the projected dataThe difference between the means should be large relative to somemeasure of the standard deviations for each class.
We define a SCATTER measure (Based in the Sample Variance)
s2k =
∑xi∈Ck
(wT xi −mk
)2=
∑yi=wT xi∈Ck
(yi −mk)2 (3)
We define then within-class variance for the whole data
s21 + s2
2 (4)
15 / 35
Images/cinvestav-1.jpg
Fixing the Problem
To obtain good separation of the projected dataThe difference between the means should be large relative to somemeasure of the standard deviations for each class.
We define a SCATTER measure (Based in the Sample Variance)
s2k =
∑xi∈Ck
(wT xi −mk
)2=
∑yi=wT xi∈Ck
(yi −mk)2 (3)
We define then within-class variance for the whole data
s21 + s2
2 (4)
15 / 35
Images/cinvestav-1.jpg
Outline
1 IntroductionWhat do we want?
2 Fisher Linear DiscriminantThe Rotation IdeaSolutionScatter measureThe Cost Function
3 Principal Component Analysis
16 / 35
Images/cinvestav-1.jpg
Finally, a Cost Function
The between-class variance
(m1 −m2)2 (5)
The Fisher criterionbetween-class variancewithin-class variance (6)
Finally
J (w) = (m1 −m2)2
s21 + s2
2(7)
17 / 35
Images/cinvestav-1.jpg
Finally, a Cost Function
The between-class variance
(m1 −m2)2 (5)
The Fisher criterionbetween-class variancewithin-class variance (6)
Finally
J (w) = (m1 −m2)2
s21 + s2
2(7)
17 / 35
Images/cinvestav-1.jpg
Finally, a Cost Function
The between-class variance
(m1 −m2)2 (5)
The Fisher criterionbetween-class variancewithin-class variance (6)
Finally
J (w) = (m1 −m2)2
s21 + s2
2(7)
17 / 35
Images/cinvestav-1.jpg
We use a transformation to simplify our lifeFirst
J (w) =
(wT m1 − wT m2
)2∑yi =wT xi ∈C1
(yi − mk)2 +∑
yi =wT xi ∈C2(yi − mk)2
(8)
Second
J (w) =
(wT m1 − wT m2
)(wT m1 − wT m2
)T∑yi =wT xi ∈C1
(wT xi − mk
)(wT xi − mk
)T+∑
yi =wT xi ∈C2
(wT xi − mk
)(wT xi − mk
)T
(9)
Third
J (w) =wT (m1 − m2)
(wT (m1 − m2)
)T∑yi =wT xi ∈C1
wT (xi − m1)(
wT (xi − m1))T
+∑
yi =wT xi ∈C2wT (xi − m2)
(wT (xi − m2)
)T
(10)
18 / 35
Images/cinvestav-1.jpg
We use a transformation to simplify our lifeFirst
J (w) =
(wT m1 − wT m2
)2∑yi =wT xi ∈C1
(yi − mk)2 +∑
yi =wT xi ∈C2(yi − mk)2
(8)
Second
J (w) =
(wT m1 − wT m2
)(wT m1 − wT m2
)T∑yi =wT xi ∈C1
(wT xi − mk
)(wT xi − mk
)T+∑
yi =wT xi ∈C2
(wT xi − mk
)(wT xi − mk
)T
(9)
Third
J (w) =wT (m1 − m2)
(wT (m1 − m2)
)T∑yi =wT xi ∈C1
wT (xi − m1)(
wT (xi − m1))T
+∑
yi =wT xi ∈C2wT (xi − m2)
(wT (xi − m2)
)T
(10)
18 / 35
Images/cinvestav-1.jpg
We use a transformation to simplify our lifeFirst
J (w) =
(wT m1 − wT m2
)2∑yi =wT xi ∈C1
(yi − mk)2 +∑
yi =wT xi ∈C2(yi − mk)2
(8)
Second
J (w) =
(wT m1 − wT m2
)(wT m1 − wT m2
)T∑yi =wT xi ∈C1
(wT xi − mk
)(wT xi − mk
)T+∑
yi =wT xi ∈C2
(wT xi − mk
)(wT xi − mk
)T
(9)
Third
J (w) =wT (m1 − m2)
(wT (m1 − m2)
)T∑yi =wT xi ∈C1
wT (xi − m1)(
wT (xi − m1))T
+∑
yi =wT xi ∈C2wT (xi − m2)
(wT (xi − m2)
)T
(10)
18 / 35
Images/cinvestav-1.jpg
Transformation
Fourth
J (w) =wT (m1 − m2) (m1 − m2)T w∑
yi =wT xi ∈C1wT (xi − m1) (xi − m1)T w +
∑yi =wT xi ∈C2
wT (xi − m2) (xi − m2)T w(11)
FifthJ (w) =
wT (m1 − m2) (m1 − m2)T w
wT[∑
yi =wT xi ∈C1(xi − m1) (xi − m1)T +
∑yi =wT xi ∈C2
(xi − m2) (xi − m2)T]
w(12)
Now Rename
J (w) = wT SBwwT Sww (13)
19 / 35
Images/cinvestav-1.jpg
Transformation
Fourth
J (w) =wT (m1 − m2) (m1 − m2)T w∑
yi =wT xi ∈C1wT (xi − m1) (xi − m1)T w +
∑yi =wT xi ∈C2
wT (xi − m2) (xi − m2)T w(11)
FifthJ (w) =
wT (m1 − m2) (m1 − m2)T w
wT[∑
yi =wT xi ∈C1(xi − m1) (xi − m1)T +
∑yi =wT xi ∈C2
(xi − m2) (xi − m2)T]
w(12)
Now Rename
J (w) = wT SBwwT Sww (13)
19 / 35
Images/cinvestav-1.jpg
Transformation
Fourth
J (w) =wT (m1 − m2) (m1 − m2)T w∑
yi =wT xi ∈C1wT (xi − m1) (xi − m1)T w +
∑yi =wT xi ∈C2
wT (xi − m2) (xi − m2)T w(11)
FifthJ (w) =
wT (m1 − m2) (m1 − m2)T w
wT[∑
yi =wT xi ∈C1(xi − m1) (xi − m1)T +
∑yi =wT xi ∈C2
(xi − m2) (xi − m2)T]
w(12)
Now Rename
J (w) = wT SBwwT Sww (13)
19 / 35
Images/cinvestav-1.jpg
Derive with respect to w
ThusdJ (w)
dw =d(wT SBw
) (wT Sww
)−1
dw = 0 (14)
Then
dJ (w)dw =
(SBw + ST
B w) (
wT Sww)−1 −
(wT SBw
) (wT Sww
)−2 (Sww + STw w)
= 0(15)
Now because the symmetry in SB and Sw
dJ (w)dw = SB
(wT Sww) − wT SBwSww(wT Sww)2 = 0 (16)
20 / 35
Images/cinvestav-1.jpg
Derive with respect to w
ThusdJ (w)
dw =d(wT SBw
) (wT Sww
)−1
dw = 0 (14)
Then
dJ (w)dw =
(SBw + ST
B w) (
wT Sww)−1 −
(wT SBw
) (wT Sww
)−2 (Sww + STw w)
= 0(15)
Now because the symmetry in SB and Sw
dJ (w)dw = SB
(wT Sww) − wT SBwSww(wT Sww)2 = 0 (16)
20 / 35
Images/cinvestav-1.jpg
Derive with respect to w
ThusdJ (w)
dw =d(wT SBw
) (wT Sww
)−1
dw = 0 (14)
Then
dJ (w)dw =
(SBw + ST
B w) (
wT Sww)−1 −
(wT SBw
) (wT Sww
)−2 (Sww + STw w)
= 0(15)
Now because the symmetry in SB and Sw
dJ (w)dw = SB
(wT Sww) − wT SBwSww(wT Sww)2 = 0 (16)
20 / 35
Images/cinvestav-1.jpg
Derive with respect to w
ThusdJ (w)
dw = SB
(wT Sww) − wT SBwSww(wT Sww)2 = 0 (17)
Then (wT Sww
)SBw =
(wT SBw
)Sww (18)
21 / 35
Images/cinvestav-1.jpg
Derive with respect to w
ThusdJ (w)
dw = SB
(wT Sww) − wT SBwSww(wT Sww)2 = 0 (17)
Then (wT Sww
)SBw =
(wT SBw
)Sww (18)
21 / 35
Images/cinvestav-1.jpg
Now, Several Tricks!!!
FirstSBw = (m1 − m2) (m1 − m2)T w = α (m1 − m2) (19)
Where α = (m1 −m2)T w is a simple constantIt means that SBw is always in the direction m1 −m2!!!
In additionwT Sww and wT SBw are constants
22 / 35
Images/cinvestav-1.jpg
Now, Several Tricks!!!
FirstSBw = (m1 − m2) (m1 − m2)T w = α (m1 − m2) (19)
Where α = (m1 −m2)T w is a simple constantIt means that SBw is always in the direction m1 −m2!!!
In additionwT Sww and wT SBw are constants
22 / 35
Images/cinvestav-1.jpg
Now, Several Tricks!!!
FirstSBw = (m1 − m2) (m1 − m2)T w = α (m1 − m2) (19)
Where α = (m1 −m2)T w is a simple constantIt means that SBw is always in the direction m1 −m2!!!
In additionwT Sww and wT SBw are constants
22 / 35
Images/cinvestav-1.jpg
Now, Several Tricks!!!
Finally
Sww ∝ (m1 −m2)⇒ w ∝ S−1w (m1 −m2) (20)
Once the data is transformed into yi
Use a threshold y0 ⇒ x ∈ C1 iff y (x) ≥ y0 or x ∈ C2 iff y (x) < y0
Or ML with a Gussian can be used to classify the new transformeddata using a Naive Bayes (Central Limit Theorem and y = wT x sumof random variables).
23 / 35
Images/cinvestav-1.jpg
Now, Several Tricks!!!
Finally
Sww ∝ (m1 −m2)⇒ w ∝ S−1w (m1 −m2) (20)
Once the data is transformed into yi
Use a threshold y0 ⇒ x ∈ C1 iff y (x) ≥ y0 or x ∈ C2 iff y (x) < y0
Or ML with a Gussian can be used to classify the new transformeddata using a Naive Bayes (Central Limit Theorem and y = wT x sumof random variables).
23 / 35
Images/cinvestav-1.jpg
Now, Several Tricks!!!
Finally
Sww ∝ (m1 −m2)⇒ w ∝ S−1w (m1 −m2) (20)
Once the data is transformed into yi
Use a threshold y0 ⇒ x ∈ C1 iff y (x) ≥ y0 or x ∈ C2 iff y (x) < y0
Or ML with a Gussian can be used to classify the new transformeddata using a Naive Bayes (Central Limit Theorem and y = wT x sumof random variables).
23 / 35
Images/cinvestav-1.jpg
Please
Your Reading Material, it is about the Multiclass4.1.6 Fisher’s discriminant for multiple classes AT “Pattern Recognition”by Bishop
24 / 35
Images/cinvestav-1.jpg
Also Known as Karhunen-Loeve Transform
SetupConsider a data set of observations {xn} with n = 1, 2, ...,N andxn ∈ Rd .
GoalProject data onto space with dimensionality m < d (We assume m isgiven)
25 / 35
Images/cinvestav-1.jpg
Also Known as Karhunen-Loeve Transform
SetupConsider a data set of observations {xn} with n = 1, 2, ...,N andxn ∈ Rd .
GoalProject data onto space with dimensionality m < d (We assume m isgiven)
25 / 35
Images/cinvestav-1.jpg
Use the 1-Dimensional Variance
Remember the Variance Sample
VAR(X) =∑N
i=1 (xi − x) (xi − x)N − 1 (21)
You can do the same in the case of two variables X and Y
COV (X ,Y ) =∑N
i=1 (xi − x) (yi − y)N − 1 (22)
26 / 35
Images/cinvestav-1.jpg
Use the 1-Dimensional Variance
Remember the Variance Sample
VAR(X) =∑N
i=1 (xi − x) (xi − x)N − 1 (21)
You can do the same in the case of two variables X and Y
COV (X ,Y ) =∑N
i=1 (xi − x) (yi − y)N − 1 (22)
26 / 35
Images/cinvestav-1.jpg
Now, Define
Given the data
x1,x2, ...,xN (23)
where xi is a column vector
Construct the sample mean
x = 1N
N∑i=1
xi (24)
Build new data
x1 − x,x2 − x, ...,xN − x (25)
27 / 35
Images/cinvestav-1.jpg
Now, Define
Given the data
x1,x2, ...,xN (23)
where xi is a column vector
Construct the sample mean
x = 1N
N∑i=1
xi (24)
Build new data
x1 − x,x2 − x, ...,xN − x (25)
27 / 35
Images/cinvestav-1.jpg
Now, Define
Given the data
x1,x2, ...,xN (23)
where xi is a column vector
Construct the sample mean
x = 1N
N∑i=1
xi (24)
Build new data
x1 − x,x2 − x, ...,xN − x (25)
27 / 35
Images/cinvestav-1.jpg
Build the Sample Mean
The Covariance Matrix
S = 1N − 1
N∑i=1
(xi − x) (xi − x)T (26)
Properties1 The ijth value of S is equivalent to σ2
ij .2 The iith value of S is equivalent to σ2
ii .3 What else? Look at a plane Center and Rotating!!!
28 / 35
Images/cinvestav-1.jpg
Build the Sample Mean
The Covariance Matrix
S = 1N − 1
N∑i=1
(xi − x) (xi − x)T (26)
Properties1 The ijth value of S is equivalent to σ2
ij .2 The iith value of S is equivalent to σ2
ii .3 What else? Look at a plane Center and Rotating!!!
28 / 35
Images/cinvestav-1.jpg
Using S to Project Data
As in FisherWe want to project the data to a line...
For this we use a u1
with uT1 u1 = 1
QuestionWhat is the Sample Variance of the Projected Data
29 / 35
Images/cinvestav-1.jpg
Using S to Project Data
As in FisherWe want to project the data to a line...
For this we use a u1
with uT1 u1 = 1
QuestionWhat is the Sample Variance of the Projected Data
29 / 35
Images/cinvestav-1.jpg
Using S to Project Data
As in FisherWe want to project the data to a line...
For this we use a u1
with uT1 u1 = 1
QuestionWhat is the Sample Variance of the Projected Data
29 / 35
Images/cinvestav-1.jpg
Thus we have
Variance of the projected data
1N − 1
N∑i=1
[u1xi − u1x] = uT1 Su1 (27)
Use Lagrange Multipliers to Maximize
uT1 Su1 + λ1
(1− uT
1 u1)
(28)
30 / 35
Images/cinvestav-1.jpg
Thus we have
Variance of the projected data
1N − 1
N∑i=1
[u1xi − u1x] = uT1 Su1 (27)
Use Lagrange Multipliers to Maximize
uT1 Su1 + λ1
(1− uT
1 u1)
(28)
30 / 35
Images/cinvestav-1.jpg
Derive by u1
We get
Su1 = λ1u1 (29)
Thenu1 is an eigenvector of S .
If we left-multiply by u1
uT1 Su1 = λ1 (30)
31 / 35
Images/cinvestav-1.jpg
Derive by u1
We get
Su1 = λ1u1 (29)
Thenu1 is an eigenvector of S .
If we left-multiply by u1
uT1 Su1 = λ1 (30)
31 / 35
Images/cinvestav-1.jpg
Derive by u1
We get
Su1 = λ1u1 (29)
Thenu1 is an eigenvector of S .
If we left-multiply by u1
uT1 Su1 = λ1 (30)
31 / 35
Images/cinvestav-1.jpg
ThusVariance will be the maximum when
uT1 Su1 = λ1 (31)
is set to the largest eigenvalue. Also know as the First PrincipalComponent
By InductionIt is possible for M -dimensional space to define M eigenvectorsu1,u2, ...,uM of the data covariance S corresponding to λ1, λ2, ..., λMthat maximize the variance of the projected data.
Computational Cost1 Full eigenvector decomposition O
(d3)
2 Power Method O(Md2) “Golub and Van Loan, 1996)”
3 Use the Expectation Maximization Algorithm32 / 35
Images/cinvestav-1.jpg
ThusVariance will be the maximum when
uT1 Su1 = λ1 (31)
is set to the largest eigenvalue. Also know as the First PrincipalComponent
By InductionIt is possible for M -dimensional space to define M eigenvectorsu1,u2, ...,uM of the data covariance S corresponding to λ1, λ2, ..., λMthat maximize the variance of the projected data.
Computational Cost1 Full eigenvector decomposition O
(d3)
2 Power Method O(Md2) “Golub and Van Loan, 1996)”
3 Use the Expectation Maximization Algorithm32 / 35
Images/cinvestav-1.jpg
ThusVariance will be the maximum when
uT1 Su1 = λ1 (31)
is set to the largest eigenvalue. Also know as the First PrincipalComponent
By InductionIt is possible for M -dimensional space to define M eigenvectorsu1,u2, ...,uM of the data covariance S corresponding to λ1, λ2, ..., λMthat maximize the variance of the projected data.
Computational Cost1 Full eigenvector decomposition O
(d3)
2 Power Method O(Md2) “Golub and Van Loan, 1996)”
3 Use the Expectation Maximization Algorithm32 / 35
Images/cinvestav-1.jpg
Example
From Bishop
33 / 35
Images/cinvestav-1.jpg
Example
From Bishop
34 / 35
Images/cinvestav-1.jpg
Example
From Bishop
35 / 35