machine learning for data science (cs4786) lecture 4 · announcement • we are grading hw0 and you...
TRANSCRIPT
Machine Learning for Data Science (CS4786)Lecture 4
Canonical Correlation Analysis (CCA)
Course Webpage :http://www.cs.cornell.edu/Courses/cs4786/2016fa/
Announcement
• We are grading HW0 and you will be added to cms by monday
• HW1 will be posted tonight on webpage (homework tab)
• HW1 on CCA and PCA (due in a week)
QUIZ
X
d
n
x
>1
x
>n
Assume points are centered. Which of the following are equal to the covariance matrix?
A. ⌃ =1
n
nX
t=1
x
>t xt B. ⌃ =
1
n
nX
t=1
xtx>t
C. ⌃ = XX> D. ⌃ = XX>
Example: Students in classroom
x
yz
Maximize Spread Minimize Reconstruction Error
PRINCIPAL COMPONENT ANALYSIS
Eigenvectors of the covariance matrix are the principalcomponents
Top K principal components are the eigenvectors with K largesteigenvalues
Projection = Data × Top Keigenvectors
Reconstruction = Projection × Transpose of top K eigenvectors
Independently discovered by Pearson in 1901 and Hotelling in1933.
⌃ =cov X !
1.
eigs= ⌃ ,K( )W2.
3. Y = W⇥X�µ
RECONSTRUCTION
4.Y= ⇥bX W>
+µ
WHEN d >> n
If d >> n then ⌃ is largeBut we only need top K eigen vectors.Idea: use SVD
X − µ = UDV
�Then note that, ⌃ = (X − µ)�(X − µ) = VD
2V
Hence, matrix V is the same as matrix W got from eigendecomposition of ⌃, eigenvalues are diagonal elements of D
2
Alternative algorithm:
[U,V] = SVD(X − µ,K) W = V
U U = ITV V = IT
WHEN TO USE PCA?
When data naturally lies in a low dimensional linear subspace
To minimize reconstruction error
Find directions where data is maximally spread
Canonical Correlation Analysis
x
yz
+Age
Gender
Angle
Canonical Correlation Analysis
x
yz
+Age
Gender
Angle
TWO VIEW DIMENSIONALITY REDUCTION
Data comes in pairs (x1,x′1), . . . , (xn,x ′n)where xt’s are d
dimensional and x
′t ’s are d ′ dimensional
Goal: Compress say view one into y1, . . . ,yn, that are Kdimensional vectors
Retain information redundant between the two views
Eliminate “noise” specific to only one of the views
EXAMPLE I: SPEECH RECOGNITION
Audio might have background sounds uncorrelated with video
Video might have lighting changes uncorrelated with audio
Redundant information between two views: the speech
+
EXAMPLE II: COMBINING FEATURE EXTRACTIONS
Method A and Method B are both equally good feature extractiontechniques
Concatenating the two features blindly yields large dimensionalfeature vector with redundancy
Applying techniques like CCA extracts the key informationbetween the two methods
Removes extra unwanted information
How do we get the right direction? (say K = 1)
+Age
Gender
Angle
WHICH DIRECTION TO PICK?
View I View II
WHICH DIRECTION TO PICK?
0 0
PCA direction
WHICH DIRECTION TO PICK?
Direction has large covariance
How do we pick the right direction to project to?
MAXIMIZING CORRELATION COEFFICIENT
Say w1 and v1 are the directions we choose to project in views 1and 2 respectively we want these directions to maximize,
1n
n�t=1�yt[1] − 1
n
n�t=1
yt[1]� ⋅ �y ′t[1] − 1n
n�t=1
y
′t[1]�
s.t. 1n ∑n
t=1 �yt[1] − 1n ∑n
t=1 yt[1]�2 = 1n ∑n
t=1 �y ′t[1] − 1n ∑n
t=1 y
′t[1]� = 1
where yt[1] =w
�1 xt and y
′t[1] = v
�1 x
′t
MAXIMIZING CORRELATION COEFFICIENT
Say w1 and v1 are the directions we choose to project in views 1and 2 respectively we want these directions to maximize,
1n
n�t=1�yt[1] − 1
n
n�t=1
yt[1]� ⋅ �y ′t[1] − 1n
n�t=1
y
′t[1]�
s.t. 1n ∑n
t=1 �yt[1] − 1n ∑n
t=1 yt[1]�2 = 1n ∑n
t=1 �y ′t[1] − 1n ∑n
t=1 y
′t[1]� = 1
where yt[1] =w
�1 xt and y
′t[1] = v
�1 x
′t
What is the problem with the above?
WHY NOT MAXIMIZE COVARIANCE
Relevant information
Say1
n
nX
t=1
xt[2] · x0t[2] > 0
Scaling up this coordinate we can blow up covariance
MAXIMIZING CORRELATION COEFFICIENT
Say w1 and v1 are the directions we choose to project in views 1and 2 respectively we want these directions to maximize,
1n ∑n
t=1 �yt[1] − 1n ∑n
t=1 yt[1]� ⋅ �y ′t[1] − 1n ∑n
t=1 y
′t[1]��
1n ∑n
t=1 �yt[1] − 1n ∑n
t=1 yt[1]�2� 1n ∑n
t=1 �y ′t[1] − 1n ∑n
t=1 y
′t[1]�
BASIC IDEA OF CCA
Normalize variance in chosen direction to be constant (say 1)
Then maximize covariance
This is same as maximizing “correlation coefficient” (recall fromlast class).
COVARIANCE VS CORRELATION
Covariance(A,B) = E[(A −E[A]) ⋅ (B −E[B])]Depends on the scale of A and B. If B is rescaled, covariance shifts.
Corelation(A,B) = E[(A−E[A])⋅(B−E[B])]�Var(A)�Var(B)
Scale free.
MAXIMIZING CORRELATION COEFFICIENT
Say w1 and v1 are the directions we choose to project in views 1and 2 respectively we want these directions to maximize,
1n
n�t=1�yt[1] − 1
n
n�t=1
yt[1]� ⋅ �y ′t[1] − 1n
n�t=1
y
′t[1]�
s.t. 1n ∑n
t=1 �yt[1] − 1n ∑n
t=1 yt[1]�2 = 1n ∑n
t=1 �y ′t[1] − 1n ∑n
t=1 y
′t[1]� = 1
where yt[1] =w
�1 xt and y
′t[1] = v
�1 x
′t
CANONICAL CORRELATION ANALYSIS
Hence we want to solve for projection vectors w1 and v1 that
maximize1n
n�t=1
w
�1(xt − µ) ⋅ v�1(x ′t − µ ′)
subject to1n
n�t=1(w�1(xt − µ))2 = 1
n
n�t=1(v�1(x ′t − µ ′))2 = 1
where µ = 1n ∑n
t=1 xt and µ ′ = 1n ∑n
t=1 x
′t
CANONICAL CORRELATION ANALYSIS
Hence we want to solve for projection vectors w1 and v1 that
maximize1n
n�t=1
w
�1(xt − µ) ⋅ v�1(x ′t − µ ′)
subject to1n
n�t=1(w�1(xt − µ))2 = 1
n
n�t=1(v�1(x ′t − µ ′))2 = 1
where µ = 1n ∑n
t=1 xt and µ ′ = 1n ∑n
t=1 x
′t
CANONICAL CORRELATION ANALYSIS
Hence we want to solve for projection vectors w1 and v1 that
maximize w
�1⌃1,2v1
subject to w
�1⌃1,1w1 = v
�1⌃2,2v1 = 1
Writing Lagrangian taking derivative equating to 0 we get
⌃1,2⌃−12,2⌃2,1w1 = �2⌃1,1w1 and ⌃2,1⌃
−11,1⌃1,2v1 = �2⌃2,2v1
or equivalently
�⌃−11,1⌃1,2⌃
−12,2⌃2,1�w1 = �2
w1 and �⌃−12,2⌃2,1⌃
−11,1⌃1,2�v1 = �2
v1⌃ =cov X
!⌃
⌃⌃⌃11
21
12
22= X
0
SOLUTION
eigs= ,K( )W1⌃�1
11 ⌃12⌃�122 ⌃21
eigs= ,K( )W2⌃�1
22 ⌃21⌃�111 ⌃12
CCA ALGORITHM
Write x̃t = � xtx
′t� the d + d ′ dimensional concatenated vectors.
Calculate covariance matrix of the joint data points
⌃ = � ⌃1,1 ⌃1,2⌃2,1 ⌃2,2
�
Calculate ⌃−11,1⌃1,2⌃
−12,2⌃2,1. The top K eigen vectors of this matrix
give us projection matrix for view I.
Calculate ⌃−12,2⌃2,1⌃
−11,1⌃1,2. The top K eigen vectors of this matrix
give us projection matrix for view II.
,
n
d1 d2
X = X1 X2
!
1.
⌃ =cov X !
2.⌃
⌃⌃⌃11
21
12
22=
eigs= ,K( )3. W1 ⌃�111 ⌃12⌃
�122 ⌃21
4. Y = W⇥X�µ11 1 1