practical machine learning (cs294-34) september 24, 2009 ...jordan/courses/... · linear...
TRANSCRIPT
![Page 1: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/1.jpg)
Linear Dimensionality Reduction
Practical Machine Learning (CS294-34)
September 24, 2009
Percy Liang
![Page 2: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/2.jpg)
Lots of high-dimensional data...
face images
Zambian President LevyMwanawasa has won asecond term in office inan election his challengerMichael Sata accused himof rigging, official resultsshowed on Monday.
According to media reports,a pair of hackers said onSaturday that the FirefoxWeb browser, commonlyperceived as the saferand more customizablealternative to marketleader Internet Explorer,is critically flawed. Apresentation on the flawwas shown during theToorCon hacker conferencein San Diego.
documents
gene expression data MEG readings
2
![Page 3: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/3.jpg)
Motivation and context
Why do dimensionality reduction?
• Computational: compress data ⇒ time/space efficiency
3
![Page 4: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/4.jpg)
Motivation and context
Why do dimensionality reduction?
• Computational: compress data ⇒ time/space efficiency
• Statistical: fewer dimensions ⇒ better generalization
3
![Page 5: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/5.jpg)
Motivation and context
Why do dimensionality reduction?
• Computational: compress data ⇒ time/space efficiency
• Statistical: fewer dimensions ⇒ better generalization
• Visualization: understand structure of data
3
![Page 6: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/6.jpg)
Motivation and context
Why do dimensionality reduction?
• Computational: compress data ⇒ time/space efficiency
• Statistical: fewer dimensions ⇒ better generalization
• Visualization: understand structure of data
• Anomaly detection: describe normal data, detect outliers
3
![Page 7: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/7.jpg)
Motivation and context
Why do dimensionality reduction?
• Computational: compress data ⇒ time/space efficiency
• Statistical: fewer dimensions ⇒ better generalization
• Visualization: understand structure of data
• Anomaly detection: describe normal data, detect outliers
Dimensionality reduction in this course:
• Linear methods (this week)
• Clustering (last week)
• Feature selection (next week)
• Nonlinear methods (later)
3
![Page 8: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/8.jpg)
Types of problems
• Prediction x→ y: classification, regression
4
![Page 9: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/9.jpg)
Types of problems
• Prediction x→ y: classification, regressionApplications: face recognition, gene expression predictionTechniques: kNN, SVM, least squares (+ dimensionalityreduction preprocessing)
4
![Page 10: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/10.jpg)
Types of problems
• Prediction x→ y: classification, regressionApplications: face recognition, gene expression predictionTechniques: kNN, SVM, least squares (+ dimensionalityreduction preprocessing)
• Structure discovery x→ z: find an alternativerepresentation z of data x
4
![Page 11: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/11.jpg)
Types of problems
• Prediction x→ y: classification, regressionApplications: face recognition, gene expression predictionTechniques: kNN, SVM, least squares (+ dimensionalityreduction preprocessing)
• Structure discovery x→ z: find an alternativerepresentation z of data xApplications: visualizationTechniques: clustering, linear dimensionality reduction
4
![Page 12: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/12.jpg)
Types of problems
• Prediction x→ y: classification, regressionApplications: face recognition, gene expression predictionTechniques: kNN, SVM, least squares (+ dimensionalityreduction preprocessing)
• Structure discovery x→ z: find an alternativerepresentation z of data xApplications: visualizationTechniques: clustering, linear dimensionality reduction
• Density estimation p(x): model the data
4
![Page 13: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/13.jpg)
Types of problems
• Prediction x→ y: classification, regressionApplications: face recognition, gene expression predictionTechniques: kNN, SVM, least squares (+ dimensionalityreduction preprocessing)
• Structure discovery x→ z: find an alternativerepresentation z of data xApplications: visualizationTechniques: clustering, linear dimensionality reduction
• Density estimation p(x): model the dataApplications: anomaly detection, language modelingTechniques: clustering, linear dimensionality reduction
4
![Page 14: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/14.jpg)
Basic idea of linear dimensionality reduction
Represent each face as a high-dimensional vector x ∈ R361
5
![Page 15: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/15.jpg)
Basic idea of linear dimensionality reduction
Represent each face as a high-dimensional vector x ∈ R361
x ∈ R361
z = U>x
z ∈ R10
5
![Page 16: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/16.jpg)
Basic idea of linear dimensionality reduction
Represent each face as a high-dimensional vector x ∈ R361
x ∈ R361
z = U>x
z ∈ R10
How do we choose U?
5
![Page 17: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/17.jpg)
Outline
• Principal component analysis (PCA)
– Basic principles
– Case studies
– Kernel PCA
– Probabilistic PCA
• Canonical correlation analysis (CCA)
• Fisher discriminant analysis (FDA)
• Summary
6
![Page 18: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/18.jpg)
Roadmap
• Principal component analysis (PCA)
– Basic principles
– Case studies
– Kernel PCA
– Probabilistic PCA
• Canonical correlation analysis (CCA)
• Fisher discriminant analysis (FDA)
• Summary
Principal component analysis (PCA) / Basic principles 7
![Page 19: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/19.jpg)
Dimensionality reduction setup
Given n data points in d dimensions: x1, . . . ,xn ∈ Rd
Principal component analysis (PCA) / Basic principles 8
![Page 20: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/20.jpg)
Dimensionality reduction setup
Given n data points in d dimensions: x1, . . . ,xn ∈ Rd
X = ( x1 · · · · · · xn ) ∈ Rd×n
Principal component analysis (PCA) / Basic principles 8
![Page 21: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/21.jpg)
Dimensionality reduction setup
Given n data points in d dimensions: x1, . . . ,xn ∈ Rd
X = ( x1 · · · · · · xn ) ∈ Rd×n
Want to reduce dimensionality from d to k
Principal component analysis (PCA) / Basic principles 8
![Page 22: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/22.jpg)
Dimensionality reduction setup
Given n data points in d dimensions: x1, . . . ,xn ∈ Rd
X = ( x1 · · · · · · xn ) ∈ Rd×n
Want to reduce dimensionality from d to k
Choose k directions u1, . . . ,uk
Principal component analysis (PCA) / Basic principles 8
![Page 23: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/23.jpg)
Dimensionality reduction setup
Given n data points in d dimensions: x1, . . . ,xn ∈ Rd
X = ( x1 · · · · · · xn ) ∈ Rd×n
Want to reduce dimensionality from d to k
Choose k directions u1, . . . ,uk
U = ( u1 ·· uk ) ∈ Rd×k
Principal component analysis (PCA) / Basic principles 8
![Page 24: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/24.jpg)
Dimensionality reduction setup
Given n data points in d dimensions: x1, . . . ,xn ∈ Rd
X = ( x1 · · · · · · xn ) ∈ Rd×n
Want to reduce dimensionality from d to k
Choose k directions u1, . . . ,uk
U = ( u1 ·· uk ) ∈ Rd×k
For each uj, compute “similarity” zj = u>j x
Principal component analysis (PCA) / Basic principles 8
![Page 25: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/25.jpg)
Dimensionality reduction setup
Given n data points in d dimensions: x1, . . . ,xn ∈ Rd
X = ( x1 · · · · · · xn ) ∈ Rd×n
Want to reduce dimensionality from d to k
Choose k directions u1, . . . ,uk
U = ( u1 ·· uk ) ∈ Rd×k
For each uj, compute “similarity” zj = u>j x
Project x down to z = (z1, . . . , zk)> = U>x
Principal component analysis (PCA) / Basic principles 8
![Page 26: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/26.jpg)
Dimensionality reduction setup
Given n data points in d dimensions: x1, . . . ,xn ∈ Rd
X = ( x1 · · · · · · xn ) ∈ Rd×n
Want to reduce dimensionality from d to k
Choose k directions u1, . . . ,uk
U = ( u1 ·· uk ) ∈ Rd×k
For each uj, compute “similarity” zj = u>j x
Project x down to z = (z1, . . . , zk)> = U>xHow to choose U?
Principal component analysis (PCA) / Basic principles 8
![Page 27: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/27.jpg)
PCA objective 1: reconstruction error
U serves two functions:
• Encode: z = U>x, zj = u>j x
Principal component analysis (PCA) / Basic principles 9
![Page 28: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/28.jpg)
PCA objective 1: reconstruction error
U serves two functions:
• Encode: z = U>x, zj = u>j x
• Decode: x = Uz =∑k
j=1 zjuj
Principal component analysis (PCA) / Basic principles 9
![Page 29: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/29.jpg)
PCA objective 1: reconstruction error
U serves two functions:
• Encode: z = U>x, zj = u>j x
• Decode: x = Uz =∑k
j=1 zjuj
Want reconstruction error ‖x− x‖ to be small
Principal component analysis (PCA) / Basic principles 9
![Page 30: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/30.jpg)
PCA objective 1: reconstruction error
U serves two functions:
• Encode: z = U>x, zj = u>j x
• Decode: x = Uz =∑k
j=1 zjuj
Want reconstruction error ‖x− x‖ to be small
Objective: minimize total squared reconstruction error
minU∈Rd×k
n∑i=1
‖xi −UU>xi‖2
Principal component analysis (PCA) / Basic principles 9
![Page 31: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/31.jpg)
PCA objective 2: projected variance
Empirical distribution: uniform over x1, . . . ,xn
Principal component analysis (PCA) / Basic principles 10
![Page 32: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/32.jpg)
PCA objective 2: projected variance
Empirical distribution: uniform over x1, . . . ,xn
Expectation (think sum over data points):
E[f(x)] = 1n
∑ni=1 f(xi)
Principal component analysis (PCA) / Basic principles 10
![Page 33: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/33.jpg)
PCA objective 2: projected variance
Empirical distribution: uniform over x1, . . . ,xn
Expectation (think sum over data points):
E[f(x)] = 1n
∑ni=1 f(xi)
Variance (think sum of squares if centered):
var[f(x)] + (E[f(x)])2 = E[f(x)2] = 1n
∑ni=1 f(xi)2
Principal component analysis (PCA) / Basic principles 10
![Page 34: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/34.jpg)
PCA objective 2: projected variance
Empirical distribution: uniform over x1, . . . ,xn
Expectation (think sum over data points):
E[f(x)] = 1n
∑ni=1 f(xi)
Variance (think sum of squares if centered):
var[f(x)] + (E[f(x)])2 = E[f(x)2] = 1n
∑ni=1 f(xi)2
Assume data is centered: E[x] = 0
Principal component analysis (PCA) / Basic principles 10
![Page 35: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/35.jpg)
PCA objective 2: projected variance
Empirical distribution: uniform over x1, . . . ,xn
Expectation (think sum over data points):
E[f(x)] = 1n
∑ni=1 f(xi)
Variance (think sum of squares if centered):
var[f(x)] + (E[f(x)])2 = E[f(x)2] = 1n
∑ni=1 f(xi)2
Assume data is centered: E[x] = 0 (what’s E[U>x]?)
Principal component analysis (PCA) / Basic principles 10
![Page 36: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/36.jpg)
PCA objective 2: projected variance
Empirical distribution: uniform over x1, . . . ,xn
Expectation (think sum over data points):
E[f(x)] = 1n
∑ni=1 f(xi)
Variance (think sum of squares if centered):
var[f(x)] + (E[f(x)])2 = E[f(x)2] = 1n
∑ni=1 f(xi)2
Assume data is centered: E[x] = 0 (what’s E[U>x]?)
Objective: maximize variance of projected data
maxU∈Rd×k,U>U=I
E[‖U>x‖2]
Principal component analysis (PCA) / Basic principles 10
![Page 37: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/37.jpg)
Equivalence in two objectives
Key intuition:
variance of data︸ ︷︷ ︸fixed
= captured variance︸ ︷︷ ︸want large
+ reconstruction error︸ ︷︷ ︸want small
Principal component analysis (PCA) / Basic principles 11
![Page 38: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/38.jpg)
Equivalence in two objectives
Key intuition:
variance of data︸ ︷︷ ︸fixed
= captured variance︸ ︷︷ ︸want large
+ reconstruction error︸ ︷︷ ︸want small
Pythagorean decomposition: x = UU>x + (I −UU>)x
Principal component analysis (PCA) / Basic principles 11
![Page 39: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/39.jpg)
Equivalence in two objectives
Key intuition:
variance of data︸ ︷︷ ︸fixed
= captured variance︸ ︷︷ ︸want large
+ reconstruction error︸ ︷︷ ︸want small
Pythagorean decomposition: x = UU>x + (I −UU>)x
‖UU>x‖
‖(I −UU>)x‖‖x‖
Principal component analysis (PCA) / Basic principles 11
![Page 40: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/40.jpg)
Equivalence in two objectives
Key intuition:
variance of data︸ ︷︷ ︸fixed
= captured variance︸ ︷︷ ︸want large
+ reconstruction error︸ ︷︷ ︸want small
Pythagorean decomposition: x = UU>x + (I −UU>)x
‖UU>x‖
‖(I −UU>)x‖‖x‖
Take expectations; note rotation U doesn’t affect length:
E[‖x‖2] = E[‖U>x‖2] + E[‖x−UU>x‖2]
Principal component analysis (PCA) / Basic principles 11
![Page 41: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/41.jpg)
Equivalence in two objectives
Key intuition:
variance of data︸ ︷︷ ︸fixed
= captured variance︸ ︷︷ ︸want large
+ reconstruction error︸ ︷︷ ︸want small
Pythagorean decomposition: x = UU>x + (I −UU>)x
‖UU>x‖
‖(I −UU>)x‖‖x‖
Take expectations; note rotation U doesn’t affect length:
E[‖x‖2] = E[‖U>x‖2] + E[‖x−UU>x‖2]Minimize reconstruction error ↔ Maximize captured variance
Principal component analysis (PCA) / Basic principles 11
![Page 42: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/42.jpg)
Finding one principal component
Input data:
X = ( x1 . . . xn )Principal component analysis (PCA) / Basic principles 12
![Page 43: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/43.jpg)
Finding one principal component
Input data:
X = ( x1 . . . xn )
Objective: maximize varianceof projected data
Principal component analysis (PCA) / Basic principles 12
![Page 44: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/44.jpg)
Finding one principal component
Input data:
X = ( x1 . . . xn )
Objective: maximize varianceof projected data
= max‖u‖=1
E[(u>x)2]
Principal component analysis (PCA) / Basic principles 12
![Page 45: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/45.jpg)
Finding one principal component
Input data:
X = ( x1 . . . xn )
Objective: maximize varianceof projected data
= max‖u‖=1
E[(u>x)2]
= max‖u‖=1
1n
n∑i=1
(u>xi)2
Principal component analysis (PCA) / Basic principles 12
![Page 46: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/46.jpg)
Finding one principal component
Input data:
X = ( x1 . . . xn )
Objective: maximize varianceof projected data
= max‖u‖=1
E[(u>x)2]
= max‖u‖=1
1n
n∑i=1
(u>xi)2
= max‖u‖=1
1n‖u>X‖2
Principal component analysis (PCA) / Basic principles 12
![Page 47: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/47.jpg)
Finding one principal component
Input data:
X = ( x1 . . . xn )
Objective: maximize varianceof projected data
= max‖u‖=1
E[(u>x)2]
= max‖u‖=1
1n
n∑i=1
(u>xi)2
= max‖u‖=1
1n‖u>X‖2
= max‖u‖=1
u>(
1nXX>
)u
Principal component analysis (PCA) / Basic principles 12
![Page 48: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/48.jpg)
Finding one principal component
Input data:
X = ( x1 . . . xn )
Objective: maximize varianceof projected data
= max‖u‖=1
E[(u>x)2]
= max‖u‖=1
1n
n∑i=1
(u>xi)2
= max‖u‖=1
1n‖u>X‖2
= max‖u‖=1
u>(
1nXX>
)u
= largest eigenvalue of Cdef=
1nXX>
(C is covariance matrix of data)Principal component analysis (PCA) / Basic principles 12
![Page 49: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/49.jpg)
How many principal components?• Similar to question of “How many clusters?”
• Magnitude of eigenvalues indicate fraction of variance captured.
Principal component analysis (PCA) / Basic principles 15
![Page 50: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/50.jpg)
How many principal components?• Similar to question of “How many clusters?”
• Magnitude of eigenvalues indicate fraction of variance captured.
• Eigenvalues on a face image dataset:
2 3 4 5 6 7 8 9 10 11
i
287.1
553.6
820.1
1086.7
1353.2
λi
Principal component analysis (PCA) / Basic principles 15
![Page 51: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/51.jpg)
How many principal components?• Similar to question of “How many clusters?”
• Magnitude of eigenvalues indicate fraction of variance captured.
• Eigenvalues on a face image dataset:
2 3 4 5 6 7 8 9 10 11
i
287.1
553.6
820.1
1086.7
1353.2
λi
• Eigenvalues typically drop off sharply, so don’t need that many.
• Of course variance isn’t everything...
Principal component analysis (PCA) / Basic principles 15
![Page 52: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/52.jpg)
Computing PCAMethod 1: eigendecomposition
U are eigenvectors of covariance matrix C = 1nXX>
Computing C already takes O(nd2) time (very expensive)
Principal component analysis (PCA) / Basic principles 16
![Page 53: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/53.jpg)
Computing PCAMethod 1: eigendecomposition
U are eigenvectors of covariance matrix C = 1nXX>
Computing C already takes O(nd2) time (very expensive)
Method 2: singular value decomposition (SVD)
Find X = Ud×dΣd×nV>n×n
where U>U = Id×d, V>V = In×n, Σ is diagonalComputing top k singular vectors takes only O(ndk)
Principal component analysis (PCA) / Basic principles 16
![Page 54: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/54.jpg)
Computing PCAMethod 1: eigendecomposition
U are eigenvectors of covariance matrix C = 1nXX>
Computing C already takes O(nd2) time (very expensive)
Method 2: singular value decomposition (SVD)
Find X = Ud×dΣd×nV>n×n
where U>U = Id×d, V>V = In×n, Σ is diagonalComputing top k singular vectors takes only O(ndk)
Relationship between eigendecomposition and SVD:
Left singular vectors are principal components (C = UΣ2U>)
Principal component analysis (PCA) / Basic principles 16
![Page 55: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/55.jpg)
Roadmap
• Principal component analysis (PCA)
– Basic principles
– Case studies
– Kernel PCA
– Probabilistic PCA
• Canonical correlation analysis (CCA)
• Fisher discriminant analysis (FDA)
• Summary
Principal component analysis (PCA) / Case studies 17
![Page 56: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/56.jpg)
Eigen-faces [Turk and Pentland, 1991]
• d = number of pixels
• Each xi ∈ Rd is a face image
• xji = intensity of the j-th pixel in image i
Principal component analysis (PCA) / Case studies 18
![Page 57: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/57.jpg)
Eigen-faces [Turk and Pentland, 1991]
• d = number of pixels
• Each xi ∈ Rd is a face image
• xji = intensity of the j-th pixel in image i
Xd×n u Ud×k Zk×n
( . . . ) u ( ) ( z1 . . . zn )
Principal component analysis (PCA) / Case studies 18
![Page 58: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/58.jpg)
Eigen-faces [Turk and Pentland, 1991]
• d = number of pixels
• Each xi ∈ Rd is a face image
• xji = intensity of the j-th pixel in image i
Xd×n u Ud×k Zk×n
( . . . ) u ( ) ( z1 . . . zn )Idea: zi more “meaningful” representation of i-th face than xi
Can use zi for nearest-neighbor classification
Much faster: O(dk + nk) time instead of O(dn) when n, d� k
Why no time savings for linear classifier?
Principal component analysis (PCA) / Case studies 18
![Page 59: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/59.jpg)
Latent Semantic Analysis [Deerwater, 1990]
• d = number of words in the vocabulary
• Each xi ∈ Rd is a vector of word counts
• xji = frequency of word j in document i
Xd×n u Ud×k Zk×n
(stocks: 2 · · · · · · · · · 0
chairman: 4 · · · · · · · · · 1the: 8 · · · · · · · · · 7· · · ... · · · · · · · · · ...
wins: 0 · · · · · · · · · 2game: 1 · · · · · · · · · 3
) u (0.4 ·· -0.0010.8 ·· 0.03
0.01 ·· 0.04... ·· ...
0.002 ·· 2.30.003 ·· 1.9
) ( z1 . . . zn )
Principal component analysis (PCA) / Case studies 19
![Page 60: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/60.jpg)
Latent Semantic Analysis [Deerwater, 1990]
• d = number of words in the vocabulary
• Each xi ∈ Rd is a vector of word counts
• xji = frequency of word j in document i
Xd×n u Ud×k Zk×n
(stocks: 2 · · · · · · · · · 0
chairman: 4 · · · · · · · · · 1the: 8 · · · · · · · · · 7· · · ... · · · · · · · · · ...
wins: 0 · · · · · · · · · 2game: 1 · · · · · · · · · 3
) u (0.4 ·· -0.0010.8 ·· 0.03
0.01 ·· 0.04... ·· ...
0.002 ·· 2.30.003 ·· 1.9
) ( z1 . . . zn )How to measure similarity between two documents?
z>1 z2 is probably better than x>1 x2
Principal component analysis (PCA) / Case studies 19
![Page 61: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/61.jpg)
Latent Semantic Analysis [Deerwater, 1990]
• d = number of words in the vocabulary
• Each xi ∈ Rd is a vector of word counts
• xji = frequency of word j in document i
Xd×n u Ud×k Zk×n
(stocks: 2 · · · · · · · · · 0
chairman: 4 · · · · · · · · · 1the: 8 · · · · · · · · · 7· · · ... · · · · · · · · · ...
wins: 0 · · · · · · · · · 2game: 1 · · · · · · · · · 3
) u (0.4 ·· -0.0010.8 ·· 0.03
0.01 ·· 0.04... ·· ...
0.002 ·· 2.30.003 ·· 1.9
) ( z1 . . . zn )How to measure similarity between two documents?
z>1 z2 is probably better than x>1 x2
Applications: information retrieval
Principal component analysis (PCA) / Case studies 19
![Page 62: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/62.jpg)
Latent Semantic Analysis [Deerwater, 1990]
• d = number of words in the vocabulary
• Each xi ∈ Rd is a vector of word counts
• xji = frequency of word j in document i
Xd×n u Ud×k Zk×n
(stocks: 2 · · · · · · · · · 0
chairman: 4 · · · · · · · · · 1the: 8 · · · · · · · · · 7· · · ... · · · · · · · · · ...
wins: 0 · · · · · · · · · 2game: 1 · · · · · · · · · 3
) u (0.4 ·· -0.0010.8 ·· 0.03
0.01 ·· 0.04... ·· ...
0.002 ·· 2.30.003 ·· 1.9
) ( z1 . . . zn )How to measure similarity between two documents?
z>1 z2 is probably better than x>1 x2
Applications: information retrievalNote: no computational savings; original x is already sparse
Principal component analysis (PCA) / Case studies 19
![Page 63: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/63.jpg)
Network anomaly detection [Lakhina, ’05]
xji = amount of traffic onlink j in the networkduring each time interval i
Principal component analysis (PCA) / Case studies 20
![Page 64: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/64.jpg)
Network anomaly detection [Lakhina, ’05]
xji = amount of traffic onlink j in the networkduring each time interval i
Model assumption: total traffic is sum of flows along a few “paths”
Principal component analysis (PCA) / Case studies 20
![Page 65: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/65.jpg)
Network anomaly detection [Lakhina, ’05]
xji = amount of traffic onlink j in the networkduring each time interval i
Model assumption: total traffic is sum of flows along a few “paths”Apply PCA: each principal component intuitively represents a “path”
Principal component analysis (PCA) / Case studies 20
![Page 66: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/66.jpg)
Network anomaly detection [Lakhina, ’05]
xji = amount of traffic onlink j in the networkduring each time interval i
Model assumption: total traffic is sum of flows along a few “paths”Apply PCA: each principal component intuitively represents a “path”Anomaly when traffic deviates from first few principal components
Principal component analysis (PCA) / Case studies 20
![Page 67: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/67.jpg)
Network anomaly detection [Lakhina, ’05]
xji = amount of traffic onlink j in the networkduring each time interval i
Model assumption: total traffic is sum of flows along a few “paths”Apply PCA: each principal component intuitively represents a “path”Anomaly when traffic deviates from first few principal components
Principal component analysis (PCA) / Case studies 20
![Page 68: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/68.jpg)
Unsupervised POS tagging [Schutze, ’95]
Part-of-speech (POS) tagging task:Input: I like reducing the dimensionality of data .
Output: NOUN VERB VERB(-ING) DET NOUN PREP NOUN .
Principal component analysis (PCA) / Case studies 21
![Page 69: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/69.jpg)
Unsupervised POS tagging [Schutze, ’95]
Part-of-speech (POS) tagging task:Input: I like reducing the dimensionality of data .
Output: NOUN VERB VERB(-ING) DET NOUN PREP NOUN .
Each xi is (the context distribution of) a word.
xji is number of times word i appeared in context j
Key idea: words appearing in similar contextstend to have the same POS tags;so cluster using the contexts of each word type
Problem: contexts are too sparse
Principal component analysis (PCA) / Case studies 21
![Page 70: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/70.jpg)
Unsupervised POS tagging [Schutze, ’95]
Part-of-speech (POS) tagging task:Input: I like reducing the dimensionality of data .
Output: NOUN VERB VERB(-ING) DET NOUN PREP NOUN .
Each xi is (the context distribution of) a word.
xji is number of times word i appeared in context j
Key idea: words appearing in similar contextstend to have the same POS tags;so cluster using the contexts of each word type
Problem: contexts are too sparse
Solution: run PCA first,then cluster using new representation
Principal component analysis (PCA) / Case studies 21
![Page 71: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/71.jpg)
Multi-task learning [Ando & Zhang, ’05]
• Have n related tasks (classify documents for various users)
• Each task has a linear classifier with weights xi
• Want to share structure between classifiers
Principal component analysis (PCA) / Case studies 22
![Page 72: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/72.jpg)
Multi-task learning [Ando & Zhang, ’05]
• Have n related tasks (classify documents for various users)
• Each task has a linear classifier with weights xi
• Want to share structure between classifiers
One step of their procedure:given n linear classifiers x1, . . . ,xn,run PCA to identify shared structure:
Principal component analysis (PCA) / Case studies 22
![Page 73: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/73.jpg)
Multi-task learning [Ando & Zhang, ’05]
• Have n related tasks (classify documents for various users)
• Each task has a linear classifier with weights xi
• Want to share structure between classifiers
One step of their procedure:given n linear classifiers x1, . . . ,xn,run PCA to identify shared structure:
X = ( x1 . . . xn ) u UZ
Principal component analysis (PCA) / Case studies 22
![Page 74: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/74.jpg)
Multi-task learning [Ando & Zhang, ’05]
• Have n related tasks (classify documents for various users)
• Each task has a linear classifier with weights xi
• Want to share structure between classifiers
One step of their procedure:given n linear classifiers x1, . . . ,xn,run PCA to identify shared structure:
X = ( x1 . . . xn ) u UZ
Each principal component is a eigen-classifier
Principal component analysis (PCA) / Case studies 22
![Page 75: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/75.jpg)
Multi-task learning [Ando & Zhang, ’05]
• Have n related tasks (classify documents for various users)
• Each task has a linear classifier with weights xi
• Want to share structure between classifiers
One step of their procedure:given n linear classifiers x1, . . . ,xn,run PCA to identify shared structure:
X = ( x1 . . . xn ) u UZ
Each principal component is a eigen-classifier
Other step of their procedure:Retrain classifiers, regularizing towards subspace U
Principal component analysis (PCA) / Case studies 22
![Page 76: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/76.jpg)
PCA summary
• Intuition: capture variance of data or minimizereconstruction error
Principal component analysis (PCA) / Case studies 23
![Page 77: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/77.jpg)
PCA summary
• Intuition: capture variance of data or minimizereconstruction error
• Algorithm: find eigendecomposition of covariancematrix or SVD
Principal component analysis (PCA) / Case studies 23
![Page 78: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/78.jpg)
PCA summary
• Intuition: capture variance of data or minimizereconstruction error
• Algorithm: find eigendecomposition of covariancematrix or SVD
• Impact: reduce storage (from O(nd) to O(nk)), reducetime complexity
Principal component analysis (PCA) / Case studies 23
![Page 79: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/79.jpg)
PCA summary
• Intuition: capture variance of data or minimizereconstruction error
• Algorithm: find eigendecomposition of covariancematrix or SVD
• Impact: reduce storage (from O(nd) to O(nk)), reducetime complexity
• Advantages: simple, fast
Principal component analysis (PCA) / Case studies 23
![Page 80: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/80.jpg)
PCA summary
• Intuition: capture variance of data or minimizereconstruction error
• Algorithm: find eigendecomposition of covariancematrix or SVD
• Impact: reduce storage (from O(nd) to O(nk)), reducetime complexity
• Advantages: simple, fast
• Applications: eigen-faces, eigen-documents, networkanomaly detection, etc.
Principal component analysis (PCA) / Case studies 23
![Page 81: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/81.jpg)
Roadmap
• Principal component analysis (PCA)
– Basic principles
– Case studies
– Kernel PCA
– Probabilistic PCA
• Canonical correlation analysis (CCA)
• Fisher discriminant analysis (FDA)
• Summary
Principal component analysis (PCA) / Kernel PCA 24
![Page 82: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/82.jpg)
Limitations of linearity
Principal component analysis (PCA) / Kernel PCA 25
![Page 83: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/83.jpg)
Limitations of linearity
PCA is effective
Principal component analysis (PCA) / Kernel PCA 25
![Page 84: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/84.jpg)
Limitations of linearity
PCA is effective
Principal component analysis (PCA) / Kernel PCA 25
![Page 85: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/85.jpg)
Limitations of linearity
PCA is effective PCA is ineffective
Principal component analysis (PCA) / Kernel PCA 25
![Page 86: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/86.jpg)
Limitations of linearity
PCA is effective PCA is ineffective
Problem is that PCA subspace is linear:
S = {x = Uz : z ∈ Rk}
Principal component analysis (PCA) / Kernel PCA 25
![Page 87: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/87.jpg)
Limitations of linearity
PCA is effective PCA is ineffective
Problem is that PCA subspace is linear:
S = {x = Uz : z ∈ Rk}
In this example:
S = {(x1, x2) : x2 = u2u1
x1}
Principal component analysis (PCA) / Kernel PCA 25
![Page 88: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/88.jpg)
Going beyond linearity: quick solution
Broken solution
Principal component analysis (PCA) / Kernel PCA 26
![Page 89: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/89.jpg)
Going beyond linearity: quick solution
Broken solution Desired solution
We want desired solution: S = {(x1, x2) : x2 = u2u1
x21}
Principal component analysis (PCA) / Kernel PCA 26
![Page 90: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/90.jpg)
Going beyond linearity: quick solution
Broken solution Desired solution
We want desired solution: S = {(x1, x2) : x2 = u2u1x2
1}
We can get this: S = {φ(x) = Uz} with φ(x) = (x21, x2)>
Principal component analysis (PCA) / Kernel PCA 26
![Page 91: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/91.jpg)
Going beyond linearity: quick solution
Broken solution Desired solution
We want desired solution: S = {(x1, x2) : x2 = u2u1x2
1}
We can get this: S = {φ(x) = Uz} with φ(x) = (x21, x2)>
Linear dimensionality reduction in φ(x) space⇔
Nonlinear dimensionality reduction in x space
Principal component analysis (PCA) / Kernel PCA 26
![Page 92: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/92.jpg)
Going beyond linearity: quick solution
Broken solution Desired solution
We want desired solution: S = {(x1, x2) : x2 = u2u1x2
1}
We can get this: S = {φ(x) = Uz} with φ(x) = (x21, x2)>
Linear dimensionality reduction in φ(x) space⇔
Nonlinear dimensionality reduction in x space
In general, can set φ(x) = (x1, x21, x1x2, sin(x1), . . . )>
Principal component analysis (PCA) / Kernel PCA 26
![Page 93: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/93.jpg)
Going beyond linearity: quick solution
Broken solution Desired solution
We want desired solution: S = {(x1, x2) : x2 = u2u1x2
1}
We can get this: S = {φ(x) = Uz} with φ(x) = (x21, x2)>
Linear dimensionality reduction in φ(x) space⇔
Nonlinear dimensionality reduction in x space
In general, can set φ(x) = (x1, x21, x1x2, sin(x1), . . . )>
Problems: (1) ad-hoc and tedious(2) φ(x) large, computationally expensive
Principal component analysis (PCA) / Kernel PCA 26
![Page 94: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/94.jpg)
Towards kernels
Representer theorem:
PCA solution is linear combination of xis
Principal component analysis (PCA) / Kernel PCA 27
![Page 95: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/95.jpg)
Towards kernels
Representer theorem:
PCA solution is linear combination of xis
Why?
Recall PCA eigenvalue problem: XX>u = λu
Principal component analysis (PCA) / Kernel PCA 27
![Page 96: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/96.jpg)
Towards kernels
Representer theorem:
PCA solution is linear combination of xis
Why?
Recall PCA eigenvalue problem: XX>u = λuNotice that u = Xα =
∑ni=1 αixi for some weights α
Principal component analysis (PCA) / Kernel PCA 27
![Page 97: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/97.jpg)
Towards kernels
Representer theorem:
PCA solution is linear combination of xis
Why?
Recall PCA eigenvalue problem: XX>u = λuNotice that u = Xα =
∑ni=1 αixi for some weights α
Analogy with SVMs: weight vector w = Xα
Principal component analysis (PCA) / Kernel PCA 27
![Page 98: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/98.jpg)
Towards kernels
Representer theorem:
PCA solution is linear combination of xis
Why?
Recall PCA eigenvalue problem: XX>u = λuNotice that u = Xα =
∑ni=1αixi for some weights α
Analogy with SVMs: weight vector w = Xα
Key fact:
PCA only needs inner products K = X>X
Principal component analysis (PCA) / Kernel PCA 27
![Page 99: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/99.jpg)
Towards kernels
Representer theorem:
PCA solution is linear combination of xis
Why?
Recall PCA eigenvalue problem: XX>u = λuNotice that u = Xα =
∑ni=1αixi for some weights α
Analogy with SVMs: weight vector w = Xα
Key fact:
PCA only needs inner products K = X>XWhy?
Use representer theorem on PCA objective:
max‖u‖=1
u>XX>u = maxα>X>Xα=1
α>(X>X)(X>X)α
Principal component analysis (PCA) / Kernel PCA 27
![Page 100: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/100.jpg)
Kernel PCA
Kernel function: k(x1,x2) such thatK, the kernel matrix formed by Kij = k(xi,xj),is positive semi-definite
Principal component analysis (PCA) / Kernel PCA 28
![Page 101: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/101.jpg)
Kernel PCA
Kernel function: k(x1,x2) such thatK, the kernel matrix formed by Kij = k(xi,xj),is positive semi-definite
Examples:
Linear kernel: k(x1,x2) = x>1 x2
Principal component analysis (PCA) / Kernel PCA 28
![Page 102: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/102.jpg)
Kernel PCA
Kernel function: k(x1,x2) such thatK, the kernel matrix formed by Kij = k(xi,xj),is positive semi-definite
Examples:
Linear kernel: k(x1,x2) = x>1 x2
Polynomial kernel: k(x1,x2) = (1 + x>1 x2)2
Principal component analysis (PCA) / Kernel PCA 28
![Page 103: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/103.jpg)
Kernel PCA
Kernel function: k(x1,x2) such thatK, the kernel matrix formed by Kij = k(xi,xj),is positive semi-definite
Examples:
Linear kernel: k(x1,x2) = x>1 x2
Polynomial kernel: k(x1,x2) = (1 + x>1 x2)2
Gaussian (RBF) kernel: k(x1,x2) = e−‖x1−x2‖2
Principal component analysis (PCA) / Kernel PCA 28
![Page 104: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/104.jpg)
Kernel PCA
Kernel function: k(x1,x2) such thatK, the kernel matrix formed by Kij = k(xi,xj),is positive semi-definite
Examples:
Linear kernel: k(x1,x2) = x>1 x2
Polynomial kernel: k(x1,x2) = (1 + x>1 x2)2
Gaussian (RBF) kernel: k(x1,x2) = e−‖x1−x2‖2
Treat data points x as black boxes, only access via k
Principal component analysis (PCA) / Kernel PCA 28
![Page 105: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/105.jpg)
Kernel PCA
Kernel function: k(x1,x2) such thatK, the kernel matrix formed by Kij = k(xi,xj),is positive semi-definite
Examples:
Linear kernel: k(x1,x2) = x>1 x2
Polynomial kernel: k(x1,x2) = (1 + x>1 x2)2
Gaussian (RBF) kernel: k(x1,x2) = e−‖x1−x2‖2
Treat data points x as black boxes, only access via kk intuitively measures “similarity” between two inputs
Principal component analysis (PCA) / Kernel PCA 28
![Page 106: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/106.jpg)
Kernel PCA
Kernel function: k(x1,x2) such thatK, the kernel matrix formed by Kij = k(xi,xj),is positive semi-definite
Examples:
Linear kernel: k(x1,x2) = x>1 x2
Polynomial kernel: k(x1,x2) = (1 + x>1 x2)2
Gaussian (RBF) kernel: k(x1,x2) = e−‖x1−x2‖2
Treat data points x as black boxes, only access via kk intuitively measures “similarity” between two inputs
Mercer’s theorem (using kernels is sensible)Exists high-dimensional feature space φ such that
k(x1,x2) = φ(x1)>φ(x2) (like quick solution earlier!)
Principal component analysis (PCA) / Kernel PCA 28
![Page 107: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/107.jpg)
Solving kernel PCA
Direct method:
Kernel PCA objective:
maxα>Kα=1
α>K2α
Principal component analysis (PCA) / Kernel PCA 29
![Page 108: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/108.jpg)
Solving kernel PCA
Direct method:
Kernel PCA objective:
maxα>Kα=1
α>K2α
⇒ kernel PCA eigenvalue problem: X>Xα = λ′α
Principal component analysis (PCA) / Kernel PCA 29
![Page 109: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/109.jpg)
Solving kernel PCA
Direct method:
Kernel PCA objective:
maxα>Kα=1
α>K2α
⇒ kernel PCA eigenvalue problem: X>Xα = λ′α
Modular method (if you don’t want to think about kernels):
Find vectors x′1, . . . ,x′n such that
x′>i x′j = Kij = φ(xi)>φ(xj)
Principal component analysis (PCA) / Kernel PCA 29
![Page 110: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/110.jpg)
Solving kernel PCA
Direct method:
Kernel PCA objective:
maxα>Kα=1
α>K2α
⇒ kernel PCA eigenvalue problem: X>Xα = λ′α
Modular method (if you don’t want to think about kernels):
Find vectors x′1, . . . ,x′n such that
x′>i x′j = Kij = φ(xi)>φ(xj)
Key: use any vectors that preserve inner products
Principal component analysis (PCA) / Kernel PCA 29
![Page 111: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/111.jpg)
Solving kernel PCA
Direct method:
Kernel PCA objective:
maxα>Kα=1
α>K2α
⇒ kernel PCA eigenvalue problem: X>Xα = λ′α
Modular method (if you don’t want to think about kernels):
Find vectors x′1, . . . ,x′n such that
x′>i x′j = Kij = φ(xi)>φ(xj)
Key: use any vectors that preserve inner products
One possibility is Cholesky decomposition K = X>X
Principal component analysis (PCA) / Kernel PCA 29
![Page 112: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/112.jpg)
Roadmap
• Principal component analysis (PCA)
– Basic principles
– Case studies
– Kernel PCA
– Probabilistic PCA
• Canonical correlation analysis (CCA)
• Fisher discriminant analysis (FDA)
• Summary
Principal component analysis (PCA) / Probabilistic PCA 30
![Page 113: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/113.jpg)
Probabilistic modelingSo far, deal with objective functions:
minU
f(X,U)
Principal component analysis (PCA) / Probabilistic PCA 31
![Page 114: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/114.jpg)
Probabilistic modelingSo far, deal with objective functions:
minU
f(X,U)
Probabilistic modeling:max
Up(X | U)
Principal component analysis (PCA) / Probabilistic PCA 31
![Page 115: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/115.jpg)
Probabilistic modelingSo far, deal with objective functions:
minU
f(X,U)
Probabilistic modeling:max
Up(X | U)
Invent a generative story of how data X arosePlay detective: infer parameters U that produced X
Principal component analysis (PCA) / Probabilistic PCA 31
![Page 116: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/116.jpg)
Probabilistic modelingSo far, deal with objective functions:
minU
f(X,U)
Probabilistic modeling:max
Up(X | U)
Invent a generative story of how data X arosePlay detective: infer parameters U that produced X
Advantages:•Model reports estimates of uncertainty
• Natural way to handle missing data
Principal component analysis (PCA) / Probabilistic PCA 31
![Page 117: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/117.jpg)
Probabilistic modelingSo far, deal with objective functions:
minU
f(X,U)
Probabilistic modeling:max
Up(X | U)
Invent a generative story of how data X arosePlay detective: infer parameters U that produced X
Advantages:•Model reports estimates of uncertainty
• Natural way to handle missing data
• Natural way to introduce prior knowledge
• Natural way to incorporate in a larger model
Principal component analysis (PCA) / Probabilistic PCA 31
![Page 118: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/118.jpg)
Probabilistic modelingSo far, deal with objective functions:
minU
f(X,U)
Probabilistic modeling:max
Up(X | U)
Invent a generative story of how data X arosePlay detective: infer parameters U that produced X
Advantages:•Model reports estimates of uncertainty
• Natural way to handle missing data
• Natural way to introduce prior knowledge
• Natural way to incorporate in a larger model
Example from last lecture: k-means ⇒ GMMs
Principal component analysis (PCA) / Probabilistic PCA 31
![Page 119: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/119.jpg)
Probabilistic PCA
Generative story [Tipping and Bishop, 1999]:
For each data point i = 1, . . . , n:
Principal component analysis (PCA) / Probabilistic PCA 32
![Page 120: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/120.jpg)
Probabilistic PCA
Generative story [Tipping and Bishop, 1999]:
For each data point i = 1, . . . , n:Draw the latent vector: zi ∼ N (0, Ik×k)
Principal component analysis (PCA) / Probabilistic PCA 32
![Page 121: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/121.jpg)
Probabilistic PCA
Generative story [Tipping and Bishop, 1999]:
For each data point i = 1, . . . , n:Draw the latent vector: zi ∼ N (0, Ik×k)Create the data point: xi ∼ N (Uzi, σ
2Id×d)
Principal component analysis (PCA) / Probabilistic PCA 32
![Page 122: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/122.jpg)
Probabilistic PCA
Generative story [Tipping and Bishop, 1999]:
For each data point i = 1, . . . , n:Draw the latent vector: zi ∼ N (0, Ik×k)Create the data point: xi ∼ N (Uzi, σ
2Id×d)
PCA finds the U that maximizes the likelihood of the data
Principal component analysis (PCA) / Probabilistic PCA 32
![Page 123: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/123.jpg)
Probabilistic PCA
Generative story [Tipping and Bishop, 1999]:
For each data point i = 1, . . . , n:Draw the latent vector: zi ∼ N (0, Ik×k)Create the data point: xi ∼ N (Uzi, σ
2Id×d)
PCA finds the U that maximizes the likelihood of the data
Advantages:• Handles missing data (important for collaborative
filtering)
Principal component analysis (PCA) / Probabilistic PCA 32
![Page 124: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/124.jpg)
Probabilistic PCA
Generative story [Tipping and Bishop, 1999]:
For each data point i = 1, . . . , n:Draw the latent vector: zi ∼ N (0, Ik×k)Create the data point: xi ∼ N (Uzi, σ
2Id×d)
PCA finds the U that maximizes the likelihood of the data
Advantages:• Handles missing data (important for collaborative
filtering)
• Extension to factor analysis: allow non-isotropic noise(replace σ2Id×d with arbitrary diagonal matrix)
Principal component analysis (PCA) / Probabilistic PCA 32
![Page 125: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/125.jpg)
Probabilistic latent semantic analysis (pLSA)Motivation: in text analysis, X contains word counts; PCA (LSA) isbad model as it allows negative counts; pLSA fixes this
Principal component analysis (PCA) / Probabilistic PCA 33
![Page 126: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/126.jpg)
Probabilistic latent semantic analysis (pLSA)Motivation: in text analysis, X contains word counts; PCA (LSA) isbad model as it allows negative counts; pLSA fixes thisGenerative story for pLSA [Hofmann, 1999]:
For each document i = 1, . . . , n:
Principal component analysis (PCA) / Probabilistic PCA 33
![Page 127: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/127.jpg)
Probabilistic latent semantic analysis (pLSA)Motivation: in text analysis, X contains word counts; PCA (LSA) isbad model as it allows negative counts; pLSA fixes thisGenerative story for pLSA [Hofmann, 1999]:
For each document i = 1, . . . , n:Repeat M times (number of word tokens in document):
Principal component analysis (PCA) / Probabilistic PCA 33
![Page 128: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/128.jpg)
Probabilistic latent semantic analysis (pLSA)Motivation: in text analysis, X contains word counts; PCA (LSA) isbad model as it allows negative counts; pLSA fixes thisGenerative story for pLSA [Hofmann, 1999]:
For each document i = 1, . . . , n:Repeat M times (number of word tokens in document):
Draw a latent topic: z ∼ p(z | i)
Principal component analysis (PCA) / Probabilistic PCA 33
![Page 129: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/129.jpg)
Probabilistic latent semantic analysis (pLSA)Motivation: in text analysis, X contains word counts; PCA (LSA) isbad model as it allows negative counts; pLSA fixes thisGenerative story for pLSA [Hofmann, 1999]:
For each document i = 1, . . . , n:Repeat M times (number of word tokens in document):
Draw a latent topic: z ∼ p(z | i)Choose the word token: x ∼ p(x | z)
Principal component analysis (PCA) / Probabilistic PCA 33
![Page 130: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/130.jpg)
Probabilistic latent semantic analysis (pLSA)Motivation: in text analysis, X contains word counts; PCA (LSA) isbad model as it allows negative counts; pLSA fixes thisGenerative story for pLSA [Hofmann, 1999]:
For each document i = 1, . . . , n:Repeat M times (number of word tokens in document):
Draw a latent topic: z ∼ p(z | i)Choose the word token: x ∼ p(x | z)
Set xji to be the number of times word j was chosen
Principal component analysis (PCA) / Probabilistic PCA 33
![Page 131: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/131.jpg)
Probabilistic latent semantic analysis (pLSA)Motivation: in text analysis, X contains word counts; PCA (LSA) isbad model as it allows negative counts; pLSA fixes thisGenerative story for pLSA [Hofmann, 1999]:
For each document i = 1, . . . , n:Repeat M times (number of word tokens in document):
Draw a latent topic: z ∼ p(z | i)Choose the word token: x ∼ p(x | z)
Set xji to be the number of times word j was chosen
Learning using Hard EM (analog of k-means):E-step: fix parameters, choose best topicsM-step: fix topics, optimize parameters
More sophisticated methods: EM, Latent Dirichlet Allocation
Principal component analysis (PCA) / Probabilistic PCA 33
![Page 132: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/132.jpg)
Probabilistic latent semantic analysis (pLSA)Motivation: in text analysis, X contains word counts; PCA (LSA) isbad model as it allows negative counts; pLSA fixes thisGenerative story for pLSA [Hofmann, 1999]:
For each document i = 1, . . . , n:Repeat M times (number of word tokens in document):
Draw a latent topic: z ∼ p(z | i)Choose the word token: x ∼ p(x | z)
Set xji to be the number of times word j was chosen
Learning using Hard EM (analog of k-means):E-step: fix parameters, choose best topicsM-step: fix topics, optimize parameters
More sophisticated methods: EM, Latent Dirichlet AllocationComparison to a mixture model for clustering:
Mixture model: assume a single topic for entire documentpLSA: allow multiple topics per document
Principal component analysis (PCA) / Probabilistic PCA 33
![Page 133: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/133.jpg)
Roadmap
• Principal component analysis (PCA)
– Basic principles
– Case studies
– Kernel PCA
– Probabilistic PCA
• Canonical correlation analysis (CCA)
• Fisher discriminant analysis (FDA)
• Summary
Canonical correlation analysis (CCA) 34
![Page 134: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/134.jpg)
Motivation for CCA [Hotelling, 1936]
Often, each data point consists of two views:
• Image retrieval: for each image, have the following:
– x: Pixels (or other visual features)
– y: Text around the image
Canonical correlation analysis (CCA) 35
![Page 135: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/135.jpg)
Motivation for CCA [Hotelling, 1936]
Often, each data point consists of two views:
• Image retrieval: for each image, have the following:
– x: Pixels (or other visual features)
– y: Text around the image
• Time series:
– x: Signal at time t
– y: Signal at time t + 1
Canonical correlation analysis (CCA) 35
![Page 136: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/136.jpg)
Motivation for CCA [Hotelling, 1936]
Often, each data point consists of two views:
• Image retrieval: for each image, have the following:
– x: Pixels (or other visual features)
– y: Text around the image
• Time series:
– x: Signal at time t
– y: Signal at time t + 1• Two-view learning: divide features into two sets
– x: Features of a word/object, etc.
– y: Features of the context in which it appears
Canonical correlation analysis (CCA) 35
![Page 137: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/137.jpg)
Motivation for CCA [Hotelling, 1936]
Often, each data point consists of two views:
• Image retrieval: for each image, have the following:
– x: Pixels (or other visual features)
– y: Text around the image
• Time series:
– x: Signal at time t
– y: Signal at time t + 1• Two-view learning: divide features into two sets
– x: Features of a word/object, etc.
– y: Features of the context in which it appears
Goal: reduce the dimensionality of the two views jointly
Canonical correlation analysis (CCA) 35
![Page 138: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/138.jpg)
An example
Setup:
Input data: (x1,y1), . . . , (xn,yn) (matrices X,Y)
Goal: find pair of projections (u,v)
Canonical correlation analysis (CCA) 36
![Page 139: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/139.jpg)
An example
Setup:
Input data: (x1,y1), . . . , (xn,yn) (matrices X,Y)
Goal: find pair of projections (u,v)
In figure, x and y are paired by brightness
Canonical correlation analysis (CCA) 36
![Page 140: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/140.jpg)
An example
Setup:
Input data: (x1,y1), . . . , (xn,yn) (matrices X,Y)
Goal: find pair of projections (u,v)
In figure, x and y are paired by brightness
Dimensionality reduction solutions:
Independent
Canonical correlation analysis (CCA) 36
![Page 141: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/141.jpg)
An example
Setup:
Input data: (x1,y1), . . . , (xn,yn) (matrices X,Y)
Goal: find pair of projections (u,v)
In figure, x and y are paired by brightness
Dimensionality reduction solutions:
Independent Joint
Canonical correlation analysis (CCA) 36
![Page 142: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/142.jpg)
From PCA to CCA
PCA on views separately: no covariance term
maxu,v
u>XX>uu>u
+v>YY>v
v>v
Canonical correlation analysis (CCA) 37
![Page 143: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/143.jpg)
From PCA to CCA
PCA on views separately: no covariance term
maxu,v
u>XX>uu>u
+v>YY>v
v>v
PCA on concatenation (X>,Y>)>: includes covariance term
maxu,v
u>XX>u + 2u>XY>v + v>YY>vu>u + v>v
Canonical correlation analysis (CCA) 37
![Page 144: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/144.jpg)
From PCA to CCA
PCA on views separately: no covariance term
maxu,v
u>XX>uu>u
+v>YY>v
v>v
PCA on concatenation (X>,Y>)>: includes covariance term
maxu,v
u>XX>u + 2u>XY>v + v>YY>vu>u + v>v
Maximum covariance: drop variance terms
maxu,v
u>XY>v√u>u
√v>v
Canonical correlation analysis (CCA) 37
![Page 145: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/145.jpg)
From PCA to CCA
PCA on views separately: no covariance term
maxu,v
u>XX>uu>u
+v>YY>v
v>v
PCA on concatenation (X>,Y>)>: includes covariance term
maxu,v
u>XX>u + 2u>XY>v + v>YY>vu>u + v>v
Maximum covariance: drop variance terms
maxu,v
u>XY>v√u>u
√v>v
Maximum correlation (CCA): divide out variance terms
maxu,v
u>XY>v√u>XX>u
√v>YY>v
Canonical correlation analysis (CCA) 37
![Page 146: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/146.jpg)
Canonical correlation analysis (CCA)
Definitions:
Variance: var(u>x) = u>XX>u
Canonical correlation analysis (CCA) 38
![Page 147: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/147.jpg)
Canonical correlation analysis (CCA)
Definitions:
Variance: var(u>x) = u>XX>uCovariance: cov(u>x,v>y) = u>XY>v
Canonical correlation analysis (CCA) 38
![Page 148: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/148.jpg)
Canonical correlation analysis (CCA)
Definitions:
Variance: var(u>x) = u>XX>uCovariance: cov(u>x,v>y) = u>XY>v
Correlation: ccov(u>x,v>y)√cvar(u>x)√cvar(v>y)
Canonical correlation analysis (CCA) 38
![Page 149: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/149.jpg)
Canonical correlation analysis (CCA)
Definitions:
Variance: var(u>x) = u>XX>uCovariance: cov(u>x,v>y) = u>XY>v
Correlation: ccov(u>x,v>y)√cvar(u>x)√cvar(v>y)
Objective: maximize correlation between projected views
maxu,v
corr(u>x,v>y)
Canonical correlation analysis (CCA) 38
![Page 150: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/150.jpg)
Canonical correlation analysis (CCA)
Definitions:
Variance: var(u>x) = u>XX>uCovariance: cov(u>x,v>y) = u>XY>v
Correlation: ccov(u>x,v>y)√cvar(u>x)√cvar(v>y)
Objective: maximize correlation between projected views
maxu,v
corr(u>x,v>y)
Properties:
• Focus on how variables are related, not how much they vary
Canonical correlation analysis (CCA) 38
![Page 151: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/151.jpg)
Canonical correlation analysis (CCA)
Definitions:
Variance: var(u>x) = u>XX>uCovariance: cov(u>x,v>y) = u>XY>v
Correlation: ccov(u>x,v>y)√cvar(u>x)√cvar(v>y)
Objective: maximize correlation between projected views
maxu,v
corr(u>x,v>y)
Properties:
• Focus on how variables are related, not how much they vary
• Invariant to any rotation and scaling of data
Canonical correlation analysis (CCA) 38
![Page 152: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/152.jpg)
Canonical correlation analysis (CCA)
Definitions:
Variance: var(u>x) = u>XX>uCovariance: cov(u>x,v>y) = u>XY>v
Correlation: ccov(u>x,v>y)√cvar(u>x)√cvar(v>y)
Objective: maximize correlation between projected views
maxu,v
corr(u>x,v>y)
Properties:
• Focus on how variables are related, not how much they vary
• Invariant to any rotation and scaling of data
Solved via a generalized eigenvalue problem (Aw = λBw)
Canonical correlation analysis (CCA) 38
![Page 153: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/153.jpg)
Regularization is important
Extreme examples of degeneracy:
• If x = Ay, then any (u,v) with u = Av is optimal(correlation 1)
Canonical correlation analysis (CCA) 41
![Page 154: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/154.jpg)
Regularization is important
Extreme examples of degeneracy:
• If x = Ay, then any (u,v) with u = Av is optimal(correlation 1)
• If x and y are independent, then any (u,v) is optimal(correlation 0)
Canonical correlation analysis (CCA) 41
![Page 155: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/155.jpg)
Regularization is important
Extreme examples of degeneracy:
• If x = Ay, then any (u,v) with u = Av is optimal(correlation 1)
• If x and y are independent, then any (u,v) is optimal(correlation 0)
Problem: if X or Y has rank n, then any (u,v) is optimal
(correlation 1) with u = X†>Yv ⇒ CCA is meaningless!
Canonical correlation analysis (CCA) 41
![Page 156: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/156.jpg)
Regularization is important
Extreme examples of degeneracy:
• If x = Ay, then any (u,v) with u = Av is optimal(correlation 1)
• If x and y are independent, then any (u,v) is optimal(correlation 0)
Problem: if X or Y has rank n, then any (u,v) is optimal
(correlation 1) with u = X†>Yv ⇒ CCA is meaningless!
Solution: regularization (interpolate between
maximum covariance and maximum correlation)
maxu,v
u>XY>v√u>(XX> + λI)u
√v>(YY> + λI)v
Canonical correlation analysis (CCA) 41
![Page 157: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/157.jpg)
Kernel CCA
Two kernels: kx and ky
Canonical correlation analysis (CCA) 42
![Page 158: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/158.jpg)
Kernel CCA
Two kernels: kx and ky
Direct method:
(some math)
Canonical correlation analysis (CCA) 42
![Page 159: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/159.jpg)
Kernel CCA
Two kernels: kx and ky
Direct method:
(some math)
Modular method:
1. Transform xi into x′i ∈ Rn satisfying
k(xi,xj) = x′>i x′j (do same for y)
Canonical correlation analysis (CCA) 42
![Page 160: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/160.jpg)
Kernel CCA
Two kernels: kx and ky
Direct method:
(some math)
Modular method:
1. Transform xi into x′i ∈ Rn satisfying
k(xi,xj) = x′>i x′j (do same for y)
2. Perform regular CCA
Canonical correlation analysis (CCA) 42
![Page 161: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/161.jpg)
Kernel CCA
Two kernels: kx and ky
Direct method:
(some math)
Modular method:
1. Transform xi into x′i ∈ Rn satisfying
k(xi,xj) = x′>i x′j (do same for y)
2. Perform regular CCA
Regularization is especially important for kernel CCA!
Canonical correlation analysis (CCA) 42
![Page 162: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/162.jpg)
Roadmap
• Principal component analysis (PCA)
– Basic principles
– Case studies
– Kernel PCA
– Probabilistic PCA
• Canonical correlation analysis (CCA)
• Fisher discriminant analysis (FDA)
• Summary
Fisher discriminant analysis (FDA) 43
![Page 163: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/163.jpg)
Motivation for FDA [Fisher, 1936]
What is the best linear projection?
Fisher discriminant analysis (FDA) 44
![Page 164: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/164.jpg)
Motivation for FDA [Fisher, 1936]
What is the best linear projection?
PCA solution
Fisher discriminant analysis (FDA) 44
![Page 165: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/165.jpg)
Motivation for FDA [Fisher, 1936]
What is the best linear projection with these labels?
PCA solution
Fisher discriminant analysis (FDA) 44
![Page 166: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/166.jpg)
Motivation for FDA [Fisher, 1936]
What is the best linear projection with these labels?
PCA solution FDA solution
Fisher discriminant analysis (FDA) 44
![Page 167: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/167.jpg)
Motivation for FDA [Fisher, 1936]
What is the best linear projection with these labels?
PCA solution FDA solution
Goal: reduce the dimensionality given labels
Idea: want projection to maximize overall interclass variancerelative to intraclass variance
Fisher discriminant analysis (FDA) 44
![Page 168: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/168.jpg)
Motivation for FDA [Fisher, 1936]
What is the best linear projection with these labels?
PCA solution FDA solution
Goal: reduce the dimensionality given labels
Idea: want projection to maximize overall interclass variancerelative to intraclass variance
Linear classifiers (logistic regression, SVMs) have similar feel:
Find one-dimensional subspace w,
e.g., to maximize margin between different classes
Fisher discriminant analysis (FDA) 44
![Page 169: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/169.jpg)
Motivation for FDA [Fisher, 1936]
What is the best linear projection with these labels?
PCA solution FDA solution
Goal: reduce the dimensionality given labels
Idea: want projection to maximize overall interclass variancerelative to intraclass variance
Linear classifiers (logistic regression, SVMs) have similar feel:
Find one-dimensional subspace w,
e.g., to maximize margin between different classes
FDA handles multiple classes, allows multiple dimensions
Fisher discriminant analysis (FDA) 44
![Page 170: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/170.jpg)
FDA objective function
Setup: xi ∈ Rd, yi ∈ {1, . . . ,m}, for i = 1, . . . , n
Fisher discriminant analysis (FDA) 45
![Page 171: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/171.jpg)
FDA objective function
Setup: xi ∈ Rd, yi ∈ {1, . . . ,m}, for i = 1, . . . , n
Objective: maximize interclass varianceintraclass variance = total variance
intraclass variance − 1
Fisher discriminant analysis (FDA) 45
![Page 172: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/172.jpg)
FDA objective function
Setup: xi ∈ Rd, yi ∈ {1, . . . ,m}, for i = 1, . . . , n
Objective: maximize interclass varianceintraclass variance = total variance
intraclass variance − 1
Total variance: 1n
∑i(u
>(xi − µ))2
Mean of all points: µ = 1n
∑i xi
Fisher discriminant analysis (FDA) 45
![Page 173: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/173.jpg)
FDA objective function
Setup: xi ∈ Rd, yi ∈ {1, . . . ,m}, for i = 1, . . . , n
Objective: maximize interclass varianceintraclass variance = total variance
intraclass variance − 1
Total variance: 1n
∑i(u
>(xi − µ))2
Mean of all points: µ = 1n
∑i xi
Intraclass variance: 1n
∑i(u
>(xi − µyi))2
Mean of points in class y: µy = 1|{i:yi=y}|
∑i:yi=y xi
Fisher discriminant analysis (FDA) 45
![Page 174: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/174.jpg)
FDA objective function
Setup: xi ∈ Rd, yi ∈ {1, . . . ,m}, for i = 1, . . . , n
Objective: maximize interclass varianceintraclass variance = total variance
intraclass variance − 1
Total variance: 1n
∑i(u
>(xi − µ))2
Mean of all points: µ = 1n
∑i xi
Intraclass variance: 1n
∑i(u
>(xi − µyi))2
Mean of points in class y: µy = 1|{i:yi=y}|
∑i:yi=y xi
Reduces to a generalized eigenvalue problem.
Fisher discriminant analysis (FDA) 45
![Page 175: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/175.jpg)
FDA objective function
Setup: xi ∈ Rd, yi ∈ {1, . . . ,m}, for i = 1, . . . , n
Objective: maximize interclass varianceintraclass variance = total variance
intraclass variance − 1
Total variance: 1n
∑i(u
>(xi − µ))2
Mean of all points: µ = 1n
∑i xi
Intraclass variance: 1n
∑i(u
>(xi − µyi))2
Mean of points in class y: µy = 1|{i:yi=y}|
∑i:yi=y xi
Reduces to a generalized eigenvalue problem.
Kernel FDA: use modular method
Fisher discriminant analysis (FDA) 45
![Page 176: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/176.jpg)
Other linear methods
Random projections:
Randomly project data onto k = O(log n) dimensions
Fisher discriminant analysis (FDA) 47
![Page 177: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/177.jpg)
Other linear methods
Random projections:
Randomly project data onto k = O(log n) dimensions
All pairwise distances preserved with high probability
‖U>xi −U>xj‖2 u ‖xi − xj‖2 for all i, j
Fisher discriminant analysis (FDA) 47
![Page 178: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/178.jpg)
Other linear methods
Random projections:
Randomly project data onto k = O(log n) dimensions
All pairwise distances preserved with high probability
‖U>xi −U>xj‖2 u ‖xi − xj‖2 for all i, j
Trivial to implement
Fisher discriminant analysis (FDA) 47
![Page 179: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/179.jpg)
Other linear methods
Random projections:
Randomly project data onto k = O(log n) dimensions
All pairwise distances preserved with high probability
‖U>xi −U>xj‖2 u ‖xi − xj‖2 for all i, j
Trivial to implement
Kernel dimensionality reduction:
One type of sufficient dimensionality reduction
Find subspace that contains all information about labels
Fisher discriminant analysis (FDA) 47
![Page 180: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/180.jpg)
Other linear methods
Random projections:
Randomly project data onto k = O(log n) dimensions
All pairwise distances preserved with high probability
‖U>xi −U>xj‖2 u ‖xi − xj‖2 for all i, j
Trivial to implement
Kernel dimensionality reduction:
One type of sufficient dimensionality reduction
Find subspace that contains all information about labels
y ⊥⊥ x | U>x
Fisher discriminant analysis (FDA) 47
![Page 181: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/181.jpg)
Other linear methods
Random projections:
Randomly project data onto k = O(log n) dimensions
All pairwise distances preserved with high probability
‖U>xi −U>xj‖2 u ‖xi − xj‖2 for all i, j
Trivial to implement
Kernel dimensionality reduction:
One type of sufficient dimensionality reduction
Find subspace that contains all information about labels
y ⊥⊥ x | U>xCapturing information is stronger than capturing variance
Fisher discriminant analysis (FDA) 47
![Page 182: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/182.jpg)
Other linear methods
Random projections:
Randomly project data onto k = O(log n) dimensions
All pairwise distances preserved with high probability
‖U>xi −U>xj‖2 u ‖xi − xj‖2 for all i, j
Trivial to implement
Kernel dimensionality reduction:
One type of sufficient dimensionality reduction
Find subspace that contains all information about labels
y ⊥⊥ x | U>xCapturing information is stronger than capturing variance
Hard nonconvex optimization problemFisher discriminant analysis (FDA) 47
![Page 183: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/183.jpg)
Summary
Framework: z = U>x, x u Uz
Fisher discriminant analysis (FDA) 48
![Page 184: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/184.jpg)
Summary
Framework: z = U>x, x u Uz
Criteria for choosing U:
• PCA: maximize projected variance
• CCA: maximize projected correlation
• FDA: maximize projected interclass varianceintraclass variance
Fisher discriminant analysis (FDA) 48
![Page 185: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/185.jpg)
Summary
Framework: z = U>x, x u Uz
Criteria for choosing U:
• PCA: maximize projected variance
• CCA: maximize projected correlation
• FDA: maximize projected interclass varianceintraclass variance
Algorithm: generalized eigenvalue problem
Fisher discriminant analysis (FDA) 48
![Page 186: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang](https://reader035.vdocument.in/reader035/viewer/2022062920/5f02847a7e708231d404a811/html5/thumbnails/186.jpg)
Summary
Framework: z = U>x, x u Uz
Criteria for choosing U:
• PCA: maximize projected variance
• CCA: maximize projected correlation
• FDA: maximize projected interclass varianceintraclass variance
Algorithm: generalized eigenvalue problem
Extensions:
non-linear using kernels (using same linear framework)
probabilistic, sparse, robust (hard optimization)
Fisher discriminant analysis (FDA) 48