metric cornell
TRANSCRIPT
-
8/4/2019 Metric Cornell
1/48
Metric and Kernel Learning
Inderjit S. DhillonUniversity of Texas at Austin
Cornell UniversityOct 30, 2007
Joint work with Jason Davis, Prateek Jain, Brian Kulis and Suvrit Sra
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
http://find/http://goback/ -
8/4/2019 Metric Cornell
2/48
Metric Learning
Goal: Learn Distance Metric between Data
Important problem in Data Mining & Machine Learning
Can govern success or failure of data mining algorithm
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
http://find/http://goback/ -
8/4/2019 Metric Cornell
3/48
Metric Learning: Example I
Similarity by Person(identity) or by Pose
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
http://find/http://goback/ -
8/4/2019 Metric Cornell
4/48
Metric Learning: Example II
Consider a set of text documents
Each document is a review of a classical music piece
Might be clusterable in one of two ways
By Composer (Beethoven, Mozart, Mendelssohn)By Form (Symphony, Sonata, Concerto)
Similarity by Composer or by Form
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
http://find/http://goback/ -
8/4/2019 Metric Cornell
5/48
Mahalanobis Distances
We restrict ourselves to learning Mahalanobis distances:
Distance parameterized by positive definite matrix :
d(x1, x2) = (x1 x2)T(x1 x2)
Often is the inverse of the covariance matrix
Generalizes squared Euclidean distance ( = I)
Rotates and scales input data
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
http://find/http://goback/ -
8/4/2019 Metric Cornell
6/48
Example: Four Blobs
0.5 0 0.5 1 1.50.5
0
0.5
1
1.5
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
http://find/http://goback/ -
8/4/2019 Metric Cornell
7/48
Example: Four Blobs
0.5 0 0.5 1 1.50.5
0
0.5
1
1.5
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
http://find/http://goback/ -
8/4/2019 Metric Cornell
8/48
Example: Four Blobs
0.5 0 0.5 1 1.5
0.5
0
0.5
1
1.5
Want to learn:
=
1 00
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
http://find/http://goback/ -
8/4/2019 Metric Cornell
9/48
Example: Four Blobs
0.5 0 0.5 1 1.50.5
0
0.5
1
1.5
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
http://find/http://goback/ -
8/4/2019 Metric Cornell
10/48
Example: Four Blobs
0.5 0 0.5 1 1.5
0.5
0
0.5
1
1.5
Want to learn:
=
00 1
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
http://find/http://goback/ -
8/4/2019 Metric Cornell
11/48
Problem Formulation
Metric Learning Goal:
min loss(,0)(xi xj)T(xi xj) u if (i,j) S [similarity constraints](xi xj)T(xi xj) if (i,j) D [dissimilarity constraints]
Learn spd matrix that is close to the baseline spd matrix 0
Other linear constraints on are possible
Constraints can arise from various scenarios
Unsupervised: Click-through feedbackSemi-supervised: must-link and cannot-link constraintsSupervised: points in the same class have small distance, etc.
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
http://find/http://goback/ -
8/4/2019 Metric Cornell
12/48
Problem Formulation
Metric Learning Goal:
min loss(,0)(xi xj)T(xi xj) u if (i,j) S [similarity constraints](xi xj)T(xi xj) if (i,j) D [dissimilarity constraints]
Learn spd matrix that is close to the baseline spd matrix 0
Other linear constraints on are possible
Constraints can arise from various scenarios
Unsupervised: Click-through feedbackSemi-supervised: must-link and cannot-link constraintsSupervised: points in the same class have small distance, etc.
QUESTION: What should loss be?
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
http://find/http://goback/ -
8/4/2019 Metric Cornell
13/48
LogDet Divergence
We use loss(,0) to be the Log-Determinant Divergence:
Dd(,0) = trace(10 ) log det(10 ) d
Our Goal:
minDd(,0)(xi xj)T(xi xj) u if (i,j) S [similarity constraints](xi xj)T(xi xj) if (i,j) D [dissimilarity constraints]
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
http://find/http://goback/ -
8/4/2019 Metric Cornell
14/48
Preview
Salient points of our Approach:
Metric Learning is equivalent to Kernel Learning
Generalizes to Unseen Data Points
Can improve upon an input metric or kernel
No expensive eigenvector computation or semi-definite programming
Most existing methods fail to satisfy one or more of the above
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
http://find/http://goback/ -
8/4/2019 Metric Cornell
15/48
Brief Digression
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
http://find/http://goback/ -
8/4/2019 Metric Cornell
16/48
Bregman Divergences
Let : S R be a differentiable, strictly convex function of Legendretype (S
R
d)
The Bregman Divergence D : S relint(S) R is defined asD(x, y) = (x) (y) (x y)T(y)
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
B Di
http://find/http://goback/ -
8/4/2019 Metric Cornell
17/48
Bregman Divergences
Let : S R be a differentiable, strictly convex function of Legendretype (S
R
d)The Bregman Divergence D : S relint(S) R is defined as
D(x, y) = (x) (y) (x y)T(y)
y x
(z)= 12
zTz
h(z)
D(x,y)=12
x
y
2
Squared Euclidean distance is a Bregman divergence
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
B Di
http://find/http://goback/ -
8/4/2019 Metric Cornell
18/48
Bregman Divergences
Let : S R be a differentiable, strictly convex function of Legendretype (S
R
d)The Bregman Divergence D : S relint(S) R is defined as
D(x, y) = (x) (y) (x y)T(y)
yx
D(x,y)=x logxyx+y
h(z)
(z)=z log z
Relative Entropy (or KL-divergence) is another Bregman divergence
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
B Di
http://find/http://goback/ -
8/4/2019 Metric Cornell
19/48
Bregman Divergences
Let : S R be a differentiable, strictly convex function of Legendretype (S
R
d)
The Bregman Divergence D : S relint(S) R is defined asD(x, y) = (x) (y) (x y)T(y)
y x
D(x,y)=xylog x
y1h(z)
(z)= log z
Itakura-Saito Dist.(used in signal processing) is also a Bregman divergence
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
E l f B Di
http://find/http://goback/ -
8/4/2019 Metric Cornell
20/48
Examples of Bregman Divergences
Function Name (x) D(x,y)
Squared norm12
x2 12
(xy)2
Shannon entropy x log xx x log xyx+y
Bit entropy x log x+(1x) log(1x) x log xy
+(1x) log 1x1y
Burg entropy log x xylog x
y1
Hellinger 1x2 (1xy)(1y2)1/2(1x2)1/2
p quasi-norm xp (0
-
8/4/2019 Metric Cornell
21/48
Bregman Matrix Divergences
Define
D(X, Y) = (X)
(Y)
trace((
(Y))T(X
Y))
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Bregman Matrix Divergences
http://find/http://goback/ -
8/4/2019 Metric Cornell
22/48
Bregman Matrix Divergences
Define
D(X, Y) = (X)
(Y)
trace((
(Y))T(X
Y))
Squared Frobenius norm: (X) = X2F. Then
D(X, Y) =1
2X
Y
2F
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Bregman Matrix Divergences
http://find/http://goback/ -
8/4/2019 Metric Cornell
23/48
Bregman Matrix Divergences
Define
D(X, Y) = (X)
(Y)
trace((
(Y))T(X
Y))
Squared Frobenius norm: (X) = X2F. Then
D(X, Y) =1
2X
Y
2F
von Neumann Divergence: For X 0, (X) = trace(X log X). ThenD(X, Y) = trace(X log X X log Y X + Y)
also called quantum relative entropy
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Bregman Matrix Divergences
http://find/http://goback/ -
8/4/2019 Metric Cornell
24/48
Bregman Matrix Divergences
Define
D(X, Y) = (X)
(Y)
trace((
(Y))T(X
Y))
Squared Frobenius norm: (X) = X2F. Then
D(X, Y) =1
2X
Y
2F
von Neumann Divergence: For X 0, (X) = trace(X log X). ThenD(X, Y) = trace(X log X X log Y X + Y)
also called quantum relative entropy
LogDet divergence: For X 0, (X) = log det X. ThenD(X, Y) = trace(XY
1) log det(XY1) dInderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
LogDet Divergence: Properties I
http://find/http://goback/ -
8/4/2019 Metric Cornell
25/48
LogDet Divergence: Properties I
Dd(X, Y) = trace(XY1)
log det(XY1)
d
Properties:
Not symmetricTriangle inequality does not holdCan be unboundedConvex in first argument (not in second)Pythagorean Property holds:
Dd(X, Y)
Dd(X, P(Y)) + Dd(P(Y), Y)
Divergence between inverses:
Dd(X, Y) = Dd(Y1, X1)
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
LogDet Divergence: Properties II
http://find/http://goback/ -
8/4/2019 Metric Cornell
26/48
LogDet Divergence: Properties II
Dd(X, Y) = trace(XY1)
log det(XY1)
d,
=d
i=1
dj=1
(vTi uj)2
i
j log i
j 1
Properties:Scale-invariance
Dd(X, Y) = Dd(X, Y), 0
In fact, for any invertible M
Dd(X, Y) = Dd(MTXM, MTYM)
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
LogDet Divergence: Properties II
http://find/http://goback/ -
8/4/2019 Metric Cornell
27/48
LogDet Divergence: Properties II
Dd(X, Y) = trace(XY1)
log det(XY1)
d,
=r
i=1
rj=1
(vTi uj)2
i
j log i
j 1
Properties:
Scale-invariance
Dd(X, Y) = Dd(X, Y), 0In fact, for any invertible M
Dd(X, Y) = Dd(MT
XM, MT
YM)
Definition can be extended to rank-deficient matricesFiniteness:
Dd(X, Y) is finite iff X and Y have the same range space
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Information-Theoretic Interpretation
http://find/http://goback/ -
8/4/2019 Metric Cornell
28/48
Information Theoretic Interpretation
Differential Relative Entropy between two Multivariate Gaussians:
N(x|,0)log
N(x|,0)N(x|,)
dx = 1
2Dd(,0)
Thus, the following two problems are equivalent
Relative Entropy Formulation LogDet Formulation
minN(x|,0)log N(x|,0)N(x|,) dx minDd(,0)
tr((xi
xj)(xi
xTj )
u
tr((xi
xj)(xi
xTj )
u
tr((xi xj)(xi xj)T) tr((xi xj)(xi xj)T) 0 0
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Steins Loss
http://find/http://goback/ -
8/4/2019 Metric Cornell
29/48
Stein s Loss
LogDet divergence is known as Steins loss in the statistics community
Steins loss is the unique scale invariant loss-function for which theuniform minimum variance unbiased estimator is also a minimum riskequivariant estimator
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Quasi-Newton Optimization
http://find/http://goback/ -
8/4/2019 Metric Cornell
30/48
Quasi Newton Optimization
LogDet Divergence arises in the BFGS and DFP updates
Quasi-Newton methods
Approximate Hessian of the function to be minimized
[Fletcher, 1991] BFGS update can be written as:
minB
Dd(B, Bt)
subject to B st = yt (Secant Equation)
st = xt+1 xt, yt = ft+1 ftClosed-form solution:
Bt+1 = Bt BtsssTt Bt
sTt Btst+ yty
Tt
sTt yt
Similar form for DFP update
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Algorithm: Bregman Projections for LogDet
http://find/http://goback/ -
8/4/2019 Metric Cornell
31/48
Algorithm: Bregman Projections for LogDet
Algorithm: Cyclic Bregman Projections (successively onto each linearconstraint) converges to globally optimal solution
Use Bregman projections to update the Mahalanobis matrix:
min
Dd(,t)
s.t. (xi xj)T
(xi xj) uCan be solved by rank-one update:
t+1 = t + tt(xi xj)(xi xj)TtAdvantages:
Automatic enforcement of positive semidefinitenessSimple, closed-form projectionsNo eigenvector calculationEasy to incorporate slack for each constraint
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Example: Noisy XOR
http://find/http://goback/ -
8/4/2019 Metric Cornell
32/48
p y
0.5 0 0.5 1 1.50.5
0
0.5
1
1.5
No linear transformation for XOR grouping
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Kernel Methods
http://find/http://goback/ -
8/4/2019 Metric Cornell
33/48
Map input data to higher-dimensional feature space:
x (x)Idea: Run machine learning algorithm in feature space
Noisy XOR Example:
x x212x1x2
x22
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Kernel Methods
http://find/http://goback/ -
8/4/2019 Metric Cornell
34/48
Map input data to higher-dimensional feature space:
x (x)Idea: Run machine learning algorithm in feature space
Noisy XOR Example:
x x212x1x2
x22
Kernel function: K(x,
y) =
(x),
(y)Kernel trick no need to explicitly form high-dimensional features
Noisy XOR Example: (x), (y) = (xTy)2
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Connection to Kernel Learning
http://find/http://goback/ -
8/4/2019 Metric Cornell
35/48
g
LogDet Formulation (1) Kernel Formulation (2)
min Dd(,0) minK Dd(K, K0)s.t. tr((xi xj)(xi xj)T) u s.t. tr(K(ei ej)(ei ej)T) u
tr((xi xj)(xi xj)T) tr(K(ei ej)(ei ej)T) 0 K 0
(1) optimizes w.r.t. the d d Mahalanobis matrix (2) optimizes w.r.t. the NN kernel matrix KLet K0 = X
T0X, where X is the input data
Let be optimal solution to (1) and K be optimal solution to (2)Theorem: K = XTX
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Kernelization
http://find/http://goback/ -
8/4/2019 Metric Cornell
36/48
Metric learning in kernel space
Assume input kernel function (x, y) = (x)T(y)
Want to learn
d((x), (y)) = ((x) (y))T((x) (y))
Equivalently: learn a new kernel function of the form
(x, y) = (x)T(y)
How to learn this only using (x, y)?
Learned kernel can be shown to be of the form
(x, y) = (x, y) +
i
j
ij(x, xi)(y, xj)
Can update ij parameters while optimizing the kernel formulation
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Related Work
http://find/http://goback/ -
8/4/2019 Metric Cornell
37/48
Distance Metric Learning [Xing, Ng, Jordan & Russell, 2002]
Large margin nearest neighbor(LMNN) [Weinberger, Blitzer & Saul,2005]
Collapsing Classes (MCML) [Globerson & Roweis, 2005]
Online Metric Learning (POLA) [Shalev-Shwartz, Singer & Ng, 2004]
Many others!
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Experimental Results
http://find/http://goback/ -
8/4/2019 Metric Cornell
38/48
Framework
k-nearest neighbor (k = 4)
and u determined by 5th and 95th percentile of distribution
20c2 constraints, chosen randomly
2-fold cross validation
Algorithms
Information-theoretic Metric Learning (offline and online)
Large-Margin Nearest Neighbors (LMNN) [Weinberger et al.]
Metric Learning by Collapsing Classes (MCML) [Globerson and Roweis]
Baseline Metrics: Euclidean and Inverse Covariance
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Results: UCI Data Sets
http://find/http://goback/ -
8/4/2019 Metric Cornell
39/48
Ran ITML with 0 = I(ITML-MaxEnt) and theinverse covariance(InverseCovariance)
Ran online algorithm for 105
iterations
Wine Ionosphere Scale Iris Soybean
ITMLMaxEntITMLOnlineEuclideanInverse CovarianceMCMLLMNN
Error
0.0
0.1
0.2
0.3
0.4
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Application 1: Clarify
http://find/http://goback/ -
8/4/2019 Metric Cornell
40/48
Clarify improves error reporting for software that uses black boxcomponents
Motivation: Black box components complicate error messaging
Solution: Error diagnosis via machine learning
Representation: System collects program features during run-time
Function counts
Call-site countsCounts of program pathsProgram execution represented as a vector of counts
Class labels: Program execution errors
Nearest neighbor software support
Match program executions with othersUnderlying distance measure should reflect this similarity
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Results: Clarify
http://find/http://goback/ -
8/4/2019 Metric Cornell
41/48
Very high dimensionality
Feature selection reduces thenumber of features to 20
Latex Mpg321 Foxpro Iptables
ITMLMaxEntITMLInverse CovarianceEuclideanInverse CovarianceMCMLLMNN
Error
0.0
0.1
0.2
0.3
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Application 2: Learning Image Similarity
http://find/http://goback/ -
8/4/2019 Metric Cornell
42/48
Goal: Learn a metric to compare images
Start with a baseline measure
Use the pyramid match kernel [Grauman and Darrell]Compares sets of image featuresEfficient and effective measure of similarity between images
Application of metric learning in kernel space
Other metric learning methods (LMNN, etc) cannot be appliedDoes metric learning work in kernel space?
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Caltech 101 Results
http://find/http://goback/ -
8/4/2019 Metric Cornell
43/48
Data Set: Caltech 101
Standard benchmark for multi-category image recognition
101 classes of images
Wide variance in pose etc.
Challenging data set
Experimental Setup15 images per class in training set; rest in test set (2454 images)
Performed 1-NN using original PMK and learned PMK
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Caltech 101 Results
http://find/http://goback/ -
8/4/2019 Metric Cornell
44/48
700 800 900 1000 1100 1200 1300 1400 1500
0.35
0.4
0.45
0.5
0.55
Number of Constraint Points
Accuracy
Learned PMK
Original PMK
When constraints are drawn from all training data, kNN accuracy is52%, versus 32% for original PMK ([Jain, Kulis & Grauman, 2007])
Data set is well-studiedbest performance with 15 training images perclass is 60%
Uses different features (geometric-blur)
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Metric Learning in High Dimensions
http://find/http://goback/ -
8/4/2019 Metric Cornell
45/48
Text analysis & Software analysis: Feature sets larger than 1,000
Learning full distance matrix requires over 1 million parameters!
Overfitting problems, intractable
Solution: Learning low-rank Mahalanobis matrices
LogDet divergence can be generalized to low-rank matrices
Dd(X, Y) is finite X and Y have the same range spaceExtending ITML to the low-rank caseIf0 is low-rank is low-rankClustered Mahalanobis matricesIf0 is a block matrix, then will also be a block matrix
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Low-rank Metric Learning: Preliminary Results
http://find/http://goback/ -
8/4/2019 Metric Cornell
46/48
Classification accuracy for Low-Rank ITML
Classic3: Text dataset, PCA basis
Mpg321, Gcc: Software analysis, Clustered basis
2 4 6 8 100.5
0.6
0.7
0.8
0.9
Rank
Accuracy
LowRank
LMNN
MCML
ITML
4 6 8 10 120.75
0.8
0.85
Rank
Accuracy
10 20 30 40 500.7
0.75
0.8
0.85
Rank
Accuracy
Classic3 Mpg321 Gcc
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
Conclusions
http://find/http://goback/ -
8/4/2019 Metric Cornell
47/48
Metric Learning Formulation
Uses LogDet divergence
Information-theoretic interpretationEquivalent to kernel learning problem
Can improve upon input metric or kernelGeneralizes to unseen data points
AlgorithmBregman projections result in simple rank-one updates
Can be kernelized
Online variant has provable regret bounds
Empirical EvaluationMethod is competitive with existing techniques
Scalable to large data sets
Applied to nearest-neighbor software support & image recognition
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
References
http://find/http://goback/ -
8/4/2019 Metric Cornell
48/48
J. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon,Information-Theoretic Metric Learning, International Conference on
Machine Learning(ICML), pages 209216, June 2007.
B. Kulis, M. Sustik, and I. S. Dhillon, Learning Low-Rank KernelMatrices, International Conference on Machine Learning(ICML), pages505512, July 2006 (longer version submitted to JMLR).
Inderjit S. Dhillon University of Texas at Austin Metric and Kernel Learning
http://find/http://goback/