machine learning for textual information access: results from the smart project
DESCRIPTION
Machine Learning for Textual Information Access: Results from the SMART project. Nicola Cancedda, Xerox Research Centre Europe First Forum for Information Retrieval Evaluation Kolkata, India, December 12 th -14 th , 2008. TexPoint fonts used in EMF. - PowerPoint PPT PresentationTRANSCRIPT
Machine Learning for Textual Information Access: Results from the SMART project
Nicola Cancedda, Xerox Research Centre Europe
First Forum for Information Retrieval Evaluation
Kolkata, India, December 12th-14th, 2008
• Statistical Multilingual Analysis for Retrieval and Translation (SMART)• Information Society Technologies Programme
• Sixth Framework Programme, “Specific Target Research Project” (STReP)
• Start date: October 1, 2006
• Duration: 3 years
• Objective: bring Machine Learning researchers to work on Machine Translation and CLIR
The SMART Project
The SMART Consortium
The SMART Consortium
Premise and Outline
• Two classes of methods for CLIR investigated in SMART– Methods based on dictionary adaptation for the cross-
language extension of the LM approach in IR– Latent semantic methods based on Canonical Correlation
Analysis• Initial plan (reflected in abstract): to present both
– ...but it would take too long, so:
• Outline:– (Longish) introduction to state of the art in Canonical
Correlation Analysis– A number of advances obtained by the SMART project
For lexicon adaptation methods: check out deliverable D 5.1 from the project website!
Background: Canonical Correlation Analysis
Canonical Correlation Analysis
Abstract view:• Word-vector representations of documents (or queries, or whatever text span) are only superficial manifestations of a deeper vector representation based on concepts.
– Since they cannot be observed directly, these concepts are latent
• If two spans are the translation of one another, their deep representation in terms of concepts is the same.• Can we recover (at least approximately) the latent concept space? Can we learn to map text spans from their superficial word appearance into their deep representation?
– CCA:• Assume mapping from deep to superficial representation is linear• Estimate mapping from empirical data
Five documents in the world of concepts
13
2
54
Z = [z1;z2;z3;z4;z5]
c1
c2
The same five documents in two languages
1
3
2
5
4
X = [x1;x2;x3;x4;x5];xi 2 <nx Y = [y1;y2;y3;y4;y5];yi 2 <ny
1
2
3
4
5c1
c2c1
c2
e1
e2
f 1
f 2
Finding the first Canonical Variates
1
3
2
5
4
1
2
3
4
5
e1
e2
f 1
f 2
5’’
3’’
1’’
4’’
2’’
1’
2’
4’
3’
5’
(wx1;wy
1) = arg maxwx ;wy
E[w0xxy0wy ]
qE[w0
xxx0wx]E[w0yyy0wy ]
Finding the first Canonical Variates
Find the two directions, one for each language, such that projections of documents are maximally correlated.Assuming data matrices X and Y are (row-wise) centered:
(wx1;wy
1) = arg maxwx ;wy
w0xX Y 0wyp
w0xX X 0wxw0
yY Y 0wy
Maximal covariance to work back the rotation
C1 expressed in the basis of X and Y
resp.
Normalization by the variances to adjust for “stretched” dimensions
Finding the first Canonical Variate
Find the two directions, one for each language, such that projections of documents are maximally correlated
(wx1;wy
1) = arg maxwx wy
w0xX Y 0wy
s.t. w0xX X 0wx = 1;w0
yY Y 0wy = 1
Turns out equivalent to finding the largest eigen-pair in a Generalized Eigenvalue Problem (GEP):
·0 C xy
C yx 0
¸ ·wx
wy
¸= ¸
·C xx 00 C yy
¸ ·wx
wy
¸(1)
Cxx = X X 0;Cyy = Y Y 0;Cxy = X Y 0;Cyx = Y X 0
Complexity:
O((nx + ny)3)
(wx1;wy
1) = arg maxwx ;wy
w0xX Y 0wyp
w0xX X 0wxw0
yY Y 0wy
Finding further Canonical Variates
Assume we already found i-1 pairs of Canonical Variates:
Turns out equivalent to finding the other eigen-pairs in the same GEP
(wxi ;wy
i ) = arg maxwx wy
w0xX Y 0wy
s.t. w0x
i X X 0wxi = 1;w0
yi Y Y 0wy
i = 1
w0x
i X X 0wxj = 0;w0
yi Y Y 0wy
j = 0;8j < i
Examples from the Hansard Corpus
Kernel CCA
• Cubic complexity in the number of dimensions becomes soon intractable, especially with text• Also, it could be better to use similarity measures other than inner product of document (possibly weighted) vectors
Kernel CCA: from primal to dual formulation, since it can be proved that the wx
i (resp. wyi) is in the span of the columns of
X (resp. Y)
Kernel CCA
The computation is again done by solving a GEP:
·0 K xK y
K yK x 0
¸ ·¯x
¯y
¸= ¸
·K 2
x 00 K 2
y
¸ ·¯x
¯y
¸(1)
·0 K xK y
K yK x 0
¸ ·¯x
¯y
¸= ¸
·K 2
x 00 K 2
y
¸ ·¯x
¯y
¸Complexity: O(m3)
(¯ ix ;¯ i
y ) = argmax¯ x ;¯ y
¯0xK xK y¯y
s.t. ¯0xK 2
x¯x = 1;¯0yK 2
y¯y = 1
¯ ix
0K 2
x¯ jx = 0;¯ i
y0K 2
y¯ jy = 0;8j < i
Overfitting
E.g. two (centered) points in R2:
11
2
2
Problem: if m · nx and m · ny then there are (infinite) trivial solutions with perfect correlation : OVERFITTINGGiven an
arbitrary direction in the
first space...
...we can find one with perfect
correlation in the second
8¯x s.t. ¯x0K 2
x¯x = 1 set ¯y = K ¡ 1y K x¯x
then ¯y0K 2
y¯y = ¯x0K xK ¡ 1
y K 2yK ¡ 1
y K x¯x = ¯x0K 2
x¯x = 1and ¯y
0K yK x¯x = ¯x0K xK ¡ 1
y K yK x¯x = ¯x0K 2
x¯x = 1
Unit variancesUnit variances
Unit covariancePerfect correlation... for
no matter what direction!
Regularized Kernel CCAWe can regularize the objective function by trading correlation against good account of variance in the two spaces:
~K = (1¡ · )K + · I ; · 2 [0;1]
(¯ ix ;¯ i
y ) = argmax¯ x ;¯ y
¯0xK xK y¯y
s.t. ¯0x
~K xK x¯x = 1;¯0y
~K yK y¯y = 1
¯0x
i ~K xK x¯xj = 0;¯0
yi ~K yK y¯y
j = 0;8j < i
·0 K xK y
K yK x 0
¸ ·¯x
¯y
¸= ¸
· ~K xK x 00 ~K yK y
¸ ·¯x
¯y
¸
Multiview CCA
(K)CCA can take advantage of the “mutual information” between two languages...
12
3
4
5
c1
c2
f 1
f 2
1
3
2
5
4c1
c2
e1
e2
...but what if we have more than two? Can we benefit from multiple views? Also known as Generalised CCA.
12
3
4
5
12
3
4
5
Multiview CCA
There are many possible ways to combine pairwise correlations between views (e.g. sum, product, min, ...).
Chosen approach: SUMCOR [Horst-61]. With a slightly different regularization than above, this is:
(¯1i ; : : : ;¯k
i ) = arg max¯ 1 ;:::;¯ k
X
p<q
¯0pK pK q¯q
s.t. ¯0p
~K 2p¯p = 1;8p
¯0p
~K 2p¯p
j = 0;8j < i
2
64
A1;1 : : : A1;k...
......
Ak;1 : : : Ak;k
3
75
2
64
¯1...
¯k
3
75 =
2
4¸1¯1
: : :¸k¯k
3
5
. Multivariate Eigenvalue
Problem
Multiview CCA
• Multivariate Eigenvalue Problems (MEP) are much harder to solve then GEPs:
– [Horst-61] introduced an extension to MEPs of the standard power method for EPs, for finding the set of first canonical variates only
– Naïve implementations would be quadratic in the number of ducuments, and scale up to no more than a few thousands
Innovations from SMART
Innovations from SMART
• Extensions of the Horst algorithm [Rupnik and Shawe-Taylor]– Efficient implementation linear in the number of
documents– Version for finding many sets of canonical variates
• New regression-CCA framework for CLIR [Rupnik and Shawe-Taylor]• Sparse KCCA [Hussain and Shawe-Taylor]
Efficient Implementation of Horst algorithmHorst algorithm starts with a random set of vectors:
2
64
¯1;t+1...
¯k;t+1
3
75 =
2
64
A1;1 : : : A1;k...
......
Ak;1 : : : Ak;k
3
75
2
64
¯1;t...
¯k;t
3
75
(¯1;0; : : : ;¯m;0)
then iteratively multiplies and renormalizes until convergence:
Inner loop: k2 matrix-vector multiplications,
each O(m2)
Extension (1): exploiting the structure of the MEP matrix, one can refactor computation and save a O(k) factor in the inner loop.
Extension (2): exploiting sparseness of the document vectors, one can replace each (vector) multiplication with a kernel matrix (O(m2)) with two multiplications with the document matrix (O(ms) each, where s is the max number of non-zero components in document vectors). Leveraging this same sparsity, kernel inversions can be replaced by cheaper numerical linear system resolutions.
The inner loop can be made O(kms) instead
of O(k2m2)
Extended Horst algorithm for finding many sets of canonical variates
Horst algorithm only finds the first set of k canonical variates
Extension (3): maintain projection matrices Pit that project ¯k,t’s at
each iteration onto the subspace orthogonal to all previous canonical variates for space i.
Finding d sets of canonical variates can be done in
O(d2mks). This scales up!
MCCA: Experiments
• Experiments: mate retrieval with Europarl• 10 languages, • 100,000 10-ways aligned sentences for training• 7873 10-ways aligned sentences for testing• Document vectors: uni-, bi- and tri-grams (~200k features for each language). TF*IDF weighting and length normalization.• MCCA used to extract d = 100-dimensional subspaces• Baseline alternatives for selecting new basis:
– k-means clustering centroids on concatenated multi-lingual document vectors
– CL-LSI, i.e. LSI on concatenated vectors
Some example latent vectors
MCCA experiment results
Measure: recall in Top 10, averaged over 9 languages
“Query” Language
K-means CL-LSI MCCA
EN 0.7486 0.9129 0.98830.9883
SP 0.7450 0.9131 0.98550.9855
GE 0.5927 0.8545 0.97780.9778
IT 0.7448 0.9022 0.98360.9836
DU 0.7136 0.9021 0.98350.9835
DA 0.5357 0.8540 0.98740.9874
SW 0.5312 0.8623 0.98800.9880
PT 0.7511 0.9000 0.98740.9874
FR 0.7334 0.9116 0.98880.9888
FI 0.4402 0.7737 0.98300.9830
MCCA experiment results
More realistic experiment: now pseudo-queries formed with top 5 TF*IDF scoring components in each sentence
“Query” Language
K-means CL-LSI MCCA
EN 0.1319 0.2348 0.44130.4413
SP 0.1258 0.2226 0.41090.4109
GE 0.1333 0.2492 0.41580.4158
IT 0.1330 0.2343 0.43730.4373
DU 0.1339 0.2408 0.43690.4369
DA 0.1376 0.2517 0.42320.4232
SW 0.1376 0.2499 0.40380.4038
PT 0.1274 0.2187 0.40750.4075
FR 0.1300 0.2262 0.39310.3931
FI 0.1340 0.2490 0.41790.4179
Extension (4): Regression - CCA
Given a query q in one language, find the target language vector w which is maximally correlated to it:
w¤ = argmaxw
q0X Y 0w
s.t.12w0((1¡ · )Y Y 0+ · I )w = 1
Solution: w¤ = ((1¡ · )Y Y 0+ · I )¡ 1(Y X 0q)
Given this “query translation” we can then find the closest target documents using the standard cosine measure
Promising initial results on CLEF/GIRT dataset: better then standard CCA, but cannot take thesaurus into account, so MAP still not competitive with the best
Extension (5): Sparse - KCCA
• Seeking sparsity in dual solution: first canonical variates expressed as linear combinations of only relatively few documents
– Improved efficiency– Alternative regularization
Same set of indices i
Sparse - KCCA
(¯x ;¯y ) = argmax¯ x ;¯ y
¯0xK x[i; :]K y[:; i]̄ y
s.t. ¯0xK 2
x[i; i]̄ x = 1;¯0yK 2
y[i; i]̄ y = 1
For a fixed set of indices i:
·0 K xy[i; i]
K yx[i; i] 0
¸ ·¯x
¯y
¸= ¸
·K 2
x[i; i] 00 K 2
y[i; i]
¸ ·¯x
¯y
¸
But how do we select i ?
Sparse – KCCA: Algorithms
Algorithm 1
1. initialize
2. for i = 1 to d do
Deflate kernel matrices
3. end for
4. Solve GEP for index set i
Algorithm 2• Set i to the index of the top d values of
• Solve GEP for index set i
Deflation consists in transforming the matrices to reflect a
projection onto the space orthogonal to the current basis in
feature space
Sparse – KCCA: Mate retrieval experiments
Europarl, English-Spanish KCCATrain: 24693 sec.Test: 27733 sec.
SKCCA (1)Train: 5242 sec.Test: 698 sec.
SKCCA (2)Train: 1873 sec.Test: 695 sec.
SMART - Website
Project presentation and deliverables• http://www.smart-project.eu
D 5.1 on lexicon-based methods
and D 5.2 on CCA
SMART - Dissemination and Exploitation
Platforms for showcasing developed tools:
Thank you!
Shameless plug
Cyril Goutte, Nicola Cancedda, Marc Dymetman and George Foster, eds: Learning Machine Translation, MIT Press, to appear in 2009.
References
[Hardoon and Shawe-Taylor]
David Hardoon and John Shawe-Taylor, Sparse CCA for Bilingual Word Generation, in 20th Mini-EURO Conference of the Continuous Optimization and Knowledge Based Technologies, Neringa, Lithuania, 2008.
[Hussain and Shawe-Taylor]
Zakria Hussain and John Shawe-Taylor, Theory of Matching Pursuit, in Neural Information Processing Systems (NIPS), Vancouver, BC, 2008.
[Rupnik and Shawe-Taylor]
Jan Rupnik and John Shawe-Taylor, contribution to SMART deliverable D 5.2 “Multilingual Latent Language-Independent Analysis Applied to CLTIA Tasks” (http://www.smart-project.eu/files/D52.pdf)
Self-introduction
Natural Language
Generation
Grammar Learning
Text Categorization
Machine Learning (kernels for text)
(Statistical) Machine
Translationca. 2004