machine learning for textual information access: results from the smart project

Machine Learning for Textual Information Access: Results from the SMART project

Nicola Cancedda, Xerox Research Centre Europe

First Forum for Information Retrieval Evaluation

Kolkata, India, December 12th-14th, 2008

• Statistical Multilingual Analysis for Retrieval and Translation (SMART)• Information Society Technologies Programme

• Sixth Framework Programme, “Specific Target Research Project” (STReP)

• Start date: October 1, 2006

• Duration: 3 years

• Objective: bring Machine Learning researchers to work on Machine Translation and CLIR

The SMART Project

The SMART Consortium

Premise and Outline

• Two classes of methods for CLIR investigated in SMART– Methods based on dictionary adaptation for the cross-

language extension of the LM approach in IR– Latent semantic methods based on Canonical Correlation

Analysis• Initial plan (reflected in abstract): to present both

– ...but it would take too long, so:

• Outline:– (Longish) introduction to state of the art in Canonical

Correlation Analysis– A number of advances obtained by the SMART project

For lexicon adaptation methods: check out deliverable D 5.1 from the project website!

Background: Canonical Correlation Analysis

Canonical Correlation Analysis

Abstract view:• Word-vector representations of documents (or queries, or whatever text span) are only superficial manifestations of a deeper vector representation based on concepts.

– Since they cannot be observed directly, these concepts are latent

• If two spans are the translation of one another, their deep representation in terms of concepts is the same.• Can we recover (at least approximately) the latent concept space? Can we learn to map text spans from their superficial word appearance into their deep representation?

– CCA:• Assume mapping from deep to superficial representation is linear• Estimate mapping from empirical data

Five documents in the world of concepts

13

2

54

Z = [z1;z2;z3;z4;z5]

c1

c2

The same five documents in two languages

1

3

2

5

4

X = [x1;x2;x3;x4;x5];xi 2 <nx Y = [y1;y2;y3;y4;y5];yi 2 <ny

1

2

3

4

5c1

c2c1

c2

e1

e2

f 1

f 2

Finding the first Canonical Variates

1

3

2

5

4

1

2

3

4

5

e1

e2

f 1

f 2

5’’

3’’

1’’

4’’

2’’

1’

2’

4’

3’

5’

(wx1;wy

1) = arg maxwx ;wy

E[w0xxy0wy ]

qE[w0

xxx0wx]E[w0yyy0wy ]

Finding the first Canonical Variates

Find the two directions, one for each language, such that projections of documents are maximally correlated.Assuming data matrices X and Y are (row-wise) centered:

(wx1;wy

1) = arg maxwx ;wy

w0xX Y 0wyp

w0xX X 0wxw0

yY Y 0wy

Maximal covariance to work back the rotation

C1 expressed in the basis of X and Y

resp.

Normalization by the variances to adjust for “stretched” dimensions

Finding the first Canonical Variate

Find the two directions, one for each language, such that projections of documents are maximally correlated

(wx1;wy

1) = arg maxwx wy

w0xX Y 0wy

s.t. w0xX X 0wx = 1;w0

yY Y 0wy = 1

Turns out equivalent to finding the largest eigen-pair in a Generalized Eigenvalue Problem (GEP):

·0 C xy

C yx 0

¸ ·wx

wy

¸= ¸

·C xx 00 C yy

¸ ·wx

wy

¸(1)

Cxx = X X 0;Cyy = Y Y 0;Cxy = X Y 0;Cyx = Y X 0

Complexity:

O((nx + ny)3)

(wx1;wy

1) = arg maxwx ;wy

w0xX Y 0wyp

w0xX X 0wxw0

yY Y 0wy

Finding further Canonical Variates

Assume we already found i-1 pairs of Canonical Variates:

Turns out equivalent to finding the other eigen-pairs in the same GEP

(wxi ;wy

i ) = arg maxwx wy

w0xX Y 0wy

s.t. w0x

i X X 0wxi = 1;w0

yi Y Y 0wy

i = 1

w0x

i X X 0wxj = 0;w0

yi Y Y 0wy

j = 0;8j < i

Examples from the Hansard Corpus

Kernel CCA

• Cubic complexity in the number of dimensions becomes soon intractable, especially with text• Also, it could be better to use similarity measures other than inner product of document (possibly weighted) vectors

Kernel CCA: from primal to dual formulation, since it can be proved that the wx

i (resp. wyi) is in the span of the columns of

X (resp. Y)

Kernel CCA

The computation is again done by solving a GEP:

·0 K xK y

K yK x 0

¸ ·¯x

¯y

¸= ¸

·K 2

x 00 K 2

y

¸ ·¯x

¯y

¸(1)

·0 K xK y

K yK x 0

¸ ·¯x

¯y

¸= ¸

·K 2

x 00 K 2

y

¸ ·¯x

¯y

¸Complexity: O(m3)

(¯ ix ;¯ i

y ) = argmax¯ x ;¯ y

¯0xK xK y¯y

s.t. ¯0xK 2

x¯x = 1;¯0yK 2

y¯y = 1

¯ ix

0K 2

x¯ jx = 0;¯ i

y0K 2

y¯ jy = 0;8j < i

Overfitting

E.g. two (centered) points in R2:

11

2

2

Problem: if m · nx and m · ny then there are (infinite) trivial solutions with perfect correlation : OVERFITTINGGiven an

arbitrary direction in the

first space...

...we can find one with perfect

correlation in the second

8¯x s.t. ¯x0K 2

x¯x = 1 set ¯y = K ¡ 1y K x¯x

then ¯y0K 2

y¯y = ¯x0K xK ¡ 1

y K 2yK ¡ 1

y K x¯x = ¯x0K 2

x¯x = 1and ¯y

0K yK x¯x = ¯x0K xK ¡ 1

y K yK x¯x = ¯x0K 2

x¯x = 1

Unit variancesUnit variances

Unit covariancePerfect correlation... for

no matter what direction!

Regularized Kernel CCAWe can regularize the objective function by trading correlation against good account of variance in the two spaces:

~K = (1¡ · )K + · I ; · 2 [0;1]

(¯ ix ;¯ i

y ) = argmax¯ x ;¯ y

¯0xK xK y¯y

s.t. ¯0x

~K xK x¯x = 1;¯0y

~K yK y¯y = 1

¯0x

i ~K xK x¯xj = 0;¯0

yi ~K yK y¯y

j = 0;8j < i

·0 K xK y

K yK x 0

¸ ·¯x

¯y

¸= ¸

· ~K xK x 00 ~K yK y

¸ ·¯x

¯y

¸

Multiview CCA

(K)CCA can take advantage of the “mutual information” between two languages...

12

3

4

5

c1

c2

f 1

f 2

1

3

2

5

4c1

c2

e1

e2

...but what if we have more than two? Can we benefit from multiple views? Also known as Generalised CCA.

12

3

4

5

12

3

4

5

Multiview CCA

There are many possible ways to combine pairwise correlations between views (e.g. sum, product, min, ...).

Chosen approach: SUMCOR [Horst-61]. With a slightly different regularization than above, this is:

(¯1i ; : : : ;¯k

i ) = arg max¯ 1 ;:::;¯ k

X

p<q

¯0pK pK q¯q

s.t. ¯0p

~K 2p¯p = 1;8p

¯0p

~K 2p¯p

j = 0;8j < i

2

64

A1;1 : : : A1;k...

......

Ak;1 : : : Ak;k

3

75

2

64

¯1...

¯k

3

75 =

2

4¸1¯1

: : :¸k¯k

3

5

. Multivariate Eigenvalue

Problem

Multiview CCA

• Multivariate Eigenvalue Problems (MEP) are much harder to solve then GEPs:

– [Horst-61] introduced an extension to MEPs of the standard power method for EPs, for finding the set of first canonical variates only

– Naïve implementations would be quadratic in the number of ducuments, and scale up to no more than a few thousands

Innovations from SMART

Innovations from SMART

• Extensions of the Horst algorithm [Rupnik and Shawe-Taylor]– Efficient implementation linear in the number of

documents– Version for finding many sets of canonical variates

• New regression-CCA framework for CLIR [Rupnik and Shawe-Taylor]• Sparse KCCA [Hussain and Shawe-Taylor]

Efficient Implementation of Horst algorithmHorst algorithm starts with a random set of vectors:

2

64

¯1;t+1...

¯k;t+1

3

75 =

2

64

A1;1 : : : A1;k...

......

Ak;1 : : : Ak;k

3

75

2

64

¯1;t...

¯k;t

3

75

(¯1;0; : : : ;¯m;0)

then iteratively multiplies and renormalizes until convergence:

Inner loop: k2 matrix-vector multiplications,

each O(m2)

Extension (1): exploiting the structure of the MEP matrix, one can refactor computation and save a O(k) factor in the inner loop.

Extension (2): exploiting sparseness of the document vectors, one can replace each (vector) multiplication with a kernel matrix (O(m2)) with two multiplications with the document matrix (O(ms) each, where s is the max number of non-zero components in document vectors). Leveraging this same sparsity, kernel inversions can be replaced by cheaper numerical linear system resolutions.

The inner loop can be made O(kms) instead

of O(k2m2)

Extended Horst algorithm for finding many sets of canonical variates

Horst algorithm only finds the first set of k canonical variates

Extension (3): maintain projection matrices Pit that project ¯k,t’s at

each iteration onto the subspace orthogonal to all previous canonical variates for space i.

Finding d sets of canonical variates can be done in

O(d2mks). This scales up!

MCCA: Experiments

• Experiments: mate retrieval with Europarl• 10 languages, • 100,000 10-ways aligned sentences for training• 7873 10-ways aligned sentences for testing• Document vectors: uni-, bi- and tri-grams (~200k features for each language). TF*IDF weighting and length normalization.• MCCA used to extract d = 100-dimensional subspaces• Baseline alternatives for selecting new basis:

– k-means clustering centroids on concatenated multi-lingual document vectors

– CL-LSI, i.e. LSI on concatenated vectors

Some example latent vectors

MCCA experiment results

Measure: recall in Top 10, averaged over 9 languages

“Query” Language

K-means CL-LSI MCCA

EN 0.7486 0.9129 0.98830.9883

SP 0.7450 0.9131 0.98550.9855

GE 0.5927 0.8545 0.97780.9778

IT 0.7448 0.9022 0.98360.9836

DU 0.7136 0.9021 0.98350.9835

DA 0.5357 0.8540 0.98740.9874

SW 0.5312 0.8623 0.98800.9880

PT 0.7511 0.9000 0.98740.9874

FR 0.7334 0.9116 0.98880.9888

FI 0.4402 0.7737 0.98300.9830

MCCA experiment results

More realistic experiment: now pseudo-queries formed with top 5 TF*IDF scoring components in each sentence

“Query” Language

K-means CL-LSI MCCA

EN 0.1319 0.2348 0.44130.4413

SP 0.1258 0.2226 0.41090.4109

GE 0.1333 0.2492 0.41580.4158

IT 0.1330 0.2343 0.43730.4373

DU 0.1339 0.2408 0.43690.4369

DA 0.1376 0.2517 0.42320.4232

SW 0.1376 0.2499 0.40380.4038

PT 0.1274 0.2187 0.40750.4075

FR 0.1300 0.2262 0.39310.3931

FI 0.1340 0.2490 0.41790.4179

Extension (4): Regression - CCA

Given a query q in one language, find the target language vector w which is maximally correlated to it:

w¤ = argmaxw

q0X Y 0w

s.t.12w0((1¡ · )Y Y 0+ · I )w = 1

Solution: w¤ = ((1¡ · )Y Y 0+ · I )¡ 1(Y X 0q)

Given this “query translation” we can then find the closest target documents using the standard cosine measure

Promising initial results on CLEF/GIRT dataset: better then standard CCA, but cannot take thesaurus into account, so MAP still not competitive with the best

Extension (5): Sparse - KCCA

• Seeking sparsity in dual solution: first canonical variates expressed as linear combinations of only relatively few documents

– Improved efficiency– Alternative regularization

Same set of indices i

Sparse - KCCA

(¯x ;¯y ) = argmax¯ x ;¯ y

¯0xK x[i; :]K y[:; i]̄ y

s.t. ¯0xK 2

x[i; i]̄ x = 1;¯0yK 2

y[i; i]̄ y = 1

For a fixed set of indices i:

·0 K xy[i; i]

K yx[i; i] 0

¸ ·¯x

¯y

¸= ¸

·K 2

x[i; i] 00 K 2

y[i; i]

¸ ·¯x

¯y

¸

But how do we select i ?

Sparse – KCCA: Algorithms

Algorithm 1

1. initialize

2. for i = 1 to d do

Deflate kernel matrices

3. end for

4. Solve GEP for index set i

Algorithm 2• Set i to the index of the top d values of

• Solve GEP for index set i

Deflation consists in transforming the matrices to reflect a

projection onto the space orthogonal to the current basis in

feature space

Sparse – KCCA: Mate retrieval experiments

Europarl, English-Spanish KCCATrain: 24693 sec.Test: 27733 sec.

SKCCA (1)Train: 5242 sec.Test: 698 sec.

SKCCA (2)Train: 1873 sec.Test: 695 sec.

SMART - Website

Project presentation and deliverables• http://www.smart-project.eu

D 5.1 on lexicon-based methods

and D 5.2 on CCA

http://www.smart-project.eu/

SMART - Dissemination and Exploitation

Platforms for showcasing developed tools:

http://cosco-demo.hiit.fi/smart/

Thank you!

Shameless plug

Cyril Goutte, Nicola Cancedda, Marc Dymetman and George Foster, eds: Learning Machine Translation, MIT Press, to appear in 2009.

References

[Hardoon and Shawe-Taylor]

David Hardoon and John Shawe-Taylor, Sparse CCA for Bilingual Word Generation, in 20th Mini-EURO Conference of the Continuous Optimization and Knowledge Based Technologies, Neringa, Lithuania, 2008.

[Hussain and Shawe-Taylor]

Zakria Hussain and John Shawe-Taylor, Theory of Matching Pursuit, in Neural Information Processing Systems (NIPS), Vancouver, BC, 2008.

[Rupnik and Shawe-Taylor]

Jan Rupnik and John Shawe-Taylor, contribution to SMART deliverable D 5.2 “Multilingual Latent Language-Independent Analysis Applied to CLTIA Tasks” (http://www.smart-project.eu/files/D52.pdf)

http://www.smart-project.eu/files/D52.pdf

Self-introduction

Natural Language

Generation

Grammar Learning

Text Categorization

Machine Learning (kernels for text)

(Statistical) Machine

Translationca. 2004

machine learning for textual information access: results from the smart project

Documents

canonical variatesassume

canonical variatesfind

canonical variatefind

pairs of canonical variates

superficial representation

deep representation

projections of documents

smart project nicola