information retrieval latent semantic indexing. speeding up cosine computation what if we could take...

21
Information Retrieval Latent Semantic Indexing

Upload: milton-lang

Post on 23-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Information Retrieval

Latent Semantic Indexing

Speeding up cosine computation

What if we could take our vectors and “pack” them into fewer dimensions (say 50,000100) while preserving distances?

Two methods: “Latent semantic indexing” Random projection

Two approaches

LSI is data-dependent Create a k-dim subspace by eliminating

redundant axes Pull together “related” axes – hopefully

car and automobile

Random projection is data-independent Choose a k-dim subspace that guarantees

probable stretching properties between pair of points.

Notions from linear algebra

Matrix A, vector v Matrix transpose (At) Matrix product Rank Eigenvalues and eigenvector v: Av = v

Overview of LSI

Pre-process docs using a technique from linear algebra called Singular Value Decomposition

Create a new (smaller) vector space

Queries handled in this new vector space

Example

16 terms17 docs

Intuition (contd)

More than dimension reduction: Derive a set of new uncorrelated features (roughly,

artificial concepts), one per dimension. Docs with lots of overlapping terms stay together

Terms also get pulled together onto the same dimension

Each term or document is then characterized by a

vector of weights indicating its strength of association

with each of these underlying concepts

Ex. car and automobile get pulled together, since co-occur in docs with tires, radiator, cylinder,…

Here comes “semantic” !!!

Singular-Value Decomposition

Recall m n matrix of terms docs, A. A has rank r m,n

Define term-term correlation matrix T=AAt

T is a square, symmetric m m matrix Let P be m r matrix of eigenvectors of T

Define doc-doc correlation matrix D=AtA D is a square, symmetric n n matrix Let R be n r matrix of eigenvectors of D

A’s decomposition

Do exist matrices P (for T, m r) and R (for D, n r) formed by orthonormal columns (unit dot-product)

It turns out that A = PRt

Where is a diagonal matrix with the eigenvalues of T=AAt in decreasing order.

=

A P Rt

mn mr rr rn

For some k << r, zero out all but the k biggest eigenvalues in [choice of k is crucial]

Denote by k this new version of , having rank k

Typically k is about 100, while r (A’s rank) is > 10,000

=

P k Rt

Dimensionality reduction

Ak

document

useless due to 0-col/0-row of k

m x r r x n

r

kk

k

00

0

A m x k k x n

Guarantee

Ak is a pretty good approximation to A: Relative distances are (approximately) preserved

Of all m n matrices of rank k, Ak is the best

approximation to A wrt the following measures:

minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = k

minB, rank(B)=k ||A-B||F2 = ||A-Ak||F

2 =

k2k+2

2r2

Frobenius norm ||A||F2 =

22r

2

Reduction

Xk = k Rt is the doc-matrix reduced to k<n dim

Take the doc-correlation matrix:

It is D=At A =(P Rt)t (P Rt) = (Rt)t (Rt)

Approx with k, and thus get At A Xk

t Xk

We use Xk to approx A: Xk = k Rt = Pkt A .

This means that to reduce a doc/query vector is enough to multiply it by Pk

t (i.e. k x m matrix)

Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn)

R,P are formed by

orthonormal eigenvectorsof the matrices D,T

Which are the concepts ?

c-th concept = c-th row of Pkt (which is k x m)

Denote it by Pkt [c], note its size is m =

#terms

Pkt [c][i] = strength of association between c-

th concept and i-th term

Projected document: d’j = Pkt dj

d’j[c] = strenght of concept c in dj

Information Retrieval

Random Projection

An interesting math result!

Setting v=0 we also get a bound on f(u)’s stretching!!!

What about the cosine-distance ?

f(u)’s, f(v)’s stretching

Defining the projection matrix

R’s columns

k

Concentration bound!!!

Is R a JL-embedding?

Gaussians are good!!

NOTE: Every col of R is unitary and uniformly distributed over the unit-sphere; moreover, the k cols of R are orthonormal on average.

A practical-theoretical idea !!!

E[ri,j] = 0

Var[ri,j] = 1

Question !!

Various theoretical results known. What about practical cases?