1 algorithms for large data sets ziv bar-yossef lecture 6 may 7, 2006

1

Algorithms for Large Data Sets

Ziv Bar-YossefLecture 6

May 7, 2006

http://www.ee.technion.ac.il/courses/049011

2

Principal Eigenvector Computation

E: n × n matrix |1| > |2| ≥ |3| … ≥ |n| : eigenvalues of E

Suppose 1 > 0 v1,…,vn: corresponding eigenvectors

Eigenvectors form a basis Suppose ||v1||2 = 1

Input: The matrix E A unit vector u, which is not in span(v2,…,vn)

Goal: compute 1 and v1

3

The Power Method

4

Why Does It Work?

Theorem: As t , w ±v1

• Convergence rate: Proportional to (2/1)t

• The larger the “spectral gap” |1|- |2|, the faster the convergence.

5

Spectral Methods in

Information Retrieval

6

Outline

Motivation: synonymy and polysemy Latent Semantic Indexing (LSI) Singular Value Decomposition (SVD) LSI via SVD Why LSI works? HITS and SVD

7

Synonymy and Polysemy

Synonymy: multiple terms with (almost) the same meaningEx: cars, autos, vehiclesHarms recall

Polysemy: a term with multiple meaningsEx: java (programming language, coffee,

island)Harms precision

8

Traditional Solutions

Query expansionSynonymy: OR on all synonyms

Manual/automatic use of thesauri Too few synonyms: recall still low Too many synonyms: harms precision

Polysemy: AND on term and additional specializing terms

Ex: +java +”programming language” Too broad terms: precision still low Too narrow terms: harms recall

9

Syntactic Indexing

D: document collection, |D| = n T: term space, |T| = m At,d: “weight” of t in d (e.g., TFIDF) ATA: pairwise document similarities AAT: pairwise term similarities

A m

n

terms

documents

10

Latent Semantic Indexing (LSI)[Deerwester et al. 1990]

C: concept space, |C| = r Documents & query: “mixtures” of concepts Given a query, finds the most similar documents Bridges the syntax-semantics gap

B r

n

concepts

documents

11

Fourier TransformTime domain:

time

3 × + 1.1 ×=

frequency

3

1.1

Frequency domain:

Compact discrete representation Effective for noise removal

12

Latent Semantic Indexing

Documents, queries ~ signals Vectors in Rm

Concepts ~ base signals Orthonormal basis of columns(A)

Semantic indexing of a document ~ Fourier transform of a signal Representation of document in concept basis

Advantages Space-efficient Better handling of synonymy and polysemy Removal of “noise”

13

Open Questions

How to choose the concept basis? How to transform the syntactic index into a

semantic index? How to filter out “noisy concepts”?

14

Singular Values

A: m×n real matrix Definition: ≥ 0 is a singular value of A if there exists a

pair of vectors u,v s.t. Av = u and ATu = v.

u and v are called singular vectors.

Ex: = ||A||2 = max||x||2 = 1 ||Ax||2. Corresponding singular vectors: x that maximizes ||Ax||2 and y =

Ax / ||A||2.

Note: ATAv = 2v and AATu = 2u 2 is eigenvalue of ATA and AAT

u eigenvector of ATA v eigenvector of AAT

15

Singular Value Decomposition (SVD) Theorem: For every m×n real matrix A, there

exists a singular value decomposition:

A = U VT

1 ≥ … ≥ r > 0 (r = rank(A)): singular values of A

= Diag(1,…,r)

U: column-orthonormal m×r matrix (UT U = I) V: column-orthonormal n×r matrix (VT V = I)

A U VT× ×=

16

Singular Values vs. EigenvaluesA = U VT

1,…,r: singular values of A 1

2,…,r2: non-zero eigenvalues of ATA and AAT

u1,…,ur: columns of U Orthonormal basis for span(columns of A) Left singular vectors of A Eigenvectors of ATA

v1,…,vr: columns of V Orthonormal basis for span(rows of A) Right singular vectors Eigenvectors of AAT

17

LSI as SVD

A = U VT UTA = VT

u1,…,ur : concept basis B = VT : LSI matrix (semantic index) Ad: d-th column of A Bd: d-th column of B Bd = UTAd Bd[c] = uc

T Ad

18

Noisy Concepts

B = UTA = VT

Bd[c] = c vd[c] If c is small, then Bd[c] small for all d k = largest i s.t. i is “large” For all c = k+1,…,r, and for all d, c is a low-

weight concept in d Main idea: filter out all concepts c = k+1,…,r

Space efficient: # of index terms = k (vs. r or m) Better retrieval: noisy concepts are filtered out across

the board

19

Low-rank SVD

B = UTA = VT

Uk = (u1,…,uk)

Vk = (v1,…,vk)

k = upper-left k×k sub-matrix of Ak = Uk k Vk

T

Bk = k VkT

rank(Ak) = rank(Bk) = k

20

Low Dimensional Embedding

Theorem: If is small, then for “most” d,d’, .

Ak preserves pairwise similarities among documents at least as good as A for retrieval.

21

Why is LSI Better?[Papadimitriou et al. 1998] [Azar et al. 2001]

LSI summaryDocuments are embedded in low dimensional

space (m k)Pairwise similarities are preservedMore space-efficient

But why is retrieval better?SynonymyPolysemy

22

Generative Model

A corpus model M = (T,C,W,D) T: Term space, |T| = m C: Concept space, |C| = k

Concept: distribution over terms W: Topic space

Topic: distribution over concepts D: Document distribution

Distribution over W × N

A document d is generated as follows: Sample a topic w and a length n according to D Repeat n times:

Sample a concept c from C according to w Sample a term t from T according to c

23

Simplifying Assumptions

Every document has a single topic (W = C) For every two concepts c,c’, ||c – c’|| ≥ 1 - The probability of every term under a

concept c is at most some constant .

24

LSI Works

A: m×n term-document matrix, representing n documents generated according to the model

Theorem [Papadimitriou et al. 1998]With high probability, for every two documents d,d’, If topic(d) = topic(d’), then

If topic(d) topic(d’), then

25

Proof For simplicity, assume = 0 Want to show:

If topic(d) = topic(d’), Adk || Ad’

k

If topic(d) topic(d’), Adk Ad’

k Dc: documents whose topic is the concept c Tc: terms in supp(c)

Since ||c – c’|| = 1, Tc ∩ Tc’ = Ø A has non-zeroes only in blocks: B1,…,Bk, where

Bc: sub-matrix of A with rows in Tc and columns in Dc

ATA is a block diagonal matrix with blocks BT1B1,…, BT

kBk

(i,j)-th entry of BTcBc: term similarity between i-th and j-th

documents whose topic is the concept c BT

cBc: adjacency matrix of a bipartite (multi-)graph Gc on Dc

26

Proof (cont.) Gc is a “random” graph First and second eigenvalues of BT

cBc are well separated

For all c,c’, second eigenvalue of BTcBc is smaller

than first eigenvalue of BTc’Bc’

Top k eigenvalues of ATA are the principal eigenvalues of BT

cBc for c = 1,…,k Let u1,…,uk be corresponding eigenvectors For every document d on topic c, Ad is orthogonal to

all u1,…,uk, except for uc. Ak

d is a scalar multiple of uc.

27

Extensions[Azar et al. 2001]

A more general generative model Explain also improved treatment of

polysemy

28

Computing SVD

Compute singular values of A, by computing eigenvalues of ATA

Compute U,V by computing eigenvectors of ATA and AAT

Running time not too good: O(m2 n + m n2) Not practical for huge corpora

Sub-linear time algorithms for estimating Ak [Frieze,Kannan,Vempala 1998]

29

HITS and SVD

A: adjacency matrix of a web (sub-)graph G a: authority vector h: hub vector a is principal eigenvector of ATA h is principal eigenvector of AAT

Therefore: a and h give A1: the rank-1 SVD of A

Generalization: using Ak, we can get k authority and hub vectors, corresponding to other topics in G.

30

End of Lecture 5

1 algorithms for large data sets ziv bar-yossef lecture 6 may 7, 2006

Documents

eigenvector of aa t

aa t u eigenvector

weight of t

svd slide

u vtvt

n n matrix

document similarities

singular values