1 algorithms for large data sets ziv bar-yossef lecture 6 may 7, 2006

30
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 http://www.ee.technion.ac.il/cours es/049011

Post on 21-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

1

Algorithms for Large Data Sets

Ziv Bar-YossefLecture 6

May 7, 2006

http://www.ee.technion.ac.il/courses/049011

Page 2: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

2

Principal Eigenvector Computation

E: n × n matrix |1| > |2| ≥ |3| … ≥ |n| : eigenvalues of E

Suppose 1 > 0 v1,…,vn: corresponding eigenvectors

Eigenvectors form a basis Suppose ||v1||2 = 1

Input: The matrix E A unit vector u, which is not in span(v2,…,vn)

Goal: compute 1 and v1

Page 3: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

3

The Power Method

Page 4: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

4

Why Does It Work?

Theorem: As t , w ±v1

• Convergence rate: Proportional to (2/1)t

• The larger the “spectral gap” |1|- |2|, the faster the convergence.

Page 5: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

5

Spectral Methods in

Information Retrieval

Page 6: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

6

Outline

Motivation: synonymy and polysemy Latent Semantic Indexing (LSI) Singular Value Decomposition (SVD) LSI via SVD Why LSI works? HITS and SVD

Page 7: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

7

Synonymy and Polysemy

Synonymy: multiple terms with (almost) the same meaningEx: cars, autos, vehiclesHarms recall

Polysemy: a term with multiple meaningsEx: java (programming language, coffee,

island)Harms precision

Page 8: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

8

Traditional Solutions

Query expansionSynonymy: OR on all synonyms

Manual/automatic use of thesauri Too few synonyms: recall still low Too many synonyms: harms precision

Polysemy: AND on term and additional specializing terms

Ex: +java +”programming language” Too broad terms: precision still low Too narrow terms: harms recall

Page 9: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

9

Syntactic Indexing

D: document collection, |D| = n T: term space, |T| = m At,d: “weight” of t in d (e.g., TFIDF) ATA: pairwise document similarities AAT: pairwise term similarities

A m

n

terms

documents

Page 10: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

10

Latent Semantic Indexing (LSI)[Deerwester et al. 1990]

C: concept space, |C| = r Documents & query: “mixtures” of concepts Given a query, finds the most similar documents Bridges the syntax-semantics gap

B r

n

concepts

documents

Page 11: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

11

Fourier TransformTime domain:

time

3 × + 1.1 ×=

frequency

3

1.1

Frequency domain:

Compact discrete representation Effective for noise removal

Page 12: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

12

Latent Semantic Indexing

Documents, queries ~ signals Vectors in Rm

Concepts ~ base signals Orthonormal basis of columns(A)

Semantic indexing of a document ~ Fourier transform of a signal Representation of document in concept basis

Advantages Space-efficient Better handling of synonymy and polysemy Removal of “noise”

Page 13: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

13

Open Questions

How to choose the concept basis? How to transform the syntactic index into a

semantic index? How to filter out “noisy concepts”?

Page 14: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

14

Singular Values

A: m×n real matrix Definition: ≥ 0 is a singular value of A if there exists a

pair of vectors u,v s.t. Av = u and ATu = v.

u and v are called singular vectors.

Ex: = ||A||2 = max||x||2 = 1 ||Ax||2. Corresponding singular vectors: x that maximizes ||Ax||2 and y =

Ax / ||A||2.

Note: ATAv = 2v and AATu = 2u 2 is eigenvalue of ATA and AAT

u eigenvector of ATA v eigenvector of AAT

Page 15: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

15

Singular Value Decomposition (SVD) Theorem: For every m×n real matrix A, there

exists a singular value decomposition:

A = U VT

1 ≥ … ≥ r > 0 (r = rank(A)): singular values of A

= Diag(1,…,r)

U: column-orthonormal m×r matrix (UT U = I) V: column-orthonormal n×r matrix (VT V = I)

A U VT× ×=

Page 16: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

16

Singular Values vs. EigenvaluesA = U VT

1,…,r: singular values of A 1

2,…,r2: non-zero eigenvalues of ATA and AAT

u1,…,ur: columns of U Orthonormal basis for span(columns of A) Left singular vectors of A Eigenvectors of ATA

v1,…,vr: columns of V Orthonormal basis for span(rows of A) Right singular vectors Eigenvectors of AAT

Page 17: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

17

LSI as SVD

A = U VT UTA = VT

u1,…,ur : concept basis B = VT : LSI matrix (semantic index) Ad: d-th column of A Bd: d-th column of B Bd = UTAd Bd[c] = uc

T Ad

Page 18: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

18

Noisy Concepts

B = UTA = VT

Bd[c] = c vd[c] If c is small, then Bd[c] small for all d k = largest i s.t. i is “large” For all c = k+1,…,r, and for all d, c is a low-

weight concept in d Main idea: filter out all concepts c = k+1,…,r

Space efficient: # of index terms = k (vs. r or m) Better retrieval: noisy concepts are filtered out across

the board

Page 19: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

19

Low-rank SVD

B = UTA = VT

Uk = (u1,…,uk)

Vk = (v1,…,vk)

k = upper-left k×k sub-matrix of Ak = Uk k Vk

T

Bk = k VkT

rank(Ak) = rank(Bk) = k

Page 20: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

20

Low Dimensional Embedding

Theorem: If is small, then for “most” d,d’, .

Ak preserves pairwise similarities among documents at least as good as A for retrieval.

Page 21: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

21

Why is LSI Better?[Papadimitriou et al. 1998] [Azar et al. 2001]

LSI summaryDocuments are embedded in low dimensional

space (m k)Pairwise similarities are preservedMore space-efficient

But why is retrieval better?SynonymyPolysemy

Page 22: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

22

Generative Model

A corpus model M = (T,C,W,D) T: Term space, |T| = m C: Concept space, |C| = k

Concept: distribution over terms W: Topic space

Topic: distribution over concepts D: Document distribution

Distribution over W × N

A document d is generated as follows: Sample a topic w and a length n according to D Repeat n times:

Sample a concept c from C according to w Sample a term t from T according to c

Page 23: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

23

Simplifying Assumptions

Every document has a single topic (W = C) For every two concepts c,c’, ||c – c’|| ≥ 1 - The probability of every term under a

concept c is at most some constant .

Page 24: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

24

LSI Works

A: m×n term-document matrix, representing n documents generated according to the model

Theorem [Papadimitriou et al. 1998]With high probability, for every two documents d,d’, If topic(d) = topic(d’), then

If topic(d) topic(d’), then

Page 25: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

25

Proof For simplicity, assume = 0 Want to show:

If topic(d) = topic(d’), Adk || Ad’

k

If topic(d) topic(d’), Adk Ad’

k Dc: documents whose topic is the concept c Tc: terms in supp(c)

Since ||c – c’|| = 1, Tc ∩ Tc’ = Ø A has non-zeroes only in blocks: B1,…,Bk, where

Bc: sub-matrix of A with rows in Tc and columns in Dc

ATA is a block diagonal matrix with blocks BT1B1,…, BT

kBk

(i,j)-th entry of BTcBc: term similarity between i-th and j-th

documents whose topic is the concept c BT

cBc: adjacency matrix of a bipartite (multi-)graph Gc on Dc

Page 26: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

26

Proof (cont.) Gc is a “random” graph First and second eigenvalues of BT

cBc are well separated

For all c,c’, second eigenvalue of BTcBc is smaller

than first eigenvalue of BTc’Bc’

Top k eigenvalues of ATA are the principal eigenvalues of BT

cBc for c = 1,…,k Let u1,…,uk be corresponding eigenvectors For every document d on topic c, Ad is orthogonal to

all u1,…,uk, except for uc. Ak

d is a scalar multiple of uc.

Page 27: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

27

Extensions[Azar et al. 2001]

A more general generative model Explain also improved treatment of

polysemy

Page 28: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

28

Computing SVD

Compute singular values of A, by computing eigenvalues of ATA

Compute U,V by computing eigenvectors of ATA and AAT

Running time not too good: O(m2 n + m n2) Not practical for huge corpora

Sub-linear time algorithms for estimating Ak [Frieze,Kannan,Vempala 1998]

Page 29: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

29

HITS and SVD

A: adjacency matrix of a web (sub-)graph G a: authority vector h: hub vector a is principal eigenvector of ATA h is principal eigenvector of AAT

Therefore: a and h give A1: the rank-1 SVD of A

Generalization: using Ak, we can get k authority and hub vectors, corresponding to other topics in G.

Page 30: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

30

End of Lecture 5