1 algorithms for large data sets ziv bar-yossef lecture 5 april 23, 2006

45
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006 http://www.ee.technion.ac.il/cours es/049011

Post on 21-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

1

Algorithms for Large Data Sets

Ziv Bar-YossefLecture 5

April 23, 2006

http://www.ee.technion.ac.il/courses/049011

Page 2: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

2

Ranking Algorithms

Page 3: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

3

PageRank [Page, Brin, Motwani, Winograd 1998]

Motivating principlesRank of p should be proportional to the rank of

the pages that point to p Recommendations from Bill Gates & Steve Jobs vs.

from Moishale and Ahuva

Rank of p should depend on the number of pages “co-cited” with p

Compare: Bill Gates recommends only me vs. Bill Gates recommends everyone on earth

Page 4: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

4

Then: r is a non-negative normalized left eigenvector of B with eigenvalue 1

PageRank, Attempt #1

Additional Conditions: r is non-negative: r ≥ 0 r is normalized: ||r||1 = 1

B = normalized adjacency matrix:

Page 5: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

5

PageRank, Attempt #1

Solution exists only if B has eigenvalue 1 Problem: B may not have 1 as an eigenvalue

Because some of its rows are 0. Example:

Page 6: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

6

= normalization constant

r is a non-negative normalized left eigenvector of B with eigenvalue

1/

PageRank, Attempt #2

Page 7: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

7

Any nonzero eigenvalue of B may give a solution = 1/ r = any non-negative normalized left eigenvector of B with

eigenvalue

Which solution to pick? Pick a “principal eigenvector” (i.e., corresponding to maximal )

How to find a solution? Power iterations

PageRank, Attempt #2

Page 8: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

8

Problem #1: Maximal eigenvalue may have multiplicity > 1 Several possible solutions Happens, for example, when graph is disconnected

Problem #2: Rank accumulates at sinks. Only sinks or nodes, from which a sink cannot be reached, can

have nonzero rank mass.

PageRank, Attempt #2

Page 9: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

9

Then: r is a non-negative normalized left eigenvector of (B + 1eT) with

eigenvalue 1/

PageRank, Final Definition

e = “rank source” vector Standard setting: e(p) = /n for all p ( < 1)

1 = the all 1’s vector

Page 10: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

10

Any nonzero eigenvalue of (B + 1eT) may give a solution Pick r to be a principal left eigenvector of (B + 1eT) Will show:

Principal eigenvalue has multiplicity 1, for any graph There exists a non-negative left eigenvector

Hence, PageRank always exists and is uniquely defined

Due to rank source vector, rank no longer accumulates at sinks

PageRank, Final Definition

Page 11: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

11

An Alternative View of PageRank:The Random Surfer Model When visiting a page p, a “random surfer”:

With probability 1 - d, selects a random outlink p q and goes to visit q. (“focused browsing”)

With probability d, jumps to a random web page q. (“loss of interest”)

If p has no outlinks, assume it has a self loop. P: probability transition matrix:

Page 12: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

12

PageRank & Random Surfer Model

Therefore, r is a principal left eigenvector of (B + 1eT) if and only if it is a principal left eigenvector of P.

Suppose:

Then:

Page 13: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

13

PageRank & Markov Chains

PageRank vector is normalized principal left eigenvector of (B + 1eT).

Hence, PageRank vector is also a principal left eigenvector of P

Conclusion: PageRank is the unique stationary distribution of the random surfer Markov Chain.

PageRank(p) = r(p) = probability of random surfer visiting page p at the limit.

Note: “Random jump” guarantees Markov Chain is ergodic.

Page 14: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

14

HITS: Hubs and Authorities [Kleinberg, 1997]

HITS: Hyperlink Induced Topic Search Main principle: every page p is associated with

two scores: Authority score: how “authoritative” a page is about the

query’s topic Ex: query: “IR”; authorities: scientific IR papers Ex: query: “automobile manufacturers”; authorities: Mazda,

Toyota, and GM web sites Hub score: how good the page is as a “resource list”

about the query’s topic Ex: query: “IR”; hubs: surveys and books about IR Ex: query: “automobile manufacturers”; hubs: KBB, car link

lists

Page 15: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

15

Mutual Reinforcement

HITS principles: p is a good authority, if it is linked by many

good hubs. p is a good hub, if it points to many good

authorities.

Page 16: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

16

HITS: Algebraic Form

a: authority vector h: hub vector A: adjacency matrix

Then:

Therefore:

a is principal eigenvector of ATA h is principal eigenvector of AAT

Page 17: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

17

Co-Citation and Bibilographic Coupling ATA: co-citation matrix

ATAp,q = # of pages that link both to p and to q.

Thus: authority scores propagate through co-citation.

AAT: bibliographic coupling matrix AAT

p,q = # of pages that both p and q link to.

Thus: hub scores propagate through bibliographic coupling.

p

q

p

q

Page 18: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

18

Principal Eigenvector Computation

E: n × n matrix |1| > |2| ≥ |3| … ≥ |n| : eigenvalues of E

Suppose 1 > 0 v1,…,vn: corresponding eigenvectors Eigenvectors are form an orthornormal basis Input:

The matrix E A unit vector u, which is not orthogonal to v1

Goal: compute 1 and v1

Page 19: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

19

The Power Method

Page 20: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

20

Why Does It Work?

Theorem: As t , w c · v1 (c is a constant)

• Convergence rate: Proportional to (2/1)t

• The larger the “spectral gap” 2 - 1, the faster the convergence.

Page 21: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

21

Spectral Methods in

Information Retrieval

Page 22: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

22

Outline

Motivation: synonymy and polysemy Latent Semantic Indexing (LSI) Singular Value Decomposition (SVD) LSI via SVD Why LSI works? HITS and SVD

Page 23: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

23

Synonymy and Polysemy

Synonymy: multiple terms with (almost) the same meaningEx: cars, autos, vehiclesHarms recall

Polysemy: a term with multiple meaningsEx: java (programming language, coffee,

island)Harms precision

Page 24: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

24

Traditional Solutions

Query expansionSynonymy: OR on all synonyms

Manual/automatic use of thesauri Too few synonyms: recall still low Too many synonyms: harms precision

Polysemy: AND on term and additional specializing terms

Ex: +java +”programming language” Too broad terms: precision still low Too narrow terms: harms recall

Page 25: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

25

Syntactic Space

D: document collection, |D| = n T: term space, |T| = m At,d: “weight” of t in d (e.g., TFIDF) ATA: pairwise document similarities AAT: pairwise term similarities

A m

n

terms

documents

Page 26: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

26

Syntactic Indexing

Index keys: terms Limitations

Synonymy (Near)-identical rows

Polysemy Space inefficiency

Matrix usually is not full rank

Gap between syntax and semantics:Information need is semantic but index and query are syntactic.

Page 27: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

27

Semantic Space

C: concept space, |C| = r Bc,d: “weight” of c in d Change of basis Compare to wavelet and Fourier transforms

B r

n

concepts

documents

Page 28: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

28

Latent Semantic Indexing (LSI)[Deerwester et al. 1990]

Index keys: concepts Documents & query: mixtures of concepts Given a query, finds the most similar documents Bridges the syntax-semantics gap Space-efficient

Concepts are orthogonal Matrix is full rank

Questions What is the concept space? What is the transformation from the syntax space to the semantic

space? How to filter “noise concepts”?

Page 29: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

29

Singular Values

A: m×n real matrix Definition: ≥ 0 is a singular value of A if there exist a

pair of vectors u,v s.t. Av = u and ATu = v

u and v are called singular vectors.

Ex: = ||A||2 = max||x||2 = 1 ||Ax||2. Corresponding singular vectors: x that maximizes ||Ax||2 and y =

Ax / ||A||2.

Note: ATAv = 2v and AATu = 2u 2 is eigenvalue of ATA and AAT

u eigenvector of ATA v eigenvector of AAT

Page 30: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

30

Singular Value Decomposition (SVD) Theorem: For every m×n real matrix A, there

exists a singular value decomposition:

A = U VT

1 ≥ … ≥ r > 0 (r = rank(A)): singular values of A

= Diag(1,…,r)

U: column-orthonormal m×r matrix (UT U = I) V: column-orthonormal n×r matrix (VT V = I)

A U VT× ×=

Page 31: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

31

Singular Values vs. EigenvaluesA = U VT

1,…,r: singular values of A 1

2,…,r2: non-zero eigenvalues of ATA and AAT

u1,…,ur: columns of U Orthonormal basis for span(columns of A) Left singular vectors of A Eigenvectors of ATA

v1,…,vr: columns of V Orthonormal basis for span(rows of A) Right singular vectors Eigenvectors of AAT

Page 32: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

32

LSI as SVD

A = U VT UTA = VT

u1,…,ur : concept basis B = VT : LSI matrix Ad: d-th column of A Bd: d-th column of B Bd = UTAd Bd[c] = uc

T Ad

Page 33: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

33

Noisy Concepts

B = UTA = VT

Bd[c] = c vd[c] If c is small, then Bd[c] small for all d k = largest i s.t. i is “large” For all c = k+1,…,r, and for all d, c is a low-

weight concept in d Main idea: filter out all concepts c = k+1,…,r

Space efficient: # of index terms = k (vs. r or m) Better retrieval: noisy concepts are filtered out across

the board

Page 34: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

34

Low-rank SVD

B = UTA = VT

Uk = (u1,…,uk)

Vk = (v1,…,vk)

k = upper-left k×k sub-matrix of Ak = Uk k Vk

T

Bk = k VkT

rank(Ak) = rank(Bk) = k

Page 35: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

35

Low Dimensional Embedding

Forbenius norm:

Fact:

Therefore, if is small, then for “most” d,d’, .

Ak preserves pairwise similarities among documents at least as good as A for retrieval.

Page 36: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

36

Computing SVD

Compute singular values of A, by computing eigenvalues of ATA

Compute U,V by computing eigenvectors of ATA and AAT

Running time not too good: O(m2 n + m n2) Not practical for huge corpora

Sub-linear time algorithms for estimating Ak [Frieze,Kannan,Vempala 1998]

Page 37: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

37

HITS and SVD

A: adjacency matrix of a web (sub-)graph G a: authority vector h: hub vector a is principal eigenvector of ATA h is principal eigenvector of AAT

Therefore: a and h give A1: the rank-1 SVD of A

Generalization: using Ak, we can get k authority and hub vectors, corresponding to other topics in G.

Page 38: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

38

Why is LSI Better?[Papadimitriou et al. 1998] [Azar et al. 2001]

LSI summaryDocuments are embedded in low dimensional

space (m k)Pairwise similarities are preservedMore space-efficient

But why is retrieval better?SynonymyPolysemy

Page 39: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

39

Generative Model

A corpus model M = (T,C,W,D) T: Term space, |T| = m C: Concept space, |C| = k

Concept: distribution over terms W: Topic space

Topic: distribution over concepts D: Document distribution

Distribution over W × N

A document d is generated as follows: Sample a topic w and a length n according to D Repeat n times:

Sample a concept c from C according to w Sample a term t from T according to c

Page 40: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

40

Simplifying Assumptions

Every document has a single topic (W = C) For every two concepts c,c’, ||c – c’|| ≥ 1 - The probability of every term under a

concept c is at most some constant .

Page 41: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

41

LSI Works

A: m×n term-document matrix, representing n documents generated according to the model

Theorem [Papadimitriou et al. 1998]With high probability, for every two documents d,d’, If topic(d) = topic(d’), then

If topic(d) topic(d’), then

Page 42: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

42

Proof For simplicity, assume = 0 Want to show:

If topic(d) = topic(d’), Adk || Ad’

k

If topic(d) topic(d’), Adk Ad’

k Dc: documents whose topic is the concept c Tc: terms in supp(c)

Since ||c – c’|| = 1, Tc ∩ Tc’ = Ø A has non-zeroes only in blocks: B1,…,Bk, where

Bc: sub-matrix of A with rows in Tc and columns in Dc

ATA is a block diagonal matrix with blocks BT1B1,…, BT

kBk

(i,j)-th entry of BTcBc: term similarity between i-th and j-th

documents whose topic is the concept c BT

cBc: adjacency matrix of a bipartite (multi-)graph Gc on Dc

Page 43: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

43

Proof (cont.) Gc is a “random” graph First and second eigenvalues of BT

cBc are well separated

For all c,c’, second eigenvalue of BTcBc is smaller

than first eigenvalue of BTc’Bc’

Top k eigenvalues of ATA are the principal eigenvalues of BT

cBc for c = 1,…,k Let u1,…,uk be corresponding eigenvectors For every document d on topic c, Ad is orthogonal to

all u1,…,uk, except for uc. Ak

d is a scalar multiple of uc.

Page 44: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

44

Extensions[Azar et al. 2001]

A more general generative model Explain also improved treatment of

polysemy

Page 45: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

45

End of Lecture 5