(c) 2003, the university of michigan1 information retrieval handout #8 february 25, 2005

38
(C) 2003, The University of Michigan 1 Information Retrieval Handout #8 February 25, 2005

Upload: luke-george

Post on 26-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

(C) 2003, The University of Michigan 1

Information Retrieval

Handout #8

February 25, 2005

(C) 2003, The University of Michigan 2

Course Information

• Instructor: Dragomir R. Radev ([email protected])

• Office: 3080, West Hall Connector

• Phone: (734) 615-5225

• Office hours: M 11-12 & Th 12-1 or via email

• Course page: http://tangra.si.umich.edu/~radev/650/

• Class meets on Fridays, 2:10-4:55 PM in 409 West Hall

(C) 2003, The University of Michigan 3

Models of the Web

(C) 2003, The University of Michigan 4

Size

• The Web is the largest repository of data and it grows exponentially.– 320 Million Web pages [Lawrence & Giles 1998]

– 800 Million Web pages, 15 TB [Lawrence & Giles 1999]

– 8 Billion Web pages indexed [Google 2005]

• Amount of data– roughly 200 TB [Lyman et al. 2003]

(C) 2003, The University of Michigan 5

Bow-tie model of the Web

SCC56 M

OUT44 M

IN44 M

Bröder & al. WWW 2000, Dill & al. VLDB 2001

DISC17 M

TEND44M

24% of pagesreachable froma given page

(C) 2003, The University of Michigan 6

Power laws

• Web site size (Huberman and Adamic 1999)• Power-law connectivity (Barabasi and Albert

1999): exponents 2.45 for out-degree and 2.1 for the in-degree

• Others: call graphs among telephone carriers, citation networks (Redner 1998), e.g., Erdos, collaboration graph of actors, metabolic pathways (Jeong et al. 2000), protein networks (Maslov and Sneppen 2002). All values of gamma are around 2-3.

(C) 2003, The University of Michigan 7

Small-world networks

• Diameter = average length of the shortest path between all pairs of nodes. Example…

• Milgram experiment (1967)– Kansas/Omaha --> Boston (42/160 letters)– diameter = 6

• Albert et al. 1999 – average distance between two verstices is d = 0.35 + 2.06 log10n. For n = 109, d=18.89.

• Six degrees of separation

(C) 2003, The University of Michigan 8

Clustering coefficient

• Cliquishness (c): between the kv (kv – 1)/2 pairs of neighbors.

• Examples:

n k d drand C crand

Actors 225226 61 3.65 2.99 0.79 0.00027

Power grid 4941 2.67 18.7 12.4 0.08 0.005

C. Elegans 282 14 2.65 2.25 0.28 0.05

(C) 2003, The University of Michigan 9

Models of the Web

Npkk

kekP

kk

!)(

)()(

k

kP

A

B

a

b

• Erdös/Rényi 59, 60

• Barabási/Albert 99

• Watts/Strogatz 98

• Kleinberg 98

• Menczer 02

• Radev 03

• Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology

(C) 2003, The University of Michigan 10

Self-triggerability across hyperlinks

• Document closures for information retrieval

• Self-triggerability [Mosteller&Wallace 84] Poisson distribution

• Two-Poisson [Bookstein&Swanson 74]

• Negative Binomial, K-mixture [Church&Gale 95]

• Triggerability across hyperlinks?

p

pwpppwp

p

pr ijij )|('

pjpi

p

p’

by with fromp

p’

photo dream path

(C) 2003, The University of Michigan 11

Evolving Word-based Web

• Observations:– Links are made based on topics

– Topics are expressed with words

– Words are distributed very unevenly (Zipf, Benford, self-triggerability laws)

• Model– Pick n

– Generate n lengths according to a power-law distribution

– Generate n documents using a trigram model

• Model (cont’d)– Pick words in decreasing order

of r.

– Generate hyperlinks with random directionality

• Outcome– Generates power-law degree

distributions

– Generates topical communities

– Natural variation of PageRank: LexRank

(C) 2003, The University of Michigan 12

Social network analysis for IR

(C) 2003, The University of Michigan 13

Social networks

• Induced by a relation• Symmetric or not• Examples:

– Friendship networks– Board membership– Citations– Power grid of the US– WWW

(C) 2003, The University of Michigan 14

Krebs 2004

(C) 2003, The University of Michigan 15

Prestige and centrality

• Degree centrality: how many neighbors each node has.

• Closeness centrality: how close a node is to all of the other nodes

• Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes

• Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects.

• Prestige = same as centrality but for directed graphs.

(C) 2003, The University of Michigan 16

Graph-based representations

1

2

34

5

7

6 81 2 3 4 5 6 7 8

1 1 1

2 1

3 1 1

4 1

5 1 1 1 1

6 1 1

7

8

Square connectivity(incidence) matrix

Graph G (V,E)

(C) 2003, The University of Michigan 17

Markov chains

• A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E.

• Path = sequence (x0, x1, …, xn).Xi = xi-1*E

• The probability of a path can be computed as a product of probabilities for each step i.

• Random walk = find Xj given x0, E, and j.

(C) 2003, The University of Michigan 18

Stationary solutions

• The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions:– E is stochastic

– E is irreducible

– E is aperiodic

• To make these conditions true:– All rows of E add up to 1 (and no value is negative)

– Make sure that E is strongly connected

– Make sure that E is not bipartite

• Example: PageRank [Brin and Page 1998]: use “teleportation”

(C) 2003, The University of Michigan 19

1

2

34

5

7

6 8

Example

This graph E has a second graph E’(not drawn) superimposed on it:E’ is the uniform transition graph.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

PageRank

t=0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

PageRank

t=1

(C) 2003, The University of Michigan 20

Eigenvectors

• An eigenvector is an implicit “direction” for a matrix.Mv = λv, where v is non-zero, though λ can be any

complex number in principle.

• The largest eigenvalue of a stochastic matrix E is real: λ1 = 1.

• For λ1, the left (principal) eigenvector is p, the right eigenvector = 1

• In other words, ETp = p.

(C) 2003, The University of Michigan 21

Computing the stationary distribution

0)(

pEI

pEpT

T

function PowerStatDist (E):begin p(0) = u; (or p(0) = [1,0,…0]) i=1; repeat p(i) = ETp(i-1)

L = ||p(i)-p(i-1)||1; i = i + 1; until L < return p(i)

end

Solution for thestationary distribution

(C) 2003, The University of Michigan 22

1

2

34

5

7

6 8

Example

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

PageRank

t=0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

PageRank

t=1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

PageRank

t=10

(C) 2003, The University of Michigan 23

How Google works

• Crawling

• Anchor text

• Fast query processing

• Pagerank

(C) 2003, The University of Michigan 24

More about PageRank

• Named after Larry Page, founder of Google (and UM alum)

• Reading “The anatomy of a large-scale hypertextual web search engine” by Brin and Page.

• Independent of query (although more recent work by Haveliwala (WWW 2002) has also identified topic-based PageRank.

(C) 2003, The University of Michigan 25

HITS

• Query-dependent model (Kleinberg 97)• Hubs and authorities (e.g., cars, Honda)

• Algorithm– obtain root set using input query– expanded the root set by radius one– run iterations on the hub and authority scores together– report top-ranking authorities and hubs

hEa T'Eah '

(C) 2003, The University of Michigan 26

The link-content hypothesis

• Topical locality: page is similar () to the page that points to it ().

• Davison (TF*IDF, 100K pages)– 0.31 same domain

– 0.23 linked pages

– 0.19 sibling

– 0.02 random

• Menczer (373K pages, non-linear least squares fit)

• Chakrabarti (focused crawling) - prob. of losing the topic

Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001

21)1()(

e 03.01=1.8, 2=0.6,

(C) 2003, The University of Michigan 27

Document closures for Q&A

capital

P L P

Madridspain

spain

capital

(C) 2003, The University of Michigan 28

Document closures for IR

Physics

P L P

PhysicsDepartment

University ofMichigan

Michigan

(C) 2003, The University of Michigan 29

Language models

• Conditional probability distributions over word sequences

• Example: p (“Paris” dj) = ? p (“Paris” dj | dj on Europe) = ?

• Training models: assume a parametric form, then maximize the probability of an existing text

(C) 2003, The University of Michigan 30

Link-based language models

• In the absence of other information, p(wip) = 1/d(wj)

• Link information:

p(wip|p1pwip1) p(wip)*Ri

conjecture: Ri > 1

(C) 2003, The University of Michigan 31

Experimental setup

• 2-Gigabyte wt2g corpus

• 247,491 Web documents

• 3,118,248 links

• 948,036 unique words (after Porter-style stemming)

• ALE (automatic link extrapolator)

(C) 2003, The University of Michigan 32

Experiment one: setup

• For each stemmed word in wt2g, we compute the following numbers:– PagesContainingWord = how many pages in the

collection contain the word

– OutgoingLinks = the total number of outgoing links in all the pages that contain the word

– LinkedPagesContainingWord = how many of the linked pages contain the word

• For the latter two measures, only the links inside the collection were considered

(C) 2003, The University of Michigan 33

The link effect R

• The word “each”p = 55654/247491 = .225p’ = 15815/46163 = .343R = p’/p = .343/.225 = 1.524

(C) 2003, The University of Michigan 34

Establishing values for R

IDF 3.0 IDF 4.0

sorted by IDF sorted by R sorted by IDF sorted by R

word IDF word IDF word IDF word IDFhuman 2.981 close 1.675 centuri 3.988 extend 2.085

accord 2.983 among 1.770 interact 3.990 beyond 2.477

perform 2.984 further 1.796 introduct 3.993 front 2.606

close 2.985 expect 1.864 front 3.994 centuri 2.713

press 2.992 accord 1.922 travel 3.997 elimin 2.753

applic 2.992 assist 1.962 elimin 4.009 damag 2.757

expect 2.997 human 2.093 opinion 4.013 introduct 2.843

among 2.998 perform 2.095 damag 4.017 opinion 2.984

assist 3.004 applic 2.203 beyond 4.019 travel 3.491

further 3.011 press 2.388 extend 4.021 interact 3.527

(C) 2003, The University of Michigan 35

IDF rank p p’ IDF R sample words1-100 0.4047 0.5293 1.6761 1.3639 the of make and

101-200 0.2141 0.3574 2.3803 1.6745 under go between copyright201-300 0.1688 0.3209 2.6896 1.9047 market subject special mean301-400 0.1386 0.2876 2.9513 2.0750 administr put establish ask401-500 0.1192 0.2588 3.1548 2.1750 understand social hand share501-600 0.1046 0.2426 3.3326 2.3179 prevent staff risk north601-700 0.0934 0.2246 3.4879 2.4085 trade class size california701-800 0.0839 0.2201 3.6354 2.6233 global drug letter softwar801-900 0.0752 0.2004 3.7884 2.6668 sound tool monitor transport

901-1000 0.0669 0.2024 3.9499 3.0200 permit target east normal1001-1100 0.0605 0.1823 4.0909 3.0149 approxim telephon danger europ1101-1200 0.0548 0.1710 4.2292 3.1213 favor richard map pictur1201-1300 0.0498 0.1752 4.3635 3.5210 professor earth english republican1301-1400 0.0454 0.1652 4.4934 3.6366 medicin, doctor, church, color1401-1500 0.0416 0.1630 4.6166 3.9224 permiss agenda programm prioriti1501-1600 0.0385 0.1508 4.7262 3.9165 prospect broadcast acquir feedback1601-1700 0.0358 0.1450 4.8306 4.0517 temperatur florida percentag membership1701-1800 0.0332 0.1462 4.9358 4.4050 alcohol lake crisi china1801-1900 0.0310 0.1386 5.0363 4.4735 francisco disciplin film medium1901-2000 0.0287 0.1379 5.1454 4.8090 entertain psycholog anticip arrest

… … … … … …100001-100100 0.0000 0.0642 12.4774 363.73

31sinker surmont thong undergrowth

500001-500100 0.0000 0.0215 16.2970 2658.9231

scheflin schena schendel scheriff

(C) 2003, The University of Michigan 36

Linear fit for the 2000 lowest-IDF words

p

p’

(C) 2003, The University of Michigan 37

Cluster One

p

p’

bywithfrom

(C) 2003, The University of Michigan 38

Cluster Two

p

p’

photodreampath