(c) 2003, the university of michigan1 information retrieval handout #8 february 25, 2005

(C) 2003, The University of Michigan 1

Information Retrieval

Handout #8

February 25, 2005


Course Information

• Instructor: Dragomir R. Radev ([email protected])

• Office: 3080, West Hall Connector

• Phone: (734) 615-5225

• Office hours: M 11-12 & Th 12-1 or via email

• Course page: http://tangra.si.umich.edu/~radev/650/

• Class meets on Fridays, 2:10-4:55 PM in 409 West Hall


Models of the Web


Size

• The Web is the largest repository of data and it grows exponentially.– 320 Million Web pages [Lawrence & Giles 1998]

– 800 Million Web pages, 15 TB [Lawrence & Giles 1999]

– 8 Billion Web pages indexed [Google 2005]

• Amount of data– roughly 200 TB [Lyman et al. 2003]


Bow-tie model of the Web

SCC56 M

OUT44 M

IN44 M

Bröder & al. WWW 2000, Dill & al. VLDB 2001

DISC17 M

TEND44M

24% of pagesreachable froma given page


Power laws

• Web site size (Huberman and Adamic 1999)• Power-law connectivity (Barabasi and Albert

1999): exponents 2.45 for out-degree and 2.1 for the in-degree

• Others: call graphs among telephone carriers, citation networks (Redner 1998), e.g., Erdos, collaboration graph of actors, metabolic pathways (Jeong et al. 2000), protein networks (Maslov and Sneppen 2002). All values of gamma are around 2-3.


Small-world networks

• Diameter = average length of the shortest path between all pairs of nodes. Example…

• Milgram experiment (1967)– Kansas/Omaha --> Boston (42/160 letters)– diameter = 6

• Albert et al. 1999 – average distance between two verstices is d = 0.35 + 2.06 log10n. For n = 109, d=18.89.

• Six degrees of separation


Clustering coefficient

• Cliquishness (c): between the kv (kv – 1)/2 pairs of neighbors.

• Examples:

n k d drand C crand

Actors 225226 61 3.65 2.99 0.79 0.00027

Power grid 4941 2.67 18.7 12.4 0.08 0.005

C. Elegans 282 14 2.65 2.25 0.28 0.05


Models of the Web

Npkk

kekP

kk

!)(

)()(

k

kP

A

B

a

b

• Erdös/Rényi 59, 60

• Barabási/Albert 99

• Watts/Strogatz 98

• Kleinberg 98

• Menczer 02

• Radev 03

• Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology


Self-triggerability across hyperlinks

• Document closures for information retrieval

• Self-triggerability [Mosteller&Wallace 84] Poisson distribution

• Two-Poisson [Bookstein&Swanson 74]

• Negative Binomial, K-mixture [Church&Gale 95]

• Triggerability across hyperlinks?

p

pwpppwp

p

pr ijij )|('

pjpi

p

p’

by with fromp

p’

photo dream path


Evolving Word-based Web

• Observations:– Links are made based on topics

– Topics are expressed with words

– Words are distributed very unevenly (Zipf, Benford, self-triggerability laws)

• Model– Pick n

– Generate n lengths according to a power-law distribution

– Generate n documents using a trigram model

• Model (cont’d)– Pick words in decreasing order

of r.

– Generate hyperlinks with random directionality

• Outcome– Generates power-law degree

distributions

– Generates topical communities

– Natural variation of PageRank: LexRank


Social network analysis for IR


Social networks

• Induced by a relation• Symmetric or not• Examples:

– Friendship networks– Board membership– Citations– Power grid of the US– WWW


Krebs 2004


Prestige and centrality

• Degree centrality: how many neighbors each node has.

• Closeness centrality: how close a node is to all of the other nodes

• Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes

• Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects.

• Prestige = same as centrality but for directed graphs.


Graph-based representations

1

2

34

5

7

6 81 2 3 4 5 6 7 8

1 1 1

2 1

3 1 1

4 1

5 1 1 1 1

6 1 1

7

8

Square connectivity(incidence) matrix

Graph G (V,E)


Markov chains

• A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E.

• Path = sequence (x0, x1, …, xn).Xi = xi-1*E

• The probability of a path can be computed as a product of probabilities for each step i.

• Random walk = find Xj given x0, E, and j.


Stationary solutions

• The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions:– E is stochastic

– E is irreducible

– E is aperiodic

• To make these conditions true:– All rows of E add up to 1 (and no value is negative)

– Make sure that E is strongly connected

– Make sure that E is not bipartite

• Example: PageRank [Brin and Page 1998]: use “teleportation”


1

2

34

5

7

6 8

Example

This graph E has a second graph E’(not drawn) superimposed on it:E’ is the uniform transition graph.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

PageRank

t=0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

PageRank

t=1


Eigenvectors

• An eigenvector is an implicit “direction” for a matrix.Mv = λv, where v is non-zero, though λ can be any

complex number in principle.

• The largest eigenvalue of a stochastic matrix E is real: λ1 = 1.

• For λ1, the left (principal) eigenvector is p, the right eigenvector = 1

• In other words, ETp = p.


Computing the stationary distribution

0)(

pEI

pEpT

T

function PowerStatDist (E):begin p(0) = u; (or p(0) = [1,0,…0]) i=1; repeat p(i) = ETp(i-1)

L = ||p(i)-p(i-1)||1; i = i + 1; until L < return p(i)

end

Solution for thestationary distribution


1

2

34

5

7

6 8

Example

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

PageRank

t=0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

PageRank

t=1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

PageRank

t=10


How Google works

• Crawling

• Anchor text

• Fast query processing

• Pagerank


More about PageRank

• Named after Larry Page, founder of Google (and UM alum)

• Reading “The anatomy of a large-scale hypertextual web search engine” by Brin and Page.

• Independent of query (although more recent work by Haveliwala (WWW 2002) has also identified topic-based PageRank.


HITS

• Query-dependent model (Kleinberg 97)• Hubs and authorities (e.g., cars, Honda)

• Algorithm– obtain root set using input query– expanded the root set by radius one– run iterations on the hub and authority scores together– report top-ranking authorities and hubs

hEa T'Eah '


The link-content hypothesis

• Topical locality: page is similar () to the page that points to it ().

• Davison (TF*IDF, 100K pages)– 0.31 same domain

– 0.23 linked pages

– 0.19 sibling

– 0.02 random

• Menczer (373K pages, non-linear least squares fit)

• Chakrabarti (focused crawling) - prob. of losing the topic

Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001

21)1()(

e 03.01=1.8, 2=0.6,


Document closures for Q&A

capital

P L P

Madridspain

spain

capital


Document closures for IR

Physics

P L P

PhysicsDepartment

University ofMichigan

Michigan


Language models

• Conditional probability distributions over word sequences

• Example: p (“Paris” dj) = ? p (“Paris” dj | dj on Europe) = ?

• Training models: assume a parametric form, then maximize the probability of an existing text


Link-based language models

• In the absence of other information, p(wip) = 1/d(wj)

• Link information:

p(wip|p1pwip1) p(wip)*Ri

conjecture: Ri > 1


Experimental setup

• 2-Gigabyte wt2g corpus

• 247,491 Web documents

• 3,118,248 links

• 948,036 unique words (after Porter-style stemming)

• ALE (automatic link extrapolator)


Experiment one: setup

• For each stemmed word in wt2g, we compute the following numbers:– PagesContainingWord = how many pages in the

collection contain the word

– OutgoingLinks = the total number of outgoing links in all the pages that contain the word

– LinkedPagesContainingWord = how many of the linked pages contain the word

• For the latter two measures, only the links inside the collection were considered


The link effect R

• The word “each”p = 55654/247491 = .225p’ = 15815/46163 = .343R = p’/p = .343/.225 = 1.524


Establishing values for R

IDF 3.0 IDF 4.0

sorted by IDF sorted by R sorted by IDF sorted by R

word IDF word IDF word IDF word IDFhuman 2.981 close 1.675 centuri 3.988 extend 2.085

accord 2.983 among 1.770 interact 3.990 beyond 2.477

perform 2.984 further 1.796 introduct 3.993 front 2.606

close 2.985 expect 1.864 front 3.994 centuri 2.713

press 2.992 accord 1.922 travel 3.997 elimin 2.753

applic 2.992 assist 1.962 elimin 4.009 damag 2.757

expect 2.997 human 2.093 opinion 4.013 introduct 2.843

among 2.998 perform 2.095 damag 4.017 opinion 2.984

assist 3.004 applic 2.203 beyond 4.019 travel 3.491

further 3.011 press 2.388 extend 4.021 interact 3.527


IDF rank p p’ IDF R sample words1-100 0.4047 0.5293 1.6761 1.3639 the of make and

101-200 0.2141 0.3574 2.3803 1.6745 under go between copyright201-300 0.1688 0.3209 2.6896 1.9047 market subject special mean301-400 0.1386 0.2876 2.9513 2.0750 administr put establish ask401-500 0.1192 0.2588 3.1548 2.1750 understand social hand share501-600 0.1046 0.2426 3.3326 2.3179 prevent staff risk north601-700 0.0934 0.2246 3.4879 2.4085 trade class size california701-800 0.0839 0.2201 3.6354 2.6233 global drug letter softwar801-900 0.0752 0.2004 3.7884 2.6668 sound tool monitor transport

901-1000 0.0669 0.2024 3.9499 3.0200 permit target east normal1001-1100 0.0605 0.1823 4.0909 3.0149 approxim telephon danger europ1101-1200 0.0548 0.1710 4.2292 3.1213 favor richard map pictur1201-1300 0.0498 0.1752 4.3635 3.5210 professor earth english republican1301-1400 0.0454 0.1652 4.4934 3.6366 medicin, doctor, church, color1401-1500 0.0416 0.1630 4.6166 3.9224 permiss agenda programm prioriti1501-1600 0.0385 0.1508 4.7262 3.9165 prospect broadcast acquir feedback1601-1700 0.0358 0.1450 4.8306 4.0517 temperatur florida percentag membership1701-1800 0.0332 0.1462 4.9358 4.4050 alcohol lake crisi china1801-1900 0.0310 0.1386 5.0363 4.4735 francisco disciplin film medium1901-2000 0.0287 0.1379 5.1454 4.8090 entertain psycholog anticip arrest

… … … … … …100001-100100 0.0000 0.0642 12.4774 363.73

31sinker surmont thong undergrowth

500001-500100 0.0000 0.0215 16.2970 2658.9231

scheflin schena schendel scheriff


Linear fit for the 2000 lowest-IDF words

p

p’


Cluster One

p

p’

bywithfrom


Cluster Two

p

p’

photodreampath

(c) 2003, the university of michigan1 information retrieval handout #8 february 25, 2005

Documents

web slide

lexrank slide

epidemiology slide

university of michigan11

west hall slide

web scc

given page slide

university of michigan4