1 uniform sampling from the web via random walks ziv bar-yossef alexander berg steve chien jittat...

1

Uniform Sampling from the Web via Random Walks

Ziv Bar-YossefAlexander Berg

Steve ChienJittat Fakcharoenphol

Dror Weitz

University of California at Berkeley

2

Motivation: Web Measurements

• Main goal:Develop a cheap method to sample uniformly from the Web

• Use a random sample of web pages to approximate:– search engine coverage– domain name distribution (.com, .org, .edu)– percentage of porn pages– average number of links in a page– average page length

• Note: A web page is a static html page

3

The Structure of the Web (Broder et al., 2000)

tendrils & isolated regions

right side

left side

large strongly connected component

indexable web

1/4

1/4

1/41/4

4

Why is Web Sampling Hard?

• Obvious solution: sample from an index of all pages

• Maintaining an index of Web pages is difficult

– Requires extensive resources (storage, bandwidth)

– Hard to implement

• There is no consistent index of all Web pages

– Difficult to get complete coverage

– Month to crawl/index most of the Web

– Web is changing every minute

5

Our Approach: Random Walks for Random Sampling

• Random walk on a graph provides a sample of nodes

• Graph is undirected and regular sample is uniform– Problems: The Web is neither undirected nor regular

• Our solution– Incrementally create an undirected regular graph with

the same nodes as the Web– Perform the walk on this graph

6

Related Work

• Monika Henzinger, et al. (2000)– Random walk produces pages distributed by Google’s page

rank.

– Weight these pages to produce a nearly uniform sample.

• Krishna Bharat & Andrei Broder (1998)– Measured relative size and overlap of search engines using

random queries.

• Steve Lawrence & Lee Giles (1998, 1999)– Size of the web by probing IP addresses and crawling servers.

– Search engine coverage in response to certain queries.

7

Random Walks: Definitions

probability distribution qt

qt(v) = prob. v is visited at step t

Transition matrix Aqt+1 = qtA

Stationary distributionLimit as t grows of qt if it exists and is independent of q0

Mixing time # of steps required to approach the stationary distribution

Markov process The probability of a transition depends only on the current state.

v u

From node v pick any outgoing edge with equal probability. Go to u.

v u

8

Straightforward Random Walk on the Web

• Gets stuck in sinks and in dense Web communities

• Biased towards popular pages

• Converges slowly, if at all

netscape.com

amazon.com

www.cs.berkeley.edu/~zivi

Follow a random out-link at each step

1

2

3

4

56

7

8

9

9

WebWalker: Undirected Regular Random Walk on the

Web

Fact:

A random walk on a connected undirected regular graph converges to a uniform stationary distribution.

w(v) = degmax - deg(v)

netscape.com

www.cs.berkeley.edu/~zivi

1

2

31

amazon.com

4

0

23

03

2

2

4

4

3

3

3

1

2

5

Follow a random out-link or a random in-link at each step

Use weighted self loops to even out pages’ degrees

10

WebWalker: Mixing Time

Theorem [Markov chain folklore]:

A random walk’s mixing time is at most log(N)/(1 - 2)

where N = size of the graph

1 - 2 = eigenvalue gap of the transition matrix

Experiment (using an extensive Alexa crawl of the web from 1996)

WebWalker’s eigenvalue gap: 1 - 2 10-5

Result: Webwalker’s mixing time is 3.1 million steps

• Self loop steps are free• Only 1 in 30,000 steps is not a self loop step (degmax 3x105, degavg= 10)

Result: Webwalker’s actual mixing time is only 100 steps!

11

WebWalker: Mixing Time (cont.)

• Mixing time on the current Web may be similar

– Some evidence that the structure of the Web today is similar to the structure in 1996 (Kumar et al., 1999, Broder et al., 2000)

12

WebWalker: Realization (1)

Problems

• The in-links of v are not available

• deg(v) is not available

Partial sources of in-links:

• Previously visited nodes

• Reverse link services of search engines

Webwalker(v):

• Spend expected degmax/deg(v) steps at v

• Pick a random link incident to v (either v u or u v)

• Webwalker(u)

13

WebWalker: Realization (2)

• WebWalker uses only available links:– out-links– in-links from previously visited pages– first r in-links returned from the search engines

• WebWalker walks on a sub-graph of the Web– sub-graph induced by available links– to ensure consistency: as soon as a page is visited its incident edge

list is fixed for the rest of the walk

14

WebWalker: Example

covered by search engines

not covered by search engines available linknon-available link

v1

w

v5

v2

v4

v6

v3v1

v5

v2

v4

v6

v3

Web GraphWebWalker’s

Induced Sub-Graph

0

1

1

1

1

2

15

WebWalker: Bad News

• WebWalker becomes a true random walk only after its induced sub-graph “stabilizes”

• Induced sub-graph is random

• Induced sub-graph misses some of the nodes

• Eigenvalue gap analysis does not hold anymore

16

WebWalker: Good News

• WebWalker eventually converges to a uniform distribution on the nodes of its induced sub-graph

• WebWalker is a “close approximation” of a random walk much before the sub-graph stabilizes

• Theorem: WebWalker’s induced sub-graph is guaranteed to eventually cover the whole indexable Web.

• Corollary: WebWalker can produce uniform samples from the indexable Web.

17

Evaluation of WebWalker’s Performance

Questions to address in experiments:

• Structure of induced sub-graphs • Mixing time• Potential bias in early stages of the walk:

– towards high degree pages– towards the search engines– towards the starting page’s neighborhood

18

WebWalker: Evaluation Experiments

• Run WebWalker on the 1996 copy of the Web– 37.5 million pages– 15 million indexable pages

– degavg= 7.15

– degmax= 300,000

• Designate a fraction p of the pages as the search engine index

• Use WebWalker to generate a sample of 100,000 pages

• Check the resulting sample against the actual values

19

Evaluation: Bias towards High Degree Nodes

Deciles of nodes ordered by degree

High Degree

Low Degree

Percent of nodes from walk

20

Evaluation: Bias towards the Search Engines

Search engine size30% 50%

Estimate of search engine size

21

Evaluation: Bias towards the Starting Node’s Neighborhood

Deciles of nodes by distance from starting node

Close to Starting

Node

Far from Starting

Node

Percent of nodes from walk

22

WebWalker: Experiments on the Web

• Run WebWalker on the actual Web

• Two runs of 34,000 pages each

• Dates: July 8, 2000 - July 15, 2000

• Used four search engines for reversed links:

• AltaVista, HotBot, Lycos, Go

23

Domain Name Distribution

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

com org edu net de jp uk us gov ch ca fr tw au mil

WebWalker (70,000)

Henzinger et al. Walk (2 million)

Henzinger et al. Crawl (80 million)

Inktomi Crawl (1 billion)

24

Search Engine Coverage

68%

54%50% 50%

48%

38%

0%

10%

20%

30%

40%

50%

60%

70%

80%

Google AltaVista Fast Lycos HotBot Go

25

Web Page Parameters

• Average page size: 8,390 Bytes

• Average # of images on a page: 9.3 Images

• Average # of hyperlinks on a page: 15.6 Links

26

Conclusions

• Uniform sampling of Web pages by random walks

• Good news:

– walk provably converges to a uniform distribution

– easy to implement and run with few resources

– encouraging experimental results

• Bad news:

– no theoretical guarantees on the walk’s mixing time

– some biases towards high degree nodes and the search engines

• Future work:

– obtain a better theoretical analysis

– eliminate biases

– deal with dynamic content

27

Thank You!

1 uniform sampling from the web via random walks ziv bar-yossef alexander berg steve chien jittat...

Documents