Download - Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani Web Search and Data Mining

Ranking Web Sites with Real User Traffic

Mark MeissFilippo MenczerSanto Fortunato

Alessandro FlamminiAlessandro Vespignani

Web Search and Data MiningStanford, CaliforniaFebruary 11, 2008

Outline

•Data collection

•Structural properties

•Behavioral patterns

•PageRank validation

•Temporal patterns

Sources for Ranking Data:The Link Graph

Sources for Ranking Data:Dynamic Sources

• Network flow data

• Web server logs

• Toolbars and plugins

ISP

~100 K users

Sources for Ranking Data:Packet Inspection

Data Collection

HostHostPathPath

RefererRefererUser-AgentUser-AgentTimestampTimestamp

HTTP (80)HTTP (80)30% @ peak30% @ peak

anonymizeranonymizer

GETGET

requests requests from IU onlyfrom IU only

FULLFULLh/p/r/a/th/p/r/a/t

HUMANHUMANh/p/r/a/th/p/r/a/t

{

Outline

•Data collection





Structural properties: Degree

Caveat: Sampling Bias

Structural properties:Strength (Site Traffic)

Structural properties:Weights (Link Traffic)

Outline

•Data collection





Behavioral patterns (HUMAN)

(Proportion of total out-strength)

Empty Referrer54%

Search5%

Other40%

Webmail1%

Ratios are stableR

equ

est

s (x

10

6)

0%

20%

40%

60%

80%

100%

Sep06

Oct06

Nov06

Dec06

Jan07

Feb07

Mar07

Apr07

May07

Requ

est

s (x

10

6)

0%

20%

40%

60%

80%

100%

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Ratios are stable

Outline

•Data collection





Validation of PageRank

• PR is a stationary distribution of visit frequency by a modified random walk (with jumps) on the Web graph

• Compare with actual site traffic (in-strength)

• From an application perspective, we care about the resulting ranking of sites rather than the actual values

Kendall’s Rank Correlation

PageRank Assumptions

1. Equal probability of teleporting to each of the nodes

2. Equal probability of teleporting from each of the nodes

3. Equal probability of following each link from any given node

0:

)()(

)1()(ijwi out

ij iPRWis

w

NjPRW

Kendall’s Rank Correlation

Local Link Heterogeneity

perfect

perfect concentratio

concentrationn

perf

ect

perf

ect

hom

ogen

eity

hom

ogen

eity

HH Index of concentration or

disparity

j out

iji is

wY

2

)(

Teleportation Target Heterogeneity

Teleportation Source Heterogeneity (“hubness”)

ssoutout < s < sinin

teleport sourcesteleport sourcesbrowsing sinksbrowsing sinks

-2

ssoutout > s > sinin

popular hubspopular hubs

Navigation vs. Jumps: Sources of Popularity

Outline

•Data collection





Temporal patterns

How predictable are traffic patterns?

-- Cache refreshing

(e.g. proxies)

-- Capacity allocation

(e.g. peering and provisioning for spikes)

-- Site design

(e.g. expose content based on time of day)

• Predict future host graph (clicks) from current one, as a function of delay

• Generalized temporal precision and recall:

Ttij ij

ij ijij

tw

twtwR

,)(

)(),(min)(

Temporal patterns

Ttij ij

ij ijij

tw

twtwP

,)(

)(),(min)(

HUMAN host graph (FULL is about 10% more predictable)

Summary

•Heterogeneity: incoming and outgoing site traffic, link traffic

• Less than half of traffic is from following links

•Only 5% of traffic is directly from search engines

•High temporal regularity

•PageRank is a poor predictor of traffic: random walk and random teleportation assumptions violated

Next

•Sampling bias and search bias

•From host graph to page graph

•Modeling traffic: Beyond random walk?

THANKS!

Mark Meiss

Filippo Menczer

Santo Fortunato

Alessandro Vespignani

Alessandro Flammini CNLL

??

Download - Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani Web Search and Data Mining

Top Related