Ranking Web Sites with Real User Traffic
Mark MeissFilippo MenczerSanto Fortunato
Alessandro FlamminiAlessandro Vespignani
Web Search and Data MiningStanford, CaliforniaFebruary 11, 2008
Outline
•Data collection
•Structural properties
•Behavioral patterns
•PageRank validation
•Temporal patterns
Sources for Ranking Data:The Link Graph
Sources for Ranking Data:Dynamic Sources
• Network flow data
• Web server logs
• Toolbars and plugins
ISP
~100 K users
Sources for Ranking Data:Packet Inspection
Data Collection
HostHostPathPath
RefererRefererUser-AgentUser-AgentTimestampTimestamp
HTTP (80)HTTP (80)30% @ peak30% @ peak
anonymizeranonymizer
GETGET
requests requests from IU onlyfrom IU only
FULLFULLh/p/r/a/th/p/r/a/t
HUMANHUMANh/p/r/a/th/p/r/a/t
{
Outline
•Data collection
•Structural properties
•Behavioral patterns
•PageRank validation
•Temporal patterns
Structural properties: Degree
Caveat: Sampling Bias
Structural properties:Strength (Site Traffic)
Structural properties:Weights (Link Traffic)
Outline
•Data collection
•Structural properties
•Behavioral patterns
•PageRank validation
•Temporal patterns
Behavioral patterns (HUMAN)
(Proportion of total out-strength)
Empty Referrer54%
Search5%
Other40%
Webmail1%
Ratios are stableR
equ
est
s (x
10
6)
0%
20%
40%
60%
80%
100%
Sep06
Oct06
Nov06
Dec06
Jan07
Feb07
Mar07
Apr07
May07
Requ
est
s (x
10
6)
0%
20%
40%
60%
80%
100%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Ratios are stable
Outline
•Data collection
•Structural properties
•Behavioral patterns
•PageRank validation
•Temporal patterns
Validation of PageRank
• PR is a stationary distribution of visit frequency by a modified random walk (with jumps) on the Web graph
• Compare with actual site traffic (in-strength)
• From an application perspective, we care about the resulting ranking of sites rather than the actual values
Kendall’s Rank Correlation
PageRank Assumptions
1. Equal probability of teleporting to each of the nodes
2. Equal probability of teleporting from each of the nodes
3. Equal probability of following each link from any given node
0:
)()(
)1()(ijwi out
ij iPRWis
w
NjPRW
Kendall’s Rank Correlation
Local Link Heterogeneity
perfect
perfect concentratio
concentrationn
perf
ect
perf
ect
hom
ogen
eity
hom
ogen
eity
HH Index of concentration or
disparity
j out
iji is
wY
2
)(
Teleportation Target Heterogeneity
Teleportation Source Heterogeneity (“hubness”)
ssoutout < s < sinin
teleport sourcesteleport sourcesbrowsing sinksbrowsing sinks
-2
ssoutout > s > sinin
popular hubspopular hubs
Navigation vs. Jumps: Sources of Popularity
Outline
•Data collection
•Structural properties
•Behavioral patterns
•PageRank validation
•Temporal patterns
Temporal patterns
How predictable are traffic patterns?
-- Cache refreshing
(e.g. proxies)
-- Capacity allocation
(e.g. peering and provisioning for spikes)
-- Site design
(e.g. expose content based on time of day)
• Predict future host graph (clicks) from current one, as a function of delay
• Generalized temporal precision and recall:
Ttij ij
ij ijij
tw
twtwR
,)(
)(),(min)(
Temporal patterns
Ttij ij
ij ijij
tw
twtwP
,)(
)(),(min)(
HUMAN host graph (FULL is about 10% more predictable)
Summary
•Heterogeneity: incoming and outgoing site traffic, link traffic
• Less than half of traffic is from following links
•Only 5% of traffic is directly from search engines
•High temporal regularity
•PageRank is a poor predictor of traffic: random walk and random teleportation assumptions violated
Next
•Sampling bias and search bias
•From host graph to page graph
•Modeling traffic: Beyond random walk?
THANKS!
Mark Meiss
Filippo Menczer
Santo Fortunato
Alessandro Vespignani
Alessandro Flammini CNLL
??