the web changes everything: understanding the dynamics of web content

Post on 23-Feb-2016

27 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Eytan Adar, Jaime Teevan , Susan Dumais , and Jon Elsas University of Washington, Microsoft Research, and Carnegie Mellon University WSDM’09. The Web Changes Everything: Understanding the Dynamics of Web Content. Who Cares About Web Change?. Revisitation Monitoring Page Structure - PowerPoint PPT Presentation

TRANSCRIPT

The Web Changes Everything:Understanding the Dynamics of Web Content

Eytan Adar, Jaime Teevan, Susan Dumais, and Jon ElsasUniversity of Washington, Microsoft Research, and Carnegie Mellon UniversityWSDM’09

Who Cares About Web Change?

• Revisitation– Monitoring

• Page Structure– Fragility

• Dynamic language– Search engine design

Quantifying Change

• Dynamics of the Web is well researched– Fetterly et al., (150 million pages), 65% stay the same– Koehler et al., (5 years), stabilization– Ntoulas et al., (turnover), 50% new content a year– And many others (see the paper for a summary)

• But: eye towards systems issues– Crawl rates, indexing, storage needs, etc. – Always random samples

• What about the visited Web– Slow (every day at best)

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Foo Bar Baz

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Behavior Driven Sampling• Can we measure dynamics of the actually used Web?• Usage Logs

– Live Toolbar• 600k from August of ‘06• Subset of total

468 (avg), 650 (med)

X 120 = 54788

All crawlable, min 2 users, 2 times

Sampling URLs

Inter-arrival time

Unique Users (popularity)

Visits Per User

Full details: Adar et al., CHI08

Behavior Driven Sampling• Can we measure dynamics of the actually used Web?• Usage Logs

– Live Toolbar• 600k from August of ‘06• Subset of total

– Sampled URLs• Around 55k (use the 40k that had revisits in May/June)• Crawled hourly (and sub-hourly) for a year

– May/June ’07

URL Annotations

• Visitation properties– Revisits, popularity, etc.

• Broad type– News, Sports, Personal, Adult, etc.

• Structural location– Top level page or deep within site?

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Basic measures of change

time

Page version 1

Page version 2

How long? (inter-version time)How much?

Dice: 2*|A B| / (|A|+|B|)⋂

66% displayed change in 5 week sample(every 123 hours on average)

Random web: 35% change after 11 weeks

More visitors = faster change

0

50

100

150

2 3-6 7-38 39+

Average hours to change as a function of # of visitors

hours

visitors

Average Inter-version Time by Page Popularity

5+ 4 3 2 1 00

50

100

150

200

250

Hours per change as a function of URL Depth

More shallow (closer to homepage)= faster change

hours

URL Depth

Average Inter-version Time by Page “Depth”

Change Plot by Type

0.50.55

0.60.65

0.70.75

0.80.85

0.90.95

0 50 100 150 200 250

Mean Dice Coeffi

cient

Mean Inter-version time (hours)

Industry/Trade

Music

AdultPersonal Pages

Sports/Recreation

News/ Magazine

Inter-Version Distribution

Sub-hourly crawls

• Over 60% of pages displayed some change when crawled every 60 minutes.

• What is the “true” change rate of the page?

Sub-hourly crawls

• Round-robin crawling

• 8 samples over 3 (week)days • shifted by at least 4 hours

controller

Original crawl 1

Original crawl 2

2 minute delay

16 minute delay

32 minute delay

60 minute delay

05000

10000150002000025000300003500040000

0 minutes 2 minutes 16 minutes 32 minutes 60 minutes

At least once

Change every

sample6%

11%9%

19%

11%

23% 42%

66%

12%

24%

Range of Changes in Sub-hourly crawls

0.8

0.85

0.9

0.95

Mean Dice

page

s

05000

10000150002000025000300003500040000

0 minutes 2 minutes 16 minutes 32 minutes 60 minutes

At least once

Change every

sample6%

11%9%

19%

11%

23% 42%

66%

12%

24%

Range of Changes in Sub-hourly crawlspa

ges

“623 Users Online”“Page generated in .6 ms”“Served to IP address…”

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Measuring change

t0 t1 t2 t3 t4 t5

• Pages are equally (dis)similar

• Similarity based on• navigation elements • base language model

Time (hours)

Dice

Two Segment Model• 2 segment (linear) – hockey stick

Knot point

Dynamic versus static

steady state

Time at which proportion of dynamic to static remains constant

Calculating the Knot Point

• Optimization problem

Knot point

Calculating the Knot Point

• Optimization problem

Knot point

Calculating the Knot Point

• Optimization problem

Knot point

Calculating the Knot Point

• Optimization problem

Knot point

Types of Change Curves

• 3 main types– Knotted (two-segment)– Sloped– Unchanging

• Automatic classification (93% accuracy*)– 70% are knotted

• 145 hours mean, 92 median– 28% sloped– 2% unchanging (flat)

*Consistent with the proportions of hand labeled data

Change curves

Different stable segment different ratios of dynamic to stable content

http://www.nytimes.com http://www.allrecipes.com

Change curves

dice

1

.4

hours10 20 30 40

LA

AK

Craigslist, Anchorage, AK Craigslist, Los Angeles, CA

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Foo Bar Baz

Nature of the Text

What terms vanish here?

Or are still here?

Foo Bar Baz

Term Longevity Plot

TimeSep. Oct. Nov. Dec.

Foo Bar Baz

Term Longevity Plot

• Term level representation of change curve– Pick a vertical (t0)– Compare overlap of

terms to next vertical

Foo Bar Baz

Features of Terms

• Divergence– Which terms distinguish current document from

the collection (at a point in time)

• Staying power (σ)– Likelihood of observing a word (w) at two different

times, t and t+α in document D– σ(w,D)≈ P(t)P(α)P(w|Dt,Dt+ α)

( | )( , ) ( | )log( | )

P w DDiv w D P w DP w C

Foo Bar Baz

Distribution of terms by staying power (σ)

Low staying power(allrecipes.com)

High Div.bbqsaladssandwichesporkcheesecool

High staying power(allrecipes.com)

Foo Bar Baz

cookscookbooks

ingredient

desserts

home

you

search…

High Div.

Low Div.

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Foo Bar Baz

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

DOM Level Changes

• How long does structure hold?• Applications with assumed stability

– Programming by Demonstration (PbD)– Mashups– Scrapers, etc.

[UIST08] Adar et al., “Zoetrope: Interacting with the Ephemeral Web”

DOM Structure

Tree Isomorphism

• The “general” approach:– Compare the DOM structure of 2 trees– Produces alignment, edit distances, etc. [Grandi’04]

– But: somewhat inefficient for large scale• We want:

– A method for comparing many (1000s) of versions of the same page at the same time

The Idea

• Serialize each DOM structure

a

b

bar

foo

<a>foo <b>bar</b></a>@time = 0

//a[0]/b[0] (/a/b)[0] H(“bar”) H(“bar”) 0

/a[0] (/a)[0] H(“foo”) H(“foo bar”)

0

full path type path node hash subtree hash version

The Idea

• Serialize each DOM structure

a

b

jar

foo

<a>foo <b>jar</b></a>@time = 1

//a[0]/b[0] (/a/b)[0] H(“bar”) H(“bar”) 0

/a[0] (/a)[0] H(“foo”) H(“foo bar”)

0

/a[0]/b[0] (/a/b)[0] H(“bar”) H(“jar”) 1/a[0] (/a)[0] H(“foo”) H(“foo

jar”)1

The Idea

• Serialize each DOM structure

a

b

jar

<a><b>jar</b></a>@time = 2

//a[0]/b[0] (/a/b)[0] H(“bar”) H(“bar”) 0

/a[0] (/a)[0] H(“foo”) H(“foo bar”)

0

/a[0]/b[0] (/a/b)[0] H(“jar”) H(“jar”) 1/a[0] (/a)[0] H(“foo”) H(“foo

jar”)1

/a[0]/b[0] (/a/b)[0] H(“jar”) H(“jar”) 2

Operators on Serialized Data

• sort(columns)– Sorts by the variables

• reduce(columns)– Generates a set of sets

• Look familiar?

sort(full_path,version)

S = reduce(full_path)

foreach s in S: calculate the difference between the minimum version id and last reported id

/a[0]/b[0] … 0

/a[0] … 0

/a[0]/b[0] … 1/a[0] … 1/a[0]/b[0] … 2

/a[0]/b[0] … 0

/a[0]/b[0] … 1

/a[0]/b[0] … 2/a[0] … 0/a[0] … 1

/a[0]/b[0] … 0

/a[0]/b[0] … 1

/a[0]/b[0] … 2/a[0] … 0/a[0] … 1

2

1

Structure Survival Over Time

0%10%20%30%40%50%60%70%80%90%

100%

2 hour 1 day 1 week 2 weeks 4 weeks 5 weeks

Smaller dataset ([UIST’08]) shows that mean survival after a year is only 23%

Frequencies and Motion

• Frequency of change of DOM elements

Frequencies and Motion

• Frequency of change of DOM elements• Motion of elements on a page

– Can we predict the motion of a page element?

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Revisitation and Change (to appear [CHI’09])

Series1

0

0.2

0.4

0.6

0.8

1

1.2

NYT WootCostco

Series1

0

0.2

0.4

0.6

0.8

1

1.2

NYT WootCostco

Series1

0

0.2

0.4

0.6

0.8

1

1.2

NYT WootCostco

Series1

-0.2-1.66533453693773E-16

0.20.40.60.8

11.2

NYT WootCostco

Time

[CHI09] Adar et al., “Resonance on the Web: Web Dynamics and Revisitation Patterns”

Implying interest in the newest newsImplying interest in the newest deal (once every 24 hours)Implying interest in the stable (slow changing) content

[CHI09] Adar et al., “Resonance on the Web: Web Dynamics and Revisitation Patterns”

Inferred intent• Filter content by removing content changing faster

or slower than peak revisitation

Improve Search Over Time

• Changing content– Likelihood that document still contains term– Improved ranking

• Different term weights based on page dynamics

• Differential snippet generation– Take into account survivability of terms

Summary

• Behavior driven sampling gives us different statistics then what we’ve seen

• Fine grained analyses (time, terms, DOM)• Data is useful for modeling and simulation (today)• The techniques are useful going forward

– Change curves and the hockey-stick model– Language models and term longevity plots– Comparison of many DOM structures simultaneously

Thanks!

?

Inter-arrival time

Unique Users (popularity)

Visits Per User

time

User 1 = 10 times

User 2 = 12 times

523 unique users

Sampling URLsFull details: Adar et al., CHI08

Sampling URLs

Inter-arrival time

Unique Users (popularity)

Visits Per User

Full details: Adar et al., CHI08

Change curve implementation

• Which t0 do we pick?

• Any specific t0 results in different (noisy) curve– Solution: pick multiple t0 at random, calculate

average change curve

Log minutes

Operators on Serialized Data

• Combining sorts/reduces and intermediate outputs can answer many questions

• Can’t identify “changes” in text – shingles can help– Can identify when a hash “vanishes” and a new

one appears• Test only on those

Overall Distributions (Dice)

Change Plot for a Page

Change Plot by Domain

0.74

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

0 50 100 150 200

Mean Dice Coeffi

cient

Mean Inter-Version Time

.gov

.edu

.com

.net

.org

top related