the web changes everything: understanding the dynamics of web content

64
The Web Changes Everything: Understanding the Dynamics of Web Content Eytan Adar, Jaime Teevan, Susan Dumais, and Jon Elsas University of Washington, Microsoft Research, and Carnegie Mellon University WSDM’09

Upload: dacey

Post on 23-Feb-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Eytan Adar, Jaime Teevan , Susan Dumais , and Jon Elsas University of Washington, Microsoft Research, and Carnegie Mellon University WSDM’09. The Web Changes Everything: Understanding the Dynamics of Web Content. Who Cares About Web Change?. Revisitation Monitoring Page Structure - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Web Changes Everything: Understanding the Dynamics of Web Content

The Web Changes Everything:Understanding the Dynamics of Web Content

Eytan Adar, Jaime Teevan, Susan Dumais, and Jon ElsasUniversity of Washington, Microsoft Research, and Carnegie Mellon UniversityWSDM’09

Page 2: The Web Changes Everything: Understanding the Dynamics of Web Content

Who Cares About Web Change?

• Revisitation– Monitoring

• Page Structure– Fragility

• Dynamic language– Search engine design

Page 3: The Web Changes Everything: Understanding the Dynamics of Web Content

Quantifying Change

• Dynamics of the Web is well researched– Fetterly et al., (150 million pages), 65% stay the same– Koehler et al., (5 years), stabilization– Ntoulas et al., (turnover), 50% new content a year– And many others (see the paper for a summary)

• But: eye towards systems issues– Crawl rates, indexing, storage needs, etc. – Always random samples

• What about the visited Web– Slow (every day at best)

Page 4: The Web Changes Everything: Understanding the Dynamics of Web Content

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Foo Bar Baz

Page 5: The Web Changes Everything: Understanding the Dynamics of Web Content

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Page 6: The Web Changes Everything: Understanding the Dynamics of Web Content

Behavior Driven Sampling• Can we measure dynamics of the actually used Web?• Usage Logs

– Live Toolbar• 600k from August of ‘06• Subset of total

Page 7: The Web Changes Everything: Understanding the Dynamics of Web Content

468 (avg), 650 (med)

X 120 = 54788

All crawlable, min 2 users, 2 times

Sampling URLs

Inter-arrival time

Unique Users (popularity)

Visits Per User

Full details: Adar et al., CHI08

Page 8: The Web Changes Everything: Understanding the Dynamics of Web Content

Behavior Driven Sampling• Can we measure dynamics of the actually used Web?• Usage Logs

– Live Toolbar• 600k from August of ‘06• Subset of total

– Sampled URLs• Around 55k (use the 40k that had revisits in May/June)• Crawled hourly (and sub-hourly) for a year

– May/June ’07

Page 9: The Web Changes Everything: Understanding the Dynamics of Web Content

URL Annotations

• Visitation properties– Revisits, popularity, etc.

• Broad type– News, Sports, Personal, Adult, etc.

• Structural location– Top level page or deep within site?

Page 10: The Web Changes Everything: Understanding the Dynamics of Web Content

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Page 11: The Web Changes Everything: Understanding the Dynamics of Web Content

Basic measures of change

time

Page version 1

Page version 2

How long? (inter-version time)How much?

Dice: 2*|A B| / (|A|+|B|)⋂

Page 12: The Web Changes Everything: Understanding the Dynamics of Web Content

66% displayed change in 5 week sample(every 123 hours on average)

Random web: 35% change after 11 weeks

Page 13: The Web Changes Everything: Understanding the Dynamics of Web Content

More visitors = faster change

0

50

100

150

2 3-6 7-38 39+

Average hours to change as a function of # of visitors

hours

visitors

Average Inter-version Time by Page Popularity

Page 14: The Web Changes Everything: Understanding the Dynamics of Web Content

5+ 4 3 2 1 00

50

100

150

200

250

Hours per change as a function of URL Depth

More shallow (closer to homepage)= faster change

hours

URL Depth

Average Inter-version Time by Page “Depth”

Page 15: The Web Changes Everything: Understanding the Dynamics of Web Content

Change Plot by Type

0.50.55

0.60.65

0.70.75

0.80.85

0.90.95

0 50 100 150 200 250

Mean Dice Coeffi

cient

Mean Inter-version time (hours)

Industry/Trade

Music

AdultPersonal Pages

Sports/Recreation

News/ Magazine

Page 16: The Web Changes Everything: Understanding the Dynamics of Web Content

Inter-Version Distribution

Page 17: The Web Changes Everything: Understanding the Dynamics of Web Content

Sub-hourly crawls

• Over 60% of pages displayed some change when crawled every 60 minutes.

• What is the “true” change rate of the page?

Page 18: The Web Changes Everything: Understanding the Dynamics of Web Content

Sub-hourly crawls

• Round-robin crawling

• 8 samples over 3 (week)days • shifted by at least 4 hours

controller

Original crawl 1

Original crawl 2

2 minute delay

16 minute delay

32 minute delay

60 minute delay

Page 19: The Web Changes Everything: Understanding the Dynamics of Web Content

05000

10000150002000025000300003500040000

0 minutes 2 minutes 16 minutes 32 minutes 60 minutes

At least once

Change every

sample6%

11%9%

19%

11%

23% 42%

66%

12%

24%

Range of Changes in Sub-hourly crawls

0.8

0.85

0.9

0.95

Mean Dice

page

s

Page 20: The Web Changes Everything: Understanding the Dynamics of Web Content

05000

10000150002000025000300003500040000

0 minutes 2 minutes 16 minutes 32 minutes 60 minutes

At least once

Change every

sample6%

11%9%

19%

11%

23% 42%

66%

12%

24%

Range of Changes in Sub-hourly crawlspa

ges

“623 Users Online”“Page generated in .6 ms”“Served to IP address…”

Page 21: The Web Changes Everything: Understanding the Dynamics of Web Content

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Page 22: The Web Changes Everything: Understanding the Dynamics of Web Content

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Page 23: The Web Changes Everything: Understanding the Dynamics of Web Content

Measuring change

t0 t1 t2 t3 t4 t5

• Pages are equally (dis)similar

• Similarity based on• navigation elements • base language model

Time (hours)

Dice

Page 24: The Web Changes Everything: Understanding the Dynamics of Web Content

Two Segment Model• 2 segment (linear) – hockey stick

Knot point

Dynamic versus static

steady state

Time at which proportion of dynamic to static remains constant

Page 25: The Web Changes Everything: Understanding the Dynamics of Web Content

Calculating the Knot Point

• Optimization problem

Knot point

Page 26: The Web Changes Everything: Understanding the Dynamics of Web Content

Calculating the Knot Point

• Optimization problem

Knot point

Page 27: The Web Changes Everything: Understanding the Dynamics of Web Content

Calculating the Knot Point

• Optimization problem

Knot point

Page 28: The Web Changes Everything: Understanding the Dynamics of Web Content

Calculating the Knot Point

• Optimization problem

Knot point

Page 29: The Web Changes Everything: Understanding the Dynamics of Web Content

Types of Change Curves

• 3 main types– Knotted (two-segment)– Sloped– Unchanging

• Automatic classification (93% accuracy*)– 70% are knotted

• 145 hours mean, 92 median– 28% sloped– 2% unchanging (flat)

*Consistent with the proportions of hand labeled data

Page 30: The Web Changes Everything: Understanding the Dynamics of Web Content

Change curves

Different stable segment different ratios of dynamic to stable content

http://www.nytimes.com http://www.allrecipes.com

Page 31: The Web Changes Everything: Understanding the Dynamics of Web Content

Change curves

dice

1

.4

hours10 20 30 40

LA

AK

Craigslist, Anchorage, AK Craigslist, Los Angeles, CA

Page 32: The Web Changes Everything: Understanding the Dynamics of Web Content

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Page 33: The Web Changes Everything: Understanding the Dynamics of Web Content

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Foo Bar Baz

Page 34: The Web Changes Everything: Understanding the Dynamics of Web Content

Nature of the Text

What terms vanish here?

Or are still here?

Foo Bar Baz

Page 35: The Web Changes Everything: Understanding the Dynamics of Web Content

Term Longevity Plot

TimeSep. Oct. Nov. Dec.

Foo Bar Baz

Page 36: The Web Changes Everything: Understanding the Dynamics of Web Content

Term Longevity Plot

• Term level representation of change curve– Pick a vertical (t0)– Compare overlap of

terms to next vertical

Foo Bar Baz

Page 37: The Web Changes Everything: Understanding the Dynamics of Web Content

Features of Terms

• Divergence– Which terms distinguish current document from

the collection (at a point in time)

• Staying power (σ)– Likelihood of observing a word (w) at two different

times, t and t+α in document D– σ(w,D)≈ P(t)P(α)P(w|Dt,Dt+ α)

( | )( , ) ( | )log( | )

P w DDiv w D P w DP w C

Foo Bar Baz

Page 38: The Web Changes Everything: Understanding the Dynamics of Web Content

Distribution of terms by staying power (σ)

Low staying power(allrecipes.com)

High Div.bbqsaladssandwichesporkcheesecool

High staying power(allrecipes.com)

Foo Bar Baz

cookscookbooks

ingredient

desserts

home

you

search…

High Div.

Low Div.

Page 39: The Web Changes Everything: Understanding the Dynamics of Web Content

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Foo Bar Baz

Page 40: The Web Changes Everything: Understanding the Dynamics of Web Content

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Page 41: The Web Changes Everything: Understanding the Dynamics of Web Content

DOM Level Changes

• How long does structure hold?• Applications with assumed stability

– Programming by Demonstration (PbD)– Mashups– Scrapers, etc.

[UIST08] Adar et al., “Zoetrope: Interacting with the Ephemeral Web”

DOM Structure

Page 42: The Web Changes Everything: Understanding the Dynamics of Web Content

Tree Isomorphism

• The “general” approach:– Compare the DOM structure of 2 trees– Produces alignment, edit distances, etc. [Grandi’04]

– But: somewhat inefficient for large scale• We want:

– A method for comparing many (1000s) of versions of the same page at the same time

Page 43: The Web Changes Everything: Understanding the Dynamics of Web Content

The Idea

• Serialize each DOM structure

a

b

bar

foo

<a>foo <b>bar</b></a>@time = 0

//a[0]/b[0] (/a/b)[0] H(“bar”) H(“bar”) 0

/a[0] (/a)[0] H(“foo”) H(“foo bar”)

0

full path type path node hash subtree hash version

Page 44: The Web Changes Everything: Understanding the Dynamics of Web Content

The Idea

• Serialize each DOM structure

a

b

jar

foo

<a>foo <b>jar</b></a>@time = 1

//a[0]/b[0] (/a/b)[0] H(“bar”) H(“bar”) 0

/a[0] (/a)[0] H(“foo”) H(“foo bar”)

0

/a[0]/b[0] (/a/b)[0] H(“bar”) H(“jar”) 1/a[0] (/a)[0] H(“foo”) H(“foo

jar”)1

Page 45: The Web Changes Everything: Understanding the Dynamics of Web Content

The Idea

• Serialize each DOM structure

a

b

jar

<a><b>jar</b></a>@time = 2

//a[0]/b[0] (/a/b)[0] H(“bar”) H(“bar”) 0

/a[0] (/a)[0] H(“foo”) H(“foo bar”)

0

/a[0]/b[0] (/a/b)[0] H(“jar”) H(“jar”) 1/a[0] (/a)[0] H(“foo”) H(“foo

jar”)1

/a[0]/b[0] (/a/b)[0] H(“jar”) H(“jar”) 2

Page 46: The Web Changes Everything: Understanding the Dynamics of Web Content

Operators on Serialized Data

• sort(columns)– Sorts by the variables

• reduce(columns)– Generates a set of sets

• Look familiar?

Page 47: The Web Changes Everything: Understanding the Dynamics of Web Content

sort(full_path,version)

S = reduce(full_path)

foreach s in S: calculate the difference between the minimum version id and last reported id

/a[0]/b[0] … 0

/a[0] … 0

/a[0]/b[0] … 1/a[0] … 1/a[0]/b[0] … 2

/a[0]/b[0] … 0

/a[0]/b[0] … 1

/a[0]/b[0] … 2/a[0] … 0/a[0] … 1

/a[0]/b[0] … 0

/a[0]/b[0] … 1

/a[0]/b[0] … 2/a[0] … 0/a[0] … 1

2

1

Page 48: The Web Changes Everything: Understanding the Dynamics of Web Content

Structure Survival Over Time

0%10%20%30%40%50%60%70%80%90%

100%

2 hour 1 day 1 week 2 weeks 4 weeks 5 weeks

Smaller dataset ([UIST’08]) shows that mean survival after a year is only 23%

Page 49: The Web Changes Everything: Understanding the Dynamics of Web Content

Frequencies and Motion

• Frequency of change of DOM elements

Page 50: The Web Changes Everything: Understanding the Dynamics of Web Content

Frequencies and Motion

• Frequency of change of DOM elements• Motion of elements on a page

– Can we predict the motion of a page element?

Page 51: The Web Changes Everything: Understanding the Dynamics of Web Content

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Page 52: The Web Changes Everything: Understanding the Dynamics of Web Content

Outline

• Behavior-Driven Sampling & Crawling• Measuring change

– Basic change behavior– Page evolution– Text changes – DOM level changes

• Applications

Page 53: The Web Changes Everything: Understanding the Dynamics of Web Content

Revisitation and Change (to appear [CHI’09])

Series1

0

0.2

0.4

0.6

0.8

1

1.2

NYT WootCostco

Series1

0

0.2

0.4

0.6

0.8

1

1.2

NYT WootCostco

Series1

0

0.2

0.4

0.6

0.8

1

1.2

NYT WootCostco

Series1

-0.2-1.66533453693773E-16

0.20.40.60.8

11.2

NYT WootCostco

Time

[CHI09] Adar et al., “Resonance on the Web: Web Dynamics and Revisitation Patterns”

Implying interest in the newest newsImplying interest in the newest deal (once every 24 hours)Implying interest in the stable (slow changing) content

Page 54: The Web Changes Everything: Understanding the Dynamics of Web Content

[CHI09] Adar et al., “Resonance on the Web: Web Dynamics and Revisitation Patterns”

Inferred intent• Filter content by removing content changing faster

or slower than peak revisitation

Page 55: The Web Changes Everything: Understanding the Dynamics of Web Content

Improve Search Over Time

• Changing content– Likelihood that document still contains term– Improved ranking

• Different term weights based on page dynamics

• Differential snippet generation– Take into account survivability of terms

Page 56: The Web Changes Everything: Understanding the Dynamics of Web Content

Summary

• Behavior driven sampling gives us different statistics then what we’ve seen

• Fine grained analyses (time, terms, DOM)• Data is useful for modeling and simulation (today)• The techniques are useful going forward

– Change curves and the hockey-stick model– Language models and term longevity plots– Comparison of many DOM structures simultaneously

Page 57: The Web Changes Everything: Understanding the Dynamics of Web Content

Thanks!

?

Page 58: The Web Changes Everything: Understanding the Dynamics of Web Content

Inter-arrival time

Unique Users (popularity)

Visits Per User

time

User 1 = 10 times

User 2 = 12 times

523 unique users

Sampling URLsFull details: Adar et al., CHI08

Page 59: The Web Changes Everything: Understanding the Dynamics of Web Content

Sampling URLs

Inter-arrival time

Unique Users (popularity)

Visits Per User

Full details: Adar et al., CHI08

Page 60: The Web Changes Everything: Understanding the Dynamics of Web Content

Change curve implementation

• Which t0 do we pick?

• Any specific t0 results in different (noisy) curve– Solution: pick multiple t0 at random, calculate

average change curve

Log minutes

Page 61: The Web Changes Everything: Understanding the Dynamics of Web Content

Operators on Serialized Data

• Combining sorts/reduces and intermediate outputs can answer many questions

• Can’t identify “changes” in text – shingles can help– Can identify when a hash “vanishes” and a new

one appears• Test only on those

Page 62: The Web Changes Everything: Understanding the Dynamics of Web Content

Overall Distributions (Dice)

Page 63: The Web Changes Everything: Understanding the Dynamics of Web Content

Change Plot for a Page

Page 64: The Web Changes Everything: Understanding the Dynamics of Web Content

Change Plot by Domain

0.74

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

0 50 100 150 200

Mean Dice Coeffi

cient

Mean Inter-Version Time

.gov

.edu

.com

.net

.org