lazy preservation: reconstructing websites from the web infrastructure frank mccown advisor: michael...

Lazy Preservation: Reconstructing Websites from the Web Infrastructure

Frank McCownAdvisor: Michael L. Nelson

Old Dominion UniversityComputer Science Department

Norfolk, Virginia, USA

Dissertation DefenseOctober 19, 2007

2

Outline

• Motivation• Lazy preservation and the Web

Infrastructure• Web repositories• Responses to 10 research questions• Contributions and Future Work

3Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

4

Preservation: Fortress Model

1. Get a lot of $

2. Buy a lot of disks, machines, tapes, etc.

3. Hire an army of staff

4. Load a small amount of data

5. “Look upon my archive ye Mighty, and despair!”

Image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg

5 easy steps for preservation:

Slide from: http://www.cs.odu.edu/~mln/pubs/differently.ppt

6

…I was doing a little “maintenance” on one of my sites and accidentally deleted my entire database of about 30 articles. After I finished berating myself for being so stupid, I realized that my hosting company would have a backup, so I sent an email asking them to restore the database. Their reply stated that backups were “coming soon”…OUCH!

Web Infrastructure

Lazy Preservation

• How much preservation can be had for free? (Little to no effort for web producer/publisher before website is lost)

• High-coverage preservation of works of unknown importance

• Built atop unreliable, distributed members which cannot be controlled

• Usually limited to crawlable web

8

Dissertation Objective

9

To demonstrate the feasibility of using the WI as a preservation service – lazy preservation – and to evaluate how effectively this previously unexplored service can be utilized for reconstructing lost websites.

Research Questions (Dissertation p. 3)

1. What types of resources are typically stored in the WI search engine caches, and how up-to-date are the caches?

2. How successful is the WI at preserving short-lived web content?

3. How much overlap is there with what is found in search engine caches and the Internet Archive?

4. What interfaces are necessary for a member of the WI (a web repository) to be used in website reconstruction?

5. How does a web-repository crawler work, and how can it reconstruct a lost website from the WI?

10

Research Questions cont.

6. What types of websites do people lose, and how successful have they been recovering them from the WI?

7. How completely can websites be reconstructed from the WI?

8. What website attributes contribute to the success of website reconstruction?

9. Which members of the WI are the most helpful for website reconstruction?

10.What methods can be used to recover the server-side components of websites from the WI?

11

WI Preliminaries:Web Repositories

12

13

How much of the Web is indexed?

Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)

Google

Yahoo

MSNIndexable

Web

8 billion pages

6.6 billion pages

5 billion pages

11.5 billion pages

Internet Archive?

Cached Image

16

Cached PDF

http://www.fda.gov/cder/about/whatwedo/testtube.pdf

MSN version Yahoo version Google version

canonical

Types of Web Repositories

• Depth of holdings– Flat – only maintain last version of resource

crawled– Deep – maintain multiple versions, each with

a timestamp

• Access to holdings– Dark – no outside access to resources– Light – minimal access restrictions

18

Accessing the WI

• Screen-scraping the web user interface (WUI)

• Application programming interface (API)

• WUIs and APIs do not always produce the same responses; the APIs may be pulling from smaller indexes1

19

1McCown & Nelson, Agreeing to Disagree: Search Engines and their Public Interfaces, JCDL 2007

Research Questions 1-3: Characterizing the WI

• Experiment 1: Observe the WI finding and caching new web content that is decaying.

• Experiment 2: Examine the contents of the WI by randomly sampling URLs

20

21

Timeline of Web Resource

22

Web Caching Experiment

• May – Sept 2005• Create 4 websites composed of HTML, PDFs,

and images– http://www.owenbrau.com/– http://www.cs.odu.edu/~fmccown/lazy/– http://www.cs.odu.edu/~jsmit/– http://www.cs.odu.edu/~mln/lazp/

• Remove pages each day• Query GMY every day using identifiers

McCown et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.

http://www.owenbrau.com/

http://www.cs.odu.edu/~fmccown/lazy/

http://www.cs.odu.edu/~jsmit/

http://www.cs.odu.edu/~mln/lazp/

Observations

• Internet Archive found nothing

• Google was the most useful web repository from a preservation perspective– Quick to find new content– Consistent access to cached content– Lost content reappeared in cache long after it

was removed

• Images are slow to be cached, and duplicate images are not cached

27

28

Experiment: Sample Search Engine Caches

• Feb 2007

• Submitted 5200 one-term queries to Ask, Google, MSN, and Yahoo

• Randomly selected 1 result from first 100

• Download resource and cached page

• Check for overlap with Internet Archive

McCown and Nelson, Characterization of Search Engine Caches, Archiving 2007.

29

Distribution of Top Level Domains

30

Cached Resource Size Distributions

976 KB 977 KB

1 MB 215 KB

31

Cache Freshness and Staleness

crawled and cached

changed on web server

crawled and cached

Stale

time

Fresh Fresh

Staleness = max(0, Last-modified HTTP header – cached date)

32

Cache Staleness

• 46% of resource had Last-Modified header

• 71% also had cached date

• 16% were at least 1 day stale

33

Overlap with Internet Archive

34

Overlap with Internet Archive

Ave of 46% URLs from

search engines were

archived

Research Question 4 of 10:Repository Interfaces

Minimum interface requirement:

What resource r do you have stored for the URI u?“

r getResource(u)

35

Deep Repositories

What resource r do you have stored for the URI u at datestamp d?“

r getResource(u, d)

36

Lister Queries

What resources R do you have stored from the site s?

R getAllUris(s)

37

Other Interface Commands

• Get list of dates D stored for URI u

D getResourceList(u)

• Get crawl date d for URI u

d getCrawlDate(u)

39

Research Question 5 of 10:Web-Repository Crawling

40

World Wide Web

Repo1

Repo2

Repon

...

Web crawling

Repo

Web-repository crawling

41

Web-repository Crawler

42

• Written in Perl

• First version completed in Sept 2005

• Made available to the public in Jan 2006

• Run as a command line programwarrick.pl --recursive --debug --output-file log.txt

http://foo.edu/~joe/

• Or on-line using the Brass queuing system

http://warrick.cs.odu.edu/

http://warrick.cs.odu.edu/

Research Question 6 of 10:Warrick usage

44

45

Ave 38.2%

Research Questions 7 and 8:Reconstruction Effectiveness

• Problem with usage data: Difficult to determine how successful reconstructions actually are– Brass tells Warrick to recover all resources, even if

not part of “current” website– When were websites actually lost?– Were URLs spelled correctly? Spam?– Need actual website to compare against

reconstruction, especially if wanting to determine which factors determine website’s recoverability

47

49

Measuring the Difference

(rc, rm, ra)

changed missing added

Apply Recovery Vector for each resource

Compute Difference Vector for website

50

Reconstruction Diagram

added 20%

identical 50%

changed 33%

missing 17%

51McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006

52

Reconstruction Experiment

• 300 websites chosen randomly from Open Directory Project (dmoz.org)

• Crawled and reconstructed each website every week for 14 weeks

• Examined change rates, age, decay, growth, recoverability

McCown and Nelson, Factors Affecting Website Reconstruction from the Web Infrastructure, JCDL 2007

53

Success of website recovery each week

*On average, 61% of a website was recovered on any given week.

54

Recovery by TLD

55

Which Factors Are Significant?

• External backlinks• Internal backlinks• Google’s PageRank• Hops from root page• Path depth• MIME type

• Query string params• Age• Resource birth rate• TLD• Website size• Size of resources

56

Regression Analysis

• No surprises: all variables are significant, but overall model only explains about half of the observations

• Three most significant variables: PageRank, hops and age (R-squared = 0.1496)

57

Observations

• Most of the sampled websites were relatively stable– One third of the websites never lost a single resource– Half of the websites never added any new resources

• The typical website can expect to get back 61% of its resources if it were lost today (77% textual, 42% images and 32% other)

• How to improve recovery from WI? Improve PageRank, decrease number of hops to resources, create stable URLs

Research Question 9 of 10:Web Repository Contributions

58

Real usage data

Experimental results

Research Question 10 of 10:Recovering the web server’s components

5959

Database

Perlscript

config

Static files (html files, PDFs,

images, style sheets, Javascript, etc.) Web

Infrastructure

Web Infrastructure

Web Server

Dynamicpage

Recoverable

Not Recoverable

60

Injecting Server Components into Crawlable Pages

Erasure codesHTML pages Recover at least

m blocks

Server Encoding Experiment

• Create a digital library using Eprints software and populate with 100 research papers

• Monarch DL: http://blanche-03.cs.odu.edu/

• Encode Eprints server components (Perl scripts, MySQL database, config files) and inject into all HTML pages

• Reconstruct each week

61

http://blanche-03.cs.odu.edu/

63

Web resources recovered each week

Contributions

1. Novel solution to pervasive problem of website loss: lazy preservation, after-the-fact recovery for little to no work required for the content creator

2. WI is characterized: behavior to consume and retain new web content, types of resources it contains, overlap between flat and deep repositories

65

Contributions cont.

3. Model for resource availability is developed from initial creation to its potential unavailability

4. Developed new type of crawler: web-repository crawler. Architecture, interfaces for crawling web repositories, rules for canonicalizing URLs, three crawling policies are evaluated

66

Contributions cont.

5. Developed statistical model to measure reconstructed website, reconstruction diagram to summarize reconstruction success.

6. Discovered the three most significant variables that determine how successfully a web resource will be recovered from the WI: Google's PageRank, hops from the root page, resource age.

67

Contributions cont.

7. Proposed and experimentally validated a novel solution to recover a website's server components from WI

8. Created website reconstruction service which is currently being used by the public to reconstruct more than 100 lost websites a month

68

Future Work

• Improvements to Warrick: increase used repositories, discovery of URLs, soft 404s

• Determining or predicting loss- save websites if detecting they are about to or already have disappeared

• Investigate other sources of lazy preservation: browser caches

• More extensive overlap studies of WI

69

Related Publications• Deep web

– IEEE Internet Computing 2006

• Link rot– IWAW 2005

• Lazy Preservation / WI– D-Lib Magazine 2006– WIDM 2006– Archiving 2007– Dynamics of Search Engines: An

Introduction (chapter)– Content Engineering (chapter)– International Journal on Digital

Libraries 2007

• Search engine contents and interfaces

– ECDL 2005– WWW 2007– JCDL 2007

• Obsolete web file formats– IWAW 2005

• Warrick– HYPERTEXT 2006– Archiving 2007– JCDL 2007– IWAW 2007– Communications of the ACM 2007

(to appear)70

71

Thank You

Can’t wait until I’m old enough to run

Warrick!

75

Some Difference Vectors

D = (changed, missing, added)

(0,0,0) – Perfect recovery

(1,0,0) – All resources are recovered but changed

(0,1,0) – All resources are lost

(0,0,1) – All recovered resources are at new URIs

76

How Much Change is a Bad Thing?

Lost Recovered

77

How Much Change is a Bad Thing?

Lost Recovered

78

Assigning Penalties

Apply to each resource

(Pc, Pm, Pa)Penalty Adjustment

Or Difference vector

79

Defining Success

success = 1 – dm

Equivalent to percent of recovered resources

0 1

Less successful

More successful

80

Recovery of Textual Resources

81

Birth and Decay

82

Recovery of HTML Resources

83

Recovery by Age

84

Mild Correlations

• Hops and – website size (0.428)– path depth (0.388)

• Age and # of query params (-0.318)

• External links and – PageRank (0.339)– Website size (0.301)– Hops (0.320)

86

Similarity vs. Staleness

lazy preservation: reconstructing websites from the web infrastructure frank mccown advisor: michael...

Documents

web infrastructure slide

web repository

future work slide

web producerpublisher

crawlable web

indexable web

web content

wi preliminaries