lazy preservation: reconstructing websites by crawling the crawlers frank mccown, joan a. smith,...

Lazy Preservation: Reconstructing Websites by

Crawling the Crawlers

Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen

Old Dominion UniversityNorfolk, Virginia, USA

Arlington, VirginiaNovember 10, 2006WIDM 2006

2

Outline

• Web page threats• Web Infrastructure• Web caching experiment• Web repository crawling• Website reconstruction experiment

3Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

4

How much of the Web is indexed?

Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)

Google

Yahoo

MSNIndexable

Web

8 billion pages

6.6 billion pages

5 billion pages

11.5 billion pages

7

Cached Image

Cached PDF

http://www.fda.gov/cder/about/whatwedo/testtube.pdf

MSN version Yahoo version Google version

canonical

Web Repository CharacteristicsType MIME type File ext Google Yahoo MSN IA

HTML text text/html html C C C C

Plain text text/plain txt, ans M M M C

Graphic Interchange Format image/gif gif M M ~R C

Joint Photographic Experts Group

image/jpegjpg

M M ~R C

Portable Network Graphic image/png png M M ~R C

Adobe Portable Document Format

application/pdfpdf

M M M C

JavaScript application/javascript js M M C

Microsoft Excel application/vnd.ms-excel xls M ~S M C

Microsoft PowerPoint application/vnd.ms-powerpoint

pptM M M C

Microsoft Word application/msword doc M M M C

PostScript application/postscript ps M ~S C

C Canonical version is storedM Modified version is stored (modified images are thumbnails, all others are html conversions)~R Indexed but not retrievable~S Indexed but not stored

10

Timeline of Web Resource

11

Web Caching Experiment

• Create 4 websites composed of HTML, PDF, images– http://www.owenbrau.com/– http://www.cs.odu.edu/~fmccown/lazy/– http://www.cs.odu.edu/~jsmit/– http://www.cs.odu.edu/~mln/lazp/

• Remove pages each day

• Query GMY each day using identifiers

16

Crawling the Web and web repositories

World Wide Web

Repo1

Repo2

Repon

...

Web crawling

Repo

Web-repository crawling

17

• First developed in fall of 2005• Available for download at

http://www.cs.odu.edu/~fmccown/warrick/ • www2006.org – first lost website reconstructed

(Nov 2005)• DCkickball.org – first website someone else

reconstructed without our help (late Jan 2006)• www.iclnet.org – first website we reconstructed

for someone else (mid Mar 2006)• Internet Archive officially endorses Warrick (mid

Mar 2006)

http://www2006.org/

18

How Much Did We Reconstruct?

A

“Lost” web site Reconstructed web site

B C

D E F

A

B’ C’

G E

F

Missing link to D; points to old resource G

F can’t be found

Four categories of recovered resources:

1) Identical: A, E2) Changed: B, C3) Missing: D, F4) Added: G

19

Reconstruction Diagram

added 20%

identical 50%

changed 33%

missing 17%

20

Reconstruction Experiment

• Crawl and reconstruct 24 sites of various sizes:

1. small (1-150 resources) 2. medium (151-499 resources)3. large (500+ resources)

• Perform 5 reconstructions for each website– One using all four repositories together– Four using each repository separately

• Calculate reconstruction vector for each reconstruction (changed%, missing%, added%)

21Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.

22

Recovery Success by MIME Type

23

Repository Contributions

24

Current & Future Work

• Building a web interface for Warrick

• Currently crawling & reconstructing 300 randomly sampled websites each week– Move from descriptive model to proscriptive &

predictive model

• Injecting server-side functionality into WI– Recover the PHP code, not just the HTML

lazy preservation: reconstructing websites by crawling the crawlers frank mccown, joan a. smith,...

Documents

web interface

web repositoriesfirst

indexable web

reconstructing websites

lost website reconstructed

reconstruction experimentcrawl

storedmmodified version

d points