lazy preservation: reconstructing websites by crawling the crawlers frank mccown, joan a. smith,...
TRANSCRIPT
Lazy Preservation: Reconstructing Websites by
Crawling the Crawlers
Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen
Old Dominion UniversityNorfolk, Virginia, USA
Arlington, VirginiaNovember 10, 2006WIDM 2006
2
Outline
• Web page threats• Web Infrastructure• Web caching experiment• Web repository crawling• Website reconstruction experiment
3Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg
4
How much of the Web is indexed?
Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)
Yahoo
MSNIndexable
Web
8 billion pages
6.6 billion pages
5 billion pages
11.5 billion pages
5
6
7
Cached Image
Cached PDF
http://www.fda.gov/cder/about/whatwedo/testtube.pdf
MSN version Yahoo version Google version
canonical
Web Repository CharacteristicsType MIME type File ext Google Yahoo MSN IA
HTML text text/html html C C C C
Plain text text/plain txt, ans M M M C
Graphic Interchange Format image/gif gif M M ~R C
Joint Photographic Experts Group
image/jpegjpg
M M ~R C
Portable Network Graphic image/png png M M ~R C
Adobe Portable Document Format
application/pdfpdf
M M M C
JavaScript application/javascript js M M C
Microsoft Excel application/vnd.ms-excel xls M ~S M C
Microsoft PowerPoint application/vnd.ms-powerpoint
pptM M M C
Microsoft Word application/msword doc M M M C
PostScript application/postscript ps M ~S C
C Canonical version is storedM Modified version is stored (modified images are thumbnails, all others are html conversions)~R Indexed but not retrievable~S Indexed but not stored
10
Timeline of Web Resource
11
Web Caching Experiment
• Create 4 websites composed of HTML, PDF, images– http://www.owenbrau.com/– http://www.cs.odu.edu/~fmccown/lazy/– http://www.cs.odu.edu/~jsmit/– http://www.cs.odu.edu/~mln/lazp/
• Remove pages each day
• Query GMY each day using identifiers
12
13
14
15
16
Crawling the Web and web repositories
World Wide Web
Repo1
Repo2
Repon
...
Web crawling
Repo
Web-repository crawling
17
• First developed in fall of 2005• Available for download at
http://www.cs.odu.edu/~fmccown/warrick/ • www2006.org – first lost website reconstructed
(Nov 2005)• DCkickball.org – first website someone else
reconstructed without our help (late Jan 2006)• www.iclnet.org – first website we reconstructed
for someone else (mid Mar 2006)• Internet Archive officially endorses Warrick (mid
Mar 2006)
18
How Much Did We Reconstruct?
A
“Lost” web site Reconstructed web site
B C
D E F
A
B’ C’
G E
F
Missing link to D; points to old resource G
F can’t be found
Four categories of recovered resources:
1) Identical: A, E2) Changed: B, C3) Missing: D, F4) Added: G
19
Reconstruction Diagram
added 20%
identical 50%
changed 33%
missing 17%
20
Reconstruction Experiment
• Crawl and reconstruct 24 sites of various sizes:
1. small (1-150 resources) 2. medium (151-499 resources)3. large (500+ resources)
• Perform 5 reconstructions for each website– One using all four repositories together– Four using each repository separately
• Calculate reconstruction vector for each reconstruction (changed%, missing%, added%)
21Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.
22
Recovery Success by MIME Type
23
Repository Contributions
24
Current & Future Work
• Building a web interface for Warrick
• Currently crawling & reconstructing 300 randomly sampled websites each week– Move from descriptive model to proscriptive &
predictive model
• Injecting server-side functionality into WI– Recover the PHP code, not just the HTML