lazy preservation, warrick, and the web infrastructure
DESCRIPTION
Lazy Preservation, Warrick, and the Web Infrastructure. Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007 Vancouver, BC June 19, 2007. Outline. What is the Web Infrastructure (WI)? How can the WI be used for preservation? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/1.jpg)
Lazy Preservation, Warrick, and the Web Infrastructure
Frank McCown
Old Dominion UniversityComputer Science Department
Norfolk, Virginia, USA
JCDL 2007Vancouver, BCJune 19, 2007
![Page 2: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/2.jpg)
2
Outline
• What is the Web Infrastructure (WI)?• How can the WI be used for preservation?• Web-repository crawling with Warrick• Understanding the WI
– Caching experiment– Reconstruction experiments– Search engine sampling and IA overlap experiment
• Recovering web server components from the WI• Brass: Queueing manager for Warrick
![Page 3: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/3.jpg)
3
![Page 4: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/4.jpg)
4
Web Infrastructure
![Page 5: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/5.jpg)
5
Alternative Models of Preservation
• Lazy Preservation– Let Google, IA et al. preserve your website
• Just-In-Time Preservation– Wait for it to disappear first, then a “good enough”
version
• Shared Infrastructure Preservation– Push your content to sites that might preserve it
• Web Server Enhanced Preservation– Use Apache modules to create archival-ready
resources
![Page 6: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/6.jpg)
6
![Page 7: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/7.jpg)
7Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg
![Page 8: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/8.jpg)
8
Crawling the Crawlers
World Wide Web
Repo1
Repo2
Repon
...
Web crawling
Repo
Web-repository crawling
![Page 9: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/9.jpg)
9
![Page 10: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/10.jpg)
10
![Page 11: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/11.jpg)
11
Cached Image
![Page 12: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/12.jpg)
Cached PDF
http://www.fda.gov/cder/about/whatwedo/testtube.pdf
MSN version Yahoo version Google version
canonical
![Page 13: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/13.jpg)
13
Web-repository Crawler
![Page 14: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/14.jpg)
14
• McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007.
• McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007.
• McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006.
• McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.
Available at http://warrick.cs.odu.edu/
![Page 15: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/15.jpg)
15
What Types of Websites Are Lost?
Marshall, McCown, and Nelson, Evaluating Personal Archiving Strategies for Internet-based Information, IS&T Archiving 2007.
![Page 16: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/16.jpg)
16
Outline
• What is the Web Infrastructure (WI)?• How can the WI be used for preservation?• Web-repository crawling with Warrick• Understanding the WI
– Caching experiment– Reconstruction experiments– Search engine sampling and IA overlap experiment
• Recovering web server components from the WI• Brass: Queueing manager for Warrick
![Page 17: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/17.jpg)
17
Understanding the WI
• How quickly do search engines acquire and purge their caches?
• Do search engines prefer caching one type of resource over another?
• How much overlap is there between the search engines caches and IA holdings?
• How successfully can we reconstruct a lost website?
• Are some resources more recoverable than others?
![Page 18: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/18.jpg)
18
Timeline of Web Resource
![Page 19: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/19.jpg)
19
Web Caching Experiment
• Create 4 websites composed of HTML, PDFs, and images– http://www.owenbrau.com/– http://www.cs.odu.edu/~fmccown/lazy/– http://www.cs.odu.edu/~jsmit/– http://www.cs.odu.edu/~mln/lazp/
• Remove pages each day
• Query GMY every day using identifiers
McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.
![Page 20: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/20.jpg)
20
![Page 21: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/21.jpg)
21
![Page 22: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/22.jpg)
22
![Page 23: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/23.jpg)
23
![Page 24: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/24.jpg)
24
Where is the Internet Archive?
• No crawls from Alexa, IA’s provider
• Even if they had crawled us, the content would not be accessible from IA for 6-12 months
• Short-lived web content is likely to be lost for good
![Page 25: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/25.jpg)
25
2005 Reconstruction Experiment
• Crawl and reconstruct 24 sites of various sizes:
1. small (1-150 resources) 2. medium (151-499 resources)3. large (500+ resources)
• Perform 5 reconstructions for each website– One using all four repositories together– Four using each repository separately
• Calculate reconstruction vector for each reconstruction (changed%, missing%, added%)
![Page 26: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/26.jpg)
26
How Much Did We Reconstruct?
A
“Lost” web site Reconstructed web site
B C
D E F
A
B’ C’
G E
F
Missing link to D; points to old resource G
F can’t be found
Four categories of recovered resources:
1) Identical: A, E2) Changed: B, C3) Missing: D, F4) Added: G
![Page 27: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/27.jpg)
27
Reconstruction Diagram
added 20%
identical 50%
changed 33%
missing 17%
![Page 28: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/28.jpg)
28
Recovery Success by MIME Type
![Page 29: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/29.jpg)
29
Repository Contributions
![Page 30: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/30.jpg)
30
2006 Reconstruction Experiment
• 300 websites chosen randomly from Open Directory Project (dmoz.org)
• Crawled and reconstructed each website every week for 14 weeks
• Examined change rates, age, decay, growth, recoverability
McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007.
![Page 31: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/31.jpg)
31
Success of website recovery each week
*On average, we recovered 61% of a website on any given week.
![Page 32: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/32.jpg)
32
![Page 33: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/33.jpg)
33
Statistics for Repositories
![Page 34: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/34.jpg)
34
Experiment: Sample Search Engine Caches
• Feb 2006
• Submitted 5200 one-term queries to Ask, Google, MSN, and Yahoo
• Randomly selected 1 result from first 100
• Download resource and cached page
• Check for overlap with Internet Archive
McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007.
![Page 35: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/35.jpg)
35
Distribution of Top Level Domains
![Page 36: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/36.jpg)
36
Cached Resource Size Distributions
976 KB 977 KB
1 MB 215 KB
![Page 37: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/37.jpg)
37
Cache Freshness
crawled and cached
changed on web server
crawled and cached
Stale
time
Fresh Fresh
Staleness = max(0, Last-modified http header – cached date)
![Page 38: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/38.jpg)
38
Cache Staleness
• 46% of resource had Last-Modified header
• 71% also had cached date
• 16% were at least 1 day stale
![Page 39: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/39.jpg)
39
Similarity vs. Staleness
![Page 40: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/40.jpg)
40
How much of the Web is indexed?
Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)
Yahoo
MSNIndexable
Web
8 billion pages
6.6 billion pages
5 billion pages
11.5 billion pages
Internet Archive?
![Page 41: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/41.jpg)
41
Overlap with Internet Archive
![Page 42: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/42.jpg)
42
Overlap with Internet Archive
![Page 43: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/43.jpg)
43
Distribution of Sampled URLs
![Page 44: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/44.jpg)
44
Problem:
WI currently only stores the client-side representation of a website. Server components (scripts, databases, configuration files, etc.) are not
accessible from the WI
![Page 45: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/45.jpg)
45
Outline
• What is the Web Infrastructure (WI)?• How can the WI be used for preservation?• Web-repository crawling with Warrick• Understanding the WI
– Caching experiment– Reconstruction experiments– Search engine sampling and IA overlap experiment
• Recovering web server components from the WI• Brass: Queueing manager for Warrick
![Page 46: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/46.jpg)
46
Database
Perlscript
config
Static files (html files, PDFs,
images, style sheets, Javascript, etc.)
Web Infrastructure
Web Infrastructure
Web Server
Dynamicpage
Recoverable
Not Recoverable
![Page 47: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/47.jpg)
47
Injecting Server Components into Crawlable Pages
Erasure codesHTML pages Recover at least
m blocks
![Page 48: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/48.jpg)
48
Brass: A Queueing Manager for Warrick
• Warrick requires some technical expertise to download, install, and run
• Warrick uses search engine APIs which allow limited requests per IP address (or key)
• Google no longer provides new keys for accessing their API
![Page 49: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/49.jpg)
49
![Page 50: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/50.jpg)
50
![Page 51: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/51.jpg)
51
Thank You
Frank McCown
[email protected]://www.cs.odu.edu/~fmccown/
Can’t wait until I’m old enough to
recover a website!
![Page 52: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/52.jpg)
52
Cache Freshness
crawled and cached
changed on web server
crawled and cached
Stale
time
Fresh Fresh
Staleness = max(0, Last-modified http header – cached date)
![Page 53: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/53.jpg)
53
Cache Staleness
• 46% of resource had Last-Modified header
• 71% also had cached date
• 16% were at least 1 day stale
![Page 54: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/54.jpg)
54
Similarity vs. Staleness
![Page 55: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/55.jpg)
56
Web Repository CharacteristicsType MIME type File ext Google Yahoo Live IA
HTML text text/html html C C C C
Plain text text/plain txt, ans M M M C
Graphic Interchange Format image/gif gif M M M C
Joint Photographic Experts Group
image/jpegjpg
M M M C
Portable Network Graphic image/png png M M M C
Adobe Portable Document Format
application/pdfpdf
M M M C
JavaScript application/javascript js M M C
Microsoft Excel application/vnd.ms-excel xls M ~S M C
Microsoft PowerPoint application/vnd.ms-powerpoint
pptM M M C
Microsoft Word application/msword doc M M M C
PostScript application/postscript ps M ~S C
C Canonical version is storedM Modified version is stored (modified images are thumbnails, all others are html conversions)~S Indexed but not stored
![Page 56: Lazy Preservation, Warrick, and the Web Infrastructure](https://reader035.vdocument.in/reader035/viewer/2022062518/56814017550346895dab61f9/html5/thumbnails/56.jpg)
57
Results
Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.