profiling web archives
TRANSCRIPT
Profiling Web Archives
Sawood Alam and Michael L. NelsonComputer Science Department, Old Dominion University
Norfolk, Virginia - 23529
Herbert Van de Sompel, Lyudmila L. Balakireva, and Harihar ShankarLos Alamos National Laboratory, Los Alamos, NM
David S. H. RosenthalStanford University Libraries, Stanford, CA
Long Tail of Archives
● 400B+ web pages at IA do not cover everything
● Top three archives after IA produce full TimeMap 52% of the time
● Targeted crawls● Special focus archives● Restricted resources● Private archives
Archive Profile
● High-level summary of an archive● Predicts presence of mementos of a URI-R
in an archive● Provides various statistics about the
holdings● Small in size● Publicly available● Easy to update and partially patch● Useful for Memento query routing and other
things
Profiling Strategies
● Complete URI-R Profiling○ bbc.co.uk/images/logo.png?w=90○ cnn.com/2014/03/15/?id=128734
● TLD-only Profiling○ com)/○ uk)/
● Middle Ground○ uk,co)/○ uk,co,bbc)/images○ uk,co,bbc)/0/2/1○ com,cnn)/ 201309 ar
Profile Types
● URI-R based○ Complete URI-R○ TLD only○ URI-Key (middle ground)
● Language● Datetime● Many more...
Sample Query Sets
Sample Size In Archive-It In UKWA
DMOZ 1M 4.097% 1.912%
MementoProxy 1M 4.182% 0.179%
IAWayback 1M 3.716% 0.231%
Evaluation
● Relate CDX Size, URI-M, URI-R, and URI-Key
● Analyze profile growth● Estimate Relative Cost● Evaluate Routing Precision vs. Relative Cost
Routing Precision
● Search Precision wrt TLD-only profile○ Double for registered domain policy○ Five fold for HxP1 policy
Precision vs Cost
● Up to 22% routing precision with <5% Cost● <0.3% sample URIs from MementoProxy and
IAWayback logs present in UKWA● Shallow crawling of UKWA results in higher cost
Future Work
● Generating sample URI sets● Profiling via sampling● Language profiles● Evaluation of combination profiles such as
URI-Key along with Datetime● Profiles for usage other than Memento
routing, such as, site classification based profiles (e.g., news, wiki, social media, blog etc.)
Conclusions
● Generated profiles with different policies for two archives
● Examined cost-precision tradeoffs of various policies
● Related CDX Size, URI-M, URI-R, and URI-Key
● Gained up to 22% routing precision with <5% relative cost without any false negatives
● Code @ GitHub:/oduwsdl/archive_profiler