web archive profiling through fulltext search
TRANSCRIPT
![Page 1: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/1.jpg)
Web Archive Profiling Through Fulltext Search
Sawood Alam and Michael L. NelsonComputer Science Department, Old Dominion University
Norfolk, Virginia - 23529
Herbert Van de SompelLos Alamos National Laboratory, Los Alamos, NM
David S. H. RosenthalStanford University Libraries, Stanford, CA
Supported in part by the IIPC and NSF 1526700
![Page 2: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/2.jpg)
Unorganized Collections
2
![Page 3: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/3.jpg)
Organized Collections
3
![Page 4: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/4.jpg)
Collection Understanding
4
![Page 5: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/5.jpg)
Memento Aggregator
5
![Page 6: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/6.jpg)
Memento Aggregator
6
![Page 7: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/7.jpg)
Memento Aggregator
7
![Page 8: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/8.jpg)
Memento Aggregator
8
![Page 9: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/9.jpg)
Memento Aggregator
9
![Page 10: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/10.jpg)
Memento Aggregator
10
![Page 11: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/11.jpg)
From: Michael Nelson [mailto:[email protected]]
Sent: Wednesday, December 02, 2015 12:33 PM
To: Jones, Gina
Cc: Rourke, Patrick; Grotke, Abigail
Subject: Re: WebSciDL
Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP addr from where you're seeing the traffic? I presume the requests are for Memento TimeMaps? It should not being actually scraping HTML pages.
regards,
Michael
On Wed, 2 Dec 2015, Jones, Gina wrote:
> Hi Michael, we have a slight configuration issue with the current OW
> set up for our webarchives. I think, from looking at the logs, that
> "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback.
> Do you know who is running this scraper? Itʼs not part of memento is it?
>
> Gina Jones
> Web Archiving Team
> Library of Congress
From: Ilya Kreymer <[email protected]>
Date: Wed, 2 Dec 2015 10:33:56 -0800
Subject: high traffic on oldweb!
To: Herbert Van de Sompel <[email protected]>, Sawood Alam <[email protected]>
Hi Herbert, Sawood,
Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has gotten really high, and also I was asked to remove an archive due to the traffic it was causing temporarily..
I am thinking that ability to remove source archives quickly is an important aspect of an aggregator.
Sawood: Hopefully yours will support something like this so I don't need to restart the container to change the archivelist ;)
Ilya
Broadcasting is Bad
11
![Page 12: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/12.jpg)
Availability and Overlap
● Archives are sparse● Broadcasting is wasteful, both clients and archives suffer
12
![Page 13: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/13.jpg)
Memento Routing
13
![Page 14: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/14.jpg)
Routing Pros & Cons
● Pros○ Minimizes traffic and resources consumption○ Improves throughput
● Cons○ Upfront profile maintenance cost○ May miss Mementos (false negatives)
14
![Page 15: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/15.jpg)
Why Small Archives Matter?
15
![Page 16: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/16.jpg)
Why Small Archives Matter?
● 400B+ web pages at IA do not cover everything
● Top three archives after IA produce full TimeMap 52% of the time (AlSum, et al., TPDL 2013)
● Targeted crawls● Special focus archives● Restricted resources● Private archives● Censorship
16
![Page 17: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/17.jpg)
While the Internet Archive was Down...
$ memgator -f cdxj example.org | cut -c-4 | grep -v "^@" | uniq -c 2 2002 1 2005 1 2008 6 2009 67 2010 17 2011 64 2012 108 2013 108 2014 186 2015 51 2016 17
![Page 18: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/18.jpg)
Archive Profile
● High-level summary of an archive● Predicts presence of mementos of a URI-R in
an archive● Provides various statistics about the holdings● Small in size● Publicly available● Easy to update and partially patch● Useful for Memento query routing and other
things
18
![Page 19: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/19.jpg)
Profiling Strategies
● Sample URI Profiling (AlSum, et al., TPDL 2013)
● CDX Profiling (Alam, et al., TPDL 2015)
● Response Cache Profiling (Bornand, et al., JCDL 2016)
● Fulltext Search Profiling
19
![Page 20: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/20.jpg)
Methodology
Top Nouns
timeyearpeoplewaymandaythingchildmrgovernment 20
Random Dict
analogiesunboltconsonantcoilsstolidlycigardecrepitrhododendroncannibalhoneydew
Dynamic Words Discovery
the وكالة warangry the أنباءarab العربي middlenews east الغاضبservice on arabica politics poetrysource war art
![Page 21: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/21.jpg)
Random Searcher Model (RSM)
21
START
STOP
Seed Vocabulary
NextWord()
ExtractWords()
Search()
Select a random link from the search results
Vocabulary seeding needed?
Termination condition reached?
GenerateProfile()
Store search results
No
Yes
YesNo
Fetch the contents of the selected document
![Page 22: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/22.jpg)
RSM Illustration
Teaching Resources Adjunct Toolkit NC NET Academy PD Planning Tools Regional Centers Campus Liaisons Nontraditional Careers College Tech Prep NC ACCESS Co op Education Green Technology You are here NC NET Teaching Resources Discipline Specific English English Self Paced Modules Writing Across the Curriculum NC NET Western Center Incorporating Visuals in Workplace Documents Sections 1 2 Wake Tech Community College Incorporating Visuals in Workplace Documents Section 3 Wake Tech Community College All self paced modules can be accessed through the NC NET Blackboard server Log in with the user name faculty and the password nc net Once connected you can view the courses by topic or alphabetically by title English Webliography North Carolina Community College System 2012
![Page 23: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/23.jpg)
RSM Modes
● Static: Externally supplied static word list● PopularityBiased: Refresh Vocabulary after
every search attempt and consider term frequency for selecting next search keyword
● EqualOpportunity: Refresh Vocabulary after every search attempt and ignore term frequency for selecting next search keyword
● Conservative: Discover new words only when the Vocabulary is exhausted
23
![Page 24: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/24.jpg)
Profiling Policies & Archive-It Dataset
Policy # Keys Example
URIR 30,800,406 uk,co,bbc,news,)/Images/Logo.png?height=80&width=200
HxP1 1,724,284 uk,co,bbc,news,)/Images
DDom 91,629 uk,co,bbc,)/
H1P0 212 uk,)/
Sample URI: https://www.news.BBC.co.uk/Images/Logo.png?width=80&height=40
24
For a detailed list of profiling policies please refer to:Alam, et al.: Web Archive Profiling Through CDX Summarization. IJDL (2016) 17: 223-238
![Page 25: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/25.jpg)
Searches vs Coverage
25
100% in 11K searches
100% in 27K searches
100% in 337K searches 100% in 1.9M searches
![Page 26: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/26.jpg)
RSM Operation Mode Costs
Mode Query Cost
HTTP Cost Remarks
Static C C Suitable for specialized collection with known top keywords
PopularityBiased C 2 * C Human like model, but costly
EqualOpportunity C 2 * C Human like model, but costly
Conservative C C + (where << C)
Suitable for any collection and works without any supplementary materials with very little overhead
26
![Page 27: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/27.jpg)
Routing Confusion Matrix
Predicted \ Actual Present in the Archive Not in the Archive
Routed to the Archive True Positive (TP) False Positive (FP)
Not Routed to the Archive False Negative (FN) True Negative (TN)
Routing Confusion Matrix Recall Accuracy27
![Page 28: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/28.jpg)
Accuracy, Recall, & Coverage (10-100%)
28
DMOZ IA Wayback
UK WaybackMemento Proxy
Low Accuracy (high FP) =>Archives & Aggregator suffer
Low Recall (high FN) =>Users suffer
![Page 29: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/29.jpg)
Profile Policy Recommendations
● IF complete CDX is available THEN○ Generate HxP1 profile
● ELSE IF fulltext search is available THEN○ Generate DDom profile
● ELSE○ Generate H1P0 or other smaller profiles using
Sample URIs
Note: It is possible to perform less detailed queries on more specific (higher order) profiles, but not the other way
29
![Page 30: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/30.jpg)
RSM Mode Recommendations
● IF the collection is about a specific topic in a specific language AND a suitable top keywords list is available THEN○ Use Static mode
● ELSE○ Use Conservative mode
30
![Page 31: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/31.jpg)
Who Knows Term Frequency for Estonian Nouns?
31https://en.wiktionary.org/wiki/Category:Estonian_nouns
![Page 32: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/32.jpg)
Future Work
● Evaluation of combination profiles such as URI-Key along with Datetime
● Utilize archive profile to generate rank ordered list of archive
● Profiles for usage other than Memento routing, such as, site classification based profiles (e.g., news, wiki, social media, blog etc.)
32
![Page 33: Web Archive Profiling Through Fulltext Search](https://reader034.vdocument.in/reader034/viewer/2022042520/58841cc11a28ab485c8b49e5/html5/thumbnails/33.jpg)
Conclusions● Evaluated the search cost as a function of archive holdings’
coverage and profiling policy● Developed the Random Searcher Model● Correctly route 80% requests while maintaining 0.9 Recall
by only discovering 10% of the archive holdings and generating a profile that costs less than 1% of the complete knowledge profile
33