alex thurman pcc participants meeting libraries ala...
TRANSCRIPT
ALEX THURMAN Columbia University Libraries PCC Participants Meeting
ALA Annual June 30, 2013
Overview
• Web archiving context (Who) • Benefits of curated web archives (Why)
• Columbia University Libraries Web Resources
Collection Program (What/How) o Background o Selection o Permissions o Harvesting o Description o Access
Web archiving context
International Internet Preservation Consortium
http://netpreserve.org/
IIPC member type #
national/regional libraries 29
university libraries 7
other non-profits 4
archives 2
commercial 2
IIPC member region #
Europe 23
North America 13
Asia 4
Oceania 2
North Africa 2
Wow!
•Crawling the public Web since 1996
•Over 280 billion URLs
•Archive searchable by URL via Wayback Machine, which contains 5 petabytes of
crawl data from 1996-2012
•Wayback Machine gets 1,000 visits per second
•Web archives from 1996-2007 backed up at mirror site at Bibliotheca Alexandrina
•Also collects millions of digitized books, movies, audio and concert recordings
•Over 10 Petabytes of data in total
But!
•IA web collections are vast but not comprehensive
•General Internet crawls take about 3 months, many websites change faster or are
short-lived or not well-linked/discoverable
•Depth of capture of individual websites varies
•Archive too huge to be indexed for full-text search
•Robots.txt restrictions are obeyed, so many sites fully or partially blocked from
archiving
Internet Archive
Benefits of curated web archives
• Precise selection of desired web resources (documents, websites, "web spheres') • Fuller website captures
• Control of frequency of website capture
• On-demand website capture
• Fill gaps in library print holdings or provide electronic
access to existing print holdings • Metadata
• Full-text searching
• Data mining
• Internet Archive o Archive-It
255 partners, including 106 colleges/universities as well as: state archives and libraries public libraries museum and art libraries law libraries NGOs K-12 program schools
• California Digital Library
o WAS (Web Archiving Service) 22 partners, mostly university libraries
Web archiving services
Andrew W. Mellon Foundation support for CUL web archiving
Grant projects • Collection Building for Web Resources (2008-2009) 1 FTE: project librarian
• Web Resources Collection Program Development
(2009-2012) 3 FTE: 2 web curators, 1 programmer
• Web Resources Archiving Collaboration (2013-2015) 2 FTE: 1 project librarian, 1 bibliographic asst
• Avery Library Historic Preservation and Urban Planning 60+ websites, semiannual • Burke Library New York City Religions
225 websites, semiannual • Human Rights
520 websites, quarterly • Rare Book and Manuscript Library
30 websites, semiannual • University Archives
most of columbia.edu domain, plus 75+ affiliated sites with "external" URLs • General
CUL Web Archive Collections
Permissions
• No explicit US Copyright Act libraries exception for web archiving
• CUL policy is to request permission from website owners to
harvest their websites and provide access to archived versions o Permission request email sent to contact info from website o If no response after 3 weeks, follow-up request with
notification of intent to archive website • Statistics
o 818 requests sent o 415 responded Yes o 5 responded No o 398 did not respond
Harvesting via web crawling
Benefits • Best case scenario of fully navigable archived version of site
reproducing look and feel of original website on particular date
• Demonstrating evolution of a website over time • Preserving access to websites no longer available on live web • Ability to provide researchers three primary study objects:
document/file; website; "web sphere" Challenges • Imperfect capture/rendering of certain content (javascript,
flash, password-protected, query-dependent) • Crawler traps • Scoping decisions for duplicative content • Effect of data/document budget on scoping and crawl
frequency decisions
Successful capture Archived version of http://www.malao.org/
http://www.malao.org/ today
Archive-It test crawl host report
http://www.equalityhumanrights.com/ http://www.equalityhumanrights.com/?colr=black http://www.equalityhumanrights.com/?colr=orange http://www.equalityhumanrights.com/?colr=yellow http://www.equalityhumanrights.com/?size=425 http://www.equalityhumanrights.com/?size=625 http://www.equalityhumanrights.com/?size=825 http://www.equalityhumanrights.com/about-us/ http://www.equalityhumanrights.com/about-us/?colr=black http://www.equalityhumanrights.com/about-us/?colr=orange http://www.equalityhumanrights.com/about-us/?colr=yellow http://www.equalityhumanrights.com/about-us/?size=425 http://www.equalityhumanrights.com/about-us/?size=625 http://www.equalityhumanrights.com/about-us/?size=825 http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/ http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/?colr=black http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/?colr=orange http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/?colr=yellow http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/?size=425 http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/?size=625 http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/?size=825 http://www.equalityhumanrights.com/about-us/contact-us/ http://www.equalityhumanrights.com/about-us/contact-us/?colr=black http://www.equalityhumanrights.com/about-us/contact-us/?colr=orange http://www.equalityhumanrights.com/about-us/contact-us/?colr=yellow http://www.equalityhumanrights.com/about-us/contact-us/?size=425 http://www.equalityhumanrights.com/about-us/contact-us/?size=625 http://www.equalityhumanrights.com/about-us/contact-us/?size=825
Crawl Host URLs report: 4 web pages, each with 7 versions
Crawl scoping challenge: Website design generates multiple versions of all pages http://www.equalityhumanrights.com/
More host constraints for crawl scoping
Description and access for archived websites at CUL • Archive-it.org site-level metadata
(All thematic collections, DCMI, copied from MARC records if possible) • CLIO collection-level MARC records
(Human rights, Avery, Burke) • CLIO site-level MARC records
(Human rights, Avery) • Document-level MARC records
(selected longish Avery collection reports, pre-existing IRCR records) • Human Rights Web Archive portal on CUL website
(using metadata extracted from MARC records)
Archive-It.org metadata for Burke Library collection website: Dublin Core supplemented with custom fields
CLIO public view of site-level MARC record
Portal Field MARC Tag
Title 245
Creator 1XX/7XX
Organization Type 653 0
Organization Based In "Country"
Geographic Focus 043 or
965 = "Global focus"
Subjects 650 ($a, $x)
Summary 520
Languages "Language" and 041
Seed URLs 856 40
Archived URLs 856 41
Nonpublic identifier “965hrportal”
Mapping MARC data for portal facets
http://hrwa.cul.columbia.edu
HRWA portal : Browse by subject (sortable by count)
HRWA portal : Site details view, with links to crawl calendar page(s)
HRWA portal : Search website descriptions (metadata)
HRWA portal : Website description search results
HRWA portal : Search page full text
HRWA portal : Page full text search results
Excerpt from people group synonym table for HRWA search expansion feature
HRWA trial feature : Search expansion (include synonyms of names of peoples)
HRWA trial feature : Search expansion (include synonyms of names of peoples)
Thanks!
Questions/feedback: [email protected]