alex thurman pcc participants meeting libraries ala...

ALEX THURMAN Columbia University Libraries PCC Participants Meeting

ALA Annual June 30, 2013

Overview

• Web archiving context (Who) • Benefits of curated web archives (Why)

• Columbia University Libraries Web Resources

Collection Program (What/How) o Background o Selection o Permissions o Harvesting o Description o Access

Web archiving context

International Internet Preservation Consortium

http://netpreserve.org/

IIPC member type #

national/regional libraries 29

university libraries 7

other non-profits 4

archives 2

commercial 2

IIPC member region #

Europe 23

North America 13

Asia 4

Oceania 2

North Africa 2

Wow!

•Crawling the public Web since 1996

•Over 280 billion URLs

•Archive searchable by URL via Wayback Machine, which contains 5 petabytes of

crawl data from 1996-2012

•Wayback Machine gets 1,000 visits per second

•Web archives from 1996-2007 backed up at mirror site at Bibliotheca Alexandrina

•Also collects millions of digitized books, movies, audio and concert recordings

•Over 10 Petabytes of data in total

But!

•IA web collections are vast but not comprehensive

•General Internet crawls take about 3 months, many websites change faster or are

short-lived or not well-linked/discoverable

•Depth of capture of individual websites varies

•Archive too huge to be indexed for full-text search

•Robots.txt restrictions are obeyed, so many sites fully or partially blocked from

archiving

Internet Archive

Benefits of curated web archives

• Precise selection of desired web resources (documents, websites, "web spheres') • Fuller website captures

• Control of frequency of website capture

• On-demand website capture

• Fill gaps in library print holdings or provide electronic

access to existing print holdings • Metadata

• Full-text searching

• Data mining

• Internet Archive o Archive-It

255 partners, including 106 colleges/universities as well as: state archives and libraries public libraries museum and art libraries law libraries NGOs K-12 program schools

• California Digital Library

o WAS (Web Archiving Service) 22 partners, mostly university libraries

Web archiving services

Andrew W. Mellon Foundation support for CUL web archiving

Grant projects • Collection Building for Web Resources (2008-2009) 1 FTE: project librarian

• Web Resources Collection Program Development

(2009-2012) 3 FTE: 2 web curators, 1 programmer

• Web Resources Archiving Collaboration (2013-2015) 2 FTE: 1 project librarian, 1 bibliographic asst

• Avery Library Historic Preservation and Urban Planning 60+ websites, semiannual • Burke Library New York City Religions

225 websites, semiannual • Human Rights

520 websites, quarterly • Rare Book and Manuscript Library

30 websites, semiannual • University Archives

most of columbia.edu domain, plus 75+ affiliated sites with "external" URLs • General

CUL Web Archive Collections

Permissions

• No explicit US Copyright Act libraries exception for web archiving

• CUL policy is to request permission from website owners to

harvest their websites and provide access to archived versions o Permission request email sent to contact info from website o If no response after 3 weeks, follow-up request with

notification of intent to archive website • Statistics

o 818 requests sent o 415 responded Yes o 5 responded No o 398 did not respond

Harvesting via web crawling

Benefits • Best case scenario of fully navigable archived version of site

reproducing look and feel of original website on particular date

• Demonstrating evolution of a website over time • Preserving access to websites no longer available on live web • Ability to provide researchers three primary study objects:

document/file; website; "web sphere" Challenges • Imperfect capture/rendering of certain content (javascript,

flash, password-protected, query-dependent) • Crawler traps • Scoping decisions for duplicative content • Effect of data/document budget on scoping and crawl

frequency decisions

Successful capture Archived version of http://www.malao.org/

http://www.malao.org/ today

Archive-It test crawl host report

http://www.equalityhumanrights.com/ http://www.equalityhumanrights.com/?colr=black http://www.equalityhumanrights.com/?colr=orange http://www.equalityhumanrights.com/?colr=yellow http://www.equalityhumanrights.com/?size=425 http://www.equalityhumanrights.com/?size=625 http://www.equalityhumanrights.com/?size=825 http://www.equalityhumanrights.com/about-us/ http://www.equalityhumanrights.com/about-us/?colr=black http://www.equalityhumanrights.com/about-us/?colr=orange http://www.equalityhumanrights.com/about-us/?colr=yellow http://www.equalityhumanrights.com/about-us/?size=425 http://www.equalityhumanrights.com/about-us/?size=625 http://www.equalityhumanrights.com/about-us/?size=825 http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/ http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/?colr=black http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/?colr=orange http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/?colr=yellow http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/?size=425 http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/?size=625 http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/?size=825 http://www.equalityhumanrights.com/about-us/contact-us/ http://www.equalityhumanrights.com/about-us/contact-us/?colr=black http://www.equalityhumanrights.com/about-us/contact-us/?colr=orange http://www.equalityhumanrights.com/about-us/contact-us/?colr=yellow http://www.equalityhumanrights.com/about-us/contact-us/?size=425 http://www.equalityhumanrights.com/about-us/contact-us/?size=625 http://www.equalityhumanrights.com/about-us/contact-us/?size=825

Crawl Host URLs report: 4 web pages, each with 7 versions

Crawl scoping challenge: Website design generates multiple versions of all pages http://www.equalityhumanrights.com/

http://www.equalityhumanrights.com/

http://www.equalityhumanrights.com/

More host constraints for crawl scoping

Description and access for archived websites at CUL • Archive-it.org site-level metadata

(All thematic collections, DCMI, copied from MARC records if possible) • CLIO collection-level MARC records

(Human rights, Avery, Burke) • CLIO site-level MARC records

(Human rights, Avery) • Document-level MARC records

(selected longish Avery collection reports, pre-existing IRCR records) • Human Rights Web Archive portal on CUL website

(using metadata extracted from MARC records)

Archive-It.org metadata for Burke Library collection website: Dublin Core supplemented with custom fields

CLIO public view of site-level MARC record

Portal Field MARC Tag

Title 245

Creator 1XX/7XX

Organization Type 653 0

Organization Based In "Country"

Geographic Focus 043 or

965 = "Global focus"

Subjects 650 ($a, $x)

Summary 520

Languages "Language" and 041

Seed URLs 856 40

Archived URLs 856 41

Nonpublic identifier “965hrportal”

Mapping MARC data for portal facets

http://hrwa.cul.columbia.edu

HRWA portal : Browse by subject (sortable by count)

HRWA portal : Site details view, with links to crawl calendar page(s)

HRWA portal : Search website descriptions (metadata)

HRWA portal : Website description search results

HRWA portal : Search page full text

HRWA portal : Page full text search results

Excerpt from people group synonym table for HRWA search expansion feature

HRWA trial feature : Search expansion (include synonyms of names of peoples)

Thanks!

Questions/feedback: [email protected]

alex thurman pcc participants meeting libraries ala...

Documents