alex thurman pcc participants meeting libraries ala...

31
ALEX THURMAN Columbia University Libraries PCC Participants Meeting ALA Annual June 30, 2013

Upload: others

Post on 19-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

ALEX THURMAN Columbia University Libraries PCC Participants Meeting

ALA Annual June 30, 2013

Page 2: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

Overview

• Web archiving context (Who) • Benefits of curated web archives (Why)

• Columbia University Libraries Web Resources

Collection Program (What/How) o Background o Selection o Permissions o Harvesting o Description o Access

Page 3: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

Web archiving context

International Internet Preservation Consortium

http://netpreserve.org/

IIPC member type #

national/regional libraries 29

university libraries 7

other non-profits 4

archives 2

commercial 2

IIPC member region #

Europe 23

North America 13

Asia 4

Oceania 2

North Africa 2

Page 4: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

Wow!

•Crawling the public Web since 1996

•Over 280 billion URLs

•Archive searchable by URL via Wayback Machine, which contains 5 petabytes of

crawl data from 1996-2012

•Wayback Machine gets 1,000 visits per second

•Web archives from 1996-2007 backed up at mirror site at Bibliotheca Alexandrina

•Also collects millions of digitized books, movies, audio and concert recordings

•Over 10 Petabytes of data in total

But!

•IA web collections are vast but not comprehensive

•General Internet crawls take about 3 months, many websites change faster or are

short-lived or not well-linked/discoverable

•Depth of capture of individual websites varies

•Archive too huge to be indexed for full-text search

•Robots.txt restrictions are obeyed, so many sites fully or partially blocked from

archiving

Internet Archive

Page 5: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

Benefits of curated web archives

• Precise selection of desired web resources (documents, websites, "web spheres') • Fuller website captures

• Control of frequency of website capture

• On-demand website capture

• Fill gaps in library print holdings or provide electronic

access to existing print holdings • Metadata

• Full-text searching

• Data mining

Page 6: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

• Internet Archive o Archive-It

255 partners, including 106 colleges/universities as well as: state archives and libraries public libraries museum and art libraries law libraries NGOs K-12 program schools

• California Digital Library

o WAS (Web Archiving Service) 22 partners, mostly university libraries

Web archiving services

Page 7: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

Andrew W. Mellon Foundation support for CUL web archiving

Grant projects • Collection Building for Web Resources (2008-2009) 1 FTE: project librarian

• Web Resources Collection Program Development

(2009-2012) 3 FTE: 2 web curators, 1 programmer

• Web Resources Archiving Collaboration (2013-2015) 2 FTE: 1 project librarian, 1 bibliographic asst

Page 8: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

• Avery Library Historic Preservation and Urban Planning 60+ websites, semiannual • Burke Library New York City Religions

225 websites, semiannual • Human Rights

520 websites, quarterly • Rare Book and Manuscript Library

30 websites, semiannual • University Archives

most of columbia.edu domain, plus 75+ affiliated sites with "external" URLs • General

CUL Web Archive Collections

Page 9: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

Permissions

• No explicit US Copyright Act libraries exception for web archiving

• CUL policy is to request permission from website owners to

harvest their websites and provide access to archived versions o Permission request email sent to contact info from website o If no response after 3 weeks, follow-up request with

notification of intent to archive website • Statistics

o 818 requests sent o 415 responded Yes o 5 responded No o 398 did not respond

Page 10: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

Harvesting via web crawling

Benefits • Best case scenario of fully navigable archived version of site

reproducing look and feel of original website on particular date

• Demonstrating evolution of a website over time • Preserving access to websites no longer available on live web • Ability to provide researchers three primary study objects:

document/file; website; "web sphere" Challenges • Imperfect capture/rendering of certain content (javascript,

flash, password-protected, query-dependent) • Crawler traps • Scoping decisions for duplicative content • Effect of data/document budget on scoping and crawl

frequency decisions

Page 11: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

Successful capture Archived version of http://www.malao.org/

Page 12: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

http://www.malao.org/ today

Page 13: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

Archive-It test crawl host report

Page 14: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

http://www.equalityhumanrights.com/ http://www.equalityhumanrights.com/?colr=black http://www.equalityhumanrights.com/?colr=orange http://www.equalityhumanrights.com/?colr=yellow http://www.equalityhumanrights.com/?size=425 http://www.equalityhumanrights.com/?size=625 http://www.equalityhumanrights.com/?size=825 http://www.equalityhumanrights.com/about-us/ http://www.equalityhumanrights.com/about-us/?colr=black http://www.equalityhumanrights.com/about-us/?colr=orange http://www.equalityhumanrights.com/about-us/?colr=yellow http://www.equalityhumanrights.com/about-us/?size=425 http://www.equalityhumanrights.com/about-us/?size=625 http://www.equalityhumanrights.com/about-us/?size=825 http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/ http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/?colr=black http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/?colr=orange http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/?colr=yellow http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/?size=425 http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/?size=625 http://www.equalityhumanrights.com/about-us/advice-from-our-helpline/?size=825 http://www.equalityhumanrights.com/about-us/contact-us/ http://www.equalityhumanrights.com/about-us/contact-us/?colr=black http://www.equalityhumanrights.com/about-us/contact-us/?colr=orange http://www.equalityhumanrights.com/about-us/contact-us/?colr=yellow http://www.equalityhumanrights.com/about-us/contact-us/?size=425 http://www.equalityhumanrights.com/about-us/contact-us/?size=625 http://www.equalityhumanrights.com/about-us/contact-us/?size=825

Crawl Host URLs report: 4 web pages, each with 7 versions

Page 15: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

Crawl scoping challenge: Website design generates multiple versions of all pages http://www.equalityhumanrights.com/

Page 16: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

More host constraints for crawl scoping

Page 17: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

Description and access for archived websites at CUL • Archive-it.org site-level metadata

(All thematic collections, DCMI, copied from MARC records if possible) • CLIO collection-level MARC records

(Human rights, Avery, Burke) • CLIO site-level MARC records

(Human rights, Avery) • Document-level MARC records

(selected longish Avery collection reports, pre-existing IRCR records) • Human Rights Web Archive portal on CUL website

(using metadata extracted from MARC records)

Page 18: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

Archive-It.org metadata for Burke Library collection website: Dublin Core supplemented with custom fields

Page 19: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

CLIO public view of site-level MARC record

Page 20: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

Portal Field MARC Tag

Title 245

Creator 1XX/7XX

Organization Type 653 0

Organization Based In "Country"

Geographic Focus 043 or

965 = "Global focus"

Subjects 650 ($a, $x)

Summary 520

Languages "Language" and 041

Seed URLs 856 40

Archived URLs 856 41

Nonpublic identifier “965hrportal”

Mapping MARC data for portal facets

Page 21: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

http://hrwa.cul.columbia.edu

Page 22: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

HRWA portal : Browse by subject (sortable by count)

Page 23: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

HRWA portal : Site details view, with links to crawl calendar page(s)

Page 24: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

HRWA portal : Search website descriptions (metadata)

Page 25: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

HRWA portal : Website description search results

Page 26: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

HRWA portal : Search page full text

Page 27: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

HRWA portal : Page full text search results

Page 28: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

Excerpt from people group synonym table for HRWA search expansion feature

Page 29: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

HRWA trial feature : Search expansion (include synonyms of names of peoples)

Page 30: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

HRWA trial feature : Search expansion (include synonyms of names of peoples)

Page 31: ALEX THURMAN PCC Participants Meeting Libraries ALA Annuallibrary.columbia.edu/content/dam/librarywebsecure/behind_the... · crawl data from 1996-2012 •Wayback Machine gets 1,000

Thanks!

Questions/feedback: [email protected]