joanne archer university of maryland kate odell archive-it abbie grotke library of congress tessa...

33
Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web Archives

Upload: norah-obrien

Post on 25-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Joanne ArcherUniversity of Maryland

Kate OdellArchive-It

Abbie GrotkeLibrary of Congress

Tessa FallonColumbia University

Creating and Maintaining Web Archives

Page 2: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Session Goals

• Provide an overview of web archiving and the tasks involved

• Discuss workflow management and copyright issues

• Talk about collection strategies and collection development for web archives

• Analyze the different options for web archiving• Discuss some of the commonly encountered

technical challenges and problems• Examine methods of access and description

Page 3: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

What is web archiving? 

Web Archiving is the capture, management, and preservation of websites and web resources.

Page 4: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Web Archiving Initiatives

Prominent Web Archiving Initiatives include: 

• Internet Archive

• International Internet Preservation Consortium

• Large National Libraries:

– Australia

– United Kingdom

– United States

– Denmark

• Web at Risk Project

Page 5: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Workflow Management

Page 6: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

– Legal deposit requirement only applies to “published works” (§ 407) –§ 108 of the Copyright Act provides library exceptions but doesn’t address digital preservation and web archiving–Varying approaches taken: • Crawl permissions• Access permissions• Notification of crawling• Respecting robots.txt (or not!)

 –Risk and web archiving policies should be determined by each institution - talk to your lawyers! 

Copyright/Permissions

Page 7: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Collection Strategies

• Whole Domain• used by some national libraries and by the Internet Archive. --capture

everything within a geographic domain such as in the case of     Sweden, all sites within the .se domain. 

• Selective Archiving• capture certain portions of the web based on predefined criteria or

collection policies. 

• Thematic

• event driven (September 11) or theme driven (human rights)

• deposit

• Combination

Page 8: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Collection Development: Topical

Page 9: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Collection Development: Technical

Page 10: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web
Page 11: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web
Page 12: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web
Page 13: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web
Page 14: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

• Collection Development Policies or Similar Documents: – Center for Human Rights Documentation and Research, Human Rights Web Archive

• http://library.columbia.edu/indiv/humanrights/hrwa.html

– Library of Congress

• http://www.loc.gov/acq/devpol/webarchive.pdf

– Tamiment Library Web Archive

• http://www.nyu.edu/library/bobst/research/tam/webarchive.html

–  University of Michigan Bentley Historical Library

• http://bentley.umich.edu/uarphome/webarchives/BHL_WebArchives_Policy.pdf

– National Library of Ireland general election 2011 web archive

• http://www.nli.ie/GetAttachment.aspx?id=8f6b68db-e19c-411c-b041-aa8b741d2e10

Collection Development Policies/Guidelines

Page 15: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Tools: HTTrack

Page 16: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Tools: HTTrack

Page 17: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Tools: In-House Program Web Curator Tool

Page 18: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Tools: In-House Program DigiBoard

Page 19: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Tools: Subscriptions, Web Archiving Service

Page 20: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Tools: Subscriptions, Archive-It

Page 21: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

How does web archiving work?

Curator Selects Websites (Seeds)

to Archive

Curator Specifies Scope (how much of the websites are

archived)

Archived content is processed and stored (.warc

format)

Crawler visits seed sites and archives the Urls that are

discovered (following the scoping rules)

Seeds and scoping are sent to the

Crawler (usually Heritrix)

Access tools (Wayback) allow

archived content to be viewed and

browse

Page 22: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Quality Review

Quality Review is different for everyone. Why?• The tool(s) being used for harvesting and access• Your institution’s goals, needs, and preferences• How much time you have

Review Reports• Were there any blocked content or

unreachable sites?• Did you get more content than

expected? Less?

Review Archived Web Pages• Some issues can only be found with the

human eye (for now!)• Was look-and-feel properly captured?

Make Desired ChangesScoping, Seeds,

Crawl Settings, etc.

Crawl Again

Page 23: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

•Some web technologies can be tricky (though not impossible!) to capture or to view in the archived version:

• Database driven sites

• Javascript (only sometimes)

• Flash (only sometimes)

• Certain video formats

•Websites change – what archived perfectly yesterday, might not after today’s redesign

Common Problems – “The Web is a Mess”

Page 24: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Access Options:• Subscription Service Access Page (i.e. Archive-It website)

• Website of Your Organization or Project (i.e. Human Rights Web Portal, LOC’s Web Archives site)

• OPAC (i.e. Columbia’s CLIO)

• OCLC’s WorldCat

Examples of Description:

• Columbia University• Dublin Core

• MARC

• Internet Resource Cataloging Request (IRCR)

• Library of Congress• Creates MODS records for each “site”

• Collection level records in MARC (for the OPAC)

• Archive-It• Dublin Core

• Coming soon: Automated transformation to MARC, MODS, and more.

Access and Description

Page 25: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Archive-It Partner Page

Page 26: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Library of Congress Web Archives Page

http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html

Page 27: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Library of Virginia

http://www.virginiamemory.com/collections/archival_web_collections

Page 28: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

CLIO Record (public view)

Page 29: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Worldcat

Link back to the Archive-It collection

http://www.worldcat.org/title/north-africa-the-middle-east-2011/oclc/756767371

Page 30: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Staff needed include:• Project Management

• Selectors/Curators

• Technical staff for Seed URL preparation (scoping), Quality Review, analysis of

reports, etc.

• Catalogers

Training for Staff:• Use of Tools

• Selection - and how what can and cannot archive affects that

• Permissions

• Quality Review

Helpful skills: comfortable with web (not all are, in our experience!), flexibility, good sense of humor

Staffing

Page 31: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

• Is there web content within your collection scope?–Your organization’s website(s)–Print material that has migrated to web publication–Subject related websites–Websites related to manuscript or archival collections–State or local government websites

• Research and talk to similar organizations

• Talk to subscription services about trial accounts

• Try out some of the lower barrier tools (i.e. HTTrack)

• Get involved with collaborative web archiving efforts

• Just do it! Jump in!

Taking the First Steps…

Page 32: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

The National Digital Stewardship Alliance (NDSA) Content Working Group [http://www.digitalpreservation.gov/ndsa/working_groups/content.html] is sponsoring this survey of organizations in the United States who are actively involved in or planning to archive content from the web.

http://www.surveymonkey.com/s/USWebArchiving

The survey will close October 31, 2011.

NDSA Web Archiving Survey

Page 33: Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web

Questions? Comments? Suggestions?

Joanne Archer • [email protected]

Tessa Fallon • [email protected]

Abbie Grotke • [email protected]

Kate Odell • [email protected]