web capture team office of strategic initiatives february 27, 2006 selecting content from the web:...

26
Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress Abbie Grotke Web Capture Team Office of Strategic Initiatives CRL Workshop, February 27, 2006

Upload: valerie-nicholson

Post on 27-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Selecting Content from the Web: Challenges and Experiences

of the Library of Congress

Abbie Grotke

Web Capture Team

Office of Strategic Initiatives

CRL Workshop, February 27, 2006

Page 2: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Agenda

• Why Collect the Web?• Web Collections at the Library of Congress• Policy Issues and Technical Activities• Project: Selecting and Managing Content

Capture from the Web• Our Partnerships and International

Collaborations

Page 3: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Why Collect the Web? Digital Preservation Goals of the

Library of Congress

• Preserve our nation’s history and culture• Identify and preserve at-risk digital content• Support development of tools, models, and

methods for digital preservation

Page 4: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

The Early Days

• Feb 2000: “how do we collect the Web?” led to MINERVA prototype (www.loc.gov/minerva)

• Special project team initially formed: cataloging, legal, public services, technology services staff

• Early partnerships:– Internet Archive (www.archive.org)

– WebArchivist.org

• From project to program…– 2003: Web Capture team formed

– 2004: Some of MINERVA team joined Web Capture

Page 5: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Web Collections 2000-2006*Election 2000: 767 seed urls

*September 11th: 30,000+

2002 Winter Olympics: 70

Sept 11 Remembrance: 1,800

*Election 2002: 3,000

107th-109th Congress: 588

Iraq War: 300

Election 2004: 2000

Papal Transition: 200

Katrina: 818

Supreme Court: 285*public access available through www.loc.gov/minerva

Page 6: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Current Library Collecting Efforts

• Iraq War (ongoing)• 109th Congress (ongoing)• Darfur

Over 40 TB of data collected to date!

Page 7: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

The Web CaptureProcess at LC

Collection Planning

Selection

Notification/Permissions

TechnicalReview

Crawl & QA

Cataloging

Interface Development

Legal Review

Access

Store & Manage

Page 8: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Technical Activities

• Current activity in areas of:– Selection and permission gathering

• Web Collection Management System

– Acquisition: crawling and collection• Heritrix

– Access and display• Full text searching, Wayback replacement

– Collection analysis and preservation

Page 9: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Policy Issues

• Need to seek clear and consistent intellectual property protocols for crawling– Section 108 Study Group may provide hope

http://www.loc.gov/section108/

• What content should we now be collecting? How long should we collect it?

• Once we collect it, how do we make it available to our staff and public users?

• Do we share collecting efforts (costs, time) with partners? If so, how?

Page 10: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Various Web Collection Strategies

• Entire Web domain -- Internet Archive• National domain (.se) –- Sweden, France, others• Selective (individual URLs) and thematic –

Australia• Thematic or event based -- Library of Congress

Other strategies LC is exploring• Acquire collections gathered by others• Establish relationships with producers to acquire

their content

Page 11: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Selection

• LC’s Collection Policy Statements• Collection planning defines:

– Collection scope • Description• Types of sites• Frequency

– Categories of sites• X category of site gets Y type permission• Reporting• Possible other uses – cataloging, access points

Page 12: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Other considerations

• What does the recommender want?– complete site– single document, page, or section

• Can we get it and provide access to it?– crawler and access tool limitations– deep web– scoping– permission

Page 13: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Selecting and Managing Content Captured from the Web

• One-year project to address:– Roles and responsibilities for lifecycle

management of archived Web content– Single-site collecting vs. thematic collecting– Copyright permissions and notifications– Exploring how technical aspects of Web sites

affect selection criteria– Expanding staff participation

Page 14: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Additional Objectives

• Learn by doing– Practical experience is key– Collection planning– Permissions planning– Content collection– Quality review: did we get what was wanted?

• Further document resource requirements and workflow (staff/time)

• Inform and educate other Library staff

Page 15: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Project Participants

• Four Content Groups– Darfur– Visual Image– Manuscript Organizations– Single Site

• Bibliographic and Lifecycle Subgroups• Management Oversight Committee

Page 16: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Training

• Workshops– Selection– Technology of Web Capture– Copyright and Permissions– Access tools overview

• Tools training– For Recommenders: How to nominate a URL for

archiving– For Selection Coordinators: How to use the tool to

move through selection and permissions process

• Ongoing support, refreshers as needed

Page 17: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Some Big Challenges

• Defining new roles and responsibilities (and actually doing them)

• Resource limitations: everyone is busy and selection and permissions take a lot of time

• Finding the geek balance: too much vs. too little technical information

• Do LC’s traditional selection policies fit Web content?

Page 18: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Crisis in Darfur, Sudan

Page 19: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Crisis in Darfur, Sudan

• Approximately 200 seed URLs selected– Sampling of news reports– Scholarly reports and studies– Responses of

• Government• Public (Web logs, etc.)• Key organizations and their Web sites, some formed in

response to crisis

– About 25 sites in other languages, mostly Arabic

• Started crawling February 20, 2006– Weekly, Monthly, One time

Page 20: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Upcoming tasks

• Review results of crawl– Technical Team Quality Review– Curator QA Quality Review

• Initiate permissions and collecting of Manuscript, Visual Image, and Single Site collections

• Full-text indexing search testing• Further explore lifecycle management

issues

Page 21: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

CDL

UNT

LC

IAUIUC

IIPC

BL

OCLC

RLGArchive-itPartners

CollectingPartnersNLA

BNF

UKWAC

NYU

Collecting Partners

Collecting Partners

Norway

Finland

Denmark

Sweden

A Web ofA Web ofArchiving Archiving InitiativesInitiatives

NARA

Page 22: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

National Partnerships and Collaborations

• University of California Digital Library– The Web at Risk: A Distributed Approach to

Preserving our Nation’s Political Cultural Heritage

• Internet Archive– Testing the storage, data maintenance and access

of collected Web content

• Information sharing with other US government agencies– Government Printing Office– National Archives and Records Administration

Page 23: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

International Collaborations

• International Internet Preservation Consortium (IIPC)– Collect and preserve a rich body of Internet

content from around the world– To foster the development and use of common

tools, techniques and standards– To encourage and support national libraries

everywhere to address Internet collecting and preservation

– Share experience and best practices

Page 24: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

IIPC Members

• France (lead)

• Italy

• Denmark

• Finland

• Iceland

• Canada

http://netpreserve.org/

• Norway

• Australia

• Sweden

• United Kingdom

• Internet Archive, USA

• Library of Congress, USA

Page 25: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Upcoming Directions

• Better tools for supporting selection• Improving access tools• Better crawl management• Large-scale collection storage approach:

Repository

Page 26: Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress

Web Capture teamOffice of strategic initiatives

February 27, 2006

Questions?

Abbie Grotke

[email protected]