1 strategies for collecting and preserving open access materials on the web william y. arms cornell...
Post on 20-Dec-2015
215 views
TRANSCRIPT
![Page 1: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/1.jpg)
1
Strategies for Collecting and Preserving Open Access Materials on the Web
William Y. Arms
Cornell University
Federal Library and Information Center Committee
![Page 2: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/2.jpg)
2
Open Access Materials on the Web
![Page 3: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/3.jpg)
3
The Library of Congress:the Web Preservation Project
Library of Congress collects cultural and intellectual output of today for the benefit of future generations.
An ever-increasing amount of this material is born digital.
The library has:
• privileged legal position• generous public funding
... but cannot do everything!
Step 1: Open Access Materials on the Web
![Page 4: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/4.jpg)
4
![Page 5: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/5.jpg)
5
![Page 6: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/6.jpg)
6
![Page 7: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/7.jpg)
7
![Page 8: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/8.jpg)
8
Partnership with publishers
Publishers and libraries as partners
Selective collection of open access web
Librarianship in a new domain
Bulk collection of open access web
Automated processes
Approaches to Preservation of the Web
OPEN ACCESS
CLOSED ACCESS
![Page 9: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/9.jpg)
9
Example: Web Preservation Project Pilot
• Small number of web sites nominated by selection officers. Three chosen for close study.
http://www.whitehouse.gov/ http://www.algore2000.com/ http://www.georgewbush.com/
• Copies downloaded using HTTrack mirroring program. Inspected for errors, anomalies, etc.
• Catalog records created using OCLC's CORC software Loaded into Library of Congress's ILS system.
• Trial web site developed to evaluate user interfaces.
![Page 10: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/10.jpg)
10
Example: The Internet Archive
![Page 11: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/11.jpg)
11
Example: National Library of Australia
![Page 12: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/12.jpg)
12
Example: National Library of Sweden
![Page 13: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/13.jpg)
13
Selection and Collection
![Page 14: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/14.jpg)
14
Collecting: Making a Snapshot
Web site
SnapshotDownload
Archive
A web site is downloaded, using a mirroring program. A snapshot is stored in an archive.
![Page 15: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/15.jpg)
15
Collecting: Periodic Snapshots
Web site Snapshot 1
Archive
At scheduled time intervals additional snapshots are made.
Snapshot 2
Snapshot 3
![Page 16: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/16.jpg)
16
Selection Decisions
Which sites to collect
• Bulk -- collect all within a certain category• Selective -- collect sites selected by a librarian
How often to make snapshots
• Monthly, weekly, or depending on circumstances
Which content to collect
• HTML pages only• Text and images only• Everything
![Page 17: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/17.jpg)
17
Examples of Selection Decisions
Selection Frequency Content
Internet Archive bulk monthly HTML + images
Pandora selective varies all
Kulturarw3 bulk sweeps all
Web Preservation selective irregular all
![Page 18: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/18.jpg)
18
Legal Issues
Legal position of archives that download open access materials is unclear
• Preservation is in the national interest
• See the discussion in The Digital Dilemma
• Crucial factor is economic impact on copyright owners
• Library of Congress has no special position except via copyright deposit
![Page 19: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/19.jpg)
19
Legal Issues: Thoughts and Actions
• Presumption is that downloading open access materials is permitted by the publisher ....
... unless other indication given, e.g., robot exclusion using robots.txt file
• Different parties to consider
=> Library of Congress=> other national libraries=> partners of the Library of Congress and national libraries=> independent archives
U.S. Copyright Office has offered to help clarification
![Page 20: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/20.jpg)
20
Access to Collections
![Page 21: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/21.jpg)
21
Access: Analysis by Computer
Snapshot 1
Archive
Snapshot 2
Snapshot 3Analysis
by computer
![Page 22: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/22.jpg)
22
Access: Analysis by Patron
Web site
Snapshot 1
Archive
Snapshot 2
Snapshot 3
Access 1
Access 2
Access 3Analysis by patron
Analysis by
computer
![Page 23: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/23.jpg)
23
Access Decisions
Style of access
• Analysis of snapshot files by computer• Analysis of Web access version by patron
Editing
• Minimal editing to make access version• Fuller editing to maintain experience• Automatic or by hand
Policy
• Who has access to the collections?
![Page 24: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/24.jpg)
24
Examples of Access Decisions
Style Editing
Internet Archive computer none
Pandora researcher some
Kulturarw3 ? ?
Web Preservation researcher some
![Page 25: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/25.jpg)
25
Information Discovery
![Page 26: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/26.jpg)
26
Options for Information Discovery
Very large numbers of Web sites will be collected and preserved. Some form of index or catalog is required.
Options
• List of sites (e.g., Internet Archive)
=> Access by URL + date
• Automatic index (e.g., Web search engines)
• Catalog (e.g., Web Preservation Project)
=> Record for individual site or group of sites=> Access through library catalog
![Page 27: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/27.jpg)
27
Information Discovery: Web Preservation Project
Procedure
• MARC catalog records created using OCLC's CORC system.• Loaded into Library of Congress's ILS.
Observations
• Catalog effort similar to other electronic files• Continual changes between snapshots• Some similarities to serials • No significant workflow difficulties
![Page 28: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/28.jpg)
28
Storage
![Page 29: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/29.jpg)
29
Storage: Preservation Versions
Snapshot 1 Access 1
Snapshot 1 Access 1
Snapshot 1 Access 1
Over time, other versions of a snapshot will be made for preservation.
![Page 30: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/30.jpg)
30
Storage Decisions: Size
Each Web site will be stored many times
• Repeated snapshots
• Access versions
• Preservation versions
Saving space
• Many files are repeated (e.g., video clips)
• Storing a single copy saves space, but leads to more complex computer systems
• Compressing files save space, but leads to more complex computer systems
![Page 31: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/31.jpg)
31
Very Rough Estimates of Size and Cost
Public web sites (OCLC, February 2000) 2,900,000
Library of Congress collects 1% 30,000
Average size of site 60 Mbytes
Size of 30,000 sites 1.8 terabytes
Storage requirements/year (monthly snapshot) 21.6 terabytes
Storage requirements (no duplicates) 5.0 terabytes
Cost per year ($25,000 per terabyte) $125,000
![Page 32: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/32.jpg)
32
Storage Decisions: Identification
Identification of Web site
• URL, but Web sites may change their URL• URN (e.g., Handle or PURL)
Identification and provenance of versions
• Web site identifier• Collection information (date, time, etc.)• History of changes
![Page 33: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/33.jpg)
33
Archive
AccessionControl
Web CrawlerProcess
Catalog ExternalAccess
Workflow
snapshot
Analysis by patron
Analysis by computer
Web site
![Page 34: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/34.jpg)
34
Preservation
![Page 35: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/35.jpg)
35
Objective
Objective is to preserve the digital collections in a manner that makes them usable for scholarship and research in the future.
What is preserved?
• Preservation of bits
• Preservation of content
• Preservation of experience
How is it used?
• Analysis by computer program
• Analysis by human researcher
• Viewed by human researcher
![Page 36: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/36.jpg)
36
Process of Preservation
Version 1
Version 2
Version 3
Time 0
Time 1
Time 2
This process may be applied to either the snapshot or the access version.
![Page 37: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/37.jpg)
37
Preservation: Refreshing
Each version is created from the previous by exactly copying the bits.
• Keeps the exact files for all time
• Preserves bits, and content but not always in an accessible form
• Later computers and software are unlikely to support today's protocols, formats, languages, etc.
Keeping the unedited snapshot files by repeated refreshing should be a basic part of any preservation strategy.
![Page 38: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/38.jpg)
38
Preservation: Automatic Migration of Individual Files
As protocols, formats, languages, etc. become obsolete, convert individual files to new standards.
• Can be carried out automatically
• Preserves content and helps toward preservation of experience
• Effectiveness depends on availability of conversion tools and the complexity and quality of original source
• Migrated versions will steadily diverge from original
• Web sites will eventually cease to function
Automated migration of individual files is the basic technique for keeping web sites functional at moderate cost.
![Page 39: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/39.jpg)
39
Preservation: Automatic Migration with Manual Editing
In conjunction with automatic migration, web sites are reviewed by a librarian and edited as necessary to preserve functionality
• The only method that can be expected to preserve the experience of using web sites
• Migrated versions will steadily diverge from original
• Some web sites will be impossible to edit without changing the experience
Manual editing is very expensive and is therefore suitable for only a small number of particularly important sites.
![Page 40: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d445503460f94a2147a/html5/thumbnails/40.jpg)
40
Acknowledgements
The members of the Web Preservation Project are:
Roger AdkinCassy AmmenWilliam ArmsAllene HayesMelissa LevineDiane KreshBarbara Tillett