niso two-part webinar: sustainable information part 1: digital preservation for text
DESCRIPTION
Wednesday, December 10, 2014 NISO Two-Part Webinar: Sustainable Information Part 1: Digital Preservation for Text National Digital Stewardship Alliance (NDSA) Levels of Preservation Trevor Owens, Digital Archivist, National Digital Information Infrastructure and Preservation Program (NDIIPP), Office of Strategic Initiatives, Library of Congress Preserving the Law: Digital Curation in a Law Library Setting Leah Prescott, Associate Law Librarian for Digital Initiatives and Special Collections, Georgetown University Law Library Rosetta digital preservation system: Enabling institutions to preserve and provide access to their digital collections Edward M. Corrado, Director of Library Technology, Binghamton University LibrariesTRANSCRIPT
NISO Two-Part Webinar Sustainable Information, Part 1:
Digital Preservation for Text
Wednesday, December 10, 2014
Speakers:
Trevor Owens, Digital Archivist, National Digital Information Infrastructure and Preservation Program (NDIIPP), Office of Strategic Initiatives, Library of Congress
Leah Prescott, Associate Law Librarian for Digital Initiatives and Special Collections, Georgetown University Law Library
Edward M. Corrado, Director of Library Technology, Binghamton University Libraries
http://www.niso.org/news/events/2014/webinars/text_preservation/
Using the NDSA Levels of Digital Preservation
Webinar, December 2014
Overview
• Version One of the Levels, Review
• Use Examples
• Discussion of uses
Common Need
• Simple, practical, documented levels of preservation services reflecting best practices, broadly useful
– For those just starting out & those with mature programs
– Independent of formats, storage systems
– Useful to educators & implementers
Niche
Personal Archiving
Advice
…
Levels of Digital
Preservation
…
Formal Certifications &
Audits
Levels of Digital Preservation, v1
Level 1
Level 2
Level 3
Level 4
Category 1
Category 2
Category 3
Category 4
Category 5
Levels of Digital Preservation, v1
Level 1
Level 2
Level 3
Level 4
Category 1 Level 1
Actions for Category 1
Level 2Actions for Category 1
… …
Category 2 Level 1
Actions for Category 2
Level 2Actions forCategory 2
… …
Category 3 … … … …
Category 4 … … … …
Category 5 … … … …
Levels of Digital Preservation, v1
Level 1
Level 2
Level 3
Level 4
Category 1
Category 2
Category 3
Category 4
Category 5
Levels of Digital Preservation, v1Level 1 (Protect your data)
Level 2 (Know your data)
Level 3 (Monitor your data)
Level 4 (Repair your data)
Storage and
Geographic
Location
- Two complete copies that are not
collocated
- For data on heterogeneous media
(optical discs, hard drives, etc.) get the
content off the medium and into your
storage system
- At least three complete copies
- At least one copy in a different geographic
location
- Document your storage system(s) and storage
media and what you need to use them
- At least one copy in a geographic location with
a different disaster threat
- Obsolescence monitoring process for your
storage system(s) and media
- At least three copies in geographic locations
with different disaster threats
- Have a comprehensive plan in place that will
keep files and metadata on currently accessible
media or systems
File Fixity
and Data
Integrity
- Check file fixity on ingest if it has been
provided with the content
- Create fixity info if it wasn’t provided with
the content
- Check fixity on all ingests
- Use write-blockers when working with original
media
- Virus-check high risk content
- Check fixity of content at fixed intervals
- Maintain logs of fixity info; supply audit on
demand
- Ability to detect corrupt data
- Virus-check all content
- Check fixity of all content in response to
specific events or activities
- Ability to replace/repair corrupted data
- Ensure no one person has write access to all
copies
Information
Security
- Identify who has read, write, move and
delete authorization to individual files
- Restrict who has those authorizations to
individual files
- Document access restrictions for content - Maintain logs of who performed what actions
on files, including deletions and preservation
actions
- Perform audit of logs
Metadata
- Inventory of content and its storage
location
- Ensure backup and non-collocation of
inventory
- Store administrative metadata
- Store transformative metadata and log events
- Store standard technical and descriptive
metadata
- Store standard preservation metadata
File Formats
- When you can give input into the creation
of digital files encourage use of a limited
set of known open formats and codecs
- Inventory of file formats in use - Monitor file format obsolescence issues - Perform format migrations, emulation and
similar activities as needed
Storage and Geographic Location
Level 1Protect your data
Level 2Know your data
Level 3Monitor your data
Level 4Repair your data
Two complete copies that are not collocated
For data on heterogeneous media (optical discs, hard drives, etc.) get the content off the medium and into your storage system
At least three complete copies
At least one copy in a different geographic location
Document your storage systems(s) and storage media and what you need to use them
At least one copy in a geographic location with a different disaster threat
Obsolescence monitoring for your storage system(s) and media
At least three copies in geographic locations with different disaster threats
Have a comprehensive plan in place that will keep files and metadata on currently accessible media or systems
File Fixity and Data Integrity
Level 1Protect your data
Level 2Know your data
Level 3Monitor your data
Level 4Repair your data
Check file fixity on ingest if it has been provided with the content
Create fixity info if it wasn’t provided with the content
Check fixity on all ingests
Use write-blockers when working with original media
Virus-check high risk content
Check fixity of content at fixed intervals
Maintain logs of fixity info; supply audit on demand
Ability to detect corrupt data
Virus-check all content
Check fixity of all content in response to specific events or activities
Ability to replace/repair corrupted data
Ensure no one person has write access to all copies
Information Security
Level 1Protect your data
Level 2Know your data
Level 3Monitor your data
Level 4Repair your data
Identify who has read, write, move and delete authorization to individual files
Restrict who has those authorizations to individual files
Document access restrictions for content
Maintain logs of who performed what actions on files, including deletions and preservation actions
Perform audit of logs
Metadata
Level 1Protect your data
Level 2Know your data
Level 3Monitor your data
Level 4Repair your data
Inventory of content and its storage location
Ensure backup and non-collocation of inventory
Store administrative metadata
Storetransformative metadata and log events
Store standards technical and descriptive metadata
Store standard preservation metadata
File FormatsLevel 1Protect your data
Level 2Know your data
Level 3Monitor your data
Level 4Repair your data
When you can give input into the creation of digital files, encourage use of a limited set of known open formats and codecs
Inventory of fileformats in use
Monitor file format obsolescence issues
Perform format migrations, emulation and similar activities as needed
Usage Contexts
• Inform Local Guidelines Development:Educate and develop guidelines for content creators and contributors USGS
• Self Assessments – how do we compare with best practices? What should we improve next? Where do we excel? How will we improve after project X? How have we improved over time? Harvard & ARTstor
• Developing requirements for third-party preservation service providers
Self-assessment example
Level One Level Two Level Three Level Four
Storage & Geographic
Location
File Fixity and Data Integrity
Information Security
Metadata
File Formats
= satisfied with implementation
= will be satisfied with implementationafter current enhancement project
= implemented but could be improved
= not implemented
December 10, 2014
Leah Prescott Associate Law Librarian for Digital Initiatives
and Special Collections
Preserving the Law:
Digital Curation
in a Law Library Setting
Digitized vs. Born Digital
Born-digital: Files that are created natively on electronic devices such as computers, cell phones, digital cameras, and digital audio and video recorders
Digitized: Analog objects that are transferred to a digital format through some conversion process
Digitization and digital preservation are not the same thing
Digitization creates new digital objects which need to be preserved
It also creates metadata that needs to be preserved
Digital CurationCentre
Life-cycleModel
Bits and Bytes
The bit is the smallest measurement for digital information
8 bits equal a byte 1,024 bytes equals a kilobyte (kb) 1,024 kilobytes equals a megabyte (mb) 1,024 megabytes equals a gigabyte (gb) 1,024 gigabytes equals a terabyte (tb) 1,024 terabytes equals a petabyte (pb)Exabyte, zettabyte, yottabyte, xonabyte, wekabyte, vundabyte
Values are usually rounded to the nearest 1,000
Bits and Bytes
File extensions
File corruption
Threats to digital objects
File corruption
File format obsolescence
Threats to digital objects
File corruption
File format obsolescence
Media obsolescence
Threats to digital objects
File corruption
File format obsolescence
Media obsolescence
Human intervention/human error
Threats to digital objects
File corruption
File format obsolescence
Media obsolescence
Human intervention/accidental human error
Metadata problems
Threats to digital objects
File corruption
File format obsolescence
Media obsolescence
Human intervention/accidental human error
Metadata problems
Computer failure
Threats to digital objects
File corruption
File format obsolescence
Media obsolescence
Human intervention/accidental human error
Metadata problems
Computer failure
Natural disaster
Threats to digital objects
File corruption
File format obsolescence
Media obsolescence
Human intervention/accidental human error
Metadata problems
Computer failure
Natural disaster
Misconceptions
Threats to digital objects
File corruption
File format obsolescence
Media obsolescence
Human intervention/accidental human error
Metadata problems
Computer failure
Natural disaster
Misconceptions
Institutional support (or lack thereof)
Threats to digital objects
What are ways that we can mitigate these
threats?
File corruption
Physical media should be stored in appropriate environmental conditions.
Digital Preservation Coalition
File corruption
Physical media should be stored in appropriate environmental conditions.
Take care in handling of media.
File corruption
Physical media should be stored in appropriate environmental conditions.
Take care in handling of media.
Maintain integrity of bit stream through security, checksums, periodic sampling & other validation
Checksums
http://digitalporr.niu.edu/tool-grid/
File corruption
Physical media should be stored in appropriate environmental conditions.
Take care in handling of media.
Maintain integrity of bit stream through security, checksums, periodic sampling & other validation
Ensure the integrity of the bit stream through actions such as checksum comparison
File corruption
Physical media should be stored in appropriate environmental conditions.
Take care in handling of media.
Maintain integrity of bit stream through security, checksums, periodic sampling & other validation
Ensure the integrity of the bit stream through actions such as checksum comparison
Periodically refresh & reformat.
File corruption
File format obsolescence
Addressing threats to digital objects
Use open formats
Images – tif, jpg, jpg2000, png, gif, vfz (talk about raw)
Text – xml, txt, rtf, pdf/a, csv, odt
A/V – aiff (flac codec), wav, bwf, mp3, avi, mj2, mjp2, mov – (uncompressed)
Addressing threats to digital objects
File corruption
File format obsolescence
Addressing threats to digital objects
• Migration
File corruption
File format obsolescence
Addressing threats to digital objects
• Emulation
File corruption
File format obsolescence
Addressing threats to digital objects
• Technology Preservation
File corruption
File format obsolescence
Addressing threats to digital objects
•Reinterpretation/Canonicalization
File corruption
File format obsolescence
Media obsolescence
Addressing threats to digital objects
File corruption
File format obsolescence
Media obsolescence
Human intervention/accidental human error
Addressing threats to digital objects
File corruption
File format obsolescence
Media obsolescence
Human intervention/accidental human error
Metadata problems
Addressing threats to digital objects
Duke Data Accessioner
File corruption
File format obsolescence
Media obsolescence
Human intervention/accidental human error
Metadata problems
Computer failure
Addressing threats to digital objects
File corruption
File format obsolescence
Media obsolescence
Human intervention/accidental human error
Metadata problems
Computer failure
Natural disaster
Addressing threats to digital objects
File corruption
File format obsolescence
Media obsolescence
Human intervention/accidental human error
Metadata problems
Computer failure
Natural disaster
Misconceptions
Addressing threats to digital objects
File corruption
File format obsolescence
Media obsolescence
Human intervention/accidental human error
Metadata problems
Computer failure
Natural disaster
Misconceptions
Institutional support (or lack thereof)
Addressing threats to digital objects
"Those who forget the past are condemned to reload it."
Nick Montfort, July 2000
Thank you for your attention
Any questions?
Joseph Ducreux, Yawning (Self-Portrait), before 1783J. Paul Getty Museum
+
Rosetta digital preservation system: Enabling
institutions to preserve and provide access to their
digital collections Edward M. Corrado
Director of Library Technology
Binghamton University Libraries
+About Me
Director of Library
Technology at Binghamton
University
MLS, Rutgers University
Co-author of Digital
Preservation for Libraries
Archives, and Museums
Co-editor of Getting Started
with Cloud Computing: A
LITA Guide
+Binghamton University (SUNY)
Binghamton University, one of four comprehensive doctoral research universities within the State University of New York, is recognized for stellar academics, an international focus, high graduation rates and overall value.
• Undergraduates: 12,356• Graduate students: 2,952• Average SAT score range: 1220-1385• Average ACT score range: 27-30• Top 25% of high school class: 87.9%• In the past 10 years, 91% of
freshmen returned for their sophomore year
• Students of color: 28% • International students: 15%
Students come from all 45 statesand 100 countries
+What we were trying to solve
We had other digital library systems that didn’t include
preservation functionality
We wanted to preserve things, and are excepted to preserve
things. Some of those things are digital.
University Archives has a mandate to preserve
various content
Content includes:
Born Digital
Digitized
Note: Not only libraries,
University wide
Systems were silos
+Rosetta
Scalable
Expandable
Architecture is open
Standards-based
“Based on the Open Archival Information System (OAIS)
model and conforming to trusted digital repository (TDR)
requirements.”
Designed, by Ex Libris, in collaboration with the National
Library of New Zealand
Used by about 30 institutions around the world
+Rosetta:
A digital preservation solution
Complete preservation solution allowing collection, archiving
and preservation of digital materials of any type. Rosetta
ensures data integrity and provides access over-time to digital
materials.
http://www.exlibrisgroup.com/category/RosettaOverview
Identify Risks Evaluate Alternatives
Permanent
Storage
Operational
Storage
Migration
Action
Execute Preservation
Actions
+Text-based digital preservation
projects at Binghamton University
Non-text projects
Photographs
Datasets
Videos
Oral histories
Text-based projects
Capstone projects
Dateline
Inside BU
University Bulletin
Websites (mostly text)
+Process
One size does not fit all
Depends on staffing, needs, importance, content, funding, etc.
Collection development policies are important for selection
Metadata librarians act as project managers
Provide guidance and training
Decide on appropriate descriptive metadata fields
Develop and/or provide specialized terminology (such as LCSH, TGM, TGN)
Review descriptive metadata and deposits as appropriate
Metadata is often created by people who are subject specialists (researchers, graduate assistants, special collections) or by student workers)
+Capstones
College of Community and Public Affairs (CCPA) Capstone
Projects (Students in Masters program)
Basic workflow
Get documents from the college
Convert files to PDF/A
Create descriptive metadata
A metadata librarian is currently doing this but in the past we
worked with students studying Library & Information Science
at the University of Missouri to create metadata
Submit to digital preservation system
Harvest metadata into Primo for search and discovery
+Dateline & Inside BU
Dateline is daily e-mail newsletter; Inside BU is a web-based magazine
Similar processes
+Born-digital Text on Websites
Various methods and tools to harvest Websites
Popular tools include
Heritrix
HTTrack
Wget
WARCreate
If we were to do this on scale it probably would make sense to partner with Internet Archive’s Archive-It service.
Can also be processed manually
+Challenges
Many of these text-based items are either e-mail or database generated
which is not always preservation friendly
Do you harvest as a website?
Pros
Can help maintain the experience, look, and feel
Easier to automate
Cons
Harvesting works well for simple websites, not always so great for
complicated, dynamically generated ones
Issues with remote CSS
JavaScript and other programing languages
What exactly is the experience, look, and feel?
May vary by browser, device, etc.
+Typical work flow for electronic
newsletters / magazines
Create a file (or pages) for each story
Note: Creating a different Cascading Style Sheet (CSS) for
archiving may be an option in some cases (think of print.css)
Combine them into one file and add a table of contents
Convert file to PDF/A
Create descriptive metadata using and locally created templates and
thesauruses (that often contains LCSH, TGN, etc. values)
Deposit file and metadata into the digital preservation system
+
Over 50 Dublin Core fields (w/ qualifiers )
Unqualified DC only has 15 fields
Need for best practices
Metadata in Rosetta
+Harvesting Metadata from Rosetta
into Primo:
By default, metadata is harvested using OAI-PMH
(which uses a limited number of Dublin Core fields)
In many cases, metadata from Rosetta maps to the
same fields in Primo as does Aleph cataloging
records
This applies to both search and display labels
Limited to 50 custom fields
+
+Conclusions
Again, one size does not fit all
The process we use for electronic newsletters is more manual than harvesting but it produces a more sustainable (and useable) digital object for preservation
The technology for digital preservation is available but policies and administrative issues are still a challenge Sustainable funding Staffing Policies and (local) documentation Prioritization
Follow standards whenever possible Exit strategy, Interoperability
Memorandum of Understanding (MOU) when working with people outside of the Library
+
Thank You
Special thanks to Rachel Jaffe, Metadata Librarian at UC Santa Cruz,
and colleagues from Binghamton University for help with some of the
content presented here.
Edward M. Corrado / [email protected]
NISO Webinar • December 10, 2014
Questions?All questions will be posted with presenter answers on
the NISO website following the webinar:
http://www.niso.org/news/events/2014/webinars/text_preservation/
NISO Two-Part Webinar
Sustainable Information, Part 1:
Digital Preservation for Text
Thank you for joining us today.
Please take a moment to fill out the brief online survey.
We look forward to hearing from you!
THANK YOU