Download - Challenges of Digital Preservation
![Page 1: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/1.jpg)
Challenges of Digital PreservationMA / CS 109April 22, 2011
Andrea GoethalsManager of Digital Preservation & Repository ServicesHarvard Library
![Page 2: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/2.jpg)
![Page 3: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/3.jpg)
“Digital Content”?Digitized (born-
analog)Born-digital
◦ Tweets◦ Web sites◦ Email◦ Documents
PDF Word, OpenOffice … Spreadsheets
◦ Data sets
![Page 4: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/4.jpg)
Digital content is not new1957: 1st digital
image1969: ARPAnet1971: 1st email
sent1972: 1st
consumer-level video game
1975: 1st digital camera
Russell Kirsch’s son (source: NIST)
![Page 5: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/5.jpg)
But has only recently exploded
1998: 1st Google index◦ 26 million pages
2000: Google index◦ 1 billion pages
2008: Google link processors◦ 1 trillion unique URIs◦ “… and the number of
individual Web pages out there is growing by several billion pages per day” – from the official Google blog
![Page 6: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/6.jpg)
The coming tsunami2010: estimated
at 1.2 ZB (1 ZB is 1 million TBs)◦ DVDs stacked from
Earth to the Moon and back
2020: expected to grow by a factor of 44 to 35 ZB◦ DVDs stacked
halfway to MarsSource: 2010 IDC Digital Universe Study sponsored by EMC
![Page 7: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/7.jpg)
Outpacing storage
Source: 2009 IDC Digital Universe Study sponsored by EMC
![Page 8: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/8.jpg)
Why do we care?
![Page 9: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/9.jpg)
May be historically significant
Captured March 19, 2011 for a Japan Earthquake collection created by Virginia Tech, Internet Archive (http://www.archive-it.org/public/collection.html?id=2438)
![Page 11: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/11.jpg)
May be an important reference
Only availabl
e in digital
form
![Page 13: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/13.jpg)
Who cares?Cultural heritage institutions
◦Libraries, archives◦Museums, historical societies◦Academic institutions
GovernmentsEntertainment, news and media
industryScientific communityFunding bodies (NSF, NIH)You?
![Page 14: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/14.jpg)
Preservation historicallyArchives and libraries have been
preserving all kinds of analog material for centuries using:◦Environmental control◦Conservation treatments
Can store away until resources allow processing◦Benign neglect approach works well
![Page 15: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/15.jpg)
Analog content is fairly durableEven damaged, may still be
identifiable, readable, usableAnatolian Cuneiform Tablet, circa 1850 BCE
![Page 16: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/16.jpg)
In contrast digital content isEasily destroyedTransientHiddenRequires more active attention –
benign neglect approach doesn’t work
![Page 17: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/17.jpg)
Digital content is easily destroyedBad peopleHardware or
software failuresHuman mistakes
◦ The slip of a finger can lead to catastrophic results
◦ “Help! Accidental deletion. I accidentally deleted 62 images… can you please recover them from backups?”
![Page 18: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/18.jpg)
Digital content is transientAverage lifespan of a Web site is
between 44 and 100 days
Captured April 8, 2009 Visited October 13, 2010
![Page 20: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/20.jpg)
Digital content is hiddenBoth. Use helps but its not
enough to detect corruption.
![Page 21: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/21.jpg)
But is it usable???It’s not enough to preserve the
digital bits◦AppleWorks?◦WordStar?◦Excel 1.0?
To use digital content we need software that can read the format
![Page 22: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/22.jpg)
Reading formats
ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...
![Page 23: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/23.jpg)
Reading formats
ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...
SOIAPP0 JFIF 1.2APP13 IPTCAPP2 ICCDQTSOF0 183x512DRIDHTSOSECS0RST0ECS1RST1ECS2...
![Page 24: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/24.jpg)
Reading formats
ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...
SOIAPP0 JFIF 1.2APP13 IPTCAPP2 ICCDQTSOF0 183x512DRIDHTSOSECS0RST0ECS1RST1ECS2...
![Page 25: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/25.jpg)
Access to information
informationcontent
bitsformats
SWHW
HW (paper)informationcontent
HW (paper)
symbols
language
Analog book
Unmediated access
Digital bookTechnology-mediated
access
![Page 26: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/26.jpg)
Formats are key to digital preservation
informationcontent
bitsformats
SWHW
supp
ortin
g
tech
nolog
ies
digita
l
cont
ent
If the format of our content is unsupported by technology, we can’t access the content’s information!
![Page 27: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/27.jpg)
Dependent on fleeting technologyWe are dependent on technology
to interpret (render, play, etc.) digital content
No technology sticks around – it all ages and disappears
Eventually all digital content in its original format becomes unusable!
![Page 28: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/28.jpg)
Format obsolescenceKodak PhotoCD
◦Used by libraries in the 1990’s and into 2000’s as a preservation format
◦Best decoders were from Kodak and are no longer supported
◦Very few software decoders remaining – soon images in this format will be unusable
◦Harvard’s Digital Repository Service has 7,243 of these
![Page 29: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/29.jpg)
Two sub-problemsKeep the bits
safeKeep the
information usable as technology changes
![Page 30: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/30.jpg)
Safe bitsInfrastructure, polices, practices and
professional staff to counter risks◦High quality storage◦Redundancy (multiple copies, multiple
locations)◦Media refreshing (replacing)◦Security and access restrictions◦Content recovery◦Integrity monitoring (check for
corruption)…
![Page 31: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/31.jpg)
Integrity monitoringMessage digests – unique
signatures for digital content◦Fixed-size bit strings
6326ec82b3200df4a87fc54356d2cb73◦Calculated by cryptographic hash
functions, e.g. MD5, SHA1, …Any changes to a file result in a
changed message digestUseful for detecting corruption
![Page 32: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/32.jpg)
Usable informationPeople have to be able to find itPeople must be able to manage itDocument what’s important
(description, context, ownership, processing history)
Know what you are preserving (formats)…
![Page 33: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/33.jpg)
A TIFF is a TIFF?Tiff 4.0Tiff 5.0Tiff 6.0Tiff 6.0 extension
YCbCr (Class Y)TIFF/IT (ISO
12639:2003)TIFF/EP (ISO 12234-
2:2001)RichTIFFEXIF 2.0
EXIF 2.1 (JEIDA-49-1998)
EXIF 2.2 (JEITA CP-3451)GeoTIFF 1.0TIFF-FX (RFC 2301)Class F (RFC 2306)RFC 1314Canon RAW
(.crw, .cr2, .tif)Nikon RAW (.nef)DNG (Adobe Digital
Negative)
![Page 34: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/34.jpg)
Identifying formatsTechniques: “magic numbers”,
full parseFew tools
◦Support limited number of formats◦Accuracy varies
Some improvements◦File Information Tool Set (FITS)
fits.google.code◦NARA-sponsored research
![Page 35: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/35.jpg)
Usable informationMake sure there’s technology to
support the formats! (technology watch)
Preservation strategies◦Technology preservation◦Creation of viewing software◦Emulation & variations:
Universal Virtual Machine Universal Virtual Computer
◦Format normalization◦Format migrations…
![Page 36: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/36.jpg)
Key format migration considerationsWhat can’t be lost in the
transformation? “Significant properties”◦E.g. color, embedded metadata, resolution,
ICC profiles, interaction, attachments, fonts, links
◦How important are each of these properties? – weighted criteria
To what format? “Preservable” formatsWhat else must be changed? Ex: LinksHow many versions to keep?
![Page 37: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/37.jpg)
Preservation lifecycle – a series of hand-offsCreate or acquire digital contentIngest into a preservation repository
◦Continuous cycle of: Monitoring Planning Intervention
◦Subject to collection management decisions
Transfer to next generation of the repository or to a different repository
![Page 38: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/38.jpg)
Ongoing commitmentRequires continual proactive
program◦You can’t just start and stop◦Time frames are MUCH shorter than
for preservation of analog materialRequires ongoing investment in
infrastructure and staff
![Page 39: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/39.jpg)
Can’t do it aloneDigital preservation activities
must be shared across institutions
Even collectively we don’t have adequate resources or understanding
![Page 40: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/40.jpg)
Preservation communityCollaborative organizations
(NDSA, IIPC, OPF)Collaborative projectsStandards and best practicesShared infrastructure and tools
◦Formats registry◦Repository software◦Preservation planning tools◦Format tools
![Page 41: Challenges of Digital Preservation](https://reader035.vdocument.in/reader035/viewer/2022070423/568165a3550346895dd884dd/html5/thumbnails/41.jpg)