economic sustainabilityof digital preservation - prof. david rosenthal, chief scientist lockss,...
DESCRIPTION
EUDAT 3rd Conference Sustainability Session:Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries - Wednesday 24th September 2014, Amsterdam, the NetherlandsTRANSCRIPT
![Page 1: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/1.jpg)
Economic Sustainability ofDigital Preservation
David S. H. Rosenthal
LOCKSS ProgramStanford University Libraries
http://www.lockss.org/http://blog.dshr.org/
© 2014 David S. H. Rosenthal
![Page 2: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/2.jpg)
Journals move to the Web
● Access for current readers better:● Links, search, data spreadsheets behind graphs, ...● No need to go to the library
● Access for future readers worse:● Not purchase but rental, no rent payment no access● Not many copies, but one on shortlived rewritable media
![Page 3: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/3.jpg)
Paper Libraries
● Interesting example of faulttolerance:● Looselycoupled network of many independent peers● Each storing a selection of available content● On durable, somewhat tamperevident media● Market in copies, fewer copies more care→
● Easy to find a copy, hard to find all copies● Interlibrary loan & copy to repair loss or damage
![Page 4: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/4.jpg)
LOCKSS Program
● LOCKSS box acts as persistent Web cache:● Crawls Web to preload with subscribed content● If can't get publisher copy, readers get library copy● Boxes cooperate to detect, repair loss & damage
● Timeline:● 1998 NSF funded prototype● 1999 NSF, Sun funded alpha: 1 journal, 15 boxes● 2000 Mellon, Sun funded beta: ~40 libraries● 2004 Production● 2005 Mellon matching grant● 2007 Sustainability!
![Page 5: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/5.jpg)
LOCKSS: Businesses
● Develop & support use of LOCKSS software:● Free & opensource, but pay for support (cf. Red Hat)● ~150 libraries using the software
● Under contract, run CLOCKSS network:● Dark archive of ejournals & ebooks● Notforprofit managed jointly by publishers and libraries● 12 nodes worldwide● Triggered if unavailable from any publisher, CC license● Certified “Trustworthy Repository” score 13/15● Technologies, Technical Infrastructure, Security – 5/5
![Page 6: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/6.jpg)
The HalfEmpty Archive
● Ejournals: less than half preserved● ARL vs. Keepers: ~40% of serials preserved● Faria et al.: <50% of serials preserved
● Public web pages Ainsworth et al.:● Search engine sampled URLs: ~2/3 preserved● Bit.ly random URLs: ~1/3 preserved
● Choices:● Do nothing● Double the budget● Halve the cost per unit content
![Page 7: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/7.jpg)
Cost Data?
● Lots of research into preservation costs:● CMDP, LIFE, KRDS, PrestoPrime, ENSURE, ...● Serious lack of usable data● Inconsistent accounting, hidden costs, content variability
● My rule of thumb summarizing the research:● Ingest 1/2, preservation 1/3, access 1/6 of lifetime cost
● 4C project please submit cost data to:● http://www.4cproject.eu/● Curation Cost Exchange
![Page 8: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/8.jpg)
Kryder's Law
● Bit density on disk platters:● Doubles every 18 months
● Thus $ per GB:● Drops 3040% per year
● If you can afford to store stuff for a few years● You can afford to store it forever
![Page 9: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/9.jpg)
![Page 10: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/10.jpg)
Source: Preeti Gupta, SSRC, UC Santa Cruz
![Page 11: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/11.jpg)
Stored Safe in the Cloud?
● Cloud storage sold as “cheaper”:● If all charges accounted for, not cheaper for preservation● Its made of the same disks you use locally● Economies of scale captured by the provider
● Cloud storage locks you in:● Free to store, costs to access● Changing providers slow, expensive – you will be gouged● Not a competitive market – dominated by Amazon
● To avoid lockin, must keep a copy yourself● To allow you to change providers without paying arm+leg
![Page 12: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/12.jpg)
Blue Ribbon Task Force
● Sustainable Digital Preservation & Access:● 2year study, report in 2010● NSF, Mellon, Library of Congress, JISC, CLIR, NARA
● Preservation has to be justified by access:● D'oh!● Dark archives (e.g. CLOCKSS) hard to sustain● Scholars don't like to, no budget to, pay for access to data
![Page 13: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/13.jpg)
“Big Data”
● Research on past access to archives:● Rare, sparse, except for integrity checks● “Cold” data
● Future access will be different:● Scholars want to datamine from archive collections● Access much more intense, expensive● Data “warm” to “hot”
● How much more expensive?● Compare S3 (warm) vs Glacier (cold)● S3 2.5 times more expensive
![Page 14: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/14.jpg)
Cloud For Access?
● Cloud ideal for datamining from collections:● Spiky demand● Charging mechanism
● Amazon Free Public Datasets:● No charge to data owner● Amazon charges readers for compute they use for access
● Library of Congress & Twitter feed (public):● Store copy in Amazon Reduced Redundancy Storage● Charge scholars for access to pay storage cost of copy● Scholars pay Amazon for compute to access copy
![Page 15: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/15.jpg)
Sustaining Open SourcePreservation
● Open source essential for preservation:● No “just trust me” like closedsource encryption● … or cloud storage
● Niche market – not like Linux, Apache, ...:● No foundation with large industry sponsors● Red Hat needs frequent, visible upgrades to motivate $● Hard to devote resources to infrastructure improvements
● Mellon recognizes this problem:● Small grant for infrastructure● AJAX crawler, Shibboleth support, protocol improvements
![Page 16: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/16.jpg)
A Petabyte for a Century
● Black Box:● Put PB in, wait 100yrs, take PB out● Whatever media, replication, algorithms you like inside● 50% chance every bit undamaged
![Page 17: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/17.jpg)
A Petabyte for a Century
● Black Box:● Put PB in, wait 100yrs, take PB out● Whatever media, replication, algorithms you like inside● 50% chance every bit undamaged
● This defines bit halflife:● Approx 60M times the age of the Universe● No feasible benchmark of adequate reliability
● Stuff will get lost or damaged:● Only question is “how much damage for how many $?”
![Page 18: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/18.jpg)
Threat Model
Media failure
Hardware failure
Software failure
Network failure
Obsolescence
Natural Disaster
![Page 19: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/19.jpg)
Threat Model
Media failure
Hardware failure
Software failure
Network failure
Obsolescence
Natural Disaster
Operator error
External Attack
Insider Attack
Economic Failure
Organization Failure
![Page 20: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/20.jpg)
Is More Reliable Better?
● Two systems, same budget for a decade:● A) zero loss rate● B) 1%/yr loss rate, 50% less $/yr than A per unit content● B's loss rate is clearly unacceptable
![Page 21: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/21.jpg)
Is More Reliable Better?
● Two systems, same budget for a decade:● A) zero loss rate● B) 1%/yr loss rate, 50% less $/yr than A per unit content● B's loss rate is clearly unacceptable
● After a decade:● B preserves 1.89 times as much at the same cost
● After 3 decades:● B preserves more than 5 times as much
![Page 22: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/22.jpg)
The Good News
● Sustainable digital preservation possible:● LOCKSS is an example
![Page 23: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries](https://reader033.vdocument.in/reader033/viewer/2022052820/54985553b47959514d8b5497/html5/thumbnails/23.jpg)
The Bad News
● Expectations way out of line with reality:● Can't preserve as much as people assume is being● Nor as reliably as people assume it is being preserved
● Mismatch will get worse:● Expect lots more data, no more money● Expect costs to drop rapidly, experts say slowly if at all
● Technology won't save us:● Research data, libraries, archives niche market● Hard problems, no big payoff for solution, so little research● Build systems from stuff designed to do something else