niso webinar: software preservation and use: i saved the files but can i run them?

65
NISO Webinar: Software Preservation and Use: I Saved the Files But Can I Run Them? Wednesday, May 13, 2015 Speakers: Michael Lesk Professor of Library and Information Science, Rutgers University Euan Cochrane Digital Preservation Manager, Yale University Library Jon Ippolito Professor of New Media and Director of the Digital Curation Graduate Program, University of Maine http://www.niso.org/news/events/2015/webinars/software/

Upload: national-information-standards-organization-niso

Post on 18-Jul-2015

260 views

Category:

Education


0 download

TRANSCRIPT

NISO Webinar: Software Preservation and Use:

I Saved the Files But Can I Run Them?

Wednesday, May 13, 2015

Speakers:

Michael LeskProfessor of Library and Information Science, Rutgers University

Euan Cochrane Digital Preservation Manager, Yale University Library

Jon IppolitoProfessor of New Media and

Director of the Digital Curation Graduate Program, University of Maine

http://www.niso.org/news/events/2015/webinars/software/

Software preservation

Michael Lesk

Prof. of Library and Information ScienceRutgers UniversityNew Brunswick, NJ 08901

[email protected]

Software preservation

The hard problem is not bad tape; it’s obsolescence. There are two common answers to the obsolescence problem.

Migration or emulation?

Migration: Convert the old information to a new format, e.g., BMP to JPEG.

Emulation: Use old information on a new version of an old machine, e.g. using a website that looks like an arcade game platform.

Why might old software be lost?

All the copies were thrown away.

The copies still exist, but the media have worn out.

The media are OK, but we have no device to read them.

We can read the bits, but we don’t know what they mean.

We understand the bits, but have no software to process them.

We have software but nothing to run it on.

The software depends on an environment that no longer exists.

We could process the bits, but we lack legal permission.

Discarded

We know the first telegram:

“What hath God wrought?” May 24, 1844; Samuel F. Morse, in Washington, to Alfred Vail, in Baltimore.

We know the first telephone call:

“Mr. Watson—Come here—I want to see you.” March 10, 1876. Alexander Graham Bell to Thomas A. Watson, in Boston.

We don’t know the first email message. It was in the spring of 1964, in either Cambridge (UK), Cambridge (Mass.), or Pittsburgh; but whatever it was, it was thrown out, and nobody kept good records.

The solution to this problem is multiple copies. Digital copies are perfect and cheap; use them.

Media fragility

In the 1970s Brazil stored Landsat space photography of their country on magnetic tape. These tapes were stored in humid conditions and deteriorated until they were unreadable. Magnetic tape is often fragile; audio tape is lost as well. It helps to start with better quality tape, and linear tape (audio) is better than helical tape (VHS cassettes). Sometimes it helps to heat the tape, once; hence one of the great titles in preservation literature, “If I knew you were coming, I’d have baked a tape” (Eddie Ciletti).

Again the solution to this is multiple copies, regularly inspected.

Note projects like LOCKSS: Lots of copies keep stuff safe.

Devices gone

Where today would you find a diskette drive? And that’s an easy one: what about a paper tape reader?

The answer to those is eBay, but what about special-purpose technology that failed in the marketplace, such as kinds of 12” writeable optical disk from the early 1990s?

Again, the answer is multiple copies on current devices. Even if your organization thinks it’s prepared to keep its 1980-vintage DEC computer running for a long time, where would you find spare parts when it broke? Or a technician who knew what to do with them?

Forgot the format

It is possible to have a format and not know what it is. Suppose you have a file made by Volkswriter, marketed by Lifetime Software (which, despite its name, ceased operating independently in 1991). How would you find out the control codes?

If you can’t find documentation, it may be easier to view this as a decipherment problem: if you find a funny symbol at “plus ?a change, plus c’est …” it’s the French ç character.

Now we’re into the real issues: is it better to try to find a copy of the software or to convert the file a current standard, like Word? In this case (word processing) conversion is probably easier.

Solution: use standard formats. Preferably public ones.

No software is available

Again, the vendor who wrote the software originally used for your file might have gone out of business. If your file is in a public format, there is probably an alternative. But if it was in a proprietary format, it may be difficult to find something that reads it. There was a time, for example, when Microsoft deliberately arranged for old MS-Word documents to be unreadable on newer versions so that customers would be forced to upgrade continuously. And in those days, Microsoft tried to prohibit other vendors from selling software that read and translated the “.doc” format; some of them did it anyway, and Microsoft gave up.

The solution is public formats and current formats; for example the newer “.docx” files in Microsoft Word have a public description.

No machine to run the software

Now we’re into the hard part of the problem: you might have some kind of program but it was coded to run on a long-gone machine (Commodore 64, anyone)? You choice is between

Finding a machine for sale on eBay – but you can’t get parts to fix it, and you may have trouble finding out how to make it work.

Migrating whatever this is to a modern platform, ideally expressing it in public standard terms.

Finding an emulator for the old machine: something that will run the old code as it was.

Migration vs. Emulation

Migration means converting files to newer formats. For example, Amiga graphics to Tiff or JPEG. If you migrate to a public standard you minimize the chance of having to do it again. It’s hard to guess which commercial formats will survive: if you had asked me in the 1990s whether a Kodak image format would survive, I would have said yes. You have to do it for every format. But you get modern capabilities with the converted files.

Emulation means programming a current machine to behave like an old machine. This is a difficult task, but emulators exist for many common machines, particularly game platforms. A notable project is Olive (olivearchive.org) which is aimed at preservation of intellectual content beyond video games (CMU, IBM, and others). You get only the old behavior of the program.

Examples

Migration:

JSTOR, and many old journal systems: the early issues, whatever their original formats, are now in PDF. Often they were just OCRdfrom the printed version, rather than translated digitally (high proofreading cost but minimal programming complexity). You can use all modern PDF tools on the articles.

Emulation:

The Internet Arcade is a collection of 1970s-80s arcade games that you can run in an emulator:https://archive.org/details/internetarcade

Some very special cases

Colossus, 1942. Colossus re-build, 1996

Charles Babbage’s Difference Engine, as rebuilt by the Science Museum (London), 2002.

Analogy

Consider performing early music. Should you play it on old instruments or modern ones? Old instruments are more authentic, but have a different effect on the modern ear. Bach’s listeners had not heard a piano and the organ did not sound “old fashioned”.

Emulation is finding an old church (there are some in Germany whose architecture and organ pipes are not changed from Bach’s day) and using old-fashioned performance techniques.

Migration is using a piano (and keyed flutes and trumpets, etc) but trying to produce the same emotional effect.

Similarly with old books: Caslon and Baskerville did not look old to people who had never seen Helvetica.

If you lack source code

In general, you can’t migrate a piece of software without the source code, since you want to recompile it on a new machine. There are de-assemblers, but the result is going to be a real pain to understand. So if you have only the object code, you may be driven to emulation. Since many software vendors keep source code very secret, and did so in the past as well, it’s not uncommon to have only the binary form of some program.

A legal warning: if you can’t find the vendor (out of business) and get permission, you may not have permission even to use the binary code, although this may depend on the terms of the original purchase. It may or may not have allowed transferring the program to a new user.

Features in old and new versions

Suppose you take an ancient word processor file and migrate it to a modern format. Then you can do things like export HTML, or PDF. Any tool that will use the modern format can work with your old file. But the tool will give a modern result – it will run faster, use modern display fonts, and the like.

If you are using an emulator, you get the old behavior. If the program only displayed green on black, you get green on black. This is “authentic” but you may not like it. And you may not be able to create HTML or PDF from the program. If you are trying to merge many such older documents into a digital library, the format incompatibilities will make things worse.

Metadata

If you really want to preserve a complex software object, it helps to know exactly what programs were used to create it. That means not just the name, but the exact version. Other issues that are more serious for digital preservation include provenance: where did this come from? This is relevant for answering questions about the material, or finding the people who might know the answer. Similarly it may assist with rights metadata, or technical metadata. Modern formats sometimes have technical metadata included in the file (eg in a JPG header) but older formats often don’t.

Again, it is easiest if you use well-known and common formats.

Standards

“The good thing about standards is that you have so much choice.”

Even ASCII (ISO 646) is ambiguous. The UK changed the “#” character to mean “£” and Germany changed “}” to “ü” .

Particularly worrisome are “wrapper” formats. Tiff may contain different kinds of image compression algorithms (such as G4 fax, or Lempel-Ziv), and thus a Tiff reader may not be able to read all Tiff images. Some image viewers understand progressive images in GIF or JPG; some don’t. PDF can include the kitchen sink (eg 3-D viewers).

Solution: emphasize the best and most public formats.

Missing environment

What would it mean to preserve the “Amazon home page”? It is different for every person using it and for each instance – it’s synthesizing using the browsing and order history of the user, the current incentives for sales at Amazon, and lots else (geography, source computer, etc.). There are many pieces of software that depend on almost everything around them- think about all the install scripts that ask “we want to use your location,” “we want to use your browser history,” and so on. (And of course many programs don’t ask, they just use them.)

No good answer for this. You have to judge what you mean by preserving the object – what will the users want the behavior to be?

Protection from abuse

If you run a general-purpose preservation operation, you need to think about whether anything in your preservation files is dangerous or doubtful in some way. People might try to use your system to distribute malware (viruses) or to enable software piracy.

Thus, unfortunately, you may want to put out calls like “please send in examples of early APL software” but you can’t just accept anything, and can’t rely on statements made by unknown volunteers about what they are submitting.

Legal permission

You may have an object, and know what to do with it, but not have legal permission to preserve it. For example, many of the video game companies object to attempts to imitate the old games – to them, this is creating competition for new games.

Unfortunately, given the copyright trolls out there, who try to make a living by finding people who have downloaded something they shouldn’t have, and then threatening them with lawsuits, this is not an area where it is easier to get forgiveness than permission.

Libraries are often justifiably paranoid.

There is of course the preservation exception in the law; but it limits a library to on-premises use.

Good and bad

Why software preservation is hard: the material is not self-describing, there were many early products that vanished without adequate documentation, software can be very complex, it requires special hardware to run, and so on….

Why software preservation is easy: as with all digital information, it can be copied without error; if one person has migrated a format or emulated a machine, that can be used by others; and computers are new enough that there is probably no computer without some user who is still alive. I learned to program on a Univac I; that doesn’t mean I have a tape drive that uses its steel tapes (yes, steel), but at least I know what they are.

Conclusion

The biggest technical choice is migration vs. emulation. I would generally say:

migration for static formatsemulation for executable programs

There are some ambitious programs: the Computer History Museum in Mountain View has been able to salvage old machines like the Xerox Alto.

But the industry does a lot less than we would like; it is more common to have legal problems in salvaging software than to get financial help from its original marketer.

Emulation in PracticeEmulation as a Service at Yale University Library; lessons learnt and plans for

the future

Euan Cochrane, Digital Preservation Manager, Yale University Library

Overview

1. Why should we care about emulation?

2. What is emulation?

3. How do we do emulation?

4. What is Emulation as a Service (EaaS)?

5. How we use EaaS

6. Lessons learnt using EaaS

7. Future work at Yale University Library (YUL)

Emulation– why?

Why? - Executable content

• Video games

• Research data workflows

• Digital Art

• Software as artifact

• Digital artifact museums (preserving the tools and infrastructure of the digital age)

Why? – Software dependent contentContent that requires software in order to be rendered or interacted with:

• Office files (documents, spreadsheets, slide sets, etc)

• CAD files

• Outlook inboxes

• eBooks with note taking capability

• Desktop environments

• Code

• Any proprietary, or effectively proprietary, formats

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

2003 2005 2007 2009 2011 2013 2015

Operating System Usage Over Time

Win8

Win7

Vista

Win2003

Older Win

WinXP

W2000

Win98

Win95

WinNT

Linux

Mac

Mobile

Why? – Software dependent content

Old software is required to authentically render old content

Original content in original software (WordPerfect in

Windows 95)

Original content in newer software (LibreOffice Writer in

Windows Vista)

Research results are at risk of loss without original software

Original content in original software (WordStar for DOS in Microsoft DOS)

[NB: equation predicting tree growthrates includes exponents documentedusing upper line of text]

Original content in newer software (LibreOffice Writer in Windows

Vista)

[NB: equation layout and meaning

Emulation – How?

How? – Emulation and virtualization software tools

• An emulation software package (“emulator”) is used to create a virtual version of one computer within another computer that has different hardware

• Old software can be run on the “emulated” computer hardware just like it was running on the original physical computer.

• Many emulators were originally developed to run old video games

How? – Software tools

• Emulation is often used to support old hardware devices that require obsolete software

(e.g. assembly line management software, scientific instruments, industrial machinery, etc)

• Emulation is widely used by mobile phone application developers to develop software for phone-hardware using desktop-PC hardware

(i.e. phone hardware is emulated on desktop pcs to build phone-compatible applications)

• Virtualization = emulation but with compatible hardware

(some of the host machine’s hardware is used directly by the “virtualized” computer)

Virtualization bridges the gap between departure of recently obsolete hardware and the arrival of hardware powerful enough to emulate it

How? – Preserving software and dependencies

• We need to curate and preserve operating systems to support access to assets that depend on them

• We need to curate and preserve software applications to support access to content that depends on them

• We need to curate and preserve fonts, scripts, plug-ins and other dependencies to support access to content that requires them

• We need to preserve whole desktop environments (e.g. Salmon Rushdie’s desktop at Emory university) to support access to the experience of interacting with it

• We need to curate and preserve pre-configured disk images with software already installed on them – for running on emulated hardware

How? - Documentation

• We need unique, persistent identifiers for software

• We need software catalogues

• We need unique, persistent identifiers for disk images (installed environments/virtual hard drives)

• We need disk image/virtual hard drive catalogues

• We need unique, persistent identifiers for emulated/virtualized hardware configurations

• We need hardware configuration catalogues

How? - Documentation

• We need unique, persistent identifiers for software

• We need software catalogues

• We need unique, persistent identifiers for disk images (installed environments/virtual hard drives)

• We need disk image/virtual hard drive catalogues

• We need unique, persistent identifiers for emulated/virtualized hardware configurations

• We need hardware configuration catalogues

*Mostly, the internet archive is doing great work, as are NIST and

PRONOM

We don’t have these

(yet!)*

How? – Configuring emulated hardware

How? – Configuring emulated hardware

• Admins configure an emulator

• Admins install and/or configure the emulated software

• Requires various emulator specific, technically challenging tools

How? – accessing emulated environments at libraries and archives • Users access

emulated environments via dedicated machines

• Use dedicated software

• At libraries and archives this is mostly restricted to reading rooms

How? – This is too hard!

Emulation as a Service

Emulation as a Service –What is it? Remote access to pre-configured emulated and virtualized

environments via any modern web browser

Abstracts configuration challenges away from end-users

Changes to environments can be saved or discarded at the end of a session (a fresh/unchanged version is always available)

Interactivity can be restricted where appropriate (e.g. limited ability to download or copy content to local computer)

Relatively simple way to provide custom online environments (virtual reading rooms?)

Emulation as a Service (EaaS)–Why?• A lot of old digital content can only be properly accessed using

emulation tools

• Emulation is technically specialized

• Old software can be challenging for modern users to understand

• Modern users don’t expect to have to come into a reading room to access digital content

• Maintain control over content: users can’t copy data in or out unless authorized (screenshots are inevitably excluded)

Emulation as a Service (EaaS)–Why?• Strong separation between environments, objects and

emulators/configurations

• Emulation can be provided remotely (outsourced) with disk image archives and/or content maintained locally)

• Small derivative environments can be created from base-environments –saving space

• Standard environments can be reused and customized

• Provides ability to cite environments

EaaS usage Examples

• Puppet Motel

• Hebrew Texts

• Companies Data

• See: http://blogs.loc.gov/digitalpreservation/2014/08/emulation-as-a-service-eaas-at-yale-university-library/

EaaS – How it works Architecture and design

EaaS – How it works (For Technical Administrators)

• Admins configure an emulator on local PC

• Admins configure the emulated software on a local PC

• Configured environment gets saved as a “disk image” with configuration metadata

• Admins confirm the software environment stored on the disk image works on local PC

• Admins/Archivists/Librarians ingest it into the EaaS service:

EaaS – How it works (For Technical Administrators)

EaaS – How it works(For Librarians/Archivists)• Pre-configured software

environments (e.g. a Windows 95 + Office 95 environment) can have files added to them and be saved as a variant or as a stand-alone new environment

• Only difference (delta) between base-environments and customized environment retained – saving space by not duplicating virtual hard drive content

• CD-ROMs and other software can be ingested, installed/configured on top of a base environment, and tested using an online interface

• Newly customized environment can be stored for future use and further

EaaS – How it works(For Librarians/Archivists)

• Librarians/Archivists can also ingest disk images captured from machines they have acquired (e.g. authors’/politicians’ desktops)

EaaS – How it works(For Librarians/Archivists)

EaaS – How it works(For end-users)

• Users can click on links in a catalogue/finding aid to access environments/content

EaaS – How it works(For developers and system integrators)

• Provides generic access to functionality of many emulators and virtualization tools vi a WebService and REST API

• Emulation functionality can be incorporated into existing workflows

• Emulated (or virtualized) environments can be embedded into web pages for online access and online exhibitions

• Emulated environment citations, thumbnails, and URIs/URLs enable easy integration with existing catalogues and finding aids

• One-click “image-disk-and-emulate” workflows being developed (collaborating with digital forensics initiatives)

• Open Source (currently available on request, code will be published in the future)

EaaS – Background

• bwFLA EaaS project from University of Freiburg in Germany (http://bw-fla.uni-freiburg.de)

• Personally collaborated with bwFLA at Freiburg while at Archives New Zealand

• Now at Yale University Library and brought collaboration along

• Yale University Library have(/had!) only installation outside of Germany

EaaS Demo

(Semi-)Public Demo

https://demo.bw-fla.uni-

freiburg.de

Username: bwfla

Password: demo

Related work

• Olive Archive https://olivearchive.org/

• Internet Archive https://archive.org/details/software

• Keep Emulation framework http://emuframework.sourceforge.net/

• QEMU http://wiki.qemu.org/Main_Page

EaaS at Yale

• Testing and providing requirements for ongoing development

• Imaging general collections digital media & Trialing access via EaaS

• Investigating workflow integration (virtual reading rooms?)

• Finding gaps in supporting infrastructure

Lessons learnt

• It works and we can do this now!*

*with caveats

Lessons learnt

• Software licensing needs to be solved (abandonwareand out-of-cart software are huge problems)

• Scale is manageable through standardization and sharing

• Archivists and Librarians can use EaaS with relatively little training

• The possibilities of using EaaS in workflows are huge

• If EaaS becomes an assumption, creators may change their methods

Future work at Yale University Library• Move EaaS into production

• Increase software archiving

• Develop standard shareable environment images

• Collaborate with others to maximize efficiency of software archiving

• Develop emulation testing standards and frameworks

• Explore options for preserving networked environments

• Make progress on the licensing issues

Thank you

https://demo.bw-fla.uni-

freiburg.de

Username: bwfla

Password: demo

NISO Webinar • May 13, 2015

Questions?All questions will be posted with presenter answers on

the NISO website following the webinar:

http://www.niso.org/news/events/2015/webinars/software/

NISO Webinar

Software Preservation and Use:

I Saved the Files But Can I Run Them?

Thank you for joining us today.

Please take a moment to fill out the brief online survey.

We look forward to hearing from you!

THANK YOU