i say emulate

27
1 I Say Emulate; He Says Migrate Are emulation or migration feasible preservation strategies? National Library of Australia Prepared by: Andrew Stawowczyk Long Presented by: David Pearson

Upload: national-library-of-australia

Post on 21-Jun-2015

705 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: I say emulate

1

I Say Emulate; He Says Migrate

Are emulation or migration feasible preservation strategies?

National Library of AustraliaPrepared by:Andrew Stawowczyk LongPresented by:David Pearson

Page 2: I say emulate

2

Archiving the Web

• Many institutions actively harvest the web• Collecting scale vary• Preservation practices not well understood and

implemented• Collecting intent may differ depending on the

institution

Page 3: I say emulate

3

Web Archives• Type

• Text oriented• Multimedia (video/audio) oriented• Picture oriented• Databases• Combination of all types

• Storage• Uncompressed• Compressed (WARC) • Combination

Page 4: I say emulate

4

Web Objects and Elements

• Challenge: Web archives may contain anytype of digital object• Common objects

• HTML/XML and related (htm, html, xml, css, etc.) • Images (raster images – JPEG, GIF, PNG)

• Media• Audio files (au, wav, aiff, midi, mp3) • Video files (mov, mpg, wmv, rm)

• Other objects• File Archives (usually compressed – zip, tar, gz, arc, sit) • Images (raster images – bmp, tiff)

• Images (vector images - SVG) • Text files (txt, csv, rtf)

• Document files• PDF• Microsoft Word, Excel, Power Point

Page 5: I say emulate

5

Comparative statistics of NLA web collections

PANDORA (selective)

3.26 TBSize:

73 millionFiles:

18.47TB

1,247,614

516 million

2007

19.04

1,046,038

596 million

2006

34.55 TB6.69 TBSize

3,038,658811,523Hosts crawled

1 billion185 millionUnique files

20082005Domain Harvest

.au Domain Harvests

78.75 TBSize:

2.3 billionFiles:

Page 6: I say emulate

6

What are we preserving?Preservation Intent

• Preservation of:• Physical media?

• Bit-stream (logical form of data)?

• Action (rendering data into something useful to user)?

• User experience?

• Important Considerations• Creator’s perceived intent

• Institution’s preservation intent

Based on Heslop and Davis (2002)

Page 7: I say emulate

7

What are we preserving?Properties

• Object Properties(Properties regarded as important would vary depending on the intention of the collecting institution)

• Derived from file format• High-level – e.g. layout, formatting• Measured – identified directly by computer• Intended – Set by the collecting body

or WEB

Page 8: I say emulate

8

Possible Preservation Actions 1

• EmulationThe original environment is recreated on a contemporary hardware using

specialised software (emulator) and original software.

• Renderers• Specialised software,

operating in thecontemporary environmentand used to access (render)original files. It is similarto emulation.

Page 9: I say emulate

9

Possible Preservation Actions 2

• Migration

e.g. MS Word 3.0 to MS Word 2008

Original file formats are migrated (converted) to another format, which is supported by current hardware/software.

Page 10: I say emulate

10

Possible Preservation Actions 3Not long-term sustainable

• Technological MuseumCollect and maintain the original hardware and software

• Take No ActionDo nothing

Page 11: I say emulate

11

Digital PreservationPreliminaries

• Collection objects need to be correctly recognised and identified

• Preservation intent(s) need to be defined• High-level preservation actions need to be defined (e.g. shall

we use emulation or migration?)• Practical-level preservation actions need to be defined

Object Format + Preservation Intent = Appropriate Action

Dillema:

How to properly migrate data if preservation intent(s) are unknown or not defined

Page 12: I say emulate

12

Tools Required for Emulation

• Emulators• Fast, stable, flexible, extendable

• Licenced Operating Systems• Various drivers• Web browsers• Browser plug-ins• Other programs as required (e.g. Java, Adobe Acrobat

Reader)

Page 13: I say emulate

13

Tools Required in Migration

• Format identifiers• Format converters• Link updaters• QA automatons

CAMiLEON project – Migration on Request Tool

XENA

Page 14: I say emulate

14

Project TestsGeneral Testing Environment

• Large slice of uncompressed PANDORA archive (random selection)

• Whole Domain Harvest archive have not been included in tests (WARC files)

• Multiple hardware combinations• Multiple OS combinations• Multiple Web Browsers

Page 15: I say emulate

15

Project TestsMaterial Sample

Testing the industrial scale tools• PANDORA slice

• 861Gb• 18,019,172 files• 2,379,326 folders

Testing object properties• Smaller slice of PANDORA slice

• 20 objects of each selected types•Audio, html, images, pdf, video, zip, MS documents

Page 16: I say emulate

16

Project TestsMethodology

• Large sample testing (861Gb, 18,019,172 files)• Attempt to identify objects in the sample using DROID

• Attempt to migrate jpeg images to png and update links

• Small sample testing • Select smaller sub-sample, with objects mostly created before year 2000

• Identify objects in the sample

• View and experience selected objects in contemporary environments using various platforms, OS and browsers

• View and experience selected objects in old environments using emulations on various platforms, using different OS and browsers

• Migrate selected objects and review them in various environments

Page 17: I say emulate

17

Project TestsTools tested

• Emulation• QEMU

• Bochs

• MS Virtual PC(Not exactly an emulator)

● Dioscuri

• Migration• ImageMagick

• MediaCoder

• Swf>>avi

• OpenOffice Tools

• XENA

• Common• DROID

• JHOVE

• TRiID

• File Identifier

• Lister (dev. in-house)

• OS– MS Win XP Pro

– MS Win 3.1

– MS Win 98SE

– Ubuntu 9.04

• Web Browsers– MS IE 7

– Firefox 3

– Arachne 1.2

– Mosaic 2

– Netscape 4

Page 18: I say emulate

18

Project TestsControl – Current Environment

• Properties observed in selected filesObject Basic Characteristics (based on Emulation Project by KB)

1. Content : the text, images, etc. from the object 2. Structure : the cohesion between different parts of the object 3. Context : the meaning of the object. 4. Appearance : the way an object is presented to the user. 5. Behaviour : the interaction of the object with the user or system.

E.g. for HTML pages:•Rendering of text, images, media files

• Font, layout, colours, contrast, brightness, animation smoothness, sound quality, etc.

•Objects dependencies

•Mouse & keyboard behaviour

•Data extraction

Page 19: I say emulate

19

Project TestsEmulated Environments

• Hardware• Dell Optiplex GX620, P4, 4.4GHz x 3.39GHZ, 3.5Gb RAM• Power Mac G4

EMULATORS:

• Bochs• Host: WinXP Pro v2002 SP3

Ubuntu 9.04• Client: Win 3.1, MS DOS 6.2

WinXP Pro SP2

• Dioscuri 0.4.0• Host: WinXP Pro v2002 SP3• Client: Win3.1, MS DOS 6.2

Page 20: I say emulate

20

Project TestsEmulated Environments

• Qemu• Host: MS WinXP Pro v2002 SP3

• Clients: MS Win98SEMS Win 3.1MS DOS 6.2

Ubuntu 9.04

• Host: Ubuntu 9.04

• Clients: MS WinXP Pro SP2, P4, 12.92GHz, 256Mb RAMMS Win98SEMS Win 3.1

• Microsoft Virtual PC• Host: MS WinXP Pro v2002 SP3

• Clients: MS Win 3.1MS Win98SE

Page 21: I say emulate

21

Tests - SummaryEmulation

•Setting up emulators was relatively simple•Additional software (especially to work with disk images) proved to be extremely useful.•Licencing was at times a big obstacle. (E.g. Impossible to emulate Macintosh environment legally).•A lot of dependencies exist. It is a complex task to make programs work correctly.

•e.g Windows XP requires internet or over-the-phone activation after 30 days

Page 22: I say emulate

22

Tests – SummaryEmulation

• AllSome of the dll libraries in Win 3.1 did not agree with 16-bit Netscape and Mosaic programs

• Bochs 2.3.7 for Windows• Extremely slow in GUI environments• No full screen mode. Limited end-user experience.

• Dioscuri• Sluggish at times• Didn’t like some of the images created in WinImage

• Qemu 0.9.0 for Windows and Linux• Much faster but still sluggish at times• Win98SE couldn't run in hi-res, hi-colour mode

• Microsoft Virtual PCRelatively fast (it's a virtualisation software on PC) but still sluggish at times

Page 23: I say emulate

23

Tests - SummaryMigration Environment

•Dell Optiplex GX620

•MS Windows XP Pro v2002 SP3

•Networked drive with PANDORA sample

Page 24: I say emulate

24

Tests - SummaryMigration

•Available tools are imperfect and slow.• e.g. DROID took more than two weeks to examine slightly over 18 million

files and many of them were not recognised

•It is very difficult to examine contents of the container formats (e.g. avi or rm)

•Network connections need to be as fast as possible

•It is difficult to make informed decision about migration without preservation intent clearly defined

Page 25: I say emulate

25

Tests - General Comments

• No proven methods existReal-world testing is needed

• Most documented approaches are ad-hoc - no commodity solutions

• Tools are few and inadequate

Page 26: I say emulate

26

Tests - General Comments

• Preservation policies, especially about preservation intent are needed

• Significant resources are needed to practically tackle the problem

Page 27: I say emulate

27

Andrew Stawowczyk LongStrategistDigital Preservation [email protected]

David PearsonDirector (Acting)Web Archiving and Digital Preservation [email protected]

Project Report is due end of October 2009