to preserve or not to preserve?

62
To Preserve Or Not To Preserve Or Not To Preserve? The Challenges in The Challenges in Appraising Electronic Records Peter Bajcsy, PhD - Research Scientist, NCSA - Adjunct Assistant Professor ECE & CS at Adjunct Assistant Professor ECE & CS at UIUC - Associate Director Center for Humanities, Social Sciences and Arts (CHASS), Illinois Informatics Institute (I3), UIUC National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Date: January 21 st , 2009

Upload: pbajcsy

Post on 12-May-2015

3.030 views

Category:

Technology


0 download

DESCRIPTION

The Challenges in Appraising Electronic Records

TRANSCRIPT

Page 1: To Preserve Or Not To Preserve?

To Preserve Or NotTo Preserve Or Not To Preserve? The Challenges inThe Challenges in Appraising Electronic Recordsect o c eco ds

Peter Bajcsy, PhD- Research Scientist, NCSA- Adjunct Assistant Professor ECE & CS atAdjunct Assistant Professor ECE & CS at UIUC- Associate Director Center for Humanities, Social Sciences and Arts (CHASS), Illinois Informatics Institute (I3), UIUC

National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-Champaign

Date: January 21st, 2009

Page 2: To Preserve Or Not To Preserve?

Acknowledgement

• This research was partially supported by a National Archive and Records Administration (NARA) supplement ( ) ppto NSF PACI cooperative agreement CA #SCI-9619019 and NCSA Industrial Partners.The ie s and concl sions contained in this doc ment• The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the National Archive and Records Administration, or the U.S. government.

• Contributions by: Peter Bajcsy Kenton McHenry Rob• Contributions by: Peter Bajcsy, Kenton McHenry, Rob Kooper, Michal Ondrejcek, William McFadden, Sang-Chul Lee, David Clutter and Alex Yahja

Imaginations unbound

Page 3: To Preserve Or Not To Preserve?

Outline

• IntroductionStakeholders• Stakeholders

• Conceptual Challenges• Some Open Problems• Research Examples Illustrating OpenResearch Examples Illustrating Open

Problems• Summary Observations and Future• Summary, Observations and Future

Vision

Page 4: To Preserve Or Not To Preserve?

Introduction• Two Trends in the Context of Decision Processes

(Government, Medical, Natural Disasters, …) • Decision processes are moving from paper based

to electronic record based (~ computer assisted decision processes)decision processes)

• Electronic records depend on rapidly changing information technologyinformation technology

• Decisions are optimal depending on knowledge• Any learning from electronic records depends on• Any learning from electronic records depends on

preservation and reconstruction of the records, as well as on quality and granularity of the information

National Center for Supercomputing Applications

well as on quality and granularity of the information

Page 5: To Preserve Or Not To Preserve?

Fundamental Problems

• Limited learning from historical records todaytoday• It is often due to missing information and

high uncertainty/ low quality of historicalhigh uncertainty/ low quality of historical records.

• Lack of understanding how to preserve and reconstruct data and decision processes.• It is due to insufficient

forecasting/simulation capabilities.

National Center for Supercomputing Applications

Page 6: To Preserve Or Not To Preserve?

To Be Preserved!Digital representation of i f ti

Preservationinformation & knowledge

Information transfer ?

Imaginations unbound

AGENCY ARCHIVES

Page 7: To Preserve Or Not To Preserve?

Motivation• The problems related to preservation of electronic records

are only going to become more serious• Information becomes more heterogeneous and complexInformation becomes more heterogeneous and complex

• More data types• Higher dimensional data

N fil f t• New file formats• Volumes of electronic records have been increasing and will

continue to grow• The model of a paperless office (4 years of Bush’s email > 8

years of Clinton’s email)• The paradigm shift to eScience

• Digital information technology has been changing faster than any previous preservation media

• The time scale of electronic media is ephemeral in comparison p pwith paper or clay tablets

Imaginations unbound

Page 8: To Preserve Or Not To Preserve?

Example of Preservation Needs in Medicine

• Short term:• Medical practice requires comparing patients’

records acquired today with the patients’ d f 5 10 50 70 i d trecords from 5, 10, 50, or 70 years in order to

assess functional, structural or low level biological changes due to diseasesbiological changes due to diseases, treatments and/or aging.

• Long term:Long term:• Genealogy studies compare data sets over

several hundreds and thousands of years

National Center for Supercomputing Applications

y

Page 9: To Preserve Or Not To Preserve?

Who Are the Stakeholders?• Multiple institutions and organizations are active in the area• Multiple institutions and organizations are active in the area

of medical record preservation• National Library of Medicine (NLM) y ( )• Research Information Network (RIN) • Medical Research Council (MRC) in UK • National Archives and Record Administration (NARA)

• Identified common goals:S l i t t d t di ll ti• Seamless, uninterrupted access to expanding collections of biomedical data, medical knowledge, and health information

• Preserve medical record collections in highly usable forms and contribute to comprehensive strategies for preservation of biomedical information in the U S and

National Center for Supercomputing Applications

preservation of biomedical information in the U.S. and worldwide.

Page 10: To Preserve Or Not To Preserve?

Other Stakeholders• Government agencies

• Prediction of patterns signaling natural disasters b d hi t i l tbased on historical measurements

• Detection of terrorist attacks based on past experienceexperience

• Learning about other planets from past space shuttle missions

• Preservation of cultural heritage• Companies

P ti f i i d i d• Preservation of engineering drawings and architectural designs – Boeing, John Deere, GM

• Preservation of simulation results – Caterpillar, Fordp ,• Backward compatibility of hardware/software - GE

Imaginations unbound

Page 11: To Preserve Or Not To Preserve?

NARA as One of the Key Stakeholders• According to The Strategic Plan of The

National Archives and Records Ad i i t ti 2006 2016 “P i thAdministration 2006–2016. “Preserving the Past to Protect the Future”• “Strategic Goal: We will preserve and• Strategic Goal: We will preserve and

process records to ensure access by the public as soon as legally possible” p g y p• “D. We will improve the efficiency with

which we manage our holdings from th ti th h d l d th hthe time they are scheduled through accessioning, processing, storage, preservation, and public use.”preservation, and public use.

Page 12: To Preserve Or Not To Preserve?

Conceptual Challenges• Learning Requires Reusing Electronic Records

• How to enable and support preservation and reconstruction of electronic records?reconstruction of electronic records?

• Advancing Sensors and Instruments Leads to New Types of High Dimensional Data and Large VolumesTypes of High Dimensional Data and Large Volumes• How to design preservation methodologies that

scale well? • Process to Enable Learning over Time from

Electronic Records Requires Large Financial InvestmentsInvestments• How to minimize computational hardware,

software and storage cost and maximize the

National Center for Supercomputing Applications

software, and storage cost and maximize the amount of preserved information?

Page 13: To Preserve Or Not To Preserve?

What Are The Key Open Problems?

Imaginations unbound

Page 14: To Preserve Or Not To Preserve?

Some Open Problems -> Intellectual Merit• Appraisal Methodology

• Appraisal by Visual Exploration• Support of Appraisals by Enabling ComparisonsSupport of Appraisals by Enabling Comparisons• Scalability of Appraisals with Increasing Heterogeneity of

Information, Dimensionality of Data and Volume of Electronic RecordsRecords

• Support of Archival Decisions• Simulate Preservation Costs as a Function of Information

G l it d I f ti T h lGranularity and Information Technology • Optimal Utilization of Computational and Human Resources

• Automation of Processing for Preservation g• Discovery of Relationships Among Electronic Records• Information Preserving Conversions of Electronic Records• Sampling Authenticity and Integrity Verification of a Collection of• Sampling, Authenticity and Integrity Verification of a Collection of

Temporally Changing RecordsImaginations unbound

Page 15: To Preserve Or Not To Preserve?

Broader ImpactsProcess to Enable Learning Over Time

+$ KnowledgeElectronic Records

-$

Optimal Decision Making

National Center for Supercomputing Applications

Page 16: To Preserve Or Not To Preserve?

Concrete Research Examples Illustrating Open Problemsp

Imaginations unbound

Page 17: To Preserve Or Not To Preserve?

Open Problems Related to AppraisalOpen Problems Related to Appraisal Methodology

1. Appraisal by Visual Exploration2. Support of Appraisals by Enabling Comparisons3. Scalability of Appraisals with Increasing Heterogeneity of

Information, Dimensionality of Data and Volume of Electronic Records

Imaginations unbound

Records

Page 18: To Preserve Or Not To Preserve?

Definition of Appraisal in Archival Context

• Appraisal -- the process of determining the value and thus the final disposition of Federal records making them eitherthe final disposition of Federal records, making them either temporary or permanent. • See http://www.archives.gov/records-p g

mgmt/initiatives/appraisal.html• The basis of appraisal decisions may include

th d ' d t t• the records' provenance and content, • the records' authenticity and reliability, • the records‘ order and completeness• the records order and completeness, • the records‘ condition and costs to preserve them, and • the records‘ intrinsic valuethe records intrinsic value

Imaginations unbound

Page 19: To Preserve Or Not To Preserve?

Open Problem 1: Appraisal by Visual ExplorationExploration

• How to visualize the transition from raw data to information?• Raw data (Byte stream) -> Information

• How to encode and represent heterogeneous information for visual exploration and for computer assisted operations?

0F0 ->(R.G,B)->GREEN

visual exploration and for computer-assisted operations?• Encoding (e.g., shape consisting of a set of Bezier

curves is encoded by a set of straight lines)• Representation (e.g., colors are represented by an

ordered sequence of intensity values from all bands)H t i t ti f i l l ti ?• How to summarize representations for visual exploration?• Frequency of occurrence of primitives• Local and global summarizations• Local and global summarizations

Imaginations unbound

Page 20: To Preserve Or Not To Preserve?

Example: Adobe Portable Document Format (PDF)Format (PDF)

• Why PDF? - PDF is just an example of a container• Office environment (Adobe PDF PS MS Word HTML )Office environment (Adobe PDF, PS, MS Word, HTML, …)• Satellite measurements (HDF, netCDF, …)

3D

Adobe Library 6.0

Movie

Ad b Lib 7 0Adobe Library 7.0

Imaginations unbound

Page 21: To Preserve Or Not To Preserve?

Exploration of PDF Documents Using PDF ViewerViewer• PDF Viewer presents information as a set of pages with

their layoutstheir layouts• PDF Viewer renders layers of internal objects

(components) and hence only the top layer is visible

Page 22: To Preserve Or Not To Preserve?

Needed Exploration of PDF Componentsp p• There is no support for archival appraisals that would

include visual exploration of components in a document (a container of components)

Needed viewers for appraisal analyses that present• Needed viewers for appraisal analyses that present information stored in a container (e.g., PDF) as a set of components and their characteristics • Text – word frequency• Images (rasters) – color frequency (histogram)• Vector graphics – line frequency

• Exploration for appraisal analyses needs to include visible and invisible objectsvisible and invisible objects

Page 23: To Preserve Or Not To Preserve?

Exploration of Text Components

LOADED FILESOccurrence of numbersOccurrence of words

“Ignore” words

Page 24: To Preserve Or Not To Preserve?

Exploration of Image Components

Occurrence of colorsList of images Preview

LOADED FILES “Ignore” colors

Page 25: To Preserve Or Not To Preserve?

Exploration of Vector Graphics ComponentsComponents

LOADED FILESLOADED FILESPreview Occurrence of v/h lines

Imaginations unbound

Page 26: To Preserve Or Not To Preserve?

Exploration of Visible And Invisible Objects

Objects intersected at the mouse click location

Page 27: To Preserve Or Not To Preserve?

Open Problem 2: Support of Appraisals by Enabling Comparisonsby Enabling Comparisons

• How to compare containers with heterogeneous i f ti ?information?• Methodology• Metrics• Weighting factors for fusion

• How to quantify differences between the same type of information? • Encodings and Representations• Metrics• Local versus global differences

Imaginations unbound

Page 28: To Preserve Or Not To Preserve?

Comparisons

Imaginations unbound

Page 29: To Preserve Or Not To Preserve?

MethodologyPartial solutions in literature +literature-Ref. CAPTCHA

+ …

Open problems

Relationship toPermanent Records

+ …

Page 30: To Preserve Or Not To Preserve?

Experimental ExampleINPUT = 10 PDF docs (4 & 6 Groups)

UNIQUE ID= 1,2,3,4 UNIQUE ID= 5,6,7,8,9,10

Imaginations unbound

Page 31: To Preserve Or Not To Preserve?

Comparative Experimental Results

INPUT = 10 PDF docs (6 & 4 members in each Group)(6 & 4 members in each Group)

V b d i il iVector-based similarity

Text-based similarity Image-based similarity

Page 32: To Preserve Or Not To Preserve?

Comparative Experimental Results

Vector Graphics Similarity Portion of Document Surface and Word Similarity Combined Allotted to Each Document Feature

Comparison Using Combination of Document Features in Proportion to Coverage

Page 33: To Preserve Or Not To Preserve?

Accuracy Comparisons

Method Average Similarity of

Average Similarity of

Average Similarity AcrossSimilarity of

Group 1Similarity of Group 2

Similarity Across Group 1 & 2

TEXT ONLY 1 0.489 0TEXT & IMAGE & 0 906 0 520 0 075TEXT & IMAGE & GRAPHICS

0.906 0.520 0.075

One refers to high similarity & zero refers to low similarity g y & y

Conclusions:•Differences in similarity are up to 10% of the score•Differences in similarity are up to 10% of the score•Documents in Group 2 would likely be misclassified as 0.5 similarity would be the threshold between similar and

Imaginations unbound

dissimilar documents

Page 34: To Preserve Or Not To Preserve?

Open Problem 3: Scalability of Appraisals

• Scalability of appraisals with increasing heterogeneity of information, dimensionality of data and volume of electronic records

H h ld i l h• How should appraisal process change as 3D data is added to file containers?H h ld i l h• How should appraisal process change as 3D+time, 2D+spectrum, 3D+time+spectrum nD3D+time+spectrum, nD, …

• How should appraisal operations be designed to accommodate growingdesigned to accommodate growing volume of electronic records?

Imaginations unbound

Page 35: To Preserve Or Not To Preserve?

Approaches to Computational Scalability of Document AppraisalsDocument Appraisals

• Options for parallel processing• message-passing interface (MPI)

MPI i d i d f h di i f i l i l• MPI is designed for the coordination of a program running as multiple processes in a distributed memory environment by using passing control messages.

• open multi processing (OpenMP)• open multi-processing (OpenMP)• OpenMP is intended for shared memory machines. It uses a

multithreading approach where the master threads forks any number of slave threadsnumber of slave threads.

• Map Reduce parallel programming paradigm for commodity clusters

It l t it i l M f ti d R d• It lets programmers write simple Map function and Reduce function, which are then automatically parallelized without requiring the programmers to code the details of parallel processes and communications

Imaginations unbound

processes and communications

• Specialized Hardware: FPGA, Cell processors, GPU

Page 36: To Preserve Or Not To Preserve?

Computational Requirements forRequirements for Executing the MethodologyMethodology

Yellow indicatescomputations

Relationship toPermanent Records

Appraisal & Sampling

Page 37: To Preserve Or Not To Preserve?

Hardware & Software Dependencies with HadoopHadoop• Test data: 15 PDF files from the Columbia investigation

web site at http://caib.nasa.gov/. p g• Software configuration: Linux OS (Ubuntu flavor) and

the Hadoop implementation of Map and Reduce f nctionalitiesfunctionalities

• Hardware configuration: homogeneous & heterogeneous machinesg

Hadoop Average Speed

405060

nds

0102030

1 2 3 4 5

#machines

seco

n average speed

Imaginations unbound

Homogeneous Hardware Heterogeneous Hardware

Page 38: To Preserve Or Not To Preserve?

Open Problems Related to ArchivalOpen Problems Related to Archival Decisions

•Simulate Preservation Costs as a Function of Information•Simulate Preservation Costs as a Function of Information Granularity and Information Technology •Optimal Utilization of Computational and Human

Imaginations unbound

Resources

Page 39: To Preserve Or Not To Preserve?

Open Problem: Archival Decision Support

• Decision support for forecasting preservation costs • How to predict computational and storage p p g

requirements of preservation as a function of technology variables and information gygranularity?

• How to optimize computational hardware,How to optimize computational hardware, software, storage, and networking investments?investments?

Imaginations unbound

Page 40: To Preserve Or Not To Preserve?

Basic Questions About Information to be PreservedPreserved

National Center for Supercomputing Applications

Page 41: To Preserve Or Not To Preserve?

Challenges in Forecasting• Volatility of software/hardware/storage media

• Updates: Windows operating systems since 2000: Two major new releases, two minor service pack updates, around fifty security , p p , y ypatches since SP2

• Upgrades: Microsoft Office Pro for Windows 95/98/ME/2000/XP/2003/2007

• Media life expectancy: Optical 5 years Disk 15 years Microfiche• Media life expectancy: Optical ~5 years, Disk ~ 15 years, Microfiche ~ 100, microfilm ~ 300, newspaper ~ 50, clay tablet ~ 10,000 (life expectancy vs. information density – [P. Conway, 1996] )

• Cost of software/hardware/storage media• Operating System: Windows 3.1/95/98/NT/2000/XP/Vista: Windows

95 = $209; Windows NT = $280; Windows XP = $300; Windows Vista = $399->$319 (2008)

• 128 MB of SDRAM: Year 1999 ~ $120-> $40 -> $200-250 due to128 MB of SDRAM: Year 1999 $120 > $40 > $200 250 due to Earthquake in Taiwan -> March 2000 ~ $55->March 2007 ~ $8.96 (flash card) - www.pricewatch.com (1TB ~$109.95 as of 01/15/2009)

• High performance computers: 2006: DARPA awards approximately $500 million to Cray and IBM; 2007 NSF $200 million to NCSA/IBM

National Center for Supercomputing Applications

$500 million to Cray and IBM; 2007 NSF $200 million to NCSA/IBM

Page 42: To Preserve Or Not To Preserve?

Archival Decision Support

• Lack of forecasting models to predict preservation costs

• Our work: Understand the tradeoffs between information value and computational/storage costs by providing simulation frameworks• Information granularity, organization, compression, encryption,

document format, ...• Versus• Cost of CPU for gathering information, for processing and for

input/output operations; cost of storage media, upgrades, storage p p p ; g , pg , groom, …

• Prototype simulation framework: Image Provenance To Learn available for downloading fromLearn available for downloading from http://isda.ncsa.uiuc.edu

Page 43: To Preserve Or Not To Preserve?

Simulation Framework

Decision Maker

Information Gathering and

Storage Learning

Information Retrieval and

Process Preservation

Value

Provenance Information

Reconstruction

Provenance Information

Preservation

Value

Information Information

Valu

e

observed

linear

Cost (memory, CPU)

Cost / Information Granularity Analysis

National Center for Supercomputing Applications

Image ViewerInformation Gathering System

Process Reconstruction System

Page 44: To Preserve Or Not To Preserve?

Image Event Category Tracker

Events

Summary of Events

ViewedArea

Storage

Area

Storage

Time

Page 45: To Preserve Or Not To Preserve?

Information Granularity

National Center for Supercomputing Applications

Page 46: To Preserve Or Not To Preserve?

Storage vs. Information Organization Tradeoffs: Test Case

• Information granules include interpreted, raw and snapshots• Files were not compressed

Tradeoffs: Test Case

Files were not compressed

Event NameSaved Size

Mouse ClickedAdd Annotation

Change RGB BandChange Gray ScaleChange Auto Zoom

-RDF= Resource

Window ShownChange GammaWindow HiddenChange Selection

MagnificationMouse Clicked

RDF

Key Pair

Description Framework Metadata Model

1 10 100 1000 10000 100000 1000000 10000000

Window CreatedChange Zoom Factor

Change Visible RegionNew Image

-Key pair = XML Metadata Model

National Center for Supercomputing Applications

1 10 100 1000 10000 100000 1000000 10000000

Bytes (log scale)

Page 47: To Preserve Or Not To Preserve?

Open Problems Related to AutomatingOpen Problems Related to Automating Archival Processing for Preservation

1. Discovery of Relationships Among Electronic Records2. Information Preserving Conversions of Electronic Records3. Sampling, Authenticity and Integrity Verification of a Collection

of Temporally Changing Records

Imaginations unbound

Page 48: To Preserve Or Not To Preserve?

Open Problem 1: Discovering Relationships Among FilesRelationships Among Files

• How should one establish relationships among electronic records coming from disparate sources or from the samerecords coming from disparate sources or from the same source at multiple time instances?• How to extract metadata?• What ontology to use to represent the extracted

metadata?H t t t t d t t ti f lti l d t• How to automate metadata extraction from multiple data types, e.g., 2D drawings and 3D CAD models?

• How to discover relationships between electronic recordsHow to discover relationships between electronic records corresponding to the same physical objects but different multidimensional observations?

• Need to Understand the Complexity of the ProblemImaginations unbound

Page 49: To Preserve Or Not To Preserve?

Metadata Extraction: Complexity & Size

the Crandon Mine Reports pfrom 1981 till 2003http://digicoll.library.wisc.edu/cgi-bin/EcoNatRes/EcoNatRes-idx?type=browse&scope=ECONATRES.CRANDONMINE

RDF t i l t t d i A t d i li d i RDF

Imaginations unbound

RDF triples extracted using Aperture and visualized using RDF-Gravity (red – edges, green-literal values, violet – properties)

Page 50: To Preserve Or Not To Preserve?

Relationships Among Multiple Data Types• Example Data: Torpedo Weapon Retriever 841

• 784 existing 2D image drawings and N>22 3D CAD modelsmodels

• How to establish relationships among the 3D CAD models and 2D image drawings during a product lifecycle?

Hypothetical Distribution of 3D CAD models for

Imaginations unbound

Hypothetical Distribution of 3D CAD models for TWR 841

Page 51: To Preserve Or Not To Preserve?

Understanding Challenges in Automation

ryD

isco

ver

nshi

p D

OCR Rel

atio

Descriptors (metadata)Representation

Imaginations unbound

Page 52: To Preserve Or Not To Preserve?

Open Problem 2: Conversions of Electronic RecordsElectronic Records• Conversions of electronic records are needed because

• Visual exploration depends on various software packages

• Many formats are retired (deprecated) over timeA subset of formats is selected for preservation• A subset of formats is selected for preservation purposes

• How to measure the degree of information gpreservation when files are converted from format A to format B?• During conversions information could be lost added or modified• During conversions, information could be lost, added or modified• What is the importance of each byte, object, etc. ?

• How to introduce a framework for measuring the quality of conversion and visualization software?

Imaginations unbound

Page 53: To Preserve Or Not To Preserve?

Example: Conversion of X3D to STEP to X3D

Software:

X3dToVrml97

WRLX3DSoftware:

A3D Reviewer

Software:

A3D ReviewerSoftware: Nothing!A3D ReviewerVrml97ToX3d

Nothing!

STEP WRL X3D

Page 54: To Preserve Or Not To Preserve?

Automation of 3D File Format Mapping & ConversionConversion

Imaginations unbound

Page 55: To Preserve Or Not To Preserve?

Open Problem 3: Sampling, Integrity and Authenticityg y y• Given finite resources and increasing amounts of electronic

records, automation of sampling, integrity and authenticity verification is very much neededverification is very much needed

• What are the criteria for sampling a collection of temporally changing versions of ‘the same’ document? • Authenticity• Integrity• Information content• Information content

• How to measure a degree of authenticity?• Computers might assign inaccurate time stamps to records

• How to detect integrity failures?• A record containing a female patient with prostate cancer

• How to incorporate constraints into sampling?• How to incorporate constraints into sampling?• Storage space, compression computational cost, etc.

Imaginations unbound

Page 56: To Preserve Or Not To Preserve?

Example:Temporal Ranking and Integrity VerificationVerification

• Chronological ranking based on time stamps of filfiles• Last modification (current

implementation)

• Ranking can be changed by a human

• Content referring to• Content referring to dates can be used for integrity verification

Imaginations unbound

TIME

Page 57: To Preserve Or Not To Preserve?

Rules and Attributes for Integrity Verification• Document integrity attributes?

• appearance or disappearance of document images• appearance and disappearance of dates embedded in

documents • file size • count of image groups• number of sentences• average value of dates found in a documentaverage value of dates found in a document

• Rules?

Imaginations unbound

Page 58: To Preserve Or Not To Preserve?

Summary• Introduced a set of open problems

related toA i l f l t i d• Appraisal of electronic records

• Archival forecasting of preservation costscosts

• Automation of processing for preservationpreservation

• Examples used for illustrating the openExamples used for illustrating the open problems from our research just scratch the surface of some of the open

blproblems

Page 59: To Preserve Or Not To Preserve?

Observations• Many stakeholders are already aware of some of the

open problems including government agencies and companies

• As all government agencies have been computerized, the continuity and functioning of the agencies depend on preservation and reconstruction of electronic records

• Right now, we are at the beginning of the exponential growth of electronic records (many more electronic records will be coming)

• Some scientific fields are already facing real time decisions about preserving electronic records (e.g.,

t )astronomers)

Page 60: To Preserve Or Not To Preserve?

Future Vision

• It is envisioned that the preservation and reconstruction of electronic records have to follow different paradigms that incorporatefollow different paradigms that incorporate • Scalability (heterogeneity, dimensionality

and volume) )• Forecasting of preservation costs • New level of automation and quality

control in processing for preservationcontrol in processing for preservation purposes

• The field of electronic record managementThe field of electronic record management and preservation needs forward looking solutions to stay abreast with the dynamics y yof digital information

Imaginations unbound

Page 61: To Preserve Or Not To Preserve?

References to Presented Research

• -Bajcsy P., R. Kooper and S-C. Lee, “Understanding Preservation and Reconstruction Requirements for Computer Assisted Decision Processes,” ACM Journal on Computers and Cultural Heritage (JOCCH), (submitted October 2008).

• -Bajcsy P., “A Perspective on Cyberinfrastructure for Water Research Driven by Informatics Methodologies,” GeographyBajcsy P., A Perspective on Cyberinfrastructure for Water Research Driven by Informatics Methodologies, Geography Compass, Volume 2, Issue 6 (p 2040-2061), 2008 Blackwell Publishing Ltd, URL: http://www3.interscience.wiley.com/cgi-bin/fulltext/121478978/PDFSTART

• -Bajcsy P., R. Kooper, L. Marini and J. Myers, “Community-Scale Cyberinfrastructure for Exploratory Science,” In: Cyberinfrastructure Technologies and Applications book, Editor: Junwei Cao, Nova Science Publishers, Chapter 12, Inc., 2009; URL: https://www.novapublishers.com/catalog/product info.php?products id=8011; p p g p _ p p p _

• - McHenry K. and P. Bajcsy "An Overview of 3D Data Content, File Formats and Viewers.", Technical Report NCSA-ISDA08-002, October 31, 2008

• -McFadden W., K. McHenry, R. Kooper, M. Ondrejcek, A. Yahja and P. Bajcsy, “Advanced Information Systems for Archival Appraisals of Contemporary Documents,” the 4th IEEE International Conference on e-Science, December 8-12, 2008, Indianapolis, IN., p ,

• -Lee S-C, W. McFadden and P. Bajcsy, “Text, Image and Vector Graphics Based Appraisal of Contemporary Documents,” The Seventh International Conference on Machine Learning and Applications, December 11-13, 2008, San Diego, CA.

• -Bajcsy P. and S-C Lee, "Computer Assisted Appraisal of Contemporary PDF Documents" ARCHIVES 2008: Archival R/Evolution & Identities 72nd Annual Meeting Pre-conference Programs: August 24-27, 2008, San Francisco, CA.& g g g , , ,

• -Lee S-C. and P. Bajcsy, “Understanding Challenges in Preserving and Reconstructing Computer-Assisted Medical Decision Processes,” the Workshop on Machine Learning in Biomedicine and Bioinformatics (MLBB07) of the 2007 International Conference on Machine Learning and Application (ICMLA07), Cincinnati, Ohio, December 13-15, 2007.

• -Bajcsy P and D. Clutter, “Gathering and Analyzing Information about Decision Making Processes Using Geospatial Electronic Records,” the 2006 Winter Federation of Earth Science Information Partners (“Federation”) Conference,Electronic Records, the 2006 Winter Federation of Earth Science Information Partners ( Federation ) Conference, poster, January 4-6, 2006 in Washington, DC.

Imaginations unbound

Page 62: To Preserve Or Not To Preserve?

Questions

• Project URL: jhttp://isda.ncsa.uiuc.edu/NARA/index.htmland http://isda.ncsa.uiuc.edu/CompTradeoffs/

• Publications – see our URL at http://isda ncsa uiuc edu/publicationshttp://isda.ncsa.uiuc.edu/publications

• Peter Bajcsy; email: pbajcsy@ncsa uiuc edu• Peter Bajcsy; email: [email protected]