us gpo aip independence test

22
US GPO AIP Independence Test CS 496A – Senior Design Team members: Antonio Castillo, Johnny Ng, Aram Weintraub, Tin-Shuk Wong Faculty advisor: Dr. Russ Abbott GPO contact: Kate Zwaard

Upload: karan

Post on 07-Jan-2016

19 views

Category:

Documents


1 download

DESCRIPTION

US GPO AIP Independence Test. CS 496A – Senior Design Team members: Antonio Castillo, Johnny Ng, Aram Weintraub, Tin-Shuk Wong Faculty advisor: Dr. Russ Abbott GPO contact: Kate Zwaard. Overview. Background US GPO FDsys Project objectives A note on deliverables File formats (AIP) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: US GPO AIP Independence Test

US GPOAIP Independence Test

CS 496A – Senior Design

Team members: Antonio Castillo, Johnny Ng, Aram Weintraub, Tin-Shuk Wong

Faculty advisor: Dr. Russ AbbottGPO contact: Kate Zwaard

Page 2: US GPO AIP Independence Test

Overview Background

US GPO FDsys Project objectives A note on deliverables

File formats (AIP) METS, MODS, and PREMIS

Hardware interface XML parsing Solution Strategy Repositories Testing Conclusion

Page 3: US GPO AIP Independence Test

US GPO The United States Government Printing

Office (GPO) is in charge of producing and archiving documents for every branch of the federal government.

“The U.S Government Printing Office (GPO) provides publishing & dissemination services for the official & authentic government publications to Congress, Federal agencies, Federal depository libraries, & the American public.” (http://www.gpo.gov/about/)

Page 4: US GPO AIP Independence Test

FDsys GPO is developing the Federal Digital

System, a new content management system (CMS) designed to manage all of its digital data.

“The U.S. Government Printing Office’s (GPO) Future Digital System (FDsys) will ingest, authenticate, preserve and provide access to digital content from all three branches of the U.S. Government. FDsys, which is in public beta testing, is intended to preserve digital content free from dependence on specific hardware or software.” (project description)

Page 5: US GPO AIP Independence Test

Project Objectives

“The objective of this project is to test whether the AIPs in FDsys are truly independent of the surrounding content management system. The CSULA team aims to either confirm or reject the claim that, with help from resources commonly available to the digital curation community, an interested party could fully reconstruct the archive using only the content data.”

Page 6: US GPO AIP Independence Test

Project Objectives “GPO will supply a set of content data from its

archival storage. This data will include content files, metadata files (in XML according to the standards referenced above), and METS binding files (in XML) that describe how all of the objects are related. The CSULA team will inspect the information and, using the METS standard, determine whether the information in XML is sufficient for a user to make sense of the data and ingest it to another repository. Because the data is stored in arbitrary folders, scripts would have to be written to assemble the content packages from the locations specified in the METS file.”

Page 7: US GPO AIP Independence Test

Project Objectives This project simulates FDsys breaking down

due to some catastrophic attack or error. We are attempting to categorize and

reconstruct an amount of sample data from FDsys outside the context of the actual CMS. The only references we have available, other than

the actual files in the archive, are publicly defined standards.

It is our hope that this project will help GPO improve the robustness of their file system.

Page 8: US GPO AIP Independence Test

A Note on Deliverables

This is not a typical computer science design project because our aim is not to design software. Instead, we will be conducting scripted tests on real data and forming conclusions based on the results.

Deliverables will most likely include: a written report of our findings and

recommendations a reorganized version of the input data

Page 9: US GPO AIP Independence Test

AIP

Archival Information Package Defines how digital objects and its associated

metadata are packaged using XML based files. METS (binding file) MODS PREMIS

Page 10: US GPO AIP Independence Test

METS Schema

XML file format

Seven major sections

Page 11: US GPO AIP Independence Test

METS Schema

5 Major Sections5 Major Sections 1) METS Header 2) Descriptive Metadata 3) Administrative Metadata 4) File Section 5) Structural Map

Page 12: US GPO AIP Independence Test

MODS

MODS file will be used to encode descriptive metadata.

A MODS file can be used as an extension schema to METS.

MODS consist of top-elements elements that are mandatory, recommended or optional.

Page 13: US GPO AIP Independence Test

MODS

Page 14: US GPO AIP Independence Test

PREMIS

PREMIS file will be used to encode preservation metadata.

Preservation metadata consists of the following: Provenance Authenticity Preservation activity Technical environment Rights management

Page 15: US GPO AIP Independence Test

PREMIS

PREMIS data model includes of the following: Intellectual Entity Object Entity Event Entity Agent Entity Rights Entity*

Object, Event, and Agent Entities are described using mandatory and optional elements.

Page 16: US GPO AIP Independence Test

PREMIS

Page 17: US GPO AIP Independence Test

Hardware Interface

PC computer

External hard drive

Page 18: US GPO AIP Independence Test

XML Parsing As described above, all metadata is in

the form of XML files. Hence, using code to read XML files is integral to the project.

We plan to use the Java programming language for our scripting needs. Java API for XML Processing (JAXP): the

standard Java library for handling XML It provides several different possible

representations for XML

Page 19: US GPO AIP Independence Test

Solution Strategy

Data submitted to us are AIPs, not SIPs. Repository software cannot ingest AIPs, only SIPs. We must write scripts that parse the AIPs in such a way to construct SIPs from the the arbitrary file structure, then ingest those SIPs with a repository software to create to new AIPs.

Page 20: US GPO AIP Independence Test

Repositories We have also looked into third-party

repository software to help parse and organize data. DSpace, Fedora Commons, EPrints

Unfortunately, so far none of them seem ideal for the task.

Page 21: US GPO AIP Independence Test

Testing After parsing and organizing the data, it will

be important to perform checks to ensure that the reconstruction is accurate. We may send a preliminary report to GPO for

verification.

The exact testing procedure is still undefined, as we haven’t had a chance to investigate the data in depth yet. Our goals should be clearer once we understand

exactly what type of data we are dealing with.

Page 22: US GPO AIP Independence Test

Conclusion

Our thanks to Kate, Dr. Abbott, and Dr. Pamula for their support.