building a reference implementation for long-term preservation€¦ ·  · 2014-09-09building a...

29
Building a Reference Implementation Building a Reference Implementation for Long-Term Preservation for Long-Term Preservation Richard Marciano Richard Marciano Lead Scientist Lead Scientist Sustainable Archives & Library Technologies (SALT) lab Sustainable Archives & Library Technologies (SALT) lab director director Data Intensive Cyber Environment (DICE) group Data Intensive Cyber Environment (DICE) group [email protected] [email protected] http://www.DiceResearch.org http://www.DiceResearch.org

Upload: nguyennhan

Post on 29-May-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

Building a Reference ImplementationBuilding a Reference Implementationfor Long-Term Preservationfor Long-Term Preservation

Richard MarcianoRichard Marciano

Lead ScientistLead ScientistSustainable Archives & Library Technologies (SALT) labSustainable Archives & Library Technologies (SALT) lab

directordirectorData Intensive Cyber Environment (DICE) groupData Intensive Cyber Environment (DICE) group

[email protected]@diceresearch.orghttp://www.DiceResearch.orghttp://www.DiceResearch.org

Page 2: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

GOAL: Building a Preservation ReferenceGOAL: Building a Preservation ReferenceImplementation (1/2)Implementation (1/2)

• Go beyond preservation and access &address the full lifecycle:

• Appraisal & Disposition• Accessioning• Arrangement• Description• Preservation policy enforcement• Preservation trustworthiness assessments

Page 3: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

Developing Skills for PreservationDeveloping Skills for Preservation

• SAA 2007 Summer Camp: week-long hands-on training. Topics included:

• Electronic Records• Components of an eRecords program

– Policies / Mandate– Technical Infrastructure– Social Infrastructure

• Infrastructure Independence• Appraisal & Disposition• Accessioning• Arrangement• Description• Preservation• Access• Scalability

Page 4: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

Federal

State

Local

International

Non-profit

College

University

USER

Commercial

• Archives: CA, WA, NC• Libraries: AZ

• US Navy

• City of Richmond Archives, Canada• Sacramento Archives & Museum CollectionCenter

• Marist Brothers of Canada

• Cigna• Ford Motor Company• Preservation Partners

• National Fire ProtectionAssociation• History Associates, Inc.

• UCSD Libraries, Spec. Coll.• UC Irvine Spec. Coll.• Princeton U., S.G.M. Man. Lib.• U. San Diego, Copley Library• Harvard Business School, B. Lib.• University of New Mexico, Pol.Arch.• U. of Texas at Arlington, Spec. Coll.• Occidental College, PeriodicalsDept.

• University of Illinois UrbanaChampaign• University of Madison-Wisconsin

SAA e-RecordsSummer Camp 2007

Page 5: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

GOAL: Building a Preservation ReferenceGOAL: Building a Preservation ReferenceImplementation (2/2)Implementation (2/2)

• The reference implementation consistsof:

• the record management environment• the preservation management rules• the management processes that implement

preservation services, and• the rules that verify compliance with

assessment criteria.• The resulting system can be shown to

be provably correct

Page 6: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

iRODS CollaborationsiRODS Collaborations• Notre Dame University - porting of the Parrot interface on top of

iRODS, unifies access to GridFTP, iRODS, file systems• University of Texas, Austin - creation of a Common Teragrid

Software Stack kit for iRODS, simplifies installation of iRODS onTeragrid Sites

• Vanderbilt - integration of iRODS with LStore and LogisticalNetworking, integrates a distributed metadata catalog for theTDLC data grid

• MIT - integration of DSpace with iRODS, funded by NARA• Fedora Commons - integration of Fedora with iRODS in support

of NSDL• Stanford - proposal to use LOCKSS reliability technology for

guarantees on distributed iRODS rule base

Page 7: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

iRODS CollaborationsiRODS Collaborations

• SHAMAN - integration of Cheshire and Multivalent browsing intoiRODS micro-services for parsing data objects

• CASPAR - representation information for data and TrustedRepository Audit check list assessment of iRODS rules

• James Cook University - porting of Python, Perl, PHP loadlibraries on iRODS

• UK ASPIS project - integration of Shibboleth authentication withiRODS

• ATOS - use of iRODS in Bibliotheque Nationale de France• DEISA - use of iRODS in European supercomputer centers• D-Grid - iRODS beta test site• Archer - creation of preservation rules (TRAC)

Page 8: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

iRODS Interested PartiesiRODS Interested Parties

• Royal British Columbia Museum - iRODS rules forfixity

• Globus - integration of iRODS and GridFTP• Aerospace Corp - data interoperability• Merrill Lynch - rule-based data management• University of York - DAME distributed analysis

systems• IBM - integration with object based storage devices• SNIA - integration with XAM technology• Mitre - support for real-time data streams• JPL - Planetary Data System

Page 9: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

iRODS Tutorials - 2008iRODS Tutorials - 2008

• January 31, SDSC• April 8 - ISGC, Taipei• May 13 - China, National Academy of Science• May 27-30 - UK eScience, Edinburgh• June 5 - OGF23, Barcelona• July 7-11 - SAA, SDSC• August 4-8 - SAA, SDSC• August 25 - SAA, San Francisco

Page 10: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

iRODS: the Latest Generation of Data GridsiRODS: the Latest Generation of Data Grids

Data Grids are middleware services• Sitting between the applications and data providers• Providing transparent and uniform access• To diverse types of digital assets

• Files, databases, streams, web, programs,…• Documents, images, data, sensor packets, tables,…

• From heterogeneous resources• File Systems, tape archives, sensor streams,…

• Distributed over a wide area network• Multiple administrative and security domains

• With users unaware of physical attributes of the dataaccess

• System addresses, paths, protocols,…

Page 11: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

Data Grids are Trust RelationshipsData Grids are Trust Relationships• Data-level Trust

• Virtualization for integrity, authenticity, accessprovision, availability, data and metadataorganization and management, communityownership and curation

• User-level Trust• Virtualization of authentication, authorization,

auditing and accounting• Resource-level Trust

• Virtualization of administration andmaintenance, appropriation (quota), availabilityand accesssibility

• These are Data Grid 1.0 level trusts

Page 12: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

Data Grids are Trust RelationshipsData Grids are Trust Relationships• Policy-level Trust

• Virtualization of Management, Organizationaland Community Rules

• Service-level Trust• Virtualization of Operations and Services

• Execution-level Trust• Virtualization of distributed, parallel,

asynchronous, delayed and/or remoteexecution

• These are Data Grid 2.0 level trusts

Page 13: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

User Base & Diversity of ApplicationsUser Base & Diversity of Applications

• Collections at SDSC:• 1+PetaBytes, 170+ Million files• Multi-disciplinary Scientific Data

• Astronomy, Cosmology• Neuro Science, Cell-Signalling & other Bio-medical

Informatics• Environmental & Ecological Data• Educational (web) & Research Data (Chem, Phys,…)• Archival & Library Collections• Earthquake Data, Seismic Simulations• Real-time Sensor Data

• Growing at 1TB a day• Supporting large projects: TeraGrid, NVO, SCEC,

SEEK/Kepler, GEON, ROADNet, JCSG, AfCS, SIOExplorer, SALK, PAT, UCSDLibrary, …

Page 14: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

What is iRODS?What is iRODS?

• It is a data grid system – data virtualization• A distributed file system, based on a client-server architecture.• Allows users to access files seamlessly across a distributed environment,

based upon their attributes rather than just their names or physicallocations.

• It replicates, syncs and archives data, connecting heterogeneousresources in a logical and abstracted manner.

• It is a distributed workflow system – policy/service virtualization• Policy can be coded as functions (micro-services)• Remote micro-services can be chained• The chains (workflows) are interpreted at run-time• The chains can be triggered on an event and condition (rules)• They can also be recursive.• Micro-services communicate through parameters, shared contexts, and

out-of-band message queues.

Similar to SRB

Page 15: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

Policy Virtualization with iRODSPolicy Virtualization with iRODS• Micro-Services• Functions with well-defined semantics• Transactional - recovery• Context of application• Message Queues• Rules• Triggered by events• Conditional execution of• alternative rule declarations• System constructs:• loops, recursion, branching• Workflows• Distributed Execution• Immediate, Deferred, Periodic

User Application

Executionat SIO

Executionat MBARI

Executionat WoodsHole

Page 16: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

Rule-based Data ManagementRule-based Data Management

• Administrator-controlled rules to implementmanagement policies• Administrative - adding / deleting users, resources• Data ingestion - pre-processing, post-processing• Data transport / deletion - parallel I/O streams, disposition• Data retention policies – expiration, over-writes, versions• Data Reliability Policies – copies, formats, migration,

checking,…

Page 17: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

Distributed Management SystemDistributed Management System

RuleRuleEngineEngine

ExecutionExecutionControlControl

MessagingMessagingSystemSystem

ExecutionExecutionEngineEngine

VirtualizationVirtualization

ServerServerSideSideWorkflowWorkflow

SchedulingScheduling

PolicyPolicyManagementManagement

DataDataTransportTransport

MetadataMetadataCatalogCatalog

PersistentPersistentStateStateinformationinformation

Page 18: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

Management VirtualizationManagement Virtualization

• Examples of management policies• Integrity

• Validation of checksums• Synchronization of replicas• Data distribution• Data retention• Access controls

• Authenticity• Chain of custody - audit trails• Track required preservation metadata - templates• Generation of Archival Information Packages

Page 19: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

Rule-based Data ManagementRule-based Data Management

• Associate rules with combinations ofname spaces• Rule set for a particular collection• Rule set for a particular user group• Rule set for a particular user group when

accessing a particular collection• Rule set for a particular storage system• Rule set for a particular micro-service• Generic rules based on SRB operations

Page 20: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

TPAP (1/2)TPAP (1/2)National Archives and Records AdministrationNational Archives and Records AdministrationTranscontinental Persistent Archive PrototypeTranscontinental Persistent Archive Prototype

U Md SDSC

MCAT MCAT

Georgia Tech

MCAT

Federation of SevenIndependent Data Grids

NARA II

MCAT

NARA I

MCAT

Extensible Environment, can federate with additional research and educationsites. Each data grid uses different vendor products.

Rocket Center

MCAT

U NC

MCAT

Page 21: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

TPAP (2/2)TPAP (2/2)Transcontinental Persistent Archive PrototypeTranscontinental Persistent Archive Prototype

• Distributed Data Management Concepts• Data virtualization

• Storage system independence• Trust virtualization

• Administration independence

• Risk mitigation• Federation of multiple independent data grids

• Operation independence

Page 22: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

PAT: Persistent Archives TestbedPAT: Persistent Archives Testbed

KentuckyKentuckyGrid BrickGrid Brick

SDSCSDSCArchiveArchive

MCATMCAT

MichiganMichiganGrid BrickGrid Brick

MinnesotaMinnesotaGrid BrickGrid Brick

OhioOhioGrid BrickGrid Brick

SLACSLACStorageStorage

Local Storage ResourcesLocal Storage Resources

Shared Preservation EnvironmentShared Preservation Environment

Metadata CatalogMetadata Catalog(Oracle)(Oracle)

Archival StorageArchival Storage(HPSS, Sam-QFS)(HPSS, Sam-QFS)

Page 23: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

PAT Project ParticipantsPAT Project Participants

Page 24: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

The Evolution of PAT: The Evolution of PAT: ““archives on rulesarchives on rules”” what the what the ““DCP CenterDCP Center”” project will try to do project will try to do……

• Automate curation processes• e.g. design reusable curation workflows

• Enforce curation policies• e.g. enforce retention/disposition schedules

• Verify assertions about curation results• e.g. periodically verify checksums• e.g. parse audit trails to verify accesses• e.g. RLG/NARA Trusworthiness Assessment

Criteria

Page 25: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

DCP: Distributed Custodial Preservation CenterDCP: Distributed Custodial Preservation CenterPurpose:

Build a distributed production preservation environment that meets theneeds of archival repositories for trusted archival preservation services

Distributed partnership of 35 participants across 11 institutions:* STATES:

- California- Kansas- Michigan- Kentucky- North Carolina- New York

* UNIVERSITIES:- Tufts University- West Virginia University

* CULTURAL ENTITIES:- Getty Research Institute

* INTERNATIONAL PARTNERS:- Carleton University (Geomatics and Cartographic Research Centre)

Page 26: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

Building a Preservation Reference ImplementationBuilding a Preservation Reference Implementation

• Go beyond preservation and access• SAA 2007 Summer Camp: week-long hands-

on training. Topics included:• Electronic Records• Components of an eRecords program

– Policies / Mandate– Technical Infrastructure– Social Infrastructure

• Address the full lifecycle:• Appraisal & Disposition• Accessioning• Arrangement• Description• Preservation policy enforcement• Preservation trustworthiness assessments

Page 27: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

Preservation Reference Implementation (cont. 2/3)Preservation Reference Implementation (cont. 2/3)

• NARA ERA capabilities list, and theassessment criteria are based on theTrustworthy Repositories Audit &Certification (TRAC): Criteria andChecklist.

• For each identified capability, therequired operations are encapsulatedin micro-services that are executed atthe storage location, under the controlof rules that implement themanagement policies needed toenforce TRAC criteria.

Page 28: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

Preservation Reference Implementation (cont. 3/3)Preservation Reference Implementation (cont. 3/3)

• Rules are also defined that periodicallyquery the system to verify compliance,and automate recovery procedureswhen problems are found.

• The reference implementation thenconsists of:

• the record management environment• the preservation management rules• the management processes that implement

preservation services, and• the rules that verify compliance with

assessment criteria.

Page 29: Building a Reference Implementation for Long-Term Preservation€¦ ·  · 2014-09-09Building a Reference Implementation for Long-Term Preservation Richard Marciano ... • A distributed

For More InformationFor More Information

Richard MarcianoDICE Group

[email protected]

http://www.DiceResearch.org