architecting extensible digital repository services robert chavez, robert dockins, anoop kumar,...

35
Architecting Extensible Digital Repository Services Robert Chavez, Robert Dockins, Anoop Kumar, Matthew Mcvey, Ranjani Saigal, Nikolai Schwertner Tufts University, Medford, MA Fedora Users Conference, Rutgers University, May 13 2005

Upload: ernest-pope

Post on 25-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Architecting Extensible Digital Repository Services

Robert Chavez, Robert Dockins, Anoop Kumar, Matthew Mcvey, Ranjani Saigal, Nikolai Schwertner

Tufts University, Medford, MA

Fedora Users Conference, Rutgers University, May 13 2005

An Overview

Digital Collections at Tufts Reasons for developing Tufts Digital Repository

(TDR) Some design requirements and goals The TDR architecture and services Applications that interface with TDR

– Tufts Digital Library– VUE

Future Directions

A Brief History of Digital Collections at Tufts

Pre-existing Digital Projects/Libraries/Collections – Perseus Digital Library– Tufts University Science Knowledgebase (TUSK-Medicine)– Artifact Image Library (Art History)– Miscellaneous projects

Crime and Punishment, Faculty Publications, Faculty Datasets, many and varied content management systems

Digital Collections and Archives (DCA) steward of the University's permanently valuable digital records

and collections many and varied digital collections university records

Why TDR?

Digital collections and materials are continually growing; adding content in a variety of formats.

Original architectures and systems were not built to accommodate such expansion.

Original architectures and systems were not built to facilitate interoperability or sharing of resources.

Needed a university-wide digital repository that could manage the ever increasing content while continuing to service discipline specific needs and leveraging existing and new tools and services.

Need for DCA to support digital data warehouse services and digital archival storage services for digital content of enduring value.

Who?

Digital Collections and Archives (DCA), Academic Technology (AT)

– partnered to create a digital repository and digital library application for managing content while supporting teaching and learning at the university.

Roles (a bit over-simplified):– DCA: content developers, collection and deposit policy

creators, managers of repository– AT: content developers, applications and overall system

architects and developers

Design Requirements

Persistence:– Enforce unique persistent identifiers– Manage identifiers for multiple projects– Assurance that the data will be

preserved and retrievable over time Ingest:

– Enforce archival standards– Ability to incorporate appraisal– Automated ingest workflow

Management: – Use of information packages to facilitate

storage and dissemination– Incorporate content models– Rights/access management

Access/Interoperability:– Digital resources should be accessible

to multiple applications and systems– Authorization policies must be enforced

Scalability (Re)Usability

– Leverage existing and new tools and services

Requirements System Services

Unique and persistent identification of materials

Naming Service

Adherence to the concept of archival information packages (AIP)

Digital Object Provider (DOP) Service -- Fedora

Adherence to the concept of submission information Packages (SIP)

Drop Box, Ingestion Service

Adherence to the concept of Dissemination Information Packages (DIP)

DOP Service -- Fedora

Authentication and integrity checking

DOP Service, Ingestion Service

Dissemination Disseminators, Caching Service, TDL, Search Service

Access TDL and other applications

FedoraRepository

Service

Drop Box

IngestionService

InterfacingServices

Search Service

Naming Service

SearchInterface

ApplicationInterface

FedoraClient

U

U

A

Indexing Service

P

P

AApplicationInterface U

P - Data ProviderA - AdministratorU - UserArrows represent flow of data

TDR Architecture

SearchIndex

Caching Service

Services of TDR

Component RoleDrop Box and Ingestion Service Validation, Preprocessing, Appraisal, Transfer/Deposit

Naming Service Unique persistent identifiers (URNs) mapped to objects,management of URNs, management of repositories. Mapping between existing URN schemas to Fedora schema

Fedora Repository Service Management and access framework for digital objects

Indexing and Search Services Metadata and full-text index creation.

Search API and application

Bridge Services Provides mechanisms for external applications to interface with repository

Current System Architecture

TDL Application

How it all fits together, a working application– http://dl.tufts.edu

TDL App

Search Interface

[JSP]

Search Index

XML index

[Oracle]

Search Service

Oracle Query Builder

[Java App.]

Search Index

Main Index

[Oracle]

U

Search Service

Results Collation

[Java App.]

Naming Service

URN-PID resolution

[MySQL]

Repository Service

Object Dissemination

[Fedora]

TDL App

Disseminator Viewer

[JSP]

U

TDL App

Search Results

[Search Interface]

General TDL application search transaction process

TDL Architecture

Drop Box and Ingestion Service Naming Service Fedora Repository Service at Tufts Indexing and Search Services Interfacing Services

Drop Box and Ingestion Service

automate the process of preparing materials for ingest

validate materials before ingest primarily for large-scale ingests not an object factory (i.e., not a tool for

building individual objects)

TDL Architecture

Drop Box and Ingestion Service Naming Service Fedora Repository Service Indexing and Search Services Interfacing Services

Naming Service

Assigns, reserves and resolves URNs– The URN has a very flexible structure that can be tailor made to suit the special

needs of the particular naming convention.– Example: namespace1:namescape2:namespace:3:object_id

Manages repositories– multiple production repositories, backup repositories, etc.

Tufts URN Formats examplestufts:dca:central:MS102:33.1345Perseus:text:1999.04.000697.5224.77-1729-47

URN Properties– Provides unique ID to objects deposited into repository– Service assures resolution to unique resource.

Implementation– MySQL, Java class, JSP Management console

Tufts Naming Service

TDL Architecture

Drop Box and Ingestion Service Naming Service Fedora Repository Service Indexing and Search Service Interfacing Services

Fedora Repository Service

Fedora met many of our critical needs:– Modular nature of the repository service– Management of digital content over time (versioning, etc.)

– Aggregation of mixed, possibly distributed, data into complex objects

– The ability to specify multiple content disseminations of these objects

– The ability to associate rights management schemes with these disseminations.

Fedora Repository Service, cont…

Tufts Implementation Details:– External data stores – Modeling behaviors and content– Piece of a larger architecture; not out of the box solution

Tufts Repository Models/Policies– Fedora @ Tufts serves several purposes

Archival/institutional repository– Guarantee functional preservation

Data warehouse– Guarantee bitstream preservation

Active Repository– Active workspace; constantly updated content (i.e faculty data sets, faculty

pubs, content mapping)

Behavior Definitions

Atomic units: sets of standardized behaviors

Building blocks of content models

Allow for flexible reuse of data

Contributes to inter-repository sharing of objects

Dissemination of standard output: XML, plain text, binary format

Rendering/processing of disseminations is the responsibility of applications implemented over the repository.

BDefs Methods

tuftsAssetDef getPreview

getLabel

getDescription

getFullView

getDefaultContent

getDescMetadata

getAdminMetadata

tuftsText getTOC

getChuckList

getChunk

getHeader

tuftsBasicImage getThumbnail

getScreensize

getMaxSize

getDynamicView

Content Models

Unique content models built from content modeling components.

Digital Objects that subscribe to a given content model inherit all methods established by a particular behavior.

Digital objects can subscribe to content models that suit their type or class.

Functional not presentation specific

Implementation Challenges

Processing large (>10MB) XML Documents– XML databases

Processing large images– Imaging servers

Streaming Media GIS data Modeling Collections Advanced Searching “Shopping cart” searching Caching Disseminations

TDL Architecture

Drop Box and Ingestion Service Naming Service Fedora Repository Service Indexing and Search Service Interfacing Services

Indexing Search Service

Indexing– Digital objects piped through from ingestion service– Metadata index– Full-text index– Specialized XML index

Implementation– Java indexing application– Oracle database

Supported Types of Search– Basic full-text– Basic metadata– Advanced metadata

Accessing the service – HTTP GET/POST– SOAP

TDL Architecture

Drop Box and Ingestion Service Naming Service Fedora Repository Service at Tufts Indexing Service and Search Engine Interfacing Services

Interfacing Services

An important design requirement for TDR was to allow current digital library applications to easily interface with TDR and provide access to the content in the digital repository within their own environments in a seamless fashion.

Current applications like VUE can interface with this service to allow their tools to disseminate the content that resides in TDL

The service is being designed not only to support current applications but also to accommodate the needs of future yet-to-be-defined applications like course management systems, learning tools, portals etc.

Fedora OKI Bridge

Fedora OKIPID Shared.Id

DR DigitalRepository

FedoraObject Asset

FedoraObjectIterator AssetIterator

BehaviorInfoStructure InfoStructure

Behavior InfoRecord

DisseminationInfoPart InfoPart

Dissemination InfoField

ParameterInfoPart InfoPart

Parameter InfoField

DataOutputStreamInfoPart InfoPart

DataOutputStream(MIMETypeStream) InfoField

Applications Accessing TDR Content

Tufts Digital Library Application– http://dl.tufts.edu/

Visual Understanding Environment (VUE)– http://vue.tccs.tufts.edu/

VUE

OKI

FEDORA

DRAPI

Learning Theories- Constructivism - Active Learning - Individualized Learning

DigitalRepository

OKI-FEDORA Bridge

VUE Overview

Technical Infrastructure

Support- Faculty needs- Learners needsExtend- Digital Libraries- OKI Standards

DR Implementations

DigitalRepository

Future Directions

Revised search service (Zebra?) XML database for metadata and XML objects (eXist) Customization and enhancement to address a wide variety of

needs (i.e. University Records). Object factory: a workbench for building certain classes of

objects Automated browsing service for Repository. Authentication and authorization modules Asset Definitions Collection Modeling Federation

Asset Definitions

The purpose of the Fedora Asset Definition is to define and expose content types and methods of objects/assets in a repository in a standard way. The goal is to facilitate access between applications and digital repositories, digital repositories and digital repositories, etc.

Some of the questions that we asked ourselves during our repository and application development helped us form the concept of an “Asset Definition.” For example:

How can an application find out what are the objects/assets within a particular repository and how does one figure out how to refer to these objects?

If one has an object/asset in a repository, how does one describe it so that other applications can understand what they can do with it?

Asset Definitions, cont…

getFullAssetDefintiongetPreviewgetDescriptiongetFullViewgetDefaultContentgetDescMetadatagetAdminMetadatagetThumbnailgetScreenSizegetMaxSizegetDynamicView

Collection Modeling

Collection Modeling

Object Relationships– Extend Fedora RDF to create collection networks– Recursive disseminators to track paths in the

network– Facilitate access to sets of materials– Facilitate management of digital objects– Facilitate browsing of sets of materials– http://nikolai.tccs.tufts.edu:1980/fedora/get/demo:

collectionAll/demo:Collection/viewMembers/