architecting extensible digital repository services robert chavez, robert dockins, anoop kumar,...
TRANSCRIPT
Architecting Extensible Digital Repository Services
Robert Chavez, Robert Dockins, Anoop Kumar, Matthew Mcvey, Ranjani Saigal, Nikolai Schwertner
Tufts University, Medford, MA
Fedora Users Conference, Rutgers University, May 13 2005
An Overview
Digital Collections at Tufts Reasons for developing Tufts Digital Repository
(TDR) Some design requirements and goals The TDR architecture and services Applications that interface with TDR
– Tufts Digital Library– VUE
Future Directions
A Brief History of Digital Collections at Tufts
Pre-existing Digital Projects/Libraries/Collections – Perseus Digital Library– Tufts University Science Knowledgebase (TUSK-Medicine)– Artifact Image Library (Art History)– Miscellaneous projects
Crime and Punishment, Faculty Publications, Faculty Datasets, many and varied content management systems
Digital Collections and Archives (DCA) steward of the University's permanently valuable digital records
and collections many and varied digital collections university records
Why TDR?
Digital collections and materials are continually growing; adding content in a variety of formats.
Original architectures and systems were not built to accommodate such expansion.
Original architectures and systems were not built to facilitate interoperability or sharing of resources.
Needed a university-wide digital repository that could manage the ever increasing content while continuing to service discipline specific needs and leveraging existing and new tools and services.
Need for DCA to support digital data warehouse services and digital archival storage services for digital content of enduring value.
Who?
Digital Collections and Archives (DCA), Academic Technology (AT)
– partnered to create a digital repository and digital library application for managing content while supporting teaching and learning at the university.
Roles (a bit over-simplified):– DCA: content developers, collection and deposit policy
creators, managers of repository– AT: content developers, applications and overall system
architects and developers
Design Requirements
Persistence:– Enforce unique persistent identifiers– Manage identifiers for multiple projects– Assurance that the data will be
preserved and retrievable over time Ingest:
– Enforce archival standards– Ability to incorporate appraisal– Automated ingest workflow
Management: – Use of information packages to facilitate
storage and dissemination– Incorporate content models– Rights/access management
Access/Interoperability:– Digital resources should be accessible
to multiple applications and systems– Authorization policies must be enforced
Scalability (Re)Usability
– Leverage existing and new tools and services
Requirements System Services
Unique and persistent identification of materials
Naming Service
Adherence to the concept of archival information packages (AIP)
Digital Object Provider (DOP) Service -- Fedora
Adherence to the concept of submission information Packages (SIP)
Drop Box, Ingestion Service
Adherence to the concept of Dissemination Information Packages (DIP)
DOP Service -- Fedora
Authentication and integrity checking
DOP Service, Ingestion Service
Dissemination Disseminators, Caching Service, TDL, Search Service
Access TDL and other applications
FedoraRepository
Service
Drop Box
IngestionService
InterfacingServices
Search Service
Naming Service
SearchInterface
ApplicationInterface
FedoraClient
U
U
A
Indexing Service
P
P
AApplicationInterface U
P - Data ProviderA - AdministratorU - UserArrows represent flow of data
TDR Architecture
SearchIndex
Caching Service
Services of TDR
Component RoleDrop Box and Ingestion Service Validation, Preprocessing, Appraisal, Transfer/Deposit
Naming Service Unique persistent identifiers (URNs) mapped to objects,management of URNs, management of repositories. Mapping between existing URN schemas to Fedora schema
Fedora Repository Service Management and access framework for digital objects
Indexing and Search Services Metadata and full-text index creation.
Search API and application
Bridge Services Provides mechanisms for external applications to interface with repository
TDL Application
How it all fits together, a working application– http://dl.tufts.edu
TDL App
Search Interface
[JSP]
Search Index
XML index
[Oracle]
Search Service
Oracle Query Builder
[Java App.]
Search Index
Main Index
[Oracle]
U
Search Service
Results Collation
[Java App.]
Naming Service
URN-PID resolution
[MySQL]
Repository Service
Object Dissemination
[Fedora]
TDL App
Disseminator Viewer
[JSP]
U
TDL App
Search Results
[Search Interface]
General TDL application search transaction process
TDL Architecture
Drop Box and Ingestion Service Naming Service Fedora Repository Service at Tufts Indexing and Search Services Interfacing Services
Drop Box and Ingestion Service
automate the process of preparing materials for ingest
validate materials before ingest primarily for large-scale ingests not an object factory (i.e., not a tool for
building individual objects)
TDL Architecture
Drop Box and Ingestion Service Naming Service Fedora Repository Service Indexing and Search Services Interfacing Services
Naming Service
Assigns, reserves and resolves URNs– The URN has a very flexible structure that can be tailor made to suit the special
needs of the particular naming convention.– Example: namespace1:namescape2:namespace:3:object_id
Manages repositories– multiple production repositories, backup repositories, etc.
Tufts URN Formats examplestufts:dca:central:MS102:33.1345Perseus:text:1999.04.000697.5224.77-1729-47
URN Properties– Provides unique ID to objects deposited into repository– Service assures resolution to unique resource.
Implementation– MySQL, Java class, JSP Management console
TDL Architecture
Drop Box and Ingestion Service Naming Service Fedora Repository Service Indexing and Search Service Interfacing Services
Fedora Repository Service
Fedora met many of our critical needs:– Modular nature of the repository service– Management of digital content over time (versioning, etc.)
– Aggregation of mixed, possibly distributed, data into complex objects
– The ability to specify multiple content disseminations of these objects
– The ability to associate rights management schemes with these disseminations.
Fedora Repository Service, cont…
Tufts Implementation Details:– External data stores – Modeling behaviors and content– Piece of a larger architecture; not out of the box solution
Tufts Repository Models/Policies– Fedora @ Tufts serves several purposes
Archival/institutional repository– Guarantee functional preservation
Data warehouse– Guarantee bitstream preservation
Active Repository– Active workspace; constantly updated content (i.e faculty data sets, faculty
pubs, content mapping)
Behavior Definitions
Atomic units: sets of standardized behaviors
Building blocks of content models
Allow for flexible reuse of data
Contributes to inter-repository sharing of objects
Dissemination of standard output: XML, plain text, binary format
Rendering/processing of disseminations is the responsibility of applications implemented over the repository.
BDefs Methods
tuftsAssetDef getPreview
getLabel
getDescription
getFullView
getDefaultContent
getDescMetadata
getAdminMetadata
tuftsText getTOC
getChuckList
getChunk
getHeader
tuftsBasicImage getThumbnail
getScreensize
getMaxSize
getDynamicView
Content Models
Unique content models built from content modeling components.
Digital Objects that subscribe to a given content model inherit all methods established by a particular behavior.
Digital objects can subscribe to content models that suit their type or class.
Functional not presentation specific
Implementation Challenges
Processing large (>10MB) XML Documents– XML databases
Processing large images– Imaging servers
Streaming Media GIS data Modeling Collections Advanced Searching “Shopping cart” searching Caching Disseminations
TDL Architecture
Drop Box and Ingestion Service Naming Service Fedora Repository Service Indexing and Search Service Interfacing Services
Indexing Search Service
Indexing– Digital objects piped through from ingestion service– Metadata index– Full-text index– Specialized XML index
Implementation– Java indexing application– Oracle database
Supported Types of Search– Basic full-text– Basic metadata– Advanced metadata
Accessing the service – HTTP GET/POST– SOAP
TDL Architecture
Drop Box and Ingestion Service Naming Service Fedora Repository Service at Tufts Indexing Service and Search Engine Interfacing Services
Interfacing Services
An important design requirement for TDR was to allow current digital library applications to easily interface with TDR and provide access to the content in the digital repository within their own environments in a seamless fashion.
Current applications like VUE can interface with this service to allow their tools to disseminate the content that resides in TDL
The service is being designed not only to support current applications but also to accommodate the needs of future yet-to-be-defined applications like course management systems, learning tools, portals etc.
Fedora OKI Bridge
Fedora OKIPID Shared.Id
DR DigitalRepository
FedoraObject Asset
FedoraObjectIterator AssetIterator
BehaviorInfoStructure InfoStructure
Behavior InfoRecord
DisseminationInfoPart InfoPart
Dissemination InfoField
ParameterInfoPart InfoPart
Parameter InfoField
DataOutputStreamInfoPart InfoPart
DataOutputStream(MIMETypeStream) InfoField
Applications Accessing TDR Content
Tufts Digital Library Application– http://dl.tufts.edu/
Visual Understanding Environment (VUE)– http://vue.tccs.tufts.edu/
VUE
OKI
FEDORA
DRAPI
Learning Theories- Constructivism - Active Learning - Individualized Learning
DigitalRepository
OKI-FEDORA Bridge
VUE Overview
Technical Infrastructure
Support- Faculty needs- Learners needsExtend- Digital Libraries- OKI Standards
DR Implementations
DigitalRepository
Future Directions
Revised search service (Zebra?) XML database for metadata and XML objects (eXist) Customization and enhancement to address a wide variety of
needs (i.e. University Records). Object factory: a workbench for building certain classes of
objects Automated browsing service for Repository. Authentication and authorization modules Asset Definitions Collection Modeling Federation
Asset Definitions
The purpose of the Fedora Asset Definition is to define and expose content types and methods of objects/assets in a repository in a standard way. The goal is to facilitate access between applications and digital repositories, digital repositories and digital repositories, etc.
Some of the questions that we asked ourselves during our repository and application development helped us form the concept of an “Asset Definition.” For example:
How can an application find out what are the objects/assets within a particular repository and how does one figure out how to refer to these objects?
If one has an object/asset in a repository, how does one describe it so that other applications can understand what they can do with it?
Asset Definitions, cont…
getFullAssetDefintiongetPreviewgetDescriptiongetFullViewgetDefaultContentgetDescMetadatagetAdminMetadatagetThumbnailgetScreenSizegetMaxSizegetDynamicView
Collection Modeling
Object Relationships– Extend Fedora RDF to create collection networks– Recursive disseminators to track paths in the
network– Facilitate access to sets of materials– Facilitate management of digital objects– Facilitate browsing of sets of materials– http://nikolai.tccs.tufts.edu:1980/fedora/get/demo:
collectionAll/demo:Collection/viewMembers/