a multi-tiered architecture for distributed data collection and centralized data delivery stacy...
TRANSCRIPT
A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data
Delivery
Stacy Kowalczyk and James HallidayApril 28, 2008
Project OverviewIN Harmony is • An IMLS funded grant• Awarded in Fall 2004• To be competed in Fall 2008• A partnership of
• Indiana University Digital Library Program• Indiana University Lilly Library• Indiana State Library• Indiana State Museum• Indiana Historical Society
April 28, 2008IN Harmony – DLP Spring Forum 2008
Project Goals1. To provide a model for fostering collaborative digital
library development by partnering with institutions with complementary collections;
2. To digitize a portion of the sheet music from these collections and offer access to these materials free of charge on the web;
3. To bring these materials and their attendant metadata together on a single web site, offering both federated searching of the entire collection and searching of one or more selected collections;
April 28, 2008IN Harmony – DLP Spring Forum 2008
Deliverables• Tools to
• Process the images• Capture metadata• Provide search and display functions
• 10,000 pieces of sheet music scanned and cataloged
• 4,000 Indiana University Lilly Library• 2,000 Indiana State Library• 2,000 Indiana State Museum• 2,000 Indiana Historical Society
April 28, 2008IN Harmony – DLP Spring Forum 2008
Cataloging and Imaging Workflow Goals
• Data integrity
•Quality of the scans•Quality of the metadata• Accuracy of the links between page images• Accuracy of the links between metadata and
images
• Simplicity of use• Balance of flexibility and constraints
April 28, 2008IN Harmony – DLP Spring Forum 2008
Cataloging and Imaging Use Cases
1. Catalog first
2. Scanning first
3. Metadata created in another system and imported into IN Harmony
April 28, 2008IN Harmony – DLP Spring Forum 2008
Digitizing Quality Control
• 2 phased Quality Control Process• Automated QC process verifies:
• All TIFF tags of every digital file• TIFF must be uncompressed• Files names • Embedded profile appropriate to its bit depth • Consistency of pixel dimensions within a score• Appropriate resolution
April 28, 2008IN Harmony – DLP Spring Forum 2008
Digitizing Quality Control (2)
• Manual QC – at 100% pixel display, verify:
• Correct page orientation and order• Correct color balance • Sharp and in-focus scan• No digital artifacts
• When all QC is passed, derivative files are created
• Large and small jpgs for screen delivery• PDF sized for 8.5 x 11 printing
April 28, 2008IN Harmony – DLP Spring Forum 2008
Digitizing Quality Control Software
Designing the metadata model
• User studies • Work with the partners• Define fields• Write cataloging guidelines with partner input• Representation in MODS
April 28, 2008IN Harmony – DLP Spring Forum 2008
Types of fields
• Title elements• Name elements• Publication elements• Subject elements• Identification elements• Note elements• Cover information
April 28, 2008IN Harmony – DLP Spring Forum 2008
Metadata Collection Tool
Public Search and Discovery System
Demo
April 21, 2023Customize footer: View menu/Header and Footer
ARCHITECTURE OVERVIEW
JIM HALLIDAY
April 21, 2023Customize footer: View menu/Header and Footer
IN Harmony Technical Overview
Fedora Web Browser
SRU and http
Mass StorageSystem
OracleCataloging
ClientQuality Control
Scanner
Authentication Service
JavaSwing
MODs Export
FTP
Perl WebApplication
Getting Data Into IN Harmony
2 primary data sources• Cataloging client• Image QC/upload application
Other data sources• XML data exported from other cataloging
systems• Score images exported from older
systems
April 28, 2008IN Harmony – DLP Spring Forum 2008
Image QC/upload application1. User scans scores and uploads to IN Harmony
server2. User accesses Perl-based web application to initiate
automated quality control3. A second user proceeds with manual QC, then uses
web application to signal that manual QC is finished4. The application moves and backs up the files,
creates derivatives, and alerts both Fedora and the internal database that the process is complete
April 28, 2008IN Harmony – DLP Spring Forum 2008
IN Harmony Derivatives• Three sizes of JPG’s produced per page
• Full (1200px high)• Screen (600px high)• Thumb (200px high)
• Multi-page, playable PDF• Approx. 1MB for an average score
April 28, 2008IN Harmony – DLP Spring Forum 2008
IN Harmony cataloging client• Standalone Java Swing based client
• Connects to Oracle database and outputs MODS for Fedora ingestion
• Implemented as a client-server application via web services using Axis
• Specialized UI components (such as ‘smart’ combo boxes) assist with quick, correct data entry
April 28, 2008IN Harmony – DLP Spring Forum 2008
Internal IN Harmony database• Oracle database stores record and user
data in our own internal format
• Communicates with upload/QC application, and cataloging client
• Cataloging client and internal scripts can output to MODS format for ingestion into Fedora
April 28, 2008IN Harmony – DLP Spring Forum 2008
IN Harmony authentication• CAS (IU’s Central Authentication Service) is
used to authenticate all users• Non-IU users must create IU Guest Accounts
to authenticate• All account/password maintenance in user’s
control
April 28, 2008IN Harmony – DLP Spring Forum 2008
Fedora and IN Harmony• Fedora used as a single storage and
infrastructure solution for Digital Library Program projects as IU
• Data (score images and metadata) ingested into Fedora and referenced as METS objects
• Master images sent to IU’s mass storage system
• Derivatives stored internally• Objects indexed using Lucene for SRU-based
searchingApril 28, 2008IN Harmony – DLP Spring Forum 2008
Fedora Object Model Collection
Sheet music
Copy
Page
IN Harmony end-user interface- Java Struts based web application- Offers searching, browsing, and record display- Each partner institution is offered a personalized view
of their data only
Interaction with Fedora
- Application sends CQL queries to Fedora and retrieves MODS data which is transformed via XSLT
- PURLs (persistent URL’s) are used to access image derivatives
April 28, 2008IN Harmony – DLP Spring Forum 2008
METS Navigator• METS Navigator is used to page through
scores online• Uses METS structmap to facilitate navigation• Allows views of multiple sizes of images• Released by IU as open source – see
http://metsnavigator.sourceforge.net
April 28, 2008IN Harmony – DLP Spring Forum 2008
IN Harmony Technical Overview
Fedora Web Browser
SRU and http
Mass StorageSystem
OracleCataloging
ClientQuality Control
Scanner
Authentication Service
JavaSwing
MODs Export
FTP
Perl WebApplication
IN Harmony Links
• IN Harmony Public Interface • IN Harmony Project Information • Cataloging Tool Release date – June 2008
April 28, 2008IN Harmony – DLP Spring Forum 2008
Questions?
April 28, 2008IN Harmony – DLP Spring Forum 2008