a multi-tiered architecture for distributed data collection and centralized data delivery stacy...

39
A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Upload: lauren-wilkins

Post on 17-Jan-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data

Delivery

Stacy Kowalczyk and James HallidayApril 28, 2008

Page 2: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Project OverviewIN Harmony is • An IMLS funded grant• Awarded in Fall 2004• To be competed in Fall 2008• A partnership of

• Indiana University Digital Library Program• Indiana University Lilly Library• Indiana State Library• Indiana State Museum• Indiana Historical Society

April 28, 2008IN Harmony – DLP Spring Forum 2008

Page 3: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Project Goals1. To provide a model for fostering collaborative digital

library development by partnering with institutions with complementary collections;

2. To digitize a portion of the sheet music from these collections and offer access to these materials free of charge on the web;

3. To bring these materials and their attendant metadata together on a single web site, offering both federated searching of the entire collection and searching of one or more selected collections;

April 28, 2008IN Harmony – DLP Spring Forum 2008

Page 4: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Deliverables• Tools to

• Process the images• Capture metadata• Provide search and display functions

• 10,000 pieces of sheet music scanned and cataloged

• 4,000 Indiana University Lilly Library• 2,000 Indiana State Library• 2,000 Indiana State Museum• 2,000 Indiana Historical Society

April 28, 2008IN Harmony – DLP Spring Forum 2008

Page 5: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Cataloging and Imaging Workflow Goals

• Data integrity

•Quality of the scans•Quality of the metadata• Accuracy of the links between page images• Accuracy of the links between metadata and

images

• Simplicity of use• Balance of flexibility and constraints

April 28, 2008IN Harmony – DLP Spring Forum 2008

Page 6: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Cataloging and Imaging Use Cases

1. Catalog first

2. Scanning first

3. Metadata created in another system and imported into IN Harmony

April 28, 2008IN Harmony – DLP Spring Forum 2008

Page 7: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Digitizing Quality Control

• 2 phased Quality Control Process• Automated QC process verifies:

• All TIFF tags of every digital file• TIFF must be uncompressed• Files names • Embedded profile appropriate to its bit depth • Consistency of pixel dimensions within a score• Appropriate resolution

April 28, 2008IN Harmony – DLP Spring Forum 2008

Page 8: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Digitizing Quality Control (2)

• Manual QC – at 100% pixel display, verify:

• Correct page orientation and order• Correct color balance • Sharp and in-focus scan• No digital artifacts

• When all QC is passed, derivative files are created

• Large and small jpgs for screen delivery• PDF sized for 8.5 x 11 printing

April 28, 2008IN Harmony – DLP Spring Forum 2008

Page 9: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Digitizing Quality Control Software

Page 10: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008
Page 11: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008
Page 12: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008
Page 13: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Designing the metadata model

• User studies • Work with the partners• Define fields• Write cataloging guidelines with partner input• Representation in MODS

April 28, 2008IN Harmony – DLP Spring Forum 2008

Page 14: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Types of fields

• Title elements• Name elements• Publication elements• Subject elements• Identification elements• Note elements• Cover information

April 28, 2008IN Harmony – DLP Spring Forum 2008

Page 15: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Metadata Collection Tool

Page 16: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008
Page 17: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008
Page 18: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Public Search and Discovery System

Demo

April 21, 2023Customize footer: View menu/Header and Footer

Page 19: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

ARCHITECTURE OVERVIEW

JIM HALLIDAY

April 21, 2023Customize footer: View menu/Header and Footer

Page 20: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

IN Harmony Technical Overview

Fedora Web Browser

SRU and http

Mass StorageSystem

OracleCataloging

ClientQuality Control

Scanner

Authentication Service

JavaSwing

MODs Export

FTP

Perl WebApplication

Page 21: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Getting Data Into IN Harmony

2 primary data sources• Cataloging client• Image QC/upload application

Other data sources• XML data exported from other cataloging

systems• Score images exported from older

systems

April 28, 2008IN Harmony – DLP Spring Forum 2008

Page 22: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008
Page 23: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Image QC/upload application1. User scans scores and uploads to IN Harmony

server2. User accesses Perl-based web application to initiate

automated quality control3. A second user proceeds with manual QC, then uses

web application to signal that manual QC is finished4. The application moves and backs up the files,

creates derivatives, and alerts both Fedora and the internal database that the process is complete

April 28, 2008IN Harmony – DLP Spring Forum 2008

Page 24: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

IN Harmony Derivatives• Three sizes of JPG’s produced per page

• Full (1200px high)• Screen (600px high)• Thumb (200px high)

• Multi-page, playable PDF• Approx. 1MB for an average score

April 28, 2008IN Harmony – DLP Spring Forum 2008

Page 25: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008
Page 26: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

IN Harmony cataloging client• Standalone Java Swing based client

• Connects to Oracle database and outputs MODS for Fedora ingestion

• Implemented as a client-server application via web services using Axis

• Specialized UI components (such as ‘smart’ combo boxes) assist with quick, correct data entry

April 28, 2008IN Harmony – DLP Spring Forum 2008

Page 27: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008
Page 28: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Internal IN Harmony database• Oracle database stores record and user

data in our own internal format

• Communicates with upload/QC application, and cataloging client

• Cataloging client and internal scripts can output to MODS format for ingestion into Fedora

April 28, 2008IN Harmony – DLP Spring Forum 2008

Page 29: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008
Page 30: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

IN Harmony authentication• CAS (IU’s Central Authentication Service) is

used to authenticate all users• Non-IU users must create IU Guest Accounts

to authenticate• All account/password maintenance in user’s

control

April 28, 2008IN Harmony – DLP Spring Forum 2008

Page 31: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008
Page 32: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Fedora and IN Harmony• Fedora used as a single storage and

infrastructure solution for Digital Library Program projects as IU

• Data (score images and metadata) ingested into Fedora and referenced as METS objects

• Master images sent to IU’s mass storage system

• Derivatives stored internally• Objects indexed using Lucene for SRU-based

searchingApril 28, 2008IN Harmony – DLP Spring Forum 2008

Page 33: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Fedora Object Model Collection

Sheet music

Copy

Page

Page 34: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008
Page 35: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

IN Harmony end-user interface- Java Struts based web application- Offers searching, browsing, and record display- Each partner institution is offered a personalized view

of their data only

Interaction with Fedora

- Application sends CQL queries to Fedora and retrieves MODS data which is transformed via XSLT

- PURLs (persistent URL’s) are used to access image derivatives

April 28, 2008IN Harmony – DLP Spring Forum 2008

Page 36: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

METS Navigator• METS Navigator is used to page through

scores online• Uses METS structmap to facilitate navigation• Allows views of multiple sizes of images• Released by IU as open source – see

http://metsnavigator.sourceforge.net

April 28, 2008IN Harmony – DLP Spring Forum 2008

Page 37: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

IN Harmony Technical Overview

Fedora Web Browser

SRU and http

Mass StorageSystem

OracleCataloging

ClientQuality Control

Scanner

Authentication Service

JavaSwing

MODs Export

FTP

Perl WebApplication

Page 38: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

IN Harmony Links

• IN Harmony Public Interface • IN Harmony Project Information • Cataloging Tool Release date – June 2008

April 28, 2008IN Harmony – DLP Spring Forum 2008

Page 39: A Multi-Tiered Architecture for Distributed Data Collection and Centralized Data Delivery Stacy Kowalczyk and James Halliday April 28, 2008

Questions?

April 28, 2008IN Harmony – DLP Spring Forum 2008