dec 9-11, 2003icadl 20031 challenges in building federation services over harvested metadata hesham...

39
Dec 9-11, 2003 ICADL 2003 1 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad Zubair, and Zhao Yang Digital Library Group Old Dominion University Norfolk, VA 23529

Upload: jade-elliott

Post on 16-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 1

Challenges in Building Federation Services over Harvested Metadata

Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad Zubair, and Zhao Yang

Digital Library GroupOld Dominion University

Norfolk, VA 23529

Page 2: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 2

Outline

• Motivation

• Overview

• Process Automation

• Web Services and Applications

• Performance

• Conclusions and Future Work

Page 3: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 3

Motivation

• Harvesting provides only the basic services to get metadata from repositories.

• Processing these data or retrieving related metadata is not part of the OAI-PMH.

• Dynamic harvesting introduces challenges of keeping specialized-services consistent with ingestion of new metadata records.

Page 4: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 4

Motivation

• There is a growing use of the Web Services standard. Hence providing services compliant with this standard will increase the usability of our digital library.

• Using web services enable 3rd parties to provide services that enhance our native services on top of our federation collection

Page 5: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 5

OverviewArchon is a federation of physics digital libraries. Its architecture provides services to both humans and machines:

•Basic Services (for humans)– a search and discovery service; – a service to allow searching on equations embedded in the metadata, – a cross-archive citation service

•OAI Services (for machines)– a storage service for the metadata of collected archives; – a harvester service to collect data from digital libraries using OAI-PMH– a data provider service to expose metadata to OAI-PMH harvesters

•Web Services (for machines) – A focus library for personal use

Page 6: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 6

Archon Architecture

User Interface

Search Engine (Servlet)

JDBC

Data Normalization

History Harvest

Daily Harvest

Data Provider

Data Provider Cache Relational

Database (Oracle)

Harvester

Extended Services

Search users usersusers

Publishing users

Page 7: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 7

Process Automation

• At the core of Archon we have high level services that require post-processing of harvested metadata .

• we implemented Archon’s post-harvesting processes as tasks that can be run incrementally and automatically.

• The Archon post-processing consists of tasks for citation and equation processing, normalization, and a subject resolver.

Page 8: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 8

Harvest Post ProcessingCitation Processing

• Reference-linking service provides the user a list of the references for each metadata record.

• Where possible the service provides links to the documents at external source archives and within Archon.

Page 9: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 9

Harvest Post ProcessingCitation Processing

Others OAI

Harvester

Raw Reference

Parser Extract

references

Get archive

Reference Resolver

Reference Collector

References

Harvester

Parser

Raw Bibliographic

Normalization

Bibliographic

Old Link Adjustment

Extrernal Link

Crosslink

DC

Bibliographic Collector

Reference Process

Page 10: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 10

Harvest Post Processing-Citation Processing

Page 11: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 11

Harvest Post Processing-Citation Processing

Page 12: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 12

Harvest Post Processing-Citation Processing Data for Resolved References

Archives Total External Internal Linked Resolved

arXiv 4,838,158 2,191,419 1,257,367 2,790,904 2,900,347

APS 686,521 427,601 195,187 432,604 520,843

CERN 58,105 24,345 9,115 25,513 27,753

Page 13: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 13

Harvest Post Processing - Equation Processing

• We represent the equations as images and display these images when the metadata records are displayed. This requires the following tasks to be performed after harvesting new metadata records:– Identifying equations – Filtering equations – Equation storage

Page 14: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 14

Harvest Post Processing - Equation Processing

EqnFilter EqnRecorder

Img2Gif

EqnExtractor

Acme.JPM.Encoders.GifEncoder

Eqn2Gif

cHotEqn MathEqn

EqnCleaner

Eqn Data DC Metadata

Image Converter

Formula Filter

Page 15: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 15

Harvest Post Processing - Subject Resolvers

• Our subject resolver, tries to fill the subject field for APS and arXiv DC records.

Get parallel metadata

Parse to get PACS code

PACS Spec Map code to subject String DC

Guess subject

Page 16: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 16

Harvest Post Processing - Statistics

#records #refsHistoricalAPS 39,064

686,521 ArXive 229,076 4,838,158CERN 17,055 58,105

NASA 38,688 N/AEmilio 3,480 N/AIncrementalAPSArXiveCERNNASA

4,05249607

66,096 0*594 12

#Equation #subject resolved37 581

25 48

*Due to lack of parallel metadata or parsed error in parallel metadata. Equation will not be processed for those whose subject is not resolved.

Archon collection

Unique Authors: 346,315

Unique Subjects:9,889

Equations (all): 330,503

#records #refs

Page 17: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 17

Web Services and Applications

• Created web service to allow students and teachers to create personal collections.

• These services use Web Services standards including the use of SOAP requests and response in communication between the clients and the services.

• Examples of these services include:– Search Service– Book Shelf Service

Page 18: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 18

Web Services and Applications

• Book Shelf Service – allows each user to have a personalized collection a subset of the

federation

– enables teachers to collect course materials and package it in a personalized collection

– enables students that are doing research in a topic to make a special collection that contains all the related documents in that collection.

• Search Service – provides access to all search functionality without the need to use the

Archon interface– allows each user (e.g. teacher) to provide customized client for the

collections that can have special features according to a course’s needs.

Page 19: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 19

Page 20: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 20

Page 21: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 21

Web Services and Applications

Page 22: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 22

Web Services and Applications

Page 23: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 23

Page 24: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 24

Conclusions and Future Work

• In our collections, we collected about 300K dc metadata for documents from APS, CERN, arXiv, Emilio and NASA.

• We also collected 30K parallel metadata records from APS.

• We have also resolved the data of 5.5M references that are cited by the above documents.

• Our performance analysis shows that we can comfortably set the scheduler of the OAI harvester to about 1 day and have a safety factor for human intervention should the automatic process break down.

Page 25: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 25

Conclusions and Future Work

• We have developed Web Services that can be used for search and discovery of our collections.

• The developed web services can be used by other developers who want to provide customized or enhanced services or that want to build services additional to the currently provided services.

• We have also developed sample client applications such as a bookshelf client that can store a collection of documents and can be used to export them as references (in user defined formats) to help authors in writing research papers.

Page 26: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 26

Conclusions and Future Work

• We are almost complete in the process of adding production service of federating CERN, arXiv, and APS. We are partially complete in add NASA and plan to collaborate with AIP(American Institute of Physics) to have their collections included as well. Once all these are federated and working at the high service level at a dynamic basis, the Web services should prove to be attractive particularly to authors of papers who can thus maintain their own bibliographies.

Page 27: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 27

Future Work

• Collections have overlapping holdings, need strong de-duplication service

• Expand the personalization effort to allow students and researchers to integrate the DL information into their writing of reports and papers

• Test a role based access system that allows for each contributing collection to have different policies for different organizations

Page 28: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 28

[1] An entry ‘0.1’ means a time less than 0.1s.

Harvest PerformanceHarvesting from NCSTRL-NCSU

OperationOperation Time (s)

Number of Times

Average Time (s)

Identify 0.6 1 0.6

DB 7.0 143 0.1

Resumption 0.1 2 0.1

ListRecords 46.2 2 23.1

ListSets 24.8 1 24.8

Total 80.5 143 0.6

Page 29: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 29

Harvest PerformanceHarvesting from arXiv (from ARC)

Table 1. arXiv (from ARC)

Operation Operation Time (s)

Number of Times

Average Time (s)

Identify 0.17 1 0.2 DB 46 1,000 0.1 Resumption 0.50 10 0.1 ListRecords 3,805.8 10 380.6 ListSets 5.7 1 5.7 Total 3,858.3 1,000 3.9

Page 30: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 30

Harvest PerformanceHarvesting from APS (DC)

Table 1. APS (DC)

Operation Operation Time (s)

Number of Times

Average Time (s)

Identify 0.2 1 0.2 DB 9.7 220 0.1 Resumption 0.1 4 0.1 ListRecords 10.3 11 0.9 ListSets 0.6 1 0.6 Total 22.5 220 0.1

Page 31: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 31

Harvest PerformanceParallel Harvesting from APS

Table 1. APS (Parallel)

Operation Operation Time (s)

Number of Times

Average Time (s)

Identify 0.2 1 0.2 DB 45.1 906 0.1 Resumption 0.4 10 0.1 ListRecords 72.1 11 6.6 ListSets 0.6 1 0.6 Total 125.6 906 0.1

Page 32: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 32

Citation Processing PerformanceCitation Processing for APS

Table 1. Citation Processing for APS

Operation Operation Time (s)

Number of Times

Average Time (s)

Adjustment 13.2 895 0.1 Biblio Parsing 21.8 906 0.1 Biblio Normalization N/A N/A N/A Ref Parsing 106.4 902 0.1 Ref resolving 624.6 13,129 0.1 Cross-linking 103 13,129 0.1 Total 868.9 906 1.0

Page 33: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 33

Citation Processing PerformanceCitation Processing for arXiv

Table 1. Citation Processing for arXiv

Operation Operation Time (s)

Number of Times

Average Time (s)

Adjustment 8.2 614 0.1 Biblio Parsing 10.1 614 0.1 Biblio Normalization 21.9 453 0.1 Ref Parsing 134.2 614 0.2 Ref Resolving 923.7 18,797 0.1 Cross-linking 123.1 18,563 0.1 Total 1,221.3 614 2

Page 34: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 34

Citation Processing PerformanceCitation Processing for CERN

Table 1. Citation Processing for CERN

Operation Operation Time (s)

Number of Times

Average Time (s)

Adjustment 13.6 972 0.1 Download HTML 327.9 256 1.3 Download PDF 468.8 181 2.6 Ref Extraction 117.7 179 0.7 Ref resolving 75 1,397 0.1 Cross-linking 9.8 1,397 0.1 DB 16.8 2,369 0.1 Total 1,029.6 972 1.1

Page 35: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 35

Subject Resolving PerformanceAPS Subject Revolving

Table 1. APS Subject Revolving

Operation Operation Time (s)

Number of Times

Average Time (s)

Initial operation 0.1 1 0.1 Get parallel metadata 14.3 1,000 0.1 Parse metadata 7.1 996 0.1 Map 2.1 514 0.1 Update 19.5 996 0.1 Flag 7.1 996 0.1 Index 18.5 1 18.5 Total 68.5 1,000 0.1

Page 36: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 36

Page 37: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 37

Page 38: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 38

Page 39: Dec 9-11, 2003ICADL 20031 Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad

Dec 9-11, 2003 ICADL 2003 39