complex data transformations in digital libraries with spatio-temporal information b. martins, n....

15
Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical University of Lisbon 2008 International Conference on Asia-Pacific Digital Libraries

Upload: virginia-melton

Post on 01-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical

Complex Data Transformations in Digital Libraries

with Spatio-Temporal Information

B. Martins, N. Freire, J. Borbinha

Instituto Superior Técnico, Technical University of Lisbon

2008 International Conference on Asia-Pacific Digital Libraries

Page 2: Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical

Introduction and Motivation

• The DIGMAP project addressed the development of a digital library for materials related to old maps– Collecting metadata from different providers (e.g. OAI-PMH servers)– Processing the metadata and enriching it with inferred spatio-temporal information

• Challenges in handling heterogeneous metadata– Transforming the original sources into the DIGMAP format (i.e., TEL profile)– Dealing with data inconsistency, non-uniformity, incorrectness and incompleteness – Handling the spatio-temporal information (e.g. dates and geospatial coordinates)

• Challenges in DIGMAP service interoperability– Using the results from DIGMAP services to enrich the metadata

• DIGMAP required appropriate XML processing technology for dealing with the above challenges

Page 3: Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical

The Proposed Solution

• Use XML processing languages like XSLT and XQuery• Extend the XPath 2.0 function library

– Functions for managing geospatial information– Functions for managing temporal information– Functions for text processing– Other miscellaneous functions

• All the advantages of declarative languages like XSLT and XQuery, together with powerful methods for handling complex transformations

Page 4: Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical

Outline

• Introduction

• Proposed Extensions to the XPath Function Library

• Implementation Issues

• Test Cases Within the DIGMAP Project

• Conclusions and Future Work

Page 5: Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical

The Proposed Extensions

• Extensions for geospatial data handling– Combining spatial elements according to a geospatial predicates such as distance or intersection – Input given in GML, KML or textual strings with geospatial coordinates

• Extensions for temporal reasoning– Combining temporal information according to the predicates of Allen’s Algebra for temporal intervals– Input given in GML or string encodings (e.g. the ISO 8601 formats)

• Extensions for text mining– Keyword matching and textual similarity– Standard text mining operations (e.g. language recognition)

• Other miscellaneous extensions– Handling JDBC calls and calls to external Web services

Page 6: Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical

Geospatial Data Handling• Operators for performing geospatial analysis based on the OGC Simple

Features and Filter Encoding specifications

– Distance, union, intersection or difference between two geometries

– Validity of a given spatial filter

– Check if two geometries are spatially related (e.g. containment or overlap)

– Check if two geometries fall bellow a given distance threshold

– Area, length, buffer, centroid, boundary or envelope of a geometry

– Geometric computations (e.g. translation or scaling) over a geometry

– Conversion between GML, KML, C-Square, Geohash or WKT encodings

– Transformations on the coordinate systems used in geometries

Page 7: Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical

Temporal Data Handling• Operators for temporal analysis based on Allen's interval algebra

– Distance, union, intersection or difference between temporal intervals

– Check if two intervals are related (e.g. containment or overlap)

• Other operators for temporal data handling

– Compute lengths for temporal intervals (e.g. return seconds or years)

– Conversion between GML and string encodings

Page 8: Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical

Textual Data Handling

• Keyword matching and textual similarity

– Tokenization and keyword-based search

– Phonetic similarity (Soundex and Double Metaphone)

– String similarity (e.g. Edit Distance, Jaro, Jaro-Winkler, Q-grams, …)

• Standard text mining operations

– Language recognition

– Keyword extraction (statistically significant keywords)

– Named entity recognition (regexp, dictionaries or machine learning)

– Text classification (machine learning)

Page 9: Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical

Miscellaneous Functions

• Calling external Web services (REST and SOAP)

• Conversion from XML to JavaScript Object Notation (JSON)

• Handling Java DataBase Connectivity (JDBC) calls

• Reading malformed HTML

• Converting MARC formats into XML (MarcXml or MarcXchange)

• …

Page 10: Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical

Implementation Issues• Proposed extensions implemented on top of SAXON

– SAXON is an open source XSLT/XQuery processor– Extension functions coded in Java (static methods)– Extension functions called by binding the Java class to a specific namespace– SAXON takes care of converting the arguments to make the functions fit

• Most extensions are wrappers over existing open-source libraries– GeoTools and Java Topology Suite (JTS) for the geospatial functions– Lucene and Nux for keyword matching– SimPack for textual similarity– NGramJ and LingPipe for text mining– MARC4J for metadata crosswalks (i.e. handling MARC formats)– Apache AXIS for external Web service calls

Page 11: Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical

Test Cases Within DIGMAP• Conversion between different metadata standards

– Converting UNIMARC, MARC21 and other formats into the DIGMAP format– Geospatial coordinates were often given originally in general textual fields– DIGMAP currently indexes over 40.000 metadata records from different sources

• Wrappers around DIGMAP XML service interfaces– The DIGMAP Gazetteer uses formats like Alexandria DL Gazetteer Service format, KML, geoRSS, …– The DIGMAP GeoParser uses formats like SpatialML, geoRSS, OGC GeoParser, …– Converting between the different formats and calling the services for processing the metadata records

• Internal development of several DIGMAP services– Data integration within the DIGMAP Gazetteer– Convert different input sources into the Alexandria DL Gazetteer Content Standard– Handling duplicates and small corrections to the data

• The proposed approach was found to be expressive and computational performance was within acceptable bounds

Page 12: Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical

An Example XQueryAn XQuery for reading gazetteer data from an HTML source and convert the data

Into the Alexandria DL Gazetteer Content format

Page 13: Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical

Conclusions• Data transformations in Digital Libraries can be very complex

– Standard XML processing technology is often not enough– But simple extensions can add the required extra functionality

• We propose using extension functions to the XPath 2.0 library– Declarative syntax of XSLT and XQuery is not affected– Extension functions add the required extra functionality

• Used in DIGMAP collection building and service composition – Converting between different metadata formats– Handling the spatio-temporal information included in the metadata– Calling DIGMAP services to enrich the metadata records

Page 14: Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical

Currently Ongoing Work• Implementing a visual interface for encoding the metadata transformations• Visual “pipelines” converted into XQuery instructions• Hide the complexity of the XSLT/XQuery languages from non-expert users

Page 15: Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical

Thanks for your attention.

www.digmap.eu

http://transform.digmap.eu