visit to hp labs, 22/10/2002 heterogeneous information integration alex poulovassilis database and...

32
Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and Information Systems Birkbeck, University of London

Upload: paul-cain

Post on 18-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Heterogeneous information integration

Alex Poulovassilis

Database and Web Technologies Group

School of Computer Science and Information Systems

Birkbeck, University of London

Page 2: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Research in CS & IS at Birkbeck

Main groups:

• Database and Web Technologies

• Computational Intelligence

• Bioinformatics

• Software Engineering

Main research funding sources: EPSRC, BBSRC, EU, Wellcome Trust, HEFCE, industry

URL http://www.dcs.bbk.ac.uk/~research/groups.html

Page 3: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Teaching in CS & IS at Birkbeck

Foundation Degree in IT (part-time) BSc Computing (pt) BSc Information Systems and Management (pt) MSc Computing Science (ft and pt)

MSc in Advanced Information Systems (ft and pt) MRes in Computer Science (ft and pt) MPhil/PhD in Computer Science (ft and pt)

URL http://www.dcs.bbk.ac.uk/~courses/

Page 4: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Schema Schema Schema

IntegratedSchema

Page 5: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Background

In earlier work with Peter McBrien (ER’97, IS’98, DKE’98) we developed a new framework to support transformation and integration of heterogeneous database schemas.

Our framework consisted of:

• a new notion of schema equivalence

• a set of primitive schema transformations which can be composed to define unconditional or conditional equivalences between schemas

Page 6: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Background

We represent the modelling constructs of higher-level data models (e.g. relational, object-oriented, semi-structured, XML) in terms of a hypergraph data model (HDM)

The HDM common data model provides a unifying semantics for such higher-level modelling constructs

Page 7: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Background

Our schema transformations allow constructs from different modelling languages to be mixed within the same intermediate schema (CAiSE’99)

Our schema transformations are automatically reversible, setting up a two-way transformation pathway between pairs of schemas:

Page 8: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Page 9: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Page 10: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

addClass Series [p|(p,S)category]

addClass Doc [p|(p,D)category]

addClass Film [p|(p,F)category]

addClass Prog [p|(p,c)category]

Page 11: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

addSubClass Film Prog

addSubClass Doc Prog

addSubClass Series Prog

addClass Series [p|(p,S)category]

addClass Doc [p|(p,D)category]

addClass Film [p|(p,F)category]

addClass Prog [p|(p,c)category]

Page 12: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

addSubClass Film Prog

addSubClass Doc Prog

addSubClass Series Prog

addClass Series [p|(p,S)category]

addClass Doc [p|(p,D)category]

addClass Film [p|(p,F)category]

addClass Prog [p|(p,c)category]

delRel category [(p,F)|pFilm] U

[(p,D)|pDoc] U

[(p,S)|pSeries]

Page 13: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

delSubClass Film Prog

delSubClass Doc Prog

delSubClass Series Prog

delClass Series [p|(p,S)category]

delClass Doc [p|(p,D)category]

delClass Film [p|(p,F)category]

delClass Prog [p|(p,c)category]

addRel category [(p,F)|pFilm] U

[(p,D)|pDoc] U

[(p,S)|pSeries]

Page 14: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

addConstraint subset Film ProgaddConstraint subset Doc

Prog addConstraint subset Series

Prog

addNode Series [p|(p,S)category]addNode Doc [p|(p,D)category]addNode Film [p|(p,F)category]addNode Prog [p|(p,c)category]

delEdge category [(p,F)|pFilm] U [(p,D)|pDoc] U [(p,S)|pSeries]

delNode Programme ProgdelNode Category [F,D,S]

Page 15: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

delConstraint subset Film ProgdelConstraint subset Doc

Prog delConstraint subset Series

Prog

delNode Series [p|(p,S)category]delNode Doc [p|(p,D)category]delNode Film [p|(p,F)category]delNode Prog [p|(p,c)category]

addEdge category [(p,F)|pFilm] U [(p,D)|pDoc] U [(p,S)|pSeries]

addNode Programme ProgaddNode Category [F,D,S]

Page 16: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Query and Data Translation

These pathways can thus be used to automatically translate data and queries between schemas (ER’99)

From a pathway T:S –> S’ we:

• compose the queries in the add steps to derive a definition of each construct in S’ as a view over S, and

• compose the queries in the del steps to derive a definition of each construct in S as a view over S’

These view definitions can then be used to automatically translate data and queries between S and S’

Page 17: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Both-As-View integration

Our schema transformation pathways capture at least the information available from global-as-view (GAV) or local-as-view (LAV)

We discuss this in a forthcoming paper (ICDE’03) and term our integration approach both-as-view (BAV)

In particular, we discuss how

• GAV and LAV view definitions can be derived from a BAV specification

• a BAV specification can be partially derived from a set of GAV or LAV view definitions

Page 18: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Schema Evolution

Unlike GAV and LAV, our framework readily supports the evolution of both local and global schemas (CAiSE’02, ICDE’03)

The first step is to define the evolution of the global or local schema as a schema transformation pathway from the old to the new schema

There is then a systematic way of evolving, as opposed to re-generating, the transformation pathways – and perhaps the global schema in the case of a local schema evolution

Page 19: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Schema Evolution

In particular (see CAiSE’02 and ICDE’03 for details):

• if the evolved schema is semantically equivalent to the original schema, then the transformation network can be repaired automatically

• if the evolved schema is a contraction of the original schema, the transformation network can again be repaired automatically

• if the evolved schema is an extension of the original schema, then domain knowledge may be required (but again the network is evolved rather than regenerated)

Page 20: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

The AutoMed Project (funded by EPSRC, at Birkbeck and Imperial College)

The aims of the AutoMed project are to investigate:

• how our theoretical framework can be practically applied real data integration problems

• how much of a mediator’s global query processing functionality can be automatically generated from our transformation pathways

• evolutionary and heuristic techniques for schema improvement and global query optimisation

Page 21: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

AutoMed Architecture

Global Query Processor

Global Query Optimiser

Schema Evolution Tool

Schema Transformationand Integration Tool

Model Definition Tool

Schema and Transformation

Repository

Model Definitions Repository

Page 22: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Query Processing and Optimisation

We are handling query language heterogeneity by translation into/from a functional intermediate query language – IQL; Edgar Jasper

A query Q expressed in a high-level query language on a global schema S is first translated into IQL

GAV view definitions are derived from the transformation pathways from the local schemas to S, and are used to reformulate the query into an IQL query over the local schema constructs

A LAV query processing approach would also be possible

Page 23: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Query Processing and Optimisation

Query optimisation and query evaluation then occur

Specific issues for query optimisation in AutoMed include:

• optimising the view definitions derived from the transformation pathways, and

• handling heterogeneous modelling constructs appearing within these view definitions

For query evaluation, wrappers will undertake translation of IQL sub-queries into the local query language, and translation of results back into the IQL type system. Further post-processing is possible.

Page 24: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

XML Data Sources

As well as integration of structured data sources, we have done some preliminary work on translating and integrating XML data CAiSE’01)

We have defined a representation of XML in terms of the nodes, edges and constraints of the HDM

We capture the ordering of XML elements by an order node and a hyperedge to it from the edge representing the parent-child relationship

Page 25: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Translating XML into HDM

<customer name=“Jones”>

<account number=“A14”/>

<account number=“B37”/>

</customer>

<customer name=“Smith”>

<account number=“C514”/>

<account number=“D438”/>

</customer>

root

customer name

numberaccount

order

order

Page 26: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

XML Data Sources

We have also defined a set of primitive transformations on XML (in terms of the underlying transformations on the equivalent HDM representation)

XML documents are then translated into a simple ER representation, which allows them to be integrated with each other and with other structured data sources

The above has been implemented by Tanvir Faqueer

He is now looking at automatic or semi-automatic transformation and integration of the ER models arising from XML documents

Page 27: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Unstructured Text Sources

We are also working on extracting structure from unstructured text sources – Dean Williams

The aim here is to integrate information extracted from unstructured text with structured or semi-structured information available from other sources

We are using existing IE technology (the GATE tool) for text annotation. Natural language and domain ontologies will extend these annotations

The extracted information will be matched with existing information in the to derive new facts and perhaps new global schema constructs

Page 28: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Materialised integration

As well as virtual integration of data sources, we are also investigating using the AutoMed framework for materialised integration i.e. a data warehousing approach

In particular, we are looking at incremental view maintenance and data lineage tracing using the AutoMed schema transformation pathways – Hao Fan

Page 29: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

Event-Condition-Action Rules for XML

XML is becoming a standard means of storing and exchanging information on the Web

XML repositories are increasingly being used in dynamic applications where actions need to be taken in a timely fashion in response to updates to the data

Periodic querying is not sufficient – may be too infrequent, or too frequent

Thus, there is a need for reactive functionality on XML repositories:event-condition-action (ECA) rules are a natural candidate

Page 30: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

ECA Rules for XML

ECA rules take the form: on event if condition do action

Users/Apps

EventDetection

Action Execution

ConditionEvaluation

Page 31: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

ECA Rules for XML

We are currently developing an ECA rule language for XML, with James Bailey and Peter Wood (WWW’2002):

ON INSERT path | DELETE path

IF condition

DO INSERT subdocument BELOW path |

DELETE path

Page 32: Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and

Visit to HP Labs, 22/10/2002

SeLeNe – Self e-Learning Networks (EU FP5)

We are planning to extend this work to ECA rules on RDF, as part of the SeLeNe project

SeLeNe is a technical feasibility study in using Semantic Web technology for dynamically integrating metadata from heterogeneous and autonomous learning resources, and for creating personalised views over this Knowledge Grid.

ECA rules will be used for incremental maintenance of derived learning objects defined as views over source learning objects