grid based data integration with automatic wrapper generation

1

Grid Based Data Integration with Automatic Wrapper Generation

Xuan Zhang Gagan Agrawal

Ohio State University

2

Overall Goal Tools for data integration driven by:

Data explosion Data size & number of data sources

New analysis tools and need for workflows

Autonomous resources Heterogeneous data representation &

various interfaces

3

Motivation (Contd.) Other Issues:

Frequent updates to data formats Flat-file datasets Ad-hoc sharing of data

4

Current Approaches Manually written wrappers

Problems O(N2) wrappers needed, O(N) for a single updates Portability of wrappers in a distributed environment

Mediator-based integration systems Problems

Need a common intermediate format Unnecessary data transformation

Integration using web/grid services Needs all tools to be web-services (all data in XML?)

5

Our Approach Automatically generate wrappers

One layout descriptor per resource Stand-alone wrapper programs For integrated DBs, (grid) workflow systems

Transform data in files of arbitrary formats No domain- or format-specific heuristics Layout information provided by users

6

Our Approach (Contd.) Help write layout descriptors using

data mining techniques (dils 2005, bibe 2005)

Particularly attractive for Data grid environments and

workflows flat-file datasets ad hoc data sharing

7

Our Approach: Advantages Advantages:

No need to write wrappers while integrating data or creating workflows

Only one descriptor per resource needed

No unnecessary transformations / storage

New resources can be integrated on-the-fly

8

Our Approach: Challenges Description language

Format and logical view of data in flat files Easy to interpret and write

Wrapper generation and execution Correspondence between data items Separating wrapper analysis and execution

Interactive tools for writing layout descriptors What data mining techniques to use ? (dils

2005, bibe 2005)

9

Wrapper Generation System Overview

Layout Descriptor Schema Descriptors

Parser Mapping Generator

Data Entry Representation Schema Mapping

DataReader DataWriterSynchronizer

SourceDataset

TargetDataset

Application Analyzer

WRAPINFO

10

Suitability for a Grid Environment

Wrapper analysis can be implemented as a grid service Very low execution costs

Wrapper execution modules are task-independent Just need to port three modules on

different systems

11

Assumptions for the Current Prototype One tabular, the

other semi-structured

Both datasets are stored record-wise

Order of records not disturbed

Suitable for bioinformatics

Semi-structured tabular

12

Layout Description Language Goal

To describe data in arbitrary flat file format

Easy to interpret and write Components:

1. Schema description2. Layout description

Example: FASTA

13

Layout Description Language

Key observations on data layout

Strings of variable length Delimiters widely used Data fields divided into

variables Repetitive structures

Key tokens “constant string” LINESIZE [optional] <repeating> …

…>seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nNMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n>seq3 …

14


Component I: Schema Description[FASTA] //Schema NameID = string //Data type definitionsDESCRIPTION = stringSEQ = string

…>seq1 comment1\n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n>seq3 …

15


Component II: Layout Description …LOOP ENTRY 1:EOF:1 {

“>” ID “ ” DESCRIPTION < “\n” SEQ >

“\n” | EOF}…

…>seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nNMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n>seq3 …

16

Mapping Cardinality Reference table TRANSFAC

…FA factor1_name…RA reference1.1_authors…RA reference1.2_authors…RA reference1.3_authors…

… FA … RA …

… … … … …

… factor1_name

… reference1.1_authors

…

… factor1_name


…

… factor1_name


…

… … … … …

One-to-multipledata field

One-to-onedata field

17

Analyzing Application Goals - WRAPINFO

Summarize all application related information necessary for the wrapper

Represent the information in look-up tables and constant parameters

Represent the information in a platform-independent format, XML

18

Wrapper Generated

Inputdataset

Datasetbuffer

DataReader

Value buffer

one_to_multiple_values

one_to_one_values

DataWriterOutputdataset

Synchronizer

load run

FARA

run

RA

halt

19

Wrapper Generated Suitable for data grid

Three general modules DataReader

Extract one data field value Write value to the value buffer if useful

DataWriter Write one data field value Remove value from list in the value buffer

Synchronizer Switch between calling DataReader and DataWriter Manage dataset buffer

Application specific information in WRAPINFO

20

Experimental Results

TRANSFAC-to-Reference Problem

(in logarithm)

(in logari

thm

)

21

Experimental Results

SWISSPROT-to-FASTA Problem

22

Summary Automatically generated wrappers can

perform well Wrapper task analysis and wrapper

execution can be separated Key Open Question:

How hard it is to write layout descriptors ? Can we make the process semi-automatic ? Data mining techniques seem quite

promising (dils 2005, bibe 2005)

grid based data integration with automatic wrapper generation

Documents