grid based data integration with automatic wrapper generation

22
1 Grid Based Data Integration with Automatic Wrapper Generation Xuan Zhang Gagan Agrawal Ohio State University

Upload: oakley

Post on 19-Jan-2016

20 views

Category:

Documents


0 download

DESCRIPTION

Grid Based Data Integration with Automatic Wrapper Generation. Xuan Zhang Gagan Agrawal Ohio State University. Overall Goal. Tools for data integration driven by: Data explosion Data size & number of data sources New analysis tools and need for workflows Autonomous resources - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Grid Based Data Integration with Automatic Wrapper Generation

1

Grid Based Data Integration with Automatic Wrapper Generation

Xuan Zhang Gagan Agrawal

Ohio State University

Page 2: Grid Based Data Integration with Automatic Wrapper Generation

2

Overall Goal Tools for data integration driven by:

Data explosion Data size & number of data sources

New analysis tools and need for workflows

Autonomous resources Heterogeneous data representation &

various interfaces

Page 3: Grid Based Data Integration with Automatic Wrapper Generation

3

Motivation (Contd.) Other Issues:

Frequent updates to data formats Flat-file datasets Ad-hoc sharing of data

Page 4: Grid Based Data Integration with Automatic Wrapper Generation

4

Current Approaches Manually written wrappers

Problems O(N2) wrappers needed, O(N) for a single updates Portability of wrappers in a distributed environment

Mediator-based integration systems Problems

Need a common intermediate format Unnecessary data transformation

Integration using web/grid services Needs all tools to be web-services (all data in XML?)

Page 5: Grid Based Data Integration with Automatic Wrapper Generation

5

Our Approach Automatically generate wrappers

One layout descriptor per resource Stand-alone wrapper programs For integrated DBs, (grid) workflow systems

Transform data in files of arbitrary formats No domain- or format-specific heuristics Layout information provided by users

Page 6: Grid Based Data Integration with Automatic Wrapper Generation

6

Our Approach (Contd.) Help write layout descriptors using

data mining techniques (dils 2005, bibe 2005)

Particularly attractive for Data grid environments and

workflows flat-file datasets ad hoc data sharing

Page 7: Grid Based Data Integration with Automatic Wrapper Generation

7

Our Approach: Advantages Advantages:

No need to write wrappers while integrating data or creating workflows

Only one descriptor per resource needed

No unnecessary transformations / storage

New resources can be integrated on-the-fly

Page 8: Grid Based Data Integration with Automatic Wrapper Generation

8

Our Approach: Challenges Description language

Format and logical view of data in flat files Easy to interpret and write

Wrapper generation and execution Correspondence between data items Separating wrapper analysis and execution

Interactive tools for writing layout descriptors What data mining techniques to use ? (dils

2005, bibe 2005)

Page 9: Grid Based Data Integration with Automatic Wrapper Generation

9

Wrapper Generation System Overview

Layout Descriptor Schema Descriptors

Parser Mapping Generator

Data Entry Representation Schema Mapping

DataReader DataWriterSynchronizer

SourceDataset

TargetDataset

Application Analyzer

WRAPINFO

Page 10: Grid Based Data Integration with Automatic Wrapper Generation

10

Suitability for a Grid Environment

Wrapper analysis can be implemented as a grid service Very low execution costs

Wrapper execution modules are task-independent Just need to port three modules on

different systems

Page 11: Grid Based Data Integration with Automatic Wrapper Generation

11

Assumptions for the Current Prototype One tabular, the

other semi-structured

Both datasets are stored record-wise

Order of records not disturbed

Suitable for bioinformatics

Semi-structured tabular

Page 12: Grid Based Data Integration with Automatic Wrapper Generation

12

Layout Description Language Goal

To describe data in arbitrary flat file format

Easy to interpret and write Components:

1. Schema description2. Layout description

Example: FASTA

Page 13: Grid Based Data Integration with Automatic Wrapper Generation

13

Layout Description Language

Key observations on data layout

Strings of variable length Delimiters widely used Data fields divided into

variables Repetitive structures

Key tokens “constant string” LINESIZE [optional] <repeating> …

…>seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nNMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n>seq3 …

Page 14: Grid Based Data Integration with Automatic Wrapper Generation

14

Layout Description Language

Component I: Schema Description[FASTA] //Schema NameID = string //Data type definitionsDESCRIPTION = stringSEQ = string

…>seq1 comment1\n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n>seq3 …

Page 15: Grid Based Data Integration with Automatic Wrapper Generation

15

Layout Description Language

Component II: Layout Description …LOOP ENTRY 1:EOF:1 {

“>” ID “ ” DESCRIPTION < “\n” SEQ >

“\n” | EOF}…

…>seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nNMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n>seq3 …

Page 16: Grid Based Data Integration with Automatic Wrapper Generation

16

Mapping Cardinality Reference table TRANSFAC

…FA factor1_name…RA reference1.1_authors…RA reference1.2_authors…RA reference1.3_authors…

… FA … RA …

… … … … …

… factor1_name

… reference1.1_authors

… factor1_name

… reference1.2_authors

… factor1_name

… reference1.3_authors

… … … … …

One-to-multipledata field

One-to-onedata field

Page 17: Grid Based Data Integration with Automatic Wrapper Generation

17

Analyzing Application Goals - WRAPINFO

Summarize all application related information necessary for the wrapper

Represent the information in look-up tables and constant parameters

Represent the information in a platform-independent format, XML

Page 18: Grid Based Data Integration with Automatic Wrapper Generation

18

Wrapper Generated

Inputdataset

Datasetbuffer

DataReader

Value buffer

one_to_multiple_values

one_to_one_values

DataWriterOutputdataset

Synchronizer

load run

FARA

run

RA

halt

Page 19: Grid Based Data Integration with Automatic Wrapper Generation

19

Wrapper Generated Suitable for data grid

Three general modules DataReader

Extract one data field value Write value to the value buffer if useful

DataWriter Write one data field value Remove value from list in the value buffer

Synchronizer Switch between calling DataReader and DataWriter Manage dataset buffer

Application specific information in WRAPINFO

Page 20: Grid Based Data Integration with Automatic Wrapper Generation

20

Experimental Results

TRANSFAC-to-Reference Problem

(in logarithm)

(in logari

thm

)

Page 21: Grid Based Data Integration with Automatic Wrapper Generation

21

Experimental Results

SWISSPROT-to-FASTA Problem

Page 22: Grid Based Data Integration with Automatic Wrapper Generation

22

Summary Automatically generated wrappers can

perform well Wrapper task analysis and wrapper

execution can be separated Key Open Question:

How hard it is to write layout descriptors ? Can we make the process semi-automatic ? Data mining techniques seem quite

promising (dils 2005, bibe 2005)