grid based data integration with automatic wrapper generation
DESCRIPTION
Grid Based Data Integration with Automatic Wrapper Generation. Xuan Zhang Gagan Agrawal Ohio State University. Overall Goal. Tools for data integration driven by: Data explosion Data size & number of data sources New analysis tools and need for workflows Autonomous resources - PowerPoint PPT PresentationTRANSCRIPT
1
Grid Based Data Integration with Automatic Wrapper Generation
Xuan Zhang Gagan Agrawal
Ohio State University
2
Overall Goal Tools for data integration driven by:
Data explosion Data size & number of data sources
New analysis tools and need for workflows
Autonomous resources Heterogeneous data representation &
various interfaces
3
Motivation (Contd.) Other Issues:
Frequent updates to data formats Flat-file datasets Ad-hoc sharing of data
4
Current Approaches Manually written wrappers
Problems O(N2) wrappers needed, O(N) for a single updates Portability of wrappers in a distributed environment
Mediator-based integration systems Problems
Need a common intermediate format Unnecessary data transformation
Integration using web/grid services Needs all tools to be web-services (all data in XML?)
5
Our Approach Automatically generate wrappers
One layout descriptor per resource Stand-alone wrapper programs For integrated DBs, (grid) workflow systems
Transform data in files of arbitrary formats No domain- or format-specific heuristics Layout information provided by users
6
Our Approach (Contd.) Help write layout descriptors using
data mining techniques (dils 2005, bibe 2005)
Particularly attractive for Data grid environments and
workflows flat-file datasets ad hoc data sharing
7
Our Approach: Advantages Advantages:
No need to write wrappers while integrating data or creating workflows
Only one descriptor per resource needed
No unnecessary transformations / storage
New resources can be integrated on-the-fly
8
Our Approach: Challenges Description language
Format and logical view of data in flat files Easy to interpret and write
Wrapper generation and execution Correspondence between data items Separating wrapper analysis and execution
Interactive tools for writing layout descriptors What data mining techniques to use ? (dils
2005, bibe 2005)
9
Wrapper Generation System Overview
Layout Descriptor Schema Descriptors
Parser Mapping Generator
Data Entry Representation Schema Mapping
DataReader DataWriterSynchronizer
SourceDataset
TargetDataset
Application Analyzer
WRAPINFO
10
Suitability for a Grid Environment
Wrapper analysis can be implemented as a grid service Very low execution costs
Wrapper execution modules are task-independent Just need to port three modules on
different systems
11
Assumptions for the Current Prototype One tabular, the
other semi-structured
Both datasets are stored record-wise
Order of records not disturbed
Suitable for bioinformatics
Semi-structured tabular
12
Layout Description Language Goal
To describe data in arbitrary flat file format
Easy to interpret and write Components:
1. Schema description2. Layout description
Example: FASTA
13
Layout Description Language
Key observations on data layout
Strings of variable length Delimiters widely used Data fields divided into
variables Repetitive structures
Key tokens “constant string” LINESIZE [optional] <repeating> …
…>seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nNMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n>seq3 …
14
Layout Description Language
Component I: Schema Description[FASTA] //Schema NameID = string //Data type definitionsDESCRIPTION = stringSEQ = string
…>seq1 comment1\n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n>seq3 …
15
Layout Description Language
Component II: Layout Description …LOOP ENTRY 1:EOF:1 {
“>” ID “ ” DESCRIPTION < “\n” SEQ >
“\n” | EOF}…
…>seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nNMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n>seq3 …
16
Mapping Cardinality Reference table TRANSFAC
…FA factor1_name…RA reference1.1_authors…RA reference1.2_authors…RA reference1.3_authors…
… FA … RA …
… … … … …
… factor1_name
… reference1.1_authors
…
… factor1_name
… reference1.2_authors
…
… factor1_name
… reference1.3_authors
…
… … … … …
One-to-multipledata field
One-to-onedata field
17
Analyzing Application Goals - WRAPINFO
Summarize all application related information necessary for the wrapper
Represent the information in look-up tables and constant parameters
Represent the information in a platform-independent format, XML
18
Wrapper Generated
Inputdataset
Datasetbuffer
DataReader
Value buffer
one_to_multiple_values
one_to_one_values
DataWriterOutputdataset
Synchronizer
load run
FARA
run
RA
halt
19
Wrapper Generated Suitable for data grid
Three general modules DataReader
Extract one data field value Write value to the value buffer if useful
DataWriter Write one data field value Remove value from list in the value buffer
Synchronizer Switch between calling DataReader and DataWriter Manage dataset buffer
Application specific information in WRAPINFO
20
Experimental Results
TRANSFAC-to-Reference Problem
(in logarithm)
(in logari
thm
)
21
Experimental Results
SWISSPROT-to-FASTA Problem
22
Summary Automatically generated wrappers can
perform well Wrapper task analysis and wrapper
execution can be separated Key Open Question:
How hard it is to write layout descriptors ? Can we make the process semi-automatic ? Data mining techniques seem quite
promising (dils 2005, bibe 2005)