rapidly constructing integrated applications from online sources craig a. knoblock information...

49
Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Post on 19-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Rapidly Constructing Integrated Applications from Online Sources

Craig A. Knoblock

Information Science Institute

University of Southern California

Page 2: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Motivating Example

BiddingForTravel.com

Priceline

Map

Orbitz

?

Page 3: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California
Page 4: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California
Page 5: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Outline Extracting data from unstructured and

ungrammatical sources Automatically discovering models of sources Dynamically building integration plans Efficiently executing the integration plans

Page 6: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Outline Extracting data from unstructured and

ungrammatical sources Automatically discovering models of sources Dynamically building integration plans Efficiently executing the integration plans

Page 7: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Ungrammatical & Unstructured Text

Page 8: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Ungrammatical & Unstructured Text

For simplicity “posts”

Goal:

<price>$25</price><hotelName>holiday inn sel.</hotelName>

<hotelArea>univ. ctr.</hotelArea>

Wrapper based IE does not apply (e.g. Stalker, RoadRunner)

NLP based IE does not apply (e.g. Rapier)

Page 9: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Reference SetsIE infused with outside knowledge

“Reference Sets” Collections of known entities and the associated

attributes Online (offline) set of docs

CIA World Fact Book Online (offline) database

Comics Price Guide, Edmunds, etc.

Page 10: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Algorithm Overview – Use of Ref Sets

$25 winning bid at holiday inn sel. univ. ctr.

Post:

Holiday Inn Select University Center

Hyatt Regency Downtown

Reference Set:

Record Linkage

$25 winning bid at holiday inn sel. univ. ctr.

Holiday Inn Select University Center

“$25”, “winning”, “bid”, …

Extraction

$25 winning bid … <price> $25 </price> <hotelName> holiday inn sel.</hotelName> <hotelArea> univ. ctr. </hotelArea> <Ref_hotelName> Holiday Inn Select </Ref_hotelName> <Ref_hotelArea> University Center </Ref_hotelArea>

Ref_hotelName Ref_hotelArea

Page 11: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Holiday Inn Greentree

Holiday Inn Select University Center

Hyatt Regency Downtown

Post:

Reference Set:hotel name hotel area

hotel name hotel area

“$25 winning bid at holiday inn sel. univ. ctr.”

Our Record Linkage Problem Posts not yet decomposed attributes. Extra tokens that match nothing in Ref Set.

Page 12: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Our Record Linkage Solution

Record Level Similarity + Field Level Similarities

VRL = < RL_scores(P, “Hyatt Regency Downtown”), RL_scores(P, “Hyatt Regency”), RL_scores(P, “Downtown”)>

Best matching member of the reference set for the post

Binary RescoringBinary Rescoring

P = “$25 winning bid at holiday inn sel. univ. ctr.”

Page 13: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

$25 winning bid at holiday inn sel. univ. ctr.

Post:

Generate VIE

Multiclass SVM

$25 winning bid at holiday inn sel. univ. ctr.

$25 holiday inn sel. univ. ctr.

price hotel name hotel area

Clean Whole Attribute

Extraction Algorithm

VIE = <common_scores(token),

IE_scores(token, attr1),

IE_scores(token, attr2),

… >

Page 14: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Experimental Data SetsHotels Posts

1125 posts from www.biddingfortravel.com Pittsburgh, Sacramento, San Diego Star rating, hotel area, hotel name, price, date booked

Reference Set 132 records Special posts on BFT site.

Per area – list any hotels ever bid on in that area Star rating, hotel area, hotel name

Page 15: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Comparison to Existing SystemsRecord Linkage WHIRL

RL allows non-decomposed attributes

Information Extraction Simple Tagger (CRF)

State-of-the-art IE Amilcare

NLP based IE

Page 16: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Record linkage results

10 trials – 30% train, 70% test

Prec. Recall F-Measure

Hotel

Phoebus 93.60 91.79 92.68

WHIRL 83.52 83.61 83.13

Page 17: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Token level Extraction results: Hotel domain

Not Significant

Prec. Recall F-Measure Freq

Area Phoebus 89.25 87.50 88.28 809.7

Simple Tagger 92.28 81.24 86.39

Amilcare 74.2 78.16 76.04

Date Phoebus 87.45 90.62 88.99 751.9

Simple Tagger 70.23 81.58 75.47

Amilcare 93.27 81.74 86.94

Name Phoebus 94.23 91.85 93.02 1873.9

Simple Tagger 93.28 93.82 93.54

Amilcare 83.61 90.49 86.90

Price Phoebus 98.68 92.58 95.53 850.1

Simple Tagger 75.93 85.93 80.61

Amilcare 89.66 82.68 85.86

Star Phoebus 97.94 96.61 97.84 766.4

Simple Tagger 97.16 97.52 97.34

Amilcare 96.50 92.26 94.27

Page 18: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Outline Extracting data from unstructured and

ungrammatical sources Automatically discovering models of sources Dynamically building integration plans Efficiently executing the integration plans

Page 19: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Discovering Models of Sources Required for Integration

Provide uniform access to heterogeneous sources Source definitions are used to reformulate queries New service, no source model, no integration! Can we discover models automatically?

Source Definitions:- United- Lufthansa- Qantas

Mediator

?

WebServices

United

Lufthansa

Qantas

newservice

Alitalia

Query

SELECT MIN(price) FROM flightWHERE depart=“MXP” AND arrive=“PIT”

Reformulated Query

Reformulated Query

lowestFare(“MXP”,“PIT”)

calcPrice(“MXP”,“PIT”,”economy”)

Page 20: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Inducing Source Definitions:A Simple Example

Step 1: use metadata to classify input types Step 2: invoke service and classify output types

Mediator

newsource

RateFinder($fromCountry,$toCountry,val):- ?

knownsource

LatestRates($country1,$country2,rate):-exchange(country1,country2,rate)Semantic Types:

currency {USD, EUR, AUD} rate {1936.2, 1.3058, 0.53177}

Predicates:exchange(currency,currency,rate)

currency

{<EUR,USD,1.30799>,<USD,EUR,0.764526>,…}

rate

Page 21: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

def_1($from, $to, val) :- LatestRates(from,to,val)

def_2($from, $to, val) :- LatestRates(to,from,val)

def_1($from, $to, val) :- exchange(from,to,val)

def_2($from, $to, val) :- exchange(to,from,val)Mediator

Predicates:exchange(currency,currency,rate)

Inducing Source Definitions:A Simple Example

Step 3: generate plausible source definitions Step 4: reformulate in terms of other sources Step 5: invoke service and compare output

newsource

RateFinder($fromCountry,$toCountry,val):- ?

currency rateInput RateFinder Def_1 Def_2

<EUR,USD> 1.30799 1.30772 0.764692

<USD,EUR> 0.764526 0.764692 1.30772

<EUR,AUD> 1.68665 1.68979 0.591789

match

Page 22: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

The FrameworkIntuition: Services often have similar semantics, so we

should be able to use what we know to induce that which we don’t

Two phase algorithmFor each operation provided by the new service:1. Classify its input/output data types

Classify inputs based on metadata similarity Invoke operation & classify outputs based on data

2. Induce a source definition Generate candidates via Inductive Logic Programming Test individual candidates by reformulating them

Page 23: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Use Case: Zip Code Data Single real zip-code service with multiple operations The first operation is defined as:

Goal is to induce definition for a second operation:

Same service so no need to classify inputs/outputs or match constants!

getDistanceBetweenZipCodes($zip1, $zip2, distance) :-centroid(zip1, lat1, long1),centroid(zip2, lat2, long2),distanceInMiles(lat1, long1, lat2, long2, distance).

getZipCodesWithin($zip1, $distance1, zip2, distance2) :-centroid(zip1, lat1, long1), centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance2), (distance2 ≤ distance1), (distance1 ≤ 300).

Page 24: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Generating definitions: ILP Want to induce source definition for:

Predicates available for generating definitions:{centroid, distanceInMiles, ≤,=}

New type signature contains that of known source Use known definition as starting point for local search:getDistanceBetweenZipCodes($zip1, $zip2, distance) :-

centroid(zip1, lat1, long1),centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance).

getZipCodesWithin($zip1, $distance1, zip2, distance2)

Plausible Source Definition

1 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d1), (d2 = d1)

2 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d1), (d2 ≤ d1)

3 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d2), (d2 ≤ d1)

4 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d2), (d1 ≤ d2)

5 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d2), (d1 ≤ #d)

6 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d2), (lt1 ≤ d1)

n cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d2), (d2 ≤ d1), (d1 ≤ #d)

INVALIDd2 unbound!

#d is a constant

UNCHECKABLElt1 inaccessible!

contained indefs 2 & 4

Page 25: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Preliminary ResultsSettings: Number of zip code constants initially available: 6 Number of samples performed per trial: 20 Number of candidate definitions in search space: 5

Results: Converged on “almost correct’’ definition!!!

Number of iterations to convergence: 12

getZipCodesWithin($zip1, $distance1, zip2, distance2) :-centroid(zip1, lat1, long1), centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance2), (distance2 ≤ distance1), (distance1 ≤ 243).

Page 26: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Related Work Classifying Web Services

(Hess & Kushmerick 2003), (Johnston & Kushmerick 2004) Classify input/output/services using metadata/data We learn semantic relationships between inputs & outputs

Category Translation (Perkowitz & Etzioni 1995) Learn functions describing operations available on internet We concentrate on a relational modeling of services

CLIO (Yan et. al. 2001) Helps users define complex mappings between schemas They do not automate the process of discovering mappings

iMAP (Dhamanka et. al. 2004) Automates discovery of certain complex mappings Our approach is more general (ILP) & tailored to web sources We must deal with problem of generating valid input tuples

Page 27: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Outline Extracting data from unstructured and

ungrammatical sources Automatically discovering models of sources Dynamically building integration plans Efficiently executing the integration plans

Page 28: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Dynamically Building Integration Plans

Mediator

Traditional Data Integration Techniques

Find information about all proteins that participate in

Transcription process

(1). SwissProtein: P36246(2). GeneBank: AAS60665.1

………

Page 29: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Dynamically Building Integration Plans (Cont’d)

Mediator

Problem Solved Here

Create a web service that accepts a name of a biological

process, <bname>, and returns information about

proteins that participate in itNew web service

Page 30: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Problem Statement (Cont’d)

Assumption Information-producing web service

operations Applicability

Biological data web services Geospatial services (WMS, WFS) Other applications that do not focus on

transactions

Page 31: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Query-based Web Service Composition Query-based approach

View web service operations as source relations with binding restrictions Can be inferred from WSDL

Create domain ontology Describe source relations in terms of domain relations

Combined Global-as-View / Local-as-View approach

Use data integration system to answer user queries

Page 32: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Template-based Web Service Composition

Our goal is to compose new web services We need to answer template queries, not specific

queries Template-based Query Approach

Generate plans to take into account general parameter values, i.e. Universal Plan [Schoppers, et. al.]

Easy to generate universal plan Plans that answer template query as oppose to specific

query But, plans can be very inefficient

Need to generate optimized “universal integration plans”

Page 33: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Example Scenario Sources

HSProtein($id, name, location, function, seq, pubmedid)

MMProteinInteractions($fromid, toid, source, verified)

Protein

Protein-ProteinInteractions

MMProtein($id, name, location, function, seq, pubmedid)

TranducerProtein($id, name, location, taxonid, seq, pubmedid)

MembraneProtein($id, name, location, taxonid, seq, pubmedid)

DipProtein($id, name, location, taxonid, function)

HSProteinInteractions($fromid, toid, source, verified)

Page 34: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Example Rules and QueryProteinProteinInteractions(fromid, toid, taxonid, source, verified):- HSProteinInteractions(fromid, toid, source, verified),(taxonid=9606)

ProteinProteinInteractions(fromid, toid, taxonid, source, verified):-MMProteinInteractions(fromid, toid, source, verified), (taxonid=10090)

ProteinProteinInteractions(fromid, toid, taxonid, source, verified):- ProteinProteinInteractions(fromid, itoid, taxonid, source, verified), ProteinProteinInteractions(itoid, toid, taxonid, source, verified)

Q(fromid, toid, taxonid, source, verified):- fromid = !fromid, taxonid = !taxonid, ProteinProteinInteractions(fromid, toid, taxonid, source, verified)

Page 35: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Unoptimized Plan

Page 36: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Optimized Plan Exploit constraints in source description to

filter queries to sources

Page 37: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Example Scenario

Q1(fromid, fromname, fromseq, frompubid, toid, toname, toseq, topubid):- fromid = !fromproteinid,Protein(fromid, fromname, loc1, f1, fromseq, frompubid, taxonid1),ProteinProteinInteractions(fromid, toid, taxonid, source, verified),Protein(toid, toname, loc2, f2, toseq, topubid, taxonid2)

Join

InputOutput

Fromproteinid

Fromproteinid, Toproteinid

Fromproteinid, fromseq

Fromproteinid,Toproteinid, toseq

Fromproteinid, fromseq,Toproteinid, toseq

Protein-ProteinInteractions

Protein

Protein

ComposedPlan

Page 38: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Example Integration Plan

Page 39: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Adding Sensing Operations for Tuple-level Filtering Compute original plan for a template query For each constraint on the sources

Introduce constraint into the query Rerun inverse rules algorithm Compare cost of new plan to original plan Save plan with lowest cost

Page 40: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Optimized Universal Integration Plan

Page 41: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Outline Extracting data from unstructured and

ungrammatical sources Automatically discovering models of sources Dynamically building integration plans Efficiently executing the integration plans

Page 42: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Dataflow-style, Streaming Execution Map datalog plans into streaming, dataflow execution

system (e.g., network query engine) We use the Theseus execution system since it supports

recursion Key challenges

Mapping non-recursive plans Mapping recursive plans

Data processing Loop detection Query results update Termination check Recursive callback

Page 43: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Example TranslationProteinProteinInteractions(fromid, toid, taxonid, source, verified):- HSProteinInteractions(fromid, toid, source, verified),(taxonid=9606)

ProteinProteinInteractions(fromid, toid, taxonid, source, verified):-MMProteinInteractions(fromid, toid, source, verified), (taxonid=10090)

ProteinProteinInteractions(fromid, toid, taxonid, source, verified):- ProteinProteinInteractions(fromid, itoid, taxonid, source, verified), ProteinProteinInteractions(itoid, toid, taxonid, source, verified)

Q(fromid, toid, taxonid, source, verified):- ProteinProteinInteractions(fromid, toid, taxonid, source, verified), (fromid = !fromproteinid), (taxonid = !taxonid)

Page 44: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Example Theseus Plan

Page 45: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Bio-informatics Domain Results Experiments in Bio-informatics domain where we have 60 real

web services provided by NCI We varied number of domain relations in a query from 1-30 and

report composition time with execution time

0

2000

4000

6000

8000

10000

12000

14000

16000

1 2 3 4 5 6 7 8

# of Relations in Query

Tim

e in

Mili

se

co

nd

s

Execution Time

Composition Time

Page 46: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Tuple-level Filtering Tuple-level filtering can improve the execution time of the

generated integration plan by up to 53.8%

Page 47: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Improvement due to Theseus Theseus can improve the execution time of the generated web

service with complex plans by up to 33.6%

Page 48: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

Discussion Huge number of sources available Need tools and systems that support the dynamic

integration of these sources In this talk, I described techniques for:

Extracting data from unstructured and ungrammatical sources

Discovering models of online sources required for integration

Dynamic and efficient integration of web sources Efficient execution of integration plans

Much work still left to be done…

Page 49: Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

More information… http://www.isi.edu/~knoblock Matthew Michelson and Craig A. Knoblock.

Semantic Annotation of Unstructured and Ungrammatical TextIn Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI), Edinburgh, Scotland, 2005

Mark James Carman and Craig A. Knoblock. Inducing source descriptions for automated web service composition, In Proceedings of the AAAI 2005 Workshop on Exploring Planning and

Scheduling for Web Services, Grid, and Autonomic Computing, 2005. Snehal Thakkar, Jose Luis Ambite, and Craig A. Knoblock.

Composing, optimizing, and executing plans for bioinformatics web services,

VLDB Journal, Special Issue on Data Management, Analysis and Mining for Life Sciences, 14(3):330--353, Sep 2005.