interactive query formulation over web service-accessed sources

31
Interactive Query Formulation over Web Service-Accessed Sources Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou ACM SIGMOD, June 2006

Upload: elvis-jensen

Post on 31-Dec-2015

22 views

Category:

Documents


0 download

DESCRIPTION

Interactive Query Formulation over Web Service-Accessed Sources. Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou. ACM SIGMOD, June 2006. Large-Scale Data Integration Systems. . Web Domain. Web Forms & Reports. CNET’s Top Combinations CNET’s Search Desktops - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Interactive Query Formulation over Web Service-Accessed Sources

Interactive Query Formulationover Web Service-Accessed Sources

Michalis PetropoulosAlin Deutsch

Yannis Papakonstantinou

ACM SIGMOD, June 2006

Page 2: Interactive Query Formulation over Web Service-Accessed Sources

2

Large-Scale Data Integration Systems

SourceDomain

WebDomain

End User

ApplicationDomain

IntegrationDomain

Application

DataSource

DataSource

MediatorIntegratedSchema

Developer

IntegrationEngineer

SourceOwner

Application

Web Forms& Reports

SourceSchema

WebService

WebService

WebService

SourceSchema …

• Dell Computers• Cisco Routers• HP Printers

• Dell Computers by CPU• Cisco Routers by Rate• HP Printers by Speed

• CNET Computer• PCWorld Portals

Compatible Combinationsof Computers, Routersand Printers

• CNET’s Top Combinations• CNET’s Search Desktops• PCWorld’s Product Finder

Page 3: Interactive Query Formulation over Web Service-Accessed Sources

3

Large-Scale Data Integration Systems

What queries can the mediator answer for me?

CLIDE

SourceDomain

WebDomain

End User

ApplicationDomain

IntegrationDomain

Application

DataSource

DataSource

MediatorIntegratedSchema

Developer

IntegrationEngineer

SourceOwner

Application

Web Forms& Reports

SourceSchema

WebService

WebService

WebService

SourceSchema …

Page 4: Interactive Query Formulation over Web Service-Accessed Sources

4

Running Example

Schema

Computers(cid, cpu, ram, price)NetCards(cid, rate, standard, interface)

Views

V1 ComByCpu(cpu) (Computer)*

SELECT DISTINCT Com1.*FROM Computers Com1WHERE Com1.cpu=cpu

V2 ComNetByCpuRate(cpu, rate) (Computer,

NetCard)*

SELECT DISTINCT Com1.*, Net1.*FROM Computers Com1, Network Net1WHERE Com1.cid=Net1.cidAND Com1.cpu=cpuAND Net1.rate=rate

Parameterized Views

DellDell CiscoCiscoSchema

Routers(rate, standard, price, type)

Views

V3 RouWired() (Router)*

SELECT DISTINCT Rou1.*FROM Routers Rou1WHERE Rou1.type='Wired'

V4 RouWireless() (Router)*

SELECT DISTINCT Rou1.*FROM Routers Rou1WHERE Rou1.type='Wireless'

Conjunctive Queries CQ• Equality & Comparison Conditions• Parameters

Computersfor a given cpu

Computers & NetCardsfor a given cpu & rate

Wired Routers

Wireless Routers

Page 5: Interactive Query Formulation over Web Service-Accessed Sources

5

Running Example

• Integrated schema puts togetherthe Dell and Cisco schemas

Attribute Associations• (Computers.cid, NetCards.cid)• (NetCards.rate, Routers.rate)• (NetCards.standard, Routers.standard)

Integrated Schema

V1

Application

V3V2

Dell Cisco

MediatorIntegrated

Schema

Developer

V4

Page 6: Interactive Query Formulation over Web Service-Accessed Sources

6

Sophisticated Mediators MakeFeasible Queries Hard to Predict

Feasible Queries FQ• Equivalent CQ query rewritings using the views• Might involve more than one views• Order might matter

V4

Mediator

RouWireless()

Routers.*10 .11b 50 Wireless54 .11g 120 Wireless

A

B

V2

ComNetByCpuRate(‘P4’, ‘10’)

C

DComputers.* NetCards.*A123P4 512 400 A123 10 .11b USBB123P4 1024 550 B123 54 .11g USB

Feasible

ComNetByCpuRate(‘P4’, ‘54’)

Computers.* NetCards.* Routers.*A123 P4 512 400 A123 10 .11b USB 10 .11b 50 WirelessB123 P4 1024 550 B123 54 .11g USB 54 .11g 120 Wireless

E

Query:Get all ‘P4’ Computers, together with their NetCards and their compatible ‘Wireless’ Routers

V1

Mediator

Query:Get all Computers

Infeasible

Page 7: Interactive Query Formulation over Web Service-Accessed Sources

7

Problem

1. Large number of sources2. Large number of views (web-services)3. Mediator capabilities

Developer formulates an application query Is an application query feasible? If not, how do I know which ones are feasible?

Previous options:– The developer had to browse the view definitions and

somehow formulate a feasible query– Or formulate queries until a feasible one is found

(trial-and-error)

No system-provided guidance

Page 8: Interactive Query Formulation over Web Service-Accessed Sources

8

The CLIDE Solution

A query formulation interface, which interactively guides the developer toward feasible queries by employing a coloring scheme

CLIDE

V1

Application

V3V2

Dell Cisco

MediatorIntegrated

Schema

Developer

V4

Page 9: Interactive Query Formulation over Web Service-Accessed Sources

9

QBE-Like Interfaces

Microsoft SQL-Server

Page 10: Interactive Query Formulation over Web Service-Accessed Sources

10

CLIDE Interface

• Table, selection, projection and join actions• Feasibility Flag• Color-based suggestions

Selection Boxes

Feasibility Flag

Page 11: Interactive Query Formulation over Web Service-Accessed Sources

11

Example Interaction

Yellow required action– All feasible queries require this action

White optional action– Feasible queries can be formulated

w/ or w/o these actions

Snapshot 1

Page 12: Interactive Query Formulation over Web Service-Accessed Sources

13

Example Interaction

Blue required choice of action– At least one of them is required to be

taken in order to reach a feasible query

Join Lines:• Only yellow and blue are displayed• Must appear in Attribute Associations

Snapshot 2

Page 13: Interactive Query Formulation over Web Service-Accessed Sources

14

CLIDE Properties

• Completeness of Suggestions– Every feasible query can be formulated by

performing yellow and blue actions at every step

• Minimality of Suggestions– At every step, only a minimal number of actions is

suggested, i.e., the ones that are needed to preserve completeness

• Rapid Convergence By Following Suggestions– The shortest sequence of actions from a query to

any feasible query consists of suggested actions

Page 14: Interactive Query Formulation over Web Service-Accessed Sources

15

Example Interaction

Snapshot 3

• * any other constant• Red prohibited action

– Does not appear in any feasible query– Lead to “Dead End” state

Page 15: Interactive Query Formulation over Web Service-Accessed Sources

16

Example Interaction

Computers.* NetCards.*A123P4 512 400 A123 10 .11b 50B123P4 1024 550 B123 54 .11g 120

Snapshot 4

V4

Mediator

RouWireless()Routers.*

10 .11b 512 Wireless54 .11g 1024 Wireless

A

B V2

ComNetByCpuRate(‘P4’, rate)

D

E

ram price rate interface price512 400 10 USB 501024 550 54 USB 120

F

Page 16: Interactive Query Formulation over Web Service-Accessed Sources

17

CLIDE Properties

• Completeness of Suggestions– Every feasible query can be formulated by

performing yellow and blue actions at every step

• Minimality of Suggestions– At every step, only a minimal number of actions is

suggested, i.e., the ones that are needed to preserve completeness

• Rapid Convergence By Following Suggestions– The shortest sequence of actions from a query to

any feasible query consists of suggested actions

Page 17: Interactive Query Formulation over Web Service-Accessed Sources

18

Join ActionTableAction

SelectionAction

Interaction Graph

• Nodes are queries– One for each qCQ

• Edges are actions– Table, selection, projection and join actions

• Green nodes are feasible queries• Infinitely big structure

– All CQ queries– All possible combinations of actions formulating them

Com1.cid=Net1.cidCom1.cpu=‘P4’Com1 Com1.ram Rou1…… Com1.price ……

s

… … ………Net1 …

Page 18: Interactive Query Formulation over Web Service-Accessed Sources

19

Com1.cpu=*

Interaction Graph: Colors

Com1.cpu=*

……

……

……

……

CurrentNode

Net1 Com1.cid=Net1.cid

Com2.cid=Net1.cid

Com2

Com2.cpu=‘P4’ Net1.rate=‘54Mbps’

Net1.rate=’54Mbps’…… … … …

… …

Com1.cpu=* Rou1 Net1.rate=Rou1.rate ……… …

Net1.rate=’54Mbps’ …

Com1.cid=Net1.cid

Com1.cid=Net1.cid …Net1

Com1.price=*

Rou1

Com1.ram=*

Com1.cid=*

Com2

Com1.cid

Com1.cpu

• Yellow action – Every path from current node n to a

feasible node contains • Blue action

– At least one path from current node n to a feasible node nF contains

– There is not path from n to nF that contains a feasible node

• Red action – No path to a feasible node contains

CurrentNode

Com1.cid=Net1.cid

Com2

Rou1

Net1.rate=’54Mbps’

Page 19: Interactive Query Formulation over Web Service-Accessed Sources

20

CLIDE Properties

• Rapid Convergence By Following Suggestions– The shortest path from the current node to any

feasible node consists of suggested actions

Page 20: Interactive Query Formulation over Web Service-Accessed Sources

21

CLIDE Architecture

• Back-End invoked every time the user performs an action– i.e., the user arrives at a new node in the interactions graph

Back-End

Closest Feasible Queries Algorithm

User

Closest Feasible Queries FQC

Current Query

Color Algorithm

Colored Actions + Feasibility Flag

Aliases Collapse Rule

Maximally-Contained Rewriter

ViewsSchemas AttributeAssociations

Minimal FeasibleExtension Queries

Front-EndActions

Parameters Algorithm

Seed Queries SQ

Page 21: Interactive Query Formulation over Web Service-Accessed Sources

22

Color DeterminedBy a Finite Set of Feasible Queries

• Finite Set of Feasible Queries

• Closest Feasible Queries FQC

• FQC is sufficient to color actions

n

Challenge: Infinitely Many Feasible Queries

Radius

?

Solution:

Challenge: How far can the Closest Feasible Queries FQC be?

Closest Feasible

Queries FQC

Page 22: Interactive Query Formulation over Web Service-Accessed Sources

23

Closest Feasible Queries FQC Algorithm

• Compute maximally contained queries FQMC

• Radius pL is the longest path to a maximally contained query

• Theorem: All FQC queries are reachable via a path of length p pL

Solution: Based on Maximally Contained Queries FQMC

n

pL Radius

Challenge: The set of nodes within pL can be too many to explore

Maximally Contained

Queries FQMC

Closest Feasible

Queries FQC

Page 23: Interactive Query Formulation over Web Service-Accessed Sources

24

Closest Feasible Queries FQC Algorithm

• Theorem: All queries in FQMC are in FQC

• But not all queries in FQC are in FQMC

• Collapse Aliases to compute FQC \ FQMC

Solution: Find the Closest Feasible Queries Directly

n

Maximally Contained

Queries FQMC

Closest Feasible

Queries FQC

More feasible nodes

Page 24: Interactive Query Formulation over Web Service-Accessed Sources

25

Remaining Issues of the Algorithm

• Color remaining actions white or red• Handling projections• Dealing with parameterized queries

Page 25: Interactive Query Formulation over Web Service-Accessed Sources

26

CLIDE Implementation Issues

• Maximally contained queries need to be minimized(affect CLIDE’s rapid convergence and minimality) Implemented the minimization module from scratch MiniCon computes mappings between the query and the views Exposed these mappings to guide minimization

• Optimizations to bring the response time below 3 sec Amortize the computation across multiple interactions steps by observing that query is changed minimally at every step Instead of calling MiniCon from scratch at every step, we incrementally compute the closest feasible queries

Page 26: Interactive Query Formulation over Web Service-Accessed Sources

27

Thank you

Play with our demo:http://www.clide.info

Page 27: Interactive Query Formulation over Web Service-Accessed Sources

28

Maximally Contained Queries FQMC

• Assuming fixed SELECT clause (projection list)• Covered extensively in literature

– MiniCon, Bucket, InverseRules Algorithms

• FQMC is finite

Maximally Contained Query

Query: Q1Get all Computers

Query: Q2Get all Computers with a given cpu

Query: Q3Get all Computerswith a given cpu & ram

Not Maximally ContainedMaximally Contained Query

Query: Q4Get all Computerswith a given ram

Page 28: Interactive Query Formulation over Web Service-Accessed Sources

29

Closest Feasible Queries FQC Algorithm

• Collapse Aliases to compute FQC \ FQMC

• Check satisfiability

ClosestFeasibleQueries FQC

MaximallyContainedFeasibleQueries FQMCn

Solution: Collapse Aliases

Page 29: Interactive Query Formulation over Web Service-Accessed Sources

30

Color Algorithm

Yellow and Blue• An action is colored based on which closest feasible

queries it appear in

• Yellow, if appears in all queries in FQC

• Blue, if appears in at least one (but not all) query in FQC

White and Red• Attach Maximum Projection Lists to Closest Feasible

Queries– Projections that can be added to a feasible query, without

compromising feasibility

• Projection is white if in the maximum projection list• Color selections based on projections

Page 30: Interactive Query Formulation over Web Service-Accessed Sources

31

CLIDE Implementation & Optimizations

• Views expansion introduce redundancy– Affects CLIDE’s rapid convergence and minimality

• Efficient containment test crucial to redundancy removal

Maximally-Contained Rewriter

Feasible Extension Queries+ Maximum Projection Lists

Maximally-Contained Feasible Extension Queries+ Maximum Projection Lists

Maximally-Contained Feasible Queries over Views+ Containment Mappings

MiniCon

Containment Mappings Logging

Redundant Queries Removal

Minimal Feasible Extension Queries FQME

+ Maximum Projection Lists

Redundant Actions Removal

Views Expansion

Back-End

Closest Feasible Queries Algorithm

Closest Feasible Queries FQC

Current Query

Color Algorithm

Colored Actions + Feasibility Flag

Aliases Collapse Rule

Maximally-Contained Rewriter

ViewsSchemas AttributeAssociations

Minimal FeasibleExtension Queries

Front-End

Parameters Algorithm

Seed Queries SQ

Page 31: Interactive Query Formulation over Web Service-Accessed Sources

32

0

1

2

3

4

5

6

7

(# of Joins, # of Selections) in Current Query

Tim

e (

se

c)

112 Views

168 Views

224 Views

280 Views

CLIDE Performance

• Queries

A-span = 7B-span = 3Selections = 4,6,8,10

A

B1

…C1

B2 C1A

BK

B1…

C1

CL

…• Schema…

Bi

… Ci

• Views

A

BK

B1…

C1

CL

… …

BiM

Bi1…

CiM

Ci1…

Chains of Stars