cse 636 data integration

23
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina

Upload: said

Post on 25-Feb-2016

28 views

Category:

Documents


0 download

DESCRIPTION

CSE 636 Data Integration. Limited Source Capabilities Slides by Hector Garcia-Molina. Heterogeneous Databases. Distributed Database System. DBMS 1. DBMS 2. legacy. web site. data. data. data. data. Limited Capabilities. Example: Amazon.com. must specify at least one of these. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CSE 636 Data Integration

CSE 636Data Integration

Limited Source CapabilitiesSlides by Hector Garcia-Molina

Page 2: CSE 636 Data Integration

2

Heterogeneous Databases

data

DBMS1

data

DBMS2

data

legacy

data

web site

Distributed Database System

Page 3: CSE 636 Data Integration

3

Limited Capabilities

Page 4: CSE 636 Data Integration

4

author:

title:subject

:forma

t:price

:

must specify at leastone of these

this attributenot returned

cannot query onthis attribute

menu ofchoices

Example: Amazon.com

Page 5: CSE 636 Data Integration

5

Example: BarnesAndNoble.com

must specify at leastone of these

can query if one ofother attributes

specified

Menu of choices

author:

title:subject

:forma

t:price

:

Page 6: CSE 636 Data Integration

6

Why Limited Capabilities?

• Search forms• Security• Indexes• Legacy

Page 7: CSE 636 Data Integration

7

Capability vs. Content

• Capability description– Can only search for subject = “art,” “history,”

“science”• Content description

– Source only contains subject = “art,” “history,” “science”

Page 8: CSE 636 Data Integration

8

• Describing source capabilities• Extending source capabilities• How mediators cope with limited capabilities• Mediator capabilities• Other topics

Outline

Mediator

SourceSource

Wrapper Wrapper

Page 9: CSE 636 Data Integration

9

Describing Query Capabilities

R(X, Y, ... Z)

Adornments:• f: may or may not specify• u: cannot be specified• b: must be specified• c[S]: specified from list S• o[S]: optional, chose from S

Page 10: CSE 636 Data Integration

10

Describing Query Capabilities

R(X, Y, ... Z)

Adornments:• f: may or may not specify• u: cannot be specified• b: must be specified• c[S]: specified from list S• o[S]: optional, chose from S

With output restriction• f’• u’• b’• c’[S]• o’[S]

Page 11: CSE 636 Data Integration

11

Example

• Relation R(X, Y, Z)• Description Templates:

bu’f, uf’c[z1, z2]• Answerable queries:

R(x1, Y, Z), R(X, Y, z1)• Unanswerable queries:

R(X, y1, Z), R(X, Y, z3)

Page 12: CSE 636 Data Integration

12

Other Description Mechanisms

• Tsimmis– Query templates

• Information Manifold– capability records (# bound attrs, conditions ok,...)

• Disco• Garlic

– black box• Context-free grammars

Page 13: CSE 636 Data Integration

13

Extending Source Capabilities

amazon

Wrapper

Query: author=“Freud” AND price > 10

Source: R(author, price, ...)Template: b, u, ...

Page 14: CSE 636 Data Integration

14

Extending Source Capabilities

Source: R(author, price, ...)Template: b, u, ...

Query: author=“Freud” AND price > 10

Source Query: author=“Freud”

Wrapper Filter: price > 10

amazon

Wrapper

Page 15: CSE 636 Data Integration

15

Another Example

Barnes&Noble

Wrapper

Query: (author = “Freud” OR author = “Jung”) AND price < 10

R(author, price, …)No disjunctive conditions;Price can only be specified with author

Page 16: CSE 636 Data Integration

16

Another Example

Query: (author = “Freud” OR author = “Jung”) AND price < 10

R(author, price, …)No disjunctive conditions;Price can only be specified with author

Q1: author = “Freud” AND price < 10Q2: author = “Jung” AND price < 10

Union Operation

Barnes&Noble

Wrapper

Page 17: CSE 636 Data Integration

17

Extending Source Capabilities

• General scheme:– try many query rewritings– check if query fragments supported by source– check if wrapper can combine answer fragments– do all this very efficiently!!

– H. Garcia-Molina, W. Labio, R. Yerneni: Capability-Sensitive Query Processing on Internet Sources,ICDE 1999

• Tsimmis, Info Manifold: no disjunctive queries• DISCO: no query splitting• Garlic: only CNF queries

Page 18: CSE 636 Data Integration

18

Mediator Processing

R(X, Y, Z) f, f, b

T(Z, W, U) f, u, b

M(X, Y, Z, W, U) = Join(R, T)

Query: M(5, Y, Z, W, 3)

Mediator

SourceSource

Wrapper Wrapper

Page 19: CSE 636 Data Integration

19

Plan 1

R(X, Y, Z) f, f, b

T(Z, W, U) f, u, b

M(X, Y, Z, W, U) = Join(R, T)

Query: M(5, Y, Z, W, 3)

Mediator

SourceSource

Wrapper Wrapper

(1) R(5, Y, Z) (2) T(Z, W, 3)

(3) Join answers

Page 20: CSE 636 Data Integration

20

Plan 2

R(X, Y, Z) f, f, b

T(Z, W, U) f, u, b

M(X, Y, Z, W, U) = Join(R, T)

Query: M(5, Y, Z, W, 3)

Mediator

SourceSource

Wrapper Wrapper

(3) Join answers

(1) P = T(Z, W, 3)

(2) for each (z,w,u) P: R(5, Y, u)

Page 21: CSE 636 Data Integration

21

Mediator Plan Generation

• Need feasible and efficient plan• Search space is huge• Tsimmis, Info Manifold, Garlic:

– exponential algorithms• Polynomial algorithms:

– often find optimal or near-optimal plan– bounded performance

– R. Yerneni, C. Li, J. D. Ullman, H. Garcia-Molina: Optimizing Large Join Queries in Mediation Systems, ICDT 1999

Page 22: CSE 636 Data Integration

22

Conclusion

• Not all sources are created equal!• Need to

– describe what sources can do– efficiently process queries with limited sources– describe what mediators can do– exploit content information– deal with unavailable sources

Page 23: CSE 636 Data Integration

23

References

• Computing Capabilities of Mediators– Ramana Yerneni, Chen Li, Hector Garcia-Molina, Jeffrey

D. Ullman– SIGMOD Conference 1999

• Describing and Using Query Capabilities of Heterogeneous Sources– Vasilis Vassalos, Yannis Papakonstantinou– VLDB 1997