cse 636 data integration data integration approaches

28
CSE 636 Data Integration Data Integration Approaches

Upload: marianna-patchin

Post on 14-Dec-2015

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CSE 636 Data Integration Data Integration Approaches

CSE 636Data Integration

Data Integration Approaches

Page 2: CSE 636 Data Integration Data Integration Approaches

2

Virtual Integration Architecture

• Leave the data in the sources• When a query comes in:

– Determine the relevant sources to the query– Break down the query into sub-queries for the sources– Get the answers from the sources, filter them if needed

and combine them appropriately

• Data is fresh• Otherwise known as

On Demand Integration

Page 3: CSE 636 Data Integration Data Integration Approaches

3

Mediator

Virtual Integration Architecture

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

Query Result

Wrapper Wrapper

End User

Design-Time

MediationLanguage

Mapping Tool

Run-Time

QueryReformulation

Optimization& Execution

XML

Web Services

1

Page 4: CSE 636 Data Integration Data Integration Approaches

4

Design-Time

Mediator

Virtual Integration Architecture

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

Query Result

Wrapper Wrapper

End User

MediationLanguage

Mapping Tool

Run-Time

QueryReformulation

Optimization& Execution

XML

Web Services

1

2

Page 5: CSE 636 Data Integration Data Integration Approaches

5

Mediator

Virtual Integration Architecture

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

Query Result

Wrapper Wrapper

End User

Design-Time

MediationLanguage

Mapping Tool

Run-Time

QueryReformulation

Optimization& Execution

XML

Web Services

1

2

3

Page 6: CSE 636 Data Integration Data Integration Approaches

6

Mediator

Virtual Integration Architecture

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

Query Result

Wrapper Wrapper

End User

Design-Time

MediationLanguage

Mapping Tool

Run-Time

QueryReformulation

Optimization& Execution

XML

Web Services

1

2

3

4

Page 7: CSE 636 Data Integration Data Integration Approaches

7

Mediator

Virtual Integration Architecture

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

Query Result

Wrapper Wrapper

End User

Design-Time

MediationLanguage

Mapping Tool

Run-Time

QueryReformulation

Optimization& Execution

XML

Web Services

1

2

5

3

4

Page 8: CSE 636 Data Integration Data Integration Approaches

8

Mediator

Virtual Integration Architecture

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

Query ResultEnd User

Wrapper Wrapper

Design-Time

MediationLanguage

Mapping Tool

Run-Time

QueryReformulation

Optimization& Execution

XML

Web Services

1

2

5

63

4

Page 9: CSE 636 Data Integration Data Integration Approaches

9

Dimensions to Consider:• How many sources are we accessing?• How autonomous are they?• Meta-data about sources?• Is the data structured?• Queries or also updates?• Requirements: accuracy, completeness,

performance, handling inconsistencies.• Closed world assumption vs. open world?

Virtual Integration Approaches

Page 10: CSE 636 Data Integration Data Integration Approaches

10

Logic

Mediation Languages

AuthorsISBNFirstNameLastName

BooksTitleISBNPriceDiscountPriceEdition

BookCategoriesISBNCategory

CDCategoriesASINCategory

ArtistsASINArtistNameGroupName

CDsAlbumASINPriceDiscountPriceStudio

Global Schema

CDASINTitleGenre…

ArtistASINName…

Page 11: CSE 636 Data Integration Data Integration Approaches

11

• Expressive power: distinguish between sources with closely related data. Hence, be able to prune access to irrelevant sources.

• Easy addition: make it easy to add new data sources.

• Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively.

Desiderata from Source Descriptions

Page 12: CSE 636 Data Integration Data Integration Approaches

12

Given:• A query Q posed over the global schema• Descriptions of the data sourcesFind:• A query Q’ over the data source relations, such

that:– Q’ provides only correct answers to Q, and– Q’ provides all possible answers from to Q given the

sources.

Reformulation Problem

Page 13: CSE 636 Data Integration Data Integration Approaches

13

Languages for Schema Mapping

Mediated Schema

Q

Q’ Q’ Q’ Q’ Q’

GAVLAV GLAV

Source Source Source Source Source

LocalSchema

LocalSchema

LocalSchema

LocalSchema

LocalSchema

MediatorGlobal

Schema

Page 14: CSE 636 Data Integration Data Integration Approaches

14

Global-as-View (GAV)

Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time)

Integrating View:Create View Movie AS

SELECT * FROM S1 [S1(title,dir,year,genre)]

union

SELECT * FROM S2 [S2(title,dir,year,genre)]

union

SELECT S3.title, S3.dir, S4.year, S4.genre

FROM S3, S4 [S3(title,dir),

WHERE S3.title = S4.title S4(title,year,genre)]

Page 15: CSE 636 Data Integration Data Integration Approaches

15

Global-as-View: Example 2

Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time)

Integrating View:Create View Movie AS

SELECT title, dir, year, NULL

FROM S1 [S1(title,dir,year)]

union

SELECT title, dir, NULL, genre

FROM S2 [S2(title,dir,genre)]

Page 16: CSE 636 Data Integration Data Integration Approaches

16

Global-as-View: Example 3

Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time)

Integrating Views:Create View Movie AS

SELECT NULL, NULL, NULL, genre

FROM S4 [S4(cinema, genre)]

Create View Schedule AS

SELECT cinema, NULL, NULL

FROM S4 [S4(cinema, genre)]

But what if we want to find which cinemas are playing comedies?

Page 17: CSE 636 Data Integration Data Integration Approaches

17

Global-as-View Summary

• Query reformulation boils down to view unfolding.

• Very easy conceptually.• Can build hierarchies of global schemas.• You sometimes loose information. Not always

natural.• Adding sources is hard. Need to consider all

other sources that are available.

Page 18: CSE 636 Data Integration Data Integration Approaches

18

Local-as-View (LAV)

Mediated Schema

Source1

Source2

Source3

Source4

Source5

Local Schema LocalSchema

LocalSchema

LocalSchema

MediatorGlobal Schema

BookISBNTitleGenreYear

AuthorISBNName

R1ISBNTitleName

Local Schema

R5ISBNTitle

Books before 1970 Humor Books

Create View R1 ASSELECT B.ISBN, B.Title, A.NameFROM Book B, Author AWHERE A.ISBN = B.ISBN AND B.Year < 1970

Create View R5 ASSELECT B.ISBN, B.TitleFROM Book BWHERE B.Genre =

‘Humor’

Page 19: CSE 636 Data Integration Data Integration Approaches

19

Query Reformulation

Mediated Schema

Source1

Source2

Source3

Source4

Source5

Local Schema LocalSchema

LocalSchema

LocalSchema

MediatorGlobal Schema

BookISBNTitleGenreYear

AuthorISBNName

R1ISBNTitleName

Local Schema

R5ISBNTitle

Books before 1970 Humor Books

Query: Find authors of humor books

Plan: R1 Join R5

Page 20: CSE 636 Data Integration Data Integration Approaches

20

Query Reformulation

Mediated Schema

Source1

Source2

Source3

Source4

Source5

Local Schema LocalSchema

LocalSchema

LocalSchema

MediatorGlobal Schema

BookISBNTitleGenreYear

AuthorISBNName

R1ISBNTitleName

Local Schema

R5ISBNTitle

Books before 1970 Humor Books

Query: Find authors of humor books before 1960

Plan: Can’t do it!

Page 21: CSE 636 Data Integration Data Integration Approaches

21

Local-as-View: Example 1

Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time)

Source Views:Create Source S1 AS [S1(title, dir, year, genre)]

SELECT * FROM Movie

Create Source S3 AS [S3(title, dir)]

SELECT title, dir FROM Movie

Create Source S5 AS [S5(title, dir, year)]

SELECT title, dir, year

FROM Movie

WHERE year > 1960 AND genre=‘Comedy’

Page 22: CSE 636 Data Integration Data Integration Approaches

22

Local-as-View: Example 2

Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time)

Source Views:Create Source S4 [S4(cinema, genre)]

SELECT cinema, genre

FROM Movie M, Schedule S

WHERE M.title=S.title

Now if we want to find which cinemas are playing comedies, there is hope!

Page 23: CSE 636 Data Integration Data Integration Approaches

23

• Very flexible. You have the power of the entire query language to define the contents of the source.

• Hence, can easily distinguish between contents of closely related sources.

• Adding sources is easy: they’re independent of each other.

• Query reformulation: answering queries using views!

Local-as-View Summary

Page 24: CSE 636 Data Integration Data Integration Approaches

24

The General Problem

• Given a set of views V1,…,Vn, and a query Q, can we answer Q using only the answers to V1,…,Vn?

• Many, many papers on this problem• The best performing algorithm:

The MiniCon Algorithm (Pottinger & Halevy, VLDB 2000)

Page 25: CSE 636 Data Integration Data Integration Approaches

25

Local Completeness Information

• If sources are incomplete, we need to look at each one of them.

• Often, sources are locally complete.• Movie(title, director, year) complete for years

after 1960, or for American directors.• Question: given a set of local completeness

statements, is a query Q’ a complete answer to Q?

Page 26: CSE 636 Data Integration Data Integration Approaches

26

• Movie(title, director, year)– complete after 1960

• Show(title, theater, city, hour)• Query: find movies (and directors) playing in

Seattle: SELECT M.title, M.director

FROM Movie M, Show S

WHERE M.title=S.title

AND city=‘Seattle’• Complete or not?

Example

Page 27: CSE 636 Data Integration Data Integration Approaches

27

• Movie(title, director, year), Oscar(title, year)• Query: find directors whose movies won Oscars

after 1965: SELECT M.director

FROM Movie M, Oscar O

WHERE M.title=O.title

AND M.year=O.year

AND O.year > 1965• Complete or not?

Example #2

Page 28: CSE 636 Data Integration Data Integration Approaches

28

References

• Information integration– Maurizio Lenzerini

– Eighteenth International Joint Conference on Artificial Intelligence, IJCAI 2003

– Invited Tutorial

• Data Integration: a Status Report– Alon Halevy

– German Database Conference (BTW), 2003– Invited Talk