data integration

59
Data Integration An Overview

Upload: germane-nieves

Post on 04-Jan-2016

33 views

Category:

Documents


1 download

DESCRIPTION

Data Integration. An Overview. What is Information Integration and Why is it important. Some of the upcoming slides are from William Cohen’s tutorial on information integration (WebDB 2005). To illustrate the problems, we focus here on this aspect. Linkage. Queries. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Integration

Data Integration

An Overview

Page 2: Data Integration

What is Information Integration and Why is it important

Some of the upcoming slides are from William Cohen’s tutorial on information integration(WebDB 2005)

Page 3: Data Integration

3

Information Integration

• Discovering information sources (e.g. deep web modeling, schema learning, …)

• Gathering data (e.g., wrapper learning & information extraction, federated search, …)

Queries

• Querying integrated information sources (e.g. queries to views, execution of web-based queries, …)

• Data mining & analyzing integrated information (e.g., collaborative filtering/classification learning using extracted data, …)

Linkage

• Cleaning data (e.g., de-duping and linking records) to form a single [virtual] database

To illustrate the problems, we focus here on this aspect

Page 4: Data Integration

4

[Science 1959]

Record linkage: bringing together of two or more separately recorded pieces of information concerning a particular individual or family (Dunn, 1946; Marshall, 1947).

Page 5: Data Integration

5

Motivations for Record Linkage c. 1959

Record linkage is motivated by certain problems faced by a small number of scientists doing data analysis for obscure reasons.

Page 6: Data Integration

6

Information integration in 1959• Many of the basic principles of modern integration work are

recognizable.• Manual engineering of distance features (e.g., last names as

Soundex codes) that are then matched probabilistically.– DB1 + DB2 + Pr(matches) + elbowGrease DB12

• Applied to records from pairs of datasets– “Smallest possible scale” for integration (one one dimension)

• Computationally expensive– Relative to ordinary database operations

• Narrowly used– Only for scientists in certain narrow areas (e.g., public health)

• How can this process be fully automated?• Why should we care?

Page 7: Data Integration

7

Ted Kennedy's “Airport Adventure” [2004]

Washington -- Sen. Edward "Ted" Kennedy said Thursday that he was stopped and questioned at airports on the East Coast five times in March because his name appeared on the government's secret "no-fly" list…Kennedy was stopped because the name "T. Kennedy" has been used as an alias by someone on the list of terrorist suspects.

“…privately they [FAA officials] acknowledged being embarrassed that it took the senator and his staff more than three weeks to get his name removed.”

Page 8: Data Integration

8

Florida Felon List [2000,2004]

The purge of felons from voter rolls has been a thorny issue since the 2000 presidential election. A private company hired to identify ineligible voters before the election produced a list with scores of errors, and elections supervisors used it to remove voters without verifying its accuracy…

The new list … contained few people identified as Hispanic; of the nearly 48,000 people on the list created by the Florida Department of Law Enforcement, only 61 were classified as Hispanics.

Gov. Bush said the mistake occurred because two databases that were merged to form the disputed list were incompatible. … when voters register in Florida, they can identify themselves as Hispanic. But the potential felons database has no Hispanic category…

The glitch in a state that President Bush won by just 537 votes could have been significant — because of the state's sizable Cuban population, Hispanics in Florida have tended to vote Republican… The list had about 28,000 Democrats and around 9,500 Republicans…

Page 9: Data Integration

9

Information dealing with such matters as violent crime, organized crime, fraud and other white-collar crime may take days to be shared throughout the law enforcement community, according to an FBI official.

The new software program was supposed to allow agents to pass along intelligence and criminal information in real time….

In a response contained in the inspector general's report, the FBI pointed to its Investigative Data Warehouse…that provides … access to 47 sources of counterterrorism data, including information from FBI files, other government agencies and open-source news feeds.

Page 10: Data Integration

10

..counter asymmetric threats by achieving total information awareness…

Page 11: Data Integration

11

Chinese Embassy Bombing [1999]• May 7, 1999: NATO bombs the Chinese Embassy in Belgrade with five

precision-guided bombs—sent to the wrong address—killing three.

“The Chinese embassy was mistaken for the intended target…located just 200 yards from the embassy. Reliance on an outdated map, aerial photos, and the extrapolation of the address of the federal directorate from number patterns on surrounding streets were cited … as causing the tragic error…despite the elaborate system of checks built-into the targeting protocol, the coordinates did not trigger an alarm because the three databases used in the process all had the old address.” [US-China Policy Foundation summary of the investigation]

“BEIJING, June 17 –– China today publicly rejected the U.S. explanation … [and] said the U.S. report ‘does not hold water.’” [Washington Post]

“The Chinese embassy was clearly marked on tourist maps that are on sale internationally, including in the English language. … Its address is listed in the Belgrade telephone directory…. For the CIA to have made such an elementary blunder is simply not plausible.” [World Socialist Web Site]

“Many observers believe that the bombing was deliberate…it if you believe that the bombing was an accident, you already believe in the far-fetched” [disinfo.com, July 2002].

Page 12: Data Integration

12

Information integration in 2005• Apparently, we still have work to do.

– Why is this problem so hard?

– The airport adventure: When can you tell if “T. Kennedy” the same person as “Ted Kennedy?” When can you accept an answer of “I don’t know”? What sorts of information can you use in deciding: structured data, text, images, … ?

– The embassy bombing: When are multiple sources that agree really useful? When have you looked at enough? What are the implications of looking at many sources?

– The felon list: If you act on uncertain matches, what kind of errors will you make? will they cancel out, or accumulate?

Page 13: Data Integration

13

Information integration in 2005• It is hard to give Definitions: What do we really

mean when we say “X is the same as Y”? does every user mean the same thing?

• Is “X is the same as Y” transitive?• What conclusions follow from “X is the same as

Y”? – Is it true that: Istanbul = Constantinople?– Does it follow that: The capital of Byzantium =

Istanbul?

Page 14: Data Integration

14

When are two entities are the same?

Page 15: Data Integration

15

Information integration in 2005• Apparently, we still have work to do.• We fail to integrate information correctly

– “Ted Kennedy (senator)”≠ “T. Kennedy (terrorist)”

• Crucial decisions are affected by these errors– Who can/can’t vote (felon list)– Where bombs are sent (Chinese embassy)

• Storing, linking, and analyzing information is a double-edged sword:– Loss of privacy and “fishing expeditions”

Page 16: Data Integration

16

Information Integration: today and tomorrow

• Discovering information sources: based on standards and free-text metadata.

• Data providers will be even more numerous.

• Gathering data: will get cheaper and cheaper

Queries

• Querying integrated information sources may be done in radically different query models

• Data mining & analyzing integrated information will be the norm, not the exception

Linkage

• Cleaning data to form a single virtual database will be guided by a user or group of users, and by characteristics of all the data

Page 17: Data Integration

INTEGRATING WEB INFORMATION

Page 18: Data Integration

Problem

18

Web

Intranet

Databases

EnterpriseInformation

Systems

Search for… The inventor of the world wide web Gothic and romantic churches that are located in the same place Movies with an actor who is the governor of California

Page 19: Data Integration

Arising Questions• What do we know about the structure of the data?• Why do we care what is the structure of the data?

• Can we “structure” data? How?

• Does John Doe also works in Apple?

• How do we interpret links?19

<worker> <name> Roy Smith</name> <company>Apple</company></worker>

Roy Smith works in Apple

<person> Roy Smith</person> works in <company>Apple</company>

//worker[company=“Apple”]/name

XPath Query:

<company>Apple</company> employs <person> John Doe</person>

Page 20: Data Integration

20

What are the publications of Max Planck?

Example query #1

Max Planck should be instance of concept person, not of concept institute

Concept Awareness

Page 21: Data Integration

21

Example query #2Conferences about XML in Norway 2005?

Context Awareness

Information is not present on a single page, but distributed across linked pages

VLDB Conference 2005, Trondheim, Norway

Call for Papers…XML…

Page 22: Data Integration

22

Goal: Increase Recall and Precision

Recall

The percentage of correct answers

that were retrieved, w.r.t. all

correct answers

Precision

The percentage of correct answers

among the retrieved answers

Page 23: Data Integration

INTEGRATING RELATIONAL DATA

Page 24: Data Integration

Mediation Languages

Goal:

Mediated Schema

SourceSource Source Source Source

Language forSpecifyingSemanticRelationships (not full FOL)

Q

Q’ Q’ Q’ Q’ Q’

Assume: data at the sources is structure (or seems so).

Page 25: Data Integration

Global-as-View (GAV)

Mediated Schema

SourceSource Source Source Source

R1 R2 R3 R4 R5

Title, Actor, …

Actor(x,y) :- R1(x,y,z)Actor(x,y) :- R2(x,z), R3(z,y)

Page 26: Data Integration

Local-as-View (LAV,GLAV)

Mediated Schema

SourceSource Source Source Source

R1 R2 R3 R4 R5

Title, Actor …

R1(x,y,z) :- Title(x,y), Actor(x,z), y< 1970

R5(x,y,z) :- Movie(x,y,”French”)

Page 27: Data Integration

LAV vs. GAV

• What are the advantages of LAV?• What are the advantages of GAV?• How are queries over the entire data being answered

in each approach?– GAV – Unfolding (easy)– LAV - Answering queries using views (NP-hard)

Page 28: Data Integration

Queries in LAV

• Suppose that we have the following mapping rules:ActingInfo(title, aname, year) Actor(aname, address)

ActingInfo(title, aname, year) Movie(title, year, director)

• How does the data look like?• We need to deal with incomplete information!• How can we answer queries?

ActorInfo(n, a)Actor(n, a)Titles(t,y) Movie(t,y,d)

Page 29: Data Integration

Dealing with Incomplete Information

• Given an incomplete database D’ (i.e., there are predicates with null values), we consider all the possible completions DD of D’

• Given a query Q over D’, a certain answer A of Q is an answer that is given for any possible completion, i.e., for any database of DD

• We consider query answering as the set of all certain answers

• How do we deal with negation (e.g., not exists)?

Page 30: Data Integration

Maximal Answers

• One approach to deal with missing values is be computing maximal answers:– Full disjunction in the relational case– Different semantics of maximal matching in the

case of matching graph queries to graph databases

– In both cases, computation is intricate

Page 31: Data Integration

Schema/Ontology Matching

Schema heterogeneity: a key roadblock for information integration– Different data sources speak their own schema– Mapping is key to any data sharing architecture

MediatorMediator

ConsumerConsumer

Data SourceData Source

Data SourceData Source

Data SourceData Source

Hotel, GaststätteBrauerei,

Kathedrale

Lodges, Restaurants

Beaches, Volcanoes

Hotel, Restaurant,AdventureSports,

HistoricalSites

Page 32: Data Integration

Schema Matching

Schema Matching: Schema Matching: Discovering correspondences between similar elementsDiscovering correspondences between similar elementsEventuallyEventually…… BooksAndMusic(x:Title, BooksAndMusic(x:Title,……) = Books(x:Title,) = Books(x:Title,……) ) CDs(x:Album, CDs(x:Album,……) )

BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords

Books TitleISBNPriceDiscountPriceEdition

CDs AlbumASINPriceDiscountPriceStudio

BookCategoriesISBNCategory

CDCategoriesASINCategory

ArtistsASINArtistNameGroupName

AuthorsISBNFirstNameLastName

Inventory Database A

Inventory Database B

Page 33: Data Integration

Typical Approaches• Multiple sources of evidences in the schemas

– Schema element names• BooksAndCDs/Categories ~ BookCategories/Category

– Descriptions and documentation• ItemID: unique identifier for a book or a CD• ISBN: unique identifier for any book

– Data types, data instances• DateTime Integer, • addresses have similar formats

– Schema structure• All books have similar attributes

– Use domain knowledge

Combine multiple techniques to exploit all available evidence

In isolation, techniques are incomplete or brittle

Page 34: Data Integration

XML AND SEMISTRUCTURED DATA

Page 35: Data Integration

XML

• In XML the is no strict schema• Integration is easier: you simply take XML

from different sources and put them in a single repository

• Well, actually the main problem of linking related pieces of information remains!

• And, additional new problems emerge (to whom is it good?)

Page 36: Data Integration

Querying and Searching in XML

• Some challenges arise:– How to deal with variations in the structure of the

XML?– How to deal with incomplete information?– How to find meaningful relationships among

elements? An important example – keyword search.

Page 37: Data Integration

An example

author(9)

bibliography(1)

bib(2)bib(11)

year(3)year(12)

book (4)article(7)

title(5) author(6)

title(8)

author(10)

book(13) article(16)

title(14)

author(15)title(17)

author(18)19992000

XML Bob

XMLMary

DatabaseCodd

C++John

Joe

Query: What are the titles and years of the publications, of which Mary is an author?

Page 38: Data Integration

INTEGRATING GEOGRAPHIC INFORMATION

Page 39: Data Integration

Integration of Geographic Data

• The goal: Matching objects that represent the same real-world entity in different maps

Page 40: Data Integration

40

The Goal: Matching Objects that Represent the Same Real-World Entity

Example: three data sources that provide information about hotels in Tel-Aviv

SOI: Survey Of Israel

MAPA: commercial corporation

MUNI: Municipally of Tel-Aviv

Page 41: Data Integration

41Each data source provides data that the other sources do not provideA join enables us to utilize the different perspectives of the data sources

The Goal: Matching Objects that Represent the Same Real-World Entity

Hotel Rank

polygons points

SOI: cadastral and building information

MAPA: tourist information

MUNI: Municipal information

Radison Moria

Is there a nearby parking lot?

Page 42: Data Integration

42

The Integration ProcessFirst step: overlaying the maps

Second step:generating the sets

Third step:fusing the objects

+

Page 43: Data Integration

Questions about Integration of Geographic Data

• How can we integrate efficiently and effectively geographical datasets?

• How does the existence of road networks affect the integration?

• Can a schema or ontology help us?

Page 44: Data Integration

44

Using Locations for Matching Objects

• There are no global keys to identify objects that should be joined

• Names cannot be used– Change often

– May be missing

– May be in different languages

• It seems that locations are keys:– Each spatial object contains location attributes

– In a “perfect world,” two objects that represent the same entity have the same location

Global key = common identifier in the different sources

Page 45: Data Integration

45

Locations are Inaccurate

• In real maps, locations are inaccurate

• The map on the left is an overlay of the three data sources about hotels in Tel-Aviv

For example, the Basel Hotel has three different locations, in the three data sources

Page 46: Data Integration

THE SEMANTIC WEB

Page 47: Data Integration

Semantic Web

“Most of the Web's content today is designed for humans to read, not for computer programs to manipulate meaningfully.”

Berners-Lee, T, Hendler, J & Lassila, O ‘The semantic web’, Scientific American, May 2001

Page 48: Data Integration

The Semantic Web

“For the semantic web to function, computers must have access to structured collections of information and sets of inference rules that they can use to conduct automated reasoning.”

Berners-Lee, T, Hendler, J & Lassila, O ‘The semantic web’, Scientific American, May 2001

Page 49: Data Integration

The Semantic Web

• The main idea: Add semantics and reasoning instead of applying artificial intelligence

• Basic standards being developed: XML, XSchema, RDF, RDFS, OWL

• Is the Semantic Web the holly grail of integration?

Page 50: Data Integration

PRIVACY

Page 51: Data Integration

Privacy

• How can we publish information and yet, guarantee that integration won’t reveal sensitive data?

Page 52: Data Integration

52

What is Privacy ?• Society is experiencing exponential growth in the number and

variety of data collections containing person-specific information.

• Sharing these collected information is valuable both in research and business. Publishing the data may put person privacy in risk.

• Objective: Maximize data utility while limiting disclosure risk to an acceptable level

• Note :– There is no clear definition for disclosure and acceptable level– Not the traditional security of data e.g. access control, theft, hacking

etc.

Page 53: Data Integration

53

Example

• For medical research (e.g., Gene, infection diseases) a hospital has some person-specific patient data which it wants to publish

• It wants to publish such that:– Information remains practically useful– Identity of an individual cannot be determined

• Adversary might infer the secret/sensitive data from the published database

Page 54: Data Integration

54

Example – cont.

• The data contains:– Identifiers - {name, ssn}– Non-Sensitive data - {zip-code, nationality, age}– Sensitive data - { medical condition, salary, location }

Identifiers Non-Sensitive dataNon-Sensitive data Sensitive data

## Name ZipZip AgeAge NationalityNationality Condition

11 Kumar 1305313053 2828 IndianIndian Heart Disease

22 Bob 1306713067 2929 AmericanAmerican Heart Disease

33 Ivan 1305313053 3535 CanadianCanadian Viral Infection

44 Umeko 1306713067 3636 JapaneseJapanese Cancer

Page 55: Data Integration

55

Example – cont [SW02-A]

Non-Sensitive DataNon-Sensitive Data Sensitive DataSensitive Data

## ZipZip AgeAge NationalityNationality ConditionCondition

11 1305313053 2828 IndianIndian Heart Disease

22 1306713067 2929 AmericanAmerican Heart Disease

33 1305313053 3535 CanadianCanadian Viral Infection

44 1306713067 3636 JapaneseJapanese Cancer

PublishedData

ChrisChrisBobBobJohnJohnNameName

AmericanAmerican2323130531305333AmericanAmerican2929130671306722AmericanAmerican2828130531305311NationalityNationalityAgeAgeZipZip##

Voter List

Data leak! Do we have a privacy violation ?Do we have a privacy violation ?

Page 56: Data Integration

56

• The Group Insurance Commission (GIC) in Massachusetts sold a believed to be anonymous data of state employees health.

• Voter registration list for Cambridge Massachusetts – sold for 20$

• William Weld was governor of Massachusetts-– Lived in Cambridge Massachusetts– Six people had his particular birth date– Three of them were men– He was the only with 5-digit ZIP code.

Example – cont[SW02-A]

ZipBirthdateGender

EthnicityVisit dateDiagnosisProcedureMedicationTotal charge

NameAddressDate registeredParty affiliationDate last voted

Medical data Voter List

Quasi Identifier)QI)

Page 57: Data Integration

57

Example-2 – AOL (2006)Anon

ID Query QueryTimeItemRank ClickURL

1326 konig wheels 18/04/2006 13:29 1http://www.konigwheels.com

1326 jet blue airlines 27/04/2006 15:29    

1326 coats tire equipment 28/04/2006 15:53    

1326 coats tire equipment 03/05/2006 19:15    

1326 verizon wireless 09/05/2006 00:09    

1326 www.crazyradiodeals.com 23/05/2006 18:00    

1337 uslandrecords.com 01/03/2006 11:50 1 http://www.seda-cog.org

1337 titlesourcein.com 14/03/2006 15:45    

1337 titlesourceinc 14/03/2006 15:45 1http://www.titlesourceinc.com

1337 select business services 14/03/2006 15:51    

1337 select business services title 14/03/2006 15:52    

1337 cbc companies 14/03/2006 15:52 2http://www.cbc-companies.com

1337 cbc companies 14/03/2006 15:52 3http://www.cbc-companies.com

1337national real estate settlement services 14/03/2006 15:59 1 http://www.realtms.com

Page 58: Data Integration

Example2 – cont.

Page 59: Data Integration

Example-3