kd2r: a key discovery method for semantic reference reconciliation danai symeonidou, nathalie...

17
KD2R: a Key Discovery method for semantic Reference Reconciliation Danai Symeonidou, Nathalie Pernelle and Fatiha Saϊs LRI (University Paris-Sud) WOD’2013 June, 3th

Upload: sally-vinton

Post on 14-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1

KD2R: a Key Discovery method for semantic Reference Reconciliation Danai Symeonidou, Nathalie Pernelle and Fatiha Sas LRI (University Paris-Sud) WOD2013 June, 3th Slide 2 Data Linking More and more heterogeneous RDF sources Links can be asserted between them Same as is one of the most important types of links: combine information given in different data sources LOD: the number of already existing links is very small How to create links automatically ? 2 Danai Symeonidou, WOD2013 Slide 3 3 FirstName: George LastName: Thomson SSN : 011223456 Job : Artist FirstName: George LastName: Thomson SSN : 444223456 Job: Professor FirstName: George LastName: Thomson SSN : 011223456 Age : 45 Dataset1Dataset2 Data Linking Problem P1 P2 P3 Danai Symeonidou, WOD2013 Slide 4 4 FirstName: George LastName: Thomson SSN : 011223456 Job : Artist FirstName: George LastName: Thomson SSN : 444223456 Job: Professor FirstName: George LastName: Thomson SSN : 011223456 Age : 45 Dataset1Dataset2 SameAs Data Linking Problem P1 P2 P3 Danai Symeonidou, WOD2013 Slide 5 5 FirstName: George LastName: Thomson SSN : 011223456 Job : Artist FirstName: George LastName: Thomson SSN : 444223456 Job: Professor FirstName: George LastName: Thomson SSN : 011223456 Age : 45 Dataset1Dataset2 SameAs Data Linking Problem P1 P2 P3 Danai Symeonidou, WOD2013 Slide 6 Data Linking with or without key constraints No knowledge given about the properties: all the properties have the same importance. Knowledge given by an expert: Specific expert rules [Arasu and al.09, Low and al.01, Volz and al.09 (Silk)] Example: max(jaro(phone-number;phone-number; jaro-winkler(SSN;SSN)) > 0.88 Key constraints [Sas, Pernelle and Rousset09] Example: hasKey(Museum (museumName) (museumAddress)) OWL2 Key for a class expression: a combination of (inverse) properties which uniquely identify an entity hasKey( CE ( OPE 1... OPE m ) ( DPE 1... DPE n ) ) Example: hasKey(Museum (museumName) (museumAddress)) expresses: Museum(x1) Museum(x2) museumName(x1, y) museumName(x2, y) museumAddress(x1, w) museumAddress(x2, w) sameAs(x1, x2) 6 Danai Symeonidou, WOD2013 Slide 7 Problem: when data sources contain numerous data and/or complex ontologies Some keys are not obvious to find. Erroneous keys can be given by the expert. Aim: automatic discovery of a complete set of keys from data Nave automatic way to discover keys: examine all the possible combinations of properties Example: given an instance described by 15 properties the number of candidate keys is 2 15 -1 = 32767 For each candidate key we have to scan all the instances of the data Objective: find efficiently keys by: Reducing the combinations Partially scanning the data Key Discovery Problem 7 Danai Symeonidou, WOD2013 Slide 8 RDF data sources (conforming to an OWL 2 ontology) Mappings between classes and properties of the different ontologies Open world assumption (incomplete data) and multivalued properties may exist How to discover keys when we do not know if : i1 =?= i2 =?=i3 =?=i4 hasFriend(i1,i4), hasFriend(i2, i3) . ?? firstName(i1, Elodie) ? Key Discovery Problem 8 idlastNamefirstNamehasFriend i1TompsonManueli2,i3 i2TompsonMaria i3DavidGeorgei2, i4 i4SolgarMichel Danai Symeonidou, WOD2013 Slide 9 Unique Name Assumption (UNA): two different URIs refer to distinct entities (data sources generated from relational databases, Yago) i1i2 i3i4 Two literals that are syntactically different are semantically different (e.g. Napoleon BonaparteNapoleon) Key Discovery Problem:Assumptions 9 Danai Symeonidou, WOD2013 Slide 10 Heuristic 1 - Pessimistic: Not instantiated property all the values are possible Example: hasFriend(i2, i3), hasFriend(i4, i2) are possible. Instantiated property only given values are considered Example: not hasFriend(i1, i4) Non keys: {lastName}, {hasFriend} Keys: {firstName}, {lastName, firstName}, {firstName, hasFriend} Undetermined keys: {hasFriend, lastName} 10 Key Discovery:Heuristics idlastNamefirstNamehasFriend i1TompsonManueli2,i3 i2TompsonMaria i3DavidGeorgei2, i4 i4SolgarMichel Danai Symeonidou, WOD2013 Slide 11 Heuristic 1 - Optimistic: Not instantiated property value not one of the already existing ones Example: not hasFriend(i2, i3), not hasFriend(i2, i1), not hasFriend(i2, i4). Instantiated property only given values are considered Example: not hasFriend(i1, i4) Non keys: {lastName}, {hasFriend} Keys: {firstName}, {lastName, firstName}, {firstName, hasFriend}, {hasFriend, lastName} 11 Key Discovery:Heuristics idlastNamefirstNamehasFriend i1TompsonManueli2,i3 i2TompsonMaria i3DavidGeorgei2, i4 i4SolgarMichel Danai Symeonidou, WOD2013 Slide 12 KD2R approach 12 Topological sort of the classes (subsumption) Key Finder Discover non keys Ex: {lastName}, {hasFriend} Derive keys using non keys Ex: {firstName}, {lastName, firstName}, {firstName, hasFriend}, {hasFriend, lastName} Key Merge Cartesian product of minimal key sets in S1,S2 Ex. K s1 = {firstName} K s2 = {hasFriend} K s1-s2 = {firstName, hasFriend} Danai Symeonidou, WOD2013 Technical report available: https://www.lri.fr/~bibli/Rapports-internes/2013/RR1559.pdf Slide 13 KD2R approach : Key Finder 13 Danai Symeonidou, WOD2013 Computation of maximal non keys and undetermined keys Represent data in a prefix-tree (a compact representation of the data of one class) Slide 14 Validation of approach Datasets where KD2R has been tested: 14 DatasetsRDF files#instance s Optimisti c Pessimisti c OAEI Restaurants Dataset Restaurant1339Yes Restaurant21390Yes OAEI Persons Dataset Person111000Yes Peson121000Yes Person211200Yes Dbpedia Dataset (properties instasiated in at least 80% of the data) Person763644YesNo NaturalPlace78400YesNo BodyOfWater34008YesNo Lake33348YesNo googleFusion Dataset G_Restaurant372813Yes ChefMoz DatasetC_Restaurant1047Yes Danai Symeonidou, WOD2013 Slide 15 Demo Ontologies Data conforming to one ontology RDF data Dbpedia NaturalPlace dataset (78400 instances) OAEIPerson dataset (2000 instances) Data linking Link data using LN2R Measure quality of linking using: recall precision f-measure 15 Danai Symeonidou, WOD2013 Slide 16 QUESTIONS??? 16 Danai Symeonidou, WOD2013 Slide 17 THANK YOU!!! 17 Danai Symeonidou, WOD2013