summary models for routing keywords to linked data sources

27
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden- Wuerttemberg and National Laboratory of the Helmholtz Associat Summary Models for Routing Keywords to Linked Data Sources Thanh Tran , Lei Zhang, Rudi Studer AIFB Institute, KIT 1

Upload: thanh-tran

Post on 11-May-2015

655 views

Category:

Education


0 download

DESCRIPTION

Summary Models for Routing Keywords to Linked Data Sources

TRANSCRIPT

Page 1: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association1

Summary Models for Routing Keywords to Linked Data SourcesThanh Tran, Lei Zhang, Rudi Studer

AIFB Institute, KIT

Page 2: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Agenda

Introduction

Opportunities & challenges

Contributions

Problem Definition

LOD Data

Keyword Query Answer

Keyword Query Routing

Summary Models

Keyword sets

Element-level vs. schema-level vs.

source-level Summary

Validity of Results vs. complexity

Theo. / Exp. Results

Conclusions2

Page 3: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Semantic Data

- 203 linked datasets serve 25 billion RDF triples interconnected by 395 million links- As of 09-2010 + other data (e.g. LON, ontologies, RDFa ) + increasing rapidly...

3

More Data

More Links

Page 4: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Opportunities

4

“Articles from awarded researchers at Stanford ”

Freebase contains data about people DBPedia contains information about awards DBLP contains bibliographic data

More Data

More Links

More complex information needs More precise results More integrated results

Page 5: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Problems“Articles from awarded researchers at Stanford ”

z) n(x,publicatio Stanford) name(y, y) worksAt(x, Award) Turing prizes(x,.,).( yxz

Formulating queries is a hard task!• Which data sources?• Which schema elements?

Processing queries is expensive!• Process against all data sources?

Large number of unknown & irrelevant sources! What is in there? What is relevant?

USABILITY SCALABILITY

5

Page 6: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Keyword Query Routing

Given the needs expressed as sets of keywords, are there “corresponding answers” in linked data? and what combination of data sources can be used to

produce them?

6

Identify valid combination of sources using keywords

Present schema elements for the user to formulate query

Let user choose combination of sources

Process only relevant combinations of sources

Page 7: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Contributions

7

Introduce the novel problem of keyword query routing

Introduce various summary models, which aim to compactly represent the search space.

Investigate the resulting trade-offs between result quality and efficiency through theoretical analysis and practical experiments using publicly available linked data sources.

Propose the multi-level relationship graph to capture its search space.

Page 8: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Agenda

Introduction

Opportunities & challenges

Contributions

Problem Definition

LOD Data

Keyword Query Answer

Keyword Query Routing

Summary Models

Keyword sets

Element-level vs. schema-level vs.

source-level Summary

Validity of Results vs. complexity

Theo. / Exp. Results

Conclusions8

Page 9: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

LOD Element-level Graph

9

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name name label

employ

sameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Web data modeled as a set of interlinked data graphs Each data graph represent a source Element-level graph vs. schema-level graph vs. source-level graph

Page 10: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

LOD Schema-level Graph

10

Author

University

Person Person Prize

authoremploy

sameAs sameAs prizes

Written Work

author

Article

Web data modeled as a set of interlinked data graphs Each data graph represent a source Element-level graph vs. schema-level graph vs. source-level graph

DBLPFreebase DBPedia

Page 11: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

LOD Source-level Graph

11

Web data modeled as a set of interlinked data graphs Each data graph represent a source Element-level graph vs. schema-level graph vs. source-level graph

DBLPFreebase DBPedia

sames sameAs

author

Page 12: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

“Corresponding” Answers

12

), dD,Q,F,R(q ji

User information need award“„stanford article

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name name label

employ

sameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Article

type

Page 13: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Problem Definition

Keyword query result (also called Steiner graph) is a subgraph of the union of the data- and schema-level graph that for every keyword, contains a matching element, and these elements are pairwise connected over a path.

13

d-max Steiner graph is a Steiner graph where paths between keyword elements is d-max or less.

Keyword query routing: compute valid set of data sources called keyword routing plan. A plan is valid if its sources produce non-empty keyword query results.

Page 14: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

A Valid Keyword Routing Plan

14

), dD,Q,F,R(q ji

User information need award“„stanford article

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name name label

employ

sameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Article

type

Page 15: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

The Search Space Multi-level inter-relationship graphs capture the entire search space Relationships between elements and between different levels

15

Search space is too large! Naïve solution not applicable: apply existing approaches to

keyword search for computing Steiner graphs Steiner graphs might span several linked sources Search space grow exponentially with the number of

sources and their associated links

Page 16: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Agenda

Introduction

Opportunities & challenges

Contributions

Problem Definition

LOD Data

Keyword Query Answer

Keyword Query Routing

Summary Models

Keyword sets

Element-level vs. schema-level vs.

source-level KERG

Validity of Results vs. complexity

Theo. / Exp. Results

Conclusions16

Page 17: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Keyword Sets

17

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name label

employsameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Stanford

University

John

McCarthy John

McCarthy

McCarthy

John

Turing

Award

Smith Music

One keyword set for every data source Elements stand for distinct keywords mentioned in a source

Page 18: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Element-level Keyword-Element Relationship Graph (E- KERG)

18

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name label

employsameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Stanford

University

John

McCarthy John

McCarthy

McCarthy

John

Turin

Award

Smith Music

A keyword-element captures a keyword k and the data element mentioning k A relationship between two keyword-elements exists iff there is a path between

their associated data elements In d-max KERG, the paths to be considered have length d-max or less

uni1 per2 per1 per3 prize1

per4

John

prize2

Award

John

pub4

Page 19: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Schema-level Keyword-Element Relationship Graph (S-KERG)

19

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name label

employsameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Stanford

University

John

McCarthy John

McCarthy

McCarthy

John

Turin

Award

Smith Music

A keyword-element captures a keyword k and the schema element which contains some instances (date elements) mentioning k

A relationship between two keyword-elements exists if there is a path between some instances of their associated schema elements

Groups elements (relationships) when they capture same pair of keywords in the same class (same keyword relationships between same pair of classes)

uni1 per2 per1 per3 prize1

per4

John

prize2

Award

John

pub4

University Person Author

Article Person Prize

Page 20: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Data-Source-level Keyword-Element Relationship Graph (D-KERG)

20

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name label

employsameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Stanford

University

John

McCarthy John

McCarthy

McCarthy

John

Turin

Award

Smith Music

A keyword-element captures a keyword k and the source which contains some instances (date elements) mentioning k

A relationship between two keyword-elements exists if there is a path between some instances of their associated sources

Groups elements (relationships) when they capture same pair of keywords in the same source (same keyword relationships between the same of pair sources)

uni1 per2 per1 per3 prize1

per4

John

prize2

Award

John

pub4

University Person Author

Article Person Prize

Page 21: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Agenda

Introduction

Opportunities & challenges

Contributions

Problem Definition

LOD Data

Keyword Query Answer

Keyword Query Routing

Summary Models

Keyword sets

Element-level vs. schema-level vs.

source-level KERG

Validity of Results vs. complexity

Theo. / Exp. Results

Conclusions22

Page 22: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Theoretical Results When Steiner graphs can be found for K in the data,

then there will be keyword routing plan that can be found in KERG.

23

The keyword routing plan derived from the summary are not necessarily valid s.t. there might be no corresponding Steiner graph in the data

Detailed results + algorithms + complexity results in the paper!

Page 23: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Experiments

Chunk of the BTC dataset containing 10M RDF triples from 154 sources, linked via 500K mappings

24

Manually crafted 30 keyword valid multi-data-source queries, i.e., produce non-empty keyword answers and involve more than 2 sources Town River America Beijing Conference Database 2007

Page 24: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Validity

P@k measure the percentage of plans that are valid out of the top-k plans P@5 up to 100% for E-KERG (dmax =4), P@5 for KS only 6% More valid plans were computed when a higher value was used for dmax

dmax =3 seems to be a good tradeoff Queries with larger number of keywords resulted in lower precision

25

2 3 4 50.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0 E-KERG D-KERG

S-KERG KS

|K|

P@5

0 1 2 3 40.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0 E-KERG

D-KERG

S-KERG

KS

dmax

P@5

Page 25: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Performance

26

Times increased with higher values for dmax

Sharp for E-KERG and S-KERG Relatively stable for D-KERG

Times increase with number of keywords All other models had poor performance w.r.t complex queries but D-KERG E-KERG needed more than 100s for queries with more than 2 keywords

Time for D-KERG was no more than 10ms on average

0 1 2 3 41

10

100

1000

10000

100000

1000000

S-KERG D-KERG KS E-KERG

dmax

Que

ry P

roce

ssin

g Ti

me

(ms)

2 3 4 51

10

100

1000

10000

100000

1000000

S-KERG D-KERG KS E-KERG

|K|

Que

ry P

roce

ssin

g Ti

me

(ms)

Page 26: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Conclusions

Keyword query routing helps users without knowledge of linked data and schemas to find combination of sources that contain answers corresponding to their needs

27

Summarizing relationships is essential for dealing with the large-scale linked data Web (E-KERG achieved poor performance, requires more than 100s for complex queries)

Summarizing at the level of sources (D-KERG) represents the most practical trade-off, produces results in less than 10ms out of which every second one was valid

However, validity still low for complex queries (<30% when 4 keywords)

Baseline approaches for novel problem Further improve validity and consider relevance! Combine keyword query routing with source and structured query

processing to compute final results!

Page 27: Summary Models for Routing Keywords to Linked Data Sources

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Thanks for Your Attention!

Institute AIFB, KIT

[email protected]

28