structure query processing – data models – query models – approaches – challenges keyword...
Embed Size (px)
TRANSCRIPT

Structure• Query Processing– Data models– Query models– Approaches– Challenges
• Keyword query processing on RDF• Structured query processing on RDF• Structured query processing on the Web– Routing needs to linked data sources– Linked data query processing

Query Processing

Query Processing
Query
Data
Mat
chin
g

Data / Data Models• Textual
– Bag-of-words– Represent documents, text in structured data,…, real-world
objects (captured as structured data)– Miss “structured information”
• in text, e.g. linguistic structure, hyperlinks, (positional information)• in structured data
In combination with Cloud Computing technologies, promising solutions for the management of `big data' have emerged. Existing industry solutions are able to support complex queries and analytics tasks with terabytes of data. For example, using a Greenplum.
combination Cloud Computing Technologiessolutions management `big data' industry solutions support complex ……
term (statistics)

Data / Data Models• Textual• Structured
– Resource Description Framework (RDF) – Represent real-world objects, services, applications, …. documents– Resource attribute values and relationships between resources– Schema

Data / Data Models• Textual• Structured• Hybrid– Textual and structured data

Query / Query Models• Unstructured• Fully-structured• Hybrid: unstructured + structured

Query / Query Models• Unstructured– NL– Keywords book price 30

Query / Query Models• Unstructured• Fully-structured– SQL: select, from, where• SELECT title, price
FROM BooksWHERE Price < 30

Query / Query Models• Unstructured• Fully-structured– SQL: select, from, where– SPARQL: BGP, filter, optional, union, select, construct,
ask, describe • PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX ns: <http://example.org/ns#> SELECT ?title ?price WHERE { ?x dc:title ?title . OPTIONAL { ?x ns:price ?price . FILTER (?price < 30) } }UNION { ?book dc11:title ?title . ?book dc11:creator ?author } }

Query / Query Models• Unstructured• Fully-structured– SQL– SPARQL– Conjunctive queries, e.g., graph patterns (BGP)

Query / Query Models• Fully-structured• Unstructured • Hybrid: content and structure constraints

Query / Query Models• Fully-structured• Unstructured • Hybrid: content and structure constraints

Query Processing• Matching queries against data

Approaches – Taxonomy (1)
Query
Data
Mat
chin
g• Complete • Sound
• Approximate• Not complete• Not sound
• Ranked• Best effort • Top-k
Query processing focuses on efficiency whereas ranking deals with result quality!

Approaches – Taxonomy (2)
Keyword query on textual data (Standard IR)
Keyword query on structured
data
Structured query on textual data
Structured query on structured
data (standard DB)
Hybrid query (XML IR)Unstructured Query
StructuredQuery
Textual Data
Structured Data

Keyword Query / Textual Data• Retrieve documents• Inverted list (inverted index)
keyword {<doc1, pos, score, ...>, <doc2, pos, score, ...>, ...}
• AND-semantics: top-k join
= =

Structured Query / Structured Data• Retrieve data for triple patterns
• Index on tables• Multiple “redundant” indexes to cover different access patterns
• Join (conjunction of triples)• Blocking, e.g. linear merge join (required sorted input)• Non-blocking, e.g. symmetric hash-join• Materialized join indexes
SP-index PO-index
==
=

Keyword Query / Structured Data• Retrieve keyword elements
• Using inverted indexkeyword {<el1, score, ...>, <el2, score, ...>,…}
• Exploration / “Join”• Data indexes for triple lookup• Materialized index (paths up to graphs)• Top-k Steiner tree search, top-k subgraph exploration
↔ ↔
==

References
• Günter Ladwig, Thanh Tran: Combining Query Translation with Query Answering for Efficient Keyword Search. ESWC 2010:288-303
• Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano: Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-Shaped (RDF) Data. ICDE 2009:405-416
• Guoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou: EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. SIGMOD 2008:903-914
• Thanh Tran, Philipp Cimiano, Sebastian Rudolph, Rudi Studer: Ontology-Based Interpretation of Keywords for Semantic Search. ISWC/ASWC 2007:523-536
• Hao He, Haixun Wang, Jun Yang, Philip S. Yu: BLINKS: ranked keyword searches on graphs. SIGMOD 2007:305-316
• Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S. Sudarshan, Rushi Desai, Hrishikesh Karambelkar: Bidirectional Expansion For Keyword Search on Graph Databases. VLDB 2005:505-516

Structured Query / Textual Data• Based on offline IE (offline see Peter’s slides)• Based on online IE, i.e., “retrieve “ is as follows
• Derive keywords to retrieve relevant documents• On-the-fly information extraction, i.e., phrase pattern matching “X title Y”• Retrieve extracted data for structured part • Retrieve documents for derived text patterns, e.g. sequence, windows, reg. exp.
• Index• Inverted index for document retrieval and pattern matching• Join index inverted index for storing materialized joins between keywords• Neighborhood indexes for phrase patterns
Hybrid case

References• Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni:
Structured Querying of Web Text Data: A Technical Challenge. CIDR 2007:225-234
• S. Chakrabarti, K. Puniyani, and S. Das. Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In WWW, pages 717–726, 2006.
• S. Agrawal, K. Chakrabarti, S. Chaudhuri, and V. Ganti: Scalable ad-hoc entity extraction from text collections. PVLDB, 1(1), 2008.
• M. J. Cafarella. Extracting and querying a comprehensive web database. In CIDR, 2009.
• G. Ramakrishnan, S. Balakrishnan, and S. Joshi. Entity annotation using inverse index operations. In EMNLP, 2006.
• M. Cafarella and O. Etzioni. A search engine for natural language applications. In WWW, 2006.

Query Processing – Main Tasks• Retrieval– Documents , data elements, triples,
paths, graphs– Inverted index,…, but also other
indexes (B+ tree)– Index documents, triples materialized
join paths• Join– Different join implementations,
efficiency depends on availability of indexes
– Non-blocking join good for early result reporting and for “unpredictable” linked data scenario
Query
Data
Mat
chin
g

Query Processing – More Tasks• Disjunction, aggregation, grouping• Join order optimization• Approximate
– Approximate the search space – Retrieve only some results– Approximate the join
• Parallelization• Top-k
– Use only some entries in the input streams to produce k results
• Multiple sources– On-the-fly mapping, similarity join – Federation, routing
• Hybrid– Join text and data
Query
Data
Mat
chin
g

Query Processing on the WebResearch Challenges and Opportunities
• Large amount of semantic data
• Data inconsistent, redundant, and low quality
• Large amount of data embedded in text
• Large amount of sources
• Large amount of links between sources
• Optimization parallelization,
• Approximation • Hybrid querying and data
management• Federation, routing• Online schema mappings• Similarity join

Approaches
Keyword query on textual data (Standard IR)
Keyword query on structured
data (IR-DB)
Structured query on textual data
(DB – IR)
Structured query on structured
data (standard DB)
Routing, Approximation,
Adaptive Optimization
Search SpaceApproximation
Unstructured Query
StructuredQuery
Textual Data
Structured Data

Keyword Query Processing onGraph-Structured RDF Data

Keyword Search in DBs / Keyword Translation (Kacholia et al., VLDB05)
), dD,Q,F,R(q ji
User information need turing award“„stanford article
Translation
Specification
28
• Keywords might produce large number of matching elements in the data graphs
• The data graphs might be large in size• Search complexity increases substantially
with the size of the data graphs• Large number of results

Query Space (Tran et al., ICDE2009)Schema graph derived from data graph Query space = connecting keyword elements with schema elements
• Main Idea– Query space: more compact representation of the data graph
• Online construction of query space out of schema graph– Match keywords against labels of resources to find keyword elements– Connect keyword elements with elements of schema graph to obtain query space
• Online top-k query graph exploration
Exploration on much reduced summary model called query space
Substantially decrease complexity Top-k procedure for graph exploration to compute
only the top-k most relevant results

Top-k Query Graph Exploration on Query SpaceQuery space, three paths from keyword matching elements, and costs of elements
• Cost-directed exploration of minimal Steiner graphs• Explore all possible distinct paths starting from keyword elements• At each exploration, take current path with lowest cost • When a connecting element is found, merge paths to obtain a candidate• Top-k terminates when
• highest cost in the candidate list (the cost of the k-ranked query graph) < lowest possible cost that can achieved with paths in the queues

Structured Query Processing onGraph-Structured RDF Data

Query Processing• Structured query: conjunctive queries– Conjunctive queries on graph-structured data
amounts to the task of graph-pattern matching
32
A solution for determining matching requires exponential time
Search complexity increases substantially with the size of the graph
The size of the graph is very large on the Web of linked data

Answer Space (Tran et al., SemData@VLDB 2010) An extended example of the data graph The resulting answer space
• Construction of answer space is based on bisimulation • Answer space
– Comprises of classes (extensions) and relations between them– Resources in an extension exhibit the same structure, i.e., have the
same (incoming and outgoing) paths – Is a structural description more fine-granular then a schema
Summary model for general data graphs Structure-based data partitioning to store data that share
structures Structure-aware processing to filter candidates and prune
queries using a smaller answer space

Structural-aware Matching Using Answer Space The answer space An example query
• Match query against answer space– Answer space matches contain elements satisfying the query structure
• Focus on answer spaces matches to compute final answers– Prune query parts containing non-distinguished variables only– Match remaining query against data graph (i.e., focus on elements in
the answer space matches identified and loaded before)
• Advantages: reduction in IO cost and number of union & joins

Query Processing on the Linked Data Web

Query Processing on the Web• Routing
• Find combinations of sources• Federation
• Query parts sources• Combining results from different sources
• Online schema mappings• Similarity join

Linked Data
- 203 linked datasets serve 25 billion RDF triples interconnected by 395 million links- As of 09-2010 + other linked data not covered by LOD cloud
37
More Data
More Links

Challenges“Articles from awarded researchers at Stanford ”
z) n(x,publicatio Stanford) name(y, y) worksAt(x, Award) Turing prizes(x,.,).( yxz
Formulating queries is a hard task!• Which data sources?• Which schema elements?
Processing queries is expensive!• Process against all data sources?
• Large number of unknown, unprocessed & irrelevant sources!– What is in there?– What is out there?– What is relevant?
USABILITY SCALABILITY

Searching Linked Data
• Given the needs (expressed as sets of keywords), – are there answers in processed linked data?– what combination of data sources produce them?– how to incorporate related unprocessed linked sources?
40
Identify valid combination of sources
Identify schema elements
Let user choose combination of sources
Focus on this combination of sources and related linked sources
Keyword Query Routing
Linked Data Query Processing

Keyword Query Routing(Tran et al., ISWC 2010)

Keyword Query Routing• Linked data (schema and data are linked)• Routing based on keywords
• Find combinations of sources

LOD Data Graph
43
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name name label
employ
sameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
• Web data modeled as a set of interlinked data graphs• Each data graph represent a source• Data graph vs. schema graph vs. source graph

LOD Schema Graph
44
Author
University
Person Person Prize
authoremploy
sameAs sameAs prizes
Written Work
author
Article
• Web data modeled as a set of interlinked data graphs• Each data graph represent a source• Data graph vs. schema graph vs. source graph
DBLPFreebase DBPedia

LOD Source Graph
45
• Web data modeled as a set of interlinked data graphs• Each data graph represent a source• Data graph vs. schema graph vs. source graph
DBLPFreebase DBPedia
sames sameAs
author

Keyword Query Answers
46
), dD,Q,F,R(q ji
User information need award“„stanford article
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name name label
employ
sameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Article
type

Problem Definition
• Keyword query result (also called Steiner graph) is a subgraph of data graph that for every keyword, contains a matching data element (called keyword elements), and these elements are pairwise connected over a path.
d-max Steiner graph is a Steiner graph where paths between keyword elements is d-max or less.
Keyword query routing: compute valid set of data sources called keyword routing plan. A plan is valid if its sources can be combined to produce non-empty keyword query results.

A Valid Keyword Routing Plan
48
), dD,Q,F,R(q ji
User information need award“„stanford article
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name name label
employ
sameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Article
type

The Search Space• Multi-level inter-relationship graphs capture the entire search space• Relationships between elements• and between different levels
49
A solution: apply existing approaches to keyword search for computing Steiner graphs Steiner graphs might span several linked sources Search space grow exponentially with the number of
sources and their associated links Search space is too large!

Keyword Sets
50
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name label
employsameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Stanford
University
John
McCarthy John
McCarthy
McCarthy
John
Turing
Award
Smith Music
• One keyword set for every data source• Elements stand for distinct keywords mentioned in a source

Element-level Keyword-Element Relationship Graph (E- KERG)
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name label
employsameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Stanford
University
John
McCarthy John
McCarthy
McCarthy
John
Turin
Award
Smith Music
• A keyword-element captures a keyword k and the data element mentioning k• A relationship between two keyword-elements exists iff there is a path between
their associated data elements• In d-max KERG, the paths to be considered have length d-max or less
uni1 per2 per1 per3 prize1
per4
John
prize2
Award
John
pub4

Schema-level Keyword-Element Relationship Graph (S-KERG)
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name label
employsameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Stanford
University
John
McCarthy John
McCarthy
McCarthy
John
Turin
Award
Smith Music
• A keyword-element captures a keyword k and the schema element which contains some instances (date elements) mentioning k
• A relationship between two keyword-elements exists if there is a path between some instances of their associated schema elements
• Groups ele. (rel.) when they capture same keyword (rel. between same classes)
uni1 per2 per1 per3 prize1
per4
John
prize2
Award
John
pub4
University Person Author
Article Person Prize

Data-Source-level Keyword-Element Relationship Graph (D-KERG)
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name label
employsameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Stanford
University
John
McCarthy John
McCarthy
McCarthy
John
Turin
Award
Smith Music
• A keyword-element captures a keyword k and the source which contains some instances (date elements) mentioning k
• A relationship between two keyword-elements exists if there is a path between some instances of their associated sources
• Groups ele. (rel.) when they capture same keyword (rel. between same sources)
uni1 per2 per1 per3 prize1
per4
John
prize2
Award
John
pub4
University Person Author
Article Person Prize

Routing Plan Computation
• Keyword sets– Retrieve elements for every keyword k in K– Retrieve associated sources and put them into SK
– Compute all |K|-combinations of SK (KRPs)
54
KERG models Compute all 2-combinations of K to get all keyword pairs Retrieve matching KERG relationships for each pair and
join them to produce matching subgraphs (KRPs)

Mixed, Corrective and Stream-based Linked Data Query Processing

Structured Query Processing on the Web• Linked data (schema and data are linked)• Federation
• Query parts to sources• Combining results from different sources
• Exploration• Mixed

Top-down Query Evaluation (Harth et al., WWW 2010)
• Local index of sources, assumed to be complete– Used for source selection– Maps triple and join patterns to source URIs
• Statistics for ranking of sources and query optimization– Performed once at compile-time– Only a fixed number of top-ranked sources is considered
• No run-time discovery• Fast, only relevant sources are retrieved• Not up-to-date• Index size may become very large

Bottom-up Query Evaluation (Hartig et al., ISWC 2009)
• Sources discovered at run-time through links from other, already retrieved sources
• No local index of sources• Slower, as unnecessary sources are retrieved• Always up-to-date

Mixed Strategy(Ladwig et al. ISWC 2010)
• Combination of top-down and bottom-up strategies– Partial local index of sources, not assumed to be
complete– New sources are discovered at run-time
• Addresses volume and dynamic of Linked Data• Corrective Source Ranking– Deal with heterogeneous source descriptions
• Stream-based Query Processing– Deal with unpredictable nature of Linked Data access

Query Plan
Source Retrieval
Stream-based Query Processing
• Compile-time– Construct query plan– Probe local index for
sources• Network latency
– Do not block!– Evaluation driven by
incoming data
• Run-time– Retrieve sources– Push data into query plan– Discover new sources– Rank sources
Join
Join
worksAt(?x, dbpedia:KIT) knows(?x, ?y)
name(?y, ?n)
Results
Source Retriever 1
Source Retriever 2
...
Push
Source RankerRetrievesource
Sourcediscovered
Source 1 (score: 1.0)Source 2 (score: 0.7) ...
Samples
Local source index
Linked Data

Push-based Symmetric Hash Join
• Operation– Maintains a hash table for each input– Tuples are inserted into one hash
table and then the other is probed for join combinations
• Results reported as soon as input tuples arrive
• Tuples can arrive on all inputs in any order
• Push-based– Tuples are pushed into operators
from the leaves to the root of the query plan
– Execution driven by incoming tuples instead of results
Key T
a t1, t3
b t2
Key T
b t4, t5
c t6
Left input Right input
Pushed on left: t7(b)
InsertProbe
Push output
t7t4
t7t5
Key T
a t1, t3
b t2, t7

Corrective Source Ranking• Prefer more relevant sources• Relevancy of a source is based on– Current query– Any available intermediate results– Overall optimization goal
• Define a set of source features and derive concrete source metrics– Not all metrics are available for all sources
(heterogeneity)• Refine previously computed metrics using newly
discovered information

Source Features and Metrics– Source is more relevant if it contains data that
contributes to answers of the query– Triple Pattern Cardinality
– Join Pattern Cardinality

Metric Correction and Refinement• During query processing new information becomes available:
intermediate join results, links– Refine and correct previously computed metrics– Important in the case of non-discriminative patterns
• Instantiate triple pattern of a join with samples of intermediate results to obtain better join size estimates
• Example
Intermediate results in SHJ operatorPerform triple pattern
cardinality lookupsSample

Conclusions• Query processing: which kinds of data, queries?– Focus: textual & structured queries and semantic data
• Web of linked data creates opportunities and challenges– Optimization – Approximation – Routing– Top-k… and ranking
• Web is linked data + a large amount of text– Hybrid management & integrated search

References• Thanh Tran, Günter Ladwig: Structure Index for RDF.
SemData@VLDB 2010• Thanh Tran, Lei Zhang, Rudi Studer: Routing Keywords
to Linked Data Sources, ISWC 2010• Günter Ladwig, Thanh Tran: Linked Data Query
Processing Strategies, ISWC 2010• Andreas Harth, Katja Hose, Marcel Karnstedt, Axel
Polleres, Kai-Uwe Sattler, Jürgen Umbrich: Data summaries for on-demand queries over linked data. WWW 2010:411-420
• Olaf Hartig, Christian Bizer, Johann Christoph Freytag: Executing SPARQL Queries over the Web of Linked Data. ISWC 2009:293-309