1 rordf: optimization for rdf query on monetary cost via...
TRANSCRIPT
1
CroRDF: Optimization for RDF Query on monetary cost via Crowdsourcing
Depeng Dang, Member, IEEE
Abstract—The proliferation of structured data and the advances in knowledge graph have enabled the construction of knowledge
bases using the RDF data model to represent various resources and their relationships. But some rdf queries cannot provide
knowledges completely. In this paper, we present CroRDF, an inquiry system that provides users with low cost query services based
on existing and crowdsourced RDF data. We propose crowdsourcing query plan (CQPs) enumeration optimization algorithms that
enumerate the CQPs in the search space based on the selection of acquisition rules of high scores for each triple pattern in the
basic graph pattern (BGP); To find the optimal CQPs, we describe the monetary cost estimation algorithm. The algorithms reduce
the total time required to traverse the search space and improves the optimization efficiency. We present the monetary cost
estimation algorithm, which considers the relationship between triple patterns, in detail, and this algorithm is combined with the
multi-choice of the crowdsourcing direction to estimate the query monetary cost. To evaluate CroRDF, we create different queries
on DBpedia dataset. The crowed use Amazon Mechanical Turk to contribute their knowledge. Experimental results clearly show
that our solution accurately low monetary cost through crowdsourcing platforms and integrating existing data.
Index Terms—Crowdsourcing, RDF, Monetary cost estimation, Crowdsourcing cost optimization
—————————— ◆ ——————————
1 INTRODUCTION
ince Google optimized its search services with
knowledge graphs, knowledge graphs have grown
rapidly. A variety of semantic knowledge bases have
emerged in both industry and academia. This like DBpedia1,
YAGO-NAGA2, Freebase3 and Geo-Names4. the Resource
Description Framework 5 (RDF) is a W3C standard that
describes network resources. It is widely used to represent
various resources and their relationships in the knowledge
graph. RDF is a semi-structured data model where entities
are represented as resources; connection between resources
are described as triples composed of subjects, predicates and
objects[1]. Many semantic knowledge bases use the RDF
semantic model to express millions of fact entities and their
relations. Rich and substantial knowledge bases provide not
1. https://wiki.dbpedia.org
2. https://datahub.io/collections/yago 3. https://developers.google.com/freebase 4. http://www.geonames.org 5. https://www.w3.org/RDF
_________________________ • D. Dang is with College of Information Science and
Technology, Beijing Normal University, Beijing 100875, China. E-mail: [email protected].
• W. Yu is with College of Information Science and Technology, Beijing Normal University, Beijing 100875, China. E-mail: [email protected].
• S. Wang is with College of Information Science and Technology, Beijing Normal University, Beijing 100875, China. E-mail: [email protected]
• N. Wang is with College of Information Science and Technology, Beijing Normal University, Beijing 100875, China. E-mail: [email protected].
only a backend basis for a variety of applications but also
intelligent services.
RDF data model technologies could be especially
serviceable for expressing the knowledge of Internet. RDF
clearly represents the subject, predicate and object of a
sentence in the form of triples. Moreover, SPARQL allows
applications to perform complex queries on distributed RDF
databases and it is supported by various competition
frameworks. Yet, SPARQL can only query the data already
on the knowledge bases. The quality of the data on the
knowledge bases determines whether the results of the query
are good or bad. If the data on the knowledge bases is
incomplete, it cannot be queried.
Existing methods and technologies of acquiring RDF
cannot guarantee data integrity. Transiting from text data or
XML documents and mining from the semantic web [2] both
are inadequate since data resources are limited. Both
methods are offline approaches. Neither method can provide
a complete answer to a query immediately. Therefore,
acquiring RDF data that meet the requirement for
completeness in real time remains challenging.
Recently, with the rapid development of the network,
people are scrambling to make full use of network resources.
Many projects are attracted by the power of people on the
Internet. Over the past decade, multinational corporations in
developed countries have turned their attention to China and
India, where low-cost labor markets have made them salivate.
But now, it does not matter where the labor force comes from.
They can live next door, or possibly far away from Indonesia,
as long as they have access to the Internet. Crowdsourcing
integrates the advantages of machines and manpower to
effectively solve complex problems[6][7][8][9], such as the
evaluation of search results [10], tagging of pictures [11], and
xxxx-xxxx/0x/$xx.00 © 2018 IEEE Published by the IEEE Computer
Society
S
2
filtering [12]. Data can be acquired in real time and on
demand through crowdsourcing systems. Collecting data
through crowdsourcing alleviates the problems while
guaranteeing RDF semantic integrity and realizing flexible
queries in real time. So, many researchers combine man and
machine to achieve the desired result.
Reviewing previous research in this area are about
importance of collecting knowledge from the crowd to
complete missing values. Examples include CrowdQ [3],
HARE [4] and [5]. However, these existing hybrid
human-machine approaches have so far been focused on the
data. They did not minimize the monetary cost.
The monetary cost has not been solved by previous
research. To achieve this, in this paper, we acquire RDF data
by using the crowdsourcing approach. However, collecting
data through crowdsourcing is not free and can lead to
monetary costs. Therefore, the goal of the present work is to
obtain RDF data through crowdsourcing with the least
monetary cost. The goal of traditional query optimization
methods for a database is to reduce CPU, I/O, and
communication costs rather than monetary cost. A few
studies have examined addressing queries on structured data
through crowdsourcing with a goal of reducing monetary
cost, but these works are not applicable to the RDF data with
semi-structured nature. In addition, no study has sought to
obtain RDF data by crowdsourcing while minimizing the
monetary cost.
In this paper, we will describe CroRDF, a crowdsourcing
knowledge system, to provide a query service by combining
existing data and crowdsourcing data. We will be focusing on
the crowdsourcing query optimization of the CroRDF system
to minimize the monetary cost.
Collecting the query answers from the existing data is
“free”. Thus, given a query from the end users, CroRDF first
searches the answers from the existing data, which we call
the search phase. When the answers do not meet the
requirements, the system collects the remaining answers
through crowdsourcing. We denote this step the collect phase.
In this phase, the system first generates a search space that
contains all possible CQPs. Then, we will detail a cost
estimation algorithm to evaluate each CQP to obtain their
monetary costs. Finally, we will choose the CQP with the
least monetary cost as the final query.
In CroRDF, an ordered BGP graph can be assigned to
different sets of acquisition rules, resulting in different
crowdsourcing plans and different monetary costs. Therefore,
in the optimization process, we will define two reasonable
evaluating scores that are used to determine the candidate set
of optimal acquisition rules. Then, we will detail a native
algorithm and an improved efficient enumeration algorithm
to enumerate all CQPs in the search space. During the
process, we save the number of possible result tuples needed
(PossiNum) while crowdsourcing calculation relationship
between the triples to improve the efficiency.
We attribute the cost estimation problem to the
PossiNum estimation problem. The PossiNum estimation is
holistic. The PossiNum of a sub-plan depends on the entire
plan. For a triple-pattern sequence, we consider the
association types between the triple patterns and the search
direction of the basic graph pattern (BGP) to estimate the
PossiNum of each triple pattern and the cost of the CQP.
Finally, we select the optimized CQP with the lowest
monetary cost. We conduct an experimental evaluation of
our system in terms of the accuracy of the cost estimation and
the effectiveness of the acquisition rule scores and the two
plan enumeration algorithms.
Our main contributions are as follows:
1. We propose the design of CroRDF including a
two-phase executive strategy for RDF crowdsourcing, i.e., a
search phase and a collected phase.
2. We describe the evaluation scores of acquisition
rules and the enumerating algorithms.
3. We traverse the candidate crowdsourcing query
plan to find the one with the lowest cost.
4. We demonstrate that our approach has lower cost
through experiments on the real dataset.
The remainder of this paper is structured as follows:
Section 2 investigates the related work. Section 3 introduces
the basic architecture of the CroRDF system. Section 3
elaborates the search phase and explains how to query in the
RDF database first. Section 4 defines the search space of the
CQP in CroRDF and proposes a plan enumeration
optimization process that enumerates all the CQPs in the
search space based on the selection of the acquisition rules.
Section 5 expands on the cost estimation algorithm and
discusses how CroRDF estimates the execution cost of a
CQP. Section 6 presents the experimental evaluation of our
system. Section 7 investigates the related work. Finally,
section 8 draws some summarizes and discusses areas we
have identified for improvement in the future.
2 RELATED WORKS
In recent years, crowdsourcing has been widely used in
various fields as an efficient and cheap problem-solving
model, with demonstrated advantages in human
resources[16][17]. As shown in[18], we have been offered an
overall picture of the current state of the art techniques in
general-purpose crowdsourcing, hence we are essentially
dealing with RDF queries with crowdsourcing. It is the most
promising technology to solve the integrity issues
researchers are facing in RDF query. Some data-oriented
process systems have used crowdsourcing data in a
crowdsourcing explanation approach. These systems
integrate a crowdsourcing process control in the data
collection process. Approaches such as CrowDB [19], Deco
[21][22][23][24], HARE [20],CoEx Deco[26] and CrowOp
[25] target scenarios in which existing microtask platforms
are directly embedded in query processing systems.
There are three important problems in this field: quality
control, cost control and latency control [27]. Here, we
3
briefly review the work on approaches mentioned above.
CrowDB[19] uses human input via crowdsourcing to process
queries that neither database systems nor search engines can
adequately answer. It exploits the extensibility of the
iterator-based query processing paradigm to add crowd
functionality into a DBMS. CrowDB[19] supports two types
of user interfaces to allow user to input the primary key of the
search. It highlighted two cases where human input is needed:
(a) unknown or incomplete data, and (b) subjective
comparisons. CrowDB[19] extends SQL to address these
cases.
Similarly, Deco [21][22][23][24] is a database system for declarative crowdsourcing. Syntactically, Deco’s query
language is a simple extension to SQL. Based on the
CrowDB[19], Deco[21][22][23][24] proposes the notions of fetch and resolution rules provide powerful mechanisms
for describing crowd access methods. Fetch rules,
specifying how data in the conceptual relations can be obtained from external sources (including humans).
Resolution rules are used to reconcile inconsistent or
uncertain values obtained from external sources.
CrowOp [25] Supports cost-based query optimization. It
is capable of finding the query plan with low latency given
a user-defined budget constraint, which nicely balances the cost and time requirement of users. We develop efficient
algorithms in the CrowOp [25] for optimizing three types
of queries: selection queries, join queries, and complex selection-join queries. CoEx Deco[26] is also a system
that provides the user the facility to submit queries in the
form of comments. CoEx Deco[26] mades the user free to
comment anything about a specific noun in form of triplet and it makes a seamless integration of user entered data
along with data collected from the crowd. When a query is
evaluated the input graph of RDF is matched against the
inside variables of triple patterns.
HARE [20] identifies parts of SPARQL queries that are
affected by incomplete portions of RDF data sets,
crowdsources potential missing values and then efficiently
combines the crowd answers with the results from the data
set during the query execution. It uses a crowd knowledge
base that captures crowd answers about missing values in the
RDF dataset. And it uses a microtask manager that exploits
the semantics encoded in the dataset RDF properties, to
crowdsource SPARQL sub-queries as microtasks and update
the crowd knowledge base with the results from the crowd.
CrowDB[19] just proposes that uses microtask-based
crowdsourcing to answer queries that cannot otherwise be
answered. No cost control involved. Deco [21][22][23][24]
prototype does not yet perform sophisticated query
optimization. Although HARE can enhance the answer of a
SPARQL query evaluation, it concentrates more on
automatically identifying the completeness of a query against
RDF data and does not consider the optimization of
crowdsourcing cost for a query, which is our specific target.
CoEx Deco system [52] answer the user queries over a
Simple Protocol and RDF Query Language (SPARQL)
Query on RDF together obtaining data from crowd in form of
triplet, but it mainly aims to make the SPARQL queries more
expressive and does not optimize the crowdsourcing process.
Our work focuses on semi-structured data–RDF, which
includes the associated relationship between the triple
patterns in a SPARQL query.
In that respect, our work is related to cost control for
crowdsourcing missing data for a SPARQL query on an RDF
database, which extends the SPARQL language. Our system
builds upon widely used crowdsourcing platforms, such as
AMT. We considered the mutual influence between existing
data and crowdsourcing data and then embedded crowd
computing features in the query execution. In conclusion,
CroRDF is a crowdsourcing query system that considers the
characteristics of the RDF data and the crowdsourcing mode,
shows the universality and flexibility of crowdsourcing in
data collection, optimizes the crowdsourcing query process,
effectively replenishes the missing RDF data, and provides a
crowdsourcing query service.
3 SYSTEM ARCHITECTURE
The architecture of query processing in CroRDF is
illustrated in Fig. 1. An application issues requests using
CroSparql, a moderate extension of standard SPARQL.
Users can use the CroSparql to call CroRDF query API, so
that they can get the answer from CroRDF. CroRDF consists
of two components, the Search phase and the Collect phase.
Fig. 1. Architecture of CroRDF
In the Search phase, we present the flexible and
extensible data model and predicate index to store the RDF
graph data. Then, we can search the results for a query with
existing RDF graph data by graph exploration. If the results
of the Search phase do not satisfy the query target, the results
are sent to the Collect phase.
After the Search phase, we enter the Collect phase.
CroRDF can generate crowdsourcing question according to
the specific acquisition rules in each crowdsourcing plan.
Then, it loads answers from the crowdsourcing platform and
4
uses resolution rules to filter the answers. Finally, the
crowdsourcing results are converted to RDF format and
returned to the knowledge base. CroRDF combines the
crowdsourcing results and the results of the query phase and
returns them to the user. The overall framework of the
crowdsourcing query optimization and module functions is
presented below, and the two-phase query execution process
the CroRDF system is briefly described.
3.1 Data Model
RDF is a graph-based data model, which uses directed
edges to connect different nodes. An RDF tuple is
composed of three parts: the subject, predicate, and object.
Each tuple represents a fact. The subject generally
represents an information entity (or concept) on the Web
by a Universal Resource Identifier (URI). The predicate
describes the relevant properties of the entity, and the
object represents the attribute value corresponding to the
subject. The formal representation is as follows [28]:
Given a set of URI I, a blank node set B, a literal
description set L, and an RDF tuple (s, p, o), the
information represented by the tuple is as follows. (𝑠, 𝑝, 𝑜) ∈ (𝐼 ∪ 𝐵) × 𝐼 × (𝐼 ∪ 𝐵 ∪ 𝐿)
A group of RDF triple data can be regarded as a directed
graph G = (V, E, L) [29], where V is the node set representing
the subject or object. E is the directed edge set representing
the predicate. L is the label set. L = 𝐿𝑣 ∪ 𝐿𝑝, where 𝐿𝑣 is the
label set of the nodes and 𝐿𝑝 is the label set of the edges. We
construct the RDF graph based on the key-value storage of
the data structure of the node (id, value), where each node
represents an RDF entity and is stored as a key-value pair.
The specific form is as follows:
(id, (in-adjacency-list, out-adjacency-list))
The node id is regarded as the key and the value is entity
pointed by an adjacent arrow. The lists can be divided into
two categories according to the direction of the connected
edges, with the structure of (predicate, id) as the basic
element. For each node, we can search its adjacent nodes.
An example is shown in Fig. 2. Fig. 2(a) shows the RDF
graph data, where 𝑛𝑖 is the node id and 𝑙𝑖 is the predicate.
Figure 2(b) shows the key-value storage of node 𝑛0.
There are server others components to CroRDF’s data
model:
错误!超链接引用无效。BGP (Basic Graph
Pattern). A sequence of triple patterns <subject,
predicate, object>.
Solutions, the results of extended SPARQL
query.
Acquisition rules, specifying how data in the
knowledge base can be obtained from external
sources (including humans).
Resolution rules, used to reconcile inconsistent or
uncertain values obtained from external sources.
Crowdsourcing Query Plan (CQP), decided by the
ordered BGP graphs, the acquisition rules and
enumerate plans. It includes the process order and
the crowdsourcing direction.
We will illustrate each of the data model components
informally in other chapters.
3.2 Query Extension
We used SPARQL to complete a select query on the RDF
graph data in the CroRDF. In contrast to the commonly used
join graph representation of BGPs in which each triple
pattern is an ordinary directed edge from a subject node to an
object node. The formal syntax of the BGP is expressed as Q:
SELECT ?V1 ... ?Vm WHERE {Q1 ... Qn}, where {Q1 ... Qn}
represents a set of triple patterns and ?V1 ...? Vm represents a
set of variables that appear in {Q1 ... Qn} and defines the
format of the query output.
In order to meet the needs of a query by data collection, we
use extended the SPARQL query language ‘CroSparql’ and
we can use it to complete two types of query targets by
crowdsourcing platform.
a. Given a threshold n of number of queries, CroRDF can
return n with the least cost.
b. Return the maximum number of queries within the
fixed cost.
For example, the threshold n is set to 5, then CroRDF
first return β exact solutions from knowledge base in the
Query parser. If β is less than 5, CroRDF will collect
solutions with crowd in Collect phase. Considering the
following example.
Example 1. A user wants to find a doctor and his field of
focus and the doctor is a professor and he works in a hospital.
At the same time, the level of the hospital is three. The
answer can be obtained by the following SPARQL query,
namely QF and the query graph is shown in Fig. 3.
Fig. 3. Query QF and its BGP graph
First enter the Search phase of CroRDF, according to the
query target, query requests are initialized. The graph
exploration module explores the existing knowledge in
accordance with one query plan and returns the partial results.
n0
n3
n5
n2
n4
n1
l3
l5l2
l4
l1
(n0, (in-adjacency-list, out-adjacency-list))
(l1,n1) (l2,n2) (l3,n3) (l4,n4) (l5,n5)
(a) RDF graph data (b) Key-value storage structure of n0
Fig. 2. Example of the key-value storage structure
QF: SELECT ?doctor ?hospital ?field ?score WHERE { ?doctor Has_rate ?score,
?doctor PositionTitle PROFESSOR , ?doctor WorkIn ?hospital, ?hospital MajorIn ?field, ?hospital Has_level 3 MinTuples 5}
?doctor
?score ?hospital
?field 3
Has_rate WorkIn
MajorIn Has_level
q1 q3
q5q4
PROFESSOR
PositionTitle
q2
5
Existing data in the knowledge base as show in Fig.4. If the
target of β is less than 5, the query process switches to the
Collect phase. In this phase, based on the partial results
obtained in the Search phase, the TPGenerate processor
generates ordered BGP graphs according to certain rules, i.e.,
different execution sequences of triples. The Acquire
processor determines the crowdsourcing direction and
acquisition rule set of each triple pattern according to the
acquisition rule scores to generate candidate optimal CQPs in
the effective search space. Then, the CostEst module is
utilized to estimate the crowdsourcing cost and help to find
the optimal plan. Finally, the CreateQuestions and
LoadAnswer processors in the crowdsourcing module
perform crowdsourcing questions and collect the results.
This paper focuses on crowdsourcing query optimization.
Therefore, the specific query optimization process of the
Search phase is not discussed. In Sections 4 and 4, we will
explain the cost estimation for a CQP and the acquisition rule
evaluation algorithms used in the Collect phase.
4 SEARCH PHASE
In this phase, the SPARQL query process is transformed
into a sub-graph matching problem using graph exploration
[31]. The process order of the triple patterns in the SPARQL
query is sorted with {q1, ... , qn}, and the matching set of the
i-th triple qi is calculated through the whole graph.
According to the matching set of qi, qi+1 is mapped with the
graph exploration query. In an ordered set of triples, there is
an effect of the impact of the interactions between the triples,
and each step of the matching operations is based on the
previous results to reduce the intermediate result sets and
improve the query performance.
Algorithm 1 illustrates the main process of the Search
phase. Where q⃗ represents a triple pattern with a direction, i.e., the crowdsourcing direction from the subject to the object, that indicates the common nodes with another triple pattern as the subject. And q⃗⃖ represents the crowdsourcing direction from the object to the subject. We call the source of q⃗ and q⃗⃖ “src” and call the target of them “tgt”. “p” represents predicate and “dir” represents to correspond relationship
between “src” and “tgt”. When src is a variable, the
LoadNodes initialize the candidate set by predicate indexes;
when src is a constant, B(src) is initialized as the constant.
Then, for each candidate item in B(src), the
SelectByPredicate searches for the suitable candidate set of
tgt. The result is added into R only when the tgt matches
B(tgt).
In example 2, assume that the existing data in the
knowledge base are as shown in Fig. 4. For q⃗ {?doctor
WorkIn ?hospital}, there will be 4 matching results in R
according to algorithm 1: {(wang3, WorkIn, Chinese
Medicine Hospital), (wang1, WorkIn, Jishuitan Hospital),
(wang1, WorkIn, Beiyi Hospital), (wang2, WorkIn, Beijing
Hospital) }.
5 COLLECT PHASE
In this phase, based on the result set and the query target in
the Search phase, the query engine triggers the optimal
acquisition rules, generates candidate crowdsourcing plans
and questions dynamically. Then, the crowdsourcing
platform can handle the crowdsourcing questions and
collect new data later.
5.1 Generate Ordered BGP Graphs
For a SPARQL query Q, we first construct a BGP graph
to describe the structural relationship between the triple
patterns. Then, all possible ordered BGP graphs of the triple
patterns that describe the process orders are determined.
Based on the BGP graphs, we construct all possible logical
plans.
Definition 1. A Logical Plan is a sequence of triple
patterns corresponding to an ordered BGP graph.
Assume the triple pattern set TP1 = {q1, q2,..., qn} as the
initial ordered BGP graph that appears in the query. Based on
TP1, the positions of the two pairs of triples are exchanged
according to the rule 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛(𝑞𝑖) ↔ 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛(𝑞𝑗)(𝑖 ≠ 𝑗)
to form different triple pattern sequences corresponding to
different ordered BGP graphs. When there are n triples, n!
triple pattern sequences are generated. The generation
process of triple pattern sequences is shown in Algorithm 2.
Different ordered BGP graphs may have different
crowdsourcing costs. When the number of triple
wang3Chinese
Medicine Hospital
2
wang1Jishuitan Hospital
Orthopedics
Beiyi Hospital
3
3
wang2Beijing
Hospital29
8
WorkIn
Has_level
MajorIn
Has_level
Has_level
Has_level
WorkIn
WorkIn
WorkIn
Has_rate
Has_rate
PROFESSOR
PositionTitle
ASSOCIATEPROFESSOR
PositionTitle
PROFESSOR PositionTitle
MajorIn
Dermatology
Fig. 4. Existing data in the knowledge base Algorithm 1 MatchPatter
Input: Triple patter e (e=q⃗⃖ or e=q⃗ ) Output: The matching set R
1: Initial src, tgt, p and dir from e
2: if src is a variable then
3: B(src)= LoadNodes(p, dir)
4: else if src is a constant then
5: for each s in B(src) do
6: Id_ListSet=LoadNeighbors(src, dir) //Get
adjacency list corresponding with src
7: N = SelectByPredicate(Id_ListSet, p)
8: for each n in N ∩ B(tgt) do
9:
10:
R=R∪(s, p, o)
return R
6
patterns in the TP set is large, a large number of TP
sequences will be generated, which may affect the cost
estimation and the efficiency of the crowdsourcing
optimization process. Therefore, a pruning process is
necessary. Given that the crowdsourcing process of
triple patterns is in a certain order, there exists a
binding set of associated values among them that
limits and reduces the unnecessary acquisition
rule generation. Therefore, when generating TP
sequences, we consider TP sequences (line 6) that have
an association between every two triple patterns to
effectively reduce the candidate ordered BGP graphs.
4.2 Evaluate Acquisition Rules
Based on the ordered BGP graphs, there are different acquisition rules for each triple pattern that generate different crowdsourcing questions.
4.2.1 Acquisition Rules
Definition 2. An Acquisition Rule is the rule extracted from a triple pattern in a BGP graph that defines how to generate crowdsourcing questions and acquire data from crowdsourcing platforms.
The general form of the acquisition rule is Predicate(subject, object). There are two specific forms when generating acquisition rules: one is Predicate(?, object), with a known object and an unknown subject; the other is Predicate(subject, ?), with a known subject and an unknown object. The acquisition process obtains an unknown value according to a known value. We can set a certain reward for each acquisition rule based on the predicate and pay workers when they complete the crowdsourcing question generated by the acquisition rule later. We take the hospital system as an example. Some acquisition rules are as follows:
Is(?, doctor): Ask a doctor's name.
WorkTime(NAME, ?): Ask the working time
according to the name of the doctor. A triple pattern in the WHERE clause of a
SPARQL query can generate a specific set of acquisition rules. The triple pattern is formally
expressed as ? _var1 <P>? _var2 / CONST, where ?_ var1 and ?_var2 represent variables (subject or object). The object may also be a constant. According to the definition of the acquisition rules, we can generate the following three types of acquisition rules: I: P(?_var1, CONST); II: Is(?_var1, var1), Is(?_var2, var2); III: P(VAR1, ?_var2), P(?_var1, VAR2). var1 and var2, respectively, represent the category where the subject and the object node belong. VAR1 and VAR2, respectively, denote the corresponding values of the subject and the object. Different acquisition rules can be selected under different conditions, and the data for the corresponding triple pattern can be acquired.
4.2.2 Acquisition Rules Selection
Definition 3. A Physical Plan is a sequence of
acquisition rules. It is converted from a logical plan by
choosing the crowdsourcing direction for each triple
pattern in the logical plan and determining the
acquisition rule for the corresponding triple pattern.
We first compute a set of candidate acquisition
rules for each triple pattern and select the
possible-complete acquisition rules (possible-complete
means that data crowdsourced according to the
acquisition rules match the triple pattern completely).
The complete acquisition rules of all triple patterns are
combined to produce a physical plan.
We consider a triple pattern q: ? _var1 <P>? _var2 /
CONST, whose candidate acquisition rule set is
fets={ P(?_var1, CONST); Is(?_var1, var1); Is(?_var2,
var2); P(VAR1, ?_var2); P(?_var1, VAR2) }. A set of
minimum complete acquisition rules includes three
types of rule sets:
A: P(?_var1, CONST)
B: { Is(?_var2, var2); P(?_var1, VAR2)}
C: { Is(?_var1, var1); P(VAR1, ?_var2)}
Sets A and B determine that the crowdsourcing
direction of one triple pattern q is from the object to
the subject and can be denoted by �⃖�. Set C determines
that the crowdsourcing direction of q is from the
subject to object and can be denoted by 𝑞 . When the
object is a constant, it is necessary to filter the result
of ?_var2, which is obtained according to the
acquisition rules in set C by the constant.
Assuming that the knowledge base has data, as
shown in Fig. 4, we take the query in Section 2.4 as an
example. We take the following two CQPs as an
example: A: {𝑞2⃖⃗ ⃗⃗⃗, 𝑞1⃗⃗⃗⃗ , 𝑞3⃗⃗⃗⃗ , 𝑞4⃗⃗⃗⃗ , 𝑞5⃗⃗⃗⃗ } and
B:{𝑞2⃗⃗⃗⃗ , 𝑞1⃗⃗⃗⃗ , 𝑞3⃗⃗⃗⃗ , 𝑞4⃗⃗⃗⃗ , 𝑞5⃗⃗⃗⃗ }; the order of the BGP graph and
Algorithm 2:EnumerateBGP
Input: Initial ordered BGP graph TP1
Output: TP sequence set TPset
1 TPset <- {TP1}
2 for i ∈ [1, n] do
3 for j ∈ [i, n] do
4 for 𝑇𝑃𝑖 ∈ TPset do
5 position(𝑞𝑖) ↔ position(𝑞𝑗)(i ≠ j)
6 if filter(TPnew) then
7 TPset <- TPset ∪TPnew
8 return TPset
7
the acquisition rules are shown in Fig. 5. All acquisition
rules conform to one of the three sets described above.
Different physical plans in the search space are formed
by the combination of different complete acquisition
rules of triple patterns.
Enumerating and combining all possible-complete
rules in the candidate acquisition rule sets of each
triple pattern will result in a huge number (i.e., 𝑜(2𝑛 ∗
𝑛!) ) of physical plans, which can affect the query
efficiency. We define two types of evaluation scores for
the acquisition rules to calculate their respective
contribution to the whole result and select rules with
high contribution scores to reduce the physical plans.
The first score of the acquisition rule fhk is calculated as
follows:
𝑠𝑐𝑜𝑟𝑒1(𝑓ℎ𝑘) = ∑([∃ 𝑗: 𝑐(𝑖, 𝑗) = ℎ] × 1
𝑝𝑖
)
𝑛
𝑖=1
where n denotes the number of triple patterns, pi
denotes the number of variables in the i-th triple
pattern, fhk denotes the k-th acquisition rule in the
complete set of the h-th triple pattern, and c(i, j)
indicates whether or not fhk contributes to the i-th
triple pattern.
The second score considers the number of
acquisition rules ∑ 𝑞(𝑖, 𝑗)𝑝𝑖𝑗=1 . 𝑞(𝑖, 𝑗) refers to the
number of acquisition rules required for the j-th
variable. The score of the acquisition rule fhk is
calculated as follows:
𝑠𝑐𝑜𝑟𝑒2(𝑓ℎ𝑘) = ∑([∃ 𝑗: 𝑐(𝑖, 𝑗) = ℎ] × 1
∑ 𝑞(𝑖, 𝑗)𝑝𝑖
𝑗=1
)
𝑛
𝑖=1
With the calculation and comparison of the scores
of different acquisition rules, rules with high scores are
selected for each triple pattern, which will be
combined to generate the possible optimal physical
plans. Cost estimation can be conducted on these plans
to determine the optimal CQP.
4.3 Candidate Crowdsourcing Plan
4.3.1 Search Space
Definition 4. A Crowdsourcing Query Plan (CQP)
consists of an ordered BGP graph and the
corresponding physical plan.
This section mainly explains the search space of
the possible CQPs considered by the crowdsourcing
query optimizer of the CroRDF system. An ordered
BGP graph specifies the process order of the triple
patterns that form the BGP graph. Different orders
affect the crowdsourcing process. In Section 4.1, we
will discuss in detail how to generate an ordered BGP
graph. For each triple pattern, the CQP requires a set of
acquisition rules to collect data that match the triple
pattern. An acquisition rule corresponds to a
crowdsourcing direction of the triple pattern. In
Section 4.2, we have discussed the set of candidate
acquisition rules for an ordered BGP graph in detail
and the need to select better acquisition rules with
higher scores, i.e., that may contribute more to the
result. By using an ordered BGP graph, we can
construct a logical plan and extend it to different
?doctor
?score?hospital
?field 3
Has_rateWorkIn
MajorIn Has_level
q3
q5q4
PROFESSORPostitionalTitle
q2
?doctor
?score?hospital
?field 3
Has_rateWorkIn
MajorIn Has_level
q1 q3
q5q4
PROFESSORPostitionalTitle
q2Plan A: Acquisition Rulesq2: PositionalTitle(?doctor, PROFESSOR)q1: Has_rate(doctor, ?score)q3: WorkIn(doctor, ?hospital)q4: MajorIn(hospital, ?field)q5: Has_level(hospital, ?level)
Plan B: Acquisition Rulesq2: Is(?, doctor) PositionalTitle(doctor, ?positionalTitle)q1: Has_rate(doctor, ?score}q3: WorkIn(doctor, ?hospital)q4: MajorIn(hospital, ?field)q5: Has_level(hospital, ?level)
Plan A Plan B
q1
Fig. 5. CQPs and acquisition rules for plans A and B
Algorithm 3:SearchBestPlanOriginal Procedure
1 bestPlan <- NULL
2 minCost <- ∞
3 for each seqBGP do
4 for each fetchRuleSet do
5 plan <- GeneratePlan(seqBGP, fetchRuleSet)
6 plan.TriplePossEst()
7 cost <- plan.CostEst(plan.poss)
8 if cost < minCost then
9
bestPlan <- plan
1
0
return bestPlan
8
executable physical plans by selecting different
acquisition rules.
4.3.2 Enumeration Algorithms
Definition 5. PossiNum is the number of possible result tuples needed for each candidate acquisition rule for a triple pattern, which is related to the cost of the corresponding crowdsourcing plan. The details of how to estimate the PossiNum are discussed in Section 5.
Note: Different physical plans have different
acquisition rules, and different acquisition rules have
different turns ratios, which indicates that the
generated one-to-one crowdsourcing questions need
different numbers of result tuples (PossiNum) to find
the right answer. The number of result tuples needed is
directly related to the monetary cost of crowdsourcing.
We now consider the problem of efficiently
enumerating all CQPs in the search space. In CroRDF,
the same logical plan may correspond to different
physical plans, resulting in different crowdsourcing
costs. Thus, the PossiNum estimation is applied at the
physical plan level to help select the optimal CQP.
Moreover, the CroRDF PossiNum estimation is holistic
and is based on an ordered triple pattern sequence in
which the PossiNum of each triple pattern partly
depends on the other parts of the CQP and affects the
other triple patterns. Therefore, the goal of the
enumeration algorithm is to generate a complete CQP
in the search space while maximally reusing the
common triple pattern subsequence. First, we propose
a native enumeration algorithm. Then, we propose an
improved efficient enumeration algorithm based on
reuse. The performance of the two enumeration
algorithms is compared in the experiment.
4.3.2.1 Native Algorithm
The native enumeration algorithm iteratively
generates all valid CQPs in the search space.
Algorithm 3 illustrates the whole process. First, all
ordered BGP graphs (line 3) are enumerated using the
EnumerateBGP algorithm in Section 4.1. For one
ordered BGP, a set of complete acquisition rules is
generated and combined according to the evaluation
scores proposed in Section 4.2, which constructs a
candidate CQP (lines 4 and 5). The optimal CQP is then
selected by using the PossiNum estimation and cost
model (lines 6-9).
4.3.2.2 Improved Algorithm
The native enumeration algorithm processes each
CQP independently. Since different CQPs may have
common triple pattern subsequences, it is possible to
generate a duplicate estimation for the same
subsequence. To improve the enumeration efficiency,
we can record the estimated results of these common
triple subsequences. Note that there are associated
values between the triple patterns, but we cannot
directly save the estimated PossiNum, although saving
the PossiNum calculation relationship between the
triples is feasible. Therefore, the algorithm does not
have to repeat to determine the relationship between
two triple patterns and can perform the calculation
directly based on the input parameters.
For a SPARQL query with n triple patterns,
although the computational complexity increases with
the value of n, the computational time is reduced
compared to repeatedly calculating the triple patterns
of all CQPs. Therefore, we can enumerate the physical
plans by using the combination of every two triple
patterns that consider the acquisition rules. The native
enumeration algorithm first selects an ordered BGP
graph and then enumerates the physical plans by
selecting the rules for the triple patterns. All possible
CQP are thus enumerated.
5 MONETARY COST ESTIMATION
This section describes how the CroRDF system
estimates the cost of a CQP. Assume that each
acquisition rule has a fixed cost that can be set by the
CroRDF system. Although cost may vary with
different acquisition rules, we simplify the assumption
that the cost of each acquisition rule is not dependent
on the specific predicates. Therefore, we convert the
cost estimation into a PossiNum estimation, which is
the number of possible result tuples needed for the
acquisition rule that each triple pattern in the SPARQL
query needs to generate to satisfy the overall query
target. Therefore, the cost estimation formula is as
follows:
𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑠𝑡 = ∑ ∑ 𝑐𝑖𝑗 × 𝑓𝑖𝑗𝑓𝑖𝑗𝜖𝐹𝑖𝑞𝑖∈𝑇𝑃
(1 ≤ 𝑖 ≤ 𝑛, 1 ≤ 𝑗 ≤ 𝑚𝑖), where TP is the set of triple patterns in the
SPARQL query, qi is a triple pattern, Fi is a set of
9
candidate acquisition rules generated by qi, fij is the PossiNum of the j-th acquisition rule in Fi, and cij is the cost of the acquisition rule corresponding to fij. To estimate the PossiNum of the triple pattern, we should fully consider the associations and restrictions among the triple patterns.
5.1 PossiNum Estimation
When executing a SPARQL query, CroRDF generates a BGP graph composed of triple patterns. A CQP corresponds to an ordered BGP graph that indicates the order in which the triple pattern is executed. Therefore, the PossiNum estimation algorithm can be regarded as a graph exploration and traversal process that considers the association among triple patterns. Based on the resolution rule turns ratio and predicate density, the whole process starts from the extended query target, estimates the result tuples that each triple pattern needs to deliver to the next triple pattern, and computes the PossiNum of each triple pattern until the entire BGP graph traversal is complete and returns the calculation result.
5.2 Important Parameters
In the PossiNum estimation, the resolution rule turns ratio and predicate density can be applied to estimate the PossiNum.
5.2.1 Resolution Rules
Resolution rules are applied to eliminate the ambiguity and inconsistency of crowdsourcing result triples, and the results are returned to the knowledge base. The form of the resolution rule is Rule(S->O, predicate), where S and O represent the subject and the object (S can be empty), respectively, and predicate is the predicate involved in the rule. The specific process groups all crowdsourcing result tuples by S, whereas for each group it regards the set of values in O as the input and outputs a result according to a specific resolution rule. Each resolution rule limits the number of inputs as a minimum or average number, and more inputs are needed if they are insufficient for the limitation. The number of inputs can be used for the query cost estimation. The resolution rules involved in the query process include distinct, majority, average, etc. In the example of the hospital system, there may be some resolution rules as follows:
Distinct(∅->hospital, Is): Distinct.
Average-3(doctor->score, Has_rate):
Calculate the average of three scores.
Majority-3(doctor->hospital, WorkIn): Take
most items of the three results.
5.2.2 Resolution Rule Turns Ratio
The resolution rule turns ratio can estimate the average number of output tuples for each input tuple. For example, the resolution rule Average-n represents the average value of n values and the turns ratio is 1/n;
Majority-n represents the majority of n results and the turn ratio is between 1/3 and 1/2 when n = 3 (when the two results are consistent, 1/2; when inconsistent, 1/3).
5.2.3 Predicate Density
The predicate density in an acquisition rule implies a probability value that owns the predicate for all possible RDF resources. The predicate density is related to the predicate category, such as for the acquisition rules Is(?, doctor) and WorkIn(?doctor, “Beiyi Hospital”), whose possible predicate densities are 1 and 0.1, respectively.
5.3 Calculate the PossiNum
First, we define four types of relationships
between triple patterns, as shown in Table 1. The
crowdsourcing process for each triple pattern has a
direction, which refers to the direction between the
source and target, represented by src and tgt,
respectively. The source and target differ from the
subject and object. The right arrow ‘→’ represents the
matching direction from subject to object, whereas the
left arrow ‘←’ indicates from object to subject. For
10
example, for q2⃖⃗ ⃗⃗⃗ , src represents the source and the
object of the triple, whereas for q2⃗⃗⃗⃗ , src indicates the
subject of the triple.
TP
Relationship
Description (example in Fig. 3)
R1: src-src q1⃗⃗ ⃗⃗ ⃗ and 𝑞3⃗⃗ ⃗⃗ ⃗
R2: tgt-src 𝑞3⃗⃗ ⃗⃗ ⃗ and 𝑞4⃗⃗⃗⃗
R3: src-tgt q1⃗⃗ ⃗⃗ ⃗ and 𝑞2⃖⃗ ⃗⃗⃗
R4: tgt-tgt 𝑞1⃖⃗ ⃗⃗ ⃗⃗ and 𝑞3⃖⃗ ⃗⃗⃗
Table. 1. Relationships between the triple patterns
Now, we explain the TriplePossEst PossiNum
estimation algorithm in terms of the four types of
relationships between triple patterns. The basic
process unit of the algorithm is a single triple pattern.
In the implementation process, two input parameters
are involved:
target: The number of target tuples to be
output for one triple pattern.
binding: The candidate set of association
values between the triples.
According to the input parameters and a CQP,
the TriplePossEst algorithm estimates the PossiNum
for a specific triple pattern, and the output is passed
as the target input of the next triple pattern. Then, the
total estimated cost of all tuples is calculated
cumulatively. Four local variables are referenced in
each triple pattern estimation:
fets: The acquisition rule set of a triple
pattern.
preds: The predicate set with the density of
the involved triple pattern.
res_sel: The resolution rule set and its
turns-ratio.
poss: The PossiNum of the current triple
pattern.
Algorithm 4 illustrates the basic process of the
TriplePossEst algorithm. The input is the CQP,
including the process order and crowdsourcing
direction of TP. The output is the estimated PossiNum
of CQP, which is the number of possible result tuples
Algorithm 5:TriplePossEstCore
Input: target, binding, tp
Output: poss
1 fets, preds, res_sel <- Initialize(tp)
2 r_type <- Relationship(tp) with the previous tp
3 poss <- target-|binding.existingpartialdata(tp)|
4 posss <- {poss, poss,…, poss}
5 {f1, …, fn} <- sort(fets)
6 If r_type = R1 or R2 then
7 tp.src <- binding
8 for fi in fets do
9 if Mapping(fi.src, tp.src) then
1
0
for pred in {preds∪res_sel(fi)} do
1
1
posss[i] <- posss[i] / pred.density
1
2
else if Mapping(fi.tgt, tp.src) then
1
3
c <- (1-|tp.src|/posss[i])*tp(associate_attribute as
src).preds.density
1
4
posss[i] <- posss[i]/c
1
5
else if r_type = R3 or R4 then
1
6
tp.tgt <- binding
1
7
for fi in fets do
1
8
if Mapping(fi.tgt, tp.tgt) then
1
9
c <-
(1-|tp.tgt|/posss[i])*tp(associate_attribute as
tgt).preds.density
2
0
posss[i] <- posss[i]/c
2
1
else if NoMapping(fi, tp.tgt) then
2
2
for pred in res_sel(fi) do
2
3
posss[i] <- posss[i]/ pred.density
2
4
else if r_type=NULL
2
5
for fi in fets do
2
6
for pred in {preds∪res_sel(fi)} do
2
7
posss[i] <- posss[i]/ pred.density
2
8
return poss <- sum(posss)
Algorithm 4:TriplePossEst
Input: Crowdsourcing Query Plan (CQP)
Output: Estimation PossiNum (EstPoss)
1 target <- n-N or 1
2 binding <- GraphExplore(DataBase)
3 for tpi in TP do
4 TriplePossEstCore(target, binding, tpi)
5 target <- poss
6 associate_attribute <- Relation(tpi, tpi+1)
7 binding <- binding(associate_attribute)
∪GraphExplore(DataBase)
8 EstPoss <- EstPoss+poss
9 return EstPoss
11
needed for all acquisition rules in the CQP. First,
according to the query result in the Search phase, the
algorithm initializes the parameters target and binding
(lines 1 and 2). For MinTuples n, the parameter target is
initialized as the number of result tuples required to
satisfy the query target n; for MaxCost c, target is
initialized to 1.
Then, the algorithm calls the TriplePossEstCore
algorithm to calculate the PossiNum of each triple
pattern, updates the input parameters of the next triple
pattern, and sums the PossiNum estimation
cumulatively (lines 4-7). Algorithm 5 illustrates the
TriplePossEstCore algorithm procedure, which aims to
estimate the PossiNum of the current triple pattern.
The inputs are the current triple pattern tp, the number
of results to be output, and the value set associated
with the previous tp. If the current tp is the first one in
the CQP, the binding set is initialized by the
TriplePossEst algorithm. The output is the PossiNum
of the current tp. The initialization is processed in lines
1-5 to obtain the following information about tp: fets,
preds, and res_sel. The fets set determines the
association type between the current tp and the
previous tp and initializes the PossiNum. Then,
according to the association types, three cases are
handled separately. The first case (lines 6-14) is applied
to the R1 and R2 association types. In this case, the
binding set limits the range of the src of tp. Therefore,
when the acquisition rules in the fets set are
crowdsourced in a certain order, it is unnecessary to
crowdsource the variable node values that match the
binding set to collect new data. In terms of the matching
type between the acquisition rules and the candidate
set of the association values of tp, the algorithm
estimates the number of other possible crowdsourcing
questions. The second case (lines 15-23) aims to handle
the R3 and R4 association types. In this case, the
binding set limits the range of the tgt of tp. Similarly, the
PossiNum of each acquisition rule is calculated in
terms of different matching types. When the
acquisition rule obtains the unassociated values before
knowing the associated values, the algorithm must
re-calculate the number of acquisition rules required to
obtain the associated values based on the estimated
cost and the binding set (lines 13 and 14 and lines 19
and 20). The third case occurs when tp is the first triple
pattern in the CQP or when there is no association
between the two tps. The PossiNum of acquisition
rules can be estimated directly based on the density of
predicates and the turns-ratio of resolution rules (lines
24-27). Finally, the PossiNum values of all acquisition
rules are combined as the output result.
For the target MinTuples n, the parameter target is
assigned to the number of results still to be
crowdsourced, considering the partial query results
generated by the existing knowledge. For the target
MaxCost c, the principle of the algorithm is to return as
many results as possible within the range of cost c,
based on returning at least n query results (n is the
system default). Therefore, we complete the PossiNum
estimation process in three steps. First, it sets the
parameter target to the number of partial result tuples
from the Search phase and calculates the cost used to
return the missing values in the partial result tuples. If
the number limit is satisfied or the budget has been
exceeded, the process returns the results directly and is
ended; otherwise, it proceeds to the next step. Second,
the process sets the target to 1 to calculate the cost for
returning one result tuple. Third, according to the cost
c, it repeats the calculation until the budget is
exhausted and then returns the number of tuples in the
result.
5.4 Example of Cost Estimation
We illustrate the PossiNum estimation algorithm
in Section 5.3 with two simple examples. Taking the
query in Section 4.2.2 for example, for simplicity, we
assume that the predicates PositionalTitle =
'PROFESSOR' and Has_level = 3 have a density of 0.2
and that the other predicates have a density of 1. The
turns ratios of the resolution rules Distinct, Average-3,
and Majority-3 are 1.0, 0.3, and 0.4, respectively. The
TriplePossEst({𝒒𝟐⃖⃗⃗⃗⃗⃗ , 𝒒𝟏⃗⃗⃗⃗ ⃗, 𝒒𝟑⃗⃗⃗⃗ ⃗, 𝒒𝟒⃗⃗⃗⃗ ⃗, 𝒒𝟓⃗⃗⃗⃗ ⃗})
q2.TriplePossEstCore(4, {wang1})
q2.poss=3
q1.TriplePossEstCore(3, {wang1, 9}∪binding(doctor))
q1.poss=10
q3.TriplePossEstCore(3, {wang1, 9, Beiyi
Hospital }∪binding(doctor))
q3.poss=7.5
q4.TriplePossEstCore(8.5, {wang1, 9, Beiyi
Hospital }∪binding(hospital))
q4.poss=21.25
q5.TriplePossEstCore(7.5, {wang1, 9, Beiyi Hospital,
3}∪binding(hospital))
q5.poss=18.75
Fig. 6(a) PossiNum estimation process of Plan A
12
cost of each acquisition rule is assumed to be $0.05. The
resolution rules involved in CQPs A and B are
Distinct(∅->doctor, Is), Average-3(doctor->score,
Has_rate), Majority-3(doctor->hospital, WorkIn),
Majority-3(hospital->field, MajorIn), and
Majority-3(hospital->level, Has_level).
Plan A: Fig. 6(a) shows the PossiNum estimation
process of plan A. First, we consider the impact of the
existing data. In our case, the partial query results are
the tuples of {wang1, Jishuitan Hospital, orthopedics, 8}
and {wang1, Beiyi Hospital, ?, 9}. Therefore, the target
parameter is initialized with 4, and the binding set is
{wang1} (the first processed triple pattern is q2). Then,
TriplePossEstCore (4, {wang1}) is called to process q2.
For the acquisition rule, PositionalTitle(?doctor,
PROFESSOR), all results satisfy the predicate, and
q2.poss = 4-1 = 3. Then, TriplePossEstCore (3, {wang1,9}
∪ binding (doctor)) is called to process q1. Since the
predicate density is 1, the resolution rule Average-3
turns ratio is 0.3, and the acquisition rule is
Has_rate(doctor, ?score), q1.poss = 3 / 0.3 = 10.
Similarly, q3.poss = 3 / 0.4 = 7.5. In case of the lack of a
‘field’ value in the partial results, q4.binding = {wang1,
9, Beiyi Hospital} ∪ binding (hospital). The PossiNum
calculation of q4 should consider supplying the
missing data; therefore, q4.poss = (7.5 + 1) / 0.4 = 21.25.
Similarly, q5.poss = 7.5 / 0.4 = 18.75. The final estimated
PossiNum is 3 + 10 + 7.5 + 21.25 + 18.75 = 60.5, and the
estimated cost is $0.05 × 60.5 = $3.025.
Plan B: Fig. 6(b) shows the PossiNum estimation
process of plan B. The difference from plan A is the
crowdsourcing direction of q2. The initialization is the
same as in plan A. When calling TriplePossEstCore (4,
{wang1}) to process q2, q2.poss = 2 × (4-1) /0.2=30
owing to the density of the predicate PositionTitle =
'PROFESSOR'. Then, TriplePossEstCore (30, {wang1,9}
∪ binding (doctor)) is called to process q1, q1.poss = 30 /
0.3 = 100. Similarly, q3.poss = 30 / 0.4 = 75, q4.poss = (75
+ 1) /0.4=190, and q5.poss=75/0.4 = 187.5. The final
estimated PossiNum is 30 + 100 + 75 + 190 + 187.5 = 585,
and the estimated cost is $0.05 × 585 = $29.25.
As shown above, plan A costs less than plan B.
Therefore, to optimize the query cost, plan A is a better
choice than plan B.
6 EXPERIMENTAL EVALUATION
In this section, we evaluate experimentally the
performance of the CroRDF system’s crowdsourcing
query optimizer, focusing on the accuracy of the cost
estimation algorithm. We only consider the query
target MinTuples n (because the target MaxCost c is
also based on the TriplePossEst algorithm). First, we
evaluate the performance of the cost estimation
algorithm with different settings. Then, we validate the
effectiveness of the acquisition rule scores and two
plan enumeration algorithms.
6.1 Accuracy of the Cost Estimation
To evaluate the accuracy of the CroRDF cost
model, we designed three experiments to compare the
actual cost with the estimated cost: no data in the
knowledge base (Experiment 1), partial data
(Experiment 2), and partial data with different logic
query plans (Experiment 3). For Experiment 1, we
adopted a real crowdsourcing platform (Amazon
Mechanical Turk) to execute different CQPs and
acquire the actual crowdsourcing cost for comparison
with the experimental result. To perform repeated
experiments and not generate actual cost, we built a
crowdsourcing simulator that returns results by
selecting from a predefined set of values. We could set
the simulator to either always return correct answers
or return wrong answers with a certain probability.
Experiment 1: No data. For the SPARQL query in
Section 4.2.2, we considered the query target
MinTuples 5 by adopting the following two CQPs:
Plan A {𝑞2⃖⃗ ⃗⃗⃗, 𝑞1⃗⃗⃗⃗ , 𝑞3⃗⃗⃗⃗ , 𝑞4⃗⃗⃗⃗ , 𝑞5⃗⃗⃗⃗ } and Plan B{𝑞2⃗⃗⃗⃗ , 𝑞1⃗⃗⃗⃗ , 𝑞3⃗⃗⃗⃗ , 𝑞4⃗⃗⃗⃗ , 𝑞5⃗⃗⃗⃗ }.
The acquisition rules of plans A and B are shown in Fig.
5. Assume that the cost of each acquisition rule is $0.05
and that the crowdsourcing start situation is no data.
The actual costs of the two crowdsourcing plans are
$4.5 and $45.25, respectively.
TriplePossEst({𝒒𝟐⃗⃗⃗⃗ ⃗, 𝒒𝟏⃗⃗⃗⃗ ⃗, 𝒒𝟑⃗⃗⃗⃗ ⃗, 𝒒𝟒⃗⃗⃗⃗ ⃗, 𝒒𝟓⃗⃗⃗⃗ ⃗})
q2.TriplePossEstCore(4, {wang1})
q2.poss=30
q1.TriplePossEstCore(30, {wang1, 9}∪binding(doctor))
q1.poss=100
q3.TriplePossEstCore(30, {wang1, 9, Beiyi
Hospital }∪binding(doctor))
q3.poss=75
q4.TriplePossEstCore(76, {wang1, 9, Beiyi
Hospital }∪binding(hospital))
q4.poss=190
q5.TriplePossEstCore(75, {wang1, 9, Beiyi Hospital,
3}∪binding(hospital))
q5.poss=187.5
Fig. 6(b) PossiNum estimation process of Plan B
13
The experimental parameter settings were the
same as those in the example in Section 5.4. The
estimated results were $4.835 and $48.33, respectively.
Fig. 7 illustrates the comparison between the estimated
costs and the actual costs of the two plans. As shown in
the figure, the overall estimated costs were very close
to the actual costs, although there were still minor
errors (7.4% and 6.8%, respectively) for two main
reasons. First, our turns ratio and density settings were
not sufficiently accurate. For example, the resolution
rule Majority-3 did not necessarily require three inputs
as expected. In the experimental result, the actual turns
ratio was estimated as 0.48. Furthermore, for Plan B,
three doctor values were finally obtained from 33
crowdsourcing results, and therefore the turns ratios of
the resolution rule Distinct and the predicate
PositionalTitle=‘PROFESSOR’ were 0.85 and 0.3,
respectively. Second, our PossiNum estimation
algorithm uses some simple assumptions. For the
acquisition rule P(?var, CONST), it is assumed that the
results always satisfy a constant restriction, but this is
often not the case. For example, we assumed that the
crowdsourcing results of the rule PositionTitle(?doctor,
PROFESSOR) always satisfied the predicate
PositionalTitle=‘PROFESSOR’, but in fact, the
crowdsourcing workers were likely to return an
unmatched answer. To solve this problem, we can
adjust the turns ratio of the resolution rules associated
with these acquisition rules to accommodate
real-world uncertainties.
Experiment 2: Partial data. Because of the
crowdsourcing cost and latency arising from the
repeated experiments performed on a real
crowdsourcing platform, we adopted a crowdsourcing
simulator to simulate the crowdsourcing platform to
collect data for the following experiments. We mainly
considered the target MinTuples n and observed the
estimated results and actual results under different
existing data distributions. Consider two different
types of SPARQL queries (star structure and chain
structure):
Query I: select ?doctor, ?hospital, ?position, ?score
Where {q1: ?doctor WorkIn ?hospital, q2: ?doctor
PositionalTitle ?position, q3: ?doctr Has_rate ?score}.
The query plan is {𝑞1⃗⃗⃗⃗ , 𝑞2⃗⃗⃗⃗ , 𝑞3⃗⃗⃗⃗ }, and the acquisition rules
are {q1: Is(?, doctor), WorkIn(doctor, ?hospital); q2:
PositionTitle(doctor, ?position); q3:
Has_rate(doctor, ?score)}
Query II: select ?doctor, ?hospital, ?field Where
{?doctor Has_rate ? 9, ?doctor
WorkIn ?hospital, ?hospital MajorIn ?field}. The query
plan is {𝑞1⃖⃗ ⃗⃗⃗, 𝑞2⃗⃗⃗⃗ , 𝑞3⃗⃗⃗⃗ }, and the acquisition rules are {q1:
Has_rate(?doctor, 9); q2: WorkIn(doctor, ?hospital); q3:
MajorIn(hospital, ?field)}.
Suppose that the resolution rule of the hospital,
position, and field is Majority-3; the resolution rule of
the doctor is Distinct; the resolution rule of the score is
Average-3; and the turns ratios are 0.4, 1 and 0.3,
respectively. Fig. 8 shows the comparison of the
estimated and actual results when N results were
obtained. In the experiment, we set three different
initial states of the existing data and randomly selected
0, 100, and 200 different values. The query results of
the existing data were obtained through graph
exploration in the Search phase. The crowdsourcing
query was performed based on the partial result tuples.
The results showed that the estimated costs were very
close to the actual costs. Under the three data
distributions, the average relative errors of query I and
query II were 3.75%, 10%, and 34.95% and 9.31%,
14.18%, and 13.17%, respectively. The estimation
algorithm could distinguish between the existing data
and the crowdsourcing data, and different initial states
were reflected in the cost estimation.
Plan A Plan B
Fig. 7. MinTuples: Accuracy of the cost estimation without data
14
Experiment 3: Partial data with different logic
query plans. For the query in Section 4.2.2, we
considered the following two logical query plans and
the corresponding acquisition rules:
Plan A {q2⃖⃗ ⃗⃗⃗, 𝑞5⃖⃗ ⃗⃗⃗, 𝑞3⃗⃗⃗⃗ , 𝑞1⃗⃗⃗⃗ , 𝑞4⃗⃗⃗⃗ } :
q2: PositionTitle(?doctor, ‘PROFESSOR’);
q5: Has_level(?hospital, 3);
q3: WorkIn(doctor, ?hospital);
q1: Has_rate(doctor, ?score);
q4: MajorIn(hospital, ?field).
Plan B{𝑞5⃖⃗ ⃗⃗⃗, 𝑞4⃗⃗⃗⃗ , 𝑞3⃖⃗ ⃗⃗⃗, 𝑞1⃗⃗⃗⃗ , 𝑞2⃗⃗⃗⃗ }:
q5: Has_level(?hospital, 3);
q4: MajorIn(hospital, ?field);
q3: WorkIn(doctor, ?hospital);
q1: Has_rate(doctor, ?score);
q2: PositionTitle(doctor, ?position).
The resolution rules and turns ratio were the same
as those in Experiment 1. Considering the three
different initial states of the existing data, Fig. 9 shows
the comparison between the estimated results and the
actual results of the two query plans in the case of the
target, MinTuples. In Fig. 9(a) and (b), the estimated
cost is close to the actual cost. The average relative
errors of plans A and B were 14.37% and 11.9%,
respectively. The result of plan B in Fig. 9(c) illustrates
a poor situation where the cost model could not predict
the best execution plan. This failure occurred is
because we controlled the generation of the data,
doctor, and hospital to meet the query predicate
requirements, which led to a difference between the
default predicate density and the actual one, resulting
in inaccurate results.
(a) Initial state: 0 value (b) Initial state: 100 values (c) Initial
state: 200 values
Fig. 8. MinTuples: Accuracy of the cost estimation with partial data
Fig. 11. Performance comparison between the two enumeration algorithms
(a) Initial state 1 (b) Initial state 2
(c) Initial state 3
Fig. 9. MinTuples: Accuracy of the cost estimation with different logical plans
(a) Initial state 1 (b) Initial state 2
(c) Initial state 3
Fig. 10. Cost comparison of different evaluation scores
15
6.2 Validity of the Enumeration Algorithms
Experiment 4: Evaluation scores of the acquisition
rules. The selection of different acquisition rules for the
same logical plan results in different physical plans.
The experiment validated the effectiveness of the two
evaluation scores in optimizing the selection of the
acquisition rules proposed in Section 4.2 and evaluated
the effect of enumerating the CQPs. Considering query
II in Experiment 2 and the target, MinTuples, Fig. 10
illustrates the actual costs of score1, score2, and
random selection for three different initial states of the
existing data (the number of values was set to 100, 200,
and 300 different values). The experiment assumed
that the crowdsourcing simulator only returns correct
results. From the experimental results, we can
conclude that the cost of optimizing the selection of
acquisition rules in terms of score1 and score2 was
reduced by an average of 28.2% and 33.9%,
respectively, compared to random selection. Therefore,
the choice of rules in accordance with the scores can be
accelerated to locate the possible candidate physical
plans, reduce the enumeration space, and find the best
crowdsourcing physical plan more quickly. Moreover,
as the state of the data changed, the optimization gap
between score1 and score2 decreased gradually
because an acquisition rule can fill more values in the
partial results as the existing data increase.
Experiment 5: Enumeration algorithms. To
evaluate the effectiveness of the enumeration process,
we compared the two enumeration algorithms
considering the overall optimization time, i.e., the time
from the crowdsourcing query to finding the optimal
crowdsourcing plan, as the evaluation criterion. Since
the search space depends on the number of triple
patterns in the SPARQL query to some extent, we
generated a series of queries with varying numbers of
triple patterns t and calculated the optimization time of
each query execution. Fig. 11 shows the comparison
results of the two enumeration algorithms. The
performance of the improved enumeration algorithm
was much better than that of the native enumeration
algorithm. With increasing t, the optimization effect
was more obvious. When t = 9, the improved
enumeration algorithm was 2.3 times faster than the
native algorithm.
8 CONCLUSION
This paper presented CroRDF to complete RDF
queries via crowdsourcing with a crowdsourcing
query plan optimizer that finds the optimal CQP based
on the estimated monetary cost. According to the
characteristics of the RDF data and the query
requirements, we defined the data model and
extended the SPARQL query statement. We proposed
a plan enumeration algorithm based on triple pattern
sequences and acquisition rule selection and a
monetary cost estimation algorithm. Through the
comparison of actual data and simulation data, the
accuracy of our cost estimation algorithm and the
validity of the plan enumeration algorithm were
verified.
In future work, we will study how to optimize
multiple SPARQL crowdsourcing queries that
integrate a reasoning module and extract the common
query substructure to turn multiple queries into one
query for crowdsourcing to effectively reduce the
crowdsourcing cost.
ACKNOWLEDGMENTS
This research is supported by the National Natural
Science Foundation of China under Grant No. 61672102,
No. 61073034, No. 61370064 and No. 60940032; the
Program for New Century Excellent Talents in the
University of Ministry of Education of China under
Grant No. NCET-10-0239; the Science Foundation of
Ministry of Education of China and China Mobile
Communications Corporation under Grant No.
MCM20130371; and the Open Project Sponsor of
Beijing Key Laboratory of Intelligent Communication
Software and Multimedia under Grant No. ITSM201493. Corresponding author. Tel.: +86 13121915269. E-mail
address: [email protected] (D. Dang)
REFERENCES
[1] Acosta, Maribel, et al. "HARE: An Engine for
Enhancing Answer Completeness of SPARQL
Queries via Crowdsourcing." Companion of the The
Web Conference 2018 on The Web Conference
2018. International World Wide Web Conferences
Steering Committee, 2018.
[2] Preda, Nicoleta; Kasneci, Gjergji; Suchanek, Fabian
M.; Neumann, Thomas; Yuan, Wenjun; Weikum,
Gerhard. ActiVE KNOWLEDGE:
DYNAMICALLY ENRICHING RDF
KNOWLEdge Bases by Web Services. Proceedings
of the ACM SIGMOD International Conference on
Management of Data, p 399-410, 2010.
16
[3] Demartini, Gianluca, et al. "CrowdQ:
Crowdsourced Query Understanding." CIDR. 2013.
[4] Acosta, Maribel, et al. "HARE: An Engine for
Enhancing Answer Completeness of SPARQL
Queries via Crowdsourcing." Companion of the The
Web Conference 2018 on The Web Conference
2018. International World Wide Web Conferences
Steering Committee, 2018.
[5] Acosta, Maribel, et al. "Enhancing answer
completeness of SPARQL queries via
crowdsourcing." Web Semantics: Science, Services
and Agents on the World Wide Web 45 (2017):
41-62.
[6] Doan, Anhai; Ramakrishnan, Raghu; Halevy, Alon
Y. Crowdsourcing Systems on the World-Wide Web.
Communications of the ACM, v 54, n 4, p 86-96,
April 2011.
[7] Huang, Shih-Wen; Fu, Wai-Tat. Enhancing
Reliability Using Peer Consistency Evaluation in
Human Computation. Proceedings of the ACM
Conference on Computer Supported Cooperative
Work, CSCW, p 639-647, 2013, CSCW 2013.
[8] Chittilappilly, A, I; Chen, L; Ameryahia, S. A
Survey of General-Purpose Crowdsourcing
Techniques. IEEE Transactions on Knowledge &
Data Engineering, v 28, n 9, p 2246-2266, 2016.
[9] Li, Guoliang; Yudian, Zheng; Ju, Fan; Jianan, Wang;
Reynold, Cheng. Crowdsourced Data Management:
Overview and Challenges. Proceedings of the 2017
ACM International Conference on Management of
Data, ACM, p 1711-1716, 2017.
[10] Kaler, Kamaljot S., et al. "Crowdsourcing
Evaluation of. Ureteroscopic Videos Using the
Post-Ureteroscopic Lesion Scale to Assess Ureteral
Injury." Journal of esndourology 32.4 (2018):
275-281.
[11] Katsurai, Marie. "Bursty research topic detection
from scholarly data using dynamic Co-word
networks: A preliminary investigation." Big Data
Analysis (ICBDA), 2017 IEEE 2nd International
Conference on. IEEE, 2017.
[12] Liu, Xi, Yiju Zhan, and Jian Cen. "An
Energy-efficient Crowd-sourcing-based
IndoorAutomaticLocalization System." IEEE
Sensors Journal (2018).
[13] Wang Hong, et al. "Research on Domain Ontology
Storage Method Based on Neo4j." Computer
Application Research 8 (2017): 039.
[14] Nguyen, Vinh, et al. "A Formal Graph Model for
RDF and Its Implementation." (2016).
[15] Saleem, Muhammad, et al. "Costfed: Cost-based
query optimization for sparql endpoint
federation." Procedia Computer Science 137 (2018):
163-174.
[16] Park, Hyunjung; Widom, Jennifer. CrowdFill:
Collecting structured data from the crowd.
Proceedings of the ACM SIGMOD International
Conference on Management of Data, p 577-588,
2014.
[17] Nicholson, Bryce; Sheng, Victor S.; Zhang, Jing.
Label noise correction and application in
crowdsourcing. Expert Systems with Applications,
v 66, p 149-162, December 30, 2016.
[18] Chittilappilly, Anand Inasu, Lei Chen, and Sihem
Amer-Yahia. "A survey of general-purpose
crowdsourcing techniques." IEEE Transactions on
Knowledge and Data Engineering 28.9 (2016):
2246-2266.
[19] M. Franklin, D. Kossmann, T. Kraska, S. Ramesh,
R. Xin, CrowdDB: answering queries with
crowdsourcing, in: SIGMOD, 2011, pp. 61–72.
[20] Acosta, Maribel, et al. "HARE: An engine for
enhancing answer completeness of SPARQL
queries via crowdsourcing." (2018): 501-505.
[21] Parameswaran, Aditya Ganesh, et al. "Deco:
declarative crowdsourcing." Proceedings of the 21st
ACM international conference on Information and
knowledge management. ACM, 2012.
[22] Chaudhuri, Surajit, and Kyuseok Shim. "Query
optimization in the presence of foreign functions."
VLDB. Vol. 93. 1993.
[23] Park, Hyunjung, et al. "Deco: A system for
declarative crowdsourcing." Proceedings of the
VLDB Endowment 5.12 (2012):1990-1993.
[24] Park, Hyunjung, and Jennifer Widom. "Query
optimization over crowdsourced data." Proceedings
of the VLDB Endowment 6.10 (2013): 781-792.
[25] JuFan,MeihuiZhang,StanleyKok,MeiyuLu,BengChi
nOoi,CrowdOp:Query optimization for declarative
crowdsourcing systems ,IEEETrans.Knowl.Data
Eng.27(8)(2015)2078–2092.
[26] Shaukat, Kamran, and Usman Shaukat. "Comment
extraction using declarative crowdsourcing (CoEx
Deco)." 2016 International Conference on
Computing, Electronic and Electrical Engineering
(ICE Cube). IEEE, 2016.
[27] Li, Guoliang, et al. "Crowdsourced Data
Management: A Survey." 2017 IEEE 33rd
International Conference on Data Engineering
(ICDE). IEEE, 2017.
[28] Pérez, Jorge, Marcelo Arenas, and Claudio
Gutierrez. "Semantics and complexity of
SPARQL." ACM Transactions on Database
Systems (TODS) 34.3 (2009): 16.
[29] Preda, Nicoleta, et al. "Active knowledge:
17
dynamically enriching RDF knowledge bases by
web services." Proceedings of the 2010 ACM
SIGMOD International Conference on Management
of data. ACM, 2010.
[30] Gerber, Daniel; Hellmann, Sebastian; Bühmann,
Lorenz; Soru, Tommaso; Usbeck, Ricando; Ngonga
Ngomo, Axel-Cyrille. Real-time RDF extraction
from unstructured data streams. The Semantic Web,
ISWC 2013-12th International Semantic Web
Conference, v 8218 LNCS, n PART 1, p 135-150,
2013.
[31] Zeng, Kai; Yang, Jiacheng; Wang, Haixun; Shao,
Bin; Wang, Zhongyuan. A distributed graph engine
for web scale RDF data. Proceedings of the VLDB
Endowment, v 6, n 4, p 265-276, 2013.
[1] Kaoudi, Zoi; Manolescu, Ioana. RDF in
the clouds: a survey. VLDB Journal, v 24, n 1, p
67-91, July 11, 2014.
[2] Özsu, M. Tamer. A survey of RDF data
management systems. Frontiers of Computer
Science, v 10, n 3, p 418-432, 2016.
[3] Jiang, Tao; Tan, Ah-Hwee. Mining
RDF Metadata for Generalized Association Rules:
Knowledge Discovery in the Semantic Web Era.
Proceedings of the 15th International Conference on
World Wide Web, p 951-952, 2006.
[4] Hollenbach, James; Presbrey, Joe;
Berners-Lee, Tim. Using RDF Metadata To Enable
Access Control on the Social Semantic Web.
Proceedings of the Workshop on Collaborative
Construction, Management and Linking of
Structured Knowledge, CK 2009 - Collocated with
the 8th International Semantic Web Conference,
ISWC 2009, v 514, 2009.
[5] Jenkins, Charlotte; Jackson, Mike;
Burden, Peter; Wallis, Jon. Automatic RDF metadata
generation for resource discovery. Computer
Networks, v 31, n 11, p 1305-1320, May 17, 1999.
[6] Papamarkos, George; Poulovassilis,
Alexandra; Wood, Peter T. Event-condition-action
rules on RDF metadata in P2P environments.
Computer Networks, v 50, n 10, p 1513-1532, July
14, 2006.
[7] Destefano, R.J.; Tao, Lixin; Gai, Keke.
Improving Data Governance in Large Organizations
through Ontology and Linked Data. 3rd IEEE
International Conference on Cyber Security and
Cloud Computing, CSCloud 2016 and 2nd IEEE
International Conference of Scalable and Smart
Cloud, SSC 2016, p 279-284, August 16, 2016.
[8] Asano, Yu; Koide, Seiji; Iwayama,
Makoto; Kato, Fumihiro; Kobayashi, Iwao; Mima,
Tadashi; Ohmukai, Ikki; Takeda, Hideaki.
Constructing a Site for Publishing Open Data of the
Ministry of Economy, Trade, and Industry — A
Practice for 5-Star Open Data —. New Generation
Computing, v 34, n 4, p 341-366, October 1, 2016.
[9] Thuy, Pham Thi Thu; Lee, Young-Koo;
Lee, Sungyoung. A Semantic Approach for
Transforming XML Data into RDF Ontology.
Wireless Personal Communications, v 73, n 4, p
1387-1402, December 2013.
[10] McClure, John. The Legal-RDF Ontology. A Generic
Model for Legal Documents. Proceedings of the 2nd Workshop on Legal
Ontologies and Artificial Intelligence Techniques, v 321, p 25-42, 2007.
[11] Jung, Hyosook; Yoo, Sujin; Kim, Doyeon; Park, Seongbin.
A grammar based approach to introduce the Semantic Web to novice users.
Multimedia Tools and Applications, v 75, n 23, p 15587-15600, December
1, 2016.
[12] Auer, Sören; Bizer, Christian; Kobilarov, Georgi; Lehmann,
Jens; Cyganiak, Richard; Ives, Zachary; DBpedia: A nucleus for a Web of
open data. The Semantic Web - 6th International Semantic Web Conference
- 2nd Asian Semantic Web Conference, v 4825 LNCS, p 722-735, 2007.
[13] Suchanek, Fabian M.; Kasneci, Gjergji; Weikum, Gerhard.
YAGO: A Core of Semantic Knowledge Unifying WordNet and
Wikipedia. Source: 16th International World Wide Web Conference,
WWW2007, p 697-706, 2007.
[14] Preda, Nicoleta; Kasneci, Gjergji; Suchanek, Fabian M.;
Neumann, Thomas; Yuan, Wenjun; Weikum, Gerhard. Active Knowledge:
Dynamically Enriching RDF Knowledge Bases by Web Services.
18
Proceedings of the ACM SIGMOD International Conference on
Management of Data, p 399-410, 2010.
[15] Gerber, Daniel; Hellmann, Sebastian; Bühmann, Lorenz;
Soru, Tommaso; Usbeck, Ricando; Ngonga Ngomo, Axel-Cyrille.
Real-time RDF extraction from unstructured data streams. The Semantic
Web, ISWC 2013 - 12th International Semantic Web Conference, v 8218
LNCS, n PART 1, p 135-150, 2013.
[16] Bouquet, Paolo; Serafini, Luciano; Stoermer, Heiko.
Introducing Context into RDF Knowledge Bases. SWAP 2005 - Semantic
Web Applications and Perspectives, Proceedings of the 2nd Italian
Semantic Web Workshop, v 166, 2005.
[17] Stoermer, Heiko; Palmisano, Ignazio; Redavid, Domenico;
Iannone, Luigi; Bouquet, Paolo; Semeraro, Giovanni. Contextualization of
a RDF Knowledge Base in the VIKEF Project. Digital Libraries:
Achievements, Challenges and Opportunities - 9th International
Conference on Asian Digital Libraries, v 4312 LNCS, p 101-110, 2006.
[18] [19] Das Sarma, Anish; Parameswaran, Aditya; Garcia-Molina,
Hector; Halevy, Alon. Crowd-Powered Find Algorithms. 2014 IEEE 30th
International Conference on Data Engineering, p 964-975, 2014.
[20] Whang, Steven Euijong; Lofgren, Peter; Garcia-Molina,
Hector. Question Selection for Crowd Entity Resolution. Proceedings of
the VLDB Endowment, v 6, n 6, p 349-360, 2013.
[21] Parameswaran, Aditya G.; Garcia-Molina, Hector; Park,
Hyunjung; Polyzotis, Neoklis; Ramesh, Aditya; Widom, Jennifer.
CrowdScreen: Algorithms for Filtering Data with Humans. Proceedings of
the ACM SIGMOD International Conference on Management of Data, p
361-372, 2012.
[22] Bönström, V.; Hinze, A.; Schweppe, H. Storing RDF as a
graph. Proceedings - 1st Latin American Web Congress: Empowering our
Web, LA-WEB 2003, p 27-36, 2003.
[23] Shao, Bin; Wang, Haixun; Li, Yatao. Trinity: A distributed
graph engine on a memory cloud. Proceedings of the ACM SIGMOD
International Conference on Management of Data, p 505-516, 2013.
[24] Zeng, Kai; Yang, Jiacheng; Wang, Haixun; Shao, Bin;
Wang, Zhongyuan. A distributed graph engine for web scale RDF data.
Proceedings of the VLDB Endowment, v 6, n 4, p 265-276, 2013.
[25] E. Prud’hommeaux and A. Seaborne. SPARQL Query
Language for RDF. Technical report, W3C, 2008.
[26] Pérez, Jorge; Arenas, Marcelo; Gutierrez, Claudio.
Semantics and complexity of SPARQL. Source: ACM Transactions on
Database Systems, v 34, n 3, August 1, 2009.
[27] Park, Hyunjung; Widom, Jennifer. CrowdFill: Collecting
structured data from the crowd. Proceedings of the ACM SIGMOD
International Conference on Management of Data, p 577-588, 2014.
[28] Guo, Stephen; Parameswaran, Aditya; Garcia-Molina,
Hector. So who won? Dynamic max discovery with the crowd.
Proceedings of the ACM SIGMOD International Conference on
Management of Data, p 385-396, 2012.
[29] Nicholson, Bryce; Sheng, Victor S.; Zhang, Jing. Label
noise correction and application in crowdsourcing. Expert Systems with
Applications, v 66, p 149-162, December 30, 2016.
[30] Simon C Warby, Sabrina L Wendt, Peter Welinder, Emil
G S Munk, Oscar Carrillo, Helge B D Sorensen, Poul Jennum, Paul E
Peppard, Pietro Perona, Emmanuel Mignot. Sleep-spindle detection:
crowdsourcing and evaluating performance of experts, non-experts and
automated methods. Nature Methods 11 385-392 2014.
[31] Li, G; Wang, J; Zheng, Y; et al. Crowdsourced Data
Management: A Survey. IEEE Transactions on Knowledge and Data
Engineering, v 28, n 9, p 2296-2319, 2016.
[32] Feng, J; Li, G; Wang, H; et al. Incremental Quality
Inference in Crowdsourcing. Database Systems for Advanced Applications,
p 453-467, 2014.
[33] Ipeirotis, P, G; Provost, F; Wang, J. Quality management
on Amazon Mechanical Turk. ACM SIGKDD Workshop on Human
Computation, p 64-67, 2010.
[34] Feng, J; et al. QASCA: A Quality-Aware Task Assignment
System for Crowdsourcing Applications. ACM SIGMOD International
Conference on Management of Data, p 1031-1046, 2015.
[35] Gao, J; Liu, X; Ooi, B, C; et al. An online cost sensitive
decision-making method in crowdsourcing systems. ACM SIGMOD
International Conference on Management of Data, p 217-228, 2013.
[36] Gruenheid, A; Kossmann, D; Sukriti, R; et al.
Crowdsourcing Entity Resolution: When is A=B?. Eth Department of
Computer Science Systems Group, 2012.
[37] Vesdapunt, N; Bellare, K; Dalvi, N. Crowdsourcing
algorithms for entity resolution. Proceedings of the Vldb Endowment, v 7, n
12, p 1071-1082, 2014.
[38] Kaplan, H; Lotosh, I; Milo, T; et al. Answering planning
queries with the crowd. Proc. VLDB Endowment, v 6, n 9, p 697-708, 2013.
[39] Faradani, S; Hartmann, B; Ipeirotis, P, G. What’s the Right
Price? Pricing Tasks for Finishing on Time. Proc. 11th Nat. Conf. Artif.
Intell. Workshop, p 26–31, 2011.
[40] V, Verroios; P, Lofgren; H, Garcia-Molina. TDP: An
optimal-latency budget allocation strategy for crowdsourced MAXIMUM
operations. Proc. ACM SIGMOD Int. Conf. Manage. Data, p 1047–1062,
2015.
[41] Davidson, Susan B.; Khanna, Sanjeev; Milo, Tova; Roy,
Sudeepa. Using the crowd for top-k and group-by queries. ICDT 2013 -
16th International Conference on Database Theory, p 225-236, 2013.
[42] Marcus, Adam; Karger, David; Madden, Samuel; Miller,
Robert; Oh, Sewoong. Counting with the crowd. Proceedings of the VLDB
Endowment, v 6, n 2, p 109-120, December 2012.
[43] Marcus, Adam; Wu, Eugene; Karger, David; Madden,
Samuel; Miller, Robert. Human-powered sorts and joins. Proceedings of
the VLDB Endowment, v 5, n 1, p 13-24, September 2011.
[44] Park, Hyunjung; Pang, Richard; Parameswaran, Aditya;
Garcia-Molina, Hector; Polyzotis, Neoklis; Widom, Jennifer. Deco: A
system for declarative crowdsourcing. Proceedings of the VLDB
Endowment, v 5, n 12, p 1990-1993, August 2012.
[45] Park, Hyunjung; Widom, Jennifer. Query Optimization
over Crowdsourced Data. Proceedings of the VLDB Endowment, v 6, n 10,
p 781-792, August 2013.
[46] Fan, Ju; Zhang, Meihui; Kok, Stanley; Lu, Meiyu; Ooi,
Beng Chin. CrowdOp: Query Optimization for Declarative Crowdsourcing
Systems. IEEE Transactions on Knowledge and Data Engineering, v 27, n
8, p 2078-2092, August 1, 2015.
[47] Franklin, Michael J.; Kossmann, Donald; Kraska, Tim;
Ramesh, Sukriti; Xin, Reynold. CrowdDB: Answering queries with
crowdsourcing. Proceedings of the ACM SIGMOD International
Conference on Management of Data, p 61-72, 2011.
[48] Demartini, Gianluca, Djellel Eddine Difallah, and Philippe
Cudré-Mauroux. ZenCrowd: leveraging probabilistic reasoning and
crowdsourcing techniques for large-scale entity linking. Proceedings of the
21st international conference on World Wide Web. ACM, p 469-478, 2012.
[49] Demartini, Gianluca, Djellel Eddine Difallah, and Philippe
Cudré-Mauroux. "Large-scale linked data integration using probabilistic
reasoning and crowdsourcing." The VLDB Journal 22.5, p 665-687, 2013.
[50] Sarasua, Cristina, Elena Simperl, and Natalya F. Noy.
"Crowdmap: Crowdsourcing ontology alignment with microtasks."
International Semantic Web Conference. Springer, Berlin, Heidelberg, p
525-541, 2012.
[51] Acosta, M; Zaveri, A; Simperl, E; et al. Crowdsourcing
Linked Data Quality Assessment. International Semantic Web Conference.
Springer-Verlag New York, Inc, p 260-276, 2013.
[52] Shaukat, K; Shaukat, U. Comment extraction using
declarative crowdsourcing(CoEx Deco). International Conference on
Computing, Electronic and Electrical Engineering, p 74-78, 2016.
[53] Maribel, Acosta; Elena, Simperl; Fabian, Flöck;
Maria-Esther, Vidal. Enhancing answer completeness of SPARQL queries
via crowdsourcing. Web Semantics: Science, Services and Agents on the
World Wide Web, v 45, p 41-62, 2017.
19
Depeng Dang received his PhD degree in Computer
Science and Technology from Huazhong University of Science
and Technology, China, in 2003. From Jul. 2003 to Jun. 2005, he
did his postdoctoral research in the Department of Computer
Science and Technology, Tsinghua University, China. Now, he is
a full professor and supervisor of Ph.D. students of in Computer
Science and Technology from Beijing Normal University, China.
Up to now, he has chaired Four NSFC projects. His research
interests include crowdsourcing computing, RDF data
management.
Wenhui Yu received her Bachelor’s
degree in Computer Science and Technology
from Beijing Normal University. She is currently
studying at the college of Information Science
and Technology, Beijing Normal University,
China. Her research interests include RDF data
management and crowdsourcing computing.
Shaofei Wang has received her Master’s degree in computer software and theory from Northwestern Polytechnical University. She is currently studying at college of Information Science and Technology, Beijing Normal University, China. Her research interests include crowdsourcing computing, RDF data management.
Nan Wang has received her Bachelor’s degree in Computer
Science and Technology from Beijing Normal University. She is currently studying at the college of Information Science and Technology, Beijing Normal University, China. Her research interests include crowdsourcing computing, RDF data management.