probabilistic rdf octavian udrea 1 v.s. subrahmanian 1 zoran majkić 2 1 university of maryland...
Post on 18-Dec-2015
217 Views
Preview:
TRANSCRIPT
Probabilistic RDF
Octavian Udrea1 V.S. Subrahmanian1
Zoran Majkić2
1University of Maryland College Park2University “La Sapienza”, Rome, Italy
Motivation
Not all information on the Web is easily expressible in “classic” models (i.e., relational)
RDF extraction from text STORY is the first, very successful prototype Need to extend RDF with temporal, uncertainty
components Goal: build a logical model of RDF with
uncertainty and provide query algorithms
The Probabilistic RDF idea
An RDF theory is a set of triples (subject, property, value) (USA hasCapital Washington DC), (Washington DC hasPopulation 500,000)
Probabilistic RDF extends this model with uncertainty over the set of values.
(USA hasCapital {(Washington DC, 0.95), (State of Washington, 0.05)})
Probabilistic RDF example
Condition
Infection
BacterialInfection
ViralInfection
Respiratory
Digestive
Sleeping
FoodPoisoninghasComplication
subClassOf
subClassOf
subClassOf
subClassOf
subClassOf
subClassOfsubClassOf, .85
subClassOf, .15
Botulism
E-ColiPoisoning
Flu
AcuteBronchitis
Pneumonia
Middle EarInfection
Emphysema
Cor pulmonale
rdf:type
hasComplication, .7
hasComplication, .15
associatedWith, 0.1
hasComplication,.1
hasComplication, .001
associatedWith, .65 hasComplication, .02
Symptom
Metabolic Mental
Fatigue
causeOf
subClassOf subClassOf
rdf:type,. 7 rdf:type, .2causeOf
causeOf
rdf:type
rdf:type
associatedWith
subPropertyOf
Extracted based on www.wrongdiagnosis .com
Probabilistic RDF example
Condition
Infection
BacterialInfection
ViralInfection
Respiratory
Digestive
Sleeping
FoodPoisoninghasComplication
subClassOf
subClassOf
subClassOf
subClassOf
subClassOf
subClassOfsubClassOf, .85
subClassOf, .15
Botulism
E-ColiPoisoning
Flu
AcuteBronchitis
Pneumonia
Middle EarInfection
Emphysema
Cor pulmonale
rdf:type
hasComplication, .7
hasComplication, .15
associatedWith, 0.1
hasComplication,.1
hasComplication, .001
associatedWith, .65 hasComplication, .02
Symptom
Metabolic Mental
Fatigue
causeOf
subClassOf subClassOf
rdf:type,. 7 rdf:type, .2causeOf
causeOf
rdf:type
rdf:type
associatedWith
subPropertyOf
Probabilistic RDF example
Condition
Infection
BacterialInfection
ViralInfection
Respiratory
Digestive
Sleeping
FoodPoisoninghasComplication
subClassOf
subClassOf
subClassOf
subClassOf
subClassOf
subClassOfsubClassOf, .85
subClassOf, .15
Botulism
E-ColiPoisoning
Flu
AcuteBronchitis
Pneumonia
Middle EarInfection
Emphysema
Cor pulmonale
rdf:type
hasComplication, .7
hasComplication, .15
associatedWith, 0.1
hasComplication,.1
hasComplication, .001
associatedWith, .65 hasComplication, .02
Symptom
Metabolic Mental
Fatigue
causeOf
subClassOf subClassOf
rdf:type,. 7 rdf:type, .2causeOf
causeOf
rdf:type
rdf:type
associatedWith
subPropertyOf
Probabilistic RDF example
Condition
Infection
BacterialInfection
ViralInfection
Respiratory
Digestive
Sleeping
FoodPoisoninghasComplication
subClassOf
subClassOf
subClassOf
subClassOf
subClassOf
subClassOfsubClassOf, .85
subClassOf, .15
Botulism
E-ColiPoisoning
Flu
AcuteBronchitis
Pneumonia
Middle EarInfection
Emphysema
Cor pulmonale
rdf:type
hasComplication, .7
hasComplication, .15
associatedWith, 0.1
hasComplication,.1
hasComplication, .001
associatedWith, .65 hasComplication, .02
Symptom
Metabolic Mental
Fatigue
causeOf
subClassOf subClassOf
rdf:type,. 7 rdf:type, .2causeOf
causeOf
rdf:type
rdf:type
associatedWith
subPropertyOf
Probabilistic RDF syntax
Schema uncertainty: (c subClassOf (C,δ)) ΣdЄC δ(d) <= 1
Class-instance uncertainty: (x rdf:type (C,δ)) ΣdЄC δ(d) <= 1
Instance-based uncertainty: (x p (Y, δ)) ΣyЄY δ(y) <= 1
Probabilistic RDF syntax
Sanity requirements (c subClassOf (C1,δ1)), ((c subClassOf (C2,δ2)) =>
(C1 = C2 and δ1 = δ2) or C1 ∩ C2 = Ø Same applies for other types of uncertainty
Transitive properties Simple inferential capability Examples: associatedWith, controlledBy
P-path: A set of triples connected by transitive properties
Example p-path
Condition
Infection
BacterialInfection
ViralInfection
Respiratory
Digestive
Sleeping
FoodPoisoninghasComplication
subClassOf
subClassOf
subClassOf
subClassOf
subClassOf
subClassOfsubClassOf, .85
subClassOf, .15
Botulism
E-ColiPoisoning
Flu
AcuteBronchitis
Pneumonia
Middle EarInfection
Emphysema
Cor pulmonale
rdf:type
hasComplication, .7
hasComplication, .15
associatedWith, 0.1
hasComplication,.1
hasComplication, .001
associatedWith, .65 hasComplication, .02
Symptom
Metabolic Mental
Fatigue
causeOf
subClassOf subClassOf
rdf:type,. 7 rdf:type, .2causeOf
causeOf
rdf:type
rdf:type
associatedWith
subPropertyOf
P-path semantics and t-norms We cannot generally assume independence
between triples on a transitive path Flu, AcuteBronchitis, Pneumonia
T-norms are used to express the user’s knowledge of the relationship between triples is associative, commutative 0 x = 0, 1 x = x x <= y, z <= w => x z <= y w
P-Path probability: t-norm applied to individual probabilities on the path
Example p-path
Condition
Infection
BacterialInfection
ViralInfection
Respiratory
Digestive
Sleeping
FoodPoisoninghasComplication
subClassOf
subClassOf
subClassOf
subClassOf
subClassOf
subClassOfsubClassOf, .85
subClassOf, .15
Botulism
E-ColiPoisoning
Flu
AcuteBronchitis
Pneumonia
Middle EarInfection
Emphysema
Cor pulmonale
rdf:type
hasComplication, .7
hasComplication, .15
associatedWith, 0.1
hasComplication,.1
hasComplication, .001
associatedWith, .65 hasComplication, .02
Symptom
Metabolic Mental
Fatigue
causeOf
subClassOf subClassOf
rdf:type,. 7 rdf:type, .2causeOf
causeOf
rdf:type
rdf:type
associatedWith
subPropertyOf
(Flu, associatedWith, (Pneumonia, 0.455)) w.r.t. the product t-norm
pRDF semantics
A world W is a set of simple triples (with no probabilities)
An interpretation I associates a probability to each world
I satisfies a pRDF theory: For each (s, p, (V,δ)), δ(v) <= Σ I(W), where W
contains (s,p,v) Same applies to paths w.r.t. to a given t-norm
pRDF semantics
A theory is consistent iff it has a satisfying interpretation Every pRDF theory is consistent
Entailment: T entails T’ iff every satisfying interpretation of T satisfies T’
Closure of a theory: The entire set of triples entailed by the theory Maximal w.r.t. the probability values
pRDF fixpoint semantics
The closure operator Δ adds exactly one entailed triple at each step
(Flu associatedWith, (Acute Bronchitis, .7)) and
(Acute Bronchitis associatedWith (Pneumonia, .65)) yields:
(Flu associatedWith, (Pneumonia, 0.455))
w.r.t. the product t-norm Δ has a fixpoint which is the theory
closure.
pRDF query processing
We will consider only simple queries: a triple with a variable term Example (? associatedWith Pneumonia 4) What is associated with Pneumonia with
probability above .4? Simple method:
Compute the closure Select any triple in the closure that matches the
query VERY expensive computationally
pRDF query processing
Set of algorithms for answering simple queries and conjunctions: pRDF_Subject, pRDF_Property, …,
pRDF_conjunction Central idea:
Apply Δ in only those directions that yield tuples relevant to the query
Cut off path computations when the threshold can no longer be reached. min(current_probability, threshold)
Experimental results
Implementation Java, 1700 LOC Disk-based storage for pRDF theories
Synthetically generated datasets According to varying underlying distributions
Datasets extracted from Web sources
Experimental questions
Does the underlying distribution affect query running time?
From a practical point of view, which are the “fastest” types of queries?
How does running time vary with the number of atoms in a conjunction?
What other theory-dependent factors affect running time? Theory width Number of properties
Query running time (Poisson)Atomic queries running time (Poisson)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
5000 10000 20000 40000 60000 80000 100000
Dataset size [no quadruples]
Tim
e [m
s]
pRDF_Subject
pRDF_Property
pRDF_Probability
Query running time (zipf)Atomic queries running time (zipf)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
5000 10000 20000 40000 60000 80000 100000
Dataset size [no quadruples]
Tim
e [
ms]
pRDF_Subject
pRDF_Property
pRDF_Probability
Conjunctive queries running time Conjunctive queries running time
0
5000
10000
15000
20000
25000
30000
5000 10000 20000 40000 60000 80000 100000
Dataset size [no quadruples]
Tim
e [
ms]
5 queries
10 queries
20 queries
30 queries
Dependence on property width Atomic query running time dependence on width
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
5 10 15 20 25 30 35
Dataset width average
Tim
e [
ms]
pRDF_Subject
pRDF_Property
pRDF_Probability
Number of propertiesAtomic query running time dependence on the number of properties
0
5000
10000
15000
20000
25000
30000
35000
20 30 40 50 60 70 80 100
Number of properties
Tim
e [
ms]
pRDF_Subject
pRDF_Property
pRDF_Probability
Take away points
RDF syntax with uncertainty Model-theory and fixpoint semantics for
pRDF Efficient query algorithms for pRDF
top related