toon calders [email protected] deductive databases
TRANSCRIPT
TOON [email protected]
Deductive databases
Important Information
This material was developed by
Dr. Toon Calders and prof. Dr. Jan Paredaensat University of Technology, Eindhoven.The original document can be accessed at the following
address:http://wwwis.win.tue.nl/~tcalders/teaching/advancedDB/
All changes and adaptations are my own responsability, Rogelio Dávila Pérez
Motivation: Deductive DB
Motivation is two-fold: add deductive capabilities to databases; the
database contains: facts (extensional relations) rules to generate derived facts (intensional relations)
Database is knowledge base Extend the querying
datalog allows for recursion
Motivation: Deductive DB
Datalog as engine of deductive databases similarities with Prolog has facts and rules rules define -possibly recursive- views
Semantics not always clear safety negation recursion
Outline
Syntax of the Datalog languageSemantics of a Datalog programRelational algebra = safe Datalog with
negation and without recursionOptimization techniquesConclusions
Syntax of Datalog
Datalog query/program: facts traditional relational tables rules define intensional views
Rules if-then rules can contain recursion can contain negations
Semantics of program can be ambiguous
Syntax of Datalog
Example
father(X,Y) :- person(X,m), parent(X,Y).
grandson(X,Y) :- parent(Y,Z), parent(Z,X), person(X,m).
hbrothers(X,Y) :- person(X,m), person(Y,m), parent(Z,X), parent(Z,Y).
person parentName Sex Parent Childingrid f ingrid toonalex m alex toonrita f rita anjan m jan antoon m an berndan f an mattijsbernd m toon berndmattijs m toon mattijs
Syntax of Datalog
Variables: X, YConstants: m, f, rita, …Positive literal: p(t1,…,tn)
p is the name of a relation (EDB or IDB) t1, …, tn constants or variables
Negative literal: not p(t1, …, tn)
Rule: h :- l1, …, ln
h positive literal, l1, …, ln literals
Syntax of Datalog
Variables: X, YConstants: m, f, rita, …Positive literal: p(t1,…,tn)
p is the name of a relation (EDB or IDB) t1, …, tn constants or variables
Negative literal: not p(t1, …, tn)
Rule: h :- l1, …, ln
h positive literal, l1, …, ln literals
In Datalog: Classic logical negation
( In contrast to Prolog’s negation by failure )
Syntax of Datalog
Rules can be recursiveArithmetic operations considered as
special predicates A<B : smaller(A,B) A+B=C : plus(A,B,C)
Outline
Syntax of the Datalog languageSemantics of a Datalog program
non-recursive recursive datalog aggregation
Relational algebra = safe Datalog with negation and without recursion
Optimization techniquesConclusions
Semantics of Non-Recursive Datalog Programs
Ground instantiation of a rule h :- l1, …, ln : replace every variable in the rule by a constant
Example:father(X,Y) :- person(X,m), parent(X,Y)
instantiation:father(toon,an) :- person(toon,m), parent(toon,an).
Semantics of Non-Recursive Datalog Programs
Let I be a set of factsThe body of a rule instantiation R’ is
satisfied by I if: every positive literal in the body of R’ is in I no negative literal in the body of R’ is in I
Example:person(toon,m), parent(toon,an) not satisfied by the facts given before
Semantics of Non-Recursive Datalog Programs
Let I be a set of factsR is a rule h :- l1, …, ln
Infer(R,I) = { h’ : h’ :- l1’, …, ln’ is a ground instantiation of R
l1’ … ln’ is satisfied by I }
R={R1, …, Rn}Infer(R,I) = Infer(R1,I) … Infer(Rn,I)
Semantics of Non-Recursive Datalog Programs
A rule h :- l1, …, ln is in layer 1: l1, …, ln only involve extensional predicates
A rule h :- l1, …, ln is in layer i for all 0<j<i, it is not in layer j l1, …, ln only involve predicates that are
extensional and in the layers 1, …, i-1
Semantics of Non-Recursive Datalog Programs
Let I0 be the facts in a datalog program
Let R1 be the rules at layer 1
…Let Rn be the rules at layer n
I1 = I0 Infer(R1, I0)
I2 = I1 Infer(R2, I1)
…In = In-1 Infer(Rn, In-1)
Semantics of Non-Recursive Datalog Programs
Example:
person parentName Sex Parent Childingrid f ingrid toonalex m alex toonrita f rita anjan m jan antoon m an berndan f an mattijsbernd m toon berndmattijs m toon mattijs
father(X,Y) :- person(X,m), parent(X,Y).grandfather(X,Y) :- father(X,Y), parent(Y,Z).hbrothers(X,Y) :- person(X,m), person(Y,m),
parent(Z,X), parent(Z,Y).
Semantics of Non-Recursive Datalog Programs
person parentName Sex Parent Childingrid f ingrid toonalex m alex toonrita f rita anjan m jan antoon m an berndan f an mattijsbernd m toon berndmattijs m toon mattijs
father(X,Y) :- person(X,m), parent(X,Y).grandfather(X,Y) :- father(X,Y), parent(Y,Z).hbrothers(X,Y) :- person(X,m), person(Y,m),
parent(Z,X), parent(Z,Y).
Stratum 0Valuation:
Semantics of Non-Recursive Datalog Programs
person parentName Sex Parent Childingrid f ingrid toonalex m alex toonrita f rita anjan m jan antoon m an berndan f an mattijsbernd m toon berndmattijs m toon mattijs
father(X,Y) :- person(X,m), parent(X,Y).grandfather(X,Y) :- father(X,Y), parent(Y,Z).hbrothers(X,Y) :- person(X,m), person(Y,m),
parent(Z,X), parent(Z,Y).
Stratum 0 father alex toonjan antoon berndtoon mattijs hbrothersbernd mattijsmattijs berndmattijs mattijsbernd bernd
Stratum 1Valuation:
Semantics of Non-Recursive Datalog Programs
person parentName Sex Parent Childingrid f ingrid toonalex m alex toonrita f rita anjan m jan antoon m an berndan f an mattijsbernd m toon berndmattijs m toon mattijs
father(X,Y) :- person(X,m), parent(X,Y).grandfather(X,Y) :- father(X,Y), parent(Y,Z).hbrothers(X,Y) :- person(X,m), person(Y,m),
parent(Z,X), parent(Z,Y).
Stratum 0 father grandfather alex toon jan mattijsjan an jan berndtoon bernd alex mattijstoon mattijs alex bernd hbrothersbernd mattijsmattijs berndmattijs mattijsbernd bernd
Stratum 1 Stratum 2 Valuation:
Caveat: Correct Negation
Negation in Datalog Negation in PrologProlog negation (Negation by failure):
not(p(X)) “is true if we fail to prove p(X)”
Datalog negation (Correct negation): not(p(X)) “binds X to a value such that p(X) does
not hold.”
Caveat: Correct Negation
Example:
father(a,b). person(a). person(b).nfather(X) :- person(X), not( father (X,Y) ),
person(Y).
Datalog:? nfather(X) { (a), (b) }Prolog:? nfather(X) { (b) }? person(a), not(father(a,a)), person(a) yes
Caveat: Correct Negation
Prolog: Order of the clauses is important nfather(X) :- person(X), not( father (X,Y) ), person(Y). versus nfather(X) :- person(X), person(Y), not( father (X,Y) ). Order of the rules is important
Datalog: Order not important More declarative
Caveat: Correct Negation
Difference is not fundamental:
Prolog:nfather(X) :- person(X), not( father (X,Y) ).
Datalog:nfather(X) :- person(X), not(father_of_someone(X)).
father_of_someone(X) :- father (X,Y).
Caveat: Correct Negation
Difference is not fundamental: Many systems that claim to implement Datalog,
actually implement negation by failure. Debating on whether or not this is correct is
pointless; both perspectives are useful Check on beforehand how an engine implements
negation …
Throughout the course, in all exercises, in the exam, …, we assume correct negation.
Safety
A rule can make no sense if variables appear in funny ways
Examples:S(x) :- R(y)S(x) :- not R(x)S(x) :- R(y), x<yIn each of these cases the result is infinite
even if the relation R is finite
Safety
Even when not leading to infinite relations, such Datalog Programs can be domain-dependent.
Example:s(a,b). s(a,a). r(a). r(b).t(X) :- not(s(X,Y)), r(X).If domain is {a,b}:
only t(b) holds.If domain is {a,b,c}
not only t(b), but also t(a) holds( Ground instantiation t(a) :- not(s(a,c)), r(a). )
Safety
Therefore, we will only consider rules that are safe.
A rule h :- l1, …, ln is safe if:
every variable in the head of the rule: also in a non-arithmetic positive literale in body
every variable in a negative literal of the body: also in some positive literal of the body
Model-Theoretic Semantics
A model M of a Datalog program is:
An instatiation of all intensional relations in the program
That satisfies all rules in the program If the body of a ground instantiation of a rule holds in
M, also the head must hold
Some models are special …
Model-Theoretic Semantics
father(a,b).person(X) :- father(X,Y).person(Y) :- father(X,Y).
M1: father persona b a
bM2: father person
a b ab a ba a
Model-Theoretic Semantics
A model is minimal if « we cannot remove tuples »
M1: father persona b a
b
M2: father persona b ab a ba a
Minimal
Not Minimal
Model-Theoretic Semantics
For non-recursive, safe datalog programs semantics is well defined
The model = all facts that can be derived from the program
Closed-World Assumption: if a fact cannot be derived from the database, then it is not true
Is a minimal model
Model-Theoretic Semantics
Minimal model is, however, not necessarily unique
Example:r(a).t(X) :- r(X), not s(X).
minimal models: M1 M2
r s t r s ta a a a
Outline
Syntax of the Datalog languageSemantics of a Datalog program
non-recursive recursive datalog aggregation
Relational algebra = safe Datalog with negation and without recursion
Optimization techniquesConclusions
Semantics of Recursive Datalog Programs
g(a,b). g(b,c). g(a,d).reach(X,X) :- g(X,Y). reach(Y,Y) :- g(X,Y)reach(X,Y) :- g(X,Y).reach(X,Z) :- reach(X,Y), reach(Y,Z).
Fixpoint of a set of rules R, starting with set of facts I:repeat
Old_I := II := I infer(R,I)
until I = Old_I
Always termination (inflationary fixpoint)
Semantics of Recursive Datalog Programs
g(a,b). g(b,c). g(a,d).reach(X,X) :- g(X,Y). reach(Y,Y) :- g(X,Y)reach(X,Y) :- g(X,Y).reach(X,Z) :- reach(X,Y), reach(Y,Z).
Step 0: reach = { }Step 1: reach = { (a,a), (b,b), (c,c), (d,d), (a,b), (b,c),
(a,d) }Step 2: reach = {(a,a), (b,b), (c,c), (d,d), (a,b), (b,c), (a,d),
(a,c) }Step 3: reach = {(a,a), (b,b), (c,c), (d,d), (a,b), (b,c), (a,d),
(a,c) } STOP
Semantics of Recursive Datalog Programs
Datalog without negation: Always a unique minimal model.
Semantics of recursive datalog with negation is less clear.
Example:T(a).R(X) :- T(X), not S(X).S(X) :- T(X), not R(X).
What about R(a)? S(a)?
Semantics of Recursive Datalog Programs
For some classes of Datalog queries with negation still a natural semantics can be defied
Important class: stratified programs T depends on S if some rule with T in the head
contains S or (recursively) some predicate that depends on S, in the body.
Stratified program: If T depends on (not S), then S cannot depend on T or (not T).
Semantics of Recursive Datalog Programs
The program:T(a).R(X) :- T(X), not S(X).S(X) :- T(X), not R(X).
is not stratified;
R depends negatively on SS depends negatively on R
R T
S
Semantics of Recursive Datalog Programs
g(a,b). g(b,c). g(a,d).reach(X,X) :- g(X,Y).reach(Y,Y) :- g(X,Y).reach(X,Y) :- g(X,Y).reach(X,Z) :- reach(X,Y),
reach(Y,Z).node(X) :- g(X,Y).node(Y) :- g(X,Y).unreach(X,Y) :- node(X), node(Y),
not reach(X,Y).
g
reach node
unreach
Semantics of Recursive Datalog Programs
If a program is stratified, the tables in the program can be partitioned into strata: Stratum 0: All database tables. Stratum I: Tables defined in terms of tables in
Stratum I and lower strata. If T depends on (not S), S is in lower stratum than
T.
Semantics of Recursive Datalog Programs
g(a,b). g(b,c). g(a,d).reach(X,X) :- g(X,Y).reach(Y,Y) :- g(X,Y).reach(X,Y) :- g(X,Y).reach(X,Z) :- reach(X,Y),
reach(Y,Z).node(X) :- g(X,Y).node(Y) :- g(X,Y).unreach(X,Y) :- node(X), node(Y),
not reach(X,Y).
g
reach node
unreach
0
1
2
Semantics of Recursive Datalog Programs
Semantics of a stratified program given by: First, compute the least fixpoint of all
tables in Stratum 1. (Stratum 0 tables are fixed.)
Then, compute the least fixpoint of tables in Stratum 2; then the lfp of tables in Stratum 3, and so on, stratum-by-stratum.
Semantics of Recursive Datalog Programs
Fixpoint of a set of rules R, starting with set of facts I:repeat
Old_I := II := I infer(R,I)
until I = Old_I
Fixpoint within one stratum always terminates Due to monotonicity within the strata Only positive dependence between tables in stratum l.
Due to finite program, number of strata isfinite as well
Semantics of Recursive Datalog Programs
Stratum 0: g(a,b). g(b,c). g(a,d).Stratum 1:
node(a), node(b), node(c), node(d),reach(a,a), reach(b,b), reach(c,c), reach(d,d), reach(a,b), reach(b,c), …
Stratum 2:unreach(b,a), unreach(c,a), …
Outline
Syntax of the Datalog languageSemantics of a Datalog program
non-recursive recursive datalog aggregation
Relational algebra = safe Datalog with negation and without recursion
Optimization techniquesConclusions
Aggregate Operators
The < … > notation in the head indicates grouping; the remaining arguments (X, in this example) are the GROUP BY fields.
In order to apply such a rule, must have all of relation g available.
Stratification with respect to use of < … > is similar to negation.
Degree(X, SUM(<Y>)) :- g(X,Y).
Aggregate Operators
bi(X,Y) :- g(X,Y). gbi(Y,X) :- g(X,Y).Degree(X, SUM(<Y>)) :- bi(X,Y). bi
degree
Aggregate Operators
bi(X,Y) :- g(X,Y). gbi(Y,X) :- g(X,Y).Degree(X, SUM(<Y>)) :- bi(X,Y). bi
degree
0
1
2
Aggregate Operators
bi(X,Y) :- g(X,Y). gbi(Y,X) :- g(X,Y).Degree(X, SUM(<Y>)) :- bi(X,Y). bi
degree
Compute stratum by stratumAssume strata 1 k fixed when computing
k+1
0
1
2
Aggregate Operators
r(a,b). r(a,c). s(a,d).t(X,SUM(<Y>)) :- r(X,Y).r(X,Y) :- t(X,Z), Z=2, s(X,Y).
Aggregate Operators
r(a,b). r(a,c). s(a,d). tt(X,SUM(<Y>)) :- r(X,Y).r(X,Y) :- t(X,Z), Z=2, s(X,Y). r s
Aggregate Operators
r(a,b). r(a,c). s(a,d). tt(X,SUM(<Y>)) :- r(X,Y).r(X,Y) :- t(X,Z), Z=2, s(X,Y). r s
“a is aggregating over a moving target”Step 1: t(a,2) is addedStep 2: r(a,d)is addedStep 3: t(a,3) added, t(a,2) no longer true …
hence r(a,d) should not have been added
Outline
Syntax of the Datalog languageSemantics of a Datalog programRelational algebra = Safe Datalog with
negation and without recursionOptimization techniquesConclusions
RA = Non-Recursive Datalog
Every operator of RA can be simulated by non-recursive datalog Project on the first attribute of the ternary relation r:
query (A) :–r(A, B, C).
Cartesian product of relations r1 and r2.
query (X1, X2, ..., Xn, Y1, Y1, Y2, ..., Ym ) :–r1 (X1, X2, ..., Xn ), r2 (Y1, Y2, ...,
Ym ). Union of relations r1 and r2.
query (X1, X2, ..., Xn ) :–r1 (X1, X2, ..., Xn ), query (X1, X2, ..., Xn ) :–r2 (X1, X2, ..., Xn ),
Set difference of r1 and r2.
query (X1, X2, ..., Xn ) :–r1(X1, X2, ..., Xn ), not r2 (X1, X2, ..., Xn )
RA = Non-Recursive Datalog
Every operator of RA can be simulated by non-recursive datalog Result of our construction is always safe, and
equivalent for stratified semantics
$1=$3 (($1R) x R)
query1(A) :- R(A,B).
query2(A,B,C) :- query1(A), R(B,C).
result(A,B,A) :- query2(A,B,A)
RA = Non-Recursive Datalog
Every rule can be expressed by one RA expression: Translate every atom separately
Negation/arithmetic: use “complement” construction Essential: safety
Combine atoms with Cartesian product Do the joins with a selection Project on the relevant attributes
Strata determine the order of evaluationBecause of no recursion: every rule only
executed once.
RA = Non-Recursive Datalog
sister(X,Y) :- person(X,f), parent(Z,X), parent(Z,Y), not(X=Y).
person(X,f): $2=‘f’ Personparent(Z,X) and parent(Z,Y): Parentnot(X=Y): complement construction
X comes from parent(Z,X) $2 Parent
Y from parent(Z,Y) $2 Parent
not(X=Y) $1$2 ($2 Parent x $2 Parent)
RA = Non-Recursive Datalog
sister(X,Y) :- person(X,f), parent(Z,X), parent(Z,Y), not(X=Y).
$1,$6
$1=$4, $3=$5, $1=$7, $6=$8
($2=‘f’ Person x Parent x Parent
x $1$2 ($2 Parent x $2
Parent))
RA = Non-Recursive Datalog
Hence, the following two are equivalent in expressive power:
Safe Datalog with negation, without recursion or aggregation, under the stratified semantics
Relational Algebra
Every rule separately can be expressed by a relational algebra expression Makes it very suitable for implementation on top
of a relational database
Outline
Syntax of the Datalog languageSemantics of a Datalog programRelational algebra = Datalog with
negation and without recursionOptimization techniquesConclusions
Evaluation of Datalog Programs
Running example:root(r). child(r,a). child(r,b). child(a,c).child(a,d). child(c,e). child(d,f). child(b,h).
sg(X,Y) :- root(X),root(Y).sg(X,Y) :- child(X,U), sg(U,V),
child(Y,V).
r
a b
c d
fe
h
Evaluation of Datalog Programs: Issues
Repeated inferences: recursive rules are repeatedly applied in the naïve way; same inferences in several iterations.
Unnecessary inferences: if we just want to find sg of a particular node, say e, computing the fixpoint of the sg program and then selecting tuples with e in the first column is wasteful, in that we compute many irrelevant facts.
Evaluation of Datalog Programs
Running example:Query: ? sg(e,X)
1. (r, r)2. (a,a), (b,b), (a,b), (b,a)3. (c,c), (c,d), (c,h), (d,c), (d,d), …4. (e,e), (f,f), (e,f), (f,e)
r
a b
c d
fe
h
Avoiding Repeated Inferences
Seminaive Fixpoint Evaluation: Avoid repeated inferences: at least one of the body facts generated in the most recent iteration. For each recursive table P, use a table delta_P. Rewrite the program to use the delta tables.
A second evaluation of the rule:r(X,Y) :- s(X), t(Y), u(X,Z), v(Y,Z).
only gives new tuples (X,Y) for ground instantiations in which at least one of the atoms is new.
Avoiding Unnecessary Inferences
Still, in the running example: many unnecessary deductions when query is:
? sg(e,X)
Compare with top-down as in Prolog only facts that are connected to the ultimate goal
are being considered
The “Prolog Way”
sg(X,Y) :- root(X),root(Y).sg(X,Y) :- child(X,U), sg(U,V), child(Y,V).
? sg(e,X).try: root(e) FAILtry: child(e,U)
U=ctry: sg(c,V)
try: root(e) FAILtry: child(c,U) U=a
…
r
a b
c d
fe
h
The “Prolog Way”
sg(X,Y) :- root(X),root(Y).sg(X,Y) :- child(X,U), sg(U,V), child(Y,V).
? sg(e,X).try: root(e) FAILtry: child(e,U)
U=ctry: sg(c,V)
try: root(e) FAILtry: child(c,U) U=a
…
r
a b
c d
fe
h
“Magic Sets” Idea
We want to do something similar for Datalog
Idea: Define a “filter” table: computes all relevant values, restricts the computation of sg(e,X).
sg(X,Y) :- m(X), root(X), root(Y).sg(X,Y) :- m(X), child(X,U), sg(U,V),child(Y,V).m(X) :- m(Y), child(Y,X).m(e).
Magic Sets
It is always possible to do this in such a way that bottom-up becomes as efficient as top-down!
Different proposals exist in literature how to introduce the magic filters
Optimization Techniques
Many other techniques exist as well Standard relational indexing techniques
(partly) materializing intensional relations on beforehand Trade-off memory query time performance (See also the OLAP-part for a similar technique)
Different representations for relations BDD (Stanford)
Outline
Syntax of the Datalog languageSemantics of a Datalog programRelational algebra = Datalog with
negation and without recursionOptimization techniquesConclusions
Conclusions
Datalog adds deductive capabilities to databases extensional relations intensional relations
Datalog without recursion, with negation safety requirement stratification equal in power to relational algebra Closed World Assumption
Conclusions
Datalog without Negation Always a unique minimal model
Datalog with negation and recursion semantics not always clear stratified negation
Evaluation of datalog queries: without negation = RA-optimization with recursion:
semi-naive recursion magic sets
Conclusions
Very nice idea, but …Deductive databases did not make it as
a database paradigm
Yet, many ideas survived recursion in SQL …
And others may re-surface in future. Increasing need for adding meta-information in
databases