provenance-based analysis of data-centric processesmoskovitch1/docs/vldbj15.pdf · provenance-based...
TRANSCRIPT
Noname manuscript No.(will be inserted by the editor)
Provenance-Based Analysis ofData-Centric Processes
Daniel Deutch ·Yuval Moskovitch ·Val Tannen
the date of receipt and acceptance should be inserted later
Abstract We consider in this paper static analysis of
the possible executions of data-dependent applications,
namely applications whose control flow is guided by a
finite state machine, as well as by the state of an un-
derlying database. We note that previous work in this
context has not addressed two important features of
such analysis, namely analysis under hypothetical sce-
narios, such as changes to the application’s state ma-
chine and/or to the underlying database, and the con-
sideration of meta-data, such as cost or access priv-
ileges. Observing that semiring-based provenance has
been proven highly effective in supporting these two
features for database queries, we develop in this paper
a semiring-based provenance framework for the anal-
ysis of data-dependent processes, accounting for hypo-
thetical reasoning and meta-data. The development ad-
dresses two interacting new challenges: (1) combining
provenance annotations for both information that re-
sides in the database and information about external
inputs (e.g., user choices), and (2) finitely capturing in-
finitely many process executions. We have implemented
our framework as part of the PROPOLIS system.
This research was partially supported by the Israeli Ministryof Science, the Israeli Science Foundation, the National Sci-ence Foundation (NSF IIS 1217798), the US-Israel BinationalScience Foundation, the Broadcom Foundation and Tel AvivUniversity Authentication Initiative
Daniel DeutchTel Aviv University
Yuval MoskovitchTel Aviv University
Val TannenUniversity of Pennsylvania
1 Introduction
There are two dominant approaches in the modeling
and specification of applications: the classic, process-
centric, approach focusing on the application control
flow while abstracting away data (see e.g. [32], and a
data-centric approach (e.g. [18,13]), where the database
underlying the process, and queries over it that guide
the course of execution, are explicitly modeled. We fo-
cus here on the data-centric approach, and model such
applications through data-dependent processes (DDPs)
which are finite state machines (FSM) whose transitions
are guided by both external input and by the result of
internal database queries.
As a simple example, consider a typical E-commerce
application where users can navigate through a selec-
tion of products (classified through categories and sub-
categories), and proposed discount deals. The under-
lying process semantics can be captured via an FSM
whose states are associated with web-pages and/or prop-
erties defined by the applications’ business logic. The
transitions between states can be governed both by
user clicks or text input and by queries on the prod-
uct database.
The typical complexity of these applications requires
automatic (static) analysis of their possible executions,
verifying properties of interest. In particular, an analyst
may be interested in questions such as “can a user view
a particular category without being a club member?”,
or “what is the minimum number of clicks allowing a
user to view the daily deals being offered?”. Some as-
pects of such analysis tasks may be captured via tempo-
ral logics; evaluating temporal logic formulas (extended
with first order predicates to query the underlying data
[18]) with respect to the possible executions of data-
centric processes is the subject of extensive investiga-
tion [18,13]. There are, however, two important aspects
that should be addressed as part of the analysis, and are
absent from current solutions: analysis under hypothet-
ical scenarios, and analysis in presence of meta-data.
We next explain these two features which are the focus
of the present paper.
Hypothetical Reasoning The result of evaluating a
temporal logic formula with respect to a process for-
malism is typically boolean, indicating the existence or
inexistence of an execution that satisfies the property
specified by the formula (e.g. “can a user view a particu-
lar category”). However, to obtain a better understand-
ing of the application, an analyst may want to do more
than pose static analysis questions with respect to its
current specification. Indeed, she may also want to test
and explore the effect on analysis results of hypothetical
2 Daniel Deutch et al.
scenarios. E.g. if the answer to the above question is
“no”, then the analyst may wish to explore changes to
the applications that would lead to a positive answer.
The scenarios may involve different modifications to the
business logic (e.g., the application’s link structure) or
to the underlying product data. In realistic cases, there
may be many possible scenarios and their combinations
that are of interest. An analyst would like to interac-
tively explore the different combinations, refining them
according to the analysis results (e.g., gradually remov-
ing links and observing the effect).
Meta-data awareness An additional important fea-
ture absent from previous work is the analysis in pres-
ence of data from meta-domains such as cost (e.g., num-
ber of clicks), trust (e.g., confidence scores), or security
(e.g., access control/clearance levels). Such meta-data
is typically associated with the data as well as with the
control flow logic, and accounting for it is required as
part of analysis tasks such as those exemplified above
(“number of clicks” or “club membership” in the above
examples).
A provenance-based solution A main observation
that stands in the core of our research is that in the
context of relational queries, it has been shown that the
use of provenance annotations, based on the provenance
semiring approach, is effective for the two above chal-
lenges (hypothetical reasoning and accounting for meta-
data). It was shown in [16] that provenance annotations
enable such interactive exploration under combinations
of hypothetical scenarios, a technique referred to as
provisioning. This technique incorporates in the query
answer compactly stored provenance expressions corre-
sponding to the hypothetical scenarios. It was further
shown in [26,21,6] that it suffices to compute prove-
nance and then evaluate it (specialize it) in specific
meta-domains. While provenance was shown to be effec-
tive for these tasks in the context of relational queries,
provenance support for process analysis was not previ-
ously studied. Therefore, the main goal of this paper is
to develop the foundations of provenance tracking for
data-dependent process analysis.
Several conceptual and technical challenges need to
be addressed in this context. We review here these chal-
lenges and our contributions in addressing them.
First, one needs to design a formal model of the pro-
cess specification. As described above, for the former we
introduce a simple notion of DDPs (Data-Dependent
Process), specified by an FSM whose transitions are ei-
ther guarded by a yes/no query on a static underlying
database, or otherwise are considered as non-deterministic.
Intuitively, non-deterministic transitions correspond to
external effects such as user choices, interacting appli-
cations, etc. For database boolean query guards we es-
sentially allow an arbitrary class of queries, as long as
provenance support (captured by “oracle”) is present
for this class, and satisfies some natural desiderata. We
later show an instantiation of the query language with
the quite expressive positive relational algebra with ag-
gregates. Similarly, a formalism for specifying analysis
questions with respect to the process’s possible execu-
tions is required. Here we follow common practice and
use the Linear Temporal Logic (LTL) formalism [32].
LTL analysis of DDP specifications can determine if
certain properties of possible executions hold. We fo-
cus here on analysis of finite executions, following the
perspective used in workflow provenance (see e.g. [14]),
but there may still be infinitely many such executions
if the process logic involves loops.
To support meta-data from various meta-domains
(modeling e.g., cost, access control privileges etc.), as
well as provisioning for analysis under hypothetical sce-
narios, we further develop a provenance model for the
result of LTL formulas with respect to a DDP. The de-
velopment of the provenance model must account in
particular for the different kinds of transition choices
(depending on queries on the data or on external ef-
fects) and for the possibly infinitely many executions.
A starting point towards a solution is the semiring-
based model [26]. However, due to the above features,
the model cannot be used as is, and a novel construc-
tion is required. We start by allowing the separate an-
notation of data tuples and external choices, using two
different semirings. This separation allows us, for exam-
ple, to simultaneously do provisioning for data tuples
and provenance specialization for external choices. An
example of interest might be “what is the minimum
number of clicks to view daily deals, if the database is
modified in a particular way”. Then, we propose a novel
construction that, intuitively, allows capturing pairs of
provenance annotations. The two components of a pair
correspond to external choices and to query answers on
the underlying database (resp.). A novel construction
is needed because completely separating the two kinds
of provenance is not a good idea. In semiring prove-
nance alternatives are modeled by addition and joint
use is modeled by multiplication [26]. The semantics
of DDPs involves alternative paths. If we accumulate
the database provenance separately from the external
effects provenance we lose track of when both happen
along the same path. Therefore instead of just taking
the cartesian product of the two semirings, we need to
factor by certain algebraic congruences on pairs. The
construction is in fact a tensor product of two semirings.
This allows us to both greatly simplify the resulting
Provenance-Based Analysis of Data-Centric Processes 3
expressions and to validate semiring homomorphisms
that compute the meta-domain annotations in prove-
nance specialization. In addition, the new construction
that we develop has to accommodate the tracking of
provenance over possibly infinitely many different ex-
ecutions. This means using a semiring that supports
infinite sums (countable sums suffice). To this end we
construct the tensor product construction so that it ma-
nipulates infinite bags of pairs before factoring through
the desired algebraic congruences. The result is a closed
semiring [39] that supports directly the desired infinite
sums. We then define the provenance for LTL formulas
in this structure as the (possibly infinite) sum over all
the paths that realize the formula; along each path we
compute provenance as a multiplication of provenance
of the individual transitions.
Several properties of the framework are then of in-
terest, for it to be of practical use. We provide here
intuitive descriptions of these properties, which are for-
malized in the sequel.
Semantics properties. A first important prop-
erty of provenance tracking is that it faithfully extends
the semantics of the underlying computation as per-
formed while ignoring provenance. This property was
formalized as set-compatability for positive relational al-
gebra with aggregates in [6], meaning that for a suitable
choice of semiring for annotations (namely the boolean
semiring), the semantics for computing query result (in-
cluding provenance) coincides with the standard set-
semantics for queries. The counterpart here is compati-
bility with the usual boolean semantics of LTL for pro-
cess analysis. Second, we provide an efficient (given an
efficient oracle for the individual queries) algorithm for
computing, given a (provenance-aware) DDP specifica-
tion and an LTL formula, the provenance of the formula
with respect to the specification. The output of our al-
gorithm is a finitely described expression in the closed
tensor product structure that captures the provenance.
The algorithm leverages and combines two techniques:
one for computing the provenance of database queries,
and the second for computing a regular expression that
is equivalent to a given Finite State Machine. Third,
we consider commutation with homomorphism, which
is essential for the soundness of applying provenance to
provisioning and to specialization in meta-domains of
interest. This is because the provenance expressions can
be built using indeterminate parameters (variables) as
input annotations and then both hypothetical scenarios
and meta-domain assumptions correspond to valuations
of these parameters in specific semirings. These valu-
ations determine homomorphisms through which the
evaluation of provenance expressions is done.
We formalize these properties and show that they
hold for our construction, given that counterpart prop-
erties hold for the provenance oracle given for guard-
ing queries. Indeed, such counterparts have been shown
in previous work for expressive languages (specifically
positive relational algebra with aggregates), and so such
expressive languages may be incorporated in our model.
Implementation and experiments. We have im-
plemented our model and algorithms in the context
of a system prototype called PROPOLIS (PROvenance
for data-dependent PrOcess anaLysIS) [17]. PROPOLIS
allows analysts to define, in addition to an LTL for-
mula, annotations (from e.g. cost, trust, and access-
control meta-domains) on the process model. This cap-
tures hypothetical scenarios by parameterizing the pro-
cess specification (its logical flow and/or its underly-
ing database) at points of interest. Then, PROPOLIS
computes (offline) provenance expression for the given
LTL formula with respect to the process specification.
This expression is the compact result of evaluating the
LTL formula with respect to applying all possible com-
binations of specified scenarios (parameters values) to
the process specification. The provenance expression is
passed to a module which allows for rapid exploration
of scenarios as well as provenance specialization, using
the provisioned expression while avoiding further costly
access to the database or costly reevaluation of the LTL
formula. We provide a detailed description of the sys-
tem, accounting for challenges in implementation and
different optimizations. We have further used PROPOLIS
to conduct an extensive experimental study, showing its
effectiveness and scalability.
2 DDPs and Their Analysis
We next define our model for data-dependent processes
(DDPs) and the temporal formalism used in its analy-
sis. Provenance will be introduced in Section 3.
2.1 Data-dependent Processes
We consider a simple model for DDP specifications. The
logical flow is specified by a state machine in which
some transitioning is governed by boolean queries in a
class Q over an underlying Database of schema D. We
will later explain how analysts are able to examine the
effects of changes to the current database state as well
as to the state machine.
We keep the class of queries Q abstract for now, and
later (Section 5) discuss a particular query language of
interest (namely positive relational algebra with aggre-
gates). Our examples use SQL syntax for simplicity.
4 Daniel Deutch et al.
Definition 1 A Data-Dependent Process (DDP) spec-
ification is a tuple (V,E, Vq, Ve, Fq, D) such that (V,E)
is a directed graph referred to as the DDP state ma-
chine, with a distinguished “initial” node vi ∈ Vq⋃Ve
having no ingoing edges, Vq, Ve ⊆ V are subsets of nodes
referred to as query nodes and external effect nodes re-
spectively, such that Vq⋂Ve = φ and Vq
⋃Ve = V . Ev-
ery node vq ∈ Vq has exactly two outgoing edges, and
the end nodes of these edges are denoted true(vq) and
false(vq). Fq : Vq 7→ Q maps query nodes to queries
deciding their transitioning. Finally, D is a database
over the schema D.
Homepage
Newproducts
Cat.
SubCat.
Product
ShoppingCart
DailyDeals
Payment
Exit
PayExit
5
2
5
3
2
if Q1 = 0
if Q1 6= 0 22
3
2 2
2
if Q2 = 0
if Q2 6= 0
Fig. 1: Data-Driven Process
Example 1 Consider the (partial) process logic in Fig-
ure 1. Each node intuitively stands for a web page, and
edges model links. The initial node is HomePage. Some
transitioning is based on the user decision, such as the
one from “Shopping Cart”, depending on the user nav-
igation choice (ignore for now the numbers annotat-
ing some transitions). Other transition choices depend
on the underlying database: for instance the transitionfrom ”Cat.” (standing for a page where the user chooses
a category) to a “SubCat.” page (where the user is
presented a set of sub-categories) or to a “Product”
page (listing relevant products) depends on availability
of sub-categories, modeled as the truth value for the
boolean query “Q1 = 0” (checking for inexistence of
a sub-category of available categories). Q1 is given in
Fig. 3, with respect to the underlying database whose
fragment is given in Fig. 2 (ignore for now the Prov.
column).
A valid execution of a DDP follows query results for
transitioning out of query nodes, while performing ar-
bitrary choices for other nodes. We limit our attention
to finite executions. To simplify definitions, we also as-
sume that an execution terminates in a node from which
there is no outgoing edge (such as Exit and PayExit
in the running example).
Definition 2 (Executions) An execution of a DDP
(V,E, Vq, Ve, Fq, D) is a finite path [v1, v2, ..., vn] in the
AvailableCatCat. Prov.· · · · · ·Cell Phones d1Computers d2Fashion d3· · · · · ·
(a)
PaymentSystemsCat. Prov.· · · · · ·PayPal d5· · · · · ·
(b)
CategoryHierarchyCat SubCat. Prov.· · · · · · · · ·Cell Phones Smartphones d4· · · · · · · · ·
(c)
Fig. 2: Underlying Database
Q1 :SELECT COUNT(*)
FROM CategoryHierarcy CH,
AvailableCat AC
WHERE CH.Cat = AC.Cat
Q2:SELECT COUNT(*)
FROM PaymentSystems PS
Fig. 3: SQL Queries
graph (V,E) such that vn has no outgoing edges, for
every vi ∈ Vq, if the query Fq(vi) is satisfied by the
database D then vi+1 = true(vi) and otherwise vi+1 =
false(vi).
Example 2 Reconsider the DDP specification in Exam-
ple 1. Consider the path P = [Homepage, Cat, SubCat,
Exit], intuitively corresponding to a user making a choice
of a category, then of a sub-category, then exiting with-
out completing a purchase. Cat. is the only query node
(i.e. Cat. ∈ Vq) in this execution, associated with the
query Q1 = 0. Since the current state of D does not
satisfy Q1 = 0, the node that follows Cat must be
SubCat, so P is an execution of the specification but
e.g. P ′ = [HomePage, Cat, Product, ...] is not.
Note that a DDP may admit infinitely many (finite)
executions, if its FSM includes cycles. We next consider
LTL as a formalism that allows to specify properties of
interest with respect to such executions.
2.2 Analysis Formalism
We revisit the syntax of LTL formulas [32], as well as
their semantics with respect to finite process executions
[23] 1. We follow common practice [23] of considering
for finite runs only the next-free fragment of LTL (we
may also account for the next operator, with particular
1 The semantics of LTL is more commonly defined with re-spect to infinite executions; however from a provenance per-spective we will only be interested in finite prefixes, as is alsothe case in [23] that works directly with traces.
Provenance-Based Analysis of Data-Centric Processes 5
choice of semantics for the last state). Formulas are
built up from a finite set of propositional variables P ,
the connectives ¬ and ∨, and the modal operator U
(until). I.e. if p ∈ P then p is an LTL formula, and
for every two LTL formulas f1,f2, it holds that f1∨f2,
¬f1, and f1Uf2 are LTL formulas.
The semantics of LTL formulas defines their satis-
faction with respect to an execution with the assump-
tion that each state can be tested for satisfaction of a
propositional variable in P . In our examples P simply
corresponds to the state names, and a state only satis-
fies the variable corresponding to its name.
Definition 3 [23] Let φ be an LTL formula over a set
of propositional variables P . Let e = v0, v1, ..., vn be an
execution and for 0 ≤ i ≤ n let ei = vi, vi+1, ...vn. We
say that e |= φ if
– φ is a propositional variable p and the state v0 sat-
isfies p.
– φ = ¬φ1 for some LTL formula φ1 and e 6|= φ1.
– φ = φ1 ∨ φ2 for some LTL formulas φ1 and φ2 such
that e |= φ1 or e |= φ2.
– φ = φ1Uφ2 and there exists some 0 ≤ i ≤ n such
that ei |= φ2, and for all 0 ≤ k < i, ek |= φ1.
The basic modal operator U allow the definition of
many derived operators such as “Finally” (Fφ), “Be-
fore” (φ1Bφ2), or “Globally” (Gφ). Fφ states that φ
holds eventually, and it can be written as trueUφ. The
formula φ1Bφ2 states that φ1 holds before φ2 and can
be written as ¬φ2Uφ1; and Gφ states that φ holds
throughout the execution, and can be written as ¬F (¬φ).
Example 3 An analyst of our running example DDP
may be interested in various execution properties, such
as whether the execution involves “A user exiting with-
out viewing the daily deals”. This is captured by the
LTL formula (Exit OR PayExit) B DailyDeals. Other
properties that may be of interest to the analyst are “A
user views product sub-categories for some category”,
“A user views proposed daily deals, and later proceeds
to the payment page”, or “A user never views the daily
deals”. These properties can be captured by the LTL
formulas F SubCat., F (DailyDeals ∧ F Payment), and
G (¬DailyDeals) respectively.
Note that an LTL formula expresses a property of
a given execution, while in general we are interested
in evaluating the formula with respect to the (possibly
infinitely many) possible executions of a given DDP. We
next define the (boolean) semantics of such evaluation.
Definition 4 Given a DDP s and a LTL formula φ, we
say that s |= φ if there exists an execution e of s such
that e |= φ.
The above definition is essentially restricted to ask-
ing for the existence of a path satisfying the formula,
and does not account for neither data in meta-domains
nor hypothetical reasoning. To this end we present a
provenance model. The boolean semantics will serve as
a yardstick in our development: we will generalize it.
3 Provenance Model
We present in this section a semiring-based provenance
model for the LTL-based analysis of DDPs. We first
recall some basic algebraic notions required for our de-
velopments, then define provenance-aware DDPs and a
provenance construction for their executions.
3.1 Semirings and homomorphisms
A commutative monoid is an algebraic structure
(M,+M, 0
M) where +
Mis an associative and commuta-
tive binary operation and 0M
is an identity for +M
.
A monoid homomorphism is a mapping h : M →M ′
where M,M ′ are monoids, and h(0M
) = 0M′ ,h(a +M
b) = h(a) +M ′ h(b). A semiring homomorphism is a
mapping h : K → K ′ where K,K ′ are semirings, and
h(0K
) = 0K′ , h(1
K) = 1
K′ ,h(a +K b) = h(a) +K′ h(b),
h(a ·K b) = h(a) ·K′ h(b).
We will consider database operations on relations
whose tuples are annotated with elements from commu-
tative semirings. These are structures (K,+K, ·
K, 0
K, 1
K)
where (K,+K, 0
K) and (K, ·
K, 1
K) are commutative monoids,
·K
is distributive over +K
, and a ·K
0K
= 0 ·Ka = 0
K.
Example 4 Examples, which we will use in the sequel,
include the boolean semiring ({true, false},∨,∧, false, true)and the tropical semiring (N∞,min,+,∞, 0) (also termed
a cost semiring) where N∞ includes all natural numbers
as well as∞. Given a set X of provenance tokens which
correspond to “atomic” provenance information, e.g.,
tuple identifiers, (N[X],+, ·, 0, 1) is the semiring of poly-
nomials with natural coefficients. For x1, x2, x3 ∈ X,
2x1 + x2 · x3 is an example of an element in the semir-
ing. (N[X],+, ·, 0, 1) was shown in [26] to most generally
capture provenance for positive relational queries and
is thus termed the “provenance semiring”.
3.2 Provenance-Aware DDPs
We next define the notion of a Provenance-Aware DDP
(PADDP for short). We propose to use two semirings
for annotations, accounting for the inherently different
types of transitioning. A first semiring will be used to
6 Daniel Deutch et al.
annotate the database tuples, following [26]; a second
semiring will be used for annotation of external effects.
Definition 5 A Provenance-Aware DDP (PADDP) is
a tuple (S,Kext,Kdata, Aext, Adata) where S is a DDP,
Kext,Kdata are commutative semirings,Aext maps tran-
sitions out of external effects nodes of S to elements of
Kext; Adata maps tuples of the underlying database of
S to elements of Kdata.
For any two commutative semirings K,K ′ we will
use the term (K,K’)-PADDP for a PADDP whose an-
notations for external effect are elements of K and an-
notations for data are elements of K ′. We have not
defined yet the way that these annotations propagate
through executions, but we can already exemplify their
use based on the intuitive correspondence of the semir-
ing + operation with alternative use (of data / transi-
tions), and of the semiring · operation with joint use.
Example 5 Re-consider the running example and sup-
pose that the analyst is interested in questions of the
flavor “What is the minimum user effort2 required for
a user to view the daily deals and later proceed to the
“Payment” page”. We have already shown how this con-
straint on executions can be expressed in LTL, but we
need to also capture user effort. This can be done by
choosing Kext to be the tropical semiring (identifying
“cost” here with user effort), and using Aext to map
transitions to natural numbers corresponding to user
effort associated with them (see weights next to tran-
sitions in Figure 1). Since multiplication in the tropi-
cal semiring corresponds to natural numbers addition,
weights of joint choices along an execution are summed
up; and since addition corresponds to min, provenance
for multiple executions (e.g. all those satisfying the con-
straint that daily deals are viewed) captures their small-
est weight. We will show concrete examples for such
weights in the examples to follow.
Furthermore, the analyst may be interested in per-
forming the analysis assuming the database will change,
e.g. some products or categories are no longer available.
This can be captured by associating provenance with
the database tuples. Specifically, we can use Kdata =
N[D], the semiring of polynomials over the set of prove-
nance tokens D = {d1, ..., d5}. The annotation function
Adata is given as the Prov. column in Figure 2. Intu-
itively, D may be considered as indeterminates, and hy-
pothetical scenarios will be modeled in the sequel via
truth assignments to them. We will show concrete hy-
pothetical scenarios in Example 13.
2 where user effort is some quantification associated withdifferent actions such as following a link, filling in a text boxetc.
We now start defining provenance: first for individ-
ual queries (in an abstract way), then for executions,
and finally for a possibly infinite set of executions con-
forming to a given LTL formula.
3.3 Provenance for Individual Queries
Recall that we have abstracted away the choice of boolean
query language Q. We next define in an abstract way
a notion of a (semiring-based) “provenance oracle” for
queries in Q, specifying desiderata with respect to this
oracle that will allow for provenance tracking for PAD-
DPs. A concrete such oracle will be shown in Section
5.
Semiring-Annotated Relations. We start by recalling
the notion of semiring-annotated databases [26]. We use
the named perspective [1] of the relational model in
which tuples are functions t : U → D with U a finite
set of attributes and D a domain of values. We fix the
domain D and denote the set of all such U -tuples by
U -Tup. (Usual) relations over U are subsets of U -Tup.
Annotations are then modeled by a function from all
possible tuples to elements of a semiring K, with those
tuples not considered to be “in” the relation tagged
with the 0 value of the semiring.
Definition 6 Let K be a semiring. A K-relation over
a finite set of attributes U is a function R : U -Tup →K such that its support defined by supp(R)
def= {t |
R(t) 6= 0} is finite. We say that R is annotated by ele-
ments of K. A K-database is a collection of K-relations.
Example 6 Different semirings may be used for annota-
tion, capturing meta-data of different domains. Perhaps
the simplest example is standard (set-theoretic) rela-
tions, carrying no meta-data, which may be regarded
as relations with annotations from the boolean semir-
ing, with all tuples present in the relations annotated
by true (tuples mapped to “false” are not part of the
support). Similarly, annotations from the cost semir-
ing may be used to associate an access cost with every
tuple, and the provenance semiring is typically used
to associate an “abstract tag” from some domain with
each tuple (which will be further used in the tracking
of provenance for query results, see following discus-
sion). Such an annotated relation, with basic annota-
tions d1, d2, ... is shown in Figure 2 (in Section 3).
Provenance for guarding queries. Let Q be a (rela-
tional) boolean query language, and let ProvOracleK be
a “provenance oracle” (parameterized by a commuta-
tive semiringK) for queries in Q, i.e. ProvOracleK : Q×Kdb 7→ K where Kdb is the class of all K-databases;
Provenance-Based Analysis of Data-Centric Processes 7
and K is a commutative semiring, linked to K through
the notion of homomorphism lift, namely for every ho-
momorphism h : K 7→ K ′ there is a “lift” h : K 7→ K ′.
The specification of this lift, as well as the definition
of how to obtain K from an arbitrary semiring K, is
part of the oracle description. The simplest case is of
course when K = K (and then h = h) but the above
definition allows for further flexibility in the choice of
the “output” semiring.
Three desiderata naturally arise with respect to
ProvOracle and will be important at different stages
of our developments (as will be made explicit below).
The first is set-compatibility which specifies that the se-
mantics of ProvOracleB on B-databases (i.e. standard
set-databases) coincides with the standard set seman-
tics (i.e. is true if and only if the query is satisfied by
the underlying database).
The second is for complexity analysis purposes, and
states that ProvOracleK(Q,D) may be computed in
polynomial time in D for every Q,D,K (and then we
say that ProvOracle is polynomial-time computable).
The third desiderata is commutation with homomor-
phism:
Definition 7 We say that ProvOracle commutes with
homomorphism if for every two semirings K,K ′, every
homomorphism h : K 7→ K ′, every query q in Q and
every K-database D, we have that
ProvOracleK(Q, h(D)) = h(ProvOracleK(Q,D)), where
h(D) is the K ′-database obtained by applying h to ev-
ery annotation from K appearing in D.
As we will show in the sequel, commutation of ho-
momorphism is crucial for hypothetical reasoning, as
hypothetical scenarios may be captured by such homo-
morphisms.
We next introduce a provenance structure that can
accommodate provenance for PADDP executions, given
such an oracle for computing provenance for the indi-
vidual queries associated with transitions.
Notations. The model and construction of prove-
nance for PADDPs execution are then naturally pa-
rameterized by the provenance model and construc-
tion for individual queries annotating its transitions
(i.e. by the ProvOracle). Where clear from context, we
omit ProvOracle from the notations. Furthermore, for
clarity of the examples, we will use expressions of the
sort [Q = 0] or [Q 6= 0] (in addition to the semiring
constants 0 and 1) to denote the provenance of such
boolean queries (comparing the value computed by Q
to 0, in these examples). For now these should only
be considered as symbols which are assumed to be ele-
ments of ˆKdata (the provenance semiring with respect
to “data semiring” of the PADDP); a “semantics” for
these expressions as well as the precise structure ac-
commodating them will be given in Section 5.
3.4 Provenance for PADDP Executions
We start by presenting the construction in a fairly gen-
eral way and will then show how to use it in our context.
A main challenge here is the existence of two, possibly
distinct, semirings – one for the user choices and one
for the result of queries (as computed by a given or-
acle), that need to be combined while accounting for
possibly infinitely many executions. To this end, we will
show that for every two commutative semirings K and
L we can define a tensor product-like structure with
additional closure properties that will be required. This
structure will be introduced in two steps. First, we will
construct a semiring whose elements are possibly infi-
nite bags of elements from K×L. Intuitively each such
pair includes the provenance of a single execution: the
first element of the pair includes the “data” portion of
the provenance and the second one includes the “exter-
nal effect” portion. A bag of such pairs then stands for
the provenance of multiple (possibly infinitely many)
executions. We will show that the obtained structure is
commutative and closed, thus suitable for provenance
tracking, and indeed will use it for defining provenance
for LTL (Sec. 3.5). In the second step (Sec. 3.6) we
will introduce congruences to the structure, that will
be crucial for effective provenance tracking. The result-
ing structure will be called K ⊗ L.
The construction details now follow.
Bags of pairs We start our construction with a sim-
ple step, introducing the set of pairs over items of K
and L, denoted K ×L. Abusing notation we denote its
elements k⊗ l as well as 〈k, l〉 invariably. The intuition
is that a single pair k⊗ l will capture the provenance of
an entire execution: k will capture the “external prove-
nance” of the execution and l will capture its “data
provenance”.
A significant difficulty that is addressed lies in cor-
rectly managing provenance for possibly infinitely many
alternatively executions, accounting for operations on
these pairs. To this end, we consider the set Bag(K×L)
of possibly infinite bags of such pairs. There are two
useful ways of working with bags. One is to consider
them as N∞-valued functions. The other is to observe
that bags, with bag union and the empty bag, form a
commutative monoid whose elements are uniquely rep-
resentable as infinite sums of singleton bags. In this
second perspective, if we abuse notation and denote
singleton bags by the unique element they contain we
8 Daniel Deutch et al.
can write each bag of pairs from K × L as∑i∈I
ki ⊗ li
where i ranges over an infinite set of indices I. More-
over, we can always rename the indices without loss of
generality such that when we work with several bags,
the sets of indices used in their sum representations are
pairwise disjoint.
It will be convenient to already denote bag union by
+K⊗L and the empty bag by 0
K⊗L . That is,
(∑i∈I
ki ⊗ li) +K⊗L (
∑j∈J
kj ⊗ lj) =∑h∈I∪J
kh ⊗ lh
0K⊗L =
∑i∈∅
ki ⊗ li
Recalling the intuitive correspondence between sum-
mation and alternative use, a bag (sum) of pairs corre-
sponds to alternative paths.
Now we define
(∑i∈I
ki⊗li)·K⊗L(∑j∈J
kj⊗lj) =∑
{i∈I}×{j∈J}
(ki·Kkj)⊗(li·L lj)
This is again consistent with the intuition of mul-
tiplication as joint use and summation as alternatives:
following one of the alternatives of the first bag com-
bined with one of the alternatives of the second bag,
corresponds exactly to all alternatives obtained by such
combinations
We have defined a mathematical structure to cap-
ture bags of pairs. For this structure to be consistent
with our intuition of summation capturing alternatives
and product capturing joint use, they should follow the
axioms of commutative semiring. Indeed, we next show
that this is the case.
Proposition 1 (Bag(K ×L),+K⊗L , ·K⊗L , 0K⊗L , 1K ⊗ 1
L)
is a commutative semiring.
Proof
– For b1, b2, b3 ∈ Bag(K × L) and by properties of
bag union we immediately have that b1 + b2 = b2 +
b1 (symmetry of bag union), b1 + 0 = b1 (union
with the empty bag), (b1 + b2) + b3 = b1 + (b2 + b3)
(associativity of bag union).
– For b1, b2 ∈ Bag(K × L) we observe that b1 · b2 =
b2 ·b1, due to the fact ki ·kj = kj ·ki and li · lj = lj · lifor ki, kj ∈ K and li, lj ∈ L (due to the fact that
K,L are commutative semirings). We further have
that (1K⊗1
L) is the neutral element with respect to
multiplication, i.e. b1 ·(1K⊗1L
) = b1, since every pair
element of the bag has its K attribute multiplied by
the neutral 1K
and its L attribute multiplied by 1L
.
– For b1, b2, b3 ∈ Bag(K × L) we have (b1 + b2) · b3 =∑{i∈b1+b2}×{j∈b3}(ki·Kkj)⊗(li·L lj) =
∑{i∈b1}×{j∈b3}(ki·K
kj)⊗ (li ·L lj) +∑{i∈b2}×{j∈b3}(ki ·K kj)⊗ (li ·L lj) =
b1 · b3 + b2 · b3.
Overloading notation we will simply use Bag(K×L)
to denote (Bag(K × L),+K⊗L , ·K⊗L , 0K⊗L , 1K ⊗ 1
L).
Next, we must make sure that the semiring “behaves
properly” with respect to infinite sums. The latter will
intuitively be necessary as we use the structure to cap-
ture possibly infinitely many executions.
We recall the definition of a closed semiring:
Definition 8 [39] A semiring K is closed if for all ele-
ments ofK it holds that (1)∑i∈φ xi = 0, (2)
∑i∈{k} xi =
xk, (3) If there is {Ij} such that I =⋃j∈J Ij is a dis-
joint partition then∑i∈I xi =
∑j∈J(
∑i∈Ij xi) and (4)
b · (∑i ai) =
∑i b · ai = (
∑i ai) · b
The following proposition guarantees that infinite
sums in the structure interact in the “expected” way
with semiring operations. In particular, this means that
they preserve the intuition of joint and alternative use
in executions. The following proposition will guarantee
the above, and will also be the basis for defining sim-
plifications of expressions in the sequel.
Proposition 2 For every two commutative semirings
K and L, it holds that Bag(K×L) is a closed semiring.
Proof We show a stronger result: Bag(K×L) is in fact
ω-continuous. Every ω-continuous semiring is closed (see
[39]). We first recall the property of ω-continuity. Given
a semiring K, define the binary relation v such that
a v b if and only if ∃c ∈ K.a + c = d. We say that
K is ω-continuous if (1) v is a partial order, (2) every
infinite chain a1 v a2... has a suprimum (least upper
bound), and (3) for every semiring element a we have
a+ supai = sup(a+ ai) and a · supai = sup(a · ai).We then note that a partial order relation in (Bag(K×
L),+K⊗L , ·K⊗L , 0K⊗L , 1K ⊗ 1
L) is bag inclusion, namely
a v b if a is bag-included in b. We then show:
1. v is a partial order relation. Observe that (1) a v a,
(2) if a v b and b v a then a = b and (3) if a v b
and b v c then a v c. These claims all follow from
the definition of bag inclusion, and the partial order
relation ’≤’ on N∞.
2. Let a1 v a2 v a3.... be an infinite chain. We claim
that the chain suprimum is an element of Bag(K ×L). To this end, we note that every bag ai on a
domain D (in our case D is a domain of pairs) can
be represented as a function fi : D 7→ N∞, where
fi(d) is the number of occurrences of d in the bag
ai. Now since N∞ is ω-continuous, there exists a
Provenance-Based Analysis of Data-Centric Processes 9
suprimum to f1(d) ≤ f2(d) ≤ ... (the inequalities
follow from the definition of bag inclusion), denoted
sup(d). Let S be the bag defined by the function
f : D 7→ N∞, f(d) = sup(d). S is a suprimum for
{a1, a2, ...}, denoted supiai.
3. Let a ∈ Bag(K × L), then a is a possibly infinite
bag defined by a function fa : D 7→ N∞. Now a +
supiai is, by definition, captured by the function
f ′ = fa + f where f is the function defined in item
(2) of this proof. Now, supi(a+ai) is captured by f ′′
such that f ′′(d) is the suprimum of fa(d) + f1(d) ≤fa(d)+f2(d) ≤ .... Since N∞ is ω-continuous (and in
particular using the additive identity for suprimum
for N∞), we have that f ′′ = f ′.
4. Let a ∈ Bag(K ×L) and let fa as in item (3). Now
a · supiai is the bag defined by
f ′(d) =∑{x1,x2|d=x1·x2} fa(x1) · f(x2) (by x1 · x2
we mean pointwise multiplication of the pairs x1,
x2 and f is the function capturing the bag that is
the suprimum of {ai}). For supia · ai we get a bag
captured by f ′′(d) = supi(∑{x1,x2|d=x1·x2} fa(x1) ·
fi(x2)). Let (x11, x12)...(xn1 , x
n2 ) be all pairs satisfying
x1 · x2 = d. We have:
f ′′(d) = supi{fa(x11) · fi(x21) + ...+ fa(xn1 ) · fi(xn2 )}f ′′(d) = supi{fa(x11) · fi(x21)} + ... + supi{fa(xn1 ) ·fi(x
n2 )}.
Then by repeated application of the multiplicative
identity that follows from the ω-continuity of N∞,
we get that we can push the supi to be applied
only on elements of the sort fi(x). We get f ′′(d) =
fa(x11) · f(x21) + ... + fa(xn1 ) · fi(xn2 ). We thus get
f ′′(p) = f ′(p), as needed.
We obtain that Bag(K × L) is ω-continuous, and
thus is closed.
We are now ready to define provenance for execu-
tions of PADDPs. Provenance for a transition outgoing
an external choice node is annotated with a singleton
bag {〈k, 1Kdata〉} where k ∈ Kext is the provenance as-
sociated with this transition according to the PADDP
specification; and 1Kdatais the neutral value with re-
spect to multiplication in Kdata. Intuitively since there
is no effect with respect to data provenance. Similarly
provenance for a transition out of a query node is de-
fined as {〈1Kext, k′〉} where k′ ∈ Kdata is the annota-
tion (as defined by the oracle ProvOraclefor the corre-
sponding query with respect to the underlying anno-
tated database.
Given a transition t of a PADDP we use Prov(t)
to denote the transition provenance according to the
above. We then define the provenance of an execution
path to be the multiplication of provenance expressions
associated with its transitions.
Definition 9 Given a PADDP execution
e = (v0, v1, ..., vn), the provenance of e, denoted Prov(e) ∈Bag(Kext×Kdata), is defined as
∏(vi,vi+1)∈e Prov((vi, vi+1)).
Note that the provenance of an execution involves
multiplication of bags in Bag(Kext×Kdata). This allows
to “mix” annotations of Kdata and annotations in Kext
in the same expression. To simplify notations, we will
identify 〈k, k′〉 with the singleton bag {〈k, k′〉}.
Example 7 Reconsider the PADDP of Example 5, the
provenance of the path [Home page, Cat., Sub Cat.,
Exit] is
〈5, 1〉 · 〈0, [Q1 6= 0]〉 · 〈2, 1〉
Where [Q1 6= 0] should currently be interpreted
only as a symbol standing for the provenance of Q1
not being equal 0 (and will later be replaced with a
concrete construction for queries). Each pair element
in this multiplication corresponds to a single transi-
tion, and they are multiplied (multiplication is done in
Bag(Kext ×Kdata)) since they are used together in an
execution. For example, the item 〈5, 1〉 stems from the
(external effect) transition from HomePage to Cat and
is shorthand to the singleton bag containing of a pair
whose first element is the natural number 5 in the trop-
ical semiring (signaling user cost of 5), and its second
element 1 is the neutral with respect to multiplication
in Kdata = N[D]. The item 〈0, [Q1 6= 0]〉 originates in
the transition from Cat. to Sub Cat. depending on the
underlying database; note that the natural number 0 is
the neutral with respect to multiplication of the tropical
semiring.
Finally, we can use the definition of bag multipli-cation above to simplify the expression by perform-
ing “point-wise multiplication”, to obtain (·T and ·N[D]
stand for multiplications in the tropical and N[D] semir-
ings, respectively).
〈5 ·T 0 ·T 2, 1 ·N[D] [Q1 6= 0] ·N[D] 1〉 ≡ 〈7, [Q1 6= 0]〉
The last equivalence is due to multiplication in the
tropical semiring corresponding to natural number ad-
ditions. Intuitively, the accumulated cost (“user effort”)
for this path is 7 and the accumulated condition with
respect to the database is [Q1 6= 0].
3.5 Provenance for LTL
Building on the fact that Bag(Kext×Kdata) is a closed
semiring, we next define provenance for an LTL formula
with respect to a PADDP, as the (possibly infinite) sum
of provenances of executions conforming to the formula.
10 Daniel Deutch et al.
Let exec(S) denote the (possibly infinite) set of (finite)
executions of a PADDP S. We define (this definition is
of course parameterized by the choice of ProvOracle):
Definition 10 Given a PADDP S, a provenance oracle
ProvOracle for its guarding queries and an LTL formula
f , we define the result of evaluating f on S (denoted
fProvOracle(S)) as∑{e∈Paths(S)|e|=f} prov(e).
Whenever the particular choice of provenance oracle
is irrelevant to the discussion or clear from context we
omit it from the notation and simply write f(S).
Definition 10 does not give an explicit way of repre-
senting the infinite sums. To this end, we introduce the
Kleene star operation:
a∗ = 1 + a+ a2 + ...
And note that it is well-defined in closed semirings.
Example 8 In the tropical semiring, since multiplica-
tion corresponds to natural number addition, addition
corresponds to min and the tropical 1 is the natural
number 0 (i.e. less or equal to any other number in the
semiring) we get a∗ = 0 (the tropical 1, i.e. the nat-
ural number 0). For the Boolean semiring, we will get
a∗ = true ∨ a ∨ a... = true ∨ a = true.
Proposition 3 For any PADDP S and LTL formula
f , f(S) may be represented as a finite starred expres-
sion.
This proposition will follow from the correctness of
the algorithm for provenance generation, presented in
Section 4. We can already note that intuitively, the ob-
tained starred expression corresponds to a regular ex-pression over the annotation pairs, capturing exactly
those paths that satisfy the LTL formula. Based on this
intuition we exemplify the obtained expressions.
Example 9 Reconsider the LTL formula F (DailyDeals
∧ F Payment) and the running example PADDP. Intu-
itively, to represent alternative paths (alternative ways
of “realizing” the LTL property) we use sum of pairs
where each pair captures the provenance of a single
path. Also, using the introduced axioms, we may in
fact generate sub-expressions for multiple partial exe-
cutions, and then combine them. For instance, observe
that all executions reaching DailyDeals start by reach-
ing Cat. The two partial executions reaching Cat have,
together, as provenance, the sum 〈5, 1〉 + 〈7, 1〉. This
expression may then be multiplied by an expression
capturing the provenance of reaching DailyDeals (and
then Payment) from Cat. The latter will involve star
to capture possible loops. We obtain:((〈5, 1〉+ 〈7, 1〉
)·(〈2, 1〉 · 〈0, [Q1 6= 0]〉+
〈0, [Q1 = 0]〉)· 〈3, 1〉 · 〈2, 1〉
)∗·(〈5, 1〉+ 〈7, 1〉
)·(〈2, 1〉 ·
〈0, [Q1 6= 0]〉 + 〈0, [Q1 = 0]〉)· 〈3, 1〉 · 〈2, 1〉 · 〈3, 1〉 ·(
〈0, [Q2 6= 0]〉+ 〈0, [Q2 = 0]〉)
The obtained expression is quite long but we can al-
ready simplify it using the axioms of our structure.
In particular, we have seen before that we can sim-
plify multiplication expression, to get e.g. 〈3, 1〉 · 〈2, 1〉 ·〈3, 1〉 = 〈8, 1〉. Further simplifications lead to:((〈5, 1〉 + 〈7, 1〉
)·(〈7, [Q1 6= 0]〉 + 〈5, [Q1 = 0]〉
))∗·(
〈5, 1〉+ 〈7, 1〉)·(〈10, [Q1 6= 0]〉+ 〈8, [Q1 = 0]〉
)·(
〈0, [Q2 6= 0]〉+ 〈0, [Q2 = 0]〉)
In this expression, every pair represents “joint” prove-
nance terms in the two domains (Kext being tropical,
and Kdata being N[D]). For instance, 〈5, 1〉 intuitively
means a cost of 5 and no dependency on data. A sum
of such pairs reflects alternative paths, e.g. the sub-
expression (〈5, 1〉 + 〈7, 1〉) corresponds to the two op-
tions of following a transition with cost 5 or following
one with cost 7 (both with no dependency on data). A
product of such sub-terms corresponds to joint use (i.e.
in conjunction with taking either of these transitions,
we also continue the execution). Finally, Kleene star is
applied to provenance of sub-executions appearing in a
loop.
The expression that we get is still quite complex but
we will show later (Example 10), that by introducing
congruence axioms, we can further simplify it.
3.6 Introducing a Congruence
We have defined provenance for LTL formula evaluated
over PADDPs, but observed that their representation
may become quite complex. There are certain equiva-
lence axioms that are “natural” in this setting. For in-
stance if the same data provenance is used repeatedly
in multiple execution paths, one expects to be able to
write an equivalent expression where it appears only
once.
To allow for simplifications, we need to identify el-
ements of Bag(K ×L) with other, “simpler” elements.
This is done via the introduction of a congruence rela-
tion. Since we are dealing with possibly infinite bags,
we need to consider inifinitary congruences:
Definition 11 An equivalence relation ≡ is an inifini-
tary congruence if it satisfies:
– k1 ≡ k3, k2 ≡ k4 =⇒ k1 · k2 ≡ k3 · k4– ∀i ai ∼ bi ⇒ Σiai ∼ Σibi.
Where the second item applies to both finite and infi-
nite sums.
Provenance-Based Analysis of Data-Centric Processes 11
Next, let ∼ be the smallest inifinitary congruence on
Bag(K×L) with respect to +K⊗L and ·
K⊗L that contains
(for all k, k′, l, l′):
(k +Kk′)⊗ l ∼ k ⊗ l +
K⊗L k′ ⊗ l
0K⊗ l ∼ 0
K⊗L
k ⊗ (l +Ll′) ∼ k ⊗ l +
K⊗L k ⊗ l′
k ⊗ 0L∼ 0
K⊗L
We denote by K ⊗ L the set of congruence classes
of bags of pairs modulo ∼ and by 1K⊗L the congruence
class of 1K⊗ 1
L. As usual when we take the quotient by
a congruence the result, (K ⊗ L,+K⊗L , ·K⊗L , 0K⊗L , 1K⊗L),
which we will also denote K⊗L overloading notations,
is also a commutative semiring.
When we define provenance for LTL queries in K⊗Lall constructions and results go through, and in addition
we can perform significant expression simplifications.
Example 10 Reconsider the provenance expression ob-
tained in Example 9. Note that the introduced equiv-
alence axioms allow for simplifications that were not
possible so far. For instance, we have the sub-expression
〈5, 1〉+ 〈7, 1〉, intuitively corresponding to two sub
-executions that involve the same dependency on data
(in this case, no dependency) with different costs. Using
the congruence relation, this expression is equivalent to
〈5+7, 1〉 = 〈5, 1〉. Intuitively we have “factored out” the
common dependency on data, and computed the min-
imal cost out of the two options with respect to user
effort.
Via repeated applications of the congruence relation
we may perform further partial computations, and ob-
tain:(〈5 + 7, 1〉 ·
(〈7, [Q1 6= 0]〉+ 〈5, [Q1 = 0]〉
))∗· 〈5 + 7, 1〉 ·(
〈10, [Q1 6= 0]〉+ 〈8, [Q1 = 0]〉)·(
〈0, [Q2 6= 0]〉+ 〈0, [Q2 = 0]〉)
=(〈12, [Q1 6= 0]〉 + 〈10, [Q1 = 0]〉
)∗ · (〈15, [Q1 6= 0]〉 +
〈13, [Q1 = 0]〉)·(〈0, [Q2 6= 0] + [Q2 = 0]〉
)An important property of the construction thus far
was that the obtained semiring was closed. We can show
that this property still holds for the structure obtained
by taking the quotient by a congruence, and in partic-
ular:
Proposition 4 For every two semirings K,L, it holds
that K ⊗ L is a closed semiring.
Proof Recall that Bag(K×L) is a closed semiring. We
obtained K ⊗ L from Bag(K × L) by adding a con-
gruence, that is, all equalities held on Bag(K ×L) also
holds on K × L. In addition, the definition of closed
semiring consists only of equalities. Therefore K ⊗L is
also a closed semiring.
Intuitively, this means that we can use K ⊗ L to
track both K-annotations and L-annotations; by keep-
ing all K-annotations as 1K we can track the
L-annotations, and by keeping all L-annotations as 1Lwe can track the K-annotations.
4 Properties of the construction
We next analyze the properties of the construction, and
show a correspondence between the desiderata imposed
for ProvOracle, and important properties of our con-
struction.
4.1 Faithful extension of LTL semantics
A basic sanity check is that the definition is consistent
with the semantics of LTL for DDPs without annota-
tion, namely that it in fact faithfully extends it. Indeed
we can show:
Proposition 5 The following property (*) holds if and
only if ProvOracle is set-compatible.
(*) For any LTL formula f and (B,B)-PADDP S,
it holds that fProvOracle(S) ≡ {(true, true)} if and
only if S′ |= f where S′ is a DDP obtained from S by
deleting all tuples and transitions annotated by false,
and keeping the rest with no annotation.
Intuitively, (true, true) corresponds to the boolean
true. We will formalize this intuition in the sequel.
Proof Assume first that ProvOracle is set-compatible.
Further assume that S′ |= f , therefore there is a path p
in the DDP S′ such that p |= f . Since S′ is a DDP ob-
tained from S by deleting all transitions annotated by
false, p is also a path in S. We claim that Prov(p) =
(true, true). The external effects provenance is obvi-
ously true since all such transitions in p are labelled
with true (those labeled with false do not appear in S′).
The data provenance is true since the following holds: p
is a path in S′, and so all queries occurring along p are
satisfied. Since ProvOracle is set-compatible, queries
that are “satisfied” on the corresponding set-database
are associated with provenance value true. Thus the
data provenance is a multiplication of true values which
yields true. Then, the provenance (after simplifications)
of a path e in a (B,B)-PADDP can be one of the follow-
ing: (true, true), (true, false), (false, true) and (false, false).
Now since f(S) =∑{e∈Paths(S)|e|=f} prov(e), and there
exists at least one path in S (call it p) such that Prov(p) =
(true, true) we get f(S) =∑{e∈Paths(S)|e|=f} prov(e) =
(true, true). This is because (true, true)+(true, false) =
(true, true)+(false, true) = (true, true)+(false, false) =
12 Daniel Deutch et al.
(true, true) + (true, true) = (true, true). Now assume
f(S) ≡ {(true, true)}. Therefore there exists a path p in
the (B,B)-PADDP S such that Prov(p) = (true, true)
and p |= f (Otherwise we will not get
f(S) =∑{e∈Paths(S)|e|=f} prov(e) = (true, true)).
Since for the path p = (v0, v1, ..., vn),
Prov(p) =∏
(vi,vi+1)∈e Prov((vi, vi+1)), and Prov(p) =
(true, true), necessarily Prov((vi, vi+1)) = (true, true)
for all 0 ≤ i < n. Namely all the annotation of the path
p are true. Due to set-compatibility (data provenance
is true only if the corresponding query is satisfied), this
implies that p is also a path in S′
In contrast, if ProvOracle is not set-compatible then
there exists a boolean query Q and a B-database in-
stance D such that the provenance of Q with respect to
D is false (true) but Q is (respectively, not) satisfied
by the instance D′ obtained from D by keeping only the
tuples labeled true, and omitting the annotations. Con-
sider a PADDP S with three nodes labeled init (initial),
A and B, such that the transitions from init to A and
B are guarded by Q (A is the true node and B is the
false node). Further consider the LTL formula f = FA
(Finally A), and note that f(S) ≡ {(true, false)} while
S′ |= f or otherwise f(S) ≡ {(true, true)} while S′ 6|= f
depending on the direction in which ProvOracle errs.
4.2 Efficient Provenance Generation
We next show the following crucial proposition (also
proving proposition 3):
Proposition 6 For any LTL formula φ and a PADDP
S and given a polynomial time ProvOracle for prove-nance of guarding queries, we can compute a descrip-
tion of φProvOracle(S) in time polynomial in the size
of the underlying database and exponential in the size
of the state machine of S 3.
Algorithm The high-level flow of an algorithm that
generates provenance for an LTL formula with respect
to a PADDP is given in Algorithm 1. Some important
details, from a practical perspective, on its implemen-
tation are given Section 7. The first step is to compute
provenance expressions for the queries associated with
query nodes by simply applying ProvOracle to each of
them. The obtained annotated structure is named S′.
The LTL formula is compiled (Line 2) into an FSM Sφconsistent with its finite semantics, based on the Algo-
rithm of [23]). Sφ is then (Line 3) intersected with S′
in the standard way of automata intersection (see e.g.
[39]), while maintaining annotations of S′′ consistent
3 We follow common practice of analyzing data complexity,so φ and the guarding queries are considered of constant size.
with those of S′: namely, a transition in the intersec-
tion automata S′′ that has origin (u, ψ) and destina-
tion (v, ψ′) will be annotated by the annotation in S′ of
(u, v). Note that S′′ is a PADDP with some states desig-
nated as accepting. The last step (line 4) is to transform
S′′ into a starred expression, which is an element of the
tensor product semiring. This is done by a direct ap-
plication of Kleene’s algorithm (see e.g. [39]) (which is
a generalization of the standard translation of FSMs to
regular expressions). Given the obtained regular expres-
sion, we may simply interpret the (regular expression)
addition, multiplication and Kleene star operation as
their counterparts in the tensor product semiring.
Correctness We show that for any LTL formula φ
and a PADDP S, the output of Algorithm 1 is equiva-
lent (in the tensor product semiring) to φ(S). For that,
we consider the algorithm’s execution step by step. Af-
ter Line 1 of the algorithm, every transition (non-deterministic
or data-dependent) of S′ is labeled by its corresponding
provenance; this is due to the correctness of computing
provenance for queries with aggregates. The translation
of φ to an FSM follows the algorithm of [23] and Sφ thus
guaranteed to satisfy that the finite executions of the
obtained FSM exactly characterize those that satisfy
the LTL formula; by intersecting Sφ with S′ we thus
get that the labeled sequences obtained for executions
of S′′ are exactly those of S′ satisfying φ. Following the
correctness of Kleene’s algorithm [39], the last step of
the algorithm generates a regular expression that com-
pactly captures this set of labels sequences (which in
turn is the set of label sequences of all executions of S
satisfying φ), i.e. is equivalent to φ(S).
Complexity For any (fixed-size) LTL formula φand PADDP S, the time complexity and output size
of Algorithm 1 is polynomial in the databse size of S,
with the exponent possibly dependent on the state ma-
chine size. To observe that this holds, note that the gen-
eration of provenance expressions for database queries
was shown to be in polynomial time with respect to
the database size. The translation to starred expres-
sion (step 4), however, may in the worst case incur an
(unavoidable [28]) exponential blow up in the size of
the state machine.
Algorithm 1 Algorithm for Expression Generation
Input PADDP specification S; LTL formula φ, Provenanceoracle ProvOracle
Output Provenance expression φ(S)1: S′ := QueriesToProvenance(S,ProvOracle)2: QueryAutomaton := TransformToFSM(φ)3: S′′ := Intersect(S′, QueryAutomaton)4: exp := TranslateToStarredExpression(S′′)5: return exp
Provenance-Based Analysis of Data-Centric Processes 13
4.3 Commutation with homomorphism
Now that we have defined provenance for DDP analy-
sis, and have designed an algorithm for computing it,
we turn to setting the grounds for using it (for hypo-
thetical reasoning). An important principle underlying
the semiring-based provenance framework, and the one
allowing hypothetical reasoning, is that one can com-
pute an “abstract” representation of the provenance
and then “specialize” it in any domain. This “special-
ization” is formalized through the use of semiring ho-
momorphism. Recall that function h is a semiring ho-
momorphism if for every x, y ∈ K we have h(x + y) =
h(x) + h(y) and h(x · y) = h(x) · h(y).
To allow for a similar use of provenance in our set-
ting, we extend the notion of homomorphism to the
tensor product structure, and study properties of the
construction. A fundamental new challenge lies in the
mapping from elements of the tensor (which are essen-
tially bags of pairs), to single semiring elements. Such
mapping “makes sense” in some cases, as we next show.
We first show how to “shift” between meta-domains.
The idea is that two mappings from the individual semir-
ings may be combined into a single homomorphism
whose domain is the tensor structure.
Proposition 7 Let K1,K2,K3,K4 be commutative semir-
ings, and let h1 : K1 7→ K3, h2 : K2 7→ K4 be semiring
homomorphisms. Let h(k1 ⊗ k2) = h1(k1)⊗ h2(k2) and
extend h to a full mapping by defining h(0 ⊗ 0) = 0,
h(s1 +s2) = h(s1) +h(s2) and h(s1 ·s2) = h(s1) ·h(s2).
Then h : K1 ⊗K2 7→ K3 ⊗K4 is a semiring homomor-
phism.
We use h1 | h2 to denote the homomorphism h ob-
tained, according to the above construction, from h1and h2.
Proof We have that h maps 0 to 0 by definition, and
observe that h(1⊗1) = 1⊗1 (the former is the 1 element
of K1⊗K2 and the latter is the 1 element of K3⊗K4).
By definition it further holds that h commutes with
addition and with multiplication.
It remains to show that h is a well-defined mapping,
i.e. to show that if x ∼ y then h(x) ∼ h(y). We show
that this is the case for every axiom of the tensor struc-
ture; since every equivalence is by repeated use of the
axioms, the proposition follows.
– h(k ⊗ 0) = h1(k)⊗ h2(0) = h1(k)⊗ 0 = 0 = h(0)
– h(k1⊗ l+ k2⊗ l) = h(k1⊗ l) + h(k2⊗ l) = h1(k1)⊗h2(l) + h1(k2)⊗ h2(l) ≡ (h1(k1) + h1(k2))⊗ h2(l) ≡h1(k1)⊗h2(l) +h1(k2)⊗h2(l) = (h1(k1) +h1(k2)⊗h2(l) = h((k1 + k2) ⊗ l), and symmetrically for the
other direction of distributivity.
And symmetrically for the other direction of asso-
ciativity. We also have h(k ⊗ 0) = h1(k) ⊗ h2(0) =
h1(k)⊗ 0 = 0 and h(1⊗ 1) = h1(1)⊗ h2(1) = 1⊗ 1 = 1
We can now extend the notion of semiring homo-
morphism to homomorphisms on PADDPs:
Definition 12 Let K1,K2,K3,K4 be 4 semirings, let
h1 : K1 7→ K3, h2 : K2 7→ K4 be semiring homo-
morphisms and let S be a (K1,K2)-PADDP. We use
(h1 | h2)(S) to denote the (K3,K4)-PADDP S′ ob-
tained from S by replacing every annotation k1 ∈ K1
by h1(k1) and every annotation k2 ∈ K2 by h2(k2).
Next we show that provenance propagation com-
mutes with homomorphisms, namely:
Proposition 8 The following (*) holds if and only if
ProvOracle satisfies commutation with homomorphism:
(*) For every LTL formula φ, a (K1,K2)-PADDP
S and semiring homomorphisms h1 : K1 7→ K3, h2 :
K2 7→ K4, it holds that (h1 | h2)(φ(s)) ≡ φ((h1 | h2)s).
Proof By definition 10:
(h1 | h2)(φ(s)) = (h1 | h2)(∑{e∈Paths(S)|e|=φ} prov(e)).
Since h1 | h2 is a homomorphism we have:
(h1 | h2)(∑{e∈Paths(S)|e|=φ} prov(e)) =∑
{e∈Paths(S)|e|=φ}(h1 | h2)(prov(e))
Now assume that ProvOracle satisfies commutation
with homomorphism. We then claim that (h1 | h2)(prov(e)) =
prov((h1 | h2)(e)) where (h1 | h2)(e) (with respect to
an underlying annotated database D) is the annotated
execution obtained by replacing every external effect
annotation a in e by h1(a) and replacing every anno-
tation a′ in D by h2(a′) and denoting the obtained
database by D′. Equality then holds for the external
effect part of the provenance expression by the basic
properties of homomorphisms (h(x+ y) = h(x) + h(y),
h(x · y) = h(x) · h(y)). For the database provenance
part we first use the commutation of ProvOracle (so
the provenance for D′ for each individual query Q is
the same as the result of applying h2 to the provenance
computed for Q with respect to D), and then use again
the the basic properties of homomorphisms to obtain
that the commutation holds for the entire provenance
expression. Finally we can again apply definition 10 to
get φ((h1 | h2)s.
If ProvOracle does not satisfy commutation with ho-
momorphism (failing for some query Q and annotated
database D) then, similarly to the proof of Proposition
8, we construct a PADDP S with three nodes labeled
init (initial), A and B, such that the transitions from
init to A and B are guarded by Q. The provenance
14 Daniel Deutch et al.
for the LTL formula FA is then simply the result of
ProvOracle applied to Q and D, and so commutation
with homomorphism fails for the LTL formula.
We have shown a provenance construction for PAD-
DPs that is parameterized by a provenance oracle for
the queries annotating transitions out of its query nodes,
and have shown that desirable properties of such or-
acle, namely set-compatibility, commutation with ho-
momorphism, and computability in polynomial time,
carry over to the settings of DDPs via our construc-
tion. Such “oracles” (having all three desiderata) were
given in previous work for positive relational algebra
[26] and enriched with aggregates (and a specific form
of difference) in [6]. Thus, any of these models may eas-
ily be “plugged into” our constructions. To complete
the picture, we next give an outline of the construction
for positive relational algebra with aggregates; this also
allows us, in the section that follows, to complete our
examples.
5 Provenance oracle for aggregate queries
Let K be any commutative semiring (intuitively used
for annotations) and M be any commutative monoid,
used for aggregates. To support provenance for aggre-
gate queries we have designed in [6] a structure that
combines (“pairs”) annotations from k with elements
from M , as follows. We start with K ×M , denote its
elements k⊗m instead of 〈k,m〉 and call them “simple
tensors”. Next we consider finite bags of such simple
tensors, which, with bag union and the empty bag, form
a commutative monoid. It will be convenient to denote
bag union by +K⊗M , the empty bag by 0
K⊗M and to
abuse notation denoting singleton bags by the unique
element they contain. Then, every non-empty bag of
simple tensors can be written (repeating summands by
multiplicity) k1⊗m1 +K⊗M · · · +K⊗M kn⊗mn. Now we
define
k ∗K⊗M
∑ki⊗mi =
∑(k ·
Kki)⊗mi
Let ∼ be the smallest congruence w.r.t. +K⊗M and ∗
K⊗M
that satisfies (for all k, k′,m,m′):
(k +Kk′)⊗m ∼ k⊗m+
K⊗M k′⊗m0K⊗m ∼ 0
K⊗M
k⊗ (m+Mm′) ∼ k⊗m+
K⊗M k⊗m′
k⊗ 0M∼ 0
K⊗M
We denote by K⊗M the set of tensors i.e., equivalence
classes of bags of simple tensors modulo ∼.
This structure suffices for defining ”simple” aggre-
gate queries, namely queries where the aggregate op-
eration must appear as the last one (i.e. the aggregate
operation is disallowed in a nested query). For the out-
put relations of our algebra queries, we thus need re-
sults of aggregation (i.e., the elements of K ⊗M) to
also be part of the domain out of which the tuples are
constructed. Thus for the output domain we will as-
sume that K ⊗M ⊆ D, i.e. the result “combines an-
notations with values”. The elements of M (e.g., real
numbers for sum or max aggregation) are still present,
but only via the embedding ι : M → K ⊗M defined
by ι(m) = 1K⊗m. Now we may define AGGM (R) as a
one-attribute relation with one tuple annotation is 1K
and whose content is SetAggK⊗M (ι(R)), which is equal
to
k1 ∗K⊗M ι(m1) +K⊗M · · ·+K⊗M kn ∗K⊗M ι(mn)
= k1⊗m1 +K⊗M · · ·+K⊗M kn⊗mn
We define the annotation of the only tuple in the output
of AGGM to be 1K
, since this tuple is always available.
However, the content of this tuple does depend on R.
Next, we need to support the use of aggregates which
are not the last operation (and in particular to com-
pare the obtained aggregate value to a given value, in
a boolean query, as done in our DDP example). To this
end we have designed in [6] a semiring whose elements
are polynomials, in which equation (and inequality) el-
ements are additional indeterminates. To achieve that,
we introduce for any semiring K and any commutative
monoid M , the “domain” equation K = N[K ∪ {[c1 =
c2] ∪ {[c1 6= c2] | c1, c2 ∈ K⊗M}]. The right-hand-side
is a monotone, in fact continuous w.r.t. the usual set in-
clusion operator, hence this equation has a set-theoretic
least solution (no need for order-theoretic domain the-
ory). The solution also has an obvious commutative
semiring structure induced by that of polynomials. The
solution semiring is K = (X,+K , ·K , 0K , 1K), and we
continue by taking the quotient on K defined by the fol-
lowing axioms. For all k1, k2 ∈ K, c1, c2, c3, c4 ∈ K⊗M :
0K ∼ 0K
1K ∼ 1K
k1 +Kk2 ∼ k1 +Kk2
k1 ·Kk2 ∼ k1 ·K k2
[c1 = c3] ∼ [c2 = c4] (if c1 = K⊗Mc2, c3 = K⊗Mc4)
and the symmetric treatment for the [a 6= b] expres-
sions. If K and M are such that ι defined by ι(m) =
1K⊗m is an isomorphism (and let h be its inverse),
we further take the quotient defined by: for all a, b ∈K ⊗M ,
[a = b] ∼ 1K
(if h(a) =Mh(b))
[a = b] ∼ 0K
(if h(a) 6=Mh(b))
Provenance-Based Analysis of Data-Centric Processes 15
We use KM to denote the semiring obtained by ap-
plying the above construction on a semiring K and a
commutative monoid M .
The extended semiring construction allows us to de-
sign a semantics for general aggregation queries. Intu-
itively, when the existence of a tuple in the result re-
lies on the result of a comparison involving aggregate
values (as in the result of applying selection or joins),
we multiply the tuple annotation by the correspond-
ing equation annotation. In the sequel we assume, to
simplify the definition, that the query aggregates and
compares only values of KM ⊗M (a value m ∈ M is
first replaced by ι(m) = 1K⊗m). In what follows, let
R,R1 and R2 be (M,KM )-relations on an attributes
set U . Recall that for a tuple t, t(u) (where u ∈ U) is
the value of the attribute u in t; also for U ′ ⊆ U , recall
that we use t |U ′ to denote the restriction of t to the
attributes in U ′. Last, we use (KM⊗M)U to denote the
set of all tuples on attributes set U , with values from
KM ⊗M . The semantics follows:
1. empty relation: ∀t φ(t) = 0.
2. union: (R1 ∪R2) (t) =∑t′∈supp(R1)
R1(t′) ·∏u∈U [t′(u) = t(u)] if t ∈ supp(R1)
+∑t′∈supp(R2)
R2(t′) ·∏u∈U [t′(u) = t(u)] ∪supp(R2)
0 Otherwise.
3. projection: Let U ′ ⊆ U , and let T = {t|U ′ | t ∈supp(R)}. Then ΠU ′(t) =∑t′∈Supp(R)R(t′)·
∏u∈U ′ [t(u) = t′(u)] if t ∈ T
0 Otherwise.
4. selection: If P is an equality predicate involving the
equation of some attribute u ∈ U and a value m ∈M then (σP (R)) (t) = R(t)·[t(u) = ι(m)].
5. value based join: We assume for simplicity that R1
and R2 have disjoint sets of attributes, U1 and U2
resp., and that the join is based on comparing a
single attribute of each relation. Let u′1 ∈ U1 and
u′2 ∈ U2 be the attributes to join on. For every t ∈(KM ⊗M)U1∪U2 :
(R1 ./R1.u1=R2.u2R2) (t) =
R1(t|U1)·R2(t|U2)·K
[t(u1) = t(u2)].
6. Aggregation: AGGM
(R)(t) = 1 t(u) =∑t′∈supp(R)R(t′)⊗ t′(u)
0 otherwise
Where t′(u) is the value of t′ in the aggregated col-
umn, or the natural number 1 if the aggregation is
COUNT.
Boolean queries We have used the notation Q = c,
Q 6= c, where Q is an aggregate query whose result
is a single tuple with a single value and c is a constant.
This is just a notation for the corresponding nested ag-
gregate query, comparing the aggregate result to c, and
its provenance definition thus follows immediately.
Example 11 Consider the aggregation query Q1 given
in Figure 3. The result of the query evaluation with
respect to the relations given in Figure 2 consist of a
single tuple with the aggregated value. Since the aggre-
gation function in this example is COUNT, the aggrega-
tion monoid is that of natural numbers, t′(u) = 1 for all
t′ ∈ R (intuitively the contribution of each tuple in R
to the count of tuples is 1). The provenance expression
of the single tuple in the table resulting by the inner
query is (d1 ·d4), therefore, the value of the tuple in the
result is the expression (d1 · d4)⊗ 1.
Thus, the provenance annotation of the nested, boolean,
query Q1 = 0 is simply [(d1 ·d4)⊗ 1 = 0]. Similarly, the
provenance of the query Q2 = 0 is [d5 ⊗ 1 = 0].
Incorporation in the DDP modelThe following proposi-
tion was shown in [6].
Proposition 9 The semantics for positive relational
algebra with (nested) aggregates satisfies polynomial time
computability, set comparability and commutation with
homomorphism.
Using our results, this means that the provenance
construction for positive relational algebra with (nested)
aggregates, following the above semantics, may be in-
corporated as oracle in our construction. The semiring
K is simply KM where K is the semiring used to anno-
tate database tuples, and M is the aggregation monoid,
and the lift of homomorphism h is simply by replac-
ing each element of k appearing in the expression by
h(k) [6]. We note that provenance constructions for ad-
ditional query constructs, such as the one for queries
with difference in [6], may similarly be incorporated as
well.
6 Further examples
Now that we have defined a provenance model for aggre-
gate query, we revisit and complete our running exam-
ple of PADDP provenance, showing both the obtained
full-fledged provenance expression as well as examples
for its usefulness.
We first show the obtained expression:
Example 12 Reconsider the provenance expression given
in Example 10, by substitution of the database queries
16 Daniel Deutch et al.
provenance we obtain the following expression:(〈12, [d1 · d4 6= 0]〉+ 〈10, [d1 · d4 = 0]〉
)∗ · (〈15, [d1 · d4 6=0]〉+ 〈13, [d1 · d4 = 0]〉
)·(〈0, [d5 6= 0] + [d5 = 0]〉
)[d1 ·d4 6= 0] is in fact shorthand to [[(d1 ·d4)⊗1 = 0] = 0]
which is the negation of the provenance for Q1 = 0. In-
tuitively, it means that both tuples annotated with d1and d4 need to be present for the query to be satisfied;
this will next be reflected in the effect of homomor-
phisms.
Then, the expression may be used e.g. for deletion
propagation, through the notion of homomorphisms and
leveraging the commutation property we have estab-
lished:
Example 13 Say that we are interested in finding the
minimal cost of realizing our running example LTL query,
but this time under the assumption that the Cell Phone
category is no longer available. This corresponds to prop-
agating the hypothetical deletion of the tuple annotated
by d1 (and other tuples are kept), and re-performing the
analysis. Instead, Proposition 8 suggests a much more
efficient alternative: given the computed provenance ex-
pression of Example 10, we can use the homomorphism
h2 : N[D] 7→ B, where (in particular) h2(d1) = false,
h2(d4) = true and h2(d5) = true. We use the identity
homomorphism as h1 (assuming the user effort quan-
tification stays intact), and obtain:
(h1 | h2)((〈12, [d1 · d4 6= 0]〉 + 〈10, [d1 · d4 = 0]〉
)∗ ·(〈15, [d1 · d4 6= 0]〉 + 〈13, [d1 · d4 = 0]〉
)·(〈0, [d5 6=
0]〉+ 〈0, [d5 = 0]〉))
=(〈12, false〉 + 〈10, true〉
)∗ · (〈15, false〉 + 〈13, true〉)·
〈0, true〉more simplifications, based on the congruence axioms,
we obtain:
〈13, true〉
Indeed, the minimal effort required to reach the goal
given the deletions is 13, realized e.g. through the execu-
tion [Home page, Cat., Product, Shopping Cart, Daily
deals, Payment, PayExit].
We next show another use-case where the generated
expression and its properties are leveraged
Example 14 As another example, from a different meta-
domain, assume that the application supports three lev-
els of memberships. A user can be a club member in one
of the clubs: silver (S), gold (G) and platinum (P ), and
each product or category is annotated with minimal
level of membership. Namely a silver club member can-
not see a product that is annotated with G, but gold
or platinum club members can. Again, we want to find
the minimal cost of realizing our running example LTL
query, but we want to do it now under the assump-
tion that the user is a gold club member. We can use
the semiring C = ({1c, S,G, P, 0c},min,max, 0c, 1m),
where 1c < S < G < P < 0c, and in a similar way
to the above example, apply the homomorphism h2 :
N[D] 7→ C.
For instance, say that the cell phone category is
available to all users (i.e. h2(d1) = 1c), the computers
category is unavailable (h2(d2) = 0c), the fashion cat-
egory is available only to club members (h2(d3) = S),
silver or higher, the smartphones sub category is avail-
able only to gold club members or higher (h2(d4) = G),
and h2(d5) = 1c. As before, we use the identity homo-
morphism as h1 and obtain:
(h1 | h2)((〈12, [d1 · d4 6= 0]〉 + 〈10, [d1 · d4 = 0]〉
)∗ ·(〈15, [d1 · d4 6= 0]〉 + 〈13, [d1 · d4 = 0]〉
)·(〈0, [d5 6=
0]〉+ 〈0, [d5 = 0]〉))
=((〈12, [1c ·G 6= 0]〉+ 〈10, [1c ·G = 0]〉
)∗ · (〈15, [1c ·G 6=
0]〉+ 〈13, [1c ·G = 0]〉)·(〈0, [1c 6= 0]〉+ 〈0, [1c = 0]〉
))Note that 1c ·G = max(1c, G) = G, thus we obtain:((〈12, [G 6= 0]〉 + 〈10, [G = 0]〉
)∗ · (〈15, [G 6= 0]〉 +
〈13, [G = 0]〉)·(〈0, [1c 6= 0]〉+ 〈0, [1c = 0]〉
))To denote that the user is a gold club member we
apply an additional homomorphism h′2, that maps x ∈C s.t. x ≥ G to 1, and otherwise to 0, for example
h′2(G) = 1 and h′2(S) = 0 (we again use the identity
homomorphism as h′1). The result of the application is
the following expression.((〈12, true〉+ 〈10, false〉
)∗ · (〈15, true〉+ 〈13, false〉)·(
〈0, true〉+ 〈0, false〉))
and via more simplifications, based on the congruence
axioms, we obtain:
〈15, true〉
The minimal effort required to reach the goal for a
gold club member is indeed 15, through the execution
[Home page, Cat., SubCat., Product, Shopping Cart,
Daily deals, Payment, PayExit]
Applying homomorphisms So far we have demon-
strated that provenance computation can be done in
K ⊗ L for arbitrary positive K and L. To then get an
item (not a pair) in K (symmetrically L) we can first
apply the homomorphism h : K ⊗ L 7→ K ⊗ K that
combines (via prop. 7) the two mappings h1 : K 7→ K
defined by h1(k) = k and h2 : L 7→ K defined by
h2(l) = 0K if l = 0 and h2(l) = 1K otherwise. We
get a correct provenance expression in K ⊗K.
Finally, it is desirable that the final provenance out-
come should reflect a single value (e.g. a cost, or a truth
Provenance-Based Analysis of Data-Centric Processes 17
LTL query Engine
Interactive Explorer
Admin Interface
Spec
Analyst Annotations
View LTL query
Provenance Expression
Homomorphism
Results
State Machine
DB (SQL Server)
Provenance Generator
Dashboard
Fig. 4: System Architecture
value) rather than a bag of pairs. For that, we apply a
mapping defined by h(k1 ⊗ k2) = k1 · k2, and extended
to a full mapping by defining h(s1 +s2) = h(s1)+h(s2)
and h(s1 · s2) = h(s1) · h(s2)
Example 15 Reconsider the obtained provenance expres-
sion in Example 13, 〈13, true〉. We use the identity ho-
momorphism h1 and the homomorphism h2 from boolean
to tropical mapping true to the tropical 1 element and
false to the tropical 0 element Composing the map-
ping with h1 | h2, and applying the result to 〈13, true〉,we finally get 13 as final answer under the particular
hypothetical scenario.
7 Prototype Implementation and Optimizations
We have implemented our provenance management frame-
work in the context of the PROPOLIS system. PROPOLIS
is implemented in C] with WPF GUI using .NET frame-
work, and runs on Windows 7. It uses MS SQL server as
its underlying database management system. We start
by a high-level description of the system architecture,
and then focus on the implementation and optimiza-
tion of its main component, namely the generation of
provenance expressions.
7.1 Architecture
The system architecture is shown in Figure 4. First, the
administrator interface allows specification of a DDP
instance (including both the logical flow in the form of
a Finite State Machine, and the underlying Database,
but no annotations). Then, an analyst specifies the an-
notations, both on the database tuples and the user
choice transitions of the state machine. Intuitively, she
could use annotations to parameterize the analysis at
points whose change she would like to further example,
in which case variables are used; other annotations may
be used to quantify certain aspects of the computation
(e.g. cost in our running example), so that the tempo-
ral analysis result will further reflect this quantification.
We thus obtain a PADDP.
After the annotations are set, and the analyst has
further specified an analysis task in LTL, the computa-
tion may begin. The Provenance Generator module is
then invoked, applying our algorithm to compute the
provenance expression for the specified LTL query with
respect to the PADDP specification (see an extended
discussion of the module’s implementation below). The
obtained expression is then fed to the Dashboard. The
latter, in turn, displays to the user a list of all vari-
ables occurring in the expression. Interacting with the
dashboard and by assigning values to the variables, the
analyst can interactively and repeatedly explore the ef-
fect of hypothetical scenarios on the LTL analysis re-
sult. The scenarios are specified, using the dashboard,
through the assignment of different values, from semir-
ings of the analyst’s choice (e.g. boolean, costs etc.) to
the different parameters. In effect, this defines a homo-
morphism of the analyst choice, to a chosen structure. A
simple example involves deciding on a subset of tuples /
transitions that are (hypothetically) deleted and assign-
ing truth values to parameters to reflect the choice of
deletion (as in Examples 13 and 15). A dedicated GUI
allows for such choices. The analyst is then presented
with the result of the LTL query for the obtained sce-
nario, and can repeatedly and interactively change the
scenarios (i.e. change the homomorphism), to explore
the effect on the LTL query result.
7.2 Efficient Provenance Generation
We next describe in detail the implementation and op-
timization of the provenance generator module. This
entails both optimization of the generation algorithm
(which is otherwise highly time-consuming, as discussed
next) as well as optimizing the size of the outputted
provenance; the latter naturally greatly effects the time
of interacting with PROPOLIS through the exploration
of hypothetical scenarios.
7.2.1 Implementation Details
The high-level algorithm for provenance generation was
given in Algorithm 1 and we next provide more details
on its implementation. We start by introducing a data
structure that is useful as intermediate representation.
Equation system of starred expressions. We use here
an idea due to [9]. Let Ri be a new “variable”, whose
values intuitively range over starred expressions. Even-
tually we will store in Ri the starred expression repre-
senting the provenance of all possible qualifying sub-
18 Daniel Deutch et al.
executions that start at state qi. Towards achieving
that, we generate an equation system, with an equation
for each such Ri, of the form:
Ri =∑
qiaj−→qj
ajRj
If qi is an accepting state then we introduce the term
1K⊗L in the righthand side of the equation of Ri.
The idea is that each such equation represents the
starred expression Ri in terms of the starred expressions
Rj for all possible following states qj . After intersecting
the LTL formula with the DDP’s state machine (which
is done in a standard manner, see [32]), we generate
this equation system by simply considering the obtained
FSM state by state, in arbitrary order, generating an
equation for each state based on its outgoing transitions
and immediate neighbors.
Example 16 Consider the PADDP of the running ex-
ample, and the LTL formula f = F (PayExit). After
intersection we get a state machine whose structure is
exactly as the one in Figure 1 with PayExit being the
only final state. The equation system is (see state vari-
ables mapping in Table 1):
R0 = 〈2, 1〉R1 + 〈5, 1〉R2
R1 = 〈5, 1〉R2
R2 = 〈0, [Q1 = 0]〉R4 + 〈0, [Q1 6= 0]〉R3
R3 = 〈2, 1〉R4 + 〈2, 1〉R8
R4 = 〈3, 1〉R5
R5 = 〈2, 1〉R0 + 〈2, 1〉R6 + 〈2, 1〉R7
R6 = 〈3, 1〉R7 + 〈2, 1〉R8
R7 = 〈0, [Q2 = 0]〉R8 + 〈0, [Q2 6= 0]〉R9
R8 = 0
R9 = 1
For example, the equation for R0 is due to its two
out-going transitions: to R1 associated with provenance
〈2, 1〉 and to R2 associated with provenance 〈5, 1〉.
Solving the equations. As pointed out in [9], solving
the equations may be done via standard substitution
of variables (i.e. “plugging in” the righthand side of
the equation for some Ri instead of an occurrence of
Ri), in some order (the choice of order applies both
to the choice of Ri and the choice of equation to plug
it into). After substitution, some Rj may occur in the
righthand side of the equation for Rj (due to loops in
the underlying FSM). We then use the following result:
Proposition 10 (adapted from [9]) Given an equation
Ri = A·Ri+B where multiplication and addition are in
State Variable
Home Page R0New products R1
Cat. R2Sub Cat. R3Products R4
Shopping Cart R5Daily Deals R6Payment R7
Exit R8PayExit R9
Table 1: State variables
a closed semiring K, Ri is a variable and A,B are ar-
bitrary expressions (possibly involving variables and/or
constants), its solution (for Ri, in terms of A and B),
is Ri = A∗ ·B.
Since we have showed that the semiring obtained
in our construction is closed, we can thus leverage this
result.
Example 17 Continuing the above example, the elimi-
nation of R8 and R9 results in the following equations
R3 = 〈2, 1〉R4
R6 = 〈3, 1〉R7
R7 = 〈0, [Q2 6= 0]〉
We use the arithmetics operation of the semiring K⊗Land the congruence axioms, for instance, the substitu-
tion of the expression of R7 in the equation of R6 results
R6 = 〈3, 1〉 · 〈0, [Q2 6= 0]〉 = 〈3, [Q2 6= 0]〉
After the elimination of R1, . . . , R9 we get an equation
for R0
R0 =(〈10, [Q1 = 0]〉+ 〈13, [Q1 6= 0]〉
)R0
+ 〈10, [Q1 = 0] · [Q2 6= 0]〉+ 〈12, [Q1 6= 0] · [Q2 = 0]〉
The regular expression for R0 is thus
R0 =(〈10, [Q1 = 0]〉+ 〈13, [Q1 6= 0]〉
)∗·(〈10, [Q1 = 0] · [Q2 6= 0]〉+ 〈12, [Q1 6= 0] · [Q2 = 0]〉
)7.2.2 Optimizations
To allow for scalability of the algorithm, both in term of
its execution time and size of the outputted expression,
we have employed multiple optimizations in different
parts of the implementation, and we explain them next.
Leveraging Congruences. Beyond the theoretical guar-
antees, the fact that the obtained provenance struc-
ture (K⊗L) forms a semiring and the particular con-
gruence axioms that we have designed allow for sim-
plifications of the provenance expression, reducing its
size. A particularly important axiom in this respect is
(k1 ⊗ l1) · (k2 ⊗ l2) = (k1 · k2)⊗ (l1 · l2) which allows us
Provenance-Based Analysis of Data-Centric Processes 19
to collapse every summand in the provenance expres-
sion into a single paired element. We then check for
the applicability of the other axioms that reduce the
expression size, applying them in a greedy manner.
Delaying Query Evaluation. Recall that the prove-
nance expression combines provenance stemming from
queries on the underlying database, with external prove-
nance. Recall also that Algorithm 1 starts by essen-
tially “replacing” every query with the corresponding
data provenance it admits. A first simple observation is
that there is no need to re-evaluate the same query in
its multiple occurrences, but rather provenance may be
computed only once. A more subtle decision is whether
to compute provenance for the queries, and plug-in the
computed provenance into the overall expression, be-
fore or after simplifications. Computing provenance for
the individual queries before simplifications has the ad-
vantage of possibly allowing for more simplifications
(that involve combining parts of expressions of different
queries), but the disadvantage of manipulating larger
expressions (even when sub-expression sharing is em-
ployed as discussed below). Our experiments show that
the latter is consistently much more significant, and
thus we defer the computation of provenance for the
individual queries, meaning that we first generate a
starred expression with query identifiers as “abstract
tokens”, and only then replace those tokens with their
respective provenance.
Sub-expression Sharing. It is extremely common to
have a sub-expression occurring multiple times in the
overall provenance expression. This in particular hap-
pens when a single transition occurs in multiple paths.
Following ideas from [19], this property is exploited in
the generation of optimized data structure, maintain-
ing pointers to sub-expressions (which are the prove-
nance of queries) rather than the expressions them-
selves; so provenance for every query is computed and
stored only once, with multiple sub-expressions con-
taining the provenance of the query, just pointing at
the stored representation. As explained below, we may
further consider a more complex type of sharing, that
crosses the boundaries of individual queries and looks
at the expression as a whole in an attempt to find and
exploit further commonalities.
Dynamic Programming. When the underlying FSM
is acyclic, we may employ a further optimized dynamic-
programming computation. The computation follows
naturally the structure of the obtained automaton, as
follows. We maintain a table which is indexed by state
and execution length. For each state s and length i, the
entry in the table contains a compact representation
of the provenance of all executions of length i ending
in state s. Given the entries for i − 1 and the possibly
Fig. 5: Dynamic Programming Table
predecessors of s, the entry for i and s is computed,
while avoiding the creation of a copy of the expressions
already in the table, and instead pointing at them as
sub-expressions.
Example 18 Let P be the fragment of the PADDP of
our running example, that consists of the states Home
page, New products, Cat., Sub cat. and Products, and
the transitions between them. P is a PADDP with no
loops. The starred expression capturing all executions
in P that start at Homepage and end at Products can
be captured by the table shown in Figure 5. The entry
[3,Products] for instance, represents the two possible
executions of length 3 ending in Products. By following
the pointers we obtain their provenance, which is 〈2, 1〉·〈5, 1〉 · 〈0, [Q1 = 0]〉 and 〈5, 1〉 · 〈0, [Q1 6= 0]〉 · 〈2, 1〉.The expression for all executions ending in Products is
the sum of the expression in the entries [i,Products] for
0 ≤ i ≤ 4, in this case:
〈5, 1〉 · 〈0, [Q1 = 0]〉+ 〈2, 1〉 · 〈5, 1〉 · 〈0, [Q1 = 0]〉+〈5, 1〉 · 〈0, [Q1 6= 0]〉 · 〈2, 1〉+〈2, 1〉 · 〈5, 1〉 · 〈0, [Q1 6= 0]〉 · 〈2, 1〉
The effect of optimization on the execution time of
the algorithm is discussed in the following section.
8 Experimental Evaluation
We have conducted experiments whose main goals were
examining (1) the scalability of the approach in terms
of the generated provenance size and generation time,
and (2) the extent of usefulness of the approach, namely
the time it takes to specialize the provenance expression
for applications such as those described in this paper.
All experiments were executed on Windows 7, 64-
bit, with 4GB of RAM and Intel Core Duo i5 3.10 GHz
processor.
8.1 Evaluation Benchmark
We have developed a dedicated benchmark that in-
volves both synthetic and real data as follows.
20 Daniel Deutch et al.
(a) Serial (b) Parallel
(c) Dense, fan-out 3, 3 levels
Fig. 6: Arctic Station Finite State Machines [5]
E-commerce We used two different DDPs for E-
commerce process specifications to examine the per-
formance (provenance size, and generation and usage
time) as a function of the underlying database size. The
first dataset uses the fixed topology of the state machine
used in our running example (Figure 1), which demon-
strates all features of the model, including an FSM
with cycles, queries on a database and external choices.
The second DDP is partially based on the proposed
benchmark for e-commerce applications, described in
[24] (which we translated to an FSM and enriched with
queries and external choices). The underlying database
in both cases was populated with synthetically and ran-
domly generated data, of growing size, up to 5M tuples
(to examine scalability with respect to the database
size). We have annotated some transitions and tuples.
We have examined various LTL queries including those
presented as examples and their counterparts for the
DDP based on [24].
Arctic Stations The second set of DDPs that we
have examined is based on the real-life “Arctic stations”
data, used as a benchmark also in [5] (albeit in a differ-
ent context, of tracking provenance of actually running
executions rather than temporally analyzing possible
executions). This benchmark includes a variety of pro-
cesses that model the operation of meteorological sta-
tions in the Russian Arctic. Their flows are based on
three kinds of topologies, serial, parallel, and dense as
shown in Figure 6. The process specifications include
queries with aggregates, and we have synthetically in-
troduced external effect choices. The underlying real
data consists of 25000 tuples and includes monthly me-
teorological observations, collected in 1961-2000. The
number of nodes in the different process specifications
(corresponding to stations) is at most 25. But to exam-
ine scalability with respect to the FSM size, we have
also considered varying FSM sizes of up to 5000 nodes.
While doing that, we followed the topologies. I.e., for
the serial and parallel structure we concatenated copies
of the depicted structure such that the last state of the
i component is the first state of the i + 1 component.
For the dense structure we have varied both the fan-out
and number of levels; we report results obtained for in-
creasing FSM size by setting fan-out value of 10 and
increasing the number of levels.
Scientific Workflows We have further employed
our analysis in the context of Scientific Workflows from
MyExperiment.org [36]. We show the result for three
representative such workflows, named here Workflow1
(No. 16 in [36]), Workflow2 (No. 204 there), Work-
flow3 (No. 72 there) enriched with queries and external
choices and with the underlying database was popu-
lated with synthetically and randomly generated data,
of growing size, up to 10M tuples. We observed that
the number of nodes in these different process specifi-
cations is at most 55, but to examine scalability with
respect to the FSM size, we have also considered vary-
ing FSM sizes of up to 1500 nodes, by chaining copies
of the original state machines.
Business Processes We have also experimented
with (centralized) Business Process specifications from
[8]. Here again we show the results for three represen-
tative structures, named here BP1 (“order processing”
process in [8]), BP2 (“order fulfilment and procure-
ment”), BP3 (“shipment process of hardware retailer”).
We have considered synthetically and randomly gener-
ated databases with up to 10M tuples and have syn-
thetically considered increasingly growing sizes of the
process specification state machines, up to 1000 nodes,
to further study the scalability of our solution.
We next describe our experimental results with re-
spect to provenance size, provenance computation time
and provenance use time. We have examined all three
aspects with respect to growing DB size and growing
state machine size. The results with respect to grow-
ing DB size are summarized in Figures 7, 8, 9, and the
results with respect to growing state machine size are
summarized in Figures 10, 11 and 12.
8.2 Provenance Size
The first set of experiments aims at studying the size
of obtained provenance expressions, as a function of
the DDP size (state machine and underlying database
sizes). In Figure 7(a) we present the expression size ob-
tained using our two E-commerce datasets and varying
the database size from 0 to 5M tuples. We observe a
moderate growth of the provenance size with respect to
growing database sizes, indicating the scalability of the
approach (expression size of about 140MB and 180MB
for database of 5M tuples, and for the two workflows).
We note that the provenance size also depends (in an
expected way) on factors such as the size of join results
for guarding queries. For this dataset we have set the
join result size to be proportional to the input DB size.
Provenance-Based Analysis of Data-Centric Processes 21
0 MB
50 MB
100 MB
150 MB
200 MB
0.0 M 1.0 M 2.0 M 3.0 M 4.0 M 5.0 M 6.0 M
Exp
ress
ion
Siz
e
DB Size
E-commerce Running Example
E-commerce Benchmark
(a) Expression Size
00.0
04.3
08.6
13.0
17.3
21.6
25.9
30.2
34.6
38.9
0.0 M 1.0 M 2.0 M 3.0 M 4.0 M 5.0 M 6.0 M
Ge
ne
rati
on
Tim
e [
sec]
DB Size
E-commerce Running Example
E-commerce Benchmark
(b) Generation Time
00.0
00.9
01.7
02.6
03.5
04.3
05.2
06.0
06.9
07.8
08.6
0.0 M 1.0 M 2.0 M 3.0 M 4.0 M 5.0 M 6.0 M
Usa
ge T
ime
[se
c]
DB Size
E-commerce Running Example
E-commerce Benchmark
Running Example - Competitor
Benchmark - Competitor
(c) Usage Time
Fig. 7: Expression Size, Generation and Usage Time as Function of DB Size (E-commerce dataset)
0 MB
20 MB
40 MB
60 MB
80 MB
100 MB
120 MB
140 MB
160 MB
180 MB
0.0 M 2.0 M 4.0 M 6.0 M 8.0 M 10.0 M 12.0 M
Exp
ress
ion
Siz
e
DB Size
Workflow1
Workflow2
Workflow3
(a) Expression Size
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
0.0 M 2.0 M 4.0 M 6.0 M 8.0 M 10.0 M 12.0 M
Ge
ne
rati
on
Tim
e [
sec]
DB Size
Workflow1
Workflow2
Workflow3
(b) Generation Time
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
0.0 M 2.0 M 4.0 M 6.0 M 8.0 M 10.0 M 12.0 M
Usa
ge T
ime
[se
c]
DB Size
Workflow1 Workflow2 Workflow3 Workflow1 - Competitor Workflow2 - Competitor Workflow3 - Competitor
(c) Usage Time
Fig. 8: Expression Size, Generation and Usage Time as Function of DB Size for Scientific Workflows
0 MB
10 MB
20 MB
30 MB
40 MB
50 MB
60 MB
70 MB
0.0 M 2.0 M 4.0 M 6.0 M 8.0 M 10.0 M 12.0 M
Exp
ress
ion
Siz
e
DB Size
BP 1
BP 2
BP 3
(a) Expression Size
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
0.0 M 2.0 M 4.0 M 6.0 M 8.0 M 10.0 M 12.0 M
Ge
ne
rati
on
Tim
e [
sec]
DB Size
BP 1
BP 2
BP 3
(b) Generation Time
0.00
1.00
2.00
3.00
4.00
5.00
6.00
0.0 M 2.0 M 4.0 M 6.0 M 8.0 M 10.0 M 12.0 M
Usa
ge T
ime
[se
c]
DB Size
BP 1
BP 2
BP 3
BP 1 - Competitor
BP 2 - Competitor
BP 3 - Competitor
(c) Usage Time
Fig. 9: Expression Size, Generation and Usage Time as Function of DB Size for Business Processes Workflows
We have varied join result size and observed the ap-
proximately linear dependency of provenance size with
respect to it.
Figures 8(a) and 9(a) shows the expression size as
a function of the Database size using scientific and busi-
ness processes workflows respectively where the Database
size varied from 0 to 10M tuples. The provenance size
grows moderately as a function of the database size, up
to 162MB and 61MB for 10M tuples for the scientific
and business processes workflows respectively.
In Figure 10(a), we have examined the effect of the
state machine size on the provenance expression size,
showing the results for the Arctic Stations dataset. The
figure shows the results for the three topologies, for
number of nodes that is increased up to 5000 (i.e. up to
200 times the real size). The expressions are compactly
represented allowing for scalability even for the dense
structure. Our experiments indicate that for relatively
simple structures commonly found in workflows, the ex-
ponential size theoretical bound is not met and feasibly
small expressions are obtained. We observed similar re-
sults using scientific and business processes workflows
as shown in Figures 11(a) and 12(a).
8.3 Provenance Generation Time
The second set of experiments aims at assessing the
time it takes to generate the provenance expression,
again as a function of the DDP size. Figures 7(b), 8(b)
and 9(b) present the time it takes to generate the ex-
pression with respect to increased database size, and
22 Daniel Deutch et al.
0 MB
5 MB
10 MB
15 MB
20 MB
25 MB
30 MB
0 1000 2000 3000 4000 5000 6000
Exp
ress
ion
Siz
e
States
Serial
Parallel
Dense
(a) Expression Size
00
17
35
52
69
86
104
0 1000 2000 3000 4000 5000 6000
Ge
ne
rati
on
Tim
e [
sec]
States
Serial
Parallel
Dense
(b) Generation Time
00
173
346
518
691
864
1037
0 1000 2000 3000 4000 5000 6000
Usa
ge T
ime
[se
c]
States
Serial
Parallel
Dense
Serial - Competitor
Parallel - Competiror
Dense - Competitor
(c) Usage Time
Fig. 10: Expression Size, Generation and Usage Time as Function of FSM Size (Arctic Stations Dataset)
0 MB
10 MB
20 MB
30 MB
40 MB
50 MB
60 MB
0 200 400 600 800 1000 1200 1400 1600 1800
Exp
ress
ion
Siz
e
States
Workflow 1
Workflow2
Workflow3
(a) Expression Size
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
0 200 400 600 800 1000 1200 1400 1600 1800
Ge
ne
rati
on
Tim
e [
sec]
States
Workflow 1
Workflow2
Workflow3
(b) Generation Time
0.00
1.00
2.00
3.00
4.00
5.00
6.00
0 200 400 600 800 1000 1200 1400 1600 1800
Usa
ge T
ime
[se
c]
States
Workflow 1
Workflow2
Workflow3
Workflow1-Competitor
Workflow2-Competitor
Workflow3-Competitor
(c) Usage Time
Fig. 11: Expression Size, Generation and Usage Time as Function of FSM Size for Scientific Workflows
0 MB
5 MB
10 MB
15 MB
20 MB
25 MB
30 MB
0 200 400 600 800 1000 1200
Exp
ress
ion
Siz
e
States
BP 1
BP 2
BP 3
(a) Expression Size
0.00
5.00
10.00
15.00
20.00
25.00
0 200 400 600 800 1000 1200
Ge
ne
rati
on
Tim
e [
sec]
States
BP 1
BP 2
BP 3
(b) Generation Time
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
0 200 400 600 800 1000 1200
Usa
ge T
ime
[se
c]
States
BP 1
BP 2
BP 3
PB 1-Competitor
BP 2-Competitor
BP 3-Competitor
(c) Usage Time
Fig. 12: Expression Size, Generation and Usage Time as Function of FSM Size for Business Processes Workflows
Figures 10(b) 11(b) and 12(b) present the time it takes
to generate the expression with respect to increased
FSM size.
Figure 7(b) presents the generation time of the prove-
nance expression, on the two E-commerce DDPs, with
database of increasing size from 0 to 5M tuples. Observe
that the generation time is just under 35 and 25 seconds
for DB size of 5M tuples (for the two DDPs). Recall
that the provenance generation is done offline, render-
ing this computation time very reasonable. Figures 8(b)
and 9(b) present the generation time for the scientific
and business processes workflows with database of in-
creasing size up to 10M tuples. The generation time
was up to about 12 seconds for 10M tuples DB in both
cases.
Figure 10(b) shows provenance generation time as a
function of FSM size using the Arctic Stations dataset.
For the “serial” and “parallel” structures, generation
time is very fast even for every large FSMs. For the
complex “dense” structure, the generation time is sig-
nificantly slower due to the complex graph structure
and the needs for simplification, but is still under 1.5
minutes for a very large FSM with 5000 states. The gen-
eration time as a function of FSM size for scientific and
business processes workflows are presented in Figures
11(b) and 12(b) respectively. The observed time was
Provenance-Based Analysis of Data-Centric Processes 23
0MB
2MB
4MB
6MB
8MB
10MB
12MB
14MB
16MB
18MB
0 10 20 30 40 50 60 70
Exp
ress
ion
Siz
e
States
No Optimizations
With Optimizations
(a) Expression Size
0
5
10
15
20
25
30
35
40
0 10 20 30 40 50 60 70
Ge
ne
rati
on
Tim
e [
sec]
States
No Optimizations
With Optimizations
(b) Generation Time
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 10 20 30 40 50 60 70
Usa
ge T
ime
[se
c]
States
No Optimizations
With Optimizations
(c) Usage Time
Fig. 13: Effect of optimization
about a minute in the worst case for scientific worklows
and 22 seconds for business processes.
8.4 Using Provenance Information
The last set of experiments aims at assessing the use-
fulness of the approach by examining the time it takes
to “interact” with the expression, i.e. apply homomor-
phisms to observe results under hypothetical scenarios.
In particular we have considered deletion propagation,
and compared the observed times with those obtained
through a simple baseline approach (the “competitor”)
of applying the deletions directly to the DDP and per-
forming optimized temporal analysis of the obtained
specification.
Results for deletion propagation using the two ap-
proaches are reported in Figures 7(c), 8(c) and 9(c) for
growing DB size, and Figures 10(c), 11(c) and 12(c) for
growing FSM size. In Fig. 7(c) we see that by usingthe provenance expression we can consistently and sig-
nificantly outperform the baseline approach. For DB of
5M tuples, the gain is about 65% for the benchmark
DDP that follows [24], and 53% for our running exam-
ple DDP. The results also indicate scalability of the ap-
proach, allowing to use the expression in about 2.5 and
3.5 seconds for the benchmark DDP and our running
example (resp.), for DB size of 5M tuples (and much
faster for smaller databases). We note that for our ex-
amples, all input tuples are reflected in the output and
thus in the provenance. Our approach performance, in
all measures and specifically usage time and gain with
respect to the competitor, improves significantly as the
percentage of such tuples drops (e.g. in a database with
higher join selectivity). We observed similar trends in
the scientific workflows and business processes as shown
in Figures 8(c) and 9(c). The gain for 10M tuples was up
to 92% for the scientific workflows and up to 94% for
business processes. The using the provenance expres-
sion required less than 1 second for all business work-
flows we examined. and between 1 to 5 seconds for the
scientific workflows.
The results in Fig. 10(c) indicate that for FSM struc-
tures commonly found in context of workflows, appli-
cation of homomorphism to the computed expression is
very fast. It was instantaneous for all structures of the
Arctic Stations dataset (with their actual sizes). Even
when extending the FSM size to up to 5000 nodes, it
required less than 2.8 seconds for the dense structure,
1 second for the parallel structure and 0.8 seconds for
the serial one. With respect to the competitor, a very
significant gain was achieved for the dense and parallel
structure (about 96% and 58% improvement for 5000
nodes). For the simple serial structure, there was no
significant gain; this is due to the extreme simplicity
of the FSM structure where the pre-processing effect
diminishes. Figures 11(c) and 12(c) shows the gain for
deletion propagation for growing FSM size using the
workflows from [36] and [8]. The average gain was 96%-
98% fir the scientific workflows and 97%-99% for busi-
ness processes. Using the expression required less than
0.3 seconds and less than 0.1 seconds respectively.
8.5 Effect of Optimizations
Last, we have experimentally examined the effect of our
optimizations on all aspects of the computation: prove-
nance generation time, resulting provenance size, and
provenance usage time. Representative results, with and
without the optimizations, for the arctic stations dataset
with dense flow, with fan-out 6 and increasing number
of levels are shown in Figure 13. Figure 13(a) shows
the expression size as a function of the state machine
size. For example, while, without optimization, the ex-
pression size is up to 16MB for a state machine with
60 nodes, the expression size generated by the opti-
mized algorithm for the same state machine was only
34.5KB. We observed similar trends for other datasets,
and for the provenance generation and usage time as
shown in Figures 13(b) and 13(c) respectively. For ex-
24 Daniel Deutch et al.
ample, provenance generation time for FSM with 60
nodes has dropped to only 85 milliseconds using the op-
timized algorithm, compared to over 35 seconds without
optimizations. Usage time has dropped to 1 millisec-
ond from 34 milliseconds for the same input. Overall,
the observed gain obtained using the optimization was
very significant: up to 99% in terms of the expression
size and up to 97% in terms of generation and usage
time.
9 Related Work
Several lines of work are related to the present paper.
Data and Workflow Provenance Provenance for data
transformations has been considered in many papers
(see e.g. [19,11,7,26,10,12]). Provenance was shown use-
ful in a variety of applications, including access con-
trol for data management and provisioning for database
queries [16], and automatic optimization based on hy-
pothetical reasoning for such queries [33,34]. Unique
technical challenges in our settings include accounting
for the workflow in addition to data (including treat-
ment of possibly infinitely many executions) in conjunc-
tion with the combination of different kinds of choices
(external and data-dependent). The work of [31] has
studied combination of annotations, but especially in
the context of dependency between them and only for
the relational algebra.
Workflow Provenance Different approaches for captur-
ing workflow provenance appear in the literature (e.g.
[14,13,3,29,20,38,35]), but none of these models cap-
ture semiring-based provenance for temporal analysis
results of workflow executions. Semiring-based prove-
nance for executions of data-centric workflows was stud-
ied in [5], however the development there (1) tracks
provenance along with the execution rather than sup-
porting analysis over all possible executions, (2) as-
sumes a DAG-based control flow (no cycles), and (3)
does not capture provenance for external effects.
Process analysis and semiring-weighted graphs Tempo-
ral (and specifically LTL) analysis of processes was stud-
ied extensively (see e.g. [32,37] for an overview). The
use of elements of a semiring to weigh transitions has
also been extensively explored, see e.g. [25], and sev-
eral works have further laid the foundations for param-
eterization of temporal analysis based on weights, with
or without the design of semirings that accommodate
these weights (see e.g. [30,15]). The main distinction
from our work is that the process models in these works
are not data-centric, and in particular do not involve
queries to an underlying database. The dependency on
data combined with the “external effects” on control
flow lead to multiple novel challenges in our settings.
These challenges include (1) the need to “correctly”
combine two semirings rather than weigh transitions
based on a single semiring, which is addressed in our
algebraic construction, (2) the formulation and proof of
the properties that hold for the construction and are re-
quired for the usefulness of the approach, and (3) novel
implementation issues.
W3C Prov W3C PROV-family [40] is a specification of
standard for modeling and querying provenance for the
web. Applications include explanations, assessments of
quality, reliability or trustworthiness of data item, etc.
The standard describes a range of transformations for
data on the web, such as data generation, use and re-
vision by multiple agents, as well as a formal model
for provenance validation [41] It also describes different
perspectives on data through a notion of specialization
that resembles inheritance.
We note that provenance for the analysis of possible
executions of a data-dependent process, is studied here
for the first time (through a precise algebraic notion). It
is interesting to explore what aspects described in the
W3C standard may be modeled through the semiring
framework. The aspects that “fit” the framework, may
possibly also be incorporated as part of the PADDP
model, and perhaps accounted for in the analysis. For
instance, trustworthiness may be captured by elements
of the tropical semiring (see [27]); joint or alternative
use of data by different agents may be represented by
the semiring · or + operations. Incorporating into the
semiring framework other concepts such as specializa-
tion/inheritance requires further work.
Business Processes and Web Services There is a wealth
of research, in many different lines, on the analysis of
data-aware (and data-centric) processes. This includes
data-centric Business Processes (e.g. [4,18]) and Web
Services and applications [22,2], and the analysis goals
vary from verification of fully specified processes to the
automatic synthesis of new processes combining a given
set of components. The results obtained in these lines of
research are not applicable to our needs since no prove-
nance model was proposed there. Rather, the focus was
on analyzing a concrete given instance of a (rich) pro-
cess model, and not on accounting for different weight-
ing and for hypothetical changes as in our work.
Since our process model is quite generic, many as-
pects of Business Process (BP) and workflow models
may be captured, and then our results apply to these
models (as demonstrated by our experiments). How-
ever, more work is needed to be able to capture every
Provenance-Based Analysis of Data-Centric Processes 25
aspect of the more complex distributed BP and work-
flow models accounting for communication and paral-
lelism. Developing provenance support for such models
is an intriguing direction for future work.
10 Conclusion
We have presented in this paper a semiring based prove-
nance framework for temporal analysis of data-dependent
processes. We have studied properties of the frame-
work, from theoretical and experimental perspectives,
and have demonstrated its usefulness. There are ad-
ditional important aspects of data-dependent processes
such as communication and parallelism, for which prove-
nance modeling requires further work. We believe that
the foundations laid in this paper will serve as sound
grounds for the incorporation of provenance support for
such features.
References
1. S. Abiteboul, R. Hull, and V. Vianu. Foundations ofDatabases. Addison-Wesley, 1995.
2. S. Abiteboul, V. Vianu, B. Fordham, and Y. Yesha. Re-lational transducers for electronic commerce. In PODS,1998.
3. A. Ailamaki, Y. E. Ioannidis, and M. Livny. Scientificworkflow management by database management. In SS-DBM, 1998.
4. Lakhdar Akroun, Boualem Benatallah, Lhouari Nourine,and Farouk Toumani. Decidability and complexity ofsimulation preorder for data-centric web services. In IC-SOC, pages 535–542, 2014.
5. Y. Amsterdamer, S. B. Davidson, D. Deutch, T. Milo,J. Stoyanovich, and V. Tannen. Putting lipstick on pig:Enabling database-style workflow provenance. PVLDB,5(4):346–357, 2011.
6. Y. Amsterdamer, D. Deutch, and V. Tannen. Provenancefor aggregate queries. In Proc. of PODS, 2011.
7. O. Benjelloun, A.D. Sarma, A.Y. Halevy, M. Theobald,and J. Widom. Databases with uncertainty and lineage.VLDB J., 17, 2008.
8. http://www.bpmn.org/.9. Janusz A. Brzozowski. Derivatives of regular expressions.
J. ACM, 11(4):481–494, October 1964.10. P. Buneman, J. Cheney, and S. Vansummeren. On the ex-
pressiveness of implicit provenance in query and updatelanguages. ACM Trans. Database Syst., 33(4), 2008.
11. P. Buneman, S. Khanna, and W.C. Tan. Why and where:A characterization of data provenance. In ICDT, 2001.
12. James Cheney, Laura Chiticariu, and Wang Chiew Tan.Provenance in databases: Why, how, and where. Foun-dations and Trends in Databases, 1(4):379–474, 2009.
13. D. Cohn and R. Hull. Business artifacts: A data-centricapproach to modeling business operations and processes.IEEE Data Eng. Bull., 32(3), 2009.
14. S. B. Davidson and J. Freire. Provenance and scientificworkflows: challenges and opportunities. In SIGMOD,2008.
15. Conrado Daws. Symbolic and parametric model checkingof discrete-time markov chains. In Theoretical Aspects ofComputing-ICTAC 2004, pages 280–294. Springer, 2005.
16. D. Deutch, Z. G. Ives, T. Milo, and V. Tannen. Caravan:Provisioning for what-if analysis. In CIDR, 2013.
17. D. Deutch, Y. Moskovitch, and V. Tannen. Propolis:Provisioned analysis of data-centric processes (demo). InVLDB, 2013.
18. A. Deutsch, L. Sui, V. Vianu, and D. Zhou. A system forspecification and verification of interactive, data-drivenweb applications. In SIGMOD Conference, 2006.
19. Robert Fink, Larisa Han, and Dan Olteanu. Aggrega-tion in probabilistic databases via knowledge compila-tion. PVLDB, 5(5), 2012.
20. I. Foster, J. Vockler, M. Wilde, and A. Zhao. Chimera:A virtual data system for representing, querying, and au-tomating data derivation. SSDBM, 2002.
21. J. Nathan Foster, Todd J. Green, and Val Tannen. An-notated xml: queries and provenance. In PODS, pages271–280, 2008.
22. X. Fu, T. Bultan, and J. Su. Wsat: A tool for formalanalysis of web services. In CAV, 2004.
23. Dimitra Giannakopoulou and Klaus Havelund.Automata-based verification of temporal propertieson running programs. In ASE, pages 412–416, 2001.
24. M. Gillmann, R. Mindermann, and G. Weikum. Bench-marking and configuration of workflow management sys-tems. In CoopIS, 2000.
25. Michel Gondran and Michel Minoux. Graphs, Dioids andSemirings: New Models and Algorithms. Springer Pub-lishing Company, Incorporated, 2008.
26. T. J. Green, G. Karvounarakis, and V. Tannen. Prove-nance semirings. In PODS, 2007.
27. Todd J. Green, Grigoris Karvounarakis, Zachary G. Ives,and Val Tannen. Provenance in orchestra. IEEE DataEng. Bull., 33(3):9–16, 2010.
28. H. Gruber and M. Holzer. Finite automata, digraph con-nectivity, and regular expression size. In ICALP, 2008.
29. D. Hull, K. Wolstencroft, R. Stevens, C. Goble,M. Pocock, P. Li, and T. Oinn. Taverna: a tool forbuilding and running workflows of services. Nucleic AcidsRes., 34, 2006.
30. T. Hune, J. Romijn, M. Stoelinga, and F. Vaandrager.Linear parametric model checking of timed automata.Springer, 2001.
31. Egor V. Kostylev and Peter Buneman. Combining depen-dent annotations for relational algebra. In ICDT, pages196–207, 2012.
32. Z. Manna and A. Pnueli. The temporal logic of reactiveand concurrent systems - specification. Springer, 1992.
33. A. Meliou and D. Suciu. Tiresias: the database oracle forhow-to queries. In SIGMOD, 2012.
34. Alexandra Meliou, Wolfgang Gatterbauer, and Dan Su-ciu. Reverse data management. PVLDB, 4(12), 2011.
35. P. Missier, N. Paton, and K. Belhajjame. Fine-grainedand efficient lineage querying of collection-based work-flow provenance. In EDBT, 2010.
36. http://www.myexperiment.org/.37. Amir Pnueli. Applications of temporal logic to the speci-
fication and verification of reactive systems: a survey ofcurrent trends. Springer, 1986.
38. Y. L. Simhan, B. Plale, and D. Gammon. Karma2: Prove-nance management for data-driven workflows. Int. J.Web Service Res., 5(2), 2008.
39. Jeffrey D. Ullman. Principles of Database andKnowledge-Base Systems. Computer Science Press, 1989.
40. Prov-overview, w3c working group note, 2013.http://www.w3.org/TR/prov-overview/.
41. Constraints of the prov data model, w3c working groupnote, 2014. http://www.w3.org/TR/2013/REC-prov-constraints-20130430/.