provenance for data munging environments
TRANSCRIPT
Paul Groth Elsevier Labs@pgroth | pgroth.com
Provenance for Data Munging Environments
Information Sciences Institute – August 13, 2015
Outline
• What’s data munging and why it’s important?
• The role of provenance• The reality….• Desktop data munging & provenance• Database data munging &
provenance• Declarative data munging (?)
60 % of time is spent on data preparation
What to do?
Data SourcesCompound
Disease
PathwayTarget ✔
✔
✔
✔Tissue ✔
7
Open PHACTS Explorer
8
Open PHACTS Explorer
?
Tension:
Integrated & SummarizedData
Transparency& Trust
Solution: Tracking and exposing provenance*
* a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data”
The PROV Data Model(W3C Recommendation)
explorer.openphacts.org
What if you’re not a large organization?
karma.isi.edu
wings-workflows.org
The reality…
Adventures in word2vec (1)
Adventures in word2vec (2)
The model:
Adventures in word2vec (3)
The model:
Adventures in word2vec (3)
Look provenance information!
http://ivory.idyll.org/blog/replication-i.html
DESKTOP DATA MUNGING & PROVENANCE
23
References
Manolis Stamatogiannakis, Paul Groth, Herbert Bos.Looking Inside the Black-Box: Capturing Data Provenance Using Dynamic Instrumentation.5th International Provenance and Annotation Workshop (IPAW'14)
Manolis Stamatogiannakis, Paul Groth, Herbert Bos.Decoupling Provenance Capture and Analysis from Execution.7th USENIX Workshop on the Theory and Practice of Provenance (TaPP'15)
24
Capturing Provenance
Disclosed Provenance
+ Accuracy
+ High-level semantics
– Intrusive
– Manual Effort
Observed Provenance
– False positives
– Semantic Gap
+ Non-intrusive
+ Minimal manual effort
CPL (Macko ‘12)Trio (Widom ‘09)
Wings (Gil ‘11)
Taverna (Oinn ‘06)
VisTrails (Fraire ‘06)
ES3 (Frew ‘08)
Trec (Vahdat ‘98)
PASSv2 (Holland ‘08)
DTrace Tool (Gessiou ‘12)
25
Challenge
• Can we capture provenance– with low false positive ratio?– without manual/obtrusive integration
effort?
• We have to rely on observed provenance.
26
State of the artApplication
• Observed provenance systems treat programs as black-boxes.
• Can’t tell if an input file was actually used.• Can’t quantify the influence of input to output.
27
Taint Tracking
Geology Computer Science
28
Our solution: DataTracker
• Captures high-fidelity provenance using Taint Tracking.
• Key building blocks:– libdft (Kemerlis ‘12) ➞ Reusable taint-tracking
framework.– Intel Pin (Luk ‘05) ➞ Dynamic instrumentation
framework.
• Does not require modification of applications.• Does not require knowledge of application
semantics.
29
DataTracker Architecture
30
Evaluation: tackling the n×m problem
• DataTracker is able to track the actual use of the input data.
• Read data ≠ Use data.
• Eliminates false positives (---->) present in other observed provenance capture methods.
31
Evaluation: vim
DataTracker attributes individual bytes of the output to the input.
Demo video: http://bit.ly/dtracker-demo
32
Can we do good enough?
• Can taint trackinga. become an “always-on” feature?b. be turned on for all running processes?
• What if we want to also run other analysis code?
• Can we pre-determine the right analysis code?
33
Re-execution
Common tactic in provenance:
• DB: Reenactment queries (Glavic ‘14)
• DistSys: Chimera (Foster ‘02), Hadoop (Logothetis
‘13), DistTape (Zhao ‘12)
• Workflows: Pegasus (Groth ‘09)
• PL: Slicing (Perera ‘12)
• OS: pTrace (Guo ‘11)
• Desktop: Excel (Asuncion ‘11)
34
Record and Replay
35
Methodology
Selection
Provenance analysis
Instrumentation
Execution Capture
36
Prototype Implementation
• PANDA: an open-source Platform for Architecture-Neutral Dynamic Analysis. (Dolan-Gavitt ‘14)
• Based on the QEMU virtualization platform.
37
• PANDA logs self-contained execution traces.– An initial RAM snapshot.– Non-deterministic inputs.
• Logging happens at virtual CPU I/O ports.– Virtual device state is not logged can’t “go-
live”.
Prototype Implementation (2/3)
PANDA
CPU RAMInput
Interrupt
DM
AInitial RAM Snapshot
Non-determinism
log
RAM
PANDA Execution Trace
38
Prototype Implementation (3/3)
• Analysis plugins– Read-only access to the VM state.– Invoked per instr., memory access, context switch, etc.– Can be combined to implement complex functionality.– OSI Linux, PROV-Tracer, ProcStrMatch.
• Debian Linux guest.• Provenance stored PROV/RDF triples, queried with
SPARQL.
PANDA Execution Trace
PANDA
Triple Store
Plugin APlugin C
Plugin B
CPU
RAM
39
OS Introspection
• What processes are currently executing?
• Which libraries are used?• What files are used?
• Possible approaches:– Execute code inside the guest-OS.– Reproduce guest-OS semantics purely
from the hardware state (RAM/registers).
40
The PROV-Tracer Plugin
• Registers for process creation/destruction events.
• Decodes executed system calls.• Keeps track of what files are used as
input/output by each process.• Emits provenance in an intermediate
format when a process terminates.
41
More Analysis Plugins
• ProcStrMatch plugin.– Which processes contained string S in
their memory?
• Other possible types of analysis:– Taint tracking– Dynamic slicing
42
Overhead (again) (1/2)
• QEMU incurs a 5x slowdown.• PANDA recording imposes an
additional 1.1x – 1.2x slowdown.
Virtualization is the dominant overhead factor.
43
Overhead (again) (2/2)• QEMU is a suboptimal virtualization
option.• ReVirt – User Mode Linux (Dunlap ‘02)
– Slowdown: 1.08x rec. + 1.58x virt.
• ReTrace – VMWare (Xu ‘07)
– Slowdown: 1.05x-2.6x rec. + ??? virt.
Virtualization slowdown is considered acceptable.
Recording overhead is fairly low.
44
Storage Requirements
• Storage requirements vary with the workload.
• For PANDA (Dolan-Gavitt ‘14):– 17-915 instructions per byte.
• In practice: O(10MB/min) uncompressed.• Different approaches to reduce/manage
storage requirements.– Compression, HD rotation, VM snapshots.
• 24/7 recording seems within limits of todays’ technology.
45
Highlights
• Taint tracking analysis is a powerful method for capturing provenance.– Eliminates many false positives.– Tackles the “n×m problem”.
• Decoupling provenance analysis from execution is possible by the use of VM record & replay.
• Execution traces can be used for post-hoc provenance analysis.
DB DATA MUNGING
47
References
Marcin Wylot, Philip Cudré-Mauroux, Paul Groth TripleProv: Efficient Processing of Lineage Queries over a Native RDF StoreWorld Wide Web Conference 2014
Marcin Wylot, Philip Cudré-Mauroux, Paul Groth Executing Provenance-Enabled Queries over Web Data World Wide Web Conference 2015
RDF is great for munging data
➢ Ability to arbitrarily add new information (schemaless)
➢ Syntaxes are easy to concatenate new data
➢ Information has a well defined structure
➢ Identifiers are distributed but controlled
48
What’s the provenance of my query result?
Qr
Graph-based Query
select ?lat ?long ?g1 ?g2 ?g3 ?g4where {
graph ?g1 {?a [] "Eiffel Tower" . }
graph ?g2 {?a inCountry FR .}
graph ?g3 {?a lat ?lat . }
graph ?g4 {?a long ?long .}
}lat long l1 l2 l4 l4, lat long l1 l2 l4 l5,lat long l1 l2 l5 l4, lat long l1 l2 l5 l5,lat long l1 l3 l4 l4, lat long l1 l3 l4 l5,lat long l1 l3 l5 l4,lat long l1 l3 l5 l5,
lat long l2 l2 l4 l4, lat long l2 l2 l4 l5,lat long l2 l2 l5 l4, lat long l2 l2 l5 l5,lat long l2 l3 l4 l4, lat long l2 l3 l4 l5,lat long l2 l3 l5 l4, lat long l2 l3 l5 l5,
lat long l3 l2 l4 l4, lat long l3 l2 l4 l5,lat long l3 l2 l5 l4, lat long l3 l2 l5 l5,lat long l3 l3 l4 l4, lat long l3 l3 l4 l5,lat long l3 l3 l5 l4,lat long l3 l3 l5 l5,
Provenance Polynomials
➢ Ability to characterize ways each source contributed
➢ Pinpoint the exact source to each result
➢ Trace back the list of sources the way they were combined
to deliver a result
Polynomials Operators
➢ Union (⊕)
○ constraint or projection satisfied with multiple sources
l1 l2 l3⊕ ⊕
○ multiple entities satisfy a set of constraints or
projections
➢ Join (⊗)
○ sources joined to handle a constraint or a projection
○ OS and OO joins between few sets of constraints
(l1 l2) (l3 l4)⊕ ⊗ ⊕
Example Polynomial
select ?lat ?long where { ?a [] ``Eiffel Tower''.?a inCountry FR .?a lat ?lat .?a long ?long . }
(l1 l2 l3) (l4 l5) ( l6 l7) (l8 l9)⊕ ⊕ ⊗ ⊕ ⊗ ⊕ ⊗ ⊕
System Architecture
Experiments
How expensive it is to trace
provenance?
What is the overhead on query
execution time?
Datasets
➢ Two collections of RDF data gathered from the Web
○ Billion Triple Challenge (BTC): Crawled from the linked
open data cloud
○ Web Data Commons (WDC): RDFa, Microdata
extracted from common crawl
➢ Typical collections gathered from multiple sources
➢ sampled subsets of ~110 million triples each; ~25GB each
Workloads
➢ 8 Queries defined for BTC○ T. Neumann and G. Weikum. Scalable join processing on very large rdf
graphs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 627–640. ACM, 2009.
➢ Two additional queries with UNION and OPTIONAL clauses
➢ 7 various new queries for WDC
http://exascale.info/tripleprov
Results
Overhead of tracking provenance compared to
vanilla version of the system for BTC dataset
source-level co-locatedsource-level annotatedtriple-level co-locatedtriple-level annotated
TripleProv: Query Execution Pipeline
input: provenance-enable query
➢ execute the provenance query ➢ optionally pre-materializing or co-locating data➢ optionally rewrite the workload queries➢ execute the workload queries
output: the workload query results, restricted to those which were derived from data specified by the provenance query 59
Experiments
What is the most efficient query
execution strategy for provenance-
enabled queries?
60
Datasets
➢ Two collections of RDF data gathered from the Web
○ Billion Triple Challenge (BTC): Crawled from the linked open data
cloud
○ Web Data Commons (WDC): RDFa, Microdata extracted from
common crawl
➢ Typical collections gathered from multiple sources
➢ Sampled subsets of ~40 million triples each; ~10GB each
➢ Added provenance specific triples (184 for WDC and 360 for BTC); that
the provenance queries do not modify the result sets of the workload
queries
61
Results for BTC
➢ Full Materialization: 44x faster than the vanilla version of the system
➢ Partial Materialization: 35x faster
➢ Pre-Filtering: 23x faster
➢ Adaptive Partial Materialization executes a provenance query and materialize data 475 times faster than Full Materialization
➢ Query Rewriting and Post-Filtering strategies perform significantly slower
62
Data Analysis
➢ How many context values refer to how many triples? How selective it is?
➢ 6'819'826 unique context values in the BTC dataset.
➢ The majority of the context values are highly selective.
63
➢ average selectivity
○ 5.8 triples per context value
○ 2.3 molecules per context value
64
DECLARATIVE DATA MUNGING (?)
65
References
Sara Magliacane, Philip Stutz, Paul Groth, Abraham Bernstein
foxPSL: A Fast, Optimized and eXtended PSL implementationInternational Journal of Approximate Reasoning (2015)
Why logic?
- Concise & natural way to represent relations
- Declarative representation:- Can reuse, extend, combine rules- Experts can write rules
- First order logic:- Can exploit symmetries to avoid
duplicated computation (e.g. lifted inference)
Let the reasoner munge the data.
See Sebastien Riedel’s etc. work towards pushing more NLP problems in to the reasoner.http://cl.naist.jp/~kevinduh/z/acltutorialslides/matrix_acl2015tutorial.pdf
Statistical Relational Learning
● Several flavors:o Markov Logic Networks, o Bayesian Logic Programso Probabilistic Soft Logic (PSL) [Broecheler,
Getoor, UAI 2010]
● PSL was successfully applied:o Entity resolution, Link predictiono Ontology alignment, Knowledge graph
identificationo Computer vision, trust propagation, …
Probabilistic Soft Logic (PSL)
● Probabilistic logic with soft truth values [0,1]∈friends(anna, bob)= 0.8votes(anna, demo) = 0.99
● Weighted rules:[weight = 0.7] friends(A,B) && votes(A,P) => votes(B,P)
● Inference as constrained convex minimization:votes(bob, demo) = 0.8
FoxPSL: Fast Optimized eXtended PSL
classes ∃partially grounded rules optimizations
DSL: FoxPSL lang
Experiments: comparison with ACO
SLURM cluster: 4 nodes, each with 2x10 cores and 128GB RAM
ACO = implementation of consensus optimization on GraphLab used for grounded PSL
Conclusions
• Data munging is a central task• Provenance is a requirement• Now:
• Provenance by stealth (ack Carole Goble)
• Separate provenance analysis from instrumentation.
• Future:• The computer should do the work
73
Future Research
• Explore optimizations of taint tracking for capturing provenance.
• Provenance analysis of real-world traces (e.g. from rrshare.org).
• Tracking provenance across environments
• Traces/logs as central provenance primitive
• Declarative data munging