provenance for data munging environments

Paul Groth Elsevier Labs@pgroth | pgroth.com

Provenance for Data Munging Environments

Information Sciences Institute – August 13, 2015

Outline

• What’s data munging and why it’s important?

• The role of provenance• The reality….• Desktop data munging & provenance• Database data munging &

provenance• Declarative data munging (?)

60 % of time is spent on data preparation

What to do?

Data SourcesCompound

Disease

PathwayTarget ✔

✔

✔

✔Tissue ✔

7

Open PHACTS Explorer

8

Open PHACTS Explorer

?

Tension:

Integrated & SummarizedData

Transparency& Trust

Solution: Tracking and exposing provenance*

* a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data”

The PROV Data Model(W3C Recommendation)

explorer.openphacts.org

What if you’re not a large organization?

karma.isi.edu

wings-workflows.org

The reality…

Adventures in word2vec (1)

The model:


The model:


Look provenance information!

http://ivory.idyll.org/blog/replication-i.html

DESKTOP DATA MUNGING & PROVENANCE

23

References

Manolis Stamatogiannakis, Paul Groth, Herbert Bos.Looking Inside the Black-Box: Capturing Data Provenance Using Dynamic Instrumentation.5th International Provenance and Annotation Workshop (IPAW'14)

Manolis Stamatogiannakis, Paul Groth, Herbert Bos.Decoupling Provenance Capture and Analysis from Execution.7th USENIX Workshop on the Theory and Practice of Provenance (TaPP'15)

24

Capturing Provenance

Disclosed Provenance

+ Accuracy

+ High-level semantics

– Intrusive

– Manual Effort

Observed Provenance

– False positives

– Semantic Gap

+ Non-intrusive

+ Minimal manual effort

CPL (Macko ‘12)Trio (Widom ‘09)

Wings (Gil ‘11)

Taverna (Oinn ‘06)

VisTrails (Fraire ‘06)

ES3 (Frew ‘08)

Trec (Vahdat ‘98)

PASSv2 (Holland ‘08)

DTrace Tool (Gessiou ‘12)

25

Challenge

• Can we capture provenance– with low false positive ratio?– without manual/obtrusive integration

effort?

• We have to rely on observed provenance.

26

State of the artApplication

• Observed provenance systems treat programs as black-boxes.

• Can’t tell if an input file was actually used.• Can’t quantify the influence of input to output.

27

Taint Tracking

Geology Computer Science

28

Our solution: DataTracker

• Captures high-fidelity provenance using Taint Tracking.

• Key building blocks:– libdft (Kemerlis ‘12) ➞ Reusable taint-tracking

framework.– Intel Pin (Luk ‘05) ➞ Dynamic instrumentation

framework.

• Does not require modification of applications.• Does not require knowledge of application

semantics.

29

DataTracker Architecture

30

Evaluation: tackling the n×m problem

• DataTracker is able to track the actual use of the input data.

• Read data ≠ Use data.

• Eliminates false positives (---->) present in other observed provenance capture methods.

31

Evaluation: vim

DataTracker attributes individual bytes of the output to the input.

Demo video: http://bit.ly/dtracker-demo

http://bit.ly/dtracker-demo

32

Can we do good enough?

• Can taint trackinga. become an “always-on” feature?b. be turned on for all running processes?

• What if we want to also run other analysis code?

• Can we pre-determine the right analysis code?

33

Re-execution

Common tactic in provenance:

• DB: Reenactment queries (Glavic ‘14)

• DistSys: Chimera (Foster ‘02), Hadoop (Logothetis

‘13), DistTape (Zhao ‘12)

• Workflows: Pegasus (Groth ‘09)

• PL: Slicing (Perera ‘12)

• OS: pTrace (Guo ‘11)

• Desktop: Excel (Asuncion ‘11)

34

Record and Replay

35

Methodology

Selection

Provenance analysis

Instrumentation

Execution Capture

36

Prototype Implementation

• PANDA: an open-source Platform for Architecture-Neutral Dynamic Analysis. (Dolan-Gavitt ‘14)

• Based on the QEMU virtualization platform.

37

• PANDA logs self-contained execution traces.– An initial RAM snapshot.– Non-deterministic inputs.

• Logging happens at virtual CPU I/O ports.– Virtual device state is not logged can’t “go-

live”.

Prototype Implementation (2/3)

PANDA

CPU RAMInput

Interrupt

DM

AInitial RAM Snapshot

Non-determinism

log

RAM

PANDA Execution Trace

38

Prototype Implementation (3/3)

• Analysis plugins– Read-only access to the VM state.– Invoked per instr., memory access, context switch, etc.– Can be combined to implement complex functionality.– OSI Linux, PROV-Tracer, ProcStrMatch.

• Debian Linux guest.• Provenance stored PROV/RDF triples, queried with

SPARQL.

PANDA Execution Trace

PANDA

Triple Store

Plugin APlugin C

Plugin B

CPU

RAM

39

OS Introspection

• What processes are currently executing?

• Which libraries are used?• What files are used?

• Possible approaches:– Execute code inside the guest-OS.– Reproduce guest-OS semantics purely

from the hardware state (RAM/registers).

40

The PROV-Tracer Plugin

• Registers for process creation/destruction events.

• Decodes executed system calls.• Keeps track of what files are used as

input/output by each process.• Emits provenance in an intermediate

format when a process terminates.

41

More Analysis Plugins

• ProcStrMatch plugin.– Which processes contained string S in

their memory?

• Other possible types of analysis:– Taint tracking– Dynamic slicing

42

Overhead (again) (1/2)

• QEMU incurs a 5x slowdown.• PANDA recording imposes an

additional 1.1x – 1.2x slowdown.

Virtualization is the dominant overhead factor.

43

Overhead (again) (2/2)• QEMU is a suboptimal virtualization

option.• ReVirt – User Mode Linux (Dunlap ‘02)

– Slowdown: 1.08x rec. + 1.58x virt.

• ReTrace – VMWare (Xu ‘07)

– Slowdown: 1.05x-2.6x rec. + ??? virt.

Virtualization slowdown is considered acceptable.

Recording overhead is fairly low.

44

Storage Requirements

• Storage requirements vary with the workload.

• For PANDA (Dolan-Gavitt ‘14):– 17-915 instructions per byte.

• In practice: O(10MB/min) uncompressed.• Different approaches to reduce/manage

storage requirements.– Compression, HD rotation, VM snapshots.

• 24/7 recording seems within limits of todays’ technology.

45

Highlights

• Taint tracking analysis is a powerful method for capturing provenance.– Eliminates many false positives.– Tackles the “n×m problem”.

• Decoupling provenance analysis from execution is possible by the use of VM record & replay.

• Execution traces can be used for post-hoc provenance analysis.

DB DATA MUNGING

47

References

Marcin Wylot, Philip Cudré-Mauroux, Paul Groth TripleProv: Efficient Processing of Lineage Queries over a Native RDF StoreWorld Wide Web Conference 2014

Marcin Wylot, Philip Cudré-Mauroux, Paul Groth Executing Provenance-Enabled Queries over Web Data World Wide Web Conference 2015

RDF is great for munging data

➢ Ability to arbitrarily add new information (schemaless)

➢ Syntaxes are easy to concatenate new data

➢ Information has a well defined structure

➢ Identifiers are distributed but controlled

48

What’s the provenance of my query result?

Qr

Graph-based Query

select ?lat ?long ?g1 ?g2 ?g3 ?g4where {

graph ?g1 {?a [] "Eiffel Tower" . }

graph ?g2 {?a inCountry FR .}

graph ?g3 {?a lat ?lat . }

graph ?g4 {?a long ?long .}

}lat long l1 l2 l4 l4, lat long l1 l2 l4 l5,lat long l1 l2 l5 l4, lat long l1 l2 l5 l5,lat long l1 l3 l4 l4, lat long l1 l3 l4 l5,lat long l1 l3 l5 l4,lat long l1 l3 l5 l5,

lat long l2 l2 l4 l4, lat long l2 l2 l4 l5,lat long l2 l2 l5 l4, lat long l2 l2 l5 l5,lat long l2 l3 l4 l4, lat long l2 l3 l4 l5,lat long l2 l3 l5 l4, lat long l2 l3 l5 l5,

lat long l3 l2 l4 l4, lat long l3 l2 l4 l5,lat long l3 l2 l5 l4, lat long l3 l2 l5 l5,lat long l3 l3 l4 l4, lat long l3 l3 l4 l5,lat long l3 l3 l5 l4,lat long l3 l3 l5 l5,

Provenance Polynomials

➢ Ability to characterize ways each source contributed

➢ Pinpoint the exact source to each result

➢ Trace back the list of sources the way they were combined

to deliver a result

Polynomials Operators

➢ Union (⊕)

○ constraint or projection satisfied with multiple sources

l1 l2 l3⊕ ⊕

○ multiple entities satisfy a set of constraints or

projections

➢ Join (⊗)

○ sources joined to handle a constraint or a projection

○ OS and OO joins between few sets of constraints

(l1 l2) (l3 l4)⊕ ⊗ ⊕

Example Polynomial

select ?lat ?long where { ?a [] ``Eiffel Tower''.?a inCountry FR .?a lat ?lat .?a long ?long . }

(l1 l2 l3) (l4 l5) ( l6 l7) (l8 l9)⊕ ⊕ ⊗ ⊕ ⊗ ⊕ ⊗ ⊕

System Architecture

Experiments

How expensive it is to trace

provenance?

What is the overhead on query

execution time?

Datasets

➢ Two collections of RDF data gathered from the Web

○ Billion Triple Challenge (BTC): Crawled from the linked

open data cloud

○ Web Data Commons (WDC): RDFa, Microdata

extracted from common crawl

➢ Typical collections gathered from multiple sources

➢ sampled subsets of ~110 million triples each; ~25GB each

Workloads

➢ 8 Queries defined for BTC○ T. Neumann and G. Weikum. Scalable join processing on very large rdf

graphs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 627–640. ACM, 2009.

➢ Two additional queries with UNION and OPTIONAL clauses

➢ 7 various new queries for WDC

http://exascale.info/tripleprov

http://exascale.info/tripleprov

Results

Overhead of tracking provenance compared to

vanilla version of the system for BTC dataset

source-level co-locatedsource-level annotatedtriple-level co-locatedtriple-level annotated

TripleProv: Query Execution Pipeline

input: provenance-enable query

➢ execute the provenance query ➢ optionally pre-materializing or co-locating data➢ optionally rewrite the workload queries➢ execute the workload queries

output: the workload query results, restricted to those which were derived from data specified by the provenance query 59

Experiments

What is the most efficient query

execution strategy for provenance-

enabled queries?

60

Datasets

➢ Two collections of RDF data gathered from the Web

○ Billion Triple Challenge (BTC): Crawled from the linked open data

cloud

○ Web Data Commons (WDC): RDFa, Microdata extracted from

common crawl

➢ Typical collections gathered from multiple sources

➢ Sampled subsets of ~40 million triples each; ~10GB each

➢ Added provenance specific triples (184 for WDC and 360 for BTC); that

the provenance queries do not modify the result sets of the workload

queries

61

Results for BTC

➢ Full Materialization: 44x faster than the vanilla version of the system

➢ Partial Materialization: 35x faster

➢ Pre-Filtering: 23x faster

➢ Adaptive Partial Materialization executes a provenance query and materialize data 475 times faster than Full Materialization

➢ Query Rewriting and Post-Filtering strategies perform significantly slower

62

Data Analysis

➢ How many context values refer to how many triples? How selective it is?

➢ 6'819'826 unique context values in the BTC dataset.

➢ The majority of the context values are highly selective.

63

➢ average selectivity

○ 5.8 triples per context value

○ 2.3 molecules per context value

64

DECLARATIVE DATA MUNGING (?)

65

References

Sara Magliacane, Philip Stutz, Paul Groth, Abraham Bernstein

foxPSL: A Fast, Optimized and eXtended PSL implementationInternational Journal of Approximate Reasoning (2015)

Why logic?

- Concise & natural way to represent relations

- Declarative representation:- Can reuse, extend, combine rules- Experts can write rules

- First order logic:- Can exploit symmetries to avoid

duplicated computation (e.g. lifted inference)

Let the reasoner munge the data.

See Sebastien Riedel’s etc. work towards pushing more NLP problems in to the reasoner.http://cl.naist.jp/~kevinduh/z/acltutorialslides/matrix_acl2015tutorial.pdf

Statistical Relational Learning

● Several flavors:o Markov Logic Networks, o Bayesian Logic Programso Probabilistic Soft Logic (PSL) [Broecheler,

Getoor, UAI 2010]

● PSL was successfully applied:o Entity resolution, Link predictiono Ontology alignment, Knowledge graph

identificationo Computer vision, trust propagation, …

Probabilistic Soft Logic (PSL)

● Probabilistic logic with soft truth values [0,1]∈friends(anna, bob)= 0.8votes(anna, demo) = 0.99

● Weighted rules:[weight = 0.7] friends(A,B) && votes(A,P) => votes(B,P)

● Inference as constrained convex minimization:votes(bob, demo) = 0.8

FoxPSL: Fast Optimized eXtended PSL

classes ∃partially grounded rules optimizations

DSL: FoxPSL lang

Experiments: comparison with ACO

SLURM cluster: 4 nodes, each with 2x10 cores and 128GB RAM

ACO = implementation of consensus optimization on GraphLab used for grounded PSL

Conclusions

• Data munging is a central task• Provenance is a requirement• Now:

• Provenance by stealth (ack Carole Goble)

• Separate provenance analysis from instrumentation.

• Future:• The computer should do the work

73

Future Research

• Explore optimizations of taint tracking for capturing provenance.

• Provenance analysis of real-world traces (e.g. from rrshare.org).

• Tracking provenance across environments

• Traces/logs as central provenance primitive

• Declarative data munging

provenance for data munging environments

Technology

data provenance

data use data

provenance systems

international provenance

highfidelity provenance

provenance informatio

role of provenance

input data