provenance for data munging environments

73
Paul Groth Elsevier Labs @pgroth | pgroth.com Provenance for Data Munging Environments rmation Sciences Institute – August 13, 2015

Upload: paul-groth

Post on 17-Aug-2015

147 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Provenance for Data Munging Environments

Paul Groth Elsevier Labs@pgroth | pgroth.com

Provenance for Data Munging Environments

Information Sciences Institute – August 13, 2015

Page 2: Provenance for Data Munging Environments

Outline

• What’s data munging and why it’s important?

• The role of provenance• The reality….• Desktop data munging & provenance• Database data munging &

provenance• Declarative data munging (?)

Page 3: Provenance for Data Munging Environments
Page 4: Provenance for Data Munging Environments

60 % of time is spent on data preparation

Page 5: Provenance for Data Munging Environments

What to do?

Page 6: Provenance for Data Munging Environments

Data SourcesCompound

Disease

PathwayTarget ✔

✔Tissue ✔

Page 7: Provenance for Data Munging Environments

7

Open PHACTS Explorer

Page 8: Provenance for Data Munging Environments

8

Open PHACTS Explorer

?

Page 9: Provenance for Data Munging Environments

Tension:

Integrated & SummarizedData

Transparency& Trust

Page 10: Provenance for Data Munging Environments

Solution: Tracking and exposing provenance*

* a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data”

The PROV Data Model(W3C Recommendation)

Page 11: Provenance for Data Munging Environments

explorer.openphacts.org

Page 12: Provenance for Data Munging Environments
Page 13: Provenance for Data Munging Environments

What if you’re not a large organization?

Page 14: Provenance for Data Munging Environments

karma.isi.edu

Page 15: Provenance for Data Munging Environments

wings-workflows.org

Page 16: Provenance for Data Munging Environments

The reality…

Page 17: Provenance for Data Munging Environments

Adventures in word2vec (1)

Page 18: Provenance for Data Munging Environments

Adventures in word2vec (2)

Page 19: Provenance for Data Munging Environments

The model:

Adventures in word2vec (3)

Page 20: Provenance for Data Munging Environments

The model:

Adventures in word2vec (3)

Look provenance information!

Page 21: Provenance for Data Munging Environments

http://ivory.idyll.org/blog/replication-i.html

Page 22: Provenance for Data Munging Environments

DESKTOP DATA MUNGING & PROVENANCE

Page 23: Provenance for Data Munging Environments

23

References

Manolis Stamatogiannakis, Paul Groth, Herbert Bos.Looking Inside the Black-Box: Capturing Data Provenance Using Dynamic Instrumentation.5th International Provenance and Annotation Workshop (IPAW'14)

Manolis Stamatogiannakis, Paul Groth, Herbert Bos.Decoupling Provenance Capture and Analysis from Execution.7th USENIX Workshop on the Theory and Practice of Provenance (TaPP'15)

Page 24: Provenance for Data Munging Environments

24

Capturing Provenance

Disclosed Provenance

+ Accuracy

+ High-level semantics

– Intrusive

– Manual Effort

Observed Provenance

– False positives

– Semantic Gap

+ Non-intrusive

+ Minimal manual effort

CPL (Macko ‘12)Trio (Widom ‘09)

Wings (Gil ‘11)

Taverna (Oinn ‘06)

VisTrails (Fraire ‘06)

ES3 (Frew ‘08)

Trec (Vahdat ‘98)

PASSv2 (Holland ‘08)

DTrace Tool (Gessiou ‘12)

Page 25: Provenance for Data Munging Environments

25

Challenge

• Can we capture provenance– with low false positive ratio?– without manual/obtrusive integration

effort?

• We have to rely on observed provenance.

Page 26: Provenance for Data Munging Environments

26

State of the artApplication

• Observed provenance systems treat programs as black-boxes.

• Can’t tell if an input file was actually used.• Can’t quantify the influence of input to output.

Page 27: Provenance for Data Munging Environments

27

Taint Tracking

Geology Computer Science

Page 28: Provenance for Data Munging Environments

28

Our solution: DataTracker

• Captures high-fidelity provenance using Taint Tracking.

• Key building blocks:– libdft (Kemerlis ‘12) ➞ Reusable taint-tracking

framework.– Intel Pin (Luk ‘05) ➞ Dynamic instrumentation

framework.

• Does not require modification of applications.• Does not require knowledge of application

semantics.

Page 29: Provenance for Data Munging Environments

29

DataTracker Architecture

Page 30: Provenance for Data Munging Environments

30

Evaluation: tackling the n×m problem

• DataTracker is able to track the actual use of the input data.

• Read data ≠ Use data.

• Eliminates false positives (---->) present in other observed provenance capture methods.

Page 31: Provenance for Data Munging Environments

31

Evaluation: vim

DataTracker attributes individual bytes of the output to the input.

Demo video: http://bit.ly/dtracker-demo

Page 32: Provenance for Data Munging Environments

32

Can we do good enough?

• Can taint trackinga. become an “always-on” feature?b. be turned on for all running processes?

• What if we want to also run other analysis code?

• Can we pre-determine the right analysis code?

Page 33: Provenance for Data Munging Environments

33

Re-execution

Common tactic in provenance:

• DB: Reenactment queries (Glavic ‘14)

• DistSys: Chimera (Foster ‘02), Hadoop (Logothetis

‘13), DistTape (Zhao ‘12)

• Workflows: Pegasus (Groth ‘09)

• PL: Slicing (Perera ‘12)

• OS: pTrace (Guo ‘11)

• Desktop: Excel (Asuncion ‘11)

Page 34: Provenance for Data Munging Environments

34

Record and Replay

Page 35: Provenance for Data Munging Environments

35

Methodology

Selection

Provenance analysis

Instrumentation

Execution Capture

Page 36: Provenance for Data Munging Environments

36

Prototype Implementation

• PANDA: an open-source Platform for Architecture-Neutral Dynamic Analysis. (Dolan-Gavitt ‘14)

• Based on the QEMU virtualization platform.

Page 37: Provenance for Data Munging Environments

37

• PANDA logs self-contained execution traces.– An initial RAM snapshot.– Non-deterministic inputs.

• Logging happens at virtual CPU I/O ports.– Virtual device state is not logged can’t “go-

live”.

Prototype Implementation (2/3)

PANDA

CPU RAMInput

Interrupt

DM

AInitial RAM Snapshot

Non-determinism

log

RAM

PANDA Execution Trace

Page 38: Provenance for Data Munging Environments

38

Prototype Implementation (3/3)

• Analysis plugins– Read-only access to the VM state.– Invoked per instr., memory access, context switch, etc.– Can be combined to implement complex functionality.– OSI Linux, PROV-Tracer, ProcStrMatch.

• Debian Linux guest.• Provenance stored PROV/RDF triples, queried with

SPARQL.

PANDA Execution Trace

PANDA

Triple Store

Plugin APlugin C

Plugin B

CPU

RAM

Page 39: Provenance for Data Munging Environments

39

OS Introspection

• What processes are currently executing?

• Which libraries are used?• What files are used?

• Possible approaches:– Execute code inside the guest-OS.– Reproduce guest-OS semantics purely

from the hardware state (RAM/registers).

Page 40: Provenance for Data Munging Environments

40

The PROV-Tracer Plugin

• Registers for process creation/destruction events.

• Decodes executed system calls.• Keeps track of what files are used as

input/output by each process.• Emits provenance in an intermediate

format when a process terminates.

Page 41: Provenance for Data Munging Environments

41

More Analysis Plugins

• ProcStrMatch plugin.– Which processes contained string S in

their memory?

• Other possible types of analysis:– Taint tracking– Dynamic slicing

Page 42: Provenance for Data Munging Environments

42

Overhead (again) (1/2)

• QEMU incurs a 5x slowdown.• PANDA recording imposes an

additional 1.1x – 1.2x slowdown.

Virtualization is the dominant overhead factor.

Page 43: Provenance for Data Munging Environments

43

Overhead (again) (2/2)• QEMU is a suboptimal virtualization

option.• ReVirt – User Mode Linux (Dunlap ‘02)

– Slowdown: 1.08x rec. + 1.58x virt.

• ReTrace – VMWare (Xu ‘07)

– Slowdown: 1.05x-2.6x rec. + ??? virt.

Virtualization slowdown is considered acceptable.

Recording overhead is fairly low.

Page 44: Provenance for Data Munging Environments

44

Storage Requirements

• Storage requirements vary with the workload.

• For PANDA (Dolan-Gavitt ‘14):– 17-915 instructions per byte.

• In practice: O(10MB/min) uncompressed.• Different approaches to reduce/manage

storage requirements.– Compression, HD rotation, VM snapshots.

• 24/7 recording seems within limits of todays’ technology.

Page 45: Provenance for Data Munging Environments

45

Highlights

• Taint tracking analysis is a powerful method for capturing provenance.– Eliminates many false positives.– Tackles the “n×m problem”.

• Decoupling provenance analysis from execution is possible by the use of VM record & replay.

• Execution traces can be used for post-hoc provenance analysis.

Page 46: Provenance for Data Munging Environments

DB DATA MUNGING

Page 47: Provenance for Data Munging Environments

47

References

Marcin Wylot, Philip Cudré-Mauroux, Paul Groth TripleProv: Efficient Processing of Lineage Queries over a Native RDF StoreWorld Wide Web Conference 2014

Marcin Wylot, Philip Cudré-Mauroux, Paul Groth Executing Provenance-Enabled Queries over Web Data World Wide Web Conference 2015

Page 48: Provenance for Data Munging Environments

RDF is great for munging data

➢ Ability to arbitrarily add new information (schemaless)

➢ Syntaxes are easy to concatenate new data

➢ Information has a well defined structure

➢ Identifiers are distributed but controlled

48

Page 49: Provenance for Data Munging Environments

What’s the provenance of my query result?

Qr

Page 50: Provenance for Data Munging Environments

Graph-based Query

select ?lat ?long ?g1 ?g2 ?g3 ?g4where {

graph ?g1 {?a [] "Eiffel Tower" . }

graph ?g2 {?a inCountry FR .}

graph ?g3 {?a lat ?lat . }

graph ?g4 {?a long ?long .}

}lat long l1 l2 l4 l4, lat long l1 l2 l4 l5,lat long l1 l2 l5 l4, lat long l1 l2 l5 l5,lat long l1 l3 l4 l4, lat long l1 l3 l4 l5,lat long l1 l3 l5 l4,lat long l1 l3 l5 l5,

lat long l2 l2 l4 l4, lat long l2 l2 l4 l5,lat long l2 l2 l5 l4, lat long l2 l2 l5 l5,lat long l2 l3 l4 l4, lat long l2 l3 l4 l5,lat long l2 l3 l5 l4, lat long l2 l3 l5 l5,

lat long l3 l2 l4 l4, lat long l3 l2 l4 l5,lat long l3 l2 l5 l4, lat long l3 l2 l5 l5,lat long l3 l3 l4 l4, lat long l3 l3 l4 l5,lat long l3 l3 l5 l4,lat long l3 l3 l5 l5,

Page 51: Provenance for Data Munging Environments

Provenance Polynomials

➢ Ability to characterize ways each source contributed

➢ Pinpoint the exact source to each result

➢ Trace back the list of sources the way they were combined

to deliver a result

Page 52: Provenance for Data Munging Environments

Polynomials Operators

➢ Union (⊕)

○ constraint or projection satisfied with multiple sources

l1 l2 l3⊕ ⊕

○ multiple entities satisfy a set of constraints or

projections

➢ Join (⊗)

○ sources joined to handle a constraint or a projection

○ OS and OO joins between few sets of constraints

(l1 l2) (l3 l4)⊕ ⊗ ⊕

Page 53: Provenance for Data Munging Environments

Example Polynomial

select ?lat ?long where { ?a [] ``Eiffel Tower''.?a inCountry FR .?a lat ?lat .?a long ?long . }

(l1 l2 l3) (l4 l5) ( l6 l7) (l8 l9)⊕ ⊕ ⊗ ⊕ ⊗ ⊕ ⊗ ⊕

Page 54: Provenance for Data Munging Environments

System Architecture

Page 55: Provenance for Data Munging Environments

Experiments

How expensive it is to trace

provenance?

What is the overhead on query

execution time?

Page 56: Provenance for Data Munging Environments

Datasets

➢ Two collections of RDF data gathered from the Web

○ Billion Triple Challenge (BTC): Crawled from the linked

open data cloud

○ Web Data Commons (WDC): RDFa, Microdata

extracted from common crawl

➢ Typical collections gathered from multiple sources

➢ sampled subsets of ~110 million triples each; ~25GB each

Page 57: Provenance for Data Munging Environments

Workloads

➢ 8 Queries defined for BTC○ T. Neumann and G. Weikum. Scalable join processing on very large rdf

graphs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 627–640. ACM, 2009.

➢ Two additional queries with UNION and OPTIONAL clauses

➢ 7 various new queries for WDC

http://exascale.info/tripleprov

Page 58: Provenance for Data Munging Environments

Results

Overhead of tracking provenance compared to

vanilla version of the system for BTC dataset

source-level co-locatedsource-level annotatedtriple-level co-locatedtriple-level annotated

Page 59: Provenance for Data Munging Environments

TripleProv: Query Execution Pipeline

input: provenance-enable query

➢ execute the provenance query ➢ optionally pre-materializing or co-locating data➢ optionally rewrite the workload queries➢ execute the workload queries

output: the workload query results, restricted to those which were derived from data specified by the provenance query 59

Page 60: Provenance for Data Munging Environments

Experiments

What is the most efficient query

execution strategy for provenance-

enabled queries?

60

Page 61: Provenance for Data Munging Environments

Datasets

➢ Two collections of RDF data gathered from the Web

○ Billion Triple Challenge (BTC): Crawled from the linked open data

cloud

○ Web Data Commons (WDC): RDFa, Microdata extracted from

common crawl

➢ Typical collections gathered from multiple sources

➢ Sampled subsets of ~40 million triples each; ~10GB each

➢ Added provenance specific triples (184 for WDC and 360 for BTC); that

the provenance queries do not modify the result sets of the workload

queries

61

Page 62: Provenance for Data Munging Environments

Results for BTC

➢ Full Materialization: 44x faster than the vanilla version of the system

➢ Partial Materialization: 35x faster

➢ Pre-Filtering: 23x faster

➢ Adaptive Partial Materialization executes a provenance query and materialize data 475 times faster than Full Materialization

➢ Query Rewriting and Post-Filtering strategies perform significantly slower

62

Page 63: Provenance for Data Munging Environments

Data Analysis

➢ How many context values refer to how many triples? How selective it is?

➢ 6'819'826 unique context values in the BTC dataset.

➢ The majority of the context values are highly selective.

63

➢ average selectivity

○ 5.8 triples per context value

○ 2.3 molecules per context value

Page 64: Provenance for Data Munging Environments

64

DECLARATIVE DATA MUNGING (?)

Page 65: Provenance for Data Munging Environments

65

References

Sara Magliacane, Philip Stutz, Paul Groth, Abraham Bernstein

foxPSL: A Fast, Optimized and eXtended PSL implementationInternational Journal of Approximate Reasoning (2015)

Page 66: Provenance for Data Munging Environments

Why logic?

- Concise & natural way to represent relations

- Declarative representation:- Can reuse, extend, combine rules- Experts can write rules

- First order logic:- Can exploit symmetries to avoid

duplicated computation (e.g. lifted inference)

Page 67: Provenance for Data Munging Environments

Let the reasoner munge the data.

See Sebastien Riedel’s etc. work towards pushing more NLP problems in to the reasoner.http://cl.naist.jp/~kevinduh/z/acltutorialslides/matrix_acl2015tutorial.pdf

Page 68: Provenance for Data Munging Environments

Statistical Relational Learning

● Several flavors:o Markov Logic Networks, o Bayesian Logic Programso Probabilistic Soft Logic (PSL) [Broecheler,

Getoor, UAI 2010]

● PSL was successfully applied:o Entity resolution, Link predictiono Ontology alignment, Knowledge graph

identificationo Computer vision, trust propagation, …

Page 69: Provenance for Data Munging Environments

Probabilistic Soft Logic (PSL)

● Probabilistic logic with soft truth values [0,1]∈friends(anna, bob)= 0.8votes(anna, demo) = 0.99

● Weighted rules:[weight = 0.7] friends(A,B) && votes(A,P) => votes(B,P)

● Inference as constrained convex minimization:votes(bob, demo) = 0.8

Page 70: Provenance for Data Munging Environments

FoxPSL: Fast Optimized eXtended PSL

classes ∃partially grounded rules optimizations

DSL: FoxPSL lang

Page 71: Provenance for Data Munging Environments

Experiments: comparison with ACO

SLURM cluster: 4 nodes, each with 2x10 cores and 128GB RAM

ACO = implementation of consensus optimization on GraphLab used for grounded PSL

Page 72: Provenance for Data Munging Environments

Conclusions

• Data munging is a central task• Provenance is a requirement• Now:

• Provenance by stealth (ack Carole Goble)

• Separate provenance analysis from instrumentation.

• Future:• The computer should do the work

Page 73: Provenance for Data Munging Environments

73

Future Research

• Explore optimizations of taint tracking for capturing provenance.

• Provenance analysis of real-world traces (e.g. from rrshare.org).

• Tracking provenance across environments

• Traces/logs as central provenance primitive

• Declarative data munging