capturing interactive data transformation operations using provenance workflows

17
Copyright 2009 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute www.deri.ie Capturing interactive data transformation operations using provenance workflows Tope Omitola, Andre Freitas, Edward Curry, Sean O'Riain, Nicholas Gibbins and Nigel Shadbolt SWPM Workshop 28.05.2012, Herakleion, Crete

Upload: andre-freitas

Post on 10-May-2015

972 views

Category:

Technology


2 download

DESCRIPTION

The ready availability of data is leading to the increased opportunity of their re-use for new applications and for analyses. Most of these data are not necessarily in the format users want, are usually heterogeneous, and highly dynamic, and this necessitates data transformation eff orts to re-purpose them. Interactive data transformation (IDT) tools are becoming easily available to lower these barriers to data transformation e fforts. This paper describes a principled way to capture data lineage of interactive data transformation processes. We provide a formal model of IDT, its mapping to a provenance representation, and its implementation and validation on Google Re fine. Provision of the data transformation process sequences allows assessment of data quality and ensures portability between IDT and other data transformation platforms. The proposed model showed a high level of coverage against a set of requirements used for evaluating systems that provide provenance management solutions.

TRANSCRIPT

Page 1: Capturing Interactive Data Transformation Operations using Provenance Workflows

Copyright 2009 Digital Enterprise Research Institute. All rights reserved.

Digital Enterprise Research Institute www.deri.ie

Capturing interactive data transformation

operations using provenance workflows

Tope Omitola, Andre Freitas, Edward Curry, Sean

O'Riain, Nicholas Gibbins and Nigel Shadbolt

SWPM Workshop 28.05.2012, Herakleion, Crete

Page 2: Capturing Interactive Data Transformation Operations using Provenance Workflows

Digital Enterprise Research Institute www.deri.ie

Outline

Motivation

Interactive data transformations (IDTs)

IDT & Provenance

Modelling IDTs

Provenance Representation

Provenance Capture

Case Study

Conclusion

Page 3: Capturing Interactive Data Transformation Operations using Provenance Workflows

Digital Enterprise Research Institute www.deri.ie

Motivation

Dataspaces:

High number of heterogeneous data sources

Complex data transformation environment

Need for both repeatable data transformations and once-

off transformations

Traditional ETL approaches for data

transformation/integration:

Based on scripting/programming

Focus on repeatable data transformation processes

Page 4: Capturing Interactive Data Transformation Operations using Provenance Workflows

Digital Enterprise Research Institute www.deri.ie

Interactive Data Transformation (IDTs)

Based on user interaction paradigms for user

creation of data transformations

Explores GUI elements mapping to data

transformation operations

Instant feedback of each iteration

Complementary to existing ETL tools

Lower the barriers for non-programmers (reduces

programming effort) of doing data transformations

Example platforms: Google Refine, Potters Wheel,

Wrangler

Page 5: Capturing Interactive Data Transformation Operations using Provenance Workflows

Digital Enterprise Research Institute www.deri.ie

Interactive Data Transformation (IDTs)

Page 6: Capturing Interactive Data Transformation Operations using Provenance Workflows

Digital Enterprise Research Institute www.deri.ie

Challenges

How to model IDTs?

Facilitating the reuse of previous IDTs

Representing IDTs

Making IDT platforms provenance-aware

Enabling transportability across IDT and ETL

platforms

Provenance

Page 7: Capturing Interactive Data Transformation Operations using Provenance Workflows

Digital Enterprise Research Institute www.deri.ie

IDT & Provenance

Provenance supports representation of interactive

data transformations

Output: a provenance descriptor which shows the

relationship between the inputs, the outputs, and

the applied transformation operations

Both retrospective and prospective provenance

Page 8: Capturing Interactive Data Transformation Operations using Provenance Workflows

Digital Enterprise Research Institute www.deri.ie

IDT

IDT model

Formal model (Algebra for IDT)

Provenance representation

Provenance capture of IDTs

Page 9: Capturing Interactive Data Transformation Operations using Provenance Workflows

Digital Enterprise Research Institute www.deri.ie

IDT Model: Core Elements

Schema and instance data

Set of predefined operations

GUI elements mapping to predefined operations

User actions

Operation selection

Parameter selection

Operation composition (workflow)

Page 10: Capturing Interactive Data Transformation Operations using Provenance Workflows

Digital Enterprise Research Institute www.deri.ie

IDT Model

Page 11: Capturing Interactive Data Transformation Operations using Provenance Workflows

Digital Enterprise Research Institute www.deri.ie

Formalizing the mapping from IDT to

Provenance

Definition 1: A provenance-based interactive data

transformation engine, consists of a set of

transformations (or activities) on a set of datasets

generating outputs in the form of other datasets or

events which may trigger further transformations

Definition 2: An interactive data transformation

event, consists of the input dataset, the output

dataset(s), the applied transformation function,

and the time the transformation took place

Page 12: Capturing Interactive Data Transformation Operations using Provenance Workflows

Digital Enterprise Research Institute www.deri.ie

Definition 3: A run is a function from time to

dataset(s) and the transformation applied to those

dataset(s)

Definition 4: A trace is the sequence of pairs of a

run and the time the run was made

Formalizing the mapping from IDT to

Provenance

Page 13: Capturing Interactive Data Transformation Operations using Provenance Workflows

Digital Enterprise Research Institute www.deri.ie

Provenance Representation

Proposed in Representing Interoperable Provenance

Descriptions for ETL Workflows

Three-layered provenance model:

Open Provenance Model Vocabulary Layer

Cogs ETL Provenance Vocabulary

Domain-Specific Model Layer

Linked Data standards

Page 14: Capturing Interactive Data Transformation Operations using Provenance Workflows

Digital Enterprise Research Institute www.deri.ie

Provenance Capture Layers

Page 15: Capturing Interactive Data Transformation Operations using Provenance Workflows

Digital Enterprise Research Institute www.deri.ie

Provenance Event-Capture Sequence Flow

Page 16: Capturing Interactive Data Transformation Operations using Provenance Workflows

Digital Enterprise Research Institute www.deri.ie

Case study

@prefix grf: <http://127.0.0.1:3333/project/1402144365904/> .

grf :MassCellChange-1092380975 rdf:type opmv:Process,

cogs:ColumnOperation, cogs:Transformation;

cogs:operationName "MassCellChange"^^xsd:string;

cogs:programUsed "com.google.refine.operations.cell.MassEditOperation"^^xsd:string;

rdfs:label "Mass edit 1 cells in column ==List of winners=="^^xsd:string.

grf:MassCellChange-1092380975/1_0 rdf:type opmv:Artifact ;

rdfs:label "* '''1955 [[Meena Kumari]]'[[Parineeta (1953 film)|Parineeta]]''''' as '''Lolita'''"^^xsd:string.

grf:MassCellChange-1092380975/1_1 rdf:type opmv:Artifact;

rdfs:label "* '''John Wayne'''"^^xsd:string.

grf:MassCellChange-1092380975/1_1 opmv:wasDerivedFrom grf:MassCellChange-1092380975/1_0.

grf:MassCellChange-1092380975 opmv:used grf:MassCellChange-1092380975/1_0.

grf:MassCellChange-1092380975/1_1 opmv:wasGeneratedBy grf:MassCellChange-1092380975.

grf:MassCellChange-1092380975/1_1 opmv:wasGeneratedAt "2011-11-16T11:2:14"^xsd: dateTime.

Mapping to the actual program

Process

Input Artifact

Output Artifact

Workflow structure

Implementation over the GR Platform

Example descriptor

Page 17: Capturing Interactive Data Transformation Operations using Provenance Workflows

Digital Enterprise Research Institute www.deri.ie

Conclusion

The proposed approach provides low impact on the

existing IDT process

Provenance representation supports different data

models

Preliminary implementation of a Google Refine

provenance extension