cis 602: provenance & scientific data management...

CIS 602, Fall 2014

CIS 602: Provenance & Scientific Data Management

Introduction

Dr. David Koop

CIS 602, Fall 2014

About Me• 1998-2002: B.S. in Mathematics, B.Comp.Sci. [Calvin College]• 2002-2005: M.S. in Computer Science [University of Wisconsin-

Madison]• 2005-2006: Consulting [nVISIA, Inc.]• 2006-2011: Ph.D. in Computing [University of Utah]• 2011-2012: Senior software architect [VisTrails, Inc.]• 2012-2014: Research assistant professor [New York University]• Now: Assistant professor at UMass-Dartmouth

2

CIS 602, Fall 2014

About Me• Research Interests

- Visualization- Computational Provenance- Data Science

• Research Projects- VisTrails: www.vistrails.org- VisComplete, UV-CDAT, SAHM

• Hobbies- Ultimate Frisbee- Hiking

• See my web page for more information- http://www.cis.umassd.edu/~dkoop/

3

http://www.vistrails.org

http://www.vistrails.org

http://www.cis.umassd.edu/~dkoop/

http://www.cis.umassd.edu/~dkoop/

CIS 602, Fall 2014

About You• Graduate students?• Background:

- Previous topics course (CIS 602)? (visualization, bioinformatics)- Research Papers?- Provenance?- Scientific Workflows?- Database Experience?

• Relational• NoSQL• XML• Graph Databases

4

CIS 602, Fall 2014

About this course• Course web page is authoritative:

- http://www.cis.umassd.edu/~dkoop/cis602/• Topics course

- A current research area the professor works in- A chance to be on the “cutting edge” of research

• No textbook- Use recent research papers

• Requires student participation- Reading responses- Reading presentation- Course project

5

CIS 602, Fall 2014

About this course• Course Registration:

- Make sure you have registered in COIN for the course- Email me ([email protected]) if you need a permission code

• Review of course policies:- If you have any concerns or questions, please email me as soon

as possible• If you are not sure if this course is a good fit, please email me or talk

to me

6

mailto:[email protected]


CIS 602, Fall 2014

What is Provenance?

7

CIS 602, Fall 2014

Provenance in ArtRembrandt van RijnDutch, 1606 - 1669Self-Portrait, 1659oil on canvasAndrew W. Mellon Collection1937.1.72 ProvenanceGeorge, 3rd Duke of Montagu and 4th Earl of Cardigan [d. 1790], by 1767;[1] by inheritance to his daughter, Lady Elizabeth, wife of Henry, 3rd Duke of Buccleuch of Montagu House, London; John Charles, 7th Duke of Buccleuch; (P. & D. Colnaghi & Co., New York, 1928); (M. Knoedler & Co., New York); sold January 1929 to Andrew W. Mellon, Pittsburgh and Washington, D.C.; deeded 28 December 1934 to The A.W. Mellon Educational and Charitable Trust, Pittsburgh; gift 1937 to NGA.

[1] This early provenance is established by presence of a mezzotint after the portrait by R. Earlom (1743-1822), dated 1767. See John Charrington, A Catalogue of the Mezzotints After, or Said to Be After, Rembrandt, Cambridge, 1923, no. 49.

Associated Names• Buccleuch, Henry, 3rd Duke of• Buccleuch, John Charles, 7th Duke of• Colnaghi & Co., Ltd., P. & D.• Knoedler & Company, M.• Mellon, Andrew W.• Mellon Educational and Charitable Trust, The A.W.• Montagu, and 4th Earl of Cardigan, George, 3rd Duke of

8

[National Gallery of Art]

http://www.nga.gov/cgi-bin/tsearch?ownerid=22007














CIS 602, Fall 2014

Provenance in Science• Provenance: the lineage of data, a

computation, or a visualization• Provenance is as (or more)

important as the result!• Old solution:

- Lab notebooks• New problems:

- Large volumes of data- Complex analyses- Writing notes doesn’t scale

• Let computers handle this automatically!

9

[DNA Recombination, Lederberg]

CIS 602, Fall 2014

Provenance in Science• Provenance: the lineage of data, a

computation, or a visualization• Provenance is as (or more)

important as the result!• Old solution:

- Lab notebooks• New problems:

- Large volumes of data- Complex analyses- Writing notes doesn’t scale

• Let computers handle this automatically!

9

Date

Annotations

Observed Data

[DNA Recombination, Lederberg]

CIS 602, Fall 2014

Provenance-Rich Computational Science

10

Fig. 7: Using the blog to document processes: A visualization expertcreated a series of blog posts to explain the problems found when gen-erating the visualizations for CMOP.

ACKNOWLEDGMENTS

Our research has been funded by the National Science Foun-dation (grants IIS-0905385, IIS-0746500, ATM-0835821, IIS-0844546, CNS-0751152, IIS-0713637, OCE-0424602, IIS-0534628,CNS-0514485, IIS-0513692, CNS-0524096, CCF-0401498, OISE-0405402, CCF-0528201, CNS-0551724), the Department of En-ergy SciDAC (VACET and SDM centers), and IBM Faculty Awards(2005, 2006, 2007, and 2008). E. Santos is partially supported by aCAPES/Fulbright fellowship.

REFERENCES

[1] L. Bavoil, S. Callahan, P. Crossno, J. Freire, C. Scheidegger, C. Silva, andH. Vo. VisTrails: Enabling Interactive Multiple-View Visualizations. InIEEE Visualization 2005, pages 135–142, 2005.

[2] S. P. Callahan, J. Freire, C. E. Scheidegger, C. T. Silva, and H. T. Vo.Towards provenance-enabling paraview. pages 120–127, 2008.

[3] Chemical blogspace. http://cb.openmolecules.net/.[4] NSF Center for Coastal Margin Observation and Prediction (CMOP).

http://www.stccmop.org.[5] S. B. Davidson and J. Freire. Provenance and scientific workflows: chal-

lenges and opportunities. In Proceedings of SIGMOD, pages 1345–1350,2008.

[6] R. T. Fielding. Architectural Styles and the Design of Network-based Soft-ware Architectures. PhD thesis, University of California, Irvine, 2000.

[7] S. Fomel and J. Claerbout. Guest editors’ introduction: Reproducibleresearch. Computing in Science Engineering, 11(1):5 –7, jan.-feb. 2009.

Fig. 8: Visualizing a binary star system simulation. Thisis an image that was generated by embedding a workflow di-rectly in the text. The original workflow is available athttp://www.crowdlabs.org/vistrails/workflows/details/119/.

[8] J. Freire, D. Koop, E. Santos, and C. T. Silva. Provenance for computa-tional tasks: A survey. Computing in Science & Engineering, 10(3):11–21, May-June 2008.

[9] J. Freire, C. Silva, S. Callahan, E. Santos, C. Scheidegger, and H. Vo.Managing rapidly-evolving scientific workflows. In International Prove-nance and Annotation Workshop (IPAW), LNCS 4145, pages 10–18.Springer Verlag, 2006.

[10] R. Hoffmann. A wiki for the life sciences where authorship matters. Na-ture Genetics, 40(9):1047–1051, 2008.

[11] IBM. OpenDX. http://www.research.ibm.com/dx.[12] Kitware. Paraview. http://www.paraview.org.[13] Kitware. The visualization toolkit. http://www.vtk.org.[14] Many Eyes Wikified. http://wikified.researchlabs.ibm.com.[15] M. McKeon. Harnessing the Web Information Ecosystem with Wiki-

based Visualization Dashboards. IEEE Transactions on Visualization andComputer Graphics, 15(6):1081–1088, 2009.

[16] A. R. Pico, T. Kelder, M. P. van Iersel, K. Hanspers, B. R. Conklin, andC. Evelo. WikiPathways: Pathway editing for the people. PLoS Biology,6(7), 2008.

[17] D. D. Roure, C. Goble, and R. Stevens. The design and realisation ofthe virtual research environment for social sharing of workflows. FutureGeneration Computer Systems, 25(5):561 – 567, 2009.

[18] E. Santos, L. Lins, J. Ahrens, J. Freire, and C. Silva. Vismashup: Stream-lining the creation of custom visualization applications. IEEE Transac-tions on Visualization and Computer Graphics, 15(6):1539–1546, 2009.

[19] Swivel. http://www.swivel.com.[20] J. Tohline and E. Santos. Visualizing a Journal that Serves the Computa-

tional Sciences Community. Computing in Science & Engineering, 12(3),2010. To appear.

[21] J. E. Tohline. Scientific Visualization: A Necessary Chore. Computingin Science & Engineering, 9(6):76–81, 2007.

[22] C. Upson, J. Thomas Faulhaber, D. Kamins, D. H. Laidlaw, D. Schlegel,J. Vroom, R. Gurwitz, and A. van Dam. The Application Visualiza-tion System: A Computational Environment for Scientific Visualization.IEEE Computer Graphics and Applications, 9(4):30–42, 1989.

[23] F. B. Viegas, M. Wattenberg, F. van Ham, J. Kriss, and M. McKeon.ManyEyes: A site for visualization at internet scale. IEEE Transactionson Visualization and Computer Graphics, 13(6):1121–1128, 2007.

[24] VisIt Visualization Tool. https://wci.llnl.gov/codes/visit.[25] The VisTrails Project. http://www.vistrails.org.

DATA DATA

Data Management

Computation

Visualization

Publishing

Provenance

CIS 602, Fall 2014

Scientific Workflows• Problem: Lots of different tools, often need to use more than one

for reasonably complex analyses• Solutions:

- Rely on human memory- Scripts- Workflows

11

CIS 602, Fall 2014

vtkActor

VTKCell

vtkRenderer

vtkContourFilter

vtkStructuredPointsReader

vtkDataSetMapper

vtkCamera

Scientific Workflowsdata = vtk.vtkStructuredPointsReader()data.SetFileName(../examples/data/head.120.vtk)

contour = vtk.vtkContourFilter()contour.SetInput(data.GetOutput())contour.SetValue(0, 67)

mapper = vtk.vtkPolyDataMapper()mapper.SetInput(contour.GetOutput())mapper.ScalarVisibilityOff()

actor = vtk.vtkActor()actor.SetMapper(mapper)

cam = vtk.vtkCamera()cam.SetViewUp(0,0,-1)cam.SetPosition(745,-453,369)cam.SetFocalPoint(135,135,150)cam.ComputeViewPlaneNormal()

ren = vtk.vtkRenderer()ren.AddActor(actor)ren.SetActiveCamera(cam)ren.ResetCamera()renwin = vtk.vtkRenderWindow()renwin.AddRenderer(ren)

style = vtk.vtkInteractorStyleTrackballCamera()iren = vtk.vtkRenderWindowInteractor()iren.SetRenderWindow(renwin)iren.SetInteractorStyle(style)iren.Initialize()iren.Start()

12

ViewUp (0,0,-1)Position (745,-453,369)

FocalPoint (-135,135,150)

FileName .../head.120.vtk

Value (0,67)

CIS 602, Fall 2014

vtkActor

VTKCell

vtkRenderer

vtkContourFilter


vtkDataSetMapper

vtkCamera

Scientific Workflow Provenance

13

<module id="12" name="vtkDataSetReader" start_time="2010-02-19 11:01:05" end_time="2010-02-19 11:01:07"> <annotation key="hash" value="c54bea63cb7d912a43ce"/></module><module id="13" name="vtkContourFilter" start_time="2010-02-19 11:01:07" end_time="2010-02-19 11:01:08"/><module id="15" name="vtkDataSetMapper" start_time="2010-02-19 11:01:09" end_time="2010-02-19 11:01:12"/><module id="16" name="vtkActor" start_time="2010-02-19 11:01:12" end_time="2010-02-19 11:01:13"/><module id="17" name="vtkCamera" start_time="2010-02-19 11:01:13" end_time="2010-02-19 11:01:14"/><module id="18" name="vtkRenderer" start_time="2010-02-19 11:01:14" end_time="2010-02-19 11:01:14"/>...

CIS 602, Fall 2014

Evolution Provenance: Photo Editing• User Actions

• Undo/Redo History

14

original darkened sharpened grayscale

CIS 602, Fall 2014

Evolution Provenance: Photo Editing• User Actions

• Undo/Redo History

14

watercolor

original darkened sharpened grayscale

CIS 602, Fall 2014

Workflow Evolution Provenance

15

HTTPFile HTTPFile

CSVFile JSONFile

JoinTables

ProjectTable

GMapCell

HTTPFile

CSVFile

TableCell

Map

MplBar MplAxesPropertiesMplFigureProperties

MplFigure

MplFigureCell

GetFareData(Group)

DateRange(PythonSource)

BuildLabels(PythonSource)

CIS 602, Fall 2014


15

CIS 602, Fall 2014


16

GMapCircleCell

CIS 602, Fall 2014


16

GMapCircleCell

delete module “GMapCell”delete module “CellLocation”delete module “ProjectTable”delete module “SelectFromTable”...add module “SelectFromTable”add parameter “float_expr” to “SelectFromTable” with value “latitutde > 40.6”delete parameter “float_expr” from “SelectFromTable”add parameter “float_expr” to “SelectFromTable” with value “latitutde > 40.7”delete parameter “float_expr” from “SelectFromTable”add parameter “float_expr” to “SelectFromTable” with value “latitutde > 40.8”...

CIS 602, Fall 2014

User

Kernel

VFS Layer

User Processes

Collector

Pasta KBDB

ext2

records

data

pro

venance

provenance

pro

venance

data

event

record

s

event

Dealing with cycles

System-Level Provenance

17

[PASS Architecture, Seltzer et al.]

CIS 602, Fall 2014

Database Provenance• Determine the raw data that contributed to a computed table

• For aggregations, tracing back to original data can require many more annotations or schemes that track the mappings

18

query even in the special case where the join is always performed before projection. On the other hand, there isa polynomial-time algorithm for deciding whether there is a side-effect-free annotation for SPJU queries whichdo not simultaneously contain both project and join operators. In fact, the annotation placement problem waslater shown to be DP-hard in [30]. In [17], the authors showed that many of the complexity issues disappear forkey-preserving operations, which are operations that retain the keys of the input relations.

3 Data Provenance: Current

In this section, we describe research efforts that mainly occur between 2005 and 2007, as shown in Figure3(c). Our discussion will center around two research projects, DBNotes [4, 15] and SPIDER [1, 14], recentlydeveloped at UC Santa Cruz.

3.1 DBNotes

The work of DBNotes builds upon ideas developed in [10, 30]. DBNotes is an annotation management systemfor relational database systems. In DBNotes, every attribute value in a relation can be tagged with multiple an-notations. When a query is executed against the database, annotations of relevant attribute values in the databaseare automatically propagated to attribute values in the result of the query execution. The queries supported byDBNotes for automatic annotation propagation belong to a fragment of SQL queries that corresponds roughly toselect-project-join-union queries. In its default execution mode, annotations are propagated based on where datais copied from (i.e., where-provenance). As a consequence, if every attribute value in the database is annotatedwith its address, the provenance of data is propagated along, from input to output, as data is transformed by thequery. An example of annotations propagated in the default manner is shown below:Source database D:Employeeempid depte977 (a1)CS (a2)e132 (a3)EE (a4)e657 (a5)BME (a6)

Departmentdeptid budgetBME (b1) 1200K (b2)CS (b3) 670K (b4)EE (b5) 890K (b6)MATH (b7)230K (b8)

Query Q:SELECT e.empid, e.dept, d.budgetFROM Employee e, Department dWHERE e.dept = d.deptidPROPAGATE default

Output of Q(D):empid dept budgete657 (a5)BME (a6)1200K (b2)e977 (a1)CS (a2) 670K (b4)e132 (a3)EE (a4) 890K (b6)

In this example, every attribute value in the source relations, Employee and Department, is annotated witha unique identifier. For instance, the attribute value 670K is annotated with the identifier b4. The query Q hasan additional “PROPAGATE default” clause, which means that we are using the default execution mode asexplained earlier. By analyzing the annotations that are propagated to Q(D), we can conclude that the value“BME” inQ(D)was copied from “BME” in the Employee relation (and not the “BME” in Department relation).If the SELECT clause of Q had been “e.empid, d.deptid, d.budget” instead, then the annotation associated with“BME” in Q(D) would be b1 instead of a6. Hence, equivalent queries may propagate annotations differently.This presents a serious difficulty as it means that the annotations (or provenance answers) that one obtains in theresult is dependent on the query plan chosen by the database engine. DBNotes resolves this problem through anovel propagation scheme, called the default-all propagation scheme. In this scheme, all annotations of everyequivalent formulation of a query are collected together. Consequently, propagated annotations are invariantacross equivalent queries. This scheme can thus be viewed as the most general way of propagating annotations.In our example, the “BME” value in Q(D) will consist of both annotations a6 and b1 under the default-allscheme. At first sight, the default-all scheme seems infeasible because the set of all equivalent queries is infinitein general. In [4], a practical and novel implementation that avoids the enumeration of every equivalent queryis described. The key insight is that for monotone relational queries, all relevant annotations can in fact bedetermined by evaluating every query in a finite set of queries. Such a finite set can always be obtained, and isnot too big in general. DBNotes also allows one to define custom propagation schemes. In this scheme, the usercan specify where annotations should be retrieved from input relations. The custom scheme is especially usefulwhen the user is, for example, only interested in retrieving annotations from a particular database, perhaps due

5

[Provenance in Databases, Tan et al.]

CIS 602, Fall 2014

molecule

. . .

molecule

. . . prov

transform

curate HPRD

PubMed 16524875

ABC1 LXR

prov

transform

curate HPRD

PubMed 16524875

M HPRD M HPRD

(a)

molecule

. . .

molecule

. . .

ABC1 LXR

1 1

prov

transform

curate HPRD

PubMed 16524875

1 Prov Store

M HPRD

(b)

Figure 2: Example of Basic Factorization. (a) ABC1 and LXRmolecule data items. (b) Same data items after Basic Factorization.

Reduction Technique Estimated Provenance SizeNo Reduction N ∗ S or n ∗ sBasic Factorization N ∗ p + N1 ∗ SNode Factorization n ∗ p + n1 ∗ sArgument Factorization n ∗ p + A ∗ a + n2 ∗ s′

Structural Inheritance N2 ∗ SPredicate Based Inheritance N ∗ S/T

Variables UsedN total number of provenance recordsN1 number of distinct provenance records;N1 ≤ NN2 number of data items whose provenance record

is different from that of their parent data item;N2 ≤ Nn number of provenance nodes; n ≥ Nn1 number of distinct provenance nodes; n1 ≤ nn2 number of distinct provenance nodes, after removing the arguments;

n2 ≤ n1 ≤ nS average size of a provenance records average size of a provenance nodes′ average size of a provenance node without arguments; s′ ≤ sp size of a pointer from the data store to the provenance storeA average size of an argumenta number of argument annotationsT number of data items that satisfy a predicate,

and have common provenance records

Table 1: Estimated provenance size for each technique.

the provenance record R is replaced by a pointer to its copy in theprovenance store.Since there is one provenance record per data item, the number

of (not necessarily distinct) provenance records isN . LetN1 be thenumber of distinct provenance records. Let S be the average sizeof a provenance record. The space used for storing provenance,before and after Basic Factorization, is shown in Table 1.

3.2 Node FactorizationOften, two data items will have distinct provenance records, but

these provenance records will have many nodes in common. NodeFactorization removes common provenance nodes. Only one copyof each node is stored in a separate provenance store. Provenancepointers are stored with data items to refer to these nodes.Consider the workflow in Figure 1(c). Two distinct, but simi-

lar processes exist, curateHPRD and curateBIND. Consider twoprovenance records that contain different curation manipulations,but are otherwise identical. For instance, for provenance recordsP0 = A → B → C, and P1 = A → X → C, the provenancestore after Basic Factorization will have one record for each of P0

and P1. Obviously, we can do better by factoring common nodes.This amounts to combining P0 and P1 as A → (B OR X) → C.

The pointer from the data items to the provenance store is used toindicate which of B or X is present, i.e., which of P0 or P1 is thecorrect provenance record for that data item.In order to accomplish this reduction, we must be able to deter-

mine that a) theA nodes in P0 and P1 are equal, b) nodeB in P0 issimilar to nodeX in P1, and c) theC nodes in P0 and P1 are equal.Provenance Node Equality and Similarity are defined as follows.

DEFINITION 8. Provenance Node Equality:Two provenance nodes a and b are equal, denoted a

P= b, iff

i. they refer to the same manipulation,ii. all parameters and input types to the manipulation are identical.

DEFINITION 9. Provenance Node Specific Similarity:Two provenance nodes a and b are specifically similar, with respectto a similarity function Sx, if Sx(a,b) = TRUE.

Notice that similarity function values are dependent on the prove-nance nodes. For instance, we can define a similarity functionS1(a, b) = {a.name like ‘curate’ and b.name like ‘curate’}.In this case S1(curateHPRD, curateBIND) = TRUE, butS1(curateHPRD, transform) = FALSE. We write S for theset of acceptable similarity functions, as defined by a provenanceexpert familiar with the provenance store in question.

DEFINITION 10. Provenance Node Similarity:Two provenance nodes a and b are similar, if they are specificallysimilar with respect to some similarity function Sx() ∈ S.

Provenance node similarity, as defined above, is a binary relationon the provenance nodes. We assume that the set S of similarityfunctions is such that this relation has the following properties.

• Reflexive: Each provenance node is similar to itself.• Symmetric: If node a is similar to node b, then b is similar to

a.• Transitive: If a is similar to b, and b is similar to c, then a issimilar to c.

So, provenance node similarity is an equivalence relation. It di-vides the set of all provenance nodes inD into equivalence classes,such that two nodes are similar iff they are in the same equivalenceclass. For example, consider the workflow shown in Figure 1c;there are five different kinds of manipulations. If we assume thatall provenance nodes that pertain to each kind of manipulation aresimilar, then the similarity relation has five equivalence classes. Ifwe further assume that all curateHPRD and curateBIND nodesare similar to each other, then the similarity relation has only fourequivalence classes.Using the above definitions, we can combine P0 and P1 as A →

(B OR X) → C. But what happens if we change our prove-nance records slightly to: P3 = J → K → L → M andP4 = J → N → O → M ; we would like to combine themas J → (K OR N) → (L OR O) → M . In other words, twoprovenance records could contain a long chain of similar prove-nance nodes. We can apply Node Factorization to such recordsusing the following definitions.

DEFINITION 11. Common Ancestor Node:Two provenance nodes a and b have a common ancestor node ifi. a.parent

P= b.parent, or

ii. a.parent and b.parent are similar, and also have a CommonAncestor Node.

DEFINITION 12. Common Descendant Node:Two provenance nodes a and b have a common descendant node if,for some children c and d of a and b, respectively, we havei. c P

= d, orii.c and d are similar, and also have a Common Descendant Node.

Provenance Storage

19

[Efficient Provenance Storage, Chapman et al.]

CIS 602, Fall 2014

Querying Provenance

20

vtkActor

VTKCell

vtkRenderer

vtkContourFilter


vtkDataSetMapper

vtkCamera

DATA

IMAGE

• What process led to the output image?

• What input datasets contributed to the output image?

• What workflows include resampling and isosurfacing with isovalue 57?

• Graph traversal or graph patterns- How do we write such queries?

CIS 602, Fall 2014

Querying Provenance by Example

21

• Provenance is represented as graphs: hard to specify queries using text!• Querying workflows by example [Scheidegger et al., TVCG 2007;

Beeri et al., VLDB 2006; Beeri et al. VLDB 2007]- WYSIWYQ -- What You See Is What You Query- Interface to create workflow is same as to query

[Scheidegger et al.]

CIS 602, Fall 2014

C. Silva & E. Anderson & E. Santos & J. Freire / Using VisTrails and Provenance for Teaching Scientific Visualization

Task

1

Task

2

Task

3

Task

4

Task

5

Task

6

0

4000

9000

14000

Task

1

Task

2

Task

3

Task

4

Task

6

Task

5

0

4000

9000

14000

Num

ber o

f Act

ions

Time

2008

2007

Figure 5: Activity histogram of action dates with due datesindicated for both 2007 and 2008 classes.

Figure 6: The correlation between the number of branchesand the number of tags per user-task.

version tree is correlated with the number of tagged nodes,as shown in Figure 6. This indicates that, as users have torevisit a previously defined workflow, they would select atagged node because it is easier to identify.

4.2.2. Analysis of Tasks

Workflow evolution information can also be helpful to char-acterize tasks. As noted in Table 1, the tasks assigned to thescientific visualization students varied in their goals, diffi-culty, due date, and how open-ended they were. To illustratehow workflow evolution data can be used to understand thedifferent types of work involved in a task, we classified theactions involved in workflow development into: structuralactions (addition and deletion of modules and connectionsin the workflow); parameter actions (modification of param-eter values in the workflow); and layout actions (changes tothe locations of modules in visual programming interface).

Figure 7 shows an attempt to characterize tasks by thetypes of actions involved. For all users, we calculated theoverall percentage of actions that were structural, parameterand layout actions across all tasks (Figure 7(a)). In addition,we computed these percentages for each task, as shown inFigure 7(b), (c) and (d). The distributions of these percent-

ages were plotted as boxplots. Note that the percentage ofactions spent changing parameters has the greatest variancefor most tasks. This should be expected as some users lo-cate correct parameter values faster than others, and somewill also expend more effort tweaking parameters than oth-ers. Another interesting feature of these plots is that Task 5shows more structural activity than Tasks 2, 3, and 4. Thisis explained by the fact that students were given examplesfor the previous three tasks, and in Task 5, they were left todiscover how to create workflows from scratch.

4.2.3. Analysis of Users

A useful application of workflow evolution provenance is tohelp in understanding how different users approach a prob-lem. Figure 8 shows two trees created by different users forthe same task. User 1 and User 2 clearly have different devel-opment styles: the tree derived by User 2 is both shorter andnarrower than that of User 1. This figure also shows a plot ofthe branching factor of the version trees across the tasks forUser 1 and User 2. A smaller branching factor indicates thata more direct path was used to obtain a solution. In contrast,a larger branching factor indicates that more trial-and-errorsteps were followed. There are many cases where branch-ing can be useful, including when a user wishes to developworkflows that share a common subworkflow: the user de-signs the first workflow, goes to the version tree, selects thenode corresponding to the common subworkflow and fromthere branches to the second workflow. We found a range ofbranching factors that varied across users and tasks.

Branching is just one variable from the workflow evolu-tion provenance data that can be used to identify “user sig-natures”, other variables, such as the time between actionsand the number of sessions may also lead to insights in thisrespect.

5. Discussion

We strongly believe that teaching is one of the killer applica-tions of provenance-enabled systems. Provenance informa-tion can help instructors to be more effective and improvethe students’ learning experience. Due to the provenance in-formation, it is possible for one person to see what anotherperson did, and to easily compare their own work to it. Thismakes it possible for the instructors to share their own workwith the students, who can easily see who the problem wasapproached by someone with more experience. When mak-ing new functionality available (e.g., a new VTK module),the process of using the new module in an example can easilybe turned into a tutorial on how to use the new functionality.This also makes it easier to have adoption in other places.One of the really nice features of the unobtrusive way thatVisTrails captures provenance is that there is no extra bur-den on the user; they can do their work without caring aboutremembering what they did.

The data in the previous section shows that workflow evo-lution provenance allows one to measure, summarize, and

c� The Eurographics Association 2010.

Provenance Analytics

22

Activity Histograms by Date

[Lins et al.]

CIS 602, Fall 2014

Provenance Analytics

23

Comparing Paths to Solutions for Two Students

[Lins et al.]

CIS 602, Fall 2014

Provenance Mining

24

Database ofWorkflows

VTKCell

vtkRenderer

vtkActor

vtkPolyDataMapper

vtkTubeFilter

vtkStreamTracer

vtkDataSetReader

VTKCell

vtkRenderer

vtkActor

vtkDataSetMapper

vtkContourFilter

vtkDataSetReader

VTKCell

vtkRenderer

vtkActor

vtkPolyDataMapper

vtkGlyph3D

vtkMaskPoints

vtkDataSetReader

[VisComplete, Koop et al.]

CIS 602, Fall 2014

Secure Provenance• Make sure no one can tamper with provenance after it is originally

generated• Potential issues: want to add to provenance chain after original

generation• Solution: Secure hashing

25

CIS 602, Fall 2014

Provenance & Semantics• Make it possible to reason about provenance using domain-

specific language• Use semantics/linked data approaches

26

The two main strains of research in this area concentrate on (i) provenancemodelling, with the goal of supporting the users’ data validation tasks; and (ii)data architectures for provenance management. The work presented in this pa-per falls in the former of these two categories. Most of the provenance modelsproposed so far, including those just cited, have been focusing on describingthe causal relationships amongst data products, without specific concern for thesemantic characterisation of those products. We refer to these graphs as domain-agnostic, as they do not include any reference to domain-specific terms. In con-trast, we propose a new semantic model of provenance, embodied by domain-aware graphs, designed to support data derivation questions that are formulatedby user-scientists using domain-specific terminology. Fig. 1 clarifies the distinc-tion between the two types of graphs4. The main di↵erences between Fig. 1(a)and Fig. 1(b) are the additional semantic annotations shown in the latter. Inthis limited example, these are of the form V instance-of C or V has-source

C

0, where V is a value, and C, C

0 are terms in some domain vocabulary, for bi-ology concepts and biological database resources, respectively. We expect that,regardless of the specific formalism chosen to specify these annotations, domain-aware graphs be useful to answer a broader class of user questions than theirdomain-agnostic counterparts (namely those questions that rely upon domainterms).

(a) Domain-agnosticprovenance graph

(b) Domain-aware provenance graph

Fig. 1. Adding simple annotations to provenance graphs

Taking this idea further, we also note that grounding a provenance model inthe Semantic Web framework presents additionally opportunities for support-ing an even broader class of user questions. In particular, we explore the ideaof making semantic provenance graphs a part of the broad Web of Data, anincreasingly rich source of interconnected data that is uniformly represented ac-cording to the principles and conventions of Linked Open Data (LOD) [4]. Inpractice, we show how mapping data elements in the graph to equivalent datathat is published elsewhere in the Web of Data, makes it possible for queriesto retrieve properties of data, which are not explicitly represented in the prove-nance graph or its annotations, but are instead associated with their equivalentexternal representations.

4 We use an abstract notation that is close to the one adopted in the Open ProvenanceModel http://www.openprovenance.org, where data values (the circles) are eitherproduced or consumed by processes (the squares).

[Janus, Missier et al.]

CIS 602, Fall 2014

Provenance Standards• PROV W3C Standard:

- http://www.w3.org/TR/prov-overview/

• Preceded by Open Provenance Model (OPM)

• Allow different modes of provenance captures to be combined

• Very general, work to extend to specific domains

27

CIS 602, Fall 2014Figure 6. The Aruvi prototype. The alphabet labels describe the interaction interfaces of the prototype explained in the Prototype section. The

numeric labels are used to describe an analysis process presented in the Use Case section.

of the objects does not change when the data mapping ischanged by changing the axes or size. Hence, it is possibleto track or emphasize the interesting objects during the en-tire exploration process. The three levels of DOI facilitateconvergent analysis. However, the analyst can revert backto a previous DOI of the objects using the history trackingmechanism and continue the analysis with different DOIs forthe objects. Hence, divergent analysis is also supported.

The analyst can choose to show or hide the low DOI objectsvia the show only filtered data interface (see figure 6(c)). Thecurrent selection interface (see figure 6(q)) displays the listof objects with high DOI. When there is no selection, it dis-plays the list of objects with medium DOI. The object list inthe current selection interface can be added as a note in theknowledge view using the paste as new note interface (seefigure 6(r)). The scatterplot allows zoom-in to a particularregion of the scatterplot via the Zoom in interface (see fig-ure 6(e)). The settings interface (see figure 6(b)) toggles thedisplay of the size and show only filtered data interfaces. Aninformation bar interface (see figure 6(f)) is used to displaydetails about the selection and size encoding. Finally, theanalyst can save, reopen or recover the last analysis using afile menu (see figure 6(a)).

History TrackingThe granularity of the history tracking can be chosen in vari-ous ways. For instance, all changes to the visualization spec-

ification can be captured. However, some heuristics on spec-ification change detection can be applied to avoid too muchlow level detail. For instance, when the user continuouslychanges the filter in the dynamic query interface, the changesare reflected in the visualization (scatterplot) but are not cap-tured by the history tracking module. We found it to be con-venient just to capture the visualization state when the mousepointer leaves the dynamic query interface and if atleast oneof the filters has been changed. Other heuristics, like detec-tion of (not necessarily continuous) change patterns could beused and will be studied in the future. The base model itselfdoes allow for a variety of choices here.

USE CASEWe now present a simple use case where a user explores adigital camera dataset (565 cameras with 15 attributes) us-ing the Aruvi prototype. There are a several tasks that theuser might perform with the data, such as detecting trendsand finding cameras that meet user requirements. To per-form trend analysis, the user compares the digital cameraattributes for different years. The user uses the scatterplot inthe data view for this comparison.

The user records the findings in the knowledge view usinga mind map. The mind map is a diagram used to representideas linked to and arranged radially around a central idea[9]. The user records the central idea - trend analysis - innote 1 (see figure 6(1)). Firstly, the user plots the number

CHI 2008 Proceedings · Visualization to Support Information Work April 5-10, 2008 · Florence, Italy

1242

Visualization & Provenance

28

[Aruvi, Shrinivasan & van Wijk]

• Capture how results were achieved• Includes many different items• Improve collaboration and sharing

CIS 602, Fall 2014

Reproducibility

29

Text

011100101111001011001001101101010110111000110

Data

WorkflowsSource Code Libraries

ResultsVisualizations

CIS 602, Fall 2014

Scientific Data Management• Relational Databases

- Row-column format data, aggregation, grouping- Very useful- Time-tested

• Scientific Data- Different requirements- New types of users- Streaming- Often shared- Data is not the end of the story

30

CIS 602, Fall 2014

Graph Databases

31

• Lots of data in the form of graphs: molecules, social networks• Store structured data in a way that makes connectivity queries

more efficient

[neo4j]

CIS 602, Fall 2014

Graph Indexing• Subgraph Isomorphism is NP-Complete • Use frequent graphs or other techniques to make searching graph

databases more efficient

32

Figure 1: A Graph Database with Three Graphs

Figure 2: A Query Graph

solution: “Can we use tree instead of graph as the basic in-dexing feature?” Tree, which is also denoted as free tree,is a special connected, acyclic and undirected graph. As ageneralization of linear sequential patterns, tree preservesplenty of structural information of graph. Meanwhile, treeis also a specialization of general graph, which avoids un-desirable theoretical properties and algorithmic complexityincurred by graph. As the middle ground between these twoextremes, tree becomes an ideal candidate of indexing fea-tures over the graph database. The main contributions ofthis paper are summarized below.

• We analyze the effectiveness and efficiency of treesas indexing features by comparing them with pathsand graphs from three critical aspects, namely, fea-ture size, feature selection cost, and pruning power.We show that tree-features can be effectively and effi-ciently used as indexing features for graph databases.Our main results show: (1) in many applications themajority of graph-features (usually more than 95%)are tree-features indeed; (2) frequent tree-features andgraph-features share similar distributions and frequenttree-features have similar pruning power like graph-features; and (3) tree mining can be done much moreefficiently than graph mining (it is not cost-effective tomine frequent graph features in which more than 95%are trees).

• We propose a new graph indexing mechanism, called(Tree+∆), that first selects frequent tree-features asthe basis of a graph index, and then on-demand selectsa small number of discriminative graph-features thatcan prune graphs more effectively than the selectedtree-features, without conducting costly graph miningbeforehand. A key issue here is how to achieve thesimilar pruning power of graph-features without graphmining. We propose a new approach by which we canapproximate the pruning power of a graph-feature byits subtree-features with upper/lower bounds.

• We conducted extensive experimental studies using areal dataset and a series of synthetic datasets. Wecompared our (Tree+∆) with two up-to-date graph-based indexing methods: gIndex [23] and C-Tree [9].Our study confirms that (tree+∆) outperforms gIndexand C-Tree in terms of index construction cost andquery processing cost.

The rest of the paper is organized as follows. In Sec-tion 2, we give the problem statement for the graph contain-ment query processing, and discuss an algorithmic frame-work with a cost model. In Section 3, we analyze the in-dexability of frequent features (path, tree and graph) from

three perspectives: feature size, feature selection cost, andpruning power. Section 4 discusses our new approach toadd discriminative graph-features on demand. Section 5presents the implementation details of our indexing algo-rithm (Tree+∆) with an emphasis on index construction andquery processing. Section 6 shows the related work concern-ing the graph containment query problem over large graphdatabases. Our experimental study is reported in Section 7.Section 8 concludes this paper.

2. PRELIMINARIESIn this section, we introduce preliminary concepts and

outline an algorithmic framework to address the graph con-tainment query problem. A cost evaluation model is alsopresented on which our analysis of graph indexing solutionsare based.

2.1 Problem StatementA graph G = (V, E, Σ, λ) is defined as a undirected la-

beled graph where V is a set of vertices, E is a set of edges(unordered pairs of vertices), Σ is a set of labels, and λ is alabeling function, λ : V ∪E → Σ, that assigns labels to ver-tices and edges. Let g and g′ be two graphs. g is a subgraphof g′, or g′ is a supergraph of g, denoted g ⊆ g′, if thereexists a subgraph isomorphism from g to g′. We also call g′

contains g or g is contained by g′. The concept of subgraphisomorphism from g to g′ is defined as a injective functionfrom Vg to Vg′ that preserves vertex labels, edge labels andadjacency. The concept of graph isomorphism can be de-fined analogously by using a bijective function instead of aninjective function. The size of g is denoted size(g) = |Vg|.A tree, also known as free tree, is a special undirected la-beled graph that is connected and acyclic. For tree, theconcept of subtree, supertree, subtree isomorphism, tree iso-morphism can be defined accordingly. A path is the simplesttree whose vertex degrees are no more than 2.

Given a graph database G = {g1, g2, · · · , gn} and an arbi-trary graph g, let sup(g) = {gi|g ⊆ gi, gi ∈ G}. |sup(g)| isthe support, or frequency of g in G. g is frequent if its sup-port is no less than σ · |G|, where σ is a minimum supportthreshold provided by users.

Graph Containment Query Problem: Given a graphdatabase, G = {g1, g2, · · · , gn}, and a query graph q, a graphcontainment query is to find the set, sup(q), from G.

The graph containment query problem is NP-complete. Itis infeasible to find sup(q) by sequentially checking subgraphisomorphism between q and every gi ∈ G, for 1 ≤ i ≤ n. Andit is especially challenging when graphs in G are large, and|G| is also large in size and diverse. Graph indexing pro-vides an alternative to tackle the graph containment queryproblem effectively.

Example 2.1: A sample query graph, shown in Figure 2, isposed to a sample graph database with three graphs, shownin Figure 1. The graph in Figure 1 (c) is the answer. ✷

2.2 An Algorithmic FrameworkGiven a graph database G, and a query, q, the graph con-

tainment query can be processed in two steps. First, a pre-processing step called index construction generates indexingfeatures from G. The feature set, denoted F , constructs theindex, and for each feature f ∈ F , sup(f) is maintained.Second, a query processing step is performed in a filtering-

939

Figure 6: Frequent Graphs of G

all the path-features appearing in the query graph q are c,c–c, c–c–c, c–c–c–c and c–c–c–c–c. They cannot be usedto prune the two graphs in Figure 1 (a) and (b), even if allthese path-features are frequent in the graph database. Re-consider Example 2.1 for the graph-based indexing approachthat uses frequent graph-features as index entries. This ap-proach needs to mine frequent graph-features beforehand,which incurs high computation cost. In this example, somefrequent graph-features discovered are shown in Figure 6,with σ = 2/3. In order to answer the query graph (Figure 2),only the graph-feature (Figure 6 (a)) can be used, which isa tree-feature in nature, while other frequent graph-features(Figure 6 (b) and Figure 6 (c)) are mined wastefully.

4. GRAPH FEATURE ON DEMANDBased on the discussions in Section 3, a tree-based index-

ing mechanism can be efficiently deployed. It is compactand can be easily maintained in main memory, as shown inour performance studies. We have also shown that a tree-based index can have similar pruning power like that pro-vided by the costly graph-based index on average in general.However, based on Theorem 3.1, it is still necessary to useeffective graph-features to reduce the candidate answer setsize, |Cq|, while tree-features cannot. In this section, we dis-cuss how to select additional non-tree graph-features fromq on demand that have greater pruning power than theirsubtree-features, based on the tree-feature set discovered.

Consider a query graph q, which contains a non-tree sub-graph g ∈ FG′ . If power(g) ≈ power(T (g)) w.r.t. pruningpower, there is no need to index the graph-feature g, be-cause its subtrees jointly have the similar pruning power.However, if power(g) ≫ power(T (g)), it will be necessary toselect g as an indexing feature because g is more discrim-inative than T (g) for pruning purpose. Note the conceptwe use here in this paper is different from the discrimina-tive graph concept used in gIndex, which is based on twofrequent graph-features instead.

In this paper, we select discriminative graph-features fromqueries on-demand, without mining the whole set of frequentgraph-features from G beforehand. These selected discrimi-native graph-features are therefore used as additional index-ing features, denoted ∆, which can also be reused further toanswer subsequent queries.

In order to measure the similarity of pruning power be-tween a graph-feature g and its subtrees, T (g), we definea discriminative ratio, denoted ε(g), for a non-tree graph,g ∈ FG′ w.r.t. T (g) as

ε(g) =

8

<

:

power(g) − power(T (g))

power(g)if power(g) ̸= 0

0 if power(g) = 0(8)

Here, 0 ≤ ε(g) ≤ 1. When ε(g) = 0, g has the same prun-ing power as the set of all its frequent subtrees, T (g). Thelarger ε(g) is, the greater pruning power g has than T (g).When ε(g) = 1, the frequent subtree set T (g) has no prun-

Figure 7: Discriminative Graphs

ing power, while g is the most discriminative graph-featureand definitely needed to be reclaimed and indexed from thegraph database, G. Based on Eq. (8), we define a discrimi-native graph in Definition 4.1:

Definition 4.1: A non-tree graph g ∈ FG′ is discriminativeif ε(g) ≥ ε0, where ε0 is a user-specified minimum discrimi-native threshold (0 < ε0 < 1). ✷

If a frequent non-tree graph g is not discriminative, weconsider that there is no need to select g as an indexingfeature, because it can not contribute more for pruning thanits frequent subtrees that have already been used as indexingfeatures. Otherwise, there is a good reason to reclaim g fromG into the index, because g has greater pruning power thanall its frequent subtrees (T (g)).

Suppose we set σ = 2/3 and ε0 = 0.5 for the sampledatabase in Figure 1. Figure 7 illustrates two discriminativefrequent graph-features. The pruning power of Figure 7 (a)is (1 − 2/3) = 1/3 and the pruning power of Figure 7 (b) is(1 − 1/3) = 2/3. Note: all frequent subtrees in Figure 7 (a)are subtrees of c–c–c–c–c, whose pruning power is 0. So thediscriminative ratio, ε, of Figure 7 (a) is 1. The discrimi-native ratio ε of Figure 7 (b) can be computed similarly as1/2.

4.1 Discriminative Graph SelectionGiven a query q, let’s denote its discriminative subgraph

set as D(q) = {g1, g2, · · · , gn}, where every non-tree graphgi ⊆ q (1 ≤ i ≤ n) is frequent and discriminative w.r.t. itssubtree set, T (gi). For D(q), it is not necessary to reclaimevery gi from G as indexing features, because to reclaim gi

from G means to compute sup(gi) from scratch, which in-curs costly subgraph isomorphism testings over the wholedatabase. Given two graphs g, g′ ∈ D(q), where g ⊆ g′, in-tuitively, if the gap between power(g′) and power(g) is largeenough, g′ will be reclaimed from G; Otherwise, g is discrim-inative enough for pruning purpose, and there is no need toreclaim g′ in the presence of g. Based on the above analysis,we propose a new strategy to select discriminative graphsfrom D(q).

Recall in [23], a frequent graph-feature, g′, is discrimina-tive, if its support, |sup(g′)|, is significantly greater than|sup(g)|, where g′ is a supergraph of g. It is worth not-ing that a costly graph mining process is needed to computesup(g′) and sup(g). Below, we discuss our approach to selectdiscriminative graph-features without graph mining before-hand. In order to do so, we approximate the discriminativecomputation between g′ and g, in the presence of our knowl-edge on frequent tree-features discovered.

sup(g)(?)?

−−−−−→ sup(g′)(?)x

?

?

x

?

?

sup(Tg) −−−−−→ sup(Tg′)

The diagram above illustrates how to estimate the dis-

943

Figure 6: Frequent Graphs of G

all the path-features appearing in the query graph q are c,c–c, c–c–c, c–c–c–c and c–c–c–c–c. They cannot be usedto prune the two graphs in Figure 1 (a) and (b), even if allthese path-features are frequent in the graph database. Re-consider Example 2.1 for the graph-based indexing approachthat uses frequent graph-features as index entries. This ap-proach needs to mine frequent graph-features beforehand,which incurs high computation cost. In this example, somefrequent graph-features discovered are shown in Figure 6,with σ = 2/3. In order to answer the query graph (Figure 2),only the graph-feature (Figure 6 (a)) can be used, which isa tree-feature in nature, while other frequent graph-features(Figure 6 (b) and Figure 6 (c)) are mined wastefully.

4. GRAPH FEATURE ON DEMANDBased on the discussions in Section 3, a tree-based index-

ing mechanism can be efficiently deployed. It is compactand can be easily maintained in main memory, as shown inour performance studies. We have also shown that a tree-based index can have similar pruning power like that pro-vided by the costly graph-based index on average in general.However, based on Theorem 3.1, it is still necessary to useeffective graph-features to reduce the candidate answer setsize, |Cq|, while tree-features cannot. In this section, we dis-cuss how to select additional non-tree graph-features fromq on demand that have greater pruning power than theirsubtree-features, based on the tree-feature set discovered.

Consider a query graph q, which contains a non-tree sub-graph g ∈ FG′ . If power(g) ≈ power(T (g)) w.r.t. pruningpower, there is no need to index the graph-feature g, be-cause its subtrees jointly have the similar pruning power.However, if power(g) ≫ power(T (g)), it will be necessary toselect g as an indexing feature because g is more discrim-inative than T (g) for pruning purpose. Note the conceptwe use here in this paper is different from the discrimina-tive graph concept used in gIndex, which is based on twofrequent graph-features instead.

In this paper, we select discriminative graph-features fromqueries on-demand, without mining the whole set of frequentgraph-features from G beforehand. These selected discrimi-native graph-features are therefore used as additional index-ing features, denoted ∆, which can also be reused further toanswer subsequent queries.

In order to measure the similarity of pruning power be-tween a graph-feature g and its subtrees, T (g), we definea discriminative ratio, denoted ε(g), for a non-tree graph,g ∈ FG′ w.r.t. T (g) as

ε(g) =

8

<

:

power(g) − power(T (g))

power(g)if power(g) ̸= 0

0 if power(g) = 0(8)

Here, 0 ≤ ε(g) ≤ 1. When ε(g) = 0, g has the same prun-ing power as the set of all its frequent subtrees, T (g). Thelarger ε(g) is, the greater pruning power g has than T (g).When ε(g) = 1, the frequent subtree set T (g) has no prun-

Figure 7: Discriminative Graphs

ing power, while g is the most discriminative graph-featureand definitely needed to be reclaimed and indexed from thegraph database, G. Based on Eq. (8), we define a discrimi-native graph in Definition 4.1:

Definition 4.1: A non-tree graph g ∈ FG′ is discriminativeif ε(g) ≥ ε0, where ε0 is a user-specified minimum discrimi-native threshold (0 < ε0 < 1). ✷

If a frequent non-tree graph g is not discriminative, weconsider that there is no need to select g as an indexingfeature, because it can not contribute more for pruning thanits frequent subtrees that have already been used as indexingfeatures. Otherwise, there is a good reason to reclaim g fromG into the index, because g has greater pruning power thanall its frequent subtrees (T (g)).

Suppose we set σ = 2/3 and ε0 = 0.5 for the sampledatabase in Figure 1. Figure 7 illustrates two discriminativefrequent graph-features. The pruning power of Figure 7 (a)is (1 − 2/3) = 1/3 and the pruning power of Figure 7 (b) is(1 − 1/3) = 2/3. Note: all frequent subtrees in Figure 7 (a)are subtrees of c–c–c–c–c, whose pruning power is 0. So thediscriminative ratio, ε, of Figure 7 (a) is 1. The discrimi-native ratio ε of Figure 7 (b) can be computed similarly as1/2.

4.1 Discriminative Graph SelectionGiven a query q, let’s denote its discriminative subgraph

set as D(q) = {g1, g2, · · · , gn}, where every non-tree graphgi ⊆ q (1 ≤ i ≤ n) is frequent and discriminative w.r.t. itssubtree set, T (gi). For D(q), it is not necessary to reclaimevery gi from G as indexing features, because to reclaim gi

from G means to compute sup(gi) from scratch, which in-curs costly subgraph isomorphism testings over the wholedatabase. Given two graphs g, g′ ∈ D(q), where g ⊆ g′, in-tuitively, if the gap between power(g′) and power(g) is largeenough, g′ will be reclaimed from G; Otherwise, g is discrim-inative enough for pruning purpose, and there is no need toreclaim g′ in the presence of g. Based on the above analysis,we propose a new strategy to select discriminative graphsfrom D(q).

Recall in [23], a frequent graph-feature, g′, is discrimina-tive, if its support, |sup(g′)|, is significantly greater than|sup(g)|, where g′ is a supergraph of g. It is worth not-ing that a costly graph mining process is needed to computesup(g′) and sup(g). Below, we discuss our approach to selectdiscriminative graph-features without graph mining before-hand. In order to do so, we approximate the discriminativecomputation between g′ and g, in the presence of our knowl-edge on frequent tree-features discovered.

sup(g)(?)?

−−−−−→ sup(g′)(?)x

?

?

x

?

?

sup(Tg) −−−−−→ sup(Tg′)

The diagram above illustrates how to estimate the dis-

943

[Tree + Delta, Zhao et al.]

CIS 602, Fall 2014

Scientific Databases• Store scientific data like arrays (need locality)

33

Obviously this degree of extensibility complicates queryplanning enormously. The SciDB optimizer will have verylimited insight into the logical or performance properties ofthe operators it is working with. In the case of the Smooth()operator introduced above, the SciDB engine may have tomaterialize the result of the intermediate Subsample(). Forother kinds of operators, however, it might be possible topipe-line the results of an inner sub-expression directly.

4. ARCHITECTURESciDB adopts a shared nothing design for its overall sys-

tem architecture. We envision a SciDB instance being de-ployed over a network of computers, each with its own lo-cally attached storage. Each compute/storage node runsa semi-autonomous instance of a SciDB engine, providingcommunications, query processing and a local storage man-ager. SciDB processes running on each node share access toa (logically) centralized system catalog database that storesinformation about the nodes, data distribution, user-definedextensions, and so forth. Our design is rather less tightlycoupled than most shared nothing systems, and is influencedby the design of modern distributed computing systems thatuse a Map/Reduce[5] model. We expect this looser archi-tecture will make it easier for us to build more flexible pro-visioning and reliability. Physical nodes may come and go,but unless a query addresses data stored on a node thatis off-line, the SciDB instance is unaffected. Adding nodesamounts to adding new entries in our system catalog: theonly central point of failure in the system.

Over the next few sections we review some of the featuresof our system’s implementation.

4.1 Storage ManagerThe design of our storage manager draws on features from

a number of commercial DBMSs, but includes a numberof novel features reflecting the requirements of array dataprocessing.

SciDB implements a distributed, no-overwrite storage man-ager. Data in a SciDB array is not update-able. New arraydata can be appended to a SciDB database, or the results ofa SciDB query can be written back to the storage manager.We plan to implement only the ‘A’ and ‘D’ of transactionalACIDity.

In addition to our own purpose built storage manager weanticipate that SciDB will provide access to external datain situ through our extensible operator mechanism. For ex-ample, scientific users with large collections of NetCDF [9]or HDF [6] files will able to address that data without im-porting it, and export data from SciDB in these de factostandard file formats.

4.1.1 Chunking, and Vertical PartitioningFigure 5 presents an outline of our approach to storage

management; how SciDB maps data in logical arrays intophysical storage.

First, in common with the so-called column-store stor-age managers[1, 7] , we vertically partition the data in ourarrays. The SciDB storage manager splits attributes in asingle logical array and handles values for each attributeseparately. In other words, all low level operations in SciDBdeal with arrays that contain a single value in each cell.The motivation here is the same as it is for column-storesystems. Scientific users often focus their attention, in a

Figure 5: SciDB Storage Manager

particular query, on a sub-set of attributes in the logicalarray. Vertical partitioning therefore reduces I/O costs.

Second, our storage manager takes each attribute’s data,and further decomposes the array into a number of equalsized, and potentially overlapping, chunks. In SciDB chunksare our physical unit of I/O, processing, and inter-node com-munication. Chunks are quite large: on an order of 64megabytes. Within the SciDB system catalog we store, foreach chunk, the chunk’s location–as a range of dimensionalindices–within the logical array.

4.1.2 Overlapping ChunksThe motivation for our decision to overlap chunks deserves

more detailed discussion.Recall the Gaussian Smoothing operation introduced in

Section 3.3. To compute the new, smoothed value for eachcell, the algorithm needs to consult attribute values in thesurrounding region. The size of the region that contributesto the smoothing operation varies from application to ap-plication but is typically fairly small, relative to the overalldata. If the array data was partitioned into non-overlappingchunks, SciDB would be obliged to ’stitch together’ adjacentchunks to reconstruct ’boundary’ regions.

By segmenting our arrays into overlapping chunks, andpicking the right ’overlap’ extent, we are able to parallelizeoperations like Gaussian Smoothing. All of the data neededto compute the filter is available within the same chunk. Thedownside of this strategy is that it increases our storage man-agement overhead, and presents a configuration challenge toSciDB DBAs.

Consider the illustration in Figure 6. Here, an 11x5 ar-ray ‘A’ has been decomposed into three, overlapping 5x5chunks. With this scheme, an operation that requires theexamination of any complete 3x3 subsample of A need onlyconsult the data in a single chunk and the operation can beparallelized three ways. The darker grey areas are regionsof the array that are replicated; a particular attribute valuein a cell stored in more than one chunk. Lighter gray areasrepresent regions of the array which are ’non-core’ in thesense that any algorithm considering a 3x3 subsample of ‘A’will be unable to compute results for these boundary cells.

[SciDB]

CIS 602, Fall 2014

Assignments for next Tuesday (9/9/2014)• Reading:

- Provenance for Computational Tasks: A Survey by Freire et al.- No reading response, will do sample next class- Be prepared for a short quiz

• Reading Topics:- Think about which topics you might like to present on

• Course Projects:- Think about potential course projects that involve the areas

discussed today• Course Registration:

- Make sure you have registered in COIN for the course- Email me ([email protected]) if you need a permission code

34



cis 602: provenance & scientific data management...

Documents