from data to knowledge with workflows & provenance

54
Bertram Ludäscher [email protected] Center for Informa*cs Research in Science & Scholarship (CIRSS) School of Informa/on Sciences (formerly: GSLIS) & Na/onal Center for Supercompu/ng Applica/ons (NCSA) & Department of Computer Science From Data to Knowledge with Workflows & Provenance

Upload: bertram-ludaescher

Post on 09-Jan-2017

111 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: From Data to Knowledge with Workflows & Provenance

BertramLudä[email protected]

CenterforInforma*csResearchinScience&Scholarship(CIRSS)

SchoolofInforma/onSciences(formerly:GSLIS)&Na/onalCenterforSupercompu/ngApplica/ons(NCSA)

&DepartmentofComputerScience

From Data to Knowledge with Workflows & Provenance

Page 2: From Data to Knowledge with Workflows & Provenance

•  Scien2ficWorkflows–  Examples,Features

•  DataCleaningandCura5on

•  Provenance&ReproducibleScience–  “Prospec5veProvenance”(a.k.a.workflows)–  Retrospec5veProvenance

•  YesWorkflow–  Yes,ScriptscanbeWorkflows,too!

•  Otherstuff–  Timeallowing..

Outline

2SPIN'16@NCSA

Page 3: From Data to Knowledge with Workflows & Provenance

Introduc2onsshouldcomefirst!•  MSComputerScience,UKarlsruhe(K.I.T.)•  PhDComputerScience,UFreiburg,Germany•  ResearchScien5st,UCSanDiego,SDSC•  Dept.ofComputerScience,UCDavis•  SchoolofInforma5onSciences,UofIllinois•  Natl.CenterforSupercompu5ngApplica5ons

3SPIN'16@NCSA

Page 4: From Data to Knowledge with Workflows & Provenance

Scientific Workflows: ASAP •  Automation

–  wfs to automate computational aspects of science

•  Scaling (exploit and optimize machine cycles) –  wfs should make use of parallel compute resources –  wfs should be able handle large data

•  Abstraction, Evolution, Reuse (human cycles) –  wfs should be easy to (re-)use, evolve, share

•  Provenance –  wfs should capture processing history, data lineage è traceable data- and wf-evolution è  Reproducible Science

TridentWorkbench

VisTrails

4

Eswareinmal…SPIN'16@NCSA

Page 5: From Data to Knowledge with Workflows & Provenance

10Essen2alfunc2onsofascien2ficworkflowsystem1.   Automateprogramsandservicesscien5stsalreadyuse.

2.   Scheduleinvoca5onsofprogramsandservicescorrectlyandefficiently–inparallelwherepossible.

3.   Managedataflowto,from,andbetweenprogramsandservices.

4.   Enablescien2sts(notjustdevelopers)toauthorormodifyworkflowseasily.

5.   Predictwhataworkflowwilldowhenexecuted:prospec/veprovenance.

6.   Recordwhathappenedduringworkflowexecu5on:retrospec/veprovenance.

7.   Revealretrospec2veprovenance–howworkflowproductswerederivedfrominputsviaprogramsandservices.

8.   Organizeintermediateandfinaldataproductsasdesiredbyusers.

9.  Enablescien5ststoversion,shareandpublishtheirworkflows.

10.   Empowerscien2stswhowishtoautomateaddi2onalprogramsandservicesthemselves.

Thesefunc2ons(notjustdataflow&actors)dis2nguishscien/ficworkflowautoma/onfromgeneralscien2ficso[waredevelopment.

SPIN'16@NCSA 5

Src:TimothyMcPhillips

Page 6: From Data to Knowledge with Workflows & Provenance

FindOTUs

(OTUHunter)

AssignTaxonomy(STAP)

Profilealignment

(STAPorInfernal)

Buildphylogene5ctree(RaxMLorQuicktree)

Viewtree:Dendroscope

UniFrac:tree&

environmentfile

Assembledcon5gs

Chimeracheck

(Mallard)

Diversitysta5s5cs:Text:OUTlist,Chao1,Shannon

Graphs:rarefac5oncurves,rank-abundancecurves

Visualiza5ontools:Cytoscapenetworks&Heatmap

WATERS: WorkflowforAlignment,Taxonomy,EcologyofRibosomalSequences(AmberHartman;EisenLab;UCDavis)

+/-cipres

+/-cluster

+/-cluster

+/-cluster

SPIN'16@NCSA 6

Page 7: From Data to Knowledge with Workflows & Provenance

Executable WATERS Workflow in Kepler

SPIN'16@NCSA 7

Page 8: From Data to Knowledge with Workflows & Provenance

Example Bioinformatics

Workflow:

Motif-Catcher

MarcFaccionetal.UCDavisGenomeCenter

SPIN'16@NCSA 8

Page 9: From Data to Knowledge with Workflows & Provenance

Motif-Catcher workflow, implemented in Kepler

SKöhleretal.ImprovedMo5fDetec5oninLargeSequenceSetswithRandomSamplinginaKeplerworkflow,ICCS-WS,2012

SPIN'16@NCSA 9

Page 10: From Data to Knowledge with Workflows & Provenance

A Data-Streaming Workflow over Sensor Data

SPIN'16@NCSA 10

Page 11: From Data to Knowledge with Workflows & Provenance

•  Monitorandcontrolsupercomputersimula5ons

–  50+compositeactors(subworkflows)–  4levelsofhierarchy–  1000+atomic(Java)actors

43actors,3levels

196actors,4levels30actors

206actors,4levels

137actors33actors

150123actors

66actors12actors

243actors,4levels

NorbertPodhorszkiORNL(then:UCDavis)

“Plumbing”workflow

SPIN'16@NCSA 11

Page 12: From Data to Knowledge with Workflows & Provenance

Scien2ficWorkflowDesign:SomeChallenges

“And the graphical UI makes our scientific workflows so much easier to develop, understand and maintain!”

SPIN'16@NCSA 12

Page 13: From Data to Knowledge with Workflows & Provenance

More “Plumbing” (beware the Boolean Select)

Cabellosetal.ComputerPhysicsCommunica*ons182,2011

SPIN'16@NCSA 13

Page 14: From Data to Knowledge with Workflows & Provenance

Modeling & Design: Die Grenzen meiner Sprache bedeuten die Grenzen meiner Welt VanillaProcessNetwork

Func2onalProgrammingDataflowNetwork

XMLTransforma2onNetwork

Collec2on-orientedModeling&Design

framework(COMAD)“LookMa:NoShims!”

SPIN'16@NCSA 14

Page 15: From Data to Knowledge with Workflows & Provenance

Problemswith[toomany]ShimsandWires•  Shimsneedtobeplacedandconnected

–  Tedious,error-prone•  Distractfromscien5ficmeaningfulactors

–  Non-descrip5veworkflows–worthsharing?•  DataOrganiza5onisencodedinworkflowstructure

–  Notrobusttodatachanges•  Shimsouenleadtocomplexdesigns

–  Imagineallprevious`design-pawerns’intertwined–  GOTO-programming

COMAD/VDAL:Raisingthelevelofabstrac/on

  Localizedcontrol-flow

  Datamanagementnotdoneviawires

  Actorsarecouplednotbywirebutbydata!SPIN'16@NCSA 15

Page 16: From Data to Knowledge with Workflows & Provenance

Collec5on-OrientedModeling&Design(COMAD)–  fullyembracetheassemblylinemetaphor

–  data=taggednestedcollec2ons–  e.g.representedasflawened,pipelined(XML)tokenstreams:

PipelinedCollec2on-OrientedWorkflows

Actors(likeassemblylineworkers),passonwhattheydon’tworkon

TMcPhillips,SBowers,DZinn,BLudäscher

SPIN'16@NCSA 16

Page 17: From Data to Knowledge with Workflows & Provenance

Two different workflow designs

• Hardwiringvs.configurabledata/collec5onmanagement• briwlevs.changeresilientdesigns• scien5stcanrecognizenapkindrawing/conceptualmodel• Humancyclesareexpensive

SPIN'16@NCSA 17

Page 18: From Data to Knowledge with Workflows & Provenance

ADIOS in Kepler

SPIN'16@NCSA 18

Page 19: From Data to Knowledge with Workflows & Provenance

ADIOS in COMAD

SPIN'16@NCSA 19

Page 20: From Data to Knowledge with Workflows & Provenance

From Data Life-Cycle to Curation Life-Cycle

Uncanny Resemblance: Eye of Jupiter(If you have “visions”… )

DCC Curation Lifecycle

SPIN'16@NCSA 20

Page 21: From Data to Knowledge with Workflows & Provenance

DataCleaning(Scien5fic&BusinessApps)

SPIN'16@NCSA 21

Page 22: From Data to Knowledge with Workflows & Provenance

Howdoyoucleandata?(Syntax)

22SPIN'16@NCSA

Page 23: From Data to Knowledge with Workflows & Provenance

Howdoyoucleandata?(Syntax)

•  RegularExpressions(regex)

•  Writeyourownscripts– …withregex– …inPython!

23SPIN'16@NCSA

Page 24: From Data to Knowledge with Workflows & Provenance

Kurator Project (Data Curation Workflows)

SPIN'16@NCSA 24

Page 25: From Data to Knowledge with Workflows & Provenance

From “Climate Gate” to Reproducible Science

Capturing provenance is crucial for transparency, interpretation, debugging, … => repeatable experiments, => reproducible science=> need workflow-system agnostic model

SPIN'16@NCSA 25

Page 26: From Data to Knowledge with Workflows & Provenance

Provenance:TheFineArts

•  Oneoftheseishasbeensoldfornearly$180m.•  Theothercouldbeworthasmuchormore.•  Whichiswhich?•  Whatisthedifference?

26SPIN'16@NCSA

Page 27: From Data to Knowledge with Workflows & Provenance

ProvenanceinScience

•  What’sso“provenance”aboutthis?•  GrandCanyon’srocklayersarearecordoftheearlygeologichistoryofNorthAmerica.

TheancestralpuebloangranariesatNankoweapCreektellarchaeologistsaboutmorerecenthumanhistory.(ByDrenaline,licensedunderCCBY-SA3.0)

27SPIN'16@NCSA

Page 28: From Data to Knowledge with Workflows & Provenance

28

NaturalHistory:Understandingwhathappened…

Zrzavý,Jan,DavidStorch,andStanislavMihulka.Evolu*on:EinLese-Lehrbuch.Springer-Verlag,2009.

Author:Jkwchui(BasedondrawingbyTruth-seeker2004)

SPIN'16@NCSA

Page 29: From Data to Knowledge with Workflows & Provenance

Computa2onalProvenance

•  Originandprocessinghistoryofanar2fact– usually:data(products),figures,...– some5mes:workflow(andscript)evolu5on…

•  Differentsub-communi5es:– Provenanceindatabases– Provenancein(scien2fic)workflows–  ...programminglanguages,systems/security,…

29SPIN'16@NCSA

Page 30: From Data to Knowledge with Workflows & Provenance

30

Run/meProvenance(a.k.a.traces,logs,

retrospec/veprovenance,“Trace-land”)

DifferentKindsofDataProvenanceinWorkflows

WorkflowModeling&Design(a.k.a.prospec/veprovenance

“Workflow-land”)

SPIN'16@NCSA

Page 31: From Data to Knowledge with Workflows & Provenance

SKOPE:SynthesizedKnowledgeOfPastEnvironments

31

Bocinsky,Kohleretal.studyrain-fedmaizeofAnasazi–  FourCorners;AD600–1500.ClimatechangeinfluencedMesaVerdeMigra2ons;late

13thcenturyAD.Usesnetworkoftree-ringchronologiestoreconstructaspa2o-temporalclimatefieldatafairlyhighresolu5on(~800m)fromAD1–2000.Algorithmes5matesjointinforma5onintree-ringsandaclimatesignaltoiden5fy“best”tree-ringchronologiesforclimatereconstruc5ng.

K.Bocinsky,T.Kohler,A2000-yearreconstruc5onoftherain-fedmaizeagriculturalnicheintheUSSouthwest.Nature

Communica/ons.doi:10.1038/ncomms6618

… implemented as an R Script … SPIN'16@NCSA

Page 32: From Data to Knowledge with Workflows & Provenance

GetModernClimate

PRISM_annual_growing_season_precipitation

SubsetAllData

dendro_series_for_calibration

dendro_series_for_reconstruction CAR_Analysis_unique

cellwise_unique_selected_linear_models

CAR_Analysis_union

cellwise_union_selected_linear_models

CAR_Reconstruction_union

raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors

CAR_Reconstruction_union_output

ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif

master_data_directory prism_directory

tree_ring_datacalibration_years retrodiction_years

?

YesWorkflow:Yes,scriptsareworkflows,too!

•  ScriptvsWorkflows/ASAP:– Automation:*****– Scaling:**– Abstraction:*– Provenance:**

32SPIN'16@NCSA

Page 33: From Data to Knowledge with Workflows & Provenance

YWannota2ons:ModelyourWorkflow!

33SPIN'16@NCSA

Page 34: From Data to Knowledge with Workflows & Provenance

YesWorkflow:Prospec2ve&Retrospec5veProvenance…(almost)forfree!

•  YWannota5onsinthescript(R,Python,Matlab)areusedtorecreatetheworkflowviewfromthescript…

34

cassette_id

sample_score_cutoff

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_namesample_quality

calculate_strategy

rejected_sample accepted_sample num_images energies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_id energy frame_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

YW!

SPIN'16@NCSA

Page 35: From Data to Knowledge with Workflows & Provenance

GetModernClimate

PRISM_annual_growing_season_precipitation

SubsetAllData

dendro_series_for_calibration

dendro_series_for_reconstruction CAR_Analysis_unique

cellwise_unique_selected_linear_models

CAR_Analysis_union

cellwise_union_selected_linear_models

CAR_Reconstruction_union

raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors

CAR_Reconstruction_union_output

ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif

master_data_directory prism_directory

tree_ring_datacalibration_years retrodiction_years

PaleoclimateReconstruc2on(EnviRecon.org)

35

•  …explainedusingYesWorkflow!

KyleB.,(computa5onal)archaeologist:"Ittookmeabout20minutestocomment.LessthananhourtolearnandYW-annotate,all-told."

SPIN'16@NCSA

Page 36: From Data to Knowledge with Workflows & Provenance

JoãoF.Pimentel,SaumenDey,TimothyMcPhillips,KhalidBelhajjame,DavidKoop,LeonardoMurta,

VanessaBraganholo,BertramLudäscher

Yin&Yang:Demonstra2ngcomplementaryprovenancefrom

noWorkflow&YesWorkflow

Page 37: From Data to Knowledge with Workflows & Provenance

UsingProvenancefromScriptRuns

37

Examplefromthelog-file:2016-06-0720:32:36Wroterun/data/DRT240/DRT240_11000eV_002.imgButhowwasthatimagederived??(“ProvenanceforSelf!”)SPIN'16@NCSA

Page 38: From Data to Knowledge with Workflows & Provenance

module.__build_class__

module.__build_class__

simulate_data_collection

180 return

180 run_logger

201 return

201 new_image_file

230 parser

231 cassette_id

236 add_option

241 add_option

246 add_option

248 set_usage

251 parse_args251 args

251 options254 module.len

24 cassette_id

24 sample_score_cutoff

24 data_redundancy

24 calibration_image_file

30 exists

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

36 run_log

37 write

38 str(sample_score_cutoff)

38 write

38 str(sample_score_cutoff)

49 str.format

49 sample_spreadsheet_file

50 spreadsheet_rows

cassette_q55_spreadsheet.csv

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format 51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

72 str.format

72 write

73 open

73 rejection_log

74 str.format

74 TextIOWrapper.write

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

calibration.img

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write

91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open119 collection_log_file 120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open 119 collection_log_file 120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file 120 module.writer120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

128 return

run/run_log.txt

run/rejected_samples.txt

run/raw/q55/DRT240/e10000/image_001.raw

run/data/DRT240/DRT240_10000eV_001.img

run/collected_images.csv

run/raw/q55/DRT240/e10000/image_002.raw

run/data/DRT240/DRT240_10000eV_002.img

run/raw/q55/DRT240/e11000/image_001.raw

run/data/DRT240/DRT240_11000eV_001.img

run/raw/q55/DRT240/e11000/image_002.raw

run/data/DRT240/DRT240_11000eV_002.img

run/raw/q55/DRT240/e12000/image_001.raw

run/data/DRT240/DRT240_12000eV_001.img

run/raw/q55/DRT240/e12000/image_002.raw

run/data/DRT240/DRT240_12000eV_002.img

run/raw/q55/DRT322/e10000/image_001.raw

run/data/DRT322/DRT322_10000eV_001.img

run/raw/q55/DRT322/e10000/image_002.raw

run/data/DRT322/DRT322_10000eV_002.img

run/raw/q55/DRT322/e11000/image_001.raw

run/data/DRT322/DRT322_11000eV_001.img

run/raw/q55/DRT322/e11000/image_002.raw

run/data/DRT322/DRT322_11000eV_002.img

noWorkflow:notonlyWorkflow!

38

•  Scriptshaveprovenance,too!

•  Transparentlycapturesome/allprovenancefromPythonscriptruns.

•  Usefilterqueriesto“zoom”intorelevantparts..

SPIN'16@NCSA

Page 39: From Data to Knowledge with Workflows & Provenance

simulate_data_collection

230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>

251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])

251 args = ['q55']

251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>

24 cassette_id = 'q55'

24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0

24 calibration_image_file = 'calibration.img'

49 str.format

49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'

50 spreadsheet_rows(sample_spreadsheet_file)

50 sample_name = 'DRT240'50 sample_quality = 45

61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])

61 accepted_sample = 'DRT240'61 num_images = 2

61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'

92 collect_next_image(casset ... _{frame_number:03d}.raw')

92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'

106 str.format

106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')

calibration.img

run/data/DRT240/DRT240_11000eV_002.img

$nowdataflow-f"run/data/DRT240/DRT240_11000eV_002.img"39

$(NW_FILTERED_LINEAGE_GRAPH).gv: $(NW_FACTS)now helper df_style.pynow dataflow -v 55 -f $(RETROSPECTIVE_LINEAGE_VALUE) -m simulation| python df_style.py -d BT -e > $(NW_FILTERED_LINEAGE_GRAPH).gv

..auto-“make”this!

noWorkflowlineageofanimagefile

Provenanceinforma*onaboutPythonfunc/oncalls,variableassignments,etc.

SPIN'16@NCSA

Page 40: From Data to Knowledge with Workflows & Provenance

simulate_data_collection

initialize_run

run_log load_screening_results

sample_namesample_quality

calculate_strategy

accepted_samplerejected_sample num_imagesenergies

log_rejected_sample

rejection_log

collect_data_set

sample_id energyframe_number raw_image

transform_images

corrected_imagetotal_intensitypixel_count

log_average_image_intensity

collection_log

sample_spreadsheet

calibration_image

sample_score_cutoffdata_redundancy

cassette_id

YesWorkflow:Yes,scriptsareWorkflows,too!•  UseYWannota5ons

@begin...@end,@in,@outtorevealhiddenconceptualworkflow(prospec2veprovenance)

•  Scriptisn'tchanged:–  annota5onsviacomments(=>languageindependent)

•  Forunderstandingandsharingthe“bigpicture”

•  Queryandvisualize!

40SPIN'16@NCSA

Page 41: From Data to Knowledge with Workflows & Provenance

AlternateYWViews

41

simulate_data_collection

initialize_run

load_screening_results calculate_strategy

log_rejected_sample

collect_data_set transform_images log_average_image_intensity

simulate_data_collection

initialize_run

run_log load_screening_results

sample_namesample_quality

calculate_strategy

accepted_samplerejected_sample num_imagesenergies

log_rejected_sample

rejection_log

collect_data_set

sample_id energyframe_number raw_image

transform_images

corrected_imagetotal_intensitypixel_count

log_average_image_intensity

collection_log

sample_spreadsheet

calibration_image

sample_score_cutoffdata_redundancy

cassette_id

Processview

DataviewWorkflowview

SPIN'16@NCSA

Page 42: From Data to Knowledge with Workflows & Provenance

simulate_data_collection

initialize_run

run_log load_screening_results

sample_namesample_quality

calculate_strategy

accepted_samplerejected_sample num_imagesenergies

log_rejected_sample

rejection_log

collect_data_set

sample_id energyframe_number raw_image

transform_images

corrected_imagetotal_intensitypixel_count

log_average_image_intensity

collection_log

sample_spreadsheet

calibration_image

sample_score_cutoffdata_redundancy

cassette_id

Whatisthelineageof“corrected_image”?

42

Fromhereon“upwards”:Whatled(leads)tothis?

..andwhatisirrelevantandshouldbepruned??

SPIN'16@NCSA

Page 43: From Data to Knowledge with Workflows & Provenance

simulate_data_collection

collect_data_set

sample_id energy frame_number raw_image

calculate_strategy

accepted_sample num_imagesenergies

load_screening_results

sample_namesample_quality

transform_images

corrected_image

sample_spreadsheet

calibration_image

sample_score_cutoff data_redundancy

cassette_id

Subgraphresul5ngfromlineagequery

onYWworkflowmodel

43

Whatisthelineageofcorrected_image?

SPIN'16@NCSA

Page 44: From Data to Knowledge with Workflows & Provenance

44

simulate_data_collection

initialize_run

run_log load_screening_results

sample_namesample_quality

calculate_strategy

accepted_samplerejected_sample num_imagesenergies

log_rejected_sample

rejection_log

collect_data_set

sample_id energyframe_number raw_image

transform_images

corrected_imagetotal_intensitypixel_count

log_average_image_intensity

collection_log

sample_spreadsheet

calibration_image

sample_score_cutoffdata_redundancy

cassette_id

simulate_data_collection

collect_data_set

sample_id energy frame_number raw_image

calculate_strategy

accepted_sample num_imagesenergies

load_screening_results

sample_namesample_quality

transform_images

corrected_image

sample_spreadsheet

calibration_image

sample_score_cutoff data_redundancy

cassette_id

module.__build_class__

module.__build_class__

simulate_data_collection

180 return

180 run_logger

201 return

201 new_image_file

230 parser

231 cassette_id

236 add_option

241 add_option

246 add_option

248 set_usage

251 parse_args251 args

251 options254 module.len

24 cassette_id

24 sample_score_cutoff

24 data_redundancy

24 calibration_image_file

30 exists

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

36 run_log

37 write

38 str(sample_score_cutoff)

38 write

38 str(sample_score_cutoff)

49 str.format

49 sample_spreadsheet_file

50 spreadsheet_rows

cassette_q55_spreadsheet.csv

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format 51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

72 str.format

72 write

73 open

73 rejection_log

74 str.format

74 TextIOWrapper.write

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

calibration.img

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write

91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open119 collection_log_file 120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open 119 collection_log_file 120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file 120 module.writer120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

128 return

run/run_log.txt

run/rejected_samples.txt

run/raw/q55/DRT240/e10000/image_001.raw

run/data/DRT240/DRT240_10000eV_001.img

run/collected_images.csv

run/raw/q55/DRT240/e10000/image_002.raw

run/data/DRT240/DRT240_10000eV_002.img

run/raw/q55/DRT240/e11000/image_001.raw

run/data/DRT240/DRT240_11000eV_001.img

run/raw/q55/DRT240/e11000/image_002.raw

run/data/DRT240/DRT240_11000eV_002.img

run/raw/q55/DRT240/e12000/image_001.raw

run/data/DRT240/DRT240_12000eV_001.img

run/raw/q55/DRT240/e12000/image_002.raw

run/data/DRT240/DRT240_12000eV_002.img

run/raw/q55/DRT322/e10000/image_001.raw

run/data/DRT322/DRT322_10000eV_001.img

run/raw/q55/DRT322/e10000/image_002.raw

run/data/DRT322/DRT322_10000eV_002.img

run/raw/q55/DRT322/e11000/image_001.raw

run/data/DRT322/DRT322_11000eV_001.img

run/raw/q55/DRT322/e11000/image_002.raw

run/data/DRT322/DRT322_11000eV_002.img

simulate_data_collection

230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>

251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])

251 args = ['q55']

251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>

24 cassette_id = 'q55'

24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0

24 calibration_image_file = 'calibration.img'

49 str.format

49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'

50 spreadsheet_rows(sample_spreadsheet_file)

50 sample_name = 'DRT240'50 sample_quality = 45

61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])

61 accepted_sample = 'DRT240'61 num_images = 2

61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'

92 collect_next_image(casset ... _{frame_number:03d}.raw')

92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'

106 str.format

106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')

calibration.img

run/data/DRT240/DRT240_11000eV_002.img

lineagequerylineagequery

YesWorkflow:Conceptualworkflowmodel

noWorkflow:Pythontracemodel

Buthowdowebridgethisgap???

WouldliketouseYWmodeltoqueryNW

data!

SPIN'16@NCSA

Page 45: From Data to Knowledge with Workflows & Provenance

Somebridgesareprecarious…

45SPIN'16@NCSA

Page 46: From Data to Knowledge with Workflows & Provenance

…andnewbridge-buildingcanbestressful

46

…evenifjustpain*ngover.

SPIN'16@NCSA

Page 47: From Data to Knowledge with Workflows & Provenance

HabemusPons!We’vegottheBridge!Thebridgeisthejourney..(Thejourneyisthedes5na5on)

47

LineageofimagefileintermsofYW

model,withdetailsfromNWprovenance

SPIN'16@NCSA

Page 48: From Data to Knowledge with Workflows & Provenance

SecretReproducibleSauce

•  Combiningprovenanceinforma5onfromnoWorkflowandYesWorkflow

•  Usingallthegoodstuff:– make,docker,Prolog,SQL,Graphviz

•  Opensource– github.com/yesworkflow-org/yw-noworkflow– github.com/gems-uff/yin-yang-demo

•  Haveacloserlookatthedemo!

48SPIN'16@NCSA

Page 49: From Data to Knowledge with Workflows & Provenance

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

YW-RECON:Prospec5ve&Retrospec2veProvenance…(almost)forfree!

49

cassette_id

sample_score_cutoff

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_namesample_quality

calculate_strategy

rejected_sample accepted_sample num_images energies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_id energy frame_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

•  URI-templateslinkconceptualen55estorun2meprovenance“leubehind”bythescriptauthor…

•  …facilita5ngprovenancereconstruc2onSPIN'16@NCSA

Page 50: From Data to Knowledge with Workflows & Provenance

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q1:Whatsamplesdidthescriptruncollectimagesfrom?

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

  50SPIN'16@NCSA

Page 51: From Data to Knowledge with Workflows & Provenance

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q2:Whatenergieswereusedforimagecollec5onfromsampleDRT322?

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

  51SPIN'16@NCSA

Page 52: From Data to Knowledge with Workflows & Provenance

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q3:WhereistherawimageofthecorrectedimageDRT322_11000ev_030.img?run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

52SPIN'16@NCSA

Page 53: From Data to Knowledge with Workflows & Provenance

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

Q5:Whatcassese-idhadthesampleleadingtoDRT240_10000ev_001.img?

53SPIN'16@NCSA

Page 54: From Data to Knowledge with Workflows & Provenance

NewProject!(internshipsnextsummer!)

SPIN'16@NCSA 54

hwp://wholetale.org/