from data to knowledge with workflows & provenance
TRANSCRIPT
BertramLudä[email protected]
CenterforInforma*csResearchinScience&Scholarship(CIRSS)
SchoolofInforma/onSciences(formerly:GSLIS)&Na/onalCenterforSupercompu/ngApplica/ons(NCSA)
&DepartmentofComputerScience
From Data to Knowledge with Workflows & Provenance
• Scien2ficWorkflows– Examples,Features
• DataCleaningandCura5on
• Provenance&ReproducibleScience– “Prospec5veProvenance”(a.k.a.workflows)– Retrospec5veProvenance
• YesWorkflow– Yes,ScriptscanbeWorkflows,too!
• Otherstuff– Timeallowing..
Outline
2SPIN'16@NCSA
Introduc2onsshouldcomefirst!• MSComputerScience,UKarlsruhe(K.I.T.)• PhDComputerScience,UFreiburg,Germany• ResearchScien5st,UCSanDiego,SDSC• Dept.ofComputerScience,UCDavis• SchoolofInforma5onSciences,UofIllinois• Natl.CenterforSupercompu5ngApplica5ons
3SPIN'16@NCSA
Scientific Workflows: ASAP • Automation
– wfs to automate computational aspects of science
• Scaling (exploit and optimize machine cycles) – wfs should make use of parallel compute resources – wfs should be able handle large data
• Abstraction, Evolution, Reuse (human cycles) – wfs should be easy to (re-)use, evolve, share
• Provenance – wfs should capture processing history, data lineage è traceable data- and wf-evolution è Reproducible Science
TridentWorkbench
VisTrails
4
Eswareinmal…SPIN'16@NCSA
10Essen2alfunc2onsofascien2ficworkflowsystem1. Automateprogramsandservicesscien5stsalreadyuse.
2. Scheduleinvoca5onsofprogramsandservicescorrectlyandefficiently–inparallelwherepossible.
3. Managedataflowto,from,andbetweenprogramsandservices.
4. Enablescien2sts(notjustdevelopers)toauthorormodifyworkflowseasily.
5. Predictwhataworkflowwilldowhenexecuted:prospec/veprovenance.
6. Recordwhathappenedduringworkflowexecu5on:retrospec/veprovenance.
7. Revealretrospec2veprovenance–howworkflowproductswerederivedfrominputsviaprogramsandservices.
8. Organizeintermediateandfinaldataproductsasdesiredbyusers.
9. Enablescien5ststoversion,shareandpublishtheirworkflows.
10. Empowerscien2stswhowishtoautomateaddi2onalprogramsandservicesthemselves.
Thesefunc2ons(notjustdataflow&actors)dis2nguishscien/ficworkflowautoma/onfromgeneralscien2ficso[waredevelopment.
SPIN'16@NCSA 5
Src:TimothyMcPhillips
FindOTUs
(OTUHunter)
AssignTaxonomy(STAP)
Profilealignment
(STAPorInfernal)
Buildphylogene5ctree(RaxMLorQuicktree)
Viewtree:Dendroscope
UniFrac:tree&
environmentfile
Assembledcon5gs
Chimeracheck
(Mallard)
Diversitysta5s5cs:Text:OUTlist,Chao1,Shannon
Graphs:rarefac5oncurves,rank-abundancecurves
Visualiza5ontools:Cytoscapenetworks&Heatmap
WATERS: WorkflowforAlignment,Taxonomy,EcologyofRibosomalSequences(AmberHartman;EisenLab;UCDavis)
+/-cipres
+/-cluster
+/-cluster
+/-cluster
SPIN'16@NCSA 6
Executable WATERS Workflow in Kepler
SPIN'16@NCSA 7
Example Bioinformatics
Workflow:
Motif-Catcher
MarcFaccionetal.UCDavisGenomeCenter
SPIN'16@NCSA 8
Motif-Catcher workflow, implemented in Kepler
SKöhleretal.ImprovedMo5fDetec5oninLargeSequenceSetswithRandomSamplinginaKeplerworkflow,ICCS-WS,2012
SPIN'16@NCSA 9
A Data-Streaming Workflow over Sensor Data
SPIN'16@NCSA 10
• Monitorandcontrolsupercomputersimula5ons
– 50+compositeactors(subworkflows)– 4levelsofhierarchy– 1000+atomic(Java)actors
43actors,3levels
196actors,4levels30actors
206actors,4levels
137actors33actors
150123actors
66actors12actors
243actors,4levels
NorbertPodhorszkiORNL(then:UCDavis)
“Plumbing”workflow
SPIN'16@NCSA 11
Scien2ficWorkflowDesign:SomeChallenges
“And the graphical UI makes our scientific workflows so much easier to develop, understand and maintain!”
SPIN'16@NCSA 12
More “Plumbing” (beware the Boolean Select)
Cabellosetal.ComputerPhysicsCommunica*ons182,2011
SPIN'16@NCSA 13
Modeling & Design: Die Grenzen meiner Sprache bedeuten die Grenzen meiner Welt VanillaProcessNetwork
Func2onalProgrammingDataflowNetwork
XMLTransforma2onNetwork
Collec2on-orientedModeling&Design
framework(COMAD)“LookMa:NoShims!”
SPIN'16@NCSA 14
Problemswith[toomany]ShimsandWires• Shimsneedtobeplacedandconnected
– Tedious,error-prone• Distractfromscien5ficmeaningfulactors
– Non-descrip5veworkflows–worthsharing?• DataOrganiza5onisencodedinworkflowstructure
– Notrobusttodatachanges• Shimsouenleadtocomplexdesigns
– Imagineallprevious`design-pawerns’intertwined– GOTO-programming
COMAD/VDAL:Raisingthelevelofabstrac/on
Localizedcontrol-flow
Datamanagementnotdoneviawires
Actorsarecouplednotbywirebutbydata!SPIN'16@NCSA 15
Collec5on-OrientedModeling&Design(COMAD)– fullyembracetheassemblylinemetaphor
– data=taggednestedcollec2ons– e.g.representedasflawened,pipelined(XML)tokenstreams:
PipelinedCollec2on-OrientedWorkflows
Actors(likeassemblylineworkers),passonwhattheydon’tworkon
TMcPhillips,SBowers,DZinn,BLudäscher
SPIN'16@NCSA 16
Two different workflow designs
• Hardwiringvs.configurabledata/collec5onmanagement• briwlevs.changeresilientdesigns• scien5stcanrecognizenapkindrawing/conceptualmodel• Humancyclesareexpensive
SPIN'16@NCSA 17
ADIOS in Kepler
SPIN'16@NCSA 18
ADIOS in COMAD
SPIN'16@NCSA 19
From Data Life-Cycle to Curation Life-Cycle
Uncanny Resemblance: Eye of Jupiter(If you have “visions”… )
DCC Curation Lifecycle
SPIN'16@NCSA 20
DataCleaning(Scien5fic&BusinessApps)
SPIN'16@NCSA 21
Howdoyoucleandata?(Syntax)
22SPIN'16@NCSA
Howdoyoucleandata?(Syntax)
• RegularExpressions(regex)
• Writeyourownscripts– …withregex– …inPython!
23SPIN'16@NCSA
Kurator Project (Data Curation Workflows)
SPIN'16@NCSA 24
From “Climate Gate” to Reproducible Science
Capturing provenance is crucial for transparency, interpretation, debugging, … => repeatable experiments, => reproducible science=> need workflow-system agnostic model
SPIN'16@NCSA 25
Provenance:TheFineArts
• Oneoftheseishasbeensoldfornearly$180m.• Theothercouldbeworthasmuchormore.• Whichiswhich?• Whatisthedifference?
26SPIN'16@NCSA
ProvenanceinScience
• What’sso“provenance”aboutthis?• GrandCanyon’srocklayersarearecordoftheearlygeologichistoryofNorthAmerica.
TheancestralpuebloangranariesatNankoweapCreektellarchaeologistsaboutmorerecenthumanhistory.(ByDrenaline,licensedunderCCBY-SA3.0)
27SPIN'16@NCSA
28
NaturalHistory:Understandingwhathappened…
Zrzavý,Jan,DavidStorch,andStanislavMihulka.Evolu*on:EinLese-Lehrbuch.Springer-Verlag,2009.
Author:Jkwchui(BasedondrawingbyTruth-seeker2004)
SPIN'16@NCSA
Computa2onalProvenance
• Originandprocessinghistoryofanar2fact– usually:data(products),figures,...– some5mes:workflow(andscript)evolu5on…
• Differentsub-communi5es:– Provenanceindatabases– Provenancein(scien2fic)workflows– ...programminglanguages,systems/security,…
29SPIN'16@NCSA
30
Run/meProvenance(a.k.a.traces,logs,
retrospec/veprovenance,“Trace-land”)
DifferentKindsofDataProvenanceinWorkflows
WorkflowModeling&Design(a.k.a.prospec/veprovenance
“Workflow-land”)
SPIN'16@NCSA
SKOPE:SynthesizedKnowledgeOfPastEnvironments
31
Bocinsky,Kohleretal.studyrain-fedmaizeofAnasazi– FourCorners;AD600–1500.ClimatechangeinfluencedMesaVerdeMigra2ons;late
13thcenturyAD.Usesnetworkoftree-ringchronologiestoreconstructaspa2o-temporalclimatefieldatafairlyhighresolu5on(~800m)fromAD1–2000.Algorithmes5matesjointinforma5onintree-ringsandaclimatesignaltoiden5fy“best”tree-ringchronologiesforclimatereconstruc5ng.
K.Bocinsky,T.Kohler,A2000-yearreconstruc5onoftherain-fedmaizeagriculturalnicheintheUSSouthwest.Nature
Communica/ons.doi:10.1038/ncomms6618
… implemented as an R Script … SPIN'16@NCSA
GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
?
YesWorkflow:Yes,scriptsareworkflows,too!
• ScriptvsWorkflows/ASAP:– Automation:*****– Scaling:**– Abstraction:*– Provenance:**
32SPIN'16@NCSA
YWannota2ons:ModelyourWorkflow!
33SPIN'16@NCSA
YesWorkflow:Prospec2ve&Retrospec5veProvenance…(almost)forfree!
• YWannota5onsinthescript(R,Python,Matlab)areusedtorecreatetheworkflowviewfromthescript…
34
cassette_id
sample_score_cutoff
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
YW!
SPIN'16@NCSA
GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
PaleoclimateReconstruc2on(EnviRecon.org)
35
• …explainedusingYesWorkflow!
KyleB.,(computa5onal)archaeologist:"Ittookmeabout20minutestocomment.LessthananhourtolearnandYW-annotate,all-told."
SPIN'16@NCSA
JoãoF.Pimentel,SaumenDey,TimothyMcPhillips,KhalidBelhajjame,DavidKoop,LeonardoMurta,
VanessaBraganholo,BertramLudäscher
Yin&Yang:Demonstra2ngcomplementaryprovenancefrom
noWorkflow&YesWorkflow
UsingProvenancefromScriptRuns
37
Examplefromthelog-file:2016-06-0720:32:36Wroterun/data/DRT240/DRT240_11000eV_002.imgButhowwasthatimagederived??(“ProvenanceforSelf!”)SPIN'16@NCSA
module.__build_class__
module.__build_class__
simulate_data_collection
180 return
180 run_logger
201 return
201 new_image_file
230 parser
231 cassette_id
236 add_option
241 add_option
246 add_option
248 set_usage
251 parse_args251 args
251 options254 module.len
24 cassette_id
24 sample_score_cutoff
24 data_redundancy
24 calibration_image_file
30 exists
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
36 run_log
37 write
38 str(sample_score_cutoff)
38 write
38 str(sample_score_cutoff)
49 str.format
49 sample_spreadsheet_file
50 spreadsheet_rows
cassette_q55_spreadsheet.csv
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format 51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
72 str.format
72 write
73 open
73 rejection_log
74 str.format
74 TextIOWrapper.write
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
calibration.img
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open 119 collection_log_file 120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file 120 module.writer120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
128 return
run/run_log.txt
run/rejected_samples.txt
run/raw/q55/DRT240/e10000/image_001.raw
run/data/DRT240/DRT240_10000eV_001.img
run/collected_images.csv
run/raw/q55/DRT240/e10000/image_002.raw
run/data/DRT240/DRT240_10000eV_002.img
run/raw/q55/DRT240/e11000/image_001.raw
run/data/DRT240/DRT240_11000eV_001.img
run/raw/q55/DRT240/e11000/image_002.raw
run/data/DRT240/DRT240_11000eV_002.img
run/raw/q55/DRT240/e12000/image_001.raw
run/data/DRT240/DRT240_12000eV_001.img
run/raw/q55/DRT240/e12000/image_002.raw
run/data/DRT240/DRT240_12000eV_002.img
run/raw/q55/DRT322/e10000/image_001.raw
run/data/DRT322/DRT322_10000eV_001.img
run/raw/q55/DRT322/e10000/image_002.raw
run/data/DRT322/DRT322_10000eV_002.img
run/raw/q55/DRT322/e11000/image_001.raw
run/data/DRT322/DRT322_11000eV_001.img
run/raw/q55/DRT322/e11000/image_002.raw
run/data/DRT322/DRT322_11000eV_002.img
noWorkflow:notonlyWorkflow!
38
• Scriptshaveprovenance,too!
• Transparentlycapturesome/allprovenancefromPythonscriptruns.
• Usefilterqueriesto“zoom”intorelevantparts..
SPIN'16@NCSA
simulate_data_collection
230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>
251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])
251 args = ['q55']
251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>
24 cassette_id = 'q55'
24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0
24 calibration_image_file = 'calibration.img'
49 str.format
49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'
50 spreadsheet_rows(sample_spreadsheet_file)
50 sample_name = 'DRT240'50 sample_quality = 45
61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])
61 accepted_sample = 'DRT240'61 num_images = 2
61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'
92 collect_next_image(casset ... _{frame_number:03d}.raw')
92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'
106 str.format
106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')
calibration.img
run/data/DRT240/DRT240_11000eV_002.img
$nowdataflow-f"run/data/DRT240/DRT240_11000eV_002.img"39
$(NW_FILTERED_LINEAGE_GRAPH).gv: $(NW_FACTS)now helper df_style.pynow dataflow -v 55 -f $(RETROSPECTIVE_LINEAGE_VALUE) -m simulation| python df_style.py -d BT -e > $(NW_FILTERED_LINEAGE_GRAPH).gv
..auto-“make”this!
noWorkflowlineageofanimagefile
Provenanceinforma*onaboutPythonfunc/oncalls,variableassignments,etc.
SPIN'16@NCSA
simulate_data_collection
initialize_run
run_log load_screening_results
sample_namesample_quality
calculate_strategy
accepted_samplerejected_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_id energyframe_number raw_image
transform_images
corrected_imagetotal_intensitypixel_count
log_average_image_intensity
collection_log
sample_spreadsheet
calibration_image
sample_score_cutoffdata_redundancy
cassette_id
YesWorkflow:Yes,scriptsareWorkflows,too!• UseYWannota5ons
@begin...@end,@in,@outtorevealhiddenconceptualworkflow(prospec2veprovenance)
• Scriptisn'tchanged:– annota5onsviacomments(=>languageindependent)
• Forunderstandingandsharingthe“bigpicture”
• Queryandvisualize!
40SPIN'16@NCSA
AlternateYWViews
41
simulate_data_collection
initialize_run
load_screening_results calculate_strategy
log_rejected_sample
collect_data_set transform_images log_average_image_intensity
simulate_data_collection
initialize_run
run_log load_screening_results
sample_namesample_quality
calculate_strategy
accepted_samplerejected_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_id energyframe_number raw_image
transform_images
corrected_imagetotal_intensitypixel_count
log_average_image_intensity
collection_log
sample_spreadsheet
calibration_image
sample_score_cutoffdata_redundancy
cassette_id
Processview
DataviewWorkflowview
SPIN'16@NCSA
simulate_data_collection
initialize_run
run_log load_screening_results
sample_namesample_quality
calculate_strategy
accepted_samplerejected_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_id energyframe_number raw_image
transform_images
corrected_imagetotal_intensitypixel_count
log_average_image_intensity
collection_log
sample_spreadsheet
calibration_image
sample_score_cutoffdata_redundancy
cassette_id
Whatisthelineageof“corrected_image”?
42
Fromhereon“upwards”:Whatled(leads)tothis?
..andwhatisirrelevantandshouldbepruned??
SPIN'16@NCSA
simulate_data_collection
collect_data_set
sample_id energy frame_number raw_image
calculate_strategy
accepted_sample num_imagesenergies
load_screening_results
sample_namesample_quality
transform_images
corrected_image
sample_spreadsheet
calibration_image
sample_score_cutoff data_redundancy
cassette_id
Subgraphresul5ngfromlineagequery
onYWworkflowmodel
43
Whatisthelineageofcorrected_image?
SPIN'16@NCSA
44
simulate_data_collection
initialize_run
run_log load_screening_results
sample_namesample_quality
calculate_strategy
accepted_samplerejected_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_id energyframe_number raw_image
transform_images
corrected_imagetotal_intensitypixel_count
log_average_image_intensity
collection_log
sample_spreadsheet
calibration_image
sample_score_cutoffdata_redundancy
cassette_id
simulate_data_collection
collect_data_set
sample_id energy frame_number raw_image
calculate_strategy
accepted_sample num_imagesenergies
load_screening_results
sample_namesample_quality
transform_images
corrected_image
sample_spreadsheet
calibration_image
sample_score_cutoff data_redundancy
cassette_id
module.__build_class__
module.__build_class__
simulate_data_collection
180 return
180 run_logger
201 return
201 new_image_file
230 parser
231 cassette_id
236 add_option
241 add_option
246 add_option
248 set_usage
251 parse_args251 args
251 options254 module.len
24 cassette_id
24 sample_score_cutoff
24 data_redundancy
24 calibration_image_file
30 exists
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
36 run_log
37 write
38 str(sample_score_cutoff)
38 write
38 str(sample_score_cutoff)
49 str.format
49 sample_spreadsheet_file
50 spreadsheet_rows
cassette_q55_spreadsheet.csv
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format 51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
72 str.format
72 write
73 open
73 rejection_log
74 str.format
74 TextIOWrapper.write
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
calibration.img
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open 119 collection_log_file 120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file 120 module.writer120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
128 return
run/run_log.txt
run/rejected_samples.txt
run/raw/q55/DRT240/e10000/image_001.raw
run/data/DRT240/DRT240_10000eV_001.img
run/collected_images.csv
run/raw/q55/DRT240/e10000/image_002.raw
run/data/DRT240/DRT240_10000eV_002.img
run/raw/q55/DRT240/e11000/image_001.raw
run/data/DRT240/DRT240_11000eV_001.img
run/raw/q55/DRT240/e11000/image_002.raw
run/data/DRT240/DRT240_11000eV_002.img
run/raw/q55/DRT240/e12000/image_001.raw
run/data/DRT240/DRT240_12000eV_001.img
run/raw/q55/DRT240/e12000/image_002.raw
run/data/DRT240/DRT240_12000eV_002.img
run/raw/q55/DRT322/e10000/image_001.raw
run/data/DRT322/DRT322_10000eV_001.img
run/raw/q55/DRT322/e10000/image_002.raw
run/data/DRT322/DRT322_10000eV_002.img
run/raw/q55/DRT322/e11000/image_001.raw
run/data/DRT322/DRT322_11000eV_001.img
run/raw/q55/DRT322/e11000/image_002.raw
run/data/DRT322/DRT322_11000eV_002.img
simulate_data_collection
230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>
251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])
251 args = ['q55']
251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>
24 cassette_id = 'q55'
24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0
24 calibration_image_file = 'calibration.img'
49 str.format
49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'
50 spreadsheet_rows(sample_spreadsheet_file)
50 sample_name = 'DRT240'50 sample_quality = 45
61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])
61 accepted_sample = 'DRT240'61 num_images = 2
61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'
92 collect_next_image(casset ... _{frame_number:03d}.raw')
92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'
106 str.format
106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')
calibration.img
run/data/DRT240/DRT240_11000eV_002.img
lineagequerylineagequery
YesWorkflow:Conceptualworkflowmodel
noWorkflow:Pythontracemodel
Buthowdowebridgethisgap???
WouldliketouseYWmodeltoqueryNW
data!
SPIN'16@NCSA
Somebridgesareprecarious…
45SPIN'16@NCSA
…andnewbridge-buildingcanbestressful
46
…evenifjustpain*ngover.
SPIN'16@NCSA
HabemusPons!We’vegottheBridge!Thebridgeisthejourney..(Thejourneyisthedes5na5on)
47
LineageofimagefileintermsofYW
model,withdetailsfromNWprovenance
SPIN'16@NCSA
SecretReproducibleSauce
• Combiningprovenanceinforma5onfromnoWorkflowandYesWorkflow
• Usingallthegoodstuff:– make,docker,Prolog,SQL,Graphviz
• Opensource– github.com/yesworkflow-org/yw-noworkflow– github.com/gems-uff/yin-yang-demo
• Haveacloserlookatthedemo!
48SPIN'16@NCSA
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
YW-RECON:Prospec5ve&Retrospec2veProvenance…(almost)forfree!
49
cassette_id
sample_score_cutoff
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
• URI-templateslinkconceptualen55estorun2meprovenance“leubehind”bythescriptauthor…
• …facilita5ngprovenancereconstruc2onSPIN'16@NCSA
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
Q1:Whatsamplesdidthescriptruncollectimagesfrom?
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
50SPIN'16@NCSA
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
Q2:Whatenergieswereusedforimagecollec5onfromsampleDRT322?
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
51SPIN'16@NCSA
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
Q3:WhereistherawimageofthecorrectedimageDRT322_11000ev_030.img?run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
52SPIN'16@NCSA
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
Q5:Whatcassese-idhadthesampleleadingtoDRT240_10000ev_001.img?
53SPIN'16@NCSA
NewProject!(internshipsnextsummer!)
SPIN'16@NCSA 54
hwp://wholetale.org/