versioning for workflow evolution
DESCRIPTION
My Presentation on "Versioning for Workflow Evolution", I did in DIDC 2010 conference in June 2010.TRANSCRIPT
Versioning for Workflow Evolution
Eran Chinthaka Withana, Beth Plale School of Informatics and ComputingIndiana University, Bloomington, Indiana
Roger Barga, Nelson Araujo Microsoft Research,
Microsoft Corporation, Redmond, Washington
3rd International Workshop on Data Intensive Distributed Computing, Chicago, IL, US; “Versioning for Workflow Evolution”;
June 22, 2010; Eran C. Withana
Workflow Evolution• Computational Science Experiments
– Sequence of activities– Set of configurable parameters and input data– Produces outputs to be analyzed and evaluated further
• Evolution of Research– Changes in research artifacts
Workflow Evolution• Workflows as a good tool to track evolution of research
– Automate repeatable tasks in an efficient manner– Algorithms & experimental procedures encoded in to workflows– Tracking workflows tracks research too
• Tracking effects over time– Provenance of data products– Lineage of and the roots of errors and affected data products
• Comparing Results– More than one research direction in a given experiment– Comparing outputs from different paths of the research
• Attribution– Attribution of credit based on who performed, who owns/created, who own data products– Sharing and attribution of research can and should be an integral part of research
• Eg: Sub-modules from myexperiments.org
• Workflow Evolution Framework and versioning model– Enables the management of knowledge encoded in workflow executions
Related Work• Workflow evolution share a lot in common with provenance collection frameworks
– I. T. Foster, J.-S. Vockler, M. Wilde, and Y. Zhao. Chimera: A virtual data system for representing, querying, and automating data derivation. In Proceedings of the 14th International Conference on Scientific and Statistical Database Management, pages 37-46, Washington, DC, USA, 2002. IEEE Computer Society.
• Existing evolution frameworks– J. Freire, C. Silva, S. Callahan, E. Santos, C. Scheidegger, and H. Vo. Managing rapidly-evolving scientific
workflows. Lecture Notes in Computer Science, 4145:10, 2006.
• Evolution Data Models– L. Bavoil, S. P. Callahan, P. J. Crossno, J. Freire, C. E. Scheidegger, C. T. Silva, H. T. Vo. Vistrails: Enabling
interactive multiple-view visualizations. In IEEE Visualization, 2005. VIS 05, pages 135-142
• Versioning at different levels– Application level: D. Santry, M. Feeley, N. Hutchinson, and A. Veitch. Elephant: The file system that never
forgets. In Workshop on Hot Topics in Operating Systems, pages 2-7. IEEE Computer Society, 1999. – System/database level: R. Chatterjee, G. Arun, S. Agarwal, B. Speckhard, and R. Vasudevan. Using
applications of data versioning in database application development. In ICSE '04: Proceedings of the 26th International Conference on Software Engineering, pages 315{325, Washington, DC, USA, 2004. IEEE Computer Society
– Disk storage level: M. Flouris and A. Bilas. Clotho: Transparent data versioning at the block I/O level. In Proceedings of the 12th NASA Goddard, 21st IEEE Conference on Mass Storage Systems and Technologies (MSST 2004),pages 315-328, 2004.
Use Cases1. Research Reproduction2. Scientific Workflows
– In LEAD tracking namelist input files and visualizations
– Tracking activity binaries
Versioning Model• Dimensions of workflow evolution
– Direct evolution occurs when a user of the workflow performs one of the following actions:• Changes the flow and arrangements of the components within the system• Changes the components within the workflow• Changes inputs and/or output parameters or configuration parameters to different
components within the workflow– Contributions tracks components that are reused from a previous system
• Workflow Evolution Capturing Stages– User explicitly saves the workflow– User closes the workflow editor– Execution of a workflow
• Warning: This granularity might not capture all edits
Architecture within Trident Scientific workflow worbench
Trident RegistryTrident RegistryTrident Runtime ServicesTrident Runtime Services
Publish-Subscribe BlackboardPublish-Subscribe Blackboard Data ModelData Model
Data Access LayerData Access Layer
ManagementManagement
MonitorMonitor
AdministrationAdministration
RegistryManagement
RegistryManagement
Workflow
Packages
Workflow
Packages
Scientific
Workflows
Scientific
Workflows
WindowsWorkflow
Foundation
DesignDesign
WorkbenchWorkbench
BrowserBrowser
Trident Data ModelTrident Data Model
Trident RegistryTrident Registry
Evolution FrameworkEvolution Framework
Versioning ModelVersioning Model
Local StorageLocal
StorageOther Local/remote Versioning System
Other Local/remote Versioning System
Trident WorkbenchTrident Workbench
Trident Architecture
Trident Evolution FrameworkArchitecture
User View (within Trident)
Versioned Objects in Registry
Workflow Evolution View
Performance Evaluation• Evaluation strategies
• Delta – difference between two consecutive versions• Checkpointing - complete version saved after fixed number
of version
– No Delta, No Checkpointing• Each version saved as it is
– With Delta, No Checkpointing• Delta with previous version
– With Delta, With Checkpointing• Checkpointed after n versions
• Workflows usedWorkflow Size (Bytes) Delta
(Bytes)O 1032 210
M 4087 2564
Performance Evaluation• File Write Time
O Workflow M Workflow
Performance Evaluation
• Version Recovery Time
O Workflow M Workflow
Performance Evaluation• Space Usage for a Version
O Workflow M Workflow
Performance Evaluation• Data Retrieved per Version
O Workflow M Workflow
Discussion• "No delta, No Checkpointing" options performs poorly with respect to storage
usage – 4-5 times for smaller workflow, smaller delta and 2-times for larger workflow, large delta
• outperforms both other options with respect to – version save time, 20-30 times for the large workflow, large delta and 5 times for smaller
workflow, small delta– version recovery time 10 times for the smaller workflow, small delta and 5 times larger
workflow, large delta
• Criteria for selecting object maintenance strategy– size of data objects– average changes for data objects between different versions of the same
object– response time to the user and the system
• Challenges in working with different types of artifacts
Future Work• Dynamic strategy to adjust versioning
technique depending on object properties• Challenges
– Unavailability of visualization software – Visualizing different types of data products,
integrating other viz tools• LEAD II Vortex2 Use case
– Tracking different WF Activity library versions
Thank You !!!
Questions …?