cis 602-01: computational reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 ·...

30
D. Koop, CIS 602-01, Fall 2016 CIS 602-01: Computational Reproducibility Scientific Workflows Dr. David Koop

Upload: others

Post on 20-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

D. Koop, CIS 602-01, Fall 2016

CIS 602-01: Computational Reproducibility

Scientific Workflows

Dr. David Koop

Page 2: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

A shipping container system for applications

About Docker

Docker Fundamentals d7590d1 16 © 2015 Docker Inc

Containers to Bridge Multiplicity Issues

2D. Koop, CIS 602-01, Fall 2016

[Docker, Inc., 2016]

Page 3: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Dockerfiles, Images, and Containers• Dockerfiles are used to build Docker images • A Docker image can be used to run a Docker container • Containers can be composed together using Docker Compose

3D. Koop, CIS 602-01, Fall 2016

Page 4: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Intro Containers I/O Images Builder Orchestration Security

Dockerfile example⌥ ⌅

# base image: last debian releaseFROM debian:wheezy

# name of the maintainer of this imageMAINTAINER [email protected]

# install the latest upgradesRUN apt-get update && apt-get -y dist-upgrade

# install nginxRUN apt-get -y install nginx

# set the default container command# ≠> run nginx in the foregroundCMD ["nginx", "-g", "daemon off;"]

# Tell the docker engine that there will be somenthing listening on the tcp port 80EXPOSE 80⌃ ⇧

51 / 67

Dockerfile

4D. Koop, CIS 602-01, Fall 2016

[A. Baire, 2016]

Page 5: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Technical Challenges in Reproducibility• Dependency Hell • Imprecise Documentation • Code rot • Barriers to adoption and reuse in existing solutions

5D. Koop, CIS 602-01, Fall 2016

[C. Boettiger, 2015]

Page 6: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Docker Features for Reproducibility• Integrating into local development environments • Modular reuse (Docker compose) • Portable environments (snapshots) • Public repository for sharing (DockerHub) • Versioning

6D. Koop, CIS 602-01, Fall 2016

[C. Boettiger, 2015]

Page 7: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Best Practices• Use Docker containers during development • Write Dockerfiles instead of installing interactive sessions • Adding tests or checks to the Dockerfile • Use and provide appropriate base images • Share Docker images and Dockerfiles • Archive tarball snapshots

7D. Koop, CIS 602-01, Fall 2016

[C. Boettiger, 2015]

Page 8: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Create a presentation like this � �

Rethinking packagemanagement - NixAllow me to introduce you to Nix - ThePurely Functional Package Manager.Nix is a powerful package manager forLinux and other Unix systems thatmakes package management reliableand reproducible. It provides atomicupgrades and rollbacks, side-by-side

Rok Garbas

� https://garbas.si � garbas

Why are current tools not goodenough?

Packaginginstructions and

metadata

Filesystem /usr

buildingpackage

Filesystem /usrwith package in it

The dependency problem in packages

8D. Koop, CIS 602-01, Fall 2016

[R. Garbas, 2016]

Page 9: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Create a presentation like this � �

Packaginginstructions and

metadata

Packaginginstructions and

metadata

buildingpackage A

buildingpackage B

Filesystem/nix/store/$hash-$name-$version

Filesystem/nix/store/$hash-$name-$version

Stateless way of buildingpackages

nix(os) solution for the dependency problem

9D. Koop, CIS 602-01, Fall 2016

[R. Garbas, 2016]

Page 10: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Using containers as part of a workflow• rabix: https://github.com/rabix/rabix • bioboxes: http://bioboxes.org

10D. Koop, CIS 602-01, Fall 2016

Page 11: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Workflows• (Business) Workflow Definition [Software AG]

“An orchestrated and repeatable pattern of business activity enabled by the systematic organization of resources into processes that transform materials, provide services, or process information.”

11D. Koop, CIS 602-01, Fall 2016

Page 12: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Business Workflow

12D. Koop, CIS 602-01, Fall 2016

[SAP]

Page 13: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Business Workflows• Also know as “Business Process Management” or “Business

Process Engineering” • Roots in office automation (think assembly lines for offices)

- Each person has certain roles - For different processes, there are flows on how decisions are

made or transferred • Newer: “Web Service Choreography”

13D. Koop, CIS 602-01, Fall 2016

Page 14: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Scientific Workflows• Scientific Workflow Definition [Ludäscher et al.]

“[P]rocess networks that are typically used as data analysis pipelines or for comparing observed and predicted data, and that can include a wide range of components, e.g., for querying databases, for data transformation and data mining steps, for execution of simulation codes on high performance computers”

14D. Koop, CIS 602-01, Fall 2016

Page 15: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

2 SCIENTIFIC WORKFLOWS 3

Figure 1: Conceptual (“napkin drawing”) view of the Promoter Identification Workflow (PIW) [ABB+03]

2 Scientific Workflows

There is a growing interest in scientific workflows ascan be seen from a number of recent events, e.g.,the Scientific Data Management Workshop [SDM03],the e-Science Workflow Services Workshop [eSc03],the e-Science Grid Environments Workshop [eSc04],the Virtual Observatory Service Composition Work-shop [GRI04], the e-Science LINK-Up Workshop onWorkflow Interoperability and Semantic Extensions[LIN04], and last not least, various activities as partof the Global Grid Forum (e.g, [GGF04]), just toname a few. Scientific workflows also play an im-portant role in a number of ongoing large researchprojects dealing with scientific data management, in-cluding those funded by NSF/ITR (GriPhyN, GEON,LEAD, SCEC, SEEK, ...), NIH (BIRN), DOE (Sci-DAC/SDM, GTL), and similar efforts funded by theUK e-Science initiative (myGrid, DiscoveryNet, andothers). For example, the SEEK project [SEE] is de-veloping an Analysis and Modeling System (AMS)that allows ecologists to design and execute scientificworkflows [MBJ+04]. The AMS workflow componentemploys a Semantic Mediation System (SMS) to facil-itate workflow design and data discovery via seman-tic typing [BL04]. Thus SEEK is a good example ofa community-driven project in need of a system thatallows users to seamlessly access data sources and ser-vices, and put them together into reusable workflows.Indeed SEEK is one of the main projects contributingto the cross-project Kepler initiative and workflowsystem discussed below.

Aspects and Types of Workflows. Scientificworkflows often exhibit particular “traits”, e.g., theycan be data-intensive, compute-intensive, analysis-

intensive, visualization-intensive, etc. The workflowsin Sections 2.1.1, 2.1.2, and 2.1.3, e.g., exhibit differ-ent features, i.e., service-orientation and data analy-sis, re-engineering and user interaction, and high-performance computing, respectively. Depending onthe intended user group, one might want to hide oremphasize particular aspects and technical capabili-ties of scientific workflows. For example, a “Grid en-gineer” might be interested in low-level workflow as-pects such as data movement and remote job control.Having workflow components (or actors) that operateat this level will be beneficial to the Grid engineer.Conversely, a scientific workflow system should hidesuch aspects from analytical scientists (say an ecolo-gist studying species richness and productivity).

The Kepler system aims at supporting very dif-ferent kinds of workflows, ranging from low-level“plumbing” workflows of interest to Grid engineers,to analytical knowledge discovery workflows for sci-entists, and conceptual-level design workflows thatmight become executable only as a result of subse-quent refinement steps [BL05].

In the following we first introduce scientific work-flows by means of several examples taken from differ-ent projects and implemented using the Ptolemy ii-based Kepler system [KEP]. We then discuss typi-cal features of scientific workflows and from this de-rive general requirements and desiderata for scientificworkflow systems. We take a closer look at underly-ing technical issues and challenges in Section 3.

2.1 Example Workflows2.1.1 Promoter Identification

Figure 1 shows a high-level, conceptual view of atypical scientific knowledge discovery workflow that

Scientific Workflow

15D. Koop, CIS 602-01, Fall 2016

[Ludäscher et al., Kepler]

Page 16: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Scientific Workflows• Manage data-intensive, complex analyses • Orchestrate different tools • Structured computation • Abstraction for more general understanding • Enable automation, reproducibility, sharing

16D. Koop, CIS 602-01, Fall 2016

Page 17: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Business Workflows vs. Scientific Workflows• Business Workflows

- Decision-oriented - Emphasis on control-flow - Often involve many people - Often stateful

•Scientific Workflows - Data-oriented - Emphasis on data-flow - Usually involve a small

group - Usually stateless

17D. Koop, CIS 602-01, Fall 2016

2 SCIENTIFIC WORKFLOWS 3

Figure 1: Conceptual (“napkin drawing”) view of the Promoter Identification Workflow (PIW) [ABB+03]

2 Scientific Workflows

There is a growing interest in scientific workflows ascan be seen from a number of recent events, e.g.,the Scientific Data Management Workshop [SDM03],the e-Science Workflow Services Workshop [eSc03],the e-Science Grid Environments Workshop [eSc04],the Virtual Observatory Service Composition Work-shop [GRI04], the e-Science LINK-Up Workshop onWorkflow Interoperability and Semantic Extensions[LIN04], and last not least, various activities as partof the Global Grid Forum (e.g, [GGF04]), just toname a few. Scientific workflows also play an im-portant role in a number of ongoing large researchprojects dealing with scientific data management, in-cluding those funded by NSF/ITR (GriPhyN, GEON,LEAD, SCEC, SEEK, ...), NIH (BIRN), DOE (Sci-DAC/SDM, GTL), and similar efforts funded by theUK e-Science initiative (myGrid, DiscoveryNet, andothers). For example, the SEEK project [SEE] is de-veloping an Analysis and Modeling System (AMS)that allows ecologists to design and execute scientificworkflows [MBJ+04]. The AMS workflow componentemploys a Semantic Mediation System (SMS) to facil-itate workflow design and data discovery via seman-tic typing [BL04]. Thus SEEK is a good example ofa community-driven project in need of a system thatallows users to seamlessly access data sources and ser-vices, and put them together into reusable workflows.Indeed SEEK is one of the main projects contributingto the cross-project Kepler initiative and workflowsystem discussed below.

Aspects and Types of Workflows. Scientificworkflows often exhibit particular “traits”, e.g., theycan be data-intensive, compute-intensive, analysis-

intensive, visualization-intensive, etc. The workflowsin Sections 2.1.1, 2.1.2, and 2.1.3, e.g., exhibit differ-ent features, i.e., service-orientation and data analy-sis, re-engineering and user interaction, and high-performance computing, respectively. Depending onthe intended user group, one might want to hide oremphasize particular aspects and technical capabili-ties of scientific workflows. For example, a “Grid en-gineer” might be interested in low-level workflow as-pects such as data movement and remote job control.Having workflow components (or actors) that operateat this level will be beneficial to the Grid engineer.Conversely, a scientific workflow system should hidesuch aspects from analytical scientists (say an ecolo-gist studying species richness and productivity).

The Kepler system aims at supporting very dif-ferent kinds of workflows, ranging from low-level“plumbing” workflows of interest to Grid engineers,to analytical knowledge discovery workflows for sci-entists, and conceptual-level design workflows thatmight become executable only as a result of subse-quent refinement steps [BL05].

In the following we first introduce scientific work-flows by means of several examples taken from differ-ent projects and implemented using the Ptolemy ii-based Kepler system [KEP]. We then discuss typi-cal features of scientific workflows and from this de-rive general requirements and desiderata for scientificworkflow systems. We take a closer look at underly-ing technical issues and challenges in Section 3.

2.1 Example Workflows2.1.1 Promoter Identification

Figure 1 shows a high-level, conceptual view of atypical scientific knowledge discovery workflow that

Page 18: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Types of Scientific Workflows• Knowledge Discovery

- Often executed once, changed, and new version executed again

- Emphasis of VisTrails workflow system • Automation

- Used when same process is repeated for different data - Users often use higher-level interfaces to the workflows

• Coordination - Example: High-Performance Computing (HPC)

• Move files from local machines to HPC resources • Coordinate execution on HPC resources

18D. Koop, CIS 602-01, Fall 2016

Page 19: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

D. Koop, CIS 602-01, Fall 2016

Workflows and e-Science: An overview of workflow system features and capabilities

E. Deelman, D. Gannon, M. Shields, I. Taylor

Page 20: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Definitions• Workflow orchestration: defining graph of tasks to manage a task • Workflow: a template for such an orchestration • Workflow instance: a specific instantiation of a workflow for a

particular problem and tool

20D. Koop, CIS 602-01, Fall 2016

Page 21: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

E. Deelman et al. / Future Generation Computer Systems 25 (2009) 528–540 529

Fig. 1. The workflow lifecycle: composition, mapping, execution, and provenance capture.

or comparing every workflow solution that exists today, wefocus here on the scientist’s or distributed-application developer’sview and attempt to extract a feature set that encapsulatesthe functionality exposed by various workflow systems in anattempt to create a generalised taxonomy for the field as a wholerather than comparing specific solutions. Therefore, the systemswe include in this paper only constitute a subset of the manyworkflow systems that exist within the research and commercialcommunities and provide an illustrativemeans for the justificationof our feature set.

It is hoped that such a taxonomy of features will be usefulnot only to scientists or application-developers when consideringwhich features might be important for their problem domain butalso to the workflow community as a whole in order to providesome basis for classification of systems in the future and possibleinteroperability. It is as inconceivable at this stage to see a one-size-fits-all workflow system that encompasses all of the desirablefeatures in one solution as it is to ask for one programminglanguage to be best for all problems. Therefore any decision aboutthe choice of one workflow system or systems used must beinformed in order to reduce the development time of applicationsas a whole in this area.

2. Workflow capabilities and the workflow lifecycle

At one level a workflow is a high-level specification of a set oftasks and the dependencies between them that must be satisfiedin order to accomplish a specific goal. For example, a data analysisprotocol consisting of a sequence of pre-processing, analysis,simulation and post-processing steps is a typical workflowscenario in e-Science applications. At the level of representationand execution, a workflow is a computer program and it can beexpressed in any modern programming language. However, thetask of writing a computer program in Perl or Java or Pythonto orchestrate a set of tasks on a wide-area distributed systemgoes well beyond the programming skills or patience of mostscientific users. The goal of e-Science workflow systems [3] isto provide a specialized programming environment to simplifythe programming effort required by scientists to orchestrate acomputational science experiment.

In general we can classify four different phases of the workflowlifecycle (as seen in Fig. 1):

1. Composition, Representation and Data Model: the composi-tion of the workflow (abstract or executable) through a numberof different means e.g. text, graphical, semantic.

2. Mapping: Involves the mapping from the (abstract) workflowto the underlying resources.

3. Execution: Enactment of the mapped workflow on theunderlying resources.

4. Metadata and Provenance: The recording of metadata andprovenance information during the various stages of theworkflow lifecycle.

During workflow composition the user creates a workfloweither from scratch or by modifying a previously designedworkflow. The user can rely on workflow component and datacatalogs. The workflow composition process can be iterative,where portions of the workflow need to be executed beforesubsequent parts of theworkflow are designed. Once theworkflowis defined, all, or portions of the workflow can be sent for mappingand then execution. During that phase various optimizationsand scheduling decisions can be made. Finally, the data and allassociatedmetadata and provenance information are recorded andplaced in a variety of registries which can then be accessed todesign a new workflow. Even though we delineate data recordingas a separate phase of the workflow lifecycle, this activity can andoften is part of the workflow execution.

In the following four sections, we examine these four areas indetail by extracting a feature set within each category to define thecommon aspects of functionality that are inherent acrossworkflowsystems in general.

3. Composition, representation and execution models

3.1. Workflow composition

The composition of workflows is an important step in theworkflow process. It allows a workflow user or programmerto specify the steps and dependencies in either a concreteor abstract fashion that represent the desired overall analysis.There are a number of mechanisms for specifying such flowsor dependencies and the resulting graph can be stored in a

Workflow Lifecycle

21D. Koop, CIS 602-01, Fall 2016

[Deelman et al., 2009]

Page 22: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Phases of Workflow Lifecycle• Composition: defining the workflow, may be new or based on

another workflow • Mapping: map (abstract) workflow to resources • Execution: running the workflow • Metadata and Provenance: recording information about the

workflow at the various steps

22D. Koop, CIS 602-01, Fall 2016

Page 23: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Workflow Composition• Textual Composition • Graphical Workflow Editing • Community Services (e.g. Web services) • Semantic Composition (automated reasoning, etc.)

23D. Koop, CIS 602-01, Fall 2016

Page 24: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Workflow Representation• Directed acyclic graphs: outputs to inputs (BPEL, MoML, etc.) • Petri nets (e.g. Grid-Flow) • UML: activity diagrams (e.g. Askalon) • CommonWL

24D. Koop, CIS 602-01, Fall 2016

Page 25: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

D. Koop, CIS 602-01, Fall 2016

Common Workflow Language

M. R. Crusoe

Page 26: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Workflow Execution Models• Control flow model • Data-flow model

26D. Koop, CIS 602-01, Fall 2016

Page 27: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Mapping workflows to resources• Understand resource constraints • Allow user to define mapping • Use scheduler (e.g. DAGMan) to map • Optimizations possible…

27D. Koop, CIS 602-01, Fall 2016

Page 28: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Execution Concerns• Single machine versus grid, cloud… • Fault tolerance • Adaptive workflow

28D. Koop, CIS 602-01, Fall 2016

Page 29: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

D. Koop, CIS 602-01, Fall 2016

Pegasus

R. F. da Silva

Page 30: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture... · 2019-08-21 · Docker Features for Reproducibility • Integrating into local development environments

Workflows and Reproducibility• How do workflows aid with reproducibility? • How do workflows aid with reuse? • What issues do workflows present with reproducibility? • What about interoperability?

30D. Koop, CIS 602-01, Fall 2016