cis 602: provenance & scientific data management …dkoop/cis602-2014fa/lecture03.pdfcis 602:...

29
CIS 602, Fall 2014 CIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop

Upload: others

Post on 09-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

CIS 602: Provenance & Scientific Data Management

Scientific Workflows

Dr. David Koop

Page 2: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Reminders• Reading Responses: Due at 12pm on the day they are assigned• Course Project Ideas:

- Please stop by or email me with any questions• Reading Presentations:

- Will post list of papers (subject to change) soon

2

Page 3: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Last Class: Provenance• Provenance

- Causality- Prospective and Retrospective

• Provenance Management- Provenance Capture- Provenance Models- Using Provenance

• Provenance-enabled example systems

3

Page 4: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Today’s Reading• Scientific Workflow Management and the Kepler System

[Ludäscher et al.]

• Why is it easy to tell this paper is ~10 years old?- Hint: Think about description of distributed computation...

4

Page 5: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Workflows• (Business) Workflow Definition [Software AG]

“An orchestrated and repeatable pattern of business activity enabled by the systematic organization of resources into processes that transform materials, provide services, or process information.”

5

Page 6: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Business Workflow

6

[SAP]

Page 7: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Business Workflows

7

• Also know as “Business Process Management” or “Business Process Engineering”

• Roots in office automation (think assembly lines for offices)- Each person has certain roles- For different processes, there are flows on how decisions are

made or transferred• Newer: “Web Service Choreography”

Page 8: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Scientific Workflows• Scientific Workflow Definition [Ludäscher et al.]

“[P]rocess networks that are typically used as data analysis pipelines or for comparing observed and predicted data, and that can include a wide range of components, e.g., for querying databases, for data transformation and data mining steps, for execution of simulation codes on high performance computers”

8

Page 9: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

2 SCIENTIFIC WORKFLOWS 3

Figure 1: Conceptual (“napkin drawing”) view of the Promoter Identification Workflow (PIW) [ABB+03]

2 Scientific Workflows

There is a growing interest in scientific workflows ascan be seen from a number of recent events, e.g.,the Scientific Data Management Workshop [SDM03],the e-Science Workflow Services Workshop [eSc03],the e-Science Grid Environments Workshop [eSc04],the Virtual Observatory Service Composition Work-shop [GRI04], the e-Science LINK-Up Workshop onWorkflow Interoperability and Semantic Extensions[LIN04], and last not least, various activities as partof the Global Grid Forum (e.g, [GGF04]), just toname a few. Scientific workflows also play an im-portant role in a number of ongoing large researchprojects dealing with scientific data management, in-cluding those funded by NSF/ITR (GriPhyN, GEON,LEAD, SCEC, SEEK, ...), NIH (BIRN), DOE (Sci-DAC/SDM, GTL), and similar efforts funded by theUK e-Science initiative (myGrid, DiscoveryNet, andothers). For example, the SEEK project [SEE] is de-veloping an Analysis and Modeling System (AMS)that allows ecologists to design and execute scientificworkflows [MBJ+04]. The AMS workflow componentemploys a Semantic Mediation System (SMS) to facil-itate workflow design and data discovery via seman-tic typing [BL04]. Thus SEEK is a good example ofa community-driven project in need of a system thatallows users to seamlessly access data sources and ser-vices, and put them together into reusable workflows.Indeed SEEK is one of the main projects contributingto the cross-project Kepler initiative and workflowsystem discussed below.

Aspects and Types of Workflows. Scientificworkflows often exhibit particular “traits”, e.g., theycan be data-intensive, compute-intensive, analysis-

intensive, visualization-intensive, etc. The workflowsin Sections 2.1.1, 2.1.2, and 2.1.3, e.g., exhibit differ-ent features, i.e., service-orientation and data analy-sis, re-engineering and user interaction, and high-performance computing, respectively. Depending onthe intended user group, one might want to hide oremphasize particular aspects and technical capabili-ties of scientific workflows. For example, a “Grid en-gineer” might be interested in low-level workflow as-pects such as data movement and remote job control.Having workflow components (or actors) that operateat this level will be beneficial to the Grid engineer.Conversely, a scientific workflow system should hidesuch aspects from analytical scientists (say an ecolo-gist studying species richness and productivity).

The Kepler system aims at supporting very dif-ferent kinds of workflows, ranging from low-level“plumbing” workflows of interest to Grid engineers,to analytical knowledge discovery workflows for sci-entists, and conceptual-level design workflows thatmight become executable only as a result of subse-quent refinement steps [BL05].

In the following we first introduce scientific work-flows by means of several examples taken from differ-ent projects and implemented using the Ptolemy ii-based Kepler system [KEP]. We then discuss typi-cal features of scientific workflows and from this de-rive general requirements and desiderata for scientificworkflow systems. We take a closer look at underly-ing technical issues and challenges in Section 3.

2.1 Example Workflows2.1.1 Promoter Identification

Figure 1 shows a high-level, conceptual view of atypical scientific knowledge discovery workflow that

Scientific Workflow

9

[Ludäscher et al., Kepler]

Page 10: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Scientific Workflows• Manage data-intensive, complex analyses• Orchestrate different tools• Structured computation• Abstraction for more general understanding• Enable automation, reproducibility, sharing

10

Page 11: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Business Workflows vs. Scientific Workflows

11

• Business Workflows- Decision-oriented- Emphasis on control-flow- Often involve many people- Often stateful

• Scientific Workflows- Data-oriented- Emphasis on data-flow- Usually involve a small group- Usually stateless

2 SCIENTIFIC WORKFLOWS 3

Figure 1: Conceptual (“napkin drawing”) view of the Promoter Identification Workflow (PIW) [ABB+03]

2 Scientific Workflows

There is a growing interest in scientific workflows ascan be seen from a number of recent events, e.g.,the Scientific Data Management Workshop [SDM03],the e-Science Workflow Services Workshop [eSc03],the e-Science Grid Environments Workshop [eSc04],the Virtual Observatory Service Composition Work-shop [GRI04], the e-Science LINK-Up Workshop onWorkflow Interoperability and Semantic Extensions[LIN04], and last not least, various activities as partof the Global Grid Forum (e.g, [GGF04]), just toname a few. Scientific workflows also play an im-portant role in a number of ongoing large researchprojects dealing with scientific data management, in-cluding those funded by NSF/ITR (GriPhyN, GEON,LEAD, SCEC, SEEK, ...), NIH (BIRN), DOE (Sci-DAC/SDM, GTL), and similar efforts funded by theUK e-Science initiative (myGrid, DiscoveryNet, andothers). For example, the SEEK project [SEE] is de-veloping an Analysis and Modeling System (AMS)that allows ecologists to design and execute scientificworkflows [MBJ+04]. The AMS workflow componentemploys a Semantic Mediation System (SMS) to facil-itate workflow design and data discovery via seman-tic typing [BL04]. Thus SEEK is a good example ofa community-driven project in need of a system thatallows users to seamlessly access data sources and ser-vices, and put them together into reusable workflows.Indeed SEEK is one of the main projects contributingto the cross-project Kepler initiative and workflowsystem discussed below.

Aspects and Types of Workflows. Scientificworkflows often exhibit particular “traits”, e.g., theycan be data-intensive, compute-intensive, analysis-

intensive, visualization-intensive, etc. The workflowsin Sections 2.1.1, 2.1.2, and 2.1.3, e.g., exhibit differ-ent features, i.e., service-orientation and data analy-sis, re-engineering and user interaction, and high-performance computing, respectively. Depending onthe intended user group, one might want to hide oremphasize particular aspects and technical capabili-ties of scientific workflows. For example, a “Grid en-gineer” might be interested in low-level workflow as-pects such as data movement and remote job control.Having workflow components (or actors) that operateat this level will be beneficial to the Grid engineer.Conversely, a scientific workflow system should hidesuch aspects from analytical scientists (say an ecolo-gist studying species richness and productivity).

The Kepler system aims at supporting very dif-ferent kinds of workflows, ranging from low-level“plumbing” workflows of interest to Grid engineers,to analytical knowledge discovery workflows for sci-entists, and conceptual-level design workflows thatmight become executable only as a result of subse-quent refinement steps [BL05].

In the following we first introduce scientific work-flows by means of several examples taken from differ-ent projects and implemented using the Ptolemy ii-based Kepler system [KEP]. We then discuss typi-cal features of scientific workflows and from this de-rive general requirements and desiderata for scientificworkflow systems. We take a closer look at underly-ing technical issues and challenges in Section 3.

2.1 Example Workflows2.1.1 Promoter Identification

Figure 1 shows a high-level, conceptual view of atypical scientific knowledge discovery workflow that

Page 12: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Types of Scientific Workflows• Knowledge Discovery

- Often executed once, changed, and new version executed again- Emphasis of VisTrails workflow system

• Automation- Used when same process is repeated for different data- Users often use higher-level interfaces to the workflows

• Coordination- Example: High-Performance Computing (HPC)

• Move files from local machines to HPC resources• Coordinate execution on HPC resources

12

Page 13: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Scientific Workflow Desiderata1.Seamless access to resources and services

- Wrappers for existing libraries or services2.Service composition and reuse

- Don’t want an actor to be used only once- Balance between specificity and reuse

3.Scalability- Support different resources/environments- Also, write workflow once, run in different environments?

4.Detached Execution- Long-running workflows should not hog system resources

5.Reliability and Fault-Tolerance- Similar to requirements for software in general

13

Page 14: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Scientific Workflow Desiderata6.Usability

- Workflows should offer advantages to scripts (specification and output)

7.Caching (Smart Re-run)- Data-dependent structure makes it possible to cache results- Deterministic, non-stateful computations!

8.Linked Semantics- Use domain knowledge to inform workflow composition- Typing of input/output ports helps here

9.Provenance- We’ve talked about this aspect already!

14

Page 15: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Workflow Components• Workflow is a directed graph with nodes and edges

- Different terms for nodes and edges in different systems• Nodes (Actors, Modules, Processors, or Activities)

- Black boxes that perform some computation• Ports

- Input ports take in input data- Output ports make output data available

• Edges (Channels or Connections)- Link one actor’s input ports to another actor’s output ports

• Parameters- User-manipulable settings that control actor execution

15

Page 16: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

2 SCIENTIFIC WORKFLOWS 4

links genomic biology techniques such as microarrayswith bioinformatics tools such as BLAST to identifyand characterize eukaryotic promoters2 – we call thisthe Promoter Identification Workflow or PIW (seealso [Wer01, ABB+03, PYN+03]: Starting from mi-croarray data, cluster analysis algorithms are usedto identify genes that share similar patterns of geneexpression profiles that are then predicted to be co-regulated as part of an interactive biochemical path-way. Given the gene-ids, gene sequences are retrievedfrom a remote database (e.g., GenBank) and fed toa tool (e.g., BLAST) that finds similar sequences. Insubsequent steps, transcription factor binding sitesand promoters are identified to create a promotermodel that can be iteratively refined.

While Figure 1 leaves many details open, some fea-tures of scientific workflows can already be identified:There are a number of existing databases (such asGenBank) and computational tools (such as Clusfa-vor and BLAST) that need to be combined in certainways to create the desired workflow. In the past, ac-cessing remote resources often meant implementinga wrapper that mimics a human entering the inputof interest, submitting an HTML form, and “screen-scraping” the result from the returned page [LPH01].Today, more and more tools and databases becomeaccessible via web services, greatly simplifying thistask. Another trend are web portals such as NCBI[NCB04] that integrate many tools and databases andsometimes provide the scientist with a “workbench”environment.

Figure 2 depicts snapshots of an early implementa-tion of PIW in Kepler. Kepler is an extension ofthe Ptolemy ii system [PTO04] for scientific work-flows. The topmost window includes a loop whosebody is expanded below and which performs severalsteps on each of the given gene-ids: First, an NCBIweb service is used to access GenBank data. Subse-quently a BLAST step is performed to identify similarsequences to the one retrieved from GenBank. Thena second inner loop is executed (bottom window) fora transcription factor binding site analysis. UsingPtolemy ii terminology, we call the individual stepsactors, since they act as independent componentswhich communicate with each other only through thechannels indicated in the figure. The overall execu-tion of the workflow is orchestrated by a director (thegreen box in Figure 2; see Section 3.3 for details).

This early PIW implementation in Kepler[ABB+03] illustrates a number of features: Actual“wiring” of a scientific workflow can be much morecomplicated than the conceptual view (Figure 1) sug-

2A promoter is a subsequence of a chromosome that sitsclose to a gene and regulates its activity.

Figure 2: PIW implemented in Kepler [ABB+03].Composite actors (subworkflows) expanded below.

gests. A mechanism for collapsing details of a sub-workflow into an abstract component (called compos-ite actor in Ptolemy ii) is essential to tame com-plexity: The windows in Figure 2 have well-definedinput and output ports and thus correspond to (sub)-workflows that can be collapsed into a more abstract,composite actor as indicated. Nevertheless, the re-sulting workflow is fairly complex and we will needto introduce additional mechanisms to simplify thedesign in particular of loops (see Section 4.1).

2.1.2 Mineral Classification

The second example, from a geoinformatics domain,illustrates the use of a scientific workflow system forautomation of an otherwise manual procedure, or al-ternatively, for reengineering an existing custom toolin a more generic and extensible environment. Theupper left window in Figure 3 shows the top-levelworkflow: Some samples are selected from a databaseholding experimentally determined mineral composi-tions of igneous rocks. This data, together with a setof classification diagrams are fed into a Classifiersubworkflow (bottom left). The manual process ofclassifying samples involves determining the positionof the sample values in a series of diagrams such asthe one shown on the right in Figure 3: if the loca-tion of a sample point in a non-terminal diagram oforder n has been determined (e.g., diorite gabbro

Workflow Components• Subworkflows

- Module that can be expanded as it itself is a workflow

- Must specify what how the inputs and outputs are passed

• (Directors)- Define how execution occurs for

the entire workflow- Kepler-specific: not a standard

component of most workflow systems

16

Page 17: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Kepler Features• Web Service Harvester

17

3 HIGHLIGHTS OF KEPLER 9

!"

#

$

Figure 5: Kepler web service Harvester in action: repository access (1-2), harvesting (3), and use (4).

but to a web service repository. The repository URLmight point to a UDDI repository, or simply to aweb page listing multiple WSDL URLs as shown in(2). The Harvester then retrieves and analyzes allWSDL files of the repository, creating instantiationsof web service actors in the user’s local actor library;see (3). For example, one of the harvested services,the BLAST web service, comprises five service oper-ations which are imported into a corresponding sub-directory. The user can then drag-and-drop any ofthese service operations on the workflow canvas foruse in a scientific workflow (4). The Harvesterfeature facilitates rapid prototyping and developmentof web service-based applications and workflows in amatter of minutes – that is, provided

(i) the web services are alive when needed, and

(ii) they can be wired together more or less directlyto perform the desired complex task.

The problem with (i) is that, while harvested web ser-vices look like local components, their runtime failurecan easily “break” a scientific workflow, reminding theuser that the service interface has been harvested,not the actual code.7 We are currently extendingKepler to make workflows with web services morereliable. One simple approach is to avoid the as-sociation of a service operation with a fixed URL.Instead, a list of alternate services can be providedwhen the workflow is launched, and service failurecan then be compensated by invocation of one of thealternate services. Another option is to insert spe-cial control tokens into the data stream, indicating

7Which is of course the whole point of web services.

to downstream actors the absence of certain results.Long running workflows may thus more gracefullyreact to web service failures and produce at leastpartial results. This idea has been further devel-oped for “collection-oriented” (in the functional pro-gramming sense) workflows: via so-called “exception-catching actors”, invalid (due to failures) data col-lections can be filtered out of the data stream, whilevalid subcollections pass through unaffected [McP05].An interesting research question is how to extendPtolemy ii’s pause-resume model to a full-fledgetransaction model that can handle service failures.

The problem (ii) is even more fundamental andhas different aspects: At the design level the chal-lenge is how to devise actors that can be reused eas-ily. In Section 3.3 we give a brief introduction toactor-oriented modeling, the underlying paradigm ofPtolemy ii, and discuss how it facilitates componentcomposition and reuse. At the “plumbing” level itis often necessary to apply data transformations be-tween two consecutive web services (called “shims”in Taverna). Such data transformations are sup-ported through various actors in Kepler, e.g., XSLTand XQuery actors to apply transformations to XMLdata, or Perl and Python actors for text-based trans-formations.

3.2 Grid and other Extensions

Figure 6 depicts a number of Kepler actors that fa-cilitate scientific workflows, including workflows thatmake use of “the Grid”. In the upper left, the previ-ously discussed generic WebService actor and some

Page 18: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Kepler Features

18

• Grid Support

3 HIGHLIGHTS OF KEPLER 10

instantiations are shown. Note how the latter spe-cialize their actor interface via their input/outputports: e.g., Blast_SearchSimple has three inputports and one output port, for the search argumentsand result, respectively. The naming scheme used isWSN_OP, where WSN is the name of the web ser-vice and OP is a specific web service operation.

Figure 6: Grid actors and other Kepler extensions.

The upper right shows two Grid actors, calledFileFetcher and FileStager, respectively. Theseactors make use of GridFTP [Pro00] to retrieve filesfrom, or put files to, remote locations on the Grid.The GlobusJob actor below is another Grid actor,in this case for running a Globus job [Glo]. At thebottom of Figure 6 a small workflow is shown thattakes a Globus proxy and some input files, staging thefiles to where the job is run, then fetching the resultsfrom the remote location and displaying them on theclient side. The green box specifies that this workflowis executed using an SDF (Synchronous Data-Flow)director. This director analyzes the dataflow depen-dencies and token consumption and production ratesof actors (here: token = file), and schedules the exe-cution of actors accordingly.

On the right, a number of actors that use the SDSCStorage Resource Broker [SRB] are shown, e.g., toconnect and disconnect from SRB and to get and putfiles from and to SRB space, respectively. We arecurrently in the process of providing all commonlyused SRB commands as actors. This will allow theKepler user to design and execute Grid workflowsinvolving a number of different tools, e.g., SRB for

data handling aspects, and Globus, Nimrod and othertools for computational aspects and job scheduling.

In the center and left of Figure 6, various otherKepler actors are shown: The CommandLine ac-tor can be used to incorporate any application into aworkflow, provided it can be accessed from the com-mand line.8 The “$” icon is reminiscent of a shellprompt. The actor is parameterized with the argu-ments of the shell command, making it easy to cre-ate generic or specialized command line invocations.A Browser actor is shown directly below (cf. Sec-tion 2.1.2). It takes as input an HTML file or URLand displays it in the user’s default browser. Thismakes the actor an ideal output device for displayingintermediate or final workflow results in ways that arewell-known to users. Another extremely useful appli-cation of this actor is as an input device for user in-teractions. The result file of an upstream actor mighthave been transformed to an HTML file (e.g., usingthe xslt actor) and augmented with HTML forms,check boxes, or other input forms that are displayableto the user in a standard web browser. Upon execut-ing the desired user interaction, an http-post re-quest is sent to a special Kepler web server, actingas a listener, and from there the workflow is resumed.

The Email actor in the center of the figure pro-vides a simple notification mechanism to inform theuser of specific situations in the workflow. Together,the Email and Browser actors address core issuesof requirement (R6) in Section 2.2. The Pause ac-tor (red down-triangle) pauses workflow execution atspecific points, allowing the user to inspect intermedi-ate results, possibly changing parameter values, andresuming the workflow subsequently (addressing (R7)in Section 2.2).

Finally, actors for accessing real-time data streamsfrom ROADNet sensor networks [ROA] have recentlybeen added. These actors (e.g., OrbWaveform-Source) can be integrated easily into Kepler, sincemany of the underlying Ptolemy ii directors supportstreaming execution.9

3.3 Actor-Oriented Modeling

Arguably the most unique feature of Kepler comesfrom the underlying Ptolemy ii system:

“The focus [of the Ptolemy project] is on assem-bly of concurrent components. The key underly-ing principle ... is the use of well-defined mod-

8E.g., Kepler workflows can include data analysis steps viacalls to R [R].

9This should come as no surprise, since dataflow processnetworks are defined on token streams in the first place.

Page 19: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Kepler Features• Map iteration

19

4 RESEARCH ISSUES 13

edly by some special directors), and a call to postfire.The main actor operation finally happens in the fire

method, e.g., a web service actor will make the actualremote service call here.

Towards Actor-Oriented Scientific Workflows.The idea of actor-oriented scientific workflows is toapply the principles of actor-orientation and hierar-chical modeling, underlying the Ptolemy approach[EJL+03, BLL+04b], to the modeling and design ofscientific workflows. In particular, web service op-erations, which provide the building blocks of manyloosely coupled workflows, should be structured intodifferent parts, corresponding to the different phasesand methods used in actor-oriented modeling. Forexample to implement a web service wA, the servicedeveloper should think of specific web service opera-tions such as wA.initialize and wA.prefire in additionto the main “worker” method wA.fire. As in the caseof Ptolemy actors, this will lead to more generic andreusable components and even facilitate more com-plex extensions such as stateful web services.15

4 Research Issues

In this section we briefly discuss some technical issuesthat we have begun addressing for Kepler, but thatare less mature and require some additional research.

4.1 Higher-Order ConstructsThe early implementation of the Promoter Identifi-cation Workflow (PIW) depicted in Figure 2 demon-strated the feasibility and some advantages of im-plementing scientific workflows in the Kepler ex-tension of Ptolemy ii [ABB+03]. However, it alsohighlighted some inherent challenges of the dataflow-oriented programming paradigm [LA03]. We haveargued in Section 2.3 that many current scientificworkflow systems are more dataflow-oriented thanbusiness workflow systems and approaches, whichtend to emphasize event-based control-flow ratherthan dataflow. When designing real-world scientificworkflows it is necessary, however, to handle com-plex control-flows within a dataflow-oriented settingas well. It is well-known that control-flow constructsrequire some thought in order to handle them prop-erly. The fairly intricate network topology in Figure 2includes backward-directed “dataflow” channels, hav-ing the sole purpose of sending control tokens that

15Statefulness is an established concept in actor-orientedmodeling and dataflow networks; e.g., it can be representedexplicitly via feedback loops.

!"#$%&'()'*+,-.'/0123456!7889:;<=>356!8;8? !="#01/0123{“CAGT…AATATGAC",“GGGGA…CAAAGA“}

Figure 9: PIW variant with map iterator.

initiate another iteration of a subworkflow. Whilesuch complicated structures achieve the desired effect(here, a special kind of loop), they are hard to under-stand, design, and maintain. Such ad-hoc construc-tions also increase the complexity of workflow designwhile diminishing the overall reusability of workflowcomponents (see (R2) in Section 2.2). Fortunately,there are better ways to incorporate structured con-trol into a dataflow-oriented system, thereby directlysupporting workflow design as required by (R2).

In [LA03] we have illustrated how higher-orderfunctional programming constructs can be used toimprove the design of PIW. In particular, the higher-order function map :: (↵ ! �) ! [↵] ! [�] hasproven to be very useful to implement a certain typeof iteration. It takes a function f (from ↵ to �) anda list of elements of type ↵, and applies f to each listelement, returning the list of result elements (each oftype �). Thus map is defined as

map f [x1, x2, . . . , xn

] = [f(x1), f(x2), . . . , f(xn

)]

For example, map f [1, 2, 3] = [1, 4, 9] for f(x) = x

2.Figure 9 shows an improved version of the PIW

workflow from Section 2.1.1 and Figure 2, now usingthe higher-order map function. Note how backward-directed flows of control-tokens are avoided. Instead,iterations are realized as nested subworkflows insidea higher-order Map actor. For example, to imple-ment a look-up of a list of gene sequences via aGenBank web service that can only accept one geneat a time, we simply create the higher-order con-struct Map(GenBankWS) as shown in Figure 9(the “stack” icon indicates that the contained work-flow is applied multiple times).

Other higher-order functional programming con-structs, e.g., foldr (for “fold right”) can be similarlyused to provide more abstract and modular iterationand control constructs in a dataflow setting, and weplan to add those to Kepler in the future. The

Page 20: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Other Workflow Systems• Yahoo! Pipes

20

[Most Interesting Flickr Images (without flowers)]

Page 21: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Other Workflow Systems• Mac OS X Automator

21

[Neil North, “Automator for Mac OS X: Tutorial and Examples”]

Page 22: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Other Workflow Systems• Windows Workflow Foundation

22

Page 23: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Other Scientific Workflow Systems• From Wikipedia (Accessed 9/11/2014)

• Anduril bioinformatics and image analysis• ASKALON A workflow system for Cloud and Grid executions of workflows[4]• Apache Airavata A general purpose workflow management system[5][6]• BioBIKE• Bioclipse A graphical workbench, with a scripting environment that lets you perform complex actions as a kind of workflow.• Discovery Net: one of the earliest examples of a scientific workflow system• Ergatis: workflow creation and monitoring interface• Galaxy: initially targeted at genomics• Kepler scientific workflow system• Mobyle• OnlineHPC: Online scientific workflow designer and high performance computing toolkit• OpenMOLE: [7] A scientific workflow system with transparent scaling from a multi-threaded execution up to grid computing execution• Orange: Open source data visualization and analysis• Pegasus Workflow Management System [8][9]• PipeLine Pilot• Swift parallel scripting language: A scripting language with many of the capabilities of scientific workflow systems built-in.• Tavaxy:[10] A cloud-based workflow system that integrates features from both Taverna and Galaxy.• Taverna workbench: widely used in bioinformatics• Triana• KNIME• VisTrails• Yabi Python based general workflow system integrating any command line tool

23

Page 24: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Taverna• Wraps and orchestrates Web

Services• Many users in bioinformatics

domain• myExperiment is an online

repository for workflows

24

[P. Fisher, KEGG Workflow]

Page 25: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Galaxy• Web-based system: no installation required

- sharing and cloud support- bioinformatics focus

25

Page 26: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Pegasus• Focus on using workflows with grid/cloud resources• Earth science applications (e.g. earthquake analysis)

26

Page 27: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

VisTrails• Focus on knowledge discovery, usability, evolution of workflows• Visual spreadsheet helps users compare outputs

27

Map

MplBar MplAxesPropertiesMplFigureProperties

MplFigure

MplFigureCell

GetFareData(Group)

DateRange(PythonSource)

BuildLabels(PythonSource)

Page 28: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Workflow Demo

28

Page 29: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture03.pdfCIS 602: Provenance & Scientific Data Management Scientific Workflows Dr. David Koop. CIS 602,

CIS 602, Fall 2014

Next Class• New paper: A Framework for Collecting Provenance in Data-Centric

Scientific Workflows (download from course web page)- Any volunteers for this presentation?

• Send Reading Response to [email protected] by 12pm on 9/16- Text or PDF (no Word documents, please)

• Keep thinking about project ideas

29