the work ow trace archive: open-access data from public ... · the work ow trace archive:...

24
The Workflow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures – Technical Report Laurens Versluis 1 , Roland Math´ a 2 , Sacheendra Talluri 3 , Tim Hegeman 1 , Radu Prodan 4 , Ewa Deelman 5 , and Alexandru Iosup 1 1 Vrije Universiteit Amsterdam, 2 University of Innsbruck, 3 Delft University of Technology, 4 University of Klagenfurt, 5 University of Southern California Abstract Realistic, relevant, and reproducible experiments of- ten need input traces collected from real-world en- vironments. We focus in this work on traces of workflows—common in datacenters, clouds, and HPC infrastructures. We show that the state-of-the-art in using workflow-traces raises important issues: (1) the use of realistic traces is infrequent, and (2) the use of realistic, open-access traces even more so. Allevi- ating these issues, we introduce the Workflow Trace Archive (WTA), an open-access archive of workflow traces from diverse computing infrastructures and tooling to parse, validate, and analyze traces. The WTA includes >48 million workflows captured from >10 computing infrastructures, representing a broad diversity of trace domains and characteristics. To emphasize the importance of trace diversity, we char- acterize the WTA contents and analyze in simulation the impact of trace diversity on experiment results. Our results indicate significant differences in charac- teristics, properties, and workflow structures between workload sources, domains, and fields. 1 Introduction Workflows, that is, applications that organize tasks with data and computational inter-dependencies, are already a significant part of private datacenter and public cloud infrastructures [22, 52]. This trend is 0% 25% 50% 75% 100% R R+O Contributor Data Collection Data parsing & anonymization Traces Database Workload Analysis WL Analysis Database Workload Report WL Reports Database User Workflow Trace Format Validation WTA Team This work Figure 1: A visual map to this work: (left) The problem: infrequent use of Realistic (40%) and Open-source (15%) workflow-traces in representa- tive articles, which can affect the relevance and re- producibility of experiments for the entire commu- nity. (right) Toward an answer: the Workflow Trace Archive (WTA) stakeholders, process, and tools pro- vide the community with open-source traces of rel- evant workflows running in public and private com- puting infrastructures. likely to intensify [27, 12], as organizations and com- panies transition from basic to increasingly more so- phisticated cloud-based services. For example, 96% of companies responding to RightScale’s 2018 survey are using the cloud [50], up from 86% in 2012 [30]; the average organization combines services across five public and private clouds. To maintain, tune, and develop the computing infrastructures for running workflows at the mas- 1 arXiv:1906.07471v2 [cs.DC] 11 Jul 2019

Upload: others

Post on 26-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

The Workflow Trace Archive: Open-Access Data from

Public and Private Computing Infrastructures – Technical Report

Laurens Versluis1, Roland Matha2, Sacheendra Talluri3, Tim Hegeman1, Radu Prodan4,Ewa Deelman5, and Alexandru Iosup1

1Vrije Universiteit Amsterdam, 2University of Innsbruck, 3Delft University of Technology,4University of Klagenfurt, 5University of Southern California

Abstract

Realistic, relevant, and reproducible experiments of-ten need input traces collected from real-world en-vironments. We focus in this work on traces ofworkflows—common in datacenters, clouds, and HPCinfrastructures. We show that the state-of-the-art inusing workflow-traces raises important issues: (1) theuse of realistic traces is infrequent, and (2) the useof realistic, open-access traces even more so. Allevi-ating these issues, we introduce the Workflow TraceArchive (WTA), an open-access archive of workflowtraces from diverse computing infrastructures andtooling to parse, validate, and analyze traces. TheWTA includes >48 million workflows captured from>10 computing infrastructures, representing a broaddiversity of trace domains and characteristics. Toemphasize the importance of trace diversity, we char-acterize the WTA contents and analyze in simulationthe impact of trace diversity on experiment results.Our results indicate significant differences in charac-teristics, properties, and workflow structures betweenworkload sources, domains, and fields.

1 Introduction

Workflows, that is, applications that organize taskswith data and computational inter-dependencies, arealready a significant part of private datacenter andpublic cloud infrastructures [22, 52]. This trend is

0%

25%

50%

75%

100%

R R+O

Contributor

Data Collection

Data parsing &anonymization

Traces

Database

Workload Analysis

WL Analysis Database

WorkloadReport

WL Reports Database

User

Workflow TraceFormat

ValidationWTA  Team

This work

Figure 1: A visual map to this work: (left) Theproblem: infrequent use of Realistic (≈ 40%) andOpen-source (≈ 15%) workflow-traces in representa-tive articles, which can affect the relevance and re-producibility of experiments for the entire commu-nity. (right) Toward an answer: the Workflow TraceArchive (WTA) stakeholders, process, and tools pro-vide the community with open-source traces of rel-evant workflows running in public and private com-puting infrastructures.

likely to intensify [27, 12], as organizations and com-panies transition from basic to increasingly more so-phisticated cloud-based services. For example, 96%of companies responding to RightScale’s 2018 surveyare using the cloud [50], up from 86% in 2012 [30];the average organization combines services across fivepublic and private clouds.

To maintain, tune, and develop the computinginfrastructures for running workflows at the mas-

1

arX

iv:1

906.

0747

1v2

[cs

.DC

] 1

1 Ju

l 201

9

Page 2: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

sive scale and with the diversity suggested by thesetrends, the systems community requires adequate ca-pabilities for testing and experimentation. Althoughthe community is aware that workload traces enablea broad class of realistic, relevant, and reproducibleexperiments, currently such traces are infrequentlyused, as we summarize in Figure 1 (left) and quan-tify in Section 2. Toward addressing this problem, wefocus on improving trace availability and understand-ing by proposing a new, free and open-access WTA,as detailed in Figure 1 (right) and in the remainderof this work.

This work aligns with our vision of massivizingcomputer systems (MCS), of “a world where indi-viduals and human-centered organizations are aug-mented by an automated, sustainable layer of tech-nology, at the core of which are modern (distributed)computer systems” [25]. MCS aims to “understandand eventually control” these systems, and positsthat, to avoid a reproducibility crisis, the commu-nity must “understand and create together a sci-ence, practice, and culture of computer ecosystems”(principle P8 of MCS) and must become “aware ofthe evolution and emergent behavior of computerecosystems” (P9). MCS also proposes a series ofmethodological challenges related to these principles,from which this work helps alleviate mainly challengeC19, related to “understanding and explaining newmodes of use, including new, realistic, accurate, yettractable models of workloads and environments”,but also the other challenges where meaningful work-loads currently play a role, such as C15, related to ex-perimentation; C16, related to benchmarking; C17,related to testing, validation, and verification; andC18, related to building a science capable to addressmodern distributed systems.

The need for workflow traces is stringent [25, 12].MCS raises the challenge of understanding all typesof workloads that appear in public and private com-puting infrastructures. Not only the sheer volume ofworkloads has increased significantly over time [52],but also the users of datacenters and cloud opera-tions are expecting increasingly better Quality of Ser-vice (QoS) from the workflow-management systems,including elasticity, reliability, and low-cost, understrong assumptions of validation [25, 12] and repro-

ducibility [27, 19]. Developing workflow managementsystems to meet these requirements requires consid-erable scientific and technical advances and, corre-spondingly, comprehensive trace-based experimenta-tion and testing, both in vivo and in silico [13]. Test-ing such systems, especially at cluster and datacenterscale, often cannot be done in vivo, due to downtimeor the operational costs required. Instead, workflowtraces can be replayed in silico, allowing multiplesetups to run in parallel, testing individual compo-nents, etc. without the downtime nor costs. Al-though realistic workflow traces are key for testing,tuning, validating, and inspiring system designs, theyare currently still scarce [11]. Prior work, such asWorkflowHub [44], has introduced numerous work-flow traces, yet only from the science domain. AsFigure 1 (left) indicates, and Section 2 quantifies andexplains, less than 40% of relevant articles focusingon workflow systems conduct experiments with real-istic traces, and less than 15% conduct experimentswith realistic and open-source traces.

The current scarcity of traces forces researchersto either use synthetically generated workloads or touse one of the few available traces. Synthetic tracesmay reduce the representatives and quality of ex-periments, if they do not match relevant real-worldsettings. Using realistic traces that correspond to anarrow application-domain may result in overfitting;Amvrosiadis et al. [4] demonstrate this for cluster-based infrastructures. Additionally, a lack of real-istic traces may lead to limited or even wrong un-derstanding of workflow characteristics, their perfor-mance, and their usage, which hampers the reuse ofthe systems tested with such (workloads of) work-flows [37]. This gives rise to the research questionRQ-1: How diverse are the workflow traces currentlyused by the systems community?

We identify the need to share workflow traces col-lected from relevant environments running relevantworkloads under relevant constraints. Effective shar-ing requires unified trace formats, and also supportfor emerging and new features. For example, sincethe introduction of commercial clouds, clients haveincreasingly started to ask for better QoS, and inparticular have started to increasingly express non-functional requirements (NFRs) such as availability,

2

Page 3: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

privacy, and security demands in traces [12, 9]. Thisleads us to research question RQ-2: How to supportsharing workflow traces in a common, unified format?How to support in it arbitrary NFRs?

Persuading both academia and industry to releasedata is vital to address the problems stated prior. Wetackle this issue with two main approaches. First, byoffering tools to obscure sensitive information, whilestill retaining significant detail in shared traces. Sec-ond, by encouraging the same organization to sharethe data across its possibly multiple workflow man-agement systems (sources), and by explicitly aimingto collect data across diverse application domains andfields. The availability of diverse data and tools stim-ulate the benefits of making available such traces,while simultaneously reducing the concerns of com-petitive disadvantage or of an (accidental) disclosureof sensitive information. The community is alreadyhelping with both approaches, by increasingly focus-ing on the problem of reproducibility. For example,ACM introduced artifact review and badges to stimu-late open-access software and data artifacts for repro-ducibility and verification purposes [40]. We add tothis community-effort ours, which is scientific in na-ture: RQ-3: What is the impact of the source and do-main of a trace on the characteristics of workflows?

Addressing research questions 1–3, our contribu-tion is four-fold:

1. To answer RQ-1, we conduct the first comprehen-sive survey of how the systems community usesworkflow traces (Section 2). We collect, select,and label articles from top conferences and jour-nals covering workflow management. We analyzethe types of traces used in the community, and thedomains and fields covered in published studies.

2. To answer RQ-2, we design the WTA for open-access to traces of workloads of workflows (Sec-tion 3). We identify a comprehensive set of re-quirements for a workflow trace archive. A keyconceptual contribution of the WTA is the designof a unified trace format for sharing workflows, thefirst to generalize NFRs support at both workflow-and task-levels. The WTA currently archives adiverse set of (1) real workflow traces collectedfrom real-world environments, (2) realistic work-

Referencesmatching

selected venues

Referencesmatching search

criteria

Evaluation ofrelevant articles

Evaluation ofworkflow trace

usageRelevant

references withworkflow trace

usage

Collection of articles Review of articles Result analysis

#397

#104#18412

Labeling articles

Labeling of traces 

Figure 2: The article selection process. Subsequentstages decrease the amount of articles: from a corpusof 18,412 articles, down to 104 relevant references.

flow traces used in peer-reviewed publications, and(3) workflow traces collected from simulated andemulated environments commonly used by the sys-tems community. WTA also introduces tools todetail and compare its traces.

3. To address RQ-3, we compare key workload char-acteristics across traces, domains, and sources(Section 4). Our effort is the first to characterizethe new trace from Alibaba, and the first to inves-tigate the critical path task length, level of par-allelism, and burstiness using the Hurst exponenton workloads of workflows. Overall, the archivecomprises 95 traces, featuring more than 48 mil-lion workflows containing over 2 billion CPU corehours.

4. To validate our answers to RQs 1–3, we analyzevarious threats (Section 5). We conduct a trace-based simulation study and qualitative analysis.Our results for the former indicate systems shouldbe tested with different traces to validate claimsabout the generality of the proposed approach.

2 A Survey of Workflow TraceUsage

To assess the current usage of workflow traces inthe systems community and the need for a work-flow archive, we systematically survey a large body of

3

Page 4: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

Table 1: Workflow trace usage in venues having at least one paper returned in the initial query. The venueswith > 5 hits have their individual column. The column “Other” shows combined results for conferenceswith ≤ 5 hits: ATC, CLOUD, CLUSTER, e-Science, Euro-Par, GRID, HPDC, JSSPP, IC2E, ICDCS,ICPE, IPDPS, NSDI, OSDI, SC, SIGMETRICS, WORKS. Percentages are computed from the total in thecorresponding column, e.g., 13 out of 37 for the cell corresponding to row R and column FGCS.

Acronym Total FGCS CCGrid TPDS Other

T Articles using traces 104 37 17 17 33

R Articles using realistic traces 40 (38%) 13 (35%) 8 (47%) 6 (35%) 13 (39%)R+O Articles using traces that are both realistic and open-access 14 (13%) 6 (16%) 2 (12%) 3 (18%) 3 (9%)

work published in top conferences and journals, andinvestigate articles that perform experiments usingworkflow traces, either through simulation or usinga real-world setup. The process and outcome of thissurvey answer RQ-1.

2.1 Article Selection and Labeling

Selection: Figure 2 displays our systematic ap-proach to select articles relevant to this survey, basedon [31]. First, we collect data from DBLP [35] andSemantic Scholar [2]. We filter them by venue, re-taining only articles from the 10 key conferences andjournals in distributed systems listed in the captionof Table 1, including SC. This yields 18,412 articles.Next, we automatically select all articles from thelast decade (2009–2018) containing the word “work-flow” in either title or abstract, yielding 397 articles.This step provides articles that focus on all aspectsof workflows, e.g. scheduling, analysis, and design.Finally, to obtain insights into workflow traces us-age, we manually check the 397 articles. Overall, thissystematic process yields 104 articles using workflowtraces. To highlight the relevance of papers, we useGoogle Scholar to obtain citation counts. In total the104 papers have been cited 3,965 times.

Labeling: We label for each of the 104 articlesthe type of trace usage. For articles explicitly de-scribing their use, we use the labels realistic for tracesderived from real-world workflow executions. We la-bel the others as synthetic. We further label tracesas open-access (or open-source) if they are availableonline and to a broad audience, and closed-access (orclosed-sources) otherwise. In our analysis, we includeamong the open-access traces only those that are also

realistic.We also label articles by domain and field. We

identify in articles explicit use of traces from thedomains “scientific”, “engineering”, “multimedia”,“governmental”, and “industry”, and from fields suchas “bioinformatics”, “astronomy”, “physics”, etc. Wefurther label a trace with uncategorized when its ori-gin remains unexplained.

2.2 Types of Traces Used in the Com-munity

We analyze here the types of traces used by the com-munity, with the following Observations (Os):

O-1: Less than 40% of articles use realistic traces.

O-2: Only one-seventh of all articles use open-accesstraces.

Table 1 presents the types of traces used in thecommunity, focusing on realistic (R) and open-access(R+O) traces. The community uses traces for ex-periments across both conference and journal arti-cles, across various levels of (high) quality. In con-trast to this positive finding, the results indicate that,from the total number of articles using traces at all,the fraction of articles using realistic and even open-access traces is relatively small. Across all venues,only 38% of the articles use at least one realistic trace,and only 14% of the articles use at least one open-access trace.

These findings match the perceived difficulty in re-producing studies in the field [19, 13], and may hintwhy so few of these seemingly successful designs areadopted for use in practice [42].

4

Page 5: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

0 25 50 75 100 125 150 175 200Count

ScientificUncategorized

IndustryEngineeringMultimedia

0 20 40 60 80 100Count

UncategorizedBioinformatics

PhysicsAstronomy

Geology

Figure 3: (top) Top-5 (out of 6) domains and (bot-tom) Top-5 (out of 28) fields from which the commu-nity sources workflows. (“Uncategorized” for uncleardomain or field.)

2.3 Workflow Domains and Fields

We analyze the domains and fields from which thecommunity sources workflows, with as main observa-tions:

O-3: The community sources workflows from 5+ do-mains and 25+ fields.

O-4: Traces containing scientific workflows are usedsignificantly more (20x) than workflows fromother domains, e.g., industry and engineering,in the surveyed articles.

O-5: Bioinformatics workflows are the most com-monly used, but three other fields exhibit usagewithin a factor of 3.

O-6: Many traces have uncategorized domainand/or field.

Overall, we find that the community uses diverseworkflows, sourced from 6 domains and 28 fields. Wefurther investigate the distribution of use, per do-main and per field. Figure 3 (top) shows that thescientific domain is over-represented in the literaturein the top-five trace domains encountered, due to thelarge number of available open-access traces and fromtheir conventional use in prior work. In particular, alarge portion of the articles use workflow traces from

100 101 102 103 104

Number of workflows

-0.3

0.26

pdf

(a) Number of workflows per article.

100 101 102 103 104 105

Number of tasks

-0.6

0.64

pdf

(b) Smallest workflow size per article.

100 101 102 103 104 105

Number of tasks

-0.4

0.42

pdf

(c) Largest workflow size per article.

Figure 4: Statistical characterization of workflows inopen- and closed-access traces. Horizontal axes arelogarithmic. Symbols are differentiation markers, notdata-points.

the Pegasus project, which covers the scientific do-main. The number of traces in this domain exceeds200, which is larger than the number of articles in thestudy as each article uses multiple traces. In contrast,the next-largest domains are industry and engineer-ing, each with less than 10 traces representing lessthan one-twentieth of the scientific domain.

We remark the positive diversity of the work-flow domains, considering that the entire commu-nity is tempered by the extreme focus on scientificworkflows. This confirms the bias demonstrated byAmvrosiadis et al. [3] with the popular Google-clustertraces. A similar situation appears for fields, butmore tempered, as Figure 3 (bottom) indicates.

Overall, the results reveal that the community hasa strong bias for one domain (scientific) and favorsscientific fields (especially bioinformatics). We con-jecture the large amount of open-access data fromthese fields facilitates this bias. To overcome it, andto further reduce the large fraction of uncategorizedtraces evident in both plots of Figure 3, we posit thecommunity should require that open-access and di-verse traces be used in articles claiming the generalityof their techniques.

5

Page 6: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

2.4 Workflow Size inOpen- and Closed-access Traces

We focus on open- and closed-access traces—is therestructural or size-related inequality? We find that:

O-7: Open-access traces generally have fewer work-flows.

O-8: Open/closed-access traces cover different work-flow sizes.

Figure 4 uses split violin plots to characterize thenumber and size of workflows present in open- vs.closed-access traces (bottom half-violins), and acrossall traces (top). Each half-violin depicts the probabil-ity density function (PDF) of an empirical variable:number of workflows used per article in Figure 4a,etc.

Figure 4a shows that most articles use up to 10workflows for experiments, with an average of 11workflows per article and some extreme outliers. Thedistributions for open- and closed-access traces aresimilarly shaped, yet closed-access traces can includeover an order of magnitude more workflows.

For the smallest workflow size, Figures 4b indicatesthat about half of articles use workflows with 30 tasksor fewer. The range of workflow-sizes in open-accesstraces is much smaller than for closed-access traces;in particular, articles using closed-access traces alsoreport many workflows with 10 or less tasks.

For the largest workflow size, half of experimentsdo not exceed 1,000 tasks, matching the prior charac-terization of Juve et al. [29]. Although closer than forthe smallest workflow size, the open- and close-accesstraces are again exhibiting different ranges for theirlargest workflow sizes.

3 The Workflow Trace Archive

In this section we outline the design of the WTA, theunified trace format used, tools to support consumerswith the trace selection according to their use-case,and give a summarized overview of the current con-tents of the archive. Furthermore, that facilitate thecontinuous growth of the archive, we provide tools for

trace anonymization and a collection of trace parsescripts for different trace sources.

Similarly to how the overall design of experimentsis now commonly described in publications in ourfield, as the setup leading to experimental results, weinclude here an overview of the design process thatled to the design presented in this section. Outlin-ing the design and the process that led to the designis important for understanding how the final designcame to be and how it fits the intended goal [26].

The overall design of the archive and the detaileddesign that results in the unified format are done iter-atively, using a process that draws from the AtLargesystematic design-process [26]. We started by listinginitial requirements (see Section 3.1) that the WTAhad to fulfill, and co-evolved the requirements with thedevelopment of the solution (the archive). For exam-ple, we added explicitly the requirement to providescripts and datasets to aid users in building their owntools, as we discovered how difficult it was to engineerthem from scratch (see Section 3.6). Next, we definedan initial format, centered around a number of uniquefeatures, such as NFRs. We improved this format it-eratively, to meet the requirements and/or to passvarious thought experiments. For the latter, when-ever we encountered a new data-format that was notfully covered by our format, we discussed which prop-erties and/or objects should be added to the format(see Section 3.4). We assessed the trade-off betweenformat comprehensiveness (what to include?) andbrevity (what is too much or too complex?) based onpersonal experience, on the perceived importance ofdata-fields in literature, and on their frequency of usein other archives. Finally, we designed the analysistools iteratively, including in them initially our ownideas and then aspects highlighted by other archives,literature reports, and perceived shortcomings.

3.1 Requirements

We identify five key requirements for the structure,content, and operation of a useful archive for work-flow traces.

R-1: Diverse Traces for Academia, Industry,and Education. The main observations in Sec-tion 2.3 give strong evidence that the archive must

6

Page 7: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

include a diverse set of traces, corresponding to themany existing workflow domains and fields. More-over, the archive must include traces that cover abroad spectrum of workflow sizes, structures, andother characteristics, including both general charac-teristics to many domains and fields, and idiosyn-cratic characteristics corresponding to only one do-main or field.

Addressing this requirement is important. Foracademia, experimenting with representative tracesdemonstrates the applicability of an approach, in-creases credibility. For industry, representative tracesare essential to test production-ready systems, andcan act as a validation for techniques proposed byacademia [32]. Furthermore, as systems become morecomplex, education and training on these topics be-comes ever more important.R-2: A Unified Format for Workload of Work-flows Traces.

To simplify trace exchange, reduce trace integra-tion effort among different systems, prevent doublework for other users, and provide dataset indepen-dent tools (expressed as R-3), the archive must de-fine a unified trace format. The format must cover abroad set of data about the workloads and about theworkflow management systemsm including: workloadmetadata; workflow-level data including NFRs andcommon metadata; task-level data including per-taskNFRs and operational metadata such as task-statechanges; inter-dependencies between tasks and otheroperational elements such as data transfers; system-level information including resource provisioning, al-location, and consumption; etc.

Addressing this requirement is the basis of any dataarchive. A unified trace format for workflow work-loads helps providing long-term provenance informa-tion [7] to improve the reproducibility of experimen-tal results, which is an ongoing grand challenge infields such as bioinformatics.R-3: Detailed Insights into Trace Properties.

Just archiving data is not enough; every archivemust provide insight into its traces at the level ofdetail required by the broad audience implied by R-1, from beginner to expert. Broad insights, easilyprovided in standardized reports, include extrinsicproperties such the number of workflows and tasks

and the number of users in the trace, while intrinsicproperties include the workflow arrival patterns andthe resource consumption per-task.

Detailed insights include expert-level analysis ofsingle traces at workload-, workflow-, and system-level; and collective analysis across all traces andacross traces with a shared feature (e.g., all traces of adomain or field). These properties must be accessiblethrough readily available tools (see R-4) and, pos-sibly, through interactive online reports. Addition-ally, practitioners can correlate information acrossmultiple traces, resulting in better quantitative ev-idence, intuition about otherwise black-box applica-tions, and understanding that helps avoiding com-mon pitfalls [18].R-4: Tools for Trace Access, Parsing, Analy-sis, Validation.

The most important tool is the online presence ofthe archive itself. The archive must further providetools to parse traces from different sources to the uni-fied format (see also R-2), to provide insight intotraces (see also R-3), and to validate common prop-erties (e.g., the presence of and correctness of proper-ties). An absence of such tools would lead to users un-able to select appropriate traces, validate their prop-erties, and compare them.

The archive should further aid users in buildingmore sophisticated tools. Newly built tools can thenbe added to the selection of tools so more parties canmake use of them (contributing to R-5)R-5: Methods for Contribution.

The archive must reflect the continuous evolutionof workflow use in practice, by increasing the cov-erage of different scenarios. We make a distinctionbetween two types of contribution: (1) additionaltraces, possibly originating from a new domain orapplication-field, and (2) additional traces, introduc-ing new properties. To facilitate the former contribu-tion, the archive must provide a method for the up-load and (basic) automated verification of traces. Tofacilitate the latter, the format must integrate spe-cific provisions that enable upgrades and long-termmaintainability, such as adding a version to each com-ponent of the format.

Having methods to contribute new workloads is keyto encourage new and existing contributors to sub-

7

Page 8: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

mit traces. In particular, tools to add new domainsare of particular importance, to support emergingparadigms with realistic data.

3.2 Overview of the WTA

We design the WTA as a process and set of tools help-ing a diverse set of stakeholders. We consider threeroles for the WTA community members, outlined inFigure 1. The contributor supplies, as the legal owneror representative, one or more traces to the WTA.A workflow trace contains historical task executiondata, resource usage, NFRs, resource description, in-puts and outputs, etc. To fulfill R-5, the WTA teamassists the contributor in parsing, anonymizing, andconverting the traces into the unified format (Sec-tion 3.4), minimizing the risk of competitive disad-vantage, and verifying their integrity. WTA fulfillsR-1 as it incrementally expands with contributors oftraces from different domains with different proper-ties.

The user represents non-expert or expert traceconsumers. Non-expert users often need to rely ongeneric domain or trace properties, whereas the ex-pert users have detailed knowledge of their systemand require fine-grained detailed for selecting the cor-rect trace. In addition, expert users may commenton (missing) properties and may develop new tools,models or other techniques to further compare andrank the traces. Both user types require assistance inselecting the most suitable trace given a set of criteria(Section 3.5) as well as analysis and validation (Sec-tion 3.6) from the available set of traces (Section 3.7).To support both user types, the WTA discloses bothhigh-level and low-level details.

3.3 Workflow Model

There are numerous types of workflow models usedacross different communities. A 2018 study by Ver-sluis et al. find that directed acyclic graphs (DAGs)are the most commonly used formalism in computersystem conferences [49]. Therefore, for the first de-sign of this archive, we adopt DAGs as the workflowmodel.

A workflow constructed as a DAG in which nodes

are computational tasks and directed edges depictthe computational or data constraints between tasks.Entry tasks are tasks with no incoming dependen-cies and, once submitted to the system, immediatelyare eligible for execution. Similarly, end tasks arenodes that have no outgoing edges. A collection ofworkflows submitted to the same infrastructure overa certain period of time is considered a workload.

Although popular, we specifically do not focus inthis work on BPMN and BPEL, Petri nets, hypergraphs, general undirected, or cyclic graphs. Theseformalisms either include business and human-in-the-loop elements [1] or add additional complexity due tohaving a large set of control structures such as loops,conditions, etc. [46] which we consider out of scopefor this work.

3.4 Unified Trace Format

Creating a unified format (R-2) requires from the de-signer a careful balance between limiting the numberof recorded fields while supporting a diverse set usagescenarios for all stakeholders in Section 3.2. Mod-ern logging and tracing infrastructure can capturethousands of metrics for each machine and workflow-task involved [53], from which the designer must se-lect. We specifically envision support for commonsystem and workflow properties found in the typicalscenarios considered in the top venues surveyed inSection 2, such as engineering a workflow engine [54],characterizing the properties of workloads of work-flows [29], and designing and tuning new workflowschedulers [47].

Our unified format attempts to cover differenttrace domains, while preserving valuable information,such as resource consumption and NFRs, contribut-ing to fulfilling R-1 and 3. By analyzing the rawdata formats, we carefully selected useful propertiesto include in our unified format, omitting low-leveldetails, such as cycles per instruction, page cachesizes, etc.

Answering RQ-2 and fulfilling R-2, our trace for-mat is the first to support arbitrary NFRs both attask and workflow levels. For example, one of theLANL traces (introduced in Table 3) contains dead-lines per workflow and the Google cluster data fea-

8

Page 9: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

Object:Workload

Schema versionWorkflowsNumber of workflowsNumber of tasksDomainStart dateEnd dateNumber of sitesNumber of resourcesNumber of usersNumber of groupsTotal resource secondsAuthorsDescription

Object:Workflow

Schema versionIDSubmit timeTasksNumber of tasksCritical path lengthCritical path task countMaximum concurrent tasksNFRsSchedulerDomainApplication nameApplication fieldTotal resources usedTotal memory usedTotal network IOTotal disk space usedTotal energy consumedObject:Tasks

Schema versionIDTypeSubmit timeSite IDRuntimeResource typeResource amount requestedParentsChildrenUser IDGroup IDNFRsWorkflow IDWait timeParametersMemory requestedDisk IO timeDisk space requestedEnergy consumptionNetwork IO timeResource IDsDatatranfser IDs

Object:Resource

Schema versionIDTypeNumber of resourcesProc modelMemoryDisk spaceNetwork (speed)Operating SystemDetailsEvents

Object:DataTransfer

IDTypeSubmit timeTransfer timeSource task IDDestination task IDSizeSize unitEventsVersion

11..*

1

1..*

1..*

0..*

Object:TaskState

Schema versionStart timeEnd timeWorkflow IDTask IDResource IDCPU rateCanonical memory usageAssigned memoryMinimum memory usageMaximum memory usageDisk IO timeMaximum disk bandwidthDisk space usageMaximum CPU rateMaximum disk IO timeSample rateSampled CPU usageNetwork IO timeMaximum network bandwidthNetwork inNetwork out

Object:ResourceState

Schema versionResource IDTimestampEvent typePlatform IDAvailable resourcesAvailable memoryAvailable disk spaceAvailable disk IO bandwidthAvailable network bandwidthAverage utilization 1 minuteAverage utilization 5 minutesAverage utilization 15 minutes

1

0..*

10..*

10..*

Figure 5: The WTA trace format.

tures task priorities, both are supported by the WTAunified format. Capturing these properties is impor-tant to test QoS-aware schedulers.

As depicted in Figure 5, the WTA format includesseven objects: Workload, Workflow, Task, TaskStateResource, ResourceState, and DataTransfer. Each ofthese objects contains a version field, updated when-ever the set of properties is altered (R-5).

Each trace is a single workload, consisting of mul-tiple workflows and their arrival process. Workloadproperties include the number of workflows, tasks,users, resources used (e.g., CPUs or cores), domainand field when available, authors list, start and enddate, and overall resource consumption statistics.The workload description described the origin of theworkflow and other relevant information, includingbut not limited to its source, execution environment,non-functional requirements etc.

Each workflow in the workload has a unique iden-tifier, an arrival time, and contains a set of tasks andseveral properties, including scheduler used, numberof tasks, critical path length, NFRs, and resourceconsumption. Each workflow also has the name ofits domain of study, when possible.

The workflow structure is a DAG of tasks with con-trol and data dependencies between tasks. Each Taskhas a unique identifier and lists its submission andwaiting time, runtime and resource requirements, in-cluding required (compute) resources, memory, disk,network, and energy usage. Additionally, each taskprovides optional dictionaries for task-specific execu-tion parameters and NFRs. To model dependenciesbetween tasks, the WTA format maintains for eachtask a list of its predecessor (parent) and succes-sor (child) tasks. Similarly, data dependencies arerecorded as a list of data transfers.

Resource objects cover various resource types, suchas cloud instances, cluster nodes, and IoT devices.A resource has a unique identifier and contains sev-eral properties, such as resource type (e.g., CPU,GPU, threads), number, processor model, memory,disk space, and operating system. An optional dictio-nary provides further details, such as instance typeor Cloud provider. The ResourceState event snap-shots periodically the resource state, including avail-ability and utilization. Analogous to the ResourceS-

9

Page 10: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

Figure 6: A screenshot of a part of the table con-taining all traces on the WTA website, sorted on thetasks column.

tate, the TaskState records periodically the resourceconsumption of the task (the Task object records theresource demand). Different from the ResourceState,the TaskState snapshot contains averages over a pe-riod of time. This is due to the resources generallyallocating resources to tasks, whereas the actual con-sumption of a task may differ over time.

Each DataTransfer describes a file transfer froma source to a destination task, which can be a lo-cal copy on the same resource or a network transferfrom a remote source, etc. To support bandwidthanalysis, a data transfer introduces submission time,transfer time, and data size. Each data transfer alsoprovides an optional dictionary with detailed eventtimestamps (e.g., pause, retry).

3.5 Mechanisms for Trace Selection

We address R-3 by assisting archive users in retriev-ing appropriate traces for their scenarios, using filterand selection mechanisms. The website is the mostimportant such filter and mechanism, containing anoverview of all traces in a general table, visible inFigure 6, with the number of workflows, tasks, users,etc. This table is sortable and searchable, allowingwebsite users to interact with the more than 90 tracescurrently in the WTA (column “#WL”, row “Total”in Table 3).

We provide, online and as separate tools, a detailedreport for each trace. Each report includes auto-matically generated statistics, such as the number ofworkflows and tasks, then resource properties such as

Table 2: Trace anonymization methods used in WTAtools.

Obfuscation method Description

IP Encodes IPv4 addressesMail and host Obfuscate mail and host namesFile paths Hide file paths in Linux and Windows formatExecutable files Encode executable file names, e.g., py, sh, exe, jarAll files Hide all file names, ending with 2, 3, or 4 lettersKeywords Anonymize a list of custom keywordsAll Apply all obfuscation methods listed above

compute, memory, and IO, and job and task arrivaltimes and runtime distributions (see Section 4). Themetrics featured in the report are reported as impor-tant by prior studies [45, 51] and enable developers toselect traces appropriate for their intended use-case.

3.6 Tools for Analysis and Validation

We implement the unified trace format using the Par-quet file format and the Snappy compression algo-rithm. Parquet is a binary file format that is sup-ported by many big data tools such as Apache Spark,Flink, Druid, and Hadoop [48]. Many programminglanguages also have libraries to parse this format,such as PyArrow for Python and parquet-avro forJava. Snappy1 compression reduces the size of thedataset significantly and has low CPU usage duringextraction.

Beside trace selection support and to address R-4,the WTA offers several tools to facilitate and incen-tivize the continuous growth of the archive. Mostof these tools required significant engineering effortto develop, due to the typical challenges of big dataprocessing (high volume, noisy data, diverse input-formats, etc.). The WTA simplifies the upload ofnew traces by providing a set of parsing scripts fordifferent trace sources, such as Google, Pegasus, andAlibaba. Parsing traces can become non-trivial, oncethey grow both in complexity and size. Such tracesrequire big data tools, such as Apache Spark, andenough resources, a cluster, to compute. Noisy dataraise another non-trivial issue: both Google’s andAlibaba’s cluster data contained either anomalous

1https://github.com/google/snappy

10

Page 11: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

fields, undocumented attributes, and non-DAG work-flows. Some of these issues were never discovered bytheir respective communities and were corrected inour parsing tools. Debugging, filtering, and correct-ing noisy big data requires significant compute powerand detailed engineering.

Because traces may contain sensitive information,the WTA offers a trace anonymization tool, whichsupports users to automatically replace privacy andsecurity-related information, to avoid an accidentalreveal of proprietary information. Specifically, to re-move sensitive information from trace files, we usetwo common techniques [39], culling and transform-ing. Culling is done during trace conversion, by omit-ting parts of the raw trace data which do not matchour workflow trace format. For the transformation,as presented in Table 2, our anonymization tool au-tomatically scans the workflow trace file for sensitivedata, such as IP addresses, file paths, names, etc.,by string pattern matching. Beside these standardsensitive-data checks, the WTA offers the option tosearch for custom privacy-critical strings.

Finally, all matched strings are replaced by a saltedSHA-256 hash key. This approach using crypto-graphic hash functions offers protection of sensitivedata, while preserving the relationships between thematched values in the same trace file [39]. Addition-ally, our tool hides potential relations to other tracefiles by adding a salt of length 16 to the hash keygeneration, which is randomly generated on each toolrun.

To validate traces, the WTA provides a validationscript that checks the integrity and summarizes im-portant characteristics of a trace. During trace con-version, using the validation script, we successfullyidentified several parse bugs and inconsistencies inthe data that we subsequently corrected.

Specifically, because tasks build the base of eachtrace, our tool checks if all contained tasks are welldefined. This, for example, means that all parsedcontrol dependencies, such as children and parents,link only to existing tasks with valid properties. Atask property is valid, if the parsed property typematches the property type definition, and the prop-erty value is allowed e.g. task runtime > 0. Basedon and similar to this fundamental validation, our

tool provides options to check the workflow and datatransfer properties to identify inconsistencies, as well.

These tools help combating perceived barriers toshare data described by Sayogo et al. [41]. Severaltechnological barriers are addressed by using a unifiedformat and validation (data architecture, quality, andstandardization), Legal and policy barriers are moredifficult to address. Our anonymization tool aids inovercoming the data protection barrier, yet legal andother enforced policies may require tailored solutions.

Besides offering these tools, the WTA also hoststhe trace data, addressing logistic and economicalbarriers. The increasing focus on sharing data ar-tifacts by the community, is lowering the barrierregarding competition for merit and reputation forquality and bolsters the culture of open sharing. Fi-nally, each trace has its own DOI by also uploading itto Zenodo2 which can be cited and thus provides au-thors with the appropriate credits (incentive barrier).

3.7 Current Content

Having a diverse set of traces available is necessary touse in experimentation. When using traces in exper-imentation, different traces should be used to provegenerality of the proposed approach (see Section 5).Gathering and parsing raw logs and other traces re-quires significant computing effort. Using 16 nodes(32 eight-core Xeon E5-2630 v3 and 1TB RAM) fromthe Dutch DAS5 super computer [6], several tracesrequire up to a day to compute using big data toolssuch as Apache Spark. In total, the WTA team spentmore than two person months on converting tracesto the unified trace format. By offering these parsescripts and the data, we contribute to R-4.

The WTA features currently 96 workloads from 10different sources, with over 48 million workflows and2 billion CPU core hours. Each workload is uniquelyidentified by a combination of the following proper-ties if available: source, runtime environment, ap-plication, and application parameters [33]. Table 3and 4 summarize these traces. From these tables weobserve that WTA contains a vast amount of differ-ent traces, from different sources and domains, with

2https://zenodo.org/

11

Page 12: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

Table 3: Overview of the current WTA content, grouped by source. Legend: D=Domain, DS=Datasets,PA=parameters, PL=Platform, S=Setup, A=Applications, WL=workload, WF=workflow, T=task,U=user, G=group, *=minimum, Eng = Engineering, Sci = Scientific, Ind = Industry, TCH = Total CoreHours. Items in bold are workloads introduced by this work. Items where workflows are for the first timeanalyzed in this work are in italics. The symbol ‡ next to S7 indicates data with promise to release, but forwhich the legal forms have not been completed yet; WTA can already release all other workloads.Source ID. Name #WL D DS #PA #PL #S #A #WF #T #U #G Year(s) Timespan TCH

S1. Askalon Old 2 Eng - - 1 - mixed 4,583 167,677 *7 *6 2007 19 months 4,685,300S2. Askalon New 67 Sci - *2 2 67 *3 1,835 91,599 *67 *67 2016 47 days 193S3. LANL 2 Sci - - 1 - mixed 1,988,397 475,555,927 - - 2011-2016 63 months *9,625,431S4. Pegasus 8 Sci - - *6 - 8 56 10,573 9 - 2011 4 days 1,477S5. Shell 1 Ind - - 1 - mixed 3,403 10,208 - - 2016 10 minutes 25S6. SPEC 2 Sci - - 1 - mixed 400 28,506 - - 2017 - 1,231S7. Two Sigma‡ 2 Ind - - 1 - mixed 41,607,237 50,518,481 610 1 2016 16 months 69,992,196S8. WorkflowHub 10 Sci *5 *4 5 - 3 10 14,275 10 - 2017 - 52S9. Alibaba 1 Ind - - 1 - mixed 4,210,365 1,356,691,136 1 1 2018 8 days 1,526,925,484S10. Google 1 Ind - 1 1 - mixed 494,179 17,810,002 430 1 2011 29 days 434,821,345

Total 96 - *5 *7 *20 67 - 48,310,465 1,900,898,384 *1,134 *76 - - 2,046,052,734

Table 4: Overview of properties available per source. Legend: X = available, ∼ = partially available, blank= not available, Task details = individual task information.

Source ID Task details Task resource req. Structural information Disk Memory Network Energy NFRs

S1 X X XS2 X X XS3 ∼ X ∼ XS4 X X XS5 X X X

S6 X X X XS7 X X ∼ XS8 X X X X XS9 X X X X X XS10 X ∼ X X X

various number of workflows, properties, number oftasks, timespans, and core hour counts. Althoughsupported by our format, no trace currently has in-formation on energy consumption, highlighting theneed of such traces [44]. These traces are collected bycombining open-access data (logs, traces, etc.) andclosed-access data throughout the years in collabo-ration with both industry and academia. This con-tributes to R-1.

This diversity enables new workflow managementtechniques and systems to be thoroughly tested fortheir feasibility, strengths, and, equally important,weaknesses.

An example of the usability of different traces isin emerging fields. Serverless is a new and emerging

field, hence no traces are available yet. One of our col-laborators used the Shell (IoT) workload, which fea-tures small and short-running workflows, much alikeserverless functions, to experiment with a serverlessworkflow engine. Additionally, from these traces ad-ditional insights can be gathered regarding task andjob runtimes, sizes, arrival patterns, resource con-sumption, etc.

We encourage the community to continuously con-tribute traces to the archive to improve coverage ofdomains, maintain its representativeness of the jobsbeing executed in real-world by both academia andindustry, offer more artifacts to prove robust perfor-mance of e.g., policies, and offer more diversity intrace statistics. By making available methods for

12

Page 13: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

Engineering

Industry

Scientific

0.00 0.25 0.50 0.75 1.00Fraction of tasks

Dom

ain

Scatter Shuffle Gather Pipeline Standalone

Figure 7: Structural workflow patterns, per domain.

contribution, we fulfill R-5.

4 A Characterization ofWorkloads of Workflows

To answer RQ-3, we perform in this section a charac-terization of the workloads in the WTA. We charac-terize workloads using a variety of metrics and prop-erties, including workflow size, resource usage, andstructural patterns. Our characterization reveals sig-nificant differences between workloads from differentdomains and sources. Such differences further sup-port our claim that the community needs to look be-yond just scientific workloads, and consider a widerrange of domains and sources for experimental stud-ies.

4.1 Structural Patterns

O-9: Scientific, industrial, and engineering work-flows exhibit various structural patterns, but atleast 60% of tasks in a domain match the domi-nant pattern of that domain.

O-10: Industry workflows stand out by exhibit-ing primarily scatter patterns, as opposed topipeline operations.

This characterization quantifies five structural pat-terns in workflows often used by researchers [8]: scat-ter (data distribution), shuffle (data redistribution),gather (data aggregation), pipeline, and standalone(process). Investigating these structural patterns isimportant to understand the types of applications be-ing executed and tune a system’s performance. Weexclude from this analysis the LANL, Two Sigma,and Google traces, which lack structural information,that is, task parent-child relationship information.

1 2 3 4 5 6 7Day of week

101

103

105

107

Avg.

Num

. Tas

ks p

er H

our

workload

AlibabaAskalon Old 2

GoogleLANL 1

Two Sigma 1Two Sigma 2

Figure 8: Daily task-arrival trend, per source.

Figure 7 depicts the structural patterns found perdomain. From this figure, we observe that in each do-main a dominant pattern emerges that accounts for61–85% of tasks. In the scientific and engineering do-mains, the majority of tasks are simple pipelines. In-terestingly, the industrial workflows include primarilyscatter operations. This observation matches knownproperties of the Alibaba trace, which accounts forover 99% of tasks with structural information we an-alyzed in this domain. In particular, the Alibabatrace includes MapReduce jobs, each consisting ofmany “map” tasks (scatter operations) and a smallernumber of “reduce” tasks (gather operations).

4.2 Arrival Patterns

O-11: From all domains, industrial traces show onaverage orders of magnitude higher rates of taskarrival.

O-12: Scientific traces can show high variability intask arrival rates, unlike industrial and engineer-ing traces.

O-13: Two Sigma shows a typical workday diurnalpattern.

To investigate the weekly trends that may appearin workload traces, we depicts in Figure 8 for sev-eral traces the average number of tasks that arriveper day of the week. We omit the Askalon newsource from the hourly task-arrival plot as they con-tain 4 or 5 data points, which is too few to plot

13

Page 14: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

Table 5: The design and setup of our characterization.

ID § Description Traces Metric Granularity

E1 4.1 Analyze structural patterns in workflows per domain All but S3, 7, 10 Structural patterns Workflow levelE2 4.2 Longitudinal analysis S1, S3, S7, S9, S10 Tasks per day Workload levelE3 4.3 Analysis of burstiness per trace All but S4-8 Hurst exponent Workload levelE4 4.4 Measure the level of parallelism per workflow All but S3, 7, 10 Level of parallelism Workflow levelE5 4.5 Analysis of critical path length All but S3, 7, 10 Critical path length Workflow levelE6 4.6 Analysis of critical path runtime All but S3, 7, 10 Critical path runtime Workflow levelE7 4.7 Task interarrival time analysis All Interarrival time Workload level

0 5 10 15 20Hour of day

101

103

105

107

Avg.

num

. tas

ks p

er h

our workload

AlibabaAskalon Old 2

GoogleLANL 1

Two Sigma 1Two Sigma 2

Figure 9: Hourly task-arrival trend, per source.

a trend. We observe that traces have significantlydifferent arrival rates and patterns. The Alibabatrace features the highest task arrival rates, peakingat over 10,000,000 tasks per hour. Google and theTwo Sigma workloads follow with 10-10,000 tasks perhour. This shows that industrial workloads includedin this work have significantly more tasks per hourthan the other compute environments, which agreeswith companies such as Alibaba and Google operat-ing at a global scale. Also interesting to notice is thatbesides Askalon Old 1, the non-industrial traces showsignificant fluctuations throughout the week, whereasboth Alibaba and Google do not. This might be dueto the global, around-the-clock operation of Alibaba’sand Google’s services, which can lead to a more stabletask arrival rate.

To observe differences in daily trends, we depictthe average task rate per hour of day in Figure 9.This figure reaffirms our observation that the twolargest traces–Alibaba and Google–and the AskalonOld 1 trace have a relatively stable arrival patternthroughout the day. In contrast, the Two Sigma 1

102 103 104 105 106

Window size (ms)

0

0.25

0.50

0.75

1

Hurs

t par

amet

er

Sustained Bursts

Sporadic Bursts

workload

Alibaba Askalon Old 2 Google LANL 1

Figure 10: Hurst exponent estimations for differenttime windows per trace. Horizontal axis does notstart at zero.

trace exhibits a typical office hours pattern; task ar-rival rates increase around hour 7 and start droppingaround 17. The same pattern occurs to a lesser ex-tent in the Two Sigma 2 trace. The highly variablearrival rates of tasks in the LANL traces, as observedin Figure 8, are also evident in our analysis of dailytrends. We study this in more depth in Section 4.3.

4.3 Burstiness

O-14: Most traces investigated exhibit bursty be-havior within small window sizes.

O-15: The LANL trace exhibits maximum bursti-ness at medium window sizes.

O-16: The largest traces (Alibaba and Google) ex-hibit uniquely bursty behavior: low burstiness atsmall and high burstiness at large, window sizes.

To investigate if workloads expose bursty behavior,a special kind of arrival pattern, the Hurst exponentH is used. H quantifies the effect previous valueshave on the present value of the time series. A value

14

Page 15: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

0 100 101 102 103 104 105

Max concurrent tasks

0

0.25

0.50

0.75

1

Num

ber o

f tas

ks (C

DF)

domain

EngineeringIndustryScientific

Figure 11: Workflow level-of-parallelism, per domain.

of H < 0.5 indicates a tendency of a series moving inthe opposite direction based on the previous values,and thus exhibit jittery behavior (sporadic burst). Avalue of H > 0.5 indicates a tendency to move in thesame direction, and thus towards well defined peaks(sustained burst). When H = 0.5, the series behaveslike a random Brownian motion.

In this experiment, we inspect busty behavior bycomputing the Hurst exponent for task arrivals. Theresults of this experiment are visible in Figure 10.From this figure, we observe most traces depict burstybehavior at least for one of small, medium, and largewindow size. They are also not bursty for at least onewindow size. This is expected, as in most systemstask arrivals vary at (sub-)second interval. Interest-ingly, LANL traces exhibit most bursty behavior atmedium window sizes. This might be due to nationallaboratories workflows being submitted in waves. Awave is submitted all at once leading to a burst. But,a wave itself is processed smoothly. The workload isalso stable over longer time periods as evidenced byH ≈ 0.5 for larger windows. Finally, the two largesttraces in this work, Alibaba and Google, exhibit in-creasingly burst behavior for larger windows. Thisindicates that for larger arrival times, the workloads(in absolute numbers) vary more than for the othersources. This matches the observations in Section 4.2.

4.4 Parallelism in Workflows

O-17: Task parallelism per workflow can differsignificantly between workload domains andsources.

O-18: Industrial workflows exhibit the highest levelof parallelism.

0 100 101 102 103 104 105

Max concurrent tasks

0.25

0.50

0.75

1

Num

ber o

f tas

ks (C

DF) source

AlibabaAskalon NewAskalon OldPegasusSPECShellWorkflowHub

Figure 12: Workflow level-of-parallelism, per source.Curves are shaded by domain, to further reveal pat-terns.

O-19: Out of all sources, Alibaba workflows have thehighest level of parallelism, followed by Pegasusand WorkflowHub.

With the structural patterns observed, we investi-gate if the large occurrence of the pass-through pat-terns expresses in a high level of parallelism. Thelevel of parallelism indicates how many tasks canmaximally run in parallel for a given workflow, pro-vided sufficient resources. Figure 11 depicts the ap-proximated level of parallelism per domain. Theapproximation algorithm used produces results veryclose to the true level of parallelism as demonstratedby Ilyushkin et al. [21]. From this figure, we ob-serve the industrial domain exhibits the highest levelof 99th percentile parallelism, up to hundreds ofthousands of tasks. This is likely a consequence ofthe many MapReduce workflows, which are highly-parallel by nature, that are present in the Alibabatrace. Alibaba also contains bag of tasks workflows,which by nature have a high parallelism.Scientificworkflows exhibit low median parallelism but high99th percentile parallelism, featuring levels of paral-lelism up to thousands of tasks. Engineering tracesexhibit a moderate amount of median parallelism, be-tween industry and scientific, with at most 1000 con-current running tasks.

Figure 12 shows the level-of-parallelism per source.From this figure, we observe that Alibaba exhibits thehighest levels of parallelism, as discussed previously.Second are the Pegasus and WorkflowHub workflows.These sources contain a variety of scientific applica-tions, commonly known for their parallel structures,as observed in Section 4.1. Other traces demonstrate

15

Page 16: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

100 101 102

Num. tasks on critical path

0

0.25

0.50

0.75

1

Num

. wor

kflo

ws (C

DF)

domain

EngineeringIndustryScientific

Figure 13: Workflow critical paths (CPs), per do-main.

less parallelism, with up to 100 concurrent runningtasks. As Shell exist entirely of sequential pipelines,the source does not exhibit any variation.

4.5 Limits to Parallelism in Work-flows

O-20: Workflows from the scientific domain have sig-nificantly different critical-path lengths.

O-21: The amount of tasks on the critical path isthe highest for engineering workflows.

O-22: Although highly parallel, industrial workflowsexhibit longer critical paths than scientific work-flows.

The critical path (CP) refers to the longest se-quence of dependent tasks in a workflow, from anyentry task to any exit task. By quantifying the CPlength, we investigate if workflow runtimes are pri-marily dominated by a few heavy tasks, or by manysmall tasks. Figure 13 presents the results of thischaracterization per workload domain. From this fig-ure we observe the CP length for engineering work-flows is the highest. This matches with the paral-lelism observations in Sections 4.1 and 4.4. Inter-estingly, even though industrial workflows are oftenhighly parallel, their critical paths are often longerthan those of scientific workflows. This indicates thatindustrial workflows are bigger in size than scientificworkflows, which our data supports.

Figure 14 presents the results of CP characteriza-tion per workload source. From this figure we observethe CP length differs significantly per trace. Based onthe prior findings, the engineering traces are expected

100 101 102

Num. tasks on critical path

0

0.25

0.50

0.75

1

Num

. wor

kflo

ws (C

DF) source

AlibabaAskalon NewAskalon OldPegasusSPECShellWorkflowHub

Figure 14: Workflow critical paths (CPs), per source.Curves are shaded by domain, to further reveal pat-terns.

105 108 1011

Length of critical path (ms)

0

0.25

0.50

0.75

1

Num

. wor

kflo

ws (C

DF)

domain

EngineeringIndustryScientific

Figure 15: Workflow critical path runtime, per do-main.

to show longer critical paths. As we can observe,the Askalon old traces contains workflows with thelongest critical path. Alibaba workflows also exhibitlong critical paths, indicating their workflows next tobeing highly parallel, also contain a lot of tasks withstages. More concentrated, the other traces exhibitlower critical path lengths, yet the traces are stillclearly distinct. As the Shell trace contains solelysequential workflows, the critical path length is one.

4.6 Critical Path Runtime

O-23: Engineering workflows have the highest criti-cal path runtime.

O-24: Although highly parallel, industrial workflowsexhibit longer runtimes than scientific workflows.

The critical path runtime is the sum of runtimesof all tasks on the critical path. It is the minimumamount of time required to run a workflow on a giveninfrastructure. Figure 15 represents presents the re-sults of this characterization per workload domain.We observe that the results are very similar to those

16

Page 17: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

105 108 1011

Length of critical path (ms)

0

0.25

0.50

0.75

1

Num

. wor

kflo

ws (C

DF) source

AlibabaAskalon NewAskalon OldPegasusSPECShellWorkflowHub

Figure 16: Workflow critical path runtime, persource.

obtained by characterizing the critical path length,per domain, in Section 4.5. Engineering workflowshave the highest critical path runtime, followed byindustry and scientific workflows.

Figure 16 presents the results of critical path run-time characterization per workload source. We ob-serve that the results are very similar to those ob-tained by characterizing the critical path length, persource, in Subsection 4.5. Askalon Old has the high-est runtime followed by Alibaba. The runtimes distri-butions of individual sources are clearly distinguish-able.

4.7 Task Interarrival times

O-25: Almost all tasks in the industrial domain ar-rive within milliseconds after one-another.

O-26: The industrial domain also exhibits thelongest task interarrival times.

O-27: The engineering domain exhibits the highestspread of task interarrival times.

O-28: Per source, the Askalon (old and new) andLANL exhibit the lowest task interarrival times.

O-29: Two Sigma features the longest task interar-rival times, possibly indicating downtime.

In this experiment, we inspect the task interar-rival times. Traces not containing task-level infor-mation, such as one of the LANL traces, are omit-ted. By quantifying the task arrival times, we gain

102 105 108

Interarrival time (in ms)

0.2

0.4

0.6

0.8

1

Frac

tion

of ta

sks

domain

EngineeringIndustryScientific

Figure 17: CDF of task interarrival times, per do-main.

insight into the speed scheduler in different environ-ments must make decisions. Low task interarrivaltimes translate to systems with a continuous streamof incoming tasks, possibly at a high rate. In suchsystems, schedulers must make often sub-second de-cisions to avoid tasks delaying, which could haveimportant implications on QoS metrics such as jobmakespan and task turnaround time. High task in-terrival times translate to systems with longer periodsbetween the arrival of two consecutive tasks. How-ever, this does not mean the amount of tasks arrivingmight be low. In situations where bags-of-tasks aresubmitted simultaneously, the task-interarrival timesbetween these tasks will be zero, yet the time betweenthe arrival of two bags-of-tasks may be long.

Figure 17 depicts the CDF of task interarrivaltimes per domain. From this figure we observe thatalmost all tasks in the industrial domain have a taskinterarrival time of less than 10 milliseconds. Thismeans industrial task schedulers must make decisionsat the millisecond level. Interestingly, the indus-trial domain also exhibits the highest task interarrivaltimes.

The scientific domain shows roughly 30% of alltasks arrive within 10 milliseconds of one another.Roughly 93% of all tasks have an interarrival time ofless than 100 milliseconds.

For the engineering domain, roughly 21% of alltasks have an interarrival time of less than 10 mil-liseconds. Around 80% of all tasks have an interar-

17

Page 18: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

102 105 108

Interarrival time (in ms)

0

0.25

0.50

0.75

1

Frac

tion

of ta

sks

workload

AlibabaAskalon NewAskalon OldChronosGoogleLANL 1PegasusSPECTwo SigmaWorkflow Hub

Figure 18: CDF of task interarrival times, per source.

rival time less than 100 milliseconds, with 95% lessthan a second.

Overall, this demonstrates all domains requirescheduling operations to happen at the sub-secondlevel, with industry having the highest need for well-performing schedulers.

To observe differences per use-case, Figure 18 de-picts the task interarrival times per source. From thisfigure we observe the Askalon (old and new) tracesand LANL 1 exhibit mid-range interarrival times, asthe majority of tasks have interarrival times between10 milliseconds and 1 second. All other sources showthe majority of tasks have task interarrival timeslower than 10 milliseconds, stressing the need forhigh-performance schedulers. While most tasks ar-rive quickly after one another, there are significantoutliers in the Two Sigma traces. This could indi-cate downtime of production systems, as the arrivalpattern of Two Sigma is diurnal, yet stable (see Sec-tion 4.2).

5 Addressing Challenges of Va-lidity

In this section, we discuss challenges to the validity ofthis work. We address the challenges through eithertrace-based simulation (the first) or argumentation(the others).

Challenge C-1. Trace diversity does not im-pact the performance of workflow schedulers.

Table 6: The performance in simulation of two sched-ulers for traces from different sources. Lower valuesare better.

Source of Trace

Metric Policy Askalon Old Askalon New Pegasus Shell SPEC

Avg. ReTFCFS 2.02 · 105 s 167 s 2.43 · 104 s 9.76 s 491 sSJF 1.74 · 105 s 113 s 2.12 · 104 s 9.52 s 248 s

Avg. BSDFCFS 1.53 · 104 65.1 1.31 · 103 1.13 47.4SJF 0.14 · 104 11.6 0.10 · 103 1.06 2.2

Avg. NSLFCFS 1.05 · 105 2.50 2.35 · 103 1.12 13.9SJF 0.01 · 105 3.14 0.06 · 103 1.07 1.78

As outlined in Sections 3.7 and 4, the WTA tracesare diverse. However, what is the impact of trace di-versity?

To demonstrate the impact of trace diversity onscheduler performance, we conduct a trace-based sim-ulation study. We simulate workloads from fivesources using two scheduler configurations. We equipthe simulated scheduler with either the first-comefirst-serve (FCFS) or the shortest job first (SJF)queue sorting policy. For both scheduler configura-tions, we further use a best-fit task placement policy.We do not use a fixed resource environment to pre-vent bias when sampling or scaling traces [18]. In-stead, we tailor the amount of available resources foreach trace to reach roughly a 70% resource utiliza-tion on average, based on the amount of CPU (core)seconds of trace and its length. Although ambitious,70% resource utilization is achievable in parallel HPCenvironments [28] and can be seen as a target forcloud environments. To evaluate the performance ofeach scheduler, we use three metrics commonly usedto assess schedulers’ performance [34, 16]: task re-sponse time (ReT), bounded task slowdown (BSD,using a lower bound of 1 second), and normalizedworkflow schedule length (NSL, the ratio between aworkflow’s response time and its critical path).

We report the performance of each simulatedscheduler in Table 6 per source. From this table weobserve significant differences between schedulers andtrace sources. In particular, we find that the rela-tive performance of schedulers differs between tracesources. For example, SJF outperforms FCFS on thenormalized schedule length metric by up to two or-ders of magnitude on traces from Askalon Old andPegasus. In contrast, on traces from Askalon New

18

Page 19: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

and Shell, the scheduling policies perform similarly.For other metrics, these differences are present, butless pronounced. SJF performs better than FCFS onresponse time and slowdown for each trace source,but the differences in performance between the sched-ulers vary greatly across traces.

Overall, we kept the working environment fixed pertrace, yet obtained significantly different results de-pending on the scheduler and input trace. Thus,our trace-based simulations give practical evidencethat researchers require experimenting with differenttraces to claim generality and feasibility of their pro-posed approaches.

C-2. Limited venue selection in the survey.Besides omitting venues that yielded no results onour initial query, we made sure that journals, work-shops, and conferences were covered at various levelsin term of quality. We believe this covers the fieldof systems community to a degree where conclusionscan be drawn from. We specifically focus on articlespublished in the systems communities as specializedcommunities, e.g., bioinformatics, focus on systemsthat solve domain-specific problems, but rarely con-duct in-depth experiments, including trace-based, totest the system-level capabilities and behavior.

C-3. Level of data anonymization. TheGoogle team published interesting work data [39], buttheir anonymization approach, of normalizing valuesof both resource consumption and available resources,reduces significantly the usability of traces and thecharacterization details they provide. We argue thistype of anonymization is not preferred. When avail-able resources per machine, e.g. available disk space,memory, etc., and resource consumption numbers arenormalized, reusing traces for different environmentsbecomes difficult. Researchers then need to make as-sumption on what kind of hardware the workflowswere executed as done in the work of Amvrosiadis etal. [5] or need to assume a homogeneous environment.

Instead, obfuscation techniques, such as multiply-ing both consumption and resources by a certain fac-tor, allow for relative comparisons and the possibilityto replay scheduling the workload on the resourceswhile still concealing the original data.

C-4. The Workflow Trace Format. A fourthchallenge is the properties included in the workload

trace format. For each encountered property in otherformats, we carefully decided whether to include it ornot. Low-level details such as page caches are omit-ted to not unnecessarily complicate the traces. Iffuture work demands change, the versioning schemaper object will allow for these additions.

6 Related Work

We survey in this section the relevant body ofwork focusing on trace archives and on characteriz-ing workloads. Differently from other archives, theWTA focuses on workloads of workflows, preserv-ing workflow-level arrival patterns and task inter-dependencies not found in other archives. Differentlyfrom other characterization work, ours is the first toreveal and compare workflow characteristics acrossdifferent domains and fields of application.

Open-access trace archives: Closest to thiswork is WorkflowHub [44], which archives traces ofworkflows executed with the Pegasus workflow en-gine and offers them in a unifying format containingstructural information. WorkflowHub also provides atool to convert Pegasus execution logs to traces, sim-ilar to our parsing tools. Different from this work,WorkflowHub’s traces include a single workflow andthus not a workload with a job-arrival pattern. Work-FlowHub also does not provide statistical insights pertrace and thus, they do not meet requirements R-1and R-3, and only partially meet R-4.

Also relatively close to this work, the ATLASrepository maintained by the Carnegie Mellon Uni-versity [4] contains two traces (the S3 traces in thiswork), with other two traces announced but notyet released (as announced, the S7 traces in thiswork). None of their published traces contains task-interdependency data, so, although overlapping withour S3 and S7, the ATLAS work is different in scopeand in particular does not address workflows. Fur-ther, they do not consider different domains norfields, and their archive lacks a unified format, sta-tistical insights, selection mechanisms, and tooling—thus, they do not meet our requirements R1–4.

Other trace-archives with similarities to this workinclude the MyExperiment archive (ME) [20], the

19

Page 20: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

Parallel WorkloadsArchive (PWA) [14], and the Grid WorkloadsArchive (GWA) [24]. ME stores workflow executa-bles, and semantic and provenance-data, but not pro-vide execution traces as WTA does and thus has dif-ferent scope. The PWA includes traces collected fromparallel production environments, which are largelydominated by tightly-coupled parallel jobs and, morerecently, by bag-of-tasks applications. The GWA in-cludes traces collected from grid environments; dif-ferently from this work, these traces are dominatedby bag-of-tasks applications and by virtual-machinelease-release data.

Workload characterization, definition, andmodeling: There is much related and relevant workin this area, from which we compare only with theclosely related; other characterization work does notfocus on comparing traces by domain and does notcover a set of characteristics as diverse as this work,leading to so many findings. Closest to this work, theGoogle cluster-traces have been analyzed from vari-ous points of view, e.g., [38, 10, 36]. Amvrosiadiset al. [3, 4] compare the Google cluster traces withthree other cluster traces, of 0.3-3 times the size and3-60 times the duration, and find key differences; ourwork adds new views and quantitative data on diver-sity, through both survey and characterization tech-niques. Bharathi et al. [8] provide a characteriza-tion on workflow structures and the effect of workflowinput sizes on said structures. Five scientific work-flows are used to explain in detail the compositionsof their data and computational dependencies. Us-ing the characterization, a workflow generator gener-ator for parameterized workflows is developed. Juveet al. [29] provide a characterization of six scientificworkflows using workflow profiling tools that investi-gate resource consumption and computational char-acteristics of tasks. The teams of Feitelson and Io-sup have provided many characterization and mod-eling studies for parallel [17], grid [23], and hosted-business [43] workloads; and Feitelson has written aseminal book on workload modeling [15]. In contrast,this work addresses in-depth the topic of workloadsof workflows.

7 Conclusion and OngoingWork

Responding to the stringent need for diverse workflowtraces, in this work we propose the Workflow TraceArchive (WTA), which is an open-access archive con-taining workflow traces.

We conduct a survey of how the systems com-munity uses workflow traces, by systematically in-specting articles accepted in the last decade in peer-reviewed conferences and journals. We find that,from all articles that use traces, less than 40% use re-alistic traces, and less than 15% use any open-accesstrace. Additionally, the community focuses primarilyon scientific workloads, possibly due to the scarcityof traces from other domains. These findings suggestexisting limits to the relevance and reproducibility ofworkflow-based studies and designs.

We design and implement the WTA around five keyrequirements. At the core of the WTA is an unifiedtrace format that, uniquely, supports both workflow-and task-level NFRs. The archive contains a largeand diverse set of traces, collected from 10 sourcesand encompassing over 48 million workflows and 2billion CPU core hours.

Finally, we provide deep insight into the WTAtraces, through a statistical characterization reveal-ing that: (1) there are large differences in workflowstructures between scientific, industrial, and engi-neering workflows, (2) our two biggest traces– fromAlibaba and Google– have the most stable arrivalpatterns in terms of tasks per hour, (3) industrialworkflows tend to have the highest level of paral-lelism, (4) the level of parallelism per domain isclearly divided, (5) engineering workloads tend tohave the most tasks on the critical path, (6) the threedomains inspected in this work show distinct criticalpath curves, (7) in order to claim generality of anapproach, one should test a system with a variety oftraces with different properties, possibly from differ-ent domains.

In ongoing work, we aim to attract more organiza-tions to contribute real-world traces to the WTA, andto encourage the use of the WTA content and toolsin educational and production settings. One of our

20

Page 21: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

goals is to develop a library system administratorscan integrate into their systems to generate tracesin our format. Our preliminary experience with thislearns that developing such a library, even for a singlesystem, requires significant engineering effort and isthus left for future work. We aim to support otherformalisms in the future, including directed graphs,BPMN workflows, etc. based on the community’sneeds. Furthermore, we aim to improve the traceformat and statistics we report for each trace, basedon community feedback.

Reproducibility Statement

The WTA datasets are available online on thearchive’s websitehttps://wta.atlarge-research.com/. The WTAtools. simulator, and parse scripts are available asfree open-source software athttps://github.com/atlarge-research/

wta-tools,https://github.com/atlarge-research/wta-sim

andhttps://github.com/atlarge-research/

wta-analysis, respectively.

References

[1] Aalst et al. Advanced topics in workflow man-agement: Issues, requirements, and solutions.JIDPS, 7(3), 2003.

[2] Ammar et al. Construction of the literaturegraph in semantic scholar. In NAACL, 2018.

[3] Amvrosiadis et al. Bigger, longer, fewer: whatdo cluster jobs look like outside google? Tech-nical report, 2017.

[4] Amvrosiadis et al. On the diversity of clusterworkloads and its impact on research results.2018.

[5] Amvrosiadis et al. On the diversity of clusterworkloads and its impact on research results. InATC, 2018.

[6] Bal et al. A medium-scale distributed system forcomputer science research: Infrastructure for thelong term. Computer, 49(5), 2016.

[7] Banati et al. Four level provenance supportto achieve portable reproducibility of scientificworkflows. In MIPRO, 2015.

[8] Bharathi et al. Characterization of scientificworkflows. 2008.

[9] Cardoso et al. Workflow quality of service. InInternational Conference on Enterprise Integra-tion and Modeling Technology, 2002.

[10] Chen et al. Analysis and lessons from a pub-licly available google cluster trace. Tech. Rep.UCB/EECS-2010-95, 94, 2010.

[11] Cohen-Boulakia et al. Search, adapt, and reuse:the future of scientific workflows. ACM SIG-MOD, 40(2), 2011.

[12] Deelman et al. The future of scientific workflows.IJHPCA, 32(1), 2018.

[13] Dudley et al. In silico research in the era of cloudcomputing. Nature biotechnology, 28(11), 2010.

[14] Feitelson. Parallel workload archive.http://www. cs. huji. ac. il/labs/parallel/-workload, 2007.

[15] Feitelson. Workload Modeling for Computer Sys-tems Performance Evaluation. Cambridge Uni-versity Press, 2015.

[16] Feitelson et al. Theory and practice in paralleljob scheduling. In JSSPP, 1997.

[17] Feitelson et al. Experience with using the Par-allel Workloads Archive. JPDC, 74(10), 2014.

[18] Frachtenberg and Feitelson. Pitfalls in paralleljob scheduling evaluation. In JSSPP, 2005.

[19] Gil et al. Examining the challenges of scientificworkflows. Computer, 40(12), 2007.

21

Page 22: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

[20] Goble et al. myExperiment: social network-ing for workflow-using e-scientists. In WORKS,2007.

[21] Ilyushkin et al. Scheduling workloads of work-flows with unknown task runtimes. In CCGrid,2015.

[22] Ilyushkin et al. An experimental performanceevaluation of autoscaling policies for complexworkflows. In ICPE, 2017.

[23] Iosup and Epema. Grid computing workloads.IC, 15(2), 2011.

[24] Iosup et al. The grid workloads archive. FGCS,24(7), 2008.

[25] Iosup et al. Massivizing computer systems: Avision to understand, design, and engineer com-puter ecosystems through and beyond moderndistributed systems. In ICDCS, 2018.

[26] Alexandru Iosup et al. The AtLarge Vision onthe Design of Distributed Systems and Ecosys-tems. In ICDCS, 2019. (in print). [Online] Avail-able as arXiv e-print, https://arxiv.org/pdf/1906.07471.pdf.

[27] Isom et al. Is Your Company Ready for Cloud.IBM Press, 2012.

[28] Jones and Nitzberg. Scheduling for parallelsupercomputing: A historical perspective ofachievable utilization. In JSSPP, 1999.

[29] Juve et al. Characterizing and profiling scientificworkflows. FGCS, 29(3), 2013.

[30] Kelly. 86 percent of companies use multiplecloud services, 2012.

[31] Kitchenham et al. Systematic literature reviewsin software engineering–a systematic literaturereview. Infor. and Sw. Tech., 51(1), 2009.

[32] Klusacek et al. On interactions among schedul-ing policies: Finding efficient queue setup usinghigh-resolution simulations. In Euro-Par, 2014.

[33] Krol et al. Workflow performance profiles: De-velopment and analysis. In Euro-Par, 2016.

[34] Kwok and Ahmad. Benchmarking and compar-ison of the task graph scheduling algorithms.JPDC, 59(3), 1999.

[35] Ley. The dblp computer science bibliography:Evolution, research issues, perspectives. InSPIRE, 2002.

[36] Mishra et al. Towards characterizing cloud back-end workloads: insights from google computeclusters. SIGMETRICS, 2010.

[37] Ramakrishnan et al. A multi-dimensional classi-fication model for scientific workflow character-istics. In Workflow Approaches to New Data-centric Science, 2010.

[38] Reiss et al. Heterogeneity and dynamicity ofclouds at scale: Google trace analysis. In SoCC,2012.

[39] Reiss et al. Obfuscatory obscanturism: makingworkload traces of commercially-sensitive sys-tems safe to release. In CLOUD, 2012.

[40] ACM Result. Artifact review and badging, 2017.

[41] Sayogo and Pardo. Exploring the determinantsof scientific data sharing: Understanding themotivation to publish research data. Govern-ment Information Quarterly, 30, 2013.

[42] Schwiegelshohn. How to design a job schedulingalgorithm. In JSSPP, 2014.

[43] Shen et al. Statistical characterization ofbusiness-critical workloads hosted in cloud dat-acenters. In CCGrid, 2015.

[44] Silva et al. Community resources for enablingresearch in distributed scientific workflows. Ine-Science, volume 1, 2014.

[45] Simmhan et al. A framework for collectingprovenance in data-centric scientific workflows.In ICWS, 2006.

22

Page 23: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

[46] Slominski. Adapting BPEL to scientific work-flows. In Workflows for e-Science, ScientificWorkflows for Grids. 2007.

[47] Sun et al. Adaptive data placement for staging-based coupled scientific workflows. In SC, 2015.

[48] Talluri et al. Characterization of a big data stor-age workload in the cloud. In ICPE, 2019.

[49] Versluis et al. An analysis of workflowformalisms for workflows with complex non-functional requirements. In ICPE, 2018.

[50] K Weins. Cloud computing trends: 2018 stateof the cloud survey. rightscale, 2018.

[51] Weiwei et al. Using imbalance metrics to opti-mize task clustering in scientific workflow execu-tions. FGCS, 46, 2015.

[52] Wu et al. Workflow scheduling in cloud: a sur-vey. TJS, 71(9), 2015.

[53] Xiong et al. vperfguard: an automated model-driven framework for application performancediagnosis in consolidated cloud environments. InICPE, 2013.

[54] Zhou et al. A declarative optimization enginefor resource provisioning of scientific workflowsin iaas clouds. In HPDC, 2015.

A Artifact Evaluation and De-scription

We present in this appendix the artifacts (data andtools) used in this work, where to find them, andhow to inspect and rerun the tools on said data toreproduce and verify our findings.

A.1 Traces

All traces are available on the WTA archive website3.The traces are available without restrictions and canbe redistributed.

3URL removed to comply with SC19 double-blind stan-dards.

A.1.1 Survey Workflow Trace Usage

The data used in the survey and the tools used to vi-sualize the data are available as open-source softwareat https: // github. com/ atlarge-research/

wta-analysis . The data of survey can befound in the file Literature_ survey-usage_ of_

WFs-2009-2018_ 2019-05-23. csv . All visual-izations regarding this data were generated usingthe parse_ survey_ csv. py Python script, whichrequires python 3, and the Python libraries NumPy(1.16.2) and Pandas (v0.24.2) to execute.

A.1.2 Parsing traces

To parse traces into parquet file format withthe WTA workload format, the archive offers aparse script for each distinctive workload structureat https: // github. com/ /atlarge-research/

wta-tools . All parse scripts are available asopen-access data and can be inspected for cor-rectness and used as a foundation for otherparse scripts. All parse scripts are located inthe parse_ scripts/ parquet_ parsers/ folderand the filenames are indicative of which rawdata source they parse. For example, theworkflowhub_ to_ parquet. py Python script parsesa trace from the WorkflowHub archive. All outputcan be found in parse_ scripts/ output_ parquet/ ,with a sub-folder for the source. Thus, theworkflowhub_ to_ parquet. py Python script willoutput toparse_ scripts/ output_ parquet/ workflowhub/ .

A.1.3 Validating Traces

To validate the traces and their properties, thearchive offers a trace validation tool. This toolscan be found in the GitHub repository containingWTA tools. The tool expects the traces in par-quet file format using the unified trace format in-troduced in this work. The run the tool, run thevalidate_ parquet_ files. py Python script withas argument the path to the trace directory to be val-idated. The tool can be found in the parse_ scriptsdirectory. The tool will print the results to standard

23

Page 24: The Work ow Trace Archive: Open-Access Data from Public ... · The Work ow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures { Technical Report Laurens

output (or stdout) and will exit with a non-zero exitcode.

A.2 Trace Characterization

All scripts used to characterize the contents of theWTA are available online at https: // github. com//atlarge-research/ wta-analysis . These scriptsare offered as IPython notebook files, which includesboth the code and the visual outputs, i.e. graphs. Inorder to run all IPython notebook files, JuPyter (4.1)is required and the python libraries PySpark (2.4.3),plotnine (0.5.1), NumPy (1.16.2), Pandas (v0.24.2),more-itertools (7.0.0), and hurst (0.0.5) should be in-stalled on the machine running JuPyter.

A.2.1 Simulation Experiment

The simulator used in Section 5 is available onlineas open-source software at https: // github. com/

/atlarge-research/ wta-sim . The datasets usedare available in the open-access archive introduced inthis work.

The simulator is written in Kotlin 1.3 and is ex-ecuted on a system running CentOS Linux release7.4.1708 and an Intel(R) Xeon(R) CPU E5-2630 v3@ 2.40GHz. The parquet data was read using theAvro parquet reader (1.10.1) from the Hadoop (3.2.0)project.

To rerun the simulation experiment, download thesimulator sources from the location provided above,compile the simulator using Maven (mvn package),and run the following command for each trace andsimulator policy:

java -cp target/wta-sim-0.1.jar \

science.atlarge.wta.simulator.WTASim \

-i "/path/to/traces/${trace_name}_parquet/" \

-o "simulation_results/${trace_name}/${policy}" \

--target-utilization 0.7 \

--task-placement-policy best_fit \

--task-order-policy ${policy}

Replace the trace name variable with the nameof a trace to analyze (e.g., askalon ee) and replace

the policy variable with fcfs or sjf for the appro-priate queue sorting policy. The data used to cre-ate Table 6 can be found in simulation_ results/

£{ trace_ name}/ £{ policy}/ summary. tsv . Addi-tional information about using the simulator is pro-vided in the accompanying README.md file.

24