designing twister2: efficient programming environment...

12
Designing Twister2: Efficient Programming Environment Toolkit for Big Data Supun Kamburugamuve School of Informatics and Computing Indiana University Bloomington, IN, USA [email protected] Geoffrey Fox School of Informatics and Computing Indiana University Bloomington, IN, USA [email protected] ABSTRACT Data-driven applications are required to adapt to the ever- increasing volume, velocity and veracity of data generated by a variety of sources including the Web and Internet of Things devices. At the same time, an event-driven compu- tational paradigm is emerging as the core of modern sys- tems designed for both database queries, data analytics and on-demand applications. MapReduce has been generalized to Map Collective and shown to be very effective in ma- chine learning. However one often uses a dataflow comput- ing model, which has been adopted by most major big data processing runtimes. The HPC community has also devel- oped several asynchronous many tasks (AMT) systems ac- cording to the dataflow model. From a different point of view, the services community is moving to an increasingly event-driven model where (micro)services are composed of small functions driven by events in the form of Function as a Service(Faas) and serverless computing. Such designs allow the applications to scale quickly as well as be cost effective in cloud environments. An event-driven runtime designed for data processing con- sists of well-understood components such as communication, scheduling, and fault tolerance. One can make different de- sign decisions for these components that will determine the type of applications a system can support efficiently. We find that modern systems are designed in a monolithic approach with a fixed set of choices that cannot be changed easily af- terwards. Because of these design choices their functionality is limited to specific sets of applications. In this paper we study existing systems (candidate event-driven runtimes), the design choices they have made for each component, and how this affects the type of applications they can support. Further we propose a loosely coupled component-based ap- proach for designing a big data toolkit where each compo- nent can have different implementations to support various applications. We believe such a polymorphic design would allow services and data analytics to be integrated seamlessly and expand from edge to cloud to high performance com- puting environments. 1. INTRODUCTION Big data has been characterized by the ever-increasing ve- locity, volume and veracity of the data generated from vari- ous sources, ranging from web users to Internet of Things de- vices to large scientific equipment. The data have to be pro- cessed as individual streams and analyzed collectively both in streaming and batch settings for knowledge discovery with both database queries and sophisticated machine learning. These applications need to run as services in cloud envi- ronments as well as traditional high performance clusters. With the proliferation of cloud-based systems and Internet of Things, fog computing [12] is adding another dimension to these applications where part of the processing has to occur near the devices. Parallel and distributed computing is essential to pro- cess big data owing to the data being naturally distributed and processing often requiring high performance in com- pute, communicate and I/O arenas. Over the years, the High Performance Computing (HPC) community has devel- oped frameworks such as message passing interface (MPI) to execute computationally intensive parallel applications efficiently. The HPC applications target high performance hardware, including low latency networks due to the scale of the applications and the required tight synchronous parallel operations. Big data applications have been developed tar- geting commodity hardware with Ethernet connections seen in the cloud. Because of this, they are more suitable for executing asynchronous parallel applications with high com- putation to communication ratios. Lately we have observed that more capable hardware comparable to HPC clusters is being added to modern clouds due to increasing demand for cloud applications in deep learning and machine learning. These trends suggest that HPC and cloud are merging, and we need frameworks that combine the capabilities of both big data and HPC frameworks. There are many properties of data applications that in- fluence the design of those frameworks developed to pro- cess them. There are many application classes including database queries, management, and data analytics from com- plex machine learning to pleasingly parallel event processing. Common issues include that the data can be too big to fit into the memory of even a large cluster. Another important aspect is that it is impractical to always expect a balanced data set from the processing standpoint across the nodes. This follows from the fact that initial data in raw form are usually not load balanced and often require too much time and disk space to balance the data. Also the batch data pro- cessing is not enough as much data is streamed and needs to be processed online with reasonable time constraints before being stored to disk. Finally the data may be varied and have processing time that varies between data points and across iterations of algorithms. Even though MPI is designed as a generic messaging frame- work, a developer has to focus on file access, with disks in case of insufficient memory and using mostly send/receive operations to develop higher level communication operations

Upload: others

Post on 22-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Designing Twister2: Efficient Programming Environment ...dsc.soic.indiana.edu/publications/Twister2.pdf · ow semantics. Some other examples of batch processing systems include Microsoft

Designing Twister2: Efficient Programming EnvironmentToolkit for Big Data

Supun KamburugamuveSchool of Informatics and Computing

Indiana UniversityBloomington, IN, USA

[email protected]

Geoffrey FoxSchool of Informatics and Computing

Indiana UniversityBloomington, IN, [email protected]

ABSTRACTData-driven applications are required to adapt to the ever-increasing volume, velocity and veracity of data generatedby a variety of sources including the Web and Internet ofThings devices. At the same time, an event-driven compu-tational paradigm is emerging as the core of modern sys-tems designed for both database queries, data analytics andon-demand applications. MapReduce has been generalizedto Map Collective and shown to be very effective in ma-chine learning. However one often uses a dataflow comput-ing model, which has been adopted by most major big dataprocessing runtimes. The HPC community has also devel-oped several asynchronous many tasks (AMT) systems ac-cording to the dataflow model. From a different point ofview, the services community is moving to an increasinglyevent-driven model where (micro)services are composed ofsmall functions driven by events in the form of Function as aService(Faas) and serverless computing. Such designs allowthe applications to scale quickly as well as be cost effectivein cloud environments.

An event-driven runtime designed for data processing con-sists of well-understood components such as communication,scheduling, and fault tolerance. One can make different de-sign decisions for these components that will determine thetype of applications a system can support efficiently. We findthat modern systems are designed in a monolithic approachwith a fixed set of choices that cannot be changed easily af-terwards. Because of these design choices their functionalityis limited to specific sets of applications. In this paper westudy existing systems (candidate event-driven runtimes),the design choices they have made for each component, andhow this affects the type of applications they can support.Further we propose a loosely coupled component-based ap-proach for designing a big data toolkit where each compo-nent can have different implementations to support variousapplications. We believe such a polymorphic design wouldallow services and data analytics to be integrated seamlesslyand expand from edge to cloud to high performance com-puting environments.

1. INTRODUCTIONBig data has been characterized by the ever-increasing ve-

locity, volume and veracity of the data generated from vari-ous sources, ranging from web users to Internet of Things de-vices to large scientific equipment. The data have to be pro-cessed as individual streams and analyzed collectively bothin streaming and batch settings for knowledge discovery withboth database queries and sophisticated machine learning.

These applications need to run as services in cloud envi-ronments as well as traditional high performance clusters.With the proliferation of cloud-based systems and Internetof Things, fog computing [12] is adding another dimensionto these applications where part of the processing has tooccur near the devices.

Parallel and distributed computing is essential to pro-cess big data owing to the data being naturally distributedand processing often requiring high performance in com-pute, communicate and I/O arenas. Over the years, theHigh Performance Computing (HPC) community has devel-oped frameworks such as message passing interface (MPI)to execute computationally intensive parallel applicationsefficiently. The HPC applications target high performancehardware, including low latency networks due to the scale ofthe applications and the required tight synchronous paralleloperations. Big data applications have been developed tar-geting commodity hardware with Ethernet connections seenin the cloud. Because of this, they are more suitable forexecuting asynchronous parallel applications with high com-putation to communication ratios. Lately we have observedthat more capable hardware comparable to HPC clusters isbeing added to modern clouds due to increasing demand forcloud applications in deep learning and machine learning.These trends suggest that HPC and cloud are merging, andwe need frameworks that combine the capabilities of bothbig data and HPC frameworks.

There are many properties of data applications that in-fluence the design of those frameworks developed to pro-cess them. There are many application classes includingdatabase queries, management, and data analytics from com-plex machine learning to pleasingly parallel event processing.Common issues include that the data can be too big to fitinto the memory of even a large cluster. Another importantaspect is that it is impractical to always expect a balanceddata set from the processing standpoint across the nodes.This follows from the fact that initial data in raw form areusually not load balanced and often require too much timeand disk space to balance the data. Also the batch data pro-cessing is not enough as much data is streamed and needs tobe processed online with reasonable time constraints beforebeing stored to disk. Finally the data may be varied andhave processing time that varies between data points andacross iterations of algorithms.

Even though MPI is designed as a generic messaging frame-work, a developer has to focus on file access, with disks incase of insufficient memory and using mostly send/receiveoperations to develop higher level communication operations

Page 2: Designing Twister2: Efficient Programming Environment ...dsc.soic.indiana.edu/publications/Twister2.pdf · ow semantics. Some other examples of batch processing systems include Microsoft

in order to express communication in a big data application.Adding to this mix is the increasing complexity of hardware,with the explosion of many-core and multi-core processorshaving different memory hierarchies. It is becoming bur-densome to develop efficient applications on these new ar-chitectures using the low-level capabilities provided by MPI.Meanwhile, the success of Harp [62] has highlighted the im-portance of the Map-Collective computing paradigm.

The dataflow [31] computation model has been presentedas a way to hide some of the system level details from theuser in developing parallel applications. With dataflow, anapplication is represented as a graph with nodes doing com-putations and edges indicating communications between thenodes. A computation at a node is activated when it re-ceives events through its inputs. A well-designed dataflowframework hides the low-level details such as communica-tions, concurrency and disk I/O, allowing the developer tofocus on the application itself. Every major big data pro-cessing system has been developed according to the dataflowmodel, and the HPC community has also developed asyn-chronous many tasks (AMT) systems according to the samemodel. AMT systems mostly focus on computationally in-tensive applications, and there is ongoing research to makethem more efficient and productive. We find that big datasystems developed according to a dataflow model are ineffi-cient in computationally intensive applications with tightlysynchronized parallel operations [38], while AMT systemsare not optimized for data processing.

At the core of the dataflow model is an event-driven ar-chitecture where tasks act upon incoming events (messages)and produce output events. In general a task can be viewedas a function activated by an event. The cloud-based ser-vices architecture is moving to an increasingly event-drivenmodel for composing services in the form of Function as aService (FaaS). FaaS is especially appealing to IoT appli-cations where the data is event-based in its natural form.Coupled with microservices and serverless computing, FaaSis driving next generation services in the cloud and can beextended to the edge.

Because of the underlying event-driven nature of bothdata analytics and message-driven services architecture, wecan find many common aspects among the frameworks de-signed to process data and services. Such architectures canbe decomposed into components such as resource provision-ing, communication, task scheduling, task execution, datamanagement, fault tolerance mechanisms and user APIs.High-level design choices are available at each of these layersthat will determine the type of applications that the frame-work composed of these layers can support efficiently. Weobserve that modern systems are designed with fixed sets ofdesign choices at each layer, rendering them only suitablefor a narrow set of applications. Because of the commonunderlying model, it is possible to build each componentseparately with clear abstractions supporting different de-sign choices. We propose to design and build a polymorphicsystem by using these components to produce a system ac-cording to the requirements of the applications, which weterm the toolkit approach. We believe such an approachwill allow the system to be configured to support differenttypes of applications efficiently. The authors are activelypursuing a project called Twister2, encompassing the con-cept of the toolkit. Serverless FaaS is a good approach tobuilding cloud native applications [3, 27] and in this way,

Twister2 will be a cloud native framework.This paper provides the following contributions: 1) A

study of different application areas and how a common com-putation model fits them; 2) Design choices of different sys-tems and how they affect each application area; 3) Present-ing a vision of a big data toolkit (Twister2) that can executeapplications from each area efficiently. Furthermore the pa-per provides comparisons of big data and MPI styles of pro-grams to gain better insight into the system requirements.

2. RELATED WORKHadoop [58] was the first major open source platform de-

veloped to process large amounts of data in parallel. Themap-reduce [19] functional model introduced by Hadoop iswell understood and adapted for writing distributed pleas-ingly parallel and one-pass applications. Coupled with Java,it provides a great tool for average programmers to processdata in parallel. Soon enough, though, the shortcomingsof HadoopaAZs simple API and its disk-based communica-tions [21] became apparent, and systems such as ApacheSpark [61] and Apache Flink [15] were developed to over-come them. These systems are developed according to thedataflow model and their execution models and APIs closelyfollow dataflow semantics. Some other examples of batchprocessing systems include Microsoft Naiad [48], ApacheApex and Google Dataflow [6]. It is interesting to notethat even with all its well-known inefficiencies, Hadoop isstill being used by many people for data processing. Apartfrom the batch processing systems mentioned above, thereare also streaming systems that can process data in real-timewhich also adhere to the dataflow model. Some examples ofopen source streaming systems include Apache Storm [57],Twitter Heron [41], Google Millwheel [5], Apache Samza [53]and Flink [15]. Note that some of the systems process bothstreaming and batch data in a unified way such as ApacheApex, Google Dataflow, Naiad and Apache Flink. ApacheBeam [6] is a project developed to provide a unified APIfor both batch and streaming pipelines. It acts as a com-piler and can translate a program written in its API to asupported batch or streaming runtime. Prior to moderndistributed streaming systems, research was done on sharedmemory streaming systems, including StreamIt [56], Bore-alis [8], Spade [28] and S4 [50].

There are synergies between HPC and big data systems,and authors [24, 25] among others [36] have expressed theneed to enhance these systems by taking ideas from eachother. In previous work [22, 23] we have identified the gen-eral implications of threads and processes, cache, memorymanagement in NUMA [11], as well as multi-core settingsfor machine learning algorithms with MPI. DataMPI [42]uses MPI to build Hadoop like system while [7] uses MPIcommunications in Spark for better performance. A toolkitapproach as in Twister2 makes interoperability easier at theusage level as one can change lower level components to fitdifferent environments without changing the programmaticor user interface.

There is an ongoing effort in the HPC community to de-velop AMT systems for realizing the full potential of multi-core and many-core machines, as well as handling irregularparallel applications in a more robust fashion. It is widelyaccepted that writing efficient programs with the existingcapabilities of MPI is difficult due to the bare minimum ca-pabilities it provides. AMT systems model computations

Page 3: Designing Twister2: Efficient Programming Environment ...dsc.soic.indiana.edu/publications/Twister2.pdf · ow semantics. Some other examples of batch processing systems include Microsoft

as dataflow graphs and use shared memory and threadingto achieve best performance out of many-core machines.Some examples of such systems are OCR [47], DADuE [13],Charm++ [37], COMPS [17] and HPX [54], all of whichfocus on dynamic scheduling of the computation graph. Aportability API is developed in DARMA [35] to AMT sys-tems to develop applications agnostic to the details of spe-cific systems. They extract the best available performance ofmulticore and many-core systems while reducing the burdenof the user writing such programs using MPI. Prior to this,there was much focus in the HPC community on developingprograms that could bring automatic parallelism to userssuch as Parallel Fortran [14]. Research has been done withMPI to understand the effect of computer noise on collectivecommunication operations [34, 33, 4]. For large computa-tions, computer noise coming from an operating system canplay a major role in reducing performance. Asynchronouscollective operations can be used to reduce the noise in suchsituations, but it is not guaranteed to completely eliminatethe burden.

In practice, multiple algorithms and data processing appli-cations are combined together in workflows to create com-plete applications. Systems such as Apache NiFi [1], Ke-pler [45], and Pegasus [20] were developed for this purpose.The lambda architecture [46] is a dataflow solution to design-ing such applications in a more tightly coupled way. AmazonStep functions [2] is bringing the workflow to the FaaS andmicroservices.

3. BIG DATA APPLICATIONSHere we highlight four types of applications with different

processing requirements: 1) Streaming, 2) Data pipelines,3) Machine learning, and 4) Services. With the explosion ofIoT devices and the cloud as a computation platform, fogcomputing is adding a new dimension to these applications,where part of the processing has to be done near the devices.

Streaming applications work on partial data while batchapplications process data stored in disks as a complete set.By definition, streaming data is unlimited in size and hard(and unnecessary) to process as a complete set due to timerequirements. Only temporal data sets observed in datawindows can be processed at a given time. In order to han-dle a continuous stream of data, it is necessary to createsummaries of the temporal data windows and use them insubsequent processing of the stream. There can be manyways to define data windows, including time-based windowsand data count-based windows. In the most extreme case asingle data tuple can be considered as the processing gran-ularity.

Data pipelines are primarily used for extract, trans-form and load (ETL) operations even though they can in-clude steps such as running a complex algorithm. Theymostly deal with unstructured data stored in raw form orsemi-structured data stored in NoSQL [32] databases. Datapipelines work on arguably the largest data sets possibleout of the three types of applications. In most cases, itis not possible to load complete data sets into memory atonce and we are required to process data partition by par-tition. Because the data is unstructured or semi-structured,the processing has to assume unbalanced data for parallelprocessing. The processing requirements are coarse-grainedand pleasingly parallel. Generally we can can consider adata pipeline as an extreme case of a streaming application,

where there is no order of data and the streaming windowscontain partitions of data.

Machine learning applications execute complex alge-braic operations and can be made to run in parallel usingsynchronized parallel operations. In most cases the data canbe load balanced across the workers as curated data is beingused. The algorithms can be regular or irregular and mayneed dynamic load balancing of the computations and data.

Services are moving towards an event-driven model forscalability, efficiency and cost effectiveness in the cloud. Theold monolithic services are being replaced by leaner mi-croservices. These microservices are envisioned to be com-posed of small functions arranged in a workflow [2] or dataflowto achieve the required functionality.

3.1 Dataflow ApplicationsParallel computing and distributed computing are two of

the general computing paradigms available for doing com-putations on large numbers of machines. MPI is the de factostandard in HPC for developing parallel applications. It pro-vides a basic but powerful tool to develop parallel applica-tions. An MPI programmer has to consider low-level detailssuch as I/O, memory hierarchy and efficient execution ofthreads to write a parallel application that scales to largenumbers of nodes. With the increasing availability of multi-core and many core systems, the burden on the programmerto get the best available performance has increased dramat-ically [23, 22]. Because of the load imbalance and velocityof the big data applications, an MPI program written withtight synchronized operations across parallel workers maynot perform well. An example HPC application is shown inFig. 1 where a workflow system such as Kepler [45] is usedto invoke individual MPI applications. A parallel workerof an MPI program does computations and communicationswithin the same process scope, allowing the program to keepstate throughout the execution.

Figure 1: MPI applications arranged in a workflow

Dataflow computing has been around in various forms fora long time. A dataflow program is a computation graphwith nodes doing computations and edges passing messagesbetween the nodes. Computation at a node is invoked whenits input data dependencies are satisfied. It is a largely ac-cepted truth that dataflow programs are easier to write thanMPI-style programs for applications that fit the dataflowmodel well. In a dataflow program, the user has to programthe computations in the nodes and define how the nodesare connected to each other. The dataflow framework han-dles the details, such as executing the tasks using threads,scheduling, data placement and communications. A care-fully designed framework can be tuned to run in differenthardware with NUMA boundaries, caches and memory hi-erarchies.

Page 4: Designing Twister2: Efficient Programming Environment ...dsc.soic.indiana.edu/publications/Twister2.pdf · ow semantics. Some other examples of batch processing systems include Microsoft

3.1.1 Dataflow Application APIsOver the years, there have been numerous languages and

different types of APIs developed for creating dataflow appli-cations. Task-based programming and data transformation-based programming are two popular approaches for dataflowparallel applications.

Data transformation APIs are used primarily by bigdata systems. Data transformation APIs employ a func-tional programming approach to create the dataflow graphimplicitly. In this approach, distributed data is representedin some abstract form and functions are applied to it that re-turn other distributed data. A function takes a user definedoperator as an argument and defines the communication be-tween the operators. An example function is a partitionfunction, often called a map in data flow runtimes. A mapfunction works on partitions of a data set and presents thepartitioned data to the user defined operators. The out-put of a user defined operator is connected to another userdefined operator by way of another function.

Figure 2: Left: User graph, Right: execution graphof a data flow

Task-based APIs are primarily used by the HPC com-munity with AMT systems. A task-based API usually cre-ates the dataflow tasks at the runtime as the program pro-gresses. A normal program is used to create the tasks, whichcan use complex control operations such as ’if/else’ and ’for’loops to control the dataflow dynamically at runtime.

3.1.2 Execution GraphThe graph executed by the dataflow runtime is termed

execution/physical graph. This is created by the frameworkwhen the user graph is deployed on the cluster. For exam-ple, some user functions may run in larger numbers of nodesdepending on the parallelism specified. Also when creatingthe execution graph, the framework can apply optimizationto make some dataflow operations more efficient by reduc-ing data movement and overlapping I/O and computations.Fig. 2 shows the execution graph and the user graph whereit runs multiple W operations and S operations in parallel.Each user defined task runs on its own program scope with-out access to any state regarding other tasks. The only wayto communicate between tasks is by messaging, as tasks canrun in different nodes.

3.1.3 Data PartitioningA big data application requires the data to be partitioned

in a hierarchical manner due to memory limitations. Fig. 4shows an example of such partitioning of a large file con-taining records of data points. The data is first partitionedaccording to the number of parallel tasks and then each par-tition is again split into smaller partitions. At every stageof the execution, such smaller examples are loaded into thememory of each worker. This hierarchical partitioning is im-plicit in streaming applications, as only a small portion ofthe data is available at a given time.

Figure 3: Load imbalance and velocity of data

Figure 4: Hierarchical data partitioning of a big dataapplication

3.1.4 Hiding LatencyIt is widely recognized that computer noise can play a huge

role in large-scale parallel jobs that require collective oper-ations. Many researchers have experimented with MPI toreduce performance degradation caused by noise in HPC en-vironments. Such noise is much less compared to what typ-ical cloud environments observe with multiple VMs sharingthe same hardware, I/O subsystem and networks. Added tothis is the Java JVM noise which most notably comes fromgarbage collection. The computations in dataflow modelare somewhat insulated from the effects of such noise dueto the asynchronous nature of the parallel execution. Forstreaming settings, the data arrives at the parallel nodeswith different speeds and processing time requirements. Be-cause of these characteristics, dataflow operations are themost suitable for such environments. Load balancing [49]is a much harder problem in streaming settings where dataskew is more common because of the nature of applications.

3.2 Dataflow for Big Data Applications

3.2.1 Streaming ApplicationsStreaming applications deal with load imbalanced data

coming at different rates to parallel workers at any given mo-ment. Having an MPI application processing this data willincrease the latency of the individual events. Fig. 3 showsthis point with an example where three parallel workers pro-cess messages arriving at different speeds and sizes(differentprocessing times). If an MPI collective operation is invoked,it is clear that the collective has to wait until the slowesttask finishes, which can vary widely. Also, to handle streamsof data with higher frequencies, the tasks of the streamingcomputation must be executed in different CPUs arrangedin pipelines. The dataflow model is a natural fit for suchasynchronous processing of chained tasks.

3.2.2 Data PipelinesData pipelines can be viewed as a special case of stream-

Page 5: Designing Twister2: Efficient Programming Environment ...dsc.soic.indiana.edu/publications/Twister2.pdf · ow semantics. Some other examples of batch processing systems include Microsoft

Figure 5: Dataflow application execution, Left: Streaming execution, Middle: Data pipelines executing instages, Right: Iterative execution

Figure 6: Microservices using FaaS, Left: Functionsusing a workflow, Right: Functions in a dataflow

ing applications. They work on hierarchically partitioneddata as shown in Fig 4. This is similar to streaming wherea stream is partitioned among multiple parallel workers anda parallel worker only processes a small portion of the as-signed partition at a given time. Data pipelines deal withthe same load imbalance as streaming applications, but thescheduling of tasks is different in streaming and data pipelineapplications. Usually every task in a data pipeline is exe-cuted in each CPU sequentially so only a subset of tasksare active at a given time in contrast to all the tasks beingactive in streaming applications. Streaming communicationoperations only need to work on data that can be stored inmemory, while data pipelines do communications that re-quire a disk because of the large size of data. It is necessaryto support iterative computations in data pipelines in casethey execute complex data analytics applications.

3.2.3 Machine LearningComplex machine learning applications work mostly with

curated data that are load balanced. This means tight syn-chronizations required by the MPI-style parallel operationsare possible because the data is available around the time thecommunication is invoked. It is not practical to run complexmachine learning algorithms (> O(n2)) on very large datasets as they have polymorphic time requirements. In thosecases it is required to find heuristic approaches with lowertime complexities. There are machine learning algorithmswhich can be run in a pleasingly parallel manner as well.Because of the expressivity required by the machine learn-ing applications, the dataflow APIs should be close enoughto MPI-type programming but should hide the details suchas threads and I/O from users. Task-based APIs as used byAMT systems are suitable for such applications. We notethat large numbers of machine learning algorithms fall intothe map-collective model of computation as described in [16,29].

3.2.4 Services

The services are composed of event-driven functions whichcan be provisioned and scaled without the user having toknow the underlying details of the infrastructure. The func-tions can be directly exposed to the user for event driven ap-plications or by proxy through microservices for request/responseapplications. Fig. 6 shows microservices using functions ar-ranged in a workflow and in a dataflow.

4. RUNTIME ARCHITECTUREThe general architecture of a runtime designed for big

data is shown in Fig. 7. An application is created using agraph API and an optimizer can be used to make the graphexecution efficient. A resource scheduler then allocates therequired computing resources to run the processes required,including a master process to manage the job and a set ofexecutors to execute the tasks. There can be additionalprocesses to manage state and gather statistics but theseare not essential. The executors use threads to invoke thetasks and manage the communications. The task schedulercan be distributed to run in each executor or be central.The task execution model adopted is a hybrid model whereboth processes and threads are used. A single executor canhost multiple tasks of the execution graph and execute themusing threads.

Figure 7: Dataflow runtime architecture

4.1 CommunicationCommunication is a fundamental requirement of distributed

computing because the performance of the applications largelyrevolves around efficient implementations. The communica-tion patterns that can involve more than two parallel tasks

Page 6: Designing Twister2: Efficient Programming Environment ...dsc.soic.indiana.edu/publications/Twister2.pdf · ow semantics. Some other examples of batch processing systems include Microsoft

are termed collective communications. These patterns asidentified by the parallel computing community are availablethrough frameworks such as MPI [26]. Some of the heavilyused communication patterns are Broadcast, Gather, Re-duce, AllGather and AllReduce [55].

The naive implementation of these communication pat-terns using point-to-point communication in a straightfor-ward way produces worst-case performance in practical large-scale parallel applications. These communication patternscan be implemented using algorithms that minimize thebandwidth utilization and latency of the operation. In gen-eral they are termed collective algorithms.

4.1.1 MPI Collective OperationsIn MPI, collective operations and other point-to-point com-

munication operations are driven by control operations. Thismeans the programmer knows exactly when to execute thesend or receive functions. Once the program is ready to re-ceive or send data, it can initiate the appropriate operationswhich will invoke the network functions. The asynchronouscommunications are slightly different than synchronous op-erations in the sense that after their invocation, the programcan continue to compute while the operation is pending. Itis important to note that even with asynchronous operationsthe user needs to use other operations such as wait/probeto complete the pending operation. MPI has clear standardAPIs defined for collective communication patterns and allMPI implementations follow these specifications. The un-derlying implementation for such a communication patterncan use different algorithms based on factors including mes-sage size among others. Significant research has been doneon MPI collectives [55, 52] and the current implementationsare optimized to an extremely high extent. A comprehen-sive summary of MPI collective operations and possible al-gorithms is found in [59].

4.1.2 Dataflow Collective OperationsA communication pattern defines how the links are ar-

ranged in the dataflow graph. For instance a single nodecan broadcast a message to multiple nodes in the graphwhen they are arranged in a broadcast communication pat-tern. One of the best examples of a collective operation indataflow is Reduce. Reduce is the opposite of broadcast op-eration and multiple nodes link to a single node. The mostcommon dataflow operations include reduce, gather, join [9]and broadcast.

MPI and big data have adopted the same type of collec-tive communications but sometimes they have diverged insupported operations. Table 1 shows some of the collectiveoperations and their availability in MPI and dataflow sys-tems. Even though some MPI collective operations are notpresent in big data systems, they can be effective. Harp [62]is a machine learning focused collective library and supportsthe standard MPI collectives as well as some other opera-tions like rotate, push and pull.

It is important to observe that dataflow applications usekeyed collective operations. Unlike in MPI where the opera-tions happen in-place, dataflow operations happen betweenindividual tasks. Without keyed operations, it is not possi-ble to direct the outcome of a collective operation to a task.

4.1.3 Optimized Dataflow Collective OperationsEach task in a dataflow graph can only send and receive

Figure 8: Collective operation with sub-tasks ar-ranged in a tree

data via its input and output ports and parallel tasks can-not communicate with each other while performing compu-tations. The authors of this paper propose collective op-erations as a dataflow graph enrichment, which introducessub-tasks to the original dataflow graph. Fig ?? and Fig. 8show the naive implementation and our proposed approachfor dataflow collective operations. In this approach, the col-lective operationaAZs computation is moved to a sub-taskunder which the collective operation depends. These sub-tasks can be connected to each other according to differ-ent data structures like trees and pipes in order to opti-mize the collective communication. This model preservesthe dataflow nature of the application and the collectivedoes not act as a synchronization barrier. The collectiveoperation can run as data becomes available to each indi-vidual task, and the effects of unbalanced load and timingissues in MPI are no longer applicable. For collective oper-ations such as broadcast and scatter, the original tasks willbe arranged according to data structures required by suchoperations. We identify several requirements for a dataflowcollective algorithm.

1. The communication and the underlying algorithm shouldbe driven by data.

2. The algorithm should be able to use disks when theamount of data is larger than the available memory.

3. The collective communication should work on parti-tions of data and need to finish only after all the par-titions are processed.

4.1.4 High Performance InterconnectsRDMA (Remote Direct Memory Access) is one of the

key areas where MPI excels. MPI implementations supporta variety of high-performance communication fabrics andperforms well compared to Ethernet counterparts. SomeRDMA fabrics are developed especially targeting MPI [10]-type applications. Recently there have been many effortsto bring RDMA communications to big data systems, in-cluding HDFS [36], Hadoop [43] and Spark [44]. The bigdata applications are primarily written in Java and RDMAapplications are written in C/C++, requiring the integra-tion to go through JNI. Even by passing through additionallayers such as JNI, the application still performs reasonablywell with RDMA. One of the key forces that drags down theadoption of RDMA fabrics is their low level APIs. Nowa-days with unified API libraries such as Libfabric [30] andPhoton [39], this is no longer the case.

Page 7: Designing Twister2: Efficient Programming Environment ...dsc.soic.indiana.edu/publications/Twister2.pdf · ow semantics. Some other examples of batch processing systems include Microsoft

Table 1: MPI and big data collective operationsMPI Big Data Algorithms available

Small Messages Large MessagesReduce Reduce, Keyed

ReduceFlat/Binary/N-ary/Binomial Tree Pipelined/Double/Split Binary Tree,

ChainAllReduce N/A Recursive Doubling, Reduce followed

by BroadcastRing, Rabenseifner Algorithm, Recur-sive Doubling, Vector Halving withDistance Doubling

Broadcast Broadcast Flat/Binary Tree Pipelined/Double/Split Binary Tree,Chain

Gather Aggregate, KeyedAggregate

Flat/Binary/N-ary/Binomial Tree Pipelined/Double/Split Binary Tree,Chain

AllGather N/A Recursive Doubling, Reduce followedby Broadcast

Ring, Rabenseifner Algorithm, Recur-sive Doubling, Vector Halving withDistance Doubling

Barrier N/A Flat/Binary/Binomial TreeScatter N/A Flat/Binary Tree Pipelined/Double/Split Binary Tree,

ChainN/A Join Distributed radix hash, sort merge

4.2 Task Scheduling, Threads & ProcessesTask scheduling is a key area in which MPI, static and

dynamic dataflow systems differ. From an MPI perspec-tive, the task scheduling is straightforward, as MPI onlyspawns processes to run. It is the responsibility of the userto spawn threads and assign the computations appropriately.This process becomes harder for the MPI programmer whendesigning applications to run on many-core and multicoresystems. Here it is worth noting that the vast majority ofprogrammers are not comfortable with threads, let alonepossess the skill to get good performance from them by ef-ficient use of locks.

Static and dynamic scheduling are the two main paradigmsused in scheduling tasks.

Figure 9: Top: Stream task scheduled in 4 CPU ina chain. Bottom: All streaming tasks scheduled ina single CPU.

Static graph scheduling necessitates the graph be avail-able from the beginning. Because streaming systems requirethe entire graph to run continuously, this is the only way forsuch systems to operate. However that does make it harderto express complex applications, especially when containing

loops. So this approach is suitable for data parallel applica-tions such as streaming and data pipelines. Because the en-tire graph is available upon submission, graph optimizationtechniques can be applied to obtain the execution graph.

Dynamic graph scheduling allocates and schedules tasksat the run-time as the computation progresses. Because thegraph is generated on the fly by a control program, morecomplex operations can be specified easily. AMT systemsare dynamic graph execution systems. Being driven by anormal program, this method is not suitable for streamingdata applications.

4.2.1 Streaming & Batch Task SchedulingStreaming systems need to allocate all the tasks of the

graph to run continuously, as well as optimize for latency.To illustrate requirements of stream task scheduling, let ustake a hypothetical example where we have 4 computationsto execute on a stream of messages with each computationtaking t CPU time. Assume we have 4 CPUs available andthe data rate is 1 msg per t CPU time. If we run all 4tasks on a single CPU as shown in Fig. 9, it takes t × 4time to process one message and the computation cannotkeep up with the stream using 1 CPU. So we need to loadbalance between the 4 CPUs and the order of the processingis lost unless explicitly programmed with locks to keep thestate across 4 CPUs. But it is worth noting that the dataremains in a single thread while the processing happens,thus preserving data locality. If we perform the scheduleas in Fig. 9 the data locality is lost but the task locality ispreserved. [40] describes stream computation scheduling onmulticore systems in great detail.

For batch dataflow applications, the tasks are executedas the computation progresses. This means sequential taskscan run on a single CPU as time passes, unlike in a streamingsystem. Usually a single thread is scheduled to run on asingle core and when the tasks become ready to run, thisthread can execute them.

4.2.2 Execution of TasksAs described earlier, a thread-based shared memory model

Page 8: Designing Twister2: Efficient Programming Environment ...dsc.soic.indiana.edu/publications/Twister2.pdf · ow semantics. Some other examples of batch processing systems include Microsoft

Figure 10: Task scheduling and execution of dataflow frameworks

is used for executing the tasks of the dataflow. This al-lows both pipelined execution of tasks and sharing of datathrough memory. Such pipelined execution of tasks is crit-ical for streaming and data pipeline application to achieveefficient computations. Because of the large number of coresavailable in modern systems, it is required to strike a balancebetween the number of executor processes run in a singlenode. If fewer executors run in a node, that means a singleexecutor has to cope with larger memory, which can lead toTLB misses and long JVM GC pauses. If more executorsare used, the data has to be moved among the processes.

4.3 Data Management & Fault ToleranceBig data applications work with input data and interme-

diate data generated through calculations. Apart from this,some state has to be maintained about the computations. Inmost applications the input data is not updated and modeldata is updated frequently. A MPI application delegatesthe complete data management to user, allowing her to loaddata, partition the data and place the data as needed. Thesame data set can be updated though-out the computationsallowing efficient use of memory. Because partitioning andplacement of the data is controlled by the user, this approachallows the best possible optimizations.

Even-though MPI approach is flexible, user needs to workwith low level details in-order to write such applications.Distributed shared memory architectures have been pro-posed to ease some of the burdens of data management fromuser. Dataflow runtimes can use distributed shared mem-ory [51] (DSM) for data and state management. In generala DSM presents distributed memory as a continuous globaladdress space to the programmer. In such systems, the tasksof the graph are only allowed to work with DSM and no lo-cal process level state is kept. This permits the tasks ofa dataflow graph to be migrated freely among the nodes.The migration allows the system to recover from node fail-ures and balance load at runtime. HPC community hasdeveloped systems such as partitioned global address space(PGAS) [18] distributed memory. The big data communitydeveloped DSM technologies like RDD [60] for the same pur-pose.

RDD and other big data DSMs are a relaxed implementa-

tion of general DSM architecture where only coarse-gainedoperations are allowed. These DSMs are immutable, mean-ing once created they cannot be changed. Because tasks of adataflow application work on partitions of such distributedmemory, when a task fails it can be migrated and recal-culated without any side effects. If there are global datadependencies for a partition of such a distributed data set,complete operations may be required to execute again, socheckpointing mechanisms are employed to save state in or-der to reduce deep recalculations. Because these data setsare immutable, they need to be created every time there areupdates to them. This can lead to unnecessary resource con-sumption for complex iterative application where the modelschange frequently during calculations. These DSMs do notallow random access and fine grained control of data, whichcan lead to inefficiencies in complex applications [38]. It isworth noting that, they work extremely well for pleasinglyparallel data pipeline type applications.

Fault tolerance is tightly coupled with how the state ismanaged. If fine-grained control of state is allowed, thecheckpointing has to take these into account. The frequencyof such checkpointing is mostly given as an option to theapplication developer. Such an architecture would requirerecalculation of large portions of the application in case ofa failure depending on the rate of checkpointing.

5. SYSTEMS DESIGNWe have chosen a representative set of runtimes that can

support at least one of the application areas comfortably,then studied their strengths and weaknesses related to thedesign choices they have made. Table 2 shows these systemsalong with design choices and their applicability to the threeapplication areas. MPI is the default choice for HPC appli-cations and HPX-5 is an AMT system. Both can handleapplications with tightly synchronized parallel operations.Spark and Flink are mainly data pipeline systems, withFlink being able to handle streaming computations naivelybecause of its static scheduling. Spark is a dynamic schedul-ing system and cannot handle low latency streaming com-putations. Storm/Heron are streaming systems that cannothandle batch applications but are well suited for streaminganalytics.

Page 9: Designing Twister2: Efficient Programming Environment ...dsc.soic.indiana.edu/publications/Twister2.pdf · ow semantics. Some other examples of batch processing systems include Microsoft

Table 2: Design choices of current runtimesFrameworks

MPI HPX-5 Spark Flink Naiad Storm/HeronTask Scheduling Static Dynamic Dynamic Static Static StaticExecution User control Task-based

threadsTask-basedthreads

Task-basedthreads

Task-basedthreads

Task-basedthreads

API In-place com-munication

Task-based Data transfor-mation

Data transfor-mation

Data transfor-mation

Explicit graphcreation

Optimized commns Yes Yes No No No NoRDMA Yes Yes No No No NoDSM No Fine grained Coarse

GrainedCoarseGrained

CoarseGrained

No

Streaming No No Yes - high la-tency

Yes Yes Yes

Data pipelines No No Yes Yes Yes N/AMachine learning Yes Yes Performance

can be poorPerformancecan be poor

N/A N/A

5.1 A Toolkit for Big DataOur study has identified high level features of a big data

runtime that determine the type of application which canbe executed efficiently. This toolkit aims to compose sys-tems using well-defined components, and has the followingimplications: 1) It will allow developers to choose only thecomponents that they need in order to develop the appli-cation. For example, a user may only want MPI-style com-munication with a static scheduling and distributed sharedmemory for their application; 2) Each component will havemultiple implementations, allowing the user to supportdifferent types of applications, e.g., the toolkit can be usedto compose a system that can perform streaming computa-tions as well as data pipelines. We observed that communi-cations, task scheduling and distributed shared memory (ifrequired) are the three main factors affecting the applica-tions. The API needs to adapt to each of these choices aswell. Table 3 shows the different capabilities expected fromdifferent types of big data applications described herein.

Figure 11: The toolkit approach

We propose a high performance communication libraryincluding both MPI and dataflow style communications, a

task scheduler capable of scheduling according to differentrequirements, a thread-based task executor, and a low-levelgraph representation API as the basis of the toolkit. Thiscan be further enhanced with a distributed shared memoryimplementation supporting coarse-grained and fine-grainedoperations. Different APIs can be built on top of such asystem including low-level and high-level APIs to supportdifferent use cases. High level APIs can be in the form oflanguages, SQL queries or domain specific APIs. The sup-port of classic HPC communications such as MPI will furtherenhance the system’s adaptability to existing legacy appli-cations.

6. CONCLUSIONS & FUTURE WORKWe foresee that the share of large-scale applications driven

by data will increase rapidly in the future. The HPC com-munity has tended to focus mostly on heavy computationalbound applications, and with these new developments thereis an opportunity to explore data-driven applications withHPC features such as high-speed interconnects and many-core machines. The data-driven computing frameworks arestill in the early stages, and as we discussed there are fourdriving application areas (streaming, data pipelines, ma-chine learning, service) with different processing require-ments. In this paper we discussed the convergence of theseapplication areas with a common event driven model. Wealso examined the choices available in the design of frame-works supporting big data with different components. Ev-ery choice made by a component has ramifications for per-formance of the applications the system can support. Webelieve the toolkit approach gives user the required flexibil-ity to strike a balance between performance and usabilityand allows the inclusion of proven existing technologies in aunified environment. This will enable a programming envi-ronment that is interoperable across application types andsystem infrastructure including both HPC and clouds wherein latter case it supports a cloud native framework [3]. Theauthors are actively working on the implementation of vari-ous components of the toolkit and APIs in order to deliveron the promised flexibility across various applications andsystems.

Page 10: Designing Twister2: Efficient Programming Environment ...dsc.soic.indiana.edu/publications/Twister2.pdf · ow semantics. Some other examples of batch processing systems include Microsoft

Table 3: Requirements of applicationsType of applications Capabilities

Scheduling API Communications Data and StateStreaming Static Scheduling Static graph API Optimized Dataflow

CollectivesCoarse grain DSM, Lo-cal

Data Pipelines Static/DynamicScheduling

Static or dynamicgraph generation

Optimized dataflowcollectives

Coarse grain DSM, Lo-cal

Machine learning Dynamic Scheduling Dynamic graph gener-ation

Optimizeddataflow/MPI col-lectives

Fine grain DSM, Local

FaaS Dynamic Scheduling Dataflow or Workflow P2P Communication Local

AcknowledgmentsThis work was partially supported by the Indiana Univer-sity Precision Health Initiative and by NSF CIF21 DIBBS1443054 and NSF RaPyDLI 1415459. We thank Intel fortheir support of the Juliet system, and extend our grati-tude to the FutureSystems team for their support with theinfrastructure.

7. REFERENCES[1] Apache NiFi. https://nifi.apache.org/. Accessed: July

19 2017.

[2] AWS Step Functions.https://aws.amazon.com/step-functions/. Accessed:July 19 2017.

[3] Cloud Native Computing Foundation.https://www.cncf.io/. Accessed: 2017-Aug-06.

[4] S. Agarwal, R. Garg, and N. K. Vishnoi. The Impactof Noise on the Scaling of Collectives: A TheoreticalApproach, pages 280–289. Springer Berlin Heidelberg,Berlin, Heidelberg, 2005.

[5] T. Akidau, A. Balikov, K. Bekiroglu, S. Chernyak,J. Haberman, R. Lax, S. McVeety, D. Mills,P. Nordstrom, and S. Whittle. Millwheel:Fault-tolerant stream processing at internet scale.Proc. VLDB Endow., 6(11):1033–1044, Aug. 2013.

[6] T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak,R. J. Fernandez-Moctezuma, R. Lax, S. McVeety,D. Mills, F. Perry, E. Schmidt, and S. Whittle. TheDataflow Model: A Practical Approach to BalancingCorrectness, Latency, and Cost in Massive-scale,Unbounded, Out-of-order Data Processing. Proc.VLDB Endow., 8(12):1792–1803, Aug. 2015.

[7] M. Anderson, S. Smith, N. Sundaram, M. Capota,Z. Zhao, S. Dulloor, N. Satish, and T. L. Willke.bridging the gap between hpc and big dataframeworks.

[8] M. Balazinska, H. Balakrishnan, S. R. Madden, andM. Stonebraker. Fault-tolerance in the BorealisDistributed Stream Processing System. ACM Trans.Database Syst., 33(1):3:1–3:44, Mar. 2008.

[9] C. Barthels, I. Muller, T. Schneider, G. Alonso, andT. Hoefler. Distributed Join Algorithms on Thousandsof Cores. Proc. VLDB Endow., 10(5):517–528, Jan.2017.

[10] M. S. Birrittella, M. Debbage, R. Huggahalli, J. Kunz,T. Lovett, T. Rimmer, K. D. Underwood, and R. C.Zak. Intel x00AE; Omni-path Architecture: Enabling

Scalable, High Performance Fabrics. In 2015 IEEE23rd Annual Symposium on High-PerformanceInterconnects, pages 1–9, Aug 2015.

[11] S. Blagodurov, S. Zhuravlev, A. Fedorova, andA. Kamali. A Case for NUMA-aware ContentionManagement on Multicore Systems. In Proceedings ofthe 19th International Conference on ParallelArchitectures and Compilation Techniques, PACT ’10,pages 557–558, New York, NY, USA, 2010. ACM.

[12] F. Bonomi, R. Milito, J. Zhu, and S. Addepalli. fogcomputing and its role in the internet of things.

[13] G. Bosilca, A. Bouteiller, A. Danalis, T. Herault,P. Lemarinier, and J. Dongarra. DAGuE: A genericdistributed DAG engine for High PerformanceComputing. Parallel Computing, 38(1):37 – 51, 2012.Extensions for Next-Generation Parallel ProgrammingModels.

[14] Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, andS. Ranka. Fortran 90D/HPF Compiler for DistributedMemory MIMD Computers: Design, Implementation,and Performance Results. In Proceedings of the 1993ACM/IEEE Conference on Supercomputing,Supercomputing ’93, pages 351–360, New York, NY,USA, 1993. ACM.

[15] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl,S. Haridi, and K. Tzoumas. Apache flink: Stream andbatch processing in a single engine. Bulletin of theIEEE Computer Society Technical Committee on DataEngineering, 36(4), 2015.

[16] C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski,A. Y. Ng, and K. Olukotun. Map-reduce for machinelearning on multicore. In Proceedings of the 19thInternational Conference on Neural InformationProcessing Systems, NIPS’06, pages 281–288,Cambridge, MA, USA, 2006. MIT Press.

[17] J. Conejero, S. Corella, R. M. Badia, and J. Labarta.task-based programming in compss to converge fromhpc to big data.

[18] M. De Wael, S. Marr, B. De Fraine, T. Van Cutsem,and W. De Meuter. partitioned global address spacelanguages.

[19] J. Dean and S. Ghemawat. MapReduce: A FlexibleData Processing Tool. Commun. ACM, 53(1):72–77,Jan. 2010.

[20] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil,C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman,J. Good, et al. Pegasus: A framework for mappingcomplex scientific workflows onto distributed systems.

Page 11: Designing Twister2: Efficient Programming Environment ...dsc.soic.indiana.edu/publications/Twister2.pdf · ow semantics. Some other examples of batch processing systems include Microsoft

Scientific Programming, 13(3):219–237, 2005.

[21] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H.Bae, J. Qiu, and G. Fox. Twister: A Runtime forIterative MapReduce. In Proceedings of the 19th ACMInternational Symposium on High PerformanceDistributed Computing, HPDC ’10, pages 810–818,New York, NY, USA, June, 2010. ACM.

[22] S. Ekanayake, S. Kamburugamuve, and G. C. Fox.Spidal java: High performance data analytics withjava and mpi on large multicore hpc clusters. InProceedings of the 24th High Performance ComputingSymposium, HPC ’16, pages 3:1–3:8, San Diego, CA,USA, 2016. Society for Computer SimulationInternational.

[23] S. Ekanayake, S. Kamburugamuve,P. Wickramasinghe, and G. C. Fox. Java thread andprocess performance for parallel machine learning onmulticore hpc clusters. In 2016 IEEE InternationalConference on Big Data (Big Data), pages 347–354,Washington, DC, USA, Dec 2016. IEEE.

[24] G. Fox, J. Qiu, S. Jha, S. Ekanayake, andS. Kamburugamuve. Big Data, Simulations and HPCConvergence, pages 3–17. Springer InternationalPublishing, Cham, 2016.

[25] G. Fox, J. Qiu, S. Kamburugamuve, S. Jha, andA. Luckow. Hpc-abds high performance computingenhanced apache big data stack. In Cluster, Cloud andGrid Computing (CCGrid), 2015 15th IEEE/ACMInternational Symposium on, pages 1057–1066, NewYork, NY, USA, May 2015. IEEE.

[26] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J.Dongarra, J. M. Squyres, V. Sahay, P. Kambadur,B. Barrett, A. Lumsdaine, R. H. Castain, D. J.Daniel, R. L. Graham, and T. S. Woodall. Open MPI:Goals, Concept, and Design of a Next Generation MPIImplementation, pages 97–104. Springer BerlinHeidelberg, Berlin, Heidelberg, 2004.

[27] D. Gannon, R. Barga, and N. Sundaresan. CloudNative Applications. IEEE Cloud ComputingMagazine special issue on cloud native computing, tobe published.

[28] B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, andM. Doo. SPADE: The System s Declarative StreamProcessing Engine. In Proceedings of the 2008 ACMSIGMOD International Conference on Management ofData, SIGMOD ’08, pages 1123–1134, New York, NY,USA, 2008. ACM.

[29] A. Ghoting, R. Krishnamurthy, E. Pednault,B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian,and S. Vaithyanathan. Systemml: Declarative machinelearning on mapreduce. In 2011 IEEE 27thInternational Conference on Data Engineering, pages231–242, April 2011.

[30] P. Grun, S. Hefty, S. Sur, D. Goodell, R. D. Russell,H. Pritchard, and J. M. Squyres. A Brief Introductionto the OpenFabrics Interfaces - A New Network APIfor Maximizing High Performance ApplicationEfficiency. In 2015 IEEE 23rd Annual Symposium onHigh-Performance Interconnects, pages 34–39, Aug2015.

[31] N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud.The synchronous data flow programming language

LUSTRE. Proceedings of the IEEE, 79(9):1305–1320,1991.

[32] J. Han, H. E, G. Le, and J. Du. Survey on nosqldatabase. In 2011 6th International Conference onPervasive Computing and Applications, pages363–366, Oct 2011.

[33] T. Hoefler, T. Schneider, and A. Lumsdaine. Theimpact of network noise at large-scale communicationperformance. In 2009 IEEE International Symposiumon Parallel Distributed Processing, pages 1–8, May2009.

[34] T. Hoefler, T. Schneider, and A. Lumsdaine.Characterizing the Influence of System Noise onLarge-Scale Applications by Simulation. InProceedings of the 2010 ACM/IEEE InternationalConference for High Performance Computing,Networking, Storage and Analysis, SC ’10, pages 1–11,Washington, DC, USA, 2010. IEEE Computer Society.

[35] D. Hollman, J. Lifflander, J. Wilke, N. Slattengren,A. Markosyan, H. Kolla, and F. Rizzi. Darma v. beta0.5, Mar 2017.

[36] N. S. Islam, M. W. Rahman, J. Jose,R. Rajachandrasekar, H. Wang, H. Subramoni,C. Murthy, and D. K. Panda. High performancerdma-based design of hdfs over infiniband. InProceedings of the International Conference on HighPerformance Computing, Networking, Storage andAnalysis, SC ’12, pages 35:1–35:35, Los Alamitos, CA,USA, 2012. IEEE Computer Society Press.

[37] L. V. Kale and S. Krishnan. Charm++: A portableconcurrent object oriented system based on c++. InProceedings of the Eighth Annual Conference onObject-oriented Programming Systems, Languages, andApplications, OOPSLA ’93, pages 91–108, New York,NY, USA, 1993. ACM.

[38] S. Kamburugamuve, P. Wickramasinghe,S. Ekanayake, and G. C. Fox. anatomy of machinelearning algorithm implementations in mpi, spark, andflink.

[39] E. Kissel and M. Swany. Photon: Remote MemoryAccess Middleware for High-Performance RuntimeSystems. In 2016 IEEE International Parallel andDistributed Processing Symposium Workshops(IPDPSW), pages 1736–1743, May 2016.

[40] M. Kudlur and S. Mahlke. Orchestrating theExecution of Stream Programs on MulticorePlatforms. In Proceedings of the 29th ACM SIGPLANConference on Programming Language Design andImplementation, PLDI ’08, pages 114–124, New York,NY, USA, 2008. ACM.

[41] S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli,C. Kellogg, S. Mittal, J. M. Patel, K. Ramasamy, andS. Taneja. Twitter heron: Stream processing at scale.In Proceedings of the 2015 ACM SIGMODInternational Conference on Management of Data,SIGMOD ’15, pages 239–250, New York, NY, USA,2015. ACM.

[42] F. Liang, C. Feng, X. Lu, and Z. Xu. Performancebenefits of datampi: A case study with bigdatabench.In J. Zhan, R. Han, and C. Weng, editors, Big DataBenchmarks, Performance Optimization, andEmerging Hardware: 4th and 5th Workshops, BPOE

Page 12: Designing Twister2: Efficient Programming Environment ...dsc.soic.indiana.edu/publications/Twister2.pdf · ow semantics. Some other examples of batch processing systems include Microsoft

2014, Salt Lake City, USA, March 1, 2014 andHangzhou, China, September 5, 2014, Revised SelectedPapers, pages 111–123, Cham, 2014. SpringerInternational Publishing.

[43] X. Lu, N. S. Islam, M. Wasi-Ur-Rahman, J. Jose,H. Subramoni, H. Wang, and D. K. Panda.High-performance design of hadoop rpc with rdmaover infiniband. In 2013 42nd InternationalConference on Parallel Processing, pages 641–650,New York, NY, USA, Oct 2013. IEEE.

[44] X. Lu, D. Shankar, S. Gugnani, and D. K. D. K.Panda. High-performance design of apache spark withRDMA and its benefits on various workloads. In 2016IEEE International Conference on Big Data (BigData), pages 253–262, Dec 2016.

[45] B. LudAd’scher, I. Altintas, C. Berkley, D. Higgins,E. Jaeger, M. Jones, E. A. Lee, J. Tao, and Y. Zhao.Scientific workflow management and the Keplersystem. Concurrency and Computation: Practice andExperience, 18(10):1039–1065, 2006.

[46] N. Marz and J. Warren. Big Data: Principles andBest Practices of Scalable Realtime Data Systems.Manning Publications Co., Greenwich, CT, USA, 1stedition, 2015.

[47] T. G. Mattson, R. Cledat, V. CavAl’, V. Sarkar,

Z. BudimliAG, S. Chatterjee, J. Fryman, I. Ganev,R. Knauerhase, M. Lee, B. Meister, B. Nickerson,N. Pepperling, B. Seshasayee, S. Tasirlar, J. Teller,and N. Vrvilo. The open community runtime: Aruntime system for extreme scale computing. In 2016IEEE High Performance Extreme ComputingConference (HPEC), pages 1–7, Sept 2016.

[48] D. G. Murray, F. McSherry, R. Isaacs, M. Isard,P. Barham, and M. Abadi. naiad: A timely dataflowsystem.

[49] M. A. U. Nasir, G. D. F. Morales, D. Garcia-Soriano,N. Kourtellis, and M. Serafini. The power of bothchoices: Practical load balancing for distributedstream processing engines. In 2015 IEEE 31stInternational Conference on Data Engineering, pages137–148, April 2015.

[50] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4:Distributed Stream Computing Platform. In 2010IEEE International Conference on Data MiningWorkshops, pages 170–177, Dec 2010.

[51] B. Nitzberg and V. Lo. distributed shared memory: asurvey of issues and algorithms. Computer.

[52] J. Pjesivac-Grbovic, T. Angskun, G. Bosilca, G. E.Fagg, E. Gabriel, and J. J. Dongarra. Performanceanalysis of MPI collective operations. ClusterComputing, 10(2):127–143, 2007.

[53] R. Ranjan. streaming big data processing indatacenter clouds. IEEE Cloud Computing.

[54] T. Sterling, M. Anderson, P. K. Bohan, M. Brodowicz,A. Kulkarni, and B. Zhang. Towards ExascaleCo-design in a Runtime System, pages 85–99. SpringerInternational Publishing, Cham, 2015.

[55] R. Thakur and W. D. Gropp. Improving thePerformance of Collective Operations in MPICH,pages 257–267. Springer Berlin Heidelberg, Berlin,Heidelberg, 2003.

[56] W. Thies, M. Karczmarek, and S. Amarasinghe.StreamIt: A Language for Streaming Applications,pages 179–196. Springer Berlin Heidelberg, Berlin,Heidelberg, 2002.

[57] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy,J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu,J. Donham, N. Bhagat, S. Mittal, and D. Ryaboy.Storm@twitter. In Proceedings of the 2014 ACMSIGMOD International Conference on Management ofData, SIGMOD ’14, pages 147–156, New York, NY,USA, 2014. ACM.

[58] T. White. Hadoop: The Definitive Guide. O’ReillyMedia, Inc., Sebastopol, CA, USA, 1st edition, 2009.

[59] U. Wickramasinghe and A. Lumsdaine. A survey ofmethods for collective communication optimizationand tuning. arXiv preprint arXiv:1611.06334, 2016.

[60] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma,M. McCauley, M. J. Franklin, S. Shenker, andI. Stoica. Resilient Distributed Datasets: AFault-tolerant Abstraction for In-memory ClusterComputing. In Proceedings of the 9th USENIXConference on Networked Systems Design andImplementation, NSDI’12, pages 2–2, Berkeley, CA,USA, 2012. USENIX Association.

[61] M. Zaharia, M. Chowdhury, M. J. Franklin,S. Shenker, and I. Stoica. Spark: Cluster computingwith working sets. In Proceedings of the 2Nd USENIXConference on Hot Topics in Cloud Computing,HotCloud’10, pages 10–10, Berkeley, CA, USA, 2010.USENIX Association.

[62] B. Zhang, Y. Ruan, and J. Qiu. Harp: CollectiveCommunication on Hadoop. In 2015 IEEEInternational Conference on Cloud Engineering, pages228–233, March 2015.