a view of programming scalable data analysis: from clouds ... · a view of programming scalable...

16
RESEARCH Open Access A view of programming scalable data analysis: from clouds to exascale Domenico Talia Abstract Scalability is a key feature for big data analysis and machine learning frameworks and for applications that need to analyze very large and real-time data available from data repositories, social media, sensor networks, smartphones, and the Web. Scalable big data analysis today can be achieved by parallel implementations that are able to exploit the computing and storage facilities of high performance computing (HPC) systems and clouds, whereas in the near future Exascale systems will be used to implement extreme-scale data analysis. Here is discussed how clouds currently support the development of scalable data mining solutions and are outlined and examined the main challenges to be addressed and solved for implementing innovative data analysis applications on Exascale systems. Keywords: Big data analysis, Cloud computing, Exascale computing, Data mining, Parallel programming, Scalability Introduction Solving problems in science and engineering was the first motivation for inventing computers. After a long time since then, computer science is still the main area in which innovative solutions and technologies are being developed and applied. Also due to the extraordinary ad- vancement of computer technology, nowadays data are generated as never before. In fact, the amount of struc- tured and unstructured digital datasets is going to in- crease beyond any estimate. Databases, file systems, data streams, social media and data repositories are increas- ingly pervasive and decentralized. As the data scale increases, we must address new chal- lenges and attack ever-larger problems. New discoveries will be achieved and more accurate investigations can be carried out due to the increasingly widespread availabil- ity of large amounts of data. Scientific sectors that fail to make full use of the huge amounts of digital data avail- able today risk losing out on the significant opportun- ities that big data can offer. To benefit from the big data availability, specialists and researchers need advanced data analysis tools and applications running on scalable architectures allowing for the extraction of useful knowledge from such huge data sources. High performance computing (HPC) sys- tems and cloud computing systems today are capable platforms for addressing both the computational and data storage needs of big data mining and parallel know- ledge discovery applications. These computing architec- tures are needed to run data analysis because complex data mining tasks involve data- and compute-intensive algorithms that require large, reliable and effective stor- age facilities together with high performance processors to get results in appropriate times. Now that data sources became very big and pervasive, reliable and effective programming tools and applica- tions for data analysis are needed to extract value and find useful insights in them. New ways to correctly and proficiently compose different distributed models and paradigms are required and interaction between hard- ware resources and programming levels must be ad- dressed. Users, professionals and scientists working in the area of big data need advanced data analysis pro- gramming models and tools coupled with scalable archi- tectures to support the extraction of useful information from such massive repositories. The scalability of a par- allel computing system is a measure of its capacity to re- duce program execution time in proportion to the number of its processing elements (The Appendix intro- duces and discusses in detail scalability in parallel sys- tems). According to scalability definition, scalable data analysis refers to the ability of a hardware/software par- allel system to exploit increasing computing resources effectively in the analysis of (very) large datasets. Correspondence: [email protected] DIMES, Università della Calabria, Rende, Italy Journal of Cloud Computing: Advances, Systems and Applications © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Talia Journal of Cloud Computing: Advances, Systems and Applications (2019) 8:4 https://doi.org/10.1186/s13677-019-0127-x

Upload: others

Post on 08-Jun-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A view of programming scalable data analysis: from clouds ... · A view of programming scalable data analysis: from clouds to exascale Domenico Talia Abstract Scalability is a key

RESEARCH Open Access

A view of programming scalable dataanalysis: from clouds to exascaleDomenico Talia

Abstract

Scalability is a key feature for big data analysis and machine learning frameworks and for applications that need toanalyze very large and real-time data available from data repositories, social media, sensor networks, smartphones,and the Web. Scalable big data analysis today can be achieved by parallel implementations that are able to exploitthe computing and storage facilities of high performance computing (HPC) systems and clouds, whereas in thenear future Exascale systems will be used to implement extreme-scale data analysis. Here is discussed how cloudscurrently support the development of scalable data mining solutions and are outlined and examined the mainchallenges to be addressed and solved for implementing innovative data analysis applications on Exascale systems.

Keywords: Big data analysis, Cloud computing, Exascale computing, Data mining, Parallel programming, Scalability

IntroductionSolving problems in science and engineering was thefirst motivation for inventing computers. After a longtime since then, computer science is still the main areain which innovative solutions and technologies are beingdeveloped and applied. Also due to the extraordinary ad-vancement of computer technology, nowadays data aregenerated as never before. In fact, the amount of struc-tured and unstructured digital datasets is going to in-crease beyond any estimate. Databases, file systems, datastreams, social media and data repositories are increas-ingly pervasive and decentralized.As the data scale increases, we must address new chal-

lenges and attack ever-larger problems. New discoverieswill be achieved and more accurate investigations can becarried out due to the increasingly widespread availabil-ity of large amounts of data. Scientific sectors that fail tomake full use of the huge amounts of digital data avail-able today risk losing out on the significant opportun-ities that big data can offer.To benefit from the big data availability, specialists

and researchers need advanced data analysis tools andapplications running on scalable architectures allowingfor the extraction of useful knowledge from such hugedata sources. High performance computing (HPC) sys-tems and cloud computing systems today are capable

platforms for addressing both the computational anddata storage needs of big data mining and parallel know-ledge discovery applications. These computing architec-tures are needed to run data analysis because complexdata mining tasks involve data- and compute-intensivealgorithms that require large, reliable and effective stor-age facilities together with high performance processorsto get results in appropriate times.Now that data sources became very big and pervasive,

reliable and effective programming tools and applica-tions for data analysis are needed to extract value andfind useful insights in them. New ways to correctly andproficiently compose different distributed models andparadigms are required and interaction between hard-ware resources and programming levels must be ad-dressed. Users, professionals and scientists working inthe area of big data need advanced data analysis pro-gramming models and tools coupled with scalable archi-tectures to support the extraction of useful informationfrom such massive repositories. The scalability of a par-allel computing system is a measure of its capacity to re-duce program execution time in proportion to thenumber of its processing elements (The Appendix intro-duces and discusses in detail scalability in parallel sys-tems). According to scalability definition, scalable dataanalysis refers to the ability of a hardware/software par-allel system to exploit increasing computing resourceseffectively in the analysis of (very) large datasets.Correspondence: [email protected]

DIMES, Università della Calabria, Rende, Italy

Journal of Cloud Computing:Advances, Systems and Applications

© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link tothe Creative Commons license, and indicate if changes were made.

Talia Journal of Cloud Computing: Advances, Systems and Applications (2019) 8:4 https://doi.org/10.1186/s13677-019-0127-x

Page 2: A view of programming scalable data analysis: from clouds ... · A view of programming scalable data analysis: from clouds to exascale Domenico Talia Abstract Scalability is a key

Today complex analysis of real-world massive datasources requires using high-performance computing sys-tems such as massively parallel machines or clouds.However in the next years, as parallel technologies ad-vance, Exascale computing systems will be exploited forimplementing scalable big data analysis in all the areasof science and engineering [23]. To reach this goal, newdesign and programming challenges must be addressedand solved. The focus of the paper is on discussingcurrent cloud-based designing and programming solu-tions for data analysis and suggesting new programmingrequirements and approaches to be conceived for meet-ing big data analysis challenges on future Exascaleplatforms.Current cloud computing platforms and parallel com-

puting systems represent two different technological so-lutions for addressing the computational and datastorage needs of big data mining and parallel knowledgediscovery applications. Indeed, parallel machines offerhigh-end processors with the main goal to support HPCapplications, whereas cloud systems implement a com-puting model in which virtualized resources dynamicallyscalable are provided to users and developers as aservice over the Internet. In fact, clouds do not mainlytarget HPC applications; they instrument scalable com-puting and storage delivery platforms that can beadapted to the needs of different classes of people andorganizations by exploiting the Service Oriented (SOA)approach. Clouds offer large facilities to many users thatwere unable to own their parallel/distributed computingsystems to run applications and services. In particular,big data analysis applications requiring access and ma-nipulating very large datasets with complex mining algo-rithms will significantly benefit from the use of cloudplatforms.Although not many cloud-based data analysis frame-

works are available today for end users, within a fewyears they will become common [29]. Some current so-lutions are based on open source systems, such asApache Hadoop and Mahout, Spark and SciDB, whileothers are proprietary solutions provided by companiessuch as Google, Microsoft, EMC, Amazon, BigML,Splunk Hunk, and InsightsOne. As more such platformsemerge, researchers and professionals will port increas-ingly powerful data mining programming tools andframeworks to the cloud to exploit complex and flexiblesoftware models such as the distributed workflow para-digm. The growing utilization of the service-orientedcomputing model could accelerate this trend.From the definition of the big data term, which refers

to datasets so large and complex that traditional hard-ware and software data processing solutions are inad-equate to manage and analyze, we can infer thatconventional computer systems are not so powerful to

process and mine big data [28] and they are not able toscale with the size of problems to be solved. As men-tioned before, to face with limits of sequential machines,advanced systems like HPC, clouds and even more scal-able architectures are used today to analyze big data.Starting from this scenario, Exascale computing systemswill represent the next computing step [1, 34]. Exascalesystems refers to high performance computingsystems capable of at least one exaFLOPS, so theirimplementation represents a very significant researchand technology challenge. Their design and developmentis currently under investigation with the goal of buildingby 2020 high-performance computers composed of avery large number of multi-core processors expected todeliver a performance of 10^18 operations per second.Cloud computing systems used today are able to storevery large amounts of data, however they do not providethe high performance expected from massively parallelExascale systems. This is the main motivation for devel-oping Exascale systems. Exascale technology will repre-sent the most advanced model of supercomputers. Theyhave been conceived for single-site supercomputing cen-ters not for distributed infrastructures as multi-clouds orfog computing systems that are aimed to decentralizedcomputing and pervasive data management that couldbe interconnected with Exascale systems that could usedas backbone for very large scale data analysis.The development of Exascale systems urges to address

and solve issues and challenges both at hardware andsoftware level. Indeed it requires to design and imple-ment novel software tools and runtime systems able tomanage a very high degree of parallelism, reliability anddata locality in extreme scale computers [14]. New pro-gramming constructs and runtime mechanisms able toadapt to the most appropriate parallelism degree andcommunication decomposition for making scalable andreliable data analysis tasks are needed. Their dependencefrom parallelism grain size and data analysis task decom-position must be deeply studied. This is needed becauseparallelism exploitation depends on several features likeparallel operations, communication overhead, input datasize, I/O speed, problem size, and hardware configur-ation. Moreover, reliability and reproducibility are twoadditional key challenges to be addressed. Indeed at pro-gramming level, constructs for handling and recoveringcommunication, data access, and computing failuresmust be designed. At the same time, reproducibility inscalable data analysis asks for rich information useful toassure similar results on environments that dynamicallymay change. All these factors must be taken into ac-count in designing data analysis applications and toolsthat will be scalable on exascale systems.Moreover, reliable and effective methods for storing,

accessing and communicating data, intelligent techniques

Talia Journal of Cloud Computing: Advances, Systems and Applications (2019) 8:4 Page 2 of 16

Page 3: A view of programming scalable data analysis: from clouds ... · A view of programming scalable data analysis: from clouds to exascale Domenico Talia Abstract Scalability is a key

for massive data analysis and software architectures enab-ling the scalable extraction of knowledge from data, areneeded [28]. To reach this goal, models and technologiesenabling cloud computing systems and HPC architecturesmust be extended/adapted or completely changed to bereliable and scalable on the very large number of proces-sors/cores that compose extreme scale platforms and forsupporting the implementation of clever data analysis al-gorithms that ought to be scalable and dynamic in re-source usage. Exascale computing infrastructures will playthe role of an extraordinary platform for addressing boththe computational and data storage needs of big data ana-lysis applications. However, as mentioned before, to have acomplete scenario, efforts must be performed for imple-menting big data analytics algorithms, architectures, pro-gramming tools and applications in Exascale systems [24].Pursuing this objective within a few years, scalable

data access and analysis systems will become the mostused platforms for big data analytics on large-scaleclouds. In a longer perspective, new Exascale computinginfrastructures will appear as the platforms for big dataanalytics in the next decades, and data mining algo-rithms, tools and applications will be ported on suchplatforms for implementing extreme data discoverysolutions.In this paper we first discuss cloud-based scalable data

mining and machine learning solutions, then we exam-ine the main research issues that must be addressed forimplementing massively parallel data mining applica-tions on Exascale computing systems. Data-related is-sues are discussed together with communication,multi-processing, and programming issues. Section II in-troduces issues and systems for scalable data analysis onclouds and Section III discusses design and program-ming issues for big data analysis in Exascale systems.Section IV completes the paper also outlining some opendesign challenges.

Data analysis on cloudsClouds implement elastic services, scalable performanceand scalable data storage used by a large and everydayincreasing number of users and applications [2, 12]. Infact, clouds enlarged the arena of distributed computingsystems by providing advanced Internet services thatcomplement and complete functionalities of distributedcomputing provided by the Web, Grid systems andpeer-to-peer networks. In particular, most cloud com-puting applications use big data repositories storedwithin the cloud itself, so in those cases large datasetsare analyzed with low latency to effectively extract dataanalysis models.Big data is a new and over-used term that refers to

massive, heterogeneous, and often unstructured digitalcontent that is difficult to process using traditional data

management tools and techniques. The term includesthe complexity and variety of data and data types,real-time data collection and processing needs, and thevalue that can be obtained by smart analytics. Howeverwe should recognize that data are not necessarily im-portant per se but they become very important if we areable to extract value from them; that is if we can exploitthem to make discoveries. The extraction of usefulknowledge from big digital datasets requires smart andscalable analytics algorithms, services, programmingtools, and applications. All these solutions need to findinsights in big data will contribute to make them reallyuseful for people.The growing use of service-oriented computing is ac-

celerating the use of cloud-based systems for scalablebig data analysis. Developers and researchers are adopt-ing the three main cloud models, software as a service(SaaS), platform as a service (PaaS), and infrastructureas a service (IaaS), to implement big data analytics solu-tions in the cloud [27, 31]. According to a specializationof these three models, data analysis tasks and applica-tions can be offered as services at infrastructure,platform or software level and made available every timeform everywhere. A methodology for implementingthem defines a new model stack to delivery data analysissolutions that is a specialization of the XaaS (Everythingas a Service) stack and is called Data Analysis as a Ser-vice (DAaaS). It adapts and specifies the three generalservice models (SaaS, PaaS and IaaS), for supporting thestructured development of Big Data analysis systems,tools and applications according to a service-orientedapproach. The DAaaS methodology is then based on thethree basic models for delivering data analysis services atdifferent levels as described here (see also Fig. 1):

� Data analysis infrastructure as a service (DAIaaS).This model provides a set of hardware/softwarevirtualized resources that developers can assembleand use as a an integrated infrastructure wherestoring large datasets, running data miningapplications and/or implementing data analyticssystems from scratch;

� Data analysis platform as a service (DAPaaS). Thismodel defines a supporting software platform thatdevelopers can use for programming and runningtheir data analytics applications or extendingexisting ones without concerning about theunderlying infrastructure or specific distributedarchitecture issues; and

� Data analysis software as a service (DASaaS). This isa higher-level model that offers to end users datamining algorithms, data analysis suites or ready-to-use knowledge discovery applications as Internet ser-vices that can be accessed and used directly through

Talia Journal of Cloud Computing: Advances, Systems and Applications (2019) 8:4 Page 3 of 16

Page 4: A view of programming scalable data analysis: from clouds ... · A view of programming scalable data analysis: from clouds to exascale Domenico Talia Abstract Scalability is a key

aWeb browser. According to this approach, everydata analysis software is provided as a service, avoid-ing end users to worry about implementation andexecution details.

Cloud-based data analysis toolsUsing the DASaaS methodology we designed acloud-based system, the Data Mining Cloud Framework(DMCF) [17], which supports three main classes of dataanalysis and knowledge discovery applications:

� Single-task applications, in which a single data miningtask such as classification, clustering, or associationrules discovery is performed on a given dataset;

� Parameter-sweeping applications, in which a datasetis analyzed by multiple instances of the same datamining algorithm with different parameters; and

� Workflow-based applications, in which knowledgediscovery applications are specified as graphs linkingtogether data sources, data mining tools, and datamining models.

DMCF includes a large variety of processing patternsto express knowledge discovery workflows as graphswhose nodes denote resources (datasets, data analysis

tools, mining models) and whose edges denote depend-encies among resources. A Web-based user interface al-lows users to compose their applications and submitthem for execution to the Cloud platform, following thedata analysis software as a service approach. Visualworkflows can be programmed in DMCF through a lan-guage called VL4Cloud (Visual Language for Cloud),whereas script-based workflows can be programmed byJS4Cloud (JavaScript for Cloud), a JavaScript-based lan-guage for data analysis programming.Figure 2 shows a sample data mining workflow com-

posed of several sequential and parallel steps. It is justan example for presenting the main features of theVL4Cloud programming interface [17]. The exampleworkflow analyses a dataset by using n instances of aclassification algorithm, which work on n portions of thetraining set and generate the same number of knowledgemodels. By using the n generated models and the testset, n classifiers produce in parallel n classified datasets(n classifications). In the final step of the workflow, avoter generates the final classification by assigning aclass to each data item, by choosing the class predictedby the majority of the models.Although DMCF has been mainly designed to coordin-

ate coarse grain data and task parallelism in big dataanalysis applications by exploiting the workflow para-digm, the DMCF script-based programming interface(JS4Cloud) allows also for parallelizing fine-grain opera-tions in data mining algorithms as it permits to programin a JavaScript style any data mining algorithm, such asclassification, clustering and others. This can be donebecause loops and data parallel methods are run in par-allel on the virtual machines of a Cloud [16, 26].Like DMCF, other innovative cloud-based systems de-

signed for programming data analysis applications are:Apache Spark, Sphere, Swift, Mahout, and CloudFlows.Most of them are open source. Apache Spark is anopen-source framework developed at UC Berkeley forin-memory data analysis and machine learning [34].Spark has been designed to run both batch processingand dynamic applications like streaming, interactivequeries, and graph analysis. Spark provides developerswith a programming interface centered on a data struc-ture called the resilient distributed dataset (RDD), that isa read-only multi-set of data items distributed over acluster of machines, that is maintained in a fault-tolerantway. Differently from other systems and from Hadoop,Spark stores data in memory and queries it repeatedly soas to obtain better performance. This feature can be use-ful for a future implementation of Spark on Exascalesystems.Swift is a workflow-based framework for implementing

functional data-driven task parallelism in data-intensiveapplications. The Swift language provides a functional

Fig. 1 The three models of the DAaaS software methodology. TheDAaaS software methodology is based on three basic models fordelivering data analysis services at different levels (application,platform, and infrastructure). The DAaaS methodology defines a newmodel stack to delivery data analysis solutions that is a specializationof the XaaS (Everything as a Service) stack and is called Data Analysisas a Service (DAaaS). It adapts and specifies the three general servicemodels (SaaS, PaaS and SaaS), for supporting the structureddevelopment of Big Data analysis systems, tools and applicationsaccording to a service-oriented approach

Talia Journal of Cloud Computing: Advances, Systems and Applications (2019) 8:4 Page 4 of 16

Page 5: A view of programming scalable data analysis: from clouds ... · A view of programming scalable data analysis: from clouds to exascale Domenico Talia Abstract Scalability is a key

programming paradigm where workflows are designedas a set of calls with associated command-line argu-ments and input and output files. Swift uses an implicitdata-driven task parallelism [32]. In fact, it looks like asequential language, but being a dataflow language, allvariables are futures, thus execution is based on dataavailability. Parallelism can be also exploited through theuse of the foreach statement. Swift/T is a new imple-mentation of the Swift language for high-performancecomputing. In this implementation, a Swift program istranslated into an MPI program that uses the Turbineand ADLB runtime libraries for scalable dataflow pro-cessing over MPI. Recently a porting of Swift/T on verylarge cloud systems for the execution of very many taskshas been investigated.DMCF, differently from the other frameworks dis-

cussed here, it is the only system that offers both a visualand a script-based programming interface. Visual pro-gramming is a very convenient design approach forhigh-level users, like domain-expert analysts having alimited understanding of programming. On the otherhand, script-based workflows are a useful paradigm forexpert programmers who can code complex applicationsrapidly, in a more concise way and with greater flexibil-ity. Finally, the workflow-based model exploited inDMCF and Swift make these frameworks of more gen-eral use with respect to Spark that offers a very re-stricted set of programming patterns (e.g., map, filterand reduce) so limiting the variety of data analysis appli-cations that can be implemented with it.These and other related systems are currently used for

the development of big data analysis applications on

HPC and cloud platforms. However, additional researchwork in this field must be done and the development ofnew models, solutions and tools is needed [13, 24]. Justto mention a few, active and promising research topicsare listed here ordered by importance factors:

� Programming models for big data analytics. Newabstract programming models and constructs hidingthe system complexity are needed for big dataanalytics tools. The MapReduce model and workflowmodels are often used on HPC and clouds, but moreresearch effort is needed to develop other scalable,adaptive, general-purpose higher-level models andtools. Research in this area is even more importantfor Exascale systems; in the next section we will dis-cuss some of these topics in Exascale computing.

� Reliability in scalable data analysis. As the numberof processing elements increases, reliability ofsystems and applications decreases, thereforemechanisms for detecting and handling hardwareand software faults are needed. Although in [7] hasbeen proved that no reliable communicationprotocol can tolerate crashes of processors on whichthe protocol runs, as stated in the same paper someways in which systems cope with the impossibilityresult can be found. Among them, at programminglevel it is necessary to design constructs for handlingcommunication, data access, and computing failuresand for recovering from them. Programmingmodels, languages and APIs must provide generaland data-oriented mechanisms for failure detectionand isolation, avoiding that an entire application can

Fig. 2 A parallel classification workflow designed by the VL4Cloud programming interface. The figure shows a workflow designed by theVL4Cloud programming interface during its execution. The workflow implements a parallel classification application. Tasks/services included insquare bracket are executed in parallel. The results produced by classifiers are selected by a voter task that produces the final classification

Talia Journal of Cloud Computing: Advances, Systems and Applications (2019) 8:4 Page 5 of 16

Page 6: A view of programming scalable data analysis: from clouds ... · A view of programming scalable data analysis: from clouds to exascale Domenico Talia Abstract Scalability is a key

fail and assuring its completion. Reliability is an issuemuch more important in the Exascale domain wherethe number of processing elements is massive andfault occurrence increases making detection and re-covering vital.

� Application reproducibility. Reproducibility isanother open research issue for designers ofcomplex applications running on parallel systems.Reproducibility in scalable data analysis must face,for example, with data communication, data parallelmanipulation and dynamic computingenvironments. Reproducibility demands that currentdata analysis frameworks (like those based on Map-Reduce and on workflows) and the future ones, es-pecially those implemented on Exascale systems,must provide additional information and knowledgeon how data are managed, on algorithm characteris-tics and on configuration of software and executionenvironments.

� Data and tool integration and openness. Codecoordination and data integration are main issues inlarge-scale applications that use data and computingresources. Standard formats, data exchange modelsand common APIs are needed to support interoper-ability and ease cooperation among design teamsusing different data formats and tools.

� Interoperability of big data analytics frameworks.The service-oriented paradigm allows running large-scale distributed applications on cloud heteroge-neous platforms along with software componentsdeveloped using different programming languages ortools. Cloud service paradigms must be designed toallow worldwide integration of multiple data analyt-ics frameworks.

Exascale and big data analysisAs we discussed in the previous sections, data analysisgained a primary role because of the very large availabil-ity of datasets and the continuous advancement ofmethods and algorithms for finding knowledge in them.Data analysis solutions advance by exploiting the powerof data mining and machine learning techniques and arechanging several scientific and industrial areas. For ex-ample, the amount of data that social media daily gener-ate is impressive and continuous. Some hundreds ofterabyte of data, including several hundreds of millionsof photos, are uploaded daily to Facebook and Twitter.Therefore it is central to design scalable solutions for

processing and analysis such massive datasets. As a gen-eral forecast, IDC experts estimate data generated toreach about 45 zettabytes worldwide by 2020 [6]. Thisimpressive amount of digital data asks for scalable highperformance data analysis solutions. However, todayonly one-quarter of digital data available would be a

candidate for analysis and about 5% of that is actuallyanalyzed. By 2020, the useful percentage could grow toabout 35% also thanks to data mining technologies.

Extreme data sources and scientific computingScalability and performance requirements are challen-ging conventional data storages, file systems and data-base management systems. Architectures of suchsystems have reached limits in handling very large pro-cessing tasks involving petabytes of data because theyhave not been built for scaling after a given threshold.This condition claims for new architectures and analyt-ics platform solutions that must process big data forextracting complex predictive and descriptive models[30]. Exascale systems, both from the hardware and thesoftware side, can play a key role to support solutionsfor these problems [23].An IBM study reports that we are generating around

2.5 exabytes of data per day.1 Because of that continuousand explosive growth of data, many applications requirethe use of scalable data analysis platforms. A well-knownexample is the ATLAS detector from the Large HadronCollider at CERN in Geneva. The ATLAS infrastructurehas a capacity of 200 PB of disk and 300,000 cores, withmore than 100 computing centers connected via 10Gbps links. The data collection rate is very high and onlya portion of the data produced by the collider is stored.Several teams of scientists run complex applications toanalyze subsets of those huge volumes of data. This ana-lysis would be impossible without a high-performanceinfrastructure that supports data storage, communica-tion and processing. Also computational astronomersare collecting and producing larger and larger datasetseach year that without scalable infrastructures cannot bestored and processed. Another significant case is repre-sented by the Energy Sciences Network (ESnet) is theUSA Department of Energy’s high-performance networkmanaged by Berkeley Lab that in late 2012 rolled out a100 gigabits-per-second national network to accommo-date the growing scale of scientific data.If we go from science to society, social data and e-health

are good examples to discuss. Social networks, such asFacebook and Twitter, have become very popular and arereceiving increasing attention from the research commu-nity since, through the huge amount of user-generateddata, they provide valuable information concerning hu-man behavior, habits, and travels. When the volume ofdata to be analyzed is of the order of terabytes or peta-bytes (billions of tweets or posts), scalable storage andcomputing solutions must be used, but no clear solutionstoday exist for the analysis of Exascale datasets. The sameoccurs in the e-health domain, where huge amounts of pa-tient data are available and can be used for improvingtherapies, for forecasting and tracking of health data, for

Talia Journal of Cloud Computing: Advances, Systems and Applications (2019) 8:4 Page 6 of 16

Page 7: A view of programming scalable data analysis: from clouds ... · A view of programming scalable data analysis: from clouds to exascale Domenico Talia Abstract Scalability is a key

the management of hospitals and health centers. Verycomplex data analysis in this area will need novel hard-ware/software solutions, however Exascale computing isstill promising in other scientific fields where scalable stor-ages and databases are not used/required. Examples of sci-entific disciplines where future Exascale computing will beextensively used are quantum chromodynamics, materialssimulation, molecular dynamics, materials design, earth-quake simulations, subsurface geophysics, climate fore-casting, nuclear energy, and combustion. All thoseapplications require the use of sophisticated models andalgorithms to solve complex equation systems that willbenefit from the exploitation of Exascale systems.

Programming models features for exascale data analysisImplementing scalable data analysis applications in Exas-cale computing systems is a very complex job and it re-quires high-level fine-grain parallel models, appropriateprogramming constructs and skills in parallel and dis-tributed programming. In particular, mechanisms andexpertise are needed for expressing task dependenciesand inter-task parallelism, for designing synchronizationand load balancing mechanisms, handling failures, andproperly manage distributed memory and concurrentcommunication among a very large number of tasks.Moreover, when the target computing infrastructures areheterogeneous and require different libraries and toolsto program applications on them, the programming is-sues are even more complex. To cope with some ofthese issues in data-intensive applications, different scal-able programming models have been proposed [5].Scalable programming models may be categorized by

i. Their level of abstraction: expressing high-level andlow-level programming mechanisms, and

ii. How they allow programmers to developapplications: using visual or script-basedformalisms.

Using high-level scalable models, a programmer de-fines only the high-level logic of an application whilehides the low-level details that are not essential for ap-plication design, including infrastructure-dependent exe-cution details. A programmer is assisted in applicationdefinition and application performance depends on thecompiler that analyzes the application code and opti-mizes its execution on the underlying infrastructure. Onthe other hand, low-level scalable models allow pro-grammers to interact directly with computing and stor-age elements composing the underlying infrastructureand thus define the applications parallelism directly.Data analysis applications implemented by some

frameworks can be programmed through a visual inter-face, which is a convenient design approach forhigh-level users, for instance domain-expert analystshaving a limited understanding of programming. Inaddition, a visual representation of workflows or compo-nents intrinsically captures parallelism at the task level,without the need to make parallelism explicit throughcontrol structures [14]. Visual-based data analysis typic-ally is implemented by providing workflows-based lan-guages or component-based paradigms (Fig. 3). Alsodataflow-based approaches, that share with workflowsthe same application structure, are used. However, in da-taflow models, the grain of parallelism and the size ofdata items are generally smaller with respect to work-flows. In general, visual programming tools are not veryflexible because they often implement a limited set ofvisual patterns and provide restricted manners toconfigure them. For addressing this issue, some visual

Fig. 3 Main visual and script-based programming models used today for data analysis programming

Talia Journal of Cloud Computing: Advances, Systems and Applications (2019) 8:4 Page 7 of 16

Page 8: A view of programming scalable data analysis: from clouds ... · A view of programming scalable data analysis: from clouds to exascale Domenico Talia Abstract Scalability is a key

languages provide users with the possibility to customizethe behavior of patterns by adding code that can specifyoperations executed a specific pattern when an eventoccurs.On the other hand, code-based (or script-based) for-

malism allows users to program complex applicationsmore rapidly, in a more concise way, and with higherflexibility [16]. Script-based applications can be designedin different ways (see Fig. 3):

� With a complete language or a language extensionthat allows to express parallelism in applications,according to a general purpose or a domain specificapproach. This approach requires the design andimplementation of a new parallel programminglanguage or a complete set of data types and parallelconstructs to be fully inserted in an existinglanguage.

� With annotations in the application code that allowthe compiler to identify which instructions will beexecuted in parallel. According to this approach,parallel statements are separated from sequentialconstructs and they are clearly identified in theprogram code because they are denoted by specialsymbols or keywords.

� Using a library in the application code that addsparallelism to the data analysis application.Currently this is the most used approach since it isorthogonal to host languages. MPI and MapReduceare two well-known examples of this approach.

Given the variety of data analysis applications andclasses of users (from skilled programmers to end users)that can be envisioned for future Exascale systems, thereis a need for scalable programming models with differentlevels of abstractions (high-level and low-level) and dif-ferent design formalisms (visual and script-based), ac-cording to the classification outlined above.As we discussed, data-intensive applications are soft-

ware programs that have a significant need to processlarge volumes of data [9]. Such applications devote mostof their processing time to run I/O operations and ex-changing and moving data among the processing ele-ments of a parallel computing infrastructure. Parallelprocessing in data analysis applications typically involvesaccessing, pre-processing, partitioning, distributing, ag-gregating, querying, mining, and visualizing data thatcan be processed independently.The main challenges for programming data analysis

applications on Exascale computing systems come frompotential scalability, network latency and reliability,reproducibility of data analysis, and resilience of mecha-nisms and operations offered to developers for accessing,exchanging and managing data. Indeed, processing very

large data volumes requires operations and new algo-rithms able to scale in loading, storing, and processingmassive amounts of data that generally must be parti-tioned in very small data grains, on which thousands tomillions of simple parallel operations do analysis.

Exascale programming systemsExascale systems force new requirements on program-ming systems to target platforms with hundreds ofhomogeneous and heterogeneous cores. Evolutionarymodels have been recently proposed for Exascale pro-gramming that extend or adapt traditional parallel pro-gramming models like MPI (e.g., EPiGRAM [15] thatuses a library-based approach, Open MPI for Exascale inthe ECP initiative), OpenMP (e.g., OmpSs [8] that ex-ploits an annotation-based approach, the SOLLVEproject), and MapReduce (e.g., Pig Latin [22] that imple-ments a domain-specific complete language). These newframeworks limit the communication overhead in mes-sage passing paradigms or limit the synchronization con-trol if a shared-memory model is used [11].As Exascale systems are likely to be based on large dis-

tributed memory hardware, MPI is one of the most nat-ural programming systems. MPI is currently used onover about one million cores, therefore is reasonable tohave MPI as one programming paradigm used onExascale systems. The same possibility occurs forMapReduce-based libraries that today are run on verylarge HPC and cloud systems. Both these paradigms arelargely used for implementing Big Data analysis applica-tions. As expected, general MPI all-to-all communica-tion does not scale well in Exascale environments, thusto solve this issue new MPI releases introduced neighborcollectives to support sparse “all-to-some” communica-tion patterns that limit the data exchange on limited re-gions of processors [11].Ensuring the reliability of Exascale systems requires a

holistic approach including several hardware and softwaretechnologies for both predicting crashes and keepingsystems stable despite failures. In the runtime of parallelAPIs, like MPI and MapReduce-based libraries likeHadoop, if do not want to behave incorrectly in case ofprocessor failure, a reliable communication layer must beprovided using the lower unreliable layer by implementinga correct protocol that work safely with every implemen-tation of the unreliable layer that cannot tolerate crashesof the processors on which it runs. Concerning MapRe-duce frameworks, reference [18] reports on an adaptiveMapReduce framework, called P2P-MapReduce, whichhas been developed to manage node churn, master nodefailures, and job recovery in a decentralized way, so as toprovide a more reliable MapReduce middleware that can beeffectively exploited in dynamic large-scale infrastructures.

Talia Journal of Cloud Computing: Advances, Systems and Applications (2019) 8:4 Page 8 of 16

Page 9: A view of programming scalable data analysis: from clouds ... · A view of programming scalable data analysis: from clouds to exascale Domenico Talia Abstract Scalability is a key

On the other hand, new complete languages such asX10 [29], ECL [33], UPC [21], Legion [3], and Chapel[4] have been defined by exploiting in them adata-centric approach. Furthermore, new APIs based ona revolutionary approach, such as GA [20] and SHMEM[19], have been implemented according to alibrary-based model. These novel parallel paradigms aredevised to address the requirements of data processingusing massive parallelism. In particular, languages such asX10, UPC, and Chapel and the GA library are based on apartitioned global address space (PGAS) memory modelthat is suited to implement data-intensive Exascale appli-cations because it uses private data structures and limitsthe amount of shared data among parallel threads.Together with different approaches, such as Pig Latin

and ECL, those programming models, languages andAPIs must be further investigated, designed and adaptedfor providing data-centric scalable programming modelsuseful to support the reliable and effective implementa-tion of Exascale data analysis applications composed ofup to millions of computing units that process smalldata elements and exchange them with a very limited setof processing elements. PGAS-based models, data-flowand data-driven paradigms, local-data approaches todayrepresent promising solutions that could be used forExascale data analysis programming. The APGAS modelis, for example, implemented in the X10 language whereit is based on the notions of places and asynchrony. Aplace is an abstraction of shared, mutable data andworker threads operating on the data. A single APGAScomputation can consist of hundreds or potentially tensof thousands of places. Asynchrony is implemented by asingle block-structured control construct async. Given astatement ST, the construct async ST executes ST in aseparate thread of control. Memory locations in oneplace can contain references to locations at other places.To compute upon data at another place, the

at pð ÞST

statement must be used. It allows the task to change itsplace of execution to p, executes ST at p and returns,leaving behind tasks that may have been spawned duringthe execution of ST.Another interesting language based on the PGAS

model is Chapel [4]. Its locality mechanisms can be ef-fectively used for scalable data analysis where light datamining (sub-)tasks are run on local processing elementsand partial results must be exchanged. Chapel data local-ity provides control over where data values are storedand where tasks execute so that developers can ensureparallel data analysis computations execute near the var-iables they access, or vice-versa for minimizing the com-munication and synchronization costs. For example,

Chapel programmers can specify how domains and ar-rays are distributed amongst the system nodes. Anotherappealing feature in Chapel is the expression ofsynchronization in a data-centric style. By associatingsynchronization constructs with data (variables), localityis enforced and data-driven parallelism can be easilyexpressed also at very large scale. In Chapel, locales anddomains are abstractions for referring to machine re-sources and map tasks and data to them. Locales are lan-guage abstractions for naming a portion of a targetarchitecture (e.g., a GPU, a single core or a multicorenode) that has processing and storage capabilities. A lo-cale specifies where (on which processing node) to exe-cute tasks/statements/operations. For example, in asystem composed of 4 locales

const Locs : 4½ �locale;for executing the method Filter(D) on the first locale, wecan use

on Locs 0½ � do Filter Dð Þ;and to execute the K-means algorithm on the 4 localeswe can use

forall lc in Locs ið Þ do on lc do KmeansðÞ;Whereas locales are used to map tasks to machine

nodes, domain maps are used for mapping data to a tar-get architecture. Here is a simple example of a declar-ation of a rectangular domain

const D: domain 2ð Þ ¼ 1::n; 1::nf g;Domains can be also mapped to locales. Similar con-

cepts (logical regions & mapping interfaces) are used inthe Legion programming model [3, 5].Exascale programming is a strongly evolving research

field and it is not possible to discuss in details all pro-gramming models, languages and libraries that are con-tributing to provide features and mechanisms useful forexascale data analysis application programming. How-ever, the next section introduces, discusses and classifiescurrent programming systems for Exascale computingaccording to the most used programming and data man-agement models.

Exascale programming systems comparisonAs mentioned, several parallel programming models,languages and libraries are under development for pro-viding high-level programming interfaces and tools forimplementing high-performance applications on futureExascale computers. Here we introduce the most signifi-cant proposals and discuss their main features. Table 1lists and classifies the considered systems and in it somepros and fallacies of different classes are summarized.

Talia Journal of Cloud Computing: Advances, Systems and Applications (2019) 8:4 Page 9 of 16

Page 10: A view of programming scalable data analysis: from clouds ... · A view of programming scalable data analysis: from clouds to exascale Domenico Talia Abstract Scalability is a key

Since Exascale systems will be composed of millions ofprocessing nodes, distributed memory paradigms, andmessage passing systems in particular, are candidatetools to be used as programming systems for such classof systems. In this area, MPI is currently the most usedand studied system. Different adaptations of thiswell-known model are under development such as, forexample, Open MPI for Exascale. Other systems basedon distributed memory programming are Pig Latin,Charm++, Legion, PaRSEC, Bulk Synchronous Parallel(BSP), AllScale API, and Enterprise Control Language(ECL). Just considering Pig Latin, we can notice thatsome of its parallel operators such as FILTER, which se-lects a set of tuples from a relation based on a condition,and SPLIT, which partitions a relation into two or morerelations, can be very useful in many highly parallel bigdata analysis applications.On the other side, we have shared-memory models

where the major system is OpenMP that offers a simpleparallel programming model although it does not pro-vide mechanisms to explicitly map and control data dis-tribution and includes non-scalable synchronizationoperations that are making very challenging its

implementation on massively parallel systems. Otherprogramming systems in this area are Threading Build-ing Blocks (TBB), OmpSs, and Cilk++. The OpenMPsynchronization model based on locks, atomic and se-quential sections that limit parallelism exploitation inExascale systems are going to be modified and integratedin recent OpenMP implementations with new tech-niques and routines that increase asynchronous opera-tions and parallelism exploitation. A similar approach isused in Cilk++ that supports parallel loops and hyperob-jects, a new construct designed to solve data race prob-lems created by parallel accesses to global variables. Infact, a hyperobject allows multiple tasks to share statewithout race conditions and without using explicit locks.As a tradeoff between distributed and shared memory

organizations, the Partitioned Global Address Space(PGAS) model has been designed for implementing aglobal memory address space that is logically partitionedand portions of it are local to single processes. The maingoal of the PGAS model is to limit data exchange andisolate failures in very large-scale systems. Languagesand libraries based on PGAS are Unified Parallel C(UPC), Chapel, X10, Global Arrays (GA), Co-Array

Table 1 Exascale programming systems classification

Programming Models Languages Libraries/APIs Pros and Fallacies

Distributed memory Charm++, Legion, High PerformanceFortran (HPF), ECL, PaRSEC

MPI, BSP, Pig Latin, AllScale, Distributed memory languages/APIsare veryclose to the Exascale hardware model.Systems in this class consider and dealwith communication latency howeverdata exchange costs are the main sourceof overhead. Except AllScale, and someMPI version, systems in this class donot manage network and CPU failures.

Shared memory TBB, Cilk++ OpenMP, OmpSs Shared memory models do not mapefficiently on Exascale systems,extensions have been proposedto perform better dealing withsynchronization and networkfailures. No single convincingsolution till now exists.

Partitioned memory UPC, Chapel, X10, CAF GA, SHMEM, DASH, OpenSHMEM, GASPI The local memory model is veryuseful but combination withglobal/shared memory mechanismsintroduce too much overhead.GASPI is the only system in thisclass enabling applications torecover from failures.

Hybrid models UPC +MPI, C++/MPI, MPI + OpenMP, Spark-MPI, FLUX,EMPI4Re, DPLASMA,

Hybrid models facilitate themapping to the hardwarearchitectures, however thedifferent programming routinescompete for resources makinghard to control concurrencyand contention. Resilientmechanisms are harder toimplement because of themixing of different constructsand data models.

Talia Journal of Cloud Computing: Advances, Systems and Applications (2019) 8:4 Page 10 of 16

Page 11: A view of programming scalable data analysis: from clouds ... · A view of programming scalable data analysis: from clouds to exascale Domenico Talia Abstract Scalability is a key

Fortran (CAF), DASH, and SHMEM. PGAS appears tobe suited for implementing data-intensive exascale appli-cations because it uses private data structures and limitsthe amount of shared data among parallel threads. Itsmemory-partitioning model facilitates failure detectionand resilience. Another programming mechanism usefulfor decentralized data analysis is related to datasynchronization. In the SHMEM library it is imple-mented through the shmem_barrier operation that per-forms a barrier operation on a subset of processingelements, then enables them to go further by sharingsynchronized data.Starting from those three main programming ap-

proaches, hybrid systems have been proposed and devel-oped to better map application tasks and data ontohardware architectures of Exascale systems. In hybridsystems that combine distributed and shared memory,message-passing routines are used for data communica-tion and inter-node processing whereas shared-memoryoperations are used for exploiting intranode parallelism.A major example in this area is given by the differentMPI + OpenMP systems recently implemented. Hybridsystems have been also designed by combining messagepassing models, like MPI, with PGAS models forrestricting data communication overhead and improvingMPI efficiency in execution time and memory consump-tion. The PGAS-based MPI implementation EMPI4Re,developed in the EPiGRAM project, is an example ofthis class of hybrid systems.Associated to the programming model issues, a set of

challenges concern the design of runtime systems that inexascale computing systems must be tightly integratedwith the programming tools level. The main challengesfor runtime systems obviously include parallelism ex-ploitation, limited data communication, data dependencemanagement, data-aware task scheduling, processor het-erogeneity, and energy efficiency. However, together withthose main issues, other aspects are addressed in run-time systems like storage/memory hierarchies, storageand processor heterogeneity, performance adaptability,resource allocation, performance analysis, and perform-ance portability. In addressing those issues the currentlyused approaches aim at providing simplified abstractionsand machine models that allow algorithm developersand application programmers to generate code that canrun and scale on a wide range of exascale computingsystems.This is a complex task that can be achieved by exploit-

ing techniques that allow the runtime system to cooper-ate with the compiler, the libraries and the operatingsystem to find integrated solutions and make smarteruse of hardware resources by efficient ways to map theapplication code to the exascale hardware. Finally, dueto the specific features of exascale hardware, runtime

systems need to find methods and techniques that allowbringing the computing system closer to the applicationrequirements. Research work in this area is carried outin projects like XPRESS, StarPU, Corvette DEGAS, lib-Water [10], Traleika-Glacier, OmpSs [8], SnuCL, D-TEC,SLEEC, PIPER, and X-TUNE that are proposing innova-tive solutions for large-scale parallel computing systemsthat can be used in exascale machines. For instance asystem that aims at integrating the runtime with the lan-guage level is OmpSs where mechanisms for data de-pendence management (based on DAG analysis like inlibWater) and for mapping tasks to computing nodesand handling processor heterogeneity (the target con-struct) are provided. Another issue to be taken into ac-count in the interaction between the programming leveland the runtime is performance and scalability monitor-ing. In the StarPU project, for example, performancefeedback through task profiling and trace analysis isprovided.In very large-scale high performance machines and in

Exascale systems, the runtime systems are more complexthan in traditional parallel computers. In fact, perform-ance and scalability issues must be addressed at theinter-node runtime level and they must be appropriatelyintegrated with intra-node runtime mechanisms [25]. Allthese issues relate to system and application scalability.In fact, vertical scaling of systems with multicore paral-lelism within a single node must be addressed. Scalabil-ity is still an open issue in Exascale systems also becausespeed-up requirements for system software and runtimesare much higher than in traditional HPC systems anddifferent portions of code in applications or runtimescan generate performance bottlenecks.Concerning application resiliency, the runtime of

Exascale systems must include mechanisms for restart-ing task and accessing data in case of software or hard-ware faults without requiring developer involvement.Traditional approaches for providing reliability in HPCinclude: checkpointing and restart (see for instanceMPI_Checkpoint), reliable data storage (through file andin-memory replication or double buffering), and messagelogging for minimizing the checkpointing overhead. Infact, whereas the global checkpointing/restart techniqueis the most used to limit system/application faults, in theExascale scenario new mechanisms with low overheadand highly scalability must be designed. These mecha-nisms should limit task and data duplication throughsmart approaches for selective replication. For example,silent data corruption (SDC) is recognized to be a crit-ical problem in Exascale computing. However, althoughreplication is useful, their inherent inefficiency must belimited. Research work is carried out in this area todefine technique that limit replication costs whileoffering protection from SDC. For application/task

Talia Journal of Cloud Computing: Advances, Systems and Applications (2019) 8:4 Page 11 of 16

Page 12: A view of programming scalable data analysis: from clouds ... · A view of programming scalable data analysis: from clouds to exascale Domenico Talia Abstract Scalability is a key

checkpointing, instead of checkpointing the entire ad-dress space of the application, as occurs in OpenMP andMPI, the minimal state of the tasks needed to be check-pointed for the fault recovery must be identified thuslimiting data size and recovery overhead.

Requirements of exascale runtime for data analysisOne of the most important aspect to ponder in applica-tions that run on Exascale systems and analyze big data-sets is the tradeoff between sharing data amongprocessing elements and computing things locally to re-duce communication and energy costs, while keepingperformance and fault-tolerance levels. A scalable pro-gramming model founded on basic operations for dataintensive/data-driven applications must include mecha-nisms and operations for

� Parallel data access that allows increasing dataaccess bandwidth by partitioning data into multiplechunks, according to different methods, andaccessing several data elements in parallel to meethigh throughput requirements.

� Fault resiliency that is a major issue as machinesexpand in size and complexity. On Exascale systemswith huge amount of processes, non-local communi-cation must be prepared for a potential failure ofone of the communication sides; runtimes must fea-tures failure handing mechanisms for recoveringfrom node and communication faults.

� Data-driven local communication that is useful tolimit the data exchange overhead in massivelyparallel systems composed of many cores; in thiscase data availability among neighbor nodes dictatesthe operations taken by those nodes.

� Data processing on limited groups of cores allowsconcentrating data analysis operations involvinglimited sets of cores and large amount of data onlocalities of Exascale machines facilitating a type ofdata affinity co-locating related data andcomputation.

� Near-data synchronization to limit the overheadgenerated by synchronization mechanisms andprotocols that involve several far away cores inkeeping data up-to-date.

� In-memory querying and analytics needed to reducequery response times and execution of analyticsoperations by caching large volumes of data in thecomputing node RAMs and issuing queries andother operation in parallel on the main memory ofcomputing nodes.

� Group-level data aggregation in parallel systems isuseful for efficient summarization, graph traversaland matrix operations, therefore it is of great

importance in programming models for data analysison massively parallel systems.

� Locality-based data selection and classification forlimiting the latency of basic data analysis operationsrunning in parallel on large scale machines in a waythat the subset of data needed together in a givenphase are locally available (in a subset of nearbycores).

A reliable and high-level programming model and itsassociated runtime must be able to manage and provideimplementation solutions for those operations togetherwith the reliable exploitation of a very large amount ofparallelism.Real-world big data analysis applications cannot be

practically solved on sequential machines. If we refer toreal-world applications, each large-scale data mining andmachine learning software that today is under develop-ment in the areas of social data analysis and bioinfor-matics will certainly benefit from the availability ofExascale computing systems and from the use of Exas-cale programming environments that will offer massiveand adaptive-grain parallelism, data locality, local com-munication and synchronization mechanisms, togetherwith the other features discussed in the previous sec-tions that are needed for reducing execution time andmaking feasible the solution of new problems and chal-lenges. For example, in bioinformatics applications par-allel data partitioning is a key feature for runningstatistical analysis or machine learning algorithms onhigh performance computing systems. After that, cleverand complex data mining algorithms must be run oneach single core/node of an Exascale machine on subsetsof data to produce data models in parallel. When partialmodels are produced, they could be checked locally andmust be merged among nearby processors to obtain, forexample, a general model of gene expression correlationsor of drug-gene interactions. Therefore for those appli-cations, data locality, highly parallel correlation algo-rithms, and limited communication structures are veryimportant to reduce execution time from several days toa few minutes. Moreover, fault tolerance software mech-anisms are also useful in long-running bioinformaticsapplications to avoid restarting them from the beginningwhen a software/hardware failure occurs.Moving to social media applications, nowadays the

huge volume of user-generated data in social media plat-forms, such as Facebook, Twitter and Instagram, arevery precious sources of data from which to extract in-sights concerning human dynamics and behaviors. Infact, social media analysis is a fast growing research areathat will benefit form the use of Exascale computing sys-tems. For example, social media users moving through asequence of places in a city or a region may create a

Talia Journal of Cloud Computing: Advances, Systems and Applications (2019) 8:4 Page 12 of 16

Page 13: A view of programming scalable data analysis: from clouds ... · A view of programming scalable data analysis: from clouds to exascale Domenico Talia Abstract Scalability is a key

huge amount of geo-referenced data that include exten-sive knowledge about human dynamics and mobility be-haviors. A methodology for discovering behavior andmobility patterns of users from social media posts andtweet includes a set of steps such as collection andpre-processing of geotagged items, organization of theinput dataset, data analysis and trajectory mining algo-rithm execution, and results visualization. In all thosedata analysis steps, the utilization of scalable program-ming techniques and tools is vital to obtain practical re-sults in feasible time when massive datasets areanalyzed. The Exascale programming features and re-quirements discussed here and in the previous sectionswill be very useful in social data analysis for executingparallel tasks like concurrent data acquisition (thus dataitems are collected exploiting parallel queries from dif-ferent data sources), parallel data filtering and data parti-tioning by the exploitation of local and in-memoryalgorithms, classification, clustering and association min-ing algorithms that are very computing intensive andneed a large number of processing elements workingasynchronously to produce learning models from billionsof posts containing text, photos and videos. The man-agement and processing of Terabytes of data that are in-volved in those applications cannot be done efficientlywithout solving issues like data locality, near-data pro-cessing, large asynchronous execution and the otherones addressed in Exascale computing systems.Together with an accurate modeling of basic operations

and of the programming languages/APIs that includethem, supporting correct and effective data-intensive ap-plications on Exascale systems will require also a signifi-cant programming effort of developers when they need toimplement complex algorithms and data-driven applica-tions such that used, for example, in big data analysis anddistributed data mining. Parallel and distributed data min-ing strategies, like

� collective learning,� meta-learning, and� ensemble learning,

must be devised using fine grain parallel approachesto be adapted on Exascale computers. Programmersmust be able to design and implement scalable algo-rithms by using the operations sketched above specif-ically adapted to those new systems. To reach thisgoal, a coordinated effort between the operation/lan-guage designers and the application developers wouldbe very fruitful.In Exascale systems, the cost of accessing, moving,

and processing data across a parallel system is enormous[24, 30]. This requires mechanisms, techniques andoperations for capable data access, placement and

querying. In addition, scalable operations must be de-signed in such a way to avoid global synchronizations,centralized control and global communications.Many data scientists want to be abstracted awayfrom these tricky, lower level, aspects of HPC untilat least they have their code working and then po-tentially to tweak communication and distributionchoices in a high level manner in order to furthertune their code. Interoperability and integration withthe MapReduce model and MPI must be investigatedwith the main goal of achieving scalability onlarge-scale data processing.Different data-driven abstractions can be combined for

providing a programming model and an API that allowthe reliable and productive programming of verylarge-scale heterogeneous and distributed memory sys-tems. In order to simplify the development of applicationsin heterogeneous distributed memory environments,large-scale data-parallelism can be exploited on top of theabstraction of n-dimensional arrays subdivided in par-titions, so that different array partitions are placed ondifferent cores/nodes that will process in parallel thearray partitions. This approach can allow the comput-ing nodes to process in parallel data partitions ateach core/node using a set of statements/library callsthat hide the complexity of the underlying process.Data dependency in this scenario limits scalability, soit should be avoided or limited to a local scale.Abstract data types provided by libraries, so that they

can be easily integrated in existing applications, shouldsupport this abstraction. As we mentioned above, anotherissue is the gap between users with HPC needs and ex-perts with the skills to make the most of these technolo-gies. An appropriate directive-based approach can be todesign, implement and evaluate a compiler frameworkthat allows generic translations from high-level languagesto Exascale heterogeneous platforms. A programmingmodel should be designed at a level that is higher thanthat of standards, such as OpenCL, including also check-pointing and fault resiliency. Efforts must be carried outto show the feasibility of transparent checkpointing ofExascale programs and quantitatively evaluate theruntime overhead. Approaches like CheCL show thatit is possible to enable transparent checkpoint andrestart, also in high-performance and dependableGPU computing including support for process migra-tion among different processors such as a CPU and aGPU.The model should enable the rapid development with

reduced effort for different heterogeneous platforms.These heterogeneous platforms need to include low en-ergy architectures and mobile devices. The new modelshould allow a preliminary evaluation of results on thetarget architectures.

Talia Journal of Cloud Computing: Advances, Systems and Applications (2019) 8:4 Page 13 of 16

Page 14: A view of programming scalable data analysis: from clouds ... · A view of programming scalable data analysis: from clouds to exascale Domenico Talia Abstract Scalability is a key

Concluding remarks and future workCloud-based solutions for big data analysis tools andsystems are in an advanced phase both on the researchand the commercial sides. On the other hand, new Exas-cale hardware/software solutions must be studied anddesigned to allow the mining of very large-scale datasetson those new platforms.Exascale systems raise new requirements on applica-

tion developers and programming systems to targetarchitectures composed of a very large number ofhomogeneous and heterogeneous cores. General issueslike energy consumption, multitasking, scheduling,reproducibility, and resiliency must be addressed to-gether with other data-oriented issues like data distribu-tion and mapping, data access, data communication andsynchronization. Programming constructs and runtimesystems will play a crucial role in enabling future dataanalysis programming models, runtime models andhardware platforms to address these challenges, and insupporting the scalable implementation of real big dataanalysis applications.In particular, here we summarize a set of open design

challenges that are critical for designing Exascale pro-gramming systems and for their scalable implementa-tion. The following design choices, among others, mustbe taken into account:

� Application reliability: Data analysis programmingmodels must include constructs and/or mechanismsfor handling task and data access failures and forrecovering. As new data analysis platforms appearever larger, the fully reliable operations cannot beimplicit and this assumption becomes less credible,therefore explicit solutions must be proposed.

� Reproducibility requirements. Big data analysisrunning on massively parallel systems demands forreproducibility. New data analysis programmingframeworks must collect and generate metadata andprovenance information about algorithmcharacteristics, software configuration and executionenvironment for supporting applicationreproducibility on large-scale computing platforms.

� Communication mechanisms: Novel approachesmust be devised for facing network unreliability [7]and network latency, for example by expressingasynchronous data communications and locality-based data exchange/sharing.

� Communication patterns: A correct paradigm designshould include communication patterns allowingapplication dependent features and data accessmodels, limiting data movement and simplify theburden on Exascale runtimes and interconnection.

� Data handling and sharing patterns: Data localitymechanisms/constructs, like near-data computing

must be designed and evaluated on big data applica-tions when subsets of data are stored in nearby pro-cessors and by avoiding that locality is imposedwhen data must be moved. Other challenges con-cern data affinity control data querying (NoSQL ap-proach), global data distribution and sharingpatterns.

� Data-parallel constructs: Useful models like data-driven/data-centric constructs, dataflow parallel op-erations, independent data parallelism, and SPMDpatterns must be deeply considered and studied.

� Grain of parallelism: from very fine-grain toprocess-grain parallelism must be analyzed also incombination with the different parallelism degreethat Exascale hardware supports. Perhaps differentgrain size should be considered in a single model toaddress hardware needs and heterogeneity.

Finally, since big data mining algorithms often requirethe exchange of raw data or, better, of mining parametersand partial models, to achieve scalability and reliability onthousands of processing elements, metadata-based infor-mation, limited-communication programming mecha-nisms, and partition-based data structures with associatedparallel operations must be proposed and implemented.

Endnotes1https://www.ibm.com/annualreport/2013/bin/assets/

2013_ibm_annual.pdf

AppendixScalability in parallel systemsParallel computing systems aim at exploiting the capacityof usefully employing all its processing elements duringapplication execution. Indeed, only an ideal parallel systemcan do that fully because of its sequential times that can-not be parallelized (As the Amdahl’s law suggests [35])and due to several sources of overhead such as sequentialoperations, communication, synchronization, I/O andmemory access, network speed, I/O system speed, hard-ware and software failures, problem size and program in-put. All these issues related to the ability of parallelsystems to fully exploit their resources are referred as sys-tem or program scalability [36].The scalability of a parallel computing system is a

measure of its capacity to reduce program executiontime in proportion to the number of its processing ele-ments. According to this definition, scalable computingrefers to the ability of a hardware/software parallel sys-tem to exploit increasing computing resources effectivelyin the execution of a software application [37].Despite the difficulties that can be faced in the parallel

implementation of an application, a framework or a pro-gramming system, a scalable parallel computation can

Talia Journal of Cloud Computing: Advances, Systems and Applications (2019) 8:4 Page 14 of 16

Page 15: A view of programming scalable data analysis: from clouds ... · A view of programming scalable data analysis: from clouds to exascale Domenico Talia Abstract Scalability is a key

always be made cost-optimal if the number of processingelements, the size of memory, the network bandwidthand the size of the problem are chosen appropriately.For evaluation and measuring scalability of a parallel

program some metrics have been defined and are largelyused: parallel runtime T(p), speedup S(p) and efficiencyE(p). Parallel runtime is the total processing time of theprogram using p processor (with p > 1). Speedup is theratio between the total processing time of the programon 1 processor and the total processing time on p pro-cessors: S(p) = T(1)/T(p). Efficiency is the ratio betweenspeedup and the total number of used processors: E(p) =S(p)/p.Application scalability is influenced by the available

hardware and software resources, their performance andreliability, and by the sources of overhead discussed be-fore. In particular, scalability of data analysis applicationsare tight related to the exploitation of parallelism indata-driven operations and the overhead generated bydata management mechanisms and techniques. More-over, application scalability also depends on the pro-grammer ability to design the algorithms reducingsequential time and exploiting parallel operations. Fi-nally, the instruction designers and the runtime imple-menters contribute to exploitation of scalability [38]. Allthese arguments mean that for realizing exascale com-puting in practice many issues and aspects must netaken into account by considering all the layers of hard-ware/software stack involved in the execution of Exas-cale programs.In addressing parallel system scalability it must be

also tackled system dependability. As the number ofprocessors and network interconnection increases andas tasks, threads and message exchanges increase, therate of failures and faults increases too [39]. As dis-cussed in reference [40], the design of scalable paral-lel systems requires assuring system dependability.Therefore understanding of failure characteristics is akey issue to couple high performance and reliabilityin massive parallel systems at Exascale size.

AbbreviationsAPGAS: Asynchronous partitioned global address space; BSP: BulkSynchronous Parallel; CAF: Co-Array Fortran; DAaaS: Data Analysis as aService; DAIaaS: Data analysis infrastructure as a service; DAPaaS: Dataanalysis platform as a service; DASaaS: Data analysis software as a service;DMCF: Data Mining Cloud Framework; ECL: Enterprise Control Language;ESnet: Energy Sciences Network; GA: Global Array; HPC: High PerformanceComputing; IaaS: Infrastructure as a service; JS4Cloud: JavaScript for Cloud;PaaS: Platform as a service; PGAS: Partitioned global address space;RDD: resilient distributed dataset; SaaS: Software as a service; SOA: ServiceOriented Computing; TBB: Threading Building Blocks; VL4Cloud: VisualLanguage for Cloud; XaaS: Everything as a Service

AcknowledgmentsNot applicable.

FundingThis work has been partially funded by the ASPIDE Project funded by theEuropean Union’s Horizon 2020 research and innovation programme undergrant agreement No 801091.

Availability of data and materialsData sharing not applicable to this article as no datasets were generated oranalyzed during the current study.

Authors’ contributionsDT carried out all the work presented in the paper. The author read andapproved the final manuscript.

Authors informationNot applicable.

Competing interestsThe author declare that he/she has no competing interests.

Publisher’s NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations.

Received: 23 October 2018 Accepted: 29 January 2019

References1. Amarasinghe S et al (2009) Exascale software study: software challenges in

extreme-scale systems. In: Defense Advanced Research Projects Agency.Arlington, VA, USA

2. Armbrust M et al (2010) A view of cloud computing. Commun ACM 53(4(April 2010)):50–58

3. Bauer M et al (2012) Legion: Expressing locality and independence withlogical regions. In: Proc. International Conference on Supercomputing. IEEECS Press

4. Bradford L, Chamberlain CD, Zima HP (2007) Parallel programmability andthe chapel language. International Journal of High Performance ComputingApplications 21(3):291–312

5. Diaz J, Munoz-Caro C, Nino A (2012) A survey of parallel programmingmodels and tools in the multi and many-core era. IEEE Trans ParallelDistributed Systems 23(8):1369–1386

6. http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm

7. Fekete A, Lynch N, Mansour Y, Spinelli J (1993) The impossibility ofimplementing reliable communication in the face of crashes. Journal ofACM 40(5):1087–1107

8. Fernandez A et al (2014) Task-based programming with OmpSs and itsapplication. In: Proc. euro-par 2014: parallel processing workshops, pp 602–613

9. Gorton I, Greenfield P, Szalay AS, Williams R (2008) Data-intensivecomputing in the 21st century. Computer 41(4):30–32

10. Grasso I, Pellegrini S, Cosenza B, Fahringer T (2013) LibWater: heterogeneousdistributed computing made easy. Procs of the 27th international ACMconference on International conference on supercomputing (ICS '13), NewYork, USA, pp 161–172

11. Gropp W, Snir M (2013) Programming for exascale computers. Computingin Science & Eng 15(6):27–35

12. Gu Y., Grossman R. L., Sector and Sphere: the design and implementation ofa high-performance data cloud. Philosophical transactions Series A,Mathematical, physical, and engineering sciences 367 1897, 2429–2445,2009

13. Lucas et al (2014) Top ten Exascale research challenges. In: Office ofScience, U.S. Department of Energy, Washington, D.C

14. Maheshwari K, Montagnat J (2010) Scientific workflow development usingboth visual and script-based representation. Proc. of the Sixth WorldCongress on Services SERVICES ‘10, Washington, DC, USA, pp 328–335

15. Markidis S et al (2016) The EPiGRAM project: preparing parallelprogramming models for exascale. In: Proc. ISC high performance 2016international workshops, pp 56–58

16. Marozzo F, Talia D, Trunfio P (2015) JS4Cloud: script-based workflowprogramming for scalable data analysis on cloud platforms. Concurrencyand Computation: Practice and Experience 27(17):5214–5237

Talia Journal of Cloud Computing: Advances, Systems and Applications (2019) 8:4 Page 15 of 16

Page 16: A view of programming scalable data analysis: from clouds ... · A view of programming scalable data analysis: from clouds to exascale Domenico Talia Abstract Scalability is a key

17. Marozzo F, Talia D, Trunfio P (2013) A cloud framework for big dataanalytics workflows on azure. In: Catlett C, Gentzsch W, Grandinetti L,Joubert G, Vazquez-Poletti J (eds) Cloud Computing and Big Data. IOS press,advances in Parallel Computing, pp 182–191

18. Marozzo F, Talia D, Trunfio P (2012) P2P-MapReduce: parallel dataprocessing in dynamic cloud environments. J Comput Syst Sci 78(5):1382–1402

19. Meswani MR et al (2012) Tools for benchmarking, tracing, and simulatingSHMEM applications. In: Proceedings Cray user group conference (CUG)

20. Nieplocha J (2006) Advances, applications and performance of the globalarrays shared memory programming toolkit. International Journal of HighPerformance Computing Applications 20(2):203–231

21. Nishtala R et al (2011) Tuning collective communication for partitionedglobal address space programming models. Parallel Comput 37(9):576–591

22. Olston C et al (2008) Pig Latin: a not-so-foreign language for dataprocessing. In: Proceedings SIGMOD '08. Vancouver, Canada, pp 1099–1110

23. Pectu, D. et al., On processing extreme data, Scalable Computing: Practiceand Experience, 16, 4, 467–489, 2015

24. Reed DA, Dongarra J (2015) Exascale computing and big data. CommunACM 58(7):56–68

25. Sarkar V, Budimlic Z, Kulkarni M (eds) (2016) Runtime systems report - 2104runtime systems summit, U.S. Dept of Energy, USA

26. Talia D (2013) Clouds for scalable big data analytics. Computer 46(5):98–10127. Talia D, Trunfio P, Marozzo F (2015) Data Analysis in the Cloud - Models,

Techniques and Applications. Elsevier, Amsterdam, Netherlands28. Talia D (2015) Making knowledge discovery services scalable on clouds for

big data mining. In: Proc. 2nd IEEE International Conference on Spatial DataMining and Geographical Knowledge Services (ICSDM), IEEE computersociety press, pp 1–4

29. Tardieu O et al (2014) X10 and APGAS at petascale. In: Proc. of the ACMSIGPLAN symposium on principles and practice of parallel programming(PPoPP'14)

30. U.S (2013) Department of Energy, Synergistic Challenges in Data-IntensiveScience and Exascale Computing. In: Report of the advanced scientificcomputing advisory committee subcommittee

31. Hwang K (2017) Cloud computing for machine learning and cognitiveapplications, MIT Press

32. Wozniak JM, Wilde M, Foster IT Language features for scalable distributed-memory dataflow computing. In: . Proceedings of the workshop on data-flow execution models for extreme-scale computing at PACT, Edmonton,Canada, p 2014

33. Yoo A, Kaplan Y (2009) Evaluating use of data flow systems for large graphanalysis. In: Proceedings of the 2nd workshop on many-task computing ongrids and supercomputers (MTAGS)

34. Zaharia M et al (2014) Apache spark: a unified engine for big dataprocessing. Commun ACM 59(11):56–65

35. Amdahl GM (1967) Validity of single-processor approach to achieving large-scale computing capability. Proc. of AFIPS Conference, Reston, VA, pp 483–485

36. Bailey D (1991) Twelve ways to fool the masses when giving performanceresults on parallel computers, RNR technical report, RNR-90-020, NASA AmesResearch Center

37. Grama, A. et al., Introduction to Parallel Computing, Addison Wesley, 200338. Gustafson JL (1988) Reevaluating Amdahl's Law. Commun ACM 31(5):532–

53339. Shi JY et al (2012) Program scalability analysis for HPC cloud: applying

Amdahl’s law to NAS benchmarks. In: Proc. High Performance Computing,Networking, Storage and Analysis (SCC), IEEE, CS, pp 1215–1225

40. Schroeder B, Gibson G (2006) A large-scale study of failures in high-performance computing systems. Proc. of the International Conference onDependable Systems and Networks (DSN2006), IEEE CS, Philadelphia, PA, pp25–28

Talia Journal of Cloud Computing: Advances, Systems and Applications (2019) 8:4 Page 16 of 16