datamovementonemerging large-scaleparallelsystems1160619/fulltext02.pdf · och minskar obalans i...

Data Movement on EmergingLarge-Scale Parallel Systems

IVY BO PENG

Doctoral ThesisSchool of Computer Science and Communication

Kungliga Tekniska HgskolanDecember 2017

TRITA-CSC-A-2017:25ISSN 1653-5723ISRN-KTH/CSC/A16/25SEISBN 978-91-7729-592-1

Akademisk avhandling som med tillstnd av Kungl Tekniska Hgskolanframlgges till offentlig granskning fr avlggande av teknologie doktorsex-amen i datalogi 18 December 2017 i School of Computer Science and Com-munication, Kungl Tekniska Hgskolan, Valhallavgen 79, Stockholm.

Ivy Bo PENG, December 2017

Tryck: Universitetsservice US-AB

Accepted by School of Computer Science and Communication, Royal In-stitute of Technology, in partial fulfillment of the requirements for the degreeof Doctor of Philosophy.

Doctoral Committee

Bronis R. de Supinski, Ph.D.(Lawrence Livermore National Laboratory)

Siegfried Benkner, Ph.D.(University of Vienna)

Anne C. Elster, Ph.D.(Norwegian University of Science and Technology)

Christoph Kessler, Ph.D.(Linkping University)

November 11th, 2017

Abstract

Large-scale HPC systems are an important driver for solving com-putational problems in scientific communities. Next-generation HPCsystems will not only grow in scale but also in heterogeneity. This in-creased system complexity entails more challenges to data movement inHPC applications. Data movement on emerging HPC systems requiresasynchronous fine-grained communication and efficient data placementin the main memory.

This thesis proposes an innovative programming model and algo-rithm to prepare HPC applications for the next computing era: (1) adata streaming model that supports emerging data-intensive applica-tions on supercomputers, (2) a decoupling model that improves par-allelism and mitigates the impact of imbalance in applications, (3) anew framework and methodology for predicting the impact of large-scale heterogeneous memory systems on HPC applications, and (4) adata placement algorithm that uses a set of rules and a decision tree todetermine the data-to-memory mapping in heterogeneous main mem-ory.

The proposed approaches in this thesis are evaluated on multiplesupercomputers with different processors and interconnect networks.The evaluation uses a diverse set of applications that represent con-ventional scientific applications and emerging data-analytic workloadson HPC systems. The experimental results on the petascale testbedshow that the approaches obtain increasing performance improvementsas system scale increases and this trend supports the approaches as avaluable contribution towards future HPC systems.

Sammanfattning

Storskaliga HPC-system r en viktig drivkraft fr att lsa datorpro-blem i vetenskapliga samhllen. Nsta generations HPC-system kom-mer inte bara att vxa i skala utan ocks i heterogenitet. Denna kadesystemkomplexitet medfr flera utmaningar fr datafrflyttning i HPC-applikationer. Datafrflyttning p nya HPC-system krver asynkron,finkorrigerad kommunikation och en effektiv dataplacering i huvud-minnet.

Denna avhandling freslr en innovativ programmeringsmodell ochalgoritm fr att frbereda HPC-applikationer fr nsta generation: (1)en datastrmningsmodell som stder nya dataintensiva applikationerp superdatorer, (2) en kopplingsmodell som frbttrar parallellitetenoch minskar obalans i applikationer, (3) en ny metologi och struk-tur fr att frutse effekten av storskaliga, heterogena minnessystemp HPC-applikationer, och (4) en datalgesalgoritm som anvnder enuppsttning av regler och ett beslutstrd fr att bestmma kartlgg-ningen av data-till-minnet i det heterogena huvudminnet.

Den freslagna programmeringsmodellen i denna avhandling r ut-vrderad p flera superdatorer med olika processorer och samman-kopplingsnt. Utvrderingen anvnder en mngd olika applikationersom representerar konventionella vetenskapliga applikationer och nyadataanalyser p HPC-system. Experimentella resultat p testbdden ipetascala visar att programmeringsmodellen frbttrar prestandan nrsystemskalan kar. Denna trend indikerar att modellen r ett vrde-fullt bidrag till framtida HPC-system.

AcknowledgementsFirst and foremost, I want to express my sincere gratitude towards my su-pervisors, Prof. Stefano Markidis and Prof. Erwin Laure at KTH, fortheir continuous support during my Ph.D. study and research. My Ph.D.study would be impossible without their encouragement, guidance, support,and immense knowledge. I cannot imagine better advisors who care moreabout the student while also providing enough freedom to explore the re-search fields. Their influences impact more than my research career and willcontinue to influence me beyond my graduation.

I want to thank Dr. Roberto Gioiosa and Dr. Gokcen Kestor atPNNL for their tremendous mentorship during my extended stay at PNNL.They guide me to shift towards a completely new topic and help me buildmy understanding and knowledge base step-by-step. They encourage me topursue my enthusiasm and to horn my skills as a researcher. Always patientand cheerful, they are great advisors, colleagues, and friends. My colleaguesand also friends at PNNL, John Feo,Ryan Friese, Burcu Mutlu,MarcoMinutoli, Vito Giovanni Castellana, and Antonino Tumeo, make mylife in Richland the most enjoyable!

My mentor Dr. Jeffrey Vetter and my colleagues Dr. Shirley Mooreand Dr. Seyong Lee have given me great help and support since my relo-cation to ORNL. Their expertise in different areas, experience, and visionhave broadened my horizon in research. I also want to thank Dr. PietroCicotti at San Diego Supercomputer Center for his great help.

My thanks also go to Dr. Andris Vaivads, Dr. Yuri Khotyaint-sev, Elin Eriksson and Andreas Johlander at Swedish Institute ofSpace Physics for their extensive knowledge in plasma physics. I also thankDr. Gian Luca Delzanno for his mentoring and guidance during my shortstay at LANL. I feel fortunate to have a chance to work with Prof. GborTth at University of Michigan in the past years.

Besides my advisors and colleagues, I would like to thank Dr. Bronis R.de Supinski for being the opponent, and Prof. Dr. Anne C. Elster, Prof.Dr. Christoph Kessler, and Prof. Dr. Siegfried Benkner for being inthe grading committee and for their encouragement, insightful comments,and hard questions. I also want to thank the Advance Reviewer JohanHoffman for his constructive comments that help me reshape my thesis.

Contents

Contents xi

1 Introduction 11.1 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Preliminaries 112.1 Exascale Computing . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Exascale Challenges . . . . . . . . . . . . . . . . . . . 122.1.2 The Trend of Future Architectures . . . . . . . . . . . 142.1.3 Applications on Supercomputers . . . . . . . . . . . . 17

2.2 Data Movement on HPC Systems . . . . . . . . . . . . . . . . 192.2.1 Impact of Memory Subsystem . . . . . . . . . . . . . . 202.2.2 Impact of Communication . . . . . . . . . . . . . . . . 22

2.3 Programming Models for Exascale Computing . . . . . . . . . 252.3.1 Disruptive and Incremental Directions . . . . . . . . . 252.3.2 Desirable Features of Programming Models . . . . . . 26

2.4 Our Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 27

I Data Streams for Fine-grained Asynchronous Commu-nication 29

3 A Data Streaming Model 313.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.3 A Data Streaming Model For HPC Systems . . . . . . . . . . 333.4 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

xi

CONTENTS

3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 343.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.7 Use Case in HPC Applications . . . . . . . . . . . . . . . . . 38

4 Characterizing Streaming Computing 394.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3 Streaming Graph Topologies . . . . . . . . . . . . . . . . . . . 404.4 Streaming Metrics . . . . . . . . . . . . . . . . . . . . . . . . 414.5 Case Study on a Cray XC40 Supercomputer . . . . . . . . . . 42

5 A Decoupling Model 455.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.3 A Stream-based Decoupling Model . . . . . . . . . . . . . . . 465.4 The Performance Model . . . . . . . . . . . . . . . . . . . . . 475.5 Implementation in HPC Applications . . . . . . . . . . . . . . 495.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 Data Streams in HPC Applications 536.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.2 Coupling and Decoupling with Streams . . . . . . . . . . . . 546.3 Emerging Workloads . . . . . . . . . . . . . . . . . . . . . . . 546.4 A MapReduce Application . . . . . . . . . . . . . . . . . . . . 566.5 Coupling LHC with an Online Classifier . . . . . . . . . . . . 57

II Data Placement on Heterogeneous Memory Systems 61

7 Emerging Heterogeneous-Memory Supercomputers 637.1 Non-volatile Memories . . . . . . . . . . . . . . . . . . . . . . 647.2 3D-Stacked Memories . . . . . . . . . . . . . . . . . . . . . . 657.3 Heterogenous Memory Systems . . . . . . . . . . . . . . . . . 66

8 The Impacts of HMS on HPC Applications 698.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 698.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . 708.3 A NUMA-Based Emulation Approach . . . . . . . . . . . . . 708.4 Performance Impact of HMS . . . . . . . . . . . . . . . . . . 72

Ivy Bo Peng xii

CONTENTS

8.5 Key Performance Metrics . . . . . . . . . . . . . . . . . . . . 74

9 Data Placement on an HBM-DRAM System 779.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 779.2 Data Placement on the KNL Processor . . . . . . . . . . . . . 789.3 Characteristics of the HBM-DRAM Memory . . . . . . . . . 799.4 Key Factors for Placing Data on KNL . . . . . . . . . . . . . 80

10 A Data Placement Algorithm for HMS 8310.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8310.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 8410.3 A Data Placement Algorithm . . . . . . . . . . . . . . . . . . 85

10.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 8510.3.2 Single Data Object Placement . . . . . . . . . . . . . 8610.3.3 Global Data Object Placement . . . . . . . . . . . . . 87

10.4 The Design and Implementation of RTHMS . . . . . . . . . . 8810.5 Evaluation on HPC Applications . . . . . . . . . . . . . . . . 89

11 Future Works 91

12 Conclusions 93

My Contributions to the Papers 97

Bibliography 99

Glossary 115

Ivy Bo Peng xiii

Chapter 1

Introduction

Current High-performance Computing (HPC) can deliver a performance ofpetaflops, i.e., 1015 floating-point operations per second. This tremendouscomputing power has become the cornerstone for many scientific communi-ties to solve large complex problems that are infeasible on any other com-puting systems. Long-term advances in science, engineering and societydemand for ever-increasing computing capability. This motivates exascalecomputing, which can deliver 1018 floating-point operations per second, toarrive in the next five to ten years. Today, efficient execution of large-scaleapplications on petascale systems is already a complex task. On the futureexascale systems, this task will be further complicated by increasing systemcomplexities and evolving application characteristics. In this thesis, we focuson optimisations of data movement on future large-scale HPC systems. Ourwork assumes that exascale systems will be distributed-memory machinesthat consist of a large number of compute nodes and each node has multiplememory technologies side-by-side.

Large-scale HPC systems today are often implemented with tens of thou-sands of compute nodes interconnected with high-speed networks. The largeaggregated memory capacity and the high aggregated throughput togetherenable feasible solutions of large-scale problems. Data movement on suchsystems is deeply hierarchical. It involves a multi-level memory hierarchy in-side each compute node and also communication across compute nodes. Asa result, the data movement in an application needs to consider both parts.In fact, for a given problem on a given system, the intra-node data move-ment and the inter-node communication are closely related. On one hand,the minimum number of compute nodes cannot be smaller than the numberof nodes with aggregated memory can hold the given problem. On the other

1

CHAPTER 1. INTRODUCTION

hand, using more compute nodes can either improve or degrade applicationperformance depending on the scalability of the application. Today MPI isthe the most widely adopted programming model on HPC systems. Otherprogramming models, such as PGAS languages and the HPCS languages,could also use MPI operations as building blocks. Thus, this thesis mostlyfocuses on MPI for the communication in applications.

We propose using data streams for inter-node data movement to takeadvantage of the streaming paradigm for improved parallelism of an appli-cation. Extensive research has identified the limitations of current coarse-grained synchronous programming models for scaling to large-scale systems.For instance, a small load imbalance or synchronization overhead can de-grade the overall performance of an application. Others have observed thatexascale computing needs programming models with asynchronous paral-lelism [18] and fine-grained communication for irregular applications [170].In this path, we use fine-grained data streams to enable an asynchronousdecoupling model. This model separates suitable operations onto groups ofprocesses and supports parallel progression of general operations. It alsoreduces the cost of operations with complexities depend on the number ofprocesses when decoupled to a small group. Furthermore, data flow be-tween groups reduces the impact of process imbalance, which results fromwaiting for a delayed communication peer. We show that data streamsare not only useful for decoupling operations within one application butalso can be used for coupling multiple workloads on supercomputers. Wedemonstrate through concrete use cases to show that emerging workloadscan be supported on supercomputers with data streams. For inter-node datamovement, we made the following contributions to address specific scientificquestions:

What functionalities are required for supporting a data streaming modelon supercomputers? We evaluate the feasibility of implementing a datastreaming model atop MPI in Chapter 3 and Paper 1 [125]. We an-alyze the desired features of a data streaming model on HPC systemsand map these features to the de-facto programming system on su-percomputers. We present several implementation strategies to bridgethe gap between the streaming model and the functionalities in MPI.We provide a proof-of-concept library implementation atop MPI anda parallel version of the STREAM benchmark for performance eval-uation. To identify the performance factors of streaming computing,we perform parametric experiments on supercomputers with differ-ent configurations. Our results show that the performance of stream

Ivy Bo Peng 2


processing is impacted by the granularity of stream elements, the ratiobetween groups and the underlying hardware of interconnect networks.

What performance metrics are critical for characterizing the perfor-mance of streaming computing on supercomputers? We define a setof performance metrics that capture the peak and sustainable perfor-mance of general configurations in Chapter 4 and Paper 2 [103].We show that the a general streaming topology can be configuredwith multi-level tree or linear chain structures. On a petascale super-computer, we evaluate the the peak and sustained injection rate andprocessing rates of representative configurations. In addition, we alsoquantify the loss of performance as the number of levels increases. Ourresults indicate that streaming computing on supercomputers consistsof a transient phase and a sustainable phase. Peak injection rate andprocessing rate could differ during the transient phase. Eventually,sustainable injection rate and processing rate reach equilibrium dur-ing the sustainable phase. This equilibrium also holds for multi-levelconfigurations at the head and tail stage.

How do we design an intermediate model between the disruptive andincremental approaches to prepare HPC applications for the exascaleera? We formalize a decoupling model in Chapter 5 and Paper3 [123] that supports easy adaption in existing MPI applications toenable a streaming processing paradigm among groups of processes.We argue that deeper parallelism can be exposed if multiple operationswithin an application are decoupled onto multiple groups of processes.In this work, we provide a performance model to guide programmers inidentifying suitable operations to be decoupled. We provide a stream-based implementation to enable asynchronous fine-grained data flowsbetween groups, which can effectively mitigate the impact of processimbalance. We performed four case studies in both scientific and data-analytics applications to demonstrate the performance advantage ofthe decoupling model in large-scale applications. Our results show thatthe operation decoupling and data mapping approach can effectivelyoptimize application performance on large-scale systems, achieving upto 3x improvement on 8, 192 processes of a petascale supercomputerwhen following the guideline to select use cases.

What applications and workloads can be enabled with data streams onsupercomputers? We showcase different applications of data streams inChapter 6 and Paper 4 [122]. We first implement a data-intensive

Ivy Bo Peng 3


application using data streams to pipeline Map and Reduce operationson groups of processes. We then use data streams to couple a large-scale scientific application with a separate I/O application for high-frequency visualization. Finally, we couple a detector with a machine-learning classifier with data streams to mimic online processing of largehadron collider events. We ran our scaling tests of these use cases on apetascale supercomputer. Our results show that data streams can flex-ibly support different emerging workloads on large-scale HPC systemswith bounded memory consumption to meet performance requirementsin practical use cases.

Data movement within one compute node is significantly impacted bythe memory subsystem. The recent hardware trends indicate that it is un-realistic to rely on DRAM, as the sole main memory technology for exascalesystems. Future HPC systems are likely equipped with multiple memorytechnologies, such as high-bandwidth memory (HBM) that is embedded onpackage and non-volatile memory (NVRAM) that does not require refreshpower. This shift in the memory subsystems will further deepen the memoryhierarchy within a compute node. As a result, intra-node data movementneeds to consider the different characteristics of memory technologies. Wemade three main contributions to optimize intra-node data movement onheterogenous-memory supercomputers. First, we introduce an emulationapproach for evaluating the impact of different memory systems on applica-tion scalability. Second, we design sets of experiments on a real hardwareto identify key factors for application performance on HBM-DRAM mainmemory. Finally, we propose an algorithm for data placement on heteroge-neous memory systems. For intra-node data movement, we addressed thefollowing scientific questions:

At large-scale, how will application performance be impacted whenmoving from uniform main memory to heterogenous main memory?We emulate a thin-node system with uniform memory and a fat-nodesystem with heterogeneous memories to evaluate the impact of futurememory systems on representative applications in Chapter 8 and Pa-per 5 [127]. We propose a systematic methodology that uses the fastand slow access latency and bandwidth to different NUMA domainsfor emulating large-scale thin-node and fat-node architectures. Weidentify three key metrics of applications from performance countersto estimate the performance improvement or degradation on emergingmemory systems. Our results show that irregular applications ben-

Ivy Bo Peng 4


efit the most from the large-capacity memory systems, reaching upto 2.74x performance improvement, despite that some portion of thememory is slower. On the other hand, regular CPU-intensive appli-cations with good cache locality will sustain their performance on thecurrent system when porting to heterogeneous memory systems.

What factors are critical for applications to exploit a memory systemcomprised of on-package HBM and off-package DRAM? We performa case study on real hardware that features a high-bandwidth mem-ory with small capacity and a normal bandwidth memory with largercapacity in Chapter 9 and Paper 6 [120]. We first characterize theaccess latency and memory bandwidth of the three memory configura-tions varying the data size and the number of hardware threads. Weselect a diverse set of scientific and data-analytic applications that rep-resent the spectrum of applications on exascale systems. We quantifythe impact of access pattern, data locality and hardware threads tocharacteristically different applications. Our results show that high-bandwidth memory can deliver up to 5x memory bandwidth but alsoincur 15% higher latency. Applications with regular memory accessesand small problem sizes that fit into HBM will benefit the most fromdirect data placement on HBM. Applications with problem sizes largerthan HBM need to consider if data reuse in HBM-backed cache cancompensate for the increase caching overhead. Applications that arebound by memory latency can only benefit from HBM when multi-ple hardware threads per core can hide the higher latency to HBMcompared to DRAM.

How do we develop an algorithm and tools for optimizating data place-ment on a heterogenous memory system? We provide an algorithmthat maps each data structure to the suitable memory in a memorysystem based on a set of rules in Chapter 10 and Paper 7 [121].We define a memory object as an allocation on continuous memoryspace. We first present a set of rules to determine the preference ofa memory object to all available memories in a heterogenous-memorycompute node. Then, we determine the global allocation of all memoryobjects, considering their size, their performance impact, their inter-action as well as the characteristics of the target system. We alsoprovide a tool that implements the algorithm to provide recommen-dations on source code changes to the programmers. We evaluate ouralgorithm by applying the tool to representative benchmarks and mini-

Ivy Bo Peng 5


applications and compare the tools recommendations with alternativememory configurations. Our results show that the recommendationfrom the algorithm can either outperform or match the best manualallocations in a variety of workloads.

1.1 Thesis OutlineThe structure of the thesis is described in the remainder of this section. Thepapers referred in this thesis and my specific contributions to each paper areappended in the thesis.

Chapter 2: Preliminaries explain the basic concepts for under-standing the work on exascale computing in this thesis. We start withthe definition and motivation for exascale computing and the majorareas of challenges. We emphasize on the trend of future architectureand applications characteristics, which provide the context of datamovement on exascale systems. We also introduce the main program-ming models on HPC systems and highlight the necessity of moreasynchronous fine-grained communication for exascale computing.

Part I: Data Streams for Fine-grained Asynchronous Com-munication presents our works that optimize data movement in com-munication.

Chapter 3 proposes a data streaming model on supercomput-ers. We start with the motivation for supporting a streamingparadigm on supercomputers to address emerging workloads andalso discuss the difficulties with current approaches. We identifythe gap between the desired features of a streaming model and theavailable functionalities in MPI. Then, we introduce several im-plementation strategies that bridge this gap. Finally, we providea parallel STREAM benchmark for evaluating the performanceof streaming processing on supercomputers.

Chapter 4 defines a new set of performance metrics for char-acterizing streaming processing on supercomputers. We use thepeak and sustained injection rates and processing rates at thefirst stage and the last stage of general topologies. We performa case study on a Cray XC40 machine. We identify the perfor-mance metrics at the transient and equilibrium phases of differentstreaming systems.

Ivy Bo Peng 6


Chapter 5 proposes a decoupling model that separates char-acteristically different operations onto different groups of pro-cesses and enables a streaming processing paradigm among thesegroups. We formalize the proposed model and provide a perfor-mance model that guides the selection of proper operations fordecoupling. We demonstrate the easy adaption in existing appli-cations using the interface of an MPI data stream library. Weevaluate the proposed approach in four applications on petascalesupercomputers.

Chapter 6 extends the use cases of data streams to enable emerg-ing workloads on supercomputers. We use data streams to enablea data-intensive application, to couple two separate applicationsand to enable online processing of high-frequency events from adetector. We describes the changes required for implementingthe use cases and evaluate the performance on a petaflops super-computer.

Part II: Data Placement on Heterogeneous Memory Sys-tems presents our works that optimize data placement in main mem-ory.

Chapter 7 introduces recent development in memory technolo-gies. It introduces non-volatile memory and 3D-stacked memorytechnologies as well the programming challenges in heterogeneousmemory systems.

Chapter 8 proposes an emulation methodology for large-scaleuniform and heterogenous memory systems. We evaluate the im-pact of emerging memory systems on scientific and data-intensiveapplications. We derive three key metrics from performance coun-ters to estimate application performance when porting to het-erogenous memory systems.

Chapter 9 presents a case study on real hardware that featuresHBM and DRAM. We quantify the impact of data size, accesspattern and the number of hardware threads. We abstract thekey factors that are important for applications to exploit suchnew architectures.

Chapter 10 proposes an algorithm for data placement on het-

Ivy Bo Peng 7


erogeneous memory systems. We use a set of rules and a decisiontree to determine the preference of memories for each memoryobject and their priority in the final decision. We provide an im-plementation of this algorithm in the Intel PIN framework. Weevaluate the algorithm on a set of applications on real hardware.

Chapter 11: Future Works describes the future works that willcontinue the research presented in this thesis.

Chapter 12: Conclusions concludes our work in this thesis andhighlights the key findings.

1.2 List of PublicationsThis thesis contains an introduction section that summarizes the specificscientific questions to address in this work and a preliminaries section thatexplains the fundamental concepts, including the motivation for exascalecomputing, the technical challenges, and the trend of future hardware, ap-plications and programming models. The rest of the thesis reviews andextends the approaches and contributions from each paper in the followinglist:

Paper 1 Ivy Bo Peng, Stefano Markidis, Roberto Gioiosa, GokcenKestor, and Erwin Laure. Preparing HPC Applications forthe Exascale Era: A Decoupling Strategy. In: ParallelProcessing (ICPP), 2017 46th International Conference on.IEEE. 2017

Paper 2 Ivy Bo Peng, Stefano Markidis, Erwin Laure, Daniel Holmes,and Mark Bull. A data streaming model in MPI. in: Pro-ceedings of the 3rd Workshop on Exascale MPI. ACM. 2015,p. 2

Paper 3 Stefano Markidis, Ivy Bo Peng, Roman Iakymchuk, ErwinLaure, Gokcen Kestor, and Roberto Gioiosa. A PerformanceCharacterization of Streaming Computing on Supercomput-ers. In: Computational Science (ICCS), 2016 InternationalConference on. IEEE. 2016

Paper 4 Ivy Bo Peng, Stefano Markidis, Roberto Gioiosa, GokcenKestor, and Erwin Laure. MPI Streams for HPC Applica-tions. In: New Frontiers in High Performance Computingand Big Data. IEEE. 2017

Ivy Bo Peng 8


Paper 5 Ivy Bo Peng, Stefano Markidis, Erwin Laure, Gokcen Kestor,and Roberto Gioiosa. Exploring Application Performance onEmerging Hybrid-Memory Supercomputers. In: High Perfor-mance Computing and Communications (HPCC), 2016 IEEE18th International Conference on. IEEE. 2016, pp. 473480

Paper 6 Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti,Erwin Laure, and Stefano Markidis. Exploring the perfor-mance benefit of hybrid memory system on HPC environ-ments. In: Parallel and Distributed Processing SymposiumWorkshops (IPDPSW), 2017 IEEE International. IEEE. 2017,pp. 683692

Paper 7 Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti,Erwin Laure, and Stefano Markidis. RTHMS: A tool for dataplacement on hybrid memory system. In: Proceedings of the2017 ACM SIGPLAN International Symposium on MemoryManagement. ACM. 2017, pp. 8291

The following peer-reviewed papers are also published during my PhDstudy. The first four publications evaluate the impact of process impact onlarge-scale HPC systems and motivate our proposal of a decoupling modelin the thesis. The remaining three publications focus on large-scale compu-tational simulations on supercomputers.

Paper 8 Ivy Bo Peng, Stefano Markidis, and Erwin Laure. The costof synchronizing imbalanced processes in message passing sys-tems. In: 2015 IEEE International Conference on ClusterComputing (CLUSTER). IEEE. 2015, pp. 408417

Paper 9 Stefano Markidis, Juris Vencels, Ivy Bo Peng, Dana Akhme-tova, Erwin Laure, and Pierre Henri. Idle waves in High-Performance Computing. In: Physical Review E 91.1 (2015),p. 013306

Paper 10 Ivy Bo Peng, Stefano Markidis, Erwin Laure, Gokcen Kestor,and Roberto Gioiosa. Idle Period Propagation in Message-Passing Applications. In: High Performance Computingand Communications (HPCC), 2016 IEEE 18th InternationalConference on. IEEE. 2016, pp. 937944

Paper 11 Stefano Markidis, Ivy Bo Peng, Jesper Larsson Trff, AntoineRougier, Valeria Bartsch, Rui Machado, Mirko Rahn, Alis-tair Hart, Daniel Holmes, and Mark Bull. The EPiGRAMProject: Preparing Parallel Programming Models for Exas-cale. In: International Conference on High PerformanceComputing. Springer International Publishing. 2016, pp. 5668

Ivy Bo Peng 9


Paper 12 Ivy Bo Peng, Juris Vencels, Giovanni Lapenta, Andrey Divin,Andris Vaivads, Erwin Laure, and Stefano Markidis. Ener-getic particles in magnetotail reconnection. In: Journal ofPlasma Physics 81.02 (2015), p. 325810202

Paper 13 Ivy Bo Peng, Stefano Markidis, Andris Vaivads, Juris Vencels,Jorge Amaya, Andrey Divin, Erwin Laure, and GiovanniLapenta. The Formation of a Magnetosphere with Im-plicit Particle-in-Cell Simulations. In: Computational Science(ICCS), 2015 International Conference on. IEEE. 2015

Paper 14 Ivy Bo Peng, Stefano Markidis, Erwin Laure, Andreas Joh-lander, Andris Vaivads, Yuri Khotyaintsev, Pierre Henri, andGiovanni Lapenta. Kinetic structures of quasi-perpendicularshocks in global particle-in-cell simulations. In: Physics ofPlasmas (1994-present) 22.9 (2015), p. 092109

Ivy Bo Peng 10

Chapter 2

Preliminaries

In this chapter, we introduce the basic concepts and the context of thisthesis. Our work focuses on data movement in HPC applications on next-generation supercomputers. In the first section, we present exascale com-puting, including the motivation, the main challenges, the trend of futurearchitectures, and the evolving characteristics of applications. In the secondsection, we discuss data movement issues on distributed-memory supercom-puters, including the impact of communication and main memory. The lastsection summarizes the main programming models for HPC systems and thedesirable features for exascale computing.

2.1 Exascale Computing

High-performance computing (HPC) is an important driver for scientificdiscoveries and breakthrough. Open questions in many scientific domainsrely on ever-increasing computing capabilities to find solutions to large com-putational problems. The current leading HPC systems deliver computingpower in the order of petaflops, i.e. 1015 floating point operations per sec-ond (FLOPS). These systems have enabled unprecedented simulations toextend and deepen our understanding in life-impactful domains, from theformation of the universe to genome sequencing, from weather forecastingto drug discovery. Still, there are scientific problems remaining unsolvable,even on the largest HPC platforms today. For instance, current computingpower cannot enable realistic simulation of the Earths magnetosphere inspace weather [85]. To address these computational challenges, larger sys-tems with higher capability are expected to arrive in the future. The nextmilestone in HPC systems is Exascale Computing, which can deliver 1000

11

CHAPTER 2. PRELIMINARIES

times the computing power of petaflops machines. Exascale computing isalso expected to accelerate the time to solution for large problems that takea long time to solve on todays systems. Exascale computing is anticipatedto advance research in science, engineering, finance, and social science.

ExascaleHPC systems refer to those systems that can deliver 1018 FLOPS,i.e., exaflops. Currently, the HPC community uses the High-performanceLinpack (HPL) benchmark to measure the performance of FLOPS on a sys-tem. Besides the primary requirement on exaflops computing capability, anexascale system is also required to have scaled memory bandwidth, persis-tent storage, and communication bandwidth, to support computing perfor-mance in applications. HPC systems, or supercomputers, in this work, referto the capability computing systems that use a large fraction of their comput-ing power for solving a single large problem [18]. In the rest of this thesis,we use exascale computing and exascale supercomputer interchangeably.

Today top supercomputers in the world are mostly implemented withdistributed-memory machines. Clusters of conventional compute nodes aretightly connected with high-end interconnects. Each compute node has theirseparate memory space and can be considered as an independent computingsystem. Inside the compute node, the common architecture is a symmetricmultiprocessor (SMP) chip with multiple cores that share a coherent mem-ory space. Each core may also support several hardware threads. Large HPCsystems typically have tens of thousands of these tightly connected computenodes. Data movement within one compute node goes through the memoryhierarchy of the memory subsystem while data movement across computenodes goes through the interconnects. From the current studies in exascalecomputing and the three DOE pre-exascale systems [148, 155, 156], we as-sume that the next-generation supercomputers will still be implemented ondistributed-memory machines.

Enabling efficient execution of applications on todays petaflops super-computers is already challenging. On exascale supercomputers, this effortwill further increase as the complexities of future architectures increase andthe requirements of applications keep evolving. In the next three subsec-tions, we briefly introduce the main areas of technical challenges in exascalecomputing and the trend of future systems and applications.

2.1.1 Exascale Challenges

There are four main technical challenges in exascale computing: power andenergy efficiency, memory and storage challenge, extreme parallelism andconcurrency, and resilience [18, 40]. Power efficiency has become the top pri-

Ivy Bo Peng 12


ority in enabling exascale computing. Exascale systems will have a powerbudget of 20 to 30 Mega Watt. Transistors have faced the power wall,such that, on a small area, the power density cannot remain constant whenscaling the frequency. The power wall and the power budget indicate it in-feasible for achieving exascale computing by simply integrating more unitstogether. Instead, power-efficient computation units, memory, storage, al-gorithms, and runtimes are required [70, 77, 101, 165]. Among these, thecost of data movement within a computing system has been identified as amore challenging issue than supporting computation at low power. For thisreason, power-efficient memory and storage technologies are likely to emergein the architecture of exascale machines.

Memory and storage on exascale supercomputers need to scale propor-tionally with the computing power to make the system usable. This re-quires memory bandwidth to reach as high as one exabyte per second onexaflops machines. In the past decade, the scaling of computing capabilitymostly comes from the increased number of integrated transistors insteadof increased frequency. While the number of cores and threads on a proces-sor keeps increasing, they demand higher memory bandwidth and capacity.The memory wall challenge emerges when the performance gap between themicroprocessor and the memory is growing exponentially [22, 175]. Thischallenge has pressured the memory technology on supercomputers, i.e.,DRAM, to face the end of its dominance for the multiple challenges whenscaling to higher density. Future memory systems will likely feature multi-ple technologies. This directly impacts the data movement inside a computenode. Similarly, the storage system also needs to cope with the requirementfor increasing capacity and bandwidth. New storage technologies like SolidState Drive (SSD), especially Storage Class Memory (SCM) and 3D-stackedNon-Volatile Memory (NVM), bring more tiers bridging the main memoryand the Hard Disk Drive (HDD)s.

Extreme concurrency is another major challenge in exascale. Exascalecomputing is expected to support parallelism of billion threads. Researchershave conducted extensive studies in new programming models and program-ming systems that support manage most of such massive parallelism. Onthis path, whether disruptive or incremental approaches should be taken,still remains debatable. One common agreement is that MPI, the de-factoprogramming standard for todays HPC systems, will be available on thenext-generation systems, likely interoperating with another programmingmodel to form an MPI + X model for the exascale computing.

Resilience is the last area of challenges that have been identified in the

Ivy Bo Peng 13


Exascale Computing Study [18]. There are two main factors for the ex-pected growth of fault rate on exascale systems [27]. First, the exascalesupercomputer will be comprised of a massive number of components, in-cluding computing, memory, storage, and network units. A large number ofunits increases the probability that some faults will occur in any one of theseunits. The mean time to failure (MTTF) on exascale machines is expectedto be too short for the current checkpointing approach to be effective. Thesecond factor comes from the increase of soft errors on these components,i.e., not all failures result in detection. Soft errors occur when the data iscorrupted but the execution completes without interruptions [82]. In somesense, this type of failure is more dangerous for scientific simulations as itcould lead to wrong results.

2.1.2 The Trend of Future Architectures

Next-generation supercomputers will have increased heterogeneity [83]. To-day, state-of-art systems are already equipped with heterogeneous and hi-erarchical computing components, interconnect networks, memories, andstorage systems. These emerging architectures are shaped by the main chal-lenges in exascale computing. Since Dennard scaling broke down around2006 [48], transistors can no longer benefit from free" scaling in terms ofpower because the leakage current becomes an issue on a small area, i.e., thepower wall challenge. Nowadays, performance scaling of a computing systemis mainly attributed to increased parallelism instead of the increased fre-quency of the transistors. More transistors are being integrated together toachieve high aggregated throughput. This increase in density forces vendorsto replace complex but power-hungry cores with simpler but power-efficientcores to maintain total power consumption within a restricted budget. Inthis trend, many-core processes and hardware accelerators are emerging.They are considered promising approaches to enable exascale computingwith 20x power efficiency [164, 165].

The most common accelerator on top supercomputers today is the general-purpose graphic processing unit (GPGPUs). Since the first deployment ofthe GPU-enabled supercomputer, the Titan supercomputer at Oak RidgeNational Laboratory [157], GPUs have gained increasing popularity on topsupercomputers. At the time of writing, the Piz Daint supercomputer atthe Swiss National Supercomputing Centre is the fastest GPU-based su-percomputer that delivers about 20 petaflops [154]. The latest GPU to-day is the NVIDIA Volta GPU [56], which will enable two pre-exascalesupercomputers, Sierra, and Summit [155, 156]. Besides GPUs, other forms

Ivy Bo Peng 14


of accelerators, such as FPGAs, DSPs, and energy-efficient architectures,such as ARM big.LITTLE [58] and Intel Many Integrated Core (MIC) [44]are also attracting increasing attention. One big challenge on accelerator-enabled supercomputers is portability. Accelerators often rely on a set oflow-level programming framework and programming models, such as CUDAand OpenCL, for performance. Higher-level support in the language, suchas OpenACC [61] and OpenMP 4.0 [115], which provide directive-based pro-gramming models, is becoming more mature and gaining popularity. Suchprogramming models rely on the compilers for automatic translation of codefor accelerators [89, 90, 94].

The architecture of supercomputers is moving from the multi-core andsmall-memory thin node architecture towards the many-core and large-memoryfat node architecture. The compute node on future systems could support100-1000 cores and terabytes of memory that consist of different technolo-gies. One main reason for this trend is the memory wall challenge, i.e., theperformance of memory has been lagging behind the performance of pro-cessors and this gap is increasing. Previously, memory performance wasalready improving more slowly than CPU performance, at a exponentiallygrowing gap [175]. Now as more cores are integrated into a single chip,the large number of threads demand even higher memory bandwidth andcapacity to sustain performance. The memory wall becomes one limitingfactor for scaling supercomputers. All three pre-exascale supercomputersfollow the fat node architecture that has much larger memory and manymore cores/threads compared to todays systems [148, 155, 156].

The exascale memory subsystems are likely heterogeneous. Two mainreasons lead to the shift in memory subsystems. First, the memory technol-ogy, dynamic random access memory (DRAM), is facing challenges to copewith the growth in system scale. Memory density has nearly reached a stop-ping point due to the power leakage and manufacturing process. The refreshpower required by DRAM grows with memory capacity and density [88].Memory systems on exascale systems require high density and lower powerconsumption than the current technology [143]. Second, alternative memorytechnologies have engaged active research but no single technology can com-pletely replace DRAM, in terms of performance and cost. High-bandwidthmemory, such as 3D-stacked DRAM can achieve orders higher memory band-width compared to DRAM [87, 117]. However, their high cost limits theirimplementation to small scale. Emerging non-volatile memory (NVM) tech-nologies can significantly reduce power consumption as they do not requirerefresh power to retain data. Byte-addressable non-volatile RAM, such as

Ivy Bo Peng 15


CPURegister

MainMemory(DRAM)

StorageDevice(SSD,HDD)

Cache(SRAM)

CPURegister

MainMemory(HBM)

Large-capacityMemory(NVM)

Cache(SRAM)Cache(HBM)

StorageDevice(SSD,HDD)

MainMemory(DRAM)

Faster

Larger

Figure 2.1: A conceptual illustration of the increased memory hierarchyfrom todays system (the left panel) to the emerging memory systems (theright panel).

Phase-Change Memory (PCM), Spin-Transfer Torque RAM (STT-RAM)and Resistive Random-Access Memory (ReRAM), is promising for replac-ing DRAM to implement main memory [75, 84, 179]. 3D Xpoint mem-ory technology stacks non-volatile memory to bridge the gap between stor-age and main memory [23]. Nevertheless, the current NVM technologiesstill have higher access latency, asymmetric read/write bandwidth, and lowwrite endurance compared to DRAM. A practical solution is to integratethese new technologies with DRAM to form a heterogeneous memory sub-system. State-of-art processors and accelerators have already employed thisapproach. In 2015, AMD released its Fiji GPU that features a first gen-eration HBM. The Intel KNL processor is equipped with an HBM-DRAMmemory system, where HBM can achieve 4x bandwidth of DRAM [146].The Volta GPU also features hybrid memories, where the high bandwidthmemory can deliver nearly 900 GB/s bandwidth. Xeon Phi and Volta aresome of the building blocks for the pre-exascale supercomputers.

Heterogeneity in the memory system imposes new programming chal-lenges. The memory organization of modern systems is hierarchical: fasterand smaller memory is placed close to the cores while slower and largermemory is placed far from the cores. The inclusion of multiple memorytechnologies in main memory and storage further deepens this hierarchy.Figure 2.1 presents one possible memory hierarchy on future systems. Ex-plicit data placement on the heterogeneous memory system requires signif-icant efforts from the programmer. In some large-scale applications, these

Ivy Bo Peng 16


manual efforts can become prohibitively expensive due to a large number ofdata structures and the scale of the application. Programmers are likely toresort to the default setting of the memory system, which requires minimalmodification of the code. Intelligent and automatic data placement on suchsystems is highly desirable and should be supported by the runtime systems.

2.1.3 Applications on Supercomputers

On exascale computing, regular and simple applications are rare while ap-plications with more complex algorithms are becoming common [18]. Theprimary objective of the supercomputer is to solve problems that are unsolv-able on any other computing systems. Thus, the characteristics of applica-tions directly impact the design and implementation of exascale computing.Real-world applications running on todays petascale supercomputers mayuse a diverse set of algorithms and have mixed characteristics. Scientificapplications will continue to be the first-class citizens on exascale comput-ing. Emerging applications such as data-analytics and machine-learning arelikely to be coupled with scientific applications for in-situ processing, avoid-ing expensive data movement in transitional post-processing. The next-generation supercomputers need to support both existing applications andemerging workloads.

Scientific Applications Scientific applications often use one algo-rithm to solve a very large problem that requires high computing ca-pability and large memory size. Representative algorithms include theMaxwell equations, the Navier-Stokes equations, and Poisson Equa-tions. Such problems are large enough in size that multiple computenodes are required to hold them in memory. They are often computa-tion intensive that requires a large count of processors for a reasonableprocessing time. The computational problem is often decomposed ontoa group of compute nodes and each processor only solves a smallerproblem.In the early days, scientific applications were usually regular. Forinstance, they represented the problem with three-dimensional meshgrids and have a regular partition of the problem over processes. To-day applications are becoming more and more irregular to implementnew algorithms or to represent realistic problems. For regular appli-cations, locality-aware implementation and the cache and prefetchingon modern computers can effectively improve performance. If the in-trinsic algorithm of an application is irregular in the data structure,

Ivy Bo Peng 17


e.g. adaptive meshes, or in data access, e.g. graph problems, appli-cation performance is bound by memory latency. Such applicationsare sensitive to the latency of memory and the network instead ofbandwidth.Communication can become a limiting factor when a large number ofprocesses are in use on exascale computing. To solve a large prob-lem, problem is often distributed onto processes. These processes aretightly coupled and need to communicate data. A common exampleis an iterative solver, e.g. the Poisson solver, that requires the sumof residual on all processes to proceed to the next iteration. Coarse-grained synchronous communication models cause the application tobe more sensitive to random delays on processes, resulting in signifi-cant performance degradation on massive concurrency. Thus, exascaleapplications need to explore a finer-grained communication paradigmthat could mitigate such impact.

Data-Analytics Applications Scientific applications are converg-ing with data-analytics applications [141]. Today, large-scale scientificapplications generate data that grow exponentially in volume. Tradi-tional post-processing analysis has seen its limitation in such trend.The enormous amount of data stresses the memory system and persis-tent storage, often becoming a bottleneck of performance. One solu-tion is to enable in-situ data-analytics routines to filter out irrelevantinformation and only save useful information. This approach can ef-fectively reduce the amount of data to be moved within a system. Infact, the energy cost of data movement plays a substantial part inthe power consumption of a system, which is estimated to be morechallenging than the computation power [18, 79, 143].HPC hardware, parallel programming models, and systems are evolv-ing to tackle the challenges from data-intensive applications. One ex-ample is the Blue Waters supercomputer that enables the VPIC codeto generate 290TB of data on 300,000 MPI processes [26]. The largedatasets generated from scientific simulation have proven valuable fornew scientific discoveries [10, 130]. Future systems may expect moredata-analytics applications for processing real-time instrumentationdata. For example, the Square Kilometer Array will provide approxi-mately 200 GBytes/s raw data [171].Data-analytics applications have different characteristics than scien-tific applications. Memory size is often their primary requirement on

Ivy Bo Peng 18


the memory system. For many problems, the feasibility depends onwhether the memory capacity of a system can hold the data. Out-of-core algorithms or batch executions are used when the memory of asystem cannot hold a problem. The memory access pattern in data-analytics applications is data-driven. Less temporal or spatial localitycan be utilized to reduce memory access time. Though considerableefforts have been invested to make implementations with better lo-cality, the intrinsic algorithmic property determines that such opti-misation can be limited. Data-driven access makes the applicationirregular, thus sensitive to the latency of memory and the network.Asynchronous fine-grained programming paradigms are desirable foroptimizing such applications.

Moving towards exascale computing, the different categories of applicationsare also converging. For instance, to enable high spatial resolution simula-tions, applications in weather, climatology and solid earth science (WCES)need to manage massive amounts of scientific data and also to support real-istic modelling. Applications in fundamental science, such as nuclear fusion,plasma physics, and material science, will require the capacity to handleexabyte data in order to compute at exaflops. Moreover, exascale comput-ing projects have identified that breakthroughs in complex systems wouldrequire multi-scale techniques that couple multiple numerics, models, bigdata, machine learning, etc [152].

2.2 Data Movement on HPC Systems

Data movement on distributed-memory supercomputers mainly consists oftwo parts: access to main memory and communication.

First, within a single compute node, the data moves through the memoryhierarchy to become available for the processing units. This data movementis mainly impacted by the memory subsystem and the placement of datastructures. When the main memory is implemented with technologies ofhigher bandwidth and lower access latency, the time for moving data isautomatically reduced. For a memory system comprised of non-uniformmemory access (NUMA) domains or multiple memory technologies, placingdata structures on a proper memory can reduce the cost of data movementin an application.

Second, when data moves across compute nodes, the data movement ismostly impacted by the interconnect network and the programming model.

Ivy Bo Peng 19


MainMemory

Node0

MainMemory

Node1

MainMemory

Node2

MainMemory

Node3

Figure 2.2: Data movement on emerging heterogeneous-memory supercom-puters. Data (green) moves in intra-node memory system that consists ofdifferent memory technologies (grey in different shades) and across multiplecompute nodes through communication.

A high-end interconnect network with high bandwidth and low latency candirectly reduce the time to transfer data. Asynchronous programming mod-els that can overlap the communication with other work could also reducethe cost of communication in applications. In the scope of this thesis, wepropose approaches for these two parts of data movement on supercom-puters. Figure 2.2 illustrates that data on emerging heterogeneous-memorysupercomputers moves through multiple memory technologies inside a nodeand over the network.

2.2.1 Impact of Memory Subsystem

On distributed-memory supercomputers, each compute node can be consid-ered as an independent computing system where multiple cores and threadsshare a coherent memory space. Thus, data movement within a computenode is significantly impacted by the memory subsystem. The memory or-ganization of modern systems is hierarchical: faster and smaller memoryis placed close to the processing units while slower and larger memory isplaced far from the processing units. A typical memory subsystem consistsof main memory and likely multiple levels of caches, e.g., L1 and L2 caches.

Ivy Bo Peng 20


Today, dynamic random access memory (DRAM) is the dominant tech-nology for implementing the main memory of supercomputers. Supercom-puters have employed DRAM as the main memory for decades for its lowcost and high performance. Other memory technologies, such as SRAM, areused to implement caches. As supercomputers continue to grow in scale,DRAM is facing challenges to cope with the increasing demands from thesystem. The balance between DRAM and other memory technologies isabout to shift due to three main factors. First, scaling for higher densityin the memory cells of the DRAM technology is progressively reducing to apoint that a hard stop is expected in the near future. Among various issues,the increase of leakage power and the reliability of the manufacturing pro-cess have become major problems for scaling DRAM technology. Second,emerging irregular HPC applications and data analytics workloads requirean ever-increasing capacity to store data. The density and static power con-sumption of DRAM make it an unviable solution for these new workloads.Memory technologies with higher density and lower power consumption arerequired. Third, the massive number of cores and hardware threads in mod-ern processor chips need to be sustained by a higher bandwidth than theDRAM technology.

Active research has been engaged to find new memory technologies thatcan scale better than DRAM, such as NVM and HBM. Due to cost or perfor-mance, no single new technology can directly replace DRAM. Heterogeneousmemory systems with multiple memory technologies working side-by-sideare emerging. Data movement within heterogeneous-memory compute nodeneeds to consider the different characteristics of all memories available inone memory subsystem. Efficient data placement within a compute nodewill directly impact the performance of applications. For this reason, ouroptimisation of intra-node data movement tackles heterogeneity in the mainmemory. We address the challenges of moving towards heterogeneous mem-ory subsystem in three steps. We first evaluate the impact of the memorysubsystem on application performance on emulated large-scale systems. Af-ter that, we conduct a case study on a real heterogenous-memory node toextract the key factors for applications to exploit the underlying memorysubsystem. Based on the findings from previous steps, we devise an algo-rithm that considers the characteristics of data structures and memories toreduce the cost of data movement on a heterogeneous memory system.

Ivy Bo Peng 21


2.2.2 Impact of Communication

Data movement across compute nodes is through communication over theinterconnect networks. Different programming models have been developedfor HPC systems, including Cilk, Charm++, MPI, UPC, Co-Array Fortran,and Chapel [21, 30, 34, 60, 76, 114]. Nevertheless, many programming mod-els still use MPI to implement the communication layer. For this reason, wedescribe the terminologies in MPI for the communication on supercomput-ers. In MPI, the execution unit is called process and each process has theirprivate memory space. Data movement between processes is done throughcommunication. Next sections introduce two main types of communicationin MPI and the concept of load imbalance, a main factor that can degradethe communication performance of parallel applications.

MPI Communication

Point-to-point communication transfers data from a source process, calledsender, to a destination process, called receiver. This is the basic communi-cation that is used to build other complex types of communication. Depend-ing on whether the receiver is actively or passively involved in messaging,point-to-point communication can be further divided into two-sided and one-sided communication. Two-sided communication is common in production-level applications for its availability since the start of the MPI standard.Two-sided communication can have blocking and a non-blocking implemen-tations. Blocking communication cannot return before the message bufferis free for reuse while non-blocking implementation can return immediatelywithout waiting for the buffer to be free. One-sided communication waslater introduced into the MPI-2 standard in 1997 to exploit the remote di-rect memory access feature on modern networks. With effective supportfrom the hardware, this type of communication is designed to reduce theoverhead of message-passing by offloading data movement to the networkhardware. One-sided communication only has a non-blocking implementa-tion.

Collective communication transfers data among a group of processes,called a communicator. This type of communication is useful in applica-tions where global information is needed. For instance, in iterative solvers,the sum of residuals on all processes is required in order to proceed tothe next iteration. As collective communication involves all processes in acommunicator, the complexity of its implementation is often a logarithmicfunction of the number of processes. On supercomputers, the cost of collec-

Ivy Bo Peng 22


tive communication can grow rapidly when tens of thousands of processesare used. There are two main directions in optimization. The first direc-tion is to devise efficient algorithms with lower complexity. The optimalalgorithms change under different scenarios so that optimization also usesheuristics to select the best algorithm. The second direction is to exploitspecial support from the underlying hardware. For example, some hardwarelike BlueGene/L has special purpose network hardware to support certaincollective communication [4]. A good implementation of the MPI standardmay employ multiple optimization strategies and hide the implementationcomplexity from the programmer. Recently, sparse collective communica-tion for nearest neighbours further extends the collective communication op-erations [68]. Collective communication also has blocking or non-blockingvariants.

Load Imbalance

The cost of communication depends on the interconnect network and theprogramming model. On one hand, fast networks with smaller latency andhigher bandwidth can speed up data movement from one compute nodeto another in all applications. On the other hand, on a given supercom-puter, applications that are constructed in suitable programming modelscan expose higher parallelism and effectively hide the cost of communica-tion. In this thesis, for the inter-node data movement, we explore novelprogramming models that can optimize the communication in applications.Previous works have pointed out that on extreme concurrency, even a smalldelay in one process can be prohibitively expensive [18, 143]. When pro-cesses are imbalanced, i.e., some are slower than others, the cost of datamovement includes both the data transfer time and the idle time that aprocess waits for its delayed communication peer. Process imbalance cansignificantly increase the cost of point-to-point communication and collec-tive communication [124, 128].

Process imbalance is practically unavoidable on billion-way parallelism.There are two main sources: the system interference and the imbalancedworkloads. On large-scale machines, interference from system noise is un-avoidable and can cause equal workloads to finish in variable time [133].The interference could come from dynamic factors, e.g. the temperature ofa processor depends on the environment, or from the OS, e.g., the kernel caninterrupt a process of a parallel application to execute system-level activ-ities. These delays range from 100 ns (the cost of a cache miss) to 20 ms(the cost of process pre-emption or of a swap-in) [14]. Recent researches on

Ivy Bo Peng 23


the impact of system noise in parallel applications include theoretical model,benchmarking and simulation [2, 17, 50, 67, 133]. Parallel applications oftendistribute a large problem onto processes and each process only works on asubset of the workload. Assigning equal workload to each process, however,is only possible within a small set of static and regular algorithms, such as3D meshes. Other scientific applications, such as unstructured meshes andadaptive grids, could allocate a variable amount of work to processes. Data-analytics applications often have their workloads driven by data, e.g., thedistribution of vertices in graph problems, resulting in an imbalanced work-load on processes. Experimental evidence has shown that random delaysoccurring over nss time scales originating on a single process, propagateamong processes and are responsible for the overall performance degrada-tion that can reach ms delays. Random small delays can propagate on thesystem and degrade the overall performance of applications [124, 128].

Extensive research has addressed system noise and workload imbalance.Several micro and lightweight kernels (LWKs) have been developed to mini-mize system noise. Examples of micro-kernels include L4 [95], Exokernel [45,46, 47], and K42 [7]. IBM Compute Node Kernel for Blue Gene supercom-puters [54, 112] is an example of a LWK targeted for HPC systems. CNK isa standard open-source, vendor-supported, OS that provides maximum per-formance and scales to hundreds of thousands of nodes. There are also sev-eral full-weight HPC kernels, including ZeptoOS [13, 15, 16], IBM AIX [74],and various other Linux variants [55]. The enclave-based application com-position [12] can isolate Operating System/Runtimes in enclaves, to addressthe different requirements from workloads within one compute node.

Workload balancing has attracted research in algorithms, languages, andruntimes. Parallel applications are exploring new algorithms or improvingthe current algorithm to for an evener distribution of workloads. However,some algorithms are fundamentally imbalanced, such as particle-based al-gorithms, adaptive meshes and the coupling of multiple physical models.New languages and runtimes that can automatically handle workload bal-ancing are becoming popular. For instance, task-based models create a largenumber of fine-grained tasks and rely on the runtime to schedule the tasksevenly onto the underlying hardware threads. Work-stealing runtimes, suchas Cilks runtime system [21], is a well-known example in this category.Charm++ is another example that provides measurement-based load bal-ancing [76, 180]. Recent research also explores decoupling frameworks tooffload these load balancing routines [118].

Ivy Bo Peng 24


2.3 Programming Models for ExascaleComputing

Programming model refers to the way of expressing the programmers viewof the machine with a set of language and libraries. Proper programmingmodels can enable efficient implementation of an algorithm on target ma-chines. As introduced in Section 2.1.3, applications on supercomputers areevolving. Regular applications are becoming rare. Applications that exhibitirregularity, in workload, data structures, and access pattern are becomingcommon. They require programming models with asynchronous and fine-grained communication. Another motivation for such programming modelscomes from the impact of process imbalance on communication with mas-sive concurrency on exascale computing as introduced in Section 2.2.2. Onbillion-way parallelism, perfect balance is nearly impossible and the processimbalance can be a limiting factor of performance on large systems. Forthese reasons, one can assume that process imbalance on exascale comput-ing is a norm that should be reflected in programming models.

2.3.1 Disruptive and Incremental Directions

New programming models have been proposed for exascale computing. Thesenew models include new languages or extensions to existing languages. Wecategorize these models into three classes: the shared-memory model, themessage-passing model, and the PGAS model. Share-memory models areoften used on smaller scale systems, such as a single compute node withSMP chips, where threads share a memory address. In this model, theOpenMP API and the POSIX thread API are popular choices [25, 35]. Newlanguages, such as Cilk and Cilk++, rely on compiler support to translateor to map their logic threads into low-level machine threads [21, 91]. Themessage-passing model is a most widely adopted model on supercomputers.High-performance implementations of the MPI standard [51], e.g., MPICHand OpenMPI, are widely available on all supercomputers and engage ven-dor support on future machines. In the PGAS model, all processors inthe systems share a Partitioned Global Address Space, which is a unifiedmemory space formed by the private memory on all compute nodes. Pop-ular PGAS programming models include extensions to existing languages,such as UPC [34] and Co-Array Fortran [114], GPI-2 implementation of theGASPI standard [62], as well as new languages like Chapel [30] and X10 [32].

Today, the dominant programming model on supercomputers is com-prised of a sequential language, such as C and Fortran, with a message-

Ivy Bo Peng 25


passing layer, e.g., the libraries that implement the MPI interfaces. Mostproduction-quality applications on petaflops supercomputers are implementedin this model. Optimization of such applications relies on the new featuresin the MPI standard. For instance, MPI-2 introduced one-sided communi-cation (or RMA interface) with the objective to exploit the remote directmemory access (RDMA) feature on modern networks [59]. With hardwaresupport, RMA should reduce the overhead of communication by bypassingOS kernels and offloading data movement to the network [52, 53]. MPI-3further extends the RMA interface and other features such as the supportfor shared memory and dynamic process management. In this incrementaldirection, the MPI+X model, where X is usually a shared memory program-ming model like OpenMP, is gaining popularity. We evaluate the interoper-ability of the message-passing model and PGAS model in [71, 104].

A disruptive direction is to rewrite the applications in task-based pro-gramming languages that provide load balancing and dynamic execution,e.g., Cilk++, Charm++ or Chapel [29, 76, 91]. A runtime system ex-ecutes fine-grained tasks on different computing units and orchestras theworkload and resource usage [40]. While in principle this approach can el-egantly solve the main exascale challenges for applications, to implementproduction-quality HPC applications using tasks often requires a major ap-plication development effort. In addition, task-based programming systemsfor distributed memory are not as mature as MPI. With exascale computingexpected in 2020-2024, this approach is challenged by the timeline.

2.3.2 Desirable Features of Programming Models

Which programming models will be efficient for implementing applicationson exascale computing is still debatable. However, the desirable features forthe wide adoption of a programming model in applications are becomingclearer as the challenges, the trend of future architectures and applicationscharacteristics are becoming better understood. In general, both the dis-ruptive and incremental models are trying to enable more asynchronousfine-grained communication in applications as this can mitigate the impactof process imbalance on extreme concurrency. Programming models fornext-generation supercomputers should reflect the heterogeneity in differ-ent computing systems, e.g. computing units and memory systems. Theyshould support an HPC application to exploit the heterogeneity in the targetsystem flexibly. Finally, as exascale computing is approaching, adaptabilityfor existing applications can also directly impact the choice of programmers.A programming model that can be easily implemented in existing applica-

Ivy Bo Peng 26


tions or can reuse some parts of applications can be implemented and portedfast onto a new system.

2.4 Our ApproachesThe following two parts of this thesis revolve around data movement onemerging large-scale HPC systems.

In Part I, we propose a data streaming model that uses asynchronousfine-grained streams for moving data among processes of HPC applications.In Chapter 3, we show that the advantages of data streaming models couldaddress the challenges of the increasing the amount of data on HPC systems.We then present our design of a data streaming model for HPC systems. Wealso provide an implementation atop MPI and benchmark its performanceon different supercomputers. In Chapter 4, we define a set of new perfor-mance metrics to characterize streaming computing on supercomputers. Wealso present our case study on a Cray XC40 supercomputer. In Chapter 5,we propose the decoupling model for preparing applications for the exas-cale era, which separates operations onto groups of processes and enables astreaming processing paradigm among these groups. We show that the datastreams can enable this decoupling model in applications. In Chapter 6,we further extend the streaming model for coupling different applicationson supercomputers. This part refers to four peer-reviewed papers: "A DataStreaming Model in MPI" [125], "A Performance Characterization of Stream-ing Computing on Supercomputers" [103], "Preparing HPC Applications forthe Exascale Era: A Decoupling Strategy" [123], and "MPI Streams for HPCApplications" [122].

In Part II, we propose approaches and algorithms for moving data onheterogeneous main memories. In Chapter 7, we summarize the advances inmemory technologies and the programming challenges faced by HPC appli-cations. In Chapter 8, we introduce a methodology of emulating large-scaleheterogeneous-memory supercomputers. In Chapter 9, we evaluate a realheterogeneous-memory system and propose the main considerations for anapplication to exploit such systems. In Chapter 10, we propose a data-placement algorithm for heterogeneous memory systems as well as its im-plementation in a tool called RTHMS. This part refers to three peer-reviewedpapers: "Exploring Application Performance on Emerging Hybrid-MemorySupercomputers" [127], "Exploring the Performance Benefit of Hybrid Mem-ory System on HPC Environments" [120], and "RTHMS: A Tool for DataPlacement on Hybrid Memory Systems" [121].

Ivy Bo Peng 27

Part I

Data Streams forFine-grained Asynchronous

Communication

29

Chapter 3

A Data Streaming Model

In this chapter and in Paper 1 [125], we introduce a data streaming modelfor supercomputers to tackle the challenges of emerging data-intensive ap-plications. Section 3.4 describes the design of an MPI stream library forHPC systems. In Section 3.5, we introduce the implementation strategies tobridge the gap between MPI functionalities and a data streaming model. InSection 3.6, we introduce a parallel STREAM [106] benchmark to evaluatethe performance of data streaming on different supercomputers. Finally, weshow new operations that can be supported in a data streaming model.

3.1 Contributions

Our contributions come in threefold. First, we present a data streamingmodel that processes data sets on-the-fly to tackle the challenges of largedata sets on supercomputers. Second, we identify the desirable featuresof such models and provide a proof-of-concept implementation atop MPI.Finally, we evaluate the performance of the streaming model on supercom-puters and quantify the impact of three factors: the granularity of streamelement, the computation intensity of operation and the ratio between dataproducer and consumer. Our experiments show that the data streamingmodel can achieve acceptable performance (52%-65% of the maximum avail-able bandwidth) on modern HPC systems. Furthermore, the model demon-strates promising scalability by achieving as high as 200 GB/s and 80 GB/sprocessing rate using 2,048 data producer over 2,048 data consumers on aBlue Gene/Q and a Cray XC40 supercomputer respectively.

31

CHAPTER 3. A DATA STREAMING MODEL

3.2 Motivation

HPC applications and data-intensive applications are converging [141]. Large-scale HPC applications continue generating an ever-increasing amount ofdata [65]. The conventional post-processing approach, which saves the datato the file system and only analyzes after the simulation, has become a per-formance bottleneck [99]. In addition, new HPC applications that demandboth streaming and computing capabilities on supercomputers have emergedin the last decade. For instance, the Large Hadron Collider (LHC) [131], theSquare Kilometer Array (SKA) [24], the Large Synoptic Survey Telescope(LLST) [161], and the Laser Interferometer Gravitational-Wave Observa-tory (LIGO) [5] will further increase their data rates. SKA[24] is estimatedto reach 10 PB/s, which can drain the entire memory of supercomputersif data is not processed in time. Popular data processing frameworks aredesigned for the cloud environment [37, 116, 178], which present a differentset of design considerations compared to HPC systems.

Streaming computing is a programming paradigm that can address thechallenges of data-intensive applications. It processes data on-the-fly andonly requires limited memory/storage to handle large data sets. It alsosupports reactive real-time computation on irregular, potentially infinite,data flow [147]. StreaMIT [158] is a modern language that supports thisparadigm. It constructs programs using Filter as the basic computationunit and connects these filters using data streams as the basic communi-cation unit. Based on the assumption of static data flow rate, a timingmechanism that is relative to the data flow is provided in this language tofacilitate irregular control messages. Despite the success of StreaMIT onother platforms, there are several limitations that prohibit it from beingadopted on supercomputers. First, reformatting existing HPC applicationscompletely complying to the stream abstraction is very difficult. Instead, foreasier adoption in existing HPC applications, providing a natural interfaceto these current applications is necessary. Second, the de-facto program-ming systems on HPC is MPI and its most active implementations are in Cand Fortran while StreaMIT relies on a Java-based compiler for high per-formance. Third, the data movement among processes on supercomputersrequires an efficient communication layer. To adopt streaming computingon supercomputers, it is necessary to support it in popular programmingsystems for HPC systems.

Ivy Bo Peng 32


3.3 A Data Streaming Model For HPC Systems

Our data streaming model for HPC systems inherits the basic concept ofstreaming computing but also addresses the specific challenges of the HPCenvironment. In streaming computing, streams are a continuous sequenceof fine-grained data structures that move from a set of processes, calleddata producers, to another set of processes, called data consumers. Thesefine-grained data structures are often small in size and in a uniform format,called a stream element. A set of computations can be attached to a datastream. Stream elements in a stream are processed online such that theyare discarded as soon as they are consumed by the attached computation.

On large-scale HPC systems, the data producers and consumers likelyreside on different compute nodes and have separate memory spaces. Thus,our data streaming model implicitly includes a communication layer for mov-ing data streams. In particular, our work focuses on parallel streams, wheredata producers and consumers are distributed among processes that requirecommunication to move data. Stream elements, which are the basic unitof data movement, now become the basic unit of communication. Streamelements move through a persistent communication channel, called a streamchannel. The communication of stream elements is asynchronous. The dataconsumer does not wait for a specific communication peer but computes onthe first available stream elements. In this way, the parallel streams enableasynchronous and fine-grained data movement on supercomputers.

3.4 Design

In this section, we design a library that enables our data streaming modelon parallel systems. Data consumers process the incoming stream elementson a first-come-first-served basis. The operation of processing stream ele-ments is attached to a parallel stream. The operations have three modes:initial, intermediate and terminal. The initial and terminal operations areonly processed once at the beginning and the end of the program and theintermediate operation is repeated on each stream element.

The library should support both stateless and stateful operations onstream elements. Stateless operations do not rely on any information frompreviously elements, e.g., filtering based on a threshold. Stateful operationsrely on some information from previous stream elements and might requiremore memory usage than stateless operations, e.g., sorting stream elements.The ordering of the stream elements is relaxed and not preserved by the

Ivy Bo Peng 33


order that they are injected by producers. Stream elements will not beprocessed in the order they are streamed out but in the order that theybecome available to a data consumer. This concept is similar to the modernout-of-order CPUs that hide latency from data dependence.

The length of a data stream, e.g., the number of elements, is not requireda priori. This feature is important because common use cases often do notknow the actual size at runtime. Thus, the library should not require priorinformation about the stream length. Instead, it should support run-timetermination of parallel streams. Each data consumer continues streamingout data until it signals the termination of its contribution to a parallelstream. When a data consumer receives termination signals from all dataproducers, it stops receiving incoming stream elements and calls the terminalfunction.

3.5 ImplementationWe provide an implementation of the streaming model in a library atopMPI, called MPIStream. Different approaches can implement the streamingmodel, e.g., the Active Access supported by RDMA hardware [19], the ActiveMessage (AM) mechanism [166] and the Pebble Programming Model [170].We choose MPI because it is the most widely used programming system onparallel systems and is also the assembly language for other programmingmodels on HPC systems. Currently, it does not include functions that candirectly enable data streaming in HPC applications.

The basic functions supported in the MPIStream library are listed asfollows. Each function supports blocking and non-blocking variation. Werefer the reader to Paper 4 for the description.

1. MPIStream_CreateChannel(int is_producer,int is_consumer,MPI_Commold_comm,MPIStream_Channel* channel)

2. MPIStream_FreeChannel(MPIStream_Channel* channel)3. MPIStream_Attach(MPI_Datatype stream_dt, MPIStream_Operation

*op, MPI_Stream *stream, MPIStream_Channel *channel)4. MPIStream_Send(void* sendbuf, MPI_Stream *stream)5. MPIStream_Terminate(MPI_Stream *stream)6. MPIStream_Operate(MPI_Stream* stream)

Our library uses MPI (derived) data types to specify the data structureof stream elements. This allows the library to support zero-copy streamingeven when the memory layout of the stream element on the data consumer

Ivy Bo Peng 34


is non-contiguous. The communication channel is implemented using thepersistent communication primitives in MPI. This reduces the overhead ofconstructing many messages of the same signatures. The asynchronous datamovement between the data producers and consumers is implemented us-ing the non-blocking point-to-point communication operations in MPI. Thissupports processes to perform other operations when the stream elementsare not available.

The main gap between the MPI functionalities and the streaming modelcomes from the message ordering and matching. MPI impose more restrictedrequirements when matching the messages to preserve the order of messages.These restrictions are not required in the data streaming model and couldeven limit the advantage of the model. When MPI matches an incomingmessage on the receiver side, it checks both the source and tag of a receiverequest. As MPI allows wildcard for these two fields, it will always match allincoming messages to the first receive request with wildcard source and tag.When all incoming messages match to a single receive request, this causeshigh contention and serialization in the communication. To overcome thisclash in message matching, we use randomized message tags and wildcardmessage source in the implementation. In this way, multiple incoming mes-sages can match multiple receive requests without knowing a priori theirsource or tag.

We demonstrate this relaxed ordering in an example in Figure 3.1. Streamelement 1, 2, 3 are streamed out approximately at the same time. However,they will match different receive requests that point to different locationsin the memory buffer. Rq0 has two matching stream elements, stream ele-ment 0 and 1. As stream element 0 arrives earlier, it would likely have beenprocessed before the arrival of stream element 1.

3.6 Evaluation

In this section, we evaluate the performance of the streaming library ontwo supercomputers. We first introduce a parallel STREAM benchmark,which is based on the origin [107] for measuring sustainable memory band-width. Then, we benchmark the streaming performance on a Cray XC40and a Blue Gene/Q supercomputer. Our results show that the granularity ofstream elements, the computation intensity of stream operations and the ra-tio between data producer and consumer, can significantly impact streamingperformance.

We use a parallel STREAM benchmark to measure the processing rate

Ivy Bo Peng 35


Recv_BufPtr

Tag = 1Rq 0 Rq 1 Rq 2 Rq 3 Rq 4

Consumer 0Recv_Buf

Ptr

Tag = 2

Recv_BufPtr

Tag = 3

Recv_BufPtr

Tag = 4

Recv_BufPtr

Tag = 5

Memory Buffer

Producer 1

Stream Element 0

Tag = 1

Time

Producer 0Stream

Element 4Tag = 4

Stream Element 2

Tag = 3

Stream Element 1

Tag = 1

Stream Element 3

Tag = 2Producer 2Stream

Element 5Tag = 5

Figure 3.1: Diagram for the management of memory and communicationin stream operations. The data producers generate a random tag when theystream out an element. Each data consumer maintains a list of persistentcommunication requests and a memory buffer. Each request has a uniquemessage tag. The memory buffer can hold multiple stream elements.

of data consumers in an application using the MPIStream library. We im-plement four kernels in the benchmark so that the data is injected by dataproducers to the stream channel and then used in the copy, scale, add andtriad kernels by the data consumers. The results presented in this section areobtained from injecting 10,000 stream elements of MPI_DOUBLE data type andapplying the scale kernel. For each test, the data producer and consumerprocesses reside on different compute nodes so that the data movement isthrough the interconnect network.

The first experiment evaluates the impact of the granularity of streamelement. We measure the peak performance of the underlying network witha ping-pong benchmark. Then, we measure the performance of the librarywith the parallel STREAM benchmark. Comparing the performance withthe peak bandwidth, we notice that the library achieves a reasonable frac-tion, at maximum 65% and 52%, of the peak bandwidth despite the overheadfrom the implementation strategy introduced in Section 3.5. We present the

Ivy Bo Peng 36


1 10 100 1000 10000 100000 1x106

Element Size (No. of double)0

2

4

6

8

10

12

14

16

Proc

essi

ng R

ate

(GB/

s)

Cray XC40 maxCray XC40 minCray XC40 avail. BWBG/Q maxBG/Q minBG/Q avail. BW

35%

48%

Figure 3.2: The maximum and minimum processing rate in GB/s varyingthe size of the stream element using the scale benchmark with 32 dataproducers over 32 data consumers on a Cray XC40 and a Blue Gene/Qsupercomputer.

results in Figure 3.2. Initially, as the size of stream elements increases, per-formance increases. Performance reaches its peak at 9GB/s and 3.5GB/son the Cray XC40 and Blue Gene/Q supercomputer respectively. Afterreaching peak performance, streaming performance saturates on both su-percomputers.

In the second experiment, we studied the impact of the computationintensity of the stream operation. We vary the number of floating operationsper stream element in the operations (FLOPS) that are attached to the samedata stream. We find that the processing rate remains almost constant tillthe compu

datamovementonemerging large-scaleparallelsystems1160619/fulltext02.pdf · och minskar obalans i...

Documents