ieee transactions on visualization and computer...

13
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY 2016 1 MemAxes: Visualization and Analytics for Characterizing Complex Memory Performance Behaviors Alfredo Gim ´ enez, Todd Gamblin, Ilir Jusufi, Abhinav Bhatele, Martin Schulz, Peer-Timo Bremer, and Bernd Hamann Abstract—Memory performance is often a major bottleneck for high-performance computing (HPC) applications. Deepening memory hierarchies, complex memory management, and non-uniform access times have made memory performance behavior difficult to characterize, and users require novel, sophisticated tools to analyze and optimize this aspect of their codes. Existing tools target only specific factors of memory performance, such as hardware layout, allocations, or access instructions. However, today’s tools do not suffice to characterize the complex relationships between these factors. Further, they require advanced expertise to be used effectively. We present MemAxes, a tool based on a novel approach for analytic-driven visualization of memory performance data. MemAxes uniquely allows users to analyze the different aspects related to memory performance by providing multiple visual contexts for a centralized dataset. We define mappings of sampled memory access data to new and existing visual metaphors, each of which enabling a user to perform different analysis tasks. We present methods to guide user interaction by scoring subsets of the data based on known performance problems. This scoring is used to provide visual cues and automatically extract clusters of interest. We designed MemAxes in collaboration with experts in HPC and demonstrate its effectiveness in case studies. Index Terms—Performance Visualization, High-Performance Computing, Memory Visualization 1 I NTRODUCTION H IGH performance computing (HPC) systems consist of large- scale parallel computers that solve computationally intensive problems and often feature complex and long-running applica- tions. Such systems enable many forms of scientific inquiry, but their computational capabilities vastly outpace the abilities of applications to take advantage of them, leaving much of their potential computational power largely unused. It is becoming increasingly important to understand and analyze the behavior and performance of HPC applications in an effort to optimize them. However, this is an extremely difficult task: depending on the hardware and software properties of a particular system, the behavior of an application can be impacted by several different factors. In many cases, memory access performance plays a dom- inating role, due to the fact that raw computational performance has advanced (and continues to advance) at a much higher rate than that of data access performance. This trend is also referred to as the “Memory Wall” [25]. To counteract this trend, hardware designers have introduced deeper, more complex memory hierarchies. While software re- searchers have developed a wide variety of strategies to take advantage of them [19], [15], no general solution is currently capable of achieving performance anywhere near peak memory performance. Different application and hardware combinations have vastly different memory usage requirements and capabilities, Alfredo Gim´ enez ([email protected]) and Bernd Hamann ([email protected]) are with the University of California, Davis. Ilir Jusufi (ilir.jusufi@lnu.se) is with Linnaeus University in V¨ axj¨ o, Sweden. Todd Gamblin, Abhinav Bhatele, Martin Schulz, and Peer-Timo Bre- mer are with Lawrence Livermore National Laboratory. Emails: {tgamblin,bhatele,schulzm,ptbremer}@llnl.gov Manuscript received February 29, 2016 explaining the need for analysis and optimization of individual application executions. Hardware simulators present one option, as they can simulate the execution of each instruction on a particular piece of hardware, with the goal being to expose detailed memory behaviors at a low level. However, these simulators incur large overheads and often fail to correctly predict total system behavior by neglecting outside variables, such as operating system interference, frequency scaling due to heat, and non-deterministic effects resulting from concurrent execution. For these reasons, optimizing memory per- formance typically requires one to collect memory-related perfor- mance data from actual, non-simulated runs and perform analyses on the collected data. Accurate collection of memory-related performance data is a field of research on its own. The overhead involved in collecting data makes necessary the use of hardware resources, including memory. Thus, high-overhead solutions corrupt the observed in- formation. Not collecting sufficiently enough information makes it impossible to perform a useful analysis. To understand the execution of a single memory instruction, we need to know which line of source code generated the instruction, which data structure was accessed, which CPU was used for execution, and which memory resource was accessed. Moreover, with modern multi-level caching, the performance of one memory instruction can depend on those that happened before. In prior work, we developed a method for collecting the needed fine-grained memory access information with low overhead [9]. In this paper, we intro- duce visualization and analysis methods that use this information to characterize memory performance with respect to the many different factors that contribute to it.

Upload: others

Post on 04-Oct-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER …graphics.idav.ucdavis.edu/~hamann/GimenezGamblin...IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY 2016 1

MemAxes: Visualization and Analytics forCharacterizing Complex Memory Performance

BehaviorsAlfredo Gimenez, Todd Gamblin, Ilir Jusufi, Abhinav Bhatele, Martin Schulz, Peer-Timo Bremer, and

Bernd Hamann

Abstract—Memory performance is often a major bottleneck for high-performance computing (HPC) applications. Deepening memoryhierarchies, complex memory management, and non-uniform access times have made memory performance behavior difficult tocharacterize, and users require novel, sophisticated tools to analyze and optimize this aspect of their codes. Existing tools target onlyspecific factors of memory performance, such as hardware layout, allocations, or access instructions. However, today’s tools do notsuffice to characterize the complex relationships between these factors. Further, they require advanced expertise to be used effectively.We present MemAxes, a tool based on a novel approach for analytic-driven visualization of memory performance data. MemAxesuniquely allows users to analyze the different aspects related to memory performance by providing multiple visual contexts for acentralized dataset. We define mappings of sampled memory access data to new and existing visual metaphors, each of whichenabling a user to perform different analysis tasks. We present methods to guide user interaction by scoring subsets of the data basedon known performance problems. This scoring is used to provide visual cues and automatically extract clusters of interest. Wedesigned MemAxes in collaboration with experts in HPC and demonstrate its effectiveness in case studies.

Index Terms—Performance Visualization, High-Performance Computing, Memory Visualization

F

1 INTRODUCTION

H IGH performance computing (HPC) systems consist of large-scale parallel computers that solve computationally intensive

problems and often feature complex and long-running applica-tions. Such systems enable many forms of scientific inquiry, buttheir computational capabilities vastly outpace the abilities ofapplications to take advantage of them, leaving much of theirpotential computational power largely unused. It is becomingincreasingly important to understand and analyze the behaviorand performance of HPC applications in an effort to optimizethem. However, this is an extremely difficult task: depending onthe hardware and software properties of a particular system, thebehavior of an application can be impacted by several differentfactors. In many cases, memory access performance plays a dom-inating role, due to the fact that raw computational performancehas advanced (and continues to advance) at a much higher ratethan that of data access performance. This trend is also referred toas the “Memory Wall” [25].

To counteract this trend, hardware designers have introduceddeeper, more complex memory hierarchies. While software re-searchers have developed a wide variety of strategies to takeadvantage of them [19], [15], no general solution is currentlycapable of achieving performance anywhere near peak memoryperformance. Different application and hardware combinationshave vastly different memory usage requirements and capabilities,

• Alfredo Gimenez ([email protected]) and Bernd Hamann([email protected]) are with the University of California, Davis.

• Ilir Jusufi ([email protected]) is with Linnaeus University in Vaxjo, Sweden.• Todd Gamblin, Abhinav Bhatele, Martin Schulz, and Peer-Timo Bre-

mer are with Lawrence Livermore National Laboratory. Emails:{tgamblin,bhatele,schulzm,ptbremer}@llnl.gov

Manuscript received February 29, 2016

explaining the need for analysis and optimization of individualapplication executions.

Hardware simulators present one option, as they can simulatethe execution of each instruction on a particular piece of hardware,with the goal being to expose detailed memory behaviors at alow level. However, these simulators incur large overheads andoften fail to correctly predict total system behavior by neglectingoutside variables, such as operating system interference, frequencyscaling due to heat, and non-deterministic effects resulting fromconcurrent execution. For these reasons, optimizing memory per-formance typically requires one to collect memory-related perfor-mance data from actual, non-simulated runs and perform analyseson the collected data.

Accurate collection of memory-related performance data is afield of research on its own. The overhead involved in collectingdata makes necessary the use of hardware resources, includingmemory. Thus, high-overhead solutions corrupt the observed in-formation. Not collecting sufficiently enough information makesit impossible to perform a useful analysis. To understand theexecution of a single memory instruction, we need to knowwhich line of source code generated the instruction, which datastructure was accessed, which CPU was used for execution, andwhich memory resource was accessed. Moreover, with modernmulti-level caching, the performance of one memory instructioncan depend on those that happened before. In prior work, wedeveloped a method for collecting the needed fine-grained memoryaccess information with low overhead [9]. In this paper, we intro-duce visualization and analysis methods that use this informationto characterize memory performance with respect to the manydifferent factors that contribute to it.

Page 2: IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER …graphics.idav.ucdavis.edu/~hamann/GimenezGamblin...IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY 2016 2

ContributionsWe present the memory performance visualization and analysistool MemAxes. MemAxes enables novel analysis of the complexrelationships between source code, data structures, and hardwareby mapping the input data to multiple different visual repre-sentations, each exposing different characteristics of memorybehavior. This unique combination of characteristics more com-pletely describes complex memory behavior than is possible withexisting methods, and as a result, our method makes it possibleto more accurately diagnose memory performance problems. Inaddition, MemAxes introduces a new way to visually characterizeutilization of the hardware involved in memory accesses. Finally,MemAxes is based on a new method that guides the process ofdata exploration by “scoring” various subsets of the performancedata and presenting the results as possible interactive data explo-ration choices to the user.

2 RELATED WORK

Performance data visualization can be done in multiple ways,subject to the particular data used and the application considered.We briefly review a subset of the related research that involvesmemory and memory-related data and refer the reader to [13] foran in-depth survey of the field at large.Visualization of Memory Allocation. One of the motivationsdriving memory performance visualization techniques is the desireto understand how data is laid out in address space. Consideringthe design of modern memory hierarchies, the location of aparticular data item in address space has a significant effect on howquickly it can be accessed. It is critically important to understandhow memory allocation and utilization affects the performance ofa complex computer code and how to optimize it. In the context ofvisualization, the problem is to depict a set of ranges (allocatedmemory buffers) along a single dimension (the total availableaddress space) as they change over time (allocation, de-allocation,re-allocation). Further, one must visualize when accesses are madeand by what parts of the code.

Griswold et al. devised a method that plots allocated memorybuffers as blocks along an axis representing address space, encod-ing the buffer type via color and wrapping the axis across rowsto more efficiently utilize screen space [10]. While providing adetailed visualization of memory fragmentation for small memoryspaces, this method was designed to show the entirety of thememory address space and allocations of a complete program,which is simply too much information to show in its entirety, evenfor moderately complex applications. Furthermore, this visualiza-tion method represents the state of memory utilization for only asingle point in time. Moreta et al. presented an improved approachbased on plotting time as an axis orthogonal to address space [18].However, this approach further limits the available real estate forvisualizing for address space, as all allocations for a single timepoint are plotted along a single row. Cheadle et al. introduceda similar method for showing the address space of the heap byplotting allocations via a heat map. Their method also showshistograms of different allocations separated by size and type [6].The resulting visualizations effectively show the distribution ofdifferent allocations for a program, but the screen space requiredscales linearly with respect to the number of allocations.

The methods described above are more or less effective re-garding the goal of understanding memory allocation character-istics. All are severely limited concerning scalability. A modern

application typically involves several thousand memory alloca-tion operations performed per millisecond, and modern platformscan use several million memory addresses for a single process.Furthermore, address space is a virtual construct; the data andaddresses also have physical counterparts, i.e., cache hardware,memory banks, etc., and their properties greatly affect memoryaccess efficiency. The location of memory and cache resources,their efficiency with respect to the different processors that accessthem and their eviction policies play crucial roles in achievedperformance. Therefore, virtual address space alone is insufficientfor characterizing memory performance behavior.

Visualization of Hardware Topologies. The physical analog tothe virtual address space is the physical addresses covering allmemory hardware resources in a particular system. However, theavailable hardware performance data often lacks the informationto connect it to the virtual address space used by the applica-tion. Hardware-centric visualizations should expose performancecharacteristics that effectively depict utilization across hardwareresources, disjoint from the address space visible to the applicationdeveloper.

Memory hardware resources include main memory, caches,and their connectivity to each other and to processors. Thelayout and connectivity of memory hardware resources define thehardware topology of the system. This topology has a hierarchicalstructure, with processors defining the leaves, with parent nodesrepresenting the next-smallest available memory resource (L1cache), until the largest memory resource (main memory) isreached at the root.

Early research done by Alpern et al. focused on connectionsbetween the processor and lower levels in the cache hierarchy [3]to illustrate the intended behavior of tiling algorithms. Theirresearch was one of the first efforts directed at understandingthe relationship between algorithms, data, and hardware used.Unfortunately, this research effort was purely conceptual and didnot address the visualization of real data. Choudhury and Rosendeveloped a technique for depicting data movement within asimulated cache hierarchy using an abstract visualization showingindividual cache lines and their transactions [7]. Their approach ef-fectively extended Alpern et al.’s effort by showing data elementsin their respective hardware containers, using visual abstractionas a tool to improve comprehensibility and utilizing visualizationspace more effectively. Further, their method links transactionsto the instructions in the code that caused them. However, theapproach is only effective for hardware with caches small enoughto have individual cache line accesses visually discernable. Rosenet al. introduced an approach for visualizing memory accesses forrepresentative subsets of CUDA hardware performance data [20].The idea used to detect representative subsets is highly effectivefor reducing the visual exploration space into semantic regions,which is necessary to support the viewing of several hundred GPUcores and several thousand timesteps per execution. Unfortunately,the approach is highly hardware-specific.

The software tools hwloc [5] and likwid [24] provide capa-bilities to detect, output, and visualize hardware topologies, butthey do not allow one to directly visualize performance datamapped to the topologies. Recently, Denoyelle et al. [8] builtupon hwloc by color-coding performance counters mapped on thevisual components representing different hardware resources. Theresult is a highly advanced hardware resource monitor. However,the tool does not facilitate analysis of specified time intervals; only

Page 3: IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER …graphics.idav.ucdavis.edu/~hamann/GimenezGamblin...IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY 2016 3

an aggregated summary of performance for a single time point issupported.

Visualizations that incorporate hardware topology expose amyriad of hardware-dependent performance problems that wouldotherwise remain unseen. However, currently one either relies onsimulated data to represent attributions between the hardware andsoftware or this attribution information is not represented at all.As a result, current tools can shed light on hardware behavior, butnot on their root causes within a simulation code.

Visualization of Memory-related Counters. Memory perfor-mance counters are available on most modern processors. Theyrecord the number of occurrences of specific events, e.g., cachemisses or accesses to main memory over time. Each counter maybe associated with a particular resource, e.g., a processor core ormemory resource. MView [4] displays memory access counters ona Gantt chart, with different processors on the horizontal axis anddifferent counters encoded via color and height. This approacheffectively shows the distribution of memory-related events ona per-processor basis and scales relatively well using commoninteraction methods for Gantt charts. Schulz et al. [21] describeda technique for aggregating memory counters by their associatedlocations in a physical mesh used for simulation. The result ishighly scalable, but also highly application-specific.

The major drawback regarding visualizing counter data isattribution. Counters provide the number of events over timeperiods, but do not provide details about those particular events.Thus, events cannot be directly attributed to the instructions thatcaused them or the data addresses involved.

Visualization of Sampled Memory Accesses. Our research pre-sented here concerns sampled memory instruction data, whichhas only become available with recent microprocessor genera-tions. Sampling has the advantage of providing highly detailedinformation about individual events that occur during programexecution compared to the summary information provided bycounters. This enhanced information leads to a significant over-head cost compared to counters, and sampling must be care-fully configured to mitigate this trade-off. Liu et al. extendedHPCToolkit to include memory access sampling data in order toattribute memory accesses to recorded call paths and associatedmemory allocations [17], [2]. This attribution provides valuableinformation. However, this toolkit does not provide more advancedvisualizations beyond scatterplots and is limited in interactiontechniques. It relies heavily on a user’s ability to explore the datain meaningful ways.

3 REQUIREMENTS

Based on the benefits and drawbacks of the current state-of-the-artmethods in memory performance visualization, we identified a setof requirements for our tool:

Scalability. The growing requirements of high-performance com-puting applications make scalability critical. This includes bothvisual scalability (how the visualization quality scales with respectto the input data size) and computational scalability (how quicklythe tool runs with respect to the input data size). In an optimalscenario, visual quality is unaffected by the input dataset and thetool performs at interactive rates regardless of data size.

Attribution. In order to perform meaningful analysis, the toolshould be able to not only show aspects of memory performancebehavior but their causes as well. Repeatedly, we have seen that

a single visualization may be effective in showing one or twodifferent characteristics of memory behavior, but unless we areable to correlate those characteristics with others, we can not hy-pothesize any causal relationships between them. Attributing dataacross visualizations allows us to determine what characteristicsmay be related and thus provides us a set of potential correlationsto draw conclusions from.

Usability. Many of the aforementioned methods provide a largevisual exploration space and/or advanced interaction techniques.However, many also require the analyst to have a substantialamount of prior knowledge of their application and computingenvironment in order to make perform useful interactions foranalysis. We require that our tool be not only powerful but usable,in the sense that the process of finding interesting performancebehaviors is simple and intuitive.

4 MEMORY PERFORMANCE

Modern memory architectures typically contain three levels ofcache and either a single main random-access memory (RAM)resource or multiple non-uniform memory access (NUMA) re-sources. The time and energy required to access a memoryresource increases exponentially with the size of the resource [11],so most often it is better to utilize small cache resources. Cachesmay be either shared or unshared among different processors.Typically, architectures include one or two unshared caches perprocessor and possibly a shared cache for a set of processors.Main memory resources are shared among all processors but arethe largest and slowest to access, and NUMA resources are local toa subset of processors on a shared socket and incur a high penaltyfor accesses from processors on a remote socket.

Problems arise when application data cannot fit into smallcaches and when multiple processors contest for shared resources,both of which are almost always the case, especially in high-performance computing (HPC) applications. The former issuerequires processors to evict data from smaller caches to replaceit with incoming data, forcing subsequent accesses to acquire datafrom larger, slower resources. The latter also requires processorsto communicate across caches in order to ensure that processorsdo not operate on stale copies of cached data.

A wide variety of factors affect the outcome of differentmemory access patterns, as discussed in Section 1. By analyzingthe memory accesses in the contexts of these different factors,we are able to gain an understanding of the interplay betweenthem and in turn formulate a high-level explanation for certainmemory behaviors. Finally, high-performance computing expertscan use this information to form hypotheses for potential causesof problems and determine possible optimization methods.

4.1 Memory Access Samples

As described in Section 1, there exist many different types ofmemory performance data, each of which may provide differentinsights. In this work we focus on the data acquired from memoryaccess sampling. We configure an available performance moni-toring unit in hardware to record a memory access after everyN accesses have completed, effectively giving us a subset of allaccesses made during a program’s execution [12].

Sampling can be configured for different values of N, and inaddition, a latency threshold T may be specified. In the latter case,memory accesses are only recorded if their duration exceeds T .

Page 4: IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER …graphics.idav.ucdavis.edu/~hamann/GimenezGamblin...IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY 2016 4

Fig. 1: Exploration and analysis workflow for characterizing memory performance using MemAxes. Initially, MemAxes shows theoverview of all memory access samples, then through guided interaction and cluster extraction the user may select regions of interest.On any selection of data, the analyst may draw conclusions by identifying patterns in different views and seeing how patterns betweenmultiple attributed contexts correlate with one another.

Different N configurations allow the user to trade off samplinggranularity and overhead. Because a low granularity may possiblymiss important memory behaviors, it is helpful to also specify a Tto bias the sampling towards slower accesses, which may representbottlenecks and are potentially of more interest.

Every sample is essentially a tuple of attributes associated witha single memory access event. Some attributes are quantitative,others qualitative. Specifically, the hardware-provided attributescontained in an individual sample are:

Timestamp (quantitative). A timestamp taken at the timethe access retires, a numerical value.

Latency (quantitative). The number of elapsed processorcycles from access request to access retirement.

Processor (qualitative). The ID of the processor that issuedthe access.

Instruction Pointer (qualitative). The address of the in-struction that issued the access request.

Data Source (qualitative). An encoding specifying the typeof resource where the access was resolved, e.g. L1 cache,remote RAM .

Data Address (qualitative). The pointer address of the databeing accessed.

We extend this list of attributes using developed methods in thehigh-performance computing community [17], [9] to additionallyobtain:

Data Symbol (qualitative). The name (as a string) of thesymbol from which data was accessed, e.g. “myArray”.

Access Index (quantitative). The index used to accessthe data element from its container, e.g. the value i in“myArray[i]”.

State Variables (varies). Any additionally recorded piecesof information, as specified by the programmer.

As a result, each memory access sample can be interpreted as amultidimensional point with a minimum of 8 dimensions and a

virtually limitless number of additional program state variables.In addition to the memory access samples, we store the hard-

ware topology on the executing system using the well- establishedhwloc [5] tool. The topology has a hierarchical format, with largermemory resources parenting smaller ones, and processors at theleaves below the smallest (L1 cache) memory resource.

5 DESIGN OVERVIEW

Due to the multidimensional nature of the input data, it is temptingto apply existing multidimensional visualization techniques andcreate a single, all-encompassing view. However, such techniquesoften lose important contextual information by projecting a widearray of attributes, both qualitative and quantitative, into a lowerdimensional space. The result is a visualization that attempts to doeverything but accomplishes very little in terms of specific analysistasks within the domain.

The most effective performance visualizations were those thattied a particular data exploration or analysis task to a repre-sentative visual metaphor. For example, when analyzing mem-ory fragmentation, it is helpful to use a visual metaphor for aprogram’s memory address space. However, as previously noted,such designs also limit the types of exploration or analysis thatmay be done. For this reason, we decided to design MemAxeswith multiple context-aware views, each of which is suited to aparticular set of characteristics relevant for memory performanceanalysis. In addition, by linking context-aware views, MemAxesmakes it possible to identify correlative relationships betweenthese different characteristics. Thus, this design satisfies ouroriginal requirement to make it possible to attribute performancecharacteristics to one another.

This design also requires effective navigation by the user,however, and the interactive search space of this data is vast. Inorder to make this space navigable, we developed methods forguided interaction and cluster extraction, which communicate tothe user what selections are of potential interest. The result is a toolthat users with limited knowledge can still glean useful insightsfrom while not limiting the available interactions from moreknowledgeable users, thus satisfying the usability requirement.

Page 5: IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER …graphics.idav.ucdavis.edu/~hamann/GimenezGamblin...IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY 2016 5

Fig. 2: Visualization of memory access data in the context ofthe top offending lines of source code (top left) and data objects(top right). The topmost offending line of source code is shownhighlighted within its containing file in the source pane (bottom).

In order to fulfill the requirement that MemAxes be visuallyscalable, all visualization methods used rely on some form ofaggregation. The aggregations used produce visual elements thatare invariant to the data size, so visual clutter remains constant forall inputs. MemAxes is computationally scalable up to a limit—nearly all interactions incur basic set operations that make use ofwell-known optimizations.

The exploration and analysis workflow is illustrated in Fig-ure 1.

6 VIEWS

MemAxes is comprised of five linked interactive views: twocontext-aware views, one multidimensional view, and two single-dimensional views. The context-aware views aggregate andpresent a subset of the dimensions relevant for performanceanalysis and will be discussed in detail. To avoid limiting theuser to exploring subsets of the dimensions, we included amultidimensional view showing all dimensions as well as twosingle-dimensional views in which the user may focus on certaindimensions in detail. The first shows a standard histogram over asingle numerical dimension (such as time), and the second showsa histogram of categorical values (such as instruction) sorted inorder of appearance.

We identified two contexts most relevant to the problems inmemory performance based on related work and our experiencein HPC and developed context-aware views for each: (1) theprogram instructions and data, and (2) the hardware topology.For the multidimensional view, we developed (3) a modifiedversion of parallel coordinates we call parallel histograms. Weomit a discussion of the single-dimensional views, as they arestraightforward.

6.1 Instructions and Data

Many previous performance visualization methods have overlaidperformance data onto the source code, and for good reason—optimization efforts most often require modification of the codeor at least an understanding of how it affects performance. Manytools show performance data embedded in callpaths, which arethe recorded sequences of function calls in a program execution.These tools often provide the capability for the user to select afunction in the callpath and see it highlighted in the source codefile. Typically, the goal in this context is to identify which partsof the code contribute most to execution time for the purpose ofdetermining which parts of the code require optimization.

With this in mind, we designed a simple visualization to enablethe user to quickly identify the top offenders in terms of theinstructions and data symbols in the source code. Top offendersare the components of the code that have the greatest contributionto the number of memory access cycles in execution. The relevantattributes of a memory access sample in this context are the linesof source code for particular instructions and the data symbolsaccessed. We calculate the total access cycles associated withthese attributes by aggregating latency along them, and we sortthe resulting aggregations in decreasing order. These results arethe top offending lines of code and data symbols for memoryperformance in ranked order.

We visualize the top offenders using a horizontal bar chart,which effectively shows the ranking of top offenders as well asthe differences between them in contribution. The user interfaceof MemAxes includes a pane showing the top offending linesof source code and data objects side-by-side above a text editorshowing the topmost offending line of source code highlightedwithin its source code file, as shown in Figure 2.

The user is able to select top offenders by clicking on theappropriate bar in the chart. In doing so, all samples associatedwith that offender, either line of code or data object, are selected.If a selection has been made, the top offenders and the highlightedline in the text editor show only selected sample information,otherwise they show all samples. Thus, by selecting a line of code,the user may immediately see which data objects were accessedby that line, and conversely, by selecting a data object, the usermay see which lines of code accessed it.

6.2 Hardware Topology

Existing tools for visualizing hardware topology are aimed atshowing resource locality and basic hardware layout. For example,hwloc (Figure 3) and likwid (not shown) both depict the hierarchyof the hardware topology using a form of icicle plot. We identifiedtwo main drawbacks in the current approaches: they poorly usevisual real estate and thus do not scale well to complex topologies,and no existing methods are able to visualize acquired memoryaccess samples in the context of the hardware topology. Wedesigned a new visualization for the hardware topology withimproved scalability for complex architectures and capability tovisually embed memory access samples via a function that mapsthem to the visual components of the hardware.

Visual Layout. The hardware topology may be interpreted as ahierarchy where a single computation node represents the root,memory resources (RAM, NUMA, caches) represent internalnodes, and processors (CPUs) represent the leaves. As such,hierarchical visualizations readily apply to visualizing hardware

Page 6: IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER …graphics.idav.ucdavis.edu/~hamann/GimenezGamblin...IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY 2016 6

Fig. 3: The hardware topology visualization produced by hwloc,which uses a horizontal layout. The pink rectangle denotes a singleRAM resource [5]. The caches are drawn underneath in white,processor cores dark gray, and processing units (CPUs) embeddedinside the cores.

topology. However, the majority of visualizations show the topol-ogy using horizontal layouts, commonly called icicle plots [16],such as the one in Figure 3. This layout allocates the same amountof visual space for each level in the hierarchy; however, by design,hardware topologies have few resources in the top levels of thehierarchy (larger, slower memory resources) and many in the lowerlevels (smaller, faster caches and many processors). As a result,an enormous amount of space is wasted showing the few largememory resources, and the number of processors that may beviewed at once is limited by the horizontal viewing space.

As a solution, we use a radial, space-filling visual layout forthe hardware topology, essentially a sunburst chart [22], as shownin Figure 4 (a). This layout allocates less space to higher levelsin the hiearchy and more space to lower levels in the hierarchy,thus providing a much improved solution for visualizing hardwaretopology over horizontal layouts.

We colormap the nodes to show resource utilization, eitherby total number of cycles or samples. Additionally, we addlines between nodes to depict resource connectivity and scale thethickness of these lines to show data traffic between resources.Although using line thickness to visually quantify data has a morelimited range than a colormap, from 1 to 50 pixels at most, it isenough to show relative differences between links and preferableto showing multiple colormaps in the same view.

As a result, we are able to embed utilization of processors,memory resources, and data traffic between resources onto our vi-sual representation of hardware topology, as shown in Figure 4 (b).

Mapping Samples to Hardware Topology. In order to visualizeaccesses in this layout, we defined a function mapping eachsampled memory access to a set of associated resources and links.We then aggregate all accesses over the resources and links andembed the resulting aggregations into the visualization.

Each access has an associated depth in the memory hierarchyand an an associated processing unit (CPU). These represent thelevel in cache or memory where a piece of memory was accessedfrom and the CPU that requested the access, respectively. For each

(a) (b)

Fig. 4: (a) The radial layout of the hardware topology in MemAxesfor a simple architecture. Main memory resources are dark purple,L3 caches are light purple, L1 and L2 caches are light orange, andprocessing units are dark orange. (b) A more complex architecturewith performance data annotated onto the hardware resources, viaa color map, and transactions between them, via line width.

access, we perform a tree traversal starting from the CPU thatissued the access and ending at the depth where the access wasresolved. The traversed path represents the actual motion of thedata that was accessed for a particular memory access sample; theCPU searches for data first in the nearest cache, and if it fails,it searches the next nearest cache, and so on until the request issatisfied. The data is then copied to all smaller caches between thelevel in which it was found and the CPU, thus incurring traffic onthe links between the resources on this path.

If we can define an aggregation function for an at-tribute, such as sum/average/deviation (if quantitative) or mini-mum/maximum/count (if qualitative), we can then use this func-tion to aggregate the attribute along the traversed path. For mosttypes of performance analysis, it is useful to aggregate the numberof accesses that traversed each link and the sum of latencies forall accesses to each resource, so this is presented by default. Theresulting visualization depicts the distribution of data transactionsacross links as well as the distribution of access cycles acrossresources, as seen in Figure 4 (b). Thus, if a resource has highoutgoing traffic but low cycles, we can conclude that it is beingused often and efficiently, and if the opposite is true there may bea performance problem.

Because aggregation schemes provide a summary of the inputdata, it is critical for the user to define or discover interestingsubsets for analysis, rather than attempt to analyze an aggregationof the entire dataset. A particular phenomena of performancebehavior may only be visible under a highly specific selectionof the dataset. We address this problem in the following Section 7.

6.3 Parallel HistogramsThough context is highly valuable from an analysis standpoint, italso limits the number of visualizable attributes. In order to includethe missing details, we included a context-unaware visualizationwe call parallel histograms. Parallel histograms are histograms ofeach attribute plotted in parallel (similar to parallel coordinates), asshown in Figure 5. Because MemAxes shows parallel histogramsinstead of parallel coordinates, the data is again aggregated,allowing for this view to scale to large numbers of samples.

Parallel histograms allow the user to easily view distributionsof samples within every dimension of the access data. As

Page 7: IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER …graphics.idav.ucdavis.edu/~hamann/GimenezGamblin...IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY 2016 7

Fig. 5: Parallel histograms of memory access samples. Eachattribute is plotted along a vertical histogram and all attributehistograms are placed side-by-side in parallel.

selections are made in context-specific views, the user canidentify whether any of those selections are correlated withan individual dimension not shown in the other views. Mostimportantly, through parallel histograms the user may makeranged selections along each axis by clicking and dragging alongit and may combine ranged selections as well. Thus, parallelhistograms complement context-specific views by providingcomprehensive selection and visualization capabilities.

This combination of linked visualizations under a centraldataset allows users to explore the relationships between thevarious actors in memory performance. For example, by selectinga top offending line of source code, we are able to determine notonly which data objects contributed most to access time, but whichindividual elements inside the data object did. Individual elementsmay be slow to access if the data structures are not optimized forlocality, in which case the hardware is unable to effectively usecaches. It may also be the case that the decomposition of datato hardware threads limits the hardware from utilizing sharedcaches. This decomposition is dependent upon the topology aswell as the data layout and access patterns, and as such it isimperative that the user analyze how all three contexts relate toone another to understand the memory performance behavior.

7 GUIDED INTERACTION

MemAxes provides a comprehensive set of interaction and ex-ploration tools through linking and selection; however, unlessthe user knows what specific characteristics to look for, theexploration parameter space is often too extensive to effectivelysearch through. Questions in performance analysis often requirethe user to identify anomalous subsets of data within the entiredataset, such as periods of poor load balance. Such subsets maybe as small as 100 samples within sets of several thousand. It isobviously unfeasible for the user to select every window of 100samples within the dataset.

As a solution, we propose automatically scoring subsets ofsamples based on how well they relate to the types of behaviorsthat performance analysts typically look for and presenting theresulting scores to the user. The user can make selections basedon these scores and analyze the resulting visualizations.

We first describe the metrics as functions of input subsets, andthen we describe how we choose those inputs.

7.1 Proposed Scoring Metrics

From collaboration with domain experts and from previous sur-veys in the field of performance visualization [13], we determineda set of behaviors that performance analysts typically search for.We defined expressions to calculate the relationship between asubset of samples and each behavior. We target two performancebehaviors: average access latency and load imbalance.

Our proposed metrics take as input the accumulated memoryaccess cycles on the hardware topology for a subset of samples,as described in Section 6, and produce a single value as output. Inorder for differently sized subsets to be compared to one another,the metric must have the property that it is invariant to the numberof samples in the subset.

In the following equations 1 and 2, we define depth in thehardware topology as d, hardware resources at a specified depthas a vector rd , and specific resources as rd,i. The value c(rd,i)represents the number of memory access cycles associated with aparticular resource, and s(rd,i) represents the number of memoryaccesses.

Average Access Latency. In practice, the number of cycles takento access a cache or main memory resources is highly variable dueto hardware effects. For example, to maintain cache coherence,the hardware has to keep track of which copies of cached data arestale and update them when a core attempts to read them. Thisaction is done opaquely in hardware but significantly contributesto the wait time for accessing data. To provide an indicator ofthe average access latency achieved in a particular execution, wecalculate the average number of cycles per access at depth d, asshown in equation 1:

Ld =1|rd |

|rd |

∑i=1

c(rd,i)

s(rd,i)(1)

Load Imbalance Across Resources. In general, performancedegrades when memory access is less uniform across the availableresources. Often, a few resources are highly over- or under-utilizedrelative to the average utilization. This behavior is indicative ofbottlenecks that may prevent an application from achieving peakperformance. In terms of the data, we can identify subsets thatexhibit this behavior by searching for regions with significantoutliers in hardware utilization.

We calculate the imbalance metric for a specified depth inthe hardware topology Id as the maximal difference, in terms ofmemory accesses, between each resource and the average numberof samples for all resources in the depth µd(s). In order forthis metric to be size-invariant, we normalize the distance by thestandard deviation of the number of samples across resources inthe depth σd(s), as shown in equation 2:

Id =

|rd |maxi=1|s(rd,i)−µd(s)|

σd(s)(2)

Page 8: IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER …graphics.idav.ucdavis.edu/~hamann/GimenezGamblin...IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY 2016 8

7.2 Metric VisualizationGiven these two scoring metrics, we need to provide input subsetsof potential interest. Because we aim to determine how scoresvary with respect to each attribute, we provide as input a subset ofsamples with a bounded range of values along a single attribute.In other words, we select samples within a window of valuesalong the attribute and use that as input. We divide the attributeinto N windows, calculate the metric, and visualize their outputsvia color-mapped blocks along the attribute’s axis, as shown inFigure 6. The user may specify the metric to use as well asthe number N of windows to adjust the scope and granularityof the visualization. The resulting visual cue effectively highlightsvariation of the metric along an attribute, from which the user maydecide to select outliers for further analysis.

Fig. 6: Histogram and guided interaction along a single attribute.The colored rectangles on the bottom encode derived metrics forbins of samples along the attribute. Darker colors indicate areas ofhigher imbalance across NUMA memory resources.

7.3 Metric-based ClusteringIt is often the case that a range of neighboring bins have similarmetric scores. As a result, viewing each of these bins may showredundant results, and, alternately, an individual bin may notcontain enough samples to show meaningful correlations. A moreeffective binning scheme would group together all samples thathave similar properties and only present subsets with uniquebehaviors.

Because we defined a robust method for scoring data subsetsthat is invariant For our purposes, we require one that is scalable,both computationally and visually, and one that makes it possibleto analyze how metrics vary with respect to each attribute. Wetherefore use the constraint that a cluster of samples must be arange of consecutive samples along an attribute. By using thisapproach, each subset only needs to be compared to its neighborsalong a single attribute, and the resulting clusters represent differ-ent positions and ranges along the attribute. Furthermore, we wantto provide a clustering solution that is dynamic relative to differentdatasets and adjustable in granularity. For these reasons, we useagglomerative clustering, using as input a set of sliding windowsalong the attribute.

We generate the initial subsets by moving a sliding windowalong a specified attribute. We create a window for the smallestw samples along the attribute, store the subset of w samples asa leaf, and slide the window by ∆ < w samples forward untilwe reach the end of the attribute. For N samples, this methodcreates approximately N/∆ leaves, each containing w samples.The process is illustrated in Figure 7.

This approach has a number of advantages. Individual samplesare insufficient to characterize performance behavior, so clusteringbased on individual samples is unpredictable and leads to undesir-able results. This method ensures that all leaves have w samples,which is a controllable parameter. By creating windows alongan attribute, we create subsets that define how the performance

Fig. 7: The process for clustering memory access samples. Wecreate a window with w samples starting with the minimum valuealong the attribute and move it by an amount ∆ < w samplesuntil we reach the end, creating a subset for each set of wsamples within each window position. We agglomeratively clusterthe subsets using our proposed metrics in Section 7 as distancemetrics.

behavior changes with respect to that attribute, which was aprimary goal for our automated analysis method. Further, becausethe windows overlap, we capture subsets that start and end at anypoint along the attribute, given that they are within the resolutionof ∆, and we can capture variable length behavior patterns.

We control the accuracy, shape and computational complexityfor generating the tree by modifying w and ∆. Higher values ofw capture longer, more stable behaviors at the loss of identifyingoutliers, and higher values of ∆ reduce the number of leaves andtime to create the tree at the loss of resolution for possible beginand end points for performance behaviors.

Fig. 8: Cluster viewing and selection. Clusters of a specified depthand metric are shown along the attribute’s axis as markers (circles)colored by their derived metrics. When the user hovers the mousepointer over a marker, a glyph expands showing the hardwaretopology visualization for the subset of samples contained in thecluster. Upon clicking the marker, the subset is selected for furtheranalysis.

After creating a set of initial subsets along an attribute, weagglomeratively cluster these subsets. We define the similarity(i.e., distance) between subsets using the derived performancemetrics described in Section 7. Specifically, we use the absolutedifference between their metric values. When we have M metricsand D attributes, we have M×D possible cluster trees. We notethat the windows often contain different numbers of samples.Because the derived performance metrics are invariant to thenumber of samples contained in the subset, this is not a problem.

Our agglomerative clustering algorithm iteratively groups to-gether the closest pairs of subsets, with the constraint that subsets

Page 9: IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER …graphics.idav.ucdavis.edu/~hamann/GimenezGamblin...IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY 2016 9

Fig. 9: MemAxes upon opening the memory performance data collected from LULESH, before performing any interaction. In thetopology view (marked 1) we can see one processor was extremely under-utilized relative to the rest. We also see that several accesseswere made to the top NUMA node (the dark orange arc directly above the center) but few if any were made to the bottom one. In thecode/data view (marked 2) we see a parallel loop in the code accessing elements fy and nodalMass from the domain. In the topoffenders view directly above, we see more time was spent waiting for fy than nodalMass.

may only be grouped together if they are neighbors along the at-tribute. All resulting clusters represent a range along the attribute,and separation of clusters along an attribute indicates a shift inperformance behavior. This clustering method accurately extractsranges of samples that exhibit a particular performance behavioralong an attribute, and by exploring these behaviors, the user caneffectively understand how a type of performance behavior varieswith respect to a particular variable.

7.4 Viewing and Selecting ClustersOnce the cluster trees are generated, the user may view individualclusters at a glance and select their associated samples for furtheranalysis if desired. The user navigates clusters along the attributeby selecting different metrics and tree depths. After making aselection, the user is presented with markers indicating the locationof clusters along the attribute’s axis line. The markers are smallcircles along the axis colored by their metric values. When theuser hovers the mouse pointer over a marker, it expands to showa glyph of the hardware topology visualization for its respectivesample subsets. Thus, the user can quickly gain an overview ofthe resulting selection and decide whether to make the selection.By clicking on the marker, the user selects the subset contained inthe cluster and all views are updated to show the current selection.The cluster viewing and selection interface is shown in Figure 8.

8 CASE STUDIES

In the following, we present some case studies that demonstratethe workflow when using MemAxes, including initial exploration,

data-driven hypothesis forming, and experimental validation. Allexperiments were performed on the Cab cluster [1] at LawrenceLivermore National Laboratory, which features two eight-coreIntel Xeon E5-2670 sockets on each node, for a total of 16 coreswith L1 and L2 caches, two shared L3 caches, and two NUMAmemory banks.

We collected memory performance data from two commonlyused HPC proxy applications. We used two proxy applicationsisolating computationally complex code sections from a largerapplication for the purpose of analyzing performance in detail.The proxy applications we analyzed are LULESH (LivermoreUnstructured Lagrangian Explicit Shock Hydrodynamics) [14]and XSBench [23].

8.1 LULESHLULESH emulates the behavior of an iterative shock-hydrodynamics simulation involving a mesh that deforms overtime. The mesh deformation requires the simulation to retrieve thechanging locations of cells and corners in the mesh in order toaccess their values. Each value retrieval requires multiple memoryaccesses, and it has been hypothesized that this step is a significantcontributor to total computation time.

We configured the memory access sampler to record a setof additional attributes from the application – the x-, y-, and z-coordinates in the mesh for each access. These additional attributesmake it possible to further discern the relationships between thelayout of the data and other characteristics, such as locality indifferent memory resources.

Page 10: IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER …graphics.idav.ucdavis.edu/~hamann/GimenezGamblin...IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY 2016 10

(a) Cluster glyphs generated by clustering along the zidxaxis using processor load imbalance metric. Consecutive z-indices were accessed by distant processors on either side ofthe hardware topology, causing contention across both processorsockets.

(b) Cluster glyphs generated by clustering along the zidx axisusing processor load imbalance metric, after mapping consecu-tive z-indices to neighboring processor cores. We can see thatconsecutive z-indices were accessed by neighboring processors,meaning cross-socket data contention was effectively reduced.We observe a 60% decrease in memory access cycles and 10%decrease in total execution time.

A screenshot of the initial state of MemAxes after loading thecollected data, and before any user interaction, is shown in Fig. 9.The views depict all collected sample points, and we can gainsome initial insights into overall execution. In the topology view,marked with a red number 1, we see that one of the processorsappears to be extremely under-utilized with respect to the otherprocessors. In addition, the first NUMA node (the top innermostarc) is associated with many accesses and transactions, based onits color and link thickness, while the other appears to have beenused sparingly, if at all. We presented these initial observationsto the developers of LULESH and they confirmed that (1) inter-node communication was handled by the message-passing library,with this particular library implementation reserving a single corefor communication operations, and (2) intra-node parallelism inthis version of LULESH was achieved by distributing work to allprocessors on a node, with allocation done only on the first NUMAnode, leaving few accesses to be made to the second. This valuabledeveloper feedback validated our data collection and aggregationmethods.

The code and data views (marked with a red number 2)revealed the most memory-intensive section of code in LULESH.Concerning the source code, we saw a loop preceded by a paralleldirective, indicating a parallel region. Within this region, oneaccess had been made to each of the variables fx, fy, fz,and three accesses had been made to nodalMass. However,in the sorted histogram of variable contributions (emphasizingtop-offending variables) for this particular region of code wesaw that the accesses to each of the other variables had actuallybeen more expensive overall, although more accesses had beenmade to nodalMass. In order to further study this anomaly, we

selected each of these data objects and examined the selectionsin the different views. This analysis revealed that nodalMasswas rarely accessed by the L3 cache (only 16 times), and thoseaccesses cost an average of 405 cycles. The other data objects,fx, fy, and fz, were only slightly more frequently accessedby L3, with average costs of between 1100 and 2700 cycles.However, these same data objects incurred few accesses by theL2 cache. These observations led us to hypothesize that the dataobjects were sufficiently small enough to fit into the L1 cache, andthat accesses to the shared L3 cache occurred due to contentionbetween multiple processors.

In order to further investigate the hypothesis that the fx,fy, and fz data objects were involved in shared L3 cachecontention, we considered the load imbalance metric across pro-cessor resources. By clustering based on processor imbalanceacross different dimensions of the data and examining the re-sulting cluster glyphs, we were able to identify an undesirabledata decomposition pattern, shown in Fig. 10a. Along the zidxdimension, consecutive elements had apparently been accessed byprocessor cores residing at opposite ends of the hardware topology,i.e., by distant cores across both available sockets. This behavior ishighly inefficient—neighboring data elements in z-direction wereconsequently contended across the entire topology and requiredconstant data migration to maintain cache coherence.

We designed an experiment to resolve this observed ineffi-ciency and collected memory access samples again. By changingthe mapping of indices such that consecutive indices would beprocessed by neighboring processor cores, we hypothesized thatremote accesses would be reduced and performance should beimproved. We performed the experiment and, again, performed

Page 11: IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER …graphics.idav.ucdavis.edu/~hamann/GimenezGamblin...IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY 2016 11

Fig. 11: Hardware topology view of sampled memory accessesof the XSBench application execution. Resource utilization wasmostly uniformly distributed, with the exception of NUMA re-sources, where all accesses occurred in the first (top) NUMAresource.

clustering based on processor imbalance across z-indices to verifythat we had successfully reduced cross-socket data contention, seeFig. 10b. Most importantly, we observed a decrease in total latencyby 60% and a decrease in total execution time by 10%. Thisresult represents a significant performance improvement in termsof both time and energy consumption. (HPC applications typicallyconsume on the order of several hundred joules per second percompute node, and applications can run for several hours usingseveral thousand compute nodes.)

8.2 XSBenchXSBench is a proxy application for OpenMC, a Monte-Carlo sim-ulation for calculating particle interactions. Due to its underlyingnon-deterministic algorithm, its memory accesses follow a lesspredictable pattern and thus pose a difficult performance analysisproblem.

The initial topology view for the XSBench data shows fairlyuniform utilization across all resources, except across NUMAdomains, see Fig. 11 Since the first NUMA domain incurred 282sampled accesses, while the second incurred none, we hypothe-sized that this application had allocated data only in one of thetwo available NUMA domains. This hypothesis was confirmed bythe developers.

We investigated these NUMA accesses further by selectingthem and observing the other dimensions in the parallel histogramsview. A clear correlation emerged with the time dimension,with three sections of time showing different distributions ofsampled accesses and nearly all NUMA accesses occurring duringthe third one. We also examined different metrics of imbalanceacross this axis and observed that the second section of timeincurred high amounts of L2 imbalance. These observations ledus to hypothesize that the XSBench application underwent threeseparate phases over time, the first of which having relatively few

Fig. 12: Top: single histogram of time dimension, where threetime sections can be seen, indicating different execution phases(marked). Middle: same dimension after selecting accesses toNUMA resources; nearly all NUMA accesses occurred during thethird phase. Bottom: same dimension with L2 imbalance metricshown along the axis. It can be seen and concluded that the secondphase caused the most L2 cache imbalance, relatively.

memory accesses, the second accessing data objects too large to fitinto the L1 cache, and the third causing several NUMA accesses,see Fig. 12.

We verified these hypotheses once again with the developers ofXSBench. The developers confirmed that the application involvedthree main phases, the first being relatively inexpensive in termsof memory access, the second involving a sorting step of a largedata set, and the third undergoing the non-deterministic memoryaccesses across the same data object by multiple processor cores.The last phase caused slow accesses by large resources as the hard-ware failed to predict and prefetch data elements for computation.

9 CONCLUSIONS AND FUTURE WORK

MemAxes is a novel tool for the analysis of memory performancebehaviors in high-performance computing environments. By creat-ing multiple aggregation-based visualizations of a central dataset,MemAxes provides the capability to understand the complexrelationships between the hardware, the data, and the code. Itintroduces a new visualization for data transactions in a multi-level cache hierarchy that is capable of exposing aspects of loadbalance, data decomposition, and parallel access patterns thatwere previously difficult to discover. MemAxes also explores theconcept of guided interaction via visual cues and automatic clusterextraction based on scoring subsets of the data. We were able tosuccessfully utilize MemAxes to analyze the complex memorybehaviors of two memory-intensive proxy applications, and wewere able to significantly improve the performance of one of themusing insights from this analysis, saving time and energy.

While we proposed two metrics for scoring sample sets, webelieve that much can be done to create more advanced metrics.Furthermore, although the method of scoring is highly application-dependent, the idea to use scores to create visual cues and describe

Page 12: IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER …graphics.idav.ucdavis.edu/~hamann/GimenezGamblin...IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY 2016 12

similarity between subsets could be extended to other applications.In particular, scoring would enable many visualization and dataprocessing techniques for high-dimensional or non-quantitativedata that has otherwise ill-defined distance metrics or similarities.

We found that the clustering algorithm we used was bothjustified for this application and practically effective. However,for different data or scoring functions, many generalizations andspecializations are possible. The combination space of possibleinitial clusters for agglomerative clustering is unfeasibly large,and effectively exploring this space remains a challenge.

We tested MemAxes with up to 150,000 samples of 13 dimen-sions each. Though the aggregation-based visualizations scale wellfor large numbers of samples, selection and filtering operations donot. This becomes especially difficult for multi-node executions,where each node can create several hundred thousand samplesand there several thousand nodes. This problem is ubiquitous inthe field of large data processing and analysis, and MemAxeswould benefit greatly from state-of-the-art methods in distributeddatabase processing.

Analysis of memory performance behavior continues to be achallenging topic, as memory hierarchies continue to increase incomplexity. New processors have complex on-node interconnec-tion networks, and future hardware poses an interesting challengein terms of scale and extensibility. Further, processors continue toevolve, and our techniques will need to be extended to provideinsight into richer data from future performance monitoring units.

ACKNOWLEDGMENTS

We thank John Tramm and Tanzima Islam for their expert feed-back and help in analysis of XSBench. We also thank BarryRountree for his help in diagnosing memory behaviors.

This research was supported by Lawrence Livermore NationalLaboratory (LLNL) and the University of California (UC) at Davisthrough its UC Laboratory Fees Research Grant Program.

REFERENCES

[1] Cab: Intel Xeon system in Livermore Computing. http://computation.llnl.gov/computers/cab.

[2] L. Adhianto, S. Banerjee, M. W. Fagan, M. Krentel, G. Marin, J. M.Mellor-Crummey, and N. R. Tallent. HPCTOOLKIT: tools for perfor-mance analysis of optimized parallel programs. CONCURRENCY 22(6),22(6):n/a–n/a, 2010.

[3] B. Alpern, L. Carter, and T. Selker. Visualizing computer memoryarchitectures. In Visualization, 1990. Visualization ’90., Proceedings ofthe First IEEE Conference on, pages 107–113. IEEE Comput. Soc. Press,1990.

[4] R. Bosch and S. U. C. S. Dept. Using visualization to understand thebehavior of computer systems. Stanford University, 2001.

[5] F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin,G. Mercier, S. Thibault, and R. Namyst. Hwloc: A generic frameworkfor managing hardware affinities in hpc applications. In Proceedingsof the 2010 18th Euromicro Conference on Parallel, Distributed andNetwork-based Processing, PDP ’10, pages 180–186, Washington, DC,USA, 2010. IEEE Computer Society.

[6] A. M. Cheadle, A. J. Field, J. W. Ayres, N. Dunn, R. A. Hayden, andJ. Nystrom-Persson. Visualising dynamic memory allocators. In Pro-ceedings of the 5th International Symposium on Memory Management,ISMM ’06, pages 115–125, New York, NY, USA, 2006. ACM.

[7] A. N. M. I. Choudhury and P. Rosen. Abstract visualization of runtimememory behavior. Visualizing Software for Understanding and Analysis(VISSOFT), 2011 6th IEEE International Workshop on, pages 1–8, 2011.

[8] N. Denoyelle, B. Goglin, and E. Jeannot. A Topology-Aware Perfor-mance Monitoring Tool for Shared Resource Management in MulticoreSystems. In Springer, editor, Proceedings of Euro-Par 2015: ParallelProcessing Workshops, Lecture Notes in Computer Science, Vienna,Austria, Aug. 2015.

[9] A. Gimenez, T. Gamblin, B. Rountree, A. Bhatele, I. Jusufi, P.-T. Bremer,and B. Hamann. Dissecting on-node memory access performance:a semantic approach. In High Performance Computing, Networking,Storage and Analysis, SC14: International Conference for, pages 166–176. IEEE, 2014.

[10] R. Griswold and G. Townsend. The Visualization of Dynamic MemoryManagement in the Icon Programming Language. TR. Departmentof Computer Science. University of Arizona. University of Arizona,Department of Computer Science, 1989.

[11] J. L. Hennessy and D. A. Patterson. Computer Architecture. AQuantitative Approach. Elsevier, 2012.

[12] Intel Corporation. Intel 64 and IA-32 Architectures Software Developer’sManual - Volume 3B, August 2007.

[13] K. E. Isaacs, A. Gimenez, I. Jusufi, T. Gamblin, A. Bhatele, M. Schulz,B. Hamann, and P.-T. Bremer. State of the art of performance visual-ization. In to appear in Proceedings of the Joint Eurographics-IEEEVGTC Symposium of Visualization 2014 (EuroVis 2014), State-of-the-ArtReports, 2014.

[14] I. Karlin, A. Bhatele, B. L. Chamberlain, J. Cohen, and Z. Devito. Luleshprogramming model and performance ports overview. 2012.

[15] M. Kowarschik and C. Weiß. An overview of cache optimizationtechniques and cache-aware numerical algorithms. In Algorithms forMemory Hierarchies Advanced Lectures, volume 2625 of Lecture Notesin Computer Science, pages 213–232. Springer, 2003.

[16] J. B. Kruskal and J. M. Landwehr. Icicle plot: Better displays forhierarchical clustering. The American Statistician, 37(2):162–168, 1983.

[17] X. Liu and J. Mellor-Crummey. A data-centric profiler for parallelprograms. In SC ’13: Proceedings of SC13: International Conferencefor High Performance Computing, Networking, Storage and Analysis,pages 1–12, New York, New York, USA, Nov. 2013. ACM RequestPermissions.

[18] S. Moreta and A. Telea. Visualizing Dynamic Memory Allocations.Visualizing Software for Understanding and Analysis, 2007. VISSOFT2007. 4th IEEE International Workshop on, pages 31–38, 2007.

[19] H. Prokop. Cache-oblivious algorithms, 1999.[20] Rosen, P. A Visual Approach to Investigating Shared and Global Memory

Behavior of CUDA Kernels. pages 1–10, Mar. 2013.[21] M. Schulz, J. A. Levine, P. T. Bremer, T. Gamblin, and V. Pascucci.

Interpreting Performance Data across Intuitive Domains. In ParallelProcessing (ICPP), 2011 International Conference on, pages 206–215.IEEE, 2011.

[22] J. Stasko and E. Zhang. Focus+context display and navigation techniquesfor enhancing radial, space-filling hierarchy visualizations. InformationVisualization, 2000. InfoVis 2000. IEEE Symposium on, pages 57–65,2000.

[23] J. R. Tramm, A. R. Siegel, T. Islam, and M. Schulz. XSBENCH–THEDEVELOPMENT AND VERIFICATION OF A PERFORMANCEABSTRACTION FOR MONTE CARLO REACTOR ANALYSIS.mcs.anl.gov.

[24] J. Treibig, G. Hager, and G. Wellein. Likwid: A lightweight performance-oriented tool suite for x86 multicore environments. In Parallel ProcessingWorkshops (ICPPW), 2010 39th International Conference on, pages 207–216, Sept 2010.

[25] W. A. Wulf and S. A. McKee. Hitting the memory wall: Implications ofthe obvious. SIGARCH Comput. Archit. News, 23(1):20–24, Mar. 1995.

PLACEPHOTOHERE

Alfredo Gimenez is a Ph.D. candidate at theUniversity of California, Davis focusing on scal-able visualization and analysis techniques forperformance data in HPC environments. He re-ceived a B.S. in computer science from UC Davisin 2010, worked at Intel Corporation in 2010-2012 developing tools for debugging and analyz-ing general-purpose GPU code, and is currentlyan active collaborator at Lawrence LivermoreNational Laboratory since 2012.

Page 13: IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER …graphics.idav.ucdavis.edu/~hamann/GimenezGamblin...IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 27, NO. 5, MAY 2016 13

PLACEPHOTOHERE

Todd Gamblin is a Computer Scientist inthe Center for Applied Scientific Computing atLawrence Livermore National Laboratory. His re-search focuses on scalable tools and algorithmsfor measuring, analyzing, and visualizing theperformance of massively parallel applications.He leads several projects in these areas, andhas been at LLNL since 2008. Todd received thePh.D. and M.S. degrees in Computer Sciencefrom the University of North Carolina at ChapelHill in 2009 and 2005. He received his B.A. in

Computer Science and Japanese from Williams College in 2002. Hehas also worked as a software developer in Tokyo and held graduateresearch internships at the University of Tokyo and IBM Research.Todd recently received an Early Career Research Award from the U.S.Department of Energy.

PLACEPHOTOHERE

Ilir Jusufi is a Senior Lecturer in the departmentof Media Technology at Linnaeus University,Vaxjo, Sweden. He was a Postdoctoral Scholarat the University of California, Davis. He receiveda B.S. in Computer Science at the South EastEuropean University in Macedonia and a M.S.in Computer Science at the Vaxjo University inSweden. He earned his Ph.D. degree at theLinnaeus University in Sweden focusing on thevisualization and interaction techniques of multi-variate networks.

PLACEPHOTOHERE

Abhinav Bhatele is a computer scientist inthe Center for Applied Scientific Computing atLawrence Livermore National Laboratory. His in-terests lie in performance optimizations throughanalysis, visualization and tuning and develop-ing algorithms for high-end parallel systems. Histhesis was on topology aware task mapping anddistributed load balancing for parallel applica-tions.

Abhinav received a B. Tech. degree in Com-puter Science and Engineering from I.I.T. Kan-

pur, India in May 2005 and M.S. and Ph.D. degrees in Computer Sciencefrom the University of Illinois at Urbana-Champaign in 2007 and 2010respectively. Abhinav was an ACM/IEEE-CS George Michael MemorialHPC Fellow in 2009. He has received several awards for his dissertationwork including the David J. Kuck Outstanding MS Thesis Award in 2009,a Distinguished Paper Award at Euro-Par 2009 and the David J. KuckOutstanding Ph.D. Thesis Award in 2011. Recently, a paper that he co-authored with LLNL and external collaborators was selected for a bestpaper award at IPDPS in 2013.

PLACEPHOTOHERE

Martin Schulz is a Computer Scientist at theCenter for Applied Scientific Computing (CASC)at Lawrence Livermore National Laboratory(LLNL). He earned his Doctorate in ComputerScience in 2001 from the Technische UniversitatMunchen (Munich, Germany) and also holds aMaster of Science in Computer Science from theUniversity of Illinois at Urbana Champaign. Hehas published over 175 peer-reviewed papersand currently serves as the chair of the MPIforum, the standardization body for the Mes-

sage Passing Interface. He is the PI for the Office of Science X-Stackproject ”Performance Insights for Programmers and Exascale Runtimes”(PIPER) as well as for the ASC/CCE project on Open|SpeedShop,and is involved in the DOE/Office of Science exascale projects CE-SAR, ExMatEx, and ARGO . Martin’s research interests include paralleland distributed architectures and applications; performance monitoring,modeling and analysis; memory system optimization; parallel program-ming paradigms; tool support for parallel programming; power-awareparallel computing; and fault tolerance at the application and systemlevel. Martin was a recipient of the IEEE/ACM Gordon Bell Award in2006 and an R&D 100 award in 2011.

PLACEPHOTOHERE

Peer-Timo Bremer is a member of technicalstaff and project leader at the Center for AppliedScientific Computing (CASC) at the LawrenceLivermore National Laboratory (LLNL) and As-sociated Director for Research at the Centerfor Extreme Data Management, Analysis, andVisualization at the University of Utah. His re-search interests include large scale data anal-ysis, performance analysis and visualization andhe recently co-organized a Dagstuhl Perspec-tives workshop on integrating performance anal-

ysis and visualization. Prior to his tenure at CASC, he was a postdoctoralresearch associate at the University of Illinois, Urbana-Champaign.Peer-Timo earned a Ph.D. in Computer science at the University ofCalifornia, Davis in 2004 and a Diploma in Mathematics and ComputerScience from the Leibniz University in Hannover, Germany in 2000. Heis a member of the IEEE Computer Society and ACM.

PLACEPHOTOHERE

Bernd Hamann is a professor of computer sci-ence at the University of California, Davis. Hestudied mathematics and computer science atthe Technical University of Braunschweig, Ger-many, and received a Ph.D. in computer sciencefrom Arizona State University in 1991. His mainteaching and research interests are data anal-ysis and visualization, image processing, andgeometric design and modeling.