runtime visualization of application progress and ... · runtime visualization of application...

10
Runtime Visualization of Application Progress and Monitoring of a GPU-enabled Parallel Environment SZYMON BULTROWICZ, PAWEL CZARNUL, PAWEL RO ´ SCISZEWSKI Department of Computer Architecture Faculty of Electronics, Telecommunications and Informatics Gda´ nsk University of Technology 11/12 Gabriela Narutowicza Street, 80-233 Gda´ nsk, POLAND [email protected], {pczarnul, pawel.rosciszewski}@eti.pg.gda.pl Abstract: The paper presents design, implementation and real life uses of a visualization subsystem for a distributed framework for parallelization of workflow-based computations among clusters with nodes that feature both CPUs and GPUs. Firstly, the proposed system presents a graphical view of the infrastructure with clusters, nodes and compute devices along with parameters and runtime graphs of load, memory available, fan speeds etc. Secondly, it depicts execution progress of particular workflow nodes that correspond to parallel computations realized by OpenCL kernels. States of workflow nodes such as ready, running, executed are shown with colors. Thirdly, the system allows the programmer to provide code that visualizes physical representation of the application executed at runtime. The visualization code is intentionally separated from the computational kernels in OpenCL. The benefits of the proposed system are demonstrated for a parallel numerical integration code executed on both CPUs and GPUs. Key–Words: Visualization, Parallel computing, Heterogeneous environments, OpenCL 1 Introduction Recent advances in hardware solutions allowing High Performance Computing (HPC) resulted in several distinct approaches to parallelization and adoption of corresponding APIs. These include: 1. accelerator level such as: GPUs – NVIDIA CUDA API, OpenCL [1], OpenACC [2], Intel Xeon Phi – OpenMP [3], MPI [4] and OpenCL, 2. multicore CPUs and SMP machines – OpenMP [3], OpenCL [1], Pthreads etc., 3. cluster level – mainly MPI [4]. The most powerful clusters from the TOP500 list such as Tianhe-2 and Titan make use of all of these par- allelization levels e.g. Tianhe-2 uses Intel Xeon E5 CPUs along with Intel Xeon Phis [5] while Titan uses Opteron CPUs and NVIDIA K20xs accelerators [6]. Efficient use of all these resources requires both knowledge of the APIs and tools allowing runtime vi- sualization of the following: 1. state of the infrastructure including cluster, node and compute device level, 2. progress of the application including its physical representation. The following are selected examples that require visu- alization: snow cover computing [7], large-scale ter- rains [8], parallel particle swarm optimization [9], par- allel biomedical image processing [10]. This work presents a visualization subsystem for a multi-level parallelization framework using CPUs and GPUs that was presented by the authors in [11]. The framework allows modeling an application as a workflow graph with definition of computational kernels in OpenCL which are attached to workflow nodes. The runtime layer of the system, depending on the optimizer chosen, selects compute devices to use, partitions input data and distributes it among the devices working in a dynamic master-slave fashion among data servers and compute devices. Section 2 presents existing systems for monitor- ing infrastructure and application progress, Section 3 presents the proposed solution with architecture, stor- age solutions and API. A real life example of parallel integration and its visualization is shown in Section 4 with conclusions and future work in Section 5. Applications of Information Systems in Engineering and Bioscience ISBN: 978-960-474-381-0 70

Upload: trinhtuyen

Post on 21-Sep-2018

232 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Runtime Visualization of Application Progress and ... · Runtime Visualization of Application Progress and Monitoring of a GPU-enabled Parallel Environment ... fan speeds etc. Secondly,

Runtime Visualization of Application Progressand Monitoring of a GPU-enabled Parallel Environment

SZYMON BULTROWICZ, PAWEŁ CZARNUL, PAWEŁ ROSCISZEWSKIDepartment of Computer Architecture

Faculty of Electronics, Telecommunications and InformaticsGdansk University of Technology

11/12 Gabriela Narutowicza Street, 80-233 Gdansk, [email protected], {pczarnul, pawel.rosciszewski}@eti.pg.gda.pl

Abstract: The paper presents design, implementation and real life uses of a visualization subsystem for a distributedframework for parallelization of workflow-based computations among clusters with nodes that feature both CPUsand GPUs. Firstly, the proposed system presents a graphical view of the infrastructure with clusters, nodes andcompute devices along with parameters and runtime graphs of load, memory available, fan speeds etc. Secondly,it depicts execution progress of particular workflow nodes that correspond to parallel computations realized byOpenCL kernels. States of workflow nodes such as ready, running, executed are shown with colors. Thirdly, thesystem allows the programmer to provide code that visualizes physical representation of the application executed atruntime. The visualization code is intentionally separated from the computational kernels in OpenCL. The benefitsof the proposed system are demonstrated for a parallel numerical integration code executed on both CPUs andGPUs.

Key–Words: Visualization, Parallel computing, Heterogeneous environments, OpenCL

1 IntroductionRecent advances in hardware solutions allowing HighPerformance Computing (HPC) resulted in severaldistinct approaches to parallelization and adoption ofcorresponding APIs. These include:

1. accelerator level such as:

• GPUs – NVIDIA CUDA API, OpenCL [1],OpenACC [2],

• Intel Xeon Phi – OpenMP [3], MPI [4] andOpenCL,

2. multicore CPUs and SMP machines – OpenMP[3], OpenCL [1], Pthreads etc.,

3. cluster level – mainly MPI [4].

The most powerful clusters from the TOP500 list suchas Tianhe-2 and Titan make use of all of these par-allelization levels e.g. Tianhe-2 uses Intel Xeon E5CPUs along with Intel Xeon Phis [5] while Titanuses Opteron CPUs and NVIDIA K20xs accelerators[6]. Efficient use of all these resources requires bothknowledge of the APIs and tools allowing runtime vi-sualization of the following:

1. state of the infrastructure including cluster, nodeand compute device level,

2. progress of the application including its physicalrepresentation.

The following are selected examples that require visu-alization: snow cover computing [7], large-scale ter-rains [8], parallel particle swarm optimization [9], par-allel biomedical image processing [10].

This work presents a visualization subsystem fora multi-level parallelization framework using CPUsand GPUs that was presented by the authors in [11].The framework allows modeling an application asa workflow graph with definition of computationalkernels in OpenCL which are attached to workflownodes. The runtime layer of the system, dependingon the optimizer chosen, selects compute devices touse, partitions input data and distributes it among thedevices working in a dynamic master-slave fashionamong data servers and compute devices.

Section 2 presents existing systems for monitor-ing infrastructure and application progress, Section 3presents the proposed solution with architecture, stor-age solutions and API. A real life example of parallelintegration and its visualization is shown in Section 4with conclusions and future work in Section 5.

Applications of Information Systems in Engineering and Bioscience

ISBN: 978-960-474-381-0 70

Page 2: Runtime Visualization of Application Progress and ... · Runtime Visualization of Application Progress and Monitoring of a GPU-enabled Parallel Environment ... fan speeds etc. Secondly,

2 Related Work

2.1 Visualization of HPC Infrastructure

Nagios [12] is an open source system that supportsmonitoring of computers including CPU, disk logsacross systems like Unix and Windows, networks andsoftware. Various service checks are available fornetwork services with the possibility to write customones in many languages. Notifications and event han-dlers can be defined as well as graphing plugins areavailable.

Ganglia [13] allows monitoring of one or moreclusters by running local area monitoring Gmons sub-systems. Gmons use UDP multicast to gather param-eters and data within a cluster. Gmon agents can ar-range a network and listen to neighbors rather that pollfor messages. The Gmon layer communicates with awide area Gmeta by sending XML data over TCP. Themonitoring structure can be expressed as a hierarchi-cal tree. Ganglia uses [14] a Round Robin Databasefor storage of information related to nodes, clustersand grids. RRDTool generates graphs which can thenbe presented to the user via an interface written inPHP.

Paper [15] presents a kernel-level resource moni-toring framework dproc that can be used for gatheringdata such as CPU, disk, memory, network on nodesand sending to other nodes using KECho publish/sub-scribe channels. Applications can subscribe using cri-teria in the form of filters that can be deployed on-the-fly.

Work [16] presents an HPC cluster monitoring ar-chitecture meant for grid and utility computing. It al-lows finding available nodes. Furthermore, it featuressystem and job monitoring including support for bothLinux and Windows environments. It is composedof four tiers: Grid Information Detecting Layer withnodes reporting loads, memory usage etc. to a mul-tithreaded Grid Information Management Layer thatexports XML data in response to requests from a WebPortal layer which is forwarded to the client.

2.2 Visualization of Parallel ApplicationProgress

Visualization of a parallel application, in particularrun on GPUs, can be viewed with two goals in mind.Firstly, identification of bottlenecks and data as wellas processing flow in the program. Secondly, in orderto visualize the simulation itself.

2.2.1 Visualization of Application to Infrastruc-ture Mapping

Since the proposed solution includes a workflow-based processing concept, there exist visualizationtools in almost any Workflow Management System(WfMS). For instance, in BeesyCluster workflows canbe defined visually [17, 18] and states of particu-lar workflow nodes reflected dynamically at runtimewhen using an agent-based BeesyBees execution en-gine [19]. This also applies to other workflow basedsystems such as Triana [20], Kepler [21] etc.

Secondly, there exist technology related tools forvisualization of application progress. As an exam-ple, vendors provide low-level tools for visualiza-tion of application execution on GPUs. Specifically,NVIDIA provides an NVIDIA Visual Profiler [22] forthe CUDA applications. The tool displays timelineswith e.g. memory transfers and method calls and isable to automate the search for bottlenecks. AMD’sCodeXL [23] is a similar tool for profiling applica-tions but also supporting OpenCL codes. It displaysload statistics, assembler code and allows to previewcard buffers. However, the two applications do notsupport distributed applications using GPUs.

The TAU (Tuning and Analysis Utilities) parallelperformance system [24] offers visualization of appli-cation execution with support for C++, Java, Pythonand support for CUDA, OpenCL and others. Specialcode is injected, executed and data collected by a TAUsystem. Compared to TAU, Vampir [25] offers bothautomatic and manual instrumentation of the code andprovides VampirTrace for data gathering and Vampirfor data analysis and identification of bottlenecks. Pa-per [26] proposes AerialVision, a visualization toolfor GPUs, that can assist in bottleneck identificationand associate part of the code responsible.

2.2.2 Visualization of Application Results

Visualization of the actual parallel execution at run-time is often necessary in order to gain knowledgeon how the simulation progresses, verify its correct-ness and identify data subdomains on which to focusmore. Many parallel applications develop their cus-tom visualization tools, often working off-line fromlogs written during runs. Runtime visualization is es-pecially desired on accelerators which primarily usetheir own memory. The host-accelerator bandwidthmay be a limitation for the application itself, not tomention runtime visualization.

There exist several solutions for visualization ofthe application results. A Parallel Graphics Library1

1http://www.hq.nasa.gov/hpcc/reports/annrpt97/accomps/

Applications of Information Systems in Engineering and Bioscience

ISBN: 978-960-474-381-0 71

Page 3: Runtime Visualization of Application Progress and ... · Runtime Visualization of Application Progress and Monitoring of a GPU-enabled Parallel Environment ... fan speeds etc. Secondly,

allows visualization of parallel SPMD applications us-ing polygon rendering [27]. However, output graphicsconsume bandwidth and solutions of that type do notseem adequate for processing codes on accelerators.PARADE [28], on the other hand, allows visualiza-tion using layers such as: instrumentation which gath-ers information about events, choreography that han-dles events and presentation which visualizes infor-mation prepared by the choreography layer. The sys-tem allows manual instrumentation as well as wrap-ping communication and system libraries. Visualiza-tion of OpenCL computations within OpenGL is pro-posed in [29].

Visualization itself can also be performed in par-allel. VisIt2 [30] is a free tool for interactive parallelvisualization and analysis of scientific data. Data canbe visualized and animated in 2D and 3D and sup-ports very large data sets. Engine and Viewer com-ponents are separated. The former can be run eitherserially or in parallel while the latter can make use ofGPU processing power. Work [31] extends VisIt byprovisioning support for multiple clients to connect toexisting VisIt sessions. It allows clients running onmobile devices to connect to VisIt running on servers.ParaView3 [32] is an open source application for anal-ysis and large scale data visualization that can be runon single nodes or supercomputers.

Several tools for data analysis and visualizationin high performance computing are listed in [33].These include in particular Community Multiscale AirQuality (CMAQ)4, Astrophysical Adaptive Mesh Re-finement (Enzo)5, Grid Analysis and Display System(GrADS)6, MATLAB7, High-Performance Computa-tional Chemistry software (NWChem)8, ParaView9,Parallel Ocean Program Simulation (POP)10, Astro-physical Simulation Analysis and Visualization11.

Several examples for visualization of computa-tions in particular domains can be given. As an ex-ample, paper [34] presents a GPU-based frameworkfor interactive visualization of time-dependent climateresearch applications. Paper [35] discusses GPGPUprocessing and visualization of three-dimensional cel-lular automata.

cas larc/WW229.html2https://wci.llnl.gov/codes/visit/3http://www.paraview.org/4www.cmaq-model.org5http://code.google.com/p/enzo6www.iges.org/grads/grads.html7www.mathworks.com/products/matlab/index.html8www.nwchem-sw.org/index.php/Main Page9www.paraview.org

10www.vets.ucar.edu/vg/POP/index.shtml11http://yt-project.org

3 Proposed SolutionThe proposed visualization system is in the first placemotivated by the need of multi-level parallel pro-cessing in the modern HPC systems. The process-ing framework was proposed by the authors before in[11]. It should be noted that technically we build onexperiences of the other approaches outlined in Sec-tion 2 and integrate system, application state and reallife simulation representation into one solution. Thismakes it a highly desirable tool for HPC processing.

3.1 Architecture and DesignThe visualization framework architecture dependsstrictly on the general architecture of the KernelHivesystem, which will be outlined in this section. Ad-ditionally, communication channels added for visu-alization purposes will be discussed, reflected by thedashed arrows in Figure 1.

3.1.1 GUI

There are various types of KernelHive system users.For each of them, the main point of interaction withthe system is the GUI module, shown in the upper–left corner of Figure 1. It is a Java Swing application,that allows:

• application architects – to construct applicationworkflows using the Workflow Editor module,

• low-level coders – to edit source of computa-tional codes using the Kernel Editor module,

• execution administrators – to run applicationsand manage workflow executions using the Ex-ecution Wizard and Workflow Executions mod-ules,

• infrastructure administrators – to inspect thecomputational infrastructure available to the sys-tem using the Infrastructure Browser module.

The two latter types of users are particularly inter-ested in visualization. For their purposes, apart fromthe aforementioned Infrastructure Browser, we addedmodules Resource Monitor and Preview Providermodules for inspecting the current state of the hard-ware and working applications respectively.

3.1.2 Execution Engine

The central point of the KernelHive system is the Exe-cution Engine, situated on the top in Figure 1. For theGUI module it is a backend, accessed through SOAPWeb Services. We define a communication endpoint

Applications of Information Systems in Engineering and Bioscience

ISBN: 978-960-474-381-0 72

Page 4: Runtime Visualization of Application Progress and ... · Runtime Visualization of Application Progress and Monitoring of a GPU-enabled Parallel Environment ... fan speeds etc. Secondly,

WAN

LAN

Execution engine(Java EE)

Storage(filesystem)

GUI (Swing)

Cluster manager (Jsvc daemon)

Unit (C++)

Network

Datatransfer

Communication and data transfer

Worker(C++/OpenCL)

Monitoring agent (C++)

LAN

Figure 1: Architecture of the Visualization Framework

for sending the visualization, separate from the onefor the core functionality management.

As a Java EE application, the Execution Engineoperates on high–level abstraction notions describingthe whole system, gathering information about thecurrent state of KernelHive. This information is con-stantly reported by lower level modules described inthe following subsections.

Gathering the object-oriented representation ofthe system internals in the Execution Engine allowsto implement various optimization strategies focusingon criteria like performance, energy efficiency [11] orsystems reliability.

3.1.3 Cluster Managers

KernelHive provides parallelization of computationson many levels, including the level of multiple clus-ters. The definition of a cluster in KernelHive is ver-satile enough to allow various infrastructure configu-rations – it is a set of nodes connected by a network.

Additionally, there should be one entry point machine,that is connected by a network with both: nodes in thecluster and the Execution Engine.

Such entry point machines in KernelHive areequipped with instances of the Cluster Manager mod-ule. It is a Java daemon (jsvc), which task is to for-ward all communication between the nodes and theExecution Engine.

Communication between Cluster Manager andExecution Engine is based on high-level Web Servicesas it is easy to extend, maintain and develop method.Although, because Cluster Manager serves in this caseonly as a proxy, sent messages are not deserializedand are passed on in the same binary-serialized formas their come from Workers.

3.1.4 Units and Workers

Information about the resources as well as applicationstate in KernelHive originate from the modules thatare deployed closest to the hardware, namely: Units

Applications of Information Systems in Engineering and Bioscience

ISBN: 978-960-474-381-0 73

Page 5: Runtime Visualization of Application Progress and ... · Runtime Visualization of Application Progress and Monitoring of a GPU-enabled Parallel Environment ... fan speeds etc. Secondly,

and Workers.Each computer connected to KernelHive is

equipped with an instance of Unit subsystem. It is asystem daemon (implemented in C++) responsible forTCP communication with its Cluster Manager. Thebasic function of a Unit is to report about availabledevices and executing tasks scheduled by the higher–level subsystems afterwards.

Given a job to execute, Unit runs in the operatingsystem a predefined binary, corresponding to the jobtype, that follows the job type set in the applicationdefinition. For each computation template availablein KernelHive (and ready to use in the GUI module),there is one or more corresponding binaries to run ona node. The base and collection of all those binariesis called the Worker subsystem.

Typical task of a Worker binary is to obtain therequired input data and computational code (OpenCL)from a data server indicated by the Execution Engine,run the program on a given device and finally gatherand send the result data back to a given data server.Hence, the Worker binaries differ only in the way ofobtaining the needed inputs that depends on the com-putation template (e.g. Master-slave, SPMD).

Both Unit and Workers use UDP for communica-tion with corresponding clusters. Units send informa-tion about resources consumption on a host and Work-ers data with progress information.

3.2 Data Acquisition and Storage ConcernsKernelHive uses two different channels of communi-cation, dedicated to their purposes. The first one formonitoring module focuses on transferring and stor-ing data about usage of resources on particular nodes.The visualization of progress channel transfers reportsfrom workers to the Execution Engine.

3.2.1 Monitoring Module

Data gathering is performed by the agent installed onthe cluster nodes. By contrast to the actual workersstarted on demand, the agent is running all the timeand is able to monitor system constantly. It uses stan-dard Linux API, which bases on file-based statisticfiles, to acquire data such as total memory, used mem-ory, CPU usage etc. Besides the general node statis-tics also GPU data is required, which is not avail-able through the standard API. In this case, Kernel-Hive uses proprietary NVIDIA tools i.e. nvidia-smito retrieve information about GPU load. Nonetheless,full status report is not available for all graphic cards,therefore, in such case best effort approach is takenand as much information as possible is presented.

Apart from some invariable node properties e.g.host name or total memory size, the rest of the data hasto be sent sequentially, which can impact data trafficsignificantly. Hence the sequential messages are seri-alized binary what makes the message notably smallerand helps saving the network bandwidth.

Acquired data is sent through cluster to the en-gine where it is stored. General purpose relationaldatabases are not suitable in this instance due to char-acteristic of the data. There are practically no rela-tions, although, there is a huge gain of new records asa result of new information coming every second. Forthis reason, the data is stored in RRD which stands forRound Robin Database. RRD is a file-based databasewhich is meant for handling cyclic data, for instance,it automatically deletes data older then specified timee.g. 1 hour. Additionally, it allows to easily retrievedifferent kinds of metrics such as average, count etc.Moreover, one of its useful features is a built-in gen-erator of charts which is used for data presentation inthe GUI.

3.2.2 Visualization of Progress

Information about progress of an application has to beset by an OpenCL code run on a graphic card. Dueto the fact that there is no possibility to create com-munication channel between kernel and host duringkernel execution, application status is reported afterevery kernel run. Each report contains preview datathat reflects the part of input given to the applica-tions. Thereafter, all reported pieces are gathered onthe server and published to the GUI. Eventually, theuser application downloads this package of data andpasses it to the user-defined visualization algorithmdescribed in more details in Section 3.3.

Progress data is a short term information becauseits scope is a single application run. Therefore there isno need for it to be saved or concerned to be of a bigsize. For this reason progress data is kept in memory.

3.3 APIIn terms of programming language, modules in Ker-nelHive can be divided into those implemented in Java(high level modules and GUI) and those implementedin C++ with OpenCL (hardware and computation re-lated). The visualization framework has to inject itsparts into modules implemented in both of those lan-guages.

In this section, methods used to share data be-tween the subsystems (Sections 3.3.1 and 3.3.4) aswell as programming interfaces used in OpenCL ker-nels (Section 3.3.2) and Java visualization code (Sec-tion 3.3.3) are described.

Applications of Information Systems in Engineering and Bioscience

ISBN: 978-960-474-381-0 74

Page 6: Runtime Visualization of Application Progress and ... · Runtime Visualization of Application Progress and Monitoring of a GPU-enabled Parallel Environment ... fan speeds etc. Secondly,

struct PreviewObject {// start of data range, delta, integral valuefloat f1; float f2; float f3;

};

Figure 2: The PreviewObject Structure in OpenCL

package pl.gda.pg.eti.kernelhive.common.monitoring.service;public class PreviewObject {

protected float f1; protected float f2; protected float f3;/* Getters and setters for the properties... */

}

Figure 3: The PreviewObject Structure in Java

3.3.1 PreviewObject Structure

We assume that throughout the application executionperiod, a sequence of data objects will be transferredfrom computational nodes to GUI. The structure ofthese objects should be defined by the applicationprovider in Java and OpenCL using only simple, com-patible types. Besides that, the application provider isfree to decide about the contents of the preview object.

Figures 2 and 3 show exemplary PreviewObjectstructure in OpenCL and Java for an integral calculat-ing application.

3.3.2 Kernel Interface and Example

In the process of creating an application workflow, thedeveloper builds a directed acyclic graph from blocks,that correspond to computation templates availablein KernelHive (e.g. Master–slave, SPMD). Eachblock comes with one or more templates of codes inOpenCL, that include the signature of the kernel.

Figure 4 shows the computational code for anexemplary integral calculating application. It fills aDataProcessor template, which in the process-Data kernel signature, includes a PreviewObjectpointer. The KernelHive worker subsystem takes careof the memory allocation issues, runs the kernel andsends the generated intermediate visualization data tothe Execution Engine.

3.3.3 PreviewProvider Interface and Example

To complete the outline of the implementation inter-nals of the application state visualization, we show thegraphics related code for the aforementioned integralcalculating application in Figure 5.

Once again, KernelHive allows high flexibil-ity for the application developer. The applica-tion should come with an implementation of the

IPreviewProvider Java interface, which definesa paintData method. The method input includesa list of PreviewObjects (described in Section3.3.1) and objects specific for the Java Swing tech-nology utilized in the GUI module.

3.3.4 Resource State Serialization

The resource state can be split into two different kinds.The first one is invariable host data such as host name,total memory size, CPU cores count etc. As long asthese messages are sent once as the agent starts, mes-sage size is not a concern here and for readability is-sues string serialization was used there.

The second kind of messages are sequential dataconsisting of information about used CPU, memoryand so on. In this instance, message size is very im-portant because there is supposed to be a lot of suchmessages. Therefore, this type of data is serializedbinary.

Since the monitoring reports are not crucial forproper working, all messages here are sent via UDPbecause of smaller network consumption overhead.

4 Adaptive Quadrature Exampleand Its Visualization

The exemplary application calculates the integral of agiven function using the method of adaptive quadra-ture. Input for the application is a set of points whichrepresent a set of rectangles to calculate. The job ofKernelHive is to split the data and distribute the datachunks among the workers and then get the resultsback.

Therefore, each kernel receives a set of rectanglesto process. Besides the data, the kernel receives a listof empty preview objects which are filled during the

Applications of Information Systems in Engineering and Bioscience

ISBN: 978-960-474-381-0 75

Page 7: Runtime Visualization of Application Progress and ... · Runtime Visualization of Application Progress and Monitoring of a GPU-enabled Parallel Environment ... fan speeds etc. Secondly,

__kernel void processData(__global float* input, unsigned int dataSize,__global float* output, unsigned int outputSize,__global struct PreviewObject *previewBuffer)

{// Get the index of the current element to be processedint id = get_global_id(0);// Get the number of data items per processing threadint actualItemsCount = outputSize / sizeof(float);int itemsPerThread = actualItemsCount / get_global_size(0);// Calculate the fieldsfor (int i = 0; i < itemsPerThread; i++) {

int idx = (id*itemsPerThread) + i;// Get the delta between values:float delta = input[idx + 1] - input[idx];float y = adaptiveIntegral(input[idx], input[idx+1]);output[idx] = y;// Fill the preview objectpreviewBuffer[idx].f1 = input[idx];previewBuffer[idx].f2 = delta;previewBuffer[idx].f3 = y;

}}

Figure 4: Code of the Data Processor Computational Kernel

public class PreviewProvider implements IPreviewProvider {public void paintData(Graphics g, List<PreviewObject> data, int areaWidth, int

areaHeight) {g.setColor(Color.YELLOW);float minX = Float.POSITIVE_INFINITY;float maxX = Float.NEGATIVE_INFINITY;float minY = 0;float maxY = Float.NEGATIVE_INFINITY;

for(PreviewObject po : data) {if(validatePreviewObject(po)) {

minX = Math.min(po.getF1(), minX);maxX = Math.max(po.getF1() + po.getF2(), maxX);maxY = Math.max(po.getF3(), maxY);

}}float ratioX = areaWidth / (maxX - minX);float ratioY = areaHeight / (maxY - minY);for(PreviewObject po : data) {

int width = Math.round(ratioX * po.getF2());int height = Math.round(ratioY * po.getF3());int x = Math.round(ratioX * po.getF1());int y = areaHeight - height;g.fillRect(x, y, width, height);

}}

}

Figure 5: Code of the PreviewProvider Component Responsible for Rendering the Visualization

execution and eventually returned back to the worker.Thereafter, and as soon as the kernel finishes his job,

the list is sent back to the server. Each preview objectcontains parameters of one calculated rectangle i.e. its

Applications of Information Systems in Engineering and Bioscience

ISBN: 978-960-474-381-0 76

Page 8: Runtime Visualization of Application Progress and ... · Runtime Visualization of Application Progress and Monitoring of a GPU-enabled Parallel Environment ... fan speeds etc. Secondly,

Figure 6: States of Integration Application Progress Preview

position, width and height. Code listing of the exem-plary application kernel is shown in Figure 4.

As a result, KernelHive GUI is provided by theserver with a complete list of rectangles descriptionsto render which can be easily done by the renderingmechanism developed by the user, shown in Figure 5.

Visualization of progress depicted in Figure 6 al-lows to see how the packages of preview data come intime. After each kernel run a new set of preview ob-jects is available and is instantly rendered as a seriesof yellow rectangles.

Besides live progress preview visualization, theWorkflow Progress view shows a grid with coloredtiles which represents particular jobs that the Execu-tion Engine has split the workflow to. Grey tiles meanthat this job is scheduled but not processed yet, greenones that the job is currently processed and finallyblue tiles mean completed jobs. Example of such agrid is visible in Figure 7.

Apart from observing application itself, there isalso a possibility to monitor hardware on which theapplication is running. Using KernelHive GUI theuser can watch live data coming from cluster nodescontaining information about usage of particular re-sources e.g. CPU, memory, GPU memory etc. Sam-ple charts presenting load of one of host CPU coresand GPU global memory usage are visible in Figure8. All charts are rendered using standard RRD mech-anisms. Obviously, it is easy to extend the current APIwith custom resources which would allow to monitoradditional points of interest e.g. power consumption.

5 Conclusions and Future WorkIn the paper we presented a new visualization sys-tem for the KernelHive environment that allows multi-

Figure 7: Workflow Progress Preview

level parallelization of computations in a collection ofclusters with CPUs and GPUs [11]. The proposed so-lution allows visualization of infrastructure with run-time usage of resources, visualization of workflowapplication progress and runtime presentation of re-sults as these are generated. Computational code inOpenCL is separated from visualization code in Javathat is assigned to the former. An example for a paral-lel adaptive integration is shown.

In the future, we aim at optimization of the datatransfer protocol used for visualization by using com-pression. Additionally, it would be possible and usefulto extract monitoring to a separate module that couldbe reusable in other projects.

Acknowledgments: The work was partially per-formed within grant “Modeling efficiency, reliabilityand power consumption of multilevel parallel HPC

Applications of Information Systems in Engineering and Bioscience

ISBN: 978-960-474-381-0 77

Page 9: Runtime Visualization of Application Progress and ... · Runtime Visualization of Application Progress and Monitoring of a GPU-enabled Parallel Environment ... fan speeds etc. Secondly,

Figure 8: Sample Resource Usage Charts

systems using CPUs and GPUs” from the NationalScience Center in Poland based on decision no DEC-2012/07/B/ST6/01516.

References:

[1] Kirk, D.B., mei W. Hwu, W.: ProgrammingMassively Parallel Processors, Second Edition:A Hands-on Approach. Morgan Kaufmann(2012) ISBN-13: 978-0124159921.

[2] Wienke, S., Springer, P., Terboven, C., an Mey,D.: Openacc: First experiences with real-world applications. In: Proceedings of the 18thInternational Conference on Parallel Process-ing. Euro-Par’12, Berlin, Heidelberg, Springer-Verlag (2012) 859–870

[3] Chapman, B., Jost, G., Pas, R.v.d.: UsingOpenMP: Portable Shared Memory Parallel Pro-gramming (Scientific and Engineering Compu-tation). The MIT Press (2007)

[4] Gropp, W., Lusk, E., Thakur, R.: Using MPI-2: Advanced Features of the Message-PassingInterface. MIT Press, Cambridge, MA, USA(1999)

[5] Jeffers, J., Reinders, J.: Intel Xeon Phi Copro-cessor High Performance Programming. 1st edn.Morgan Kaufmann Publishers Inc., San Fran-cisco, CA, USA (2013)

[6] Sanders, J., Kandrot, E.: CUDA by Exam-ple: An Introduction to General-Purpose GPUProgramming. 1st edn. Addison-Wesley Profes-sional (2010)

[7] Huraj, L., Siladi, V., Silaci, J.: Comparison ofdesign and performance of snow cover comput-ing on gpus and multi-core processors. WSEASTrans. Info. Sci. and App. 7 (2010) 1284–1294

[8] Qiu, H., Chen, L.T., Qiu, G.P., Yang, H.: An ef-fective visualization method for large-scale ter-rain dataset. In: WSEAS TRANSACTIONSon INFORMATION SCIENCE and APPLICA-TIONS. Volume 5. (2013)

[9] ROBERGE, V., TARBOUCHI, M.: Parallel par-ticle swarm optimization on graphical process-ing unit for pose estimation. In: WSEAS Trans-actions on Computers. Volume 11. (2012)

[10] Remenyi, A., Szenasi, S., Bandi, I., Vamossy,Z.I., Valcz, G., Bogdanov, P., Sergyan, S., Ko-zlovszky, M.: Parallel biomedical image pro-cessing with GPGPU-s in cancer research. InSzakal, A., ed.: 3rd IEEE International Sym-posium on Logistics and Industrial Informatics(LINDI 2011), Budapest, IEEE Hungary Section(2011) 245–248

[11] Czarnul, P., Rosciszewski, P.: Optimization ofexecution time under power consumption con-straints in a heterogeneous parallel system withgpus and cpus. In Chatterjee, M., Cao, J.N.,Kothapalli, K., Rajsbaum, S., eds.: ICDCN. Vol-ume 8314 of Lecture Notes in Computer Sci-ence., Springer (2014) 66–80

[12] Josephsen, D.: Building a Monitoring Infras-tructure with Nagios. Prentice Hall PTR, UpperSaddle River, NJ, USA (2007)

[13] Sacerdoti, F.D., Katz, M.J., Massie, M.L.,Culler, D.E.: Wide area cluster monitoring withganglia. In: CLUSTER, IEEE Computer Society(2003)

[14] Massie, M.L., Chun, B.N., Culler, D.E.: TheGanglia Distributed Monitoring System: De-sign, Implementation, and Experience. ParallelComputing 30 (2004)

Applications of Information Systems in Engineering and Bioscience

ISBN: 978-960-474-381-0 78

Page 10: Runtime Visualization of Application Progress and ... · Runtime Visualization of Application Progress and Monitoring of a GPU-enabled Parallel Environment ... fan speeds etc. Secondly,

[15] Agarwala, S., Poellabauer, C., Kong, J., Schwan,K., Wolf, M.: Resource-aware stream man-agement with the customizable dproc distributedmonitoring mechanisms. In: HPDC, IEEE Com-puter Society (2003) 250–259

[16] Li, M., Zhang, Y.: Hpc cluster monitoring sys-tem architecture design and implement. Intelli-gent Computation Technology and Automation,International Conference on 2 (2009) 325–327

[17] Czarnul, P.: Modeling, run-time optimizationand execution of distributed workflow applica-tions in the jee-based beesycluster environment.The Journal of Supercomputing 63 (2013) 46–71

[18] Czarnul, P.: A model, design, and implementa-tion of an efficient multithreaded workflow exe-cution engine with data streaming, caching, andstorage constraints. The Journal of Supercom-puting 63 (2013) 919–945

[19] Czarnul, P., Matuszek, M.R., Wojcik, M., Za-lewski, K.: Beesybees: A mobile agent-basedmiddleware for a reliable and secure executionof service-based workflow applications in beesy-cluster. Multiagent and Grid Systems 7 (2011)219–241

[20] Taylor, I., Shields, M., Wang, I., Harrison, A.:Visual grid workflow in triana. Journal of GridComputing 3 (2005) 153–169

[21] Ludascher, B., Altintas, I., Berkley, C., Higgins,D., Jaeger, E., Jones, M., Lee, E.A., Tao, J.,Zhao, Y.: Scientific workflow management andthe kepler system: Research articles. Concurr.Comput. : Pract. Exper. 18 (2006) 1039–1065

[22] NVIDIA: Profiler User’s Guide (2013)http://docs.nvidia.com/cuda/pdf/CUDA Profiler Users Guide.pdf.

[23] Advanced Micro Devices: Get-ting started with codexl (2012)http://developer.amd.com/wordpress/media/2012/10/CodeXL Quick Start Guide.pdf.

[24] Shende, S.S., Malony, A.D.: The tau parallelperformance system. Int. J. High Perform. Com-put. Appl. 20 (2006) 287–311

[25] Muller, M.S., Knupfer, A., Jurenz, M., Lieber,M., Brunst, H., Mix, H., Nagel, W.E.: Devel-oping scalable applications with vampir, vam-pirserver and vampirtrace. In Bischof, C.H.,

Bucker, H.M., Gibbon, P., Joubert, G.R., Lip-pert, T., Mohr, B., Peters, F.J., eds.: PARCO.Volume 15 of Advances in Parallel Computing.,IOS Press (2007) 637–644

[26] Ariel, A., Fung, W.W.L., Turner, A.E., Aamodt,T.M.: Visualizing complex dynamics in many-core accelerator architectures. In: ISPASS, IEEEComputer Society (2010) 164–174

[27] Crockett, T.W.: Design considerations for paral-lel graphics libraries. Technical report (1994)

[28] Stasko, J.T.: The parade environment for visu-alizing parallel program executions: A progressreport. Technical report, Georgia Institute ofTechnology (1995)

[29] Abo-Namous, O.: Visualizing openclcomputations within opengl (2011)http://omar.toomuchcookies.net/node/2011/08/visualizing-opencl-computations-within-opengl/.

[30] Ahern, S., Brugger, E., Whitlock, B., Meredith,J.S., Biagas, K., Miller, M.C., Childs, H.: Visit:Experiences with sustainable software. CoRRabs/1309.1796 (2013)

[31] Krishnan, H., Harrison, C., Whitlock, B., Pug-mire, D., Childs, H.: Exploring collabora-tive hpc visualization workflows using visit andpython. In van der Walt, S., Millman, J., Huff,K., eds.: Proceedings of the 12th Python in Sci-ence Conference. (2013) 69 – 73

[32] Cedilnik, A., Geveci, B., Moreland, K., Ahrens,J.P., Favre, J.M.: Remote large data visualiza-tion in the paraview framework . In Heirich, A.,Raffin, B., dos Santos, L.P.P., eds.: EGPGV, (Eu-rographics Association) 163–170

[33] Szczepanski, A., Huang, J., Baer, T., Mack,Y., Ahern, S.: Data analysis and visualizationin high-performance computing. Computer 46(2013) 84–92

[34] Cuntz, N., Kolb, A., Leidl, M., Rezk-Salama,C., Bottinger, M.: Gpu-based dynamic flowvisualization for climate research applications.In Schulze, T., Preim, B., Schumann, H., eds.:SimVis, SCS Publishing House e.V. (2007) 371–384

[35] Gobron, S., Coltekin, A., Bonafos, H., Thal-mann, D.: Gpgpu computation and visualiza-tion of three-dimensional cellular automata. Vis.Comput. 27 (2011) 67–81

Applications of Information Systems in Engineering and Bioscience

ISBN: 978-960-474-381-0 79