climate science performance, data and productivity on titan · community earth system model (cesm)...

1

Climate Science Performance, Dataand Productivity on Titan

Benjamin Mayer∗, Patrick H. Worley∗, Rafael Ferreira da Silva§, Abigail L. Gaddis∗

∗ Oak Ridge National Laboratory, Oak Ridge, TN, USA§ Information Sciences Institute, University of Southern California, Marina Del Rey, CA, USA

{mayerbw,worleyph,gaddisal}@ornl.gov, [email protected]

Abstract—Climate Science models are flagship codes for thelargest of high performance computing (HPC) resources, both invisibility, with the newly launched Department of Energy (DOE)Accelerated Climate Model for Energy (ACME) effort, and interms of significant fractions of system usage. The performanceof the DOE ACME model is captured with application leveltimers and examined through a sizeable run archive. Performanceand variability of compute, queue time and ancillary servicesare examined. As Climate Science advances in the use of HPCresources there has been an increase in the required humanand data systems to achieve programs goals. A description ofcurrent workflow processes (hardware, software, human) andplanned automation of the workflow, along with historical andprojected data in motion and at rest data usage, are detailed.The combination of these two topics motivates a description offuture systems requirements for DOE Climate Modeling efforts,focusing on the growth of data storage and network and diskbandwidth required to handle data at an acceptable rate.

Index Terms—Climate Science, High Performance Computing,Performance Analysis, ACME.

I. INTRODUCTION

Scientific workflows that run on high performance comput-ing (HPC) systems can be very expensive in terms of bothtime and data [1]–[4]. In the Climate community, to complete asingle instance or ensemble of high-resolution simulations cantake on the order of six to nine months and generate hundredsof Terabytes (TB) of data. This time span is due to manydifferent types of tasks and factors. Obviously compute timeis a factor, but so is the time spent in the queue waiting to run,due to both competing jobs and machine down time. Havingproduced model output is only a partial goal; there are manyother data management issues that take significant amounts oftime, including generating analysis and archiving model outputto long-term storage. These time spans will increase overthe coming years due to the increase in spatial and temporalmodel resolution leading to better quality simulations. In thiswork, we examine this workflow process with the help ofinstrumentation, and look for areas where improvements canbe made.

There are several scales at which development happens inclimate models, from an individual equation or feature to howthat feature interacts with the component model, and finallyhow all of the components interact together. The amount ofcomputation and timeline for each of these start small on thedevelopment end, where modifications are happening on theminute to hour scale and seconds to minutes of computations

to full scale simulations that take months and tens of millionsof hours of computing. At every scale of development issuesare identified. The faster one can do experiments at all scales,and especially at the large production scale, the faster thequality of the model can be improved.

A. Related Work

Performance evaluation and optimization of HPC applica-tions is common practice, enabled by a variety of availabletools and libraries (e.g., TAU [5], CrayPat [6], HPCToolkit [7],gptl [8], Vampir [9], etc). However, science productivity rarelycomes from running a single application, but rather requiresexecuting a series of tasks. These other tasks also consumeresources and take time, which can be a significant factor [10]in the total time to solution. There are a number of scienceendeavors that exhibit these characteristics. Examples includethose with a large data fusion component, e.g. astronomicaldata, those with a large data analysis component, e.g. data fromlarge instruments such as accelerators, or those conductinglarge parameter sweeps, where monitoring and controlling thesubtasks is itself a critical task.

One non-climate example that has taken an approach similarto what is described, and proposed, here, is Bellerophon.Bellerophon [11] is an n-tier software system built to supporta production-level HPC application called CHIMERA, whichsimulates the temporal evolution of core-collapse supernovae.The CHIMERA team’s workflow management needs are metwith software tools that enable real-time data analysis andvisualization, automate regression testing, dynamically gen-erate code repository statistics, monitor the current status ofseveral supercomputing resources, and integrate with otherworkflow tools utilized by the group. In addition, monitoringtools built into Bellerophon allow users to keep tabs onimportant quantities such as current physical time, wall time,and simulation progress.

The large computing centers also provide tools and method-ologies for supporting workflows, and for center-wide man-agement. For instance, the DOE National Energy ResearchScientific Computing Center (NERSC) actively shows queuewait time [12] and provides history functionality. They aredoing queue time data collection on a large scale. Workloadarchives [13]–[17] are widely used for research in distributedsystems to validate assumptions, to model computational ac-tivity, and to evaluate methods in simulation or in experimental

2

conditions. These workloads mainly capture information aboutjob executions, but lack detailed information for the applica-tions, such as the separation between the initialization and runphases. Analyses of HPC workload characteristics includingsystem usage, user population, and application characteristicshave been conducted in [2], [18]. However, these analyses arefocused on the overall workload, and not on the performanceof a specific application.

B. Problem Statement

The climate modeling process involves a sequence of steps:1) model configuration, 2) compilation, 3) model execution,4) diagnostic analysis, 5) archive of model output and analysisto tape (e.g., HPSS), and 6) publication of model outputto a content management system (e.g., Earth System GridFederation (ESGF)). As part of this process there are someoperations that use a shared file system and others that requiredata transfers (DTN) between file systems.

In projects like the DOE Ultra High Resolution GlobalClimate Simulation project [?] and the Accelerated ClimateModeling for Energy (ACME) project [19], where the goalis to exercise and analyze high resolution climate models,large amounts of simulation output are generated. The processof generating these data can take a significant amount oftime, and not just the expected large amount of computetime. The workflow for producing a finished climate productinvolves many steps across several machines, and involvesmany people, with an associated cost in both people and pre-and post-processing time.

Given the nearly year long time frame for performing a20 year high resolution simulation or running several hundredsimulation years of moderate resolution simulation [20], thereare likely places where optimization to reduce time to solutioncan happen. One of the goals of this work is to highlight wheretime is being spent in the workflow so that the process can beoptimized.

Data volume in climate science is on the order of hundredsof TB of data produced yearly. For example, given this amountof data, the analysis that needs to be performed, transferfor distribution and archiving operations all take significantamounts of time. Automation of some tasks might help withthis workload.

II. EXAMINING PERFORMANCE

A. Data Captured

In this work, we examined performance data for runs of theCommunity Earth System Model (CESM) and of the DOEACME branch of the CESM over the period 2012–2015 onthe Jaguar (Cray XT5) and Titan (Cray XK7) supercomputershosted at Oak Ridge Leadership Computing Facility (OLCF).The CESM is equipped with workflow and model computerperformance instrumentation, and this was further augmentedfor these studies. This augmentation is native to ACME, andwe will refer to both ACME and this modified version ofthe CESM as ACME in the rest of this paper. We recordedvarious performance-related attributes (timings and configu-ration information), as well as timespans for some pre- and

post-processing events. Captured configuration informationincludes model specific parameters, simulation period, andnumber and location of processing elements, MPI tasks, andOpenMP threads. Examples of recorded values are rates atwhich the components of the coupled model integrated andtotal throughput rate. Examples of recorded timesspans arewhen the model entered the queuing system, when a runstarted, etc. The information that was recorded changed overthe 4 year period being examined, leading to a varyingpopulation size in the data presented here, dependent on theparticular attribute we are examining.

ACME is a coupled model made up of several components.The components correspond to the physical domains to bemodeled, i.e. atmosphere, ocean, land, sea ice, land ice, andrivers (river transport model). A case is a specific modelconfiguration with input files, resolution, choice of activelysimulating a physical domain or just reading in data, selectionof physical processes and numerical methods to enable ineach component, etc. The data called out in single caseshere is for two resolutions: t341 and t85, with ne120and n30 cases also being run. t341 is 0.35◦ resolution(512x1024 latitude/longitude points) and t85 is 1◦ (128x256latitude/longitude). Both of these use a spectral Euleriannumerical method for evolving the dynamical equations inthe atmosphere, and both use 26 vertical levels. The ne120resolution is 0.25◦ and ne30 is 1◦. The ne numbers cor-respond to using a spectral element based numerical methodfor the atmosphere dynamics, and also both use 30 verticallevels. When a case is referred to as high resolution thiscurrently refers to the t341 and ne120 resolutions, whilelow resolutions refers to the t85 or ne30 resolutions. Theother part of the case name describes the model configuration.For example a leading F indicates that data read from a file areused for the ocean forcing communicated to the atmosphere,as opposed to B configurations where the ocean forcing isproduced by an active ocean simulation. The AMIP vs 1850runs are specific model configurations that differ in their initialinput files and internal parameters. The desired simulationduration for high resolution runs is on the order of a fewdecades, while 100+ years is the target for the low resolutionsimulations.

Some of the data presented in this paper are from model runsdone on the Jaguar system, which does not use Graphical Pro-cessing Units (GPU). The Titan system includes both a multi-core processor and a GPU accelerator in each computationalnode, but the GPUs were not used in the ACME runs reportedon here.1

B. Data Capture Infrastructure

The ACME model has an extensive system of control scriptsthat manage correctness checks and workflow tasks relatedto running the model. These scripts have been modified toinclude code that records the timestamp of particular events,

1The ACME science cases require a monotone limiter for tracer transport,which had not been ported to the GPU accelerators at the time of theseexperiments.

3

and archives this and other performance data in a project-specific central repository. Some of these data are capturedonly when the model run completes successfully, resulting ina selection bias in the presented performance data.

C. Analysis Techniques

Computer performance is typically measured at the climatesimulation level using the inverse time metric Simulated Yearper Day (SYPD), that is, how many simulation years can becomputed for the particular configuration being measured ifit ran for 24 hours straight. This metric is convenient for theclimate scientists when specifying performance requirementsas it relates directly to required productivity. This is also asufficient metric if only application performance is of concern,as when tuning, or if the simulation is running on a dedicatedsystem. In this paper, we also use a time to solution metricthat takes into account all steps to a solution when run onshared resource systems.

Total time to solution is made up of several components(Fig. 1): Configure, Build, Queue (Fig. 1.a), Run (Fig. 1.b),Diagnostic Analysis, Transfer, and Archive. There are severalother tasks that are required for provenance and output dataarchiving in a complete workflow, but the results of thesetasks are typically not consumed directly and thus are nottime critical for the initial results and will be ignored in ourworkflow optimization analysis. Another issue that can takesignificant time occurs when the model terminates abnormally.Currently, failed jobs are not automatically resubmitted to thequeue, and progress is halted until a user resubmits the model(Fig. 1.c). Another situation when time can be spent waitingfor a human to intervene is when a model finishes runningand analysis and data transfers are required (Fig. 1.d).

The Configuration step is primarily limited by the principalinvestigators, who can take months to years to decide howto configure the model so as to best test a hypothesis. Incontrast, the computer time required for configuration is onthe order of several minutes. The build step takes on theorder of tens of minutes, with 20–30 minutes being typical.This step needs to occur when there are changes to coresystem libraries, changes to the number of processors thatare used, and for every new science configuration (case).Currently, builds happen infrequently except during debuggingor performance optimization. Even though the configure andbuild steps are on the critical path and time blocking to asolution, given the infrequent nature of each step (typicallyonce per case) and their relatively small cost, they will beignored for this analysis.

D. Results

In this section, we examine model run time (Fig. 1.b),various aspects of queue time (Fig. 1.a), and the impact ofhuman factors (Fig. 1.c). Data transfer time and archive timewill be dealt with in the data section.

Performance vs time (single cases): Here we examine thecomputer performance of the model over time to see if thereare changes or if it is fairly consistent. Seemingly small

configuration changes can make large differences. When thereis a case being run to answer a science question the modelis typically run in a fixed configuration, including with afixed number of MPI tasks and OpenMP threads. Comparingperformance between configuration cases does not make senseunless it is the identical configuration either on the samesystem or when comparing performance between systems.What we will be looking at in this sub-section is severalindividual cases that have been run many times over a periodof time, and comparing performance internally over time.

The initialization phase of ACME does a parallel I/O read ofrestart files and does a further distribution of that data, usinga subset of the MPI tasks for I/O and then reordering thedata from the output format to that used during computationbefore distribution to the rest of the tasks. The run phase ofthe program advances the simulation in time, serially in thetime direction but exploiting parallelism spatially during thecomputation for each time step. Model history is also outputperiodically during the run phase, with data being gathered andreordered by the I/O tasks before being written out. Table Ishows the statistical summary for both phases on three casesfor a total of 247 runs.

Initialization RunMean (s) 86.31 5,276.98Standard deviation (s) 87.63 628.34Coefficient of variation 1.02 0.12Count 118 118Max (s) 962.22 6,817.39Min (s) 55.10 76.17

(a) t85 FAMIP

Initialization RunMean (s) 547.82 17,899.15Standard deviation (s) 206.55 869.54Coefficient of variation 0.38 0.05Count 42 42Max (s) 1,735.69 19,990.46Min (s) 407.01 16,822.68

(b) t341 FAMIP

Initialization RunMean (s) 585.14 16,264.20Standard deviation (s) 178.44 1,006.05Coefficient of variation 0.30 0.06Count 87 87Max (s) 1,450.79 19,686.51Min (s) 455.04 15,054.04

(c) t341 F1850

TABLE I: Statistical summary of the ACME initialization andrun phases.

The initialization time for the t341 cases are similar witht85 being smaller. This makes sense as there is less data toread in for the lower resolution t85 case. By visual inspection(Fig. 2), there appears to be a good deal of variation in thet341 cases, but further inspection with the coefficient ofvariation in the t341 cases is 0.37 and 0.30 respectively, witht85 being 1. This indicates that there is less relative variabilityfor the t341 cases compared to t85.

Fig. 3 shows the distribution of run time for the threecases. There are few outliers in the run time dataset. The

4

Fig. 1: Manual experiment and diagnostic workflow.

Fig. 2: Initialization time for the t341 cases and the t85case (247 runs).

low run time outlier in the t85 case is near the beginning ofthe run sequence and terminates almost immediately. Likely,this is early termination from a case misconfiguration. Incomparing the variability in the performance between the runtime and the initialization time, we note that the coefficientof variations are much lower (0.05, 0.11, 0.06 respectively)compared to the initialization times. The ratio between runtime and initialization time is 30 for the t341 cases, and 60for the t85 case showing that initialization time is only 1.5%to 3% of total run time. Even though the initialization timehas a large amount of variability, the contribution to total timeis small, thus the contribution to variability is also small.

Queue time vs job size (bulk statistics): Fig. 4 shows 485 runsfrom 2014 (1/11 to 11/20). From this diagram we can see thatalmost all requests are under 20k core hours and they are runwithin a day of being submitted. For the large requests thereappears to be a uniform distribution with large range for howlong the jobs take to be serviced, anywhere from seconds toa few weeks.

There is a question about what caused the larger runs to havesuch a range of queue wait times. Looking at the plot of queuewait time compared to the date of submission (Fig. 5), we cansee several interesting things. There is a spike in wait timesin March/April. The DOE INCITE (The Innovative and NovelComputational Impact on Theory and Experiment program)allocation period runs from January 1st to December 31st of

Fig. 3: Run time for the t341 cases and the t85 case (247runs).

Fig. 4: Job queue time per job size (485 runs from 2014).

every year, with DOE ALCC (Advanced Scientific ComputingResearch Leadership Computing Challenge program) runningJuly 1st to June 30th. Conventional wisdom suggests thatprojects are trying to use large portions their allocation nearthe end of those periods. However the spike we are seeingis not associated with either of those timeframes. What thisqueue wait time spike appears to be is projects doing large runsin preparation for major conferences such as SuperComputing(with full paper submission due April 11th, 2014), or GordonBell Prize attempts that were due May 1st.

5

Fig. 5: Job queue time per submission date (485 runs from2014).

Queue request time vs actual run time: Queue time is on theorder of a day while requested run time is on the order of 6hours. The model run time is the component that is makingprogress, so we want to make sure that we are using as muchof the requested time once it has been given to the process.Looking at the same t341 and t85 cases discussed above,the mean amount of requested time used for the cases was74% for t85, 88% for t341 FAMIP, and 81% for t341F1850.

For the t85 case with a 2 hour request window using 74%of the request window means that there is 31 minutes left at theend of the run. Looking at the two standard deviations for theinitialization and run time gives 24 minutes for 95% of the runsto finish. There is an additional buffer of 7 minutes past the twostandard deviations model run time. Given the variability andthe short request window a mean of 74% utilization is fairlygood. A similar analysis with the t341 based models shows asimilar result for t341 FAMIP, but an extra 34 minutes forF1850.

Time spent waiting for people i.e. not running, not in queue:Another possible cause of additional time in completing a caseis when a job fails and needs to be restarted manually, or whena job is stopped on purpose. We define the metric for the timespent waiting for people to respond as the timespan betweenwhen one job is expected to end and when the next job ina series is created. A positive value indicates that the nextjob is submitted after the expected run time of the projectedjob, which characterizes a delay in the execution. A negativevalue indicates that the next job in the sequence is submittedbefore the expected job deadline. The former is the expectedoutcome, since ACME resubmits itself to the queue once a runis succeeded, and on production infrastructure the walltime isoften overestimated. Fig. 6 illustrates a situation where themetric would yield a negative value.

Using the same 485 instances of production runs as previ-ously, we identified 7 occurrences of positive values (delays)from the above analysis, which represent about 1.4% of thetotal case runs. The delays range from hours to a few days inmost cases, with two instances of multi-week delays. These

Fig. 6: Typical sequence of events in queue.

delays have been explained as occasions when the model rundid not complete successfully, but, when resubmitted, runsto completion and continues running. Possibly this is causedby a hardware fault, but no information was archived thatallows us to confirm this at this point in time. (At the timeof a failure, the job id can be used to query system logsand identify possible causes for failures.) Even if these werehardware failures, they occurred only intermittently and had alow impact on throughput. The small hours to day delay likelyis from when the person minding the run is not alerted that therun has failed and it is restarted only when they notice that itis no longer running. The longer delays are from a case wheremodel output analysis showed that the simulation was startingto drift in an undesirable direction. The simulations were thenhalted while the source of the issue could be identified and asolution formulated.

III. DIAGNOSTIC ANALYSIS

While the model is running, analyses are computed to trackthe progress of the simulation. These can be, for example,single values such as RESTOM, which measures the rateat which energy is entering or leaving the model, summarystatistics and plots of cloud layers, and/or extent of ice sheets,all to ensure that the simulation is not “nonphysical”. It isimportant that these analyses are carried out in a timely mannerso corrections can be made, or the simulation can de discarded.

There are standard analysis packages used for each ofthe components. The available data is exclusively for theAtmosphere diagnostics, of which there are two softwarepackages. One is the Atmosphere Model Working Group(AMWG) diagnostics package, which is based on NCARCommand Language (NCL) analysis tools [21]. The otherpackage replicates the behavior of AMWG with additionalcalculations being performed. This second package is basedon the python CDAT library, which is part of the UV-CDATpackage [22].

Time (hr)NCL AMWG 6.45UV-CDAT Total 58.74

UV-CDAT climatologies 55.79UV-CDAT diagnostics 2.95

TABLE II: Analysis operations timespan.

These operations can run in parallel, but are critical forfeedback about how the model is progressing.

IV. TRANSFER TIME

As part of the end-to-end workflow model, output data needsto be archived as well as being published to the ESGF system.

6

The allocation of long term storage and infrastructure forpublication is often located at a different site from the computeallocation. Hence, data transfer across different networks isrequired. We have two sets of transfers: 1) wide area–betweentwo national laboratories; and 2) local area–internal to asingle laboratory but between security zones and different disksystems. All transfers under one gigabyte have been removedfrom the data set as they have little impact on total time andare typically log files of negligible scientific use other thanfor reporting such as this paper. Table III shows the statisticalsummary of the two data transfer sets.

Rate (Mbit/s) Size of Transfer (GB)Mean 591.0 430.7Standard deviation 621.8 872.2Coefficient of variation 1.05 2.02Count 34 34Max 2,2210.1 5,110Min 1.58 1.0

(a) Transfer from one national laboratory to another (wide area)

Rate (Mbit/s) Size of Transfer (GB)Mean 642.3 2166.2Standard deviation 405.7 2259.7Coefficient of variation 0.63 1.04Count 38 38Max 1330.8 5930Min 19.3 7.2

(b) Transfer internally across security zones (local area)

TABLE III: Statistical summary of data transfers.

Typical setup of these transfers is a model case needing tobe copied. A transfer session per model component or transferper file type (multiple filetypes per component) are started.These transfers run in parallel.

At this point we don’t have enough data to provide sta-tistical information, but enough to make an estimate. A highresolution simulation with coupled ocean model can produce130TB of data. If all of this information is transferred at onetime, instead of as it is generated, and if there is only onetransfer process working, this example would take 19.6 days.If there are four parallel transfers the transfer time drops to 5days. Transferring the data as it is generated does not add tothe total amount of time to run an experiment as the transferscan happen in parallel to the case running.

A. Current Dataset Size

The amount of data generated by climate related projectsover the course of their year long duration can be quitesizeable. From three climate projects one of the authorsparticipated in, final data volume saved to tape archive wasof 161 TB, 53 TB, and 252 TB.

V. PERFORMANCE SUMMARY

To this point we have detailed several contributions to totalexperiment run time. The summary of that information ispresented here:

• Queue time: progress through the queue happens in undera day in 94.6% of the cases in our record. There is somevariability around busy times of computer usage;

• Model performance is fairly consistent. Approximately20% of the variability comes from the model initializationphase with the remainder in the rest of the model whichincludes numerical algorithms, communication, and writ-ing of restart and history files;

• Diagnostic Analysis: the process consumes from a quarterday to two days;

• Transfers in parallel after simulation is complete take ∼5days. Transfers as the model is running happen in parallelwith the simulation.

VI. PROJECTIONS

In this section, we examine parameters and planned changesthat impact overall workflow performance. In particular, weobserve how changes to the model, along with the platformsthat run the experiments, can influence the experiment perfor-mance.

A. Relationship between data size and compute time

The amount of computing time awarded to a project hasan impact on the ability of the project to generate data. Inparticular, the above mentioned projects received: 33.5 millionhours (Mh) for t341 FAMIP; 116 Mh for t85 FAMIP; and50 Mh for t341 F1850. These were allocation hours onJaguar. The data intensity in TB per million allocation hourfor each project was: 4.8, 0.5, and 5.0, respectively. The twoprojects with high data intensity for the amount of allocationwere focused on high resolution production runs and lookingat data output at a high frequency with respect to simulationtime.

B. Changes in simulation impacting compute intensity

The ACME climate model is changing atmospheric modelsfrom CAM4 to CAM5. CAM5 is more computationally expensivethan CAM4 due to enhanced physics, increased aerosols, andimproved cloud properties, which significantly increases theamount of time simulated (i.e., more compute intensive). Inpractice, CAM5 is 2–4x more computationally expensive forthe same amount of data, with the range of performancedepending on the exact configuration [23]. Also impacting thecomputation data balance, the DOE community standard con-figuration is moving to higher spatial resolution (1◦ to 0.25◦).In the atmosphere, the theoretical computational impact ofthe resolution change represents an increase of 16x (4 timesthe resolution in the horizontal dimensions). The number ofvertical levels is independent to the horizontal resolution, andit is only modified for experiment concerns. In practice, whencomparing the performance of CAM5 ne30 to ne120 config-urations, there is a 27x difference in performance. Combiningthis with the 16x increased size of the output data fields, theamount of required computation is once again scaling fasterthan the data output, in this case by nearly 1.7x given thesingle sample size for this estimate.

7

Additional physical details can be revealed by increasingthe temporal resolution of the output. Increasing the outputfrequency allows more physics to be simulated from the gov-erning equations as opposed to being represented by physicalparameterizations. This leads to more realistic interactionsbetween the different simulated scales in the simulation. Ex-amples include precipitation, wind speed, tropospheric mois-ture convection, Atlantic storm tracks, and eddy fluxes. Forinstance, increased output frequency (e.g., daily or hourly)may be required in simulations where the number of hurricanesis examined. These properties lead to the ability to generatebetter simulations, and do more detailed analysis, which leadsto better founded paper or a greater volume of papers. How farthe balance is pushed towards gathering more data is largelycontrolled by available storage resources, with the later beingan ongoing discussion.

Increasing the output frequency reveals more realistic in-teractions between the different scales in the simulation, forexample, in precipitation, wind speed, tropospheric moistureconvection, Atlantic storm tracks and eddy fluxes. An exampleis needing more frequent output than monthly if the numberof hurricanes in a simulations is to be examined. Thus, morefrequent output enables more detailed analysis, which leads tobetter founded experiments, and investigating a greater numberof science questions from the same experiment. How far thebalance is pushed towards gathering more data is largelycontrolled by available storage resources, with the latter beingan ongoing discussion.

C. Machine Changes

Future generations of supercomputers at the LeadershipComputing Facilities are aiming to be 5-10x more capable thancurrent generation machines, with the follow on generationanother 5-10x again. [24] The percentage allocation thatClimate Science has on these large systems has been fairlyconsistent historically.

From the Summit announcement we have seen machinecomputational performance scale up with I/O rates stayingconstant, albeit with a change in technology. If the deliveredperformance to rated performance stays constant with thesechanges in technology we will continue to see widening gapsbetween the computation and I/O. However, looking back atthe model changes we see that the model is changing in asimilar way, with a larger increase in computation than in I/O.

D. Projections of compute and data given projected machinespeeds

In the previous section the transition from CAM4 to CAM5gave a 2-4x increase in computational demands, while theresolution change needed 27x the amount of computation witha 16x increase in data volume. To analyze higher resolutionmodel behavior the I/O output volume needs to increase, butis limited by available storage space.

VII. AUTOMATION

Much of the reporting (simulation status) and ancillary tasks(publication of data, archival storage) for these experiments

can happen in parallel with the simulation, but currently con-sumes large amounts personal time while being time sensitive.To alleviate the time demands on staff, improve consistencyand reduce need for high speed sub-systems automation hasbeen introduced into the experimental process.

A. First Attempt

OLCF operates as a moderate security data processingfacility. This imposes restrictions such as using two factorauthentication to login, submit jobs, or start data transfersfrom external locations. Also, there are no reliable facilitiesfor running daemon processing internally to OLCF, motivatingearly attempts to automate the ACME workflow within OLCFto use a novel one-sided workflow (see Fig. 7). This methodconsisted of commands being issued outside of OLCF from theCompute and Data Environment for Science (CADES) infras-tructure with one time passwords and the processes running inOLCF doing automated reporting of status. This arrangementhas been abandoned due to the inability to directly gatherinformation about process status, this being critical whenexceptions occur. Other considerations include the inabilityto start transfers that moved simulation output from OLCFto CADES Open Research in an automated fashion, and thedependence on fragile scripting to carry out complex tasksacross several machines.

B. Current Path

Pegasus is a workflow management system that is targetedat large scale workflows on large resource. It uses a file basedtask and flow configuration system with a web based UI forstatus and reporting. The Pegasus system is highly portable,performance oriented, scaleable to large numbers of tasksand resources utilized, automatically captures provenance, pro-vides reliability in workflow execution and has error recovery.

We have encoded the core ACME workflow (Configuration,Build, Run and periodic Analysis) into a Pegasus Workflowas a demonstration of capability. This simulation has runsuccessfully at NERSC. At OLCF and CADES Open Researchworkflow nodes have been installed into the infrastructureon an experimental basis with the intent of transitioning toproduction ready systems. The ACME workflow is currentlybeing ported to these new Pegasus workflow instances.

As demonstrations of capability and portability for theACME project are made, additional capabilities will be addedto the workflow such as the data transfers, ESGF publicationand archiving to HPSS.

C. Automated Workflow Impacts

Up to now we have been doing mass processing of data thathas required days to weeks to transfer data between systemsand store it to HPSS. With automation, treating this as a singleevent where a large volume of data is moved through thesystem and injected to the end points can end. Instead asprogress is being made each small piece of data can be handledas it is being generated.

Taken very approximately we see that the model runs onceper day, with a requested run time of 6 hours, actually running

8

Fig. 7: One sided workflow

for 4.5 to 5 hours of the 6 hour allocation. Half an hour of thattime is spend doing I/O at a rate of approximately 50MB/s.The network transfer rates are about 80MB/s, thus needing 20minutes per day to stream the data to the destination machine.The OLCF HPSS system ingests data at around 200MB/s,again only needing minutes once the proper tapes are loaded.Moving to a processing flow where each step is incrementallyupdated results in lower demand on hardware and humanresources.

VIII. CONCLUSION AND FUTURE WORK

The ACME model is changing to become more compute anddata intense with the ratio of the two leaning towards becomingmore compute intense. Machine performance is moving in asimilar way. As there is little information at this point onfuture machine performance it is impossible to quantify howthe changes in machine balance compare to model balance,but they are moving in similar directions.

Increase the number of places where performance informa-tion is captured, such as the transfer and publication times.Increase the number of fields captured Gather data and buildan analytic model for increased queue request time vs longerqueue wait time

Reduce run time of applications: Optimize run time whereappropriate, Automate where appropriate

ACKNOWLEDGEMENTS

This manuscript has been authored by UT-Battelle, LLCunder Contract No. DE-AC05-00OR22725 with the U.S. De-partment of Energy. The United States Government retainsand the publisher, by accepting the article for publication,acknowledges that the United States Government retains anon-exclusive, paid-up, irrevocable, world- wide license topublish or reproduce the published form of this manuscript, orallow others to do so, for United States Government purposes.The Department of Energy will provide public access to theseresults of federally sponsored research in accordance with theDOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

This research used resources of the Oak Ridge LeadershipComputing Facility at the Oak Ridge National Laboratory,which is supported by the Office of Science of the U.S. Depart-ment of Energy under Contract No. DE-AC05-00OR22725.

This work was supported by DOE under contract #DE-SC0012636, “Panorama—Predictive Modeling and DiagnosticMonitoring of Extreme Science Workflows” and by the Officeof Biological and Environmental Research as part of theproject “Accelerated Climate Modeling for Energy”.

We thank Ashley Barker, Christopher Fuson, Judy Hill,Gideon Juve, Eric Lingerfelt, Vickie Lynch, Ross Miller, MattNorman, Suzanne Parete-Koon, Duane Rosenburg and BrianSmith for their valuable contributions.

9

REFERENCES

[1] G. Mehta, E. Deelman, J. A. Knowles, T. Chen, Y. Wang, J. Vockler,S. Buyske, and T. Matise, “Enabling data and compute intensiveworkflows in bioinformatics,” in Euro-Par 2011: Parallel ProcessingWorkshops. Springer, 2012, pp. 23–32.

[2] W. Tang, N. Desai, D. Buettner, and Z. Lan, “Job scheduling withadjusted runtime estimates on production supercomputers,” Journal ofParallel and Distributed Computing, vol. 73, no. 7, pp. 926–938, 2013.

[3] R. Ferreira da Silva, G. Juve, E. Deelman, T. Glatard, F. Desprez,D. Thain, B. Tovar, and M. Livny, “Toward fine-grained online taskcharacteristics estimation in scientific workflows,” in 8th Workshop onWorkflows in Support of Large-Scale Science, 2013, pp. 58–67.

[4] R. Ferreira da Silva, G. Juve, M. Rynge, E. Deelman, and M. Livny,“Online task resource consumption prediction for scientific workflows,”Parallel Processing Letters, vol. accepted, 2015.

[5] S. S. Shende and A. D. Malony, “The TAU parallel performance system,”International Journal of High Performance Computing Applications,vol. 20, no. 2, pp. 287–311, 2006.

[6] S. Kaufmann and B. Homer, “Craypat-cray x1 performance analysistool,” Cray User Group (May 2003), 2003.

[7] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent, “Hpctoolkit: Tools for performance anal-ysis of optimized parallel programs,” Concurrency and Computation:Practice and Experience, vol. 22, no. 6, pp. 685–701, 2010.

[8] J. Rosinski, “General purpose timing library (gptl): A tool for character-izing performance of parallel and serial applications,” Cray User Group(May 2009).

[9] W. E. Nagel, A. Arnold, M. Weber, H.-C. Hoppe, and K. Solchenbach,“Vampir: Visualization and analysis of mpi resources,” 1996.

[10] W. Chen and E. Deelman, “Workflow overhead analysis and optimiza-tions,” in Proceedings of the 6th workshop on Workflows in support oflarge-scale science. ACM, 2011, pp. 11–20.

[11] E. J. Lingerfelt, O. Messer, S. S. Desai, C. A. Holt, and E. J. Lentz,“Near real-time data analysis of core-collapse supernova simulationswith bellerophon,” Procedia Computer Science, vol. 29, pp. 1504–1514,2014.

[12] “NERSC,” https://www.nersc.gov/users/queues/queue-wait-times.[13] “Parallel workloads archive,” http://www.cs.huji.ac.il/labs/parallel/workload.[14] A. Iosup, H. Li, M. Jan, S. Anoep, C. Dumitrescu, L. Wolters, and

D. H. J. Epema, “The grid workloads archive,” Future Gener. Comput.Syst., vol. 24, no. 7, pp. 672–686, 2008.

[15] C. Germain-Renaud, A. Cady, P. Gauron, M. Jouvin, C. Loomis, J. Mar-tyniak, J. Nauroy, G. Philippon, and M. Sebag, “The grid observatory,”IEEE International Symposium on Cluster Computing and the Grid, pp.114–123, 2011.

[16] R. Ferreira da Silva and T. Glatard, “A science-gateway workload archiveto study pilot jobs, user activity, bag of tasks, task sub-steps, and work-flow executions,” in Euro-Par 2012: Parallel Processing Workshops,ser. Lecture Notes in Computer Science, I. Caragiannis, M. Alexander,R. Badia, M. Cannataro, A. Costan, M. Danelutto, F. Desprez et al.,Eds., 2013, vol. 7640, pp. 79–88.

[17] R. Ferreira da Silva, W. Chen, G. Juve, K. Vahi, and E. Deelman,“Community resources for enabling and evaluating research on scientificworkflows,” in 10th IEEE International Conference on e-Science, ser.eScience’14, 2014, pp. 177–184.

[18] D. L. Hart, “Measuring teragrid: workload characterization for a high-performance computing federation,” International Journal of High Per-formance Computing Applications, vol. 25, no. 4, pp. 451–465, 2011.

[19] “ACME: Accelerated climate modeling for energy,”http://climatemodeling.science.energy.gov/projects/accelerated-climate-modeling-energy.

[20] K. Evans, D. Bader, G. Bisht, J. Caron, J. Hack, S. Mahajan, M. Maltrud,J. McClean, R. Neale, M. Taylor, J. Truesdale, M. Vertenstein, andP. Worley, “Mean state and global characteristics of the t341 spectralcommunity atmosphere model,” CESM Atmosphere Working Group,2012.

[21] NCAR, “Atmosphere working group diagnostics package,”https://www2.cesm.ucar.edu/working-groups/amwg/amwg-diagnostics-package.

[22] L. Cinquini, D. Crichton, C. Mattmann, J. Harney, G. Shipman, F. Wang,R. Ananthakrishnan, N. Miller, S. Denvil, M. Morgan et al., “The earthsystem grid federation: An open infrastructure for access to distributedgeospatial data,” Future Generation Computer Systems, vol. 36, pp. 400–417, 2014.

[23] S. Tilmes, “Cam4/cam5 comparison,” in Chemistry-Climate WorkingGroup Meeting. CESM Annual Workshop, 2012.

[24] B. Bland, “Present and future leadership computers at olcf,” OLCF UserGroup Conference Call December 3, 2014.

climate science performance, data and productivity on titan · community earth system model (cesm)...

Documents