the roadmap - warpiv · roadmap is to help m&s practitioners (a) understand the openutf vision...

WarpIV Technologies, Inc. Simulation Interoperability Workshop, Spring 2013 Paper 13S-SIW-047

The Roadmap

Jeffrey S. Steinman, Ph.D.

WarpIV Technologies, Inc. 5230 Carroll Canyon Road, Suite 306

San Diego, CA 92121

(858) 605-1646

[email protected]

Key Words: Open Unified Technical Framework, OpenUTF, OpenMSA, OSAMS, OpenCAF, High Performance Computing, Multicore, Manycore, WarpIV Kernel, SPEEDES

ABSTRACT: The multicore computing revolution has begun and demands a new software development paradigm to take advantage of modern computing resources. The Simulation Interoperability Standards Organization (SISO) launched the Open Source Initiative for Parallel and Distributed Modeling & Simulation (OSI-PDMS) Study Group in 2007 to prepare for this paradigm shift in modern computing. This study group evolved one year later to become the Parallel and Distributed Modeling & Simulation Standing Study Group (PDMS-SSG).

The initial focus of the PDMS-SSG was to (a) provide a forum for discussing various parallel and distributed M&S technologies and solutions, and (b) educate M&S practitioners in the subject of massively parallel computing on high-performance computers. Over time, three architectures emerged to address the high-level performance and interoperability requirements of parallel and distributed M&S. These three architectures are: (1) Open Modeling & Simulation Architecture (OpenMSA), which focuses on the technology of parallel M&S, (2) Open System Architecture for Modeling & Simulation (OSAMS), which focuses on the necessary modeling constructs and methodologies to support cost effective, scalable, and interoperable M&S, and (3) Open Cognitive Architecture Framework (OpenCAF), which focuses on modeling constructs necessary to represent the thought processes of complex and intelligent behaviors.

These three synergistic architectures are currently in the process of being matured and refined by participants of the PDMS-SSG. The goal is for these architectures to become international M&S standards. The Open Unified Technical Framework (OpenUTF) is comprised of these three architectures. The WarpIV Kernel is the freely available (within the United States) open source reference implementation of the OpenUTF. The WarpIV Kernel can be made available to the international community through an approved export license.

This paper provides a roadmap for establishing the OpenUTF as a common technical framework. It tackles many of the challenging issues leading to the formation of the three architectures. The roadmap discusses how to migrate legacy models and services to operate within the OpenUTF using composability techniques. Topics discussed in the roadmap include: (1) the multicore computing revolution, (2) impact of parallel computing on life cycle costs, (3) fundamentals of parallel and distributed computing, (4) abstract time synchronization mechanisms, (5) composable modeling techniques, (6) publish and subscribe services, (7) advanced performance techniques, (8) integration with external systems, (9) visualization and analysis, (10) testing and troubleshooting, (11) technology transition, (12) standardization, and (13) an incremental strategy to integrate legacy systems to the OpenUTF. The primary goal of the roadmap is to help M&S practitioners (a) understand the OpenUTF vision and (b) how to make its transition into mainstream programs a reality.

It is anticipated that this roadmap will evolve into a technical report that will be provided by the PDMS-SSG. This paper should be regarded as an early draft of this report.

2

1.0 Introduction

The goal of this paper is to provide a roadmap for unifying Modeling & Simulation (M&S) Science & Technology (S&T) under a common Open Unified Technical Framework (OpenUTF). The benefits of such a framework, if widely adopted by the community would be: (1) support for scalable parallel and distributed computing, (2) significantly improved interoperability and reuse between models and/or services, and (3) considerably lower life cycle costs for software programs.

Figure 1: Roadmap for achieving goals.

A successful roadmap must address and solve all of the technical issues related to making the development and widespread use of a common framework a reality. As shown in Figure 1, the roadmap must identify (a) the current as is state of software technology, (b) the to be end goal that is desired, and (c) an incremental strategy for reaching the end goal in a cost-effective and rapid manner. In other words, the roadmap must take the community from where it is today on a safe and affordable journey to the right destination. Note that the poorly designed highway in Figure 1 does not provide a safe journey. Such high risk and dangerous roads must be avoided as the community lays out a feasible roadmap.

By necessity, this roadmap covers a multiple related topics that must work together in establishing the vision for a common software framework. The roadmap does not delve deeply into any particular topic, but rather it attempts to provide adequate discussion of the technical issues, challenges, risks, and solutions associated with each topic. The roadmap guides the discussion of each topic to a technically feasible, implementable, and affordable solution based on considering all options and making smart choices along the way. The roadmap requires community participation.

2.0 The Need for Change

The end of dramatic exponential growth in single-processor performance marks the end of the dominance of the single microprocessor in computing. The era of sequential computing must give way to a new era in which parallelism is at the forefront… The next generation of discoveries is likely to require advances at both the hardware and software levels of computing systems.

The Future of Computing Performance: Game Over or Next Level?

National Research Council, 2011

Figure 2: Article from the National Research Council emphasizes the importance of next-generation software systems embracing the multicore computing revolution.

2.1 Multicore Computing Revolution

The multicore computing revolution (see Figure 2) has begun and will forever impact the way software is designed, developed, managed, tested, validated, fielded, and maintained. As shown in Figure 3, modern computing architectures will likely be composed of wide-area and local-area networks composed of handhelds, laptops, desktops, and cloud clusters of multi-rack machines containing one or more mother boards with local RAM, each board potentially configured with multiple CPU chips, each providing a substantial number of processors that are grouped into nodes sharing local memory caches.

Figure 3: Modern computing architectures.

Hierarchical software systems of systems operating as an enterprise will be distributed across such computing architectures with requirements to support both distributed network-centric warfare and scalable high-

3

performance computing. In this emerging distributed communicating and computing world, which also includes cyber-physical real-time systems that are embedded in smart devices, it will become increasingly important to efficiently map software applications and actual code implementations to computing resources in a manner that (a) distributes computational workloads evenly, (b) minimizes network bandwidth consumption, (c) reduces interaction latencies between tightly coupled software components, (d) increases performance by making use of scalable parallel processing techniques, and (e) promotes interoperability and reuse of software models and services.

Software technologies, tools, and frameworks are needed to reduce the complexity of developing software in the emerging parallel and distributed computing world. M&S, in its most generic form, very likely faces the toughest challenges of all software domains. The most demanding simulations are composed of large numbers of interacting objects representing hierarchical systems or entities that comprise an overarching system. Such objects, distributed across processors, freely interact with other objects at arbitrary time scales as they operate independently and concurrently in abstract time. These objects often require cognitive capabilities to portray humanistic thought processes and decisions that combine both logical and emotional personality traits. Cognitive models representing humans are highly dependent on societal worldviews, ideologies, emotion, perception, and imperfect data filtering. Human decisions are rarely optimal. The mapping of these complex and highly interactive objects onto networks, machines, boards, chips, nodes, and processors impacts the performance of the fully integrated executing system.

2.2 Composability and Time Scales

Composability in the roadmap is defined as the ability to map encapsulated models and/or services to (1) each other in a hierarchical systems-of-systems computational graph structure and (2) computing resources within a parallel and distributed environment. The term model refers to a simulated representation of a system or subsystem. Services, on the other hand, are actual algorithms that can be deployed in real systems. One benefit of having a common technical framework is that models and services make use the same programming constructs and can therefore more freely interoperate. This means that simulations can use real services for high fidelity studies or algorithm testing within a scalable testbed. Similarly, actual services can utilize predictive models, which can be useful for tactical systems providing decision support. Scalable services can be hosted within a common high-performance technical framework to automatically integrate with web-based or other standards-based

external interfaces. This unifies the worlds of service-oriented operational systems, modeling & simulation, and test & evaluation testbeds.

As shown in Figure 4, common technical framework support for multiple processing configurations within an enterprise can be challenging, especially when considering the fact that long-distance interactions between software systems operating at the enterprise level can be six orders of magnitude more expensive than interactions between tightly integrated software components linked together within the same process performing function or method calls within a single core. Software composition strategies must account for these interaction time scales to provide efficient execution.

Figure 4: Interaction time scales for execution across the full spectrum of network and computing resources.

Enterprises integrate multiple disjoint applications within a loosely coupled distributed system. Communication between such systems is often sporadic and irregular. High Level Architecture (HLA) federations can be thought of as enterprises because they are composed of loosely coupled heterogeneous simulations (called federates), each providing a different functional capability that operates at its own time scale.

In an ideal world, the combined collection of federates within an HLA federation span the required functionality of the simulated system for its intended use. Loosely coupled federations can encounter conceptual modeling problems, making it very difficult and costly to validate results. Federates within a federation often have overlapping models, which further exacerbates the problem of validation. Experience has shown it to be very costly to integrate federates into multiple federations, especially when object models are different, startup procedures are specific to each federation, tools are federation-specific, scenario descriptions have different formats, etc. Run-time performance of HLA federations is usually far from ideal.

4

Enterprises can be locally or geographically distributed across arbitrary networks. At the enterprise (or federation) level, distributed software systems interact with overheads that are typically greater than 10-3 seconds. These overheads can become excessive for highly interacting applications, especially when ASCII-based data formats such as XML are required or when time management services, often required for supporting analysis-based M&S, are enabled.

Enterprises (or federations) are composed of systems (or federates) that themselves may internally execute on one or more local machines. Such locally distributed systems almost always utilize more optimized message-passing services and are typically able to interact via TCP/IP at the 10-4 second time scale.

Within a machine, specific systems (or federates) can be further decomposed into composites (sometimes referred to as simulation objects) that represent a particular simulated system, perhaps organized internally as a hierarchical collection of models and services. Depending on whether composites are local or remote to a board, they mutually communicate either through high-speed communications provided by specialized backplanes within a rack of boards (approximately 10-5 seconds overhead), or through shared memory (approximately 10-6 seconds overhead) when they reside on the same board.

Threads, using kernel-level locking mechanisms to support concurrent memory access, can handle interactions between composites residing on the same chip but on different processors at the 10-8 second time scale. However, having too many threads can result in bottlenecks during access to critical memory.

Finally, a process executing on a core interacts through function or method calls with overheads that are on the order of 10-9 seconds. For full encapsulation between models and services, data sharing and function/method calls must be performed through abstractions that hide actual implementations. This adds a level of indirection that is not required for data sharing and method/function calls occurring between objects provided within an encapsulated model or service. As far as the technical framework is concerned, encapsulation is only enforced at the model or service layer. What goes on inside a model or service is up to the developer to decide and optimize.

An important distinction must be made between (a) models and services that are composed within a composite and (b) instances of composites that are distributed across processors. Once the simulation is initialized and ready to begin its actual execution, all processing occurs through time-tagged events that occur in abstract time (logical, time-stepped, discrete-event, scaled real-time, hard real-time, etc.). By working in

abstract time, models and services can freely operate in multiple time domains. The actual time domain choice is made at run time when the application is launched. The framework internally manages events for composites in abstract time. By definition, this means that all of a composite’s internal models and services progress in time together. This is not always true for different composites, even when they are located on the same processor. For optimal performance and flexibility, the framework must allow composites to evolve independently in abstract time. So, while it is always valid for models and services within a composite to directly share data and invoke each other’s methods, this is not generally valid for different composites, even when co-located on the same processor. Instead, different composite instances must interact and share data through event mechanisms that are scheduled and processed in abstract time. How composites are defined and mapped to processors significantly impacts performance. For example, the highly interacting systems of an aircraft carrier might be represented as multiple composites that are decomposed onto several processors within a multicore chip, while other composites representing individual entities are simply spread across different chips, boards, or machines.

Collectively, the spectrum of interaction overheads on modern network and computing architectures demands composition techniques that map interacting software appropriately. It is noteworthy that most of the prior community focus has been to standardize interoperability and reuse at the enterprise (or federation) level. Yet, actual software models and services, which currently have no universally agreed-upon interoperability standards, run six orders of magnitude faster than enterprised system. There is an urgent need to standardize how models and services interoperate at the actual code level. This is a necessary step to support true composability in a meaningful way. Because of the technical challenges in providing such capabilities within a common framework, the only feasible approach for implementing the roadmap is by free and unconstrained open source participation and cooperation amongst technologists and scientists.

The common technical framework must (a) provide programming constructs that simplify model and service code development, (b) enforce complete encapsulation of models and services so that they can share data and invoke each other’s methods without introducing code dependencies, and (c) define metadata standards for describing models and services so that they can be correctly composed using automation tools that also check for validity. Enterprises composed of applications, composites, models, and services must operate in a scalable manner within the technical framework over the full spectrum of possible network and computing configurations.

5

2.3 Interoperability and Reuse

Because the multicore computing revolution demands a different software methodology, it makes sense to also consider other beneficial software practices that promote interoperability and reuse. Software composability of multiuse components is enhanced by the notion of composites being hierarchically constructed from models and/or services that reside within the same node or even the same thread. Models and/or services within a composite are free to mutually interact using abstracted function or method calls and data sharing techniques, while composites within a multicore system (or federate) mutually interact with other composites through events scheduled using high-speed communications or shared memory. Event scheduling overheads increase when extending these concepts to local and wide-area networks. So, the mapping of services and models into composites, and the mapping of composites to cores, CPU chips, boards, machines, and networks significantly impacts overall system performance.

2.4 Impact on Life Cycle Costs

Figure 5: Software life cycle.

As shown in Figure 5, traditional software systems go through an iterative life cycle. Typical software programs begin by identifying requirements, leading to design and prototyping of the system before beginning actual code development. Testing, debugging, validation and accreditation are performed as the development of the system matures. Experience has shown that verification and validation should be performed as early as possible in the development process. The system is documented in preparation for operational deployment, where it is fielded and maintained. Operator training can be extensive, especially when the upgraded system replaces an older interface. User feedback relating problems, missing functionality, recommendations for better usage, etc., results in further software life cycle rounds that again start

with new or additional requirements. In some cases, requirements evolve so rapidly over time that the final software implementation bears little resemblance to its original design. If the system design is brittle, or when modifications are made (on tight schedules or budgets) without respecting the original system’s design, such applications eventually collapse due to their unmaintainable complexity. It simply becomes too expensive to support the program any longer.

Large programs often ramp up their technical staffs with expert engineers during the first round of the software life cycle process. Once the system is fielded, engineering resources are often reduced to a skeleton maintenance team. The most experienced engineers leave the project to work on new and more exciting startup efforts. Meanwhile, maintenance of the fielded system becomes challenging and costly, especially when the maintenance team with limited expertise, patches the software as needed to fix bugs or add new features. Without the original core expertise, such software changes often introduce new bugs, which can be insidious to discover, let alone fix. The design of the system becomes rigid over time, resulting in prohibitive maintenance costs.

Another problem with the current approach towards software life cycle management occurs when the capabilities of a particular tool are not presently needed. This can result in the program being defunded and the software shelved. Eventually, all of the original subject matter experts move on to other projects. Then, at a later date, the application’s capability is once again needed, yet there are no experienced engineers available to support the program. So, either an inexperienced engineering team is assembled to resurrect the program, or a new development effort must begin, often resulting in significant waste due to duplication. There is no guarantee that prior lessons learned from the previous system are applied to such do-over efforts.

The solution to these problems is to have a common framework that is well understood by the engineering community, which encapsulates models and services into small and maintainable modules that are managed in a common repository. An automated test framework verifies and validates models as part of the configuration management process, which ensures that the models are ready to be used when needed. Engineers can move between programs and/or organizations knowing that models and services can be maintained or extended by a competent work force because code is encapsulated (which minimizes dependencies) and small enough (perhaps less than a few thousand lines of code at most) to quickly understand how it works. Design documentation of complex mathematics, algorithms, data structures, or numerical methods must be provided for models and

6

services. A software engineer who is unfamiliar with a software model or service must understand its conceptual design before effectively making modifications.

The most challenging issue is how to make the transition to a common framework…

2.5 Transitioning to Multicore

With the multicore computing revolution introducing new parallel processing requirements and software methodologies, computationally intensive legacy software systems are faced with several difficult choices. Four common decisions are described below:

Cautious: Because of technical risks and high-cost factors, many existing programs will likely choose to ignore the multicore revolution. This approach cannot be universally taken because it will eventually result in critical organizations falling behind in information processing capabilities. The cautious approach ultimately leads to obsolescence, which is dangerous in competitive environments where paradigm shifts in technology cause new industries to rise while others fall. Maintaining superiority in information technology can also determine winners and losers in modern warfare.

Refactor: In order to take advantage of multicore computing, this approach attempts to refactor all of the computationally intensive algorithms that make up a legacy software system. However, what usually happens in real systems is that the software, which was never designed for parallel computation in the first place, does not achieve its performance goals. In addition, the resulting software becomes significantly more difficult to maintain and far less robust. The refactoring approach ultimately leads to wasted efforts that are costly and fail to accomplish their performance goals.

Rewrite: To fully achieve both performance and interoperability goals, the obvious approach is to rewrite software from scratch using advanced parallel and distributed computing technology that organizes simulated systems containing models and services into composites that are distributed across processors in a smart way. However, software program managers rarely have the funding resources to rewrite their applications from scratch. Nevertheless, new startup programs must embrace multicore computing requirements from inception. A common standard multicore framework is essential for new startup programs to be cost effective.

Incremental: The best strategy for legacy systems to embrace multicore computing is to make software changes through a series of incremental steps. Several technically feasible approaches will be described later in the roadmap. As previously discussed (see Figure 1), the

roadmap will only be successful if an incremental approach can be used to safely transition legacy software to the common framework without throwing everything away and starting over. Existing capabilities must be maintained while making the transition to the new way of producing scalable software.

Figure 6: Multicore computing without a common framework significantly increases life cycle costs, while supporting multicore computing within a common framework can actually lower life cycle costs if done correctly.

As illustrated in Figure 6, current life cycle costs of software programs are already too high. Adding multicore computing requirements without proper tools and frameworks causes life cycle costs to skyrocket. However, having a common framework that promotes parallel and distributed composability of composites, models, and services not only provides scalable performance, but also lowers life cycle costs due to a number of important reasons. Figure 7 lists some of the life cycle cost benefits of using a common framework.

Figure 7: Cost benefits (in addition to scalable performance) of using a common framework.

First, the framework provides advanced modeling constructs that are elegant and significantly simplify complex event interactions. Some of these modeling

7

constructs will be discussed later in the roadmap. The framework therefore requires fewer lines of code to be written by modelers, which increases productivity and lowers development costs. Second, the framework promotes interoperability and reuse of models and services through type-checked and consistency-checked composability mechanisms. Third, common Verification, Validation, and Accreditation (VV&A) techniques (and actual V&V tests) can be used to perform both standalone-unit and integrated-system tests. This alone will have a significant effect on unifying VV&A processes and methodologies, resulting in significant cost savings and more trustworthy systems. Fourth, a common framework amortizes training costs for model developers and users because all programs use the same core technology. Fifth, a common framework allows the workforce to be more flexible and adaptive because models and services tend to be small, understandable, and fully encapsulated from other models and services. Life cycle maintenance costs will therefore be significantly reduced as a smaller and more versatile skilled workforce supports the common collection of models and services.

3.0 Parallel and Distributed Computing

Figure 8: Global Information Grid (GIG) illustrating complex interacting systems operating as an interconnected parallel and distributed-computing enterprise.

3.1 Parallel and Distributed Computing Concepts

The previous section described modern enterprise computing systems (see Figure 8) composed of systems distributed across geographically dispersed computing resources down to actual software applications executing within a thread on a single core within a multicore CPU chip. It is important to clarify the meaning of distributed computing in contrast to parallel computing.

Normally parallel computing is performed on supercomputers designed with shared memory and/or

high-speed communications, while distributed computing is performed on geographically dispersed machines that utilize standard network protocols to exchange messages. However, the distributed computing methodology can be executed on a single machine (perhaps with multiple cores) using high-speed communications, while the parallel computing methodology can be performed by multiple machines that are distributed across a network and communicate using standard slow-speed protocols. Conceptual understanding of parallel and distributed computing transcends the actual hardware and networks used to support processing.

A software application operating in parallel can be thought of as a single program that happens to use parallel processing techniques combined with communication services to execute more rapidly. This can be achieved by executing on a single machine with multiple processors or on several machines connected through a network. In an ideal world, a parallel application executing on N compute-nodes would run N times faster than if it ran on a single node. To achieve maximum performance, nodes must have minimal interprocess communication overheads, evenly distribute their data and computations, minimize data storage and processing redundancy, and remove all bottlenecks that affect the application’s computational critical path. If for some reason, one node prematurely exits or crashes due to a variety of potential reasons, then the whole program must gracefully shut down processing, no matter how many nodes and/or machines are used. Parallel processing is used to speedup up an application by having it run concurrently on multiple processors. A correctly parallelized application runs in a repeatable manner that is independent of the number of processors used to speed it up.

High Performance Computing (HPC) communication services must be highly optimized and offer a variety of both synchronous and asynchronous interprocess communication services that extend well beyond the basic message passing capabilities that are typically found in network-based protocols. It would therefore be inappropriate to define HPC communication services solely through standard message-passing constructs alone because shared memory and/or hardware-based solutions can offer orders of magnitude improved performance when used appropriately. HPC communication services must be properly abstracted to allow for these kinds of optimizations on any kind of computing and communications hardware architectures. Thus, it is important to define the communication interfaces in such a way that can be implemented and optimized on all HPC and/or network-based architectures. The Simulation Interoperability Standards Organization (SISO) is in the initial stages of standardizing the OpenUTF High Speed Communications (OpenUTF-HSC) services.

8

In contrast, distributed software systems are very different from parallel systems in that they are usually comprised of multiple loosely coupled applications. The number of applications at any point in time can continually change as systems dynamically join and resign from the enterprise. Client/server, Object Request Broker (ORB) with distributed objects, sockets, and/or web-based architectures that work with XML data formats are often used to support communications between loosely coupled distributed systems. Network-oriented enterprise systems are designed to be robust and not crash just because one application unexpectedly shuts down. Of course, this goal can be difficult to achieve, especially when a critical system, such as a server, goes down. Nevertheless, distributed enterprise systems seek flexibility, robustness, and fault tolerance, while parallel systems seek scalability and computational performance.

In the modeling & simulation world, distributed systems typically use Live, Virtual, and Constructive (LVC) communication standards such as the High Level Architecture (HLA), Distributed Interactive Simulation (DIS) protocols, and/or the Test/training Enabling Architecture (TENA) for message passing services. These systems, in turn, rely on standard network protocols such as TCP, UDP, and IP Multicast. In some cases, especially when heartbeat entity-state update message-passing techniques are used, best-effort communication services are utilized. Due to the synchronization needs of parallel discrete-event applications, a common M&S framework must provide robust/reliable services for supporting scalable parallel execution. Communication on HPC platforms, and especially through shared memory, is always reliable. Knowing that message passing is both scalable and reliable eliminates heartbeat update strategies that can waste precious bandwidth and computational resources. An optimized common framework utilizes very different publish and subscribe data distribution techniques that efficiently coordinate their operations in logical time, which can be synchronized to the wall clock.

Of course, LVC services are actually built upon lower-level and more fundamental communication services that ultimately rely on standard network protocols. But, due to the dynamic join/resign nature of enterprises or federations, distributed systems never use scalable synchronous communication calls. This is a critical distinction between parallel and distributed systems. Because of these important factors, parallel systems require substantially different and much more optimized communication services to achieve scalability goals.

3.2 Parallel Computing Basics

As previously discussed, the goal of parallel computing is to use multiple processors in an efficient way to speed up

an application as it executes. Parallel speedup is defined as the execution time of the application optimized for operation on a single node divided by the execution time on N processors. Normally speedup ranges between 1 (meaning no speedup is achieved) and N (meaning that perfect speedup is achieved). However, applications can actually execute slower in parallel than sequentially when communication overheads dominate computations, redundant processing is required, synchronizations between processors are overly excessive, critical path dependencies between objects are serial in nature, etc.

On the other extreme, applications can actually achieve speedups greater than N. While the phenomenon, known as superlinear speedup, is rare, it has been observed in large-scale M&S applications that use less than optimal data structures for managing pending events. For example, event management based on linked list data structures scale as n, where n is the number of pending events in the system. When parallelizing the system, each node now has roughly n/N events to manage. So, event management overheads decrease on each node. In addition, each node manages its events concurrently with the other nodes, which reduces the overall event management overhead by another factor of 1/N. Event management overheads therefore scale as n/N2 on N nodes. Speedup, as previously defined, is computed by dividing the time on one node (n) by the time on N nodes (n/N2), resulting in a speedup factor of N2, which is certainly larger than N (indicating superlinear speedup) when running in parallel.

Superlinear speedup can also be attributed to memory cache effects because parallel applications distribute their memory consumption across nodes. When running in parallel, each node consumes approximately 1/N the amount of memory used when running sequentially on a single processor. This memory reduction per node gives the caching system a better chance of maintaining frequently accessed memory in local cache, resulting in faster execution. Other interesting superlinear speedup issues can arise involving information theory concepts related to time ordering/merging of outputs generated by concurrently processed events, but these topics are beyond the scope of this roadmap.

Computational efficiency is defined as speedup divided by N. An efficiency value of 1 indicates perfect speedup, while 1/N indicates no speedup when running in parallel. Values less than 1/N actually indicate slow down (i.e., the application runs slower in parallel than on one node).

Perhaps a more interesting concept is relative efficiency, which measures speedup going from m nodes to n nodes. Normally, relative speedup measurements focus on parallel performance when doubling the number of nodes.

9

For example, an application may achieve a relative efficiency of 0.9 every time it doubles the number of nodes. So, the actual efficiency on 32 nodes would be (0.9)5 = 0.59049. Because the relative efficiency does not decrease as the number of nodes increases from 1 to 32, this system linearly scales even though it does not achieve perfect processing efficiency. The concept of scalability is not the same as speedup, even though they are related.

In addition, scalability extends beyond just run time execution. Scalability considers how dependent variables such as run time, memory consumption, message throughput, and message bandwidth consumed behave as a function of independent variables such as the number of nodes, number of entities in the simulation, time resolution, and/or model fidelity. A truly scalable system must consider all of these combined factors together.

3.3 Processing Loop

Parallel software systems usually perform the following steps: Startup, Initialize, Compute/Communicate loops, Store Results, and Terminate. Figure 9 illustrates a generic application executing concurrently in parallel.

Figure 9: Typical processing loop in a parallel application.

During the Startup phase, the program is replicated into separate processes or multiple threads for concurrent processing. The Startup function also creates any shared memory segments that might be required by the application. Node (or thread) Identifiers are numbered from zero to the total number of nodes (or threads) minus one. Each node is able to obtain its NodeId (or ThreadId) to assist in the decomposition of data during Initialize and then later while processing the data. After Startup, all nodes or threads independently execute the same program concurrently. Speedup is obtained as each node or thread simultaneously performs the Compute operation on its local data. Computed results that must be shared are then synchronously delivered to other nodes or threads during

the Communicate function before starting the next cycle. The new cycle does not begin until all nodes have completed their communication. After all cycles have been computed, which for most simulation applications would indicate the simulation has reached its end time (the simulation might also break out of its Process Cycle when a termination condition is determined), the StoreResults function is called to output results to the screen or to a file. Note that each node could also produce outputs during each of its Process Cycle iterations. Finally, when the application is done, each node or thread calls the Terminate function to tear down the communications infrastructure that was setup during Initialize. Each node or thread then exits the program indicating successful completion of the application.

One file per node or thread can be created while storing results, or the results can be combined into a single stream before writing a consolidated/merged output file. Merging time-tagged data created from events takes additional work, which offsets superlinear speedup effects that are present, even when highly optimized event management data structures are used. Oftentimes, inputting and outputting results can be so data intensive that it becomes the overwhelming bottleneck of a parallel application. Special Parallel I/O capabilities are provided on some supercomputers to reduce I/O bottlenecks. Parallel I/O techniques include providing multiple paths between processors and disks and/or special striped file formats that allow multiple readers and writers to simultaneously perform file operations in parallel.

It is essential for parallel processing algorithms to be robust on each processor because the data is distributed in a scalable manner across nodes. Data is generally not duplicated across processors because that would introduce memory scalability problems. Therefore, if a node goes down for any reason, whether it is due to a hardware failure or a software crash, there is no practical way for the application to recover because the data assigned to that node is forever lost. In addition, most synchronous communication services either hang or crash while waiting for messages that will never be received from nodes that have unexpectedly stopped functioning due to software or hardware failures.

If a single node goes down unexpectedly, the operating system or supporting framework must cleanly terminate the entire parallel application. Otherwise, users would have to manually bring down each node and then clean up all of the temporary shared memory segments that were created during execution. This can become tedious, especially if the number of nodes is large. To the user, a parallel application should be thought of as a single indivisible program. If any part of it fails, then the entire parallel application itself fails as a unit. Again, this is in

10

contrast to the distributed computing methodology where loosely interacting systems dynamically join and resign from the enterprise.

The most practical way to offer fault tolerance for parallel applications with long execution times is to implement periodic checkpointing that can be used to restart the application from a checkpoint file if a node ever crashes. Of course, if a software bug caused the crash, then there is a good chance that after a restart, the program will again crash at the same point in its execution. Checkpoints can take substantial time to complete and they often produce very large files. Therefore, checkpoints should be done fairly infrequently. Normally, each node performs its own checkpoint, which means that the application must restart using the exact same number of processors. This can also be a problem when a CPU is no longer functional. Big and Little Endian data representations must also be considered when creating checkpoints in heterogeneous computing environments. An application running on a Little Endian machine will probably not restart correctly on a Big Endian machine unless special care is taken to handle both formats in a universal manner.

Most parallel applications are not fault tolerant because of the complexity and excessive overheads in supporting checkpoints and restarts. Yet, fault tolerance can be an important consideration for applications that operate for long periods of time on very large supercomputers, where hardware failures due to scale occur more frequently. Desktops with multicore chips are much more reliable because all of the processors reside within a single chip or motherboard and are not as prone to such frequent hardware failures.

Algorithms are often difficult to completely parallelize and almost always contain at least some sequential portions of code. This introduces Amdahl’s law that predicts the performance limits for mixed sequential and parallel algorithms. In Amdahl’s law, P is the normalized portion of computations during sequential processing that can be efficiently parallelized, (1-P) represents the normalized serial portion of the computation, and N is the number of processing nodes. The maximum speedup is: 1/(1-P+P/N). In the limit as N gets very large, Amdahl’s law can be used to indicate the limit of parallelism for an application containing some commonly executed serial code. The maximum speedup is expressed as 1/(1-P). Despite being obvious, Amdahl’s law is important. It highlights the need for eliminating redundant or serial computations within a parallel application.

3.4 Decomposition Strategies

The most common way to parallelize an application is to distribute its unique data across nodes and then have each node independently execute the same set of algorithms on

its local data. This data-parallelism approach can also be used to distribute objects representing simulated systems or entities across nodes.

Another popular approach to parallelism is to distribute functions to nodes. Parallelism is achieved through data-flow and pipelining techniques. Functional decomposition can sometimes be easier to implement when refactoring existing legacy applications. In some ways, the concept of functional decomposition is similar to the HLA approach of federating an arbitrary collection of interacting simulations. In such federated systems, each simulation provides a unique function within the overall federation.

Applications can sometimes be thought of as an interconnected collection of black box functions or services that receive input data and produce output results that are consumed by other functions. A topological computational graph of these functions is constructed to define the data flow. Each function can then be assigned to a processor to achieve parallelism. A scheduler is required to activate functions when more than one function is assigned to a processor. The scheduler can be the host operating system that wakes up individual processes when their function is requested, or the scheduler can be implemented within a more elaborate execution framework such as the OpenUTF.

Several problems arise with functional decomposition that can severely limit scalable performance. First, applications typically have a limited number of functions. This restricts the amount of parallelism within the application. This may not be a problem for small-scale parallel systems. However, this limitation can become a serious factor as the number of processors in a multicore computing environment becomes large compared to the number of functions. Second, it is difficult to achieve significant parallelism because computationally intensive functions almost always emerge within the application as crippling performance bottlenecks. Parallel applications can only execute as fast as the slowest function, which usually dominates performance. This is regularly experienced within time managed HLA federations. The slowest simulation within the federation paces the advancement of time.

A common second step in systems using functional decomposition is to use data-parallel techniques within each function to extract further parallelism. Even with this approach, it is often hard to determine how many processors each function requires to keep workloads balanced across the entire application.

When running in a network environment, it is not unusual for the underlying communications infrastructure to introduce such large overheads to the distributed system that the application runs slower than it would if

11

everything ran on a single processor. This is frequently observed in time managed HLA federations, especially when small values of lookahead are required to maintain necessary fidelity.

In most cases, data parallelism is the preferred approach for supporting scalable multicore processing. In terms of applications, this means that evenly decomposing composites across processors will likely perform and scale better than functionally distributing simulations within federations, and then trying to parallelize each simulation individually. Distributing composites across processors within a common high performance computing parallel M&S framework is a paradigm shift away from the HLA functional decomposition approach. Yet, this will be required to maximize performance, while promoting composable interoperability and reuse between models and services. Models and services can be further parallelized using additional techniques that will be described later in the roadmap.

Data parallelism distributes entities across processors and does not easily support fault tolerance or dynamic processor reconfiguration during execution. Checkpoints, while challenging to fully implement, are generally required to provide some fault tolerance. Memory management technologies such as object persistence can be used to automate checkpoint and restart. The challenge is to support this transparently with minimal overheads.

3.5 High Speed Communications

Figure 10: Common parallel-computing inter-node communication topologies. Hypercube topologies provide an extremely rich and scalable interconnection topology by adding a communications link to each node whenever doubling the number of nodes. 2D meshes provide four neighbor links per node, while 3D meshes provide 6 links per node. Meshes are often connected into toroids that connect edges. Toroids reduce the worst-case number of hops to send a message from one node to another node by a factor of two.

The most common way to support parallel processing on high-performance computers is to use a combination of specially designed asynchronous and synchronous message passing services to coordinate processing between multicore nodes that are interconnected in various ways (see Figure 10).

One benefit of this approach is that both shared memory and network message passing services can be unified into a common integrated communications framework that allows one or more multicore computers to operate as a single parallel computing platform. Achieving parallelism through message passing services alone can support mixed multicore and network computing configurations, but at the expense of slower communications between nodes on different machines. Sometimes, these communication overheads are negligible, especially when message passing between machines is infrequent. Careful decomposition of data or objects to nodes becomes important when operating in mixed shared-memory and network configurations. Communication-intensive applications achieve their highest performance when operating within a single multicore computer.

Several high-performance communications libraries for parallel computing have been developed over the past twenty years. Historically, the two most commonly used libraries have been the (a) Parallel Virtual Machine (PVM) and the more recent (b) Message Passing Interface (MPI). These communications libraries were primarily designed for synchronous high-performance physics-based applications such as computational fluid dynamics. They were not optimized for applications requiring asynchronous discrete-event scheduling capabilities.

The OpenUTF-HSC specifies a more optimized High Speed Communications library than PVM or MPI. It was specifically designed to support parallel discrete-event simulation communication requirements, which are significantly more challenging than the requirements of typical physics-based parallel applications. A Native and an MPI-based reference implementation of this communications library are freely available with source code for non-commercial use. Both implementations operate on all versions of Windows, Mac OS X, Linux, and virtually all other versions of UNIX. Mixed operating systems, data representation types, and architecture configurations are also supported.

Eight categories of services are necessary to support general-purpose parallel computing applications, and specifically parallel discrete-event simulation.

1. Startup and shutdown services 2. Miscellaneous services 3. Global synchronizations

12

4. Global reductions 5. Synchronized data distribution 6. Asynchronous message passing 7. Coordinated message passing 8. Object Request Broker (ORB) services

Software developers using the OpenUTF to build models and services do not directly invoke these high-speed communication services. However, they are critical in supporting the underlying event scheduling and time synchronization services provided by the framework to models and services during execution. Each category of service is briefly described below.

3.5.1 Startup and Shutdown

Applications automatically begin parallel execution when they invoke the startup interface. The startup service automatically launches all of the necessary processes to support parallel processing. Shared memory segments are transparently created within the communications infrastructure to coordinate inter-process communications services. Network connections, when executing in a network or cluster environment, are also established during the startup process. When applications have completed their work, the shared memory segments are deleted by the framework through its terminate interface.

3.5.2 Miscellaneous Services

The framework provides a number of miscellaneous services that allow applications to determine their node identifier and the total number of nodes executing in the application. Node identifiers range from zero to the number of nodes minus one. Several other services are provided that allow applications to tune the size of shared memory segments and how they are partitioned into mailboxes for supporting asynchronous and coordinated message delivery. Default values that do not exceed standard system settings are provided by the framework and offer excellent performance for most applications.

3.5.3 Global Synchronizations

The simplest communication operations are the basic synchronization services. Two flavors of these services are supported: (1) barrier synchronization provides blocking synchronization between all processes and (2) fuzzy barrier synchronization allows applications to enter a fuzzy barrier without blocking. When all processes have entered the fuzzy barrier, the exit criterion is satisfied. Fuzzy barriers are extremely useful for applications that wish to indicate synchronization readiness, but would like to continue processing while waiting for the rest of the nodes to synchronize. This unique service is not directly provided by PVM or MPI. Yet, this service is essential for

optimistic event processing algorithms such as Time Warp and others.

3.5.4 Global Reductions

Global reductions provide a second category of synchronized blocking operations. For example, these operations allow applications to synchronously obtain the sum, maximum, and minimum of integer and double precision values that are passed as arguments by each node. A set of commonly used operations for integers and doubles are directly provided. A more generic global reduction capability is also provided that allows users to define their own reduction operations on larger classes or data structures. This is useful when collecting multiple run-time performance values from each node during global time updates.

3.5.5 Synchronized Data Distribution

The framework supports five synchronized blocking services that are commonly used to distribute data between nodes: (1) the scatter operation allows a node to partition and distribute a block of data to receiving nodes, (2) the gather operation is the inverse of scatter and allows a node to collect partitioned data from the other nodes, (3) the form vector operation concatenates buffers from each node to construct a replicated vector on all of the nodes, (4) the form matrix operation allows each node to input a vector of data that is used to construct a replicated matrix on all of the nodes, and (5) the broadcast service broadcasts a buffer from one node to all of the other nodes. These services require the synchronous participation of all the nodes in the application. The synchronized data distribution services are commonly used to parallelize computationally intensive matrix operations across processors.

3.5.6 Asynchronous Message Passing

Asynchronous message passing is supported with up to 256 user-defined queued message types. Three types of services are implemented: (1) unicast that sends a message from one node to another node, (2) destination-based multicast that sends a message from one node to multiple nodes identified in a bitfield, where each bit represents a destination node, and (3) broadcast that sends a message from one node to all other nodes. Asynchronous messages do not require pre-allocating buffers. The framework creates the memory for messages received. As an optional optimization, users can provide a function that performs memory allocation for incoming messages. This can significantly improve performance and is automated by the OpenUTF using optimized object factories for implementing user-defined event messages. The object factory manages the reuse of fixed-length messages to bypass dynamic memory overheads.

13

3.5.7 Coordinated Message Passing

Coordinated message passing allows applications to exchange multiple messages between nodes in a synchronized manner. Each node may have an arbitrary set of messages that are sent to other nodes. After each node finishes sending all of its coordinated messages, each node then begins receiving incoming messages that were sent by other nodes to itself. Eventually, a NULL message pointer is returned to each node when it can be guaranteed that there are no more pending coordinated messages to be received. All nodes must participate in this synchronized operation. Coordinated message passing is used in parallel simulation to ensure that each node has received all event-scheduling messages before starting its next event-processing cycle. The act of flushing all messages before starting the next event-processing cycle provides needed flow control and execution stability.

3.5.8 Object Request Broker Services

An Object Request Broker (ORB) capability allows applications to asynchronously invoke methods on designated objects residing on other nodes. The ORB supports user-defined methods with up to twenty strongly type-checked arguments, along with variable-length data that may be appended to the message. The ORB services are layered on top of the asynchronous message passing services. By abstracting away message passing interfaces, the ORB provides a very user-friendly high-performance communication mechanism. The ORB automatically uses an object factory that manages fixed-length messages in free lists. This significantly reduces dynamic memory allocation and deallocation overheads that otherwise could cripple performance. The ORB also provides a high-level communication framework for supporting embedded grid computing techniques that can be used to parallelize computationally intensive events.

3.6 Common Performance Issues

As previously discussed, parallel applications can suffer from several issues that hinder performance. These issues along with several solutions are discussed below.

3.6.1 Workload Imbalance

The most important consideration in achieving scalable performance is to distribute an even workload across processors. As shown in Figure 9, parallel applications normally processes in computational cycles that process, communicate, and synchronize at cycle boundaries. Ignoring communication and synchronization overheads, each processing cycle takes as long as its slowest node to complete its processing. Workload imbalances cause nodes with less work to be idle as they wait for the slowest node to complete its work before communicating

and synchronizing. This produces a built-in inefficiency that is sometimes hard to address.

Entity-based simulations suffer from scenario-dependent unpredictable workloads. For example, if a parallel simulation distributes its sensor models across processors, it is very likely that some sensors produce large numbers of detections, while other sensors produce few or none. Their workloads depend on whether detectable entities enter their field of view, which is highly scenario dependent. Because intelligent entities can move in an unpredictable manner, it is extremely hard to perfectly distribute workloads across processors.

Figure 11 illustrates the effect of distributing objects with stochastic workloads on up to 65,536 processors. Notice how the speedup efficiency decreases steadily as the number of nodes increases. This is because the computational workloads exhibit larger relative fluctuations as fewer objects are placed on each node.

Figure 11: Load-balancing effect on speedup of evenly distributing objects to processing nodes with stochastic workloads. Each object is assumed to have a mean workload of 10.0 seconds with a sigma of 3.0 seconds. The mean speedup is calculated by dividing the total processing time of the full set of objects by the node with the largest workload. Hundreds of statistical Monte Carlo replications were analyzed and averaged to determine mean speedup results.

Figure 11 explains load-balancing issues of systems with stochastic workloads using the central limit theorem in statistics, which suggests that the standard deviation of the statistical mean decreases as the square root of the sample space n, as n gets large. Once the application has just one object per node, stochastic load imbalances become even more pronounced. Some nodes have light computations (perhaps two or three sigma less than the mean) and others have heavy computations (perhaps two or three sigma more than the mean). The slowest node in the application dictates the runtime.

The lesson learned here is that for a fixed problem size, speedup efficiency almost always begins to dip as the

14

number of nodes increases due the stochastic nature of computational workloads. Possible solutions are (a) to statically predict workloads and then assign objects to nodes appropriately during initialization, or (b) to allow the framework to migrate objects across nodes during execution with the goal of evening out the overall computation workloads of each node.

One approach that should normally be avoided is to first run the simulation, analyze workloads, and then run it again with a better object decomposition strategy that produces a more even workload. This cheat can support impressive demos, but it is not really effective in supporting real work. One exception is when running large numbers of Monte Carlo replications, where each replication essentially runs the same scenario and produces similar results, but with some statistical or parameterized variance. In such cases, it may make sense to analyze workloads of prior runs to optimize performance in subsequent runs.

Another solution to address workload imbalances is to dynamically migrate work from heavily loaded nodes to those with lighter workloads. To support this, a common framework must provide two capabilities:

1. The ability to pack complex composites that contain hierarchically configured models and services into buffers that can be sent as messages to other nodes, where the full composite is reassembled. To support interactions between composites, the other nodes in the simulation must also be informed that the composite has moved.

2. Efficient algorithms that determine which composites should migrate to which other nodes. A good dynamic load balancing algorithm (a) minimizes how often objects are migrated to avoid thrashing, (b) keeps workloads even, and (c) minimizes message overheads between interacting composites.

The first capability can be addressed through the use of embedded persistence, which can also be used to support automatic checkpointing for fault tolerance. Persistence essentially tracks all memory allocations and pointers. With this information, persistence is able to pack complex objects that are composed of other objects, etc., into a buffer that can be sent to another processor, where the memory segments and pointers are reassembled.

The second capability requires the framework to continually maintain statistics of event processing. In conservative simulations, such statistics might include computational time for each node to complete its processing cycle. Optimistic simulations might use the number of events processed beyond global virtual time to determine which nodes are ahead of which other nodes.

3.6.2 Communication Overheads

Performance breaks down in the processing cycle shown in Figure 9 whenever communication overheads, relative to the computations performed in each cycle, become excessive. It is therefore important to decompose computations across processors in such a manner that minimizes communications between nodes. To achieve this, some parallel applications distribute their data and computations to nodes according to specific hardware topologies, such as those shown in Figure 10. This way, the number of communication hops is minimized.

2D and 3D physics-based problems, such as those addressed by computational fluid dynamics algorithms often represent physical space as grid cells that are distributed across processors. Hardware mesh topologies provide a good fit for these types of problems. A good strategy to minimize communication between nodes is to group cells into larger connected regions on each node that minimize perimeters or surface areas (see Figure 12). This strategy is intuitive because the amount of information exchanged between cells is usually proportional to their boundaries with neighboring cells. Distributing cells in a manner that maximally localizes neighbors to be on the same node minimizes inter-node communications.

Figure 12: Decomposition of 256 2D grid cells to 16 processors, indicated by their color. Assuming that the grids are connected in a toroid (i.e., the edges connect to their opposite edges), then the perimeter of the left decomposition strategy is 16 grid-cell edges, while the perimeter of the right decomposition strategy has a perimeter of 34 grid-cell edges. The left decomposition strategy is preferred over the right strategy because data exchanged between grid-cell boundaries results in less communication between nodes.

3.6.3 Redundant Sequential Code

As previously discussed, Amdahl’s law explains how redundant code limits parallelism. This is important because some applications are tempted to introduce redundant computations that on the surface appear to achieve massive parallelism. Yet, actual parallelism is worse. This effect can be subtle. For example, consider the computational patterns shown in Figure 13.

15

Figure 13: Transaction between A, B, C, and D showing different decomposition strategies (grouped by boxes) and critical paths.

Assume that computation A takes 1 second, B takes 100 seconds, C takes 1 second, and D takes 1 second. Further, assume that all communication overheads (i.e., the data flow represented by arrows) are negligible. The computations are grouped into individual units within rectangles as shown in Case I and Case II. Computations within each rectangle are potentially performed on separate processors. Case I takes 110 seconds to complete a transaction on one node. Assuming that computations are pipelined so that A, B, C, and D operate concurrently in repeated cycles, a transaction on 10 nodes takes 101 seconds to complete because the AB box is a bottleneck. Speedup on 10 nodes is computed as 110/101 = 1.089, which is very poor.

Case II naively redundantly parallelizes B by including it with the C computation. The bottleneck now shifts to the BC boxes, which operate in parallel, but take 101 seconds to complete. Because of redundancy, the total sequential computation skyrockets to 810 seconds. The speedup of this configuration is 810/101 = 8.020. Yet, the run time on ten processors is still 101. Case II falsely appears to have better speedup, but it actually ran no faster than Case I. In fact, Case II is much worse because it requires ten processors just to catch up with the run time of Case I.

3.6.4 Critical Path Dependencies

The analysis performed in the previous section (see Figure 13) assumed that Case I and Case II were pipelined to achieve concurrency. Suppose such transactions are not pipelined in a manner where the pipeline is continually being fed. As an example, suppose A represents a decision made by a simulated entity that causes computation B to change its motion. Suppose C represents computations by

sensors that are affected by the changed motion. Sensors send their detections to a tracker represented by D.

Consider first Case I with the assumption that all critical path computations start at time 0. The critical path analysis shows that A begins its computation at time 0 and finishes at time 1. B then begins its computation at time 1 and finishes its computation at time 101. The results from B are fed to each of the eight C computations, where they complete in parallel at time 102. The results are fed to D at time 102, which performs its computation once all of the results are received. D finishes its computation at time 103. The maximum speedup of this transaction is therefore 110/103 = 1.068, which is slightly worse than when only load balancing was considered in the previous section.

Now consider Case II. A again begins its computation at time 0 and produces results to each of the BC boxes at time 1. BC results are computed in parallel and produce results to D at time 102. D then performs its computations once it has received all of its results and completes again at time 103. Again, notice that the run time (assuming ten processors are used) is the same as Case I, even though it falsely boasts better speedup.

Critical path dependencies can be insidious. The worst interaction scenario is when composites frequently schedule events for large numbers of other composites with zero lookahead. A good example of this is when a composite updates a published attribute that is distributed to many subscribing composites. This fanning out of events puts the source composite on the critical path of the other composites. If all composites frequently perform such zero-lookahead fan-out event transactions, the result is that composites are all on each other’s critical path, resulting in extremely poor parallel performance. A much more optimized publish and subscribe data distribution service that minimizes fan-out event patterns and maximizes lookahead is required to remove these critical path dependencies.

3.6.5 Processing Efficiency

How processing is performed in abstract time reveals another subtle performance issue. Many applications distribute large data sets evenly across nodes and then update their complete system states in time-stepped loops. Near-perfect speedups are often reported. The problem is that not everything may need to be updated at every time step. A discrete-event approach would let time vary and only update the system’s state as needed.

As an example, imagine an n-body physics-based simulation, with 10,000 particles interacting via their mutual gravity, running on 100 nodes. Each node would have 100 particles. Assuming that each particle takes the

16

same amount of processing time to update its state and that communications overhead between nodes is negligible, a time-stepped simulation could in theory achieve perfect speedup. Yet, there are a number of very serious flaws to the time-stepped approach.

For illustration purposes, imagine that during updates every particle computed its mutual force with every other particle. If the magnitude of the change in force defined a resolution integration threshold, then particles that are far apart would need to interact less often than particles close together. The mass of particles would also have an effect on interaction times because gravitational forces are proportional to mass.

A discrete-event approach would allow the time scales between particle pairs to fluctuate, while maintaining the specified resolution. Nearby particles that need to interact very often could do so at tighter time scales, while particles farther apart would update at larger time scales. This discrete-event computational strategy would result in (a) faster processing because fewer computations are required, (b) more accurate results because of support for very tight time scales for nearby particles, and (c) better characterization of the overall accuracy of the produced results.

Figure 14: Time scales for updating evenly distributed interacting particles in a box. Update times are determined by holding the change in position due to the force constant. The black arrow shows what a typical time increment might be when using a global time step, which tries to balance resolution with execution speed. Everything to the right of the global time step represents updates that are performed more often than necessary. Everything to the left represents a loss in resolution. The better approach is to use discrete-event scheduling to only update state variables when necessary.

However, something else occurs when moving from time-stepped approaches that update everything each time step to the discrete-event approach that only updates state when needed. If conservative time windows are used (this

will be discussed in more detail in the next section), then each processing cycle computes events for its set of particles. For example, one node may update 28 particles. Another node updates 73 particles. Another only updates 12 particles. The run time during that cycle is dictated by the slowest node, which means that most of the nodes are idle waiting to synchronize before proceeding to the next cycle. In the next cycle, the same node that previously updated 73 particles, now only updates 4 particles. The node that previously updated 28 particles now updates 42 particles. The node that previously updated 12 particles now updates 91 particles. The workloads on each node change, but the result is the same. The slowest node dictates the processing time per cycle. Optimistic event processing techniques solve this problem by having each node continually update its particles without stopping and waiting to synchronize.

Instead of holding the time step constant and letting resolutions fluctuate, optimistic event stepping holds the resolution constant and lets time fluctuate. Problems with dynamically changing workloads across time windows are solved by optimistically processing events. Global time stepping over-computes, loses resolution, and produces results that are not easily proven correct, while optimistic event-stepping runs faster, is more accurate, and produces results that are easily characterized by specified resolutions. Obviously, optimistic event stepping is the preferred way to perform computations of such systems.

3.6.6 Collective Performance Effects

Several factors such as computational load balancing, communication and synchronization overheads, computational redundancy, critical path dependencies, and event-scheduling strategies affect parallel performance. Effective composition of models and services to composites that are distributed across processors must consider all of these factors in its overall decomposition strategy.

4.0 Processing in Abstract Time

To provide a universal execution framework, the common framework must support multiple execution modes that operate in abstract time, which could be a purely logical concept, the representation of simulated time, or tied in some manner to the wall clock. Models and services must never directly access the actual system clock, but rather, they must get their time services from the framework.

The framework supports the ability for computations (events) to be scheduled and processed in abstract time, which could have no bearing whatsoever to meaningful notions of time. Abstract time simply provides a necessary mechanism for sequencing the event

17

computations of interacting systems, composites, models, and services across processors that occur in parallel and distributed computing environments.

4.1 Logical Time and Scaled Real Time

The first critical distinction is whether abstract time should be tied in any manner to the wall clock. Simulations that run as fast as possible operate in logical time with no ties whatsoever to the wall clock. Logical time simulations are used to support a wide variety of analysis applications where results are needed as quickly as possible. Simulations that interact with humans or hardware systems must rely on the wall clock to pace internal and external event processing in abstract time. The wall clock can be scaled to support faster or slower execution in a smooth manner. For example, a scale factor of two would mean that the simulation paces itself to advance twice as fast as the wall clock. Real time simulations imply that the scale factor is one.

Further distinctions between hard and soft real time can be made as tolerances are established for how far events are allowed to deviate from the wall clock. Hard real-time implies that events begin their computations at scheduled times and complete their computation before a specified deadline. Soft real-time systems tend to combine multiple events in time windows that are synchronized to the wall clock at the beginning of the time window. To maintain a real-time rate, all event computations must complete before the end of the time window. Soft real-time systems can either lag when their time windows cannot keep pace with the wall clock, or they adjust their processing to maintain pace with the wall clock. Hard real-time systems, especially involving embedded software residing in cyber-physical hardware systems, are less forgiving and can produce unstable performance when hard deadlines are not met.

Hard real-time systems, such as embedded cyber-physical systems, must guarantee predictable execution timelines. This usually requires a dedicated real-time operating system to eliminate external interruptions from uncontrolled processes. To support hard real-time capabilities, the common framework could itself operate as a real-time operating system, or it could simply execute as a layer on top of a real-time operating system.

Soft real-time systems, such as interactive simulations, typically deal with the wall clock in a more forgiving manner. Soft real time systems sometimes perform computations in an as-best-as-possible mode where updates are prioritized and performed as time permits. A natural requirement of real-time interactive simulation is to minimize the delay from when humans receive output from the simulation, to when human interactions can be injected back into the simulation. Reducing network lag

delay is critical for expert video gamers where the timing of every frame counts.

4.2 Time Stepped and Discrete Event

The two most commonly used mechanisms for processing in abstract time are time stepped (also known as time-driven or lock step) and discrete event (also known as event-stepped). Discrete-event frameworks provide the most general mechanism for controlling time and can support both discrete-event and time-stepped models. Time-stepped frameworks are more limited in capability and only support time-stepped models. Both approaches can support as-fast-as-possible and scaled-time execution. Real-time mechanisms that rely solely on the wall clock for synchronization cannot support logical time modes effectively. So, the most general solution that supports all types of processing in abstract time is discrete event.

Parallel and distributed time stepped simulations update the state of their simulated system over abstract time within a fixed time-increment loop shown below.

Startup();

Initialize();

GlobalTime = 0.0;

while (GlobalTime < EndTime) {

Update();

Communicate();

GlobalTime += Step;

}

Terminate();

The Update function updates the entire system on each node. Notice that the models themselves do not control time. They are passive in that sense, which affords independence from the framework. The nodes then communicate whatever information is mutually needed and then they globally synchronize before universally stepping in time and repeating the cycle as time advances. The simulation terminates once the end time is reached.

Discrete event simulations operate in a different manner than time-stepped simulations. Time is not updated in fixed steps, but instead can take on any continuous value. An event can be thought of as a method invoked on an object that is processed at its scheduled time. Events can represent a physical phenomenon occurring in the system. Events (or pseudo events) can also be used to provide bookkeeping or data distribution services that have nothing to do with physical activities of the system.

Events can be scheduled in dynamically computed time steps that continually update state in a manner to keep errors within tolerance. By allowing time scales to fluctuate, errors are better understood and managed, while

18

providing better accuracy with fewer computations. Fixed time steps require smaller time scales than necessary to maintain reasonable errors, resulting in excessive computations. Errors due to using a fixed time resolution are also harder to assess.

New events can be scheduled to occur at the current event time or at any time in the future. Events are generally not permitted to schedule events in their past because that would violate fundamental principles of causality. Time vortexes could be caused when violating causality resulting in unstable behavior. When executing a simulation in parallel, the scope of an event is limited to operate on a single composite. Data exchanges and interactions between composites potentially located on different nodes must occur solely through events.

4.3 Sequential Simulation

Sequential simulation is straightforward to implement. As shown in Figure 15, an event queue manages pending events in time-sorted order, where the event at the top of the queue has the earliest time. The event is popped off the event queue, processed, and then newly scheduled events are inserted back into the event queue. The processing cycle continues until either the end time of the simulation is reached, or a termination condition is detected.

Figure 15: Event processing in sequential discrete-event simulation.

An event queue data structure is required to perform three operations: (1) insertion of an event, (2) removal of an already inserted event, and (3) removal of the next event. Binary trees with appropriate balancing mechanisms provide a very good general-purpose data structure for implementing event queues.

Pseudo code for processing in a sequential discrete event simulation is shown below. The virtual Process method is assumed to perform some computation on a model or service residing within a composite. How this turns into invoking a method on an object is considerably more complex. The OpenUTF does this through the use of easy-to-use macros that essentially provide the capability through code generation.

Initialize();

while (GetNextEventTime() < EndTime) {

Event = RemoveNextEvent();

Event->Process();

Event->ScheduleGeneratedEvents();

}

4.4 Lookahead-based Conservative Simulation

The primary challenge of parallel discrete-event simulation is for each node to continually determine whether it is safe to process its next event, or whether it is possible for an event with an earlier time tag to arrive as a message from another node that should be processed ahead of it. In the most general case, every composite (composed of models and services) can potentially interact with every other composite without timing constraints. Conservative simulations serialize in this general case because the current event might schedule another event for another composite that has a smaller time tag than the time tag of what the composite thinks is its next event. The only event in the simulation that is actually safe to process without risk of receiving a straggler event is the one with the smallest time tag.

Figure 16: Challenge of parallel discrete-event simulation.

The serialization problem is addressed by introducing global lookahead, L, which essentially places a minimal time increment requirement on events scheduled for other composites that might reside on other nodes. Formally speaking, an event processed at time T must never schedule a new event with time tag less than T + L for

19

composites residing on other nodes. Because the framework arbitrarily distributes composites to nodes, this dictates that all inter-composite events are refrained from interacting tighter in time than L. No constraints are placed on events scheduled by a composite for itself. This becomes critical for models and services that represent complex systems within a composite. Conservative simulations do not constrain the internal behaviors of models and services.

Events are safely processed in cycles of time duration L with full assurance that they will be processed in their correct ascending time order without straggler messages arriving with earlier time tags scheduled by composites residing on other nodes.

One important optimization is to determine the minimum time of the next unprocessed event on each node after a cycle completes. By doing this, the simulation can jump to this time at the start of the next cycle to avoid creeping, which is often seen in time-stepped simulations that have small time steps.

Figure 17: Event processing in conservative lookahead-based parallel simulations. Note that technically speaking, an event could be associated with multiple composites on the node. However, because the distribution of composites to nodes can vary, events are normally restricted to operate on a single composite.

Figure 17 illustrates event processing in conservative simulations. The event with the earliest time is removed from the event queue, processed, and then newly generated events are either sent as messages to remote nodes or inserted into the local event queue if the event is for a composite residing on the same node. All messages sent during each cycle must be received before the next cycle begins. Pseudo code for processing events in a lookahead-based conservative simulation is shown below.

Startup();

Initialize();

GlobalTime = 0.0;

while (GetNextEventTime() < EndTime) {

while (GetNextEventTime() <

GlobalTime + Lookahead) {

if (EndTime < GetNextEventTime ()) {

break;

}

Event = RemoveNextEvent();

Event->Process();

Event->ScheduleGeneratedEvents();

}

Communicate();

GlobalTime =

GlobalMin(GetNextEventTime());

}

Terminate();

4.5 Topology-based Parallel Conservative

Topology-based synchronization can be useful for simulations involving systems with rigid interaction topologies having sparse interconnections. Examples might include (a) emulating the logical operation of complex circuit boards, (b) simulating communications of local and/or wide area networks with fixed interconnections, (c) modeling production-lines in manufacturing factories, (d) representing arbitrary queuing systems, and (e) analyzing the processing and communication performance of parallel computers.

Conservative topology-based event processing is not very useful for military applications because physical battlefield entities are generally able to interact with all other battlefield entities. In other words, the topology for military simulation applications is not a sparse graph, but instead it is potentially a fully interconnected graph. Awkward attempts have been made to apply topology-based synchronization to military simulation problems using geographical grids to manage battlefield entities that migrate from grid to grid as they move on the battlefield. However, such attempts significantly complicate and limit the ability to represent truly complex composites that are hierarchically composed of multiple model and services. Previous work has resulted in producing only toy demonstrations that cannot be easily extended to efficiently and intuitively model the actual behaviors of complex battlefield entities. In addition, parallel computational performance using this approach has only been marginal.

Topology-based synchronization uses the basic notion that each composite has a set of input queues with incoming events that were scheduled by other composites. In addition, models and services within composites are always allowed to schedule events for themselves.

An example of a topology-based simulation involving seven composites is shown in Figure 18. In this example, A is a pure event-scheduling source, while G is a pure

20

event-receiving sync. The other composites schedule and receive events from other composites and themselves as defined by the graph topology. Feedback is allowed and can complicate things considerably, especially if lookahead is allowed to be zero.

Figure 18: An example of a topology-based parallel and distributed simulation.

A First-In-First-Out (FIFO) scheduling constraint is normally required to preserve monotonically increasing logical times for events scheduled by one composite for another composite. In other words, the events scheduled by A for B must always increase in time. It would therefore be incorrect for A at time 10 to schedule an event for B at time 20, and then later A at time 12 schedules a subsequent event for B at time 18. Such event crossing between composites is not permitted. The FIFO constraint also assumes that the communication framework preserves message ordering for delivery between nodes, which imposes an additional requirement on the message-passing framework.

The input and output queues for D in Figure 18 is shown in Figure 19. Again, note that all entities are permitted to self-schedule events for themselves. In addition to the topology and FIFO constraint, different lookahead values can be applied on each link to help composites advance time for other composites. This can provide an important improvement over global lookahead conservative approaches. Rather than impose a global lookahead value defined as the minimum lookahead across all links, each link itself employs its own lookahead value. This can significantly improve performance, but potentially at the expense of larger synchronization overheads. Feedback loops in the topology can further introduce significant synchronization overheads.

Figure 19: FIFO event management in topology-based parallel and distributed simulation.

In the simplest case, a composite knows it is safe to process its next event if all of its input queues have at least one pending event. The composite simply processes the event in its set of input queues with the earliest time tag. As events are processed and new events are scheduled, other composite input queues may become populated, enabling further event processing. However, it is possible for situations to arise where no composites are able to process their next event because one or more of their input queues are empty. Deadlocks can occur if this situation is not handled correctly.

Various solutions have been proposed using special NULL time-synchronization messages to update time between composites. A simple NULL message approach transmits the new time of a composite plus corresponding lookahead values to its links whenever an event is processed, or whenever a NULL message is received. The NULL message indicates the earliest time a receiving composite’s input queue might receive an event from the first composite. If the earliest actual event for a composite receiving the NULL message is earlier than the minimum time of all empty input queues, then it is safe for the composite to process its next event. While this approach provides a more aggressive strategy for updating time, it introduces more communication overhead and still does not completely solve the deadlock problem when there are feedback loops.

NULL messages can generate other NULL messages when a composite is not able to process an event, but able to update the earliest time it might schedule an event for other composites. In this manner, NULL messages are able to propagate between composites to break deadlocks and eventually allow composites to process their next event. Further deadlock or time creeping complexities

21

arise when lookahead values are zero or very small. Feedback loops can also pose interesting problems. NULL synchronization message explosions have been observed, making the topology-based approach very inefficient and potentially computationally unstable. Often, these instabilities are more significant than those found in the rollback-based optimistic approaches described in the next section. Conservative topology-based synchronization techniques can actually have more risk and uncertainty than optimistic techniques.

A further problem with topology-based synchronization is that composites often process events at widely varying times. It is not required for event processing to be performed in strict time order within a node. Instead it is based on whichever composites are able to conservatively process their next events. Composites can be widely out of sync in terms of simulated time from each other. This situation makes it difficult to exchange information with external systems such as graphical displays, hardware in the loop, and humans undergoing training.

One of the side benefits of topology-based synchronization is that FIFO event queues can be maintained as simple linked lists, which eliminates the need to sort events within each input queue. For composite-based simulations, this performance benefit, unfortunately, does not outweigh all of the other problems. These savings can also be insignificant when event computations are large compared to event management overheads, which is often the case for realistic military simulation applications. Because of its many problems, topology-based synchronization does not generally provide a viable solution for military simulation applications.

4.6 Parallel Optimistic

The most general approach for supporting unconstrained parallel execution of interacting composites that are composed of models and services is to use optimistic event-processing techniques. Events are processed in a speculative manner, where they can be rolled back if necessary. Optimistic simulations do not require the use of global lookahead, fixed interaction topologies, FIFO event-scheduling constraints, or rigorous message delivery ordering. Optimistic simulations are also ideal for supporting both real-time and logical time execution modes, which operate transparently in abstract time.

Optimistic simulations use rollback-based event-processing techniques to process events in an aggressive manner with the hope that no straggler events will be received. When stragglers are received, rollback mechanisms kick in to undo all of the events that were processed out of order. As previously discussed, each event is associated with a single composite. This means

that when a straggler event is received, only those events associated with the composite are rolled back, not every event on the node. Rollbacks can produce secondary rollbacks, but with proper flow control, this too can be limited to provide a stable execution.

From a pure framework perspective, events do three things: (1) modify state variables associated with models and services within a composite, (2) schedule new events, and (3) produce output such as files or messages to the external world. So, undoing an event simply means (a) restoring modified state variables to their previous values before the event was processed, (b) retracting or discarding scheduled events, and (c) never releasing output to the external world until the event is committed.

4.6.1 Rollbackable State Management

Figure 20: Event processing in optimistic parallel and distributed simulations.

As shown in Figure 20, each event manages rollback items in a rollback queue that is associated with the event. Each rollbackable operation performed during event processing generates a rollback item that is inserted into the rollback queue. Through virtual functions, rollback items know how to undo their operation. Rolling back the state changes made by an event is accomplished by having each rollback item undo its operation. Rollback items are processed in reverse order. Simple state changes to primitive data types, such as integers and doubles, are easily automated using rollbackable data types with operator overloading. Other rollbackable data types and operations can be more complex. Examples include container classes, allocation and deallocation of memory, error handling, external output, etc.

Rollback items can also be rolled forward. Rollbacks and rollforward are much like undo and redo functions found in commercial applications such as Microsoft Office. A good rollback framework minimizes its overheads of (a) allocating/deallocating rollback items and (b) supporting the following four generic operations:

22

1. Commit 2. Rollback 3. Rollforward 4. Uncommit.

The Commit operation is called when it is determined that the event will never be rolled back and is therefore committed. External messages are released to the outside world when events are committed. The Rollback operation is called when the event is rolled back. This operation returns state to its original condition before the event was processed. The Rollforward operation is called when the event was rolled back, but now rolled forward. A common use of rolling forward is when performing visualization or analysis based on External Modeling Framework (EMF) log files. This will be discussed in more detail later in the roadmap. Finally, the Uncommit operation is called when an event is rolled back and will not be rolled forward.

4.6.2 Events and Rollbacks

Events can generate new events for composites that are located on remote nodes or on the same local node. The most common case is for a composite to schedule an event for itself, which means that the new event is local.

As shown in Figure 21, an event is scheduled by first creating a message that contains all of the information associated with the event. Among other things, this includes the event time, the object handle of the recipient composite, event type, and all parameters/data associated with the event. The message is not to be confused with the event object itself, but rather defines everything about the event when it is scheduled. Event messages are flat regions of memory that can be sent as messages across nodes.

Figure 21: Events and messages.

Event parameters, stored as part of the event message, are passed to the recipient object method(s) when the event is eventually processed. Variable-length data can be appended to messages to support arbitrary data that might be passed with the event. Messages for composites located on different nodes must be flat. They are not packed or serialized by the framework, but rather are sent as is through the communications framework.

User-defined (or macro-generated) event objects are created from messages where they are inserted into event queues that are managed by each composite. A scheduler residing on each node manages its set of composites. The virtual Process event method calls appropriate method(s) on objects within models and services that are associated with the composite. Methods can be fixed or dynamically registered to handle user-defined events.

Often, the event itself does not know what method gets called on what object when it is processed. Methods can be dynamically registered and/or unregistered to handle events in a very generic way. This means that the scheduler of the event does not necessarily know what specific method is actually invoked by the recipient when the event is processed. This promotes the decoupling of composites, models, and services.

If an event is rolled back, it must retract any events it generated. One technique used to provide flow control is to only release messages of scheduled events when the event is committed. This risk-free way to manage event messages simplifies rollbacks. A rollback simply discards all unsent messages that were scheduled by the event.

A more aggressive strategy that does not hold back event messages retracts those messages with antimessages. It is possible for an antimessage to reach its destination after its associated event has already been processed. This situation induces further rollbacks and antimessages. Cascading rollbacks can explode into an avalanche if flow control is not provided. Another flow control technique uses the time tag of the event’s scheduler to determine when to release its messages. If the event’s scheduler’s time tag is less than the Global Virtual Time (GVT), then the event knows that it will not be retracted from an antimessage. This technique limits cascading effects to one level.

GVT is defined as the minimum time tag of all unprocessed events, event messages (in transit or unsent), or antimessages in transit. From causality, GVT represents the global time across all nodes where events can be committed. In theory, GVT continually advances while events are processed as the simulation executes. However, this information is not easily provided to the nodes. In practice GVT is periodically updated in time cycles (see Figure 9) resulting in all nodes determining

23

GVT in a synchronized manner. Events are then committed after GVT is updated. This (a) causes some messages that were held back, to now be released, (b) commits rollback items associated with committed events, (c) deletes committed events, their associated messages, and antimessage stubs, and (d) releases external output, which can be print statements, data written to files, or messages sent to external systems.

4.6.3 External Output and Rollbacks

External output generated by time tagged events is only released when events are committed. Each node could simply release its output independently when its events are committed. Committed output from multiple nodes to a common the terminal window, for example, would appear jumbled. An important issue when running in parallel is how to merge output from multiple nodes into a common stream that is meaningful. When running sequentially, this is done automatically because events are processed in time order. A more challenging problem is having an external system receive committed messages from arbitrary nodes and then knowing when it has received all of its messages and how to advance time. Advanced techniques to handle these kinds of problems will be discussed later in the roadmap.

4.7 Representation of Time for Repeatability

Logical-time simulations must have a unique way to represent event times to ensure repeatability in a manner that is independent of the number of nodes or wall clock. This is accomplished by representing time as a class with a physical time value and a series of tiebreakers. Through operator overloading, the class behaves as a double, which means that it can be used natively in numerical computations. The (a) equal-to and (b) less-than-or-equal-to operators are also overloaded to support time comparisons required by event queue data structures. The fields of this class are shown in Table 1.

Table 1: Abstract time representation.

Name Type Description

Time double Physical time of the event

Priority 1 int First user-settable priority

Priority 2 int Second user-settable priority

Counter int Counter from composite

Unique Id int Unique Id of composite

Super Counter double Unique non-rollbackable counter

Aux Counter int Auxiliary Counter

Aux UniqueId int Auxiliary Unique Id

4.7.1 User Settable Time Fields

The Time field is normally represented as seconds past midnight relative to some start date. A double represents approximately 15 floating point digits, which means that even for scenarios lasting a year (31,536,000 seconds), physical time resolutions are still in the 0.1 microsecond range. If this is still not high enough resolution, the type can be changed to long double, which (depending on the compiler) adds 32 to 64 more bits of resolution, but at the expense of larger overheads.

Two user-settable integer priority fields follow the physical time. These priority fields can take on negative values, which can be useful in some circumstances. Note that the priority fields could also be used to represent fractional seconds if more elaborate ways to represent time are needed. For example, physical Time could represent integer seconds, while the first priority field could represent nanoseconds. Realistically though, representing physical time in just the Time field almost always provides more than adequate resolution, especially when considering the optional use of long doubles.

4.7.2 Framework-settable Time Fields

The framework automatically sets the next two time fields when events are scheduled to ensure that no event times are identical. Each composite is assigned a unique identifier by the framework when it is constructed. Composites also manage a rollbackable event counter that is initialized to zero. The counter is incremented each time an event is scheduled by a composite. This guarantees that every event scheduled by the same composite is unique. Because each composite also has a unique identifier, the combination of counter and unique identifier globally guarantees that each event has a unique time tag across the simulation.

The framework sets the counter and unique identifier time fields of each newly scheduled event after the source event is processed.

One final step is required for this to work properly. Consider what happens if an event, with a counter value of 100 in its time tag, schedules a zero-lookahead event without changing the priority fields. Suppose the counter of the composite processing the event is only 50. If this counter value were to be used, the new event would actually go backwards in time. This violates causality. The simple solution is to always advance the counter in the composite to the current event counter plus one when this situation occurs. So, the new counter in the composite jumps to 101, the counter in the new event is set to 101, and then the counter in the composite is again incremented to 102 in preparation for the next event it might schedule.

24

4.7.3 Super Counter

The super counter provides a way to uniquely tag each event instance, which can be critical as antimessages find their associated event. Because messages can arrive late and/or potentially out of order, uniquely identifying event instances when there may be duplicates is an essential bookkeeping requirement for the framework. A super counter for each node is initialized to zero during startup. It is incremented whenever an event is scheduled.

4.7.4 Transactions and Auxiliary Time Fields

Auxiliary time provides a final capability that is essential for supporting zero-lookahead event transactions that initiate and complete before any other event time tags. A transaction involves a sequence of events being scheduled to accomplish a goal. For example, updating a published attribute generates a transaction. The new attribute value is first scheduled for each node having at least one subscriber, and then an attribute reflection event is locally scheduled for each subscriber on the node. Auxiliary time provides a way for all events within the transaction to occur as a unit. No other events have time tags overlapping the time tags of the events in the transaction.

The way this is accomplished is to have all events in the transaction use the same counter and unique identifier, but then update the auxiliary counter and auxiliary unique identifier. This essentially pushes the tie breaking mechanisms further down the chain while events in the transaction are processed with the same first five time values. When using auxiliary time, the super counter is more than just an instance identifier. The super counter jumps as necessary to maintain causality.

4.8 Five Dimensional Simulation

Modeling and simulation applications normally simulate four-dimensional worlds comprised of three spatial dimensions and a causally constrained time dimension. Five-dimensional simulation technology, known as HyperWarpSpeed, can be thought of as the simulation of parallel four-dimensional worlds that overlap, yet deviate from one another through branches that propagate within their own timelines (see Figure 22).

Branches can further branch and/or merge as the course of events unfolds over time. Enormous computational savings occur because the affected branches are localized. This means that computations common to branches, no matter the complexity of the branches, are shared. Applications such as missile defense have successfully analyzed the effects of redundant target-weapon assignments with stochastic outcomes using this technology. Virtually an infinite number of outcome

permutations have been explored in less than 60 seconds running on a four-processor desktop.

Figure 22: Five dimensional simulation explores multiple decision points within a single execution by branching across independent timelines that intersect, interact, and potentially merge.

4.8.1 Five Dimensional Simulation Background

Analytic simulations are often independently executed many times with different parameter settings to determine the optimal configuration of complex systems. Monte Carlo simulations that model critical outcomes using probabilistic distributions and random number generation often require large numbers of end-to-end simulation replications to determine the statistical measures of effectiveness and/or performance of complex systems. Real-time estimation and prediction decision aid tools project the future through simulation using real-time data to calibrate the current state and require multiple-hypothesis what-if branch excursions to simultaneously explore the effectiveness of various battle-plan outcomes.

These different kinds of simulation applications are commonly implemented by running multiple executions of the same simulation using (1) different parameter settings, (2) different starting random number seeds, or (3) different critical decision options and/or battle plans. Simplistic parallel processing is achieved by farming the replications to multiple processors. Experiments must be carefully designed to explore the most important cases.

The simple brute-force Monte Carlo approach ignores the fact that each of the replicated simulation executions may redundantly repeat many of the same time-consuming calculations. Furthermore, outcomes are generally provided only at the end of the simulation runs. It is extremely challenging to consider, across all runs, the common threads of execution and relative frequencies of

25

certain decision points and stochastic branches across all replications.

Some improvements over the brute-force multiple-replication Monte Carlo approach have been explored over the past 15 years. One example is to replicate (i.e., fork) the simulation at branch points. This has the appeal of sharing computations up to the branch point, but it does not take full advantage of computation sharing afterwards. It also suffers from only being supported on shared memory computing architectures. It can also quickly degenerate into an unmanageable number of parallel executions.

Another approach clones objects at critical branch points. While this offers a higher degree of computation sharing, it still does not take full advantage of the potential computation sharing for complex objects with multiple concurrent event patterns. Furthermore, none of these approaches merge common branches to eliminate redundant event computations.

Five-dimensional simulation employs a more aggressive approach that automatically detects and reuses redundant event calculations that are shared between the multiple branches at the state attribute level. Unique calculations for a particular hypothesis branch are only performed when required. As a further optimization, this more efficient computational approach also takes advantage of parallel processing when multiple processors are available through the use of optimistic event-processing algorithms. Large scenarios that require considerable amounts of memory are supported through scalable parallel and distributed algorithms that distribute simulated objects across computing resources. Stochastic simulation applications that explore multiple branch outcomes are supported with minimal computational overhead on single-processor machines, multi-processor machines, and/or networks of such machines.

Five-dimensional simulations enable the exploration of multiple decision branches at key decision points within a single simulation execution while transparently leveraging parallel and distributed computing resources. Via aggressive computation sharing, Five-dimensional simulation offers significant improvement over Monte Carlo techniques. Preliminary performance results on Air Force, Army, and Missile Defense battle scenarios have demonstrated orders of magnitude speedup gains over traditional brute force Monte Carlo approaches executed in a serial manner. This technology has significant ramifications for national defense.

4.8.2 HyperWarpSpeed Algorithm

As opposed to cloning entire simulations or objects at key decision points, HyperWarpSpeed offers a more flexible,

robust, scalable, and efficient solution to exploring multiple decision paths. Coordinating how the alternative timelines interact is completely automated and transparent through unique multivariable state management structures, event splitting, and event merging techniques. These techniques are briefly discussed below.

4.8.2.1 Replication Sets

Fundamental to HyperWarpSpeed is the notion of replication numbers and replication sets. Monte Carlo simulations uniquely identify individual runs with a replication number. A replication number can be used to (1) set different starting seeds for pseudo random number generation, (2) specify a set of model performance parameters, (3) identify a particular set of decisions in a battle plan, and/or (4) identify enemy responses.

A replication set is simply a collection of replication numbers. HyperWarpSpeed represents replication sets as bit fields, where each bit set to one identifies a replication in the set. For example, a replication set with 32 potential replications containing replications {0, 1, 4, 9, 15, 29} would be stored as binary 0010 0000 0010 0000 1000 0010 0001 0011, where the rightmost bit represents replication 0, the next bit to the left represents replication 1, etc. An integer can store up to 32 replications. Arrays of integers are used to manage larger replication sets.

To achieve scalable performance, high-speed bit-masking instructions automate most of the critical bookkeeping operations for replication sets. Both events and state variables are associated with replication sets. State variables manage an array of values where the array index is associated with the replication number. Events keep track of which replication numbers are represented by the event. At the start of the simulation before any branching occurs, (1) state variables store the same value for all possible replications and (2) all initially scheduled events represent the full set of replications. Branching, event splitting, event merging, and multireplication state variable assignments occur as events are processed.

4.8.2.2 Branching

Branching enables a HyperWarpSpeed model to simultaneously explore multiple decision points within a special event type known as a process. An event is a method on a C++ object; a process is an event that is able to pass time without exiting the method. Special modeling constructs are provided by the framework to allow processes to (1) wait specified periods of time before waking up, (2) wait for interrupts to occur, and/or (3) wait for necessary resources to become available before continuing processing. Processes can have multiple wait statements within a single method.

26

HyperWarpSpeed allows processes to branch using syntax similar to the standard switch-case statement. Each decision point is currently limited to twenty branches. At the decision point, model developers specify the probability of taking each branch. HyperWarpSpeed schedules new processes that continue execution with subdivided replication sets based on the specified probabilities for each branch label. Actual replication numbers are randomized to avoid correlation effects. The maximum number of branches within an individual timeline is dependent on the replication set size. For example, a replication set of 128 could evenly subdivide 7 times, resulting in 128 unique permutations. Any additional branch statements encountered would flip a coin to determine which branch to evaluate, as in a standard Monte Carlo run.

4.8.2.3 Multireplication State Variables

Multireplication state variables manage an array of values, where each element in the array corresponds to a particular replication number. State variables are potentially accessed and/or modified by events with arbitrary replication sets. Multireplication data types are implemented as C++ objects representing integers, doubles, and Boolean values. With operator overloading, these data types behave as normal variables. Under the hood, they manage multiple values for different replication subsets.

4.8.2.4 Event Splitting

Event splitting occurs when an event or process associated with a particular replication set accesses multireplication variables with different values. For example, a Boolean multireplication state variable data type might store the value true for replications {0, 1, 4, 9, 15, 29}, and false for all other replications. An event or process with replications {5, 9, 12, 15, 20} will likely produce different results depending on the actual Boolean value. So, HyperWarpSpeed automatically processes the event twice, once for replications {5, 12, 20} and then again for replications {9, 15}. During event splitting operations, HyperWarpSpeed automatically tracks replication set dependencies. Any event or process may access several multireplication state variables, causing arbitrarily complex event splitting.

4.8.2.5 Event Merging

Event merging enables identical timelines to recombine whenever possible. It is entirely possible for (1) identical events to be scheduled during event splitting, and for (2) pending unprocessed events to be identical. HyperWarpSpeed merges identical events before they are processed to consolidate event scheduling and processing

overheads. Without an automated merging capability, HyperWarpSpeed could branch and split events in an exponential out-of-control manner. If decision points were independent from one another, the simulation would continually diverge and merge between decision points.

5.0 Composable Modeling Techniques

The roadmap previously introduced the concept of composites that are hierarchically composed of models and services. Composite instances are assigned to processors where their internal models and services operate concurrently with those in other composites. Applications within the OpenUTF are composed of composites. Enterprises are composed of applications. The discussion now turns from the technology of parallel computing to the software programming methodology that facilitates interoperability and reuse between models and services.

5.1 Art of Modeling & Simulation

Figure 23 highlights the commonly used definitions for modeling and simulation. As will be shown, these definitions do not adequately describe the software development process that real engineers apply when creating models and simulations.

Figure 23: Modeling & Simulation.

According to Figure 23, a model is basically a static representation of a system, entity, phenomenon, or process, while simulation animates those static models with behaviors over time. In reality, model developers must address the time behavior of their system representations. What is most troubling in these definitions is that the terms modeling and simulation are often used interchangeably. This section makes a clear distinction between models (or services) and simulation.

27

With regard to the common framework, simulations are simply software programs that are composed of composites, which in turn are hierarchically composed of models and services. Simulations require a simulation engine (framework) to provide core event-scheduling and event-processing services, which allow models to advance in time. Models are simply software representations of systems and/or subsystems.

Without having a common framework, models can be reusable if: (a) they are implemented as black boxes that are independent of the framework, (b) independent of other models, and (c) they have no reliance on global variables that are specific to other models or services. Most models in legacy simulations are tightly coupled with their framework, other models, and specialized services. Models also commonly share global variables and data structures. In other words, legacy models are almost never reusable.

A better software approach is needed to simultaneously address (1) parallel and distributed scalable performance on multicore machines and (2) interoperability and reuse of models and services. The art of M&S can be defined as follows:

The art of modeling is to accurately portray the behavior of a represented system as it evolves over time and interacts with other models, services, and/or systems, while minimizing both software and computational complexity.

The art of simulation is to provide an interoperable high-performance parallel and distributed execution framework that enables a collection of models and services to be easily developed, tested, configured, and operated in a variety of run-time modes.

5.2 Science & Technology

At this point, it is helpful to make another distinction. Like modeling and simulation, the terms science and technology are unfortunately often used interchangeably too. Better definitions are needed.

Scientists perform research and development to construct, test, and/or validate models based on conceptual and/or theoretical understanding of hypothetical or real systems. Scientists can be subject matter experts, operations researchers, model developers, analysts, testers, and system validators spanning theoretical, experimental, and/or phenomenological areas of expertise.

Technologists also perform research and development to support architectures, frameworks, and tools that are necessary for scientists to carry out their work in a practical manner. If astronomers are scientists, telescope

builders are technologists (see Figure 24). Modern advances in technology enable better science.

Figure 24: Far side comic strip by Gary Larson showing Carl Sagan as a kid looking up towards the sky, amazed that there must be hundreds of stars in the galaxy. Carl Sagan was a well-known astronomer who is best remembered for his phrase about the galaxy, “…billions upon billions of stars.” What Carl Sagan actually needed to help him be a better scientist as a kid was a powerful telescope.

5.3 Multi-domain Modeling Spectrum

Figure 25: Multi-domain modeling spectrum.

As shown in Figure 25, military simulation covers a wide spectrum of use cases that require different levels of modeling resolution. The left side of the figure indicates the highest resolution modeling that is based on Physics. Examples of this include computational fluid dynamics to model wind turbulence on aircraft, electromagnetic radio frequency propagation of signals, chemical reactions and dispersion of chemical, biological, and radiological weapons, etc. Engineering models offer a slightly higher level of resolution. They can represent circuit boards, antennas, embedded systems, and enterprised systems. Engagement models focus on interactions between simulated entities. This can include dogfights between aircraft, combat between troops on the ground, etc. Mission-level models often aggregate entities into larger statistical models. A mission model could represent an air mission containing fighters and bombers following an Air Tasking Order (ATO), a battle group of destroyers protecting an aircraft carrier, a platoon of soldiers, etc. Finally, Campaign models are highly aggregated and focus on the statistical outcomes of warfare campaigns.

28

Physics and Engineering models tend to be computationally intensive, while usually involving a small number of battlefield entities. Physics-based models naturally decompose into large numbers of spatial grid cells, electromagnetic wave propagation paths, etc., which can be represented by large numbers of composites that operate in parallel. Engineering-based models of entities composed of complex systems can distribute their internal systems between many composites to achieve parallelism. For example, an aircraft carrier with many complex systems could be represented by several composites, each modeling their portion of the full system. The challenge is to balance computational workloads, minimize critical path dependencies, and make the internal interactions between models and services transparent to whether they are located within the same or on different composites.

Engagement models normally represent each battlefield entity as a composite system that is hierarchically composed of models and services. Scalability is achieved by creating large numbers of entities (composites) that are distributed across processors where they operate concurrently. The number of entities created is scenario dependent. Performance issues arise when scenarios are not large enough to distribute workloads or when certain event computations associated with a small number of entities, such as command centers, become bottlenecks. One approach to alleviate this problem is to use various embedded grid-computing techniques to parallelize the processing of a single event. This approach will be discussed later in the roadmap.

Mission and Campaign models often (but not always) aggregate entities into larger composites that act as a unit. Instead of weapon exchanges occurring between actual battlefield entities, statistical models such as Lanchester Attrition are used to determine outcomes of battles. Mission and Campaign models do not provide the same level of detail as Physics, Engineering, or Engagement models. Yet to be valid, they ought to provide the same statistical ensemble results. Sometimes, measures of performance derived from the execution of physics and engineering models are used to parameterize higher-level mission and campaign models. Mission and Campaign models often require large numbers of Monte Carlo runs, which can simply be farmed to available processors using standard grid computing techniques.

Five-dimensional simulation techniques, previously discussed, can offer tremendous performance boosts for Engagement, Mission, and Campaign simulations.

5.4 Composites and Systems of Systems

A composite (see Figure 26) is hierarchically composed of models and services that collectively represent a complex system of systems. The framework constructs

each composite, with its models and services, as a self-contained event-processing unit.

Figure 26: A composite is hierarchically composed of systems containing models and services. Models and services within a composite invoke methods and/or schedule events using the Abstract Interface Manager (AIM). Methods on objects within models and services register their methods to handle abstract interfaces. The Data Registry (DR) allows models and services to register data for other models and services to use.

Models and services within a composite advance in time together as a unit even though their internal events are independent and asynchronous from one another. They are technically able to invoke methods on each other, and can directly share data without having to schedule events. However, to promote true interoperability and reuse, models and services within a composite must be completely encapsulated and therefore independent from each other. Another way of saying this is that models and services must never include header files provided by other models and services. A model or service knows nothing about the implementation of another model or service. Instead, models and services use abstracted services to mutually invoke each other’s methods and/or share data.

The framework facilitates this capability within each composite by managing abstract interfaces and shared data that are dynamically registered by models and services within the composite. Methods on objects within models and services are dynamically registered to inform the Abstract Interface Manager (AIM) that they are able to process a specified (strongly type-checked) interface. So, when a model or service calls or schedules an abstract interface, it does not know which other models or services are actually invoked. As an important optimization, an interface scheduled for the current time is optimized as a direct method call if the model is within the same composite. If the recipient model or service resides within a different composite, it is scheduled as an event. A publish and subscribe delivery service, provided by the framework, ensures that the appropriate composite(s)

29

receive the event. The composite processes the event by invoking all registered methods for that interface.

Each composite also maintains its own Data Registry (DR), which simply manages pointers in a binary tree data structure to registered data. A string key, and potentially a series of string tags, is used to register and lookup data. So, models and services register data without knowing which other models and services access their data. A publish and subscribe data distribution service is provided by the framework to efficiently distribute published data to subscribing models and services residing within other composites, or even within other applications spanning the enterprise. Models and services within a composite can also share data through federation objects, which will be discussed later in this roadmap.

5.5 Composable Models and Services

A metadata file defines the capabilities of each model and service. At minimum, this file defines (1) the name of the model or service, (2) which package the model or service resides in for dynamic library linking at run time, (3) all data registered, (4) all data accessed from the data registry, (5) all abstract interfaces provided, (6) all abstract interfaces invoked, (7) all data published for other subscribing composites, (8) all subscribed data published by other composites, and (9) model initialization parameters. The metadata file is essential in validating the hierarchical structure of a composite for correctness. Each model and service within a correctly structured composite requiring registered data and interfaces from other models and services must be matched with other models and services that provide the required data and interfaces.

To further promote interoperability and reuse between models and services, abstract interfaces and data that are registered must be well understood with common agreement on semantics. Abstract interfaces and shared data are defined separately from models and services. They therefore have their own metadata files.

As shown in Figure 27, one extremely hard challenge is to transparently automate how a complex system distributed across multiple composites is composed of hierarchical models and services. Events between models and services residing in different composites are not difficult to support. Much harder is sharing data, especially when that data, for example a container of other objects, is provided with a rich and complex structure. The framework provides two mechanisms: (1) a data registry that can support any data structure without requiring serialization and (2) a publish/subscribe framework that automatically delivers specific types of published data to subscribers. The framework can support the notion of updates and reflections of simple data types that are shared between

models and services residing within multiple composites. More complex data types are much harder to support.

Figure 27: Models and Services are specified by metadata describing initialization parameters, data provided, data needed, services provided, and services needed.

System files (sometimes called computational graphs) define the hierarchical model/service structure of composites (see Figure 28). Transparent name mangling uniquely defines interfaces between models and services. The framework automatically creates, initializes, and “wires together” the graphically defined structure of models and services for each composite.

Figure 28: Composites are defined as computational graphs that are composed of models and services that are wired together.

Two types of connections are provided between models and services. First, outputs of a model and/or service could be data that feed as inputs to other models and services. A notification method is invoked when input data is modified, which can produce additional outputs that trigger other model and service notification methods. Notifications are processed according to priorities that are associated with model and service inputs to help optimize and stabilize computations. Prioritizing the order of

30

computations can be critical when systems have feedback. An important application of systems composed of models having data inputs and outputs is cognitive modeling via rules or emotions that trigger whenever their inputs change. Complex cognitive models can be constructed from large rule sets that represent thoughts, which trigger other thoughts, etc. Second, outputs of a model and/or service could be registered interfaces, which can be function calls that are invoked right away, or scheduled for a future time.

Composite instances are defined in scenario files. This can be accomplished by enumerating each composite with their hierarchical system specification, parameters and thresholds for each model and service, mission description, etc. Scenario files can also specify multiple instances of replicated objects that take their geospatial position within a geometrical constellation, grid structure, or they can be given random locations.

Figure 29: Data files defining three levels of composition.

Figure 29 shows the relationship between the three levels of data files. First is the metadata file for each model & service. Second is the system specification of an individual composite. Third is a scenario file that describes multiple instances of composites.

5.6 Visual Programming with Models and Services

Figure 30: Process of developing code for models and services, providing metadata about models and services, specifying systems that are composed of models and services, executing a scenario composed of composite systems within a parallel and distributed computing environment, and then visualizing results.

As shown in Figure 30, visual programming tools can be used to develop models and services, generate code to simplify programming efforts, capture metadata about models and services, compose systems, construct scenarios, and visualize results.

Figure 31: Packages.

Figure 31 shows the connection between models and services that are provided in dynamically linked libraries, which execute inside the OpenUTF parallel and distributed execution environment. A metadata package describes each model and service that is used by the Vision graphical user interface to compose system packages. Vision also creates the system definition file that combines the scenario description of composites with their internal systems composed of models and services. A generic main program reads the system definition file, dynamically loads the necessary libraries, and then executes the scenario on parallel and distributed computing resources. The OpenUTF Kernel provides core framework services necessary to execute the scenario.

5.7 Integrated Development Environment

A graphical Integrated Development Environment (IDE) would greatly facilitate model and service code development. The IDE essentially provides a graphical interface for specifying header file class definitions of models and services. Internal data members would be identified as either (1) rollbackable state variables, (2) parameters that are set during initialization, or (3) exportable attributes that are published. In addition, data members that need to be registered in the data registry are specially identified. Class methods are similarly identified as regular methods, notification methods, or events.

A graphical programming interface can generate code to perform mechanical steps that are tedious and error prone for programmers. Such operations include the use of macros to specify methods as events, registering methods to handle specified interfaces, publish and subscribe operations, data registration, parameter initialization, etc.

31

The IDE can also ensure that all dynamically changeable state variable data types are defined as rollbackable. The benefit of having an IDE is faster and less error-prone development of models and services with automatic capturing of their metadata for visual programming.

5.8 Executable architectures

The roadmap described the process of developing models, services, data, and interfaces with metadata descriptions that can be used to graphically create composite systems of systems, which execute within the OpenUTF. This is exactly what is needed to support executable architectures of real systems operating in parallel and distributed computing environments. The artifacts of this process fully capture the complete architecture of a distributed system and can be used to support common standards such as the DoD Architecture Framework.

5.9 Basic Modeling Constructs

The framework provides various types of programming constructs for building composites, models, services, interfaces, and defining common data. Like the tools of an expert carpenter, each construct has its proper purpose in implementing the behaviors of complex models and services. This section highlights the most important basic modeling constructs provided by the OpenUTF that are commonly used within models and services to coordinate their activities in abstract time.

Because composites are automatically created and distributed across processors by the framework from system definition files, this section skips describing how this actually occurs and primarily focuses on the various event types that are provided by the OpenUTF. Keep in mind that models and services internally coordinate their activities in abstract time using these constructs. Abstract data registration, abstract interfaces, and publish and subscribe capabilities are used to define how models and services interact and share data within a composite, between composites, or with other applications operating in a distributed enterprise or federation.

5.9.1 Local Event

A local event is defined as a method on an object that is scheduled by the model or service for itself. The object can be a model or service, or it can be an object embedded within the model or service. There are no inheritance requirements for local events. Local events can have up to 20 strongly type-checked arguments that are passed to the object’s method when the event is eventually processed.

A macro provided by the framework is used by the model or service to define the local event as a method on some

object. That macro code-generates all of the machinery for scheduling and processing the event. A local event is scheduled by calling a macro-generated function. The arguments to the SCHEDULE_<EventName> function are (1) abstract time of the event, (2) pointer or reference to the object implementing the local event, and (3) the arguments associated with the event that are passed to the method when it is called. Variable length data can be appended to the local event if needed.

5.9.2 Process Model

Local events can be turned into thread-like reentrant events, sometimes called processes (not to be confused with UNIX processes). Normally, an event is a method on an object that is invoked at its scheduled time. The event is completed when the method returns. A process can wait for a specified period of time or for some condition to become valid and then wake up. Much like context switching between threads, all of the local variables associated with the process are restored to their prior values upon waking up. So, a process essentially goes to sleep, passes time, and then wakes up without leaving the method. In reality, this is all accomplished under the hood with events, but to the model developer, it appears as if the method simply wakes up. A process can have multiple wakeup points within its method. The basic constructs of processes are described below.

The most basic construct is the wait statement, which specifies an amount of time to wait before waking up. A variant of this is wait until, which uses the actual wakeup time as an argument instead of the duration.

Another very useful construct is the wait for interrupt statement. A process can specify up to 32 different types of interrupts that it responds to using a mask. The process then waits for either the interrupt to be set or for a specified time-out to occur. The timeout can be infinite, which means that the process only wakes up when one of its specified interrupts are set.

Integer and double semaphores provide another way for processes to wake up. A process can use the wait for construct to wake up when a semaphore is greater than, greater-than-or-equal-to, less than, or less-than-or-equal-to a specified value. Semaphores can be used with interrupts and timeouts as well. A common application of semaphores is for a sensor to wait until at least one detectable entity is within range. The integer semaphore provides the count of detectable entities. The sensor sleeps until there is at least one detectable entity. A timeout can also be specified.

Integer and double resources provide a way for a process to request a resource. Using the wait for resource construct, the process sleeps until the resource becomes

32

available. At that point, the process wakes up, consumes the resource, and then once done using the resource, returns it for other processes to consume. Resources can be used with interrupts and timeouts. An example of a resource might be fuel. An entity waits for its fuel before starting its mission. Resources can represent servers in queuing systems. If a resource is not available, processes wait in queues to wake up when the requested resource later becomes available. Various queuing disciplines can be used (FIFO, LIFO, prioritized by amount requested, inversely prioritize by amount requested, generic priority). To promote parallelism of heavyweight composites, resources can be shared between composites.

A more advanced process model capability, known as ask, that is less frequently used but extremely powerful, allows a process to invoke a two-way method on an object residing within another composite and have the results returned. This is still accomplished through events under the hood, but to the modeler, it appears as if a method on a different composite is invoked with results returned. As an optimization, multiple asks can be invoked before waking up to receive the results of each ask. While these transactions normally occur with zero-lookahead, the results can actually be returned at different times. To reduce rollbacks when using optimistic time management, other events for composites with time tags greater than the return time of a pending ask are not processed until results are returned. This helps provide stability through additional flow control, which can be critical.

Finally, the process model supports stochastic branching. The branch statement uses probabilities to determine which branch to take. The branch construct looks very much like a switch/case statement. When running in five dimensions, the branch construct actually explores all branches. It maintains the probabilities specified in each branch to provide correct statistics at the end of the simulation execution.

5.9.3 Simple Event

Another type of event, known as a simple event, is used to schedule events between different models and services that reside within the same composite or other composites. Unlike local events, simple events use a string name to define their event type. Simple events come in five flavors that are distinguished by the type of arguments passed: (1) no arguments, (2) single class, (3) generic buffer, (4) up to 20 user-defined type-checked arguments, and (5) run time class, which provides an easy-to-use container for passing generic arguments by string name. Simple events using run time classes can also apply logical string expressions that are dynamically interpreted to help filter out unwanted events when using publish and subscribe techniques.

A framework-provided macro is used to create the function for scheduling or publishing a simple event with type-checked arguments. To eliminate model and service interdependencies, this interface should be defined in its own package. A separate framework-provided macro is used by the model or service to identify a method on a class for handling the simple event. That macro generates functions for the method to dynamically register, unregister, subscribe, and unsubscribe to the simple event.

The object handle of the recipient composite is provided when scheduling a directed simple event for a specific composite. Object handles can be dynamically computed, looked up by name, or discovered through publish and subscribe services. No object handle is necessary when publishing an undirected simple event. The framework routes the published simple event first to all nodes containing one or more subscriber, and then to each individual subscribers residing on the node.

Event names are often coded or mangled to filter recipients so that the event is processed by a limited set of recipients. It is possible to publish a simple event with no subscribers. When that occurs, the framework simply discards the simple event.

Subscription filters can be applied when simple event parameters are passed using run time classes. A built-in interpreter utility provided by the framework evaluates the logical filter using data stored in the run time class to determine if the subscription condition is met.

A simple event scheduled for objects residing within the same composite at the current time is automatically optimized as a normal function call by the framework. This can significantly reduce computational overheads.

5.9.4 Data Registration

As previously mentioned, data is registered and accessed by models and services using the data registry that is provided by each composite. The framework provides registration interfaces for many built-in data types and performs strong type checking when accessing such data. For example, an interrupt can be registered in the data registry and then used by several processes to set and wake up from interrupts. This allows models to mutually coordinate their activities in complex ways without requiring dependencies on each other’s implementation.

As in abstract interfaces, user-defined data types that are registered and accessed from the data registry by models and services must be defined within their own package. This is necessary to eliminate all interdependencies between truly composable models and services.

33

5.10 Cognitive Models

Modeling human decision-making involves the triggering of complex logic that is best suited for rule-based systems within a cognitive framework (see Figure 32). The heart of any rule-based cognitive framework is a Reasoning Engine that allows a method (with no arguments) on an object to be registered as being sensitive to any number of state variables that act as stimulus. No class inheritance is required and methods can be named anything. Sensitive methods are automatically invoked whenever a registered stimulus changes by more than a specified tolerance.

Figure 32: Conceptual operation of the Open Cognitive Architecture Framework (OpenCAF).

Stimulus comes from multiple sources including perception of the environment, processed or fused data, and outputs of thoughts that stimulate the triggering of other thoughts.

During the course of processing a triggered sensitive method, further state variables that are registered as stimulus for other sensitive methods may be modified, thereby triggering additional sensitive methods in a cascading manner. This provides the foundation for triggering abstract cognitive thought processes. One thought generates stimulus that triggers additional thoughts. Priorities can be assigned to each stimulus registered with a sensitive method. Ordering the triggering of sensitive methods based on stimulus priority can help to achieve faster and more meaningful cognitions that converge more quickly to stable values.

Stimulus can be represented as encapsulated state variables residing within models, services, embedded objects. Stimulus is managed within the composite’s data registry. This facilitates access of stimulus information to sensitive methods while preserving object-oriented encapsulation principles. Other models and services can also access stimulus data through the data registry.

Stimulus can be thought of as volatile short-term memory. Stimulus is usually based on perception, which is not necessarily the truth. Long-term memory, on the other hand, might consist of model parameters, thresholds, known doctrine guidelines, rules of engagement, geographical data, and operational concepts that are established prior to execution. Long-term memory rarely changes during model execution and would not require use of the stimulus infrastructure, which is associated with additional overheads.

Thoughts are represented as sensitive methods within classes. The Reasoning Engine automatically coordinates input/output stimulus and other state variables between registered class methods. All state variables operate transparently in sequential, parallel (optimistic or conservative), and/or real (scaled, hard, or soft) time.

Collectively, thoughts form a computational graph that is executed by the Reasoning Engine. Graphs are not rigid and can be dynamically reorganized as new data is received and/or decisions are required. This introduces the possibility of having adaptable cognitive models that are able to morph in response to dynamically changing circumstances.

User-defined classes can implement multiple thoughts that share data encapsulated within object instances. Processing a thought may stimulate new thoughts as stimulus is modified. Thinking stops when the cognitive model runs out of ideas (i.e., there are no more new thoughts), when thoughts converge to a stable state, or when the thinking process is intentionally terminated due to time constraints or lack of convergence. The same thought (i.e., method on a software object) can be triggered multiple times as stimulus becomes modified by other thoughts. In this manner, cognitive models are able to change their minds or converge to a solution during the thinking process as refinement occurs.

Thoughts are conceptual rules or algorithms that are automatically triggered by stimulus. In this manner, thoughts are much more general and individually encapsulated than hardcoded monolithic if-then-else logical expressions commonly coded in other systems. Models can also register or unregister their thoughts dynamically over abstract time.

Cognitive modeling capabilities provide a rich foundation for supporting agent-based simulation, where intelligent agents make cognitive decisions based on their perception of the environment, are adaptable as dynamics change, and behave autonomously to provide complex decision-making. Cognitive models are mobile because their specifications are described via system files, which can be communicated in a distributed computing

34

environment, and subsequently used to dynamically construct instances.

5.10.1 World View Learning Machine

The World View Learning Machine (WVLM) shown in Figure 33 provides an extension to the basic cognitive modeling framework previously discussed. The WVLM combines several important components that work together to form a worldview, which may not be optimal or even true, but represents the thought processes of people, organizations, and societies with strongly adhered-to ideologies. Worldview representations are necessary for correctly modeling the cognitive decisions of adversaries with vastly different ideologies. The WVLM can represent adversaries making emotional, irrational, and suboptimal decisions.

Figure 33: World View Learning Machine.

Ideologies represent a collection of interconnected rules that are believed to be true within a big-picture worldview. Models provide more complicated predictive properties that help make sense of the rules. They fill in the details missing by just the interconnected set of rules alone. The WVLM transforms its ideology into a collection of truth statements that are implemented as rules within the reasoning engine. Combined with more elaborate models, the reasoning engine triggers cognitions as stimulus is received through perception.

Desires are represented as emotion-based reasoning constructs that make decisions using weighted inputs that are coupled with multiplicative thresholds. This seat-of-the-pants way of making decisions adds personality to the thought processes occurring within the reasoning engine and explains why people sometimes behave contrary to their strongly held worldview ideologies. All inputs and outputs in the WVLM are filtered to account for imperfect data fusion and interpretation of data. Biases often play a strong role in the filtering process. Personal or societal

ideologies are almost always hard to change, often resulting in bizarre interpretations of reality by those with strongly held rigid worldviews.

A validation stage continually assesses the validity of the worldview cognitive model. Results that do not pass validation cause adjustments to be made throughout the WVLM. The validation criteria can be strict, which means that the WVLM has a high threshold for acceptance. On the other extreme, the validation criteria could be loose, meaning that the model exhibits limited critical thinking capability. In such cases, emotional decision-making based on selfish desires, distorted perception, improperly connected rules, and inaccurate models combine with validation tests that are also irrational.

One of the important characteristics of the WVLM is that it can combine incorrectly filtered data, irrational ideologies, and incorrect rules and models, to produce outputs that are valid according to the validation step, which itself can be incorrect. A WVLM can be trained to describe a wide variety of human behaviors based on ideologies, personalities, desires, and biases.

5.11 Detecting System Violations

One very simple application of a rule-based cognitive model is to catch and flag system violations as they occur in a simulation. Violations are expressed as a series of simple logic-based rules that must not be violated for the performance of a system to be deemed correct. An example might be a series of commands given to an autonomous vehicle to perform a specific operation. Rule checking makes sure that the performance margins of its internal systems are never exceeded.

The Rule-based system violation checking approach is easily understandable and extensible by non-technical users. It not only catches critical system performance violations, but also provides an extra level of trustworthiness to computed results for simulation runs where no violations are detected. Users of the model or service know that their results are automatically checked for correct performance by an easily understood and traceable set of simple rules.

5.12 Embedded Simulation

Events within simulations often need to predict future outcomes to make good decisions. For example, a pilot in an aircraft might need to explore all possible outcomes of multiple maneuvers being considered to determine which one provides the best result. Such predictions often involve complex interactions between systems that can only be determined by simulating the future. So an event at a particular time in the primary simulation needs to

35

simulate the future before making its decision. This can be supported by embedded simulation.

Figure 34 illustrates an event in the primary simulation needing to make a decision based on a prediction of the future. It launches one or more embedded simulations to predict the future. The results of the embedded simulation runs are used to make the decision within the primary simulation. It is possible for embedded simulations to further embed simulations, which can be used to explore various decisions. Embedded simulations, on the other hand, could use five-dimensional simulation techniques to simultaneously explore the effects of multiple decisions and/or stochastic outcomes.

Figure 34: Embedded simulation. An event within the primary simulation runs an embedded simulation to predict the future. The start time of the embedded simulation is the time of the event in the primary simulation. The results from the embedded simulation are used by the event to make a decision.

There are several ways to launch an embedded simulation. The most straightforward way to do this is to launch the external simulation as a separate process, perhaps running on a different machine from the primary simulation. The results could be returned at a later time, which would allow the primary simulation to execute concurrently with the embedded simulation during the time window. A grid-computing scheme could be used to launch multiple embedded simulations on available computing resources when multiple decisions need to be considered. If the event is a process, the wakeup could occur automatically when all of the results are returned and stored in the data registry. The process simply retrieves its data for each of the embedded simulation excursions, chooses the best decision, and then continues its normal processing.

A second way to launch an embedded simulation is to actually execute it within the event. The problem with this approach, however, is that the event computation running

the embedded simulation can become a bottleneck. Solutions to this problem likely involve additional multithreading techniques to speed up the computations of the embedded simulation.

6.0 Publish and Subscribe Services

Simulations are often composed of large numbers of composite entities that are distributed across processors and mutually interact in abstract time during execution. Entities are typically able to interact with other entities when they are within sensor range, communication range, and/or weapons range.

Publish and subscribe services allow encapsulated composites to both mutually interact and share data in parallel and distributed computing environments. Standardization of these services can also be used to automate integration with enterprise systems. Publishers and subscribers provide their publication and subscription interests to the framework where interaction events and attribute data mapped into federation objects are distributed in an efficient manner. The framework always first distributes interactions and object attributes to the nodes containing at least one subscriber. Then the framework relays the interaction or attribute update to all local subscribers. The entire two-step transaction is performed using events that operate in auxiliary time so that there are never any other events outside the transaction with time tags overlapping those events generated by the transaction.

6.1 Publish and Subscribe Data Distribution

Composites must have access to basic information about other composites as they potentially interact. This requires publish and subscribe data distribution services to deliver entity information in parallel and distributed computing environments. At minimum, this information generally includes each composite entity’s spatial position and unique identifier. In general, composite entities freely move about the battlefield and can continually change their motion and/or other attributes without notice.

As shown in Figure 35, publish and subscribe data distribution services allow models and services within composites to map exportable attributes into Federation Objects (FOs). Publishers publish their FOs to enable the distribution of their attributes to subscribers. A subscriber specifies what types of FOs they are interested in, along with specially created filters that operate on the attributes of published FOs. Subscribers can specify parent class types during subscription. All derived class instances will be discovered.

A filter returns either true or false when it is applied to a FO, indicating whether the filter is passed. The most

36

common type of filter is based on the range between subscribers and published FOs. However, filters can operate on a wide variety of other attributes as well. Both filters and attributes can dynamically change, which adds complexity to publish and subscribe data distribution services. Some attributes are implemented as time-ordered sequences of segments that compute their values as a function of time. These kinds of computed attributes are most often used to represent motion. Computed attributes further complicate publish and subscribe filtering mechanisms because they have to be regularly computed and handled with tolerances.

Figure 35: Publishers, subscribers, exportable attributes, federation objects (FOs), and filters.

A subscriber discovers FOs when any of its registered filters meet the criteria of a published FO. Each filter combines a set of criteria on one or more published attribute that must be true for the filter to pass. The set of filters registered by a subscriber form a logical OR to regulate discovery.

Publication and subscription regions are said to overlap when the filter criteria for a FO evaluates to true. FOs are removed from a subscriber when either (a) none of the filters registered by the subscriber meet the criteria of the published FO, (b) the FO is unpublished by the creator of the FO, or (c) the FO is actually deleted by the publisher. In typical scenarios involving moving entities, published FOs are continually discovered and removed by subscribers during the execution of the simulation. Subscribers can register notification methods on their objects that are automatically invoked whenever FOs are discovered or removed. This allows the subscriber to take specific action when these special events occur.

Publishers can update the attributes of their FOs anytime during the course of the simulation. Updated attributes are reflected to those subscribers that currently have received the FO. A filterable attribute that is modified can potentially cause the FO to be newly discovered or removed from subscribers as filters are automatically reevaluated. Again, subscribers can register notification methods that are invoked whenever FO attributes are

reflected. This allows the subscriber to take specific action when attributes of remote FOs are modified.

Depending on how the parallel and distributed application is configured, reflections can be provided instantaneously in simulated time, delayed by a prescribed lookahead value, delayed by specified tolerances on attribute values, or arbitrarily delayed by network latencies when operating in real time. Rollback-based optimistic run-time configurations introduce additional challenges because entities on the same processor may be represented at different logical times due to time warping. Space-time data management techniques that record state history over time are typically used to handle this case. Other touch-depend techniques can also be used to coordinate multiple subscribers locally sharing a discovered FO on their node.

Subscribers can modify the attributes of remote FOs if permission is granted. Modified attributes are automatically provided back to the publisher of the FO, where they are internally updated and then reflected to other subscribing entities. This all occurs optimistically with zero lookahead. The use of auxiliary time guarantees that all events in the multi-step transaction occur before any other event outside the transaction is processed. The use of auxiliary time eliminates all issues regarding ambiguities on who owns the attribute when two entities modify the same attributes of a remote FO at the same time. Publishing entities can register methods on internal objects for notification when local or remote updates occur. This allows the publisher to take specific action when attributes of local FOs are modified.

While many applications do not explicitly provide publish and subscribe data distribution services in the same formalism described above, military simulation applications almost always address these generic capabilities in some manner, especially when operating in parallel and distributed computing environments. Because of the zero-lookahead fan-out and fan-in types of event transactions, publish and subscribe data distribution services often govern the overall performance of simulation applications. Efficient filtering that minimizes critical path dependencies between publishers and subscribers is essential for scalable performance.

6.2 Publish and Subscribe Event Scheduling

Simple events also provide publish and subscribe event-scheduling services. Basic filtering is performed using the name of the event to determine recipients. Logical string-based filter expressions can optionally be applied when using run time classes for simple events. Run time classes store name value pairs in a packed container. Filters use data stored in the run time class to provide values for variables used in the expression. A logical expression

37

evaluated as true means that the subscriber should receive the simple event.

An example of a simple event with filters might be a weapon detonation event. Subscribing entities would create a filter expressing their approximate location, which can change over time. Subscribers must periodically update their filters with their new position as they move about the battlefield. The weapon detonation event stores its position and the radius of its blast in the run time class. The filter then determines if an entity is within the effective range of the weapon detonation.

7.0 Advanced Performance Techniques

As previously discussed, scalable performance of multicore applications depends on several factors: (1) workloads must be evenly balanced across processors, (2) communication services must be highly optimized, (3) redundant computations must be avoided, (4) data (i.e., objects, grid cells, etc.) must be distributed evenly across processors while minimizing communication needs between data, (5) critical path dependencies must be mitigated, (6) intrinsic lookahead must be maximized, and (7) bottlenecks must be eliminated. Pure parallel discrete-event simulation techniques by themselves are not enough to ensure scalable execution. This section describes additional techniques that are necessary for achieving full scalable performance.

Figure 36: Typical representation of a composite as an entity that is hierarchically composed of models and services.

Decomposition strategies involving models, services, and composites and their relationship to entities play a significant role in achieving parallel performance (see Figure 25). Event patterns between entities as they publish, subscribe, update, reflect data, and mutually interact, affect the critical path. Zero-lookahead, fan-in, and fan-out patterns often associated with publish and subscribe data distribution services can serialize a simulation if not properly optimized.

The most common representation of a battlefield entity is as a single composite, shown in Figure 36. The composite is composed of models and services representing the entity’s internal systems and behaviors. Parallel performance is achieved by creating many entity composites that are distributed across processors and operate concurrently. The models and services within an entity composite are free to (a) share data using the data registry provided by the composite and (b) mutually invoke each other’s methods using abstract interfaces.

Mission and campaign simulations (see Figure 37) often combine the representation of multiple battlefield entities within a composite. Abstracted models represent the aggregate or statistical behavior of the entities as a collective. Parallelism is normally achieved by running large numbers of independent Monte Carlo replications. Each replication is initialized with a different random number seed to produce different outcomes. Results from each run are gathered and analyzed to produce statistical measures of specific system performance and overall effectiveness. However, other techniques such as grid computing based on data decomposition can also be used to parallelize computations within an aggregate composite. For some problems, five-dimensional simulation also offers an alternative way to achieve statistical results without resorting to large numbers of Monte Carlo runs.

Figure 37: Mission and campaign simulations aggregate multiple entities within a single composite.

The most challenging case is physics and engineering simulations because they usually simulate just a few entities, but at extremely high levels of fidelity. Due to the need for high-precision results, physics and engineering simulations are often computationally intensive. Additional parallelization techniques, beyond simply representing entities as composites, are needed for these types of simulations to scale on large supercomputers.

38

As previously discussed, multiple composites can represent a single entity. An example might be an aircraft carrier with many systems and subsystems, or a sensor with computationally intensive signal processing algorithms that are easily parallelized.

Figure 38: Functional decomposition of a single entity using multiple composites. In this example, the entity is represented as three composites, each modeling a subset of the full system. A significant challenge is for models and services to be transparent to their configuration in terms of data sharing and abstract interfaces when spread across multiple composites.

Two ways to extend parallelism for such entities are: (1) distribute the models and services associated with the entity across multiple composites (i.e., functional decomposition, see Figure 38), and (2) let each composite representing the entity be composed of its full collection of models and services, but have each composite subscribe to a subset of its full data, which automatically distributes the computations of the data to each composite (i.e., data decomposition, see Figure 39).

Figure 39: Data decomposition of an entity using multiple composites. In this example, the entity is represented as three composites, each modeling the full system, but subscribing to different data. Each composite processes its data concurrently. Additional work may be required to merge results.

To obtain scalable performance, workloads must be continually balanced. Workloads can change during the course of an application’s execution, which means that some form of dynamic load balancing is needed. One strategy is to migrate composites from heavily loaded

processors to ones with lighter workloads. This requires three things: (1) determining which processors have heavy workloads and which ones do not, (2) determining which composites should move to provide a more even workload while minimizing communications, and (3) actually moving composites from one processor to another. Implementing dynamic migration of composites across processors is extremely difficult to achieve.

7.1 Embedded Grid Computing

A different, and perhaps better, approach to dynamic load balancing uses concepts from grid computing to parallelize computations that otherwise would be bottlenecks. Grid computing essentially breaks an event computation into separate tasks that can be farmed to other processors, where they are performed concurrently. The results are gathered back to the originator of the work. Without grid computing, a computationally intensive event could become a bottleneck that serializes the execution. Grid computing provides an important technique to remove this bottleneck.

Grid computing, as shown in Figure 40, can be performed in several ways. First, computations from an event can be farmed out to other computers that are not being used by the application. The results are returned at a specified abstract time, which can be the current event time or any time in the future. Time barriers are established to ensure that global time does not advance beyond the expected result time until all computed results are received.

Figure 40: Grid computing farms computations of intensive events to other processors. The results are returned (potentially at a later abstract time) where the event combines the results and continues processing.

Second, computations from an event can be farmed out to processors currently being used by the application. The computations are processed in parallel in-between event processing. This can be accomplished either through

39

message-passing techniques (see discussion on the Object Request Broker services for invoking methods on remote objects) or through the use of threads. In either case, processors or threads must be prioritized in a manner that farms work to lightly loaded processors. For optimistic simulations, this usually means that work is farmed to those nodes that are further ahead of the others in terms of global virtual time.

Grid computing using threads requires categorizing different types of threads and establishing their priorities. The first category is communication threads. Each node has at least one communication thread running at the highest priority. It sleeps until a message is received. The second category is the node thread. Each node has one of these running at a moderately low priority. The actual priority is related to how far the node has advanced ahead of global virtual time. Slower nodes are higher priority than faster nodes. The third category is the worker thread. These threads are organized in a pool and are nominally set to the lowest priority where they sleep until needed. When an event is parallelized using threads, an appropriate number of worker threads are activated with a priority that is higher than all of the node threads. This effectively swaps out node threads while worker threads perform their computations. One important consideration when assigning worker thread priorities is that events close to global virtual time should receive a higher priority than events further out from global virtual time.

Grid computing breaks down when too many workers are used. This concept is shown in Figure 41.

Figure 41: Embedded grid computing critical path analysis. The assumption is that each communication hop takes one unit of processing time. Workload, W, is held constant and assumed to be perfectly parallelizable as it is farmed to N workers. The critical path time for all of the results to be received and processed is 2 + W/N + N.

The optimal number of workers heavily depends on the amount of work (relative to framework communication

and message-handling overheads) to be performed. The optimal number of workers is the square root of W. This result is important for framework services to properly assist in decomposing workloads. Performance results for different workloads are shown in Figure 42.

Figure 42: Embedded grid computing maximum speedup for different workloads as a function of the number of workers.

7.2 Advanced Decision Support

Simulations are often used in live tactical environments to rapidly predict the future based on the current Common Operational Picture (COP) and one or more plans or decisions that are under consideration. This is where optimistic event processing, and especially five-dimensional simulation technologies really shine.

Tactical systems normally provide a COP that receives external data from a multitude of sources (sensors, intelligence, reconnaissance, etc.). It continually fuses received data into a best-estimate integrated picture of the current world. The COP changes dynamically over time as new information is received. Monte Carlo simulations explore the effectiveness of one or more plans, based on the current COP. As the COP evolves, the plans need to be continually reassessed.

Optimistic five-dimensional simulation offers a different approach to providing tactical decision support. The optimistic five-dimensional simulation ties its global virtual time to the wall clock while continually predicting the future as it runs in parallel. The simulation receives data in real time, where it is fused to form the COP. Only the affected parts of the simulation are rolled back as live data is received. Five-dimensional predictions are continually refined as the current estimate of the operational picture evolves over time.

Five-dimensional simulation continually refines its estimate of the COP with live data as it predicts virtually all of the possible future outcomes without having to periodically launch large numbers of Monte Carlo

40

replications. Five-dimensional simulation automatically executes in parallel to achieve additional performance gains. This provides extremely rapid decision support for next-generation command and control systems. This is game-changing technology.

8.0 Integration with External Systems

Models running within the common framework must be able to integrate with external systems such as real-world services, legacy simulations or federations, hardware in the loop, and visualization or analysis tools. This section discusses some of the advanced communication and synchronization techniques for integrating OpenUTF applications with external systems.

8.1 Live Virtual and Constructive Simulation

The publish and subscribe data distribution and event services were carefully specified in the OpenUTF to be compatible with Live, Virtual, and Constructive (LVC) interoperability standards such as the High Level Architecture (HLA), Distributed Interactive Simulation (DIS), and the Test/Training Enabling Architecture (TENA). This means that high-speed gateways can be provided by the OpenUTF to automate LVC integration and interoperability. It also means that the OpenUTF itself could become a High Performance Computing Run Time Infrastructure (HPC-RTI) for HLA. This provides an incremental path for legacy federates to migrate to the OpenUTF.

8.2 Direct Integration using the EMF

A direct way to integrate an external system to a native OpenUTF application is through its External Modeling Framework (EMF), shown in Figure 43.

Figure 43: EMF provides remote interface between external applications and the OpenUTF. It coordinates two-way event and data exchanges in logical or real time. Received messages can be logged for later playback testing and/or analysis.

The EMF synchronizes time with the simulation in a robust manner using a specified lookahead value that allows both systems to process concurrently. It supports two-way publish and subscribe data distribution and event-scheduling capabilities. It also supports all of the modeling constructs provided by the OpenUTF in its internal engine so that external applications, even with their own simulation engines, can take advantage of existing OpenUTF models and services. It also provides a bridge for migrating legacy models to the OpenUTF. The EMF can operate live, or it can operate in playback mode from automatically generated log files.

The EMF is essentially a thin client that is connected to the central simulation. It is constructed as an object with interfaces to advance time, support two-way publish and subscribe services, and easily retrieve received data from its embedded federation object manager. All messages received by the EMF are time tagged and processed optimistically as events. Using optimized rollback and rollforward techniques, the EMF allows time to actually go forward and backward (see Figure 44). This is essential for supporting live or offline visualization and analysis capabilities.

Figure 44: EMF optimistic processing engine. Messages received from the OpenUTF simulation are processed optimistically as events. Rollback items are saved as each event is processed for rollback and rollforward capabilities.

8.3 Web-based Services

Web-based services using XML standards such as the Simple Object Access Protocol (SOAP) and Web Service Definition Language (WSDL) are straightforward to support in the OpenUTF (see Figure 45). A service request made by an external system is received by the SOAP server and turned into a simple event that is injected into the OpenUTF. It is routed to the appropriate composite through publish and subscribe services and invokes the registered method of the object providing the service. The service produces an output (if the service is two-way) that goes back as a simple event to the SOAP

41

server, where it is translated back into an XLM message and sent to the external system that originally launched the request.

Figure 45: Integration of OpenUTF applications to external systems using web services or LVC standards.

8.4 HPC-RTI

Because the OpenUTF provides the same types of publish and subscribe data distribution and event scheduling services as HLA, it is straightforward to build an HLA High Performance Computing Run Time Infrastructure (HPC-RTI) using the OpenUTF, which then operates as both a simulation framework and an HLA RTI.

Figure 46: High Performance Computing Run Time Infrastructure (HPC-RTI).

What makes the HPC-RTI unique is that all services are built on top of a rigorous and repeatable optimistic simulation engine. This means that all HLA services (including Declaration Management, Data Distribution Management, and Ownership Management) can be time managed. No other RTI offers this capability. The HPC-RTI coordinates its internal optimistic processing with

federates that process their events conservatively. Nothing transfers from the HPC-RTI to a federate that gets rolled back. What is also unique about this approach is that models and services within composites can actually execute inside the HPC-RTI. This provides an incremental transition strategy for legacy applications to migrate their models and services into the OpenUTF.

There are many benefits to using the OpenUTF for supporting HLA federations. Benefits include faster and more scalable reliable communications, higher-performance time management, repeatable execution, true support for zero-lookahead event scheduling, time-managed operation for all services, portability on all mainstream operating systems and computing platforms, and a straightforward way to incrementally migrate models to the OpenUTF.

8.5 Future OpenUTF-based Federations

The OpenUTF does not require all models and services to be dynamically linked into a single executable. Collections of different OpenUTF executables could operate together as a federation using cluster mechanisms that group processors with their executable. This allows proprietary models and services that must reside on certain machines to work in a larger OpenUTF context. An OpenUTF federation can also integrate with HPC-RTI federates executing on individual nodes, external systems using the EMF, and more broadly with other LVC systems. All combinations are possible. Together, this provides a very powerful and flexible basis for software interoperability on all types of computing systems.

9.0 Visualization and Analysis

This section discusses various graphical visualization and analysis tools that are required for the OpenUTF.

9.1 Monitor and Control

The Monitor and Control (MC) visual tool allows operators to launch one or more simulation executions and then monitor and control its progress as it runs. The MC tool can pause, resume, or kill the simulation. It can also pace the simulation. More than one MC tool can connect to the simulation while it is executing for remote monitoring. Pauses are named and can be set for arbitrary future or current times, which allows multiple pauses to be simultaneously specified.

The tool receives and displays run time statistics as time evolves. Run time values include global virtual time, wall time, number of events processed, CPU time spent processing events, number of events committed, CPU time spent processing committed events, number of rollbacks, number of messages sent, number of

42

antimessages sent, and critical path maximum speedup. Values can be displayed in a table and/or as plotted strip charts that evolve over time. Performance values are normally displayed as global summaries taken across all nodes. Node-specific performance values can also be displayed.

The MC tool can gather additional statistics at the end of the simulation run. Such statistics include (a) event-processing statistics by event type or by object type, (b) memory consumption by type, (c) message statistics by type, and (d) load balancing by node. These statistics can be extremely helpful in finding bottlenecks, memory leaks, message overheads, excessive rollbacks by event or object type, etc.

9.2 Trace Files

Trace files can optionally be turned on during run time to capture essential information for each event that is processed. This includes (1) time of the event, (2) type of event, (3) which composite the event is associated with, (4) rollback items generated by the event (optional), (5) time spent processing the event (optional), (6) all newly scheduled events generated when processing the event, and (7) additional printed output generated when processing the event. A trace file analysis tool, shown in Figure 47, can be used to visualize event transactions occurring between composites, models, and services. Trace file analysis helps validate transactions, finds problems with repeatability, assists in isolating bugs, etc.

Figure 47: Trace file analysis of event processing between multiple objects over time.

9.3 Scenario Generation

Scenario generation involves several aspects. The first part is to define types of composites, how many

composites of each type to create, their hierarchical structure of models and services, and their individual model parameters that need to be set during initialization. The Vision graphical composability tool previously discussed provides this capability. More complicated is specifying the mission of each composite, their cognitive behaviors, and how they perform a series of individual tasks to reach goals.

A mission can be defined as a series of goals that need to be accomplished. Like routing a message in a network through multiple hops, missions are accomplished by one or more tasks executed in sequence. State tables can be constructed to indicate what the current task should be given the current state and desired goal. In the same way that message routing can be adaptive to congestion or faults in a network, tasks can also be adaptive to handle environmental stimulus such as surprise engagements with hostile systems. Rather than simply carry out a scripted set of tasks to carry out each goal, an entity can intelligently respond to unexpected circumstances using state machine and cognitive modeling techniques that reprioritize goals and/or choose alternative tasks that might still accomplish the goal.

An example of this is an aircraft with a time-critical mission to drop a bomb on a high-value target and then fly back to its base. Suppose while flying to the target, the aircraft engages an unexpected enemy surface to air missile site. The aircraft has to make a decision on what to do. Possibilities are (a) continue with the mission and hope that defenses are adequate, (b) route around the threat, which might impact getting to the target on time, (c) abort the mission and fly back to the base, or (d) destroy the threat and then continue to the target. All of these considerations must take into account a number of factors that are complex and coupled. The cognitive model must mimic the decision making of the pilot, which is not the same as choosing the most optimal solution. Perhaps a WVLM cognitive model of the pilot ought to be used to represent individual pilot personalities and decision-making.

Mission specification and cognitive decision-making are almost certainly the hardest part of scenario generation. This topic is an open field of research and development.

9.4 3D and 2D Visualization using the EMF

A generic 3D (see Figure 48) and 2D (see Figure 49) visualization tool provides interactive visualization of live scenarios. Offline playback is also supported using automatically generated log files. Because the EMF is able to roll state forward and backwards, the visualization tool is able to display the exact state of the simulation at any point in time. A VCR-like controller is provided to play, fast forward, reverse, pause, or stop the playback.

43

The visualization tool can maintain a 60 frame-per-second display rate, even for fairly large scenarios with many entities because the EMF rollback framework is highly optimized.

Figure 48: 3D visualization of a missile defense scenario involving multistage missile launches, individually released multiple warheads per missile, decoys, ground-based interceptor sites with fire control, interceptor flyouts, space based sensors, ground-based radars, track fusion, and battle management.

Figure 49: 2D visualization showing a map of the USA.

9.5 Data Logging and the EMF

The EMF provides two mechanisms for capturing data generated by an OpenUTF simulation. First, the EMF supports publish and subscribe data distribution mechanisms to capture dynamically changing attributes in federation objects. Second, the EMF supports a data logging capability that allows a named collection of points to be logged together as a unit. During execution, the EMF can automatically log all of its received messages to a file for later playback or analysis.

9.6 Mathematical Analysis

As previously discussed, the EMF can log published data and named collections, which can be analyzed afterwards using an analysis tool, such as the one shown in Figure 50.

Figure 50: EMF Analysis Tool.

The EMF Analysis Tool (EAT) provides an interface that allows an analyst to plot values from published attributes in federation objects (which includes time-based parameterized algorithms that can be computed at any time value) and named collections that are simply a collection of name values logged at a point in time during event processing. Plots can be time-based, scatter plots, or histograms. Logical filters that act on logged data can be specified to remove unwanted data points. This allows analysts to perform guided data mining excursions to find needles in haystacks. Chi-squared curve fitting capabilities support a wide variety of statistical distributions and polynomials. Finally, analysts can plot new values defined by mathematical expressions using existing data.

Users can specify displaying error bars on histograms, time step size for time plots or for plots of federation objects, number of histogram bins, plot symbol type, colors, etc. Collections of plot specifications can be saved and later loaded for making custom plots from multiple log files. This can save significant time when analyzing the results of multiple runs.

The OpenUTF also provides a C++ programming interface for directly creating plots from a running simulation or from EMF log files. This allows an expert SME programmer performing analysis to quickly create plots from code without using a graphical interface. Plots can also be constructed from a command-line utility. Arguments specify log file name, plot type, variables to plot, etc. In addition, a run time class input file can also be used to automatically create plots from log files.

10.0 Technology Transition

Programs typically begin by using the best technology that is currently available and hope to transition newer and better technologies along the way. However, integration of new technology, especially when

44

independently developed by different organizations, is often difficult because of conflicts with the program’s fundamental design. Despite the best of intentions, technology transition efforts often fail.

The problem is that there is usually a technology gap (see Figure 51) that can be viewed in two ways: (1) as a capability gap (i.e., technology lags behind needed capabilities) or (2) as a time gap (i.e., technology is provided at a later time than when it was actually needed). To address the problem of technology transition, programs and broader user communities must continually invest in emerging technology and plan appropriately. Schedules must reflect feasible technology transition, especially in the startup period of new programs. Continual investments in technologies must be made, even before new programs are conceived. Advances in technology can also provide enabling capabilities that launch new programs.

Figure 51: Technology gap.

10.1 Why Programs Fail

Programs fail for a number of reasons. Oftentimes high-level management in retrospect will claim that the failure was not because of a lack of technology, but due to other programmatic reasons. It is important to emphasize here that inadequate technology or improper use of technology can very well be the culprit in failed programs. This is why it is so important to establish the right technical framework and overall vision for M&S. Future systems must be scalable on modern multicore computing hardware, while promoting interoperability and reuse. Legacy systems must not be thrown away, but rather incrementally brought into the technical framework. The full life-cycle cost of future systems must be minimized.

Before a major program begins, calls for research and development proposals are normally announced and awarded. This step does not guarantee that the best ideas

are awarded and/or matured in preparation for use in the program. Large contract awards tend to go to large organizations, which tend to use their own in-house technologies, processes, and tools. The not-invented-here syndrome begins to set in as the program is developed from the ground up. Technologists seem to always be on the outside trying to get their technologies into programs that are run by big organizations with their own agendas.

Organizations typically ramp up their engineering staff at the start of a new program. When new technologies are actually used, team members must be trained to learn how to work with the new technology. This can take a significant amount of time. In addition, the technology is often tailored for the program, which can also take a significant amount of time. So, what often happens during the first year of a program is that the technical staff is trained while the technology is being extended for use on the program. This can be quite wasteful on programs with large teams. Testers waiting for software to test have little to do during the program’s startup. Having a mature common technical framework with an already trained engineering team would save a significant amount of startup time. Training costs and new framework capabilities developed to support the program would apply to other efforts as well.

Actual software development begins after spending a significant amount of time training the engineering team and preparing the new technology for use. After enough software is developed, testers are finally able to contribute to the program. At some point in time, it is decided that the system is ready to be formally tested, verified, validated, possibly accredited, and ultimately fielded. At this point, programs often go into maintenance mode. Engineering teams are reduced to a skeleton staff. The most experienced engineers move on to other programs where the process repeats.

Performance issues, especially when considerations for multicore computing have been ignored, are usually an afterthought, making it almost impossible to re-architect the system for high-performance computing. As new requirements are added to the program, the software slowly becomes unwieldy to work with. Productivity goes down and eventually the program fails because it becomes too expensive to maintain, or it does not accomplish its function/performance goals. Again, this is why it is so important to establish the right vision for a common technical framework embraced and widely used across the software community.

All of this can be attributed to poor software technology planning, which perhaps can be attributed to programmatics. Nevertheless, the importance of providing a clear technical vision that accomplishes all of the

45

desired goals cannot be overstated. The best-managed programs still need the right technology to succeed.

11.0 Standardization

Standards provide an important part of the roadmap. The common technical framework must work with existing Live, Virtual, and Constructive (LVC) simulation standards and web services that are commonly used by the community, while also creating new standards and preparing a roadmap for next-generation capabilities. This section discusses the high-level roadmap for standardizing the common technical framework.

11.1 Working with SISO

To prepare for the paradigm shift in modern multicore computing, the Simulation Interoperability Standards Organization (SISO) launched the Open Source Initiative for Parallel and Distributed Modeling & Simulation (OSI-PDMS) Study Group in 2007. The Parallel and Distributed Modeling and Simulation Standing Study Group (PDMS-SSG) began November 10, 2008 as a continuation of the OSI-PDMS. The PDMS-SSG goal is to study the subject of parallel and distributed M&S and to prepare the standards community for embracing scalable computing technology.

The PDMS-SSG initially provided an open forum for discussing parallel and distributed M&S across multiple programs. Over time, the PDMS-SSG shifted its focus on the refinement of three layered architectures that comprise what is now known as the Open Unified Technical Framework. Figure 52 shows the architecture layers of the OpenUTF. Each layer potentially makes use of layers below, but not above.

Figure 52: OpenUTF Architecture layers. Green layers indicate functionality provided by OpenMSA. Yellow layers indicate functionality provided by OSAMS. The orange layer represents functionality provided by OpenCAF. Finally, the blue layers represent models and services, including legacy systems and visualization/analysis tools.

The Open Modeling and Simulation Architecture (OpenMSA) is a layered architecture that focuses on the technology of scalable parallel and distributed computing for real-world, M&S, and T&E applications operating in abstract time. The OpenMSA also addresses interoperability with existing Live Virtual and Constructive (LVC) standards such as HLA, DIS, TENA, and XML-based web services such as the Simple Object Access Protocol (SOAP) and Web Service Definition Language (WSDL).

The Open System Architecture for Modeling and Simulation (OSAMS) focuses on the programming constructs and interoperability methodologies for developing composable plug-and-play models and services. OSAMS provides a Service Oriented Architecture (SOA) philosophy within an application executing in parallel on multicore computers. Composite objects are hierarchically composed of models and services that mutually (a) interact through abstract interfaces and (b) share data using a general-purpose data registry that is managed by the composite. Instances of composites are distributed across processors and mutually interoperate in parallel.

The Open Cognitive Architecture Framework (OpenCAF) provides an infrastructure for representing intelligent cognitive behavior. This includes (a) stimulus and thought triggering, (b) management of goal-oriented behavior, and (c) branch excursions to rapidly explore optimal decision-making. OpenCAF is essential for representing intelligent behaviors of semi-automated forces.

11.2 Open Source Reference Implementation

An open source reference implementation, known as the WarpIV Kernel, is provided to assist in the evaluation of the three architectures that comprise the OpenUTF. The reference implementation contains nearly 500,000 lines of code, representing nearly 25 years of research and development. It is anticipated that the PDMS-SSG will recommend standardizing these three architectures, layer by layer, in an incremental manner, while embracing the open source code base as the de facto community supported implementation. Without an open source reference implementation, the cost and technology barrier would be prohibitively expensive. In addition, it would be difficult for different organizations with competing professional and financial interests to adopt anything considered proprietary. All of the technologies provided by the open source reference implementation have been openly published in conferences and forums. No patents have been pursued.

The only current restriction on distribution comes from the United States Department of Commerce. Without going through the export license process, distribution of

46

the WarpIV Kernel is confined to organizations within the United States. This restriction will probably be lifted in the future. Note that the export restriction does not prohibit International organizations from obtaining the software. They just have to go through the export license process. An alternative to this is for foreign organizations working jointly with the United States to obtain the software directly from the United States government.

Citizens of the United States can freely obtain a non-commercial license to the software by filling out a one-page information sheet and signing the license agreement, which essentially limits redistribution to government agencies only. The United States government has an unrestricted use license for the software.

11.3 Requirements Traceability

The roadmap outlines several key functional areas that must be provided by the OpenUTF for widespread use by the community. They are summarized in Figure 53. A much more involved requirements process is actually needed to specify detailed requirements.

Figure 53: High-level outline of required functionality for the OpenUTF.

Normally, the requirements process starts with defining capabilities of an actual system to be built. However, the OpenUTF must support a wide variety of applications. This means that system requirements must represent the needs of all possible systems using the OpenUTF. System requirements then must be mapped (traced) to technical framework requirements, which are mapped to architecture layers, modules, and standards. The architecture layers map to its implementation, which maps to verification and validation tests.

As shown in Figure 54, traceability works in both directions. Tests map back to the implementation. The implementation maps back to architecture layers, which map back to technical framework requirements, which map back to system requirements. In other words, middle

layers in the requirements traceability matrix are tracked both from a higher level and lower level.

Figure 54: Requirements traceability.

Managing a crosschecked requirements traceability matrix is important because it helps track tests, capability gaps, etc. For example, the OpenUTF currently provides approximately 200,000 lines of test code. The question is what parts of the implementation are actually tested and where are the gaps? By mapping test code back to the implementation, one can determine how well tested the implementation is, and what still needs to be tested. It would also indicate wasteful resources spent redundantly testing the same code.

Similarly, the implementation layer can be mapped back to architecture layers to determine where capability gaps exist. One might question implemented capabilities that do not map to any architecture layer.

11.4 Common & Cross Cutting Working Group

Some oversight should be provided if the OpenUTF becomes a common framework for hosting interoperable models and services. Similar to FOM and SOM construction, this broad oversight should standardize abstract interfaces and shared data within a common data model. Without such oversight, interfaces and data definitions will proliferate, defeating the purpose of encapsulating models and services.

A Common & Cross Cutting Working Group (C&CC-WG) should be established to represent stakeholders in the development and usage of models and services across a wide spectrum of domains. The primary focus of this working group is to standardize data and interfaces. However, this body could also address conceptual modeling issues, semantics, capability gaps, performance issues, coordination of synergistic efforts, etc.

The C&CC-WG should also participate in the OpenUTF standards and technology development oversight to

47

ensure that the modeling needs of their communities are satisfied and to provide helpful feedback on issues, capability gaps, etc.

11.5 Validation of Models and Services

The OpenUTF provides an approach towards building fully encapsulated models and services. This means that the same verification and validation tests could be used to test different implementations of the same type of model. This fact facilitates the possibility of having new standards for conducting verification and validation tests of encapsulated black box models and services. A direct comparison of test results becomes possible.

For example, there could be a dozen different radar model implementations, each using the same data and interfaces, but with different internal algorithms. A common set of tests could measure the computed output results and run-time execution performance of each implementation to help determine proper usage of each implementation.

It is likely that a physics-based radar model will have very different inputs and outputs from a campaign-level radar model. The OpenUTF does not require all interfaces of similar models to be the same. The C&CC-WG must recognize when interfaces can be united and when it does not make sense to do so.

Verification and validation regression tests for models and services must be organized in a manner that facilitates automatic testing of multiple implementations. Results should be analyzed and tabulated for potential users to review before choosing an implementation. Tests should assess and track functional correctness, robustness, computational speed, and validity. A model that suddenly runs significantly slower, or produces different results, should be flagged for review to make sure that an error has not crept into the code base.

12.0 Incremental Strategy

The roadmap provides an incremental strategy for migrating legacy systems to the common technical framework. The goal is for the widespread M&S community to transition from the current as is state to the desired to be state. This means that existing systems operating sequentially, perhaps supporting LVC interoperability with other systems, transition over time to the OpenUTF.

Characteristics of the to be state are: (a) support for all abstract time execution modes, (b) scalable execution, (c) model and service interoperability, reuse, and composability, (d) operation on all mainstream computing platforms and operating systems, (e) significantly lower

life cycle maintenance and operation costs, (f) more robust execution, and (g) verified and validated operation.

Three primary techniques are recommended for successfully accomplishing this transition. The first approach uses LVC standards to connect legacy systems to the OpenUTF. This strategy provides a stepping-stone that preserves existing functionality while developing native models and services for the OpenUTF. Once enough native OpenUTF models and services are developed, the legacy systems are no longer needed.

The second approach uses the HPC-RTI as a bridge to migrate models into the OpenUTF. Any legacy federate that is HLA-capable could easily integrate with the HPC-RTI and then run models and services residing in their legacy engine and the OpenUTF engine. The HPC-RTI bridge approach is viable, even when the legacy simulation operates in a standalone manner. It also provides a quick way for federates to self-federate to achieve parallel performance.

The third approach uses a four-stage strategy that is shown in Figure 55.

Figure 55: Four-stage incremental transition approach.

The first step develops middleware translation services that map legacy framework services to the OpenUTF. This is normally a straightforward task because the OpenUTF provides a very rich modeling framework that is almost always more capable than the legacy framework. It would be quite a bit harder to perform this step if the OpenUTF modeling framework were not as capable. The OpenUTF and middleware replaces the legacy system’s core infrastructure. At this point, the services of both frameworks are available to be used.

The second step refactors key models and services to begin utilizing the OpenUTF modeling constructs. The goal is to refactor software into native OpenUTF models and services that still operate in the legacy simulation, but

48

could eventually be put into a repository for use in other OpenUTF simulations.

The third step begins to parallelize the legacy simulation using lookahead-based conservative (non-rollbackable) time management techniques. Legacy models and services are first grouped into composites that are distributed across processors. The composites may still use legacy modeling-constructs. All global variables, tables, direct method calls between composites must be eliminated, which for some legacy simulations would require significant effort.

The fourth step transforms all state variables within composites to use rollbackable data types for full support of optimistic event processing.

In most cases, only the first two stages are technically feasible and/or cost effective. The first two stages do not directly focus on scalable performance of the legacy simulation, but rather the transition to a common framework. If key models and services from mainstream legacy simulations followed the first two steps, the overarching repository would become populated with useful capabilities very quickly.

12.1 Filling Technology Gaps

The current open source reference implementation does not fully implement all of the capabilities of the OpenUTF. To get ahead of the technology curve, the most critical capability gaps need to be filled as quickly as possible before large teams begin developing native models and services.

The highest priority gaps include (a) completing and optimizing the publish and subscribe filtering services, (b) developing visual programming tools to simplify model and service coding efforts, (c) refine the visual model and service composability tools, (d) complete the HPC-RTI and gateways to LVC federations or web services, and (e) simplify, extend and optimize current embedded grid computing techniques.

12.2 Documentation and Training

Documentation and training materials for the OpenUTF need to be continually updated and made available to the user community. User documentation includes programming guides, software reference guides, design documentation, technical papers, and quick reference guide. Training materials include conceptual presentations and tutorials, training videos that guide new users through various features, hands-on training materials with lecture notes and workbooks, and a variety of code examples. Hands-on training events have already taken place at various locations within the United States.

12.3 Coordinated Tiger Team Approach

Perhaps the best way to proceed forward is to form a small tiger team comprised of subject matter experts in (a) the OpenUTF technology and (b) model and service developers across a wide spectrum of domains. The tiger team would optimally consist of approximately 20 people, all experts in their respective areas, and willing to flush out OpenUTF issues as gaps are being filled and capabilities are refined. This effort would probably take several years to complete. The result of this would be to provide a fully usable and vetted OpenUTF with a set of joint core models and services that provide the foundation of future systems.

The tiger team would also initially function as the first C&CC-WG. This will help lay out the charter, processes, and artifacts of the C&CC-WG.

While all of this is underway, legacy systems could also begin following the first two transition steps described in Figure 55, which will produce many more native models and services for future use. A full transition to the OpenUTF is probably possible within a five-to-ten-year period. At that point, it is expected that commodity CPU chips will come with hundreds, if not thousands of cores.

12.4 Formation of a Model & Service Repository

As native models and services are developed for the OpenUTF, the first issue that comes up is where to put them and how they will be managed. Some models and services will be proprietary, while others will be open source or government owned. Some models and services might be classified. A configuration management framework for freely available models and services with source code must be established that accommodates all of these considerations. Proprietary model and service packages should be maintained by their own organizations. Standards must be established for documenting and installing packages on user systems.

Releases of the core OpenUTF, its models, services, and tools must support rapid update cycles. Power users must be able to obtain the latest capabilities, before formal releases are made available. Bug tracking, new feature requests, etc., must all be provided in the configuration management framework. The verification and validation pedigree of each model and service also needs to be maintained so that users will know their intended usability limitations.

13.0 Summary and Conclusions

The roadmap addressed the important issues for providing a common framework for modeling and simulation. This framework, known as the Open Unified Technical

49

Framework (OpenUTF), addresses both (1) scalable performance on multicore computers and (2) interoperability and reuse of models and services.

The roadmap first highlighted how the multicore revolution demands a new paradigm for software development. Computationally intensive software must begin taking advantage of parallel computing techniques to harness the processing power of modern CPUs. However, developing parallel computing software can become complex and difficult to maintain. The solution is to provide all of the low-level details of parallel computing within a common framework. If done right, a common framework can provide support for scalable multicore computing, while also lowering software development costs.

The roadmap then provided a brief overview of parallel and distributed computing, including a discussion of the most common performance issues. It then extended the discussion to the types of communication services and time management schemes that are necessary to support scalable computing in abstract time. In particular, the discussion showed that only optimistic time management techniques support all execution modes without constraint and should therefore be provided at the core of the OpenUTF. Other techniques (e.g., conservative, time-stepped, sequential, five-dimensional, scaled real-time, hard real time) easily derive their support from this core capability.

The roadmap then shifted to model and service composability issues, with the goal of maximizing interoperability and reuse while promoting scalable operation. The roadmap introduced the notion of composites that are hierarchically composed of encapsulated models and services, which are distributed across processors. Models and services are described by metadata that specifies initialization parameters, data provided, data consumed, data published, data subscribed to, interfaces provided, and interfaces invoked. Using these mechanisms, drag-and-drop composability of complex systems of systems is possible, leading to the ability to execute fully documented architectures.

The roadmap also described some of the important modeling concepts involving the data registry and abstract interface management provided by composites. Then, the roadmap described how local events, processes, and simple events could be used to schedule computations in abstract time. Cognitive modeling concepts were highlighted. Publish and subscribe services with scalable filtering were also reviewed.

The roadmap then shifted back to advanced performance techniques involving different ways to represent entities using composites. In particular, the embedded grid

computing technique was shown to extend parallelism of computationally intensive events, while simultaneously addressing stochastic load balancing issues that commonly plague large systems.

The roadmap next discussed several techniques for integrating OpenUTF applications with external systems, which could be legacy systems, hardware-in-the-loop, or visual/analysis tools. Three techniques were highlighted: (1) use gateways to work with existing Live Virtual and Constructive (LVC) or web standards, (2) directly integrate external systems using the External Modeling Framework (EMF), and (3) integrate legacy federates using the High Performance Computing Run Time Infrastructure (HPC-RTI).

The roadmap then described several tools for monitoring and controlling the execution of the simulation, analyzing trace files, developing scenarios, visualizing the simulated system in two and three dimensions, logging data for offline analysis, and performing mathematical analysis.

A brief discussion on technology transition showed how most programs face a technology gap. Either capability is not there when needed, or it is provided too late. This motivates the need to prepare the OpenUTF right now in preparation for anticipated widespread use. An important part of the roadmap is to work with the Simulation Interoperability Standards Organization (SISO) to develop requirements and new standards based on the three synergistic architectures (OpenMSA, OSAMS, and OpenCAF) of the OpenUTF. An open source reference implementation, known as the WarpIV Kernel, is freely provided to the community. However, an export license is required for international use outside the United States. The formation of a Common & Cross Cutting Working Group will be needed to provide general oversight of common interfaces and shared data between models and services. Also, new standards, processes, and techniques for verifying and validating models and services will emerge, as similar encapsulated models are able to share common black-box tests.

Finally, The roadmap laid out several incremental transition strategies for migrating legacy systems to the OpenUTF. Perhaps the two most interesting techniques are to (1) use the HPC-RTI as a bridge to migrate legacy HLA federates to the OpenUTF, and (2) apply the first two steps of the four-step migration plan that first replaces the legacy engine with the OpenUTF engine using middleware interfaces and then incrementally refactor key models to utilize the recommended modeling constructs of the OpenUTF. The formation of a coordinated tiger team was suggested to allow a small team of subject matter experts to work out all of the technical issues of the OpenUTF as its current capability

50

gaps are being filled and/or enhancements are made. The small team also forms the initial C&CC-WG, which will help establish its charter, policies, and procedures.

A technically feasible roadmap has been provided for establishing the OpenUTF as a common framework. This roadmap is intended to help decision-makers begin the process of making this happen.

Additional Resources on the OpenUTF

1. Parallel and Distributed Modeling & Simulation Terms of Reference (TOR), November 10, 2008.

2. Steinman Jeff, Lammers Craig, Valinski Maria, and Steinman Wendy, 2009. “Open Unified Technical Framework Volume 1: OSAMS, User’s Guide for Model Developers.” WarpIV Technologies, Inc.

3. Steinman Jeff, Lammers Craig, Valinski Maria, and Steinman Wendy, 2010. “Parallel and Distributed Modeling & Simulation Standing Study Group (PDMS-SSG) DRAFT REPORT Volume 1: PDMS Technology.” Submitted to the SISO Standards Activities Committee (SAC), 14 April 2010.

4. For additional literature on the OpenUTF and the WarpIV Kernel, see http://www.warpiv.com/Literature.html.

Author Biography

JEFFREY S. STEINMAN, President and CEO of WarpIV Technologies, Inc. received his Ph.D. in High Energy Physics from UCLA in 1988 where he studied the quark structure function of high-energy virtual photons at the Stanford Linear Accelerator Center. Dr. Steinman was the creator of the Synchronous Parallel Environment for Emulation and Discrete Event Simulation (SPEEDES), has published more than 75 papers and articles in the field of high performance computing, has five U.S. patents in high performance M&S technology, is a frequently invited speaker at conferences, seminars, and forums, and is the principle designer/developer of the WarpIV Kernel Reference Implementation of the OpenMSA, OSAMS, and OpenCAF architectures that make up the Open Unified Technical Framework (OpenUTF). Dr. Steinman is currently the Chair of the Parallel and Distributed Modeling & Simulation Standing Study Group (PDMS-SSG) that is governed by the Simulation Interoperability Standards Organization (SISO).

the roadmap - warpiv · roadmap is to help m&s practitioners (a) understand the openutf vision...

Documents