pdms 2hr tutorial

1 PDMS 2 Hour Tutorial

Multicore computing revolution The need for change

Proposed Open Unified Technical Framework (OpenUTF) architecture standards OpenMSA, OSAMS, OpenCAF as future standards

Introduction to parallel computing Programming models High Speed Communications (HSC) through shared memory

Synchronization and Parallel Discrete Event Simulation (PDES) Event Management Time Management

Open discussion

PDMS 2 Hour Tutorial 2

MULTICORE Future of computing is



I skate to where the puck is going to be, not where it has been!Wayne Gretzky

Performance wall Clock speed and power consumption Memory access bottlenecks Single instruction level parallelism

Multiple processors (cores) on a single chip is the future No foreseeable limit to the number of cores per chip Requires software to be written differently

Supercomputing community consensus: Low-level parallel programming is too hard Threads, shared memory, locks/semaphores, race-conditions,

repeatability, etc., are too hard and expensive to develop and debug (fine-grained HPC is not for your average programmer)

Message-passing is much easier but can be less efficient High-level approaches, tools, and frameworks are needed (OpenUTF,

new compilers, languages, math libraries, memory management, etc.)


Computer/Blade/Cluster

Board

BoardBoard

Chip

Node

Node

Node

Node

Chip

Node

Node

Node

Node

Chip

Node

Node

Node

Node

Chip

Node

Node

Node

Node

CloudCompu3ng,Netcentric,GIG,

SystemsofSystems


World of computing is rapidly changing and will soon demand new parallel and distributed service-oriented programming methodologies and technical frameworks.

Experts say that parallel and distributed programming is too hard for normal development teams. The Open Unified Technical Framework abstracts low-level programming details.

Microsoft Sponsor of the by-invitation-only 2007 Manycore Computing Workshop

that brought together the whos who of supercomputing together Unanimous consensus on the need for multicore computing software

tools and frameworks for developers (e.g., OpenUTF)

Apple Snow Leopard will have no new features (focus on multicore computing) The next version of Apple's OS X operating system will include

breakthroughs in programming parallel processors, Apple CEO Steve Jobs told The New York Times in an interview after this week's Worldwide Developers Conference. "The way the processor industry is going is to add more and more cores, but nobody knows how to program those things," Jobs said. "I mean, two, yeah; four, not really; eight, forget it.

http://bits.blogs.nytimes.com/2008/06/10/apple-in-parallel-turning-the-pc-world-upside-down/


Next generation chips Intel has disclosed details on a chip that will compete directly with Nvidia and ATI and may take it into unchartered technological and market-segment waters. Larrabee will be a stand-alone chip, meaning it will be very different than the low-endbut widely usedintegrated graphics that Intel now offers as part of the silicon that accompanies its processors. And Larrabee will be based on the universal Intel x86 architecture.

The number of cores in each Larrabee chip may vary, according to market segment. Intel showed a slide with core counts ranging from 8 to 48, claiming performance scales almost linearly as more cores are added: that is, 16 cores will offer twice the performance of eight cores.

http://i4you.wordpress.com/2008/08/05/intel-details-future-larrabee-graphics-chip


Next generation chips

Intel touts 8-core Xeon monster Nehalem-EX

Intel gave a demo yesterday of its eight-core, 2.3 billion-transistor Nehalem-EX, which is set to launch later this year Nehalem EX has up to 8 cores, which gives a total of 16 threads per socket.

By Jon Stokes | Last updated May 28, 2009 8:25 AM CT

http://arstechnica.com/hardware/news/2009/05/intel-touts-8-core-xeon-monster.ars


COMPOSABLE SYSTEMS Open Unified Technical Framework (OpenUTF)


OpenUTF Kernel

Model Components

Service Components

Simulation is not as cost effective as it should be we need to do things differently Revolutionary, not evolutionary change!

Multicore computing revolution demands change in software development methodology need standardized framework

New architecture standards we should be building models, not simulations

Model and Service components developed in a common framework automates integration for Test and Evaluation

Verification and Validation need a common test framework with standard processes

Open source Overcomes the technology/cost barrier and supports widespread community involvement



10 ms 1 ms 100 s 10 s 1 s 100 ns 10 ns 1 ns

Requires assessment of the current state Existing tools, technologies, methodologies, data models, existing interfaces, policies, requirements, business models, contract language, lessons learned, impediments to progress, etc.

Requires the right vision for the future Lowered costs, better quality, faster end-to-end execution, easier to use and maintain, feasible technology, optimal use of workforce skill sets, multiuse concepts, composability, modern computational architectures, multiplatform, net-centric, etc.

Requires an executable transition strategy Incremental evolution, risk reduction, phased capability, accurately assessed transition costs, available funding, prioritization, community buy-in and participation, formation of new standards


1. Engine and Model Separation 2. Optimized Communications 3. Abstract Time 4. Scheduling Constructs 5. Time Management 6. Encapsulated Components 7. Hierarchical Composition 8. Distributable Composites 9. Abstract Interfaces 10. Interaction Constructs 11. Publish/Subscribe 12. Data Translation Services 13. Multiple Applications

14. Platform Independence 15. Scalability 16. LVC Interoperability Standards 17. Web Services 18. Cognitive Behavior 19. Stochastic Modeling 20. Geospatial Representations 21. Software Utilities 22. External Modeling Framework 23. Output Data Formats 24. Test Framework 25. Community-wide Participation


OpenMSA Layered Technology Focuses on parallel and distributed computing technologies Modularizes technologies through a layered architecture Contains OSAMS and OpenCAF Proven technologies based on experience with large programs Cost effective strategy for developing scalable computing technology Provides interoperability without sacrificing performance Facilitates sequential, parallel, and distributed computing paradigms

OSAMS Model/Service Composability Focuses on interfaces and software development methodology to

support highly interoperable plug-and-play model/service components Provided by OpenMSA but could be supported by other architectures

OpenCAF Cognitive Intelligent Behavior Thoughts and stimulus, goal-oriented behaviors, decision branch

exploration, five dimensional excursions Provided as an extension to OSAMS


OpenCAF Behaviors Cognitive Thought Processes

5D Simulation Goal-oriented Optimization

OSAMS Modularity Composability Interoperability Flexibility Programming Constructs

VV&A

OpenMSA Open Source Technology HPC/Multicore Performance Synchronization

OpenUTF Architecture Standards Net-centricity Data Models

HPC

Network

Scheduling

Modeling Framework

Services

Models

Behavior Representation

Cognitive Rule Triggering

Bayesian Branching

Goals and State Machines

Decision Support

Composites

Pub/Sub Services

LVC Interoperability

Web-based SOA


Operating System ServicesThreads

General Software Utilities (OSAMS)ORB Network Services

Internal High Speed Communications External Distributed CommunicationsRollback Framework

Rollback Utilities (OSAMS)Persistence (OSAMS)

Standard Template Library (OSAMS)Event Management Services

Time ManagementStandard Modeling Framework (OSAMS, OpenCAF)

Distributed Simulation Management Services (OSAMS Pub/Sub Data Distribution)SOM/FOM Data Translation Services

External ModelingFramework (EMF)

&DistributedBlackboard

Gateway Interfaces(HLA, DIS,

TENA, Web-basedSOA)

HPC-RTIBridge

Model & Service Component RepositoryEntity Composite Repository

CASE Tools

DirectFederate

AbstractFederate

HLAFederate

LVC Federation& Enterprises

External SystemVisualization/Analysis


18

Thought 1

Stimulus - Perception(Short Term Memory)

Thought 2

Thought N

Data ProcessingBehaviors, Tasks, Notifications, Abstract Methods, Uncertainty

Data ReceivedFederation Objects and/or Interactions

Prioritized Goals State, Action & Task ManagementTasksTasksTasks

(5D Branching)

Reas

oning

En

gine

PDMS 2 Hour Tutorial

Outputs

Inputs

Left brain reasoning Inputs are ints, doubles, or

Boolean Inputs are prioritized when

they are associated with RBRs

Inputs can be fed into multiple reasoning nodes

Outputs can be inputs to other reasoning nodes

Feedback loops are permitted

W X Y Z

B CA

19

Based on OpenUTF Kernel Sensitivity List Sensitive variables (stimulus) are registered with sensitive methods (thoughts) Thoughts are automatically triggered whenever registered stimulus is modified Thoughts can modify other stimulus to triggers additional thoughts Terminates when solution converges or when reaching max thoughts


Outputs

Inputs

Learned reasoning Inputs are ints, doubles, or

Boolean TBR is trained and then

utilized to produce outputs (can be continually trained during execution)

Inputs can be fed into multiple reasoning nodes

Outputs can be inputs to other reasoning nodes

Feedback loops are permitted

W X Y Z

B CA


1 =W +X +Y +ZA = W W +X X +Y Y +Z Z [ ] TW 2TX1TY1TZ 3

Right brain reasoning Inputs are normalized,

weighted, and summed Sum is multiplied by the

product of thresholds to produce the output

Output is normalized Inputs can be fed into

multiple reasoning nodes Outputs can be inputs to

other reasoning nodes Feedback loops are

permitted

W

TW1

TW2

TW30

1TX1

TX20

1

TY1

0

1 TZ1

TZ3TZ10

1

X Y Z

TZ2

Output

InputsW X Y Z

A


Arbitrary graphs can be constructed from Rules, Neural Nets, and Emotions

Outputs of graphs can trigger changes to behaviors by reprioritizing goals

Behaviors are only triggered once reasoning is completed

22

Emotion Based Reasoning

Training Based Reasoning

Rule Based Reasoning



Net Centric Enterprise Framework Composable

Systems LVC Web GCCS Data Visualization

Composable Plug and Play OpenUTF

Kernel Service

Components Model

Components Abstract

Interfaces V&V Test

Framework

Monolithic Applications Collection of Hardwired Services

Simulations Collection of Hardwired Models


Reusable Software Components Plug and Play Composability Conceptual Model Interoperability Pub/Sub Data Distribution & abstract Interfaces V&V Test Framework Performance Benchmarks

Parallel and Distributed Operation Scalable Run-time Performance Platform/OS Independence OpenMSA: Technology OSAMS: Modeling Constructs OpenCAF: Behavior Representation

Composable Systems LVC (HLA, DIS, TENA) Web Services (SOA) Data Model C4I/GCCS Visualization and Analysis

OpenUTF Kernel


ComposableSystem

Plug-and-play Model/Service Components

Net-centric Operation: - Enterprise Frameworks - Command and Control - Standard Data Models

Legacy Interoperability: - Distributed Federation - Training, Analysis, Test - FOM/SOM

Standalone Operation: - Laptops, Desktops, Clusters, HPC - Standalone Operation - Pub/Sub Data Distribution

Transparently hosts hierarchical services using the same interfaces as model components

SOAP interface connects services to external applications

Collections of related services are dynamically configured and distributed across processors on multicore systems

Services internally communicate through pub/sub services and decoupled abstract interfaces

Seamlessly supports LVC integration


Composite Net Centric System onMulticore Computer

Subscribed Data Received Published Data Provided

Abstract Services Provided Abstract Services Invoked

Services communicate through Pub/Sub data exchanges and abstract interfaces

Composites are distributed across processors to achieve parallel performance

WebServices

Net-centric SOA/LVC on Networks of Single-processor and Multicore Computers

Dynamically configuredstructure

LVCInterface


Global Installation & Make System

Installation & Make System

DAS

ETS

T&D

Models

Weather

CCSI

ATP-45

Services

Polymorphic Methods

Interactions

Federation Objects

XML Interfaces

Web Services

Interfaces

Source Include Library

DAS

ETS

T&D

Weather

CCSI

ATP-45

Verification

DAS

ETS

T&D

Weather

CCSI

ATP-45

Validation

DAS

ETS

T&D

Weather

CCSI

ATP-45

Benchmarks

Tests

Component Repository

Installation & Make System

OpenUTF Kernel

320,000 Lines of Code

OpenUTF

General concept Government maintained software configuration management Automatic platform-independent installation & make system Test framework (verification, validation, and benchmarks) Will seamlessly support mainstream interoperability standards Designed for secure community-wide software distribution


OpenUTF Kernel

LVC Interoperability

Standards

Web Standards

Models

Services

V&V Test Framework Data &

Interfaces

Development Tools

Composability Tools

Visualization Tools

Analysis Tools

PARALLEL COMPUTING Introduction to


16 Node Hypercube TopologyLog2(N) worst case hops

2D Mesh Topology(m+n) worst case hops

3D Mesh Topology(l+m+n) worst case hops


Startup

Initialize

Compute

Communicate

Store Results

File

Process Cycle

Initialize

Compute

Communicate

Store Results

File

Process Cycle

Initialize

Compute

Communicate

Store Results

File

Process Cycle

Node 0 Node 1 Node N-1

Parallel computing vs. distributed computing Parallel computing maps computations, data, and/or object instances of

within an application to multiple processors to obtain scalable speedup Normally occurs on a single multicore computer, but can operate

across multiple machines The entire application crashes if one node or thread crashes

Distributed computing interconnects loosely coupled applications within a network environment to support interoperable execution

Normally occurs on multiple networked machines, but can operate on a single multicore computer

Dynamic connectivity supports fault tolerance but loses scalability

Speedup(N) = T1 / TN

Efficiency(N) = Speedup / N

RelativeEfficiency(M,N) = (M / N) [Speedup(N) / Speedup(M)]


Time driven (or time stepping) is the simplest approach for (double time=0.0; time < END_TIME; time+=STEP) {

UpateSystem(time);

Communicate();

}

The discrete event approach (or event stepping) manages activities within the system more efficiently Events occur at a point in time and have no duration Events do not have to correspond to physical activities (pseudo-events) Events occur for individual object instances, not for the entire system Events when processed can modify state variables and/or schedule new

events

Parallel discrete event simulation offers unique synchronization challenges


Distributed net-centric computing Programs communicate through a network interface

TCP/IP, HTTPS, SOA and Web Services, Client/Server, CORBA, Federations, Enterprises, Grid Computing, NCES, etc.

Parallel multicore computing Processors directly communicate through high speed mechanisms

Threads, shared memory, message passing


SequentialProgram MultiThreaded

SharedMemory

MessagePassing


SharedMemoryServer

SharedMemoryServer

SharedMemoryServer

ClusterServer

Parallel Application



Startup and Terminate Forks processes Cleans up shared memory

Miscellaneous services Node info, shared memory tuning

parameters, etc.

Synchronization Hard and fuzzy barriers

Global reductions Min, Max, Sum, Product, etc. Performance Statistics Can support user-defined

operations

Synchronized data distribution Broadcast, Scatter, Gather, Form

Matrix

Asynchronous Message Passing

Unicast, destination-based multicast, broadcast

Automatic or user-defined memory allocation

Up to 256 message types Coordinated Message Passing

Patterned after the Crystal Router Synchronized operation

guarantees all messages received by all nodes

Unicast, destination-based multicast, broadcast

ORB Services Remote asynchronous method

invocation with user-specified interfaces



Node 0

Node 1

Node 2

Node 3

Node 4

Example of a global synchronization on five processing nodes

Stage 0 Stage 1 Stage 2 Stage 3

Wait UntilCompleted

FinalResult


Slots (circular buffer)Node 0Node 1Node 2Node 3

Tail

Head

Output Messages (circular buffer)

One shared memory block per nodeSlots manage incoming messages for each nodeCircular buffer manages outgoing messages

Steps in sending a message:1. Write header and message to head in senders

output message buffer.2. Write index of msg header in the receiving node

shared memory slot for the senders node.Steps in receiving a message1. Iterate over slot mgrs to find messages2. Read message using index in the slot3. Mark the header as being read

Potential technical issues

Cache coherency

Instruction synchronization


CircularBuffer

Tail

Head

CircularBuffer

Head

Tail

Tail chasing Head Head chasing Tail


Header 1Header 2

Header n

int NumBytes

int Index

unsigned short Packetunsigned short NumPacketschar DummyChar0char DummyChar1char DummyChar2char ReadFlag

Mes

sage

For

mat

Head

er F

orm

at

SYNCHRONIZATION Parallel Discrete Event Simulation (PDES)


Standardized processing cycle interfaces to support any time management algorithm Uses virtual functions on scheduler to specialize processing steps Supports reentrant applications (e.g., HPC-RTI, graphical interfaces,

etc.)

Highly optimized internal algorithms for managing events Optimized and flexible event queue infrastructure Native support for sequential, conservative, and optimistic processing Internal usage of free lists to reduce memory allocation overheads Optimized memory management with high speed communications

Statistics gathering and debug support Rollback and rollforward application testing Automatic statistics gathering (live critical path analysis, message

statistics, event processing and rollbacks, memory usage, etc.) Merged trace file generation for debugging parallel simulations that can

be tailored to include rollback information, performance data, and user output


Time Management Modes are generically implemented through class inheritance from the WpScheduler OpenMSA provides a generic framework to support basic parallel and

distributed event processing operations, which makes it easy to implement new time management algorithms

OpenMSA creates the object implementing the requested time management algorithm at run time

The base class WpScheduler provides generic event management services for sequential, conservative, and optimistic processing

WpWarpSpeed, WpSonicSpeed, WpLightSpeed, and WpHyperWarpSpeed time management objects inherit from WpScheduler to implement their specific event processing and synchronization algorithms



Execute { Initialize Process Up To (End Time) Terminate }

Process Up To (Time) { while (GVT < Time) { Process GVT Cycle } }

main { Plug in User SimObjs Plug in User Components Plug in User Events Execute }

Initialize { Launch processes Establish Communications Construct/Initialize SimObjs Schedule Initial Events }

Terminate { Terminate All SimObjs Print Final Statistics Shut Down Communications }

Process GVT Cycle { Process Events & User Functions Update GVT Commit Events Print GVT Statistics }


*

*

*

*


*

* 1


Processed Events Doubly Linked List

Future Pending Events Priority Queue

Simulation Time Rollback Queue

Event Messages

Scheduler: A priority queue of Logical Processes (i.e., Simulation Objects) ordered by next event time

Simulation Time

Priority queue uses new self-correcting tree data structure that employs a heuristic to keep the tree roughly balanced Tree data structure efficiently supports three critical operations

Element insertion in O(log2(n)) time Element retraction in O(log2(n)) time Element removal in O(1) time

Does not require storage of additional information in tree nodes to keep the tree balanced

Tracks depth on insert and find operations to adjust tree organization through specially combined multi-rotation operations

Goal is to minimize long left/left and/or right/right chains of elements in the tree

Competes with STL Red-Black Tree Beats STL when compiled unoptimized Slightly worse than STL when compiled optimized



OptimalDepth = log2(NumElements)NumRotations = ActualDepth OptimalDepth

Rotation heuristic decreases depth to keep the tree roughly balanced

Rollback Manager Manages list of rollbackable items that were created as rollbackable

operations are performed Each event provides a rollback manager

Global pointer is set before the event is processed Rollbacks are performed in reverse order to undo operations

Rollback Items Each rollbackable operation generates a Rollback Item that is managed

by the Rollback Manager Rollback utilities include (1) native data types, (2) memory

operations, (3) container classes, (4) strings, and (5) various misc. operations

Rollback Items inherit from the base class to provide four virtual functions

Rollback, Rollforward, Commit, Uncommit


Distributed Synchronization

Conservative Vs. Optimistic Algorithms

Rollbacks in the Time Warp Algorithm

The Event Horizon

Breathing Time Buckets

Breathing Time Warp

WarpSpeed

Four Flow Control Techniques


Conservative algorithms impose one or more constraints Object interactions limited to just neighbors (e.g., Chandy-Misra) Object interactions have non-zero time scales (e.g., lookahead) Object interactions follow FIFO constraint

Optimistic algorithms impose no constraints but require a more sophisticated engine Support for rollbacks (and advanced features for rollforward) Require flow control to provide stability Optimistic approaches can sometimes support real-time applications

better...

The most important thing is for applications to develop their models to maximize parallelism Simulations will generally not execute in parallel faster than their critical

path


E

F

D

A

B

C

G



D FIFO Input Q

Scheduled inputevents and timefrom C

Self-scheduledevents and timefrom D

Scheduled inputevents and timefrom E

Scheduled outputevents and time to F

Scheduled outputevents and time to B

FIFO

FIFO

FIFO Input

Q

FIFO Input

Q

GVT is defined as the minimum time-tag of: Unprocessed event Unsent message Message or antimessage in transit

Theoretically, GVT changes as events are processed In practice, GVT is updated periodically by a GVT update algorithm

To correctly provide time management services to the outside world, GVT must be updated synchronously between internal nodes



20,00010,00000

10

20

30

40

50

60

70

80

90

100

Time WarpBreathing Time Buckets

Simulation Time

CPU

Tim

eProximity Detection (32 Nodes)259 Ground Sensors1099 Aircraft


20,00010,00000

100,000

200,000

300,000

400,000

500,000

Simulation Time

Even

ts a

nd R

ollb

acks

ProcessedEvents

Time WarpRollbacks

Breathing Time BucketsRollbacks

Proximity Detection (32 Nodes)259 Ground Sensors1099 Aircraft


GeneratedMessages

GeneratedMessages

Opposite problems when comparing Breathing Time Buckets and Time Warp

Imagine mapping events into a global event queue

Events processed by runaway nodes have good chance of being rolled back

Should hold back messages from runaway nodes


Example with four nodes Time Warp: Messages released as events are processed Breathing Time Buckets: Messages held back GVT: Flushes messages out of network while processing events Commit: Releases event horizon messages and commits events


Wall Time

Abstract representation of logical time uses 5 tie-breaking fields to guarantee unique time tags double Time Simulated physical time of the event int Priority1 First user settable priority field int Priority2 Second user settable priority field int Counter Event counter of the scheduling SimObj int UniqueId Globally unique Id of the scheduling SimObj

Guaranteed logical times The OpenUTF automatically increments the SimObj event Counter to

guarantee that each SimObj schedules its events with unique time tags Note, Counter may jump to ensure that events have increasing

time tags SimObj Counter = max(SimObj Counter, Event Counter) + 1

The OpenUTF automatically stores the UniqueId of the SimObj in event time tags to guarantee that events scheduled by different SimObjs are unique


Four algorithms, selectable at run-time, are currently supported in the OpenUTF reference implementation LightSpeed for fast sequential processing

Optimistic processing overheads are removed Parallel processing overheads are removed

SonicSpeed for ultra-fast sequential parallel and conservative event processing

Highly optimized event management (no bells and whistles) WarpSpeed for optimistic parallel event processing with four new flow

control techniques to ensure stability Cascading antimessages can be eliminated Individual event lookahead evaluation for message-sending risk Message sending risk based on uncommitted event CPU time Run-time adaptable flow control for risk and optimistic processing

HyperWarpSpeed for supporting five-dimensional simulation Branch excursions, event splitting/merging, parallel universes



GVT Time

GVT Time

Hold Back Messages

Ok to Send Messages

Cas

e 1

Cas

e 2


Time Send Messages Hold Back Message

Risk Lookahead


Time

Processing Threshold Exceeded Hold Back Messages

Tcpu

0

Tcpu

1

Tcpu

2 Tc

pu3

Tcpu

4

Tcpu

5

Tcpu

6


NR

ollb

acks

Time

Unstable - Decrease Nopt

Stable - Slightly Increase Nopt

NA

ntim

essa

gess

Time

Unstable - Decrease Nrisk

Stable - Slightly Increase Nrisk

OPEN DISCUSSION Final thoughts


Participate in the PDMS Standing Study Group (PDMS-SSG) Simulation Users Model Developers Technologists Sponsors Program Managers Policy Makers

Receive OpenUTF hands-on training for the open source reference implementation One-week hands-on-training events can be arranged for groups if there

is enough participation

Begin considering OpenUTF architecture standards OpenMSA layered technology OSAMS plug-and-play components OpenCAF representation of intelligent behavior


pdms 2hr tutorial

Documents

parallel programming

multicore future of

distributed programming

world of computing

larrabee chip

memory management

single chip

technical frameworks