high performance multi-agent system based...

High Performance Multi-agent System based Simulations

Thesis submitted in partial fulfillment

of the requirements for the degree of

Master of Science (by Research)

in

Computer Science and Engineering

by

Prashant Sethia

200602017

prashant [email protected]

Center for Data Engineering

International Institute of Information Technology

Hyderabad - 500 032, INDIA

June 2011

Copyright c© Prashant Sethia, 2011

All Rights Reserved

International Institute of Information Technology

Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled“High Performance Multi-agent

System based Simulations” by Prashant Sethia, has been carried out under my supervision

and is not submitted elsewhere for a degree.

Date Advisor: Dr. Kamalakar Karlapalem

To my fatherMr. Moti Lal Sethiaand my motherMrs. Saroj Sethia

Sincere gratitude to

My grandfather,Shri Santok Chand Sethia. His blessings are always with me and taking

me through in all ups-and-downs.

My fatherandmother, for their endless love, care and sacrifices done for me.

My guide,Dr. Kamalakar Karlapalem, for inspiring and guiding me throughout my work.

He will always be the first one whenever I need any advice.

And mysistersandfriends, for making my life beautiful and full of joy.

v

Abstract

Real-life city-traffic simulation presents a good example ofmulti-agent simulations in-

volving a large number of agents (each human modelled as an individual agent). Analysis of

emergent behaviors in social simulations largely depends on the number of agents involved

(more than100,000agents at least). Due to large number of agents involved, it takes sev-

eral seconds or even minutes to simulate a single second of real-life. Hence, we resort to

distributed computing to speed-up the simulations. We usedHadoop to manage and execute

multi-agent application on a small cloud of computers and used CUDA framework to develop

an agent-simulation model on GPUs.

Hadoop is known for efficiently supporting massive work-loads on a large number of sys-

tems with the help of its novel task distribution and the MapReduce model. It provides a

fault-tolerant and failure-resilient underlying framework and these properties get propagated

to the multi-agent simulation solution developed on top of it. Further, when some of the sys-

tems fail, Hadoop dynamically re-balances the work-load onremaining systems and provides

capability to add new systems on the fly while the simulation is running. In our solution,

agents are executed as MapReduce tasks; and a one time execution of all the agents in the

simulation becomes a MapReduce job. Agents need to communicate with each other for mak-

ing autonomous decisions. Agent communications may involve inter-node exchange of data

causing network I/O to increase and become the bottle-neck in multi-agent simulations. Fur-

ther, Hadoop does not process small files efficiently. We present an algorithm for grouping

agents on the cluster of machines such that frequently communicating agents are brought to-

gether in a single file (as much as possible) avoiding inter-node communication. The speed-up

vi

vii

achieved using Hadoop was limited by the overheads of a largenumber of MapReduce tasks

running to execute agents and also due to the steps taken by Hadoop to provide failure re-

silience. Moreover, it is slow to access and hard to append HDFS based files. Hence, we use

Lucene index based mechanism to store agent data and implement agent messaging. With

our implementation, we are able to achieve better performance and scalability in comparison

to simulation frameworks currently available.

We experimented with GPUs for speeding-up massive complex simulations, GPUs be-

ing managed with help of CUDA framework. CUDA provides the capability to create and

manage large number (1012) of light-weight GPU threads. It follows Single Instruction Mul-

tiple Threads architecture; the single instruction being modelled as a kernel function. Agent

strategies are hence developed as kernel functions with each agent running as a separate GPU

thread. The consistency in agent-states are maintained by using a four-state agent execution

model. The model decentralizes the problem of consistency and helps in avoiding bottle-

necks that may have occurred due to a single centralized system.

We tested performance of the two developed frameworks and compared them with the

existing solutions like NetLogo, MASON and DMASF. The experiments involved both, sim-

ulations having a lot of computation and simulations with large number of messages amongst

agents. The framework developed on Hadoop provided a linearscale-up in the number of

running agents with the increase in number of machines in thecluster. Time taken to exe-

cute a simulation for a fixed number of agents decreased inversely as number of machines

increased. Hadoop rebalanced the simulation, with very little overhead in time, when some

machines failed and new ones were added dynamically. However, the execution time was

upto 6-8 times more as compared with DMASF with same experimental settings. But with

the help of Lucene indexing along with Hadoop, the execution-time improved and for larger

number of agents our system outperformed it. The GPU solution on the other hand outper-

formed all the existing CPU solutions in terms of execution time. The speed-up achieved was

as much as 10,000 times for some simulations when compared with DMASF running on 2

CPUs for same experiments.

Contents

Chapter Page

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Contribution and Organization . . . . . . . . . . . . . . . . . . . . . .. . . 5

2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 AB-Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 A Multi-agent Simulation Framework on Small Hadoop Clusters . . . . . . . . . 123.1 Hadoop Architecture and Map-Reduce Model . . . . . . . . . . . . .. . . . 13

3.1.1 Hadoop Distributed File System (HDFS) . . . . . . . . . . . . .. . 133.1.2 Map-Reduce Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . 133.1.3 MapReduce Job Execution by Hadoop . . . . . . . . . . . . . . . . . 14

3.2 Multi-agent Simulation Framework on Hadoop . . . . . . . . . .. . . . . . 153.2.1 Handling Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1.1 Namenode Failure . . . . . . . . . . . . . . . . . . . . . . 193.2.1.2 Secondary Namenode Failure . . . . . . . . . . . . . . . . 193.2.1.3 Datanode Failure . . . . . . . . . . . . . . . . . . . . . . 20

3.2.2 Dynamic Addition of New Nodes . . . . . . . . . . . . . . . . . . . 203.3 Implementation Issues for MAS framework on Hadoop . . . . .. . . . . . . 21

3.3.1 Small-Files-Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3.1.1 Solution to the Small-Files-Problem . . . . . . . . . . . .21

3.3.2 Agent Communication . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.2.1 Agent Clustering algorithm based on agent-communication 233.3.2.2 Implementing Greedy Agent-Redistribution Algorithm . . 25

3.3.2.2.1 In MAP-1 phase: . . . . . . . . . . . . . . . . . 253.3.2.2.2 In REDUCE-1 phase: . . . . . . . . . . . . . . . 263.3.2.2.3 In MAP-2 phase: . . . . . . . . . . . . . . . . . 263.3.2.2.4 In REDUCE-2 phase: . . . . . . . . . . . . . . . 26

3.3.2.3 Placing the agent-groups into Files . . . . . . . . . . . . .263.3.3 Queries in Agent-State Updates . . . . . . . . . . . . . . . . . . .. 27

viii

CONTENTS ix

3.4 Agent Execution using Lucene/Solr Indexing . . . . . . . . . .. . . . . . . 283.4.1 Introduction to Lucene and Solr . . . . . . . . . . . . . . . . . . .. 283.4.2 Implementation Strategy . . . . . . . . . . . . . . . . . . . . . . . .29

3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .313.5.1 Circle Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.5.2 Standing Ovation Problem (SOP) . . . . . . . . . . . . . . . . . . .363.5.3 Sand-pile Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 363.5.4 KP Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.5.5 Dynamic Nodes Addition . . . . . . . . . . . . . . . . . . . . . . . 373.5.6 Scalability Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.5.7 Experiments and comparison with Lucene . . . . . . . . . . . .. . . 38

4 Efficient Multi-Agent Simulation using Four State Agent Execution Model on GPUs 414.1 Outline of nVidia Compute Unified Device Architecture . . .. . . . . . . . 414.2 Agent Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43

4.2.1 Evaluating ‘decision-lag’ . . . . . . . . . . . . . . . . . . . . . .. . 444.2.2 Utility of ‘perceive’ state . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 FSAM-framework Architecture . . . . . . . . . . . . . . . . . . . . . .. . . 464.3.1 Distribution of agents . . . . . . . . . . . . . . . . . . . . . . . . . .474.3.2 Event-driven approach of agent-state updates . . . . . .. . . . . . . 484.3.3 Messaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3.4 Warp management . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .504.4.1 Experiments on performance . . . . . . . . . . . . . . . . . . . . . .51

4.4.1.1 Computationally intense with no messaging (Circle simu-lation). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.4.1.2 Communication intensive (Hand-shake Simulation).. . . . 554.4.1.3 Messaging and Computationally balanced (Sand-pileSim-

ulation). . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.4.2 Evaluating FSAM-framework Architecture . . . . . . . . . .. . . . 56

4.4.2.1 Agent distribution algorithm . . . . . . . . . . . . . . . . 564.4.2.2 Warp management . . . . . . . . . . . . . . . . . . . . . . 564.4.2.3 Stress Testing . . . . . . . . . . . . . . . . . . . . . . . . 57

5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Appendix A: Example codes for agent simulation on Hadoop simulation framework 66A.1 Circle Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66A.2 Standing Ovation Simulation . . . . . . . . . . . . . . . . . . . . . . .. . . 69

x CONTENTS

A.3 Sand-pile Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 73

Appendix B: Example codes for agent simulation on FSAM. . . . . . . . . . . . 79B.1 Circle Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79B.2 Hand-shake Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81B.3 Sand-pile Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83

List of Figures

Figure Page

2.1 AB-Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Agent distribution and communication . . . . . . . . . . . . . . .. . . . . . 243.2 Comparing iteration-time for cachingOn and Off for CircleSimulation on

Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.3 Comparing number of inter-node messages for clusteringOn and Off for Stand-

ing Ovation Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4 Comparing iteration-time for clusteringOn and Off for Standing Ovation Sim-

ulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.5 Comparing iteration-time for clusteringOn and Off for Sandpile simulation . 333.6 Iteration times obtained for KP-Simulation . . . . . . . . . .. . . . . . . . . 343.7 Iteration times obtained when removing and adding machines dynamically . . 343.8 Scalability test with 200,000 agents . . . . . . . . . . . . . . . .. . . . . . 353.9 Scalability Test with varying number of datanodes . . . . .. . . . . . . . . . 353.10 Comparing iteration-times between Hadoop with HDFS, Hadoop with Lucene,

DMASF for Circle Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 383.11 Comparing iteration-times between Hadoop with HDFS, Hadoop with Lucene,

DMASF for Standing Ovation Simulation . . . . . . . . . . . . . . . . . .. 393.12 Comparing iteration-times between Hadoop with HDFS, Hadoop with Lucene,

DMASF for Sand-pile Simulation . . . . . . . . . . . . . . . . . . . . . . . 39

4.1 Agent state change diagram . . . . . . . . . . . . . . . . . . . . . . . . .. . 434.2 FSAM-framework Architecture . . . . . . . . . . . . . . . . . . . . . .. . . 474.3 Decision-lag obtained for different scenarios on GPU-based FSAM . . . . . . 514.4 Cycle-time comparison between GPU-based FSAM and CPU-based DMASF:

Circle-simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.5 Idle-time comparison between GPU-based FSAM and CPU-based DMASF:

Circle-simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.6 Cycle-time comparison between GPU-based FSAM and CPU-based DMASF:

Hand-shake simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

xi

xii LIST OF FIGURES

4.7 Idle-time comparison between GPU-based FSAM and CPU-based DMASF:Hand-shake simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.8 Cycle-time comparison between GPU-based FSAM and CPU-based DMASF:Sand-pile simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.9 Idle-time comparison between GPU-based FSAM and CPU-based DMASF:Sand-pile simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

List of Tables

Table Page

3.1 Map-reduce solution for word-count as it occurs in [2] . .. . . . . . . . . . 143.2 Framework Supplied Code Classes . . . . . . . . . . . . . . . . . . . . . .. 173.3 Classes for which code is supplied by user . . . . . . . . . . . . . .. . . . . 183.4 ALGORITHM : Greedy Agent-Redistribution . . . . . . . . . . . . . .. . . 243.5 ALGORITHM : Agent-Allocation . . . . . . . . . . . . . . . . . . . . . . .273.6 API provided to handle cached results . . . . . . . . . . . . . . . .. . . . . 27

4.1 ALGORITHM : Agent-distribution on GPUs . . . . . . . . . . . . . . .. . 48

xiii

Chapter 1

Introduction

In the recent decades, multi-agent systems has emerged as animportant research field. It

has found several important applications in areas of distributed problem solving, in robotics

and in simulations and construction of synthetic worlds. Inthe area of distributed problem

solving, the multi-agent technology has helped in developing robust systems and novel soft-

ware architectures by bringing in modularity and flexible decoupling of the components. In

robotics, any complex task can be carried out by dividing into simpler tasks and allocating

each robot (which represent an agent in the system) specific goals.

Another important contribution of multi-agent systems comes in simulations. Simulations

are widely used to enhance knowledge in biology, in social sciences and several other fields

through testing of developed theories. Simulations help increating and modeling virtual

cities and humans. One can then simulate disasters and then test several rescue strategies

- an example would be RoboCup Rescue Competition. Another classic example would be

simulating traffic of a city with millions of humans and thousands of vehicles like trains,

buses. Such experiments will then enable us to construct structural roads and plan out an

effective traffic system. Analyzing emergent behaviors in social simulations intrigues us to

run the theories for large number of agents and see how the results vary. Similarly, to get a

more realistic model of the cities, one would want to simulate a huge population with millions

of humans, thus increasing the number of human agents. Due tolarge number of agents, time

taken for such simulations becomes large.

1

A simulation cycle can be seen as one execution of all agents (a step or decision taken

by the agent). In simulations involving millions of agents,the running time for a simulation

cycle can be of several seconds or even minutes; and when run for a large number of cycles,

the total simulation time can be of several hours or days. Similarly, if we want to run multiple

experiments, we would want the simulation runs to take as less amount of time as possible.

Hence, we resort to distributed computing to achieve speed up.

To fully exploit the capabilities of any distributed system, we must be able to dynamically

balance the work load on the running systems. Further, hardware failures are not rare cases. If

some of the machines fail during the run-time, then in the common case the entire simulation

needs to be re-started. If we can somehow dynamically re-balance the work-load on the

remaining number of machines and maintain logs of the simulation progress, we can continue

the simulation from the point of failure. So, we require a fault-tolerant and failure-resilient

system to run on a large number of processors and agents. Moreover, the system should be

easily extensible so that new machines can be added dynamically to scale up the simulations.

Furthermore, in multi-agent based simulations, agents perceive environment and other

agents’ states. Based on their perceptions, they frame decisions. Hence, it is important for

an agent to get the most current information from other agents and environment. We term

the issue of getting the latest perceptions of other agents and environment as data currency

problem for an agent.

Development of multi-agent based simulation systems has seen substantial research in

recent years. Numerous studies have been made on designing new programming paradigms

(NetLogo [12]) and architecture (MASON [16]), scale-up of the simulation (ZASE [13],

DMASF [14]), and developing mechanisms for recovery to handle agent failures ([24]).

Hadoop [5] and CUDA [9] present emerging trends for distributed and parallel computing.

Hadoop provides a scalable, fault-tolerant, failure-resilient distributed computing platform

and is capable of dynamically balancing the work-load. In case of system failures or in

case of addition of new machines to the cluster, Hadoop dynamically re-balances the work

load without stopping an ongoing job. The multi-agent simulation framework developer only

2

needs to develop a layer for agent-based simulation on top ofit to inherit the afore-mentioned

advantages.

GPUs with hundreds of ALUs, several thousand registers provide a faster alternative to

CPUs. GPUs with its light-weight threads provide immense parallelism, thus speeding up

execution of instructions. Further, the GPU, available today, are general-purpose parallel pro-

cessors with support for accessible programming interfaces and industry-standard languages

such as ‘C’. nVidia Compute Unified Device Architecture (CUDA) [9] enables programmers

and developers to write software to solve complex computational problems by utilizing the

many-core parallel processing power of GPUs. Hence, an agent simulation framework de-

veloped on top of CUDA would naturally be faster than its equivalent implementation on

CPUs.

1.1 Motivation

We studied the architectures of Hadoop and CUDA in depth and identified their features

which can help in advancing the state of art for multi-agent simulation frameworks. In this

work, our main motivation is to devise agent-frameworks which efficiently (i) provide data

currency with minimum possible overhead to the execution time of the simulation (work done

on GPUs using CUDA); (ii) scale with large number of agents andallows dynamic addition

of new machines and handle system failures without affecting the ongoing simulation (work

done on Hadoop).

One of the major issues with simulations involving huge number of agents is scalability.

One such agent-application would be creating a virtual citywith ten million citizens and

simulating traffic and transport system for that city. The data for city is detailed enough to

take into account railway network, road routes and has multiple modes of transport like bus,

trains and cars running on it. Each human has some information about the roads and routes.

It has a set of movement activities which involves going fromone place to another. Execution

of such a simulation requires a lot of space to store city map data, each humans’ personal

3

knowledge about the routes and its data, information about the vehicles running in the city

and other details; and along with space a lot of computation is also required as the number of

agents is huge. Moreover, we may want to store the simulationdata being generated, which

can be huge. Similarly, if we want to run multiple experiments, we would want the simulation

runs to take as less amount of time as possible. Thus, scalability becomes a major issue in

such simulations. The main motivation behind developing simulation platform over Hadoop

is scalability. Hadoop (along with Lucene [31], see section3.4) provides scalability and speed

to the simulations run on top of it.

The framework developed on GPUs using CUDA has a very low idle time for agents for

a wide class of applications as demonstrated by the results obtained for several experiments

(refer section Results). As such, the framework provides agents the ability to perceive envi-

ronment and receive information at a high rate. Now considera call of applications in which

the environment keeps getting updated at a high rate (in the order of a few seconds) and the

agents need to keep track of the changes in the environment and need to take quick actions

accordingly. If the number of agents are large, providing quick updates to all is a challenge.

Stock-market simulation is one such example. There are millions of stake-holders in a stock

market which are modeled as agents. Environment contains the quotes list for stocks of vari-

ous firms. The list gets updated with the prices getting changed rapidly. Real-life transactions

suggest that stock traders usually try to profit from the short-term price volatility with trades

lasting as low as several seconds. Now, stake-holders wouldbe able to do a more profitable

trade if they remain constantly aware of the stock prices andmake trades at appropriate time.

When simulating ten million agents with GPUs, the average idle-time obtained for an agent

is less than a milli-second, while that obtained on CPU is of the order of tens of thousands

of seconds. It implies that while the GPU agents are almost all the time aware of the mar-

ket dynamics, the intermediate time between consecutive updates for CPU agents is of the

order of hours. Clearly, if we have stock quote prices being streamed into the system from a

real-life source, GPU based agents would have a better idea of market when compared with

CPU-based agents.

4

1.2 Contribution and Organization

The main contributions of our work include development of: (i) A scalable, dynami-

cally extensible, fast, fault-tolerant and failure-resilient agent-based simulation framework

on Hadoop; (ii) Optimization techniques for implementation of a simulation framework on

Hadoop cluster namely - caching of intermediate results, using Lucene indexing for fast

agent-data retreival and messaging, and algorithm for clustering of frequently communicat-

ing agents (for reducing inter-processor communication);(iii) A four agent-state execution

model which ensures that every agent gets the latest perceptions from the agent environment;

(iv) A fast multi-agent simulation framework which utilizes the parallel architecture of GPUs

to efficiently implement the four-state agent-execution model.

Chapter 2 presents the related research on agent-simulationframeworks. Chapter 3 presents

the architecture of simulation framework developed on Hadoop alongwith the optimization

techniques that were developed. Some experimental resultsare included. Chapter 4 presents

the four state agent execution model. Architecture of agent-simulation framework developed

on GPUs alongwith experimental results are included in thischapter. Finally, in Chapter 5,

we present some conclusions and future work that can be done to extend the functionalities

and efficiency of the agent-execution models and frameworkspresented in this work.

5

Chapter 2

Related Work

Developing tools for multi-agent simulations has always been an active area of research,

with emphasis being laid on different aspects - architecture, scalability, efficiency, fault-

tolerance and effectiveness of the system. A number of frameworks have been developed

- Swarm [17], Netlogo [12] and RePast [18] are the frameworks managing simulations on

a single CPU core; ZASE [13], DMASF [14] and MASON [16] are the ones which scale

simulations to multiple cores.

SWARM [17] and RePast [18] are widely used frameworks for studying emergent agent-

behaviors through agent-based social simulations. SWARM was developed to help users

implement agent models and conduct experiments on those models. It has two versions

available- one developed in Objective-C and other in Java. It conceptualized agent-based

simulation models as a hierarchy of “swarms”, a swarm being agroup of objects and a sched-

ule of actions that the objects execute. A primary purpose ofagent-based modelling platforms

is to control which specific actions are executed when, in simulated time. Models often use

discrete, fixed time steps but sometimes use “dynamic scheduling”, in which new actions can

be generated as the model executes, and scheduled for execution at a specific future time.

Swarm provides explicit methods for scheduling actions, both in fixed time steps and dynam-

ically.

RePast [18] is developed with an objective to provide an equivalent functionality of SWARM

in Java and has been designed to make it easier for inexperienced users to build and test agent

6

models. Repast Simphony (Repast S) is the latest version of Repast. The Repast S agent

model designer is being developed to allow users to visuallyspecify the logical structure of

their models, the spatial (e.g., geographic maps and networks) structure of their models, the

kinds of agents in their models, and the behaviors of the agents themselves. Users can then

execute model runs as well as visualize and store results. Inaddition, the Repast S runtime

environment includes automated results analysis and connections to a variety of spreadsheet,

visualization, data mining, and statistical analysis tools.

Though SWARM and RePast provide a lot of functionalities, theyboth lack the capability

to manage more than one system. As such, the number of agents that can be executed using

them are limited to the configuration of the machine. Hence, the two frameworks are not

scalable.

Netlogo [12] introduced a novel programming model for implementing agent-based sim-

ulations, which eased the development of complex agent-models and scenarios. It manages

all the agents in asingle threadof execution, switching execution between different agents,

deterministically and not randomly, after each agent has done some minimal amount of work

(simulated parallelism). The simulated parallelism provides deterministic reproducibility of

the agent-based simulation every time it is run with same seed for random number generator;

which was one of the implementation goals of Netlogo. Further, a visualization module pro-

vides 2D/3D visuals of the ongoing simulation. However, Netlogo is not able to distribute the

computation on a cluster of computers and hence is not scalable.

MASON [16], developed in Java, provides a platform for running massive simulations

over a cluster of computers. It exploits several features ofJava: its platform-independence to

make the framework portable; strict math and type definitions to obtain duplicatable results;

its efficient object serialization for check-pointing simulations. It has a layered architecture

with separate layers for agent modelling and visualization, which makes decoupling the visu-

alization part easier. It has the capability to support millions of agents (without visualization).

Checkpoints of agent data are created on disk for offline visualization. Platform independence

7

and check-pointing together allow to migrate simulations from one platform to another in the

middle of the run.

ZASE [13] (developed in Java) is another scalable platform for running billions of agents.

It divides the simulation into several smaller agent-runtimes; each runtime controlling hun-

dreds of thousands of agents and running on a separate machine. A thread-pool architecture

is followed with several agents sharing a single thread of execution. It keeps all agents in

main memory without the need to access the disk. As such, scaling simulations require either

increasing the main memory or adding new processors altogether both of which are more

expensive than adding secondary storage.

DMASF [14] (developed in Python) has an architecture similar to MASON and ZASE.

Like ZASE, it divides simulation into several smaller runtimes executing on different com-

puters; furthermore several agents are executed in a sharedsingle thread of execution. How-

ever, it uses MySQL database for providing scalability withthe help of secondary storage

rather than getting bounded by the limited main memory. Similar to MASON, it has a modu-

lar architecture. It separates agent-modelling from visualization and makes the decoupling of

the two easier. Further, it dynamically balances the agent execution load among the machines

running the simulation.

However the ability to handle hardware failures is lacking in all the three (MASON, ZASE,

DMASF). If some of the systems using these frameworks fail during the simulation run, then

the simulation needs to be restarted from the beginning.

Further, there has been some work in developing agent simulation frameworks on GPUs.

In [22], the authors present an architecture for a 3D simulation framework,ABGPU - simu-

lated upto 65,536 agents at 60 Frames/sec (with visualization) and 1 million agents at 5 Fps

(without visualization). The parallel nature of processing offers significant and scalable per-

formance increases, with the added benefit of avoiding data transfer between the simulation

and rendering stages.

In [23], authors use GPU’s texture memory to store information about the agents and

environment, providing mechanisms to simulate births and deaths of agents. State update

8

functions that operate on the agent state data are programmed askernels[11]. Therefore,

different update functions will have differentkernels. Kernelsoperate one at a time on the

entire data set. Thousands of individual threads are automatically launched with each thread

executing the samekernelon a portion of the agent-state array. The threads are then barrier

synchronized at the end of the computation. Depending on theagent model, severalkernels

may be invoked for the computation of one time step.Kernelsare implemented as shaders

and data is stored as textures.

In [25], authors scale the number of agents to one billion. They provide a latency hiding

scheme to avoid communication delays which may occur due to messaging between agents

residing on physically different systems. Multiple instances of the same agent are run on

systems which have agents communicating with it. In their work, they assume agents to be

organized in a grid and agents can interact only with agents which are within some specific

distance of reach in their neighborhood region. For the distributed environment, the global

grid gets partitioned into multiple grids. Hence, the agents lying near the border may still have

to communicate with neighbouring agents on different systems. Such agents have multiple

instances running on the candidate systems. This results inlower communication delays.

Though their scheme works well only for agents having staticcommunication links, it would

be difficult to utilize this scheme for a broader class of simulations which have fairly dynamic

communication links (an example, [26]) between agents.

[30] is a recent work done on developing an environment whichenables creation of agent-

based models to be run on high performance computers (HPCs) and GPUs. The simulation

code is generated by processing a model definition of agents using a template engine for a

given set of predefined code templates and flags indicating the architecture. The generated

code (including common routine for input/output) can then be compiled and executed on the

appropriate architecture. When simulating on HPCs, the framework exploits task parallelism

and when simulating on GPUs data parallelism is utilized. However, the syntax for model and

behaviour scripting remains the same allowing models to be compiled for either architecture.

The experiments performed involved millions of agents to simulate European economy.

9

A.x

A.y

B.x

B.y

yy

1

2

3

(a) AB − dependency Graph (b) Time delays

t=t t=t

At t=t , updated A.x visible to B.

At t=t , updated A.x written to the disk.

At t=t , A.x updated in main memory.

1 2 t=t3

1 22 3

t=0

= t − t ;1 = t − t ;2

Figure 2.1AB-Example

2.1 AB-Example

Let us consider a scenario with two agents -A andB, each having two properties‘x’ and

‘y’ . Figure 2.1(a) shows the dependency graph for the two agents. AgentB considers value of

A.xbefore updatingB.x, and agentA considers value ofB.ybefore it updatesA.y. Our motive

is to provide agentB with the latest value ofA.xbefore it updatesB.xand similarly for agent

A.

AssumeA andB have a shared disk-resident database (or across network) tostore their

properties. There is some delay,λ1, involved in writing updatedA.x value to the database.

Further, there is some delay,λ2, in readingA.x from the database by agentB. Therefore,

eitherB has to delay the updation ofB.x by (λ1 + λ2) in order to get the latest value forA.x

or else it can still proceed with a relatively older value ofA.x, thus avoiding the delay. In our

agent-execution model,B always delays its decision-making to get the most recentA.x and

hence our aim is to reduce this delay as much as possible.

Multi-agent simulation frameworks like Netlogo [12], RePast [18] execute agents serially

on one machine followingcycle-based simulation model[21] - one time execution of all

the agents correspond to onecycleof the simulation; in this way the simulation proceeds in

cycles of agent execution. FollowingAB-dependency graph to ensure data currency, it would

take two simulationcyclesto complete the execution: (i) in the first cycle,A updatesA.x

andB updatesB.y. (ii) in the second cycle, agentA readsB.y to updateA.y andB readsA.x

to updateB.x. Since all changes occur in main memory, (λ1 + λ2) is very low. However,

10

when the number of agents are large (tens of millions), serial execution can get relatively

slow as compared to an equivalent parallel execution, making the total time taken to run the

simulation large.

Frameworks like DMASF [14], MASON [16] or ZASE [13], distribute computation over

a set of available processors. Agent execution is done by switching agents among a limited

number of parallel processes which may even be running on different machines. For such

scenarios, agents either need to flush data to the disk or communicate through messages over

network. In either case, (λ1 + λ2) is still high even though simulation time may go down due

to parallel agent execution.

DMASF [14] usescycle based simulationmodel [21] and agents are executed in a limited

number of parallel threads. The simulation enters into nextcycle after all agents have updated

their states for the current cycle. DMASF maintains agent states in a database. In a particular

cycle, agents make their perceptions using this database, update their states and the updated

agent state gets flushed into the database. Considering theAB-example, agentBneeds to make

sure that agentA has updated the value ofA.x in the database beforeB can read it. Therefore,

in order to solve data currency problem, the second cycle of simulation should begin only

after all the database-writes corresponding to the first cycle have been completed. Hence the

agents need to wait till all the writes have been performed. Further, accessing database to read

and write values incurs a lot of overhead in the execution time of each cycle. MASON [16]

suffers from a similar overhead - agents residing on different systems communicate through

messages and exchange data - therefore, network I/O adversely impacts the execution time.

On the other hand, GPUs have hundreds of cores to do computations at a fast pace. CUDA

[9] enables creation and management of a large number (upto several millions) of GPU

threads. CUDA APIs allow access to a sharedglobal memory[11] and a fast access to the

mapped main memory[11] for sharing agent data. Thus, the delay(λ1 + λ2) becomes neg-

ligible along with simulation time. So in our work on GPUs, weprovide an agent execution

model to achieve fast data currency.

11

Chapter 3

A Multi-agent Simulation Framework on Small Hadoop

Clusters

In this chapter, we present a design of an agent-based simulation framework implemented

on Hadoop cloud. Being developed on top of Hadoop, it inheritsHadoop’s afore-mentioned

advantages.

Our proposed framework developed on Hadoop provides three major advances to the cur-

rent state of art - (i) Dynamic addition of new computing nodes while the simulation is run-

ning; (ii) Handling node failures without affecting the ongoing simulation by redistribut-

ing the failed tasks on working systems; (iii) Allowing simulations to run on machines run-

ning different operating systems. Further, the framework incorporates several optimization

techniques - (i) clustering of frequently communicating agents (for reducing inter-processor

communication); (ii) caching of results (for improving performance) that are run on Hadoop

cloud; (iii) indexing agent-data and agent-messages usingLucene [31] for faster retreival dur-

ing the simulation. With the help of Lucene indexing, the execution time of the simulation is

comparatively less than the current day simulation frameworks.

12

3.1 Hadoop Architecture and Map-Reduce Model

Hadoop is an Apache project which develops open-source software for reliable and scal-

able distributed computing. It maintains a distributed filesystem, Hadoop Distributed File

System (HDFS) for data storage and processing. Hadoop uses classicMap-Reduceprogram-

ming paradigm to process data. This paradigm easily fits a large number of problems [8].

Hadoop consists of a single master system (known asnamenode) along with several slave

systems (known asdatanodes). For failure resilience purposes, it has asecondary namenode

which replicates the data ofnamenodeat regular intervals.

3.1.1 Hadoop Distributed File System (HDFS)

HDFS is a block-structured file system: individual files are broken into blocks of a fixed

size (default size is 64MB), which are distributed across a cluster of one or more machines

(datanodes); thus all the blocks of a single file may not be stored on the same machine.

Thus, access to a file may require access to multiple machines, in which case a file could be

rendered unavailable by the loss of any one of those machines. HDFS solves this problem

by replicating each block across a number of machines (three, by default). The metadata

information consists of division of the files into blocks andthe distribution of these blocks on

differentdatanodes. This metadata information is stored onnamenode.

3.1.2 Map-Reduce Paradigm

TheMapReduceparadigm transforms a list of(key, value)pairs into a list of values. The

transformation is done using two functions:Map andReduce. Map function takes an input

(key1, value1)pair and produces a set of intermediate(key2, value2)pairs. TheMap output

can have multiple entries with the samekey2. The MapReduceframework sorts theMap

output according to intermediatekey2and groups together all intermediatevalue2s associated

with the same intermediatekey2. TheReducefunction accepts an intermediatekey2and the

set of correspondingvalue2s for thatkey2, and produces one or more outputvalue3s.

13

(i) map(key1,value1) -> list<(key2,value2)>

(ii) reduce(key2, list<value2>) -> list<value3>

The intermediate values are supplied to theReducefunction via an iterator. This allows

handling lists of values that are too large to fit in memory. The MapReduceframework calls

theReducefunction once for each uniquekeyin sorted order. Due to this the final output list

generated by the framework is sorted according to thekeyof Reducefunction.

For example, consider the standard problem of counting the number of occurrences of each

word in a large collection of documents [2]. This problem canbe solved by using theMap

andReducefunction. TheMap function emits each word along with an associated count of

occurrences (just ‘1’ in this simple example). TheReducefunction sums together all counts

emitted for a particular word.

map(String key, String value):// key: document name// value: document contentsfor each word w in value:

EmitIntermediate(w, "1");

reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:

result += ParseInt(v);Emit(AsString(result));

Table 3.1Map-reduce solution for word-count as it occurs in [2]

3.1.3 MapReduce Job Execution by Hadoop

In Hadoop terminology, by ‘job’ we mean execution of a Mapperand Reducer across a data

set and by ‘task’ we mean an execution of aMapperor aReduceron a slice of data. Hadoop

14

distributes theMap invocations across multiple machines by automatically partitioning the

input data into a set ofM independent splits. These input splits are processed in parallel

by M Mapper tasks on different machines. Hadoop invokes aRecordReadermethod on the

input split to read one record per line until the entire inputsplit has been consumed. Each

invocation of theRecordReaderleads to another call to theMap function of theMapper.

Reduceinvocations are distributed by partitioning the intermediate key space intoR pieces

using a partitioning function. One more thing to mention here is that some problems require

a series ofMapReducesteps to accomplish their goals.

Map1 -> Reduce1 -> Map2 -> Reduce2 -> Map3... and so on

Hadoop supports this by allowing us to chainMapReducejobs by writing multiple driver

methods, one for each job, usingChainMapperclasses (see Section 3.3.2.2).

Thenamenodeis responsible for scheduling variousMapReducetasks on differentdatan-

odes. First theMap tasks are scheduled and then theReducetasks. It gets periodic updates

from the datanodesabout the work-load on each. It computes the average time taken by

MapReducetasks on eachdatanodeand then does the distribution of tasks in a way that the

fasterdatanodesget more number of tasks to execute and the slower ones get less number of

tasks to execute.

3.2 Multi-agent Simulation Framework on Hadoop

Each agent has three essential properties: a uniqueidentifier, a time-stampassociated with

the state of that agent andtypeof agent. The identifiers and time-stamps (current time) are

generated by the framework itself, whereas agent-type is provided by the user. User can also

specify additional properties. State of an agent at a particular timestamp refers to the set of

its property-values at that timestamp. Likewise, an updatein the state of an agent refers to

changes in these property-values. User needs to provide anupdate-agent-statefunction for

each type of agent, which is a programmatic model for incorporating agent strategy.

15

Hadoop requires problems to be solved usingMapReducemodel. This constraint is re-

quired in order to make the scheduling ofMapReducetasks (done by Hadoop) independent

of the problem being solved. Further, by default Hadoop invokes oneMap task for each input

file. Therefore, we model a multi-agent simulation as a series of MapReducejobs with each

job representing one iteration in the simulation and we model each agent as a separate input

file to these jobs. This leads a particular job to invoke multipleMap tasks, one for each agent,

executing in parallel. And therefore, the function,update-agent-state, which is responsible

for updating the state of an agent in each iteration, is written asMap task. Reducetasks are

responsible for writing the data back into the file associated with each agent.

Each agent-state is modelled as a separate flat file on HDFS with file name being the agent

identifier. This agent file containst most recent states of the agent, wheret is a user speci-

fied parameter (specified throughgetIterationParameters()in CreateEnvironclass described

later), each state is distinguished by a timestamp. This agent file is the input for map task

initiated to update the state of an agent (one map task for each input file/agent). Current state

of an agent is the one having most recent timestamp. A separate message file is associated

with each agent that stores recent messages received from other agents. OneMapReducetask

is invoked for each agent.

Each iteration in the simulation corresponds to oneMapReducejob invoked with one

MapReducetask corresponding to every single agent. The framework implements two classes

AgentandAgentAPI. ClassAgentcontains two classes:Map andReducecorresponding to

the MapReducetask. In themapmethod of classMap, data is read from associated input

file of the agent into a Java map<object, object>, sayagentdata, mapping agent properties

to their values. Agent state is updated by execution of user-suppliedUpdatemethod (in the

AgentUserCodeclass described later).Timestampis also updated to the current time. All the

properties are concatenated as a string, and it is passed as avaluein the (key, value)tuple to

theReducer, keybeing the agent identifier. In the reduce method ofReduceclass, the input

properties-concatenated string is written in the corresponding agent flat-file.Main() func-

tion creates the simulation world with the help of classCreateEnvironfor which the code is

16

public class Agent {public static class Map extends MapReduceBaseimplements Mapper {

public void map(key, value, OutputCollector<key, value>);}public static class Reduce extends MapReduceBaseimplements Reducer {

public void reduce(key, Iterator<value>,OutputCollector<key, value>);

}public static void main(String [ ] args) {

//Create the simulation world.CreateEnviron ce = new CreateEnviron();iterparams = ce.getIterationParameters();ce.createWorld();//Configure map-reduce jobs.//Invoke map-reduce task.

}}public class AgentAPI {

public void createAgent(map <object, object> agent data);map<String,String> getAgents();void sendMessage(String from agent identifier,

String to agent identifier, String message);map<String, String> readMessages(String agent identifier);

}

Table 3.2 Framework Supplied Code Classes

supplied by user. The methodgetIterationParameters()gets the user specified inputs such as

number of iterations the simulation is intended to run and the number of most recent times-

tamps for which the agent data is retained in the agent specific flat file. MethodcreateWorld()

is also supplied by user. It creates agents and initializes the world. Further, the framework

supplies code for theAgentAPIclass. MethodcreateAgent(map<object,object>) creates a

flat file in HDFS with name as agent identifier and writes the initial state of agent. Method

getAgents()returns data of all the agents present in the simulation.sendMessage()function

sends a message to another agent. It writes the current timestamp and identifiers of the two

17

public class CreateEnviron {public map<object,object> getIterationParameters();void createWorld() {

//map <object, object> agent data initialized.AgentAPI crobj=new AgentAPI();crobj.createAgent(agent data);//And so on for any number of agents.

}}public class AgentUserCode {map<object, object> Update(map<object, object> agent data) {

switch (agent data["type"]):case "type1"://User code for update.case "type2":....

}void Shape(String agent identifier) {

switch (agent identifier):case "type1"://User code for rendering shape.case "type2":....

}}

Table 3.3Classes for which code is supplied by user

agents involved in communication in message files associated with them. readMessages()

function reads all the messages sent to the agent in the last iteration.

Finally the user needs to supply the code for strategy of different types of agents.

3.2.1 Handling Failures

System failures are a common case when number of systems involved are large. Infor-

mation about thenamenodeandsecondary namenodeis present on all thedatanodes. The

namenodesends a heart-beat message todatanodesat regular intervals (by default 600 sec-

onds; can be configured by user). Eachdatanodesends an acknowledgement message along

18

with the information regarding the status of various tasks running on it. This information in-

cludes the number of completedMapReducetasks (for the currentMapReducejob) after the

last heart-beat message received and the total time taken tocomplete them. It also includes

number of activeMapReducetasks and number ofMapReducetasks in queue, waiting to be

scheduled.

When a particularMap (or Reduce) task fails (and in cases when a slowerMap task is

becoming a bottleneck for rest of the processes), Hadoop spawns a new process to carry out

its job, and may also use idle processes to do its task (the ones which have completed their

Map/Reduce task). When one of the several processes spawned to complete the failed task

finishes, rest of them are aborted (Speculative execution). Thus, the simulation enters into

next iteration only when allMap tasks in the current iteration are completed. HenceMap

tasks is used for updating an agent’s state in a particular iteration.

3.2.1.1 Namenode Failure

If the datanodesdo not receive heartbeat message from thenamenodefor more than two

time intervals (1200 seconds), then thenamenodeis considered to have failed. Thenamenode

data has already been replicated onsecondary namenodeat regular intervals. Therefore, fail-

ure ofnamenodedoes not cause any loss of data.Datanodes(on detecting namenode failure)

at once declare thesecondary namenodeas the newnamenode. All the responsibilities of

namenodelike job scheduling are now taken up by this newnamenode. Further, thedatanode

which is physically nearest to the newnamenode(for fasternamenodedata replication) is se-

lected as the newsecondary namenode. A special case occurs when thesecondary namenode

turns out to be failed at the instant whennamenodeis detected as failed. In this case, the

simulation is aborted.

3.2.1.2 Secondary Namenode Failure

Thenamenodedetectssecondary namenodefailure if it does not receive any acknowledge-

ment for the heart-beat message. Since,secondary namenodeonly had replica ofnamenode

19

data, therefore its failure is handled simply by electing a new secondary namenodefrom the

currentdatanodes(datanodephysically nearest to thenamenodeis selected).

3.2.1.3 Datanode Failure

Namenodedetects adatanodefailure if it does not receive an acknowledgement of the

heart-beat message from thedatanode. Data of each node is replicated on three other nodes

in the distributed system. As such, when adatanodefails its data can be recovered easily

using its replica. However, it might be running severalMapReducetasks when it failed.

TheseMapReducetasks need to be rescheduled on some other datanode. Two cases arise

for a failed MapReducetask: (i) failure occurred while running theMap task; (ii) failure

occurred while running theReducetask. When thedatanodefails while running itsMap

task, the entireMapReducetask needs to be rescheduled on some other datanode and the

complete task needs to be done again. If adatanodefailure occurs while running aReduce

task, then optimally theMap task should not be redone. For achieving this, output ofMap

task is replicated (as soon as it is finished) on thosedatanodeswhich contain the data replicas

for the faileddatanode. Thus, if adatanodefails when it is runningReducetask, then only

theReducetask is rescheduled and redone on some other node.

Thus, even if some of the machines fail in the Hadoop cluster,the simulation does not

stop.

3.2.2 Dynamic Addition of New Nodes

Namenodemaintains a file containing the details of IP addresses of different machines

(datanodes). It sends heart-beat messages to the systems mentioned in this file. If a new

system needs to be added in the Hadoop cluster, then information about it simply needs to be

added in this file. When thenamenodefinds a new entry in this file, it immediately grants

access to HDFS to the newdatanodeand invokesMapReducetasks on this system, rebalanc-

ing the total work-load. Since, heart-beat messages are sent every 600 seconds, therefore the

20

newly addeddatanodecan be idle at most for this period. Further, the simulation need not be

stopped for achieving this addition.

3.3 Implementation Issues for MAS framework on Hadoop

Our model faces run-time challenges which needs to be addressed. As the number of

agents becomes large (of the order107 agents on 100 machines), overhead increases signifi-

cantly due to generation of large number ofMapReducetasks. Further, the way in which the

agents are distributed on different datanodes may increasethe execution-time. We present

below the challenges faced with above model and solutions for the same.

3.3.1 Small-Files-Problem

Every file, directory and block in HDFS is represented as an object in the namenode’s

memory, each of which occupies about150bytes. So if we have10 million agents running

then we need about3GB (assuming each file has one block) of memory for the namenode.

Furthermore, HDFS is not geared up to efficiently access small files: it is primarily designed

for streaming access of large files. Reading through small files normally causes lots of seeks

and lots of hopping from datanode to datanode to retrieve each small file, all of which is

inefficient. Moreover,Map tasks usually process a block of input at a time. If the file is very

small and there are lots of them, then each map task processesvery little input, and there are

a lot moreMap tasks running, each of which imposes extra book-keeping overhead.

3.3.1.1 Solution to the Small-Files-Problem

The overheads posed by the small-files-problem can be reduced if we group together sev-

eral agents into a single file. The problem now is to decide thenumber of agents to be put in

a single file.

21

Since our major concern is to reduce the overhead due to generation of a large number

of Map tasks, we need to first find out the number ofMap tasks that should be run on the

Hadoop cluster that would give the best expected performance. Less number ofMap tasks

would not fully exploit the available resources. And too many Map tasks would require extra

book-keeping and causing swapping of processes (JVMs) between main memory and disk.

Hence, the number ofMap tasks that would give a good average performance depends on the

amount main memory available and the computation complexities of the tasks. The memory

available to each task (JVM) can be controlled by settingmapred.child.java.optsproperty

appropriately, default is200MB. As an example, let us assume the amount of main memory

available on each node (datanodesandnamenode) to be4GB. The maximum number ofMap

tasks that can be run on adatanodewithout any swapping would then be4GB/200MB = 20.

With 10 datanodesin the cluster, the number ofMap tasks that can be run would be20 ×

10 = 200. Hence, if we need to simulate10 million agents, then we need to divide them into

200groups with50,000agents in each group and a single group written in one file. Further,

the metadata required would now be around200× 150Bytes× 2 = 600kBytes, as opposed to

3GBas computed earlier.

3.3.2 Agent Communication

Agent simulation requires communication between different agents. The agent communi-

cation occurs by fetching the state of other agent from theircorresponding agent files which

reside on different datanodes. This can be a time consuming factor in such simulations.

Hence, the number of accesses to files residing on remote datanodes need to be reduced.

Continuing the solution presented in Section 3.3.1.1, the reduction of inter-node accesses

is achieved by placing the agents which communicate with each other frequently in the same

group and hence in the same file. Furthermore, theblock-size, which is by default64MB, is

set asnumAgents× szAgentData, wherenumAgentsis the number of agents to be put in a

22

single file andszAgentDatais the maximum size of an agent’s data. The fix in theblock-size

is done to avoid placing the blocks belonging to the same file on different nodes.

We use clustering for grouping the agents. Further, we need to cluster while the simulation

is running. Hence, we need an algorithm that does not incur too much overhead on execution

time of the simulation. The following algorithm achieves the above requirements. Further,

this greedy algorithm can itself be distributed via Map/Reduce.

3.3.2.1 Agent Clustering algorithm based on agent-communication

Given K sites, agentsa1, a2, ..., aN , and communication statistics between these agents,

the problem is to form groups of agents such a manner that communication between agents

in the same group is maximized and that between agents in different groups is minimized. By

achieving low inter-group messaging, we reduce the inter-node communication. The com-

plexity of solution needs to be of the order of number of communication links among the

agents.

Rm(ak) denotes the map value for agentak in file m. The algorithm begins by placing

the agents evenly distributed intoK files on given sites. Their communication links with

each other for the first iteration is noted and based on these links they are grouped together.

Step 2 brings together the agents who communicated with eachother in the same group.

To avoid redundancy, if two agents communicated with each other, then it is considered as

the agent having the lower agent identifier communicated with the other agent and not the

other way round (Step 2(b)); and we refer this agent having lower identifier as representative

agent. Step 3 combines the resulting groups formed in different files in Step 2 and merges

them using the criteria given in the algorithm. This step is required, because same agentai

may communicate with agents in several different files and hence having different values for

Rj(ai). Next, consider a case when agentai is occurring in two filesmandn. In file m, ai is a

representative agent and havingRm(ai) = ai. In file n, it is not a representative agent and has

Rn(ai) lower thanai. In such a case,R(ai) = Rn(ai). All the elements which initially mapped

to ai on filemhave to be re-mapped toRn(ai). This justifies Step 4 of the above algorithm.

23

1. Distribute the N agents randomly into K files,and carry out one iteration of the simulation.

2. For every agent ai, in each file j,(i) Compute the list of agents with which ai communicated

using the message file associated with ai.Call ai as the representative agent.

(ii) Map each agent ak in ai’s list to ai only ifid-value(ai) <= id-value(ak); and map agent ai to itself.Denote this map table as Rj.

3. Combine map-tables Rj’s from different files into a singlemap table R using the following update rule :

if (Rm(ak) < Rn(ak))then, R(ak)=Rm(ak)

else,R(ak)=Rn(ak)

4. Let R(ak) = al. Then, do R(ak) = R(al). Do this updatefor all the entries in table-map R.

5. All agents ai having the same value for R(ai) form the samegroup.

Table 3.4ALGORITHM : Greedy Agent-Redistribution

As an example, let the distribution of agents and communication links between them be as

shown in Figure 3.1.

a1

a3

a2

a6

a7

a5

a4

Site 3Site 2Site 1

Figure 3.1Agent distribution and communication

Step 2 (corresponding to the algorithm) :R1(a1) = a1; R1(a3) = a3; R1(a7) = a1; R2(a2)

= a1; R2(a6) = a3; R3(a6) = a5; R3(a4) = a3; R3(a5) = a4

Step 3 :R(a1) = a1; R(a3) = a3; R(a7) = a1; R(a2) = a1; R(a6) = a3; R(a4) = a3; R(a5)

= a4

24

Step 4 :R(a1) = a1; R(a3) = a3; R(a7) = a1; R(a2) = a1; R(a6) = a3; R(a4) = a3; R(a5)

= a3

Step 5:-

Group 1 :a1, a2, a7

Group 2 :a3, a4, a5, a6 So finally the agents communicating with each other are grouped

together. In the presented example it is the best solution. However, the algorithm may not

always give the best solution, but for the example scenariostested it gave reasonable results

with O(M/K) complexity, whereM is the number of unique communication links between

different agents andK is the number of sites.

3.3.2.2 Implementing Greedy Agent-Redistribution Algorithm

Since we already have a Hadoop cloud setup, we use this cloud to run the above algorithm

and compute the clusters. So, we reframed the algorithm intoa Chained MapReducemodel

and executed it on Hadoop. Execution is carried out in two chainedMapReducejobs. Output

of first MapReducejob becomes the input to theMap phase of secondMapReducejob. Fi-

nally, clusters of agents are obtained as output from the secondMapReducejob. Step 1 in the

above algorithm corresponds toMAP-1, Step 2 corresponds toREDUCE-1. Further, Step 3

is executed asMAP-2and Step 4 asREDUCE-2.

3.3.2.2.1 In MAP-1 phase:

Input - Agent and the list of agents, it communicated with. This is represented as a single

line of numbers :x1, x2,..., andxk, wherex1 represents the identifier of current agent under

consideration, and the following numbers are identifiers ofagents who communicated with

the current agent. The input consists of several such lines,one for each agent in the simula-

tion. The input split and load balancing is done by Hadoop itself.

Output - (Key, Value) pairs (xi, x1) if x1 <= xi.

25

3.3.2.2.2 In REDUCE-1 phase:

Input - The output fromMAP-1.

Output - Different values corresponding to the same key are brought together in a list by

Hadoop. LetV almin denote minimum of these values. Reduce this list of values to asingle

value,V almin. Therefore, output of this phase is the reduced set of(Key, Value)pairs. Further,

if a pair with same value forKeyandValueoccurs (e.g. (2,2)) then (2,V almin) is written in a

separate file (Refer this file asRepresentativeMaps.). This is useful for phase MAP-2.

3.3.2.2.3 In MAP-2 phase:

Input - (Key, Value)pairs fromREDUCE-1output and the fileRepresentativeMaps.

Output - For eachKey, the correspondingValueis mapped toR(Value)using theRepresen-

tative Maps, which contains(Value, R(Value))pairs. Finally, the pair is reversed. Therefore,

the output of the phase is(R(Value), Key).

3.3.2.2.4 In REDUCE-2 phase:

Input - (Key, Value)pairs fromMAP-2.

Output - Values corresponding to same key are grouped together in a list by Hadoop itself.

Therefore the final output of the phase is(Key, List of Values). e.g. : If (k1, v1), (k1, v2), (k1,

v3) were present in the output ofMAP-2. Then the corresponding output pair will be(k1, [v1,

v2, v3]) .

These clusters/groups contains agents who communicate with each other.

3.3.2.3 Placing the agent-groups into Files

The total number of files to be formed is determined using the concepts mentioned in

Section 3.3.1.1. In our experiments, we fixed it as10×K, whereK is the number of computing

sites. The clusters obtained are re-arranged into10×K files, such that all agents belonging to

the same cluster occur together in same file as far as possible. We follow the procedure stated

in Table 3.5.

26

file = 1for each cluster c

if (size(c) + allocation(file) < capacity(file))then,

Place all agents in cluster c into this file.else,

Place into this file maximum possible numberof agents in cluster c.file = file + 1Reduce cluster c to unallocated agents.Repeat the loop for this c.

Update allocation(file) appropriately.

Table 3.5ALGORITHM : Agent-Allocation

where,size(c)denote the size of clusterc, allocation(file) gives the current number of

agents placed into thisfile, andcapacity(file)gives the maximum number of agents which

this file can hold.

3.3.3 Queries in Agent-State Updates

public class DistributedCacheUtils {public void createCache(String chacheFileName,

List<String> records, int iterationNumber);List<String> readFromCache(String chacheFileName,

int iterationNumber);List<String> deleteFromCache(String chacheFileName,

int iterationNumber);}

Table 3.6API provided to handle cached results

An agent, to decide its next state, needs to know about the state of other agents. For

example, in the simulation of a soccer game, an agent may wantto know about the location

of all the players at that instant; this information for thatparticular instant would be same for

all agents. Hence, the percept once computed can be broadcasted to all; in other words, once

the query for location of all agents is computed, its result can be used by other agents as well.

27

Therefore, caching results reduces execution time by avoiding redundant execution of same

queries.

Queries needs to access a large number of agent files and many of them may reside on dif-

ferent datanodes. For reducing access time, we cache results of recent queries in the HDFS.

The cached result files are physically replicated on each datanode with the help ofDistribut-

edCacheclass of Hadoop. This provides a rapid access of the cached results to the datanodes.

The framework provides an implementation (refer Table 3.6)which allows simulation de-

veloper to perform various operations on cache files. MethodcreateCache()creates a cache

file with the specified file name, writes the records provided and copies the file in theDis-

tributedCacheof the HDFS. Further, the iteration number of the simulationin which the file

was created is stored along with it. MethodreadFromCache()reads the entire cache file asso-

ciated with the iteration number passed as parameter. Lastly, deleteFromCache()allows the

developer to delete the unused cache files and clear the HDFS space from time-to-time.

3.4 Agent Execution using Lucene/Solr Indexing

A major improvement (as much as 100 times, see the section on results) in run-time is

achieved by replacing HDFS-based agent data storage with Lucene-based indexed storage.

In this section, we briefly introduce the technologies Lucene [31] and Solr [32] and the way

agent-execution is implemented using them.

3.4.1 Introduction to Lucene and Solr

Lucene [31] is an open source, highly scalable text search-engine library developed by

Apache Software Foundation. Lucene has been developed in Java and now it has been ported

to many other programming languages, including Perl, Python, C++, and .NET. Lucene’s

APIs focus mainly on text indexing and searching.

Supporting full-text search using Lucene requires two steps: (1) creating a lucence index

on the documents and/or database objects and (2) parsing theuser query and looking up

28

the prebuilt index to answer the query. Lucene uses powerful, accurate, and efficient search

algorithms to look into the index created by it and retreive the results.

The major concepts in Lucene include -documents, fieldsandqueries. The unit of search

and index is adocument. A documentis basically an object that needs to be stored in the

index. A documentconsists of one or morefields. A field is simply a name-value pair. An

index, therefore, consists of one or moredocuments, eachdocumentcontaining some pre-

definedfields. Indexing involves addingdocumentsto anIndexWriter, and searching involves

retrievingdocumentsfrom an index via anIndexSearcher. Lucene provides its ownquery

language for searching, allowing users to specify whichfields to search on, which fields to

give more weight to (boosting), the ability to perform boolean queries (AND, OR, NOT) and

other functionality.

Lucene uses a file-based locking mechanism to prevent concurrent index modifications.

Moreover, Lucene allows simultaneous searching and indexing which further enhances its

performance.

Lucene provides libraries for building and searching indexon a local machine. If instead

we want to query Lucene index residing on some remote machine, we need to a search-server

which can handle all querying requests and answers them efficiently. Solr [32] does the job.

Solr runs as a standalone full-text search server within a servlet container such as Tomcat.

Solr uses the Lucene Java search library at its core for full-text indexing and search. Like

Lucene, it allows the users to specify the schema ofdocumentsthat need to be stored in

underlying Lucene index. Further, it provides highly scalable distributed search with shared

index across multiple hosts which increases the search speed and allows for a larger number

of simultaneous querying.

3.4.2 Implementation Strategy

We use Lucene to store the agent data. Each agent is modelled as adocumentin the Lucene

index, with different agent attributes being stored asfieldsin its document. Agents are indexed

29

based on their agent identifiers. Agent identifier is configured in the index as a unique and

requiredfield of the agentdocuments. Solr based search server is set-up which listens the

incoming requests for adding agentdocumentsand querying agent data and executes them on

the underlying Lucene index.

The implementation of classes -CreateEnvironandAgentUserCode- still remain the same

as earlier. Even the APIs provided by the classesAgentAPIandAgenthave the same decla-

rations for methods, only the way they are implemented gets changed.

Each iteration in the simulation corresponds to oneMapReducejob invoked with one

MapReducetask executing a set of agents; the number of agents being executed in a sin-

gle task is decided using the concepts described in Section 3.3.1.1. ClassAgentcontains two

classes:Map andReducecorresponding to theMapReducetask. In themapmethod of class

Map, agent-data is fetched from the Lucene index through APIs provided by Solr. A single

query is formed by concatenating the identifiers with anORclause for the set of agents being

executed by that particularMapReducetask. The execution time for a single query fetching

multiple agentdocumentsis much less than multiple queries fetching one agentdocument

each. Agents are now updated using theUpdatemethod provided by the user in theAgen-

tUserCodeclass. The updated values for agentdocumentsis rewritten in the Lucene index

once theUpdatemethod has been called for all agents. Again a bulk-update API is provided

by Solr which updates several entries in the Lucene index in one request; rather than updat-

ing single entries in multiple requests. The agent-identifier and the corresponding serialized

agent properties are emitted as the(key, value)pairs on completion ofMap task. In thereduce

method of classReduce, the received(key, value)pairs are simply passed to theOutputCollec-

tor. In this manner, we have the log of states of all agents after completion of every iteration.

This log can be used later for offline visualzation and analysis.

MethodAgentAPI.createAgentcreates agents by adding agentdocumentsin the Lucene

index. MethodAgentAPI.getAgentsissues aselect-all-like query on the Lucene index.Agen-

tAPI.sendMessagecreates an entry into the Lucene index with message indentifier being a

function of iteration number and agent-identifiers of the two agents involved. Messages being

30

indexed by message identifiers built in this manner providesa fast retreival of the messages

by AgentAPI.readMessagemethod.

Replacing HDFS-based agent data storage and messaging with the fast index-based oper-

ations for updation and messaging provides a major reduction in the execution-time of the

agent-based simulation. Speed-up achieved is upto100times for some simulations.

3.5 Experimental Results

In our experiments, we took multi-agent simulation problems of diverse nature, so as to

test the overhead of execution of the two optimization algorithms - clustering of agents and

caching of intermediate results - running on top of Hadoop framework. Further, we stud-

ied the speed up provided by using Lucene index along with Hadoop. We also compared

the execution-time for several simulations with one of the existing simulation framework

DMASF [14] for varying number of agents. We set up a Hadoop cloud with 10 Unix ma-

chines, each having1.67 GHzprocessor with2 GBRAM. Our experiments involved200,000

agents distributed on these10 machines and interacting with each other for100 iterations.

Some of the important results obtained are presented here.

3.5.1 Circle Simulation

In this problem, agents are scattered randomly on a 2-D plane. Their goal is to position

themselves in a circle, and then move around it. The strategyinvolved computation of arith-

metic mean of locations of all the agents. A frequent query isexecuted to compute this mean

to get the locations of all agents. Accessing the agent files on different systems, everytime

(once in each agent update function) an agent requested, wasavoided by caching locations of

all the agents once and then using this cached value for future reference. This cache was up-

dated in every iteration. Average execution time for one iteration reduced from124seconds

to 58seconds (see Figure 3.2).

31

0 20 40 60 80 10040

50

60

70

80

90

100

110

120

130

140

150

Iteration Number (N)

Tim

e ta

ken

in N

−th

Itera

tion

in s

econ

ds

(a) Comparing Caching−On and Caching−Off for Circle Simulation

Caching−OffCaching−On

Figure 3.2Comparing iteration-time for cachingOn and Off for Circle Simulation on Hadoop

0 20 40 60 80 10010

3

104

105

106


# of

inte

r−no

de m

essa

ges

in N

−th

itera

tion

(b) Messaging for Clustering−On and −Off (Standing Ovation Simulation)

ClusteredUnClustered

Figure 3.3Comparing number of inter-node messages for clusteringOn and Off for StandingOvation Simulation

32

0 20 40 60 80 10020

40

60

80

100

120

140

160


Tim

e ta

ken

in N

−th

itera

tion

in s

econ

ds

(c) Comparing Clustering−On and Clustering−Off for Standing Ovation Simulation

ClusteredUn−clustered

Figure 3.4Comparing iteration-time for clusteringOn and Off for Standing Ovation Simula-tion

0 20 40 60 80 100

60

80

100

120

140

160


Tim

e ta

ken

in N

−th

Itera

tion

in s

econ

ds

(d) Comparing Clustering−On and Clustering−Off for Sand−pile simulation

Figure 3.5Comparing iteration-time for clusteringOn and Off for Sandpile simulation

33

1000 1500 2000 2500 3000 3500 4000 4500 50005

10

15

20

25

30

35

40

# of messages received and sent by each agent (P)

Tim

e ta

ken

in o

ne it

erat

ion

in s

econ

ds

(e) KP−Simulation

K=1000K=2000K=3000K=4000K=5000

Figure 3.6 Iteration times obtained for KP-Simulation

0 20 40 60 80 10090

95

100

105

110

115

120

125

130

135

140


Tim

e ta

ken

in N

−th

itera

tion

in s

econ

ds

(f) Removing and adding machines dynamically

Figure 3.7 Iteration times obtained when removing and adding machinesdynamically

34

1 2 3 4 5 6 7 8 9 100

50

100

150

200

250

300

350

400

450

500

Number of active data−nodes

Tim

e ta

ken

for

one

itera

tion

in s

econ

ds(g) Scalability test with 20,000 agents

Figure 3.8Scalability test with 200,000 agents

1 2 3 4 5 6 7 8 9 10

50,000

100,000

150,000

200,000

50,000

100,000

150,000

200,000

50,000

100,000

Number of active data−nodes

Num

ber

of u

pdat

ed−a

gent

s in

60

seco

nds

(h) Scalability Test with varying number of data−nodes

Figure 3.9Scalability Test with varying number of datanodes

35

3.5.2 Standing Ovation Problem (SOP)

The basic SOP can be stated as: a brilliant economics lectureends and the audience begins

to applaud. The applause builds and tentatively, a few audience members may or may not

decide to stand. Does a standing ovation ensue or does the enthusiasm fizzle? The authors

present a simulation model for the above problem in [28]. We modelled the auditorium as

a 500×400grid, and agents were randomly assigned a seat. Agents in this simulation com-

municated with almost a constant set of neighbouring agents. Hence the clustering algorithm

proposed earlier showed marked reduction in number of inter-site messages and showed ma-

jor improvements in execution-time. Time for each iteration was almost halved (see Figure

3.3, 3.4). The slight increase in iteration-time during thethird iteration is due to the overhead

of clustering algorithm. It took approximately35 seconds for the clustering algorithm to run

for 200,000agents.

3.5.3 Sand-pile Simulation

In this problem, grain particles are dropped from a height into a beaker through a funnel,

and they finally settle down in the beaker after colliding with each other and with the walls

of beaker and funnel. A detailed description and solution ofthe problem is given in [26] and

for visualization of the problem refer to the link [27].

The queries generated in this problem were generic enough togive good results on caching

intermediate results and were similar to Circle Simulation and hence omitted. However, the

set of agents with which a particular agent interacted, changed too frequently. Hence, the

number of inter-node messages varied from iteration-to-iteration. The results show a 100

iteration run with regular agent-clustering done after every 30th iteration (see Figure 3.5). It

is observed that agent clustering reduces the execution time for following iterations but due to

frequent changes in communication links of agents, the affect of clustering reduces rapidly.

The sharp rise at the33rd, 63rd and 93rd iterations are due to the overhead of clustering

algorithm.

36

3.5.4 KP Simulation

This simulation is done to test messaging efficiency of the framework. K denotes the

number of agents in a particular run of the simulation.P is the number of messages sent

by each agent. Results obtained for different values ofK andP are shown in Figure 3.6.

Results indicate that internode communication incurs majorcost in execution time. As the

number of agents increase, resulting in involvement of larger number of data-nodes, the cost

for messaging increases. This is indicated by the two pairs of values(K=1000, P=5000),

time taken11 seconds and(K=5000, P=1000), time taken36 seconds for the same number

of total messages flowing in the system(5000×1000).

3.5.5 Dynamic Nodes Addition

We tested the ability of Hadoop to redistribute agents when new datanodes are dynami-

cally added to the Hadoop cluster and when some hardware failures occur. We ran sand-pile

simulation with Hadoop cloud consisting of ten datanodes. At 30th iteration, we failed two

datanodes. Finally we added one datanode at60th iteration. Results show an inverse relation

between execution-time and the number of active data-nodesshown in Figure 3.7. The aver-

age execution-time for one iteration increased from99 seconds to121seconds on reduction

of nodes from10 to 8 and then decreased again to106 seconds when9 datanodes became

active.

3.5.6 Scalability Tests

For testing the scalability of framework, we conducted two experiments. In the first ex-

periment, we ran thecircle simulationwith 200,000agents and varied the number of active

datanodes from1 to 10 and noted the average time taken for one iteration in each case. Re-

sults show run-time being inversely proportional to the number of machines (Figure 3.8).

Time taken for one iteration reduced from489seconds (on one machine) to58seconds when

executed on10 systems. In the second experiment, we ran thecircle simulationfor 60 sec-

37

101

102

103

104

105

106

10−2

10−1

100

101

102

103

Number of agents

Itera

tion−

time

in s

econ

ds

Comparing iteration−time for Hadoop−hdfs, Hadoop−lucene, DMASF: Circle Simulation

Hadoop−HDFSDMASFHadoop−Lucene

Figure 3.10Comparing iteration-times between Hadoop with HDFS, Hadoopwith Lucene,DMASF for Circle Simulation

onds and noted the number of agents updated with one machine in the Hadoop cloud. Then,

we varied the number of machines from1 to 10 and in each case noted the number of agents

being updated in60 seconds. Almost linear increase in the number of updated-agents was

observed (Figure 3.9). For one system,26,081agents were updated in60 seconds for one

system which increased to200,012when10systems were put to use in same amount of time.

3.5.7 Experiments and comparison with Lucene

We compared the execution-time of Hadoop using Lucene indexas backend for agent-data

and messaging with Hadoop using HDFS as the backend. The execution time obtained using

Lucene indexing is comparable and even faster than DMASF [14]. We used the three simu-

lations -circle, standing ovationandsand-pile- mentioned earlier for experiments. Results

obtained are presented in Figures 3.10, 3.11, 3.12. The results show that as we increase the

number of agents, Hadoop with Lucene index outperforms current state simulation frame-

work, DMASF.

38

101

102

103

104

105

106

10−2

10−1

100

101

102

103

Number of agents

Itera

tion−

time

in s

econ

ds

Comparing Iteration−time for Hadoop−hdfs, Hadoop−lucene, DMASF:Standing Ovation Simulation

Hadoop−HDFS

DMASF

Hadoop−Lucene

Figure 3.11Comparing iteration-times between Hadoop with HDFS, Hadoopwith Lucene,DMASF for Standing Ovation Simulation

101

102

103

104

105

106

10−1

100

101

102

103

Number of agents

Itera

tion−

time

in s

econ

ds

Comparing iteration−time for Hadoop−hdfs, Hadoop−lucene, DMASF: Sand−pile Simulation

Hadoop−HDFSDMASFHadoop−Lucene

Figure 3.12Comparing iteration-times between Hadoop with HDFS, Hadoopwith Lucene,DMASF for Sand-pile Simulation

39

Hadoop provides dynamic load balancing, failure-recoveryand dynamic addition of new

nodes which definitely incur processing overheads in terms of heart-beat messages and data

replication. It is a generic framework to solve diverse problems and is not specifically in-

tended for multi-agent systems. Hence, the run-time for different simulations obtained with

Hadoop are definitely not the best when compared with other simulation frameworks.

40

Chapter 4

Efficient Multi-Agent Simulation using Four State Agent

Execution Model on GPUs

In this chapter, we present an agent execution framework on GPUs to achieve fast data

currency. The main contributions of this chapter are:(i) developing an agent execution model

with four agent states -update, perceive, decide, rest- to ensure that every agent gets the

latest perceptions of the agent environment (Section 4.2);and(ii) we present a multi-agent

simulation framework usingFSAM(Four-State Agent-execution Model), utilizing GPUs as a

platform to efficiently support multi-agent applications with millions of agents (Section 4.3).

We also present some optimizations which the framework utilizes to speed-up the simula-

tion on GPUs -(i) optimal distribution of agents on a cluster of GPUs,(ii) a fast messaging

model and(iii) improvements in managing warps ([11]). These optimizations are specific

to CUDA architecture for GPUs. On running 10 million agents, the speedup achieved on

GPUs for some simulations was as high as10,000times when compared against 2 CPUs with

similar configuration (see Section 4.4.1).

4.1 Outline of nVidia Compute Unified Device Architecture

CUDA architecture consists of a scalable array of multi-threaded Streaming Multiproces-

sors (SMs). Each multiprocessor in turn consists of multiple Scalar Processor (SP) cores. A

41

multiprocessor creates, manages, and executes concurrentand light-weight GPU threads in

hardware with extremely low scheduling overhead. It implements a fast barrier synchroniza-

tion with a single instruction. These features enable a fine-grained parallelism by assigning

one thread to each data element (in present case, assigning athread to each agent) of the

problem under consideration.

CUDA architecture supports creation of a large number of GPU threads, theoretically

of the order240. These threads are organized in two- or three- dimensional thread blocks.

Threads within a block synchornize and share data through a shared memory. The threads of

a thread block execute concurrently on one multiprocessor.As thread blocks terminate, new

blocks are launched on the vacated multiprocessors. Threadblocks are further organized into

a one-dimensional or two-dimensionalgridsof thread blocks.

CUDA uses Single-Instruction Multiple-Thread (SIMT) architecture, i.e., same set of in-

structions is carried out concurrently in different threads with each thread processing on dif-

ferent data elements. Each thread has its own instruction address and register state. The set

of instructions which is executedN times in parallel byN different CUDA threads is referred

to as thekernelfunction. Each of the threads that execute a kernel is given athread identifier.

Similarly, each block in the grid is given a unique identifier. CUDA executes thread blocks

independently in any order across any number of cores, enabling programmers to write code

that scales with the number of cores.

CUDA threads may access data from multiple memory spaces during their execution. Each

thread has a privatelocal memory. Each thread block has asharedmemory (16kB in size)

visible to all threads of the block and with the same lifetimeas the block. Finally, all threads

have access to the same global memory. There are also two additional read-only memory

spaces accessible by all threads: the constant and texture memory spaces. The global, con-

stant, and texture memory spaces are optimized for different memory usages. The above

memory spaces altogether are referred to as thedevicememory spaces (as they comprise the

on-chip memory of a GPU-device).Kernelsoperates out of device memory. It cannot directly

42

Initialize

UPDATE

REST

PERCEIVE

DECIDE

End

update−state code

Create agent Make decisions. Executeagent decide−state code

Execute agent

threads

Analyze agent−univese till every agent has waited for

decision−lag amount of time

Or all agents have reached their goaluser−specified number of cyclesIf simulation has completed

Wait till all agents reach Rest state

Figure 4.1Agent state change diagram

operate on the memory space of the host CPU on which GPU devicesare mounted. However,

CUDA also enables accelerated access of thepage-lockedhost memory by the GPU device.

The number of blocks a multiprocessor can process at a time depends on number of regis-

ters required per thread and shared memory required per block for a given kernel. If either of

them are not enough to process at least one block, the kernel will fail to launch. The number

of registers required per thread (for a given kernel code), shared memory required per block

and its effect on performance can be determined using CUDA Occupancy Calculator [10]

provided by nVidia CUDA.

4.2 Agent Execution Model

Each agent is modelled as a separate GPU thread. We define fouragent states -update,

perceive, decideandrest (refer state transition diagram in Figure 4.1). The application de-

veloper needs to define what each agent does in each of these states as separate functions,

we refer to these sub-routines as agent-state codes. As soonas the agent threads are created,

agents enter into theupdatestate and agents execute theirupdatecode, on completion of

which they enter into theperceivestate. In theperceivestate, every agent monitors the events

43

occuring in the agent environment and obtains information about the states of other agents.

During perceivestate, an agent executes itsperceivecode multiple times in order to get the

latest perception of the agent environment and determine best possible decisions. We ensure

that an agent is scheduled at least twice forperceivecode execution (see Section 4.2.1 and

4.2.2 for details).

The decisions framed duringperceivestate get executed once the agent enters intodecide

state. After executing thedecidecode, an agent immediately enters into therest state. An

agent remains in thereststate till every agent in the simulation has executed itsdecidecode.

As soon as every agent reachesreststate (on completion of thedecidestate), agents enter into

updatestate. This marks the beginning of the nextcycleof the simulation. Thus, simulation

is said to complete onecyclewhen every agent completes one set of the state transitions,

starting fromupdatestate to thereststate.

We have not yet defined the transition fromperceivestate todecidestate. For this we

introduce two parameters -decision lagǫ, andidle-time. An agent in the simulation stays in

perceivestate till every agent has spent at leastǫ time (decision-lag) in theirperceivestate. In

our experiments, because of parallel threading provided byGPUs, the value ofǫ obtained is

very small.Idle-timeis the duration for which an agent thread is waiting to get scheduled for

execution on the processor.

4.2.1 Evaluating ‘decision-lag’

Value of decision-lag, ǫ, needs to be chosen carefully. Consider two groups of agents -

G1 andG2, with agents in the same group being scheduled concurrently. Let agent threads

corresponding to groupG1 be scheduled earlier thanG2. G2 agent-threads will complete their

updatecode execution later thanG1. Now, we need to ensure thatG1 agents execute their

perceivecode at least once after the completion ofG2 agents’updatecode execution in order

to get the latest states of theG2 agents. Further,G2 agents in theirperceivestate may send a

message toG1 agents, soG1 agents should execute theirperceivecode at least once afterG2

44

agents have executed theirperceivecode. If we ensure thatG2 agent-threads get scheduled

at least twice forperceivecode execution, then we can be sure thatG1 agent-threads get

scheduled at least once forperceivecode execution after completion ofG2 agentsperceive.

Thus, we need to make sure thatG2 agent-threads stay in theperceivestate for timeǫ greater

than the sum ofidle-timeand time taken forperceivecode execution.

Further,ǫ will vary for different simulations depending on the time taken by the agent-

state codes. For different number of agents, theidle-timewill increase with the increase in

number of agents, hence affecting the value ofǫ. Hence, a good estimate ofǫ is obtained by

executing onecycleof the simulation keepingǫ to be 0, i.e., as soon as every agent finishes

theupdatestate, agents execute theirperceivecode once and enter into thedecidestate. The

total time taken for onecycleis noted. In this time, every agent has been scheduled at least

thrice for execution, once corresponding to each -update, perceive, anddecidestate. Hence,

the total time would be a good choice fordecision-lag,ǫ. So, we assign this value toǫ and

restart the simulation from the beginning. Now during the course of simulation it may happen

that agents change their execution path, i.e., an agent may face different scenarios and hence

execute a different strategy which may take longer to execute. Accordingly, value ofǫ needs

to be adjusted. Therefore,decision-lagfor (i+1) th cycle, ǫi+1, is calculated as,

ǫi+1 = maximum(ǫ′, ǫavg), where

ǫavg =∑i

k=1

ǫki,

ǫ′ = Ti(u+ p+ d),

Ti(u+ p+ d) = Time taken for update, perceive and decide code execution in ith cycle.

We found experimentally that the value ofǫ converges after a few iterations. In our ex-

periments with106 agents,ǫ converged to31.439msfor Circle simulationand to14.125ms

for Hand-shake simulationafter secondcycleitself (refer Section 4.4.1). ForSand-pilesim-

ulation,ǫ started off with a high value of105.253msdue to numerous calculations in the first

cycle, but gradually converged to80.380msafter tenthcycle.

45

4.2.2 Utility of ‘perceive’ state

With our agent-execution model, the simulation developer needs to know the agent-activities

which are dependent on other agents and those which are independent. Independent activities

form the code corresponding to theupdatestate. Dependent activities get broken into two

parts - first, analyzing the agents (or the environment) on which the activity is dependent,

theperceivecode; second, taking the decisions on basis of the perceptions and messages re-

ceived, thedecidecode. In theAB-example, agentA will update the value ofA.x, andB will

update the value ofB.y in their updatecode. In theperceivestate,A will read B.y, andB will

readA.x. Finally, indecidestate,A will update the value ofA.yandB will updateB.x.

The presence ofperceivestate is important. Looking at theG1G2-example, let agentA

be a part ofG1 andB be a part ofG2. Now, B is scheduled afterA for execution ofupdate

code. Now,A may execute itsperceivestate in parallel toB’s updatestate execution. In order

to ensure thatA perceives the latest value ofB.y, A should execute itsperceivecode at least

once afterB’s updatecode (as enabled by‘decision-lag’) for its decision making. Whereas,

absence ofperceivestate might have ledA to read an older value ofB.y. Therefore, in our

modelA is always made to execute theperceivestate.

4.3 FSAM-framework Architecture

We implementedFSAM-framework for testing our execution model. The framework con-

sists of GPUs mounted on the same host CPU. There are three corecomponents: amaster-

controller, agent-controllersanduser-interaction(Figure 4.2). Each agent is modelled as a

separate CUDA thread on the GPU.

Master-controller : This component is responsible for managing the over-all execution

of the simulation. It initiates the simulation and distributes the agents on availableagent-

controllers. It keeps track whether theagent-controllersare working or have failed and redis-

tributes the agents in case of system failures.

46

��

��

��

��

��

��

ControllerMaster

Visualization and User−interaction

GPU−n

GPU−2

GPU−1

Local MessageController

GPU−basedAgent−Controllers

Mapped Memory

Agents

Figure 4.2FSAM-framework Architecture

Agent-controller : This component is responsible for running the agents allotted to it.

Every agent allotted to anagent-controllerruns in a separate CUDA thread.

User-interaction : This component gives the visual representation of the running simula-

tion. It is essentially the same physical system as themaster-controller. Visualizeraccesses

the agent-environment data on themaster-controllerand renders the visualization using it.

The simulation begins by invoking a query to know the number of GPU devices (agent-

controllers) that are hosted on themaster-controller(CPU). Then,master-controllerdoes the

initial distribution of the agents on theagent-controllersand the simulation starts with all

agents entering into theirupdatestate.

4.3.1 Distribution of agents

GivenN agents andG GPUs, the agent distribution algorithm is stated in Table 4.3.1. The

computeconcurrency()function determines the number of concurrent threads that can run

on eachagent-controllerusing CUDA Occupancy Calculator [10]. Theagent-controllersare

sorted, in descending order of their number of concurrent threads. Next, theagent-controller

that can run the maximum number of concurrent threads is allotted ‘n’ agents to it where

47

1. for i = 1 to G :C[i].concurrency = compute concurrency();C[i].id = i;

2. Sort C in descending order according to C[i].concurrencyvalue.

3. s = 0; i = 0; Num[1...G] = 0;4. while (s <= N && i < G) :

Num[C[i].id] = C[i].concurrency;i += 1; s += C[i].concurrency;

5. if (s < N) :for i = 1 to G :

Num[C[i].id] += (N-s) * C[i].concurrency/s;6. Num[i] is the number of agents to be allotted to ith GPU.

Table 4.1ALGORITHM : Agent-distribution on GPUs

‘n’ is the number of concurrent threads running on this system. Then, theagent-controller

having second maximum number gets a similar allotment for it. This process is continued till

all the agents get allotted. The total number of agents can bemore than the total number of

concurrent threads. In such a case, we first allot agents to each of theagent-controllersas

mentioned above, and then again distribute remaining unallotted agents in proportion to the

number of concurrent threads running on those systems.

In our experiments, we found that with this distribution no participating GPU is over-

loaded and none of them are idle for large number of agents.

4.3.2 Event-driven approach of agent-state updates

The application developer needs to provide the implementation code for the agent-environment

initialization, and theupdate, perceiveand decidecodes. The agent state codes are exe-

cuted askernel([11]) functions on theagent-controllers. CUDA provides event generation

APIs which allows us to raise events when a particularkernelfinishes its execution. Using

these APIs, events are generated on complete execution ofkernelfunctions which denote the

change in state of the agents, andmaster-controllergets notified about these events.Per-

ceivestate for an agent gets initiated as soon asupdatestate is complete. Switching from

48

perceivestate todecidestate for an agent requires all agents to complete theirperceivestate.

Therefore,master-controllerwaits for notifications from all theagent-controllersabout the

completion ofperceivestate, after which it initiates the execution fordecidestate. An agent

enters intorest state as soon asdecidestate is completed. Transition fromrest state toup-

datestate occurs in a way similar toperceive-decidestate transition. State changes occur

according to protocols described in Section 4.2.

An agent on anagent-controlleris considered to have failed if theagent-controllerdoes

not receive notification for state transition for10*ǫi seconds, whereǫi is the decision-lag

computed forith cycle. In such a case, thekernelexecuting that particular agent is re-launched

for execution.

4.3.3 Messaging

In order to ensure security of agent-data, no agent can handle other agents’ data directly. If

an agent wants to send a message to another agent or want to seeor change the data of other

agents, then it needs to send a request to themaster-controllerwhich decides whether to ac-

cept or deny the request. In order to speed-up the data transfers, agents running on GPU-based

agent-controllersshare memory with themaster-controllerusing the mapped memory APIs

provided by CUDA. Now the application developer needs to implement a function which

takes two agent identifiers as parameters,agentfrom id andagentto id, and returnstrue if

agentfrom id is allowed to accessagentto id’s data, otherwise it returnsfalse. With this

setting, whenever a request is initiated by an agent, themaster-controllersimply invokes this

function and accordingly returns the pointer to the data if return value wastrue, else returns

null.

Master-controllerbecomes a bottle-neck for such requests when the number of agents

become large. To lessen the burden, we have alocal-message controlleron each GPU. Each

GPU maintains a local copy of data of the agents running on it in addition to the CPU-shared

memory. If the two agents involved in messaging are on the same agent-controller, then the

49

local message controllerhandles the request using the local agent-data; otherwise the request

goes to themaster-controllerwhich then handles the request usingmapped memory.

With this messaging model, framework is able to manage up to1014 messages in approxi-

mately100 milli-seconds.

4.3.4 Warp management

Warpsare group of 32 threads which are initiated, scheduled and terminated together.

Performance is best when all threads in awarp follow the same execution path. Threads may

diverge due to conditional statements in the code. For example let there be two threads,ta

and tb. ta satisfies the conditioncondwhereastb satisfies theelsepart of cond. In CUDA,

threadtb gets blocked whileta is executing theif part; likewise threadta stays idle whentb

is executingelsepart. Hence, there should be as little divergence as possible in the execution

path of threads in the samewarp. In order to reduce the divergence in the execution paths, we

allot threads in the sameblock to sametypeof agents as their behavior would be similar to

each other and they would be less likely to diverge. Experiments showed as much as 8-fold

improvement in execution time (Section 4.4.2.2).

4.4 Experimental Results

We conducted several experiments for testing the utility ofour agent-execution model and

the performance ofFSAM-framework. For experiments we used nVidia Tesla T10 GPU with

933 GFLOPS of processing performance, 1.30 GHz clock-rate and 4 GB of GDDR3 memory

at 102 GB/s bandwidth. It has 30 multi-processors with 240 cores, a constant memory of 64

MB, a shared memory of 16kB per block and 16k registers per block. For all the experiments,

we kept the block size as 512 threads. Grid size was computed as ⌈(N/512)⌉, whereN is the

number of agents in simulation. We could do limited comparisons against other GPU based

solutions because the code for the systems presented in [22], [25], [23] were not available

publicly. Code for DMASF is available [15].

50

1 2 3 4 5 6 710

−5

10−4

10−3

10−2

10−1

100

101

102

K, number of agents = 10 K

Dec

isio

n−la

g in

sec

onds

Decision−lag obtained for different scenarios on FSAM

Circle−Simulation

Hand−shake Simulation

Sand−pile Simulation

Figure 4.3Decision-lag obtained for different scenarios on GPU-based FSAM

4.4.1 Experiments on performance

We analyzed the performance and expressiveness of the proposed agent execution frame-

work following the 4-agent-state model. The time taken by eachcycle(we call itcycle-time),

idle-timeanddecision-laghas been measured for different scenarios. These times can be

estimated as a function of the number of messages being sent to and received from different

agents and the number of computations in agent code. We took three sample scenarios for our

experiments and compared the time taken byFSAM-framework with CPU-based DMASF. In-

tel I3 Processor with 2.93 GHz and 4 GB RAM was used for CPU. Two CPUs were used for

DMASF. Thecycle-time, idle-timeanddecision-lagare measured in seconds and are aver-

aged over 100cyclesfor both the frameworks. The lowcycle-timesand idle-timesobtained

for different scenarios showed the effectiveness of usingFSAM-framework for a wide range

of simulations.

51

1 2 3 4 5 6 710

−4

10−2

100

102

104


Cyc

le ti

me

in s

econ

ds

Cycle Time comparison for Circle Simulation

FSAMDMASF

Figure 4.4 Cycle-time comparison between GPU-based FSAM and CPU-based DMASF:Circle-simulation

1 2 3 4 5 6 710

−6

10−4

10−2

100

102

104


Tim

e ta

ken

in s

econ

ds

Comparison of idle time for Circle Simulation

FSAMDMASF

Figure 4.5 Idle-time comparison between GPU-based FSAM and CPU-based DMASF:Circle-simulation

52

1 2 3 4 5 6 710

−4

10−2

100

102

104

106


Cyc

le ti

me

in s

econ

ds

Comparison of cycle−time for Hand−shake Simulation

FSAMDMASF

Figure 4.6 Cycle-time comparison between GPU-based FSAM and CPU-based DMASF:Hand-shake simulation

1 2 3 4 5 6 710

−5

100

105


Idle

−tim

e pe

r ite

ratio

n in

sec

onds

Comparison of idle−time for Hand−shake simulation

FSAMDMASF

Figure 4.7 Idle-time comparison between GPU-based FSAM and CPU-based DMASF:Hand-shake simulation

53

1 2 3 4 5 6 710

−4

10−2

100

102

104

106


Cyc

le−t

ime

in S

econ

ds

Comparison of cycle−time for Sand−pile Simulation

FSAMDMASF

Figure 4.8 Cycle-time comparison between GPU-based FSAM and CPU-based DMASF:Sand-pile simulation

1 2 3 4 5 6 710

−6

10−4

10−2

100

102

104

106


Idle

−tim

e pe

r ite

ratio

n in

Sec

onds

Comparison of idle−time for Sand−pile Simulation

FSAMDMASF

Figure 4.9Idle-time comparison between GPU-based FSAM and CPU-based DMASF: Sand-pile simulation

54

4.4.1.1 Computationally intense with no messaging (Circlesimulation).

We randomly distributed agents on a 2D plane. Goal of the agents is to position themselves

in a circle and then move around it. The strategy involved computation of arithmetic mean

(which requiresN additions corresponding to both the dimensions, whereN is the number

of agents) for each agent in everycycle. Figures 4.4, 4.5 show the comparison for execution

time between GPU-based FSAM and CPU-based DMASF. Speed-up achieved using FSAM

is nearly200 times forcycle-time. Idle-time for CPUs increases linearly with the number

of agents. For GPUs theidle-timecomputed depends on the number of cores available for

concurrent computation. It remains same till the number of agents are less than30,720(as all

the agents are scheduled concurrently) and gradually increases after that.

4.4.1.2 Communication intensive (Hand-shake Simulation).

In this simulation, all the agents were packed in a single room. Their aim was to shake

hands with every other agent 100 times.Hand-shakebetween two agents X and Y is modelled

as a message from X to Y followed by an acknowledgement message sent by Y to X. Every

agent ended up sendingN messages (one corresponding to each agent) in everycycle, where

N is the number of agents. Figures 4.6, 4.7 show the comparisonresults. The GPU-based

FSAM has a memory based messaging model and hence has a remarkably low cycle-timeas

compared to CPU-based DMASF which has disk-based messaging model. Speed-up achieved

in cycle-timefor 107 agents was more than10,000times.

4.4.1.3 Messaging and Computationally balanced (Sand-pileSimulation).

In this problem, grain particles (each particle modelled asan agent) are dropped from a

height into a beaker through a funnel, and they finally settledown in the beaker after colliding

with each other and with the walls of beaker and funnel. A detailed description and solution

of the problem is given in [26] and for visualization of the problem refer to [27]. For each

agent, a singlecycleinvolved exchange of positions and velocities with a subsetof agents and

55

then computing physical forces acting on it and sending its impact to the appropriate agents.

Comparison results are shown in 4.8, 4.9. A lot of mathematical calculations are required in

this simulation and GPUs being faster than CPUs leads to the speed-up of as much as2316

times for107 agents.

4.4.2 Evaluating FSAM-framework Architecture

We tested our design decisions forFSAM-framework architecture presented in Section 4.3.

4.4.2.1 Agent distribution algorithm

Using CUDA Occupancy Calculator ([10]), we computed that30,720concurrent threads

can run on a single GPU for Sand-pile Simulation ([26]) code.We ran the simulation with

30,000 agents. We first divided them equally on four (identical) GPUs and found thecycle-

time to be10.837ms. Next, we ran all the agents on a single GPU and found thecycle-time

equal to10.841ms, same as before. Thus, we can use spare GPUs to run another instance of

FSAMand perform different simulations in parallel.

Then, we increased the number of agents to 60,000. We ran themon a single GPU and

found thecycle-timeas13.643ms. Using our distribution algorithm, we get the distribution

on two GPUs as 30,720 and 29,280. With this distributioncycle-timeachieved was10.838ms.

4.4.2.2 Warp management

We divided10,000,000agents into two agent-types, assand-particle agentor beaker agent

corresponding to thesand-pile simulationin the ratio 100:1. We compared a random distri-

bution of agents on CUDAblocksagainst the optimization stated in Section 4.3.4 - same

blockhaving same type of agents. While for the random distributionthe averagecycle-time

was680.413 ms; the corresponding time for the optimization in Section 4.3.4 was80.491 ms

showing more than8-fold improvement. This is because agents of typebeakerinteract with

significantly more number of particles in their updation than thesand-particleagent. Ran-

56

dom distribution of agents on the blocks, hence, caused theparticle agent to remain idle for

a longer time.

4.4.2.3 Stress Testing

By architecture, in CUDA ablock can have a maximum of1024 threads and a grid can

have maximum of65,535× 65,535(approximately232) blocks, giving a total of242 possible

threads. We carried out experiments to find out the capacity of FSAM-framework with 4

GPU-basedagent-controllers. Each agent was allotted 10 bytes of data.Mapped memoryon

CPU had 4GB of space. Hence, the maximum number of possible threads (or agents) were

limited to 3.676×108. Idle-timeobtained was0.00153 milli-seconds. Next, we used all the

space in GPUglobal memory(refer [11]) and CPU host memory for the agents data. The

number of agents were increased to7.250×108. Idle-timeobtained in this case was almost

same,0.00159 milli-seconds. In either case, on increasing the number of agents, CUDA

kernel(refer [11]) failed to launch. Further, we ran the above number of agents till1,000,000

cyclesand neither agents nor anyagent-controllerfailed, testifying the remarkable capability

of CUDA to support a large number of threads. Thus,FSAM-framework built on top of

CUDA is quite reliable.

57

Chapter 5

Conclusions

Cloud computing and multi-core based concurrent programming are recent advancements

in the field of solving larger problems. Multi-agent simulations when scaled up to several mil-

lions of agents is one such problem posing several challenges - (i) scalability of the framework

running such massive simulations; a lot of computation power is needed to execute agents

along with primary and secondary storage to store simulation data. (ii) fast agent execution

and message delivery along with latest agent-environment state perception. In our work, we

addressed these challenges and provided separate solutions for them. The challenge with scal-

ability is handled with Hadoop-cum-Lucene solution and agent-state perception challenge is

tackled by utilizing CUDA over GPUs.

Hadoop provides a novel framework for running applicationsinvolving thousands of nodes

and petabytes of data. It allows a developer to focus on agentmodel and their optimization

without getting involved in fault tolerance issues. Extensibility of hardware on which frame-

work is running is made easy by Hadoop, by allowing dynamic addition of new nodes and

by allowing heterogeneity between operating systems whichthe different nodes are running.

Therefore, it provides a strong backbone for implementing large scale agent-based simulation

framework. Using cached results is a major optimization in the framework. A faster lookup

for agents is achieved by indexing agent-data using Lucene.Agent-messages in an iteration

are also indexed using Lucene to achieve fast agent-messaging. Further, the simulation data

58

for each iteration is stored in HDFS, which can be used for off-line/on-line visualization of

the simulation.

GPUs provide a massively concurrent architecture for program execution. CUDA al-

lows us to utilize the multi-core GPUs through creation and execution of several millions

of threads. Agent-based simulation frameworks developed on CUDA can have each agent

owning its own GPU-thread of execution, thus, enabling agents to be active almost all the

time during the simulation and get latest perceptions from enviornment and other agents. We

developed an agent-based simulation framework, FSAM-framework, which followed a four

state agent execution model and speeded-up the agent-execution and agent-messaging using

computational power of GPUs. The presented four-state agent execution model along with

GPUs tackled the problem of latest environment perceptionswith decision-lagof perceive

state making the difference from currently followed agent execution models.

Selecting an appropriate alternative out of the presented frameworks depends on the ap-

plication at hand. If the application has a lot of memory requirement, in the order of several

giga-bytes, then framework developed on Hadoop is the better alternative to be used. GPUs

have a limited memory and as such would fail to launch the agent-execution if the data does

not fit into the memory. On the other hand, if the rate of perceptions being received is more

than the rate at which they are processed by an agent, then thefour state agent execution

model should be used. In general, the four state agent execution model fits best on the par-

allel GPUs giving very low values fordecision-lag, provided the simulated agents do not

require a lot of memory during execution. However, the four state model can be executed on

other hardware architectures as well, only the delay valuesfor decision-laggets higher.

5.1 Future Work

The impelemented systems can be further optimized and more functionalities can be added

in them. Developing better heuristics for caching results and to determine appropriate cache

sites for faster access of the results are some of the challenging tasks in Hadoop framework.

59

For online analysis, scene visualization module can be provided which can be triggered at the

end of a single map-reduce job (which is equivalent to a single iteration in the multi-agent

simulation) and the updated scene can then be rendered usingJava APIs.

The 4-state agent execution model in its presented form is fitfor only 1-level of depen-

dencies among agent variables (a dependency is said to be of leveln if there exists a path of

lengthn in the directed dependency graph). The presented executionmodel can be extended

to solven-level of dependencies, by allowing agents to make updates of dependent variables

in their perceive state and multiplying the decision-lag bya factor ofn.

The frameworks developed are available for download at [34], [33]. Sample agent codes

can be found in the Appendix.

60

Related Publications

1. A Multi-agent Simulation Framework on Small Hadoop Cluster - Prashant Sethia and

Kamalakar Karlapalem -Engineering Applications of Artificial Intelligence Journal.

DOI: 10.1016/j.engappai.2011.06.009

2. A Multi-agent Simulation Framework on Small Hadoop Clouds- Kamalakar Karla-

palem and Prashant Sethia -ITMAS workshop at Ninth International Conference On

Autonomous Agents And Multi-agent Systems, Toronto, Canada - 2010.

3. Efficient Multi-Agent Simulation using Four State Agent Execution Model on GPUs

- Prashant Sethia and Kamalakar Karlapalem. Under Review in Engineering Applica-

tions of Artificial Intelligence Journal.

61

Bibliography

[1] Jacques Ferber - Multi-Agent System: An Introduction toDistributed Artificial Intelli-

gence. Addison Wesley Longman, Harlow, UK, 1999.

[2] Jeffrey Dean and Sanjay Ghemawat - MapReduce: Simplified Data Processing on Large

Clusters. Proceedings of the 6th conference on Symposium on OSDI - Volume 6, 2004.

[3] Steven F. Railsback, Steven L. Lytinen, Stephen K. Jackson - Agent-based Simulation

Platforms: Review and Development Recommendations - Societyfor Computer Simu-

lation International, 2006.

[4] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh,Deborah A. Wallach,

Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber - Bigtable: A

Distributed Storage System for Structured Data. Seventh Symposium on OSDI, Seattle,

WA, 2006.

[5] Hadoop wiki page : [http://wiki.apache.org/hadoop]

[6] Nuannuan Zong, Feng Gui, Malek Adjouadi - A New ClusteringAlgorithm of Large

Datasets with O(N) Computational Complexity : Proceedings ofthe 5th International

Conference on ISDA, 2005.

[7] Cloud-computing Wikipedia-page: [http://en.wikipedia.org/wiki/Cloudcomputing]

[8] http://code.google.com/edu/parallel/mapreduce-tutorial.html

62

[9] nVidia CUDA home-page :

http://www.nvidia.com/object/cudahome.html

[10] CUDA Occupancy Calculator :

http://developer.download.nvidia.com/compute/cuda/CUDA Occupancycalculator.xls

[11] nVidia CDUA Programming Guide Version 3.0

[12] S TISUE, U Wilensky. NetLogo: Design and Implementation of a Multi-Agent Model-

ing Environment. Proceedings of the Agent Conference, 2004.pp161-184.

[13] G Yamamoto, H Tai, H Mizuta. A Platform for Massive Agent-Based Simulation and

Its Evaluation. MMAS, 2008. pp1-12.

[14] IVA Rao, M Jain, K Karlapalem. Towards Simulating Billions of Agents in Thousands

of Seconds. AAMAS, 2007. Article No. 143.

[15] DMASF code: http://sourceforge.net/projects/dmasf/files/

[16] S Luke, C Cioffi-Revilla, L Panait, K Sullivan, G Balan. MASON: a multiagent simula-

tion environment. Simulation, 2005. pp517-527.

[17] N Minar, R Burkhart, C Langton, M Askenazi. The Swarm Simulation System: A

Toolkit for Building Multi-agent Simulations. Santa Fe Institute Working Paper, 1996.

[18] N Collier. Repast: An extensible framework for agent simulation. Natural Resources

and Environmental Issues. Vol. 8, Article 4.

[19] M Sonnessa. JAS: Java Agent-based Simulation Library.An Open Framework for

Algorithm-Intensive Simulation. Industry And Labor Dynamics: The Agent-Based

Computational Economics Approach, 2003. pp43-56.

[20] A Fedoruk, R Deters. Improving fault-tolerance by replicating agents. AAMAS, 2002.

pp737-744.

63

[21] R Sarika, S Harith, K Karlapalem. Database Driven RoboCupRescue Server. RoboCup

2008. pp602-613.

[22] P Richmond, D Romano. Agent Based GPU, a Real-time 3D Simulation and Interactive

Visualisation Framework for Massive Agent Based Modelling on the GPU. IWSV, 2008.

[23] M Lysenko, RM DSouza. A Framework for Megascale Agent Based Model Simulations

on Graphics Processing Units. Journal of Artificial Societies and Social Simulation,

2008. Vol. 11, pp10.

[24] P Varakantham, S Gangwani, K Karlapalem. On Handling Component and Transaction

Failures in Multi Agent Systems. SIGecom Exchanges, 2001. Vol. 3, pp32-43.

[25] BG Aaby, KS Perumalla, SK Seal. Efficient Simulation of Agent-Based Models on

Multi-GPU and Multi-Core Clusters. International Conferenceon Simulation Tools and

Techniques for Commuications, Networks and Systems and Workshops, 2010. Article

29.

[26] L Breton, JD Zucker, E Clement. A multi-agent based simulation of sand piles in a static

equilibrium. MABS, 2000. pp108-118.

[27] Visualization of sand piles problem: [http://grmat.imi.pcz.pl]

[28] John H. Miller, Scott E. Page - The Standing Ovation Problem : Computational model-

ing in the social sciences, 2004.

[29] L Welch, S Ekwaro-Osire. Fairness in Agent Based Simulation Frameworks. Journal of

Computing and Information Science in Engineering, March 2010. Volume 10, Issue 1,

011002.

[30] M Kiran, P Richmond, M Holcombe, L S Chin, D Worth, C Greenough. FLAME: sim-

ulating large populations of agents on parallel hardware architectures. AAMAS, 2010.

pp1633-1636.

64

[31] Lucene: http://lucene.apache.org/

[32] Solr: http://lucene.apache.org/solr/

[33] Multi-agent simulation framework using CUDA: http://sourceforge.net/projects/fsam/

[34] Multi-agent simulation framework on Hadoop: http://sourceforge.net/projects/hmasf/

65

Appendix A

Example codes for agent simulation on Hadoop simulation

framework

A.1 Circle Simulation

public class CreateEnviron {

public map<object,object> getIterationParameters(){

Map <Object, Object> iter_data =

new HashMap <Object, Object>();

//Name of the directory containing

//intermediate simulation data.

String keytemp="AGENT_DIRECTORY";

String valtemp="CIRCLE_AGENTS";

iter_data.put((Object)keytemp, (Object)valtemp);

//Number of iterations for the simulation to run.

keytemp="NOI";

valtemp=10000;


66

return iter_data;

}

void createWorld() {

AgentAPI crobj=new AgentAPI();

Random randomGenerator = new Random();

for(int i=1;i<=NUM_AGENTS;i++)

{

//Agent data initialized.

Map <Object, Object> agent_data =


agent_data.put((Object)"ID",(Object)i);

agent_data.put((Object)"TYPE",(Object)"circle");

//Circles are distributed randomly between

//co-ordinates (0, 0) and (500, 500).

float x=randomGenerator.nextFloat()*500;

float y=randomGenerator.nextFloat()*500;

agent_data.put((Object)"X",(Object)x);

agent_data.put((Object)"Y",(Object)y);

crobj.createAgent(agent_data);

}

}

}

67

public class AgentUserCode {

map<object, object> Update(map<object, object> agent_data) {

switch (agent_data.get("TYPE")):

case "circle":

float cntx=0, cnty=0;

AgentAPI aapi=new AgentAPI();


{

Map<Object, Object> agent= aapi.getAgent(i);

float tx=(float)agent.get("X");

float ty=(float)agent.get("Y");

cntx+=tx;

cnty+=ty;

}

cntx=cntx/NUM_AGENTS;

cnty=cnty/NUM_AGENTS;

float rad=1.5;

float dy=cnty-(float)agent_data.get("Y");

float dx=cntx-(float)agent_data.get("X");

float theta =atan2(dy,dx);

float targetx = rad * cos(theta) + cntx;

float targety = rad * sin(theta) + cnty;

float x = (targetx - (float)agent_data.get("X"));

float y = (targety - (float)agent_data.get("Y"));



68

default:

break;

return agent_data;

}

void Shape(String agent_type) {

switch (agent_type):

case "circle":

//User code for rendering shape.

default:

break;

}

}

A.2 Standing Ovation Simulation








String valtemp="AGENTS_STANDING_OVATION";


69


keytemp="NOI";

valtemp=1000;


return iter_data;

}




for(int i=1;i<=NUM_AGENTS_X;i++)

{

for(int j=1;j<=NUM_AGENTS_Y;j++)

{

//Agent data initialized.



String id=Integer.toString(i)"+":"+Integer.toString(j);

agent_data.put((Object)"ID",(Object)(id));

agent_data.put((Object)"TYPE",(Object)"audience");

//Agents are seated in a rectangular grid.

agent_data.put((Object)"X",(Object)i);

agent_data.put((Object)"Y",(Object)j);

//Flag indicating whether the agent is standing.

70

int stand=randomGenerator.nextInt(2);

agent_data.put((Object)"STAND",(Object)stand);


}

}

}

}




case "audience":


int total=0, standing=0;

int x=(int)agent_data.get("X");

int y=(int)agent_data.get("Y");

List<String> Ids = new ArrayList<String>();

Ids.append(Integer.toString(x-1)+":"+Integer.toString(y-1));

Ids.append(Integer.toString(x-1)+":"+Integer.toString(y));

Ids.append(Integer.toString(x-1)+":"+Integer.toString(y+1));

Ids.append(Integer.toString(x)+":"+Integer.toString(y-1));

Ids.append(Integer.toString(x)+":"+Integer.toString(y+1));

Ids.append(Integer.toString(x+1)+":"+Integer.toString(y-1));

Ids.append(Integer.toString(x+1)+":"+Integer.toString(y));

Ids.append(Integer.toString(x+1)+":"+Integer.toString(y+1));

for(int i=0;i<Ids.size();i++)

{

Map<Object, Object> agent= aapi.getAgent(Ids.get(i));

71

if(agent!=NULL)

{

int st=(int)agent.get("STAND");

standing+=st;

total+=1;

}

}

if(standing > total/2)

agent_data.put((Object)"STAND",1);

else if(standing < total/4)

agent_data.put((Object)"STAND",0);

default:

break;

return agent_data;

}



case "audience":


default:

break;

}

}

72

A.3 Sand-pile Simulation








String valtemp="SANDPILE_AGENTS";



keytemp="NOI";

valtemp=10000;


return iter_data;

}





{

//AgentStruct data initialized.



agent_data.put((Object)"ID",(Object)i);

agent_data.put((Object)"TYPE",(Object)"sand");

73

//Sand-particles are distributed randomly

//between co-ordinates (0, 0) and (1000, 1000).

float x=randomGenerator.nextFloat()*1000;

float y=randomGenerator.nextFloat()*1000;



agent_data.put((Object)"VX",(Object)vx);

agent_data.put((Object)"VY",(Object)vy);


}

}

}

public class AgentStruct{

int id;

float x,y,vx,vy;

}




case "sand":



float e=0.6;

int i;

float x,y,vx,vy,t,s,tn, circ, th, an ,cs, sn;

x=(float)agent_data.get("X");

y=(float)agent_data.get("Y");

vx=(float)agent_data.get("VX");

74

vy=(float)agent_data.get("VY");

for (i=1;i<=NUM_AGENTS;i++)

{

Map<Object, Object> agi=aapi.getAgent(i);

AgentStruct ai= new AgentStruct();

ai.id=(float)agi.get("ID");

ai.x=(float)agi.get("X");

ai.y=(float)agi.get("Y");

ai.vx=(float)agi.get("VX");

ai.vy=(float)agi.get("VY");

if(ai.id!=agent_data.get("ID")

{

if(Math.pow(ai.x-x,2)+

Math.pow(ai.y-y,2)< 6400)

{

AgentStruct k= new AgentStruct();

k.x=(ai.x-x); k.y=(ai.y-y);

AgentStruct v = new AgentStruct();

v.vx=(ai.vx); v.vy=(ai.vy);

float radm=(1.0*((k.x*k.x)+(k.y*k.y)));

if( radm == 0)

radm=1;

float kx=k.x/radm, ky=k.y/radm;

ai.x=x+kx*80; ai.y=y+ky*80;

AgentStruct n= new AgentStruct();

float velm=(1.0*((v.vx*v.vx)+(v.vy*v.vy)));

if (velm == 0)

velm=1;

75

n.vx=v.vx/velm; n.vy=v.vy/velm;

float tx,k1x,k1y;

tx=(kx*n.vx+ky*n.vy)*velm;

kx=tx*kx; ky=tx*ky;

AgentStruct k1=new AgentStruct();

AgentStruct v1=new AgentStruct();

k1.x=-k.x; k1.y=-k.y;

v1.vx=vx; v1.vy=vy;

if (radm == 0)

radm=1;

k1x=k1.x/radm; k1y=k1.y/radm;

velm=(1.0*((v1.vx*v1.vx)+(v1.vy*v1.vy)));

if (velm == 0)

velm=1;

n.vx=v1.vx/velm; n.vy=v1.vy/velm;

if( velm == 0)

velm=1;

tx=(k1x*n.vx+k1y*n.vy)*velm;

k1x=tx*k1x; k1y=tx*k1y;

AgentStruct nochng=new AgentStruct();

AgentStruct nochng1=new AgentStruct();

nochng.x=v.vx-kx; nochng.y=v.vy-ky;

nochng1.x=v1.vx-k1x; nochng1.y=v1.vy-k1y;

vx=(1-e)*k1x/2+(1+e)*kx/2+nochng1.x;

ai.vx=(1-e)*kx/2+(1+e)*k1x/2+nochng.x;

vy=(1-e)*k1y/2+(1+e)*ky/2+nochng1.y;

ai.vy=(1-e)*ky/2+(1+e)*k1y/2+nochng.y;

}

76

}

}

float g=9.8;

t=440*(y)/300.0;

s=100+t;

th=(440/300);

circ=x*x+y*y-(s-80)*(s-80);

if(circ>=0 && x!=0)

{

tn=y/x;

if(tn<0)

tn=tn*(-1);

an=Math.atan(tn); sn=Math.sin(an); cs=Math.cos(an);

if(x>=0 && y<0)

cs=cs*(-1);

if(y>=0 && x>=0)

{ sn=sn*(-1);

cs=cs*(-1);

}

if(x<0 && y>=0)

sn=sn*(-1);

vx+=(g*(th)*cs*(th));

vy-=(g*(th)*(th));

x+=vx*0.1+0.005*(g*(th)*cs*(th));

y+=vy*0.1-0.005*g*(th)*(th);

}

else

77

{

vx=0; vy-=(g/10);

x+=vx*0.1; y+=vy*0.1-0.005*g;

}

(float)agent_data.put((Object)"X", (Object)x);

(float)agent_data.put((Object)"Y", (Object)y);

(float)agent_data.put((Object)"VX", (Object)vx);

(float)agent_data.put((Object)"VY", (Object)vy);

default:

break;

return agent_data;

}



case "circle":


default:

break;

}

}

78

Appendix B

Example codes for agent simulation on FSAM

B.1 Circle Simulation

int NUM_ITERATIONS=10000;

struct Agent{

int Id;

float x,y,cntx,cnty;

};

void InitializeWorld(Agent * gpuAgents)

{

//Initialize agents.

gpuAgents=(Agent *)malloc(sizeof(Agent)*NUM_GPU_AGENTS);

for(int i=0;i<NUM_GPU_AGENTS;i++)

{

gpuAgents[i].Id=i+1;

//Distribute circles randomly between

//co-ordinates (0,0) and (500, 500).

gpuAgents[i].x=rand()%500;

gpuAgents[i].y=rand()%500;

79

}

}

//To decide whether one agent can send a message to another

__device__ bool Allow(Agent *agnt1, Agent *agnt2)

{

//In this example we make it tru always.

return true;

}

__device__ void Update(Agent *agnt, Agent *agents)

{ //Updation in this case is based on perception

//hence nothing is done.

;

}

__device__ void Perceive(Agent *agnt, Agent *agents)

{

//Perceive locations of other agents

//and compute centroid.



{

cntx+=agents[i].x;

cnty+=agents[i].y;

}

agnt->cntx=cntx/NUM_GPU_AGENTS;

agnt->cnty=cnty/NUM_GPU_AGENTS;

}

80

__device__ void Decide(Agent *agnt, Agent *agents)

{

//Update accordingly based on the centroid perceived.

float rad=1.5;

float cntx=agnt->cntx, cnty=agnt->cnty;

float theta =atan2(cnty-agnt->y,cntx-agnt->x);

float targetx = rad * cos(theta) + cntx;

float targety = rad * sin(theta) + cnty;

agnt->x = (targetx - agnt->x);

agnt->y = (targety - agnt->y);

}

B.2 Hand-shake Simulation


struct Agent{

int Id;

int countHandShakes, tempCount;

};


{

//Initialize agents.

81



{


gpuAgents[i].countHandShakes=0;

}

}



{

//This is just a sample strategy.

if(agnt1->Id > agnt2->Id)

return true;

else

return false;

}


{

//Send dummy messages to other agents.


{

SendMessage(agnt->Id, agents[i]->Id, "HAND-SHAKE!");

}

}


82

{

char *[] msgs;

int count =GetMessages(msgs);

agnt->tempCount=0;

for(int i=0;i<count;i++)

{

//Need to implement your own string comparison.

if(strcmpCUDA(msgs[i],"HAND-SHAKE!")==0)

agnt->tempCount++;

}

}

__device__ void Decide(Agent *agnt, Agent *agents)

{

//Update based on number of hand-shakes received.

agnt->countHandShakes+=agnt->tempCount;

}

B.3 Sand-pile Simulation


struct Agent{

int Id;

float x, y, vx, vy;

};


{ //Initialize agents.

83



{ //Sand-particles are distributed randomly

//between co-ordinates (0, 0) and (1000, 1000).


gpuAgents[i].x=rand()%1000;

gpuAgents[i].y=rand()%1000;

gpuAgents[i].vx=0;

gpuAgents[i].vy=0;

}

}



{

return true;

}


{

//Update of the position depends on

//relative positions of sand-particles.

//Hence, computed after perceptions.

;

}


{

//Compute collisions.

float e=0.6;

84

int i;

for (i=0;i<NUM_GPU_AGENTS;i++)

{

Agent ai = agents[i];

if(ai.Id!=agent->Id)

{

if(pow((ai.x-agent->x),2)+

pow((ai.y-agent->y),2)< 6400)

{

Agent k;

k.x=(ai.x-agent->x); k.y=(ai.y-agent->y);

Agent v;

v.vx=(ai.vx); v.vy=(ai.vy);

float radm=(1.0*((k.x*k.x)+(k.y*k.y)));

if( radm == 0)

radm=1;

float kx=k.x/radm, ky=k.y/radm;

ai.x=agent->x+kx*80; ai.y=agent->y+ky*80;

Agent n;

float velm=(1.0*((v.vx*v.vx)+(v.vy*v.vy)));

if (velm == 0)

velm=1;

n.vx=v.vx/velm; n.vy=v.vy/velm;

float tx,k1x,k1y;

tx=(kx*n.vx+ky*n.vy)*velm;

kx=tx*kx; ky=tx*ky;

Agent k1, v1;

k1.x=-k.x; k1.y=-k.y;

85

v1.vx=agent->vx; v1.vy=agent->vy;

if (radm == 0)

radm=1;

k1x=k1.x/radm; k1y=k1.y/radm;

velm=(1.0*((v1.vx*v1.vx)+(v1.vy*v1.vy)));

if (velm == 0)

velm=1;

n.vx=v1.vx/velm; n.vy=v1.vy/velm;

if( velm == 0)

velm=1;

tx=(k1x*n.vx+k1y*n.vy)*velm;

k1x=tx*k1x; k1y=tx*k1y;

Agent nochng,nochng1;

nochng.x=v.vx-kx; nochng.y=v.vy-ky;

nochng1.x=v1.vx-k1x; nochng1.y=v1.vy-k1y;

agent->vx=(1-e)*k1x/2+(1+e)*kx/2+nochng1.x;

ai.vx=(1-e)*kx/2+(1+e)*k1x/2+nochng.x;

agent->vy=(1-e)*k1y/2+(1+e)*ky/2+nochng1.y;

ai.vy=(1-e)*ky/2+(1+e)*k1y/2+nochng.y;

}

}

}

float g=9.8;

float x,y,t,s,tn, circ, th, an ,cs, sn;

x=agent->x-640; y=agent->y-650;

t=440*(y)/300.0; s=100+t; th=(440/300);

circ=x*x+y*y-(s-80)*(s-80);

if(circ>=0 && x!=0)

86

{

tn=y/x;

if(tn<0)

tn=tn*(-1);

an=atan(tn); sn=sin(an); cs=cos(an);

if(x>=0 && y<0)

cs=cs*(-1);

if(y>=0 && x>=0)

{ sn=sn*(-1); cs=cs*(-1);

}

if(x<0 && y>=0)

sn=sn*(-1);

agent->vx+=(g*(th)*cs*(th));

agent->vy-=(g*(th)*(th));

}

else

{

agent->vx=0; agent->vy-=(g/10);

}

}

__device__ void Decide(Agent *agent, Agent *agents)

{

//Position updated after perceptions for all

//agents are completed.

agent->x=agent->vx*0.1+agent->x;

agent->y=agent->vy*0.1-0.049*g+agent->y;

}

87

high performance multi-agent system based...

Documents