scythedownload.meyercraft.net/scythe20-iww32-final.pdf · 2018-04-22 · teams continued work from...

20
Scythe Proceedings and Bulletin of the International Data Farming Community Issue 20 - Workshop 32

Upload: others

Post on 04-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

ScytheProceedings andBulletin of theInternationalData Farming Community

Issue 20 - Workshop 32

Proceedings and Bulletin of the International Data Farming CommunityTable of Contents

IWW 32: Blossoming Data 1..................

2018: DataFarming.org 4..................

2010: Data Farming and Defense Applications 5...........

2004: Data Farming: Discovering Surprise 11..........

ScytheProceedings and Bulletin of the International Data Farming Community

It is appropriate that the publication supporting the International Data Farming Workshop is named after a farming implement. In farming, a scythe is used to clear and harvest. We hope that the “Scythe” will perform a similar role for our data farming community by being a tool to help prepare for our data farming efforts and harvest the results. The Scythe is provided to all attendees of the Workshops. Electronic copies may be obtained by request to the editors. Please contact the editors for additional paper copies.

Articles, ideas for articles and material, and any commentary are always appreciated.Bulletin EditorsTed Meyer: [email protected] Horne: [email protected]

International What-if Workshop 32

“Blossoming Data”

International Data Farming CommunityOverviewThe International Data Farming Community is a consortium of researchers interested in the study of Data Farming, its methodologies, applications, tools, and evolution.

The primary venue for the Community is the biannual International Data Farming Workshops, where researchers participate in team-oriented model development, experimental design, and analysis using high performance computing resources... that is, Data Farming.

Scythe, Proceedings and Bulletin of the International Data Farming Community, Issue 20, Workshop 32, Publication date: April 2018

IWW 32: OverviewBlossoming Data

This International Workshop was the 32nd in a series where we have focused on gaining insights into our what-if? questions through data farming. This event also contained the 12th meeting of the NATO Modeling and Simulation Task Group "Developing Actionable Data Farming Decision Support for NATO." This Task Group has been designated MSG-124 and is applying the work from the MSG-088 Task Group that documented the data farming process. International What-if? Workshop (IWW) 32 was held from 26 through 30 March 2017 at Blue Canopy in Reston, Virginia, USA where we had chance to visit nearby Washington DC during the blossoming of the famous cherry trees! And I also want to note that we had many off-site participants located in Finland.

The teams and team leaders at IWW 32 were:Team 1: Cyber Defence, Gary Horne (USA) and Santiago Balestrini (USA)Team 2: Operation Planning, Stephan Seichter (DEU) and Johan Schubert (SWE)Team 3: Using Data Farming in Humanitarian Assistance Modeling, Merikki Lappi (FIN)Team 4: Simulation of Traffic Accidents due to Animals, Esa Lappi (FIN)Team 5: Data Farming Simulations of Autonomous Systems, Wayne Stilwell (USA)

The first two teams were from the MSG-124 Task Group and worked during the week toward finishing documentation of their work from the previous 3 plus years. Their final report is available in draft form and the final version is currently in process through the NATO MSG channels. The next two teams continued work from IWW 31and the final team was designed to simply gather ideas for the topic.

Thus, this issue, our twentieth, of the Scythe does not contain a summary of team efforts other than to say the work was valuable in pushing their own data farming agendas. Instead, in the pages that follow we have summarized data farming through two articles that I believe capture the essence of the topic. I like to say “I live for surprise!” and the first article is titled Data Farming: Discovering Surprise. The second is titled Data Farming and Defense Applications and during the course of writing both articles I had the privilege of co-authoring with Ted Meyer!

And speaking of Ted Meyer, I would like to offer my special thanks to him for his help in putting together this publication, and in fact ALL 20 issues of The Scythe! I would also like to express my thanks to the management of Blue Canopy for hosting IWW 32. The first photo below shows the top management of Blue Canopy and the Reston participants. Also, Esa Lappi and the student participants were located at the Päivolä School of Mathematics in Tarttilla, Finland and the second group picture shows them.

Gary Horne

- IWW 32 - Overview1

- IDFW 32 - Overview2

International What-If? Workshop 32Data Farming - Reston, Virginia USA

International What-If? Workshop 32Data Farming - Student Enclave, Finland

P.S. Two additional photos taken during this 32nd Workshop provide an amazing bookend to the series of workshops from number 1 to number 32 as the movement grew from a simple idea with the name Data Farming that I coined in 1997. This idea was able to grow into action because of funding proposed by a Senator from Hawaii and voted into law in 1998 in the Capitol depicted in the next picture! And this effort was named Project Albert after Albert Einstein who is depicted in the statue in the final picture.

P.P.S. The first data farming workshop took place in 1999 in Hawaii and a list of all 32 workshop dates and locations are found on the inside back cover of this publication.

- IWW 32 - Overview3

DataFarming.org

- IWW 32 - Photos4

Data Farming Workshops and pertinent research are gathered, exhibited, and linked on-line at datafarming.org. DataFarming.org is intended to provide a straightforward representation of the concepts of data farming and a basic starting point for anyone interested in learning more about the subject. We want it to include documents, papers, and journal when possible, and links to any applicable research.

If you would like to provide input, links, comments, or corrections to DataFarming.org please contact Ted Meyer at [email protected] or Gary Horne at [email protected].

Data Farming and Defense Applications

TeamGary Horne

DataFarming.org Ted Meyer

MeyerCraft, Inc.

INTRODUCTION Data farming uses simulation modeling, high performance computing, experimental design, and analysis to examine questions of interest with large possibility spaces. This methodology allows for the examination of whole landscapes of potential outcomes and provides the capability of executing enough experiments so that outliers might be captured and examined for insights. In this paper we will provide an overview of data farming and describe the six domains of data farming. We will also illustrate data farming in the context of application to questions inherent in military decision-making, in particular social network analysis related to countering improvised explosive devices.

Overview of Data FarmingData Farming is a collaborative and iterative team process (Horne 1997, Horne and Meyer 2004). This process normally requires input and participation by subject matter experts, modelers, analysts, and decision-makers.Data Farming focuses on a more complete landscape of possible system responses and progressions, rather than attempting to pinpoint an answer. This “big picture” solution landscape is an invaluable aid to the decision maker in light of the complex nature of the modern battle space. And while there is no such thing as an optimal decision in a system where the enemy has a role, data farming allows the decision maker to more fully understand the landscape of possibilities and thereby make more informed decisions. Data farming also allows for the discovery of outliers that may lead to findings that allow decision makers to no longer be surprised by surprise. Data Farming continues to evolve from initial work in a USMC effort called Project Albert (Hoffman and Horne 1998) to the work documented in the latest edition of the Scythe (Horne and Meyer 2010) documenting International Data Farming Workshop (IDFW) 20 held in March 2010 in Monterey, California. The Scythe is the publication of the International Data Farming Community that contains the proceedings of the IDFWs. IDFW 21 is scheduled to take place in Lisbon, Portugal in September 2010.

The Six Domains of Data FarmingThe discovery of surprises and potential options are made possible by data farming. But many disciplines are behind these discoveries and their use in the overall data farming process evolved over a period of time. In this section we give a brief account of this development.Six realms or domains were incorporated into the data farming methodology from 1997 to 2002. Initial data farming efforts in the 1997-98 time frame relied upon two basic ideas: 1. Developing models, called distillations, which may not

have a great deal of verisimilitude but could be focused to specifically address the questions at hand. (Horne 1999)

2. Using high performance computing to execute models many times over varied initial conditions to gain understanding of the possible outliers, trends, and distribution of results

The models need not be agent-based models, but because of the ease with which they can be prototyped, agent-based models were used during this beginning time period. This rapid prototyping facilitated the iterative nature of the approach the use of high performance computing to execute models many times over varied initial conditions to gain understanding of the possible outliers, trends, and distribution of results. Also, the huge volume of output from the simulations made possible by the high performance computing resulted in a need to develop visualization tools and methods commensurate with this tremendous amount of data. Thus, visualization of simulation data and rapid prototyping of scenarios became important to data farming efforts in the 1999-2000 time frame.The simulations that defense analysts use are often large and complex. An evaluation of complete landscapes is extremely time consuming, sometimes not even possible. Also, even the smaller more abstract agent-based distillations referred to above can have many parameters that are potentially significant and that could take on many values. Thus, even with high performance computing and the small models used in data farming, gridded designs, where every value is simulated, are unwieldy. Thus, using efficient experimental designs is essential and The Naval Postgraduate School in Monterey, California joined Project Albert researchers in the early 2000s with their expertise in this area. And NPS researchers have collaborated with others worldwide as well (see Kleijman, Sanchez, Lucas, and Cioppa 2005).

- IWW 32 - Defense Applications Team5

Finally, collaboration must take place at many levels if the full power of data farming is to be brought upon any question. Collaborative processes help to integrate the other five domains of data farming through interdisciplinary work in creating models and data farming infrastructure and during the iterative process of prototyping scenarios and examining output from model runs. Collaboration also takes place between people from different organizations and nations sharing information and perspectives at various points in approaching common questions. With the addition of design of experiments and collaborative processes in 2001-2002 to data farming efforts, much attention then focused on the defense applications discussed in the next section. The six realms, or domains, discussed above that contribute to the data farming process are depicted in Figure 1.

!Figure 1. The Six Domains of Data Farming

Defense ApplicationsSince the incorporation of the above six domains into the process we call data farming, several articles have captured the fundamentals of data farming (e.g. Horne and Meyer 2005). But the key tenet in the data farming process has been the focus on the questions and since 2002 many application efforts have been documented. For example, at the Naval Postgraduate School many theses have been completed which have used data farming. And over the past decade, over 150 international work teams have formed around questions at International Data Farming Workshops. These 150 work teams fall into areas, or themes, which include: Joint and Combined Operations (e.g. C4ISR Operations, Network Centric Warfare, Networked Fires, and Future Combat Missions), Urban Operations, Combat Support (e.g. UAV Operations, Robotics, Logistics, and Combat ID), Peace Support Operations, the Global War on Terrorism, Homeland Defense, and Disaster Relief. The types of questions in these areas typically do not have precisely defined initial conditions and a complete set of algorithms that describe the system being considered. These questions address open systems that defy prediction. Data farming is used to provide insight that can be used by

decision-makers. As an illustrative example, we now describe how data farming is being integrated with other techniques in the context of countering improvised explosive devices.

Illustrative Application: Social network Modeling to support the counter-ied fightThis work represents results from an ongoing study to examine the utility of distillation modeling in the Counter-IED (Improvised Explosive Devices) fight. Understanding social networks, their nature in insurgencies and IED networks, and how to impact them, is important to the Counter-IED (C-IED) fight. This study, conducted as a team effort with international and inter-agency participation, is exploring methods of extracting, analyzing, and visualizing dynamic social networks that are inherent in models with agent interaction. This effort is being conducted in order to build tools that may be useful in examining and potentially manipulating insurgencies. The team started with a simple scenario that evolves cliques via interactions based on shared attributes. This simple model is the initial basis for the team’s investigations and is being used to examine the types of network statistics that can be used as MOEs and pointers to unique and emergent behaviors of interest.The team’s initial goals were to extend this very basic scenario with simple variations and to test candidate tools and prototype methods for data farming the scenario, extracting network data, analyzing end-of-run network statistics, and visualizing network behaviors. Social Network Analysis (SNA) techniques were explored in detail to determine which network metrics would be most beneficial for analyzing the types of networks produced by the agent-based scenario. Developing these tools and methods, and delineating applicable metrics will allow the exploration of questions regarding C-IED issues—including insurgent network evolution and adaptation. Insurgent networks can be categorized into two groups of interest to C-IED efforts: IED Emplacement Networks (consisting of personnel that are directly involved with IED usage) and IED Enabling Networks (consisting of communities that indirectly support and enable the IED Emplacement networks). This study is identifying tools that can be used to explore patterns that might provide valuable insights into emergent behaviors of interest for both of these classes of networks.

BackgroundIn previous work related to the use of agent-based modeling in the C-IED work, task plans aimed at addressing specific C-IED questions were developed. The current work is aimed at producing capabilities that can address these tasks. Tasks topics included: methods of indirect network attack; identifying important link layers for impacting the insurgent networks in specific environments, identifying important individuals, emergence of insurgent cells, eroding popular support for insurgent networks.From this set of tasks the study team selected a set of candidate tasks for follow–up study and analysis. It was concluded that both data farming and SNA concepts and

- IWW 32 - Defense Applications Team6

techniques needed to be applied to address the candidate tasks… And that the current set of tools and methods available in these domains was not up to the task required. The study team is working on developing the necessary tools and methods. In this effort we have: • Demonstrated the ability to extract social network data

from an existing scenario that included agent interaction, but that did not explicitly define a network. In this scenario the network “emerged” or evolved from the basic agent interactions.

• Data farmed this initial scenario and established the need to simplify the target scenario in order to more closely examine cause and effect relationships to SNA statistics.

• Developed a new base scenario, delineated a simple illustrative DOE, and data farmed the model to provide a sample data set for further exploration.

• Examined the utility of and approach to applying specific SNA statistics, methods, and concepts using the data farming output provided from previous work.

• Delineated the data requirements for the various types of networks that might be extracted from various modeling.

• Established and documented software and processes for applying these capabilities to detecting and analyzing emergent networks.

This work has lead to the study team’s conviction that additional work needs to be accomplished in order to address C-IED-oriented problems. Generalized SNA/data farming tools that can be applied to output from various model types should have the capability to: • Detect the presence of a network or networks. • Distinguish different networks and different classes of

networks. • Determine if and when networks achieve equilibrium. • Determine which model inputs have significant impact

on the state and behaviors of the network.Specifically, the intent is to use these capabilities to be able to address a variety of social network questions such as:• What do insurgent networks look like? Who is in the

network? Who is not? • How do we distinguish networks that should be

attacked, networks that should be attritted or that should be co-opted?

• Who are the High Value Individuals (HVI) and what are their identifiable characteristics?

• Will removing specific nodes destabilize a network? • What are the 2nd and 3rd order effects of network

manipulation? • What are the potential unintended consequences?

Abstracted Illustrative Scenario and DOEThe Pythagoras agent-based model development environment was used for the initial scenarios. The first phase of activity was based on the Pythagoras distribution

“Peace” scenario with some minor modifications of the source code to support the extraction of network interaction data. Data Farming of this scenario demonstrated the ability to extract inherent emergent network data. Initial analysis of the results led to the development of a more basic scenario in order to test basic network concepts.The illustrative “Clique Creator” (CC) scenario was developed using Pythagoras’s “relative” color change capability as a tool for experimenting with SNA extraction and analysis. CC has a single agent class with 100 instantiated agents that are uniformly distributed across Pythagoras’s red and blue color spaces. The agents’ only “weapon” is “Chat” which induces a relative color change on other agents with which the agent interacts. As the scenario is executed, entities move through various color states, becoming “more” red or “more” blue depending on the interactions with other red”-ish” or blue-“ish” entities. States will change depending on whether two entities engage in “chatting” and form a connection. The more any two agents interact, the more “alike” they become.The focus of the scenario selection was to represent dynamic homophily and use the results to explore the various analysis tools under study. Multiple excursions and replications of the Pythagoras-developed Clique Creator scenario were used to produce the data for analysis with the candidate tools. This baseline provided a means for the team to experiment with various SNA measures and analysis techniques. Pythagoras can provide multiple views of agent state data. A spatial view showed the physical relationship between entitities and where connections or bonds were formed. The inclination space view sorted the entities by colors. This color space view is used to illustrate the homophilic state of the participating entities in the simulation. A very basic full-factorial design space was used to data farm the scenario.

Table 1: Experimental Design Matrix

"The design matrix (Table 1) reflects four input parameters that will influence the composition of the resulting networks:

- IWW 32 - Defense Applications Team7

• RelativeChange - Percentage relative change of color when “chatted.”

• InfluenceRng - Maximum distance of chat.• FriendThresh - Agents within this range are considered

“linked.”• EnemyThresh – Dependent variable; is calculated as

FriendThresh plus 55, in order to preserve the same Friend to Enemy Distance (equivalent to the “neutral” range) as was present in the base scenario.

The CC scenario can be considered as a metaphor for a group of people establishing relationships based on shared interests or desires (color space proximity) and physical proximity (relative agent location). Agents are drawn toward agents with similar color and move away from agents of disimilar color. The closer agents are in location, the more frequently they “chat” each other, and thus, the closer they grow in color space. Eventually, cliques of “like-interest” agent form and are impacted by other agents and cliques. The input parameters varied in the design matrix affect these behavioral processes in straightforward ways.

Visualizing the Dynamic Network StatePart of a toolset to examine social network dynamics is the ability to analyze the ongoing agent interactions, behaviors, and network responses. Co-visualizing the various aspects (layers) of network dynamics can potentially provide powerful insight into the network.

! Figure 1. CC Scenario – Spatial View

Team 6 has done initial examination of the CC scenario using several visualization capabilites. Figure 1 is the spatial view provided by Pythagoras.Figure 1 shows the agents at a time-step midway in the scenario. “Chats” are shown as lines between agents. This view, though, focuses on the location of the agent spatially. Figure 2 shows four time-steps of an “inclination”-space view. In this image the location of the agents is based on their location in color space. The “redness” (0-255) of the agent is represented on the x axis. The “blueness” (0-255) of

the agent is represented on the y axis. As the scenario proceeds left to right, top to bottom, note the congregation of agents into color groups. These groups do not represent the cliques formed though, because the spatial aspect is not represented.

! Figure 2. CC Scenario – Inclination Space View

FIgures 3 and 4 represent the same agent network , derived from the CC scenario, using the social network analysis “layout” generated by the R SNA plug-in and SoNIA software packages.

!Figure 3. CC Scenario – Static Graph View

Figure 3 shows a static network layout representation of one of the CC time-steps using the default SNA layout algorithm. The SNA R package plots each time-step independently, not accounting for the layout defined in the previous time-step.

- IWW 32 - Defense Applications Team8

The layout of each time-step is independent and as a result, the dynamic evolution is difficult to examine.

!Figure 4. CC Scenario – Dynamic Graph View

Figure 4 shows a single time-step using the SoNIA application. SoNIA is designed to support dynamic time-series network data. As a result, the layout of any timestep can be based on the previous time step as a starting point. The result is a layout which displays the evolution of the network, but that can result in layouts that are not easily viewed statically. It should be noted that Figures 2, 3 and 4 do not represent the spatial data shown in Figure 1 in any way... the “physical” location is ignored in these representations. In Figure 2 location represents color, and in Figures 3 and 4 the location is purely a function of the layout algorithm, which is designed to display the network in an uncluttered and easily-viewed manner, not the spatial location of the agents. 1.Social Network Analysis (SNA)One of Team 6’s goals is to begin to understand the utility of various SNA statistics in understanding the scenario dynamics and the result of data farming. Step one in this process was to delineate what outputs and analysis methods provide insight into network evolution and impact on agent behaviors.SNA statistics fall into two classes: node statistics and network statistics. Node statistics include: betweenness, closeness, eigenvector centrality, and degree. Network statistics include: number of components, number of cliques, and average path length.The study team decided to focus on node statistics initially and produced time-series output for every node of betweenness, eigenvector centrality and degree. Although data for 27 excursions of data farming was collected, it was decided to do an intial comparison of three excursions, where the primary variation was the color distance that defined what is considered a friend (a homophilic link). Excursions 0, 1, and 2 were examined.Figure 5 represents one replication each from excusions 0, 1, and 2 as delineated in Table 1. The three plots represent the degree of each agent over time. The vertical axis is degree

(the number of links associated with a node), the horizontal axis is time, and the axis going into the page is agent number. Figure 5 was generated using the PlotGL plugin to R.

!

!

!

Figure 5. Centrality for Excursions 1-3

In Figure 5, various pattern differences, related to the evolution and devolution of cliques and components, can be discerned There are obvious differences between the excusions, with 0 and 1 appearing to reach covergence, but 2 never converging. It can be seen that some agents reach a steady-state and maintain it for some time, while other groups of agents particpate in behaviours which lead to the growth and reduction of degree for groups of agents.

ResultsTwo counter-intuitive results presented themselves. Excursion 2, in Figure 5c, shows that an increase in FriendThresh, that is, expanding the range and number of agents that an agent has homophilic links with in color space leads to increased instability in terms of clique formation. The initial assumption was that this would affect the size of the cliques and number of components. The unexpected result is that this increase prevents the stabilization of cliques and network components. Rather, it appears that this increase results in groups being able to “steal” members from other groups more easily. Another interesting behavior is the Excursion 0 (Figure 5a) degree variation that occurs before equilibrium. In this case it appears that larger components are formed intially, but that they devolve into smaller groups over time. The team intends to investigate the set of replicates associated with this excursion to determine whether this behavior is consistent for this level of FriendThresh.

- IWW 32 - Defense Applications Team9

Summary and Way AheadSignificant insight was gained by team members in delineating capabilities needed in a toolkit for the extraction and analysis of dynamic social data from models. The following capabilities will be needed for ongoing data farming research of basic social networks:• Synching of Visualization: Various representations of

the dynamic network are useful, but examining multiple views of the network time-step synced would provide powerful relational insights.

• Equilibrium Time: Determining whether equilibrium occurs and how long it takes is often the first step in analysis.

• Data Farm Time Window Reduction Size: Dynamic network analysis requires defining what constitutes a link, for example, a single interaction or multiple interactions over some time window. Being able to data farm this time window would provide analysts insight into network basics.

• Node Statistic Capability: Degree, betweenness, eigenvector, closeness need to be extractable for each node, time-step, replicate and excursion and then represented effectively.

• Network/Component Statistic Capability: # cliques, and components, density, and others need to be acquired for each time step, replicate and excursion.

• Newcomer/Leaving Effects: Measure the effects of dynamic birth and death of agents.

• Network Boundary Effects: Data farm the impact of varying the size and extent of the network.

• MOEs (end-of-run vs. time-series) Both end-of-run and ongoing behaviors may be important.

The study team intends to continue to delineate tool capabilities for data farming social network models. We intend to accomplish the folowing tasks in the upcoming months:• Document tools and methods identified in previous

work.

• Define model output requirements for SNA analysis.• Expand the toolkit to include additional network, node,

and link statistics.• Expand data farming methods for other network layers

including weapon and resource interaction, spatial, communication, and multiple “inclination” parameters.

• Continue detailed analysis of CliqueCreator data farming results.

• Test use of tools and methods on other models (MANA, Netlogo scenarios).

• Begin delineating insurgent IED network scenario.

ACKNOWLEDGMENTThe authors would like to thank the members of Team 6 at the International Data Farming Workshops 19 and 20 in New Zealand and Monterey, California for their contributions, insights, and support.

References1. Carrington, Peter J., Scott, John, Wasserman, Stanley,

2005, Models and Methods in Social Network Analysis (Structural Analysis in the Social Sciences), Cambridge University Press

2. Henscheid, Z., Middleton, D., and Bitinas, E. 2007. Pythagoras: An Agent-Based Simulation Environment, Scythe Issue 1: 40-44. Monterey, CA.

3. Hoffman, F. and Horne, G. 1998. Maneuver Warfare Science 1998. United States Marine Corps Project Albert. Quantico, VA.

4. Horne, G. 1997. Data Farming: A Meta-Technique for Research in the 21st Century, briefing presented at the Naval War College. Newport, RI.

5. Horne, G. 1999. Maneuver Warfare Distillations: Essence Not Verisimilitude. Proceedings of the 1999 Winter Simulation Conference, eds. A. Farrington, H. B. Nembhard, D. T. Sturrock, and G. W. Evans, 1147-1151. Phoenix, AZ.

6. Horne, G. and Meyer, T. 2004. Data Farming: Discovering Surprise. Proceedings of the 2004 Winter Simulation Conference, eds. R. Ingalls, M. D. Rossetti, J. S. Smith, and B. A. Peters, 171-180. Washington, DC.

7. Horne, G. and Meyer, T. 2005. Data Farming Architecture. Proceedings of the 2005 Winter Simulation Conference, eds. M. E. Kuhl, N. M. Steiger, F.B. Armstrong, and J. A. Joines, 1082-1087. Orlando, FL.

8. Horne, G. and Meyer, T. January 2010. Scythe, Proceedings and Bulletin of the International Data Farming Community, Issue 7, Workshop 19, SEED Center for Data Farming, Monterey, CA.

9. Horne, G. and Meyer, T. August 2010. Scythe, Proceedings and Bulletin of the International Data Farming Community, Issue 8, Workshop 20, SEED Center for Data Farming, Monterey, CA.

10. Kleijnen, J., Sanchez, S., Lucas, T., and Cioppa, T. 2005, A User’s Guide to the Brave New World of Designing Simulation Experiments, INFORMS Journal on Computing, 17(3): 263-289. Hanover, MD.

11. PlotGL R Package (http://cran.r-project.org/web/packages/plotgl/index.html)

12. SNA R Package (http://cran.r-project.org/web/packages/sna/index.html)

13. SoNIA Social Network Image Animator (http://www.stanford.edu/group/sonia/index.html)

14. Wasserman, Stanley, Faust, Katherine, 1994, Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences), Cambridge University Pre

- IWW 32 - Defense Applications Team10

Data Farming:Discovering Surprise

TeamGary Horne

DataFarming.org Ted Meyer

MeyerCraft, Inc.

ABSTRACTThe development of models and analysis of modeling results usually requires that models be run many times. Very few modelers are satisfied with the computing resources available to do sensitivity studies, validation and verification, measurement of effectiveness analysis, and related necessary activities. Fortunately, high performance computing, in the form of distributed computing capabilities and commodity node systems, is becoming more pervasive and cost effective. In this paper the authors describe the concept and methods of Data Farming, the study and development of methods, interfaces, and tools that make high performance computing readily available to modelers and allows analysts to explore the vast amount of data that results from exercising models.

BACKGROUNDIn 1998, General Charles Krulak, then Commandant of the Marine Corps, recognized the inherent nonlinearity of war and that many of the existing combat models and simulations were archaic or inappropriate given the asymmetric nature of modern combat. For a few years leading up to his comments, the USMC led the development of ideas to capture answers not provided by traditional models. A fast running model called ISAAC was developed and the concept of Data Farming (Brandstein and Horne 1998) was invented. Congress expressed interest in combining the ideas above with high performance computing and other high tech capabilities.Project Albert (named after Einstein in the same manner that the first model was named after Newton) has continued since then stressing the development of technology to capture and explore huge possible outcome spaces generated by Data Farming.The capability development has emphasized and benefited from a wide range of interdisciplinary, joint, and coalition collaboration and the still developing experimental technologies have recently begun to be tested in various application areas in collaborative efforts.

Project Albert is not about running specific models for predicting a final “answer.” Project Albert is about Data Farming any model to gain insight in potential outcomes and experimentation with emerging methods. Data Farming, by providing the ability to process large parameter spaces, makes possible the discovery of surprises (both positive and negative) and potential options. Project Albert is addressing real questions. It has used Data Farming to seek insight into questions such as:

• When is decentralized (vs. centralized) command and control desired or preferred?

• What is the role of trust, or other so-called ‘intangibles’, on the battlefield?

• How can we best protect our homeland from a martyr-based offense?

• How can a bio-terrorist attack be mitigated in a free society?

• What system characteristics are important in military convoy protection systems?

• How can groups co-exists peacefully?

Of course, there are many other questions which are of interest, but these are but a few of the ones that teams have attempted to address using Data Farming. The Data Farming approach explores these types of questions from the perspective of the ‘whole’, vice from the perspective of the component parts. And finally, the desire in all of our efforts is to go well beyond point estimates, because the understanding we seek requires much more (Horne 2001).Data Farming relies on a set of enabling technologies and processes that have been the focus of ongoing research and development efforts: distributed and high-performance computing; agent-based simulations and rapid model development; knowledge discovery methods; high-dimensional data visualization techniques; design-of-experiments methods, human-computer interfaces; team work and collaborative environments; and heuristic search techniques. Project Albert has been, and is continually, pursuing a program of developing, integrating, and applying this methodology and these technologies to problems in the military domain. Project Albert’s mission is to: “Create the best Data Farming environment possible to collaboratively explore the vast space of possibilities inherent in the questions that our decision makers face in today’s uncertain world.”

- IWW 32 - Discovering Surprise Team5

WHAT IS DATA FARMING?Data Farming can be thought of as nothing more than putting the advances mentioned earlier to work to engage the scientific method. Testing of models and a complete exploration of model output requires that a model of even a modest level of complexity be run a statistically significant number of times over a potentially large parameter space. Few models are examined as extensively as the modeler designers and the decision-makers who rely on them would wish. Even though high performance computing (HPC) is becoming cheaper and more available, the use of HPC currently requires specialized development and expertise. A lot of developmental resources are expended to build ad hoc model execution capabilities.One objective of this research is to answer the question: “What interfaces, human and software, will allow modelers to easily submit and execute models, design experiments, collect results, explore the results, and support decision making?” Although much research has been done in HPC, most has been done in the areas of processing methodologies and improvements, not in human access and interfaces. As any computing capability becomes more of a commodity, however, ease of use and access takes on a premium and becomes the conduit for new opportunity. Typically, Data Farming is an iterative team process. Figure 1 presents the data farming process as a set of imbedded loops. The following steps are inherent in this figure and may be repeated until insight is gained.

• Question/topic research and definition• Model development and gaming• Parameter space exploration• Data exploration and analysis

These steps normally require input and participation by subject matter experts, modelers, analysts, and decision-makers. The “Scenario Creation” loop shown on the left side of the figure involves building a model that adequately represent the system that pertains to a question being asked by the decision-maker. The scenario creation loop involves both the creation of a model, but also the honing and refining of the question so that all participants understand the scope and intent and confirm that the designed model significantly addresses the question. The scenario or model is crafted so that the decision-maker knows that it is addressing his issue, the subject matter expert believes that the model adequately represents the real world processes at work, the analyst can acquire the data required to examine the outcome space, and the modeler has the resources to implement the model in software. The modeler iterates the implementation with all participants providing feedback into the development. In this phase the team must also define what measurements should be collected from the model in order to address the question being asked. The loops shown in Figure 1 all require the same participation and concurrence.

!Figure 1: Data Farming Iterative Process

During the scenario creation loop the team may want to execute the model any number of times, examining the results to ensure the model is meeting their specific requirements. At this point, high performance computing may be used to “game” parameters. That is, to adjust them so that model appropriately emulates the system being examine.Once the team agrees that the scenario is represented by a basecase, the “Scenario Run Space Execution” loop shown in Figure 1 is entered. In this loop the team determines which scenario parameters should be examined and what processes should be used to vary them. Here the team is exploring the possible variations (or excursions) in the initial conditions of the scenario. Specifically those parameters that address the question being posed are considered. The basecase provides a starting set of parameter values, but the team must decide what range and limits of variation are appropriate for the scenario. Additionally, the method for varying the parameters should be considered. “Gridded” Data Farming will result in a simple, but voluminous full factorial variation of all parameters by defining ranges and step sizes. Alternatively, evolutionary (or generative analysis) algorithms may be used to explore the parameter space using optimization processes (Lucas et al. 2002). Any number of other methods may be developed to optimally cover the potential parameter space to be examined. If the model uses random seeds (e.g., to “randomize” initial conditions or in ongoing order or conflict resolution) then the model may be executed some number of times, each with a variation in the initial random seeds used. The number of replicates of each excursion must be defined. A study should be done over the parameter space to determine the sensitivity of the results to the random seeding. “Outliers” in the results may be dependent on this randomization and may provide insight into the potential volatility in real-world systems.

- IWW 32 - Discovering Surprise Team6

On defining the parameter space to be explored, the model is ported to a high performance computing environment and executed. At this point any number of analysis, data mining, and visualization methods may be applied to the output measurements. In Data Farming, there is not necessarily a predefined hypothesis that is being confirmed. The data is being “explored.” Trends and relationships may become exposed, but outliers, cusps, and saddle points may also be found in the n-dimensional output space. The completion of this exploration step can result in various outcomes:

• the team may determine that additional areas or resolutions of parameter space should be explored (iteration of the “Scenario Run Space Execution” loop);

• the team may decide that the model needs to be adjusted (back to the “Scenario Creation” loop);

• the team may decide that sufficient insight has been gained or that circumstances require a completion of the effort.

The results of this process may be incorporated into other modeling and operation analysis activities. Insight may be used to adjust wargames, provide input to deterministic models and equations, or build higher verisimilitude simulations and models.Data Farming provides a never-ending opportunity to explore our questions. The idea is to grow more data in the areas of interest. This growth within a particular definition of a particular distillation might be in the form of more runs or a different preparation of the sample space to include different parameters, finer gradations of parameter values, or greater ranges. After the execution of samples and analysis using data visualization and search methods, the data farmers are free to grow more data in interesting areas, integrate with information from other tools, prepare a different scenario using the same distillation, select another distillation, or any combination of these possibilities that might lead to progress on the question at hand or new questions that arise during this exploration process.

DISTILLING Questions“Everything should be made as

simple as possible, but not simpler.”

“Any intelligent fool can make things bigger, more complex… It takes a touch of genius and a lot

of courage to move in the opposite direction.”

~ Albert Einstein

Models used within the paradigm of data farming are referred to as “distillations.” It is recognized that all models are “distillations” or abstractions of the real world. It is only by judicious selection of specific aspects of a system that we can produce models that are helpful. The Einstein quotes above capture the intent of modeling within the realm of Data Farming. Distillations should be complex enough to address the question… and no more complex.

Ideally, distillations have the following characteristics:

• Intuitive–the team must be able to understand the parameters and rules that define the model and how they relate to the system being modeled;

• Transparent–the team must be able to understand how the behaviors that emerge in the model emerged from a set of parameters and rules; and

• Transportable–the model must be portable to a Data Farming environment.

Although any model could be data farmed, distillations are intended to be a bottom up reduction to the essence of a question. Typically, distillations are expected to be developed quickly–potentially in a matter of a few days, hours, or even minutes. Realistically, though, model development environments have not reached the ease of use required to produce models in minutes. As a result of these requirements, the current focus has been on the implementation of agent-based modeling environments such as Pythagoras, represented in Figure 2.

!Figure 2: Pythagoras Distillation Environment

Distillations use abstraction judiciously. “Weapons” can represent interchanges of various types such as food, resources, messages, and positive or negative messages or propaganda. Location or proximity in the model can be abstracted to represent relative aspects of other relational parameters. Modeled obstacles can represent walls, floors, borders, or sociological or psychological obstructions in non-geoterrain or combat interchanges. Another Einstein quote, “Imagination is more Important than Knowledge,” is an important guideline in distillation development. Distillation modelers must often innovate and use imagination to define abstractions. For example, communication level has been used as a “proxy” for trust–do you use or ignore information provided?

- IWW 32 - Discovering Surprise Team7

Why DATA FARMING?The availability of Data Farming resources can be advantageous to any decision support process that is aided by modeling. Data Farming was developed to provide methods to address several phenomena that are not easily addressed using traditional methods of modeling:

• Non-linearity–including sensitive dependence on initial conditions and bifurcation events;

• Intangibles–“Fuzzy” parameters such as leadership, morale, and trust; and

• Adaptation–including opponent reaction and co-evolution.

Data Farming is a process that can address questions quickly. Distillation models are developed quickly and HPC allows results to be produced and collected quickly. Data Farming allows the examination of whole landscapes of potential outcomes, not just a few cases. It provides the capability of executing enough experiments so that outliers might be captured and examined for insights. Data Farming in not intended to predict an outcome, it is used to aid intuition and to gain insight.Recapping, Data Farming is a methodology and set of tools that provides modelers the ability to execute models or simulations hundreds of thousands or even millions of times. This capability can be used to support modeling and decision support in a number of ways:

• Sensitivity Studies – Models of any complexity are subject to chaotic or non-linear behaviors that may vary over the space of possible inputs to the model. Data Farming provides the ability to examine much larger and higher resolution areas of the parameter space to examine the statistical variability of the model.

• Validation and Verification – Data Farming allows modelers to fully test a model’s reaction to various inputs over a broad space of possible and potentially unforeseen combinations of input parameters. Results may be examined to ascertain the correct execution of the model algorithms and to compare the results to the real world.

• Model Development and Gaming – “All models are simplifications of the real world.” In order to hone the model and its parameters to better represent the real world, models are often repeatedly executed to “steer” parameters. Furthermore, the process of developing models requires innumerable executions of the model to aid in debugging and algorithm development. The ability to run models over a larger parameter space speeds the model development process. This process is greatly enhanced by Data Farming.

• Scenario Analysis – Once models are developed, they are executed. The results of the execution are studied to provide insight or to address real world questions. Data Farming allows the model to be executed over a much larger number of input parameters and a larger number of random variations, which can give decision-makers a more complete view of the possible outcomes and system dynamics.

• Trends and Outliers – Traditionally models are run a few times to do scenario analysis of a small window of the possible outcomes. A few summary statistics are generated to represent the results. If one examines a wider parameter space, however, trends and relationships between inputs and measurements of effectiveness can be studied. Of equal importance is the ability to identify which parameter combinations or random variations result in “outliers,” special cases that may indicate model problems, or high risk or high opportunity domains of the parameter space.

• Heuristic Search and Discovery – Data Farming encompasses the ability to apply iterative methodologies for model analysis such as genetic algorithms and other sophisticated optimization and search methodologies.

• Generation of Massive Test Data Sets – Data Farming can be used in conjunction with models to generate massive data sets to test learning algorithms and other data mining tools. This is particularly valuable where actual data may not be available for security or privacy concerns.

A TOOLKIT FOR DATA FARMINGSince 1998, a suite of tools has been developed to implement Data Farming environments and distillation models that can be executed in these environments. In general these tools are expected to be openly available to the collaborative community that is involved in Data Farming development. These tools fall into three categories: 1) implementations of Data Farming environments, 2) distillation modeling environments, and 3) data exploration tools.Two distillation modeling environments have been developed: the Maui High Performance Computing Parallel Execution System (PES) and OldMcData. The PES is accessed through a web based interface that allow the uploading of basecases to several supercomputing multi-node systems. The web interface also allows users to define the excursion space to be examined, the number of replicates, and the types of output to be produced and collected. The system includes software that distributes distillations to nodes, executes them, and collects output in a central repository. This system is currently being maintained and is undergoing a major developmental upgrade..OldMcData (Upton 2004) is a smaller scale system that can be used to execute model excursions on a standalone computer or on a distributed set of nodes on a network. In

- IWW 32 - Discovering Surprise Team8

combination with an application called Xstudy. It allows users to set up a Data Farming environment on any networked set of nodes.Six (agent-based) distillation modeling environments have been integrated into OldMcData and the PES. These include ISAAC, Socrates, Pythagoras, Mana, PAX, and NetLogo. Of these, three are currently under development: Pythagoras by Northrup Grumman, Mana (displayed in Figure 3) by the New Zealand Defence Technology Agency, and the PAX Peace Support Operation Model by EADS, Germany.

!Figure 3: A Sample Scenario in the Mana Model

Of the six integrated models three have source code available: ISAAC, Pythagoras, and Socrates. Each of these models have rich features sets. Of the integrated models, netLogo is the most open ended, being a complete programming environment. NetLogo is an open source modeling system available online.It should be noted, though, that any model that adheres to a fairly simple set of specifications can be integrated into the Data Farming environments. Models of various types could potentially be integrated: logistics, deterministic, network, game theory, predictive agent, and others. To be integrated into a Data Farming environment models must minimally be able to define a basecase using an XML text file and adhere to a simple text-based delimited record/field output format.Three visualization tools have been developed to be used in the Data Farming process. These tools include: the Playback Tool, the VizTool Landscape Plotter, and Avatar, all developed at MHPCC. The current visualization development is aimed at data exploration, not data presentation. The tools are aimed at fairly sophisticated analysts (Meyer and Johnson 2001). Development is intended to support the special needs of Data Farming community and the high dimensional data that is produced. The long term goal is to provide interfaces and tools to directly support decision makers.The first visualization tool, the Playback tool, allows users to take time series output from a model and watch the model with VCR type stepping, rewind and fast forward controls.

The utility of the software is currently limited to the ISAAC model, but it allows the playback of multiple excursions and replicates at the same time, giving users a powerful comparative view of multiple excursions or replicates at the same time.The second visualization tool, the VizTool Landscape Plotter, is displayed in Figure 4. This is a powerful tool that allows users to extract 2D slices out of high dimension full factorial gridded Data Farming output to easily display the relationship of multiple parameters to the output measurements. Figure 4 depicts the maximum, mean, and minimum of replicates for a slice of data extracted from a 5 dimensional data set of 1000’s of records.

!Figure 4: The MHPPC VizTool Landscape Plotter

Two visualization concepts in particular support the exploration of high dimension data: focus and linking (Buja et al. 1991). Focus refers to the ability to manipulate the view or perspective of a visualization interactively. Zooming, rotating, and subsetting/sampling to examine relevant data at varying resolution are examples of focus. Linking refers to being able to examine multiple perspectives/visualizations at the same time to discover relationships among parameters. Selecting or coloring data in one view results in linked selection or coloring in other views.Avatar, the latest visualization tool, is designed to begin implementing focusing and linking for the Data Farming environment. Figure 5 shows two views of the same data in using the Avatar visualization modules. Avatar also provides for subsetting of data sets, can support non-gridded data, and is currently under ongoing development. It is intend to be integrated with model playback so that users, when they have discovered interesting results can examine the behaviors that caused those resultsFigure 5 represents two linked views of data: a 3D scatter plot and a parallel coordinate plot. Avatar allows the selection of any three parameters or output to be displayed in the 3D scatter. It provides for rotation, zooming, selection and subsetting as well.

- IWW 32 - Discovering Surprise Team9

!Figure 5: Avatar 3D Scatter and Parallel Coordinate Plot

The parallel coordinate plot is a powerful visualization that allows the examination of a large number of dimensions at one time. This plotter also provides for data subsetting, selection, and manipulation of scale and focus. Avatar has a plug-in architecture for adding visualization modules and can be easily extended. Modules for 2D scatter, 2D scatter with jitter, bubble plots, box plots, table view and statistical summary have also been developed.

OPEN SOURCE, OPEN INVITATIONThis paper has two purposes. The first is to introduce the reader to the concept of Data Farming, its utility and value, and the toolkit that has been developed to support it. The second purpose of the paper is to provide an open invitation to collaborators. By November 2004 we expect that much of

the software developed to support the Data Farming process will be hosted on SourceForge.net. Those familiar with Open Source projects will be aware that this means that we hope that other modelers can benefit from this software and that we can benefit by the expansion of our user and developer community. Please contact the authors with any questions regarding collaboration or use of these tools.

REFERENCES1. Brandstein, A. and Horne, G. 1998. Data Farming: A

Meta-Technique for Research in the 21st Century. Maneuver Warfare Science 1998, Marine Corps Combat Development Command Publication, Quantico, Virginia.

2. Buja, A., McDonald, J. A., Michalak, J., and Stuetzle, W. 1991. Interactive Data Visualization Using Focusing and Linking. Proceedings of the 2nd conference on Visualization '91, San Diego, California, SESSION: Multivariate visualization, Pages: 156 – 163. IEEE Computer Society Press. Los Alamitos, California.

3. Horne, G. 2001. Beyond Point Estimates: Operational Synthesis and Data Farming. Maneuver Warfare Science 2001. United States Marine Corps Project Albert. Quantico, Virginia.

4. Lucas, T., Sanchez, S., Brown, L., and Vinyard, W. 2002. Better Designs for High-Dimensional Explorations of Distillations. Maneuver Warfare Science 2002. United States Marine Corps Project Albert. Quantico, Virginia.

5. Meyer, T. and Johnson, S. 2001. Visualization for Data Farming: A Survey of Methods. Maneuver Warfare Science 2001. United States Marine Corps Project Albert. Quantico, Virginia.

6. Upton, Stephen. 2004. Users Guide: OldMcData, the Data Farmer, Version 1.0. United States Marine Corps Project Albert. Quantico, Virginia..

- IWW 32 - Discovering Surprise Team10

43

Data Farming WorkshopsProject Albert International Workshops

Project Albert International Workshop, August 1999, MauiProject Albert International Workshop 2, January 2000, Maui

Project Albert International Workshop 3, February 2001, AucklandProject Albert International Workshop 4, August 2001, AustraliaProject Albert International Workshop 5, July 2002, Germany

Project Albert International Workshop 6, March 2003, MontereyProject Albert International Workshop 7, September 2003, Quantico

Project Albert International Workshop 8, April 2004, SingaporeProject Albert International Workshop 9, November 2004, Wellington

Project Albert International Workshop 10, May 2005, StockholmProject Albert International Workshop 11, February 2006, Honolulu

Project Albert International Workshop 12, June 2006, Germany

International Data Farming Workshops International Data Farming Workshop 13, November 2006, Netherlands

International Data Farming Workshop 14, March 2007, MontereyInternational Data Farming Workshop 15, November 2007, Singapore

International Data Farming Workshop 16, March 2008, MontereyInternational Data Farming Workshop 17, September 2008, Germany

International Data Farming Workshop 18, March 2009, MontereyInternational Data Farming Workshop 19, November 2009, Auckland

International Data Farming Workshop 20, March 2010, MontereyInternational Data Farming Workshop 21, September 2010, PortugalInternational Data Farming Workshop 22, March 2011, Monterey

International Data Farming Workshop 23, September 2011, FinlandInternational Data Farming Workshop 24, March 2012, Monterey

International Data Farming Workshop 25, September 2012, Istanbul

International What-If? Workshops International What-If? Workshop 26, June 2013, WashingtonInternational What-If? Workshop 27, January 2014, Finland

International What If? Workshop 28, October 2014, WashingtonInternational What-If? Workshop 29, March 2015, FinlandInternational What If? Workshop 30, February 2016, Italy

International What If? Workshop 31, October 2016, FinlandInternational What If? Workshop 32, March 2017, Washington

Additional Information Contact: Dr. Gary Horne at [email protected].

Scythe - Proceedings and Bulletin of the International Data Farming Community

Issue 20- Workshop 32