model-driven instructional strategy 1jaredfreeman.com/jf_pubs/freeman_ari_adaptive... ·...

Model-Driven Instructional Strategy

1

1 A Model-Driven Instructional Strategy: 2

The Benchmarked Experiential System for Training (BEST) 3

Georgiy Levchuk, Ph.D.1 4

Wayne Shebilske, Ph.D. 2 5

Jared Freeman, Ph.D. 1 6

7

1Aptima, Inc. 8

2Wright State University 9

10

11

12

13


2

14

Introduction 15

The military, the aviation industry, and other commercial enterprises increasingly use 16

simulations to train and maintain skills. Such “experiential training” (c.f., Ness, Tepe, and Ritzer, 17

2004; Silberman, 2007) deeply engages learners with realistic environments, technologies and, 18

often, with the teammates with which they must execute real-world missions. 19

However, these simulations are not training systems in a formal sense. The environments 20

generally are not imbued with the capability to measure human performance nor the instructional 21

intelligence to improve it (Freeman, MacMillan, Haimson, Weil, Stacy, and Diedrich, 2006) 22

through the selection of feedback (cf. Shebilske, Gildea, Freeman, and Levchuk, 2009) or 23

training scenarios (addressed in this paper). It is the trainees and trainers who exercise 24

instructional strategy in these simulations. Trainees are notoriously poor instructional strategists, 25

however. They often invest less time on task than is optimal, exercise poor learning strategies, 26

and engage in other undesirable learning practices (Williams, 1993; Steinberg, 1989). 27

Significant advances have been made in developing Intelligent Tutoring Systems (ITS), 28

which automate the application of instructional strategy. ITS designed around systematic 29

application of instructional strategies (e.g., Chi & VanLehn’s (2007) focus on principles and 30

dynamic scaffolding) or computational models of human cognition and learning have improved 31

learning by individuals in well-defined domains. For example, the LISP programming tutor 32

(Anderson, Conrad, & Corbett, 1989) reduced training time by 30% and increased scores by 43% 33

relative to a control group, with a particularly strong effect for the poorest students. There is 34

little literature, however, that tests methods of scaling ITS student models up from individual 35


3

instruction to team training, and little concerning their application in ill-defined domains – those 36

domains in which there is disagreement over solution methods and solutions for a given problem. 37

Scaling up an ITS to teams is a problem because the literature is scant and not altogether 38

helpful with respect to modeling instructional strategy for team training. Most formal models 39

used for team training are, ironically, models of individual performers that serve as simulated 40

teammates and/or instructor to a single trainee (c.f., Eliot & Wolf, 1995; Miller, Yin, Volz, 41

Ioerger, & Yen, 2000; Rickel & Johnson, 1999; Freeman, Haimson, Diedrich, & Paley, 2005; 42

Freeman, 2002). Because these models generally represent domain experts, not instructional 43

experts, their ability to enforce sound instructional strategy is limited. 44

Ill-defined domains, likewise, are a little-explored frontier for ITS. The literature includes 45

reports of marked success in automated tutoring of algebra (Anderson, Douglass, & Qin, 2004), 46

LISP (Anderson, et al., 1989), probability and physics (Chi & VanLehn, 2008). Some systems 47

are designed to train people to diagnose unique failures in seemingly well-defined domains such 48

as electronics (Lesgold, Lajoie, Bunzo, Eggan, 1992). Recent work explores performance and 49

learning in complex domains, such as those that can be characterized as dynamical systems (Dutt 50

and Gonzalez, 2008). However, there is scant literature concerning the application of ITS with 51

strong didactic models to ill-defined domains, including military mission planning and 52

execution. 53

We are attempting to automate the management of instructional strategy in simulation-54

based team training within ill-defined domains such as military command and control. To 55

accomplish this, we are developing and empirically testing the effects of (1) models that assess 56

team performance in these domains, and (2) models that adapt instructional strategy to the team 57


4

on the basis of these assessments. We call this the Benchmarked Experiential System for 58

Training (BEST). 59

To assess performance, we have applied optimization techniques that generate 60

benchmarks against which to evaluate team performance, as well as animated feedback based on 61

these benchmarks. This model-based approach to benchmarking and feedback reliably increases 62

learning, holding learning trials constant (Shebilske, Gildea, Freeman, and Levchuk , 2009) , and 63

it is arguably more efficient than engaging domain experts in developing solutions to large 64

numbers of scenarios. To adapt instructional strategy, we have extended Atkinson’s (1972) 65

application of optimal control theory to sequence instructional experiences. This paper focuses 66

on the sequencing model and its effects in two experiments. 67

A POMDP Model of Instructional Strategy 68

BEST selects, from a large and structured library of scenarios, the scenario that is most 69

likely to increase team expertise given the team’s performance in the previous scenario. In 70

practice, this optimal scenario sequencing policy is expressed in a simple lookup table that 71

specifies the next scenario to administer given any plausible combination of a prior scenario and 72

a performance score. We generate this table by applying the modeling technique described 73

below to historical data or expert estimates of selected parameters. Generating the table is 74

computationally intensive, and so it is created before training is delivered. During training, 75

instructors can navigate the table rapidly to select a scenario (Figure 1a); software can do so 76

instantaneously. The intended effect of this customized, optimized scenario sequencing is to 77

increase the level of expertise students achieve in a fixed period of training, or to accelerate 78


5

progress to a given level of expertise. We describe the optimization strategy here, the turn to an 79

empirical test of its effects. 80

[Insert Figure 1 about here] 81

Figure 1: The Problem of Training and Conceptual POMDP Solution 82

The problem of sequencing instructional experiences has all elements of planning under 83

uncertainty: the system has observables, hidden dynamics, and control measures that need to be 84

planned. The true state of trainee knowledge and skill cannot be directly observed, but only 85

estimated from measures of process and outcome. The effect of a training experience on 86

expertise cannot be reliably predicted; it, too, is probabilistic. Finally, the challenge at hand is to 87

select training events that produce the fastest or largest change in expertise. To represent and 88

solve this complex problem systematically and reliably, we invoke a computational modeling 89

strategy. 90

To model the evolution of team expertise, we assume that the space of team expertise can 91

be represented using a finite-state machine; that is, we discretize expertise state into a finite set 92

of team expertise states. While this is a simplifying assumption, it enables us to study complex 93

temporal expertise dynamics and develop efficient system control strategies. 94

To develop optimal instructional strategy for navigating this space, we utilize the 95

Partially Observable Markov Decision Processes (POMDP; Kaelbling, Littman, and Cassandra, 96

1998; see a tutorial at http://www.cs.brown.edu/research/ai/pomdp/tutorial/index.html; see 97

Figure 1). The POMDP is well suited to decision-theoretic planning under uncertainty. It has 98

several advantages. 99


6

First, the POMDP formulation captures the dynamic nature of team and individual skills 100

via a Markov decision process graph. Within the graph, a single finite discrete variable indexes 101

the current team expertise state, and external actions control expertise changes. For example, the 102

state of team’s expertise may represent the proficiency of each team member at their tasks (such 103

as accurately prioritizing enemy targets) and their ability to collaborate (e.g., efficiently 104

communicate with other team members). The state changes approximate the dynamics of the 105

team expertise when the model applies a specific control action to a team. In our context, a 106

control action corresponds to selecting a training scenario. Expertise changes are described by a 107

table of transition probabilities that statistically represent the uncertain effect on expertise of 108

selecting a specific training scenario for a team. 109

Second, POMDP formulation allows us to make training decisions under partial 110

observability of the true state of team expertise. While observations about team and individual 111

performance influence our belief about achieved team skills, the actual (“true”) state of skills is 112

not observable. Thus, we can only estimate the expertise state, interpreting it as “partially 113

observable”. 114

Third, the POMDP formulation allows our model to treat scenario selection as both the 115

control mechanism to change the skills and the testing mechanisms to obtain more knowledge of 116

the true state of team expertise. 117

More formally, the POMDP model is described using the following variables: 118

a finite set of states, S ; 119 a finite set of control actions, A ; 120 a finite set of observations, Z ; 121 a state transition function, )(: SAS , where )( is the probability distribution 122

over some finite set; 123


7

an observation function, )(: ZASo ; and 124

an immediate reward function, RASr : . 125

In the present application, the set S represents all possible states of team expertise. The 126

team can be in only one state at a given time. The set A represents all of the available 127

training/testing scenarios. The set Z consists of all possible observations about trainees, that is, 128

all possible values of normalized performance and process measures. The state transition 129

function models the uncertainty in the evolution of expertise states (learning), while the 130

observation function o relates the observed measures to the true underlying expertise state and 131

scenario selection actions. The immediate utility of performing a control action in each of the 132

true states of the environment is given by the immediate reward function r , which can 133

incorporate a cost of training and a benefit of attaining expertise. 134

Thus, the dynamics of team expertise are represented in a state-action model (see Figure 135

2a) equivalent to Markov Decision Process (MDP) graph, where the instructional actions change 136

the team expertise with some uncertainty. The state-action model is uniquely described with a set 137

of team expertise states S , a set of selectable training scenarios A , and state transition function 138

. That is, if },...,,{ 21 NsssS and },...,,{ 21 MaaaA , then transition function )(: SAS 139

defines the probability },|Pr{),,( kijjki asssas that team expertise will change to state js if 140

scenario instruction ka is applied when team expertise is in state is . 141


Figure 2: An example of POMDP structure 143

(a) The state-action model illustrates how the controlled instructions of the trainer can affect 144 the dynamics of the team expertise. For example, if the team does not have any skills in 145 pairing assets (such as weapons) to tasks (such as enemy targets), then playing a mission 146


8

scenario containing air and ground task classes with high appearance frequency would have 147 30% probability of not achieving any effects, 10% probability of achieving high level of 148 skills, 40% probability of acquiring some skills for which further training is required, and 149 20% probability that adequate skills are achieved. 150

(b) The observation model illustrates an example of how observations from the average task 151 accuracy measure are related to the selection of scenarios (represented as task classes and 152 task frequencies) and the true state of expertise resulting from executing a new scenario. 153 For example, there is a 40% probability that average task accuracy will range from 60% 154 to70%, given that the new scenario contained air and ground task classes with high 155 appearance frequency and that the team achieves some asset-task pairing skills that require 156 training. 157

The true states that team expertise takes over time (that is, states of MDP) are not known 158

to the trainer or the instructional model. They obtain only partial observations about current state 159

of expertise in the form of performance and/or process measures. The observation-state 160

relationships are captured using an observation model (see Figure 2b) described by the state set, 161

action set, and observation function o . That is, if the set of measure outcomes is 162

},...,,{ 21 LzzzZ , then an observation function defines the probability },|Pr{ kij asz that a 163

normalized performance/process measure outcome jz is obtained when instruction action ka (a 164

scenario) is applied and team expertise transitions to state is . (In another formulation, this 165

probability reflects the dependence of measures on only the true expertise state, that is, the 166

probability }|Pr{ ij sz .) 167

The training policy should produce substantial learning while minimizing the cost of 168

training sessions. The cost and reward can be quantified using the scoring of team’s skills, as we 169

have done in the present research, and other factors such as the duration of training, or 170

complexity of scenario development and training session preparation. However, since the true 171

states of expertise of the team are not known, we cannot compute precise cost-reward measure of 172


9

the training policy. Instead, we use an objective of expected cost-reward measure, called 173

expected utility, which incorporates the beliefs about state of expertise. 174

The POMDP solution objective is to derive a control policy, an instructional strategy that 175

achieves the greatest amount of expected utility (expected reward of training) over some number 176

of decision steps (training events). This control policy can be expressed as a policy graph, where 177

nodes are beliefs about true state of team’s expertise, each such node contains a training scenario 178

to be executed, and links correspond to probabilities of moving to a new belief state. 179

If the states of expertise were observable, this policy could be specified as a training 180

action to be performed at the currently attained state of expertise s . The problem of finding the 181

team training policy then simplifies to a Markov Decision Process (MDP). The MDP solution is 182

significantly easier to obtain than a POMDP solution, with algorithms running in time square of 183

state space size time per iteration or cube of state space size for a closed-form solution (Bellman, 184

1957). 185

In case of partial observability, the intelligent tutoring system (ITS) at time 1t does not 186

know the current state ]1[ ts of the team’s expertise/knowledge. Instead, the ITS has an initial 187

belief about the expertise state (potentially from prior assessments), the history of observations 188

]}1[],...,2[],1[{z 1 tzzzt , and ITS’s own actions ]}[],...,2[],1[{a taaat . The ITS can act 189

optimally on this information by conditioning the training policy on its current belief about the 190

state of the team expertise/knowledge at every time step. The belief state at time t is represented 191

as a vector of probabilities ][],...,[],[][ 21 tbtbtbtb N , where ][tbi is equal to the probability that 192

state of the team’s knowledge is is at time t ( 1][1

N

ii tb ). 193


10

These beliefs are probability distributions over the states Si and are sufficient 194

statistics to act optimally. The POMDP-based scenario training policy is then defined on the 195

belief state, so that we specify the training scenario Ab )( to be performed at belief state b . 196

Optimal algorithms for POMDP are significantly more computationally complex than the 197

solutions for MDP formulation, and the problem of finding optimal policies is PSPACE-198

complete (Papadimitriou and Tsitsiklis, 1987). There exist two main classes of approximate 199

algorithms for POMDP: value-function methods that seek to approximate the value of belief 200

states – probability distributions over the environment states (Hauskrecht, 2000), and policy-201

based methods that search for a good policy within some restricted class of policies. 202

Due to the large size of the belief state space, the optimal policy to maximize the 203

objective function cannot be derived using conventional means. Currently, problems of a few 204

hundred states are at the limits of tractability for optimal POMDP algorithms (Smith and 205

Simmons, 2004). This is because most exact algorithms for general POMDPs use a form of 206

dynamic programming, which has a computational explosion in the belief state space (Cassandra, 207

Littman, and Zhang, 1997). Still, these algorithms provide a useful finding that a value function 208

can be given by a piece-wise linear and convex representation and transformed into a new such 209

function iteratively over time. And while the set of belief states is infinite, the structure of the 210

POMDP problem allows efficient clustering of all beliefs into a limited set of states. 211

Several algorithms that use dynamic-programming (DP) updates of the value function 212

have been developed, such as one pass (Sondik, 1971), exhaustive enumeration (Monahan, 213

1982), linear support (Cheng, 1988), and witness (Littman, Cassandra, and Kaelbling, 1996). Out 214

of these algorithms, the witness algorithm has been shown to have superior performance 215


11

(Littman, Cassandra, and Kaelbling, 1996). Combining the benefits of Monahan’s enumeration 216

and witness algorithms, a more superior optimal algorithm called incremental pruning has been 217

developed in (Zhang and Liu, 1996) and enhanced in (Cassandra, Littman, and Zhang, 1997). 218

To overcome the solution complexity of optimal algorithms, efficient approximate 219

solutions to POMDP have been proposed (Littman, Cassandra, and Kaelbling, 1995). These 220

algorithms use replicated Q-learning or linear Q-learning to update the belief state-action vectors 221

during the search in belief state. 222

Another approximate technique is a Heuristic Search Value Iteration (HSVI) algorithm 223

proposed by Smith, and Simmons (2004). This is an anytime algorithm that returns an 224

approximate policy and a provable bound on its error with respect to optimal policy. HSVI 225

combines two well-known techniques: attention-focusing search heuristics and piece-wise linear 226

convex representations of the value function. On some of the benchmarking problems, HSVI 227

displayed over 100 times improvement in solution time compared to state of the art POMDP 228

value iteration algorithms (Smith, and Simmons, 2004). In addition, HSVI was able to solve 229

problems 10 times larger than those reported previously. 230

Recently, researchers have examined new methods that restrict the set of states that the 231

POMDP policy solution could have (Sutton et al., 2000; Poupart, Ortiz, and Boutilier, 2001; 232

Aberdeen and Baxter, 2002). One such solution using an internal-state policy-gradient algorithm 233

(Aberdeen, 2003) was shown to solve the problem with tens of thousands of possible 234

environment states in reasonable time (e.g., 30 minutes). (Recall that the BEST POMDP solution 235

is computed off-line once for all students in advance of training; thus the time to compute a 236

solution does not influence training.) This algorithm uses the concept of finite state stochastic 237


12

controller (FSSC) which provides efficient near-optimal policy identification and model training 238

by trading-off the optimality with complexity and adapting the solution to new evidence over 239

time. FSSC solution is attractive because the number of its internal states can be significantly 240

smaller than the number of possible states of team expertise or state beliefs, and new states can 241

be added and old deleted as new evidence is received. As the internal states start describing the 242

beliefs about the team expertise and the transition probabilities become deterministic, the FSSC 243

approaches the optimal solution to POMDP problem. 244

The algorithms above describe the main features of POMDP solution approaches. 245

Optimal algorithms can only be applied to solve small-scale POMDP problems, and approximate 246

iterative solutions must be used for larger-scale situations. In our research, we experimented with 247

a small spaces of 11 expertise states, and therefore used the optimal value iteration solutions. 248

However, we envision a need to move to approximate solutions for real-world applications. The 249

recent advances in heuristic value iteration, stochastic controller models, and error-bounding 250

state space reduction methods promise to achieve needed trade-offs between optimality and 251

complexity of POMDP modeling. 252

Experimental Validation 253

Testing the Application of a POMDP to Team Training 254

We conducted two, similar experiments to compare a control condition to the POMDP 255

adaptive training system. In the control condition, the experimenters applied a single, 256

predetermined sequence of scenarios. In the POMDP condition, the experimenters assessed team 257

performance and then used that assessment and the scenario identity to select the next scenario 258

from a simple table generated by the POMDP solver. In each experiment a single-team used the 259


13

POMDP instructional strategy to learn one new experience and the control instructional strategy 260

to learn another new experience. The two experiments counterbalanced the order of the two 261

instructional strategies. 262

The experiments trained teams to execute the tasks of an Air Force Dynamic Targeting 263

Cell (DTC). The role of the DTC is to rapidly adjust air operations to strike Time Sensitive 264

Targets (TSTs), which are unexpected hazards (threats) and opportunities (targets) that may 265

demand immediate action. These actions should minimally disrupt the current mission plan for 266

prosecuting other targets and defending against other threats (cf. Shebilske, et al., 2009). 267

Pilot studies suggested to us that, in this domain, whole-task training would be less 268

effective than hierarchical part-task training, which is one of a few part-task training sequences 269

that have afforded an advantage over whole-task training (Fredericksen & White, 1989). 270

Accordingly, the training scenarios systematically and independently vary the number of threats 271

and targets to develop student skill at each part of their task: defense and offense. The control 272

condition and adaptive POMDP condition drew on the same library of scenarios. 273

To increase the relevance of this work to field training and operations, the experiment 274

simulated (1) experienced DTC teams (2) training for new experiences in which a) the current 275

enemy escalated or changed tactics, b) new enemies arose, and c) enemies were confronted in 276

different contexts (Shebilske et al., 2009). To replicate experienced DTC teams, we used a 277

paradigm in which two teams participated for many hours to learn many more complex 278

scenarios. These teams attained the proficiency of a moderately or highly skilled operational 279

DTC, in the judgment of scientists who had performed the task analysis of DTC teams in 280

operational settings and of our simulated team. We focused on training for new experiences 281


14

because this is the emphasis of most operational training. Such training is analogous to 282

professional football players training each week for their next opponent (Corrington, & 283

Shebilske, 1995). 284

We started developing our POMDP model after we had about 50 hours of data on 285

participants in Experiment 1, who were trained to perform like DTC teams, and after developing 286

a system to enable experts to rate trainee performance, as described by Shebilske et al. (2009). 287

This empirical foundation was not enough to estimate parameters for the POMDP model from 288

historical data alone, but it was enough to build expert judgments about parameters, thus: When 289

reviewing the historical assessments from each training scenario (a1), the experts thought of their 290

ratings as observations (o1) at the end of the scenario, which reflected the state of expertise (s2) 291

the team obtained while performing the current scenario. The experts then estimated the 292

Observation Probability (Pr {o1 | s2, a1}) and Transition Probabilities, (P {s2 | s1, a1}), which 293

reflect how the team expertise changes from one state to another when the team performs a 294

current scenario. 295

The experts did not have data for all possible combinations of transitions from one 296

difficulty level to another, but they noticed a tendency that enabled them to estimate by 297

extrapolation. The tendency concerned the probability of transitioning from a lower state to a 298

higher state as a function of the combination of time sensitive targets (TSTs) with enemy 299

defenses (such as Surface to Air Missiles), which constitute threats to our strike force. Having a 300

low number of TSTs or Threats raised the probability of better performance on a component task 301

in the short run, but in the long run, having a medium to high number increased overall 302

performance as reflected by ratings on the most comprehensive of our training objectives: to 303


15

coordinate attack assets. The experts also noticed another tendency that guided their estimations 304

of the transition probability (P) for a team that was given a scenario (a1) that had a difficulty of 305

medium to high TST/threat combinations. They found high variability in the probability that a 306

team that was given such a scenario would advance from one state of expertise (s1) to a more 307

advanced state of expertise (s2) in a single performance of the scenario. The estimated 308

advancement included assessment of the team’s state of expertise with respect to a) recognizing 309

TSTs, b) taking into account other threats, and c) submitting a plan to prosecute TST while 310

disrupting other mission priorities minimally. The expert’s final estimates for this transition 311

probability (P {s2 | s1, a1}) included estimates of states of expertise before and after the 312

scenario. The POMDP model was thus able to hold a team at the same scenario difficulty level 313

when their state had not changed as a result of performance on the current scenario, or to 314

advance the team to a higher scenario difficulty level when their measured expertise state had 315

improved. 316

The POMDP model assigned scenarios that were challenging but achievable. That is, the 317

POMDP model kept training in the team’s zone of proximal development (Vygotsky, 1978). The 318

hierarchical part-task training strategy used in the control condition also was designed to keep 319

the trainees in their proximal zone. However, it used a more conservative predetermined 320

sequence as shown in Figure 3. The triangles in the figure show that, for both experiments, the 321

hierarchical part-task control strategy gradually increased the number of TSTs and then increased 322

the other Threats that were present in the mission. The logic was that a DTC team must learn to 323

detect TSTs before it can design strike packages that take into account the TST in the context of 324

the other Threats. Accordingly, detecting TSTs is a less advanced skill in the hierarchy. This is 325

consistent with the logic of the hierarchical part-task strategy, which conservatively keeps 326


16

challenges low for more advanced skills in the hierarchy until a skill less advanced in the 327

hierarchy is learned well. Then it holds the challenges for the less advanced skill at a consistently 328

high level and gradually increases the challenge for the more advanced skill. In contrast, circles 329

in the figure show that the POMDP training strategy increased TSTs and Threats less 330

conservatively and more efficiently by changing both before the less advanced TST skill had 331

reached its highest level. Specifically, the control strategy and the POMDP strategy started with 332

the same TST/Threat combination and ended with same TST/Threat combination, but the 333

POMDP strategy took the more direct diagonal path. Because the POMDP was not 334

predetermined, it was able to use inferred state of expertise before and after each scenario to 335

advance the challenge of both TSTs and Threats or to hold the participants at the same challenge 336

level of TSTs and Threats. The circles that overlap represent the same combination of TSTs and 337

Threats. Specifically, in Experiment 1, the team was held for two scenarios of 10 TSTs and 35 338

Threats, for two scenarios of 11 TSTs and 40 Threats, and for two scenarios of 12 TSTs and 45 339

Threats. In Experiment 2, the POMDP diagonal path looks similar. A critical difference, 340

however, is where the POMDP model told the experimenters to hold the team or to advance 341

them. Specifically, the team was held for two scenarios at 10 TSTs and 35 Threats and for three 342

scenarios at 12 TSTs and 45 threats. We hypothesized that the POMDP training path would be 343

more effective than the control hierarchical part-task training path. 344


Figure 3: Training sequences for control and POMDP conditions 346

We tested this hypothesis in two single-team experiments. The team in Experiment 1 was 347

7 undergraduate college students (2 women and 5 men, mean age = 22.6 years), who were paid 348


17

$7.25 per hour for 117 hours each. The team in Experiment 2 was 7 undergraduate college 349

students (2 women and 5 men, mean age = 20 years), who were paid $7.25 per hour for 45 hours 350

each. The participation in both experiments was part of responsibilities as research assistants, but 351

teams did not know the purpose of the experiment until they had completed the research. 352

Shebilske, et al. (2009) describe the work stations and Aptima’s Dynamic Distributed Decision-353

making (DDD, see www.aptima.com, searched on 8 March 2010) synthetic task environment 354

used to simulate Air Force Intelligence, Surveillance, and Reconnaissance (ISR) and the DTC 355

task. The dependant variable was the quality of the proposed strike package for each TST, which 356

was determined by expert ratings of the strike package. The ratings evaluated the plan of the 357

whole team, as opposed to evaluating each individual, making each experiment a single-team 358

design (cf, Shebilske et al., 2009). 359

Although the main purpose for two experiments was to counterbalance the order of 360

presenting the two instructional strategies, the second experiment also decreased the amount of 361

training before the teams learned new experiences. Each experiment had three phases. Phase I 362

was background training; Phase II was performing missions with the same enemy using the same 363

attack strategy; Phase III was performing missions that exposed the teams to new, and putatively 364

more difficult experiences defending against new, and qualitatively different, attack strategies. 365

Sessions in all three Phases of both experiments consisted of planning (10 min), mission 366

execution (40 min), and debriefing (10 min). Phase I was 50 hrs in Experiment 1 and 16 hrs in 367

Experiment 2. Phase II was 49 hrs in Experiment 1 and 11 hrs in Experiment 2. Difficulty 368

increased gradually in Phase II as described by Shebilske et al. (2009). Phase III was 18 hrs in 369

Experiments 1 and 2. 370


18

Extended training produced experienced teams, but at time scales that are unusually long 371

for typical laboratory training studies. For example, the scale of training interventions in the 372

present experiment was the scenario. In Phase III, for instance, 18 scenario interventions 373

occurred in 18 hrs of training, or 1 intervention per hour for 18 hrs. The ratio, 1/18, is not as 374

unusual as the time scale. If we apply the same ratio to 18 interventions in a typical laboratory 375

task in which all training occurs in 1 hr, we get a more familiar timescale: 1 training intervention 376

per 3.3 minutes in 60 min. Research is needed to determine the best ratio of interventions per 377

time unit for longer training intervals. 378

Results during initial training in Phases I and II 379

Time was reduced in Phases I and II of Experiment 2 because Experiment 1 had shown 380

that the team was so close to asymptotic performance on the pretest in Phase II that the pretest 381

and post test difference was not statistically significant. Figure 4 shows this result and shows 382

that the reduction of Phase I and II times in Experiment 2 was effective. It resulted in much 383

lower pretest performance in Phase II, the same asymptotic performance on the posttest, and a 384

significant difference between the pretest and posttest (t(35) = 3.13, p < .01). 385

Note that the pretests and posttests were similar to one another and were counterbalanced across 386

experiments. The scores on all pretests and posttests were ratings of the most comprehensive 387

training objective, which was to coordinate attack asset. 388



19

Figure 4: Pretest and posttest scores in two experiments 390

Phase III: Testing the POMDP Approach for an Experienced Team to Learn a New Experience 391

Phase III was critical because it compared the Control training protocol and the POMDP 392

training protocol for the same team learning two, new experiences that varied with respect to 393

enemy strategy. For example, during New Experience 1 at the beginning in Phase III, the enemy 394

destroyed the refueling tankers, and did so consistently. This forced the DTC to choose different 395

attack package patterns, and gave them repeated practice doing so. During New Experience 2 in 396

Phase III, a different strategic variation was presented and practiced. The order of training the 397

two new experiences was the same in both experiments. The only difference in Phase III was in 398

the order of testing the Control training protocol and the POMDP training protocol. In 399

Experiment 1, the team had the Control training protocol for New Experience 1 and the POMDP 400

training protocol for New Experience 2. In Experiment 2, the team had the POMDP training 401

protocol for New Experience 1 and the Control protocol for New Experience 2. As a result, over 402

the two experiments, the POMDP and Control Condition had both orders of new experiences and 403

pretest-posttest combinations. 404

Results During Phase III 405

The results from both experiments supported the hypothesis. Figure 5 shows that the 406

posttest minus pretest difference for the POMDP training strategy was greater than that 407

difference for the Control Training Strategy in both experiments. That is, the POMDP training 408

strategy produced more learning whether it was used for learning the first or second new 409

experience, or whether it was used with the first or second pretest posttest combination. The 410

standard errors for the posttests were consistently smaller than those for the pretests. We used 411


20

SPSS to conducted conservative t-tests, which are robust to unequal variance. In Experiment 1, 412

on the POMDP posttest, accuracy of the plan to coordinate attack assets rose significantly from 413

1.6 on the pretest to 3.0 on the posttest (t (31) = 3.11, p < .01). Between the POMDP posttest and 414

New Experience 2 for the Control Pretest, performance fell from 3.0 to 1.6 (t(27) = 2.83, p < 415

.01). On the Control protocol Posttest, the slight rise from 1.6 to 1.9 was not significant (t(34) = 416

0.48, p > .05). In Experiment 2, on the Control posttest, attack plan accuracy rose insignificantly 417

from 1.7 on the pretest to 2.2 on the posttest (t (41) = .86, p > .05). Between the Control posttest 418

and New Experience 2 for the POMDP Pretest, performance fell insignificantly from 2.2 to 1.7 419

(t(40) = .94, p > .05). On the POMDP protocol Posttest, the increase in attack plan accuracy 420

from 1.7 to 3.2 was significant (t(40) = 2.70, p < .01). 421


Figure 5: Difference scores for new experiences between treatments and experiments 423

Discussion 424

One potential disadvantage of single-team designs is the possibility of serial correlation, 425

which would violate the independence assumption that is required for many standard statistics 426

including those used in the present experiment. Anderson (2001) reviews misunderstandings 427

about this potential risk, which result in failures to realize the potential advantages of single-428

person (or in this case single-team) designs. He also discusses evidence and conditions in which 429

the independence assumption seems reasonable. For example, we isolated the observations in our 430

rating procedure by rating each plan relative to the circumstances that existed when the plan was 431

made regardless of earlier or later circumstances. Accordingly, we cautiously made the 432


21

independence assumption until future experiments yield enough data to test it with time series 433

analyses (see West, 2006). 434

Another disadvantage of the single team A-B design is that neither the order nor the 435

materials can be counterbalanced. We addressed this problem by counterbalancing order and 436

materials between Experiments 1 and 2 and by minimizing the similarity of strategic 437

countermeasures in Phases II and III. Learning with the Control protocol second in Experiment 1 438

was potentially facilitated by general knowledge of the tendency for the enemy to change 439

strategies. However, this was a potential bias in favor of the Control protocol, and made the 440

observed POMDP advantage more defensible. We cautiously conclude that the POMDP protocol 441

facilitated experienced teams learning new experiences in both Experiments and that the Control 442

protocol did not. 443

Accordingly, we argue that the advantage for POMDP in Experiments 1 and 2 was due to 444

the instructional strategy (scenario selection) driven by a POMDP model, and not to differences 445

in order of new experiences or in pretests and posttests. 446

The advantage of the POMDP-drive strategy is that it exposes a team to as much 447

complexity as the team can learn from given its state of expertise1. This is not true of static 448

instructional strategies such as the control condition tested here. Further, the use of POMDPs is 449

1 Although the state of expertise is evaluated before and after each scenario, no

assumption is made that the change in expertise state during a scenario depended only on events

in that scenario. The change in expertise may have been influenced by all prior experience, but it

occurred in the context of the evaluation period, which was a scenario for this experiment.


22

arguably more effective than instruction guided by the trainee or by a trainer who (typically) has 450

domain knowledge but not instructional expertise. It also may be more cost-effective than an 451

ITS, which can require development of an instrumented instructional environment, a rule-based 452

(and, thus, domain-specific) model of expert performance, a rule-based model that diagnoses 453

student performance, and a model of instructional strategy (Polson and Richardson, 1988). By 454

contrast, the present POMDP training protocol requires only an instructional environment with a 455

library of systematically defined training scenarios; instrumentation – automated or observer-456

based – to assess trainee performance; a generalized POMDP solver; and a model of the rate of 457

human learning given the scenario library, based on historical performance scores and/or expert 458

assessments of trainee state before and after a training scenario. In sum, the present experiments 459

demonstrate both the potential instructional effectiveness of a POMDP adaptive training system 460

and its technical feasibility. 461

The research reported here is a foundation for future experiments. These could test 462

information processing accounts of the observed advantage of POMDP-driven instructional 463

sequences. The findings would also increase our understanding of team performance in ill-464

defined domains. One hypothesis for such research is this: the responsiveness of the POMDP 465

sequences to team state enables students to learn from scenarios with higher counts of TSTs and 466

threats, and these combinations are advantageous over training with a lower number of either 467

TSTs or threats. The literature on part-task / whole-task training provides a useful perspective on 468

these combinations. Having a lower number of either TSTs or threats is similar to part task 469

training, while having higher TST/Threat combinations is similar to whole task training. Whole 470

task training is generally better than part task training especially when training with the whole 471

task facilitates the learning of critical input/output ensembles (e.g. Fredericksen & White, 1989; 472


23

Gopher, Weil, and Siegel, 1989). Ieoger, Shebilske, Yen, and Volz (2005) set the stage for 473

applying this principle to teamwork on complex tasks such as the present task, which is a 474

dynamical system in the sense that the components are interdependent and interactive. They 475

argue that teams understand system dynamics by processing collective variables, which are all 476

uncontrolled variables that depend on the reciprocal interactions among an organism (in this case 477

the team), its components (in this case team members), and its environment (in this case the 478

DDD/DTC simulation environment). For example, an input/output ensemble in the present task 479

might include as input understanding a communication such as: “Would it be better to prosecute 480

the present TST with the jet that we had planned to use to attack the nearby enemy tank, or with 481

the jet that we had planned to use to attack the nearby enemy aircraft?” A response, output, 482

might be, “Use the jet for the tank because we can replace it more easily.” The collective 483

variables for this case include the two team members, the TST, and the other threats, the enemy 484

tank and enemy jet. They are uncontrolled in the sense that they are free to vary over a wide 485

range of possibilities. These collective variables become more complex when the number of 486

TSTs and threats increase together. That is, when the DTC environment includes high TST / high 487

threat combinations, team members must recognize more complex patterns in enemy and 488

friendly weapons. They must also engage in more complex patterns of communication to form a 489

unified team plan. The plan might include, for instance, combinations of two relevant TSTs, four 490

relevant threats, and many more objects that must be seen as irrelevant to a specific input/output 491

ensemble. These more complex input/response patterns of collective variables must be integrated 492

into functional input/response ensembles. Medium to high TST/threat combinations enable the 493

formation of these ensembles and facilitate learning. The present research did not measure these 494

collective variables. However, future research will investigate whether these collective variables 495


24

are best learned with the POMDP adaptive training system using medium to high TST/threat 496

combinations. Future experiments will also compare the collective variables for high TST/threat 497

combinations with and without the POMDP adaptive training system. We hypothesize that the 498

adaptive feature will be necessary to enable effective use of the high TST/treat combinations. 499

Finally, we believe that information processing models of collective variables will explain this 500

advantage of the POMDP adaptive training system. 501

The present experiment also provides a foundation for future development of a POMDP 502

adaptive training system. The technologies employed here to leverage POMDP for training were 503

a mix of manual and automated methods. Defining the POMDP for DTC was a laborious 504

process. Technologies are needed that simplify and accelerate subject-matter-expert work to 505

parameterize a POMDP for a new domain. Applying the POMDP during training was slowed by 506

manual assessment of performance. Automated measurement and assessment of team 507

performance could remove the expert from the loop where that produces reliable and valid 508

measures, and in so doing it would accelerate training. Further, automated measurements 509

(observations), related scenarios (actions), and other data could be stored and used to manually 510

or automatically refine the POMDP model to improve its training recommendations. 511

Finally, the speed and effectiveness of training administered here may have been limited 512

by the large size of the training event: a scenario. This limitation was imposed by the present 513

experimental procedures and not by the POMDP approach itself. The POMDP approach could be 514

revised to select smaller training events such as vignettes that would be automatically composed 515

into scenarios. Alternatively, the POMDP could specify training objectives that a scenario 516

generation technology could translate into vignettes and dynamically construct into scenarios 517


25

(c.f., MacMillan, Stacy, and Freeman, in press). Such a wedding of POMDP with dynamic 518

scenario generation technology would ensure that training consists of only those events that are 519

relevant to a team’s training needs. This would necessarily be more efficient than selection from 520

a library of pre-defined scenarios, none of which is guaranteed to address all and only a team’s 521

current training needs. 522

Conclusion 523

The research reported here employed a POMDP model to inject rigorous, real-time 524

instructional strategy into simulation-based, team training in an ill-defined domain. POMDP-525

driven adaptive selection of training experiences enhanced learning, relative to ordering 526

scenarios in a predetermined sequence using a hierarchical part-task training strategy. This work 527

demonstrated the feasibility of a POMDP-driven adaptive training system. Finally, it set the 528

stage for experiments that test information processing accounts of the POMDP effectiveness, and 529

for development of technology that marries POMDP scenario specification techniques with 530

automated scenario generation technology that adapts instruction within as well as between 531

scenarios. 532

This research program is also distinguished by its logical progression in integrating field 533

research, laboratory research, and field applications. One starting point for the present research 534

was field study of DTC teams in operational settings. Based on these task analyses, we simulated 535

DTC operations using the DDD synthetic task environment, and applied a POMDP model to 536

adaptively select training experiences in the laboratory. The products of this research include a 537

large number of challenging training scenarios and a model-driven instructional strategy that, we 538


26

believe, should be of direct benefit, not only to future research, but also to Air Force DTC staff 539

training in operational settings. 540

References 541

Aberdeen, D. (2003). Policy-Gradient Algorithms for Partially Observable Markov Decision 542

Processes. PhD Thesis, The Australian National University, April 2003. 543

Aberdeen, D., and J. Baxter. (2002). Scalable Internal-State Policy-Gradient Methods for 544

POMDPs. Proceedings of the International Conference on Machine Learning. 2002, 545

pp.3-10. 546

Anderson, J. R., Conrad, F. G., & Corbett, A. T. (1989). Skill acquisition and the LISP Tutor. 547

Cognitive Science, 13, 467-506. 548

Anderson, J. R., Douglass, S. & Qin, Y. (2004). How should a theory of learning and cognition 549

inform instruction?. In A. Healy (Ed.) Experimental cognitive psychology and it’s 550

applications. American Psychological Association; Washington, D. C. 551

Anderson, N.H. (2001). Empirical Direction in Design and Analysis. Mahwah, NJ.: Lawrence 552

Erlbaum Associates. 553

Atkinson, Richard C. (1972). Ingredients for a theory of instruction. American Psychologist. Vol 554

27(10), Oct 1972, 921-931. 555

Bellman, R. (1957). Dynamic Programming. Princeton, NJ: Princeton University Press. 556

Cassandra, A.R.; Littman, M.L.; and Zhang, N.L. (1997). Incremental pruning: A simple, fast, 557

exact method for partially observable Markov decision processes. Uncertainty in 558

Artificial Intelligence (UAI). 559


27

Cheng, H.T., "Algorithms for partially observable Markov decision processes", PhD thesis, 560

University of British Columbia, 1988. 561

Chi, Min and VanLehn, Kurt. (2008). Eliminating the Gap between the High and Low Students 562

through Meta-Cognitive Strategy Instruction. Proceedings of ITS 2008, Montreal, 563

Canada. 564

Chi, M., and VanLehn, K. (2007). Accelerated Future Learning via Explicit Instruction of a 565

Problem Solving Strategy. In K. R. Koedinger, R. Luckin & J. Greer (Eds.), Artificial 566

Intelligence in Education (pp. 409-416). Amsterdam, Netherlands: IOS Press. 567

Corrington, K. & Shebilske, W.L. (1995). Complex skill acquisition: generalizing laboratory-568

based principles to football. Applied Research in Coaching and Athletics Annual, 54-69. 569

Dutt, Varun, and Gonzalez, Cleotilde. (2008). Human perceptions of climate change. 570

Proceedings of the 2008 System Dynamics Conference. Athens, Greece. 571

Eliot, C., & Woolf, B.P. (1995). An adaptive student centered curriculum for an intelligent 572

training system. User Modeling and User-Adapted Instruction, 5, 67-86. 573

Fredericksen, J. and White, B. (1989). An Approach to Training Based Upon Principled Task 574

Decomposition, Acta Psychologica, vol. 71, pp. 89-146. 575

Freeman, Jared. (2002). I've got synthers. Who could ask for anything more? Proceedings of the 576

46th Annual Meeting of the Human Factors and Ergonomics Society. Baltimore, MD. 577

Freeman, J., Haimson, C., Diedrich., Paley, M., (2005). Training teamwork with synthetic teams. 578

Clint Bowers, Eduardo Salas, & Florian Jentsch (eds.), Creating High-Tech Teams: 579


28

Practical Guidance on Work Performance and Technology. Washington, DC: APA 580

Press. 581

Freeman, J., MacMillan, J., Haimson, C., Weil, S., Stacy, W., and Diedrich, F. (2006). From 582

gaming to training. Society for Advanced Learning Technology. Orlando, FL. 8-10 583

February 2006. 8-10February 2006. 584

Gopher, D.,Weil, M., and Siegel, D. (1989). Practice Under Changing Priorities: An Approach 585

to the Training of Complex Skills, Acta Psychologica, vol. 71, pp. 147-177, 1989. 586

Hauskrecht, M. (2000). Value-function approximations for partially observable Markov decision 587

processes. Journal of Artificial Intelligence Research, 13:33-94. 588

Ieoger, T., Shebilske, W., Yen, J., and Volz, R. (2005). Agent-Based Training of Distributed 589

Command and Control Teams. Proceedings of the 49th annual meeting of Human Factors 590

Society, Orlando, FL. 591


29

Kaelbling, L.P.; Littman, M.L.; and Cassandra, A.R. (1998). Planning and acting in partially 592

observable stochastic domains. Artificial Intelligence, Volume 101, pp. 99-134. 593

Lesgold, A. M., Lajoie, S. P., Bunzo, M., & Eggan, G. (1992). SHERLOCK: A coached practice 594

environment for an electronics troubleshooting job. In J. Larkin & R. Chabay (Eds.), 595

Computer assisted instruction and intelligent tutoring systems: Shared issues and 596

complementary approaches (pp. 201-238). Hillsdale, NJ: Erlbaum. 597

MacMillan, J., Stacy, W., and Freeman, J. (In press). The Design of Synthetic Experiences for 598

Effective Training: Challenges for DMO. Proceedings of Distributed Mission Operations 599

Training. Mesa, AZ. 600

Monahan, G. E. (1982). A survey of partially observable Markov decision processes: theory, 601

models and algorithms. Management Science, 28(1), January 1982. 602

Miller, M., Yin, J., Volz, R.A., Ioerger, T.R., Yen, J. (2000). Training teams with collaborative 603

agents. Proceedings of the Fifth International Conference on Intelligent Tutoring 604

Systems, (ITS-2000), 63-72. 605

Ness, J.W., Tepe, V., and Ritzer, D.R., (2004). The Science and Simulation of Human 606

Performance. Amsterdam: Elsevier. 607


30

Papadimitriou, C.H., and J. N. Tsitsiklis. (1987). The complexity of Markov decision processes. 608

Mathematics of Operations Research, 12(3):441-450, 1987. 609

Polson, Martha and Richardson, J. Jeffrey (1988). Foundations of Intelligent Tutoring Systems. 610

Hillsdale, N.J.: Lawrence Erlbaum Associates. 1988. 611

Poupart. P., L.E. Ortiz, and C.Boutilier. (2001). Value-directed sampling methods for monitoring 612

POMDPs. Uncertainty in Artificial Intelligence, Retrieved June 10, 2008, from 613

www.citeseer.nj.nec.com/445996.html. 614

Rickel, J., & Johnson, W.L. (1999). Virtual humans for team training in virtual reality. 615

Proceedings of the Ninth World Conference on AI in Education, IOS Press. 616

Shebilske, W., Gildea, K., Freeman, J., Levchuk, G. (2009). Optimizing Instructional Strategies: 617

A Benchmarked Experiential System for Training (BEST). Theoretical Issues in 618

Ergonomic Science. Special Issue on Optimizing Virtual Training Systems, v10 (3) May 619

2009 , pages 267 - 278. 620

Silberman, Melvin (ed). (2007). The handbook of experiential learning. NY: Wiley & Sons. 621

Smith, T. and Simmons, R.G. (2004). Heuristic search value iteration for POMDPs. Proceedings 622

of the 20th Conference on Uncertainty in Artificial Intelligence (UAI). ACM International 623

Conference Proceeding Series, Vol. 70. 624

Sondik, E. J. The optimal control of partially observable Markov processes. PhD thesis, Stanford 625

University, 1971. 626

Steinberg, E.R. (1989). Cognition and learner control: A literature review, 1977-1988. Journal of 627

Computer-Based Instruction, 16, 117-121. 628


31

Sutton, R.S., D. McAllester, S. Singh, and Y. Mansour. (2000). Policy gradient methods for 629

reinforcement learning with function approximation. Advances in Neural Information 630

Processing Systems, 12:1057-1063. 631

Vygotsky, L. (1978). Mind in Society. Cambridge, MA: Harvard University Press. 632

West, B. J. (2006). Where Medicine Went Wrong: Rediscovering the Path to Complexity. NJ: 633

World Scientific. 634

Williams, M.D. (1993). A comprehensive review of learner control: The Role of Learner 635

Characteristics. Proceedings of Selected Research and Development Presentations at the 636

Convention of the Association for Educational Communications and Technology 637

Sponsored by the Research and Theory Division (15th, New Orleans, Louisiana, January 638

13-17, 1993). 639

Zhang, N.L., and W. Liu. (1996). Planning in stochastic domains: problem characteristics and 640

approximations. Technical Report HKUST-CS96-31, Department of Computer Science, 641

The Hong Kong University of Science and Technology. 642

643

644


32

Author Note 645

Georgiy M. Levchuk is a Simulation and Optimization Engineer at Aptima, Inc. His 646

research interests include global, multi-objective optimization and its applications in the areas of 647

organizational design and adaptation, and network optimization. He received a Ph.D. in 648

Electrical Engineering from the University of Connecticut, Storrs 649

Wayne Shebilske is Professor of Psychology at Wright State University, Department of 650

Psychology. His research interests include instruction for complex skills, team training, and 651

spatial cognition. Dr. Shebilske received a Ph.D. in psychology from the University of 652

Wisconsin, Madison. 653

Jared Freeman is Chief Research Officer at Aptima. He is a cognitive psychologist whose 654

work addresses the design and assessment of instructional systems, operational systems, and 655

human organizations. Dr. Freeman received a Ph.D. in human learning and cognition from 656

Columbia University. 657

This research was supported in part by an STTR from the Air Force Office of Scientific 658

Research. 659

Correspondence concerning this article should be addressed to Jared Freeman, Aptima, 660

1726 M St., NW, Washington, DC, 200036. E-mail: [email protected] 661

662


33

Footnotes 663

(none) 664

665

666


34

Tables and Figures 667

Note to the editor: The figures and captions are included here solely for reference. Greyscale .tif 668 files have been delivered to you for each figure. Captions are embedded in the text, above. 669 670 671 672

TrainingScenario

True TeamExpertise

State

Measures

hidden

observed

controlled

effect on expertise

effect on measures

feedback

(b) Conceptual POMDP Model(a) Team Training Problem

trainer

trainee

traineetrainee

trainee

scenario measures

of performance

Training Environment

(DDD)

673 674

Figure 1: The Problem of Training and Conceptual POMDP Solution 675

676


35

(b) Observation model

Asset-task pairingskills: none

Air and ground task;high frequency

Asset-task pairingskills: need training

Averagetask accuracyin [30%,50%]



p=0.2

p=0.4

p=0.4

(a) State-action model

Asset-task pairingskills: none

Air and ground task;high frequency

Asset-task pairingskills: high

Asset-task pairingskills: need training

Asset-task pairingskills: adequate

p=0.3

p=0.1 p=0.4 p=0.2

- state

- action

- observation

677

678

Figure 2: Example of POMDP structure 679

680

(a) The state-action model illustrates how the controlled instructions of the trainer can affect the dynamics of the team expertise. For example, if the team does not have any skills in pairing assets (such as weapons) to tasks (such as enemy targets), then playing a mission scenario containing air and ground task classes with high appearance frequency would have 30% probability of not achieving any effects, 10% probability of achieving high level of skills, 40% probability of acquiring some skills for which further training is required, and 20% probability that adequate skills are achieved.

(b) The observation model illustrates an example of how observations from the average task accuracy measure are related to the selection of scenarios (represented as task classes and task frequencies) and the true state of expertise resulting from executing a new scenario. For example, there is a 40% probability that average task accuracy will range from 60% to70%, given that the new scenario contained air and ground task classes with high appearance frequency and that the team achieves some asset-task pairing skills that require training.


36

8

9

10

11

12

13

30 35 40 45 50

TSTs

Threats

Experiment 1 Path

Expt 1 Control

Expt 1 POMDP

681

8

9

10

11

12

13

30 35 40 45 50

TSTs

Threats

Experiment 2 Path

Expt 2 Control

Expt 2 POMDP

682

683

684

Figure 3 685


37

686

687

688

689

Figure 4 690

691

692

0

0.5

1

1.5

2

2.5

3

3.5

4

Exp. 1 Exp. 2

Pretest

Posttest


38

Experiment 1 (Phase III) Experiment 2 (Phase III) 693

694

New Experience 1 New Experience 2 New Experience 1 New Experience 2 695

Figure 5 696

697

698

0

0.5

1

1.5

2

2.5

3

3.5

POMDP 1 Control 1 Control 2 POMDP 2

Pretest

Posttest

model-driven instructional strategy 1jaredfreeman.com/jf_pubs/freeman_ari_adaptive... ·...

Documents