hpkb year 1 results

■ Now completing its first year, the High-Perfor-mance Knowledge Bases Project promotes technol-ogy for developing very large, flexible, andreusable knowledge bases. The project is supportedby the Defense Advanced Research Projects Agencyand includes more than 15 contractors in univer-sities, research laboratories, and companies. Theevaluation of the constituent technologies centerson two challenge problems, in crisis managementand battlespace reasoning, each demanding pow-erful problem solving with very large knowledgebases. This article discusses the challenge prob-lems, the constituent technologies, and their inte-gration and evaluation.

Although a computer has beaten theworld chess champion, no computer hasthe commonsense of a six-year-old

child. Programs lack knowledge about theworld sufficient to understand and adjust tonew situations as people do. Consequently,programs have been poor at interpreting andreasoning about novel and changing events,such as international crises and battlefield sit-uations. These problems are more open endedthan chess. Their solution requires shallowknowledge about motives, goals, people, coun-tries, adversarial situations, and so on, as wellas deeper knowledge about specific politicalregimes, economies, geographies, and armies.

The High-Performance Knowledge Base(HPKB) Project is sponsored by the DefenseAdvanced Research Projects Agency (DARPA)to develop new technology for knowledge-based systems.1 It is a three-year program, end-ing in fiscal year 1999, with funding totaling$34 million. HPKB technology will enabledevelopers to rapidly build very large knowl-edge bases—on the order of 106 rules, axioms,or frames—enabling a new level of intelligencefor military systems. These knowledge basesshould be comprehensive and reusable across

many applications, and they should be main-tained and modified easily. Clearly, these goalsrequire innovation in many areas, from knowl-edge representation to formal reasoning andspecial-purpose problem solving, from knowl-edge acquisition to information gathering onthe web to machine learning, from natural lan-guage understanding to semantic integrationof disparate knowledge bases.

For roughly one year, HPKB researchers havebeen developing knowledge bases containingtens of thousands of axioms concerning crisesand battlefield situations. Recently, the tech-nology was tested in a month-long evaluationinvolving sets of open-ended test items, mostof which were similar to sample (training)items but otherwise novel. Changes to the cri-sis and battlefield scenarios were introducedduring the evaluation to test the comprehen-siveness and flexibility of knowledge in theHPKB systems. The requirement for compre-hensive, flexible knowledge about general sce-narios forces knowledge bases to be large. Chal-lenge problems, which define the scenariosand thus drive knowledge base development,are a central innovation of HPKB. This articlediscusses HPKB challenge problems, technolo-gies and integrated systems, and the evaluationof these systems.

The challenge problems require significantdevelopments in three broad areas of knowl-edge-based technology. First, the overridinggoal of HPKB—to be able to select, compose,extend, specialize, and modify componentsfrom a library of reusable ontologies, commondomain theories, and generic problem-solvingstrategies—is not immediately achievable andrequires some research into foundations ofvery large knowledge bases, particularlyresearch in knowledge representation andontological engineering. Second, there is the

Articles

WINTER 1998 25

The DARPA High-Performance Knowledge

Bases ProjectPaul Cohen, Robert Schrag, Eric Jones, Adam Pease, Albert Lin,

Barbara Starr, David Gunning, and Murray Burke

Copyright © 1998, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1998 / $2.00

AI Magazine Volume 19 Number 4 (1998) (© AAAI)

Often, one will accept an answer that isroughly correct, especially when the alterna-tives are no answer at all or a very specific butwrong answer. This is Lenat and Feigenbaum’sbreadth hypothesis: “Intelligent performanceoften requires the problem solver to fall backon increasingly general knowledge, and/or toanalogize to specific knowledge from far-flungdomains” (Lenat and Feigenbaum 1987, p.1173). We must, therefore, augment high-pow-er knowledge-based systems, which give spe-cific and precise answers, with weaker but ade-quate knowledge and inference. The inferencemethods might not all be sound and complete.Indeed, one might need a multitude of meth-ods to implement what Polya called plausibleinference. HPKB encompasses work on a vari-ety of logical, probabilistic, and other infer-ence methods.

It is one thing to recognize the need forcommonsense knowledge, another to inte-grate it seamlessly into knowledge-based sys-tems. Lenat observes that ontologies often aremissing a middle level, the purpose of which isto connect very general ontological conceptssuch as human and activity with domain-specif-ic concepts such as the person who is responsiblefor navigating a B-52 bomber. Because HPKB isgrounded in domain-specific tasks, the focus ofmuch ontological engineering is this middlelayer.

The ParticipantsThe HPKB participants are organized into threegroups: (1) technology developers, (2) integra-tion teams, and (3) challenge problem devel-opers. Roughly speaking, the integration teamsbuild systems with the new technologies tosolve challenge problems. The integrationteams are led by SAIC and Teknowledge. Eachintegration team fields systems to solve chal-lenge problems in an annual evaluation. Uni-versity participants include Stanford Universi-ty, Massachusetts Institute of Technology(MIT), Carnegie Mellon University (CMU),Northwestern University, University of Massa-chusetts (UMass), George Mason University(GMU), and the University of Edinburgh(AIAI). In addition, SRI International, the Uni-versity of Southern California Information Sci-ences Institute (USC-ISI), the Kestrel Institute,and TextWise, Inc., have developed importantcomponents. Information Extraction andTransport (IET), Inc., with Pacific SierraResearch (PSR), Inc., developed and evaluatedthe crisis-management challenge problem, andAlphatech, Inc., is responsible for the battle-space challenge problem.

problem of building on these foundations topopulate very large knowledge bases. The goalis for collaborating teams of domain experts(who might lack training in computer science)to easily extend the foundation theories,define additional domain theories and prob-lem-solving strategies, and acquire domainfacts. Third, because knowledge is not enough,one also requires efficient problem-solvingmethods. HPKB supports research on efficient,general inference methods and optimized task-specific methods.

HPKB is a timely impetus for knowledge-based technology, although some might thinkit overdue. Some of the tenets of HPKB werevoiced in 1987 by Doug Lenat and Ed Feigen-baum (Lenat and Feigenbaum 1987), and somehave been around for longer. Lenat’s CYC Pro-ject has also contributed much to our under-standing of large knowledge bases and ontolo-gies. Now, 13 years into the CYC Project andmore than a decade after Lenat and Feigen-baum’s paper, there seems to be consensus onthe following points:

The first and most intellectually taxing taskwhen building a large knowledge base is todesign an ontology. If you get it wrong, youcan expect ongoing trouble organizing theknowledge you acquire in a natural way.Whenever two or more systems are built forrelated tasks (for example, medical expert sys-tems, planning, modeling of physical process-es, scheduling and logistics, natural languageunderstanding), the architects of the systemsrealize, often too late, that someone else hasalready done, or is in the process of doing, thehard ontological work. HPKB challenges theresearch community to share, merge, and col-lectively develop large ontologies for signifi-cant military problems. However, an ontologyalone is not sufficient. Axioms are required togive meaning to the terms in an ontology.Without them, users of the ontology can inter-pret the terms differently.

Most knowledge-based systems have nocommon sense; so, they cannot be trusted.Suppose you have a knowledge-based systemfor scheduling resources such as heavy-lift heli-copters, and none of its knowledge concernsnoncombatant evacuation operations. Now,suppose you have to evacuate a lot of people.Lacking common sense, your system is literallyuseless. With a little common sense, it couldnot only support human planning but mightbe superior to it because it could think outsidethe box and consider using the helicopters inan unconventional way. Common sense isneeded to recognize and exploit opportunitiesas well as avoid foolish mistakes.

Articles

26 AI MAGAZINE

Challenge ProblemsA programmatic innovation of HPKB is chal-lenge problems. The crisis-management chal-lenge problem, developed by IET and PSR, isdesigned to exercise broad, relatively shallowknowledge about international tensions. Thebattlespace challenge problem, developed byAlphatech, Inc., has two parts, each designedto exercise relatively specific knowledge aboutactivities in armed conflicts. Movement analysisinvolves interpreting vehicle movementsdetected and tracked by idealized sensors. Theworkaround problem is concerned with findingmilitary engineering solutions to traffic-obstruction problems, such as destroyedbridges and blocked tunnels.

Good challenge problems must satisfy sever-al, often conflicting, criteria. A challenge prob-lem must be challenging: It must raise the barfor both technology and science. A problemthat requires only technical ingenuity will nothold the attention of the technology develop-ers, nor will it help the United States maintainits preeminence in science. Equally important,a challenge problem for a DARPA programmust have clear significance to the UnitedStates Department of Defense (DoD). Chal-lenge problems should serve for the durationof the program, becoming more challengingeach year. This continuity is preferable todesigning new problems every year becausethe infrastructure to support challenge prob-lems is expensive.

A challenge problem should require little orno access to military subject-matter experts. Itshould not introduce a knowledge-acquisitionbottleneck that results in delays and low pro-ductivity from the technology developers. Asmuch as possible, the problem should be solv-able with accessible, open-source material. Achallenge problem should exercise all (ormost) of the contributions of the technologydevelopers, and it should exercise an integra-tion of these technologies. A challenge prob-lem should have unambiguous criteria forevaluating its solutions. These criteria neednot be so objective that one can write algo-rithms to score performance (for example,human judgment might be needed to assessscores), but they must be clear and they mustbe published early in the program. In addition,although performance is important, challengeproblems that value performance above all elseencourage “one-off” solutions (a solutiondeveloped for a specific problem, once only)and discourage researchers from trying tounderstand why their technologies work welland poorly. A challenge problem should pro-vide a steady stream of results, so progress can

be assessed not only by technology developersbut also by DARPA management and involvedmembers of the DoD community.

The HPKB challenge problems are designedto support new and ongoing DARPA initiativesin intelligence analysis and battlespace infor-mation systems. Crisis-management systemswill assist strategic analysts by evaluating thepolitical, economic, and military courses ofaction available to nations engaged at variouslevels of conflict. Battlespace systems will sup-port operations officers and intelligence ana-lysts by inferring militarily significant targetsand sites, reasoning about road network traffi-cability, and anticipating responses to militarystrikes.

Crisis-Management Challenge ProblemThe crisis-management challenge problem isintended to drive the development of broad,relatively shallow commonsense knowledgebases to facilitate intelligence analysis. Theclient program at DARPA for this problem isProject GENOA—Collaborative Crisis Under-standing and Management. GENOA is intendedto help analysts more rapidly understandemerging international crises to preserve U.S.policy options. Proactive crisis management—before a situation has evolved into a crisis thatmight engage the U.S. military—enables moreeffective responses than reactive management.Crisis-management systems will assist strategicanalysts by evaluating the political, economic,and military courses of action available tonations engaged at various levels of conflict.

The challenge problem development teamworked with GENOA representatives to identifyareas for the application of HPKB technology.This work took three or four months, but thecrisis-management challenge problem specifi-cation has remained fairly stable since its ini-tial release in draft form in July 1997.

The first step in creating the challenge prob-lem was to develop a scenario to provide con-text for intelligence analysis in time of crisis.To ensure that the problem should requiredevelopment of real knowledge about theworld, the scenario includes real nationalactors with a fictional yet plausible story line.The scenario, which takes place in the PersianGulf, involves hostilities between Saudi Arabiaand Iran that culminate in closing the Strait ofHormuz to international shipping.

Next, IET worked with experts at PSR todevelop a description of the intelligence analy-sis process, which involves the following tasks:information gathering—what happened, situ-ation assessment—what does it mean, and sce-

Articles

WINTER 1998 27

rent and a historical context. Crises can be rep-resented as events or as larger episodes trackingthe evolution of a conflict over time, frominception or trigger, through any escalation, toeventual resolution or stasis. The representa-tions being developed in HPKB are intended toserve as a crisis corporate memory to help ana-lysts discover historical precedents and analo-gies for actions. Much of the challenge-prob-lem specification is devoted to samplequestions that are intended to drive the devel-opment of general models for reasoning aboutcrisis events.

Sample questions are embedded in an ana-lytic context. The question “What might hap-pen next?” is instantiated as “What mighthappen following the Saudi air strikes?” asshown in figure 1. Q51 is refined to Q83 in away that is characteristic of the analyticprocess; that is, higher-level questions arerefined into sets of lower-level questions thatprovide detail.

The challenge-problem developers (IET withPSR) developed an answer key for sample ques-tions, a fragment of which is shown in figure 2.Although simple factual questions (for exam-ple, “What is the gross national product of theUnited States?”) have just one answer; ques-tions such as Q53 usually have several. Theanswer key actually lists five answers, two ofwhich are shown in figure 2. Each is accompa-nied by suitable explanations, including sourcematerial. The first source (Convention on theLaw of the Sea) is electronic. IET maintains aweb site with links to pages that are expected tobe useful in answering the questions. The sec-ond source is a fragment of a model developedby IET and published in the challenge-problemspecification. IET developed these fragments to

nario development—what might happen next.Situation assessment (or interpretation)

includes factors that pertain to the specific sit-uation at hand, such as motives, intents, risks,rewards, and ramifications, and factors thatmake up a general context, or “strategic cul-ture,” for a state actor’s behavior in interna-tional relations, such as capabilities, interests,policies, ideologies, alliances, and enmities.Scenario development, or speculative predic-tion, starts with the generation of plausibleactions for each actor. Then, options are eval-uated with respect to the same factors for situ-ation assessment, and a likelihood rating isproduced. The most plausible actions arereported back to policy makers.

These analytic tasks afford many opportuni-ties for knowledge-based systems. One is to useknowledge bases to retain or multiply corpo-rate expertise; another is to use knowledge andreasoning to “think outside the box,” to gener-ate analytic possibilities that a human analystmight overlook. The latter task requires exten-sive commonsense knowledge, or “analyst’ssense,” about the domain to rule out implausi-ble options.

The crisis-management challenge problemincludes an informal specification for a proto-type crisis-management assistant to support ana-lysts. The assistant is tested by asking ques-tions. Some are simple requests for factualinformation, others require the assistant tointerpret the actions of nations in the contextof strategic culture. Actions are motivated byinterests, balancing risks and rewards. Theyhave impacts and require capabilities. Interestsdrive the formation of alliances, the exercise ofinfluence, and the generation of tensionsamong actors. These factors play out in a cur-

III. What of significance might happen following the Saudi air strikes? B. Options evaluationEvaluate the options available to Iran.

Close the Strait of Hormuz to shipping.Evaluation: ProbableMotivation: Respond to Saudi air strikes and deter future strikes.Capability:

(Q51) Can Iran close the Strait of Hormuz to international shipping? (Q83) Is Iran capable of firing upon tankers in the Strait of Hormuz? With what weapons?

Negative outcomes: (Q53) What risks would Iran face in closing the strait?

Articles

28 AI MAGAZINE

Figure 1. Sample Questions Pertaining to the Responses to an Event.

indicate the kinds of reasoning they would betesting in the challenge problem.

For the challenge-problem evaluation, heldin June 1998, IET developed a way to generatetest questions through parameterization. Testquestions deviate from sample questions inspecified, controlled ways, so the teams partic-ipating in the challenge problem know thespace of questions from which test items willbe selected. This space includes billions ofquestions so the challenge problem cannot besolved by relying on question-specific knowl-edge. The teams must rely on general knowl-edge to perform well in the evaluation.(Semantics provide practical constraints on thenumber of reasonable instantiations of para-meterized questions, as do online sources pro-vided by IET.) To illustrate, Q53 is parameter-ized in figure 3. Parameterized question 53,PQ53, actually covers 8 of the roughly 100sample questions in the specification.

Parameterized questions and associated classdefinitions are based on natural language, giv-ing the integration teams responsibility fordeveloping (potentially different) formal rep-resentations of the questions. This decisionwas made at the request of the teams. Aninstance of a parameterized question, say,PQ53, is mechanically generated, then theteams must create a formal representation andreason with it—without human intervention.

Battlespace Challenge Problems The second challenge-problem domain forHPKB is battlespace reasoning. Battlespace is anabstract notion that includes not only the

PQ53 [During/After <TimeInterval>,] what {risks, rewards}would <InternationalAgent> face in <InternationalAction-Type>?

<InternationalActionType> ={[exposure of its] {supporting, sponsoring}

<InternationalAgentType> in <InternationalAgent2>, successful terrorist attacks against <InternationalAgent2>’s

<EconomicSector>, <InternationalActionType>(PQ51), taking hostage citizens of <InternationalAgent2>, attacking targets <SpatialRelationship> <InternationalAgent2> with <Force>}

<InternationalAgentType> ={terrorist group, dissident group, political party, humani-tarian organization}

Articles

WINTER 1998 29

Answer(s):1. Economic sanctions from {Saudi Arabia, GCC, U.S., U.N.}• The closure of the Strait of Hormuz would violate an international norm promoting freedom of the seas and

would jeopardize the interests of many states. • In response, states might act unilaterally or jointly to impose economic sanctions on Iran to compel it to

reopen the strait.• The United Nations Security Council might authorize economic sanctions against Iran.

2. Limited military response from {Saudi Arabia, GCC, U.S., others}…

Source(s):• The Convention on the Law of the Sea. • (B5) States may act unilaterally or collectively to isolate and/or punish a group or state that violates interna-

tional norms. Unilateral and collective action can involve a wide range of mechanisms, such as intelligencecollection, military retaliation, economic sanction, and diplomatic censure/isolation.

Figure 2. Part of the Answer Key for Question 53.

Figure 3. A Parameterized Question Suitable for Generating Sample Questions and Test Questions.

form and an order of battle that describes thestructure and composition of the enemy forcesin the scenario region.

Given these input, movement analysis com-prises the following tasks:

First is to distinguish military from nonmil-itary traffic. Almost all military traffic travels inconvoys, which makes this task fairly straight-forward except for very small convoys of twoor three vehicles. Second is to identify the sitesbetween which military convoys travel, deter-mine which of these sites are militarily signifi-cant, and determine the types of each militar-ily significant site. Site types include battlepositions, command posts, support areas, air-defense sites, artillery sites, and assembly-stag-ing areas.

Third is to identify which units (or parts ofunits) in the enemy order of battle are partici-pating in each military convoy.

Fourth is to determine the purpose of eachconvoy movement. Purposes include recon-naissance, movement of an entire unit towarda battle position, activities by command ele-ments, and support activities.

Fifth is to infer the exact types of the vehi-cles that make up each convoy. About 20 typesof military vehicle are distinguished in theenemy order of battle, all of which show up inthe scenario data.

To help the technology base and the integra-tion teams develop their systems, a portion ofthe simulation data was released in advance ofthe evaluation phase, accompanied by ananswer key that supplied model answers foreach of the inference tasks listed previously.

Movement analysis is currently carried outmanually by human intelligence analysts,who appear to rely on models of enemybehavior at several levels of abstraction. Theseinclude models of how different sites or con-voys are structured for different purposes andmodels of military systems such as logistics(supply and resupply). For example, in a logis-tics model, one might find the following frag-ment: “Each echelon in a military organiza-tion is responsible for resupplying itssubordinate echelons. Each echelon, from bat-talion on up, has a designated area for storingsupplies. Supplies are provided by higher ech-elons and transshipped to lower echelons atthese areas.” Model fragments such as theseare thought to constitute the knowledge ofintelligence analysts and, thus, should be thecontent of HPKB movement-analysis systems.Some such knowledge was elicited from mili-tary intelligence analysts during programwidemeetings. These same analysts also scriptedthe simulation scenario.

physical geography of a conflict but also theplans, goals, and activities of all combatantsprior to, and during, a battle and during theactivities leading to the battle. Three battle-space programs within DARPA were identifiedas potential users of HPKB technologies: (1) thedynamic multiinformation fusion program,(2) the dynamic database program, and (3) thejoint forces air-component commander(JFACC) program. Two battlespace challengeproblems have been developed.

The Movement-Analysis ChallengeProblem The movement-analysis challengeproblem concerns high-level analysis of ideal-ized sensor data, particularly the airborneJSTARS moving target indicator radar. ThisDoppler radar can generate vast quantities ofinformation—one reading every minute foreach vehicle in motion within a 10,000-square-mile area.2 The movement-analysis sce-nario involves an enemy mobilizing a full divi-sion of ground forces—roughly 200 militaryunits and 2000 vehicles—to defend against apossible attack. A simulation of the vehiclemovements of this division was developed, theoutput of which includes reports of the posi-tions of all the vehicles in the division at 1-minute intervals over a 4-day period for 18hours each day. These military vehicle move-ments were then interspersed with plausiblecivilian traffic to add the problem of distin-guishing military from nonmilitary traffic. Themovement-analysis task is to monitor themovements of the enemy to detect and identi-fy types of military site and convoy.

Because HPKB is not concerned with signalprocessing, the input are not real JSTARS databut are instead generated by a simulator andpreprocessed into vehicle tracks. There is nouncertainty in vehicle location and no radarshadowing, and each vehicle is always accu-rately identified by a unique bumper number.However, vehicle tracks do not precisely iden-tify vehicle type but instead define each vehi-cle as either light wheeled, heavy wheeled, ortracked. Low-speed and stationary vehicles arenot reported.

Vehicle-track data are supplemented bysmall quantities of high-value intelligencedata, including accurate identification of a fewkey enemy sites, electronic intelligence reports oflocations and times at which an enemy radar isturned on, communications intelligence reportsthat summarize information obtained by mon-itoring enemy communications, and humanintelligence reports that provide detailed infor-mation about the numbers and types of vehi-cle passing a given location. Other inputinclude a detailed road network in electronic

The secondchallenge-

problemdomain for

HPKB is battlespacereasoning.

Battlespace isan abstractnotion that

includes notonly the physical

geography ofa conflict but

also theplans, goals,

and activitiesof all

combatantsprior to, and

during, a battle andduring the

activitiesleading to the

battle.

Articles

30 AI MAGAZINE

The Workaround Challenge ProblemThe workaround challenge problem supports air-campaign planning by the JFACC and his/herstaff. One task for the JFACC is to determinesuitable targets for air strikes. Good targetsallow one to achieve maximum military effectwith minimum risk to friendly forces and min-imum loss of life on all sides. Infrastructureoften provides such targets: It can be sufficientto destroy supplies at a few key sites or criticalnodes in a transportation network, such asbridges along supply routes. However, bridgesand other targets can be repaired, and there islittle point in destroying a bridge if an avail-able fording site is nearby. If a plan requires aninterruption in traffic of several days, and thebridge can be repaired in a few hours, thenanother target might be more suitable. Targetselection, then, requires some reasoning abouthow an enemy might “work around” the dam-age to the target.

The task of the workaround challenge prob-lem is to automatically assess how rapidly andby what method an enemy can reconstitute orbypass damage to a target and, thereby, helpair-campaign planners rapidly choose effectivetargets. The focus of the workaround problemin the first year of HPKB is automaticworkaround generation.

The workaround task involves detailed rep-resentation of targets and the local terrainaround the target and detailed reasoning aboutactions the enemy can take to reconstitute orbypass this damage. Thus, the input toworkaround systems include the following ele-ments:

First is a description of a target (for example,a bridge or a tunnel), the damage to it (forexample, one span of a bridge is dropped; thebridge and vicinity are mined), and key fea-tures of the local terrain (for example, theslope and soil types of a terrain cross sectioncoincident with the road near the bridge,together with the maximum depth and thespeed of any river or stream the bridge crosses).

Second is a specific enemy unit or capabilityto be interdicted, such as a particular armoredbattalion or supply trucks carrying ammuni-tion.

Third is a time period over which this unitor capability is to be denied access to the tar-geted route. The presumption is that the ene-my will try to repair the damage within thistime period; a target is considered to be effec-tive if there appears to be no way for the ene-my to make this repair.

Fourth is a detailed description of the enemyresources in the area that could be used torepair the damage. For the most part, repairs to

battle damage are carried out by Army engi-neers; so, this description takes the form of adetailed engineering order of battle.

All input are provided in a formal represen-tation language.

The workaround generator is expected toprovide three output: First is a reconstitutionschedule, giving the capacity of the damagedlink as a function of time since the damage wasinflicted. For example, the workaround gener-ator might conclude that the capacity of thelink is zero for the first 48 hours, but thereafter,a temporary bridge will be in place that cansustain a capacity of 170 vehicles an hour. Sec-ond is a time line of engineering actions that theenemy might carry out to implement therepair, the time these actions require, and thetemporal constraints among them. If thereappears to be more than one viable repair strat-egy, a time line should be provided for each.Third is a set of required assets for each time lineof actions, a description of the engineeringresources that are used to repair the damageand pointers to the actions in the time linethat utilize these assets. The reconstitutionschedule provides the minimal informationrequired to evaluate the suitability of a giventarget. The time line of actions provides anexplanation to justify the reconstitutionschedule. The set of required assets is easilyderived from the time line of actions and canbe used to suggest further targets for preemp-tive air strikes against the enemy to frustrate itsrepair efforts.

A training data set was provided to helpdevelopers build their systems. It suppliedinput and output for several sample problems,together with detailed descriptions of the cal-culations carried out to compute action dura-tions; lists of simplifying assumptions made tofacilitate these calculations; and pointers totext sources for information on engineeringresources and their use (mainly Army fieldmanuals available on the World Wide Web).

Workaround generation requires detailedknowledge about what the capabilities of theenemy’s engineering equipment are and howit is typically used by enemy forces. For exam-ple, repairing damage to a bridge typicallyinvolves mobile bridging equipment, such asarmored vehicle-launched bridges (AVLBs),medium girder bridges, Bailey bridges, or floatbridges such as ribbon bridges or M4T6bridges, together with a range of earthmovingequipment such as bulldozers. Each kind ofmobile bridge takes a characteristic amount oftime to deploy, requires different kinds of bankpreparation, and is “owned” by different eche-lons in the military hierarchy, all of which

The challengeproblems aresolved by integrated systems fielded byintegrationteams led byTeknowledgeand SAIC.

Articles

WINTER 1998 31

object-oriented format (Pease and Carrico1997a, 1997b), and applications of this genericsemantics to domain-specific tasks are promis-ing (Pease and Albericci 1998). The develop-ment of ontologies for integrating manufactur-ing planning applications (Tate 1998) andwork flow (Lee et al. 1996) is ongoing.

Another option for semantic integration issoftware mediation (Park, Gennari, and Musen1997). This software mediation can be seen asa variant on pairwise integration, but becauseintegration is done by knowledge-basedmeans, one has an explicit expression of thesemantics of the conversion. Researchers atKestrel Institute have successfully defined for-mal specifications for data and used these the-ories to integrate formally specified software.In addition, researchers at Cycorp have suc-cessfully applied CYC to the integration of mul-tiple databases.

The Teknowledge approach to integration isto share knowledge among applications andcreate new knowledge to support the challengeproblems. Teknowledge is defining formalsemantics for the input and output of eachapplication and the information in the chal-lenge problems.

Many concepts defy simple definitions.Although there has been much success indefining the semantics of mathematical con-cepts, it is harder to be precise about thesemantics of the concepts people use everyday. These concepts seem to acquire meaningthrough their associations with other con-cepts, their use in situations and communica-tion, and their relations to instances. To givethe concepts in our integrated system realmeaning, we must provide a rich set of associ-ations, which requires an extremely largeknowledge base. CYC offers just such a knowl-edge base.

CYC (Lenat 1995; Lenat and Guha 1990)consists of an immense, multicontextualknowledge base; an efficient inference engine;and associated tools and interfaces for acquir-ing, browsing, editing, and combining knowl-edge. Its premise is that knowledge-based soft-ware will be less brittle if and only if it hasaccess to a foundation of basic commonsenseknowledge. This semantic substratum ofterms, rules, and relations enables applicationprograms to cope with unforeseen circum-stances and situations.

The CYC knowledge base represents millionsof hand-crafted axioms entered during the 13years since CYC’s inception. Through carefulpolicing and generalizing, there are nowslightly fewer than 1 million axioms in theknowledge base, interrelating roughly 50,000

affect the time it takes to bring the bridge to adamage site and effect a repair. Because HPKBoperates in an entirely unclassified environ-ment, U.S. engineering resources and doctrinewere used throughout. Information fromArmy field manuals was supplemented by aseries of programwide meetings with an Armycombat engineer, who also helped constructsample problems and solutions.

Integrated SystemsThe challenge problems are solved by integrat-ed systems fielded by integration teams led byTeknowledge and SAIC. Teknowledge favors acentralized architecture that contains a largecommonsense ontology (CYC); SAIC has a dis-tributed architecture that relies on sharing spe-cialized domain ontologies and knowledgebases, including a large upper-level ontologybased on the merging of CYC, SENSUS, and otherknowledge bases.

Teknowledge IntegrationThe Teknowledge integration team comprisesTeknowledge, Cycorp, and Kestrel Institute. Itsfocus is on semantic integration and the cre-ation of massive amounts of knowledge.Semantic Integration Three issues makesoftware integration difficult. Transport issuesconcern mechanisms to get data from oneprocess or machine to another. Solutionsinclude sockets, remote-method invocation(RMI), and CORBA. Syntactic issues concern howto convert number formats, “syntactic sugar,”and the labels of data. The more challengingissues concern semantic integration: To integrateelements properly, one must understand themeaning of each. The database communityhas addressed this issue (Wiederhold 1996); itis even more pressing in knowledge-based sys-tems.

The current state of practice in software inte-gration consists largely of interfacing pairs ofsystems, as needed. Pairwise integration of thiskind does not scale up, unanticipated uses arehard to cover later, and chains of integratedsystems at best evolve into stovepipe systems.Each integration is only as general as it needsto be to solve the problem at hand.

Some success has been achieved in low-levelintegration and reuse; for example, systemsthat share scientific subroutine libraries orgraphics packages are often forced into similarrepresentational choices for low-level data.DARPA has invested in early efforts to createreuse libraries for integrating large systems athigher levels. Some effort has gone intoexpressing a generic semantics of plans in an

Teknowledgefavors a

centralizedarchitecture

that containsa large

commonsenseontology (CYC) ….

Articles

32 AI MAGAZINE

atomic terms. Fewer than two percent of theseaxioms represent simple facts about propernouns of the sort one might find in analmanac. Most embody general consensusinformation about the concepts. For example,one axiom says one cannot perform volitionalactions while one sleeps, another says one can-not be in two places at once, and another saysyou must be at the same place as a tool to useit. The knowledge base spans human capabili-ties and limitations, including information onemotions, beliefs, expectations, dreads, andgoals; common everyday objects, processes,and situations; and the physical universe,including such phenomena as time, space,causality, and motion.

The CYC inference engine comprises an epis-temological and a heuristic level. The epistemo-logical level is an expressive nth-order logicallanguage with clean formal semantics. Theheuristic level is a set of some three dozen spe-cial-purpose modules that each contains itsown algorithms and data structures and canrecognize and handle some commonly occur-ring sorts of inference. For example, oneheuristic-level module handles temporal rea-soning efficiently by converting temporal rela-tions into a before-and-after graph and thendoing graph searching rather than theoremproving to derive an answer. A truth mainte-nance system and an argumentation-basedexplanation and justification system are tight-ly integrated into the system and are efficientenough to be in operation at all times. In addi-tion to these inference engines, CYC includesnumerous browsers, editors, and consistencycheckers. A rich interface has been defined.

Crisis-Management Integration The cri-sis-management challenge problem involvesanswering test questions presented in a struc-tured grammar. The first step in answering atest question is to convert it to a form that CYC

can reason with, a declarative decision tree.When the tree is applied to the test questioninput, a CYC query is generated and sent to CYC.

Answering the challenge problem questionstakes a great deal of knowledge. For the firstyear’s challenge problem alone, the Cycorpand Teknowledge team added some 8,000 con-cepts and 80,000 assertions to CYC. To meet theneeds of this challenge problem, the team cre-ated significant amounts of new knowledge,some developed by collaborators and mergedinto CYC, some added by automated processes.

The Teknowledge integrated system in-cludes two natural language components:START and TextWise. The START system was cre-ated by Boris Katz (1997) and his group at MIT.For each of the crisis-management questions,

Teknowledge has developed a template intowhich user-specified parameters can be insert-ed. START parses English queries for a few of thecrisis-management questions to fill in thesetemplates. Each filled template is a legal CYC

query. TextWise Corporation has been devel-oping natural language information-retrievalsoftware primarily for news articles (Liddy,Paik, and McKenna 1995). Teknowledgeintends to use the TextWise knowledge baseinformation tools (KNOW-IT) to supply manyinstances to CYC of facts discovered from newsstories. The system can parse English text andreturn a series of binary relations that expressthe content of the sentences. There are severaldozen relation types, and the constants thatinstantiate each relation are WORDNET synsetmappings (Miller et al. 1993). Each of the con-cepts has been mapped to a CYC expression,and a portion of WORDNET has been mapped toCYC. For those synsets not in CYC, the WORDNET

hyponym links are traversed until a mappedCYC term is found.

Battlespace Integration Teknowledgesupported the movement-analysis workaroundproblem.

Movement analysis: Several movement-analysis systems were to be integrated, andmuch preliminary integration work was done.Ultimately, the time pressure of the challengeproblem evaluation precluded a full integra-tion. The MIT and UMass movement-analysissystems are described briefly here; the SMI andSRI systems are described in the section enti-tled SAIC Integrated Systems.

The MIT MAITA system provides tools forconstructing and controlling networks of dis-tributed-monitoring processes. These toolsprovide access to large knowledge bases ofmonitoring methods, organized around thehierarchies of tasks performed, knowledgeused, contexts of application, the alerting ofutility models, and other dimensions. Individ-ual monitoring processes can also make use ofknowledge bases representing commonsenseor expert knowledge in conducting their rea-soning or reporting their findings. MIT builtmonitoring processes for sites and convoyswith these tools.

The UMass group tried to identify convoysand sites with very simple rules. Rules weredeveloped for three site types: (1) battle posi-tions, (2) command posts, and (3) assembly-staging areas. The convoy detector simplylooked for vehicles traveling at fixed distancesfrom one another. Initially, UMass was goingto recognize convoys from their dynamics, inwhich the distances between vehicles fluctuatein a characteristic way, but in the simulated

… SAIC has adistributedarchitecturethat relies on sharingspecializeddomainontologiesand knowledgebases ….

Articles

WINTER 1998 33

soning process. For this reason, a hierarchicaltask network (HTN) approach was taken. Aplanning-specific ontology was defined withinthe larger CYC ontology, and planning rulesonly referenced concepts within this moreconstrained context. The planning applicationwas essentially embedded in CYC.

CYC had to be extended to represent com-posite actions that have several alternativedecompositions and complex preconditions-effects. Although it is not a commonsenseapproach, AIAI decided to explore HTN plan-ning because it appeared suitable for theworkaround domain. It was possible to repre-sent actions, their conditions and effects, theplan-node network, and plan resources in arelational style. The structure of a plan wasimplicitly represented in the proof that thecorresponding composite action was a relationbetween particular sets of conditions andeffects. Once proved, action relations areretained by CYC and are potentially reusable.An advantage of implementing the AIAI plan-ner in CYC was the ability to remove brittlenessfrom the planner-input knowledge format; forexample, it was not necessary to account for allthe possible permutations of argument orderin predicates such as bordersOn and between.

SAIC Integrated SystemSAIC built an HPKB integrated knowledge envi-ronment (HIKE) to support both crisis-manage-ment and battlespace challenge problems. Thearchitecture of HIKE for crisis management isshown in figure 4. For battlespace, the architec-ture is similar in that it is distributed and relieson the open knowledge base connectivity(OKBC) protocol, but of course, the compo-nents integrated by the battlespace architectureare different. HIKE’s goals are to address the dis-tributed communications and interoperabilityrequirements among the HPKB technologycomponents—knowledge servers, knowledge-acquisition tools, question-and-answering sys-tems, problem solvers, process monitors, and soon—and provide a graphic user interface (GUI)tailored to the end users of the HPKB environ-ment.

HIKE provides a distributed computing infra-structure that addresses two types of commu-nications needs: First are input and outputdata-transportation and software connectivi-ties. These include connections between theHIKE server and technology components, con-nections between components, and connec-tions between servers. HIKE encapsulates infor-mation content and data transportationthrough JAVA objects, hypertext transfer proto-col (HTTP), remote-method invocation (JAVA

data, the distances between vehicles remainedfixed. UMass also intended to detect sites bythe dynamics of vehicle movements betweenthem, but no significant dynamic patternscould be found in the movement data.

Workarounds: Teknowledge developed twoworkaround integrations, one an internalTeknowledge system, the other from AIAI atthe University of Edinburgh.

Teknowledge developed a planning toolbased on CYC, essentially a wrapper aroundCYC’s existing knowledge base and inferenceengine. A plan is a proof that there is a pathfrom the final goal to the initial situationthrough a partially ordered set of actions. Therules in the knowledge base driving the plan-ner are rules about action preconditions andabout which actions can bring about a certainstate of affairs. There is no explicit temporalreasoning; the partial order of temporal prece-dence between actions is established on thebasis of the rules about preconditions andeffects.

The planner is a new kind of inferenceengine, performing its own search but in amuch smaller search space. However, each stepin the search involves interaction with theexisting inference engine by hypothesizingactions and microtheories and doing asks andasserts in these microtheories. This hypothe-sizing and asserting on the fly in effectamounts to dynamically updating the knowl-edge base in the course of inference; this capa-bility is new for the CYC inference engine.

Consistent with the goals of HPKB, theTeknowledge workaround planner reused CYC’sknowledge, although it was not knowledgespecific to workarounds. In fact, CYC had neverbeen the basis of a planner before, so even stat-ing things in terms of an action’s precondi-tions was new. What CYC provided, however,was a rich basis on which to build workaroundknowledge. For example, the Teknowledgeteam needed to write only one rule to state “touse something as a device you must have con-trol over that device,” and this rule covered thecases of using an M88 to clear rubble, a mineplow to breach a minefield, a bulldozer to cutinto a bank or narrow the gap, and so on. Thereason one rule can cover so many cases isbecause clearing rubble, demining an area,narrowing a gap, and cutting into a bank areall specializations of IntrinsicStateChange-Event, an extant part of the CYC ontology.

The AIAI workaround planner was alsoimplemented in CYC and took data fromTeknowledge’s FIRE&ISE-TO-MELD translator as itsinput. The central idea was to use the scriptlikestructure of workaround plans to guide the rea-

Articles

34 AI MAGAZINE

RMI), and database access (JDBC). Second, HIKE

provides for knowledge-content assertion anddistribution and query requests to knowledgeservices.

The OKBC protocol proved essential. SRIused it to interface the theorem prover SNARK toan OKBC server storing the Central Intelli-gence Agency (CIA) World Fact Book knowledgebase. Because this knowledge base is large, SRIdid not want to incorporate it into SNARK butinstead used the procedural attachment fea-ture of SNARK to look up facts that were avail-able only in the World Fact Book. MIT’s START

system used OKBC to connect to SRI’s OCELOT-SNARK OKBC server. This connection will even-tually give users the ability to pose questionsin English, which are then transformed to aformal representation by START and shipped toSNARK using OKBC; the result is returned usingOKBC. ISI built an OKBC server for their LOOM

system for GMU. SAIC built a front end to theOKBC server for LOOM that was extensivelyused by the members of the battlespace chal-lenge problem team.

With OKBC and other methods, the HIKE

infrastructure permits the integration of newtechnology components (either clients orservers) in the integrated end-to-end HPKB sys-tem without introducing major changes, pro-vided that the new components adhere to thespecified protocols.Crisis-Management Integration TheSAIC crisis-management architecture isfocused around a central OKBC bus, as shownin figure 4. The technology components pro-vide user interfaces, question answering, andknowledge services. Some components haveoverlapping roles. For example, MIT’s START sys-tem serves both as a user interface and a ques-tion-answering component. Similarly, CMU’s

Articles

WINTER 1998 35

SKC

ATP

SNARK

SPOOK

START

Ontolingua

Analyst

HIKEClients

GKB-Editor

webKB

TrainingData

WWW

Electronic archival sources

Real time news feeds

KNOW-IT

WWW

Knowledge Engineer

OKBC(Open Knowledge

Base Connectivity)BUS

Knowledge Services

User In

terfac

e

HikeServers

Question Answering

Question Answering

Figure 4. SAIC Crisis-Management Challenge Problem Architecture.

to reuse knowledge whenever it made sense.The SAIC team reused three knowledge bases:(1) the HPKB upper-level ontology developedby Cycorp, (2) the World Fact Book knowledgebase from the CIA, and the Units and MeasuresOntology from Stanford. Reusing the upper-level ontology required translation, compre-hension, and reformulation. The ontology wasreleased in MELD (a language used by Cycorp)and was not directly readable by the SAIC sys-tem. In conjunction with Stanford, SRI devel-oped a translator to load the upper-level ontol-ogy into any OKBC-compliant server. Onceloaded into the OCELOT server, the GKB editorwas used to comprehend the upper ontology.The graphic nature of the GKB editor illuminat-ed the interrelationships between classes andpredicates of the upper-level ontology. Becausethe upper-level ontology represents functionalrelationships as predicates but SNARK reasonsefficiently with functions, it was necessary toreformulate the ontology to use functionswhenever a functional relationship existed.Battlespace Integration The distributedHIKE infrastructure is well suited to support anintegrated battlespace challenge problem as itwas originally designed: a single informationsystem for movement analysis, trafficability,and workaround reasoning. However, the traffi-cability problem (establishing routes for variouskinds of vehicle given the characteristics) wasdropped, and the integration of the other prob-lems was delayed. The components that solvedthese problems are described briefly later.

Movement analysis: The movement-analy-sis problem is solved by MIT’s monitoring,analysis, and interpretation tools arsenal (MAI-TA); Stanford University’s Section on MedicalInformatics’ (SMI) problem-solving methods;and SRI International’s Bayesian networks. TheMIT effort was described briefly in the sectionentitled Teknowledge Integration. Here, wefocus on the SMI and SRI movement-analysissystems.

For scalability, SMI adopted a three-layeredapproach to the challenge problem: The firstlayer consisted primarily of simple, context-free data processing that attempted to findimportant preliminary abstractions in the dataset. The most important of these were trafficcenters (locations that were either the startingor stopping points for a significant number ofvehicles) and convoy segments (a number ofvehicles that depart from the same traffic cen-ter at roughly the same time, going in roughlythe same direction). Spotting these abstractionsrequired setting a number of parameters (forexample, how big a traffic center is). Oncetrained, these first-layer algorithms are linear in

WEBKB supports both question answering andknowledge services.

HIKE provides a form-based GUI with whichusers can construct queries with pull-downmenus. Query-construction templates corre-spond to the templates defined in the crisis-management challenge problem specification.Questions also can be entered in natural lan-guage. START and the TextWise componentaccept natural language queries and thenattempt to answer the questions. To answerquestions that involve more complex types ofreasoning, START generates a formal representa-tion of the query and passes it to one of thetheorem provers.

The Stanford University Knowledge SystemsLaboratory ONTOLINGUA, SRI International’sgraphic knowledge base (GKB) editor, WEBKB,and TextWise provide the knowledge servicecomponents. The GKB editor is a graphic toolfor browsing and editing large knowledge bases,used primarily for manual knowledge acquisi-tion. WEBKB supports semiautomatic knowledgeacquisition. Given some training data and anontology as input, a web spider searches in adirected manner and populates instances ofclasses and relations defined in the ontology.Probabilistic rules are also extracted. TextWiseextracts information from text and newswirefeeds, converting them into knowledge inter-change format (KIF) triples, which are thenloaded into ONTOLINGUA. ONTOLINGUA is SAIC’scentral knowledge server and informationrepository for the crisis-management challengeproblem. ONTOLINGUA supports KIF as well ascompositional modeling language (CML). Flowmodels developed by Northwestern University(NWU) answer challenge problem questionsrelated to world oil-transportation networksand reside within ONTOLINGUA. Stanford’s systemfor probabilistic object-oriented knowledge(SPOOK) provides a language for class frames tobe annotated with probabilistic information,representing uncertainty about the propertiesof instances in this class. SPOOK is capable of rea-soning with the probabilistic information basedon Bayesian networks.

Question answering is implemented in sev-eral ways. SRI International’s SNARK and Stan-ford’s abstract theorem prover (ATP) are first-order theorem provers. WEBKB answersquestions based on the information it gathers.Question answering is also accomplished bySTART and TextWise taking a query in English asinput and using information retrieval toextract the answers from text-based sources(such as the web, newswire feeds).

The guiding philosophy during knowledgebase development for crisis management was

The guidingphilosophy

during knowledge

base develop-ment for crisismanagementwas to reuse

knowledgewhenever itmade sense.

Articles

36 AI MAGAZINE

the size of the data set and enabled SMI to useknowledge-intensive techniques on the result-ing (much smaller) set of data abstractions.

The second layer was a repair layer, whichused knowledge of typical convoy behaviorsand locations on the battlespace to construct a“map” of militarily significant traffic and traf-fic centers. The end result was a network oftraffic connected by traffic. Three main tasksremain: (1) classify the traffic centers, (2) figureout what the convoys are doing, and (3) iden-tify which units are involved. SMI iterativelyanswered these questions by using repeatedlayers of heuristic classification and constraintsatisfaction. The heuristic-classification com-ponents operated independently of the net-work, using known (and deduced) facts aboutsingle convoys or traffic centers. Consider thefollowing rule for trying to identify a mainsupply brigade (MSB) site (paraphrased intoEnglish, with abstractions in boldface):

If we have a current site which is unclas-sified

and it’s in the Division support area,and the traffic is high enoughand the traffic is balancedand the site is persistent with no majordeployments emanating from it

then it’s probably an MSBSMI used similar rules for the constraint-satis-faction component of its system, allowinginformation to propagate through the networkin a manner similar to Waltz’s (1975) well-known constraint-satisfaction algorithm foredge labeling.

The goal of the SRI group was to induce aknowledge base to characterize and identifytypes of site such as command posts and battlepositions. Its approach was to induce aBayesian classifier and use a generative modelapproach, producing a Bayesian network thatcould serve as a knowledge base. This model-ing required transforming raw vehicle tracksinto features (for example, the frequency ofcertain vehicles at sites, number of stops) thatcould be used to predict sites. Thus, it was alsonecessary to have hypothetical sites to test. SRIrelied on SMI to provide hypothetical sites,and it also used some of the features that SMIcomputed. As a classifier, SRI used tree-aug-mented naive (TAN) Bayes (Friedman, Geiger,and Goldszmidt 1997).

Workarounds: The SAIC team integratedtwo approaches to workaround generation,one developed by USC-ISI, the other by GMU.

ISI developed course-of-action–generationproblem solvers to create alternative solutionsto workaround problems. In fact, two alterna-tive course-of-action–generation problem solv-

ers were developed. One is a novel plannerwhose knowledge base is represented in theontologies, including its operators, statedescriptions, and problem-specific informa-tion. It uses a novel partial-match capabilitydeveloped in LOOM (MacGregor 1991). The oth-er is based on a state-of-the-art planner (Velosoet al. 1995). Each solution lists several engi-neering actions for this workaround (for exam-ple, deslope the banks of the river, install atemporary bridge), includes information aboutthe sources used (for example, what kind ofearthmoving equipment or bridge is used), andasserts temporal constraints among the indi-vidual actions to indicate which can be execut-ed in parallel. A temporal estimation-assess-ment problem solver evaluates each of thealternatives and selects one as the most likelychoice for an enemy workaround. This prob-lem solver was developed in EXPECT (Swartoutand Gil 1995; Gil 1994).

Several general battlespace ontologies (forexample, military units, vehicles), anchoredon the HPKB upper ontology, were used andaugmented with ontologies needed to reasonabout workarounds (for example, engineeringequipment). Besides these ontologies, theknowledge bases used included a number ofproblem-solving methods to represent knowl-edge about how to solve the task. Both ontolo-gies and problem-solving knowledge were usedby two main problem solvers.

EXPECT’s knowledge-acquisition tools wereused throughout the evaluation to detect miss-ing knowledge. EXPECT uses problem-solvingknowledge and ontologies to analyze whichinformation is needed to solve the task. Thiscapability allows EXPECT to alert a user whenthere is missing knowledge about a problem(for example, unspecified bridge lengths) or asituation. It also helps debug and refineontologies by detecting missing axioms andovergeneral definitions.

GMU developed the DISCIPLE98 system. DISCI-PLE is an apprenticeship multistrategy learningsystem that learns from examples, from expla-nations, and by analogy and can be taught byan expert how to perform domain-specifictasks through examples and explanations in away that resembles how experts teach appren-tices (Tecuci 1998). For the workarounddomain, DISCIPLE was extended into a baseline-integrated system that creates an ontology byacquiring concepts from a domain expert orimporting them (through OKBC) from sharedontologies. It learns task-decomposition rulesfrom a domain expert and uses this knowledgeto solve workaround problems through hierar-chical nonlinear planning.

Articles

WINTER 1998 37

ferent metrics. The test items for crisis manage-ment were questions, and the test was similarto an exam. Overall competence is a functionof the number of questions answered correctly,but the crisis-management systems are alsoexpected to “show their work” and providejustifications (including sources) for theiranswers. Examples of questions, answers, andjustifications for crisis management are shownin the section entitled Crisis-ManagementChallenge Problem.

Performance metrics for the movement-analysis problem are related to recall and pre-cision. The basic problem is to identify sites,vehicles, and purposes given vehicle trackdata; so, performance is a function of howmany of these entities are correctly identifiedand how many incorrect identifications aremade. In general, identifications can bemarked down on three dimensions: First, theidentified entity can be more or less like theactual entity; second, the location of the iden-tified entity can be displaced from the actualentity’s true location; and third, the identifica-tion can be more or less timely.

The workaround problem involves generat-ing workarounds to military actions such asbombing a bridge. Here, the criteria for suc-cessful performance include coverage (thegeneration of all workarounds generated),appropriateness (the generation of work-arounds appropriate given the action), speci-ficity (the exact implementation of the work-around), and accuracy of timing inferences(the length each step in the workaround takesto implement).

Performance evaluation, although essential,tells us relatively little about the HPKB inte-grated systems, still less about the componenttechnologies. We also want to know why thesystems perform well or poorly. Answering thisquestion requires credit assignment becausethe systems comprise many technologies. Wealso want to gather evidence pertinent to sev-eral important, general claims. One claim isthat HPKB facilitates rapid construction ofknowledge-based systems because ontologiesand knowledge bases can be reused. The chal-lenge problems by design involve broad, rela-tively shallow knowledge in the case of crisismanagement and deep, fairly specific knowl-edge in the battlespace problems. It is unclearwhich kind of problem most favors the reuseclaim and why. We are developing analyticmodels of reuse. Although the predictions ofthese models will not be directly tested in thefirst year’s evaluation, we will gather data tocalibrate these models for a later evaluation.

First, with DISCIPLE’s ontology-building tools,a domain expert assisted by a knowledge engi-neer built the object ontology from severalsources, including expert’s manuals, Alphate-ch’s FIRE&ISE document and ISI’s LOOM ontology.Second, a task taxonomy was defined by refin-ing the task taxonomy provided by Alphatech.This taxonomy indicates principled decomposi-tions of generic workaround tasks into subtasksbut does not indicate the conditions underwhich such decompositions should be per-formed. Third, the examples of hierarchicalworkaround plans provided by Alphatech wereused to teach DISCIPLE. Each such plan providedDISCIPLE with specific examples of decomposi-tions of tasks into subtasks, and the expert guid-ed DISCIPLE to “understand” why each taskdecomposition was appropriate in a particularsituation. From these examples and the expla-nations of why they are appropriate in the giv-en situations, DISCIPLE learned general task-decomposition rules. After a knowledge baseconsisting of an object ontology and task-decomposition rules was built, the hierarchicalnonlinear planner of DISCIPLE was used to auto-matically generate workaround plans for newworkaround problems.

EvaluationThe SAIC and Teknowledge integrated systemsfor crisis management, movement analysis,and workarounds were tested in an extensivestudy in June 1998. The study followed a two-phase, test-retest schedule. In the first phase,the systems were tested on problems similar tothose used for system development, but in thesecond, the problems required significantmodifications to the systems. Within eachphase, the systems were tested and retested onthe same problems. The test at the beginningof each phase established a baseline level ofperformance, but the test at the end measuredimprovement during the phase.

We claim that HPKB technology facilitatesrapid modification of knowledge-based sys-tems. This claim was tested in both phases ofthe experiment because each phase allowstime to improve performance on test prob-lems. Phase 2 provides a more stringent test:Only some of the phase 2 problems can besolved by the phase 1 systems, so the systemswere expected to perform poorly in the test atthe beginning of phase 2. The improvement inperformance on these problems during phase2 is a direct measure of how well HPKB tech-nology facilitates knowledge capture, represen-tation, merging, and modification.

Each challenge problem is evaluated by dif-

We claim thatHPKB

technologyfacilitates

rapid modification

of knowledge-

based systems. This

claim wastested in bothphases of the

experimentbecause eachphase allows

time toimprove

performanceon test

problems.

Articles

38 AI MAGAZINE

Results of the Challenge Problem Evaluation

We present the results of the crisis-manage-ment evaluation first, followed by the resultsof the battle-space evaluation.

Crisis ManagementThe evaluation of the SAIC and Teknowledgecrisis-management systems involved 7 trials orbatches of roughly 110 questions. Thus, morethan 1500 answers were manually graded bythe challenge problem developer, IET, and sub-ject matter experts at PSR on criteria rangingfrom correctness to completeness of sourcematerial to the quality of the representation ofthe question. Each question in a batch wasposed in English accompanied by the syntax ofthe corresponding parameterized question (fig-ure 3). The crisis-management systems weresupposed to translate these questions into aninternal representation, MELD for the Teknowl-edge system and KIF for the SAIC system. TheMELD translator was operational for all the tri-als; the KIF translator was used to a limitedextent on later trials.

The first trial involved testing the systemson the sample questions that had been avail-able for several months for training. Theremaining trials implemented the “test andretest with scenario modification” strategy dis-cussed earlier. The first batch of test questions,TQA, was repeated four days later as a retest; itwas designated TQA’ for scoring purposes. Thedifference in scores between TQA and TQA’represents improvements in the systems. Aftersolving the questions in TQA’, the systemstackled a new set, TQB, designed to be “closeto” TQA. The purpose of TQB was to checkwhether the improvements to the systems gen-eralized to new questions. After a short break,a modification was introduced into the crisis-management scenario, and new fragments ofknowledge about the scenario were released.Then, the cycle repeated: A new batch of ques-tions, TQC, tested how well the systems copedwith the scenario modification; then after fourdays, the systems were retested on the samequestions, TQC’, and on the same day, a finalbatch, TQD, was released and answered.

Each question in a trial was scored accordingto several criteria, some official and othersoptional. The four official criteria were (1) thecorrectness of the answer, (2) the quality of theexplanation of the answer, (3) the complete-ness and quality of cited sources, and (4) thequality of the representation of the question.The optional criteria included lay intelligibilityof explanations, novelty of assumptions, qual-

ity of the presentation of the explanation, theautomatic production by the system of a repre-sentation of the question, source novelty, andreconciliation of multiple sources. Each ques-tion could garner a score between 0 and 3 oneach criterion, and the criteria were themselvesweighted. Some questions had multiple parts,and the number of parts was a further weight-ing criterion. In retrospect, it might have beenclearer to assign each question a percentage ofthe points available, thus standardizing allscores, but in the data that follow, scores areon an open-ended scale. Subject-matterexperts were assisted with scoring the qualityof knowledge representations when necessary.

A web-based form was developed for scor-ing, with clear instructions on how to assignscores. For example, on the correct-answer cri-terion, the subject-matter expert was instruct-ed to award “zero points if no top-level answeris provided and you cannot infer an intendedanswer; one point for a wrong answer withoutany convincing arguments, or most requiredanswer elements; two points for a partially cor-rect answer; three points for a correct answeraddressing most required elements.”

When you consider the difficulty of the task,both systems performed remarkably well.Scores on the sample questions were relativelyhigh, which is not surprising because thesequestions had been available for training forseveral months (figure 5). It is also not surpris-ing that scores on the first batch of test ques-tions (TQA) were not high. It is gratifying,however, to see how scores improve steadilybetween test and retest (TQA and TQA’, TQCand TQC’) and that these gains are general:The scores on TQA’ and TQB and TQC’ andTQD were similar.

The scores designated auto in figure 5 referto questions that were translated automaticallyfrom English into a formal representation. TheTeknowledge system translated all questionsautomatically, the SAIC system very few. Ini-tially, the Teknowledge team did not manipu-late the resulting representations, but in laterbatches, they permitted themselves minormodifications. The effects of these can be seenin the differences between TQB and TQB-Auto,TQC and TQC-Auto, and TQD and TQD-Auto.

Although the scores of the Teknowledge andSAIC systems appear close in figure 5, differ-ences between the systems appear in otherviews of the data. Figure 6 shows the perfor-mance of the systems on all official questionsplus a few optional questions. Although theseextra questions widen the gap between the sys-tems, the real effect comes from addingoptional components to the scores. Here,

Articles

WINTER 1998 39

When you consider the difficulty of thetask, both systems performedremarkablywell. Scores onthe samplequestions wererelatively high,which is notsurprisingbecause thesequestions hadbeen availablefor training forseveral months….It is also notsurprising thatscores on thefirst batch oftest questions(TQA) were nothigh. It is gratifying,however, to seehow scoresimprove steadilybetween testand retest…

sample questions, Teknowledge got perfectscores on nearly 80 percent of the questions itanswered, and SAIC got perfect scores on near-ly 70 percent of the questions it answered.However, on TQA, TQA’, and TQB, the systemsare very similar, getting perfect scores onroughly 60 percent, 70 percent, and 60 percentof the questions they answered. The gapwidens to a roughly 10-percent spread inTeknowledge’s favor on TQC, TQC’, and TQD.Similarly, when one averages the scores ofanswered questions in each trial, the averagesfor the Teknowledge and SAIC systems are sim-ilar on all but the sample-question trial. Evenso, the Teknowledge system had both highercoverage (questions answered) and higher cor-rectness on all trials.

If one looks at the minimum of the officialcomponent scores for each question—answercorrectness, explanation adequacy, source ade-quacy, and representation adequacy—the dif-ference between the SAIC and Teknowledgesystems is quite pronounced. Calculating theminimum component score is like findingsomething to complain about in eachanswer—the answer or the explanation, thesources or the question representation. It is thestatistic of most interest to a potential userwho would dismiss a system for failure on anyof these criteria. Figure 8 shows that neithersystem did very well on this stringent criteri-on, but the Teknowledge system got a perfector good minimum score (3 or 2) roughly twiceas often as the SAIC system in each trial.

Teknowledge got credit for its automatic trans-lation of questions and also for the lay intelli-gibility of its explanations, the novelty of itsassumptions, and other criteria.

Recall that each question is given a scorebetween zero and three on each of four officialcriteria. Figure 7 represents the distribution ofthe scores 0, 1, 2, and 3 for the correctness cri-terion over all questions for each batch ofquestions and each system. A score of 0 gener-ally means the question was not answered. TheTeknowledge system answered more questionsthan the SAIC system, and it had a larger pro-portion of perfect scores.

Interestingly, if one looks at the number ofperfect scores as a fraction of the number ofquestions answered, the systems are not so dif-ferent. One might argue that this comparisonis invalid, that one’s grade on a test dependson all the test items, not just those oneanswers. However, suppose one is comparingstudents who have very different backgroundsand preparation; then, it can be informative tocompare them on both the number and qual-ity of their answers. The Teknowledge-SAICcomparison is analogous to a comparison ofstudents with very different backgrounds.Teknowledge used CYC, a huge, mature knowl-edge base, whereas SAIC used CYC’s upper-levelontology and a variety of disparate knowledgebases. The systems were not at the same levelof preparation at the time of the experiment,so it is informative to ask how each system per-formed on the questions it answered. On the

Articles

40 AI MAGAZINE

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

SQ' TQA TQA-Auto TQA' TQB TQB-Auto TQC TQC-Auto TQC' TQD TQD-Auto

TFSSAIC

Official Questions and Official Criteria OnlyA

vera

ge S

core

Figure 5. Scores on the Seven Batches of Questions That Made Up the Crisis-Management Challenge Problem Evaluation.Some questions were automatically translated into a formal representation. The scores for these are denoted with the suffix Auto.

Movement AnalysisThe task for movement analysis was to identifysites—command posts, artillery sites, battlepositions, and so on—given data about vehiclemovements and some intelligence sources.Another component of the problem was toidentify convoys. The data were generated bysimulation for an area of approximately10,000 square kilometers over a period ofapproximately 4 simulated days. The scenarioinvolved 194 military units arranged into 5brigades, 1848 military vehicles, and 7666civilian vehicles. Some units were collocated atsites; others stood alone. There were 726 con-voys and nearly 1.8 million distinct reports ofvehicle movement.

Four groups developed systems for all or partof the movement-analysis problem, and SAICand Teknowledge provided integration sup-port. The groups were Stanford’s SMI, SRI,UMass, and MIT. Each site identification wasscored for its accuracy, and recall and precisionscores were maintained for each site. Supposean identification asserts at time t that a battal-ion command post exists at a location (x, y). Toscore this identification, we find all sites with-in a fixed radius of (x, y). Some site types arevery similar to others. For example, all com-mand posts have similar characteristics; so, ifone mistakes a battalion command post for,say, a division command post, then oneshould get partial credit for the identification.The identification is incorrect but not as incor-rect as, say, mistaking a command post for an

artillery site. Entity error ranges from 0 forcompletely correct identifications to 1 forhopelessly wrong identifications. Fractionalentity errors provide partial credit for near-miss identifications. If one of the sites withinthe radius of an identification matches theidentification (for example, a battalion com-mand post), then the identification score is 0,but if none matches, then the score is the aver-age entity error for the identification matchedagainst all the sites in the radius. If no siteexists within the radius of an identification,then the identification is a false positive.

Recall and precision rates for sites aredefined in terms of entity error. Let H be thenumber of sites identified with zero entityerror, M be the number of sites identified withentity error less than one (near misses), and Rbe the number of sites identified with maxi-mum entity error. Let T be the total number ofsites, N be the total number of identifications,and F be the number of false positives. The fol-lowing statistics describe the performance ofthe movement-analysis systems:

zero-entity error recall = H / Tnon-one-entity error recall = (H + M) / Tmaximum-entity error recall

= (H + M + R) / Tzero-entity error precision = H / Nnon–one-entity error precision

= (H + M) / Nmaximum-entity error precision

= (H + M + R) / Nfalse positive rate = F / N

Articles

WINTER 1998 41

0 .0 0

2 0 .0 0

4 0 .0 0

6 0 .0 0

8 0 .0 0

1 0 0 .0 0

1 2 0 .0 0

1 4 0 .0 0

1 6 0 .0 0

1 8 0 .0 0

SQ' TQA TQA-Auto TQA' TQB TQB-Auto TQC TQC-Auto TQC' TQD TQD-Auto

All Questions and All Criteria

TFSSAIC

Ave

rage

Sco

re

Figure 6. Scores on All Questions (Including Optional Ones) and Both Official and Optional Criteria.

Articles

42 AI MAGAZINE

0% 20% 40% 60% 80% 100%

Distribution of Scores

TFS-SQp

SAIC-SQp

TFS-TQA

SAIC-TQA

TFS-TQA'

SAIC-TQA'

TFS-TQB

SAIC-TQB

TFS-TQC

SAIC-TQC

TFS-TQC'

SAIC-TQC'

TFS-TQD

SAIC-TQD

3210

Score

Figure 7. The Distribution of Correctness Scores for the Crisis-Management Challenge Problem.Correctness is one of the official scoring criteria and takes values in {0, 1, 2, 3}, with 3 being the highest score. This graph shows the pro-portion of the correctness scores that are 0, 1, 2, and 3. The lower bar, for example, shows that roughly 70 percent of the correctness scoreswere 3, and roughly 10 percent were 1 or 0.

Articles

WINTER 1998 43

0% 20% 40% 60% 80% 100%

Distribution of Minimum Scores

TFS-SQp

SAIC-SQp

TFS-TQA

SAIC-TQA

TFS-TQA'

SAIC-TQA'

TFS-TQB

SAIC-TQB

TFS-TQC

SAIC-TQC

TFS-TQC'

SAIC-TQC'

TFS-TQD

SAIC-TQD

3210

Score

Figure 8. The Distribution of Minimum Scores for the Crisis-Management Challenge Problem.Each question is awarded four component scores corresponding to the four official criteria. This graph shows the distribution of the min-imum of these four components over questions. With the lowest bar, for example, on 20 percent of the questions, the minimum of the 4component scores was 3.

15 percent and are not shown in figure 9. Ofthe participating research groups, MIT did notattempt to identify the types of site, onlywhether sites are present; so, their entity erroris always maximum. The other groups did notattempt to identify all kinds of site; for exam-ple, UMass tried to identify only battle posi-tions, command posts, and assembly-stagingareas. Even so, each group was scored againstall site types. Obviously, the scores would besomewhat higher had the groups been scoredagainst only the kinds of site they intended toidentify (for example, recall rates for UMassranged from 20 percent to 39 percent for thesites they tried to identify).

Precision rates are reported only for the SMIand UMass teams because MIT did not try toidentify the types of site. UMass’s precisionwas highest on its first trial; a small modifica-tion to its system boosted recall but at a signif-icant cost to precision. SMI’s precision hoveredaround 20 percent on all trials.

Scores were much higher for convoy detec-tion. Although scoring convoy identificationsposed some interesting technical challenges,the results were plain enough: SMI, UMass,

Recall and precision scores for the groups areshown in figure 9. The experiment designinvolved two phases with a test and retestwithin each phase. Additionally, the data setsfor each test included either military trafficonly or military plus civilian traffic. Had eachgroup run their systems in each experimentalcondition, there would have been 4 tests, 2data sets with 2 versions of each (military onlyand military plus civilian), and 4 groups, or 64conditions to score. All these conditions werenot run. Delays were introduced by the processof reformatting data to make it compatiblewith a scoring program, so scores were notreleased in time for groups to modify their sys-tems; one group devoted much time to scoringand did not participate in all trials, and anoth-er group participated in no trials for reasonsdiscussed at the end of this section.

The trials that were run are reported in fig-ure 9. Recall rates range from 40 percent to justover 60 percent, but these are for maximum-entity error recall—the frequency of detectinga site when one exists but not identifying itcorrectly. Rates for correct and near-miss iden-tification are lower, ranging from 10 percent to

Articles

44 AI MAGAZINE

0 20 40 60 80

MIT A Test Mil

MIT A Test Mil/Mil+Civ

MIT A Retest Mil

MIT A Retest Mil/Mil+Civ

MIT B Test Mil

MIT B Test Mil/Mil+Civ

SMI A Test Mil

SMI A Test Mil/Mil+Civ

SMI B Test Mil

UMass A Test Mil

UMass A Retest Mil

UMass B Test Mil

Non-one EE PrecisionZero EE PrecisionMax EE Recall

Figure 9. Scores for Detecting and Identifying Sites in the Movement-Analysis Problem.Trials labeled A are from the first phase of the experiment, before the scenario modification. Trials labeled B are from after the scenario mod-ification. Mil and Mil + Civ refer to data sets with military traffic only and with both military and civilian traffic, respectively.

and MIT detected 507 (70 percent), 616 (85percent), and 465 (64 percent) of the 726 con-voys, respectively. The differences betweenthese figures do not indicate that one technol-ogy was superior to another because eachgroup emphasized a slightly different aspect ofthe problem. For example, SMI didn’t try todetect small convoys (fewer than four vehi-cles). In any case, the scores for convoy identi-fications are quite similar.

In sum, none of the systems identified sitetypes accurately, although they identified alldetected sites and convoys pretty well. The rea-son for these results is that sites can be detect-ed by observing vehicle halts, just as convoyscan be detected by observing clusters of vehi-cles. However, it is difficult to identify sitetypes without more information. It would beworth studying the types of information thathumans require for the task.

The experience of SRI is instructive: The SRIsystem used features of movement data toinfer site types; unfortunately, the data did notseem to support the task. There were insuffi-cient quantities of training data to train classi-fiers (fewer than 40 instances of sites), butmore importantly, the data did not have suffi-cient underlying structure—it was too random.The SRI group tried a variety of classifier tech-nologies, including identification trees, nearestneighbor, and C4.5, but classification accuracyremained in the low 30-percent range. This fig-ure is comparable to those released by UMasson the performance of its system on threekinds of site. Even when the UMass recall fig-ures are not diluted by sites they weren’t look-ing for, they did not exceed 40 percent. SRIand UMass both performed extensive ex-ploratory data analysis, looking for statisticallysignificant relationships between features ofmovement data and site types, with little suc-cess, and the disclosed regularities were notstrongly predictive. The reason for these teamsfound little statistical regularity is probablythat the data were artificial. It is extraordinari-ly difficult to build simulators that generatedata with the rich dependency structure ofnatural data.

Some simulated intelligence and radar infor-mation were released with the movement data.Although none of the teams reported finding ituseful, we believe these kind of data probablyhelp human analysts. One assumes humansperform the task better than the systemsreported here, but because no human eversolved these movement-analysis problems,one cannot tell whether the systems per-formed comparatively well or poorly. In anycase, the systems took a valuable first step

toward movement analysis, detecting mostconvoys and sites. Identifying the sites accu-rately will be a challenge for the future.

WorkaroundsThe workaround problem was evaluated inmuch the same way as the other problems: Theevaluation period was divided into two phases,and within each phase, the systems were givena test and a retest on the same problems. In thefirst phase, the systems were tested on 20 prob-lems and retested after a week on the sameproblems; in the second phase, a modificationwas introduced into the scenario, the systemswere tested on five problems and after a weekretested on the same five problems and fivenew ones.

Solutions were scored along five equallyweighted dimensions: (1) viability of enumer-ated workaround options, (2) correctness of theoverall time estimate for a workaround, (3) cor-rectness of solution steps provided for eachviable option, (4) correctness of temporal con-straints among these steps, and (5) appropriate-ness of engineering resources used. Scores wereassigned by comparing the systems’ answerswith those of human experts. Bonus pointswere awarded when, occasionally, systems gavebetter answers than the experts. These answersbecame the gold standard for scoring answerswhen the systems were retested.

Four systems were fielded by ISI, GMU,Teknowledge, and AIAI. The results are shownin figures 10 and 11. In each figure, the upperextreme of the vertical axis represents the max-imum score a system could get by answeringall the questions correctly (that is, 200 pointsfor the initial phase, 50 points for the first testin the modification phase, 100 points for theretest in the modification phase). The bars rep-resent the number of points scored, and thecircles represent the number of points thatcould have been awarded given the number ofquestions that were actually answered. Forexample, in the initial phase, ISI answered allquestions, so it could have been awarded 200points on the test and 200 on the retest; GMUcovered only a portion of the domain, and itcould have been awarded a maximum of 90and 100 points, respectively. The bars repre-sent the number of points actually scored byeach system.

How one views the performance of thesesystems depends on how one values correct-ness; coverage (the number of questionsanswered); and, more subtly, the prospects forscaling the systems to larger problem sets. Anassistant should answer any question posed toit, but if the system is less than ideal, should it

What wemight callerrors ofspecificity, in which ananswer is less specific orcomplete thanit should be,are not inconsistentwith the philosophy ofHPKB, whichexpects systems to give partial—even common-sense—answers whenthey lack specificknowledge.

Articles

WINTER 1998 45

Two factors complicate comparisons be-tween the systems. First, if one is allowed toselect a subset of problems for one’s system tosolve, one might be expected to select prob-lems on which one hopes the system will dowell. Thus, ISI’s correctness scores are notstrictly comparable to those of the other sys-tems because ISI did not select problems butsolved them all. Said differently, what we callcorrectness is really “correctness in one’sdeclared scope.” It is not difficult to get a per-fect correctness score if one selects just oneproblem; conversely, it is not easy to get a highcorrectness score if one solves all problems.

Second, Figure 12 shows that coverage canbe increased by knowledge engineering, but atwhat cost to correctness? GMU believes cor-rectness need not suffer with increased cover-age and cites the rapid growth of its knowledgebase. At the beginning of the evaluation peri-od, the coverage of the knowledge base wasabout 40 percent (11841 binary predicates),and two weeks later, the coverage had grownsignificantly to roughly 80 percent of thedomain (20324 binary predicates). In terms ofcoverage and correctness (figure 12), GMU’scoverage increased from 47.5 percent in thetest phase to 80 percent in the modificationphase, with almost no decrement in correct-ness. However, ISI suggests that the complexityof the workaround task increases significantlyas the coverage is extended, to the point wherea perfect score might be out of reach given the

answer more questions with some errors orfewer questions with fewer errors? Obviously,the answer depends on the severity of errors,the application, and the prospect for improv-ing system coverage and accuracy. What wemight call errors of specificity, in which ananswer is less specific or complete than itshould be, are not inconsistent with the phi-losophy of HPKB, which expects systems togive partial—even commonsense—answerswhen they lack specific knowledge.

Figures 10 and 11 show that USC-ISI was theonly group to attempt to solve all theworkaround problems, although its answerswere not all correct, whereas GMU solved fewerproblems with higher overall correctness. AIAIsolved fewer problems still, but quite correctly,and Teknowledge solved more problems thanAIAI with more variable correctness. One cancompute coverage and correctness scores as fol-lows: Coverage is the number of questionsattempted divided by the total number of ques-tions in the experiment (55 in this case). Cor-rectness is the total number of points awardeddivided by the number of points that mighthave been awarded given the number ofanswers attempted. Figure 12 shows a plot ofcoverage against correctness for all theworkaround systems. Points above and to theright of other points are superior; thus, the ISIand GMU systems are preferred to the other sys-tems, but the ranking of these systems dependson how one values coverage and correctness.

Articles

46 AI MAGAZINE

Initial Phase: Test and Retest Scores and Scopes

020406080

100120140160180200

Totalavailable

points

ISIGMU TFSAIAI

Figure 10. Results of the Initial Phase of the Workaround Evaluation.Lighter bars represent test scores; darker bars represent retest scores. Circles represent the scope of the task attempted by eachsystem, that is, the best score that the system could have received given the number of questions it answered.

state of the art in AI technology. For example,in some cases, the selection of resourcesrequires combining plan generation with rea-soning about temporal aspects of the (partiallygenerated) plan. Moreover, interdependenciesin the knowledge bases grow nonlinearly withcoverage because different aspects of thedomain interact and must be coordinated. Forexample, including mine-clearing operationschanges the way one looks at how to ford a riv-er. Such interactions in large knowledge basesmight degrade the performance of a systemwhen changes are made during a retest andmodification phase.

The coverage versus correctness debateshould not cloud the accomplishments of theworkaround systems. Both ISI and GMU werejudged to perform the workaround task at anexpert level. All the systems were developedquite rapidly—indeed, GMU’s knowledge basedoubled during the experimental perioditself—and knowledge reuse was prevalent.These results take us some way toward theHPKB goal of very rapid development of pow-erful and flexible knowledge base systems.

Evaluation Lessons LearnedThe experiments reported here constitute oneof the larger studies of AI technology. In addi-tion to the results of these experiments, muchwas learned about how to conduct such stud-ies. The primary lesson is that one should

release the challenge problem specificationsearly. Many difficulties can be traced to delaysin the release of the battlespace specifications.Experiment procedures that involve multiplesites and technologies must be debugged indry-run experiments before the evaluationperiod begins. There must be agreement onthe formats of input data, answers, andanswer keys—a mundane requirement butone that caused delays in movement-analysisevaluation.

A major contributor to the success of the cri-sis-management problem was the parameter-ized question syntax, which gave all partici-pants a concise representation of the kinds ofproblem they would solve during the evalua-tion. Similarly, evaluation metrics for crisismanagement were published relatively early.Initially, there was a strong desire for evalua-tion criteria to be objective enough for amachine to score performance. Relaxing thisrequirement had a positive effect in crisis man-agement. The criteria were objective in thesense that experts followed scoring guidelines,but we could ask the experts to assess the qual-ity of a system’s explanation—something noprogram can do. However, scoring roughly1500 answers in crisis management took itstoll on the experts and IET. It is worth consid-ering whether equally informative resultscould be had at a lower cost.

The HPKB evaluations were set up as friend-

Articles

WINTER 1998 47

Totalavailablepoints

0102030405060708090

100

ISIGMU TFSAIAI

Modification Phase: Test and Retest Scores and Scopes

Figure 11. Results of the Modification Phase of the Workaround Evaluation.Lighter bars represent test scores; darker bars represent retest scores. Circles represent the scope of the task attempted by each system,that is, the best score that the system could have received given the number of questions it answered. Note that only 5 problems were

released for the test, 10 for the retest; so, the maximum available points for each are 50 and 100, respectively.

were constructed rapidly from many sources,reuse of knowledge was a significant factor,and semantic integration across knowledgebases started to become feasible. HPKB hasintegrated many technologies, cuttingthrough traditional groupings in AI, to devel-op new ways of acquiring, organizing, sharing,and reasoning with knowledge, and it hasnourished a remarkable assault on perhaps thegrandest of the AI grand challenges—to buildintelligent programs that answer questionsabout everyday things, important things likeinternational relations but also ordinary thingslike river banks.

AcknowledgmentsThe authors wish to thank Stuart Aitken, VinayChaudhri, Cleo Condoravdi, Jon Doyle, JohnGennari, Yolanda Gil, Moises Goldszmidt,William Grosso, Mark Musen, Bill Swartout,and Gheorghe Tecuci for their help preparingthis article.

Notes1. Many HPKB resources and documents arefound at www.teknowledge.com/HPKB. A website devoted to the Year 1 HPKB ChallengeProblem Conference can be found at www.teknowledge.com/HPKB/meetings/Year1meet-ing/. Definitive information on the year 1 cri-sis-management challenge problem, includingupdated specification, evaluation procedures,and evaluation results, is available at www.iet.com/Projects/HPKB/Y1Eval/. Earlier, draftinformation about both challenge problems isavailable at www.iet.com/Projects/ HPKB/Combined-CPs.doc. Scoring procedures anddata for the movement-analysis problem areavailable at eksl-www.cs.umass.edu:80/re-search/hpkb/scores/. For additional sources,contact the authors.2. These numbers and all information regard-ing MTI radar are approximate; actual figuresare classified.

ReferencesFriedman, N.; Geiger, D.; and Goldszmidt, M. 1997.Bayesian Network Classifiers. Machine Learning29:131–163.Gil, Y. 1994. Knowledge Refinement in a ReflectiveArchitecture. In Proceedings of the Twelfth NationalConference on Artificial Intelligence, 520–526. Men-lo Park, Calif.: American Association for ArtificialIntelligence.Jones, E. 1998. Formalized Intelligence ReportEnsemble (FIRE) & Intelligence Stream Encoding(ISE)”, memo, Alphatech.Katz, B. 1997. From Sentence Processing to Informa-tion Access on the World Wide Web. Paper presentedat the AAAI Spring Symposium on Natural Language

ly competitions between the integrationteams, SAIC and Teknowledge, each incorpo-rating a subset of the technology in a uniquearchitecture. However, some technology wasused by both teams, and some was not testedin integrated systems but rather as stand-alonecomponents. The purpose of the friendly com-petition—to evaluate merits of different tech-nologies and architectures—was not fully real-ized. Given the cost of developing HPKBsystems, we might reevaluate the purportedand actual benefits of competitions.

ConclusionHPKB is an ambitious program, a high-risk,high-payoff venture beyond the edge ofknowledge-based technology. In its first year,HPKB suffered some setbacks, but these wereinformative and will help the program infuture. More importantly, HPKB celebratedsome extraordinary successes. It has long beena dream in AI to ask a program a question innatural language about almost anything andreceive a comprehensive, intelligible, well-informed answer. Although we have notachieved this aim, we are getting close. InHPKB, questions were posed in natural lan-guage, questions on a hitherto impossiblerange of topics were answered, and theanswers were supported by page after page ofsources and justifications. Knowledge bases

Articles

48 AI MAGAZINE

1.0

.75

.5

.25

.25 .5 .75 1.0

ISI

Teknowledge

GMU

AIAI

Correctness

Coverage

Figure 12. A Plot of Overall Coverage against Overall Correctness for EachSystem on the Workaround Challenge Problem.

Processing for the World Wide Web, 24–26March, Stanford, California.Lee, J.; Grunninger, M.; Jin, Y.; Malone, T.;Tate, A.; Yost, G.; and the PIF WorkingGroup, eds. 1996. The PIF Process Inter-change and Framework Version 1.1, Work-ing paper, 194, Center for CoordinationScience, Massachusetts Institute of Tech-nology.Lenat, D. 1995. Artificial Intelligence. Scien-tific American 273(3): 80–82.Lenat, D. B., and Feigenbaum, E. A. 1987.On the Thresholds of Knowledge. In Pro-ceedings of the Tenth International JointConference on Artificial Intelligence,1173–1182. Menlo Park, Calif.: Internation-al Joint Conferences on Artificial Intelli-gence.Lenat, D., and Guha, R. 1990. Building LargeKnowledge-Based Systems. Reading, Mass.:Addison Wesley.Liddy, E. D.; Paik, W.; and McKenna, M.1995. Development and Implementationof a Discourse Model for Newspaper Texts.Paper presented at the AAAI Symposium onEmpirical Methods in Discourse Interpreta-tion and Generation, 27–29 March, Stan-ford, California.MacGregor, R. 1991. The Evolving Technol-ogy of Classification-Based Knowledge Rep-resentation Systems. In Principles of Seman-tic Networks: Explorations in theRepresentation of Knowledge, ed. J. Sowa,385–400. San Francisco, Calif.: MorganKaufmann.Miller, G.; Beckwith, R.; Fellbaum, C.;Gross, D.; and Miller, K. 1993. Introductionto WORDNET: An Online Lexical Database.Available at www.cogsci.princeton.edu/~wn/papers/.Park, J.; Gennari, J.; and Musen, M. 1997.Mappings for Reuse in Knowledge-BasedSystems, Technical Report, SMI-97-0697,Section on Medical Informatics, StanfordUniversity.Pease, A., and Albericci, D. 1998. The War-plan. CJL256, Teknowledge, Palo Alto, Cal-ifornia.Pease, A., and Carrico, T. 1997a. The JTFATD Core Plan Representation: A ProgressReport. Paper presented at the AAAI 1997Spring Symposium on Ontological Engi-neering, 24–26 March, Stanford, California.Pease, A., and Carrico, T. 1997b. The JTFATD Core Plan Representation: Request forComment, AL/HR-TP-96-9631, ArmstrongLab, Wright-Patterson Air Force Base, Ohio.Swartout, B., and Gil, Y. 1995. EXPECT:Explicit Representations for Flexible Acqui-sition. Paper presented at the Ninth Knowl-edge Acquisition for Knowledge-Based Sys-tems Workshop, 26 February–3 March,Banff, Canada.

Tate, A. 1998. Roots of SPAR—Shared Plan-ning and Activity Representation. Knowl-edge Engineering Review (Special Issue onOntologies) 13(1): 121–128.Tecuci G. 1998. Building Intelligent Agents:An Apprenticeship Multistrategy Learning The-ory, Methodology, Tool, and Case Studies. SanDiego, Calif.: Academic.Veloso, M.; Carbonell J.; Perez, A.; Borrajo,D.; Fink, E.; and Blythe, J. 1995. IntegratingPlanning and Learning: The PRODIGY Archi-tecture. Journal of Experimental and Theoret-ical 7:81–120.Waltz, D. 1975. Understanding Line Draw-ings of Scenes with Shadows. In The Psy-chology of Computer Vision, ed. P. H. Win-ston, 19–91. New York: McGraw-Hill.Wiederhold, G., ed. 1996. Intelligent Integra-tion of Information. Boston, Mass.: KluwerAcademic.

Paul Cohen is professor of computer sci-ence at the University of Massachusetts. Hereceived his Ph.D. from Stanford Universityin 1983. His publications include EmpiricalMethods for Artificial Intelligence (MIT Press,1995) and The Handbook of Artificial Intelli-gence, Volumes 3 and 4 (Addison-Wesley),which he edited with Edward A. Feigen-baum and Avron B. Barr. Cohen is develop-ing theories of how sensorimotor agentssuch as robots and infants learn ontologies.

Robert Schrag is a principal scientist atInformation Extraction & Transport, Inc.,in Rosslyn, Virginia. Schrag has served asprincipal investigator to develop the high-performance knowledge base crisis-man-agement challenge problem. Besides evalu-ating emerging AI technologies, hisresearch interests include constraint satis-faction, propositional reasoning, and tem-poral reasoning.

Eric Jones is a senior member of the tech-nical staff at Alphatech Inc. From 1992 to1996, he was a member of the academicstaff in the Department of Computer Sci-ence at Victoria University, Wellington,New Zealand. He received his Ph.D. in AIfrom Yale University in 1992. He also hasdegrees in mathematics and geographyfrom the University of Otago, NewZealand. His research interests include case-based reasoning, natural language process-ing, and statistical approaches to machinelearning. He serves on the American Mete-orological Society Committee on AI Appli-cations to Environmental Science.

Adam Pease received his B.S. and M.S. incomputer science from Worcester Polytech-nic Institute. He has worked in the area ofknowledge-based systems for the past 10years. His current interests are in planning

ontologies and large knowledge-based sys-tems. He leads an integration team forTeknowledge on the High-PerformanceKnowledge Base Project.

Albert D. Lin received a B.S.E.E. from FengChia University in Taiwan in 1984 and anM.S. in electrical and computer engineer-ing from San Diego State University in1990. He is a technical lead at ScienceApplications International Corporation,currently leading the High-PerformanceKnowledge Base Project.

Barbara H. Starr received a B.S. from theUniversity of Witwatersrand in SouthAfrica in 1980 and an Honors degree incomputer science at the University of Wit-watersrand in 1981. She has been a consul-tant in the areas of AI and object-orientedsoftware development in the United Statesfor the past 13 years and in South Africa forseveral years before that. She is currentlyleading the crisis-management integrationeffort for the High-Performance KnowledgeBase Project at Science Applications Inter-national Corporation.

David Gunning has been the programmanager for the High-Performance Knowl-edge Base Project since its inception in1995 and just became the assistant directorfor Command and Control in the DefenseAdvanced Research Projects Agency’sresearch projects in Command and Con-trol, which include work in intelligent rea-soning and decision aids for crisis manage-ment, air campaign planning, course-of-action analysis, and logistics planning.Gunning has an M.S. in computer sciencefrom Stanford University and an M.S. incognitive psychology from the Universityof Dayton.

Murray Burke has served as the High-Per-formance Knowledge Base Project deputyprogram manager for the past year, and theDefense Advanced Research ProjectsAgency has selected him as its new programmanager. He has over 30 years in commandand control information systems researchand development. He was cofounder andexecutive vice president of Knowledge Sys-tems Concepts, Inc., a software companyspecializing in defense applications ofadvanced information technologies. Burkeis a member of the American Associationfor Artificial Intelligence, the Association ofComputing Machinery, and the Institute ofElectrical and Electronics Engineers.

Articles

WINTER 1998 49

■ May 1 - May 5, 1999 ■ Seattle, Washington—Hyatt Regency BellevueHotel

Sponsored by SIGART ACMCo-sponsored by ACM SIGGRAPH and ACM SIGCHIHeld in cooperation with AAAI

CONFERENCE DATES■ April 1, 1999

Cutoff date for early registration and hotel conference rate

■ May 1-2, 1999Workshops and tutorials

■ May 3-5, 1999Conference technical sessions

CONFERENCE ORGANIZERSGeneral Chair:Jeffrey M. Bradshaw (The Boeing Company)

Technical Program Co-Chairs:Oren Etzioni (University of Washington)Joerg P. Mueller (John Wiley & Sons Ltd, UK)Craig A. Knoblock (University of SouthernCalifornia)Workshops Chair: Elisabeth Andre

(DFKI GmbH, Germany)

Tutorials Chair: Carles Sierra (AI Research Institute - CSIC, Spain)

Software Demos Chair: Keith Golden (NASA Ames Research Center)

Robotics Demos Chair: Wei-Min Shen (University of Southern California)

Videos Chair: Maria Gini (University of Minnesota)

Posters Chair: Frank Dignum (Eindhoven University, The Netherlands)

Treasurer: Daniela Rus (Dartmouth College)

Local Arrangements Chair: Gene Ball (Microsoft Research)

The Third International Conferenceon Autonomous Agents (Agents ‘99)

http://www.cs.washington.edu/research/agents99

Five Major International Conferences Under One RoofPAAM99 - The Practical Application of Intelligent Agents and Multi Agents

19th-21st April, 1999 www.practical-applications.co.uk/PAAM99

PAKeM99 - The Practical Application of Knowledge Management

21st-23rd April, 1999 www.practical-applications.co.uk/PAKeM99

PA Java99 - The Practical Application of Java

21-23 April 1999 www.practical-applications.co.uk/PAJAVA99

PACLP99 - The Practical Application of

Constraint Technologies and Logic Programming

19th-21st April, 1999 www.practical-applications.co.uk/PACLP99

PADD99 - The Practical Application of Knowledge Discovery and Data Mining

21st-23rd April, 1999 www.practical-applications.co.uk/PADD99

www.practical-applications.co.uk/Expo99

PA Expo99 will bringtogether developers andusers from commerce,industry and academia toexamine and discuss solu-tions for solving real worldproblems found in keycommercial areas suchas: Transportation andDistribution, Finance, theIntelligent Internet, Deci-sion Support, InformationManagement, CustomerRetention, Fraud Detec-tion, Planning and Sched-uling, Business Modellingand Support.

The Commonwealth Conference and Events Centre,

London, UK 19th-23rd April, 1999

hpkb year 1 results

Education