using natural language interfaces - web.nmsu.eduogden/pdf/usingnaturallanguageinterfaces.p… ·...

Using Natural Language Interfaces

Handbook of Human-Computer InteractionM. Helander (ed.)© Elsevier Science Publishers B.V. (North-Holland), 1996

Chapter

William C. OgdenPhilip BernickComputing Research LaboratoryNew Mexico State UniversityLas Cruces, New Mexico 88003ogden | [email protected]

1.0 Introduction....................................1Habitability ......................................2

2.0 Evaluation Issues ...........................43.0 Evaluations of Prototype and

Commercial Systems......................6Laboratory Evaluations ....................6Field Studies ..................................10Natural Language Versus OtherInterface Designs ...........................14

4.0 Design Issues ................................18What is Natural? ............................18Restrictions on Vocabulary ............21Restrictions on Syntax ...................22Functional Restrictions ..................23Effects of Feedback .......................24Empirically Derived Grammars .....25

5.0 Design Recommendations ...........276.0 Conclusion ...................................297.0 Acknowledgments .......................308.0 References ....................................30

1.0 IntroductionA goal of human factors research with com-puter systems is to develop human-computercommunication modes that are both error tol-erant and easily learned. Since people alreadyhave extensive communication skills through

their own native or natural language (e.g.English, French, Japanese, etc.) many believethat natural language interfaces (NLIs) canprovide the most useful and efficient way forpeople to interact with computers.

Although some authors express a beliefthat computers will never be able to under-stand natural language (e.g. Winograd &Flores, 1986), others feel that natural languageprocessing technology needs only to advancesufficiently to make general purpose NLIs pos-sible. Indeed, there have been several attemptsto produce commercial systems.

The goal for most natural language sys-tems is to provide an interface that minimizesthe training required for users. To most, thismeans a system that uses the words and syntaxof a natural language such as English. There is,however, some disagreement as to the amountof “understanding” or flexibility required inthe system.

Systems have been proposed that permitusers to construct English sentences by select-ing words from menus (Tennant et al. 1983).However, Woods (1977) rejects the idea that asystem using English words in an artificial for-mat should be considered a natural languagesystem, and assumes that the system shouldhave an awareness of discourse rules that makeit possible to omit easily inferred details. Infurther contrast, Perlman (1984) suggests that“naturalness” be determined by the context ofthe current application and urges the design ofrestricted “natural artificial” languages.

Philosophical issues about the plausibilityof computers understanding and generatingnatural language aside, it was widely believed

2

that the number of NLI applications wouldcontinue to grow (Waltz, 1983). Since then,work in graphical user interfaces (GUIs) hassolved many of the problems that NLIs wereexpected to solve. As a result, NLIs have notgrown at the rate first anticipated, and thosethat have been produced are designed to useconstrained language in limited domains.

Research into NLIs continues, and a fre-quently asked question is how effective arethese interfaces for human-computer commu-nication. The focus of this chapter is not toaddress the philosophical issues of natural lan-guage processing. Rather, it is to review empir-ical methods that have been applied to theevaluation of these limited NLIs and to reviewthe results of user studies. We have notincluded the evaluation of speech systems, pri-marily due to space limitations. Readers inter-ested in speech system evaluations shouldconsult Hirschman et al. (1992), and Goodineet al. (1992).

This discussion of empirical results is alsonot limited to a single definition of natural lan-guage. Instead, it uses the most liberal defini-tion of natural language and looks at systemsthat seek to provide flexible input languagesthat minimize training requirements. The dis-cussion is limited to two study categories thatreport empirical results obtained from observ-ing users interacting with these systems. Thefirst category consists of prototype systemstudies developed in research environments.These studies have been conducted both in lab-oratory settings and in field settings. The sec-ond category consists of simulated systemsstudied in the laboratory that are designed tohelp identify desirable attributes of natural lan-guage systems. Before reviewing these studies,some criteria for evaluating NLIs are pre-sented.

HabitabilityHabitability is a term coined by Watt (1968) toindicate how easily, naturally, and effectivelyusers can use language to express themselveswithin the constraints of a system language. Alanguage is considered habitable if users canexpress everything that is needed for a taskusing language they would expect the systemto understand. For example, if there are 26ways that a user population would be likely touse to describe an operation, a habitable sys-tem will process all 26.

In this review of studies, at least fourdomains in which a language can be habitableshould be considered: conceptual, functional,syntactic, and lexical. Users of a NLI mustlearn to stay within the limits of all fourdomains.Conceptual: The conceptual domain of a lan-guage describes the language’s total area ofcoverage, and defines the complete set ofobjects and the actions covered by the inter-face. Users may only reference those objectsand actions processable by the system. Forexample, a user should not ask about staffmembers who are managers if the computersystem has no information about managers.Such a system would not understand the sen-tence:

1. What is the salary of John Smith’s man-ager?

Users are limited to only those conceptsthe system has information about. There is adifference between the conceptual domain of alanguage and the conceptual domain of theunderlying system. The conceptual domain ofa language can be expanded by recognizingconcepts (e.g. manager) that exceed the sys-tem’s coverage and responding appropriately(Codd, 1974), (e.g. “There is no informationon managers.”) The query could then be said tobe part of the language’s conceptual domain,but not supported by the system’s. However,

3

there will always be a limit to the number ofconcepts expressible within the language, andusers must learn to refer to only these con-cepts.

Functional: Functional domain is definedby constraints about what can be expressedwithin the language without elaboration, anddetermines the processing details that usersmay leave out of their expressions. While con-ceptual coverage determines what can beexpressed, functional coverage determineshow it can be expressed. Natural languageallows speakers to reference concepts manyways depending on listener knowledge andcontext. The functional domain is determinedby the number of built-in functions or knowl-edge the system has available. For example,although a database may have salary informa-tion about managers and staff, a natural lan-guage interface still may not understand thequestion expressed in Sentence 1 if the proce-dure for getting the answer is too complicatedto be expressed in one question. For example,the answer to Sentence 1 may require twosteps; one to retrieve the name of the managerand another to retrieve the salary associatedwith the name. Thus, the system may allow theuser to get the answer with two questions:

2aWho is the manager of John Smith?

System: MARY JONES

2bWhat is the salary of Mary Jones?

With these questions, the user essentiallyspecifies procedures that the system is capableof. The question in Sentence 1 does not exceedthe conceptual domain of the language becausesalaries of managers are available. Instead, itexpresses a function that does not exist (i.e. afunction that combines two retrievals from onequestion). Nor is it a syntactic limitation sincethe system might understand a question withthe same syntactic structure as Sentence 1, butthat can be answered with a single databaseretrieval:

3. What is the name of John Smith’s man-ager?

Other, more formal, languages vary onfunctional coverage as well. Concepts likesquare root can be expressed directly in somelanguages but must be computed in others. Ahabitable system provides functions that usersexpect in the interface language. Since therewill be a limit on the functional domain of thelanguage, users must learn to refer to onlythose functions contained in the language.Syntactic: The syntactic domain of a languagerefers to the number of paraphrases of a singlecommand that the system understands. A sys-tem that did not allow possessives might notunderstand Sentence 1, but would understandSentence 4:

4. What is the salary of the manager ofJohn Smith?

A habitable system must provide the syn-tactic coverage users expect.Lexical: The lexical domain of a languagerefers to words contained in the system’s lexi-con. Sentence 1 might not be understood if thelanguage does not accept the word “salary” butaccepts “earnings.” Thus, Sentence 5 would beunderstood:

5. What are the earnings of the manager ofJohn Smith?

A natural language system must be madehabitable in all four domains because it will bedifficult for users to learn which domain is vio-lated when the system rejects an expression. Auser entering a command like Sentence 1 thatis rejected by the system, but does not violatethe conceptual domain, might be successfulwith any of the paraphrases in Sentence 2, 4, or5. Determining which one will depend uponthe functional, syntactic, or lexical coverage ofthe language. When evaluating the results ofuser studies, it is important to keep the distinc-tions between habitable domains in mind.

4

NLIs attempt to cover each domain bymeeting the expectations of the user, and inter-face habitability is determined by how wellthese expectations are met. Of course, the mosthabitable NLI would be one capable of passinga Turing Test or winning the Loebner Prize1.

It can be difficult to measure the habitabil-ity of a language. To determine how much cov-erage of each domain is adequate for a giventask requires good evaluation methods. Thenext section reviews some of the methodologi-cal issues to consider when evaluating naturallanguage interfaces.

2.0 Evaluation IssuesWhen evaluating the studies presented in thischapter there are several methodological issuesthat need to be considered: user selection andtraining, task selection and presentation, so-called Wizard-of-Oz simulations (Kelley,1984), parsing success rates, and interface cus-tomization. Here we present a brief discussionof these issues.User selection and training: Evaluating anyuser-system interface requires test participantswho represent the intended user population.For natural language evaluations, this is a par-ticularly important factor. A language’s habit-ability depends on how well it matches userknowledge about the domain of discourse.Therefore, test participants should have knowl-edge similar to that held by the actual users inthe target domain. Some studies select care-fully, but most have tried to train participantsin the domain to be tested. It is likely, however,

1. In the 1950’s Turing proposed a test designed to challengeour beliefs about what it means to think (Turing, 1950).The Loebner variant involves computer programs withNLIs that have been designed to converse with users(Epstein, 1993). Users do not know whether they are com-municating with a computer or another user. Winning theprize involves fooling the users into thinking they are con-versing with another human, when, in fact, they are com-municating with a computer.

that participants employed for an experimentwill be less motivated to use the NLI produc-tively than existing users of the database. Onthe other hand, a measure of control is pro-vided when participants are trained to have acommon understanding of the domain.

Training ranges from a minimal introduc-tion to the domain to extensive training on theinterface language that can include instructionsand practice on how to avoid common traps.Obviously, the quality of training given tousers significantly affects user performance.Therefore, the kind and amount of trainingusers receive should represent the training thatusers are expected to have with the actualproduct.Task generation and presentation: The goalof studies designed to evaluate the use of natu-ral language is to collect unbiased user expres-sions as they are engaged in computer-relatedtasks. These tasks should be representative ofthe type of work the user would be expected toaccomplish with the interface and be presentedin a way that would not influence the form ofexpression. In most studies, tasks are gener-ated by the experimenter who attempts tocover the range of function available in theinterface. Letting users generate their owntasks is an alternative method, but results in areduction of experimenter control. Thismethod also requires that users be motivated toask appropriate questions.

Experimenter-generated tasks are neces-sary if actual users are not available or needprior training. These tasks can simulate ahypothesized level of user knowledge andexperience by presenting tasks assumed to berepresentative of questions that would be askedof the real system. The disadvantage of experi-menter-generated tasks are that they do notallow for assessment of the language’s concep-tual habitability because experimenters usuallygenerate only solvable tasks.

5

User-generated tasks have the advantage ofbeing able to assess the language’s conceptualand functional coverage because users are freeto express problems that may be beyond thecapabilities of the system. The disadvantage ofuser-generated tasks is that a study’s resultsmay not generalize beyond the set of questionsasked by the selected set of users; no attempt atcovering all of the capabilities of the interfacewill have been made. Another disadvantage isthat actual users of the proposed system mustbe available.

How experimenter-generated or user-gen-erated tasks are presented to test participantshas strong influence on the expressions testparticipants generate. In an extreme case, par-ticipants would be able to solve the taskmerely by entering the task instructions as theywere presented. At the other extreme, taskinstructions would encourage participants togenerate invalid expressions.

Researchers usually choose one of twomethods for overcoming these extremes. Onemethod presents the task as a large generally-stated problem that requires several steps tosolve. This method not only tests the habitabil-ity of the language, but also the problem-solv-ing ability of the participants. Participants arefree to use whatever strategy seems natural andto use whatever functions they expect the inter-face to have. Like user-generated tasks, thismethod does not allow researcher to test all ofthe anticipated uses of the system because par-ticipants may not ask sufficiently complicatedquestions.

An alternative method has been used to testmore of a system’s functions. In this method,participants are given items like tables orgraphs, with some information missing, andare then asked to complete these items by ask-ing the system for this missing information.This method gives an experimenter the mostcontrol over the complexity of expressions thatparticipants would be expected to enter. How-

ever, some expressions may be difficult to rep-resent non-linguistically, so coverage may notbe as complete as desired (c.f. Zoeppritz,1986). An independent measure of how partic-ipants interpreted the questions should also beused to determine whether participants under-stood the task. For example, participants maybe asked to do the requested task manuallybefore asking the computer to do it.Wizard-of-Oz (WOz) simulations: WOz stud-ies simulate a natural language system byusing a human to interpret participants’ com-mands. In a typical experiment a participantwill type a natural language command on oneterminal that will appear on a terminal moni-tored by an operator (the Wizard) hidden inanother location. The Wizard interprets thecommand and takes appropriate actions thatresult in messages that appear on the partici-pant’s terminal. Usually the Wizard makesdecisions about what a real system would orwould not understand. It is likely that the Wiz-ard will not be as consistent in responding asthe computer would, and this problem shouldbe taken into account when reviewing thesestudies. However, WOz simulations are usefulfor quickly evaluating potential designs.Evaluation of parsing success rates: An oftenreported measure of habitability is the propor-tion of expressions that can be successfullyparsed by the language processor returning aresult. But studies that report parsing successrates as a global indicator of how well the sys-tem is doing assume that all commands areequally complex when they are not (Tennant,1980).

A high success rate may be due to partici-pants in the study repeating a simple requestmany times. For example, Tennant observed aparticipant asking “How many NOR hours didplane 4 have in Jan of 1973,” and then repeat-ing this request again for each of the 12months. This yields 12 correctly parsed ques-tions. On the other hand, another participant

6

trying to get the information from all 12months in one request had two incorrectlyparsed requests before correctly requesting“List the NOR hours in each month of 1973 forplane 4.” The second participant had a lowerparse rate percentage, but obtained the desiredinformation with less work than the first partic-ipant. In fact, Tennant found that successfultask solution scores did not correlate with pars-ing success rates. Therefore, a user’s ability toenter allowable commands does not guaranteethat they can accomplish their tasks.

Parsing success rates need to be interpretedin light of other measures such as number ofrequests per task, task solution success, andsolution time. Since these measures depend onthe tasks users are given, which vary fromstudy to study, it would be inappropriate tocompare systems across studies.Interface customization: Finally, how a sys-tem is customized for the application beingevaluated is an important methodologicalissue. Each NLI requires that semantic andpragmatic information about the task domainbe encoded and entered. Evaluation results aresignificantly affected if researchers captureand enter this information poorly. Since mostevaluations have been conducted on systemsthat were customized by system developers,these systems represent ideally adapted inter-faces. However, most operational systems willnot have the advantage of being customized byan expert, so performance with the operationalsystem may be worse.

3.0 Evaluations of Prototype andCommercial Systems

The feasibility of the natural languageapproach is usually shown by building anddemonstrating a prototype system prior todelivering it to the marketplace. Very few ofthese prototype systems have been evaluatedby actually measuring user performance. Thefew that have been evaluated are reviewed

here, beginning with a review of evaluationsdone under controlled laboratory conditions,and followed by a review of field studies.

Laboratory EvaluationsLADDER: Hershman et al. (1979) studied tenNavy officers using LADDER, a natural lan-guage query system designed to provide easyaccess to a naval database. The goal of thisstudy was to simulate as closely as possible theactual operational environment in which LAD-DER would be implemented. To accomplishthis, Navy officers were trained to be interme-diaries between a hypothetical decision makerand the computer’s database in a simulatedsearch and rescue operation. Officers weregiven global requests for information and wereasked to use LADDER to obtain the necessaryinformation. Training consisted of a 30 minutetutorial session that included a lengthy discus-sion of LADDER’s syntax and vocabulary, fol-lowed by one hour of practice that involvedtyping in canned queries and solving somesimple problems. Compared to participants inother studies, these participants were moder-ately well trained.

Participants were largely successful atobtaining necessary information from the data-base, and were able to avoid requests for infor-mation not relevant to their task. Thus, it seemsthat participants were easily able to stay withinthe conceptual domain of the language. Hersh-man et al, however, report that LADDERparsed only 70.5 percent of the 366 queriessubmitted. Participants also used twice thenumber of queries that would have beenrequired by an expert LADDER user. Almost80 percent of the rejected queries were due tosyntax errors. Apparently, LADDER’s syntac-tic coverage was too limited for these moder-ately trained users. However, a contributingfactor may have been the wording of informa-tion requests given to the participants. Theserequests were designed not to be understood

7

by LADDER and this could have influencedthe questions typed by the participants.

Hershman et al. concluded that the systemcould benefit from expanded syntactic and lex-ical coverage, but that users would still requiretraining. Apparently, the training that wasgiven to participants in this study was adequatefor teaching the system’s functional and con-ceptual coverage, but not for teaching its syn-tactic and lexical coverage.PLANES: Tennant (1979) conducted severalstudies of prototype systems and came to simi-lar conclusions. However, Tennant also pro-vides evidence that users have trouble stayingwithin the conceptual limits of a natural lan-guage query interface. Tennant studied users oftwo natural language question answering sys-tems: PLANES and the Automatic Advisor.PLANES was used with a relational databaseof flight and maintenance records for naval air-craft, and the Automatic Advisor was used toprovide information about engineering coursesoffered at a university. Participants were uni-versity students who were unfamiliar with thedatabase domains of the two systems. Partici-pants using PLANES were given a 600-wordscript that described the information containedin the database. Participants using the Auto-matic Advisor were only given a few sentencesdescribing the domain. Participants receivedno other training. Problems were presentedeither in the form of partially completed tablesand charts or in the form of long descriptionsor high-level problems that users had todecompose.

Problems were generated either by peoplefamiliar with the databases or by people whohad received only the brief introduction givento the participants. The purpose of havingproblems generated by people who had noexperience with the system was to test the con-ceptual completeness of the natural languagesystems. If all of the problems generated byinexperienced users could be solved, then the

system could be considered conceptually com-plete. However, this was not the case. Tennantdoes not report statistics, but claims that someof the problems generated by the inexperi-enced users could not have been solved usingthe natural language system, and consequentlyparticipants were less able to solve these prob-lems than the problems generated by peoplefamiliar with the database.

Tennant concluded that the systems werenot conceptually or functionally complete andthat without extending the conceptual coveragebeyond the limits of the database contents, nat-ural language systems would be as difficult touse as formal language systems.NLC: Other laboratory evaluations of proto-type systems have been conducted using a nat-ural language programming system calledNLC (Biermann et al., 1983). The NLC systemallows users to display and manipulate numeri-cal tables or matrices. Users are limited tocommands that begin with an imperative verband can only refer to items shown on their dis-play terminals. Thus, the user is directly awareof some of the language’s syntactic and con-ceptual limitations.

A study conducted by Biermann et al.(1983) compared NLC to a formal program-ming language, PL/C. Participants were askedto solve a linear algebra problem and a “grade-book” problem using either NLC or PL/C. Par-ticipants using NLC were given a written tuto-rial, a practice session, the problems, and somebrief instructions on using an interactive termi-nal. Participants using PL/C were given theproblems and used a batch card reading sys-tem. The 23 participants were just completinga course in PL/C and were considered to be inthe top one-third of the class. Each participantsolved one of the problems using NLC and theother using PL/C. Problems were equallydivided among languages.

Results show that 10 of 12 (83 percent)participants using NLC correctly completed

8

the linear algebra problem in an average of 34minutes. This performance compared favor-ably to that of the PL/C group in which 5 of 11(45 percent) participants correctly completedthis problem in an average 165 minutes. Forthe “grade-book” problem, 8 of 11 (73 per-cent) NLC participants completed the problemcorrectly in an average of 68 minutes while 9of 12 (75 percent) PL/C participants correctlycompleted the problem in an average of 125minutes The reliability of these differenceswas not tested statistically, but it was clear thatparticipants with 50 minutes of self-pacedtraining could use a natural language program-ming tool on problems generated by the sys-tem designers. These participants also did atleast as well as similar participants who used aformal language which they had just learned.

The system was able to process 81 percentof the natural language commands correctly.Most of the incorrect commands were judgedto be the result of “user sloppiness” and non-implemented functions. Users stayed withinthe conceptual domain when they were givenan explicit model of the domain (as items onthe display terminal) and were given problemsgenerated by the system designers. However,users seemed to have difficulty staying withinthe functional limitation of the system andwere not always perfect in their syntactic andlexical performance.

The idea of referring to an explicit concep-tual model that can be displayed on the screenis a good one. Biermann et al. also pointed outthe necessity of providing immediate feedbackvia an on-line display to show users how theircommands were interpreted. If there was amisinterpretation, it would be very obvious,and the command could be corrected with anUNDO instruction.

In another study of NLC, Fink et al.(1985), examined the training issue. Eighteenparticipants with little or no computer experi-ence were given problems to solve with NLC.

To solve these problems, participants had toformulate conditional statements that the sys-tem was capable of understanding. Participantsreceived no training, nor were they givenexamples of how to express these conditions.In other respects the experimental methodol-ogy was the same as Biermann et al. (1983).The following is an example of an allowableconditional statement in NLC.

For i = 1 to 4, double row i if it containsa positive entry.Fink et al. reported large individual differ-

ences in the participants’ abilities to discoverrules for generating conditional statements.One participant made only one error in solving13 problems, whereas another participantcould not solve any problems. In general, par-ticipants made large numbers of errors solvingthe first few problems and made few errorsafter discovering a method that worked. Thesefindings support the conclusion that training isrequired for these kinds of natural languagesystems.A commercial system: Customization canmake it difficult to study prototype natural lan-guage interfaces. The flexibility of commercialsystems demands customization, but a sys-tem’s usability depends on its natural languageprocessor and on how well the interface hasbeen customized for the application. Whenusability problems occur, evaluators may havetrouble determining whether the natural lan-guage processor or the customization isresponsible.

Ogden & Sorknes (1987) evaluated a PC-based natural language query product thatallowed users to do their own customizing.The evaluation goal was to assess how well acommercially available NLI would meet theneeds of a database user who was responsiblefor customization, but who had no formalquery training. The interface was evaluated byobserving seven participants as they learnedand used the product. They were given the

9

product’s documentation and asked to solve aset of 47 query-writing problems. Problemswere presented as data tables with some miss-ing items, and participants were asked to entera question that would retrieve only the missingdata.

The interface had difficulty interpretingparticipants’ queries, with first attempts result-ing in a correct result only 28 percent of thetime. After an average of 3.6 attempts, 72 per-cent of the problems were answered correctly.Another 16 percent of the problems usersthought were correctly answered were not.This undetected error frequency would beunacceptable to most database users.

The results also showed that participantsfrequently used the system to view the data-base structure; an average of 17 times for eachparticipant (36 percent of the tasks). This indi-cates that the system’s natural language userneeded to have specific knowledge of the data-base to use the interface. Clarification dialogoccurred when the parser prompted users formore information. This occurred an average of20 times for each participant (42 percent of thetasks). The high frequency of undetectederrors, coupled with the frequent need for clar-ification dialogs, suggests that users werestruggling to be understood.

The following is an example of how unde-tected errors can occur:

* User: How many credits does “David Lee”have?

System: count: 2

* User: What are the total credits for “David Lee”

System: total credits: 7

Here the system gives a different answer toparaphrases of the same question. In the firstcase, the phrase “How many” invoked a countfunction, and “2” is the number of courses thestudent took. In the second case, the word“total” invoked a sum function. The system didnot contain the needed semantic information

about credits to determine that they should besummed in both cases. Thus, to use this sys-tem, users would need to know the functionaldifferences between saying “How many...” and“What are the total...” The language was nothabitable for the participants in the study whodid not know these differences due to limita-tions in the system’s functional coverage. Theconclusion is that users need specific trainingon the functional characteristics of the lan-guage and database in much the same way asusers of formal languages do. Without thistraining, users cannot be expected to customizetheir own language interface.

Harman and Candela (1990) report an eval-uation of an information retrieval system’sNLI. The prototype system, which used a verysimple natural language processing (NLP)model, allowed users to enter unrestricted nat-ural language questions, phrases, or a set ofterms. The system would respond with a set oftext document titles ranked according to rele-vance to the question. This system’s statisticalranking mechanism considered each word in aquery independently. Together these wordswere statistically compared to the records in aninformation file. This comparison was used toestimate the likelihood that a record was rele-vant to the question.

The system contained test data alreadyused by this study’s more than 40 test partici-pants. The questions submitted to the systemwere user-generated and relevant to the partici-pant’s current or recent research interests. Nineparticipants were proficient Boolean retrievalsystem users, and five others had limited expe-rience. The rest of the participants were notfamiliar with any retrieval systems. All partici-pants were very familiar with the data con-tained in the test sets.

Evaluating results from this and other textretrieval systems is problematic in that themeasure of success is often a subjective evalu-ation of how relevant the retrieved documents

10

are, and this makes it impossible to determinehow successful the interaction was. However,Harman and Candela report that for a select setof text queries, 53 out of 68 queries (77 per-cent) retrieved at least one relevant record. Theother reported results were qualitative judge-ments made by Harman and Candela based onthe comments of the test participants. Gener-ally, this study’s participants found usefulinformation very quickly, and first time usersseemed to be just as successful as the experi-enced Boolean system users. While the studydid not test a Boolean system, Harman andCandela point out that, in contrast, first timeusers of Boolean systems have little success.

Harman and Candela conclude that thistype of NLI for information retrieval is a goodsolution to a difficult Boolean interface.Whereas some queries are not handled cor-rectly by statistically based approaches, suchas queries requiring the NOT operator, theseproblems could be overcome. A study by Tur-tle (1994) that directly compares results of aBoolean and NLI information retrieval systemare reviewed later in this chapter.Summary: Laboratory studies of natural lan-guage prototypes make possible the observa-tion that users do relatively well if they 1) areknowledgeable about the domain or are givengood feedback about the domain, 2) are givenlanguage-specific training and 3) are giventasks that have been generated by the experi-menters. Users perform poorly when 1) train-ing is absent, 2) domain knowledge is limited,or 3) the system is functionally impoverished.

Tennant’s studies suggest that user-gener-ated tasks will be much more difficult to per-form than experimenter-generated tasks. Sinceactually using an interface will involve user-generated tasks, it is important to evaluateNLIs under field conditions.

Field StudiesField studies of prototype natural language

systems have focused on learning how the lan-guage was used, what language facilities wereused, and identifying system requirements inan operational environment with real usersworking with genuine data. Two of these eval-uations (Krause, 1980; Damerau, 1981) arediscussed here. Harris (1977) has reported afield test that resulted in a 90 percent parsingsuccess rate, but since no other details werereported the study is hard to evaluate. Threeother studies, one that is a field study of a com-mercial system (Capindale and Crawford,1990), a second that compares a prototype nat-ural language system with a formal languagesystem (Jarke et al., 1985), and third thatdescribes a field test of a conversational hyper-text natural language information system(Patrick and Whalen, 1992) will also be exam-ined.USL: Krause (1980) studied the use of theUser Specialty Language (USL) system fordatabase query answering in the context of anactual application. The USL system wasinstalled as a German Language interface to acomputer database. The database containedgrade and other information about 430 stu-dents attending a German Gymnasium. Theusers were teachers in the school who wantedto analyze data on student development. Forexample, they wanted to know if early gradespredicted later success. Users were highlymotivated and understood the applicationdomain well. Although the amount of trainingusers received is not reported, the system wasinstalled under optimal conditions: by itsdevelopers after they interviewed users tounderstand the kinds of questions that wouldbe asked.

The system was used over a one-yearperiod. During this time, about 7300 questionswere asked in 46 different sessions. For eachsession, a user would come to a laboratory

11

with a set of questions and would use the sys-tem in the presence of an observer. Study dataconsisted of the session logs, observer’s notes,and user questionnaires. The observer did notprovide any user assistance.

The results reported by Krause come froman analysis of 2121 questions asked by one ofthe users. Generally, this user successfulentered questions into USL. Overall, only 6.9percent of the user’s questions could be classi-fied as errors, and most of these (4.4 percent)were correctable typing errors. Krauseattributes part of this low error rate to theobservation that the user was so involved in thetask and the data being analyzed that there lit-tle effort spent to learn more about the USLsystem. This user may have found some simplequestion structures that worked well and usedthem over and over again. This may be veryindicative of how natural language systemswill be used. It is unclear whether the useractually got the wanted answers since Krauseprovides no data on this. However, Krausereports two observations that suggest that theuser did get satisfactory answers: 1) the userremained very motivated and 2) a researchreport based on the data obtained with USLwas published.

A major finding by Krause was that syntac-tic errors gave the user more difficulty thansemantic errors. It was easier to recover fromerrors resulting from “which students go toclass Y?” if it required a synonym changeresulting in “which students attend class Y?”than if it required a syntactic change resultingin “List the class Y students.” From theseobservations, Krause concludes that broad syn-tactic coverage is needed even when thesemantics of the database are well understoodby the users.TQA: Damerau (1981) presents a statisticalsummary of the use of another natural lan-guage query interface called the Transforma-tional Question Answering (TQA) system.

Over a one year period, a city planning depart-ment used TQA to access a database consistingof records of each parcel of land in the city.Users, who were very familiar with the data-base, received training on the TQA language.Access was available to users wheneverneeded via a computer terminal connected tothe TQA system. Session logs were collectedautomatically and consisted of a trace of all theoutput received at the terminal as well as atrace of system’s performance.

The results come primarily from one user,although other users entered some requests. Atotal of 788 queries were entered during thestudy year, and 65 percent of these resulted inan answer from the database. There is no wayto know what proportion of these answerswere actually useful to the users. Clarificationwas required when TQA did not recognize aword or when it recognized an ambiguousquestion, and thirty percent of the questionsentered required clarification. In these cases,the user could re-key the word or select amongalternate interpretations of the ambiguousquestion.

Damerau also reports instances of usersechoing back the system’s responses, and thiswas not allowable input in the version of TQAbeing tested. The TQA system would repeat aquestion after transforming phrases found inthe lexicon. Thus, the phrase “gas station”would by echoed back to the user as“GAS_STATION.” Users would create errorsby entering the echoed version. TQA was sub-sequently changed to echo some variant ofwhat the user entered which would be allow-able if entered.

The results reported by Damerau aremainly descriptive, but the researchers wereencouraged by these results and reported thatusers had positive attitudes toward the system.Evaluating this study is difficult because nomeasures of user-success are available.

12

INTELLECT: Capindale and Crawford (1990)report a field evaluation of INTELLECT, thefirst commercial NLI to appear on the market.INTELLECT is a NLI to existing relationaldatabase systems. Nineteen users of the data,who had previously accessed it through amenu system, were given a one-hour introduc-tion to INTELLECT. They were then free touse INTELLECT to access the data for aperiod of ten weeks. Users ranged in theirfamiliarity with the database but were all trueend-users of the data. They generated theirown questions, presumably to solve particularjob-related problems, although Capindale andCrawford did not analyze the nature of thesequestions. The capabilities and limitations ofINTELLECT are summarized well by Capin-dale and Crawford, who make a special note ofMartin’s (1985) observation that the success ofan INTELLECT installation depends on build-ing a custom lexicon and that “To build a goodlexicon requires considerable work” (p. 219).Surprisingly, no mention is made of the effortor level of customization that went into install-ing INTELLECT for their study. This makes itdifficult to evaluate Capindale’s and Craw-ford’s results.

Most of the reported results are of ques-tionnaire data obtained from users after the tenweek period and are of little consequence tothe present discussion except to say that theusers were mostly pleased with the idea ofusing a NLI and rated many of INTELLECT’sfeatures highly. Objective data was alsorecorded in transaction logs which Capindaleand Crawford analyzed by defining success asthe parse success rate. The parse success ratewas 88.5 percent. There was no attempt todetermine task success rate as is often the casein field studies.Comparison to Formal Language: Since nosystem can cover all possible utterances of anatural language, they are in some sense a for-mal computer language. Therefore, these sys-

tems must be compared against other formallanguage systems in regard to function, ease oflearning and recall, etc. (Zoeppritz, 1986). In acomprehensive field study that compared anatural language system (USL) with a formallanguage system (SQL), Jarke et al. (1985)used paid participants to serve as “advisors” orsurrogates to the principal users of a database.The database contained university alumnirecords and the principal users were universityalumni officers. This could be considered afield study because USL and SQL were usedon a relational database containing data froman actual application, and the participants’tasks were generated by the principal users ofthese data. However, it could also be consid-ered a laboratory evaluation because the partic-ipants were paid to learn and use bothlanguages, and they were recruited solely toparticipate in the study. Unfortunately, itlacked the control of a laboratory study since1) participants were given different tasks(although some tasks were given to both lan-guage groups), and 2) the USL system wasmodified during the study. Also, the databasemanagement system was running on a largetime-shared system; response times and sys-tem availability was poor and varied betweenlanguage conditions.

Eight participants were selected non-ran-domly from a pool of 20 applicants who, in theexperimenter’s judgment, represented a homo-geneous group of young business profession-als. Applicants were familiar with computers,but only had limited experience with them.Classroom instruction was given for both SQLand USL, and each participant learned andused both. Instruction for USL, the natural lan-guage system, was extensive and specific, andidentified the language’s restrictions and strat-egies to overcome them.

Analysis of the tasks generated by the prin-cipal users for use by the participants indicatedthat 15.6 percent of the SQL tasks, and 26.2

13

percent of the USL tasks were unanswerable.The proportion of these tasks that exceeded thedatabase’s conceptual coverage versus the pro-portion that exceeded the query language’sfunctional coverage is not reported. Neverthe-less, Jarke et al conclude that SQL is function-ally more powerful than USL. The importantpoint is, however, that principal users of thedatabase (who knew the conceptual domainvery well) generated many tasks that could notbe solved by either query language. Thus, thisstudy supports the findings suggested by Ten-nant (1979); users who know the conceptualdomain but who have had no experience withcomputers, may still ask questions that cannotbe answered.

SQL users solved more than twice as manytasks as USL users. Of the fully solvable tasks,52.4 percent were solved using SQL versus23.6 percent using USL. A fairer comparisonis to look at the paired tasks (tasks that weregiven to both language groups), but Jarke et al.do not present these data clearly. They reportthat SQL was “better” on 60.7 percent, thatUSL was “better” on 17.9 percent and thatSQL and USL were “equal” on 21.4 percent ofthe paired tasks. They do not indicate how“better” or “equal” performance was deter-mined. Nevertheless, the results indicate that itwas difficult for participants to obtainresponses to the requests of actual usersregardless of which language was used. Fur-thermore, the natural language system tested inthis study was not used more effectively thanthe formal language system.

In trying to explain the difficulty users hadwith natural language, Jarke et al. cited lack offunctionality as one main reason for task fail-ure. Of the solvable tasks, 24 percent were notsolved because participants tried to invokeunavailable functions. Apparently, the USLsystem tested in this study did not provide theconceptual and/or functional coverage that was

needed for tasks generated by the actual usersof the database.

Many task failures had to do with the sys-tem’s hardware and operating environment.System unavailability and interface problemscontributed to 29 percent of the failures. Incontrast, system and interface problems con-tributed to only 7 percent of the task failureswhen SQL was used. This represents a sourceof confounding between the two language con-ditions and weakens the comparison that canbe made. Therefore, little can be said about theadvantage of natural language over formal lan-guages based on this study.

It is clear, however, that the prototype USLsystem studied in this evaluation could not beused effectively to answer the actual questionsraised by the principal users of the database. Itshould be noted that the system was installedand customized by the experimenters and notby the system developers. Thus, a sub-optimalcustomization procedure might have beenmajor contributor to the system’s performance.COMODA: Patrick and Whalen (1992) con-ducted a large field test of COMODA, theirconversational hypertext natural languageinformation system for publicly distributinginformation about the disease AIDS to thepublic. In this test, users with computers andmodems could call a dial-up AIDS informationsystem and use natural language to ask ques-tions or just browse.

Whalen and Patrick report that during atwo month period the COMODA systemreceived nearly 500 calls. The average call wasapproximately 10 minutes, and involved anaverage of 27 exchanges (query-response,request) between the user and the system. Ofthese, approximately 45 percent were directnatural language queries, and though they pro-vide no specific numbers, Whalen and Patrickreport that the system successfully answeredmany of them.

14

Users were obtained by advertising in localnewspapers, radio, and television in Alberta,Canada. A close analysis of the calls for thefinal three weeks of data collection was donethat evaluated not only those inputs from usersthat the system could parse, but success ratesof system responses. Correct answers werethose that provided the information requestedby the user or gave a response of “I don’t knowabout that.” when no information respondingto the request was available. Incorrectresponses were those that provided the wronginformation when correct information wasavailable, provided wrong information whencorrect information was not available, aresponse of “I don’t know about that.” whencorrect information was available, or when thequery was ambiguous so that a correct answercould not be identified. Whalen and Patrickreport a 70 percent correct response rate.

Patrick and Whalen were surprised by thenumber of requests to browse since the systemwas designed to enable users to easily obtaininformation about a particular topic area. How-ever, since topic focus requires knowledge ofthe domain by a user, and since there is no wayfor the experimenter to know how knowledge-able users were, the request for browsingmight be explained by the novelty of the sys-tem, and users’ interest in exploring the systemin conjunction with learning about the infor-mation it contained.

It also isn’t clear from the study whetherusers thought they were interacting with ahuman or a computer. Previously, Whalen andPatrick have reported that their system doesnot lead users to believe that they are interact-ing with a human (Whalen and Patrick, 1989).However, Whalen went on to enter a variant ofthis system in the 1994 Loebner competitionwhere he took first prize. Though his systemfooled none of the judges into thinking it was ahuman (which is the goal of the competition)he did receive the highest median score of all

the computer entrants. An important differencebetween Whalen’s entry and other systems isthat it contains no natural language under-standing component. Like COMODA it is lim-ited to recognizing actual words and phrasespeople use to discuss a topic.

The result of this work contributes signifi-cantly to the notion that NLP is not an essentialcomponent of successful NLIs, and that NLIsare useful for databases other than relationaldatabases.Summary: Field studies are not usuallyintended to be generalized beyond their limitedapplication. Only the relative success of theimplementations can be assessed. The studiesthat have been presented offer mixed results. Ingeneral, the results of the field studies tend toagree the laboratory study results. If users arevery familiar with the database, their majordifficulties are caused by syntactic limitationsof the language. However, if the system doesnot provide the conceptual or functional cover-age the user expects, performance will sufferdramatically. If this is the case, it appears thattraining will be required to instruct users aboutthe functional capabilities of the system, andthe language must provide broad syntacticcoverage. The type of training that is requiredhas not been established.

Natural Language Versus OtherInterface Designs

A first issue concerns the question of whether anatural language would really be any better foran interface than a formal artificial languagedesigned to do the same task. The previouslydiscussed studies that compared prototype lan-guages with artificial languages reportedmixed results. In the case of a database appli-cation, Jarke et al. (1985) showed an advantagefor the artificial language, whereas in a pro-gramming application Biermann et al. (1983)showed an advantage for the natural language.

15

This section reviews other studies that com-pare natural and artificial languages.

Small and Weldon (1983) simulated twodatabase query languages using WOz. Onelanguage was based on a formal language(SQL) and had a fixed syntax and vocabulary.The other allowed unrestricted syntax and freeuse of synonyms. However, users of both lan-guages had to follow the same conceptual andfunctional restrictions. Thus, users of the natu-ral language had to specify the database tables,columns, and search criteria to be used toanswer the query. For example, the request“Find the doctors whose age is over 35.” wouldnot be allowed because the database table thatcontains doctors is not mentioned. Thus, avalid request would have been “Find the doc-tors on the staff whose age is over 35.”Because Small and Weldon were attempting tocontrol for the information content necessaryin each language, this study compared unre-stricted syntax and vocabulary to restrictedsyntax and vocabulary while trying to controlfor functional capabilities.

The participant’s task was to view a subsetof the data and write a query that could retrievethat subset. Although it is unclear how much ofthe required functional information (e.g. tableand column names) was contained in theseanswer sets, this method of problem presenta-tion may have helped the natural language par-ticipants include this information in theirrequests. Participants used both languages in acounterbalanced order. Ten participants usednatural language first and ten participants usedthe artificial language first. The natural lan-guage users received no training (althoughthey presumably were given the names of thedatabase tables and columns) and the artificiallanguage users were given a self-paced studyguide of the language. The participants werethen given four practice problems followed by16 experimental problems with each language.

Results showed that there was no differ-ence in the number of language errors betweenthe two languages. It appears that the difficultythe natural language users had in rememberingto mention table and column names wasroughly equivalent to the difficulty artificiallanguage participants had in remembering tomention the table and column names while fol-lowing the syntactic and lexical restrictions.The structured order of the formal languagemust have helped the participants remember toinclude the column and table names. Thus, it islikely that it was more difficult for the naturallanguage participants to remember to includetable and column information than it was forthe formal language participants. This analysisis based on the assumption that the participantsin the formal language condition made moresyntactic and lexical errors than the partici-pants using natural language. However, Smalland Weldon only present overall error rates, sothis assumption may be incorrect.

The results also show that participantsusing the structured language could enter theirqueries faster than those using the natural lan-guage, especially for simple problems. Smalland Weldon use this result to conclude that for-mal languages are superior to natural lan-guages. However, the tested set of SQL waslimited in function compared to what is avail-able in most implementations of databasequery languages, and the speed advantagereported for SQL was not as pronounced formore complicated problems. Thus, a betterconclusion is that NLIs should provide morefunction than their formal language counter-parts if they are going to be easier to use thanformal languages. To provide a flexible syntaxand vocabulary may not be enough.

Shneiderman (1978) also compared a natu-ral language to a formal relational query lan-guage. However, unlike Small and Weldon(1983), Shneiderman chose not to impose anylimits on the participant’s use of natural lan-

16

guage. Participants were told about a depart-ment store employee database and wereinstructed to ask questions that would lead toinformation about which department theywould want to work in. One group of partici-pants first asked questions in natural languageand then used the formal language. Anothergroup used the formal language first and thennatural language. In the formal query languagecondition, participants had to know the struc-ture and content of the database, but in the nat-ural language condition they were not giventhis information. The results showed that thenumber of requests that could not be answeredwith data in the database was higher using nat-ural language than when using formal lan-guage. This was especially true for participantsin the natural language first condition. Thisshould not be surprising given that the partici-pants did not know what was in the database.However, the result highlights the fact thatusers’ expectations about the functional capa-bilities of a database will probably exceedwhat is available in current systems.

In another laboratory experiment, Boren-stein (1986) compared several methods forobtaining on-line help, including a humantutor and a simulated natural language helpsystem. Both allowed for unrestricted naturallanguage, but the simulated natural languagehelp system required users to type queries on akeyboard. He compared these to two other tra-ditional methods, the standard UNIX “man”and “key” help system and a prototype windowand menu help system. The UNIX help systemwas also modified to provide the same helptexts that were provided by the menu and natu-ral language systems. The participants wereasked to accomplish a set of tasks using aUNIX-based system but had only prior experi-ence with other computer systems. As a mea-sure of effectiveness of the help system,Borenstein measured the time these partici-pants needed to complete the tasks.

The results showed that participants com-pleted tasks faster when they had a humantutor to help them, and slowest when they usedthe standard UNIX help system. But Boren-stein found little difference between the modi-fied UNIX command interface, the window/menu system, and the natural language system.Because all of the methods provided the samehelp texts, Borenstein concluded that the qual-ity of the information provided by the help sys-tem is more important than the interface.

Hauptmann and Green (1983) compared aNLI with a command language and a menu-based interface for a program that generatedsimple graphs. Participants were given hand-sketched graphs to reproduce using the pro-gram. The NLI was embedded in a mixed ini-tiative dialog in which the computer or theusers could initiate the dialog. Hauptmann andGreen report no differences between the threeinterface styles in the time to complete the taskor in the number of errors. However, they doreport many usability problems with all threeinterfaces. One such problem, which may havebeen a more critical problem for the NLI, wasa restrictive order in which the operationscould be performed with the system. Also, theNLI was a simple keyword system that wascustomized based on what may have been atoo small of sample. The authors concludedthat NLIs may give no advantage over com-mand and menu systems unless they can alsoovercome rigid system constraints by addingflexibility not contained in the underlying pro-gram.

Turtle (1994) also compares a NLI to anartificial language interface. He compared theperformance of several information retrievalsystems that accept natural language queries tothe performance of expert users of a Booleanretrieval system when searching full-text legalmaterials. In contrast to all other studiesdescribed in this section, Turtle found a clearadvantage for the NLI.

17

In Turtle’s study, experienced attorneysdeveloped a set of natural language issue state-ments to represent the type of problems law-yers would research. These natural languagestatements were then used as input to severalcommercial and prototype search systems. Thetop 20 documents retrieved by each systemwhere independently rated for relevance. Theissue statements were also given to experi-enced users of a Boolean query system(WESTLAW). Users wrote Boolean queriesand were allowed to iterate each against a testdatabase until they were satisfied with theresults. The set of documents obtained usingthese queries contained fewer relevant onesthan those sets obtained by the NLI systems.

This is impressive support for using anNLI for searching full text information, andamplifies the earlier results of Harman andCandela (1990). The weakness of this study,however, is that it did not actually consider ormeasure user interactions with the systems.There is little detail on how the issues state-ments were generated and one can only guessas to how different these might have been hadthey been generated by users interacting with asystem.

Another study presents evidence that anNLI interface may be superior to a formal lan-guage equivalent. Napier et al. (1989) com-pared the performance of novices using LotusHAL, a restricted NLI, with Lotus 1-2-3, amenu/command interface. Different groups ofparticipants were each given a day and a halftraining on the respective spread-sheet inter-faces, and then solved sets of spread-sheetproblems. The Lotus HAL users consistentlysolved more problems than did the Lotus 1-2-3users. Napier et al. suggest that the HAL userswere more successful because the languageallowed reference to spread-sheet cells by col-umn names. It should be pointed out that LotusHAL has a very restricted syntax with English-like commands. HAL does, however, provide

some flexibility, and clearly provides morefunctionality than the menu/command inter-face of Lotus 1-2-3.

Walker and Whittaker (1989) conducted afield study that compared a menu-based and anatural language database query language. Theresults they report regarding the usefulness ofand problems with a restricted NLI for data-base access are similar to those of the studiespreviously summarized (e.g. high task failurerates due to lexical and syntactic errors). Aninteresting aspect of their study was the findingthat a set of users persisted in using the NLIdespite a high frequency of errors. Althoughthis set of users was small (9 of 50), this find-ing suggests the NLI provided a necessaryfunctionality that was not available in themenu system. However, the primary functionused by these users was a sort function (e.g.“... by department”). This function is typicallyfound in most formal database languages andmay reflect a limitation of the menu systemrather than an inherent NLI capability. Walkerand Whittaker also found a persistent use ofcoordination, which suggests another menusystem limitation. Coordination allows thesame operation on more than one entity at atime (e.g. “List sales to Apple ANDMicrosoft”), and is also typically found in for-mal database languages. Thus, it seems thatthis study primarily shows that the menu sys-tem better met the needs of most of the users.

Summary: With the exception of Turtle(1994) and Napier et al. (1989), there is noconvincing evidence that interfaces that allownatural language have any advantage overthose restricted to artificial languages. It couldbe effectively argued that some of these labo-ratory investigations put natural language at adisadvantage. In the Small and Weldon (1983)study, natural language was functionallyrestricted, and in the Shneiderman (1978)study users were uninformed about the appli-

18

cation domain. These are unrealistic con-straints for an actual natural language system.

When the NLI provides more functionalitythan the traditional interface, then clearly —asin the case of Lotus HAL— an advantage canbe demonstrated. What needs to be clarified iswhether added functionality is inherently dueto properties of natural language, or whether itcan be engineered as part of a more traditionalGUI. Walker (1989) discusses a taxonomy ofcommunicative features that contribute to theefficiency of natural language and suggests,along with others (e.g. Cohen et al, 1989), thata NLI could be combined with a direct manip-ulation GUI for a more effective interface. Theevidence suggests that a well-designedrestricted language may be just as effective asa flexible natural language. But what is a well-designed language? The next section will lookat what users expect a natural language to be.

4.0 Design IssuesSeveral laboratory experiments have been con-ducted to answer particular design issues con-cerning NLIs. The remainder of the chapterwill address these design issues.

What is Natural?Early research into NLIs was based on thepremise that unconstrained natural languagewould prove to be the most habitable and easy-to-use method for people to interact with com-puters. Later, Chin (1984), Krause (1990), andothers observed that people converse differ-ently with computers (or when they believethat their counterpart is a computer) than theydo when their counterpart is (or they believetheir counterpart to be) another person. Forexample, Chin (1984) discovered that userscommunicating with each other about a topicwill rely heavily on context. In contrast to this,users who believed they were talking to a com-puter relied less on context. This result, as wellas those of Guindon (1987), suggests that con-

text is shared poorly between users and com-puters. Users may frame communication basedupon notions about what computers can andcannot understand. Thus, ‘register’ might playan important role in human-computer interac-tion via NLIs.Register: Fraser (1993) reminds us that regis-ter can be minimally described as a “varietyaccording to use.” In human-human conversa-tion, participants discover linguistic and cogni-tive features (context, levels of interest,attention, formality, vocabulary, and syntax)that affect communication. Combined, thesefeatures can be referred to as register. Fraserpoints out that people may begin communicat-ing using one register, but, as the conversationcontinues the register may change. Conver-gence in this context refers to the phenomenonof humans adapting, or adopting the character-istics of each other’s speech, in ways that facil-itate communication. Fraser suggests that forsuccessful NLIs it is the task that should con-strain the user and the language used, not thesublanguage or set of available commands. Inother words, it would be inappropriate for aNLI interface to constrain a user to a limitedvocabulary and syntax. Rather, users should beconstrained in their use of language by the taskand domain.

Register refers to the ways language usechanges in differing communication situations,and is determined by speaker beliefs about thelistener and the context of the communication.Several laboratory studies have been con-ducted to investigate how users would natu-rally communicate with a computer in anunrestricted language. For example, Malhotra(1975) and Malhotra and Sheridan (1976)reported a set of WOz studies that analyzedusers’ inputs to a simulated natural languagesystem. They found that a large portion of theinput could be classified into a fairly smallnumber of simple syntactic types. In the caseof a simulated database retrieval application,

19

78 percent of the utterances were parsed intoten sentence types, with three accounting for81 percent of the parsed sentences (Malhotra,1975). It seems that users are reluctant to putdemands on the system that they feel might betoo taxing.

Malhotra’s tasks were global, open-endedproblems that would encourage a variety ofexpressions. In contrast, Ogden and Brooks(1983) conducted a WOz simulation study inwhich the tasks were more focused and con-trolled. They presented tables of informationwith missing data, and participants were totype one question to retrieve the missing data.In an unrestricted condition, Ogden andBrooks found that 89 percent of the questionscould be classified into one global syntacticcategory2. Thus, participants seem to naturallyuse a somewhat limited subset of natural lan-guage. These results have been replicated byCapindale and Crawford’s (1990) study ofINTELLECT. Using the same syntactic analy-sis, they found 94 percent of the questions fellinto the same category identified by Ogden andBrooks. Burton and Steward (1993), also usingthe same method, report 95 percent of thequestions to be of the same type.

Ringle and Halstead-Nussloch (1989) con-ducted a series of experiments to explore thepossibilities for reducing the complexity ofnatural language processing. They were inter-ested in determining whether user input couldbe channeled toward a form of English thatwas easier to process, but that retained thequalities of natural language. The study wasdesigned to test whether feedback could beused to shape user input by introducing analternative to an ordinary human-human con-versation model that would maintain users’perception of natural and effective question-and-answer dialogue. Nine college undergrad-

2. See section ‘Restrictions on Syntax’ for a description ofthis type.

uates who were classified as casual computerusers of email and word processors were par-ticipants in this study. The task was to use anunfamiliar electronic text processor to edit andformat an electronic text file to produce aprinted document that was identical to a hard-copy version they had been given. The com-puter terminal was split into two areas, one forediting an electronic document, and a seconddialog window for communicating with ahuman tutor. Human tutors used two modes ofinteraction; a natural mode where tutors couldanswer questions in any appropriate manner,and a formal mode in which tutors wereinstructed to simulate the logical formalism ofan augmented transition network (ATN). Forexample, in simulating the ATN, responsetimes should be longer when input was illformed or contained multiple questions. Eachparticipant had two sessions, one in formalmode, and one in natural mode. Four partici-pants began with formal mode sessions, andfive with natural. The study’s two measure-ment factors were tractability (how easily thesimulated ATN could correctly parse, extractrelevant semantic information, identify thecorrect query category, and provide a usefulreply), and perceived naturalness by the user.Tractability was measured by analyzing andcomparing the transcripts of natural versus for-mal user-tutor dialogs for fragmentation, pars-ing complexity, and query category. Perceivednaturalness was determined by looking atusers’ subjective assessments of usability andflexibility of natural versus formal help modes.

Of the 480 user-tutor exchanges in theexperiment, 49 percent were in the naturalmode, and 51 percent in the formal. Dialogueanalysis for fragmented sentences found a rateof 21 percent in natural mode versus 8 percentin formal mode. This suggests that feedback inthe formal mode may have motivated queriesthat were syntactically well formed. A fivepoint scale was used for evaluating utterance

20

parsing complexity; 1) could be easily handledby the hypothetical ATN parser; 2) containedsome feature (misspelling, sentence fragment,unusual word) that might make parsing diffi-cult; 3) had two or more difficult parsing fea-tures; 4) had features that would probably behandled incorrectly; 5) could not be handled atall. Ringle and Halstead-Nussloch report thatcomplexity ratings were significantly lower forformal dialogues than for natural. More inter-esting are their results that indicate that bothgroup’s complexity level fell after the first ses-sion. This result was used to suggest that theshaping of user queries accomplished in thefirst formal session held. Another explanationmight be that the difficulties users were havingwith the system became less complex. Numer-ical values of the users’ subjective evaluationsof system usability and usefulness are notgiven. It is reported that users rated both natu-ral and formal modes very highly, and muchhigher than users would rate ‘conventional’on-line help.

Ringle and Halstead-Nussloch’s resultsseem to support their conclusion that userinput can be shaped in ways that could be moretractable for natural language interfaces,though their result would be stronger had theirstudy used an actual ATN system. Further, itappears that differences in register betweenformal and informal language can be used toeffect this shaping. Because their study used sofew participants it is difficult to evaluate theclaim that the perceived naturalness does notdecrease as a result of switching from naturalto formal dialog modes. However, the per-ceived usability of the system did not appear tochange for users in this study.

These results are encouraging for the pros-pect of defining usable subsets for natural lan-guage processing. Less encouraging results areprovided by Miller (1981). He asked com-puter-naive participants to write proceduraldirections intended to be followed by others.

They were given six file-manipulation prob-lems and were required to carefully enter intothe computer a detailed procedure for solvingthe problems. Their only restriction was thatthey were limited to 80 characters of input foreach step in the procedure.

The participants in Miller’s study did notprovide the type of specific information a com-puter would require to be able to solve theproblems. Participants left out many importantsteps and were ambiguous about others. Theywere obviously relying on an intelligent lis-tener to interpret what was entered. Differentresults may have been obtained had the partici-pants been interacting with a simulated com-puter, and thus, would have been writinginstructions intended for the machine insteadof for other people. However, the study showsthat a true natural language programming envi-ronment would need to provide very broadconceptual and functional domains.

Another study that investigated how peoplenaturally express computer functions was con-ducted by Ogden and Kaplan (1986). The goalof the Ogden and Kaplan study was to observeparticipants’ natural use of AND and OR in thecontext of a natural language query interface.

Thirty-six participants with word-process-ing experience, but with little or no databaseexperience, were given an explanation of thegeneral concept of a database and were toldthat they would be testing a database programthat could answer English questions. For eachproblem, participants were given a table shownon a computer screen. This table consisted ofhypothetical student names followed by one ormore columns of data pertaining to the student(such as HOMESTATE and MAJOR). Eachtable contained missing names. The participantwas to enter a question that would retrieve themissing names by identifying them using theinformation in the other columns of the table.Thus, by controlling the information in theadditional columns, Ogden and Kaplan con-

21

trolled the type of set relationships that were tobe expressed.

Results showed that participants alwaysused OR correctly to indicate union, but ANDwas used to indicate both union and intersec-tion. For problems that required union, partici-pants used OR on 60 percent of the problems,used AND on 30 percent and used neitherAND nor OR on the remaining 10 percent. Onthe other hand, participants almost always usedAND when they wanted to specify intersection(97 percent). The use of OR to specify inter-section was very rare (1 percent). Thus, pro-grams written to accept natural language cansafely interpret OR as a logical OR but willneed to process additional information to inter-pret AND correctly.

The data showed that participants tended touse ‘and’ to conjoin clause groups in the con-text of union problems, but not in the contextof intersection problems. For example, for aunion problem, participants would type,“Which students live in Idaho and which majorin psychology?” For an intersection problemthey would type, “Which students live in Idahoand major in psychology?” In the first case twoclauses were conjoined, and in the second casetwo adverbial phrases were conjoined. Thus,Ogden and Kaplan were able to identify a con-sistent pattern of using ‘and’ that could be usedto clarify its meaning. Their conclusion wasthat without training, users would not be ableto specify the meaning of ‘and’ clearly enoughfor unambiguous natural language processing.Processors will need to be able to recognize anambiguous ‘and’ and prompt users to clarifythe meaning.

The studies by Malhotra and by Ogden andBrooks suggest that users impose naturalrestriction on themselves when interactingwith a computer while the studies by Millerand by Ogden and Kaplan suggest that naturallanguage is too informal to clearly communi-cate with the computer. A natural language

subset that people would find easy to useseems possible, but formal constraints requiredfor clarity may cause problems. The next sec-tions review work that was designed to under-stand the restrictions users can be expected tolearn.

Restrictions on VocabularySeveral studies have shown that participantscan be effective communicators with arestricted subset of English. For example,Kelly and Chapanis (1977) identified a 300-word vocabulary that allowed participants tocommunicate as effectively as participantswho had an unrestricted vocabulary. Kelly andChapanis first determined what words wereused by unrestricted two-person teams whensolving particular problems while communi-cating via a teletype. They then restricted asubsequent group to the 300 words most com-monly used by the unrestricted group andfound that the restricted group solved the sameset of problems as quickly and as accurately asthe unrestricted group. In another set of stud-ies, Ford, Weeks and Chapanis (1980) andMichaelis (1980) encouraged groups of partic-ipants to restrict their dialog to as few words aspossible and compared their performance togroups who were given no incentive to bebrief. They found that the restricted-dialoggroup performed a problem solving task in lesstime than the unrestricted group.

While Kelly and Chapanis (1977) used avocabulary restricted to an empirically deter-mined sub-set, Ogden and Brooks (1983)tested a group of participants who wererestricted to a vocabulary defined by an exist-ing database. In this condition, participants hadto enter questions using only those nouns con-tained in the database as table names, columnnames, or data. The participants were given(and always had available) a list of these nounsthat showed the structure of the database.Other verbs, prepositions, articles, etc. were

22

allowed, and no syntactic restrictions wereimposed. On the first attempt at solving a newproblem, 23 percent of the questions that par-ticipants entered contained non-allowedwords. It can be concluded that this type oflexical restriction was not very natural. How-ever, by the third attempt at a question, theseerrors were reduced to 7 percent of the ques-tions entered.Summary: The findings from restricted lexi-con studies support an observation by Krause(1980) that users of USL could recover easilyfrom errors caused by omissions in USL’s lexi-con. Thus, it can be concluded that users canadapt to a limited vocabulary, especially if it ishas been derived on the basis of empiricalobservations. However, Krause and others (e.g.Hershman, et. al., 1979) provide evidence thatusers have a difficult time recovering fromsyntactic errors. The next section reviews workrelated to this issue.

Restrictions on SyntaxWhile the studies outlined above show thathumans can communicate with each otherwithin a restricted vocabulary, these resultsmay not generalize to restricted human-com-puter communications. There are, of course,many syntactic constructions in which arestricted vocabulary could be used. However,the evidence cited previously suggests thatusers could be comfortable using a restricted-English syntax (Malhotra, 1975; Malhotra andSheridan, 1976; Ogden and Brooks, 1983).

Hendler and Michaelis (1983) followed amethodology similar to Kelly and Chapanis(1977) to study the effects of restricted syntaxnatural language dialog. Participant pairs wereto solve a problem by communicating witheach. Communication was effected by enteringmessages into a computer terminal. One par-ticipant pair group was allowed unrestrictedcommunication while another group had tofollow a limited context-free grammar. This

grammar was selected to be easily processedby a computer. The restricted group were toldto use a limited grammar, but was not toldwhat the grammar rules were. Each groupreceived three problems in a counterbalancedorder. The results showed that the restrictedparticipant pairs took longer to solve the prob-lems than the unrestricted participants in Ses-sion 1, but that this difference went away inSessions 2 and 3. It appears that these partici-pants adapted to the restricted syntax aftersolving one problem.

Ogden and Brooks (1983) also imposed asyntactic restriction on a group of 12 partici-pants entering questions into a simulated querylanguage processor. They imposed a context-free pragmatic grammar where the terminalsymbols of the grammar referred to attributesof the database. The grammar restricted usersto questions that first allowed for an optionalaction phrase (e.g. “What are...” or “List...”)followed by a required phase naming a data-base retrieval object (e.g. “...the earnings...”)which could optionally be followed by anynumber of phrases describing qualifications(e.g. “...of the last two years”). Unlike the par-ticipants in the Hendler and Michaelis study,participants were informed of the constraintsof the grammar and were given several exam-ples and counter-examples of allowable sen-tences.

Results showed that 91 percent of the firstattempts at a question were syntactically cor-rect, and this improved to 95 percent by thethird attempt. Participants had the most troubleavoiding expressions with noun adjectives (e.g.“Blue parts” instead of “Parts that are blue”).These results are limited in that the syntacticconstraints were not very severe, but they dosuggest that users can control input based onthese types of instructions.

Jackson (1983) imposed a set of syntacticrestrictions on participants who were using acommand language to perform a set of com-

23

puter tasks. Each task involved examining andmanipulating newspaper classified ads storedand displayed by the computer. Participantswere given a description of a function the com-puter was to perform. They were then told toenter commands having a single action and ashort description of the objects to be actedupon. For half of the participants, commandswere to be constructed in an English-like orderwith an action followed by an object descrip-tion (e.g. “Find the VW ads”). The other par-ticipants were asked to reverse the order andspecify the object first followed by an action(e.g. “VW ads find”). It was hypothesized thatthe more natural English-like action-objectconstruction would be easier to enter than theunnatural object-action construction.

Commands were collected from 30 experi-enced and 30 inexperienced computer userswho were fully informed about the syntacticrestriction. Comparisons between the two con-ditions were made on the basis of the com-mand entry time. Experienced users werefaster than inexperienced users, but bothgroups could enter object-action commands asquickly as action-object commands. This sug-gests that the experience of using naturalEnglish does not transfer to learning and usinga computer command language. It also sug-gests that the experience of using an action-object command language (the experiencedgroup) does not negatively transfer to learningand using an object-action command language.Syntax does not seem to matter for constrainedlanguages like this.

Jackson also reports that users had littledifficulty generating constrained commandsthat had only one action and noun phrase whileleaving out common syntactic markers likepronouns. The participants had more troubleincluding all the necessary information in theirrequests than they did in getting the words in aparticular order. Participants tended to leaveinformation out of commands that could be

inferred from context. Over half (53 percent)of the initial commands omitted the type ofobject from the object’s description. For exam-ple, participants would enter “Find VW”instead of “Find VW ads.” The language wasnot habitable because of functional limitations;the language processor could not infer objecttypes.

Burton and Steward (1993) also analyzedthe use of ellipsis, (e.g., asking “Pens?” afterasking “How many pencils?”) and compared itto an NLI with a feature that allowed the previ-ous question to be edited and re-submitted.When participants had both features available,re-editing a previous query was chosen twiceas often as using ellipsis. However, an inter-face with both features seemed to be harder touse than an interface that had only one. Theyreported few differences in success rates whenusing either the ellipsis or edit interfaces.Summary: The results of these laboratorystudies of syntactic restrictions suggest thatpeople adapt rapidly to these types of con-straints when interacting with a computer.None directly investigated the observation thatsyntactic errors are hard to correct, althoughthe Ogden and Brooks study indicates an abil-ity to recover. However, in the language theytested, specific syntactic errors could bedetected by the computer, and error messagescould indicate specific solutions to the partici-pants who were fully informed about the syn-tactic restrictions of the language. The issue offeedback is discussed in a later section. How-ever, Jackson’s findings suggest that functionalrestrictions will be harder to adapt to than syn-tactic restrictions. The next section looks atthis issue.

Functional RestrictionsOmitting information that can be inferred fromcontext has been referred to as a functionalcapability of natural language. A habitable lan-guage will be able to infer information that

24

users leave out of their natural language. Atissue is the match between the user’s expecta-tions and the language’s capabilities. Eitherusers must be taught the language’s capabili-ties, or the language must be customized toprovide capabilities users expect. The customi-zation issue is reviewed in a later section. Thissection will look at the user’s ability to learnand adapt to functional limitations.

In a follow-up study based on the Ogdenand Brooks methodology, Ogden designed afunctionally limited natural language querysystem and attempted to teach computer-inex-perienced participants the limitations of thelanguage. Twelve participants were told to fol-low the restricted grammar described byOgden and Brooks (see above), but partici-pants also had to include all needed referencesto existing database attributes. For example,the request, “Find the salary of David Lee.”,would need to be rephrased as, “Find the sal-ary of the employee named David Lee.” andthe request, “List the courses which are full.”,would need to be rephrased as “List thecourses whose size equals its limit.” This issimilar to the restriction imposed by Small andWeldon (1983).

Participants were fully informed of therestrictions and were given several examplesand counter-examples of allowable questions.They were also given the necessary databaseinformation and always had it available whileentering questions.

The results showed that it was difficult forparticipants to learn to express the neededdatabase functions using natural language. Par-ticularly difficult were expressions thatrequired calculations, especially when theyhad common synonyms that were not allowed(e.g. “... a 10 percent raise...” vs. “...salarytimes 1.1...”). Further, expressions thatrequired identifying necessary data relation-ships were usually left out (e.g. “...staff ID isequal to salesman ID...”). The total percent

error rate for this condition was 29 percent onparticipants’ first attempts. This was muchworse than for the syntactic restrictionreported by Ogden and Brooks (9 percent).The results indicated that participants have adifficult time recovering from these errors aswell (19 percent errors on the third attempt).Summary: These results are consistent withpreviously reported findings: it may be difficultfor people using natural language to be specificenough for the system to clearly understandthem. (e.g. Jackson, 1983; Miller, 1981; Ogdenand Kaplan, 1986). Next, studies that use feed-back to help the user understand system limita-tions are examined.

Effects of FeedbackZolton-Ford (1984) conducted a WOz simula-tion study that examined the effects of systemfeedback on the participants’ own input. Zol-ton-Ford systematically varied two systemfeedback characteristics; the vocabulary used(high frequency or low frequency) and the sys-tem feedback phrase length (complete sen-tences or only verbs and nouns). Two inputcharacteristics were also varied: the communi-cation mode (keyboard or voice) and theamount of restriction placed on participants’language (restricted to phrases matching thesystem’s feedback language, or unrestricted).Zolton-Ford’s question was to what extentwould the participants imitate the system’sfeedback language. The participants in therestricted condition were not informed of therestrictions they were to follow. Instead, tointroduce participants to the system’s commu-nication style and to the functions available inthe system, all participant groups were guidedby the program during the initial portion oftheir interactions. Participants were asked tosolve 30 problems concerning input, update,and retrieval functions of an imaginary inven-tory database.

25

Results showed that the restricted partici-pants generated significantly more output-con-forming entries than did unrestrictedparticipants. This was especially true when thefeedback language was simple (verbs andnouns) rather than complex (complete sen-tences). System message word frequency hadno effect. Zolton-Ford suggested the followingdesign criteria:

1. Provide a consistently worded program feed-back because users will model it.

2. Design the program to communicatewith tersely phrased outputs becauseusers will be able to more easily model itthan more verbose outputs.

3. Include error messages that reiterate thevocabulary and syntax that is under-stood by the program because users willalter their vocabulary and syntax to belike those provided in the error mes-sages.

Support for Zolton-Ford’s conclusioncomes from a study reported by Slator et al.(1986). Slator et al. used a WOz simulated NLIto an existing software package that producedcomputer-generated graphs to measure theeffects of feedback. Participants were asked toconstruct a graph by entering English com-mands. The Wizard interpreted and enteredtranslated commands into the graphics pack-age. Participants were divided into two feed-back groups. The control group received nofeedback other than the feedback that resultedfrom changes in the graph. The feedbackgroup saw the translated commands entered bythe experimenter. Results show that the feed-back group made significantly fewer semanti-cally ambiguous utterances than did thecontrol group (7.9 percent and 22.4 percent,respectively). According to Slator et al., thereason the feedback group did so much betterwas that they began to imitate the feedback.They provide this example:

Participant: For X axis lable only everytenth year.

Which is translated to a computer command(ignoring the misspelled word “label”)

Feedback: XAXIS MAJOR STEP 10

The participant immediately recognizes a shorterway to say what was wanted which is reflected inthe next statement.

Participant: Yaxis major step 2000.Just as in the Zolton-Ford study, the users

modeled the system’s output. Slator et al. alsosuggest that by showing the formal-languagetranslations of the natural language input,users will more quickly learn an abstract sys-tem model that will help them understand howto use the computer. Of course, this depends onhow well the formal language reflects theunderlying system model. The formal lan-guage design must be coherent or users mayhave trouble imitating it.Summary: It is clear from these two studiesthat feedback in a NLI will strongly influenceuser performance with the system. Feedbackshould be formed to reflect the simplest inputlanguage acceptable to the system. Certainlyanother role for feedback is to let users knowhow their input was interpreted by the inter-face. This has been the recognized role offeedback in many systems. These two studiessuggest that training may be a more importantrole for feedback in natural language systems.

Empirically Derived GrammarsAn important issue for NLIs is the effortrequired to install a new application. Eachapplication of a natural language interface willrequire encoding semantic and pragmaticinformation about the task domain and theusers’ linguistic requirements in that domain.For most prototype systems, a great deal ofeffort is required to capture, define, and enterthis information into the system. This customi-zation process requires a person who knowshow to acquire the information and how totranslate it into a form usable by the natural

26

language program. Very little research hasbeen conducted to investigate methods for col-lecting the necessary application data so itcould be used to customize the interface. Cus-tomization is important because the functionalcapabilities of the end-user interface will bedetermined by how well the domain-specificlinguistic information is gathered and encoded.The following two studies described anapproach to building domain-specific informa-tion into the system. These studies started withsystems with no linguistic capability and usedWOz simulations to demonstrate how domain-specific information could be empiricallyderived and incorporated into a natural lan-guage program.

Kelley (1984) developed a NLI to an elec-tronic calendar program using an iterativedesign methodology. In the first phase, partici-pants used a simulated system whose linguisticcapabilities were totally determined by theWizard simulating the system. Then the inputscollected during the first phase were used todevelop language processing functions, and 15additional participants used the program. Dur-ing this phase, the Wizard would interveneonly when it was necessary to keep the dialoggoing. After each participant, new functionswere added to the language that would make itpossible to process the inputs whenever aninput required operator intervention. Care wastaken to prevent new changes from degradingsystem performance. Finally, as a validationstep, six more participants used the programwithout experimenter intervention.

Kelley used a widely diverse participantpopulation, and most had very limited com-puter experience. They were given a briefintroduction on using the display terminal andwere then asked to enter whatever appoint-ments they had over the next two weeks. Theywere specifically asked to enter at least tenappointments and were prodded by the experi-menter during the course of their work only if

they failed to exercise storage, retrieval andmanipulation functions against the database ontheir own.

System performance improved rapidly dur-ing the intervention phase. The growth of thelexicon and new functions reached an asymp-tote after only ten participants (iterations).During the validation phase, participants cor-rectly stored 97 percent of the appointmentsentered. Other interpretation errors occurred,but either the user or the system was able torecognize and correct them. Even if these cor-rected errors are included, participants wereunderstood 84 percent of the time.

The data suggest that the system providedan interface that was easy to use for novicecomputer users. The input language coveredby this “empirically derived” grammar is verydistant from grammatical English. A typicalinput to this system was:

remind me 8/11/82 send birthday cardto mama

This finding provides further support for Slatoret al.’s (1986) conclusion that users care moreabout brevity than grammatical correctness.

The quickness by which most of the lexi-con and phrase structures could be capturedindicates that the users in the study had a clearand homogeneous knowledge of the applica-tion domain. An important issue is whetherthis technique would work as well in widerdomains, for example, those domains coveredby most databases. The following study can beused to address this issue. It uses the sametechnique as Kelley in an electronic mail appli-cation domain.

Good et al. (1984) conducted a study simi-lar to Kelley (1984), but with five importantdifferences. First, the Wizard intervened onlywhen the input was judged to be simpleenough to be parsed, and error messages weresent to the participants when it was not. Sec-ond, the initial prototype system was based onexisting mail systems, not on a user-defined

27

system as it was in the Kelley study. Third, thesystem was not modified after each iteration,but only at periodic intervals. Fourth, Good etal. presented a fixed set of tasks whereasKelley’s tasks were user-generated. Fifth,Good et al. used 67 participants during the iter-ative development phase compared to the 15participants used in the Kelley study. Despitethe differences, the results were similar. Goodet al. focus the discussion of their results onthe improvement in parse rate obtained fromchanges made to the system based on userinput. The initial performance of the systemcould only parse 7 percent of the commandsissued by users, but after 67 participants and30 changes to the software, 76 percent of thecommands could be parsed. The study doesnot report all the data, but participants were notable to complete all of the tasks with the initialsystem but were able to complete all of themwithin one hour with the final system.Summary: It is difficult to compare the resultsobtained by Kelley to those obtained by Goodet al. Kelley reports a better parse rate withfewer iterations, but the task sets given and theamount of modification done to the systemmay have been quite different. What can begenerally concluded is that user-defined sys-tems can be created that will allow users to getwork done in a natural way without any train-ing or help.

5.0 Design RecommendationsTwo things people must learn in order to useany system are 1) the system’s capabilities,and 2) a language to invoke those capabilities.With formal computer languages, people learnthe capabilities of the system as they learn thelanguage. With natural language, they arespared the burden of learning the formal lan-guage and consequently lose an opportunity tolearn the capabilities of the system. The evi-dence presented here suggests that users havedifficulties with natural language systems

when they either have not had enough experi-ence to know what the capabilities of the sys-tem are, or the system has not been built toanticipate all of the capabilities users willassume. People tend to enter incompleterequests (e.g. Miller, 1981; Jackson, 1983) orattempt to invoke functions the system doesnot have (e.g: Jarke et al., 1985; Ogden andSorknes, 1986; Tennant, 1979).

There are two approaches to correcting thismismatch between users’ expectations and thesystem’s capabilities. The first approach can beconsidered an artificial intelligence approach.Its goal is to provide a habitable system byanticipating all of the capabilities users willexpect. It requires extensive interface programcustomization for each application domain.Unfortunately, the methods for collecting andintegrating domain-specific information havenot been well developed or tested. For largedomains, like database applications where thepragmatic and semantic information is oftencomplex, customization methods have beenproposed (e.g. Grosz et al., 1987; Hass andHendrix, 1980; Hendrix and Lewis, 1981;Manaris and Dominick, 1993), but these meth-ods have not been empirically evaluated. It isclear, however, that as the complexity of theNL processing system increases, users’ expec-tations for the system increase, as do the diffi-culties involved with customizing the systemfor other domains. For systems based on verysimple NLP techniques, like statistically-basedfull text retrieval systems, the customizationproblem becomes much simpler or at leastmore well defined.

For smaller domains, the procedures usedby Kelley (1984) and Good et al. (1984) arepromising. The disadvantage of their approachis that the solution is not guaranteed. The per-formance of the system will depend on theavailability of representative users prior toactual use, and it will depend on the installer’s

28

abilities to collect and integrate the relevantinformation.

The second approach might be considereda human engineering approach. It depends onmore training for the users, coupled with aninterface language whose capabilities are moreapparent. The primary goal of this approach isto develop a system that allows users todevelop a consistent conceptual model of thesystem’s domain so they will understand thesystem’s capabilities. One aspect of a moreapparent language is the use of feedback. Zol-ton-Ford (1984) and Slator et al. (1986) haveshown the effectiveness of feedback in shapinga user’s language. Another example of the useof feedback and an apparent language model isprovided by the NLC system described byBiermann et al. (1983). Extreme examples ofthis interface type are systems that providesnatural language in menus (Tennant et al.,1983; Mueckstein, 1985). The disadvantage ofthis second approach is that users will need tobe trained. To keep this training to a minimum(i.e., less than what would be required for anartificial language), the language should pro-vide more function than would be provided byan artificial language. Thus, domain-specificinformation would still need to be collectedand integrated into the interface.

The solution lies in a combination of artifi-cial intelligence and human engineeringapproaches. The following set of recommenda-tions represents a synthesis of these twoapproaches based on the results of thereviewed user studies. The recommendationsconcern the design and development of NLIs,but could easily apply to the design of anyuser-system interface.Clearly define an application domain anduser population: It would be desirable to offera list of characteristics that could describe theusers and the domains best suited for naturallanguage technology. However, there has beenfar too little work done in this area to provide

this guidance. What is certain, however, is thatNLIs can be built only when the functional andlinguistic demands of the user can be deter-mined. Therefore the domain and user popula-tion must be well defined.Plan for an incremental adaptation period: Itis recommended that application-specific lin-guistic information be gathered empiricallyand that the system be designed to easilyaccept this type of data. Representative usersshould be asked to use the interface languageto generate data for system customization. Thesystem must be built to incorporate incremen-tal changes reflecting use over a period of time.There is no evidence to suggest how long thisperiod may be. Methods for collecting andintegrating domain-specific information needto be established. The verdict is still out onwhether these systems can be installed andcustomized for a new application by peopleother than the system developers.Provide access to meta-knowledge: The datathat have been reviewed here suggest that themost difficult problem people have when usingnatural language interfaces is staying withinthe language’s conceptual and functional limi-tations. Because users are not required to learna formal language, there is less opportunity tolearn the capabilities of the system. Systemsmust therefore provide mechanisms that allowusers to interrogate the system for its knowl-edge. Few of the reviewed prototype systemsprovide this kind of mechanism.Provide broad (or well-defined) syntactic cov-erage: Evidence obtained from user evalua-tions of prototype systems suggest that aprimary problem for users is limited syntax,whereas laboratory experiments suggest thatpeople can adapt fairly rapidly to some syntac-tic restrictions, especially when the restrictionscan be defined. More studies should be con-ducted to investigate the possibility thatbroader syntactic coverage will encourageusers to expect even more coverage. Current

29

findings suggest that the broader the syntacticcoverage, the better.Provide feedback: Feedback in a NLI can havetwo functions. One is to play-back the user’sexpression in a paraphrase or visible systemresponse to demonstrate how it was inter-preted. Another is to show the user the under-lying system commands to inform the user of amore concise expression. Evidence stronglysuggests that users will echo the feedback’sform so paraphrases should be constructed tobe allowable as input expressions. Concisefeedback should be provided for training pur-poses if the syntactic coverage is very limited.Visible system responses as well as conciseparaphrases should be provided where possi-ble. In database queries, concise paraphrasesmay be ambiguous, so in this case, long unam-biguous expressions should be used. Neverthe-less, they should be constructed as allowablesystem input.

6.0 ConclusionNo attempt has been made to identify charac-teristics of specific technologies used in thevarious natural language systems that contrib-ute to usability. Thus, no conclusions can bemade for or against any particular natural lan-guage technology. This is not to imply that dif-ferences in technology are not important. Theycertainly are. But the focus has been to reviewthe existing empirical evidence of how NLIsare used.The evidence gathered with existing naturallanguage prototypes and products does notprovide an encouraging picture. There hasbeen just one well documented case of a singleuser having success working with a NLI onself-generated questions in an actual applica-tion (Krause, 1980). Damerau’s (1981) resultsmay or may not be taken as an example of suc-cessful use. The other successes (Biermann etal., 1983; Hershman et al., 1979) can be attrib-uted to well-designed experimental tasks and

trained users. More studies show users havingproblems when tasks are user-generated, orwhen users have not had extensive experiencewith the task domain or with the system’scapabilities (Jarke et al., 1985; Ogden &Sorknes, 1986; Tennant, 1979). Experiencewith prototype systems suggests that muchmore research needs to be done.

From the experimental studies, it is clearthat users do not need and do not necessarilybenefit from grammars based on natural lan-guage. When communicating with a computer,users want to be as brief as possible. Thesestudies suggest that users will benefit from anatural language system’s ability to providebroader conceptual and functional coveragethan that provided with an artificial language.The use of synonyms and the ability to leaveout contextual information (ellipsis) wouldseem to provide the most benefit. However, toprovide this coverage requires the collection ofapplication domain-specific information. Howthis information is gathered and representedwill determine the usability of the NLI. Themethods for accomplishing this are not wellestablished, and, in the eight years since thisreview appeared in the first edition of theHandbook of Human Computer interaction,these methods are still not forthcoming. Thereis no doubt that with enough hard work anditeration, good NLIs can be developed thatprovide the usability enhancements we expect.Given that the most impressive recent NLI suc-cesses have been in text-based informationretrieval applications, the key may lie in usingsimple models of NLP resulting in simplerdomain customization methods. For example,statistically-based text retrieval systems havewell specified indexing methods for acquiringthe linguistic domain knowledge required bythe system. This may represent the mostimportant area for further research into NLIs.

30

7.0 AcknowledgmentsThe authors are indebted to comments from ananonymous reviewer and Rhonda Steele onearlier drafts of this chapter.

8.0 References

Biermann, A.W., Ballard, B.W., and Sigmon,A.H. (1983). An experimental study of naturallanguage programming. International Journalof Man-Machine Studies, 18, 71-87.

Borenstein, N.S. (1986). Is English a naturallanguage. In K. Hopper and I.A. Newman(Eds.), Foundation for Human-ComputerCommunication, (pp. 60-72). North-Holland:Elsevier Science Publishers B.V.

Burton, A. and Steward, A.P. (1993). Effects ofLinguistic Sophistication on the Usability of aNatural Language Interface. Interacting withComputers, 5 (1), 31-59.

Capindale, R.A., and Crawford, R.O. (1990).Using a Natural Language Interface withCasual Users. International Journal of Man-Machine Studies, 20, 341-361.

Chin, D. (1984). An Analysis of Scripts Gen-erated in Writing Between Users and Com-puter Consultants. In National ComputerConference, (pp. 637-642).

Codd, E.F. (1974). Seven steps to RENDEZ-VOUS with the casual user (IBM Researchreport J1333). San Jose, CA: San JoseResearch Laboratory, International BusinessMachines Corporation.

Cohen, P.R., Sullivan, J.W., Dalrymple, M.,Gargan, R.A., Moran, D.B., Schlossberg, J.L.,Pereira, F.C.N., Tyler, S.W. (1989) SynergisticUse of Direct Manipulation and Natural Lan-guage. In Proceedings of CHI’89: (pp. 227-

232). New York: Association for ComputingMachinery.

Damerau, F.J. (1981). Operating statistics forthe transformational question answering sys-tem. American Journal of Computational Lin-guistics, 7, 30-42.

Epstein, R. (1993). 1993 Loebner Prize Com-petition in Artificial Intelligence: Official Tran-scripts and Results. Technical Report,Cambridge Center for Behavioral Studies.

Fink, P.K., Sigmon, A.H. and Biermann A.W.(1985). Computer control via limited naturallanguage. IEEE Transactions on Systems,Man, and Cybernetics, 15, 54-68.

Ford, W.R., Weeks, G.D. and Chapanis, A.(1980). The effect of self-imposed brevity onthe structure of didactic communication. TheJournal of Psychology, 104, 87-103.

Fraser, N.M. (1993). Sublanguage, Registerand Natural Language Interfaces. Interactingwith Computers, 5 (4) 441-444.

Good, M.D., Whiteside, J.A., Wixon, D.R.,and Jones, S.J. (1984). Building a user-derivedinterface. Communications of the ACM, 27,1032-1043.

Goodine, D., Hirschman, L, Polifroni, J., Sen-eff, S. and Zue, V. (1992). Evaluating Interac-tive Spoken Language Systems. In Proceedingsof ICSLP-92 (Vol. 1, pp. 201-204). Banff,Canada.

Grosz, B.J., Appelt, D.E., Martin, P.A., andPereira, F.C.N. (1987). TEAM: An experimentin the design of transportable natural-languageinterfaces. Artificial Intelligence, 32, 173-243.

Guindon, R. (1987) Grammatical and ungram-matical structures in user-adviser dialogues:evidence for sufficiency of restricted languages

31

in natural language interfaces to advisory sys-tems. In Proceedings of the 25th ACL: (pp. 41-44). Stanford University.

Harman, D. and Candela, G. (1990) BringingNatural Language Information Retrieval Outof the Closet. ACM SIGCHI Bulletin 22 (1),42-48.

Harris, L.R. (1977). User-oriented databasequery with ROBOT natural language querysystem, International Journal of Man-MachineStudies, 9, 697-713.

Hass, N. & Hendrix, G. (1980). An approachto acquiring and applying knowledge. FirstNational Conference on Artificial Intelligence(235-239). American Association for ArtificialIntelligence.

Hauptman, A.G. & Green, B.F. (1983). A com-parison of command, menu-selection and natu-ral-language computer programs. Behaviourand Information Technology, 2 (2) 163-178.

Hendler, J.A. and Michaelis, P.R. (1983). Theeffects of limited grammar on interactive natu-ral language. In Proceedings of CHI ‘83.Human Factors in Computing Systems (pp.190-192). New York: Association for Comput-ing Machinery.

Hendrix G. & Lewis W. (1981). Transportablenatural language interfaces to databases. InProceedings of the Annual Meeting of theAssociation for Computational Linguistics.(pp. 159-165). Menlo Park: Association forComputational Linguistics.

Hershman, R.L., Kelly, R.T., Miller, H.G.(1979). User performance with an natural lan-guage query system for command control(Technical Report NPRDC-TR-797). SanDiego, California: Navy Personnel Researchand Development Center.

Hirschman L., et al. (1992). Multi-Site DataCollection for a Spoken Language Corpus. InProceedings of ICSLP-92 (Vol. 2, pp. 903-906). Banff, Canada.

Jackson, M.D. (1983). Constrained languagesneed not constrain person/computer interac-tion. SIGCHI Bulletin, 15(2-3), 18-22.

Jarke, M., Turner, J.A., Stohr, E.A., Vassiliou,Y., White, N.H., and Michielsen, K. (1985). AField evaluation of natural-language for data-retrieval. IEEE Transactions on Software Engi-neering, 11, 97-114.

Kelley, J.F. (1984). An iterative design meth-odology for user-friendly natural-languageoffice information applications. ACM Transac-tions on Office Information Systems, 2, 26-41.

Kelly, M.J., and Chapanis, A. (1977). Limitedvocabulary natural language dialog. Interna-tional Journal of Man-Machine Studies, 9,479-501.

Krause, J. (1980). Natural language access toinformation systems. An evaluation study ofits acceptance by end users. Information Sys-tems, 5, 297-319.

Malhotra, A. (1975). Design Criteria for aKnowledge based English Language Systemfor Management: An Experimental Analysis,(Project MAC Report TR146). CambridgeMA: Massachusetts Institute of Technology.

Malhotra, A. and Sheridan, P.B. (1976). Exper-imental determination of design requirementsfor a program explanation system (IBMResearch report RC 5831). Yorktown Heights,NY: International Business Machines Corpora-tion.

Martin, J. (1985). Fourth Generation Lan-guages, vol. 1, Principles. Englewood Cliffs:Prentice-Hall Inc.

32

Michaelis, P.R. (1980). Cooperative problemsolving by like and mixed-sex teams in a tele-typewriter mode with unlimited, self-limitedintroduced and anonymous conditions. JSASCatalog of Selected Documents in Psychology,10, 35-36 (Ms. No. 2066).

Miller, L.A., (1981). Natural language pro-gramming: Styles, strategies, and contrasts.IBM Systems Journal, 20, 184-215.

Mueckstein, E.M. (1985). Controlled naturallanguage interfaces: The best of three worlds.In Proceedings of CSC ‘85: ACM ComputerScience Conference (pp. 176-178). New York:Association for Computing Machinery.

Napier, H.A., Lane, D., Batsell, R.R., andGuadango, N.S. (1989). Impact of a NaturalLanguage Interface on Ease of Learning andProductivity. Communications of the ACM,(pp. 1190-1198). 32 (10). New York: Associa-tion for Computing Machinery.

Ogden, W.C. and Brooks, S.R. (1983). Querylanguages for the casual user: Exploring themiddle ground between formal and natural lan-guages. In Proceedings of CHI ‘83: HumanFactors in Computing Systems (pp. 161-165).New York: Association for ComputingMachinery.

Ogden, W.C. and Kaplan, C. (1986). The useof AND and OR in a natural language com-puter interface. In Proceedings of the HumanFactors Society 30th Annual Meeting, (pp.829-833). Santa Monica CA: The Human Fac-tors Society.

Ogden, W.C. and Sorknes A. (1987). What dousers say to their natural language interface? InProceedings of Interact’87 2nd IFIP confer-ence on Human Computer Interaction Amster-dam: Elsevier Science.

Patrick, A.S. and Whalen, T.E. (1992). FieldTesting a Natural Language Information Sys-tem: Usage Characteristics and Users’ Com-ments. Interacting with Computers, 4(2), 218-230.

Perlman, G. (1984). Natural artificial lan-guages: low level processes. InternationalJournal of Man-Machine Studies, 20, 373-419.

Ringle, M.D., and Halstead-Nussloch, R.(1989). Shaping User Input: A Strategy forNatural Language Dialog Design. Interactingwith Computers, 1 (3), 227-44.

Shneiderman, B., (1978). Improving thehuman factors aspect of database interactions,ACM Transactions on Database Systems, 3,417-439.

Slator, B.M., Anderson, M.P., and Conley W.(1986). Pygmalion at the interface. Communi-cations of the ACM, 29, 599-604.

Small, D.W. and Weldon, L.J. (1983). Anexperimental comparison of natural and struc-tured query languages. Human Factors, 25,253-263.

Tennant, H.R. (1979). Experience with theEvaluation of Natural Language QuestionAnswerers (Working paper 18). Urbana, IL:University of Illinois, Coordinated ScienceLaboratory.

Tennant, H.R. (1980). Evaluation of naturallanguage processors (Report T-103). Urbana,IL: University of Illinois, Coordinated ScienceLaboratory.

Tennant, H.R., Ross, K.M., & Thompson,C.W. (1983). Usable natural language inter-faces through menu-based natural languageunderstanding. In Proceedings of CHI ‘83:Human Factors in Computing Systems (pp.

33

154-160). New York: Association for Comput-ing Machinery.

Turing, A.M. (1950) Computing Machineryand Intelligence. Mind, 54, 433-460.

Turtle, H. (1994) Natural Language vs. Bool-ean Query Evaluation: A Comparison ofRetrieval Performance. In Proceedings of theSeventeenth Annual International Conferenceon Research and Development in InformationRetrieval, (pp. 212-221). London: Springer-Verlag

Walker, M. and Whittaker, S. (1989). WhenNatural Language is Better than Menus: AField Study. Technical Report, Hewlett Pack-ard Laboratories, Bristol, England.

Waltz, D.L. (1983). Helping computers under-stand natural language. IEEE Spectrum,November, 81-84.

Watt, W.C. (1968). Habitability. AmericanDocumentation, July, 338-351.

Whalen, T. and Patrick, A. (1989). “Conversa-tional Hypertext: Information Access ThroughNatural Language Dialogues with Computers.”in Proceedings of CHI’89: ACM Human Com-puter Interaction Conference, New York:Association of Computing Machinery.

Winograd, T and Flores, C.F. (1986). Under-standing Computers and Cognition. NorwoodNJ: Ablex.

Woods, W.A. (1977). A personal view of natu-ral language understanding. Special interestgroup in artificial intelligence, Newsletter, 61,17-20.

Zoeppritz, M. (1986). Investigating human fac-tors in natural language data base query. In J.L. Mey (Ed.), Language and discourse: Testand protest: A Festschrift for Petr Sgall. (Lin-

guistic and Literary Studies in Eastern Europe19) (pp. 585-605). Amsterdam, Philadelphia:John Benjamins.

Zolton-Ford, E. (1984). Reducing variability innatural-language interactions with computers.In Proceedings of the Human Factors Society28th Annual Meeting, (pp. 768-772). SantaMonica CA: The Human Factors Society.