a small natural language interpreter in prolog

41
A small natural language interpreter in Prolog Knut Tveitane, knut at itu.dk Christian Theil Have, cth at itu.dk Supervisor: Henning Christiansen, henning at ruc.dk May 29, 2006 4 Week Project, IT University of Copenhagen

Upload: benyfirst

Post on 14-Apr-2015

30 views

Category:

Documents


0 download

DESCRIPTION

Natural Language analysis

TRANSCRIPT

Page 1: A Small Natural Language Interpreter in Prolog

A small natural language interpreter in Prolog

Knut Tveitane, knut at itu.dk

Christian Theil Have, cth at itu.dk

Supervisor: Henning Christiansen, henning at ruc.dk

May 29, 2006

4 Week Project, IT University of Copenhagen

Page 2: A Small Natural Language Interpreter in Prolog

Contents

1 Introduction 4

1.1 About the Project . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 The Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Purpose and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 5

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Similar work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Attempto Controlled English . . . . . . . . . . . . . . . . 72.3.2 Common Logic Controlled English . . . . . . . . . . . . . 82.3.3 Natural Language Case Tool . . . . . . . . . . . . . . . . 82.3.4 Metafor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Lexicons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Our approach 11

3.1 Use Cases and Natural English . . . . . . . . . . . . . . . . . . . 113.2 Supported Natural Language Constructs . . . . . . . . . . . . . . 12

3.2.1 Basic Sentences . . . . . . . . . . . . . . . . . . . . . . . . 123.2.2 Property Sentences . . . . . . . . . . . . . . . . . . . . . . 133.2.3 Entity-relational Sentences . . . . . . . . . . . . . . . . . 133.2.4 Phrase Lists . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.5 Compound Sentences and Pronouns . . . . . . . . . . . . 143.2.6 Syntactic Stringency . . . . . . . . . . . . . . . . . . . . . 14

3.3 Delimiting the Project . . . . . . . . . . . . . . . . . . . . . . . . 143.4 Tools and methods . . . . . . . . . . . . . . . . . . . . . . . . . . 153.5 Input Grammar Details . . . . . . . . . . . . . . . . . . . . . . . 163.6 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.6.1 Extraction of facts from the program . . . . . . . . . . . . 193.6.2 Generation of Graphviz code . . . . . . . . . . . . . . . . 193.6.3 From parse tree to code . . . . . . . . . . . . . . . . . . . 21

4 Running the Project Software 22

4.1 Instructions for Use . . . . . . . . . . . . . . . . . . . . . . . . . 224.2 Example Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2.1 Simple example using Test domain . . . . . . . . . . . . . 224.2.2 More complex example with the Company domain . . . . 25

5 Future Work 25

6 Conclusion 28

7 References 30

2

Page 3: A Small Natural Language Interpreter in Prolog

8 Appendix A - Code 32

8.1 Input Grammar.pl . . . . . . . . . . . . . . . . . . . . . . . . . . 328.2 CodeGen.pl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358.3 Lexicon Test.pl . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398.4 Lexicon Company.pl . . . . . . . . . . . . . . . . . . . . . . . . . 40

List of Figures

1 Object notation for UML diagrams . . . . . . . . . . . . . . . . . 192 UML diagram for sample session . . . . . . . . . . . . . . . . . . 243 UML diagram for company use case . . . . . . . . . . . . . . . . 26

List of Tables

1 Examples of use cases and the entities associated with . . . . . . 18

3

Page 4: A Small Natural Language Interpreter in Prolog

1 Introduction

1.1 About the Project

This project is a 4 week project at the IT University of Copenhagen. We, theauthors, are MSc students at ITU, but met at RUC spring 2006, where we bothattended the course ”Paradigms in Programming”. This course was also wherewe were introduced to Prolog.

The PiP course gave us both a wish to achieve a better working knowledgeof Prolog than the course alone could give us. That is why we decided to do a 4week project in Prolog. A common interest in language and language technologyled us to a grammar based project.

We thank professor Henning Christiansen at RUC for counseling the project,and for supplying valuable input and feedback during these four weeks.

1.2 The Authors

Knut Tveitane is a second semester MSc student at the Software Developmentstudy line at ITU. He has been working for a number of years with IT technologyapplication within the language supplier and translation industry. He has aninterest in language, but he is not a linguist, and he is new to logic programmingand Prolog.

Christian Theil Have is also a MSc student at the Software Developmentline at ITU. He has solid programming experience and a computer science back-ground (B.S.). During his studies he has developed strong interest in AI relatedtechnologies. He has no linguistic background, but find it to be a very interestingfield. It is also his first adventure in Prolog, except for the PiP course.

1.3 Purpose and Scope

The process of getting from a use case written in natural language to code ina programming language is time consuming work that traditionally demandshuman skills, and is usually performed by educated professionals. In objectoriented approaches, one must identify and partition into classes and methods,as part of this process.

The purpose of the project is to investigate the possibilities for automatedtransition from ”Use Cases” in a natural language syntax into a computer read-able representation, by trying to capture the semantics of the natural languageand map it into building blocks of the object oriented programming paradigm(classes, objects, methods, properties etc.)

The shift in programming languages from strict procedural to more objectoriented has made this mapping process easier. Programming language con-structs more and more resemble the linguistic constructs in natural language.Hugo Liu conclude that ”..fairly direct mappings are possible from parsed En-glish to the control and data structures of traditional high-level programminglanguages..” [LL05]

4

Page 5: A Small Natural Language Interpreter in Prolog

We propose automating this process at least partly with a tool that per-forms the mapping from a use case written in natural language to a computerunderstandable language.

It’s our hope that such a tool might useful to the professional normallyperforming this process manually. A number of possible uses for such a toolcould be:

• Brainstorming.

• Prototyping.

• A learning tool that could help students gain insight in the process ofdeveloping software.

• Certain types of domain specific applications.

2 Background

2.1 Introduction

There has been several attempts to map natural language specifications to codein a programming language. We have not found anyone specifically targetinguse cases, though some of the examined approaches do something very similarto what we hope to achieve. The examined approaches has all used English astheir input language.

Philosophies The approaches can be divided according to two different philoso-phies:

• Formal languages: Approaches that use a controlled subset of English thatmaps into first order logic.

• Opportunistic recognition: Approaches that leaves room for ambiguity inthe input language, and opportunistically recognizes a subset of it.

Attempto’s ACE (section 2.3.1) and CLCE (section 2.3.2) are examples ofthe first philosophy, while Metafor (section 2.3.4) and to a degree the “NaturalLanguage CASE Tool” (section 2.3.3) are examples of the second.

2.2 Use Cases

Use Cases are normally described in natural language. “A use case describeswhat a system does but does not specify how it does it” [BJR99]. Use casesmodel the flow of events between system and it’s actors.

Even though use cases are written in natural language, only a subset ofEnglish is normally used.

5

Page 6: A Small Natural Language Interpreter in Prolog

The Unified Modeling Language User Guide [BJR99] does not provide anyin-depth guidance on how to write use-cases, it only gives a few examples writtenas paragraphs of text in present tense and third person.

There has be some relevant research concerned with the structure and styleof use cases. Alistair Cockburn [Coc97] has suggested a semi-formal approachwhere each action description has a certain structured format. Both MartinFowler [Fow03] and Cockburn suggest consistent line numberings. Line numbersmakes it less ambiguous to determine the flow of events.

An European research team called CREWS has elaborated on Cockburn’sresearch and defined a rigorous set of guidelines for use case writing in [Ach98a].The guidelines address both the content and style of use cases. The guidelinesare expressed almost as a formal grammar.

The CREWS team also provide some research in the area of linguistic struc-tures of use cases written using their guidelines in [Ach98b]. The guidelines aredivided into content guidelines and style guidelines.

The guidelines include a number things that simplifies the language in whichthe use cases are written. Avoidance of such things as anaphoric references,explicit references and synonyms, clearly make parsing simpler. So does consis-tency in terminology and abstraction level.

The style guidelines provides provide some insight into which linguistic struc-tures one can expect to find in use cases. Some of those are:

• Atomic action structures.

• Flow conditions

• Loops

The guidelines provides some templates for these linguistic structures, whichcould relatively easy be formulated as grammars.

Research by Karl Cox and others [CP00] have questioned the usefulness ofthe CREWS guidelines, judging them to be to complex. Subsequently they havewritten their own guidelines [KCS01], called CP. The CP guidelines indeed seemclearer and simpler. CP is also divided into style rules and content rules. Theyare summarized below.

CP Style rules

• Style 1: Each sentence in the description should be on a new, numberedline. Alternatives and exceptions should be described in a section belowthe main description and the sentence numbers should agree.

• Style 2: Avoid pronouns if there is more than one actor.

• Style 3: No adverbs or adjectives.

• Style 4: Avoid negatives.

• Style 5: Give explanations if necessary.

6

Page 7: A Small Natural Language Interpreter in Prolog

• Style 6: All verbs are in present tense format.

• Style 7: There should be logical coherence throughout the description.

• Style 8: When an action occurs there should be a meaningful response tothat action.

CP Content rules

• Structure 1: Subject verb object.

• Structure 2: Subject verb object prepositional phrase.

• Structure 3: Subject passive.

• Structure 4: Underline other use case names.

2.3 Similar work

2.3.1 Attempto Controlled English

Attempto Controlled English (ACE) is a subset of English designed for writingsoftware specifications. This subset is enough to express first order logic. “ACEcan be accurately and efficiently processed by a computer, but is expressiveenough to allow natural usage.” [Fuc00] Specifications in ACE appear to beinformal, but are in fact quite formal.

“The Attempto system translates specification texts in ACE into discourserepresentation structures and optionally into Prolog.” [Fuc00]

The input text is parsed using something similar to definite clause grammars:“The specification text is parsed by a top-down parser using a unification-basedphase structure grammar.” [Fuc00] It operates with a small vocabulary that“contains entries of the function word class, e.g. determiners, prepositions,pronouns”[Fuc00], but nouns, verbs and adjectives have to be added for thespecific input text. Their grammar is built to recognize declarative sentencesand composite sentences built from declarative sentences by using coordinationconstructors (and, or, either-or). “A declarative sentence tells us how the worldlooks like if the sentence is true (proposition) and claims that the world looks likethat (illocution).” [Fuc00] A declarative sentence can be illustrated by a simplegrammar, subject + finite verb (+complement or object). Only sentences inpresent tense and third person (either singular or plural) are allowed. Sentencescontaining modal adjectives or verbs are not allowed.

They employ some techniques that allow for more natural sentences. Forinstance they allow anaphoric references and provides a means to resolve them.The technique used is a combination of deixis look-back based on syntacticfeatures such as gender number, and a simple rule of right associativity.

A feature that is not present in natural language, but practical for softwarespecifications is a sort of variables (in ACE terminology, dynamic names). “Dy-namic names in ACE distinguish single instance of the set of objects denotedby the preceding noun.” [FST99]

7

Page 8: A Small Natural Language Interpreter in Prolog

They also allow synonyms and abbreviations, but do not provide details onhow they manage synonym resolution. Possibly, it’s done using a lexicon suchas WordNet, which is described in section 2.4.

The input text is translated to a discourse representation structure (DRS).“A DRS is a structured from of first-order predicate logic which contains dis-course referents representing the objects of the discourse, and conditions for thediscourse referents”. [Fuc00]. The DRS is similar to rules in Prolog program-ming language and can be translated to Prolog.

2.3.2 Common Logic Controlled English

Common Logic Controlled English is a specification draft written by John F.Sowa for a “formal language with an English-like syntax.”[Sow04]. The grammardefinition is still incomplete. It is a subset of English that can be translated intofirst-order logic. As of yet no tool exists that transforms sentences to FOL butSOWA’s claim is that “Under the assumption that all words, names, and vari-ables are declared explicitly or implicitly before their first use, the translationof any CLCE text to FOL can be performed in a single pass by a context-freeparser augmented with two symbol tables.” [Sow04].

ACE and CLCE are very similar languages. ATTEMPTO has advancedschemes for resolution of anaphora and structural ambiguity. This makes ACEa more natural language than CLCE but prevents translation in both directionsand makes parsing more complex.

2.3.3 Natural Language Case Tool

[NT96] introduces an approach to mapping process descriptions written in nat-ural text to information system design. The paper describes “a CASE tool thatutilizes natural language processing for interpreting and mapping business rulesto information systems design” [NT96].

In brief, it identifies condition-action structures in the sentences and repre-sents them as branches in a flowchart. The boxes in the flowchart are labeledwith relevant parts of the input text. The text within the boxes are processedin a systematic way, so that subjects are left out and negations in conditionsare eliminated. The tool does not seem to be capable of much contextualizing,in the sense that the text in the nodes are not processed further. Liu describesthe nodes as “large unparsed natural language utterances.” [LL05]

The paper gives the impression that it can identify a lot of different conditionand action structures within the input text, but unfortunately the detail in whichthese constructions are described is very limited.

The technical details are also sparse. They employ multi-pass parsing anddisambiguation using a dictionary with syntactic and semantic information. Thedictionary distinguishes different conceptual categories (tangible, object, person,event, location etc.).

8

Page 9: A Small Natural Language Interpreter in Prolog

2.3.4 Metafor

Metafor is a tool that maps natural language (English) into code in the pythonprogramming language. They conclude that “fairly direct mappings are possiblefrom parsed English to the control and data structures of LISP and Python.”[LL04]

Metafor accepts a wide range of language constructions and is definitely nota formal language. It is a prime example of a tool in the opportunistic category.It understands different narrative stances, past and present tense, metonyms,dominion, anaphoric and even dynamic references (set selections).

They compare programming to storytelling. This is interesting, since story-telling is similar to use case writing, where events are also described chronolog-ically.

The capabilities of the tool is very impressive since it allows much ambiguitythe input language. One of the reasons why Metafor so successfully recognizessuch a rich input language is the use the ConceptNet lexicon. We briefly describethis lexicon in section 2.4.

They do mention how parsing and internal representation of the input textis handled. On a side-note, a deitic stack is mentioned, which might well be theway they handle anaphoric resolution.

Programmatic Semantics The authors [LL05] have coined the term “pro-grammatic semantics” to describe the transliteration process: “Programmaticsemantics is a mapping between natural linguistic structures and basic program-ming language structures.” [LL05]

They have further divided programmatic semantics into four categories:

• Syntactic features

• Procedural features

• Relational and set-theorectic features

• Representational equivalence

Syntactic features implies semantics. Different word categories naturallymap to certain programming language constructs; nouns as objects, verbs asfunctions and adjectives as properties.

Procedural features include expression of conditional rules and iteration.The paper [LL05] point out linguistic constructions for handling conditionals:Subjunctives, possibles and when. Subjunctives are the well known two-clauseconstructions also seen in most programming languages (if ... then ...). Whenis similar to if, and can be handled the same way. Possibles are constructionssuch as may and might. Note, that possibles are illegal in the formal languagesexamined (ACE and CLCE). Loops are similar to conditionals, they are alsotwo-clause constructions with a conditional and a body.

9

Page 10: A Small Natural Language Interpreter in Prolog

Relational and set-theoretic features can denote implicit loops. For instancea selectional constraint can be used to select a subset from a set. Implicitly thismeans looping through the set, applying the selectional constraint.

Representational equivalence is a kind of a type system used for inferring acode representation for the objects in the input text. ”In Metafor, we alwaysbegin by assuming the simplest code representation which can accommodatethe facts in the story, dynamically refactoring to more complex representationsas necessary.” [LL05] They claim that ”the sort of representational equivalencefound in natural language is quite unparalleled in any formal programminglanguage”. This might be true, but the way they infer representation seemssimilar to Standard ML’s type inference system.

2.4 Lexicons

Each of the examined approaches use some sort of lexicon. Lexicons are widelyused in natural language processing, so they will not go unmentioned here either.In the following we describe some important lexicons and briefly discuss theirproperties. We use the term “lexicon” to describe a collection or database ofwords where each word is linked with lexical and possibly semantic knowledgei.e. an advanced dictionary.

A very popular publicly available lexicon is WordNet [Fel98]. WordNet isa huge lexicon that contains information about different words, categorized byword class (eg. noun, verb, adverb etc). Most notably, WordNet also containsinformation about relations between words and different word senses (eg. rockcan refer to both a stone and to a kind of music). Some of the supportedrelations include synonyms, antonyms, part-of, kind-of and several others.

Attempto and CLCE use custom lexicons. Each of them contains a limitedamount of words and must be extended by the user to process a specific inputdocument. Those dictionaries contain only closed word classes such as deter-miners, prepositions and conjunctions. The user extends the dictionary withdomain specific nouns and verbs.

These limited lexicons makes good sense in a formal language approach.However in the opportunistic approaches, large lexicons packed with lexical andsemantic information is required, since input cannot easily be anticipated.

“Natural language CASE Tool” use a large lexicon with 75.000 words andcan also be extended by the user. This lexicon contains information aboutdifferent concept categories. Details on the lexicon is sparse, but it probablyhas a high degree of similarity to WordNet.

Metafor uses a large lexicon called ConceptNet. [LS04] ConceptNet is a veryadvanced lexicon developed at MIT, that includes more than 250,000 elementsof commonsense concepts. It is similar to WordNet but contains a much richerset of semantic relations. While WordNet has been hand-crafted, ConceptNetwas developed as a web-collaboration project. WordNet only contains entriesfor individual words, but entries in ConceptNet are linked with much more con-textual information (so word sense, for instance, may determined using sentenceanalysis).

10

Page 11: A Small Natural Language Interpreter in Prolog

2.5 Discussion

The limited language of use cases makes it feasible to employ natural languageprocessing. The scope of this could be limited to a certain subset of languageconstructions.

Different approaches for mapping natural language to code exists and can bedivided into formal languages and opportunistic language recognition tools. Theapproaches also differ in what types of language constructs they support andhow they map them to a computer understandable representation. ACE andCLCE maps to formal logic while Metafor and “Natural language CASE Tool”have a heavy emphasis on procedural features. Natural Language CASE Tooltotally disregards structural features whereas Metafor works with a combinationstructural and procedural features. “Natural language CASE Tool” createsdiagrams instead of actual code in programming language, but argues that theirtool could easily be modified to output code to programming language. Metaforskips directly to code generation in the Python programming language. Becauseof Python’s high-level features (dynamic typed, lambda functions etc.), they areable to do this quite elegantly, but are spared many considerations they wouldhave go through if they had used a language like Java.

3 Our approach

3.1 Use Cases and Natural English

Use cases are descriptions of system functionality and interaction between partsof a system. The word “system” here is used in its widest meaning, meaning intheory it can be any compound structure where two or more components inter-act. In reality the systems described usually consist of human beings (”users”),some computer software and some kind of hardware objects.

The use cases describe the capabilities of the system, and interactions be-tween the parts of the system, in particular between the users on one side andthe hardware and software parts on the other.

Use cases are a system design tool. This means that the software (andsometimes also some of the hardware) described in the use cases, do not existat the time the use cases are written. On the contrary, it is the purpose ofthe use cases to aid the system designer in designing a well functioning system.One of the tasks for the system designer is, based on the use cases and otherdocumentation, to define the class structure - that is, the main entities of thesystem, their content and capabilities, and the relations between them.

The purpose of this project is, as mentioned before, to do some investigationinto the possibilities of automating this process.

The use cases are written in natural a language. However, they are not freeform. As discussed in section 2.2, several sets of guidelines have been elaboratedfor how natural language should be applied to use cases.

We have chosen to base the syntax on Attempto Controlled English, aspresented in section 2.3.1. ACE comprises a subset of the English language,

11

Page 12: A Small Natural Language Interpreter in Prolog

such that any statement in ACE is valid English, but not every valid Englishstatement is valid Controlled English. Three notable limitations of ACE arethat it supports only

• Present tense

• 3rd person singular or plural

• Active, declarative sentences

The grammar we have implemented in this version does not cover the entiredefinition of ACE. The precise subset of ACE that we support, is presented inthe next section.

3.2 Supported Natural Language Constructs

When defining the constraints for the subset of natural English that we wouldimplement support for, the goal was that it had to be expressive enough toidentify the important entities and relations in an object oriented descriptionof a system. Further, it still had to not only to comply with a basic formalsyntax, but be sufficiently flexible to maintain a certain degree of the flow andstyle of natural language. This rises the controversy of expressional flexibilityvs syntactical stringency and error control. We have tried to find a balancebetween these, implementing a “proof of concept” that both can be consideredwithin this kind of solution.

Below we present the different sentence types and constructions the grammarparser is designed to recognize from a grammatical or syntactical point of view.Where applicable, the programmatic semantics of individual constructs are in-cluded. In 3.5, we present details of implementations for the most important ofthese constructs.

3.2.1 Basic Sentences

The starting point is the simplest of sentences, with a noun phrase followed bya simple verb phrase. The verb phrase can contain an intransitive or transitiveverb. We will first consider verbs that imply an action to be performed by oron the subject of the sentence. Examples are ”A man walks” or ”The womandrives a car”.

Programmatic Semantics The subject of the sentence is a noun. This nounmaps to a class definition in the object oriented programming paradigm.

The verb maps to a method of the class represented by the subject. If theverb is transitive, the object is another noun (that defines another class) andthis class serves as argument to the method.

12

Page 13: A Small Natural Language Interpreter in Prolog

3.2.2 Property Sentences

Sentences that imply an ownership or containment relation are syntacticallyidentical to the transitive sentences as described in the previous section. How-ever, verbs like to have or to contain cannot be considered to imply an actionperformed by or on the subject. Instead, a sentence like ”A car has an engine”signal a static super-subordinate relation between subject and object.

The object of a property sentence may be plural. In these cases, the objectmay contain a quantifier. This may be an integer, a number (between twoand twelve) spelled out in letters, or an unspecified quantifier like ”some”. Anexample is the sentence ”A car has four wheels”.

Programmatic Semantics The main semantic difference between this sen-tence construction and the basic transitive sentences described above, lies inthat the verb in itself is not considered a method. Instead, the class representedby the object is considered a property of the class in the subject.

When it comes to properties in plural form, they map to multi-valued prop-erties. There are different approaches to representing these in object orientedprogramming languages. We have tried not to limit the flexibility, by main-taining as precise information as possible about the cardinality of the property.This means the exact number of instances is conserved if present; if it is not,the information that it is a multi-valued property is still saved.

3.2.3 Entity-relational Sentences

Some sentences define entities in terms of others. These sentences are syntacti-cally similar to the above examples. The verb to be (is/are) serves the purposeof defining such relationships.

Programmatic Semantics One type of such sentences are similar to ”A caris a kind of vehicle”. Also here there is a super-subordinate relationship, but ofa different type. In object oriented terminology, vehicle is a superclass of car.The key to this kind of sentences is what we can call a subclassing phrase, kindof, following the verb to be.

Another construction is a sentence like ”John is a man”. The noun phrasehere is a proper noun, and the significance of this sentence is that there is aconcrete, named entity (John) of the class man. In object oriented terminology,John is an object of class man.

After the person John has been introduced, we must be prepared for sen-tences on the form ”John talks”. The sentence looks straightforward at firstglance, but the programmatic semantics of it is a bit more complex. ”John”maps to an object, ”talks” to a method - but objects don’t define methods (orproperties). The sentence must be understood to mean that the method belongsto the class the object is an instance of.

13

Page 14: A Small Natural Language Interpreter in Prolog

3.2.4 Phrase Lists

In natural language several phrases of the same type are often packed into alist instead of having a sentence each. Such lists can be lists of verb phrases orlists of objects to the verb to have. Lists are comma separated, except for thelast two entries which are separated by a conjunction. We have implementedsupport for such lists, using the conjunctions and or or.

3.2.5 Compound Sentences and Pronouns

One or more sentences of the types mentioned above, combined by conjunctionsand followed by a sentence-ending punctuation mark (full stop is the only oneimplemented by now) comprise a compound sentence. A use case is made up ofone or more compound sentences.

Within a use case (eventually spanning compound sentences), the subject ofa sentence can be substituted by a pronoun, e.g. ”A man is a type of person.He has a car”. The pronoun will always refer to the subject of the previoussentence. The pronouns available are he, she, it and they.

3.2.6 Syntactic Stringency

Several syntax check constraints regulate which words and word forms that canbe used together. We have implemented singular/plural agreement betweensubject and verb phrases, such that “A woman walks” is accepted, but “Shedrive a car” is not. Gender agreement between pronouns and their referrednouns has also been implemented. Therefore, the use case “A man goes. Shehas a bag” is not accepted, because “she” is not allowed to refer to “a man”.

3.3 Delimiting the Project

We have done simplifications regarding the input and output. Instead of fullnatural language input, we use Prolog list format, with each word and syntaxelement represented as atoms. The transition from natural syntax to this listsyntax is considered trivial.

The project depends on a lexicon that, in our test project, is defined as part ofthe program. The lexicon consists of two parts: One part defines words that arepart of the basic language definition - the other part defines the domain specificterminology for the use cases we want to analyze. The latter consists of nouns,proper nouns, transitive and intransitive verbs. The domain specific lexicon isimplemented as a separate module (file), making it easily interchangeable.

We have built a couple of relatively small domain-specific lexicon for testingthe approach. The lexicons are sufficiently large to prove that a reasonablywide selection of use cases can be transformed, and can easily be extended.Other dictionaries are available which contain a much larger selection wordsand word categorizations, see section 2.4. It would be possible to substitute oursimple dictionary with one those to make the system recognize a much largervocabulary.

14

Page 15: A Small Natural Language Interpreter in Prolog

We have chosen to focus on the structural features found in natural language.Behavioral and procedural features are not represented in our system. Metafor(section 2.3.4) and “Natural language CASE Tool” (section 2.3.3 have shownthat it is possible to do this, but it would be time consuming and out of reachin the time available for this project.

Instead of generating code in an actual programming language, we generateUML diagrams. There are several reasons why we chose to do this:

• There usually is a design phase between use case writing and coding. TheUML class diagram is a central tool and model in this phase, and is laterused when writing the actual code.

• We focus on the structural aspects which is exactly what is modeled inthe UML class diagram.

• The class diagram is more illustrative than code.

3.4 Tools and methods

Definite Clause Grammar (DCG) syntactic extension to Prolog is a very pow-erful tool for building parsers and otherwise analyze language. DCG itself isbased on a simple syntax, where production rules define one grammatical sym-bol (or segment) in terms of a sequence of (one or more) others. The format iss → n, v, meaning that the (non-terminal) symbol s consists of the symbol se-quence n followed by v. The symbols are non-terminal (i.e. segments that havetheir own production rules and will be further segmented) or terminal (which,in a natural language grammar, correspond to words). Expressional strength isadded by use of parameters (so called ”features”) and by embedding ordinaryProlog syntax in the production rules. We have used DCG in combination withstandard Prolog code for the input grammar and the lexicon.

DCG is, as mentioned, just a syntactic extension to Prolog. The DCGproduction rules are translated to ordinary Prolog clauses - the production rules → n, v is transformed to the Prolog syntax:

s(List1, Rest) :-

n(List1, List2),

v(List2, Rest).

The arguments to all of the predicates in this clause, are a constructioncalled difference lists - a construction Definite Clause Grammars rely heavilyon. Difference lists are sets of two lists, where the last list contains any tail partof the first list. The value of a difference list, is the head part of the first list,up to the point corresponding to the start of the second. The difference list[1,2,3,4,5],[4,5] evaluates to [1,2,3]. The second list may be (and often is) theempty list, in which case the difference list equals the first list.

Different solutions for the grammatical parsing of the input were considered.One option was to build a full syntax tree for the use cases, but this resulted in

15

Page 16: A Small Natural Language Interpreter in Prolog

rather complex code that was hard to overlook, and it seemed the complexity didnot ease the syntactical transition. A better solution proved to be an approachwhere the input is segmented in multiple levels, aiming to isolate the program-matic semantics of each segment on the highest possible level, and embeddingfunctionality to handle these semantics directly in the grammar.

In other words, as high up in the segmenting hierarchy as possible, theentities (from an object oriented point of view) that can be captured from theprogrammatic semantics of the text segments, are asserted to the program’sdatabase as facts.

The following types of facts (shown with their arguments) are asserted:

• class(Classname)

• extends(Classname, Superclass)

• property(PropertyName, Classname, Cardinality)

• method(Methodname, Classname)

• object(Objectname, Classname)

Prolog, being primarily a declarative language, has a ”symmetrical ap-proach” to input and output, in the sense that which variables that are ”output”from a rule or procedure, depends on whether they are uninitialized when therule is evaluated, not on their position in the rule like in imperative languages.In other words, variables to hold output can be found in the body (right sidepart) as well as the head (left side) of a rule. This symmetrical approach alsoholds for DCG, meaning that DCG production rules can be used not only forparsing a complex, non-terminal symbol into terminals, but also for generatinga non-terminal (output) symbol from terminals. Thus, we could also use DCGin combination with standard Prolog to produce the output, based on the setof facts that exist after the input is processed.

3.5 Input Grammar Details

As mentioned earlier, the grammar in this version is a subset of ACE. Eventhough support for the most complex features are not implemented, the gram-mar is sufficiently advanced to allow for a variety of sentence constructions.

The text is limited to present tense, 3rd person, declarative active sentenceform.

Viewed from top down, input consists of one or more use cases. The inputgrammar will identify each piece of text with semantic significance, and foreach such piece, one or more facts are asserted to the program. The list of factsincreases for each use case that is analyzed. See table 1 at the end of this sectionfor an example of mapping between use cases and asserted facts.

A use case consists of one or more compound sentences (each terminatedwith an end punctuation symbol, “.”). A compound sentence, in turn, consistsof one or more sentences, joined together by conjunctions.

16

Page 17: A Small Natural Language Interpreter in Prolog

A sentence is divided into a noun phrase and a verb phrase, with the nounphrase - either a proper noun or a determiner, noun sequence - as the subjectof the sentence.

When a noun is identified in the text, it is generally asserted as a class fact.The verb phrase can have different functional significance. It can be either a

subclassing verb phrase, an instantiation verb phrase, a method verb phrase or aproperty verb phrase, depending partly on the verb it contains, partly on otherfactors.

Verbs are grouped according to their functionality, related to the types ofverb phrases stated above. The verb to be (in its 3rd person present singularand plural forms, is and are) is classified in a functional class of its own, as asubordinating verb. As mentioned in 3.2.3 above, it is found in entity-relationalsentences, its usage being to define entities (objects and classes) that are basedon other classes. It occurs in two verb phrase types, subclassing verb phrases,which signify the definition of a class as a subclass to some other class (andtrigger the assertion of extends facts) and instantiation verb phrases, signifyingthe instantiation of an object of a class (and triggering assertion of object facts).

Another functional verb class is possessive verbs. They are found in propertyverb phrases, which describe property relations for classes, as discussed in section3.2.2, and assert property facts. The group consists of verbs of the type haveand contain.

Quantifiers are used in property verb phrases to set the value of the cardi-nality argument of the property fact. Cardinality is set to 1 if the noun objectof the verb phrase is singular, or - if the noun is plural - to a number > 1 if thea numeric quantifier is specified (either as an integer or spelled out in letters forthe numbers two - twelve). The atom n is used if no or an indefinite quantifier(like ”some”) is specified.

Other verbs are divided by traditional dividing lines, into transitive andintransitive verbs. In our lexicon, these will typically be “action verbs”, mappinginto method facts. Intransitive verbs are asserted as parameter-less methods,while transitive verbs are asserted using the grammatical object as parameter(only one parameter is possible using this approach).

As with sentences, method- and property-specifying verb phrases as well asproperty noun phrases can be compound, following the normal syntax wherethe first entries in the list are separated with a comma, and only the last mustbe a conjunction - and or or.

The grammar is - though fairly basic - sufficiently advanced to prove that theconcept of a DCG, handling different elements of the programmatic semanticson different grammatical levels, has the ability to extract and express differencesin meaning from syntactically quite similar sentences.

3.6 Code Generation

The system generates UML class diagrams. We decided that this, more illustra-tive approach, is better suited for the purpose of this project, than generatingactual code for programming language, which is argued in section 3.3.

17

Page 18: A Small Natural Language Interpreter in Prolog

Table 1: Examples of use cases and the entities associated with

A car is a type of vehicle.class(car)class(vehicle)extends(car,vehicle)

A man is a kind of person. John is aman.

class(man)class(person)extends(man,person)object(john, man)

A car has some seats, an engine andfour wheels.

class(car)class(seat)property(seat,car,n)class(engine)property(engine,car,1)class(wheel)property(wheel,car,4)

A library contains books and magazinesand has borrowers. A borrower borrowsa book. He takes the book an goes.

class(library)class(book)property(book,library,1)class(magazine)property(magazine,library,1)class(borrower)property(borrower,library,1)method(borrow,borrower,book)method(borrow,borrower,book)method(take,borrower,book)method(go,borrower)

Diagrams are generated in the “dot” language which can be visualized usingGraphviz. “Graphviz is open source graph visualization software.” [gra].

UML class diagrams only represent structural relations between classes suchas inheritance, association and aggregation. Objects are not included. Howeverin our diagrams we decided not conform strictly to the UML specification [uml],so we could include objects anyway. Objects can be recognized using our inputgrammar, so they should also be displayed in the output. We have introducedour own notation for including objects in the diagram. Objects are representedas boxes that contains the class of the object, then a colon, followed by thename of the object. There is an arrow from the class to the object. The headof the this arrow is a round circle. This is also our own notation. An exampleof the notation is shown in figure 1.

We display properties both on the class they belong to and as aggregationarrows. Normally only simple types are displayed as properties on the class,however we do not operate with simple types. From a visual perspective, itseems clearer to display properties as both.

18

Page 19: A Small Natural Language Interpreter in Prolog

myclass

+some_method() : void

myclass:myobject

Figure 1: Object notation for UML diagrams

3.6.1 Extraction of facts from the program

As described in section 3.4, programmatic semantics are asserted as facts in theprogram. To facilitate code generation, we must later extract these facts again.This is done using rules that utilize the built-in bagof predicate. All the factsof each category is appended to lists as tokens. These list are combined intoto a token program, were the facts from each lists are qualified with additionalinformation tokens (such as class, method, property and object). The finaltoken program is a flat list of tokens. And example of such a token program isshown below (indented for readability):

[ program,

class, vehicle,

class, car,

method, drive,

property, wheel, 4,

property, engine, 1,

property, seat, n,

class wheel,

class engine,

class seat,

extends, vehicle, car

]

3.6.2 Generation of Graphviz code

The token program is used as input to a DCG grammar. This also serves as anillustration of how definite clause grammars can be used to produce a languageas well as parse one.

The DCG grammar builds the Graphviz code as a parse tree, while recog-nizing the token language. Each rule in the DCG grammar builds a list of “dot”code and control tokens, corresponding to the element in the token grammar.The result is a nested list (indeed, a tree) containing all the output for Graphviz.

19

Page 20: A Small Natural Language Interpreter in Prolog

Rules of the grammar The grammar contains rules for each semantic cat-egory:

• classes

• properties

• methods

• extends (inheritance)

• objects

Since each of the elements may occur a number of times (including zero), theyare recognized by recursive DCG rules. Thoose rules are similar in structure,and the structure of such a recursive rule is:

rec_elements([]) --> []. % match zero or no more elements

rec_elements([X,Y]) --> single_element(X), rec_elements(Y).

single_element(X) --> ...

Each ”single element” rule contributes some code to the feature X. Therecursive rules collects these contributions in nested lists. The nesting level ofthe produced list is proportional to the number of elements of the this type.

For most of the elements, context is not really needed. Code can be gen-erated independent of its context. Whether or not this is possible depends onthe similarity of the input and output languages. In our case, we have designedthe input language such that the order of elements is almost identical to cor-responding elements of Graphviz code. The places where we need to handlecontext are described below.

Classes The syntax of classes in the token language expressed in Backus-NaurForm (BNF) is:

<class> ::= ‘‘class’’ class-name methods properties

For each class it’s properties and methods is generated recursively. The classproduce a feature with all the Graphviz code for the class, which include thecode passed up from the methods and properties.

Properties Properties have following the token syntax expressed in BNF:

<property> ::= ‘‘property’’ name cardinality

Properties are translated into to aggregation arrows, that point from one classto another. The name of the class pointed from is given as the next token inthe input stream, whereas the class pointed to occurs somewhere before theproperty in the input stream. We handle this by passing down the name of thecurrent class as feature to the properties rules. A second feature is used to pass

20

Page 21: A Small Natural Language Interpreter in Prolog

up the generated tree for aggregation arrows and a third feature is used to passup the generated tree for the class property list. Cardinality is resolved usingthe the typeinf relation. typeinf generates either an atomic type or an arraytype (with relevant number of elements) for the property list.

Methods We handle two kinds of methods: Methods that doesn’t take anyarguments and methods that take exactly one argument. The token syntax formethods in BNF is:

<method> ::= ‘‘method’’ name | ‘‘method’’ name argument

Methods without argument are simple and are reflected only in the method listin the class description. A feature is used to generate the tree for the methodlist.

Methods with an argument also triggers the construction of an associationarrow. The first end of the association arrow, the name of the class the methodbelongs to, is passed down using a feature, the second end is given as the nexttoken in the input stream. A second feature is used to pass up the method listand a third is used to pass up the code for the association arrow.

extends Extends has the following syntax in BNF:

<extends> ::= ‘‘extends’’ superclass-name subclass-name

In inheritance (extends) the name of the related classes are given directly astoken following the extends token. This saves the use a feature to pass downthe name of the current class. In the dot language it doesn’t matter wherewe put the inheritance declarations, so we just put them after all the classdeclarations.

Objects Objects have the following syntax in BNF:

<object> ::= ‘‘object’’ object-name class-name

Objects are similar in construction to extends. No contextual informationis needed by the rules producing code for objects. They are placed in the endof the dot program.

3.6.3 From parse tree to code

A depth-first traversal of the parse tree will visit the nodes in the correct orderfor output in the “dot” language. We “flatten” the list before output, and thisflattening process is really a depth-first traversal that puts the elements in a flatlist in the order visited.

Each element in this list is written to a file in the order appearing. Certaincontrol tokens (tab and newline) are used to control the formatting of the output,and thus have special interpretations. All other elements are written directly.

21

Page 22: A Small Natural Language Interpreter in Prolog

4 Running the Project Software

4.1 Instructions for Use

The project software consists of several Prolog program files.

• InputGrammar.pl (Grammatical parser)

• CodeGen.pl (Output generator)

• Lexicon XXX.pl (Domain specific Lexicon for domain XXX)

After consulting the relevant files, one or more use cases are entered usingthe clause:

use_case(U).

The argument U is a list of atoms representing words and punctuation. Punc-tuation marks must be enclosed in apostrophes.

To generate output, enter the clause:

generate_dotty_file(F).

F contains the name of a file (enclosed in apostrophes) to contain the output.The file contents serves as input to Graphviz.

4.2 Example Sessions

4.2.1 Simple example using Test domain

This first example session uses the lexicon contained in the file lexicon test.pl.The file contains terms related to persons and cars - a domain that is not closelyrelated to computer systems, but that serves the point of demonstrating entityrelationships.

SICStus 3.11.0 (x86-win32-nt-4): Mon Oct 20 00:38:10 WEDT 2003

Licensed to ruc.dk

| ?- :-

consult(’C:/NLPproject/InputGrammar.pl’).

% consulting c:/nlpproject/inputgrammar.pl...

% loading c:/program files/sicstus prolog 3.11.0/library/lists.po...

% module lists imported into user

% loaded c:/program files/sicstus prolog 3.11.0/library/lists.po

in module lists, 0 msec 13600 bytes

% consulted c:/nlpproject/inputgrammar.pl in module user, 10 msec 30688 bytes

| ?- :-

consult(’C:/NLPproject/codegen.pl’).

% consulting c:/nlpproject/codegen.pl...

% consulted c:/nlpproject/codegen.pl in module user, 10 msec 12320 bytes

| ?- :-

22

Page 23: A Small Natural Language Interpreter in Prolog

consult(’C:/NLPproject/Lexicon\_Test.pl’).

% consulting c:/nlpproject/lexicon\_test.pl...

% consulted c:/nlpproject/lexicon\_test.pl in module user, 0 msec 3976 bytes

| ?- use_case([a,man,is,a,kind,of,person,’.’,john,is,a,man,’.’]).

yes

| ?- use_case([a,woman,is,a,type,of,person,’.’,

women,walk,’,’,talk,and,drive,cars,’.’]).

yes

| ?- use_case([a,car,has,an,engine,and,four,wheels,’.’]).

yes

| ?- generate_dotty_file(’c:\\dotty_test.txt’).

yes

| ?-

This generates the following output:

digraph G {

fontsize = 8

node [

fontsize = 8

shape = "record"

]

edge [

fontsize = 8

]

man[

label = "{man||}"

]

person[

label = "{person||}"

]

woman[

label = "{woman||+: walk(param:) : void\l+: talk(param:):

void\l+: drive(param:car) : void\l}"

]

edge [ arrowhead = "none" ]

woman -> car [ label="drive" ]

car[

label = "{car|- property: engine\l- property: wheel[4]\l|}"

]

edge [ arrowhead = "odiamond" ]

23

Page 24: A Small Natural Language Interpreter in Prolog

engine -> car [ label="1" ]

edge [ arrowhead = "odiamond" ]

wheel -> car [ label="4" ]

engine[

label = "{engine||}"

]

wheel[

label = "{wheel||}"

]

edge [ arrowhead = "empty" ]

man -> person

edge [ arrowhead = "empty" ]

woman -> person

obj_john [

label = "{man:\ljohn}"

]

edge [ arrowhead = "odot" ]

man -> obj_john

}

When this file is processed by Graphviz, it produces the graphic representa-tion show in figure 2.

man

person

man:john

woman

+: walk(param:) : void+: talk(param:) : void+: drive(param:car) : void

car

- property: engine- property: wheel[4]

drive

engine

1

wheel

4

Figure 2: UML diagram for sample session

The diagram uses slightly modified UML diagram syntax. The nouns fromthe use cases can be recognized as class rectangles, with methods and proper-ties specified, and the standard UML arrowed connectors showing generalization(inheritance). Classes that are properties of other classes, are shown with con-nectors with an open diamond ending, the UML syntax for aggregations. Theconnectors are even equipped with cardinality indicators. If a method in oneclass has another class type as parameter, the classes are connected with an

24

Page 25: A Small Natural Language Interpreter in Prolog

arrowless connector - UML for associations. The connectors are marked withthe method name.

There is one additional syntac element, as presented in 3.6, which is not partof the UML standard: Instantiated objects are shown in the diagram. They areconnected to the class they belong to by a connector with a circle in the objectend.

4.2.2 More complex example with the Company domain

Not only does this example represent a more complex set of entities, the domainis also more relevant in relation to design of computer system. The lexicon filelexicon company.pl contains terms related to companies and employees, and canbe viewed as a starting point for a Human Resources information system.

The Prolog session is simmilar to the one above, but instead of the Lexi-con Test.pl, the file Lexicon Company.pl should be consulted. And of coursethe use cases are different:

use_case([a,company,has,a,number,of,departments,’.’,

it,produces,goods,or,delivers,services,’.’]),

use_case([a,department,consists,of,employees,’.’,

all,employees,work,and,they,have,a,position,’.’,

they,are,persons,’.’]),

use_case([the,employees,have,salaries,’.’,

the,company,pays,the,salary,’.’]),

use_case([a,sales,representative,is,a,kind,of,position,’.’,

an,office,clerk,is,a,type,of,position,’.’]),

use_case([office,clerks,use,a,computer,’.’]),

use_case([sales,representatives,sell,and,they,have,a,budget,’.’]),

use_case([a,boss,is,a,kind,of,position,’.’]),

use_case([a,boss,manages,employees,’.’]),

use_case([mary, is, the, boss, ’.’, john, is, an, office, clerk, ’.’]).

We will not show the contents of intermediate output file here. The graphproduced when it is processed by Graphviz is shown in figure 3

5 Future Work

There are several options to extend the project and its software. We give a shortpresentation of some of them below. An obvious one is to extend the grammar.The other options included, deal with integration and communication with ”theouter world”, i.e. other existing systems and services.

Grammar Extension. The grammatical system implemented in the projectsoftware is quite limited. Due to the time frame for the project, we had toconcentrate on a limited number of syntactical constructs. The software mustbe considered a ”proof of concept” more than a finished system.

25

Page 26: A Small Natural Language Interpreter in Prolog

Figure 3: UML diagram for company use case

It would be interesting to continue extending the grammar. The list of inter-esting new features to implement includes more advanced use of pronouns (alsoenabling pronominal references to the object of sentences), handling indirect ob-jects and prepositional expressions (enabling sentences indicating methods withmore than one argument, ”The man gives the book to the woman”). Theseextensions do not have large semantical implications.

Features that also involve new features in the programmatic semantics of thesoftware, would be introduction of primitive types (int,real,string). Presently,all properties of a class are of another class type. Also, introduction of adjectiveswould imply changes to the property typing system. ”The car is red” would forinstance have to map ”red” to an enumeration type ”colour”, for which a finiteset of values were defined.

Finally, one could try to implement support for expressing program logic,

26

Page 27: A Small Natural Language Interpreter in Prolog

to put some ”flesh” on the skeleton code. One would need ways to specifysequences, selections and iterations of operations, and a strategy for couplingthe code description to the correct method. This needs very careful considera-tion, and it is hard to imagine a solution to this without a quite stringent andrestricted natural language syntax.

Input Parsing via ACE Web Service. The Attempto Controlled Englishproject offers a Web Service that - among more advanced functions - can beused to transform a sentence from ordinary free form (with blanks separatingwords) to Prolog list format. The service also checks that the input is legalControlled English, so it can serve as a first level syntax check for the program.

We have not been able to find any specific information on, or explicit supportfor, using Sixtus Prolog as a Web Service client. Though it is probably not verycomplicated to implement (using Prolog’s HTTP libraries) we did not want tospend time on investigating this during this project.

Extending the Program with Semantically Rich Lexicons. It wouldbe interesting to extend the program using one the various lexicons describedin section 2.4. One obvious consequence of doing this is that we can assumethings not explicitly described in the use case by the user. For instance, if theuser describes an entity called “john”, we could infer that john is the nameof a person, even if the user has never mentioned anything about “persons”.Inheritance and composition could be inferred with a very limited amount ofinput from the user using Wordnets kindOf and PartOf relations. ConceptNetssemantic relations go further and define things such as subeventOf, which couldbe used to infer procedural features. It would also allow for a much widerinput language where synonyms and word senses could be inferred from context.There is a catch though. Automatically inferring things from a semantic netsuch the mentioned lexicons, could have the consequence that the system inferssomething errorneously. Also the system could infer to much or to little. Maybethe user does not want the system to automatically infer a “person” class etc.Using these rich semantic lexicons would bring us closer to an opportunisticapproach. Such an approach has some very desirable features, but also introduceunwanted complexity.

Output Formats and .Net CodeDOM. We have concentrated on gener-ating output in a format that can be used to produce UML diagrams. UMLdiagrams are a common representation for all object oriented languages andformats, and this makes it a natural first step in a system like this.

However, having the information necessary to construct the class diagramfor the UML, going one step further and actually generate a code skeleton insome object oriented language is close to trivial, even if it would involve quitea bit of labor. An output grammar (code generator) would have to be con-structed for each language to support. However, general purpose programming

27

Page 28: A Small Natural Language Interpreter in Prolog

languages pose some challenges that were not considered in the generation ofUML diagrams:

• Types: Simple types vary between programming languages and imple-mentation would require a connection from the input grammar to desiredprogramming language. The current input grammar can be use to describethat a person has a name and that a name is a string. That would triggercreation of a string class (which wouldn’t know how to actually representa string). A more sensible solution would instead use the programminglanguages built-in string class.

• Order of construction of classes: In Graphviz the order in which the classesare described doesn’t matter. In a programming language, classes usuallymust be described in an order such that they are declared before classesreferencing them.

• Generation of variable and parameter names: This could be easily ensuredby using a global uniqueness scheme. Giving the variables sensible nameswould be more difficult.

One particularly interesting output alternative to investigate, would be tointerface the program to Microsoft .NET and deliver the data in CodeDOM(Code Document Object Model) format. CodeDOM is a .NET API defining alanguage independent program description (or meta language) model. A pro-gram specification in CodeDOM can be rendered to any .NET-based program-ming language. However, the CodeDOM model itself has no persistent format- the model is constructed in runtime and can not be saved otherwise than byrendering it in a programming language. Therefore, direct interfacing to .NETis necessary to use this approach. Prolog has a .NET interface module that wepresume could facilitate this.

6 Conclusion

The purpose of this project was formulated as ”...to investigate the possibilitiesfor automated transition from ”Use Cases” in a natural language syntax into acomputer readable representation...”.

Natural language processing is an immensely complex field. It was importantfor us to constrain the scope of the project, since there would have been enoughof interesting problems to spill our time on. Committing to these constraints,we have shown that our goal was realistic, and it was possible to construct a -though somewhat limited - functional NLP system within the timeframe of a 4week student project.

We also got valuable experience in the use of Prolog and Definite ClauseGrammars, which proved to be as highly powerful tools for this kind of task as weanticipated. For relatively unexperienced Prolog users, however, it is sometimesdemanding to switch your mindset from imperative to logic programming -

28

Page 29: A Small Natural Language Interpreter in Prolog

things take time, and often, the result of a day’s work measured in lines of codeis not too impressive. The more satisfactory to see how advanced functionalityone can achieve with really few program lines!

Summing up, it has been a most interesting project to work with, we achievedour goal, and it had a good learning effect.

29

Page 30: A Small Natural Language Interpreter in Prolog

7 References

[Ach98a] Camille Ben Achour. Guiding scenario authoring. In EJC, pages152–171, 1998.

[Ach98b] Camille Ben Achour. Writing and correcting textual scenarios forsystem design. In DEXA Workshop, pages 166–170, 1998.

[BJR99] Grady Booch, Ivar Jacobson, and James Rumbaugh. The UnifiedModeling Language User Guide. Addison-Wesley, 1999.

[Coc97] Alistair Cockburn. Structuring use cases with goals. Journal of Object-Oriented Programming, SeptemberOctober 1997.

[CP00] Karl Cox and Keith Phalp. Replicating the CREWS use case author-ing guidelines experiment. Empirical Software Engineering, 5(3):245–267, 2000.

[Fel98] Christiane Fellbaum. WordNet: An Electronic Lexical Database. MITPress, 1998.

[Fow03] Martin Fowler. UML Distilled: A Brief Guide to the Standard Mod-eling Object Language. Object Technology Series. Addison-Wesley,third edition, September 2003.

[FST99] Norbert E. Fuchs, Uta Schwertel, and Sunna Torge. Controlled naturallanguage can replace first-order logic, October 09 1999.

[Fuc00] Norbert E. Fuchs. Attempto controlled english. In WLP, pages 211–218, 2000.

[gra] Graphviz - graph visualization software.

[KCS01] Keith Phalp Karl Cox and Martin Shepperd. Comparing use case writ-ing guidelines. In Seventh International Workshop on RequirementsEngineering (RE’01), June 2001.

[LL04] Hugo Liu and Henry Lieberman. Toward a programmatic semanticsof natural language. In VL/HCC, pages 281–282. IEEE ComputerSociety, 2004.

[LL05] Liu, Hugo and Lieberman, Henry. Programmatic semantics for natu-ral language interfaces. In Proceedings of ACM CHI 2005 Conferenceon Human Factors in Computing Systems, volume 2 of Late breakingresults: short papers, pages 1597–1600, 2005.

[LS04] Hugo Liu and Push Singh. Conceptnet: A practical commonsensereasoning toolkit, May 02 2004.

30

Page 31: A Small Natural Language Interpreter in Prolog

[NT96] Introducing A Natural and Language Case Tool. Eliciting and map-ping business rules to is design: Introducing A natural language casetool, July 31 1996.

[Sow04] John F. Sowa. Common logic controlled english, 2004.

[uml] Unified Modeling Language (UML), version 2.0.

31

Page 32: A Small Natural Language Interpreter in Prolog

8 Appendix A - Code

This Appendix contains the Prolog code produced during the project. The codeis also available via the webpage:

http://www.itu.dk/~cth/nlp/

8.1 Input Grammar.pl

← dynamic(class/1). %class(Name)

← dynamic(method/2). %method(Name,Class)

← dynamic(method/3). %method(Name,Class,Argument)

← dynamic(property/3). %property(Name,Class,Cardinality)

← dynamic(extends/2). %extends(Name,Super)

← dynamic(object/2). %object(Name,Class)

← use module(library(lists)).

%%%% Global predicates

addfact(F )← 6`F → assert(F ) ; true.

getclass(A,C )← object(A,C ); class(A), A = C .

%%%% Translation from input format to DCG format

use case(S )← use case( , ,S , [ ]).

%%%% Grammar rules and assertions

%%%% Ai = Actor (input), Ao = Actor (output), Gi = Gender (input), Go = Gender (output)

use case(Ai ,Gi) −→ compound sentence(Ai ,Gi , , ).use case(Ai ,Gi) −→ compound sentence(Ai ,Gi ,Ao,Go), use case(Ao,Go).

compound sentence(Ai ,Gi ,Ao,Go) −→sentence(Ai ,Gi ,Ao,Go), moresentences(Ao,Go), end punctuation.

sentence( , ,Actor ,Gnd) −→noun phrase(Cnt , Gnd , Actor),verb phrase(Cnt , Actor).

sentence(Actor ,Gnd ,Actor ,Gnd) −→pronoun(Cnt , Gnd , Actor),verb phrase(Cnt , Actor).

moresentences( , ) −→ [ ].moresentences(Ai ,Gi) −→ conjunction, sentence(Ai ,Gi ,Ao,Go),moresentences(Ao,Go).

verb phrase(Cnt , Actor) −→ subclassing verb phrase(Cnt , Actor).verb phrase(Cnt , Actor) −→ instantiation verb phrase(Cnt , Actor).verb phrase(Cnt , Actor) −→ method verb phrases(Cnt , Actor).verb phrase(Cnt , Actor) −→ property verb phrases(Cnt , Actor).

subclassing verb phrase(Cnt , Actor) −→subord verb(Cnt , ),subclassing noun phrase(Cnt ,Object), !,

32

Page 33: A Small Natural Language Interpreter in Prolog

{addfact(extends(Actor ,Object)) }.

instantiation verb phrase(Cnt , Actor) −→subord verb(Cnt , ),noun phrase( , ,Object), !,{addfact(object(Actor ,Object)) }.

method verb phrases(Cnt , Actor) −→ method verb phrase(Cnt , Actor).method verb phrases(Cnt , Actor) −→

method verb phrase list(Cnt , Actor), method verb phrase(Cnt , Actor),conjunction, method verb phrase(Cnt , Actor).

method verb phrase list( , ) −→ [ ].method verb phrase list(Cnt , Actor) −→

method verb phrase(Cnt , Actor),list separator,method verb phrase list(Cnt , Actor).

method verb phrase(Cnt ,Actor) −→intrans verb(Cnt ,Verb), !,{getclass(Actor , Actor Class), addfact(method(Verb, Actor Class)) }.

method verb phrase(Cnt ,Actor) −→trans verb(Cnt ,Verb),noun phrase( , ,Object), !,{getclass(Actor , Actor Class), addfact(method(Verb, Actor Class, Object)) }.

property verb phrases(Cnt , Actor) −→ property verb phrase(Cnt , Actor).property verb phrases(Cnt , Actor) −→

property verb phrase list(Cnt , Actor), property verb phrase(Cnt , Actor),conjunction, property verb phrase(Cnt , Actor).

property verb phrase list( , ) −→ [ ].property verb phrase list(Cnt , Actor) −→

property verb phrase(Cnt , Actor),list separator,property verb phrase list(Cnt , Actor).

property verb phrase(Cnt , Actor) −→possess verb(Cnt , ),property noun phrases(Actor).

property noun phrases(Actor) −→ property noun phrase(Actor).property noun phrases(Actor) −→

property noun phrase list(Actor), property noun phrase(Actor),conjunction, property noun phrase(Actor).

property noun phrase list( ) −→ [ ].property noun phrase list(Actor) −→ property noun phrase(Actor),

list separator, property noun phrase list(Actor).

property noun phrase(Actor) −→ noun phrase(sing, ,Object), !,{getclass(Actor , Actor Class), addfact(property(Object ,Actor Class, 1)) }.

property noun phrase(Actor) −→ quantifier(X ),noun phrase(plur, ,Object), !,{getclass(Actor , Actor Class), addfact(property(Object ,Actor Class,X )) }.

noun phrase(Cnt ,Gnd ,Actor) −→ determiner(Cnt),noun(Cnt ,Gnd ,Actor), !,{addfact(class(Actor)) }.

33

Page 34: A Small Natural Language Interpreter in Prolog

noun phrase(sing,Gnd ,Actor) −→ proper noun(Gnd ,Actor).

subclassing noun phrase(Cnt , Actor) −→ subclasser,noun(Cnt , ,Actor), !,{addfact(class(Actor)) }.

%%%% Lexicon − general

conjunction −→ [and].conjunction −→ [or].

list separator −→ [','].end punctuation −→ ['.'].

determiner(sing) −→ [a].determiner(sing) −→ [an].determiner( ) −→ [the].determiner(plur) −→ [ ].determiner(sing) −→ [any].determiner(sing) −→ [every].determiner(plur) −→ [some].determiner(plur) −→ [most].determiner(plur) −→ [all].

pronoun(sing,n, ) −→ [it].pronoun(sing,m, ) −→ [he].pronoun(sing, f , ) −→ [she].pronoun(plur, , ) −→ [they].

quantifier(n) −→ [ ].quantifier(n) −→ [several].quantifier(n) −→ [some].quantifier(n) −→ [a,number,of ].quantifier(X ) −→ [X ], {integer(X )}.quantifier(2) −→ [two].quantifier(3) −→ [three].quantifier(4) −→ [four].quantifier(5) −→ [five].quantifier(6) −→ [six].quantifier(7) −→ [seven].quantifier(8) −→ [eight].quantifier(9) −→ [nine].quantifier(10) −→ [ten].quantifier(11) −→ [eleven].quantifier(12) −→ [twelve].

subclasser −→ [a,kind,of ].subclasser −→ [a, sort,of ].subclasser −→ [a, type,of ].

possess verb(sing,have) −→ [has].

34

Page 35: A Small Natural Language Interpreter in Prolog

possess verb(plur,have) −→ [have].possess verb(sing, contain) −→ [contains].possess verb(plur, contain) −→ [contain].possess verb(sing, consistof) −→ [consists,of ].possess verb(plur, consistof) −→ [consist,of ].subord verb(sing,be) −→ [is].subord verb(plur,be) −→ [are].

8.2 CodeGen.pl

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Code generation

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

← use module(library(lists)).

%%% utility rules

flatten([Head |Tail ],FlatList)←flatten(Head , FlatHead),flatten(Tail , FlatTail),append(FlatHead , FlatTail , FlatList), !.

flatten([ ], [ ]).flatten(X , [X ])← atomic(X ).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%% Determine array type using cardinality..

typeinf(n, '[n]')← !.typeinf(1, '')← !.typeinf(X , ['[',X , ']'])← !.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% extraction

% Extracts facts from the program as lists

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

classes(Classes)←bagof(Class, class(Class), Classes), !.

classes([ ]).

class methods noarg(Class, Methods)←bagof(X , method(X , Class), Methods).

class methods noarg( , [ ]).

class methods one arg(Class, [Methods,Arg ])←bagof(X , method(X , Class, Arg), Methods).

class methods one arg( , [ ]).

class properties(Class, [X ,Cnt ])←

35

Page 36: A Small Natural Language Interpreter in Prolog

bagof(X , property(X , Class, Cnt), X ).class properties( , [ ]).

class properties list(Class, L)←bagof(X , class properties(Class, X ), L).

extends list([SuperClass , SubClass])←bagof(X , extends(SubClass, SuperClass), X ).

extends list([ ]).

objects([O ,C ])←bagof(X , object(O ,C ), X ).

objects([ ]).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Generate program list

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

generate program(Program)←classes(C ), generate classes(C ,ClassProg),bagof(EL, extends list(EL), EL), qualify(extends, EL, ExtendsProg),bagof(O , objects(O), O), qualify(object, O , ObjectsProg),flatten([program,ClassProg ,ExtendsProg ,ObjectsProg ], Program), !.

generate classes([C |Rest ], [class,C ,M0L,M1L,PL|GRest ])←class methods noarg(C , M0 ), qualify(method, M0 , M0L),bagof(X , class methods one arg(C , X ),M1 ), qualify(method, M1 , M1L),class properties list(C , P), qualify(property, P , PL),generate classes(Rest , GRest).

qualify( , [ ], [ ]).qualify( , [[ ]], [ ]).qualify(Q , [[ ]|Rest ], R)← qualify(Q , Rest , R).qualify(Q , [C |Rest ], [Q ,C |R])←

qualify(Q , Rest , R).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Generate output for grapviz %

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

gv uml([H ,C ,E ,O ,F ]) −→gv header(H ),gv classes(C ),gv extends(E ),gv objects(O),gv footer(F ).

gv header(['digraphÃGÃ{', newline,tab, 'fontsizeÃ=Ã8', newline, newline,

36

Page 37: A Small Natural Language Interpreter in Prolog

tab, 'nodeÃ[',newline,tab, tab, 'fontsizeÃ=Ã8', newline,tab, tab, 'shapeÃ=Ã"record"', newline,tab, ']', newline, newline,tab, 'edgeÃ[',newline,tab, tab, 'fontsizeÃ=Ã8', newline,tab, ']', newline ]) −→[program].

gv footer([newline, '}',newline]) −→ [ ].

%%% Classes:

gv classes([ ]) −→ [ ].gv classes([C ,CS ]) −→

gv class(C ),gv classes(CS ).

gv class( [tab,Name, '[Ã', newline, tab, tab,'labelÃ=Ã"{', Name, '|', P , '|', M , '}"',newline, tab, ']',newline, Aggregations , newline,newline, A, newline]) −→[class],gv name(Name),gv methods(Name, A, M ),gv properties(Name, P , Aggregations).

%% Methods:

gv methods( , [ ], [ ]) −→ [ ].gv methods(Class, [A1 , A2 ], [N ,M ]) −→

gv method(Class, A1 , N ), gv methods(Class, A2 , M ).

gv method(Class, Assoc, ['+:Ã', Name, '(param:', Arg , ')','Ã:Ã', 'void', '\\', 'l' ]) −→[method],gv name(Name),gv method arg(Class, Arg , Name, Assoc).

gv method arg( , [ ], , [ ]) −→ [ ].gv method arg(Class, Arg , MethodName,

[ newline, tab, 'edgeÃ[ÃarrowheadÃ=Ã"none"Ã]',newline, tab,Class, 'Ã->Ã', Arg , 'Ã[Ãlabel="',MethodName, '"Ã]', newline ]) −→gv name(Arg).

%% Properties:

gv properties( , [ ], [ ]) −→ [ ].

37

Page 38: A Small Natural Language Interpreter in Prolog

gv properties(Class, [Props1 |PropsRest ], [Agg1 ,AggRest ]) −→gv property(Class, Props1 , Agg1 ),gv properties(Class, PropsRest , AggRest).

gv property(Class,[ '-Ãproperty:Ã', Name, ArrayType, '\\', 'l' ],[ tab, 'edgeÃ[ÃarrowheadÃ=Ã"odiamond"Ã]', newline,tab, Name, 'Ã->Ã', Class, 'Ã[Ãlabel="', C , '"Ã]', newline ]) −→[property],gv name(Name),gv cardinality(C ),{typeinf(C , ArrayType)}.

%% inheritance:

gv extends([ ]) −→ [ ].gv extends([E ,F ]) −→

gv extend(E ),gv extends(F ).

gv extend([ tab, 'edgeÃ[ÃarrowheadÃ=Ã"empty"Ã]',newline,tab, Sub, 'Ã->Ã', Super , newline]) −→[extends],gv name(Super),gv name(Sub).

% objects

gv objects([ ]) −→ [ ].gv objects([O ,P ]) −→ gv object(O), gv objects(P).

gv object([ newline, tab, 'obj ', Object , 'Ã[Ã', newline,tab, tab, 'labelÃ=Ã"{', Class, ':', '\\', 'l',Object , '}"', newline,tab, ']', newline,tab, 'edgeÃ[ÃarrowheadÃ=Ã"odot"Ã]', newline,tab, Class, 'Ã->Ã', 'obj ', Object , newline]) −→[object],gv name(Object),gv name(Class).

gv name(X ) −→ [X ].gv cardinality(X ) −→ [X ].

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Write output to file

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

output code([ ]).

38

Page 39: A Small Natural Language Interpreter in Prolog

output code([newline|Rest ])←nl, output code(Rest).

output code([tab|Rest ])←write('ÃÃÃÃ'), output code(Rest).

output code([X |Rest ])←write(X ), output code(Rest).

generate dotty ←

generate program(P), !,gv uml(Code, P , [ ]),flatten(Code, FlatCode),output code(FlatCode).

generate dotty file(File)←tell(File), generate dotty, told.

8.3 Lexicon Test.pl

%%%% Dictionary − Domain specific − Persons and cars

noun(sing,n,person) −→ [person].noun(plur,n,person) −→ [persons].noun(sing,m,man) −→ [man].noun(plur,m,man) −→ [men].noun(sing, f ,woman) −→ [woman].noun(plur, f ,woman) −→ [women].noun(sing,n, car) −→ [car].noun(plur,n, car) −→ [cars].noun(sing,n, engine) −→ [engine].noun(plur,n, engine) −→ [engines].noun(sing,n,wheel) −→ [wheel].noun(plur,n,wheel) −→ [wheels].noun(sing,n, seat) −→ [seat].noun(plur,n, seat) −→ [seats].noun(sing,n,bag) −→ [bag].noun(plur,n,bag) −→ [bags].

proper noun(m, john) −→ [john].proper noun(f ,mary) −→ [mary].

intrans verb(sing,go) −→ [goes].intrans verb(plur,go) −→ [go].intrans verb(sing,walk) −→ [walks].intrans verb(plur,walk) −→ [walk].intrans verb(sing, talk) −→ [talks].intrans verb(plur, talk) −→ [talk].intrans verb(sing, look) −→ [looks].intrans verb(plur, look) −→ [look].

39

Page 40: A Small Natural Language Interpreter in Prolog

intrans verb(sing, run) −→ [runs].intrans verb(plur, run) −→ [run].trans verb(sing,drive) −→ [drives].trans verb(plur,drive) −→ [drive].trans verb(sing, like) −→ [likes].trans verb(plur, like) −→ [like].trans verb(sing, love) −→ [loves].trans verb(plur, love) −→ [love].

8.4 Lexicon Company.pl

%%%% Dictionary − Domain specific − Companies and employees

noun(sing,n, company) −→ [company].noun(plur,n, company) −→ [companies].noun(sing,n,department) −→ [department].noun(plur,n,department) −→ [departments].noun(sing,n,person) −→ [person].noun(plur,n,person) −→ [persons].noun(sing,m,man) −→ [man].noun(plur,m,man) −→ [men].noun(sing, f ,woman) −→ [woman].noun(plur, f ,woman) −→ [women].noun(sing,n, employee) −→ [employee].noun(plur,n, employee) −→ [employees].noun(sing,n, salary) −→ [salary].noun(plur,n, salary) −→ [salaries].noun(sing,n,position) −→ [position].noun(plur,n,position) −→ [positions].noun(sing,n,office clerk) −→ [office, clerk].noun(plur,n,office clerk) −→ [office, clerks].noun(sing,n, sales rep) −→ [sales, representative].noun(plur,n, sales rep) −→ [sales, representatives].noun(sing,n,budget) −→ [budget].noun(plur,n,budget) −→ [budgets].noun(sing,n, computer) −→ [computer].noun(plur,n, computer) −→ [computers].noun(plur,n,goods) −→ [goods].noun(plur,n, service) −→ [services].noun(sing,n,boss) −→ [boss].noun(plur,n,bosses) −→ [bosses].

proper noun(m, john) −→ [john].proper noun(f ,mary) −→ [mary].

intrans verb(sing,work) −→ [works].

40

Page 41: A Small Natural Language Interpreter in Prolog

intrans verb(plur,work) −→ [work].intrans verb(sing, sell) −→ [sells].intrans verb(plur, sell) −→ [sell].trans verb(sing,pay) −→ [pays].trans verb(plur,pay) −→ [pay].trans verb(sing,produce) −→ [produces].trans verb(plur,produce) −→ [produce].trans verb(sing,deliver) −→ [delivers].trans verb(plur,deliver) −→ [deliver].trans verb(sing,manage) −→ [manages].trans verb(plur,manage) −→ [manage].trans verb(sing,use) −→ [uses].trans verb(plur,use) −→ [use].

%%%% Test case:

company test case←

use case([a, company,has,a,number,of ,departments, '.',it,produces,goods,or,delivers, services, '.']),

use case([a,department, consists,of , employees, '.',all, employees,work,and, they,have,a,position, '.',they,are,persons, '.']),

use case([the, employees,have, salaries, '.',the, company,pays, the, salary, '.']),

use case([a, sales, representative, is,a,kind,of ,position, '.',an,office, clerk, is,a, type,of ,position, '.']),

use case([office, clerks,use,a, computer, '.']),use case([sales, representatives, sell,and, they,have,a,budget, '.']),use case([a,boss, is,a,kind,of ,position, '.']),use case([a,boss,manages, employees, '.']),use case([mary, is, the, boss, '.', john, is, an, office, clerk, '.']).

41