recognising textual entailment dragos ionut necula ... · chapter 2. background 5 ... (2)bob reads...

The candidate confirms that the work submitted is their own and the appropriate credit hasbeen given where reference has been made to the work of others.

I understand that failure to attribute material which is obtained from another source may beconsidered as plagiarism.

(Signature of student)

Recognising Textual EntailmentDragos Ionut Necula

Cognitive Science (Industry)2004/2005

SummaryThis project sets out to investigate the task of recognising textual

entailment in automated systems. The first chapter presents a project

summary. In the second chapter the task of textual entailment,and

issues surrounding it are analysed, both from a theoretical standpoint

and regarding existing attempts at implementation in automated

systems. The final section of this chapter outline the implementation

approach proposed, which relies on the division of the task in a

series of subtasks representing various potential sources of

entailment. The third chapter describes in detail the software

implemented, providing an in-depth analysis of the task of entailment

at the same time. Features described are entailment from synonymy,

from hypernymy and from passive sentences. A confidence level estimate

is also proposed and explained. The final chapter gives an account of

the methods used to evaluate the software, corpus design as well as

theoretical considerations. Finally potential future extensions to the

system are reviewed.

i

Acknowledgements

I would like to thank Eric Atwell, my supervisor, for his patience and

useful feedback, Dr. Katja Markert, my assessor, for her comments and

suggestions, as well as the developers behind the tools used in

developing the software and everyone behind the Pascal RTE Challenge.

ii

Contents

Chapter 1. Introduction 1

1.1 Aim 1

1.2 Objectives 1

1.3 Minimum Requirements 2

1.4 Possible Extensions 3

1.5 Project Schedule 3

Chapter 2. Background 5

2.1 The Notion of Entailment 6

2.1.1 Definition 6

2.1.2 Ambiguity of Entailment 7

2.2 Automating Entailment 9

2.2.1 The Pascal RTE Challenge 9

2.2.2 The Probabilistic Approach 10

2.2.3 The Symbolic Approach 12

2.2.4 Proposed Approach 14

Chapter 3. The Software 16

3.1 Overview of Design 16

3.2 Methodology 17

3.3 Software Tools 19

3.3.1 The RASP Toolkit 20

3.3.2 The Wordnet Database 21

iii

3.3.3 The NLTK and Python 22

3.4 The Minimum Requirements of the System 24

3.5 Additional Implemented Features 27

3.5.1 Entailment from Hypernymy 27

3.5.2 Entailment from Passive Sentences 29

3.5.3 Confidence Estimate 30

3.6 Further Refinements 32

Chapter 4. Evaluation 36

4.1 Corpus Development 36

4.2 Testing and Evaluation 38

4.2.1 Analysis of Testing Results 38

4.2.2 Evaluation 41

4.3 Possible Enhancements 43

4.4 Conclusion 45

Bibliography 46

Appendices 49

iv

Chapter 1 Introduction

1.1 Aim

The overall aim of the project is to provide a study of and propose

approaches to the task of automatic textual inferencing in the

context of natural language processing, including a basic Python

program to illustrate and implement some of the methods proposed.

1.2 Objectives

The objectives of the project are to:

● review and critically assess existing literature relevant to

the task of textual entailment

● outline a possible symbolic approach to the task of

implementing the capability of textual inferencing in

computer systems, starting from the idea of dividing the

task of entailment into subtasks reflecting various

entailment sources.

● develop software that would implement methods aimed at

achieving automated textual inferencing along the lines of

the proposed approach. The software is to be developed in

Python and will make use of existing NLP tools (e.g. parsers

or stemmers) included in the Natural Language Toolkit (NLTK

- nltk.sourceforge.net). Given a knowledge base and a novel

sentence the system would return a confidence level of the

truth of the sentence by attempting to infer it from the

knowledge base.

● evaluate the software produced against a specialised corpus,

1

developed around the features of entailment implemented in

the system.

1.3 Minimum Requirements

The minimum requirements are:

● a report on existing literature relevant to the task of

textual entailment

● the successful incorporation of a set of relevant external

tools (e.g. parsers, lexical databases) in the software

produced, with the aim of concentrating the development

process on the task of entailment, as opposed to other

fields of Natural Language Processing (e.g. tokenization,

parsing)

● a working software system that would attempt textual

inferencing with novel simple sentences of the form Subject

-Verb – Direct Object, probably the use of synonymy applied

to a knowledge base consisting of a corpus formed with the

same type of sentences. For instance, given the sentence

'Bob rushed home' and assuming the sentence 'Bob hurried

home' as existing in the knowledge base, the system should

be able to correctly identify that the former is entailed by

its knowledge base.

● the construction of a suitable, relevant corpus of sentences

of the type Subject - Verb – Direct Object that would be

used for the purposes of testing the written software.

● appropriate and correct evaluation of the software, both

through testing with the created corpus and on theoretical

considerations.

Note: The minimum requirements have been modified as a result of the

feedback received after submitting the mid-project report.

2

1.4 Possible Extensions

● the implementation of an entailment confidence estimate that

would approximate, in percentages, the possibility of

entailment of a test sentence from the knowledge base

● the implementation of a feature responsible for detection of

entailment from hypernyms

● the implementation of a feature responsible for detection of

entailment as a result of active – passive sentence

conversions

● the addition of capability for the system to handle sentence

structures more complex than Subject - Verb - Direct Object.

1.5 Project Schedule1

Week 2a – Week 10a - Background reading.

Week 2a – Week 4a - Completing the aims and minimum requirements

form

Week 3a – Week 5a - Familiarization with Python and the NLTK.

Week 3a – Week 7a - Basic outline of the proposed approach to

entailment.

Week 9a – Week 11a - Writing up the mid-project report.

Week 8a – Week 4b - Development of appropriate corpus

Week 8a – Week 11a - Implementation of a prototype consistent with

the basic principles of the proposed approach,

that would satisfy the minimum requirements of

the project.

1 The letters 'a' and 'b' used in the enumeration of the term weeks denote the first and second semestersrespectively (e.g. 'Week 2a' – second week of the first semester, 'Week 5b' – fifth week of the secondsemester).

3

Week 11a – Week 8b - Addition of features to the prototype (as many

as time allows)

Week 5b - Week 8b - Evaluation and testing of the final system.

Week 8b - Week 10b - Completing the final report

Note: The project schedule has been modified as a result of the

feedback received after submitting the mid-project report.

4

Chapter 2 Background

The process of programming linguistic competence into artificial

systems has and is proving to be a highly complex task. There are

many reasons why this is so and most of them have to do with

flexibility. The manner in which humans use language is one that has

resisted formalization so far and this obviously poses a serious

problem as far as implementing the ability to use language into

artificial systems is concerned, since artificial systems tend to

work with formal rules and are as such generally poor at dealing

with ambiguity and flexibility.

In the light of the above it appears that one of the main problems

with natural language processing is language variability. While it

is the case that humans use language to describe or refer to a huge

range of objects (be it environmental factors, internal beliefs or

abstract concepts) and that in itself poses a problem, even in cases

when the problem space is very restricted, the possible choices of

linguistic terms necessary to describe it are wide ranging and

indeed quite varied. A typical example is supplied by the problem

known as many-to many mapping between sense and text. Upon the sight

of a table with a cup on it one would most likely say 'There is a

cup on the table'. However, this is merely the most likely

description of the problem space, but by no means is it the only

one. The sentences 'There exists a cup on the table', 'there is a

porcelain liquid container on the table', 'the table has a cup on

it' and many others also accurately describe the problem space.

5

2.1 The Notion of Entailment

2.1.1 Definition

The concept of entailment accounts for the linguistic situation

described above as well as for other similar ones. Saeed (2003)

gives the following definition for entailment:

“ A sentence p entails a sentence q when the truth of the first (p)

guarantees the truth of the second (q), and the falsity of the

second (q) guarantees the falsity of the first (p).”

To illustrate this definition consider the sentences:

(1)Bob reads a book.

(2)Bob reads a semantics book.

In this instance, according to the definition above (2) entails (1).

Indeed if it is the case that Bob reads a semantics book, it can be

said with certainty that Bob reads a book. Similarly, if it is not

the case that Bob reads a book, it cannon be true that Bob reads a

semantics book.

Given this definition of entailment, Saeed (2003) proceeds to

defining paraphrases as sets of sentences which mutually entail each

other. The sentences produced earlier to describe the problem space

composed of a table with a cup on it would be examples of

paraphrasing.

At first sight, it might be tempting to equate the notion of

entailment as described above to the logical concept of entailment

as it is defined by classical mathematical logic (CML). Cheng (1996)

warns against this by pointing out that entailment in CML is

6

reducible to material implication (A -> B) and as such does not

account for any relation between A and B. In CML the statement

'1+1=2' would legitimately entail from the sentence 'it is raining',

whereas in natural language this is not the case. Starting from this

observation he goes as far as proposing a few alternative relevant

logic systems that solve the paradoxes of CML and other relevant

logics. (Cheng, 1996)

2.1.2 Ambiguity of Entailment

Consider the following sentences:

1) Mary likes animals.

2) Mary likes vicious, savage dogs.

3) Mary took a phone call in her office at noon today.

4) Mary skipped lunch today.

Assuming that sentences (1) – (4) refer to the same person named

Mary and that Mary is a typical individual, most people would quite

likely label (1) True and (2) False. Yet from a logical standpoint

(1) entails (2) since vicious, savage dogs are a type of animals.

According to the definition of entailment given in section 2.1.1,

this makes it quite impossible for (2) to be False while (1) is

True.

This apparent paradox could be resolved by the presence of an

appropriate quantifier (such as 'some') in sentence (1) that would

restrict the domain of the word 'animals' in such a manner as to

exclude the vicious dogs. Indeed, although the quantifier is missing

explicitly, most humans would assume its implied presence and would

understand sentence (1) as meaning 'Mary likes most animals'. This

assumption however, is exclusively the result of the knowledge that

it is highly unlikely that anyone (Mary included) would like all

7

animals. This knowledge is acquired from extensive practical

experience of the kind a computer does not have access to. For this

reason, the use of implicit knowledge of this sort in language poses

a major problem for any system tackling the task of entailment.

A similar principle is at work in the case of sentences (3) and (4).

Though the two utterances appear to be totally unrelated, an

entailment of (4) from (3) is perfectly plausible. Once again, in

order for such a relation to be detected a considerable amount of

experience and background knowledge is required (e.g., lunch breaks

happen at noon, Mary does not have lunch in her office, etc.).

If current limitations of computers systems are taken into account

and consequently the problem space of sentence entailment is

restricted by excluding cases (such as the ones above) involving

experience or implicit knowledge, the task of automating the

detection of entailment between sentences would be similar in nature

to the manual (i.e. performed by a human) task of attempting to

detect entailment between sentences in an unfamiliar language, with

the aid of tools such as various books written in the language,

collections of synonyms of the language or monolingual dictionaries.

Even thus restricted, the problem remains far from straightforward,

mainly because sentence entailment can be generated in a wide range

of manners, all conceptually different from each other. Saeed (2003)

broadly separates the different sources of entailment in two

categories: lexical and syntactical. Examples of the former would be

synonymy and hyponymy, while notions such as active and passive

sentences belong to the latter (e.g. 'I read a book' and 'A book is

read by me'). Depending on the approach to the task, in its initial

stages a program automating the task of entailment could well

account for only a few of the many possible sources of entailment.

8

2.2 Automating Entailment

2.2.1 The Pascal RTE Challenge

Unfortunately few attempts have so far been made to resolve

entailment in natural language systems. This is perhaps surprising

considering the central role it plays in the way humans use language

but it is most likely due to the difficult problem it poses, some of

which have been described above. Acknowledging the importance of the

issue of textual entailment for the field of Natural Language

Processing and many of its applications, as well as being aware of

the little amount of research existent on the topic, Dagan et al

launched in June 2004 the PASCAL Recognizing Textual Entailment

Challenge, with the aims of providing “ a first opportunity for

presenting and comparing possible approaches for modeling textual

entailment” and potentially promoting “ the development of generic

semantic “engines”, which will play an analogous role to that of

generic syntactic analyzers across multiple applications” (Dagan et

al, 2004).

This initiative is important for several reasons. First of all to

this date it provides the only available corpus developed

specifically for the task of entailment. This corpus consists of a

number of sentence pairs, each classified as either True of False

(True in the case the second sentence in the pair entails from the

first, False otherwise) by human annotators, with a roughly equal

balance between True and False pairs, and is divided in seven

subsets, reflecting possible cases of entailment proposed by the

authors (Information Retrieval, Comparative Documents, Reading

Comprehension, Question Answering, Information Extraction, Machine

Translation and Paraphrase Acquisition)2 (for a more detailed

2 Unfortunately due to the fact that the system described in this project was restricted to the handling ofsentences of the type Subject – Verb – Direct Object (see section 3.1), the RTE corpus could not be used fortesting purposes.

9

description, see Dagan et al, 2004).

The existence of the challenge is also important because it

represents a recognition of the importance of the issue of sentence

entailment and as such has attracted considerable amount of interest

from research groups. Indeed, Dagan et al report 13 submissions of

systems in response to the call for participation set out by the

challenge, with overall accuracies ranging between 50 and 60 percent

(Dagan et al, 2005). The relatively low accuracies obtained bear

witness to the difficulty of the task, but provide nonetheless a

useful benchmark against which future systems could be evaluated.

Unfortunately the proximity of the time of publication (10th of April

2005) of a selection of the reports submitted to the deadline for

this report made it impossible for any of them to be reviewed here3.

2.2.2 The probabilistic approach

The interest shown in the Pascal RTE Challenge is particularly

welcome considering that before its launch only a few papers had

been written specifically on the topic of textual entailment as a

Natural Language Processing task.

One of these, written by the two of the initiators of the RTE

challenge, Dagan and Glickman (2003), adopts a probabilistic

approach to entailment. They propose an inference model which does

not represent meanings explicitly, focusing instead directly on the

lexical-syntactic representations of sentences which are usually the

result of automatic parsing of language constructs. The model has at

its core a set of entailment patterns consisting of relations

3 It must be mentioned that the development work on this project and the software it includes proceeded inparallel to the RTE challenge and therefore any similarity between the ideas proposed here and the systemssubmitted for the challenge is purely coincidental.

10

between language expression templates, and associated probabilities.

A template is defined as “ a language expression along with its

syntactic analysis, optionally with variables replacing sub-parts of

the structure” (e.g. X subj buy obj Y). Thus an example of an

entailment pattern could be represented as: (X subj buy obj Y

⇒ X subj own obj Y), P, where P is the probability that a

text that entails the left hand side of the expression also entails

its right hand side. (Dagan and Glickman, 2003)

Given a pairs of sentences and a knowledge base of such entailment

patterns, the inference model identifies templates present in the

sentences and then applies a set of probabilistic inference rules to

combine the relevant pattern probabilities, producing an

approximation of entailment between the given sentences.

This approach has some obvious advantages in that it avoids a number

of problems that would have to be resolved by a logical automated

system dealing with entailment, most notably the existence of an

indefinite number of potential sources of entailment. Its main

disadvantage, however, is its heavy dependence on a large database

of entailment patterns. Dagan and Glickman list a number of possible

methods of pattern acquisition (e.g. extracting paraphrases from a

single corpus, learning patterns from the web) without explicitly

favouring any, but it is difficult to assess the efficiency of any

of them (especially since some of them are not fully developed) with

any degree of certainty (Dagan and Glickman, 2003).

As for the model they propose, without the existence of a database

of entailment patterns, its potential success rate in recognizing

entailment is impossible to predict, which makes the model in itself

difficult to assess.

11

2.2.3 The symbolic approach

Iwanska (1993) favours an approach to the task of entailment that is

radically different from the model proposed by Dagan and Glickman.

Like Cheng (1996) she maintains that first order logic is unsuitable

for representation of natural language constructs. As an alternative

she develops what she refers to as 'a formal, computational,

semantically clean representation of natural language' (Iwanska,

1993). This representation, which she calls UNO, reflects the

semantics of eleven syntactic categories (e.g. nouns, verbs,

adverbs) and accounts for the logical relations of conjunction,

disjunction and negation. This features combined make UNO a powerful

knowledge representation model for natural language. For full

details see Iwanska (1993), but as an example, given the sentences:

(1)John doesn't walk very fast.

(2)Every student works hard.

UNO would produce the following representations:

(R1) john -->{ not walk (speed ⇒ fast (adv ⇒ very)) }

(R2) np (det ⇒ every,

n ⇒ student) -->{[ work (adv ⇒ hard) ]} (Iwanska, 1993)

Starting from the premise that entailment of sentences heavily

depends on the correspondent entailment of their components and

using UNO as a knowledge representation model, Iwanska implements

(in LISP) a system that takes a natural language sentence as input,

converts it into an UNO representation and compares it against a

knowledge base consisting of a set of UNO representations by

comparing its components to components of the UNO representations in

the knowledge base (which she calls rules). According to the results

12

of this process the sentence is classified as True if it entails

from the knowledge base, False if it contradicts it or Unknown if

neither of the previous can be establish with certainty (Iwanska,

1993).

An important aspect of the system is its capability of automatically

expanding its database from the input sentences it receives. If in

the course of the comparison process described above a rule in the

input sentence is identified as having no correspondent in the

knowledge base or as being stronger than or equal to its

correspondent, then this rule is added to the knowledge base. This

implies that the knowledge base only contains “lower boundaries” of

each of the rules it stores, thus avoiding duplicates and allowing

for an infinite number of possible entailing sentences. “ For

example, the UNO representation of a single property walk extremely

fast accounts for the properties walk very fast, walk fast, walk,

walk or dance, etc; at the same time, this single property accounts

for the lack of the properties such as not walk, walk fast but not

very fast, etc.” (Iwanska, 1993)

While perhaps lacking some of the flexibility of the probabilistic

model described above, this symbolic approach to entailment is

designed to cope with relatively sophisticated cases of entailment,

the treatment of which is unclear in the case of the system proposed

by Dagan and Glickman (2003). Another advantage is its ability to

extend its knowledge base from the test sentences, which is an

elegant solution to the problem of an extensive knowledge base.

Unfortunately Iwanska does not provide any accuracy results for her

system, nor does she give any account of testing or evaluation

methods, which makes it impossible, once again, to assess its value.

13

2.2.4 Proposed Approach

The approach to the task of recognizing entailment proposed here,

while essentially symbolic in nature, attempts to combine the

advantages of a systematic representation of natural language, such

as the one described in section 2.2.2, the flexibility given by the

existence of a entailment probability estimate as in section 2.2.1

and the authors' own ideas regarding the issue of entailment and

generally linguistic knowledge representation.

In a similar fashion to both systems already described, this

approach relies on an entailment inference engine and a knowledge

base against which test sentences are to be compared and entailment

judgements made according to the results. However this knowledge

base consists of neither entailment patterns nor symbolic

representations of sentences. Instead the information obtained from

the natural language sentences from which the knowledge base is

generated is stored in the form of lists of the nouns contained in

these sentences, each of the nouns having associated a set of

attributes reflecting additional information provided in the

sentences (e.g. syntactic functions, other parts of speech attached,

syntactic referents, etc.)

This model of knowledge representation was adopted with the view

that nouns are the central building blocks of language constructs. A

detailed defence of this hypothesis does not make the object of this

report. It is enough to point out that this seems to be a matter of

general agreement in the linguistic community. Wordnet (a lexical

database, see section 3.3.2 for further information), for instance,

is focused on representations of lexical relations and categories

between nouns.

The inference engine is similar to the symbolic system proposed by

14

Iwanska, in that it relies on the assumption that entailment between

sentences depends to a large extent on the entailment between their

component parts4. It therefore attempts to establish entailment of

the input sentence from the knowledge base by comparing the

component nouns (and their attributes) of the test sentence to those

present in the knowledge base. Instead of making a definite

judgement, however, it calculates an entailment confidence estimate

(not unlike the probabilities computed in the model proposed by

Dagan and Glickman) based on the proportion of nouns that match the

database and the degree to which they match it.

A system meant to exhibit and develop upon the key aspects of the

approach to the task of automatic recognition of sentence entailment

briefly outlined here has been implemented in Python. The detailed

description of this system and the concepts behind it makes the

object of the next chapter.

4 Though difficult to prove, this claim appeals to human intuitions about language, especially if some of theconsiderations discussed in section 2.1.2 are taken into account.

15

Chapter 3. The Software

3.1 Overview of Design

The planning stage of the development process of the software

intended to test the theoretical principles described above started

from the idea of separating the task of automated sentence

entailment detection into subtasks reflecting possible different

sources of entailment with natural language sentences. Given the

theoretical assumptions, these subtasks would mostly mirror either

issues of sentence structure (e.g. passive sentence entailed by

active sentences, negative sentences entailed by positive sentences)

or semantic relations between component words of the sentences

considered (e.g synonymy, antonymy, hyponymy, meronymy). The next

step was to decide on one such subtask as the main feature of the

program and the first one to be implemented. Several other possible

sources of entailment would then be incorporated into the software

as additional features, taking into account development time

constraints. Finally an entailment confidence estimate feature would

be added, based on which the software would make judgements about

entailment between two sentences.

Thus the end product would behave in the following manner: given a

pair of sentences (a and b) through the means of a standard input

method (e.g. reading from a file) the system would compare them

based on the possible sources of entailment implemented as its

features. It will, in the process, compute an entailment confidence

estimate according to the degree of correspondence encountered

between the two sentences. The outcome would be that sentence b

would be judged as entailing from sentence a if the confidence

estimate would be higher than 0.5, the opposite being true if the

16

confidence estimate would be below 0.5. The system would output the

result and exit.

The sentence structure that the system was designed around and

tested on throughout are of the type Subject – Verb – Direct Object

(e.g. 'Cats drink milk' or 'Books bring wisdom'). This was done with

the intention of significantly reducing the problem space (necessary

given the time constraints of the project and the complexity of the

task at hand), while keeping true to the conceptual implications of

the notion of sentence entailment.

With the same considerations in mind, the system only worked with

lemmatised versions of the words composing the sentences, thus

ignoring issues such as the number for the nouns (e.g. 'students'

was treated as 'student') or the tense and person for the verbs

(e.g. 'liked' and 'likes' were both treated as 'like').

3.2 Methodology

Marks (2002) distinguishes and compares two main currents in the

field of software design methodology – the waterfall approach and

the iterative approach.

He describes waterfall methodologies as “ structuring a project into

distinct phases with defined deliverables from each phase”, where

“ the first phase tries to capture What the system will do (its

requirements), the second determines How it will be designed, in the

middle is the actual programming, the fourth phase is the full

system Testing, and the final phase is focused on Implementation

tasks such as go-live, training, and documentation.” (Marks, 2002)

It is easy to see why a waterfall methodology would be inappropriate

for the software system proposed here. While it is necessary to set

certain system requirements at an early stage, it would be

17

impossible to specify exactly “what the system will do” without a

knowledge of which of the possible sources of entailment will be

implemented. And this knowledge would only become available during

the course of the implementation itself, since it would depend on

the time taken to incorporate each feature.

Another problem is that in order to guarantee a successful final

product, testing would have to be part of each of the subtasks set

to implement various features, rather that being performed at the

end of the development process. Once a feature is implemented,

comprehensive testing is required before the implementation of the

next feature is begun.

Because of these considerations, an alternative iterative

methodology was preferred, since it better suited the modular

character of the system specifications. In a software development

process based on an iterative approach, “ the initial planning will

consist of only a broad outline of the business objectives and

creating a framework for the overall project components. In some

cases, a set of core features for the entire system will be built

initially and subsequent features added as the project progresses.

In other cases, just certain modules will be built entirely during

the early part of the project and other components added over time.”

(Marks, 2002)

The process described in this definition addresses quite effectively

the design specifications of the proposed entailment detection

system, as outlined in section 3.1. The detection of entailment has

been established as the objective of the project, with different

potential sources of entailment (alongside a confidence estimate) as

project components. One of these sources will be decided upon as the

initial feature with others to be added as the project progresses.

18

The main advantage of this approach is flexibility. With an

iterative framework, it is not necessary to decide on all the

features that would be implemented before the code writing begins.

Also, testing can be performed whenever necessary and this is

crucial for the safe delivery of a successful end product.

While on the topic of methodology, it must be mentioned that

throughout the development process whenever possiblecode has been

written in such a manner as to allow room for possible extensions.

3.3 Software Tools

Part of the reason why entailment is a novel topic in the area of

natural language processing is the lack, until recently, of

appropriate tools necessary for any attempt to tackle the task of

entailment to even be possible.

It would be very difficult to make any reasonable judgement

regarding any structural relation between two sentences and between

their components without the existence of competent, appropriate

sentence processing tools such as parsers, taggers or stemmers.

Similarly, one cannot establish semantic relations between different

words without a database documenting such relations on a large

scale.

Thankfully, in recent years, as the interest in the field of natural

language processing rose such tools have been developed and have

become available. Of these, this particular project makes use of the

RASP toolkit, the Wordnet database and some features of the Natural

Language Toolkit. The programming language of choice is Python.

19

3.3.1 The RASP Toolkit

The Robust Accurate Statistical Parsing (in short, RASP) toolkit was

originally developed by Ted Briscoe and John Carroll (for a detailed

description of the system see Briscoe and Carroll, 2002). Briefly

the toolkit comprises of a set of modules, each performing, in this

order, tokenization, part of speech tagging, lemmatisation and

finally parsing. The modules are independent and thus can be run

separately, but for the purposes of this project they were all run

together by executing one of the scripts provided. The script in

question has various display options, with different amounts of

information being displayed.

The initial preferred display, given the sentence 'Students read

books', for example, would produce the following parsing:

[number]

(ncsubj eat cat+s _)

(dobj eat fish _)

where [number] is the order of the sentence from the beginning of

file ('1' for the first sentence, '2' for the second, etc.). The

contents of the brackets depend on the syntactic function of the

word5, but in general the first term would be the syntactic function

of the current word, followed by the referent (i.e. the word on

which the current word depends, the lemmatised version of the

current word and finally the initial syntactic function6.

At a later stage in development, a more complete RASP display was

preferred, which given the same input sentence, would output:

5 For a detailed description of the grammatical notation scheme used see Briscoe and Carroll, 2000.6 In the example provided this is represented by '_', in other words none. Given the sentence 'Books are read

by students', however, the line for the word 'books', for instance, would be (ncsubj read book+s dobj).

20

("students" "read" "books") 1 ; (-5.099)

(|ncsubj| |read_VV0| |student+s_NN2| _)

(|dobj| |read_VV0| |book+s_NN2| _)

The additional information provided by this form of output and of

relevance to the system is the output of the entire sentence rather

than just the sentence order and the inclusion of part-of-speech

tags following every word (VVO, NN2, etc.). The tagset used by the

tagging part of the RASP is the CLAWS2 tagset .

Several features of the RASP toolkit eventually made it the software

of choice for the initial sentence processing step7:

– all in one package including tagging, lemmatising and parsing.

– no need for training

– grammar already included with the parser

– syntactic information included (e.g. subject, direct object)

– similar in approach to the theory behind the system, in the

central role given to nouns (while verbs are mainly regarded as

'referents')

3.3.2 The Wordnet Database

“ WordNet, an electronic lexical database, is considered to be the

most important resource available to researchers in computational

linguistics, text analysis, and many related areas. Its design is

inspired by current psycholinguistic and computational theories of

human lexical memory. English nouns, verbs, adjectives, and adverbs

are organized into synonym sets, each representing one underlying

lexicalized concept. Different relations link the synonym sets.”

7 Initially a combination of NLTK tools were used for tagging, parsing, lemmatising etc. They were droppedearly in the development process as it became increasingly evident that they were incompatible with therequirements of the system (see Chapter 3.3.3).

21

(Fellbaum, 1998)

The choice of Wordnet as an aid in the task of lexically comparing

components of sentences is probably obvious to anyone with an

interest in computational linguistics. A great number of research

projects involving work with lexical relations of various kinds

employ Wordnet as a research tool8.

Comparing components of the sentences considered for entailment is a

key element of the approach to the task on entailment proposed here.

A significant number of the features either implemented or

considered as possible extensions consist of potential sources of

entailment based on lexical relations (such as synonymy or hyponymy)

between components of different sentences. For this reason, Wordnet

was not only an useful aid, but a vital tool in the process of

developing the system detailed here. As an additional advantage, the

organisation of Wordnet resembles in certain aspects some of the

theoretical observations at the base of this project.

3.3.3 The NLTK and Python

Mainly intended as an education tool, the Natural Language Toolkit

(NLTK), written in Python, is an open source collection of

independent modules, each implementing a data structure or task

relevant to the field of natural language processing (Loper and

Bird, 2002). The fact that it is written in Python and that it

includes a comprehensive set of tutorials and generally a detailed

and clear documentation base make it relatively easy to familiarise

with and for the purposes of this project that was considered an

advantage.

8 For an uptodate list of projects using Wordnet,see the Wordnet bibliography, currently maintained byRada Mihalcea, at http://engr.smu.edu/~rada/wnb/

22

Initially the NLTK was intended to play a central role in the

process of development of the entailment software, as the main

source of off-the shelf tools the software would use, including

taggers and parsers. As the development progressed, however, it

became apparent that, because they were designed primarily for

educational rather than research purposes, some of the NLTK tools,

notably the parser, were insufficient for the requirements of this

project. As a consequence eventually the alternative RASP toolkit

was judged as more suitable and thus was preferred for the tasks of

tagging, lemmatising and parsing. Other NLTK modules however, namely

the tokenizer and the Wordnet interface functions, were kept and

successfully used throughout the implementation process.

After some consideration, Python was designated as the programming

language of choice. This was partly because the NLTK is written in

Python, but also because of other important considerations intrinsic

to the language, such as its relatively loose structural

constraints, its ease of use and, perhaps most importantly, the

superior manner in which Python handles data types, compared to

alternatives such as Java or C9. This was a major factor, given the

amount of string and list processing that was necessarily a part of

the entailment software.

The main disadvantage of Python, as well as the NLTK is its lack of

resource efficiency. Russell (2003) reports that, when used with

large sets of data, the NLTK tools prove to be significantly slower

than alternative tools available for the same tasks. This is hardly

surprising given that its developers clearly state that speed was

not one of the design aims of the toolkit (Loper and Bird, 2002) in

addition to the fact that Python is traditionally recognised as

being a slow language compared, for instance, to C.

9 For more information on Python see the Python Tutorials (van Rossum, 2005)

23

Nonetheless, the fact that the entailment system, as well as being a

small scale project as an undergraduate project, was tested with

relatively small amounts of data (see the section on testing) made

the lack of efficiency an acceptable impediment considering the

advantages brought by using Python and the NLTK.

3.4 The Minimum Requirements of the System

(Entailment from Synonymy)

As mentioned previously the software development strategy was

centred around the idea of dividing the task of entailment into

subtasks representing various potential sources of entailment. One

of these subtasks was given priority as the first feature to be

implemented and its successful completion was meant to fulfil the

minimum requirements of a system intended to approach the task of

automated sentence entailment.

After some thought, this initial subtask was set as representing the

case of entailment through synonymy. Take the sentences:

1) Students read books.

2) Pupils read books.

There is little doubt as to the fact that (1) entails (2). Equally

obvious, however, is that, just as well, (2) entails (1). Apart from

being perhaps one of the most common sources of entailment, synonymy

typically produces cases of mutual entailment, a concept known as

paraphrasing (Saeed, 2003). This was one of the reasons why

entailment from synonymy was chosen as the first feature to be

implemented.

Having established that the minimum requirement of the system was

24

detection of entailment from synonymy (i.e. paraphrasing) I shall

now turn to a discussion of the way in which this was implemented.

In virtue of its structure, the corpora provided by the Pascal RTE

Challenge for testing purposes suggests attempting automated textual

entailment by analysing sentence pairs. The system described here

takes a slightly different path in that it treats the data (i.e. the

collection of sentences) by considering two distinct categories: a

set of sentences that are considered as known by the system and

about which no entailment judgements are made (from here on referred

to as 'the knowledge base'); and another set ('the test base') that

the system analyses sentence by sentence, attempting to determine

the potential of entailment of each from the sentences in the

knowledge base.

At the moment this works by the system trying to identify sentences

in the knowledge base that potentially entail the test base sentence

under analysis. As a possible extension, however, the system could

perform more fuzzy searches and attempt entailment from the

knowledge base as a whole. This is one of the reasons why this

approach was preferred over the more popular sentence pairs

alternative. It is an important choice since the knowledge base

approach resembles much closer the manner in which entailment occurs

in human language interaction. Rarely are we presented with pairs of

sentences and asked to make entailment judgements. We usually make

such judgements by comparing utterances we encounter against our

whole spectrum of knowledge, linguistic or otherwise.

Given the requirement that the test data would be split into a

knowledge base and a test base, the system then takes the file

containing the former, parses it10 and stores the contents in a

'knowledge list'. Rather than lists of sentences, or anything to

10 For a description of the parser output, see section 3.3.1

25

that effect, this list contains 'noun tokens'11 (produced with the

aid of the NLTK tokenizer), in conformance with the theoretical

considerations mentioned earlier. At this stage, each of the tokens

stores the following attributes12:

– TEXT: containing the string representation of the token (e.g.

student['TEXT'] = 'student')

– REFERENT: containing the string representation of the word on

which the token depends syntactically (e.g. student['REFERENT'] =

'read', given the sentence 'Students read books'13).

– SYNTAG: containing the syntactic function of the token as given by

the parser (e.g. student['SYNTAG'] = ncsubj, given the same

sentence as above).

– SENTENCE: containing the number of the sentence from which the

token is extracted (e.g. '3' should the relevant sentence be the

third in the processed file).

The next step is the processing of the test base file in a similar

manner to the knowledge base. The subject of the first test sentence

is then identified, and a list of synonyms of the most common sense

is retrieved from the Wordnet database. Each of the synonyms

(including the original word) is then checked against the knowledge

list. If a match is found, a comparison between the syntactic

functions (i.e. the subjects) and then the referents (i.e. the

predicates) of the test word and the synonymous known word is

performed. If the comparison is still positive, in other words if

the two words still match, the knowledge base sentence containing

the synonym is extracted and the same procedure repeats, in that the

next noun token belonging to the test sentence is matched against

11 Tokens are, in simple terms, concrete instantiations of conceptual entities, which can hold a number ofproperties. An extended system built on the one presented here could use types, which are actual conceptualentities as opposed to instantiations of them, instead of tokens. For a more detailed explanation of the notionof 'token', as well as the difference between tokens and types, see the NLTK tutorial on tokenization (Kleinet al, 2004).

12 Later the token attributes were refined to contain more information (see section 3.6).13 Remember the software works with lemmatised versions of the words in sentences.

26

the noun tokens of the corresponding knowledge base sentence.

In the case of a sentence of the type 'subject- verb- direct object'

(S-V-O) (which was the structure of both the test and the known

sentences), this is only a two step process, as S-V-O sentences

necessarily only contain two noun tokens, the subject and the direct

object. There is no reason, however, why the principle would not

work in a similar manner for longer sentences, which could be the

case in an extended system.

If the matching process completes successfully (i.e. all noun tokens

of the test sentence have matches - in terms of TEXT, SYNTAG and

REFERENT – in the selected knowledge base sentence) the test

sentence is established as entailing from the knowledge base, an

output to this effect is produced and the systems proceeds to the

next test sentence.

Thus the set minimum requirements are fulfilled producing a basic

system performing automated recognition of entailment of an S-V-O

test sentence from a known set of S-V-O sentences on the grounds of

synonymy.

3.5 Additional implemented features

3.5.1 Entailment from Hypernymy

Saeed (2003) describes hypernymy as a lexical relation of inclusion.

Given two words x and y, if the meaning of x encompasses and includes

the meaning of y, then a relation of hypernymy exists between the

two words, with the more general term (x) being termed the hypernym

and the more specific term (y) named the hyponym (e.g. 'animal' is a

hypernym of 'cat').

Hypernymy was chosen as one of the features to be implemented mainly

27

in virtue of it being one of the main sources of entailment, but

also because it exhibits nicely the potential complexity of the task

at hand. Given the sentences:

Animals have rights.

Cats have rights.

, the hypernym lies in the sentence that 'does the entailment', the

sentence from which the arrow of implication starts. In a loose

sense, it can be said of this case that 'the hypernym entails the

hyponym'. Whereas with the sentences:

Some animals have rights.

Cats have rights.

the opposite is true. The addition of the 'some' indefinite article

in one of the sentences completely reverses the direction of

entailment, having now the sentence containing the hyponym entailing

the one containing the hypernym.

Because the system described here is designed to handle simple S-V-O

sentences, containing lemmatised nouns with no articles attached, in

all tested cases the sentence containing the hypernym sentence

entailed the hyponym sentence. It is easy to see, however, why an

enhanced system handling with more complex sentence structure would

have to address this issue.

In order to add the entailment from hypernymy feature to the system

the matching process described in section 3.4 had to be modified to

include the recognition of hypernyms, as well as synonyms, in the

knowledge base. Thus, after retrieving the list of synonyms of each

noun token in the test sentence from Wordnet, the system now also

retrieves a list of hypernyms for each of these synonyms (also from

28

Wordnet) and then attempts to find matches in the knowledge base for

each of the synonyms of the test noun tokens as well as for their

corresponding lists of hypernyms.

3.5.2 Entailment from Passive Sentences

Like the previous features implemented, passive sentences are a very

common source of entailment. And, as in the case of synonymy, most

often the active and passive form of a sentence are paraphrases of

each other. Not only does the sentence 'Students read books' entail

the sentence 'Books are read by students', but the reverse (i.e.

'Books are read by students' entails 'Students read books') also

holds true.

The difference between passive sentence derivation and synonymy,

however, is that the former is a syntactic source of entailment

(i.e. entailment is derived through syntactic changes) whereas the

latter is lexical in nature (as is entailment from hypernymy).

The implementation of the entailment derived from passive sentences

is carried out in a different manner from previous features,

reflecting the difference in nature between this and the previous

subtasks. The idea is that the two example sentences used above,

while different in syntactic structure, contain the same linguistic

information. Because of this, there is no reason to assume that the

content of the two sentences ought to be stored in the knowledge

base in different manners14.

Thus instead of altering the matching procedure, like with

14 Indeed, keeping the principle of similarity between the knowledge base idea and human storage of linguisticinformation in mind, this seems justified. Intuitively it seems unlikely that our brains would store basicallythe same information twice, because of structural differences in presentation. Rather some sort of structuraldecoding takes place, as part of the information processing mechanisms. the proposed Implementationstrategy attempts to mimic this behaviour.

29

hypernyms, changes were made to the function responsible for

generating the noun tokens from the parsed output of the sentences.

Passive sentences were identified by the presence of auxiliaries15

and, should a passive sentence be found, the tokenization process

would change to reflect the properties of the component words as

they would be found in the equivalent active sentence. Virtually a

passive-active sentence conversion would be performed, but it would

be done by applying changes to the (parsed) nouns in the sentence

rather that to the sentence as a whole.

3.5.3 Confidence estimate

While perhaps not immediately obvious, the need for a confidence

estimate becomes clear if the matching procedure on which the system

relies is analysed in some details. Briefly matching works by

identifying the subject of a given test sentence and looking for a

match in the knowledge base. should a match be found, the next

identified noun token in the test sentence is identified and the

procedure repeats if a match is found again. Logically this is

correct. Given the hypothesis that entailment between sentences is

linked to the entailment between their components, a test sentence

would entail from a knowledge base only if all its noun tokens

entail from noun tokens in the knowledge base. In practice, however,

while having the advantage of not producing any false positive this

condition is likely to generate some false negatives. The reason for

this is the unavoidable incompleteness of the database used for

retrieval of lexical relations16 (such as synonymy and hypernymy) –

in this case Wordnet. It is perfectly possible that certain words

would have synonyms or hypernyms that are not present in Wordnet.

15 This was a valid criterion since the system was designed to deal with SVO sentences. No auxiliaries wouldbe found in this type of sentence, whereas in the passive conversion of an SVO the presence of anauxiliary (i.e. 'by') is necessary.

16 The parser could potentially be a source of error as well. But given the degree of sophistication of currentparsers, this is less likely, especially when working with simple sentence structures such as SVO

30

If at any stage in the process of iterating through the noun tokens

of the test sentence such a word would occur the loop would break

and the sentence would be categorised as not entailing from the

knowledge base. The use of confidence estimates generally

circumvents this problem. In order to understand why this is the

case, an explanation of the confidence estimate calculation is

necessary.

The overall confidence estimate of an entailment of a test sentence

is represented in percentages. This would be calculated by assigning

a percentage, x, to each of the noun tokens of a test sentence.

During the matching procedure each token is compared against the

knowledge base on three attributes, in this order – its 'TEXT'

(including its synonyms or hypernyms), its 'SYNTAG' and its

'REFERENT' (again, including synonyms and hypernyms). Each attribute

has equal weight, x/3. Therefore each time an attribute has a match

in the corresponding noun token in the knowledge base, x/3 is added

to the overall confidence estimate of the sentence. The order of the

attributes is important because the attempt to match the last two

was made dependent on the successful matching of the first.

Therefore, although the percentage weight is equal between the

three, in fact the first carried more significance. This is a

reasonable constraint since without a lexical relation present,

there is no sense in which a correspondent known token exists in the

first place17.

The process repeats for every token in the test sentence and a final

confidence estimate is computed. If this is greater that 50 percent,

the test sentences is judged as entailing from the knowledge base

and the confidence estimate is displayed. Since the system is

17 Should this condition not exist, tokens could add to the confidence estimate if they depend syntactically onthe same word for instance , or if they have the same syntactic function, which would be illegitimate, since,without a lexical relation present, these attributes are too loose to constitute feasible grounds for anentailment relation

31

designed to handle S-V-O sentences (which only have two noun

tokens), at this stage each noun token was manually assigned a

weight of 50 percent18.

As a final point, it should be mentioned that after some

consideration, the same type of constraint applied to the token

attributes was kept to an extent for the whole sentence, in that the

successful matching of the test sentence subject to a known subject

was a necessary condition for the matching of the other test

sentence tokens. This was done with the purpose of reflecting the

importance of the subject in a sentence (see section 4.2.2) and thus

simulating in some respect, the manner in which humans use language.

Thus given the sentences:

(1) Animals eat lunch.

(2) Cats eat fish.

(3) Students eat lunch.

, the confidence estimate of (2) entailing from (1) would be 50

percent, while the confidence estimate of (3) entailing from (2)

would be 0.

3.6 Further refinements

The software development process concluded with the implementation

of a few final adjustments of the system, aimed primarily at

improving the quality and accuracy of the methods implemented and in

some cases with the purpose of facilitating potential extensions.

18 Later in the development process an automatic method of establishing token weights was introduced (see thesection 3.3).

32

The first of these adjustments was the modification of some of the

properties of the noun tokens19. More precisely, the 'TEXT' and the

'REFERENT' attributes were modified to store a tuple containing both

the string representation of the token and its part of speech tag

(e.g. student['TEXT'] = ('student', 'NN2'), as opposed to student

['TEXT'] = 'student' as it was initially). The reason for this has

to do with the fact that when looking up a word in the Wordnet

database, its part of speech is required in addition to its string

representation. Initially this was provided manually, according to

the 'SYNTAG' of a token ('e.g. if the token is a subject, the part

of speech ought to be a noun). Retrieving this information from the

parser and storing it in the token properties is obviously a better,

more elegant solution. The difference is especially obvious in the

eventuality of an extension designed to cope with pronouns for

instance. In such a case, in would not be possible to determine the

part of speech of a word from its syntactic function (both nouns and

pronouns can be subjects).

Another token property that was modified was the 'SENTENCE'

attribute. When the switch in parser output was decided, the

sentence attribute was changed to store the whole string

representation of the sentence rather than a number denoting its

position in its respective file. Apart from being more accurate,

this simplified the process of identifying passive sentences by the

presence of auxiliaries and improved the quality of the output the

system produced.

The next modification was done to provide further depth and accuracy

to the system. Initially when looking up the synonyms of a noun

token in the Wordnet database only the most common sense was

retrieved. While this is enough to demonstrate the principle of

entailment from synonymy, it is likely to be an important source of

19 A description of the noun tokens and their properties can be found in section 3.4

33

error, since often enough words are used with other senses than the

most common. With this in mind, the system was modified so that the

five most frequent senses were retrieved. The synonyms for different

senses were stored in different lists, making it possible to later

access the sense number for a test noun token synonym should a match

be found in the knowledge base.

The final modification brought to the system was an improvement in

confidence estimate calculation. The obvious disadvantage of the

initial implementation was that the weight assigned to each noun

token (0.5) relied on the presence of exactly two such tokens in

each given sentence. Naturally, in the eventuality of an extension

this might not hold true, and indeed, should the system be able to

cope with sentences containing various numbers of noun tokens, it

becomes problematic. A more general method of calculation was

therefore introduced as an improvement. This method would obtain the

weight of each noun token in a given sentence according to the

number of total noun tokens present in this sentence, while taking

into consideration the sense frequency of the synonym for which a

match is found, as introduced in the improvement described above.

The method can be described by the formula:

x=100n

−s

, where x is the weight assigned to each token (all sentence tokens

have the same weight20), n is the number of tokens in the sentence

and s is the sense number of the match in order of frequency. The

distribution of a token's weight between its attributes remained

unchanged. Thus for a test sentence containing two noun tokens (as

in the case of S-V-O sentences), assuming a match for the third most

common sense of the nouns has been found in the knowledge base, the

20 An alternative system was considered by which the subject token would have a higher weight than theothers. This was dropped eventually, as it was decided that the dependence of the matching of other tokensin two sentences to the successful matching of the subjects accurately and fully reflected the greaterimportance of subjects compared to other syntactic functions.

34

weight of this particular match would be x=1002

−3=47 .

35

Chapter 4. Evaluation

4.1 Corpus Development

As mentioned in section 2.2.1, the only publicly available corpus

designed specifically for the task of entailment is provided by the

organisers of the Pascal RTE Challenge. Unfortunately this corpus

was incompatible with the requirements of the system described here,

firstly because of the fact that the corpus was formed with sentence

pairs while the system was designed to work with a knowledge base

and a set of test sentences and secondly, and perhaps more

importantly because of the system being restricted to S-V-O

sentences (whereas the RTE corpus is formed with much more complex

sentence structures). This restriction also made it impossible for a

corpus to be compiled from other sources since the vast majority of

collections of natural language sentences' available, like the RTE

corpus, contain more complex sentence structures.

The only feasible alternative was the manual construction of a

corpus suitable to the purposes and requirements of the system

presented here. Inevitably this implied that the system was tested

on a relatively small data set.

Since corpus size constituted a major disadvantage, particular care

has been given in the process of constructing the corpus to issues

such as representativeness and corpus bias21. As a result, a mixed

strategy of random selection and selection according to criteria

established based on the features implemented in the system was

adopted to construct a specialised corpus suitable for testing of

21 For a detailed discussion of corpus representativeness as well as general issues surrounding corpusconstruction and corpus use see McEnery and Wilson (1999)

36

the system while relatively free from bias22.

The first stage in the development of the corpus consisted of

generating the sentences that formed the knowledge base. For this

purposes a small Python script was written which made use of NLTK

tools to retrieve the two hundred most frequent common nouns that

occur in the Brown corpus23, sort them alphabetically and output them

to a file (see Appendix 2 for a code listing of the script). A

slightly modified version of the same script was used to produce a

file containing the two hundred most frequent verbs in the Brown

corpus. From these two lists words were selected randomly and

combined in a sensible fashion (i.e. avoiding nonsensical

combinations such as “Experience eats fish”) to twenty sentences of

the type Subject – Verb – Direct Object. This core set of sentences

formed the knowledge base (see Appendix 3 for a list of sentences).

In the next stage a further one hundred sentences were derived from

the initial set. These could be loosely grouped in a number of

categories, each illustrating a method used for derivation (see

Table 1)24

The categories, as well as the number of sentences allocated to each

category, were decided with the purpose of providing a set of

sentences that would thoroughly test the three features implemented

in the system (i.e. entailment from synonymy, hypernymy and passive

sentences) as well as the confidence estimate function. The design

of the categories also allowed a relatively balanced distribution

between the number of test sentences that entailed from the

knowledge base (55 – from categories 1-3, 7, 9) and the number of

those that did not (45 – from categories 4-6, 8, 10).

22 Complete elimination of bias is very difficult, if not impossible, to achieve, especially in the case ofspecialised corpora (McEnery and Wilson, 1999)

23 For further details on the Brown corpus see the Brown Corpus Manual (Francis and Kucera, 1979)24 For a full listing of the sentences forming the test corpus see Appendix 4

37

CategoryNo.

Description No. ofSentences

1sentences derived from the knowledge base set by replacing the subjectwith a synonym or a hyponym. 15

2sentences derived from the knowledge base set by replacing the directobject with a synonym or a hyponym.

10

3sentences derived from the knowledge base set by replacing the predicatewith a synonym.

15

4sentences derived from the knowledge base set by replacing the subjectwith an unrelated word.

10

5sentences derived from the knowledge base set by replacing the directobject with an unrelated word

10

6sentences derived from the knowledge base set by replacing the predicatewith an unrelated word.

10

7sentences derived from the knowledge base set by applying anycombination of the methods described in categories 13.

10

8sentences derived from the knowledge base set by applying anycombination of the methods described in categories 46.

10

9sentences derived from any of the sentences in categories 13, 7 byconversion to passive voice.

5

10sentences derived from any of the sentences in categories 46, 8 byconversion to passive voice.

5

Table 1: Categories of sentences included in the test corpus.

4.2 Testing and Evaluation

4.2.1 Analysis of Testing Results

Running the software with the 20 sentence knowledge base and the 100

sentence test base, generated by the methods described in section

4.1, produced the results summarised in Table 225.

25 See Appendix 5 for a listing of the output of the system run on the test set described.

38

Category No. Accuracy (%) Average confidence estimatevalue (%)

1 93.33 92.67

2 80 89.73

3 60 68.4

4 100 2.5

5 100 50

6 100 25

7 100 97.7

8 100 14.7

9 100 99.4

10 100 35

Table 2: Test results.

With 55 sentences in the test corpus being entailed from the

knowledge base and 45 sentences not entailing the natural baseline

was established at 55 percent, which would be the accuracy of a

system that would uniformly predict entailment of test sentences.

With an overall accuracy of 93.33 percent the system performance is

significantly above the baseline. The overall accuracy for detection

of entailment (categories 1-3, 7, 9) is 86.67 percent while the

accuracy for the detection of non-entailment (4-6, 8, 10) is 100

percent. Therefore on the given test data the system did not produce

any false positives.

The most likely source of error is the retrieval of lists of

synonyms and hyponyms from Wordnet. Naturally Wordnet is not a

complete database and as such it is possible that some synonyms or

hyponyms used in the test corpus were not found. This assumption is

supported by the fact that the lowest accuracy, 60 percent, occurs

for sentences in category 3 (in which verbs are replaced by

synonyms), given that the development of Wordnet is centred around

39

nouns.

The average overall confidence levels are 89.58 percent for entailed

sentences and 23.3 percent for non-entailed sentences26. Factors that

affect the confidence estimate for entailed sentences are primarily

errors of judgement (i.e. sentences being classified as non-

entailing) but also matching of words in the test sentences to less

common synonyms in the knowledge base. The latter is intentional and

is caused by the method used to calculate confidence estimates (see

section 3.6 for further details).

For non-entailed sentences, the high difference in average

confidence estimates between categories 4 and 5 (2.5 percent versus

50 percent) is caused by the design of the matching process at the

core of the system rather than by implementation flaws. The

sentences in category 4 are generated by replacing the subject with

unrelated words whereas in category 5 it is the direct object that

is replaced. Because of the heavy reliance of the system on the

correct matching of subjects, for sentences in category 4 the entire

matching process is interrupted at an early stage, thus having very

low confidence estimates. In contrast the system matches subjects

correctly for sentences in category 5 which eventually produces a

higher confidence level.

Category 6 produces intermediate values (25 percent) because verbs

are matched as referents of the subjects, after matching the

subjects themselves and before attempting to match the remainder of

the sentences.

26 The confidence level measures the probability of entailment of test sentences from the knowledge base,rather than the confidence in the judgement made. Therefore, for a good system, one would expect highconfidence levels for sentences that entail and low confidence level otherwise, rather than uniformly highvalues.

40

4.2.2 Evaluation

Naturally the first step in evaluating a piece of software is a

discussion of the test results. An overall accuracy of 93.33 percent

given a baseline of 55 percent, and the fact that no false positives

were produced proves beyond doubt that the system successfully

implemented the features that it was tested for. Nonetheless, errors

did occur which implies that improvement is possible.

Judging from the test results, it appears as though the main

weakness of the system is its reliance on Wordnet for evidence of

lexical relations such as synonymy and hypernymy. Unfortunately a

solution to this problem is not obvious, given that Wordnet is

currently the largest lexical database available. It is worth

pointing out however, that part of the reason why the use of Wordnet

generated false negatives is the restriction to Subject- Verb

-Direct Object sentence structure. When working with S-V-O sentences

(which necessarily only contain two noun tokens) words carry heavy

confidence level weights. As a result, if just one word in the test

sentence has no match in the knowledge base the confidence level

drops significantly and ultimately leads to the test sentence being

judged as non-entailing. Should the system be capable of handling

more complex sentence structures (which is the primary extension

suggested in section 4.3), unavoidably most often the confidence

level weight for each token would be lower (as there would be more

tokens in a test sentence) and therefore potential errors generated

by Wordnet incompleteness would be less significant in the process

of judging entailment for a test sentence.

Apart from the production of false negatives, the other potential

problem identified from the test results is best illustrated by the

average confidence level for category 5 sentences (50 percent). When

41

constructing the test corpus these were generated by replacing the

direct objects of some of the knowledge base sentences with

unrelated words. Thus the test sentences clearly do not entail, yet,

at 50 percent, the confidence estimate is on the verge on

entailment. There are two reasons for this: the high confidence

weight assigned to each noun token (already discussed in the

paragraph above) and the reliance of the system on the successful

matching of test sentence subjects to knowledge base subjects (as

discussed in section 4.2.1). When the confidence estimate was

introduced, this reliance could have been eliminated (see section

3.5.3). After some thought it has been kept, for theoretical

considerations.

In the process of developing an approach to the task of textual

entailment recognition the assumption has been made that nouns are

the building blocks of utterances (section 2.2.3). In the same

manner, and perhaps more clearly, subjects are considered to play a

central role amongst the nouns a sentence contains. The manner in

which humans use language confirms this. If presented with an

utterance an individual understands its subject, it can be said of

them that they have acquired information, as they are aware of the

fact that something is being said about the subject. If instead they

only recognise the direct object for instance, the information

acquired is less important27. Thus the system places emphasis on

subjects in an attempt to maintain resemblance to human use of

language.

Overall the system successfully fulfils its purpose of highlighting

important aspects of the problem of textual entailment and of

27 It is important to distinguish between grammatical subject and logical subject. For active sentences the twoare identical. However, in passive sentences the logical subject is in fact the direct object. Competent parserswhich include syntactic information recognise this difference. This system also accounts for it byperforming passive active conversion prior to processing the information contained in a sentence (seesection 2.2.2).

42

verifying some of the assumptions behind the proposed approach to

this problem. As the results show, it performs in a satisfactory

manner given the restrictions imposed on it which enforces the

principles at the core of its design. It is intended as a model,

rather than a complete system of automated recognition of textual

entailment. Further development is possible however. The next

section discusses ways in which this may be conducted.

4.3 Possible enhancements

An important aspect of the development of the system described in

Chapter 3 was the consideration given to possible extensions and for

this reason the implementation was done in such a manner as to allow

the addition of such extensions on top of the features already

implemented, without the need for major changes. This is shown by

the fact that a number of possible enhancements has already been

mentioned in the process of describing the manner in which various

features were implemented.

Of these, perhaps the most important would be allowing the system to

deal with more complex sentence structures than S-V-O. Apart from

providing a more complete system this would allow for extensive

testing with existing large corpora and potentially comparison with

other systems dealing with recognition of entailment. For this

purpose, initially the method responsible for the creation of noun

tokens could easily be extended to support the entire set of

grammatical relations the RASP parser can represent (see Annexe 2),

rather than only subject and direct object, as is the case at the

moment. The next step would be the introduction of additional

attributes for the noun tokens that would store other information

potentially contained in the knowledge base sentences (such as

adjectives or determiners).

43

An obvious and important set of extensions would be provided by the

addition of other possible sources of entailment as features. It is

difficult to approximate how many of these would be feasible without

an in-depth study on the topic of sources of entailment. However,

some suggestions would be entailment from determiners (e.g. 'All

animals drink water' entails 'Some animals drink water'), from

holonyms28 (e.g. 'My car was stolen' entails 'My wheel was stolen',

given that the wheel referred to is part of the stolen car), or from

negative sentences (for instance 'I am going [not home]' – e.g. 'to

school' – entails 'I am not going home').

Finally a number of enhancements could be made regarding the

management of the knowledge database of noun tokens. Firstly the

database could be optimised so that any redundant entries would be

removed. This applies to the same nouns appearing in different

sentences but also to synonyms for instance. This is quite a complex

enhancement since the duplicate entries would most likely store

different information in some of their attributes. Methods would

have to be developed to deal with the merging of every one of the

attributed before removal can be done. However, if completed, given

a large knowledge base, an improvement in execution speed would be

an immediate gain. In addition it would open the way for a couple of

other knowledge base enhancements.

One of these is the introduction of the ability to extend the

knowledge base with unknown information identified in the test

sentences, both by the addition of new noun tokens and by the

modification of attributes of noun tokens already known. This would

confer robustness of the knowledge base, enabling, simply put, the

system to 'learn' language.

28 Y is a holonym of X if X is a part of Y (e.g. 'car' is a holonym of 'wheel')

44

The other is the addition of the capability to perform 'fuzzy'

searches during the matching process, allowing entailment judgements

to be made from the knowledge base as a whole rather than by

identifying individual sentences that are potential candidates for

entailment of the test sentence.

4.4 Conclusion

The virtues of having an artificial system that could recognize

entailment and implicitly paraphrasing should be obvious.

Information extraction, information retrieval or question answering

systems would all benefit greatly from having this sort of ability.

More generally, a system that would be able to deal with language in

any sort of fluent manner is impossible to conceive of without a

comprehensive ability to deal with entailment. This fact provides

strong motivation for research directed at attempting to automatise

the recognition of sentence entailment and was a main factor in

choosing entailment as the topic for this project.

By no means providing a complete solution to the problem of

recognition of sentence entailment, nonetheless the approach

described in this project and the software implemented to

demonstrate and test the ideas behind it succeed in their purpose of

identifying some of the key aspects of this task and providing a

starting point for potential development.

45

Bibliography

Briscoe, E. and Carroll, J. Robust Accurate Statistical Annotation

of General Text. Proceedings of the Third International Conference

on Language Resources and Evaluation (LREC 2002), Las Palmas, Canary

Islands, May 2002 pp. 1499-1504.

Briscoe, E. and Carroll, J. Grammatical Relation Annotation.

http://www.informatics.susx.ac.uk/research/nlp/carroll/grdescription

/index.html, 2000 (last visited 27 April 2005)

Cheng J. The Fundamental Role of Entailment in Knowledge

Representation and Reasoning. Journal of Computing and Information

Vol 2, No 1: 853-873, 1996.

Dagan I and Glickman O. Probabilistic Textual Entailment: Generic

Applied Modeling of Language Variability. Learning Methods for Text

Understanding and Mining, 26-29, 2004.

Dagan et al. Recognising Textual Entailment Challenge, 15 June 2004

– 10 April 2005, http://www.pascal-network.org/Challenges/RTE/.

(last visited 27 April 2005)

Dagan et al. The PASCAL Recognising Textual Entailment Challenge,

http://www.cs.biu.ac.il/~glikmao/rte05/dagan_et_al.pdf, (last

visited 27 April 2005)

Fellbaum C. (ed.) WordNet. An Electronic Lexical Database. MIT

Press, Cambridge, 1998.

46

Francis W N and Kucera H. Brown Corpus Manual.

http://khnt.hit.uib.no/icame/manuals/brown/INDEX.HTM, 1979 (last

visited 27 April 2005)

Iwanska L. Logical Reasoning in Natural Language: Is it all about

Language? International Journal of Minds and Machines, Special Issue

on Knowledge Representation for Natural Language, 3: 475-510, 1993.

Klein et al. Elementary Language Processing: Tokenizing Text and

Classifying Words.

http://nltk.sourceforge.net/tutorial/tokenization/index.html, 2004

(last visited 27 April 2005)

Loper E and Bird S. NLTK: The Natural Language Toolkit. Proceedings

of the ACL Workshop on Effective Tools and Methodologies for

Teaching Natural Language Processing and Computational Linguistics,

Philadelphia, 2002.

Marks D. Development Methodologies Compared,

http://www.ncycles.com/e_whi_Methodologies.htm, 2002 (last visited

27 April 2005)

McEnery T and Wilson A. Corpus Linguistics. Edinburgh: Edinburgh

University Press, 1999

Russel K. Natural Language Toolkit (NLTK) 1.2.

http://nltk.sourceforge.net/docs.html, 2003. (last visited 27 April

2005)

Saeed JI. Semantics, Second Edition. Blackwell Publishing Ltd 2003.

47

UCREL CLAWS2 Tagset,

http://www.comp.lancs.ac.uk/ucrel/claws2tags.html, (last visited 27

April 2005)

van Rossum G. Python Tutorial, Release 2.4.

http://docs.python.org/tut/tut.html, 2005. (last visited 27 April

2005)

48

Appendix 1

Given the interdisciplinary character of my degree and a personal interest

in philosophy as well as a taste for innovation, the issue of textual

entailment presented itself as a perfect topic for the final year project.

From the onset I suspected it was going to be difficult but the existence

of the PASCAL RTE challenge (http://www.pascal-

network.org/Challenges/RTE/), of the NLTK (http://nltk.sourceforge.net) and

the fact that I was enrolled for the Natural Language Processing module in

the second semester encouraged me to go ahead with my choice. Only later

did I realise that there was very little relevant research material

available and that few of the topics treated in the NLP module had

connections with entailment. My enthusiasm for the topic remained unchanged

however and drove me to do massive amounts of reading around the topic,

most of which I did not use in the final report. More time still went on

learning Python, NLTK methods, and acquiring general NLP concepts in

advance of their introduction in the relevant module. On the programming

side of things, I spent a few weeks trying to adapt NLTK tools for the

project only to discover eventually that the RASP parser was better suited

to my needs. Because of all these factors I found myself at the end of

Semester 1 only having started the actual work. Obviously this put a lot of

pressure on me and at times I thought I would not finish. The reading did

help, however, indirectly, by giving me an in-depth understanding of the

topic and thus enabling me to develop and implement my ideas in an

efficient manner. The result is a project in which the theoretical

treatment of the topic is the focus and the software is meant as an aid to

understanding and investigating the issues at hand.

The main advice I would have for others interested in the topic is to pay

careful attention to planning. This is crucial, given the complexity of the

topic. Draft a detailed plan in the initial stages, allowing for possible

delays and adhere to it strictly. Entailment is an interesting but

difficult topic, requiring a strong theoretical foundation, even for

someone oriented towards the programming aspect of it. But it is vital to

limit the time spent reading such that you leave yourself enough time to do

49

the work. You will not manage to solve the problem completely, there simply

isn't enough time. Set yourself a number of specific goals early and

advance through them methodically.

Overall I have enjoyed working on this project and managed to keep my

interest for the topic intact throughout.

50

Appendix 2

Script used to retrieve the 200 most frequent nouns

in the Brown corpus.

from nltk.corpus import brown

from nltk.probability import FreqDist

corpus = []

for item in brown.items():

corpus.append(brown.read(item))

fd = FreqDist()

for text in corpus:

for token in text['WORDS']:

if ((token['TAG'] == 'nn') or (token['TAG'] == 'nns')):

fd.inc(token['TEXT'])

f = file('samples', 'a')

counter = 0

for item in fd.sorted_samples():

counter = counter + 1

if counter <= 151:

f.write(item)

f.write('\n')

else: break

f.close()

print 'most frequent word is %s' % fd.max()

print 'Done.'

51

Appendix 3

Knowledge Base Corpus.

felines eat fish.

students make errors.

parents love children.

scientists perform experiments.

countries have governments.

governments pass laws.

education prevents poverty.

boys like girls.

business involves money.

questions require answers.

detectives analyse evidence.

animals want space.

history teaches lessons.

people damage nature.

success builds confidence.

teachers develop minds.

experience brings wisdom.

books provide information.

society needs examples.

companies create jobs.

52

Appendix 4

Test Corpus

cats eat fish.

beasts want space.

tutors develop minds.

textbooks provide information.

cultures need examples.

pupils make errors.

nations have governments.

governance passes laws.

instruction prevents poverty.

doubts require answers.

investigators analyse evidence.

creatures want space.

stories teach lessons.

winners build confidence.

records provide information.

education prevents destitution.

detectives analyse proofs.

questions require feedback.

students make slips.

business involves funds.

students make mistakes.

parents love babies.

society needs lessons.

teachers develop judgements.

success builds trust.

boys like girlfriends.

education prevents impoverishment.

questions require solutions.

companies create business.

history teaches morals.

students create errors.

53

scientists execute experiments.

countries need governments.

governments give laws.

business demands money.

detectives examine evidence.

animals need space.

experience conveys wisdom.

felines consume fish.

success establishes confidence.

sharks eat fish.

justice prevents poverty.

chess develops minds.

explanations need examples.

cities destroy nature.

computers provide information.

investment creates jobs.

age brings wisdom.

banks make errors.

gambling involves money.

countries have citizens.

books provide pleasure.

animals want food.

companies create competition.

questions require interpretation.

people destroy property.

boys like football.

society needs peace.

teachers develop exams.

education prevents ignorance.

parents protect children.

scientists design experiments.

countries elect governments.

business attracts money.

detectives find evidence.

animals enjoy space.

people change nature.

books hold information.

54

society lacks examples.

teachers shape minds.

cats consume fish.

pupils create mistakes.

nations need governments.

instruction prevents destitution.

business demands funds.

doubts require solutions.

detectives examine proofs.

beasts need space.

stories teach morals.

cultures need lessons.

cigarettes destroy health.

parents have duties.

elections involve campaigns.

birds eat seeds.

trains improve transport.

companies sell products.

universities offer diplomas.

people enjoy music.

scientists propose theories.

pictures decorate walls.

information is provided by books.

girls are liked by boys.

evidence is analysed by investigators.

errors are made by pupils.

lessons are needed by society.

jobs are created by investment.

pleasure is provided by books.

exams are developed by teachers.

evidence is found by detectives.

nature is changed by people.

55

Appendix 5

System output after running on the test set in

Appendix 4

Sentence "cats eat fish" entails from the knowledge base.

Confidence level =100.0

-----

Sentence "beasts want space" entails from the knowledge base.


-----

Sentence "tutors develop minds" entails from the knowledge base.


-----

Sentence "textbooks provide information" entails from the knowledge base.


-----

Sentence "cultures need examples" entails from the knowledge base.


-----

Sentence "pupils make errors" entails from the knowledge base.


-----

Sentence "nations have governments" entails from the knowledge base.


-----

Sentence "governance passes laws" entails from the knowledge base.


-----

Sentence "instruction prevents poverty" entails from the knowledge base.


-----

Sentence "doubts require answers" entails from the knowledge base.


56

-----

Sentence "investigators analyse evidence" entails from the knowledge base.


-----

Sentence "creatures want space" entails from the knowledge base.


-----

Sentence "stories teach lessons" entails from the knowledge base.


-----

Sentence "winners build confidence" entails from the knowledge base.


-----

Sentence "records provide information" does not entail from any known

sentence.

Confidence level =0

-----

Sentence "education prevents destitution" entails from the knowledge base.


-----

Sentence "detectives analyse proofs" entails from the knowledge base.


-----

Sentence "questions require feedback" does not entail from any known

sentence.

Confidence level =50

-----

Sentence "students make slips" entails from the knowledge base.


-----

Sentence "business involves funds" entails from the knowledge base.


-----

Sentence "students make mistakes" entails from the knowledge base.


-----

Sentence "parents love babies" entails from the knowledge base.

57


-----

Sentence "society needs lessons" entails from the knowledge base.


-----

Sentence "teachers develop judgements" entails from the knowledge base.


-----

Sentence "success builds trust" does not entail from any known sentence.


-----

Sentence "boys like girlfriends" entails from the knowledge base.


-----

Sentence "education prevents impoverishment" entails from the knowledge

base.


-----

Sentence "questions require solutions" entails from the knowledge base.


-----

Sentence "companies create business" does not entail from any known

sentence.


-----

Sentence "history teaches morals" entails from the knowledge base.


-----

Sentence "students create errors" entails from the knowledge base.


-----

Sentence "scientists execute experiments" does not entail from any known

sentence.


-----

Sentence "countries need governments" entails from the knowledge base.


58

-----

Sentence "governments give laws" does not entail from any known sentence.


-----

Sentence "business demands money" entails from the knowledge base.


-----

Sentence "detectives examine evidence" entails from the knowledge base.


-----

Sentence "animals need space" entails from the knowledge base.


-----

Sentence "experience conveys wisdom" does not entail from any known

sentence.


-----

Sentence "felines consume fish" entails from the knowledge base.


-----

Sentence "success establishes confidence" does not entail from any known

sentence.


-----

Sentence "sharks eat fish" does not entail from any known sentence.

Confidence level =0

-----

Sentence "justice prevents poverty" does not entail from any known

sentence.

Confidence level =0

-----

Sentence "chess develops minds" does not entail from any known sentence.

Confidence level =0

-----

Sentence "explanations need examples" does not entail from any known

sentence.

Confidence level =0

59

-----

Sentence "cities destroy nature" does not entail from any known sentence.

Confidence level =0

-----

Sentence "computers provide information" does not entail from any known

sentence.

Confidence level =0

-----

Sentence "investment creates jobs" does not entail from any known sentence.


-----

Sentence "age brings wisdom" does not entail from any known sentence.

Confidence level =0

-----

Sentence "banks make errors" does not entail from any known sentence.

Confidence level =0

-----

Sentence "gambling involves money" does not entail from any known sentence.

Confidence level =0

-----

Sentence "countries have citizens" does not entail from any known sentence.


-----

Sentence "books provide pleasure" does not entail from any known sentence.


-----

Sentence "animals want food" does not entail from any known sentence.


-----

Sentence "companies create competition" does not entail from any known

sentence.


-----

Sentence "questions require interpretation" does not entail from any known

sentence.


-----

60

Sentence "people destroy property" does not entail from any known sentence.


-----

Sentence "boys like football" does not entail from any known sentence.


-----

Sentence "society needs peace" does not entail from any known sentence.


-----

Sentence "teachers develop exams" does not entail from any known sentence.


-----

Sentence "education prevents ignorance" does not entail from any known

sentence.


-----

Sentence "parents protect children" does not entail from any known

sentence.


-----

Sentence "scientists design experiments" does not entail from any known

sentence.


-----

Sentence "countries elect governments" does not entail from any known

sentence.


-----

Sentence "business attracts money" does not entail from any known sentence.


-----

Sentence "detectives find evidence" does not entail from any known

sentence.


-----

Sentence "animals enjoy space" does not entail from any known sentence.


61

-----

Sentence "people change nature" does not entail from any known sentence.


-----

Sentence "books hold information" does not entail from any known sentence.


-----

Sentence "society lacks examples" does not entail from any known sentence.


-----

Sentence "teachers shape minds" does not entail from any known sentence.


-----

Sentence "cats consume fish" entails from the knowledge base.


-----

Sentence "pupils create mistakes" entails from the knowledge base.


-----

Sentence "nations need governments" entails from the knowledge base.


-----

Sentence "instruction prevents destitution" entails from the knowledge

base.


-----

Sentence "business demands funds" entails from the knowledge base.


-----

Sentence "doubts require solutions" entails from the knowledge base.


-----

Sentence "detectives examine proofs" entails from the knowledge base.


-----

Sentence "beasts need space" entails from the knowledge base.


62

-----

Sentence "stories teach morals" entails from the knowledge base.


-----

Sentence "cultures need lessons" entails from the knowledge base.


-----

Sentence "cigarettes destroy health" does not entail from any known

sentence.

Confidence level =0

-----

Sentence "parents have duties" does not entail from any known sentence.


-----

Sentence "elections involve campaigns" does not entail from any known

sentence.

Confidence level =0

-----

Sentence "birds eat seeds" does not entail from any known sentence.


-----

Sentence "trains improve transport" does not entail from any known

sentence.

Confidence level =0

-----

Sentence "companies sell products" does not entail from any known sentence.


-----

Sentence "universities offer diplomas" does not entail from any known

sentence.

Confidence level =0

-----

Sentence "people enjoy music" does not entail from any known sentence.


-----

Sentence "scientists propose theories" does not entail from any known

sentence.

63


-----

Sentence "pictures decorate walls" does not entail from any known sentence.


-----

Sentence "information is provided by books" entails from the knowledge

base.


-----

Sentence "girls are liked by boys" entails from the knowledge base.


-----

Sentence "evidence is analysed by investigators" entails from the knowledge

base.


-----

Sentence "errors are made by pupils" entails from the knowledge base.


-----

Sentence "lessons are needed by society" entails from the knowledge base.


-----

Sentence "jobs are created by investment" does not entail from any known

sentence.


-----

Sentence "pleasure is provided by books" does not entail from any known

sentence.


-----

Sentence "exams are developed by teachers" does not entail from any known

sentence.


-----

Sentence "evidence is found by detectives" does not entail from any known

sentence.


64

-----

Sentence "nature is changed by people" does not entail from any known

sentence.


-----

65

Appendix 6

README instructions for running the software and the

script in Appendix 2.

Both the software and the script make use of the NLTK toolkit which

is not included with standard Python distribution.

Therefore, if running on the University of Leeds, School of

Computing Linux system, prior to running either programs type

'extras nltk' at the command prompt in a console. This sets the NLTK

environment variables.

To run the software type: 'python rte.py' at the command prompt in

the console and follow the instructions on screen.

66