research and development in intelligent systems xxi: proceedings of ai-2004, the twenty-fourth sgai...

Research and Development in Intelligent Systems XXI

Max Bramer, Frans Coenen and Tony Allen (Eds)

Research and Development in Intelligent Systems XXI

Proceedings of AI-2004, the Twenty-fourth SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence

Springer

Professor Max Bramer, BSc, PhD, CEng, FBCS, FIEE, FRSA Faculty of Technology, University of Portsmouth, Portsmouth, UK

Dr Frans Coenen Department of Computer Science, University of Liverpool, Liverpool, UK

Dr Tony Allen Nottingham Trent University

British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms of licences issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers.

ISBN 1-85233-907-1 Springer is part of Springer Science+Business Media springeronUne.com

© Springer-Verlag London Limited 2005 Printed in Great Britain

The use of registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use.

The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that maybe made.

Typesetting: Camera-ready by editors Printed and bound at the Athenaeum Press Ltd, Gateshead, Tyne 8c Wear 34/3830-543210 Printed on acid-free paper SPIN 11006770

TECHNICAL PROGRAMME CHAIR'S INTRODUCTION

M.A.BRAMER University of Portsmouth, UK

This volume comprises the refereed technical papers presented at AI-2004, the Twenty-fourth SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence, held in Cambridge in December 2004. The conference was organised by SGAI, the British Computer Society Specialist Group on Artificial Intelligence.

The papers in this volume present new and innovative developments in the field, divided into sections on AI Techniques I and II, CBR and Recommender Systems, Ontologies, Intelligent Agents and Scheduling Systems, Knowledge Discovery in Data and Spatial Reasoning and Image Recognition.

This year's prize for the best refereed technical paper was won by a paper entitled Extracting Finite Structure from Infinite Language by T. McQueen, A. A. Hopgood, T. J. Allen and J. A. Tepper (School of Computing & Informatics, Nottingham Trent University, UK). SGAI gratefully acknowledges the long-term sponsorship of Hewlett-Packard Laboratories (Bristol) for this prize, which goes back to the 1980s.

This is the twenty-first volume in the Research and Development series. The Application Stream papers are published as a companion volume under the title Applications and Innovations in Intelligent Systems XII.

On behalf of the conference organising committee I should like to thank all those who contributed to the organisation of this year's technical programme, in particular the programme committee members, the executive programme committee and our administrators Linsay Turbert and Collette Jackson.

Max Bramer Technical Programme Chair, AI-2004

ACKNOWLEDGEMENTS

AI.2004 CONFERENCE COMMITTEE

Dr. Tony Allen, Nottingham Trent University

Dr Robert Milne, Sermatech Intelligent Applications Ltd

Dr. Alun Preece, University of Aberdeen

Dr Nirmalie Wiratunga, Robert Gordon University, Aberdeen

Prof. Adrian Hopgood Nottingham Trent University

Prof. Ann Macintosh Napier University

Richard Ellis Stratum Management Ltd

Professor Max Bramer University of Portsmouth

Dr Frans Coenen, University of Liverpool

Dr. Bob Howlett, University of Brighton

Rosemary Gilligan

(Conference Chair)

(Deputy Conference Chair, Finance and Publicity)

(Deputy Conference Chair, Electronic Services)

(Deputy Conference Chair, Poster Session)

(Tutorial Organiser)

(Application Programme Chair)

(Deputy Application Programme Chair)

(Technical Programme Chair)

(Deputy Technical Programme Chair)

(Exhibition Organiser)

(Research Student Liaison)

TECHNICAL EXECUTIVE PROGRAMME COMMITTEE

Prof. Max Bramer, University of Portsmouth (Chair) Dr. Frans Coenen, University of Liverpool (Vice-Chair) Dr. Tony Allen, Nottingham Trent University Prof. Adrian Hopgood, Nottingham Trent University Mr. John Kingston, University of Edinburgh Dr. Peter Lucas, University of Nijmegen, The Netherlands Dr. Alun Preece, University of Aberdeen

VII

VIII

TECHNICAL PROGRAMME COMMITTEE

Alia Abdelmoty (Cardiff University)

Andreas A Albrecht (University of Hertfordshire)

Tony Allen (Nottingham Trent University)

Somaya A. S. Almaadeed (Qatar University)

Yaxin Bi (Queen's University Belfast)

Arkady Borisov (Riga Technical University)

Max Bramer (University of Portsmouth)

Ken Brown (University College Cork)

Frans Coenen (University of Liverpool)

Bruno Cremilleux (University of Caen)

Juan A. Fdez. del Pozo (Technical University of Madrid)

Marina De Vos (University of Bath)

John Debenham (University of Technology, Sydney)

Stefan Diaconescu (Softwin)

Nicolas Durand (University of Caen)

Anneli Edman (University of Upsala)

Mark Elshaw (University of Sunderland)

Max Garagnani (The Open University)

Adriana Giret (Universidad Politecnica de Valencia)

Mercedes Gomez Albarran (Univ. Complutense de Madrid)

Martin Grabmiiller (Tehcnische Universitat Berlin)

Anne Hakansson (Uppsala University, Sweden)

Mark Hall (University of Waikato, New Zealand)

Eveline M. Helsper (Utrecht University)

Ray Hickey (University of Ulster)

Adrian Hopgood (The Nottingham Trent University)

Chihli Hung (De Lin Institute of Technology, Taiwan)

Piotr Jedrzejowicz (Gdynia Maritime University, Poland)

John Kingston (University of Edinburgh)

T. K. Satish Kumar (Stanford University)

Alvin C. M. Kwan (University of Hong Kong)

Brian Lees (University of Paisley)

Peter Lucas (University of Nijmegen)

Angeles Manjarres (Universidad Nacional de Educacion a Distancia, Spain)

Daniel Manrique Gamo

Raphael Maree (University of Liege, Belgium)

IX

David McSherry (University of Ulster)

Alfonsas Misevicius (Kaunas University of Technology)

Ernest Muthomi Mugambi (Sunderland University, UK)

Lars Nolle (Nottingham Trent University)

Tomas Eric Nordlander (University of Aberdeen)

Tim Norman (University of Aberdeen)

Dan O'Leary (University of Southern California)

Barry O'Sullivan (University College Cork)

Alun Preece (University of Aberdeen)

Gerrit Renker (Robert Gordon University)

Maria Dolores Rodriguez-Moreno (Universidad de Alcala)

Fernando Saenz Perez (Universidad Complutense de Madrid)

Miguel A. Salido (Universidad de Alicante)

Barry Smyth (University College Dublin)

Jon Timmis (University of Kent)

Kai Ming Ting (Monash University)

Andrew Tuson (City University)

M.R.C. van Dongen (University College Cork)

Ian Watson (University of Auckland)

Graham Winstanley (University of Brighton)

Nirmalie Wiratunga (Robert Gordon University)

Shengxiang Yang (University of Leicester)

CONTENTS

BEST TECHNICAL PAPER

Extracting Finite Structure from Infinite Language (x) r. McQueen, A. A. Hopgood, T, J. Allen andJ. A. Tepper, School of Computing & Informatics, Nottingham Trent University, UK 3

SESSION la: AI TECHNIQUES I

Modelling Shared Extended Mind and Collective Representational Content Tibor Bosse, Catholijn M. Jonke and Martijn C. Schut, Department of Artificial Intelligence, Vrije Universiteit Amsterdam; Jan Treur, Department of Artificial Intelligence, Vrije Universiteit Amsterdam and Department of Philosophy, Universiteit, Utrecht 19

Overfitting in Wrapper-Based Feature Subset Selection: The Harder You Try the Worse it Gets John Loughrey and Pddraig Cunningham, Trinity College Dublin, Ireland, 33

Managing Ontology Versions with a Distributed Blackboard Architecture Ernesto Compatangelo, Wamberto Vasconcelos and Bruce Scharlau, Department of Computing Science, University of Aberdeen 44

OntoSearch: An Ontology Search Engine Yi Zhang, Wamberto Vasconcelos and Derek Sleeman, Department of Computing Science, University of Aberdeen, Aberdeen, UK 58

SESSION lb: CBR AND RECOMMENDER SYSTEMS

Case Based Adaptation Using Interpolation over Nominal Values Brian Knight, University of Greenwich, UK and Fei Ling Woon, Tunku Abdul Rahman College, Kuala Lumpur, Malaysia 73

Automating the Discovery of Recommendation Rules David McSherry, School of Computing and Information Engineering, University of Ulster, Northern Ireland 87

Incremental Critiquing (x) James Reilly, Kevin McCarthy, Lorraine McGinty and Barry Smyth, Department of Computer Science, University College Dublin, Ireland 101

Note: X indicates SGAI recognition award

XI

XII

SESSION 2: AI TECHNIQUES II

A Treebank-Based Case Role Annotation Using An Attributed String Matching Samuel W.K.Chan, Department of Decision Sciences, The Chinese University of Hong Kong, Hong Kong, China 117

A Combinatorial Approach to Conceptual Graph Projection Checking Madalina Croitoru and Ernesto Compatangelo, Department of Computing Science, University of Aberdeen 130

Implementing Policy Management Through BDI Simon Miles, Juri Papay, Michael Luck and Luc Moreau, University of Southampton, UK 144

Exploiting Causal Independence in Large Bayesian Networks (x) Rasa Jurgelenaite and Peter Lucas, Radboud University Nijmegen, The Netherlands 157

SESSION 3: INTELLIGENT AGENTS AND SCHEDULING SYSTEMS

A Bargaining Agent Aims to Tlay Fair' John Debenham, Faculty of Information Technology, University of Technology, Sydney, NSW, Australia 173

Resource Allocation in Communication Networks Using Market-Based Agents (x) Nadim Hague, Nicholas R. Jennings and Luc Moreau, School of Electronics and Computer Science, University of Southampton, Southampton, UK, 187

Are Ordinal Representations Effective? Andrew Tuson, Department of Computing, City University, UK 201

A Framework for Planning with Hybrid Models

Max Garagnani, Department of Computing, The Open University, UK 214

SESSION 4: KNOWLEDGE DISCOVERY IN DATA

Towards Symbolic Data Mining in Numerical Time Series

Agustin Santamaria, Technical University of Madrid, Spain; Africa Lopez-Illescas, High Council for Sports, Madrid, Spain; Aurora Perez-Perez and Juan P. Caraga- Valente, Technical University of Madrid, Spain 231 Support Vector Machines of Interval-based Features for Time Series Classification (x) Juan Jose Rodriguez, Universidad de Burgos, Spain and Carlos J, Alonso, Departamento de Informatica, Universidad de Valladolid, Spain 244

XIII

Neighbourhood Exploitation in Hypertext Categorization Houda Benbrahim and Max Bramer, Department of Computer Science and Software Engineering, University of Portsmouth, UK 258

Using Background Knowledge to Construct Bayesian Classifiers for Data-Poor Domains Marcel van Gerven and Peter Lucas, Institute for Computing and Information Sciences, University ofNijmegen, The Netherlands 269

SESSION 5: SPATIAL REASONING, IMAGE RECOGNITION ANDHYPERCUBES

Interactive Selection of Visual Features through Reinforcement Learning Sebastien Jodogne and Justus H, Piater, Montefiore Institute, University of Liege, Belgium 285

Imprecise Qualitative Spatial Reasoning Baher El-Geresy, Department of Computer Studies, University of Glamorgan, UK and Alia Abdelmoty, Department of Computer Science, Cardiff University, UK 299

Reasoning with Geometric Information in Digital Space (x|) Passent El-Kafrawy and Robert McCartney, Department of Computer Science and Engineering, University of Connecticut, USA 313

On Disjunctive Representations of Distributions and Randomization T, K Satish Kumar, Knowledge Systems Laboratory, Stanford University 327

AUTHOR INDEX 341

BEST TECHNICAL PAPER

Extracting Finite Structure from Infinite Language

T. McQueen, A. A. Hopgood, T. J. Allen, and J. A. Tepper School of Computing & Informatics, Nottingham Trent University,

Burton Street, Nottingham, NGl 4BU, UK thomas.mcqueen{adrian.hopgood, tony.allen, jonathan.tepper}@ntu.ac.uk

www.ntu.ac.uk

Abstract

This paper presents a novel connectionist memory-rule based model capable of learning the finite-state properties of an input language fi-om a set of positive examples. The model is based upon an unsupervised recurrent self-organizing map [1] with laterally interconnected neurons. A derivation of functional-equivalence theory [2] is used that allows the model to exploit similarities between the future context of previously memorized sequences and the future context of the current input sequence. This bottom-up learning algorithm binds fimctionally-related neurons together to form states. Results show that the model is able to leam the Reber grammar [3] perfectly fi-om a randomly generated training set and to generalize to sequences beyond the length of those found in the training set.

1. Introduction Since its inception, language acquisition has been one of the core problems in artificial intelligence. The ability to communicate through spoken or written language is considered by many philosophers to be the hallmark of human intelligence. Researchers have endeavoured to explain this human propensity for language in order both to develop a deeper understanding of cognition and also to produce a model of language itself The quest for an automated language acquisition model is thus the ultimate aim for many researchers [4]. Currently, the abilities of many natural language processing systems, such as parsers and information extraction systems, are limited by a prerequisite need for an incalculable amount of manually derived language and domain-specific knowledge. The development of a model that could automatically acquire and represent language would revolutionize the field of artificial intelligence, impacting on almost every area of computing fi*om Intemet search engines to speech-recognition systems.

Language acquisition is considered by many to be a paradox. Researchers such as ChomslQr argue that the mput to which children are exposed is insufficient for them to determine the grammatical rules of the language. Ilus argument for the poverty of stimulus [5] is based on Gold's theorem [6], which proves that most classes of

languages cannot be learnt using only positive evidence, because of the effect of overgeneralization. Gold's analysis and proof regarding the unfeasibility of language acquisition thus forms a central conceptual pillar of modem linguistics. However, less formal approaches have questioned the treatment of language identification as a deterministic problem m which any solution must involve a guarantee of no future ETCH'S. Such approaches to the problem of language acquisition [7] show that certain classes of language can be learnt using only positive examples if language identification involves a stochastic probability of success.

Language acquisition, as with all aspects of natural language processing, traditionally mvolves hard-coded symbolic approaches. Such top-down approaches to cognition attempt to work backwards fi-om formal linguistic structure towards human processing mechanisms. However, recent advances in cognitive modelling have led to the birth of connectionism, a discipline that uses biologically inspired models that are capable of leammg by example. In contrast to traditional symbolic approaches, connectionism uses a bottom-up approach to cognition that attempts to solve human-like problems using biologically inspired networks of interconnected neurons. Connectionist models learn by exploiting statistical relationships in their input data, potentially allowing them to discover the underlying rules for a problem. This ability to learn the rules, as opposed to learning via rote memorization, allows connectionist models to generalize their learnt behaviour to unseen exemplars. Connectionist models of language acquisition pose a direct challenge to traditional nativist perspectives based on Gold's theorem [6] because they attempt to learn language using only positive examples.

2. Connectionism and Determinacy Since the early nineties, connectionist models such as the simple recurrent network (SRN) [8] have been applied to the language acquisition problem in the form of grammar induction. This involves learning simple approximations of natural language, such as regular and context-fi'ee grammars. These experiments have met with some success [6, 7], suggestmg that dynamic recurrent networks (DRNs) can learn to emulate finite-state automata. However, detailed analysis of models trained on these tasks show that a number of fiindamental problems exist that may derive fi-om using a model with a continuous state-space to approxhnate a discrete problem.

While DRNs are capable of learning simple formal languages, they are renowned for their instability when processing long sequences that were not part of their training set [8, 9]. As detailed by Kolen [10], a DRN is capable of partitioning its state space into regions approximating the states in a grammar. However, sensitivity to initial conditions means that each transition between regions of state space will result m a slightly different trajectory. This causes instability when traversing state trajectories

that were not seen during training. This is because slight discrepancies in the trajectories will be compounded with each transition until they exceed the locus of the original attractor, resulting in a transition to an erroneous region of state space. Such behavior is characteristic of continuous state-space DRNs and can be seen as both a power and a weakness of this class of model. While this representational power enables the model to surpass deterministic finite automata and emulate non-deterministic systems, it proves to be a significant disadvantage when attempting to emulate the deterministic behavior fundamental to deterministic finite state automata (DFA).

Attempts have been made to produce discrete state-space DRNs by using a step-function for the hidden layer neurons [9]. However, while this technique eliminates the instability problem, the use of a non-diflferentiable function means that the weight-update algorithm's sigmoid function can only approximate the error signal. This weakens the power of the learning algorithm, which increases training times and may cause the model to learn an incorrect representation of the DFA.

The instability of DRNs when generalizing to long sequences that are beyond their training sets is a limitation that is probably endemic to most continuous state-space connectionist models. However, when finite-state extraction techniques [9] are applied to the weight space of a trained DRN, it has been shown that once extracted into symbolic form, the representations learnt by the DRN can perfectly emulate the original DFA, even beyond the training set. Thus, while discrete symbolic models may be unable to adequately model the learning process itself, they are better suited to representing the learnt DFA than the original continuous state-space connectionist model.

While supervised DRNs such as the SRN dominate the literature on connectionist temporal sequence processing, they are not the only class of recurrent network. Unsupervised models, typically based on the self-organizing map (SOM) [11], have also been used in certain areas of temporal sequence processing [12]. Due to then* localist nature, many unsupervised models operate using a discrete state-space and are therefore not subject to the same kind of instabilities characteristic of supervised continuous state-space DRNs. The aim of this research is therefore to develop an unsupervised discrete state-space recurrent connectionist model that can induce the finite-state properties of language from a set of positive examples.

3. A Memory-Rule Based Theory of Linguistics Many leading linguists, such as Pinker [13] and Marcus [14], have theorized that language acquisition, as well as other aspects of cognition, can be explamed using a memory-rule based model. This theory proposes that cognition uses two separate mechanisms that work together to form memory. Such a dual-mechanism approach is supported by neuro-biological research, which suggests that human memory

operates using a declarative fact-based system and a procedural skill-based system [15]. In this theory, rote memorization is used to learn individual exemplars, while a rule-based mechanism operates to override the original memorizations in order to produce behaviour specific to a category. This memory-rule theory of cognition is commonly explained in the context of the acquisition of the English past tense [13]. Accounting for children's over-regularizations during the process of learning regular and irregular verbs constitutes a well-known battlefield for competing linguistic theories. Both Pinker [13] and Marcus [14] propose that irregular verbs are learnt via rote-memorization, while regular verbs are produced by a rule. The evidence for this rule-based behaviour is cited as the over-regularization errors produced when children incorrectly apply the past tense rule to irregular verbs (e.g. ni/wierf instead of ran).

The model presented in this paper is a connectionist implementation of a memory-rule based system that extracts the finite-state properties of an input language fi-om a set of positive example sequences. The model's bottom-up learning algorithm uses fimctional-equivalence theory [2] to construct discrete-symbolic representations of grammatical states (Figure 1).

4. STORM (Spatio Temporal Self-Organizing Recurrent Map) STORM is a recurrent SOM [1] that acts as a temporal associative memory, initially producing a localist-based memorization of input sequences. The model's rule-based mechanism then exploits similarities between the fiiture context of memorized sequences and the fiiture context of input sequences. These similarities are used to construct fimctional-relationships, which are equivalent to states in the grammar. The next two sections will detail the model's memorization and rule-based mechanisms separately.

4.1 STORM'S Memorization Mechanism STORM maintains much of the functionality of the original SOM [11], including the winning-neuron selection algorithm (Equation 1), weight-update algorithm (Equation 2) and neighbourhood function (Equation 3). The model's localist architecture is used to represent each element of the input sequence using a separate neuron. In this respect, STORM exploits the SOM's abilities as a vector quantization system rather than as a topological map. Equation 1 shows that for every input to the model (X), the neuron whose weight vector has the lowest distance measure from the input vector is selected as the winning neuron (Y). The symbol d denotes the distance between the winning neuron and the neuron in

question. As shown in fig 1, each input vector consists of the current input symbol and a context vector, representing the location of the previous winning neuron.

yj =argminy(rf(x,W;)) (1)

The weight update algorithm (equation 2) is then applied to bring the wimung neuron's weight vector (W), along with the weight vectors of neighbouring neurons, closer to the input vector (X) (equation 2). The rate of weight change is controlled by the learning rate a, which is linearly decreased through training.

Wyit + 1) = Wy(t) + ahijix(t) - Wyit)) (2)

The symbol h in equation 2 denotes the neighbourhood function (equation 3). This standard Gaussian function is used to update the weights of neighbouring neurons in proportion to their distance fi-om the winning neuron. This weight update function, in conjunction with the neighbourhood function, has the effect of mapping similar inputs to similar locations on the map and also minimizing weight sharing between similar inputs. The width of the kernel a is linearly decreased through training.

^y=exd z3 2o'

(3)

The model uses an orthogonal input vector to rqpres^it the ^ammar's toimnal symbols. Each of the seven terminal symbols are represented by setting the respective binary value to 1 and setting all the other values to 0 (table 1).

Grammatical symbol

B

T

P

S

X

V

E

Orthogonal vector 1

1 0 0 0 0 0 0

0 1 0 0 0 0 0

0 0 1 0 0 0 0

0 0 0 1 0 0 0

0 0 0 0 1 0 0

OOOOOIO

0 0 0 0 0 0 1

Table 1 - Ortbogonal vector representations for input symbols

Input

symbol > N O O

o o

T P

• o o .• / ^W i > . ^

Connectiordst FSM

B

i

1

Reber grammar FSM

Fig. 1 - Diagram showing conceptual overview of model. The left side shows STORM's representation of a FSM, while the right side of the diagram shows the FSM for the Reber grammar.

As shown in Figures 1 and 2, STORM extends Kohonen's SOM [11] into the temporal domain by using recurrent connections. The recurrency mechanism feeds back a representation of the previous winning neuron's location on the map using a 10-bit Gray-code vector. By separately representing the column and row of the previous winning neuron in the context vector, the recurrency mechanism creates a 2D representation of the neuron's location. Further details of the recurrency mechanism, along with its advantages, are provided in [1]. This method of explicitly representing the previous winner's location as part of the input vector has the effect of selecting the winning neuron based not just on the current input, but also indirectly on all previous inputs in the sequence. The advantage of this method of recurrency is that it is more eflScient than alternative methods (e.g. [16]), because only information pertaining to the previous winning neuron's location is fed back. Secondly, the amount of information fed back isn't directly related to the size of the map (i.e. recursive SOM [16] feeds back a representation of each neuron's activation). This allows the model to scale up to larger problems without exponentially increasing computational complexity.

BTXSE

0000® ®0®00 00000 0000 0000

Fig. 2 - Diagram showing STORM'S input representation. The model's weight vector consists of a 7-bit orthogonal symbol vector representing the terminal symbol in the grammar, along with a 10-bit Gray code context vector, representing the column and row of the previous winning neuron.

4.2 STORM'S Rule-Based Construction Mechanism The model's location-based recurrency representation and localist architecture provide it with a very important ability. Unlike using conventional artificial neural networks, the sequences learnt by STORM can be extracted in reverse order. This makes it possible to start with the last element in an input sequence and work backwards to find the winning neurons corresponding to the previous inputs in the sequence. STORM uses this ability, while processmg input sequences, to find any existing pre-leamt sequences that end with the same elements as the current input sequence. For example, Figure 3 shows that the winning neuron for the symbol'T' in sequence 1 has the same fixture context ('XSE') as the winnmg neuron fi^r the first symbol *S' in sequence 2.

Functional-equivalent theory [2] asserts that two states are said to be equivalent if, for all fiiture inputs, their outputs are identical. STORM uses the inverse of this theory to construct states in a bottom-up approach to granunar acquisition. By identifying neurons with consistently identical fiiture inputs, the model's temporal Hebbian learning mechanism (THL) mechanism binds together potential states via lateral connections. By strengthening the lateral connections between neurons that

10

have the same future context, this THL mechanism constructs functional-relationships between the winning neuron for the current input and the winning neuron for a memorized input (referred to as the alternative winner) whose future-context matches that of the current input sequence (Figure 4). In order to prevent lateral weight values from becoming too high, a negative THL value is applied every time a winning neuron is selected. This has the effect of controlling lateral weight growth and also breaking down old functional relationships that are no longer used.

l . B T X S E

2 . B T S X S E

Fig. 3 - Diagram showing the memorized winning neurons for two sequences that end with the same sub-sequence ^XS£'

Once states have formed, they override the recurrency mechanism, forcing the model to use a smgle representation for the fiiture inputs in the sequence rather than the original two representations (Figure 4). The advantage of forming states in this manner is that it provides the model with a powerful ability to generalize beyond its original memorizations. The model's THL mechanism conforms to the SOM's winner-take-all philosophy by selecting the alternative winner as the neuron whose future-context is the best match to that of the current input sequence. Given that tracing back through the future-context may identify multiple alternative winners, the criteria of best matching winner classifies the strongest sequence stored in the model as the winner. Furthermore, THL is only used to enhance the functional relationship between the winner and the alternative winner, if the future-context for the alternative winner is stronger than that of the winner itself Thus, the model has a preference for always using the dominant sequence and it will use the THL mechanism to re-wire its internal pathways m order to use any dominant sequence.

11

Constmcting the lateral connections between fiinctionally-related neurons is equivalent to identifying states in a grammar. Once the strength of these lateral connections exceeds a certain threshold they override the standard recurrency mechanism, affecting the representation of the previous winning neuron that is fed back (Figure 4). Instead of feedmg back a representation of the previous winning neuron, the lateral connections may force the model to feed back a representation of the functionally-related neuron. The consequence of this is that the rest of the sequence is processed as if the fiinctionally-related neuron had been selected rather than the actual winner. For example, Figure 4 shows that when the first *S' symbol in sequence 2 is presented to STORM, its winning neuron is functionally linked to the winner for the 'T' symbol from sequence 1. As the latter winning neuron is the dominant winner for this state, its location is fed back as context for the next symbol in sequence 2.

l . B T X S E

2. B T S X S E

0 Qi^P (^•"0 O v!)

Fig. 4 ~ Functional override in winning-neuron selection algorithm. The functional relationship (shown in grey) between the third symbol ^S' in the second sequence and the second symbol ^T' in the first sequence, forces the model to process the remaining elements in the second sequence (namely ^XS£ ) using the same winning neurons as for the first sequence.

While a state is formed based on similarities in future context, there may be cases where the future context, for the respective mput symbols that make up the state, is dissimilar (Table 2), However, once a state been constructed, the future context in subsequent sequences containing that state will be processed in an identical manner, regardless of the future context itself For example, when trained on the sequences in Table 2, the *T' symbol fi-om sequence 1 will form a state with the first 'S' symbol fi-om sequence 2. This will result in both sequences 1 and 2 sharing the same winning neurons for their final three inputs (X S E). STORM will then be able to generalize this learnt state to its memorization of sequence 3, resulting in the same winning neurons being activated for the 'X X V V E' in test sequence 4 as in training sequence 3.

12

#

1

2

3

Training sequence

B T X S E

B T S X S E

B T X X V V E

Test sequence

B T S X X V V E

Table 2 - Geaeralizjitioii example. When trained on the first three sequences, STORM is able to construct a state between the ^T' in sequence 1 and the fint ^S' in sequence 2. B]r generalizing this learnt state to its memorization of sequence 3, STORM is able to correctly process sequence 4 by activating the same winning neurons for the subsequence X X V V E' as would be activated in sequence 3.

5. Experiments In ord&t: to quantify STC^M's grammar inductioa abilities, die model was applied to the task of predicting the neTct symbols in a sequence from the Reber grammar (Figure 1). Siniilar prediction tasks have been used in [8] and [3] to test the SRN's grammar-induction abilities. The task involved presetting the model with symbols from a randomly generated sequence that was not encountered during training. The model then had to predict the n >ct possible ^nnbols in the sequence ^ t could follow each symbol according to the rules of the grammar. STORM's predictions are made by utilizing the locational representational values used in its context vector. As fiirther explained in [1], the winning neuron for an input is the neuron whose weight vector best matches both the input symbol and the context representation of the last winning neuron's location. STORM predicts the next symbol by finding the neuron whose context representation best matches that of the current winning neuron (i.e. the symbol part of the weight vector is ignored m the Euclidean distance calculation). This forces the model to find the natron that is most likely to be the next winner. The symbol part of this neuron's weight vector provides the next predicted symbol itself This process is then repeated to find the second-best matching winner and the corresponding second predicted next symbol.

In accordance with established training criteria for artificial neural network models [17], the experiments were conducted on randomly generated separate training and test sets (i.e. sequences were unique with respect to all other sequences in both sets). Such an approach ensures that the nKKtel's performance, assessed from the test set, is a true measure of its generalization abilities because the test sequences

13

were not aticountêd during traimng. The expîmait was run tati times using models with randomly generated initial weights, in order to ensure that the starting state did not adversely influence the resuhs.

The recursive dq)th parameter, as listed in Table 3, dîotes the maxinrnm numb^ of sequential recursive transversals a sentence may contain (i.e. how many tunes it can go around the same loop). In order to ensure that the training and test sequences are representative of the specified recursive depth, the sets are divided equally between sequences of each recursive depth (i.e. a set of six sequences with a recursive dq>th (RD) of 2 will contain two sequîces with an RD of 0, two sequences with an RD of 1 and two sequences with an RD of 2).

Parameter

Number of epochs

Learning rate a (linearly decreasing)

Initial neighbourhood o (linearly decreasing)

Positive / nâtive t^nporal Hebbian learning rate

Number of training sequences

Number of test sequences

Maximum recursive depth (RD) of sequences

Model size

Value

1000

0.1

5

0.5/0.005

21

7

6

10 X 10

Table 3 - Experimental parameters for the first experiment

As shown in figure 5, six models learnt the grammar with over 89% accuracy during training and three of them became perfect grammar recognizers. However, this mim^^ fdl by the end of trmning, with only two pêct models and an additional two models with over 90% performance accuracy. This equates to an average post-training performance of 71%. While less than half the models successfully learnt the grammar, it is worth noting that tlus is ^gnificantly better than for SRNs where Sharkey [18] showed that only two out of 90 SRNs became finite-state grammar recognisers ma »milar experiment using the Reberâmmar.

(hie of the proposed advantages of a discrete state-space model (png^ 3), is its ability to generalize to sequences longer than those encoimtered during training without the instabilities characteristic of standard DRN modeb. In order to test this proposition, a perfect finite-state recognizer (i.e. a model that scored 100%

14

pFedicti(»i accuntcy) from the first experun^it (figure S) was tested on a fiuther three test sets. These sets contained sequences with recursive depths of 8,10 and 12 and should constitute a mudi harder problem for any modd trained only on sequences with a recurave depth of 6. These models that achieved 100% performance accuracy in the ori^nal expoiments also achieved 100% accuracy on tnuning sets with hi^ier recumve dep^. TUs proves that these models act as perfect grammar recognizers that are capable of generalizing to sequences of potentially any length.

100

I 80 3 70 i 60 c 50 I 40 ^ 30

20 10 0

I r 2 3 4 5 6 7 8 9 10

Test number

• Highest prediciton accuracy during training B Prediction accuracy after training

Fig 5 - Results from ten modeb trained on randomly generated separate trainine and test sets.

6. Conclusions and Future Work We have preseitfed a novel cotmectiomst memory-rule based model Gi )at4e of inducing the finite-state properties of an input language fi-om a set of positive example sequences. In contrast with the majority of supervised connectionist models in the literature, STORM is based on an unsupervised recurrent SOM [1] and operates using a discrete state-space.

The model has been successfiilly applied to the tBsk of learning the Rd)^ grammar by predicting the next symbols in a set of randomly generated sequences. The experiments have shown that over half the modds trained are capable of learning a good approxunation of the grammar (over 89%) during the training process. However, by the end of trainings only a fifth of the models were capable of operating as perfect grammar recognizers. This suggests that the model is unstable and that partial or optimal solutions reached during traming may be lost by the end

15

of the training process. Despite this instability, a comparison between STORM and the SRN, when applied to a similar problem [3], shows that STORM is capable of learning the grammar perfectly much more often than its counterpart. Furthermore, experiments show that STORM's discrete state-space allow it to generalize its grammar recognition abilities to sequences far beyond the length of those encountered in the training set, without the instabilities experienced in continuous state-space DRNs.

Future work will initially involve analyzing the model to find where it fails. Once the model's abilities have been fully explored, its stability will be improved to increase the number of models that successfully become perfect grammar recognizers. STORM will then be enhanced to allow it to process more advanced grammars. Given that regular grammars are insufficient for rq)resenting natural language [19], the model must be extended to learn at least context-free languages if it is to be applied to real-world problems. However, despite such future requirements STORM'S current ability to explicitly learn the rules of a regular grammar distinguish its potential as a language acquisition model.

References

1. McQueen, T. & Hopgood, A. & Tepper, J. & Allen, T. A Recurrent self-organizing map for Temporal Sequence Processing. In: Proceedings of 4* International Conference in Recent Advances in Soft Computing (RASC2002), Nottingham, 2002

2. Hopcroft J. & UUman J. Introduction to Automata Theory, Languages and Computation, vol 1, Addison-Wesley, 1979

3. Qeeremans A, Schreiber D, McClelland J. Fmite State Automata and Simple Recurrent Networks. In: Neural Computation. 1989, Vol 1, pp 372-381

4. Collier R. An historical overview of natural language processing systems that learn. Artificial Intelligence Review 1994; 8(1)

5. Chomsky, N. Aspects of the Theory of Syntax. MIT Press, 1965

6. Gold, EM. Language Identification in the Limit. Information and Control 1967; 10:447-474

7. Homing, J.J. A study of grammatical inference. PhD thesis, Stanford University, California, 1969

8. Ehnan, J.L. Finding Structure in Time. Cognitive Science 1990; 14:179-211

9. Omlin, C. Understanding and Explainii^ DRN Behavi(Hir. In: Kden, J. and Kremer S (eds) A Field Guide to Dynamical Recurrent Networks. IEEE Press, New York, 2001, pp 207-227

16

10. Kolen, J. Fool's Odd: Extracting Finite State Machines From Recurr^it Network Dynamics. In: Cowan J, Tesauro G and Alspector J (eds) Advances in Neural Information Processing Systems 6. Morgan Kaufinann, San Francisco CA, 1994, pp 501-508

11. Kohonen T. Self-Organizing Maps, vol 1. Springer-Verlag, Germany, 1995

12. Baretto, G and Arajo, A. Time in Self-Organizing Map: An Overview of Models. International Journal of Computer Research: Special Edition on Neural Networks: Past, Present and Future 2001; 10(2):139-179

13. Pinker, S. Words and Rules. Phoenix, London, 2000

14. Marcus, G. F. Children's Overregularization and Its Implications for Cognition. In: P. Broeder and J. Murre (eds) Models of Language Acquisition: Inductive and Deductive approaches. Oxford University Press, Oxford, 2000, pp 154-176

15. Cohen, N.J. and Squire, L.R. Preserved learning and retention of pattern-analyzing skill in amnesia: Dissociation of knowing how and knowing that. Science 1980; 21:207-210

16. Voegtlin, T. Recursive Self-Organizing Maps. Neural Networks 2002; 15(8-9).^79-991

17. Hopgood, A. A. Intelligent Systems for Engineers and Scientists, 2°** edition, CRC Press LLC, Florida, 2001, pp 195-222

18. Sharkey N, Sharkey A, Jackson S. Are SRNs sufficient for modelling language acquisition?. In: Broeder P, Murre J. (eds) Models of Language Acquisition: Inductive and Deductive Approaches. Oxford University Press, Oxford, 2000, pp 33-54

19. Lawrence S, Giles C, Fong S. Natural Language Grammatical Inference with Recurrent Neural Networks. IEEE Transactions on Knowledge and Data Engineering 2000; 12(1): 126-140

SESSION 1a:

Al TECHNIQUES I

Modelling Shared Extended Mind and Collective Representational Content

Tibor Bosse\ Catholijn M. Jonker\ Martijn C. Schut\ and Jan Treur^^ Vrije Universiteit Amsterdam, Department of Artificial Intelligence

{tbosse, jonker, schut, treur}@cs.vu.nl http://www.cs.vu.nl/~{tbosse, jonker, schut, treur}

^Universiteit Utrecht, Department of Philosophy

Abstract

Some types of animals exploit the external environment to support their cognitive processes, in the sense of patterns created in the environment that function as external mental states and serve as an extension to their mind. In the case of social animals the creation and exploitation of such patterns can be shared, thus obtaining a form of shared mind or collective intelligence. This paper explores this shared extended mind principle for social animals in more detail. The focus is on the notion of representational content in such cases. Proposals are put forward and formalised to define collective representational content for such shared external mental states. A case study in social ant behaviour in which shared extended mind plays an important role is used as illustration. For this case simulations are described, representation relations are specified and are verified against the simulated traces.

1. Introduction Behaviour is often not only supported by internal mental structures and cognitive processes, but also by processes based on patterns created in the external environment that serve as external mental structures; cf [5, 6, 7 & 8]. Examples of this pattern of behaviour are the use of 'to do lists' and *lists of desiderata'. Having written these down externally (e.g., on paper, in your diary, in your organizer or computer) makes it unnecessary to have an internal memory about all the items. Thus internal mental processing can be kept less complex. Other examples of the use of extended mind are doing mathematics or arithmetic, where external (symbolic, graphical, material) representations are used; e.g., [4 & 12]. In [16] a collection of papers can be found based on presentations at the conference The Extended Mind: The Very Idea' that took place in 2001. Clark [6] points at the roles played by both internal and external representations in describing cognitive processes: 'Internal representations will, almost certainly, feature in this story. But so will external

representations, ...'[6, p. 134]. From another, developmental angle, also Griffiths and Stotz [9] endorse the importance of using both internal and external

19

20

representations; they speak of *a larger representational environment which extends beyond the

skin', and claim that 'culture makes humans as much as the reverse' [9, p. 45].

Allowing mental states, which are in the external world and thus accessible for any agent around, opens the possibility that other agents also start to use them. Indeed, not only in the individual, single agent case, but also in the social, multi-agent case the extended mind principle can be observed, e.g., one individual creating a pattern in the environment, and one or more other individuals taking this pattern into account in their behaviour. For the human case, examples can be found everywhere, varying from roads, and traffic signs to books or other media, and to many other kinds of cultural achievements. Also in [17] it is claimed that part of the total team knowledge in distributed tasks (such as air traffic control) comprises external memory in the form of artefacts. In this multi-agent case the extended mind principle serves as a way to build a form of social or collective intelligence, that goes beyond (and may even not require) social intelligence based on dkect one-to-one communication.

Especially in the case of social animals external mental states created by one individual can be exploited by another individual, or, more general, the creation and maintenance, as well as the exploitation of external mental states can be activities in which a number of individuals participate. For example, presenting slides on a paper with multiple authors to an audience. In such cases the external mental states cross, and in a sense break up, the borders between the individuals and become shared extended mental states. An interesting and currently often studied example of collective intelligence is the intelligence shown by an ant colony [2]. Indeed, in this case the external world is exploited as an extended mind by using pheromones. While they walk, ants drop pheromones on the ground. The same or other ants sense these pheromones and follow the route in the direction of the strongest sensing. Pheromones are not persistent for long times; therefore such routes can vary over time.

In [3] the shared extended mind principle is worked out in more detail. The paper focusses on formal analysis and formalisation of the dynamic properties of the processes involved, both at the local level (the basic mechanisms) and the global level (the emerging properties of the whole), and their relationships. A case study in social ant behaviour in which shared extended mind plays an important role is used as illustration.

In the current paper, as an extension to [3], the notion of representational content is analysed for mental processes based on the shared extended mind principle. The analysis of notions of representational content of internal mental state properties is well-known in the literature on Cognitive Science and Philosophy of Mind. In this literature a relevant internal mental state property m is taken and a representation relation is identified that indicates in which way m relates to properties in the external world or the agent's interaction with the external world; cf. [1, 10 & 15, pp. 184-210]. For the case of extended mind an extension of the analysis of notions of representational content to external state properties is needed. Moreover, for the case of external mental state properties that are shared, a notion of collective representational content is needed (in contrast to a notion of representational content for a single agent).

21

Thus, by addressing the ants example and its modelling from an extended mind perspective, a number of challenging new issues on cognitive modelling and representational content are encountered:

• How to define representational content for an external mental state property

• How to handle decay of a mental state property

• How can joint creation of a shared mental state property be modelled

• What is an appropriate notion of collective representational content of a shared external mental state property

• How can representational content be defined in a case where a behavioural choice depends on a number of mental state properties

In this paper these questions are addressed. To this end the shared extended mind principle is analysed in more detail, and a formalisation is provided of its dynamics. It is discussed in particular how a notion of collective representational content for a shared external mental state property can be formulated. In the literature notions of representational content are usually restricted to internal mental states of one individual. The notion of collective representational content developed here extends this in two manners: (1) for external instead of internal mental states, and (2) for groups of individuals instead of single individuals. It is reported how in a case study of social behaviour based on shared extended mind (a simple ant colony) the proposals put forward have been evaluated. The analysis of this case study comprises multi-agent simulation based on identified local dynamic properties, identification of dynamic properties that describe collective representational content of shared extended mind states, and verification of these dynamic properties.

2. State Properties and Dynamic Properties Dynamics will be described in the next section as evolution of states over time. The notion of state as used here is characterised on the basis of an ontology defining a set of physical and/or mental (state) properties that do or do not hold at a certain point in time. For example, the internal state property 'the agent A has pain', or the external world state property 'the environmental temperature is 7° C , may be expressed in terms of different ontologies. To formalise state property descriptions, an ontology is specified as a finite set of sorts, constants within these sorts, and relations and functions over these sorts. The example properties mentioned above then can be defined by nuUary predicates (or proposition symbols) such as pain, or by using n-ary predicates (with n>l) like has_temperature(environment, 7). For a given ontology Ont, the propositional language signature consisting of all state ground atoms (or atomic state properties) based on Ont is denoted by APROP(Ont). The state properties based on a certain ontology Ont are formalised by the propositions that can be made (using conjunction, negation, disjunction, implication) from the ground atoms. A state s is an indication of which atomic state properties are true and which are false, i.e., a mapping S: APROP(Ont) -^ {true, false}.

22

To describe the internal and external dynamics of the agent, explicit reference is made to time. Dynamic properties can be formulated that relate a state at one point in time to a state at another point in time. A simple example is the following dynamic property specification for belief creation based on observation:

'at any point in time tl if the agent observes at tl that it is raining, then there exists a point in time t2 after tl such that at t2 the agent believes that it is raining'.

To express such dynamic properties, and other, more sophisticated ones, the temporal trace language TTL is used; cf. [11]. To express dynamic properties in a precise manner a language is used in which explicit references can be made to time points and traces. Here trace or trajectory over an ontology Ont is a time-indexed sequence of states over Ont. The sorted predicate logic temporal trace language TTL is built on atoms referring to, e.g., traces, time and state properties. For example, 'in the output state of A in trace y at time t property p holds' is formalised by state(Y, t. output(A)) 1= p. Here |= is a predicate symbol in the language, usually used in infix notation, which is comparable to the Holds-predicate in situation calculus. Dynamic properties are expressed by temporal statements built using the usual logical connectives and quantification (for example, over traces, time and state properties). For example the following dynamic property is expressed:

'in any trace y, if at any point in time tl the agent A observes that it is raining, then there exists a point in time t2 after tl such that at t2 in the trace the agent A believes that it is raining'.

In formalised form:

Vt1 [ state(Y, t1, Input(A)) |= agent_observes_itsraining => 3t2 > t1 state(Y, t2, intemal(A)) |= beliefjtsraining ]

Language abstractions by introducing new (definable) predicates for complex expressions are possible and supported.

A simpler temporal language has been used to specify simulation models. This language (the leads to language) offers the possibility to model direct temporal dependencies between two state properties in successive states. This executable format is defined as follows. Let a and p be state properties of the form 'conjunction of atom s or negations of atoms', and e, f, g, h non-negative real numbers. In the leads to language a -^^ ^ g ^ p, means:

If state property a holds for a certain time interval with duration g, then after some delay (between e andf) state property p will hold for a certain time interval of length h.

For a precise definition of the leads to format in terms of the language TTL, see [14]. A specification of dynamic properties in leads to format has as advantages that it is executable and that it can often easily be depicted graphically.

3. Representation for Shared Extended Mind Originally, the different types of approaches to representational content that have been put forward in the literature on Cognitive Science and Philosophy of Mind, [1, 13 & 15, pp. 191-193, 200-202] are all applicable to internal (mental) states. They have in common that the occurrence of the internal (mental) state property m at a

23

specific point in time is related (by a representation relation) to the occurrence of other state properties, at the same or at different time points. For the temporal-interactivist approach [1 & 13] a representation relation relates the occurrence of an internal state property to sets of past and future interaction traces. The relational specification approach to representational content is based on a specification of how a representation relation relates the occurrence of an internal state property to properties of states distant in space and time; cf. [15, pp. 200-202]. As mentioned in the Introduction, one of the goals of this paper is to apply these approaches to shared extended mental states instead of internal mental states.

Suppose p is an external state property used by a collection of agents in their shared extended mind, for example, as an external belief. At a certain point in time this mental state property is created by performing an action a (or maybe a collection of actions) by one or more agents to bring about p in the external world. Given the thus created occurrence of p, at a later point in time any agent can observe p and take this mental state property into account in determining its behaviour. For a representation relation, which indicates representational content for such a mental state property p two possibilities are considered: (1) a representation relation relating the occurrence of p to one or more events in the past (backward), or (2) a representation relation relating the occurrence of p to behaviour in the future (forward). Moreover, for each category, the representation relation can be described by referring to external world state properties, independent of the agent (using the relational specification approach), or referring to interaction state properties (e.g., observing, initiating actions) for the agent (using the temporal-interactivist approach). In this paper only the relational specification approach is addressed. This approach is applied both backward and forward. For reasons of presentation, first in the upcoming section the (qualitative) case is considered that p is the result of the action of one agent, e.g., the presence of pheromone. Next, the (quantitative) case that p is the result of actions of multiple agents is considered. Here p has a certain degree or level, e.g., a certain accumulated level of pheromone; in decisions levels for a number of such state properties p are taken into account. For the ants case study, the world in which the ants live is described by a labeled graph as depicted in Figure 1. Locations are indicated by A, B,..., and edges by el, e2,... To represent such a graph the predicate connected_to_via(10,ll,e) is used. The ants move from location to location via edges; while passing an edge, pheromones are dropped.

Figure 1 An ants wona

24

3.1 The Qualitative Case In this section representational content is addressed for the qualitative case. This means that an external state property p is the result of the action of one agent, e.g., the presence of pheromone.

Looking Backward

Looking backward, for the qualitative case the preceding state is the action a by an arbitrary agent, to bring about p. This action a is not an external state property but an interaction state property of this agent. However, this action was performed due to certain circumstances in the world that made the agent do the action. So, the chain of processes can be followed further back to the agent's internal state properties. Still further back it can be followed to the agent's observations that in the past formed the basis of these internal state properties. As these observations concern observations of certain state properties of the external world, we finally arrive at other external world state properties. These external world state properties will be used for the representation relation (conform the relational specification approach). It may be clear that if complex internal processes come between, such a representation relation can become complicated. However, if the complexity of the agent's internal processes is kept relatively simple (as is one of the claims accompanying the extended mind principle), this amounts in a feasible approach.

For the relational specification approach a representation relation can be specified by temporal relationships between the presence of the pheromone (at a certain edge), and other state properties in the past or future. Although the relational specification approach as such does not explicitly exclude the use of state properties related to input and output of the agent, in our approach below the state properties will be limited to external world state properties. As the mental state property itself also is an external world state property, this implies that temporal relationships are provided only between external world state properties.

The pheromone being present at edge e is temporally related to the existence of a state at some time point in the past, namely an agent's presence at e:

If at some time point in the past an agent was present at e, then after that time point the pheromone was present at edge e.

If the pheromone is present at edge e, then at some time point in the past an agent was present at e,

Note here that the sharing of the external mental state property is expressed by using explicit agent names in the language and quantification over (multiple) agents. In the usual single agent case of a representation relation, no explicit reference to the agent itself is made. A formalisation is as follows:

Vt1 VI Ve Va [ state(Y, t1) |= is_at_edge_from(a, e, I) => 3t2>t1 state(Y, t2) |= pheromone_at(e) ]

Vt2 Vx Ve [ state(Y, t2) |= pheromone_at(e) => 3a, t1<t2 state(Y, t1) |= is_at_edge_from(a, e, I) ]

25

Looking Forward

Looking forward, in general the first step is to relate the extended mind state property p to the observation of it by an agent (under certain circumstances c). However, to reach external state properties again, the chain of processes can be followed further through this agent's internal processes to the agent's actions and their effects on the external world.

For the example, the effect of an agent's action based on its observation of the pheromone is that it heads for the direction of the pheromone. So, the representation relation relates the occurrence of the pheromone (at edge e) to the conditional (with condition that it is at the location) fact that the agent heads for the direction of e. The pheromone being present at edge e is temporally related to a conditional statement about the fiiture, namely if an agent later arrives at the location, coming from any direction e', then he will head for direction e:

If the pheromone is present at edge el, then if at some time point in the future, an agent arrives at a location involving el, coming from any direction e2 t el,

then the next direction he will choose is el.

If a time point tl exist such that at tl an agent arrives at a location involving el, coming from any direction e2 1 ,

and if at any time point t2 > tl an agent arrives at a location involving el coming from any direction e2 t el,

then the next direction he will choose is el, then at tl the pheromone is present iat direction el.

A formalisation is as follows:

Vt1 VI Ve1 [ state(Y, t1) |= pheromone_at(e1) => Vt2>t1 Ve2, a [e2 t e1 & state(Y, t2) |= is_atJocation_from(a, I, e2) =>

at3>t2 state(Y, t3) |= is_at_edgeJrom(a, e1,1) & [Vt4 t2<t4<t3 =* is_atJocation_from(a, I, e2)] ] ]

Vt1 VI Ve1 [ 3a, e2 e2 ^ e1 & state(Y. t1) |= is_atJocation_from(a, I, e2) & [Vt2>t1 Va, e2 [e2 ^e^&

state(Y, t2) |= is_atJocation_from(a, I, e2) => 3t3>t2 state(Y, t3) |= is_at_edge_from(a, e1,1) & [Vt4 t2<t4<t3 =» is_at_location_from(a, I, e2)] ] ]

=> state(Y, t1) 1= pheromone_at(e1) ]

3.2 The Quantitative Case The quantitative, accumulating case allows us to consider certain levels of a mental state property p; in this case a mental state property is involved that is parameterised by a number: it has the form p(r), where r is a number, denoting that p has level r. This differs from the above in that it is now possible to model: (1) joint creation of p: multiple agents together bring about a certain level of p, each contributing a part of the level, (2) by decay levels may decrease over time, (3) behaviour may be based on a number of state properties with different levels, taking into account their relative values, e.g., by determining the highest level of them. For the ants example, for each choice point multiple directions are possible, each with a different pheromone level; the choice is made for the direction with the highest pheromone level (ignoring the direction the ant just came from).

26

Looking Backward

To address the backward quantitative case (i.e., the case of joint creation of a mental state property), the representation relation is analogous to the one described above, but now involves not the presence of one agent at one past time point, but a summation over multiple agents at different time points. Moreover a decay rate r with 0 < r < 1 is used to indicate that after each time unit only a fraction r is left. Thus for the ants example in mathematical terms the following property is expressed:

There is an amount v of pheromone at edge e, if and only if there is a history such that at time point 0 there was ph(0, e) pheromone at e, and for each time point k from 0 to t a number dr(k, e) of ants dropped pheromone, and v = ph(0, e) * r + E^J dr(t-k, e) *r''

A formalisation of this property in the logical language TTL is as follows:

Vt Ve Vp state(Y, t) |= pheromones_at(e, v) <=> I^ I ^ , " ^ case(state(Y, k) |= is_at_edge(a, e), 1, 0) * r*' = v

Here for any formula f, the expression case(f, vi, v2) indicates the value vi if f is true, and v2 otherwise.

Looking Forward

The forward quantitative case involves a behavioural choice that depends on the relative levels of multiple mental state properties. This makes that at each choice point the representational content of the level of one mental state property is not independent of the level of the other mental state properties involved at the same choice point. Therefore it is only possible to provide representational content for the combined mental state property involving all mental state properties involved in the behavioural choice. Thus for the ants example the following property is specified:

If at time tl the amount of pheromone at edge el is maximal with respect to the amount of pheromone at all other edges connected to that location, except the edge that brought the ant to the location,

then, if an ant is at that location 1 at time tl, then the next direction the ant will choose at some time t2 > tl is el.

If at time tl an ant is at location 1 and for every ant arriving at that location 1 at time tl,

the next direction it will choose at some time t2 > tl is el, then the amount of pheromone at edge el is maximal with respect to the amount of pheromone at all other edges connected to that location 1, except the edge that brought the ant to the location.

A formalisation of this property in TTL is as follows:

Vt1,l,l1,e1,e2,i1 [el9te2& state(Y, t1) 1= connected_to_via(l, 11, e1) & state(Y, t1) 1= pheromones_at(e1,11) & [VI29tl1, e39te2 [ state(Y, tl) |= connected_to_via(l, 12, e3) =>

3i2 [0<i2<i1 & state(Y, tl) |= pheromones_at(e3, i2) ] ] => Va [ state(Y, tl) |= is_at_location_from(a, I, e2) =>

3t2>t1 state(Y, t2) |= is_at_edge_from(a, e l , I) & [Vt3 t1<t3<t2 => is_at_location_from(a, I, e2) ] ] ] ]

27

Vt1,l,l1,e1,e2 [el êa & state(Y, t1) 1= connected_to_via(l, 11, e1) & 3a state(Y, t1) |= is_atJocation_from(a, I, e2) & Va [ state(Y, t1) |= is_at_locationJrom(a, I, e2) =>

3t2>t1 state(Y, t2) |= is_at_edge_from(a, e1,1) & [Vt3 t1<t3<t2 => is_at_locationjrom(a, I, e2) ] ] ] => 3i1 [ state(Y, t1) |= pheromones_at(e1, i1) &

[VI2?tl1, e39te2 [ state(Y, t1) |= connected_to_via(l, 12, e3) => 3i2 [0<i2<i1 & state(Y. t1) |= pheromones_at(e3, i2) ] ] ] ]

4. A Simulation Model of Shared Extended Mind In [3] a simulation model of an ant society is specified in which shared extended mind plays an important role. This model is based on local dynamic properties, expressing the basic mechanisms of the process. In this section, a selection of these local properties is presented, and a resulting simulation trace is shown. In the next section it will be explained how the representation relations specified earlier can be verified against such simulation traces. Here a is a variable that stands for ant, 1 for location, e for edge, and i for pheromone level.

LPS (Selection of Edge)

This property models (part of) the edge selection mechanism of the ants. It expresses that, when an ant observes that it is at location 1, and there are two edges connected to that location, then the ant goes to the edge with the highest amount of pheromones. Formalisation:

observes(a, is_at_location_from(l, eO)) and neighbours(l, 3) and connected_to_via(l, 11, e1) and observes(a, pheromones_at(e1, i1)) and connected_to_via(l, 12, e2) and observes(a, pheromones_at(e2, i2)) and e0ê^ and eO 5t e2 and e^ ê2 and i1 > i2 -^ to_bej)erformed(a, go_to_edgeJrom_to(e1,11))

LP9 (Dropping of Pheromones)

This property expresses that, if an ant observes that it is at an edge e from a location 1 to a location 11, then it will drop pheromones at this edge e. Formalisation:

observes(a, is_at_edge_from_to(e, I, II)) •-> to_be_performed(a, drop_pheromones_at_edge_from(e, I))

LP13 (Increment of Pheromones)

This property models (part of) the increment of the number of pheromones at an edge as a result of ants dropping pheromones. It expresses that, if an ant drops pheromones at edge e, and no other ants drop pheromones at this edge, then the new number of pheromones at e becomes i*decay+incr. Here, i is the old number of pheromones, decay is the decay factor, and incr is the amount of pheromones dropped. Formalisation:

to_be_perfonned(a1, dropj3heromones_at_edge_fronri(e, II)) and VI2 not to_be_performed(a2, dropj3heromones_at_edge_from(e, 12)) and VI3 not to_be_perfonmed(a3, drop_pheromones_at_edge_from(e, 13)) and a1 9t a2 and a1 t a3 and a2 t a3 and pheromones_at(e, 1) •^ pheromones_at(e, i*decay+incr)

28

LP14 (CoUecting of Food)

This property expresses that, if an ant observes that it is at location F (the food source), then it will pick up some food. Formalisation:

observes(a, is_at_location_from(l, e)) and food_location(l) •-> to_bej3erformed(a, pick_up_food)

LP18 (Decay of Pheromones)

This property expresses that, if the old amount of pheromones at an edge is i, and there is no ant dropping any pheromones at this edge, then the new amount of pheromones at e will be i*decay. Formalisation:

pheromones_at(e, i) and Va,l not to_be_perfonned(a, dropjDheromones_at_edge_from(e, I)) •-> pheromones_at(e, i*decay)

A special software environment has been created to enable the simulation of executable models. Based on an input consisting of dynamic properties in leads to format, the software environment generates simulation traces. An example of such a trace can be seen in Figure 2. Time is on the horizontal axis, the state properties are on the vertical axis. A dark box on top of the line indicates that the property is true during that time period, and a lighter box below the line indicates that the property is false. This trace is based on all local properties identified.

Because of space limitations, in the example situation depicted in Figure 2, only three ants are involved. However, similar experiments have been performed with a population of 50 ants. Since the abstract way of modelling used for the simulation is not computationally expensive, also these simulations took no more than 30 seconds.

As can be seen in Figure 2 there are two ants (anti and ant2) that start their search for food immediately, whereas ants comes into play a bit later, at time point 3. When anti and ant2 start their search, none of the locations contain any pheromones yet, so basically they have a fi'ee choice where to go. In the current example, anti selects a rather long route to the food source (via locations A-B-C-D-E-F), whilst ant2 chooses a shorter route (A-G-H-F). Note that, in the current model, a fixed route preference (via the attractiveness predicate) has been assigned to each ant for the case there are no pheromones yet. After that, at time point 3, ants starts its search for food. At that moment, there are trails of pheromones leading to both locations B and G, but these trails contain exactly the same number of pheromones. Thus, ant3 also has a free choice among location B and G, and chooses in this case to go to B. Meanwhile, at time point 18, ant2 has arrived at the food source (location F). Since it is the first to discover this location, the only present trail leading back to the nest, is its own trail. Thus ant2 will return home via its own trail. Next, when anti discovers the food source (at time point 31), it will notice that there is a trail leading back that is stronger than its own trail (since ant2 has akeady walked there twice: back and forth, not too long ago). As a result, it will follow this trail and will keep following ant2 forever. Something similar holds for ants. The first time that it reaches the food source, ants will still follow its own trail, but some time later (from time point 63) it will also follow the other two ants. To conclude, eventually the shortest of both routes is shown to remain, whilst the other route evaporates. Other simulations, in particular for small ant populations, show that it is important that the

29

decay parameter of the pheromones is not too high. Otherwise, the trail leading to the nest has evaporated before the first ant has returned, and all ants get lost!

observes(ant1, pheromones_at('E1', 0.0))

observes(ant1, pheromones_at('E1', 8.73751 ))-|

observes(ant1, pheramonesat('E5', 10.0))




observes(ant1, pheromones_at('E8', 16.7056))-


to_be_performed(ant1, drop_food)

to_bej3erformecl(ant1, pick_up_food)

to_bejDerformed(ant2, drop_food)

to_bejDerformed(ant2, pick_up_f(X)d)

to_bej3erformed(ant3, dropfood)

to_bej3erformed(ant3, pick_up_fcxxi)

is_atJocation_from(ant1, 'A', 'E6')-|

is_at_location_from(ant1, 'A', init)-|

is_atJocation_-from(ant1, 'B', 'E1')-

is_atJocationJrom(ant1, 'C, 'E2')-

is_atJocation_from(ant1, 'D', 'E3')-

is_at_location_from(ant1, 'E', 'E4')-

is_atJocation_from(ant1, 'P, 'E5')-

is_atJocation_from(ant1, 'P, 'E8')-

is_at_locationJrom(ant1, 'G', 'E6')-

is_atJocation_from(ant1, "G", 'E7')-

is_atJocation_from(ant1, 'H', 'E7')-

is_at_location_from(ant1, 'H', 'E8')-

is_atJocation_from(ant2, 'A', 'E6')-

is_atJocation_from(ant2, 'A', init)-

is_atJocation_from(ant2, 'F', 'E8')

is_at_location_from(ant2, 'G', 'E6')

is_at_location_from(an12, 'G', 'E7')

is_at_location_from(ant2, 'H', 'E7')-

is_atJocationJrom(ant2, 'H', 'E8')

is_at_location_from(ant3, 'A', 'El')

is_atJocation_from(ant3, 'A', init)

is_at_locationJrom(ant3, 'B', 'El')

is_at_location •from(ant3, 'B', 'E2')

is_at_location_trom(ant3, 'C, 'E2')

is_at_location_from(ant3, 'C, 'E3')

is_atJocation_from{ant3, 'D', 'E3')

is_at_location_from(ant3, 'D', 'E4')

is_atJocation_from(ant3, 'E', 'E4')

is_at_locationJrom(ant3, 'E', 'E5')

is_atJocation_from(ant3, 'F', 'E5')

is_at_location_from{ant3, 'G', 'E6')

is_at_location_from(ant3, 'H', 'E7')

time

fm

flzz

\zz

E

j t

-ft—

—ft.

—/

—

h-ft

rA

.

^

__

zzJ"

t =

. . r . ^

„ _

Az:

ft—

fl=:

=fti

z J

t =

*c=:

J t

:zA

fc

—ft

J t

-ft-

—f

i T

J

=A

=F\:

t,—

zzJ

zzd"

'czz

Aiz

:ft-

= 1 : ^

^

t r : -J t

J t

J t

=A

ifc

fez

—

r^-f

-ft-

J t

^

3

•

mzz

Figure 2 Simulation trace of the dynamics of the ants behaviour

5. Verification In addition to the simulation software, a software environment has been developed that enables to check dynamic properties specified in TTL against simulation traces. This software environment takes a dynamic property and one or more (empu-ical or simulated) traces as input, and checks whether the dynamic property holds for the traces. Traces are represented by sets of Prolog facts of the form

holds(state(m1, t(2)), a, true).

30

where m1 is the trace name, t(2) time point 2, and a is a state formula in the ontology of the component's input. It is indicated that state formula a is true in the component's input state at time point t2. The programme for temporal formula checking basically uses Prolog rules for the predicate sat that reduce the satisfaction of the temporal formula finally to the satisfaction of atomic state formulae at certain time points, which can be read from the trace representation. Examples of such reduction rules are:

sat(and(F,G)):- sat(F), sat(G).

sat(not(and(F,G))):- sat(or(not(F), not(G))).

sat(or(F,G)):- sat(F).

sat(or(F,G)):- sat(G).

sat(not(or(F,G))):- sat(and(not(F), not(G))).

Using this environment, the formal representation relations presented in Section 3.2 have been automatically checked against traces like the one depicted in Section 4. The duration of these checks varied from 1 to 10 seconds, depending on the complexity of the formula (in particular, the backward representation relation has a quite complex structure, since it involves reference to a large number of events in the history). All these checks turned out to be successful, which validates (for the given traces at least) our choice for the representational content of the shared extended mental state property pheromones_at(e, v). However, note that these checks are only an empirical validation, they are no exhaustive proof as, e.g., model checking is. Currently, the possibilities are explored to combine TTL with existing model checking techniques.

In addition to simulated traces, the checking software allows to check dynamic properties against other types of traces as well. In the future, the representation relations specified in this paper will be checked against traces resulting from other types of ants simulations, and possibly against empirical traces.

6. Discussion The extended mind perspective introduces a high-level conceptualisation of agent-environment interaction processes. By modelling the ants example from an extended mind perspective, the following challenging issues on cognitive modelling and representational content were encountered:

1. How to define representational content for an external mental state property

2. How to handle decay of a mental state property

3. How can joint creation of a shared mental state property be modelled

4. What is an appropriate notion of collective representational content of a shared external mental state property

5. How can representational content be defined in a case where a behavioural choice depends on a number of mental state properties

31

These questions were addressed in this paper. For example, modelling joint creation of mental state properties (3.) was made possible by using relative or leveled mental state properties, parameterised by numbers. Each contribution to such a mental state property was modelled by addition to the level indicated by the number. Collective representational content (4.) from a looking backward perspective was defined by taking into account histories of such contributions. Collective representational content from a forward perspective was defined taking into account multiple parameterised mental state properties, corresponding to the alternatives for behavioural choices, with their relative weights. In this case it is not possible to define representational content for just one of these mental state properties, but it is possible to define it for their combination or conjunction (5.).

The high-level conceptualisation has successfully been formalised and analysed in a logical manner. The formalisation enables simulation and automated checking of dynamic properties of traces or sets of traces, in particular of the representation relations.

For fiiture research, it is planned to define the general concept of extended mind in a more precise way. This will make the distinction between extended mind states and other external world states, which is currently not always clear, more concrete. In addition, the approach will be applied to several other cases of extended mind. For example, can the work be related to AI planning representations, traffic control, knowledge representation of negotiation, and to the concept of "shared knowledge" in knowledge management?

Acknowledgements The authors are grateful to Lourens van der Meij for his contribution to the development of the software environment, and to an anonymous referee for some valuable comments on an earlier version of this paper.

References 1. Bickhard, M.H. Representational Content in Humans and Machines. Journal

of Experimental and Theoretical Artificial Intelligence, vol. 5, 1993, pp. 285-333.

2. Bonabeau, J. Dorigo, M. and Theraulaz, G. Swarm Intelligence: From Natural to Artificial Systems. Oxford University Press, New York, 1999.

3. Bosse, T., Jonker, CM., Schut, M.C., and Treur, J. Simulation and Analysis of Shared Extended Mind. Proceedings of the First Joint Workshop on Multi-Agent and Multi-Agent-Based Simulation, MAMABSV4. To appear, 2004.

4. Bosse, T., Jonker, CM., and Treur, J. Simulation and analysis of controlled multi-representational reasoning processes. Proc. of the Fifth International Conference on Cognitive Modelling, ICCMW. Universitats-Verlag Bamberg, 2003, pp. 27-32.

5. Clark, A. Being There: Putting Brain, Body and World Together Again, MIT Press, 1997.

6. Clark, A. Reasons, Robots and the Extended Mind. In: Mind & Language, vol. 16, 2001, pp. 121-145.

32

7. Clark, A., and Chalmers, D. The Extended Mind. In: Analysis, vol. 58, 1998, pp. 7-19.

8. Dennett, D.C. Kinds of Mind: Towards an Understanding of Consciousness, New York: Basic Books, 1996.

9. Griffiths, P. and Stotz, K. How the mind grows: a developmental perspective on the biology of cognition. Synthese, vol.122, 2000, pp. 29-51.

10. Jacob, P. What Minds Can Do: Intentionality in a Non-Intentional World. Cambridge University Press, Cambridge, 1997.

11. Jonker, CM., and Treur, J. Compositional Verification of Multi-Agent Systems: a Formal Analysis of Pro-activeness and Reactiveness. International Journal of Cooperative Information Systems, vol. 11, 2002, pp. 51-92.

12. Jonker, CM., and Treur, J. Analysis of the Dynamics of Reasoning Using Multiple Representations. In: W.D. Gray and CD. Schunn (eds.). Proceedings of the 24th Annual Conference of the Cognitive Science Society, CogSci 2002. Mahwah, NJ: Lawrence Erlbaum Associates, Inc., 2002, pp. 512-517.

13. Jonker, CM., and Treur, J. A Temporal-Interactivist Perspective on the Dynamics of Mental States. Cognitive Systems Research Journal, vol. 4, 2003, pp. 137-155.

14. Jonker, CM., Treur, J., and Wijngaards, W.C.A., A Temporal Modelling Environment for Internally Grounded Beliefs, Desires and Intentions. Cognitive Systems Research Journal, vol. 4, 2003, pp. 191-210.

15. Kim, J. Philosophy of Mind. Westview Press, 1996. 16. Menary, R. (ed.) The Extended Mind, Papers presented at the Conference The

Extended Mind - The Very Idea: Philosophical Perspectives on Situated and Embodied Cognition, University of Hertfordshire, 2001. John Benjamins, 2004, to appear.

17. Scheele, M. Team Action: A Matter of Distribution. Distributed Cognitive systems and the Collective Intentionality they Exhibit. Third International Conference on Collective Intentionality, 2002.

Overfitting in Wrapper-Based Feature Subset Selection: The Harder You Try the Worse it Gets-

John Loughrey, Padraig Cunningham

Trinity College Dublin, College Green, Dublin 2, Ireland. {John.Loughrey, Padraig.Cunningham}@ cs.ted.ie

Abstract. In Wrapper based feature selection, the more states that are visited during the search phase of the algorithm the greater the likelihood of finding a feature subset that has a high internal accuracy while generalizing poorly. When this occurs, we say that the algorithm has overfitted to the training data. We outline a set of experiments to show this and we introduce a modified genetic algorithm to address this overfitting problem by stopping the search before overfitting occurs. This new algorithm called GAWES (Genetic Algorithm With Early Stopping) reduces the level of overfitting and yields feature subsets that have a better generalization accuracy.

1 Introduction

The benefits of wrapper-based techniques for feature selection are well established [ 1, 15]. However, it has recently been recognized that wrapper-based techniques have the potential to overfit the training data [2]. That is, feature subsets that perform well on the training data may not perform as well on data not used in the training process. Furthermore, the extent of the overfitting is related to the depth of the search. Reunanen [2] shows that, whereas Sequential Forward Floating Selection (SFFS) beats Sequential Forward Selection on the data used in the training process, the reverse is true on hold-out data. He argues that this is because SFFS is a more intensive search process i.e. it explores more states.

In this paper we present further evidence of this and explore the use of the number of states explored in the search as an indicator of the depth of the search and thus as a predictor of overfitting. Clearly this metric does not tell the whole story since for example a lengthy random search will not overfit at all.

We also explore a solution to this overfitting problem. Techniques from Machine Learning research for tackling overfitting include: - Post-Pruning: Overfitting can be eliminated by pruning as is done in the

construction of Decision Trees [6].

* This research was funded by Science Foundation Ireland Grant No. SFI-02 IN. Ill 11

33

34

- Jitter: Adding noise to the training data can make it more difficult for the learning algorithm to fit the training data and thus overfitting is avoided [12].

- Early Stopping: Overfitting is avoided in the training of supervised Neural Networks by stopping the training when performance on a validation set starts to deteriorate [7, 14].

Of these three options, the one that we explore here is Early Stopping. We present a stochastic search process that has a cross-validation stage to determine when overfitting occurs. Then the final search uses all the data to guide the search and stops at this point determined by the cross-validation. We show that this method works well in reducing the overfitting associated with feature selection - this will be shown later in Section 4. In Section 2 of the paper we briefly discuss different approaches to Feature Selection, focusing on various wrapper based search strategies. Section 3' provides more detail on the GAWES algorithm and early stopping in stochastic search. Section 4 outlines the results of the experimental study. Future avenues for research are discussed in Section 5 and the paper concludes in Section 6.

2 Wrapper-Based Feature Subset Selection

Feature selection is defined as the selection of a subset of features to describe a phenomenon from a larger set that may contain irrelevant or redundant features. Improving classifier performance and accuracy are usually the motivating factors behind this, as the accuracy is degraded by the presence of these irrelevant features. The curse of dimensionality is the term given to the phenomenon when there are too many features in the model and not enough instances to completely describe the target concept. Feature selection attempts to identify and eliminate unnecessary features, thereby reducing the dimensionality of the data, and hopefully resulting in an increase in accuracy.

The two common approaches to feature selection are the use of filters and the wrapper method. Filtering techniques attempt to identify features that are related to or predictive of the outcome of interest: they operate independently of the learning algorithm. An example is Information Gain, which was originally introduced to Machine Learning research by Quinlan as a criterion for building concise decision trees [6] but it is now widely used for feature selection in general. The wrapper approach differs in that it evaluates subsets based upon the accuracy estimates provided by a classifier built with that feature subset. Thus wrappers are much more computationally expensive that filters but can produce better results because they take the bias of the classifier into account and evaluate features in context. A detailed presentation of the wrapper approach can be found in [1].

2.1 Search Algorithms

The wrapper method can be viewed as a search optimization process and therefore can incur a high computational cost. From n features, the number of possible feature subsets is T, so it is impractical to search the whole state space except in situations

35

with a small number of features. The search strategies available can be classed into three categories; randomized, sequential and exhaustive, depending on the order in which they evaluate the subsets. In this research we only experiment with randomized and sequential techniques as an exhaustive search is infeasible in most domains. The algorithms we use are forward selection, backward elimination, hill climbing and a genetic algorithm as these tend to be quite popular and are easily implemented strategies.

2.2 The Problem of Overfitting

A classifier is said to overfit to a dataset if it models the training data too closely and gives poor predictions on new data. This occurs when there is insufficient data to train the classifier and the data does not fully cover the concept being learned. Such models are said to have a high variance, meaning that small changes in this data will have a significant influence on the resulting model [8]. This is a problem for many real world situations where the data available may be quite noisy. Overfitting in feature selection appears to be exacerbated by the intensity of the search since the more feature subsets that are visited the more likely the search is to find a subset that overfits [2-4]. hi [1, 4] this problem is described, although little is said on how it can be addressed. However, we believe that limiting the extent of search will help combat overfitting. Kohavi et a/ [10] describe the feature weighting algorithm DIET, in which the set of possible feature weights can be restricted. Their experiments show that when DIET is restricted to two non-zero weights the resultant models perform better than when the algorithm allows for a larger set of feature weights, in situations when the training data is limited. This restriction on the possible set of values in turn restricts the extent to which the algorithm can search. However, in feature selection we only have two possible weights, a feature can only have a value of T or '0' i.e. be turned 'on' or 'off, so we cannot restrict this aspect any further. Perhaps counter-intuitively, restricting the number of nodes visited by the feature selection algorithm should help further.

Figure 1 shows accuracies obtained during a feature selection search using a genetic algorithm for the hand dataset (see Table 1). We could expect the search to suffer from overfitting at any point after generation 17 in the search. In this example, we see a typical demonstration of overfitting where we see a peak in the generalization performance early on with a gradual deterioration in performance after that.

36

Dataset: Hand

- Internal -TestSetl

Generation

Fig. 1 A comparison of the Internal and Test Set accuracy on the 'hand' dataset. A trend line is shown for the Test Set accuracy (dashed line).

Our experiments begin with an initial investigation into the correlation between the depth of search and the associated level of overfitting. We compare the algorithms mentioned in Section 2.1 using a 10-fold Cross Validation Accuracy on a 3-Nearest Neighbor classifier.

The graphs in Figure 2 supports the hypothesis that the more nodes that are evaluated in the subspace search the more likely it is to find a subset that overfits and performs poorly on the test set. Hill Climbing is the least intensive search in each example and as a result has the poorest internal and test set accuracy in most cases. This shows this algorithm's tendency to under-fit the training data, probably getting stuck in a local maximum. The FS and BE searches perform quite similarly over all datasets, and it is interesting to note that they examine a similar number of nodes in most cases. The research into the comparative performances of these strategies have been inconclusive [15,16]. Our results are not much different. BE tends to be a little more intensive but on the datasets we show here, this does not result in it overfitting to a greater extent. One could expect that any difference in these strategies is dependent on the dataset used. In five of the seven datasets the GA explores the most states and is outperformed by both FS and BE in all of these cases. While one may have expected this more intensive strategy to yield higher generalization accuracies, the graphs show that this is clearly not the case. Moreover, on the two datasets that the GA evaluates fewer nodes, it performance is more competitive with the FS and BE algorithms.

37

Table 1. Datasets used:

Name Hand Breast Sonar Ionosphere Diabetes Zoo Glass

Instances 63 273 208 351 768 101 214

Features 13 (+1 Class) 9 (+1 Class) 60 (+ 1 Class) 35 (+ 1 Class) 8 (+ 1 Class) 16 (+1 Class) 9 (+ 1 Class)

Source http://www.cs.tcd.ie/research groups/mlg UCI Repository UCI Repository UCI Repository UCI Repository UCI Repository UCI Repository

98 J

96 94

92-

90-88

86 J

~ •

1 •m L-H

FS

Zoo

•

™ l • B l i W i l l

BE HC

g -i • ' ,Mm GA

r 2500

2000

1500

1000

500

0

94. 92 90 88 86 ' 84 82 80 78 J

Sonar

l - - ^ • l \ 1 lA i - • Hm \ H

L • m -• • l i -H H H-1 1 1 I-MMH FS BE HC

J 14000

• ' 1 . • L' H' -fl-i-BlM

12000 10000 8000

• 6000 4000

• 2000

to GA

8 0 T 78-76-74 -

70 68 66^ 64 62^

-• . • -• • •mm

•l-H i l_f l i i_ FS

Glass

• • •_ • l i 1 II 1 / Hi Ba -HHI • l ' ' f l l 1 M M I • i a - ^ BE HC

• -• Mm' m^ H i Hii Hi H-Ht^ GA

r2500

2000

- 1500

• 1000

500

Lo

100-

95 '

90-

85

80

7«i FS

Hand

^ ^ J mtM

BE HC GA

r2500

2000

1500

1000

500

0

96 . 94 92 90 88 86-84 82 80-1

Ionosphere

•_ E • -till -m H B^l FS BE HC GA

r7000 -6000 •5000 4000 3000 2000 1000

1-0

76-75

73

70 68-68^

Diabetes

• . • • 1 IlLlj B-UfI' FS BE HC GA

r 2500

•2000

1500

1000

-500

to

80.

78

76

72 70 68-1

• k H-\j^m-FS

Breast

•

kh 4-fl 1 T ^ BE HC

• -i • jm4

GA

r2500

•2000

-1500

1TO0

500

lo

m Internal Accuracy

• Test Set Accuracy

_ _ Nodes Evaluated

Fig. 2 The graphs above show the results for the preliminary experiments. FS - Forward Selection; BE - Backward Elimination; HC - Hill Climbing; GA - Genetic Algorithm. The left-hand side y' axis represents the classification accuracy. The right-hand side 'y' axis represents the number of states visited in the subspace search.

3 Early Stopping in Stochastic Search

The idea of implementing early stopping in our search is an appealing one. The method is widely understood, and easy to implement. In neural networks the training process is stopped once the generalization accuracy starts to drop. This generalization performance is obtained by withholding a sample of the data (the validation set). A major drawback of withholding data from the training process for use in early

38

stopping is that overfitting arises in situations where the data available provides inadequate coverage of the phenomenon. In such situations, we can ill afford to withhold data from the training process.

The strategy we adopt here (see Figure 4) is to start with a cross validation process to determine when overfitting occurs [9]. Then all the training data is used to guide the search, with the search stopping at the point determined in the cross-validation. In order to determine if this actually does address overfitting, our evaluation involves wrapping this process in an outer cross validation that gives a good assessment of the overall generalization accuracy. The overall evaluation process is shown in Figure 3.

Step 0. Divide complete data set F into 10 folds, Fi... Fw Define FTj ^ F \ F, {training set corresponding to holdout set F,}

Step 1. For each fold / Step 1.1. Using GAWES on FTi find feature mask M, {see Fig 4.} Step 1.2. Calculate accuracy ATi of mask M, on training data FT; using cross-validation. Step 1.3. Calculate accuracy AGi of mask M, on holdout data F/ using FTi as training

Step 2. ATi- Average(^7^) {Accuracy on training data} AG i- Average(AG,) {Accuracy on unseen data}

Fig. 3 Outer cross-validation, determining the accuracy on training data AT and the generalization accuracy AG.

Step 0. Divide the data set FT, into 10 folds, Ei... Eio Define ETj <- E \ Ej {training set corresponding to validation set Ej}

S t e p l . For each fold y Step 1.1. Using GA and ETj find best feature mask Mjlg] for each generation g Step 1.2. AETjig] i- accuracy of mask Mj[g] on training data ETj {i.e. fitness} Step 1.3. AEGjig] <- accuracy of mask Mig] on validation data E, using ETj as training

set Step 2. AET[g] i- Ayer3Qe{AETJ[g]) {Accuracy on training data at each gen.}

AEGfg] <- AyerageiAEGlg]) {Accuracy on validation data at each gen.} Step 3. sg 4- generation with highest AEG[g] {the stopping point} Step 4. Using GA and Ff , find best feature mask Mlsg] for generation sg Step 5. Return Mlsg]

Fig. 4 Inner cross validation, determining the generation for early stopping.

From this evaluation we can estimate when overfitting will occur once the generalization performance starts to fall off Once we have this estimate we can then rerun the algorithm with new parameters that will stop the search before overfitting starts.

Deciding when to stop is not such a straightforward task. In [14] a number of different criteria for early stopping are discussed and it is suggested that allowing the condition to be biased towards the latter stages of the search will yield small improvements in generalization accuracy. This said however, if we delay too much we run the risk of overfitting once again.

39

3.1 The GAWES Approach Using a Genetic Algorithm

The GAWES algorithm was developed using the FIONN workbench [13]. The algorithm is based upon the standard GA and the fitness of each individual is calculated from a 10-fold Cross Validation measure. Once the fitness has been calculated, the evolutionary strategy is based upon the Roulette Wheel technique, where the probability of an individual being selected for the new generation is related to its fitness. We use a two point crossover operation and the probability of a mutation occurring is 0.05 [5, 11].

After a series of preliminary experiments we decided to fix the population size for the GA to 20, with the number of generations set to 100. We arrived at this after taking into consideration the length of time it took to execute the algorithm, the performance of the end mask along with the rate at which the population converged. The purpose of the experiments was to determine the gen limit for each dataset - the generation after which the genetic algorithm should be stopped.

94

93

92

91

90

89-

87

Zoo

A ^^rr '^>^^v/i A ^ A w ^ ^ \

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46

96

94

92

90

86

Hand

.«.. '•-'•-... ->•........<.•

i.^ v^v'-^^'^^W •-f

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57

j.V^^W ''L ^^—-—-^t^

1 3 5 7 9 11 13 15 17 19 21 23 25 27 :

Ionosphere

^V/-^^n«^>..^4v

4 7 10 13 16 19 22 25 28 3134 37 40 4346 49

77 76. 75 74 73-72 H 71 -70-

Breast

v ^ ^ -r r-^^ -^^^^ A . . V ^ ^ ' - ^

1 4 7 10 13 16 19 22 25 28 31 34 37

77 76-75-74-73-72-71 -7 0 . 69 i

Glass

. . - . . . . • • . . • ' ' • . . - • - • . . . . •

^'\j^\n'^'^^isr'*^ V

1 5 9 13 17 21 25 29 33 37 41 45 49 53

1 4 7 10 13 16 19 22 25 28 31 34 37

Internal Accuracy

Test Set Accuracy

Trend Line (Test Set)

Fig. 5 The graphs above show the results of running GAWES on 9 datasets. The x axis represents the generation count, while the> axis is the accuracy.

The results obtained are shown in Figure 5. The graphs represent 90% of the total data available, where 81% was used in the internal accuracy measure and 9% was used for the test set accuracy. The remaining 10% of data was withheld for the

40

evaluation of the GAWES algorithm. All graphs are averaged over 100 runs of the genetic algorithm, where each run is performed on a different sample of the data. From these run we were able to generate a trend line of the test set accuracy, based upon of measure of a nine-point moving average. Using a smoothed average we have a more reliable indication of the best point for early stopping. The stopping point was chosen as the point at which this test set trend line is at the maximum value.

The graphs in Fig 5 show classical overfitting to different extents in five of the seven datasets; that is, there is an increase in test set accuracy followed by a gradual deterioration. In the breast dataset we see that the hold-out test set accuracy starts to deteriorate from the first generation and never seems to recover. From this behavior we assume that feature selection in these datasets will not lead to an increase in accuracy. The test set accuracy of ionosphere degrades in a similar manner but starts to improve in fitness after the 8^ generation and peaks at around the 11^^ before overfitting.

4 Evaluation

Having obtained an estimated genjimit from Figure 5 we can re-run our GA to evaluate the GAWES performance. As with other early stopping techniques, GAWES is only successful if the generalization of the end point is higher than the result that would otherwise be obtained. Another characteristic of these techniques is that the internal accuracy will be lower because the potential to overfit has been constrained. Figure 6 and Table 2 show the results.

The results are much as we expected. The number of states evaluated by the classifier is greatly reduced as indicated by the line in the graphs. It is also shown that our algorithm does not suffer from overfitting as much as the standard GA and in six of the seven datasets our GAWES algorithm beats the longer, more intensive search. We believe that in the case where GAWES failed (zoo), this failure was due to a small number of cases per class in the dataset. Dividing smaller datasets further, as is required in our algorithm leads to a high variance between successive training and test sets which makes is more difficult to get an accurate estimate of when overfitting occurs.

These results are consistent with our suggestion that the harder you try in wrapper-based feature subset selection, the worse it gets when the number of training cases is limited. By reducing the length of time that the GA is allowed to run, we limit the number of subsets it can evaluate, thus reducing the depth of the search. Our results provide clear evidence that early stopping can help to reduce the amount of overfitting. The improvements in some of the results could probably be increased further if work were done on other aspects of the GA.

41

98 .

96

94'

92

90

88-

^ ' L - H - H _ ^ K

go

Zoo

-

\ • • \ L -H 1 J ^ m

gawes

r2500

2000

• 1500

1000

500

•0

92-

90-

88-

86-

84-

82^

80^

Sonar

' BX • i 1 \B H N H ' L •-" ' Bi H '

ga gawes

r2S0O

2000

1500

1000

-500

-0

78n

76

74

72

70-

68

Glass

•v • • x l ' L ta -' • Hi ^••__,_MM_J

ga gawes

r 2500

2000

1500

1000

-500

• 0

94-93 92-91 •

90-89 88 87 J

Ionosphere

tM-ILM

ga gawes

r2S00

2000

1500

-1000

500

lo

76 n 75

73 72

70 69 68^

Diabetes

•^^M--• - B . 'L ••'

• M. m \ ^ I ' l i i ' i 1 • i i i i i i i 1 ga gawes

r2500

2000

1500

1000

500

i 0

80-78-76-

7 2 H

70 68 66i

Breast

• X L MM

ga gawes

r2500

2000

-1500

1000

-500

Lo

I Internal Accuracy

M Test Set Accuracy

__«. Nodes Evaluated

Fig. 6 The graphs above show the results for the GAWES algorithm. The line and the right side V axis represent the number of states visited during each search.

Table 2. Summary of results for the GAWES algorithm.

Hand

Breast

Sonar

ionosphere

diabetes

1 Zoo Glass

GA

Internal

98.04

78.42

90.70

93.634

75.14

96.37

77.51

Test Set

84.03

70.39

83.7

89.74

70.84

93

71.42

GAWES 1 Internal

96.47

78.26

90.12

92.81

74.87

94.38

77.36

Test Set 1

88.57

73.64 1 84.12

90.32 1 73.69

91.18

72.87

42

5 Future Work

At this stage we feel we have estabhshed the principle that early stopping can be effective in addressing overfitting in feature subset selection. The next stage of this research is to perform experiments on many more datasets to get a clear picture of the performance of the early stopping algorithm. Our experiments so far have been done with a one-size-fits-all GA and it seems clear that the parameters of the GA need to be tuned to the characteristics of the data. Some of the results shown here might have been improved if we had chosen other parameters for the GA, as our better results were on datasets that had fewer features. This was probably due to our choice of population size. The population size remained constant across the experiments so that the effect of early stopping could be examined under equal conditions. It would be interesting to look into this further, whether it means working with the GA more or indeed moving to another stochastic technique such as Simulated Annealing (SA). Simulated annealing has been inspired by statistical mechanics and is similar to the standard Hill Climbing search, but differs in that it is able to accept decreases in the fitness. The search is modeled on the cooling of metals and so the probability of accepting a decrease in fitness is based upon the current temperature of the system (an artificial variable). The temperature of the system is high at the beginning but slowly cools as the search progresses, therefore significant decreases in fitness are more likely to be accepted early in the search process when the temperature of the system is high, but are less likely as the search progresses and the temperature gradually cools [17]. This gives the search the ability to escape from local maxima that it would otherwise get trapped in early in the search. We feel implementing Early Stopping in the SA has promise as there are many ways in which one can restrict the length of the search e.g. by increasing the cooling rate.

6 Conclusions

Reunanen [2] shows that overfitting is a problem in wrapper based feature selection. Our preliminary experiments support this finding. We have proposed a mechanism for early stopping in stochastic search as a solution. Early stopping is a widely known and well understood method of avoiding overfitting in neural network training, and we are unaware of any other research that applies it to the feature selection.

Genetic algorithms are often used in feature selection, although one major difficulty associated with them is parameter selection. The population size, generation limit, evolutionary technique, crossover and mutation values all have to be set, as these values are all dependent on the dataset being explored. It has been shown that the more the feature subspace is search the greater the chance there is of overfitting. By reducing the length of time that the GA is allowed to run, we limit the number of subsets it can evaluate, thus reducing the depth of the search. However, more work is needed to make the algorithm more competitive with existing feature-selection techniques. It is important to mention that overfitting does not always occur and finding datasets that demonstrated the effects of early stopping was difficult. We have an issue with the datasets available to us in that sometimes feature selection is not

43

always necessary and as a result determining when to stop based upon a marginal increase in test set accuracy is not always reliable. Increasing the number of datasets is a major issue for future research. Moreover, the computational requirements of GAWES resulted in many searches taking days to execute which limited somewhat the number of results we could show.

References

1. Kohavi, R., John, G., Wrappers for feature subset selection. Artificial Intelligence, Vol. 97, No. 1-2, pp273-324, 1997

2. Reunanen, J.. Overfitting in making comparisons between variable selection methods. Journal of Machine Learning Research, Vol. 3, pp371-1382, 2003

3. Jain, A., Zongker, D., Feature Selection: Evaluation, Application and Small Sample Performance. IEEE Transactions on Pattern analysis and machine intelligence, VOL 19, NO. 2 1997

4. Kohavi, R., Sommerfield, D., Feature Subset Selection Using the Wrapper Method: Overfitting and E)>'namic Search Space Topology. First International Conference on Knowledge Discovery and Data Mining (KDD-95)

5. Yang, J., Honavar, V., Feature Subset Selection using a genetic algorithm. H. Liu and H. Motoda (Eds), Feature Extraction, Construction and Selection: A Data Mining Perspective, pp. 117-136. Massachusetts: Kluwer Academic Pubhshers

6. J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993 7. Fausett, L: Fundamentals of Neural Networks : architectures, algorithms,and applications.

Prentice-Hall, 1994 8. Cunningham P. Overfitting and Diversity in Classification Ensembles based on Feature

Selection. Department of Computer Science, Trinity College Dublin - Technical Report No. TCD-CS-2000-07.

9. Caruana, R. Lawrence, S. Giles, L. Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping, Neural Information Processing Systems, Denver, Colorado. 2000

10. Kohavi, R., Langley, P., Yun, Y. The utility of feature weighting in nearest-neighbor algorithms. In Proceedings of the European Conference on Machine Learning (ECML-97), 1997

11. Mitchell, M. An Introduction to Genetic Algorithms. MIT Press, 1998 12. Koistinen, P., Holmstrom, L. Kernel regression and backpropagation training with noise. In

J. E. Moody, S. J. Hanson, and R. P. Lippman, editors. Advances in Neural Information Processing Systems 4, pages 1033-1039. Morgan Kaufmann Publishers, San Mateo, CA, 1992

13. Doyle, D., Loughrey, J., Nugent, C, Coyle, L., Cunningham, P., FIONN: A Framework for Developing CBR Systems, to appear in Expert Update

14. Prechelt, L. Automatic Early Stopping Using Cross Validation: Quantifying the criteria. Neural Networks 1998

15. Aha, D., Bankert, R. A Comparative Evaluation of Sequential Feature Selection Algorithms. Articial Intelligence and Statistics, D. Fisher and J. H. Lenz, New York (1996)

16. Doak, J. An evaluation of feature selection methods and their application to computer security (Technical Report CSE-92-18). Davis, CA. University of California, Department of Computer Science

17. Kirkpatrick, S.,Gelatt, C. D. Jr.,Vecchi, M. P. Optimization by Simulated Annealing. Science, 220 (4598):671-680, 1983

Managing ontology versions with a distributed blackboard architecture

Ernesto Compatangelo Wamberto Vasconcelos Bruce Scharlau

Department of Computing Science, University of Aberdeen

Abstract

Ontology versioning deals with the management of ontology changes, including the evaluation of the consequences arising from these changes. We describe a distributed, "pluggable" blackboard architecture for managing different ontology versions. This allows existing environments for ontology design to be naturally extended with versioning capabilities with little or no overhead. We also outline how some core components of our architecture can be linked together, showing how this can be used to manage ontology versions, reasoning with and about them.

1 Introduction

Ontological engineering was born with the aim of reusing of knowledge-based systems by building and sharing domain ontologies, which are semantically sound specifications of domain conceptualisations [10]. Focus in this research area rapidly shifted from the editing and the application of single ontologies to the different aspects of managing multiple ontologies in a distributed environment [5]. Most of these aspects involve the querying, the maintenance, and the reuse of multiple ontologies in a semantic web context. Such knowledge management tasks are accomplished by sharing, mapping, combining, and versioning a number of distributed ontologies [11].

Versioning, i.e. the ability to manage ontology changes and their consequences, is a critical functionality in those frameworks that deal with the management of multiple ontologies. In fact, versioning entails a thorough analysis of the available ontologies, which is based on the changes introduced to transform one ontology version into a different one. Such analysis leads to the specification of the evolution history of an ontology as an oriented graph of versions. This history can be used to assess whether any two ontology versions are based on different conceptualisations or whether they represent the same conceptualisation, rendered using a different descriptive structure and/or lexicon. In turn, such an assessment allows the evaluation of the effects of each ontology change on the results provided by those systems that use the ontology.

44

45

To date, ontology versioning research has given rise to different, somehow complementary frameworks for ontology versioning and evolution within the context of multi-ontology management. The ontology versioning approach for the semantic web [5] is part of the Wonder Web ^ ontology infrastructure framework. It mainly focuses on ontology versions for which a history of changes is available. The approach also includes the OntoView versioning system [6], which can highlight changes in concepts belonging to differentiated ontologies. The ontology evolution management approach [9] is part of the Karlsruhe Ontology and Semantic Web (KAON) ^ framework. It mainly focuses on ontology versions in which articulations between concepts can be introduced such that the retrieval of ontology instances is not affected once a new ontology version replaces an old one. The approach also includes the Ontologging system [8], which establishes semantic bridges between concepts in different ontology versions. The component-based approach to ontology evolution [7] is part of the Protege ^ ontology development framework. It mainly focuses on ontology versions for which a history of version is not available. The approach also includes the PROMPT system [11], which establishes alignment mappings between concepts in different ontology versions based on their degree of similarity. All these frameworks at least partially focus on one or more of the following aspects:

1. Representing details of changes between ontology versions [7, 4].

2. Specifying ontology change operations and analysing the different implications of these operations in diverse contexts [12, 7, 5, 14].

Although each of the different ontology versioning approaches currently under development separately addresses one or more of the above aspects (1) and (2), to the best of our knowledge none of them jointly addresses all of them. Therefore, we propose an architecture to jointly address the above issues. Our proposal is built around a blackboard system where information on the design activity is recorded. The blackboard system allows for the simultaneous access of the versioning space by different users and tools. An important feature in our proposal concerns the ability to reason with and about the versioning space - for instance, showing when in the design of an ontology a concept appears and what implications it has in the current version. Our declarative approach allows for alternative reasoning capabilities to be deployed. We envisage our architecture being added with little or no overhead to existing ontology management environments. As an immediate advantage, this allows single-user environments to be naturally extended to distributed, multi-user scenarios.

This paper is structured as follows. Section 2 introduces our architecture. Section 3 outlines our framework for ontology versioning. Section 4 describes our representation of changes in actual or virtual ontology versions. Section 5 provides some relevant versioning scenarios. Finally, Section 6 draws some preliminary conclusions and describes future research in this area.

^ http://wonderweb.semanticweb.org/ '^http://kaon.semanticweb.org/ ^http://protege.stanford.edu/

46

2 A blackboard architecture for versioning

An ontology generally changes with time as part of its evolutionary lifecycle, thus giving rise to a whole range of (sometimes incompatible) actual variants ^. Handling these variants is the remit of that part of ontology management often denoted as ontology evolution management [9]. However, independently developed ontologies can be also considered as virtual variants ^ of a hypothetical common ancestor. In the broader context of multiple ontology management, these virtual variants should also fall within the scope and the remit of ontology versioning. In fact, algorithms for establishing mappings (i) between similar elements in two independently developed ontologies and (ii) between differentiated elements in two versions of the same ontology are themselves variants of a common approach to ontology comparison [11].

In theory, the creation and the modification of (each knowledge element in) an ontology should always be formally recorded and documented as a log of changes. This would allow the reconstruction of the actual versioning history. In practice, the documentation of ontology changes is often unavailable. However, it should still be possible to infer and thus to reconstruct a virtual versioning history, i.e. a series of possible sequences of changes. Unfortunately, this requires the combination of a wide range of logical deductions and heuristic inferences that detect similarities and differences in ontology variants [5, 11]. A similar combination should be also used to assess the eflPects of changes in replacing an ontology variant with a different one.

A way to enable the interaction between the various representation and reasoning components needed to manage diflFerent ontology versions is that of devising a versioning architecture that Blackboard

Ontology 1

Q

I ^ ^

i^ i^

Ontology 3

an,

N—4

ConcapTool Reasoning Tools

'JL

Agsnt,

u-iA

Agsnt,

Agent,

Visualisation Tools

accommodates the different components around a shared ontology versioning space. The architecture proposed in this paper is thus built around a blackboard system, which consists of a memory space shared by different processes or threads. We present a diagram with our proposed architecture in Figure 1. The shared blackboard, ^^^ure l : Blax^kboard architecture

represented as a larger central box, stores the versioning spaces of different ontologies. Distinct users, represented by smaller boxes around the blackboard, access and update the shared memory space. Our versioning architecture implements the infrastructure required for the blackboard system functionalities {e.g., concurrent access, queues, priority of access, and so on) as well as for knowledge representation and reasoning.

^The term "actual variant", ("variant" for short), is used to denote any ontology which is either proven or reliably acknowledged to be a modification of another accessible ontology.

^The term "virtual variant", (or "version" for short), is used to denote an ontology that represents any of the (independently developed) conceptualisations of the same domain.

47

2.1 Architectural components

The bottom left box shows the CONCEPTOOL intelhgent management environment [2] interacting with the blackboard, performing changes directly onto the versioning space and annotating the space with extra knowledge about the operations carried out. Alternatively, a Web browser (second box, bottom row) interacts via forms with the shared space, sending requests in a particular format - this allows users to remotely access the shared space and confers openness on our proposal. Reasoning tools (third box, bottom row) make use of the knowledge stored in the blackboard space to perform inferences. Another useful class of users concern the visualisation services that allow the inspection of the shared space along different dimensions and in alternative formats.

The "pluggability" of our architecture is realised in the following way. Existing environments for ontology engineering can be augmented with a functionality where, on saving the latest changes on a local non-shared file, a separate event takes place. As a result of such event, a new ontology version is added to the space with the appropriate connections forged between the new version and an existing version. Ultimately, the file exclusively owned by users would become obsolete as the blackboard will store all of them, possibly including additions and changes that will benefit people other than its originators. We aim at an implementation of our architecture in which we can offer this feature.

We also show on the right-hand side of the diagram a vertical row of boxes with agents, i.e. autonomous and proactive pieces of software able to communicate with other (software or human) agents [16]. Our architecture will be populated with a team of agents that continuously roam the versioning space stored in the blackboard, carrying out checks of desirable properties (or lack of undesirable ones) on the latest versions and preparing reports on the activities of teams and team members. Agents will also be used to search for particular components in the versioning space, monitoring it until when they appear.

We want to investigate adequate data structures for storing the ontology constructs as well as means for representing knowledge on finer- and coarse grained operations and annotations on the changes (authorship, date/time, justification, and so on). The data structures and knowledge representation formalisms should be accompanied with, respectively, algorithms for the management of the data and reasoning mechanisms. The shared memory requires mechanisms and policies for the disciplined ax:cess and update by the users.

The policies regulating the access and update of the shared memory should incorporate some of the practices adopted in real-life scenarios. For instance, changes introduced by junior engineers could be vetoed by a senior engineer. This would lead to provisional versions with changes subject to approval.

A Linda tuple space [1] is a realisation of a blackboard system that has been incorporated into various programming languages, including Prolog and Java [3]. Our versioning system will deploy existing tuple space management facilities based on blackboard systems for multi-agent systems [15].

It is worth noting that our architecture is an open one: other functionalities can be added without affecting existing components. The blackboard

48

can be accessed simultaneously by different threads that implement arbitrary computational behaviours. We envisage auxiliary management services being gradually added to our architecture, for instance, preparing summaries of team members' performance (using their output stored in the blackboard), monitoring particular versions (due to their confidentiality or complexity), creating visual representations of (portions of) the blackboard, and so on.

The different functionalities can be realised in many ways, using different technologies and programming languages; this also confers heterogeneity to our architecture. The architecture is also scalable as the blackboard can be divided up and kept in different machines - as many as necessary - and be managed as one single knowledge repository. We also note that our proposal is lightweight in that there is only a thin interface layer between those accessing the blackboard and the blackboard itself. The bulk of the management will be carried out behind-the-scenes and those developing new tools and functionalities will be provided with high-level means to access and update the blackboard.

2.2 Case Study: ConcepTool and JavaSpace

We want our architecture to enable "plug-and-play" capabilities with existing editors and tools. We demonstrate the ease of this by allowing the CONCEPTOOL

system [2] to use a JavaSpace [3] to share ontology versions. We found that we only needed a trivial amount of code to write a project, the larger units of concepts organised by CONCEPTOOL, onto the space. While this could be written into the core code of CONCEPTOOL (or any other tool, for that matter) we find that the extra functionalities of a shared JavaSpace should be added as a plug-in for other ontology editors and tools.

Saving a project onto a JavaSpace can be transparent to the user, who works on her project as normal. When their work is saved, then a copy of it can be placed in the JavaSpace to be shared and monitored by the services running there. Saving onto the JavaSpace is done in addition to the standard functionalities of saving the work locally. We illustrate our approach with the diagram in Figure 2. In the diagram we show CONCEPTOOL and its local file

Concq)Tool Remote/local JavaSpace

Figure 2: CONCEPToOL and a JavaSpace Blackboard

49

as well as its connection to the JavaSpace. For the purposes of testing our implementation, we added a monitor to the JavaSpace, shown in the bottom right of the diagram. The monitor shows the interaction with the JavaSpace as CONCEPTOOL connects to it and sends out a project for saving.

We added an item to the menu of CONCEPTOOL labelled "Save to JavaSpace". When selected the menu item makes a simple method call, shown in Figure 3. The method first writes the file to the local file system, and then

void memProjectSaveToJavaSpace.actionPerfonned(ActioiiEvent e){ pf.saveToJavaSpace();

}

public boolean saveToJavaSpaceO { // first make sure that a 'normal' save has been performed to set the file name // for the project

saveO; try {

JavaSpace space - SpaceAccessor.getSpaceC"JavaSpace"); if (space == null) throw new Exception("»«» no JavaSpace !! =»*"); String projectString * null; this. saveXMLC JAVA_SPACE_PATH) ; projectString « new StringO; BufferedReader in " new BufferedReaderCnew FileReader(JAVA_SPACE_PATH)); while (in.ready0) {

projectString = projectString + in.readLineO + "/n";

} // create new ProjectInfo

Projectlnfo pi = new Projectlnfo(projectString, this.project.getName(), "versionl"); // write to JavaSpace

System.out.printIn("Write to Space: " + pi.projectldentifier); space.write(pi, null, Lease.FOREVER); return true; } catch (Exception e) { e.printStackTraceO; }

return saveAs();

}

Figure 3: Fragment of code to save CONCEPTOOL projects onto a JavaSpace

acquires the JavaSpace and writes out a 'Projectlnfo' object onto the space, which can then be read and used by anyone with access to the JavaSpace.

Used this way JavaSpaces enable any standalone editor to be networked to other colleagues by adding our plug-in. There is no latency as with HTML based tools, or Java plug-in problems as occur with applets. Developers can still use their preferred environment and only need to add our plug-in to take advantage of JavaSpace persistence and collaboration. The only requirement beyond our plug-in is access to a JavaSpace, which can be anywhere on the Internet. Another part of our envisaged plug-in is the JavaSpace browser, which will enable developers to select an item from the JavaSpace, download it to their local filesystem where they can work on it, and then to save it back to the JavaSpace as a new version.

50

3 Versioning for ontology management

Our framework for ontology versioning encompasses a formal knowledge model, a software engineering method, and tools to support both real and virtual versioning. The adopted knowledge model should provide means to explore and organise the space of possible ontologies, allowing engineers to annotate their design activities and to reason about these annotations. The software engineering methodology and the accompanying tool(s) should support the construction, the evolution and the maintenance of ontologies, creating virtual versioning histories whenever necessary. We thus envisage an Inferential Ontology Management System (lOMS) equipped with an Ontology Versioning Environment (OVEN), such that:

• The OVEN should explicitly document the evolutionary history of (each concept in) an ontology across different ontology versions. Versioning information about (each concept in) an ontology (including the rationale for its current structure and content) could be encoded using a number of specialised "own slots" in the representation of ontological knowledge.

• The OVEN should be able to reconstruct and show the evolutionary history of (each concept in) an ontology across different ontology versions using the available (versioning) information.

• The OVEN should be able to create a virtual versioning hierarchy out of a set of partially overlapping ontologies. In the most general case, these ontologies, for which no versioning information is available, are independently developed from a single original ontology, their engineers having had no contact to share or reuse design principles.

Therefore, we envisage a set of specific ontology versioning services to be provided by the OVEN environment whereby any changes are formally recorded, justified, and explained by their authors. In our vision, these changes should be recorded as ontology annotations; they would thus represent a layer of knowledge about the ontology design, development and maintenance history.

For practical reasons, we will focus on domain ontologies represented using frames (possibly in conjunction with first order logic axioms), as most application ontologies are currently expressed in this way. However, we will explore different logic-based annotations to represent both concept creation or modification operations and their justification or rationale. We will also investigate how formal reasoning can be exploited to manage different ontology versions, e.g. comparing selected properties and computing the implications of the differences between these versions.

We envisage a formal framework supported by software tools where engineers and domain experts share their knowledge, either independently or jointly building and maintaining ontologies. For instance, starting from an existing ontology f O) users can perform operations v:?i,</?2} • • • j^n, independently developing ontologies ^^i, 2? • • • ? ^n which are thus partially overlapping. Alternatively, starting from an existing ontology ilj, users can perform op-

51

erations / i , /2 , • • •, /n , jointly developing distinct ontology versions Q[j i j , . . . , ^[ij]. Each operation, its associated parameters and result are recorded and made available for inspection by other users and/or by other reasoning services. These parameters provide a knowledge layer that can be used to explore the design space of an onto- Figure 4: Ontology design space logy. This is depicted in Figure 4, where each labelled arrow represents one of the above kinds of transformation operations. In our formal framework, specialised operations are defined to insert, delete, and modify ontology components, to merge ontology clusters, and so on.

The usage of a frame-based model to represent ontological knowledge provides a natural and cost-effective way of enabling annotation-based ontology version-ing services in addition to structural or lexical reasoning services. More precisely, while ontological knowledge as such can be recorded in the template slots of frames, versioning information can be recorded in the own slots of frames. As ontology reasoning only takes template slots into account, own slots can be thus used to record all sort of annotations for versioning and other purposes. In particular, each versioning-relevant operation (e.g. creation, modification) involving any concept in the ontology should be recorded in terms of:

• The structure and the content of the new concept in the ontology version generated at the end of the editing session as a result of the modification operation.

• A pointer to the ontology version opened during the editing session in which that concept has been created or modified.

• A pointer to the starting concept on which the modified concept is based. This starting concept is contained in the ontology version opened during the editing session in which that concept has been created or modified.

In our vision, an OVEN working session is an editing session during which either a new ontology is created or an existing ontology is opened, altered and saved. A new ontology version is generated as a result of an editing session if there is any difference between the ontology immediately after opening and immediately before closing.

In our approach, any operation f{c,c*) ' C^. -^ CX, where a concept C in the ontology Qj is transformed into a concept C* in the ontology ^[i^n] ? can be entirely reconstructed and described by comparing the versioning information in the specifications of C and C*. If such versioning information is not available, then a possible reconstruction must be inferred by the OVEN on the basis of the available non-annotated ontologies.

Our goal is akin to that of software versioning systems, that are now commercially or freely available for a number of languages like Java and C-h-f {e.g.,

52

the open source project [13]). However, we want to exploit the declarative nature and semantic richness of formally represented ontological knowledge, which can be manipulated in a number of different ways. For instance, we would like to extract either those portions of the ontology that were affected by a particular change or those portions that remained completely unaffected^.

4 Representing changes in ontologies

Given an ontology Vti expressed using a suitable formalism and notation, we want to represent the changes at the end of an editing session that gives rise to a variant ontology version ^[iî] ^. The description of such changes should include the element of ^[iî] resulting from each operation, the nature and the author of the operation, the time stamp, and the rationale for such a change. These operations (generically denoted as / i , /2, • • •, fn) are functions whose features (i.e. types and number of parameters) may vary depending on their nature. Parameters, however, should include reference to llj; moreover, the result of each fk must be a variant ontology ^[iî] of Q>i. For instance, an operation to create a new subclass of a class in an ontology Q.i needs as parameters the superclass to which the new subclass will be added plus any specific values for attributes which give rise to the subclass. It is worth noting that some of parameters of an operation, such as the originating superclass in a new subclass are already specified by default in the concept created during the editing session.

The different annotated versions fully characterise the development history of an ontology, as the sequence of modification operations can be reconstructed by comparing the available ontology variants. The same ontology can be edited differently by different users; as a result, the design space is explored via a tree rooted on the starting ontology Q , whose nodes are new variant versions of ilj. The edges connecting a node Cli to its offspring nodes ^[i,i], ri[i,2]) • • • ? ^[i,n] ^^^ the editing operations fk applied to Qi and resulting in ^[iî], r2[i 2]7 • • • ? ^[i,n]-The topology of the design space is shown Figure 4.

4.1 Granularity of changes

The granularity of the / operations is an important issue to consider. At one end of the spectrum, there are coarsely-grained operations F that represent an editing session in its entirety, where any number of simpler editing operations fk ^ F have been performed. In this case, no annotations about the single operations fk on the individual concepts in Qi are recorded. Conversely, annotations on the whole ontology are recorded. These could simply be the author of the editing operations and the date/time the new ontology version was generated. Alternatively, operations can be finer-grained, in this case describing individual editing operations performed on the ontology.

^The notion of "being affected" by a change can be given different alternative definitions. Â variant ontology version is one for which all the required versioning annotations exist.

53

Given an original ontology and a new version, we envisage a framework where it is possible to automatically compare them, figuring out their mutual differences and thus reconstructing the operations that have caused these differences. Hence, in principle we can infer the operations carried out on an ontology. However, if an oi;itology is sufficiently complex and large this process can be costly. Therefore, we could save the effort of inferring the changes if we require that new ontology versions are explicitly annotated with the operations that generated them.

4.2 A declarative representation of the version space

A declarative representation of the version space should allow us to ex-plicitly manipulate nodes, edges and branches. Ideally one should provide the same information in different ways: for instance, we migh t need t o represent t h e frag- Figure 5: Representation of the versioning space

ments of the explored version space in which a particular operation fk has not been performed. We show in Figure 5 an initial declarative presentation for the version space depicted in Figure 4. In this case, we use Prolog constructs to define relationships between the components. We have used fact root/1 to define the root of the version space: in our case, the r o ontology. Facts edge/3 state that the application of a second parameter {i.e., a function) to the first parameter {i.e., an ontology) generates the second parameter {i.e., the new ontology).

Using the above representation, we can develop tools and functionalities that support users in navigating and in

root(0). edge(0,/i,f2i). edge(rii,/3,n[i^i]). edge(r22,/i,^[2,i])-

edge(0,/2,fi2). ••• edge(fii,/4,f^[i,2])-

history(Onto,H):-root(Root), path(Root,Onto,[],H).

path(A,A,Path,Path). path(A,B,PathSoFar,Path):-

edge(A,F,C), path(C,B,[C/FIPathSoFar],Path):

further exploring the design space. For instance, we can devise a way of reconstructing the design history of any ontology from the design space. Figure 6 shows a fragment of Prolog code that implements such function- Figure 6: Ontology design history ality. In this fragment, predicate his tory/2 reconstructs, given an ontology Onto (first argument) the design history H of this ontology (second argument), i.e., the sequence of intermediate versions and operations leading from the root ontology to Onto. Predicate his tory/2 builds on the usual definition of a predicate path/4 which finds a path between its first argument and its second argument, using an intermediary path (path built so far) in its third argument and returning the final path in its fourth argument. We want to investigate alternative formalisms suitable to represent operations fi. Although these can be safely regarded as ordinary functions, the previously discussed granularity issues affect how they should be represented. The range of allowed operations to be performed on the ontologies also influence the formalism adopted and its representation. Time issues (when the version becomes available), authorship (who performs the operations, their reputation.

54

authority within the team, and so on), and justification (why was the operation performed - this issue is further exploited below) may all influence our decision.

5 Versioning scenarios

Scenarios characterised either by surreptitious ontology changes or by untraceable changes can lead to ambiguous or incomplete interpretations and unacceptable or inconsistent implications. Hence, they are not considered in the versioning approach proposed in this paper.

In the following, we will introduce and discuss a specific example. We will use simple mathematical constructs to present our example in order not to overload the discussion with technical details, avoiding early commitments to particular formalisms (Description logics, RDF, UML, etc.) and notations (XML, UML diagrams).

We define an ontology as the pair il = (C,7^), where C = {Ci , . . . , Cn} is a set of concepts Ci and 7 = {pi , . . . , pm} is a set of relationships. Each concept definition is of the form (c, {(ao, o)? • • • ? { nj ^n)}), where c is the concept name and (ai^Vi) are pairs of attribute names a and values Vi. Each Pj C C x C, 1 < j < m, is a relationship among the concepts Ci. The elements of each pj are of the form (C, C ) , where C^C E C are concepts in C, and the pair represents that C is related to C via pj.

Some typical operations performed on ontologies can be formally represented as rewriting rules as follows:

• Creation of a concept - a concept definition is introduced in the set C with an added entry in one of the relationships pj.

(c, n)^{c\j {C}, (n - {pj}) u {pj u {{c, c')}})

Additional constraints may be required to precisely represent the conditions under which the operation can be performed. For instance, C" G C, that is, the concept with which C is newly related must already be a concept in fi (possibly subsuming all concepts).

• Renaming of a concept - the name of a concept may be changed and this change must be propagated throughout the ontology.

{CLl{{c,A)},n)-^{CU{{c',A)},Tl.{c/c'})

Concept C = (c,v4) is renamed to C = {c\A), A being the set of attribute/value pairs. The operation must replace every occurrence of c with c' in the sets of 7 , denoted by 1Z • {c/c'}.

• Addition of attribute/value pair to a concept - an existing concept may be altered to accommodate an extra attribute/value pair (a,v).

{CU{{c,A)},']l)-^{Cu{{c,Au{{a,v)})},TZ)

55

< ^[1,1,1] -

-^ ^[1,1]

^ I v ^ ^ [ 1 , 1 , 2 ]

Qi ^ f

' ^ c

X ^ '\ ^[1,2]

/ a

^[1,2,1]

The above list is not exhaustive: it is meant to illustrate the kind of operations we aim to provide to annotate the versioning space. The operations require user intervention, to provide the details of a concept definition, to choose the concept to be renamed (and its new name), to choose the concept to be changed and so on. The rewriting rules above should accommodate means for user-interaction, allowing engineers to experiment with distinct combinations.

A more appropriate data structure to represent the exploration of the version space is a graph, as different sequences of operations may result in the same ontology. We illustrate this scenario in Figure 7 where different ontologies may converge depending on the operations performed on them. Graphs can, however, be broken into trees by replicating nodes that cause two (or more) branches to converge. In the figure we show the operations defined above being used to obtain new ontologies. The operations are recorded

J ,1 .n J. • J J V F igure 7: Sample ontology space

and the specific parameters provided by the users are also incorporated to the versioning space. Convergence in a version space may suggest that although users went along different design paths, they eventually reached an agreed ontology. In our example, version [1,1,2] has been reached via three alternative design paths. Although we have shown individual rewriting rules on our diagram, these can be gathered as a sequence of operations performed during an editing session and only the final resulting ontology is stored in the versioning space.

Our formalisation of alterations in terms of rewriting rules allows us to study important versioning issues from solid theoretical underpinnings. For instance, during an editing session many operations can be later undone. In such cases, we should be able to analyse the sequence of rewrite rules applied and select the shortest subsequence that maps the original ontology onto the resulting one.

Furthermore, given two subsequences of rewrite rules applied to the same original ontology, we should be able to infer properties among different resulting ontologies based on the operations carried out. For example, given two versions Q.' and Q." stemming from the same original version and obtained via different sequences of operations, we may want to check if Q! fl ^" ^ 0, if Q.' h 17", or conversely, Q." h Q.'. We want to answer these questions by analysing the sequence of operations, rather then the ontologies themselves.

6 Conclusions

Ontology building is a team effort: domain experts and engineers jointly devise partial and/or overlapping constructs that together give rise to ontologies and taxonomies. Tools to support the engineering of ontologies must take into account the inherently collaborative and distributed nature of the task.

56

Software engineers have long detected the evolutionary nature of software development, whereby bugs or limitations of initial versions once detected give rise to improved, more robust versions. The evolutionary nature of this process calls for means for recording and managing different versions of software components. Indeed, more recent proposals of environments for ontology engineering have included versioning in their functionalities.

We propose an architecture for ontology versioning. Our approach, however, differs from those adopted in conventional versioning of software and ontologies in that it is knowledge-rich. This would allow reasoning with and about the changes performed to components of ontologies. We also want to formally address the issue of justification of choices and decisions of early designs and their changes, proposing mechanisms whereby engineers can interactively document their activities. The justifications of design and changes allows for formal reasoning about the design activity, taking into account authorship and reputation, refutation of justification, lifespan of a version, and so on.

Our distributed architecture is built around a shared blackboard system allowing different access policies to reflect real-life practices. Independent ontologies can be simultaneously developed, and design histories can be transferred across from one ontology to another. A set of support services such as visualisation tools and summaries will help coping with the potential growth of version spaces. Our architecture is pluggable, open, lightweight and scalable.

Our approach to ontology versioning should naturally lead to an Ontology Configuration Management (OCM) system, the equivalent of Software Configuration Management (SCM) systems. We are investigating the adaptability of the functionalities of SCMs to OCMs, such as deltas (whereby only the changes are recorded and the unchanged part is reused, thus saving space), specifications of ontological components and their individual versioning, and so on.

References

[1] N. Carriero and D. Gelernter. Linda in Context. Comm. of the ACM, 32 (4):444-458, April 1989.

[2] E. Compatangelo and H. Meisel. CONCEPTOOL: intelhgent support to the management of expressive knowledge. In Proc. of the 7th InVl conf on Knowledge-Based Intell. Information & Engineering Systems (KES'2003), volume 2773 of Lect Notes in Artif Intell, pages 81-88. Springer-Verlag, 2003.

[3] E. Freeman, S. Hupfer, and K. Arnold. JavaSpaces: Principles, Patterns and Practice. Addison-Wesley, 1999.

[4] J. Heflin and J. Hendler. Dynamic Ontologies on the Web. In Proc. of the American Association for ArVl Intell. Conf. (AAAI'2000). AAAI Press, 2000.

57

[5] M. Klein et al. Ontology versioning and change detection on the web. In Proc. of the 13th InVl Conf. on Knowledge Engineering and Knowledge Management (EKAW'2002), pages 197-212. Springer-Verlag, 2002.

[6] M. Klein et al. OntoView: Comparing and versioning Ontologies. Collected Posters of the 1st Int'l Semantic Web Conf. (ISWC'2002), 2002.

[7] M. Klein and N. F. Noy. A component-based framework for ontology evolution. In Proc. of the IJCAI '03 W'shop on Ontologies and Distributed Systems, 2003.

[8] A. Maedche et al. Managing Multiple Ontologies and Ontology Evolution in Ontologging. In Proc. of the Conf on Intelligent Information Processing, World Computer Congress 2002, pages 51-63. Kluwer Academic Publishers, 2002.

[9] A. Maedche et al. Ontologies for enterprise knowledge management. Intelligent Systems, 18(2):26-33, 2003.

[10] R. Neches et al. Enabling technology for knowledge sharing. Al Magazine, 12(3):36-56, 1991.

[11] N. F. Noy and M. A. Musen. Ontology Versioning as an Element of an Ontology-Management Framework. Technical Report SMI-2003-0961, School of Medical Informatics, Stanford University, USA, 2003. To appear in IEEE Intelligent Systems.

[12] N. Noy and M. Klein. Ontology Evolution: Not the Same as Schema Evolution. Knowledge and Information Systems, 6(4):428-440, 2004.

[13] D. Price. Concurrent version system, last accessed April 2004. URL h t tp : //ccvs.cvshome.org/.

[14] L. Stojanovic et al. User-driven ontology evolution management. In Proc. of the 13th InVl Conf. on Knowledge Engineering and Knowledge Management (EKAW'2002), volume 2473 of Lect. Notes in Comp. Sci., pages 285-300. Springer-Verlag, 2002.

[15] W. W. Vasconcelos et al. Rapid Prototyping of Large Multi-Agent Systems through Logic Programming. Annals of Mathematics and Artificial Intelligence, 41(2-4), 2004. Special Issue on Logic-Based Agent Implementation.

[16] M. Wooldridge. An Introduction to Multiagent Systems. John Wiley & Sons, 2002.

OntoSearch: An Ontology Search Engine^

Yi Zhang, Wamberto Vasconcelos, Derek Sleeman Department of Computing Science,

University of Aberdeen, Aberdeen, AB24 SUE, Scotland, UK

Email: {yzhang, wvasconc, dsleeman}@ csd.abdn.ac.uk

Abstract

Reuse of knowledge bases and the semantic web are two promising areas in knowledge technologies. Given some user requirements, finding the suitable ontologies is an important task in both these areas. This paper discusses our work on OntoSearch, a kind of "ontology Google", which can help users find ontologies on the Internet. OntoSearch combines Google Web APIs with a hierarchy visualization technique. It allows the user to perform keyword searches on certain types of "ontology" files, and to visually inspect the files to check their relevance. OntoSearch system is based on Java, JSP, Jena and JBoss technologies.

1. Introduction Reuse of knowledge bases is an important area in knowledge technologies. Determining the principal topic of an existing knowledge base (KB) is very important for the reuse of knowledge bases. Identify-Knowledge-Base (1KB) [2, 3] is a tool to identify the principal topic(s) of some particular knowledge base by matching concepts (extracted fi'om the KB) against a reference taxonomy (extracted fi"om a reference ontology). Finding (normally fi-om the Internet) a relevant reference ontology for a particular KB is the key point in the use of the 1KB system.

^ This work is part of the Advanced Knowledge Technology (AKT) project, which is fiinded by EPSRC, [1]. The 1KB system [2, 3](Aberdeen University) and the ExtrAKT system [4, 5, 6] (Edinburgh University), which incorporate with OntoSearch system, were built for the AKT consortium as well. ^ Knowledge Reuse: http://www.aktors.org/technologies/reuse/

58

59

The Semantic Web^ provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It envisions the globally interconnected network of machine-processable information, made possible by means of the sharing of semantic data models or ontologies. Locating suitable existing ontologies to capture the user-required information from the Internet is a big challenge in the current research of the Semantic Web.

Finding a suitable ontology from the Internet is a hard task. There is still no good tool to handle this problem. Google offers a powerfiil web search engine. However, with regard to the ontology searching, it has its own problems, such as a lack of visualization facilities. Google APIs'* give us a chance to develop our own tool (OntoSearch) to search the relevant ontology files to meet the user requirements.

In this article, we discuss the issue of searching for relevant ontologies on the Internet and introduce our tool, OntoSearch. In section 2, we give some background to our research and list some current problems. In section 3, OntoSearch is introduced in detail. In section 4, some discussion and friture work are given followed by a brief summary.

2. Background 2.1 1KB: Identify Knowledge Base

Reuse of knowledge bases is a promising area in knowledge technologies and many researchers are focusing on how to reuse existing knowledge bases for different applications [1, 2]. Such requests for reuse are often specified as a knowledge base (KB) characterisation problem:

Require knowledge base on topic T, conforming to the set of constraints C [2].

There are two key points here:

• Decide what the principal topic (T) of a given knowledge base is. • Decide whether a KB conforms to certain constraints C.

As we noted, determining the principal topic of an existing knowledge base (KB) is an important step in the reuse of knowledge bases. Identify-Knowledge-Base (1KB) [2, 3] is a tool to suggest the principal topic(s) addressed by a knowledge base. It matches concepts extracted from a particular knowledge base against some reference taxonomy, where the taxonomy can be pre-stored or extracted from ontologies which are either stored on the local machine or are accessible through the WWW. The 'most specific' super-concept subsuming these extracted concepts is said to be the principal topic of the knowledge base.

^ W3C Semantic Web: http://www.w3.org/2001/sw/ ^ Google Web APIs: http://www.google.com/apis/

60

Here we give a simple example about a taxonomy of Food. Suppose we already have the taxonomy depicted as in the next page:

Food

Fruii-Vegeiables Meals

Fruii Vegeiables WhiieMeai Red Meat / \ / \ / \ / \

Apples Pears Poiaioes Carrois Chicken RabbH Game Beef

Figure 1: Taxonomy showing different kinds of food

If the concepts {Apples, Pears} are extracted and passed to the 1KB system, the system would suggest that {Fruit} might be the focus of the knowledge base. Similarly, if the concepts {Apples, Potatoes, and Carrots} are extracted, {Fruit-vegetables} would be the output. If the set of concepts {Potatoes, Chicken, and Game} is provided, topic {Food} would be returned as the result.

The 1KB system is implemented in Java . Jena , a Java API, is used to manipulate RDF^ models. The ExtrAKT system [4, 5, 6] developed at Edinburgh University is used to extract concepts from a Prolog knowledge base and then passes them to the 1KB system.

There are two main inputs in the 1KB system: extracted concepts from a KB and a reference taxonomy. The concepts can be extracted by the ExtrAKT system. However, choosing a suitable reference ontology is very hard. In using the 1KB system, we found that there are a huge number of ontologies available online; but finding a relevant reference ontology for some particular KB is not an easy job at all. (More discussions will be given later in the next section.) However, finding a relevant reference ontology taxonomy is essential for using the 1KB system.

2.2 Semantic Web

'^The Semantic Web is an extension of the current web in which information is given a well-defined meaning, better enabling computers and people to work in cooperation."

— Tim Bemers-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001 [1]

The Semantic Web [8] provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It

^ Java: http://java.sun.com/ ^ Jena: http://www.hpl.hp.com/personal^wm/rdf(jena/ ^ Resource Description Framework (RDF): http://www.w3.org/RDF/

61

envisions a globally interconnected network of machine-processable information, made possible by the sharing of semantic data models, which is also known as ontologies.

The Semantic Web is a collaborative effort led by the World Wide Web consortium^ with participation from a large number of researchers and industrial partners. It is based on the Resource Description Framework (RDF), which integrates a variety of applications using XML^ for syntax and URIs for naming.

There are many people working in this area to improve, extend and standardize the Semantic Web. Many documents and tools have already been developed. However, Semantic Web technologies are still in the infancy and there are many challenges in this area. One of the most important issues is to locate suitable existing ontologies to capture the user-required information from the Semantic Web. For example, if you want to publish your top ten favourite music tracks in Semantic Web, you would like to find some ontologies that represent real-world things like "artist", "track title", and "album". Otherwise, you will have to build these ontologies yourself However, to locate suitable ontologies from the Semantic Web is currently far from easy and there is still no handy tool to help the users as we know. So, we need to build a kind of "ontology Google" tool to kick-start this process.

2.3 Google Application for ontologies

Nowadays, Google^^ is widely used to search for information on the Internet. With the powerful facilities offered by Google, we can rapidly search many resources on the web. The next question is: Can one use Google to locate an existing ontology, which conforms to the user's requirements? The answer is "Yes". As pointed out in [9], we can simply use the Google facility "filetype:" to limit the type of searching file. For example, if we search in Google for "filetype:RDFs Food", then Google will return all the RDFs files with the keywords "Food". So the user can use Google to search for existing ontologies in different formalism, such as DAML (+OIL)^\ RDFs^^ OWL^^ etc. and use (or reuse) them for their own needs.

It seems Google is a good way to help the user find suitable online Ontology resources. However, after some experiments (basically focused on finding RDFs files), we found it does not perform as expected; it is very hard to use Google to search for suitable ontology files. There are several problems:

Firstly, ontologies are not always available for a particular topic/domain. Some domains have many resources while others have very few.

^ W3C: http://www.W3C.org ^ Extensible Markup Language: http://www.w3.org/XML/ ^ Google: http://www.Google.com ^ DAML(+OIL): http://www.daml.org/ ^ Resource Description Framework (RDF) Schema: http://www.w3.org/TR/2000/CR-rdf-schema-20000327/ ^ OWL: http://www.w3.org/TRy2004/REC-owl-features-20040210/

62

Secondly, Google returns links of'•elevant files, and the user will have to check if they are really relevant. This can be very time consuming because Google does not offer a quick way to browse ontology files.

Last but not least, Google searches files based on keywords supplied by the users. It does not check the real content and structure of the files. Some (usually many) irrelevant files will be returned to the user, just because they have the keywords somewhere in their files. We quite often find many RDFs files, which contain the required keywords, but on fiirther examination of the ontology, we realised that the files do not match our needs at all; that is, they do have the required keywords, but they are not situated as required. For example, when we searched for a food ontology using the keyword concept "Food", ontologies about the Animal domain are also returned, because the file contains a statement, such as "animal food vegetarian". Obviously, it is not really what we want. This kind of "mistakes" can cost the user more time to find acceptable ontologies. Thus, Google's keyword searching is not good enough as an ontology search tool.

Google Web APIs are a fi*ee beta service to help programmers develop their own google-based applications. With the Google Web APIs service, software developers can query more than 4 billion web pages directly fi'om their own computer programs. Google uses SOAP " and WSDL^ standards so a developer can program in his or her favourite environment, such as Java, Perl , or Visual Studio .NET. ^ So, with the support of Google Web APIs, we can develop a more specific tool to search for user-required ontologies fi-om the Semantic Web.

3. Empirical Studies 3.1 Design of OntoSearch As mentioned in the last section, finding ontologies to satisfy user requirements is a very important issue, in both KB reuse and Semantic Web areas. There is no existing tool to solve this problem. Google does have the power, but does not seem to be specific enough to give good results.

After some experiments, we noticed that the problem arises because Google does not offer a good visualization fimction for the ontology files (in different formalisms, such as RDFs, etc.). As the user cannot view the ontology in an intuitive graphic format, they have to look through the ontologies as structured text files. This process takes a lot of time and cannot guarantee a good result, as the plain text of the ontology cannot show the internal structure of the ontology clearly.

After reviewing some Ontology tools, we find that showing the hierarchy (structure) of an ontology is very important to help the user to understand the nature of the ontology. Most of the tools, such as ReTAX [10], Protege [11], OntoEdit [12, 13],

^^ SOAP: http://ws.apache.org/soap/ * WSDL: http://www.w3.org/TR/wsdl ^ Perl: http://www.perl.org/ ^ .NET: http://www.microsoft.com/net/

63

OilEd [14], WebODE [15] and OntoRAMA [16, 17], offer a facility of hierarchy viewing to support the user to build and edit ontologies. A hierarchical view of ontology seems to be a good way to give the user a quick overview of the selected ontology. In this piece of work, we investigate the applicability of visualisation techniques for ontology searching on the Internet.

To answer this question, we developed a visualization tool, OntoSearch, which combines the Google search engine together with the RDFs ontology (hierarchy) visualization technology. It helps the user search for relevant (based on keywords) ontology files on the Internet and displays the files in a visually appealing way—a hierarchy tree. The hierarchical view allows users to quickly review the structures of different ontology files and select the relevant ontology files.

We show a diagrammatic overview of OntoSearch in Figure 2:

2. Search for relevant RDFs files

f Selected \

Figure 2: Overview of OntoSearch^^

The user inputs to OntoSearch the keywords to describe the nature of the required ontology. Then OntoSearch applies the Google engine to search for RDFs files related to the keywords and returns a list of relevant links (URLs) to the user. The user then chooses some of the returned RDFs files and displays their structure, and

The rectangles in the figures represent processes while the ovals represent data or information.

64

decides which of the files are relevant. Finally, the user select the relevant RDFs files and saves them in a taxonomy library for future use.

As we now have the ontology-searching tool OntoSearch, we can link it to our other tool 1KB. Figure 3 discusses links between them and demonstrates how they interoperate.

User

( Keywords )

KB

Concepts ExtrAKT

OntoSearch 1KB

Ontology library

Relevant ontology

Figure 3: The relation between OntoSearch, 1KB and ExtrAKT

3.2 Development of OntoSearch The OntoSearch system is implemented in Java and JSP^ . It is a web-based system, which can offer online service based on JBoss^ . Jena, a Java API for manipulating RDF models, is used to read the ontology (RDFs file) into Java. Google Web APIs contribute to the Internet search engine. One JSP tag (tree tag^^) is applied to visualize the hierarchy structure of the ontology.

The user can browse and use the OntoSearch interface using any web browser. The user inputs keywords to describe the nature of the required ontology on the

^ JSP: http://java.sun.com/products/jsp/ ^ JBoss: http://www.jboss.org/index.html ^ JSP Tree Tag (Version: 1.5): http://www.guydavis.ca/projects/oss/tags/

65

keyboard. Then, OntoSearch will apply the Google Web APIs to search the Internet for relevant files (the file type is restricted as RDFs now but can be changed) and return all the URLs on the screen. The user can select the files to inspect their structures in a hierarchy tree view. Thus, the user can get a general idea of the content and structure of the returned ontologies. Finally, the user can save the relevant ontology on local disk.

3.3 Demonstration of OntoSearch Next, an example of using OntoSearch is given. Suppose the user is looking for some ontology in a Food domain. The required ontology should contain some real-world knowledge about food and related issues. The user inputs the keyword "Food" into OntoSearch. After searching, some RDFs files are returned as resuhs, which are shown in Figure 4.

' H I I m/^^»i^,MM

i«^ ^P^P?IP:|:^P''^''B tMt3?^pd.3i80Bip^^ M

OntoSearch

search Ontologies ( M s files) hj leyvords

I

pl^«s« U^ keytojcds

wm

FflttXi the foUwitii 1 ^ f ilts

Mp:/iwfLMtkml9^t, orf/oil/«d«ils. ritf s, http://iii. onlQbofl«4c . org/ail/r flniiilSr t^Sj http://pi«iUac. iitri«.f r/cdroi/fti ocaMa/coiu. rdf s]

Figure 4: Search ontologies by keywords

66

As often many RDFs files are returned, the user then has to inspect them to check if these files are really about the Food domain. As there is one file named "Food.RDFs", the user selects that one first. The content of that RDFs file is shown as triples in Figure 5.

fk m m FMtes lo^ H# i f ]

©Bad • © I I ^ J P S ^ f 1 ^ f Wa f I ^ ^m m http://139.1333.3$:808()/fhtAntAp|^^^ ii .

OntoSearch

Parse selected O n t o k ^ (RDFs file) in triples

sa?e the selected ontology to local disk

please select one ontology [httpi/j^ww.csdabdn.ac.ulf'yzhang/fQod.ix^

the rdfs triples

[[AppleSj Chicken, Gate, leatSj Red-Ieat| TegetableSj Food, Carrots, Potatoes, Pears, Rabbit, Ihite-Ieat, ^ Fruit-Vegetables, Fruit, Beef],

[subClassOf, s\^lassOf, subClassOf, subClassOf, st^lassOf, subClassOf, subClassOf, subClassOf, subClassOf, subClassOf, subClassOf, subClassOf, subClassOf, sobClassOf, subClassOf],

[Fruit, Ihite-Ieat, Red-Ieat, Food leats, Fruit-Vegetables, Resource, Vegetables, Vegetables, Fruit, Ihite-Ieat, leats. Food, Fruit-Vegetables, Red-Ieat]]

D:\oiitpu(foodRDFs S a ^ Beset

Figure 5: Show the triples of the example ontology

As shown in Figure 5, there is only one kind of triple in this ontology. All the triples are "subClassOf type of triple. All the concepts in this ontology are subclasses (within several levels) of the food concept. Thus, we can think this ontology is a hierarchy of different kinds of foods. In Fact, this ontology file does match the user's needs.

Figure 6 gives the hierarchy of that ontology. Obviously, this format is much easier for the user to understand than the triple format which is shown in Figure 5.

67

0m^:r:M-mmmM : i ^ l ^ 1 ^ http:#139;i33.2d0.35l60dO/fir$l:AnU^/^

OiitoSefiirch Visual!70 Ontology (RlWFs f3Gle) i n liieriur<:fay

This i s "the Taxonojuy of the se lected Ontology (divpley the tree)

^a — - ^

e-

i

a-

Foe

^ l|l-

^

1

^

^ W^iits

Chi ck«ii

R^bit

Truit-V«f*t»bl«s

^ a y»g<>id»i<>«

Ctjrrot*

LsiilSiS

Api»l*Jt

If'

Figure 6: Hierarchy visualization of selected ontology

After viewing the hierarchy of the select ontology, the user makes the decision whether the ontology is relevant to the requirement, and then proceeds to check fiirther returned ontologies.

4. Summary, Discussion and Future Work As mentioned earlier, the OntoSearch system is a usefiil tool which can search for ontology files from the Internet and visualize them as hierarchies. The next stage of our work will be developing an advanced mode for OntoSearch system:

The current OntoSearch system is quite simple. It can only search for one type (RDFs) of ontology file, and it only compares the user keywords with the contents of the ontology files wherever they occur. And so it matches indiscriminately the keywords both from concepts and comment fields. A fixture version of OntoSearch

68

will allow the user to choose different representational formalisms used to express ontologies, and it will allow the user to specify the type of entity (concepts, attribute or comments, etc.) to be matched.

Other future work includes:

• Creating a "library" of the Taxonomies

More experiments will be carried out, especially on particular domains to test our OntoSearch system. The user-acceptable ontologies will be stored in a repository for future use (eg. for use with 1KB).

• WordNet ^ application

The synonym problem is not well addressed in the current version of OntoSearch. We are planning to incorporate WordNet in future versions so that our tool will be more effective, ie it will retrieve a large number of relevant ontologies.

Acknowledgements This work is supported under the EPSRC's grant number GR/N15764 to the Advanced Knowledge Technologies Interdisciplinary Research Collaboration, http://www.aktors.org/akt/, which comprises the Universities of Aberdeen, Edinburgh, Sheffield, Southampton and the Open University.

References 1. Advanced Knowledge Technology (AKT project)

http://www.aktors.org/akt/ 2. Sleeman D, Potter S, Robertson D, and Schorlemmer W.M. Enabling Services

for Distributed Environments: Ontology Extraction and Knowledge Base Characterisation, ECAJ-2002 workshop, 2002

3. Sleeman D, Zhang Yi, Vasconcelos W. Characterisation of Knowledge Bases, Proceedings of AI-2003 (the twenty-third Annual International Conference of the British Computer Society's Specialist Group on Artificial Intelligence (SGAI)), 2003

4. Schorlemmer M, Potter S, and Robertson D. Automated Support for Composition of Transformational Components in Knowledge Engineering. Informatics Research Report EDI-INF-RR-0137, June, 2002

22 WordNet: http://www.cogsci.princeton.edu/~wn/

69

5. Sleeman D, Potter S, Robertson D, and Schorlemmer W.M. Ontology Extraction for Distributed Environments, In: B.Omelayenko & M.Klein (Eds), Knowledge Transformation for the Semantic Web. pub Amsterdam: lOS press, p80-91,2003

6. ExtrAKT system: a tool for extracting ontologies from Prolog knowledge bases, http://www.aktors.org/technologies/extrakt/

7. Bemers-Lee T, Hendler J, Lassila O, The Semantic Web, Scientific American, 2001 http://www.scientificamerican.com/article.cftn?articleID=00048144-10D2-lC70-84A9809EC588EF21&catID=2

8. Sean B. Palmer, The Semantic Web: An Introduction, 2001-09. http://infomesh.net/2001 /s wintro/#ontInference

9. DuCharme B, Googling for XML, February 11, 2004, http://www.xml.eom/pub/a/2004/02/ll/googlexml.html

10. Alberdi E & Sleeman D, ReTAX: a step in the Automation of Taxonomic Revision, Artificial Intelligence, 91, p257-279,1997.

11. Musen M. A., Fergerson R. W., Grosso W. E., Crubezy M., Eriksson H,, Noy N. F. and Tu S. W., The Evolution of Protege: An Environment for Knowledge-Based Systems Development, International Journal of Human-Computer Interaction (in press)

12. Sure, Y., S. Staab, J. Angele. OntoEdit: Guiding Ontology Development by Methodology and Inferencing. In: R. Meersman, Z. Tari et al. (eds.). Proceedings of the Confederated International Conferences CoopIS, DO A and ODBASE (2002) Springer, LNCS 2519,1205-1222.

13. Sure, Y., S. Staab, M. Erdmann, J. Angele, R. Studer and D. Wenke, OntoEdit: Collaborative ontology development for the semantic web, Proc. of ISWC2002, (2002) 221-235.

14. Bechhofer S., Horrocks I., Goble C, Stevens R. OilEd: a Reason-able Ontology Editor for the Semantic Web. Proceedings of KI2001, Joint German/Austrian conference on Artificial Intelligence, September 19-21, Vienna. Springer-Verlag LNAI Vol. 2174, pp 396-408. 2001.

15. Corcho, O., Fernandez-Lopez M., A.Gomez-Perez and Vicente O., WebODE: An Integrated Workbench for Ontology Representation, Reasoning and Exchange, Prof of EKAW2002, Springer LNAI 2473 (2002) 138-153.

16. Eklund P, Roberts N and Green S. OntoRama: Browsing RDF Ontologies using a Hyperbolic-style Browser http://www.kvocentral.com/kvopapers/ontorama.pdf

17. Eklund P, Cole R, and Roberts N. Retrieveing and Exploring Ontology-based Information, Handbook on Ontologies. International Handbooks on Information Systems Springer 2004, ISBN 3-540-40834-7 (2003) 405-414.

SESSION 1b:

CBR AND RECOMMENDER SYSTEMS

Case Based Adaptation Using Interpolation over Nominal Values

Brian Knight^ Fei Ling Woon^ ^University of Greenwich, CMS, London, UK,

^Tunku Abdul Rahman College, SAS, Kuala Lumpur, Malaysia [email protected], [email protected]

Abstract

In this paper we propose a method for interpolation over a set of retrieved cases in the adaptation phase of the case-based reasoning cycle. The method has two advantages over traditional systems: the first is that it can predict "new" instances, not yet present in the case base; the second is that it can predict solutions not present in the retrieval set. The method is a generalisation of Shepard's Interpolation method, formulated as the minimisation of an error function defined in terms of distance metrics in the solution and problem spaces. We term the retrieval algorithm the Generalised Shepard Nearest Neighbour (GSNN) method. A novel aspect of GSNN is that it provides a general method for interpolation over nominal solution domains. The method is illustrated in the paper with reference to the Irises classification problem. It is evaluated with reference to a simulated nominal value test problem, and to a benchmark case base from the travel domain. The algorithm is shown to out-perform conventional nearest neighbour methods on these problems. Finally, GSNN is shown to improve in efficiency when used in conjunction with a diverse retrieval algorithm.

1. Introduction In this paper we present a Case-Based Reasoning (CBR) adaptation method that utilises a generalised interpolation method applying equally to nominal and continuous values. The paper is an extended version of the method previously proposed by Knight & Woon [1]. We show that this approach has advantages over existing interpolative methods such as fc-Nearest Neighbours (fc-NN) and Distance Weighted Nearest Neighbours (DWNN).

The fc-Nearest Neighbours (fc-NN) algorithm [2, 3] has been a primary method employed in the task of classification. It predicts the solution for a given target query based on the output classification of its fc-nearest neighbours assuming that the output class of the target query be most similar to the output class of its nearby instances in (possibly weighted) Euclidean distance. This algorithm can be used for approximating both discrete-valued target functions and continuous-valued target ftinctions. In discrete-valued domain, the output class that has the most common

73

74

vote among its ^-nearest neighbours determines the output classification of the target query. For continuous-valued target functions, the mean value of its k nearest neighbours is taken as the solution rather than their most common class.

An alternative to i-NN is the Distance-Weighted Nearest Neighbour method (DWNN). In this algorithm, the contribution of each of the /:-nearest neighbours is weighted according to its distance to the target query (jc ). When this algorithm is used in real-valued target functions the weighted mean value of the ^-nearest neighbours is taken, and it is thus a localised version of Shepard's method [4]. However, for discrete solution spaces a voting mechanism replaces the weighted mean. The Distance-Weighted Nearest Neighbour Algorithm for discrete solution

spaces takes the solution value ( /(jc ) ) which has the highest vote, where the

vote for a particular solution is weighted by the inverse square of its distance from

the target {d{x .x^)'^). The algorithm is given as follows:

/ ( ;c^)^argmaxiw,J() ; , /U, ) ) (1.1) y^Y 1=1

where w^ = r- ; S{y,y) = l if y-y ; S{y,y ) = 0 otherwise; d{x^,x,)

and where the set Y is the set of y values in the retrieved set of nearest neighbours.

It should be noted that under both fc-NN and DWNN, the solution must exist in one of the nearest neighbours. It is not possible for these methods to return intermediate values not present in the retrieved set Y. This property is not shared by most interpolation formulae over continuous real domain, which typically return real values NOT in the retrieved set; for example, interpolating as the average of the retrieved set Y ={ly2} produces the solution 7.5 which isn't in the set Y. However, the interpolation method GSNN, proposed in this paper, is a generalisation of Shepard's continuous formulae which applies in exactly the same way to nominal and real value domains. As such, it does not require the solution value to be contained in the retrieved set, or indeed within the case base at all.

This characteristic of fc-NN and DWNN over nominal solution domains has disadvantages for CBR. The first of these relates to performance of CBR systems in predictive classification. For, in many retrieval situations the retrieved set may not contain the target nominal value at all; particularly if the target is in a region of the case base where cases are sparse. This can in fact degrade the predictive performance of the retrieved set, as we see in the examples given in sections 6 and 7 of the paper.

There is also a problem involved when new solutions are added to a case base. When a new solution is &st added, there may be very few cases, or even no cases at all present in the case base. In this situation the new solution will rarely, if ever, be retrieved into the set for interpolation. In the example of the travel case base discussed in Section 7, a new hotel may be added for which there are no extant

75

holidays. This hotel will never be retrieved by fc-NN or DWNN as a solution for a target holiday.

The importance of interpolation over nominal domains has been discussed by Wilson and Martinez in [5]. They provide an account of approaches to defining metrics over nominal domains, and propose the use of an extension to Stanfill and Waltz's Value Difference Metric [6] as the basis for interpolation. However, as the authors point out, the method will not cope with solutions not in the case base.

In this paper, we propose a generalisation of Shepard's method which is somewhat similar to DWNN, but which depends upon the minimisation of a distance weighted error function, rather than the usual maximisation of a voting function. This method applies equally to both discrete and continuous solution domains. We refer to the algorithm as the Generalised Shepard Nearest Neighbour algorithm (GSNN).

In Section 2 we describe Shepard's interpolation method, and in Section 3 explain the generalisation to nominal values. In Section 4 we discuss the general properties of the interpolation, and in Section 5 illustrate the procedure with reference to the well-known Irises classification problem [7]. The Irises problem is a good one for illustration purposes, since it demonstrates how 28/50 instances of iris-setosa can be correctly predicted from a case base of just 2 cases: 1 of iris-virginica and 1 of iris-versicolor. This illustrates clearly how solutions not in the case base can be correctly interpolated. In Section 6 we give a simulated example, taken from a benchmark used by Ramos & Enright [8]. This is used to measure the performance improvement on random and regular case bases. In Section 7, we use the travel benchmark case base [9] to test both the improvement in performance effected by GSNN, and also to test its ability to deal with new solutions. In Section 8 we show that GSNN can improve in efficiency when used in conjunction with a diverse retrieval algorithm. We conclude in Section 9, with a summary and indications of future work.

2. Shepard's Method Shepard's interpolation method [4] is one of a variety of well-known algorithms available for multivariate scattered data interpolation, where the independent variable x e R^, and the dependent variable y e R (see e.g., [10]). We choose Shepard's method, since, as noted by Mitchell [3], in the case of continuous domains it is the global form of the DWNN. Shepard's method is global, requiring all points in a dataset to estimate a function fix) at the point x. The interpolation function is given by:

f(x) = l.l\\x-x^\'^ f(x^) / S||;c-x.|r'' (2.1)

where the interpolation is over the set of n points {x ,x ,.,.,x } , p > 0 and

II X - x^ II denotes the Euclidean distance in E^.

Lazzaro and Montefusco [11] point out that this scheme has the advantage of fiill independence from the space dimension but disadvantages of low reproduction quality and high computational cost. By independence from the space dimension

76

we mean that the d dimensional values of the point x e R^ do not enter into (2.1) themselves, but only the distances between points x and Xi. This independence from the space dimension makes the scheme of interest in CBR applications, since we show in Section 4 that we can generalise the method to operate only on distances. Often in CBR, we do not necessarily have embedding dimensions, as we shall show in Section 4. However, the global nature of the function and its concomitant high computational cost make a local form of the method more suitable. Franke and Nielson [10] have proposed a modification of Shepard's method which presents improved reproduction quality and reduced complexity. In this method, known as modified quadratic Shepard's method, the influence of each data point in the data set is confined to interpolation points within a radius of the data point.

For case-based reasoning the global nature of Shepard's method is not a problem, since we are only looking for interpolation over a small set of retrieved data points. The retrieval phase of the cycle has effected the localisation of the method. In effect the retrieval phase ensures that only data points within a radius of the interpolation point will influence the interpolation value.

3. An Error Function The objective of this paper is to present a retrieval method which will apply to a general class of problem and solution domains. The assumption here is that a distance metric d^{XyX.) is defined on the problem domain X and d (3 , ,) on the

solution domain Y.

Since Shepard's method applies to domains where X = R^ and Y = R \ and we would like to apply it to more general domains, wherever a distance metric may be defined. Fortunately, the method is already independent of the space R^ , inheriting only the Euclidean distance || jc - jc-1| from R^ . We propose to generalise the

Euclidean distances of Shepard's method to use general distance functions d^(x,x.)

mddy(y,y,).

For the solution space F, we first need to express Shepard's interpolant in terms of d (y,yi)' To do this we may note that y =f[x) in (2.1) is the value of y which

minimizes the error function:

/()') = 1:11 y - y,-iriu - ^ . f / Z : i u - ,-r To see this, we find the extrema of I(y) as:

di/dy = 22::ii y - y^ II \\x-x, ir" / E : I U - ;c, n -= 0 iff y'ZlWx-x.W-" = ^"^Wx-x^r^ y,

77

That this is a minimum follows from the positive definite form:

a^//9y^=2::ii^-^.ir/i:ii^-^.r The function I(y) depends only upon the Euclidean distance over In order to generalise the method completely, we propose to replace the Euclidean distance by the general distance the error function:

^()^) = Z i ^ / > ' ' > ' « ) ' ^ x ( ^ ' ^ « ) ' ' / Z i ^ : c U . ^ . ) " ' (3-1)

Here, the set / jc , jc , . . . A: 7 are the A: nearest neighbours in the problem space

to the point x. rf^(jc,jc.) and d (}',y,) are distance (or dissimilarity coefficients) on

domains x e X ,y e Y each satisfying:

d{a,b)>Oya,b d{a,a) = 0,\/a

d{a, b) = d(b, a), Va, b

The interpolant value y is the value ysY which minimizes the error function /.

This method is different from the Distance-Weighted Nearest Neighbour method. Although both of them are local forms of Shepard's method in continuous solution domains, DWNN is not such a local form in discrete domains. In these domains

DWNN relies on a voting function (i.e., S(y,y ) = 1 if y — y and

S(y,y ) = 0 otherwise). However, GSNN relies on a distance metric, d (y,y,)

defined on the solution domain Y. In GSNN, we are now interested in the interpolation value ye Y that minimizes I.

The Generalised Shepard Nearest Neighbour Algorithm (GSNN) is given as follows:

Generalised Shepard Nearest Neighbour Algorithm

f(x^)^ arg min ^ ^ id y'(y , f (x,)) (3.1)

where y^ . = i d ,{x^,x,)

Although somewhat (3.1) similar in appearance to the formula of DWNN, we should notice that the set Y in this algorithm is the set of all possible y-values, whereas in the formula (1.1) for DWNN, Y is the set of }?-values in the retrieved set.

78

4. General Properties We can get an idea of how the method works by first examining a simple case y =f(x), as illustrated in Fig. 1. Here X = R , Y = R and d^{x,x^), d (} ,y,.) are

absolute distances in R. We take /? = 1, A: = 2 and consider two retrieved cases xj and X2 in the neighbourhood of point x.

y^fix)

Xl X X2

Figure 1. Interpolation using the function I(y) fory =f(x)

In fact, for this simple case the interpolation curve is a straight line between the

points (xj, yi), fe, yT)- For Xj < X < A: , we have:

Hy) = Kx2-x)(y - y^y - (x - x^)(y - y2y]/(x2 - x^).

The minimum value of / occurs when (x,y) lies on the straight line:

(^2 " ^i)y = (y2 - yO^ + (^2^1 - x^yi)

For smooth curves such as the one illustrated in Fig. 1, we expect good estimation by interpolation between the two retrieved cases. However, extrapolation is not likely to be as accurate as we move away from xi and JC2. Here the interpolation method gives an estimate asymptoting to (yi+y2)/2, (the average of the y values) with no apparent tendency to the true value of y -fix).

The computational complexity of the method is equivalent to that of the nearest neighbour algorithm. Retrieval of k cases once they are ordered by distance involves litde extra confutation. The only computational overhead is the calculation of the minimal value for I(y), which is hF if there are N nominal values foxy.

5. lUustratiye Example: Interpolation Unordered Nominal Values

over

In this example we illustrate how the method works in detail. We choose the well known Iris dataset [7]. Although the Iris data set contains only continuous variables

79

in the problem domain, the solution space is a set of 3 nominal values, which is sufficient to illustrate how GSNN works.

The data set contains 3 classes of 50 instances each, where each class refers to a type of Iris plant. We take the problem space X to be /?^ , so that x = (xj, X2, xs, X4) is a point in problem space, where xj = sepal width, xz = sepal length, X3 = petal width, X4 = petal length. The solution space Y = { setosa, versicolour , virginica } . For this problem, we need to define distance metric in both problem space X and solution space Y. For the problem space we define distance according to a weighted sum of attributes. For convenience, we assign equal weight = VA for each attribute, so that:

d^x,x') = V4(\xi-xr\ + ...)

For the Y space, we need to construct dy(y,y'). In this test, we have used the distances between cluster centres (see Fig. 2) to represent the distance between the classes. These distances are shown in the following matrix:

setosa

'ersicolou r

virginica

setosa

0

.35

.49

versicolou r

.35

0

.18

virginica

.49

.18

0

1.5

1

0.5

0

-0.5

-1

-1.5

Virginica

cluster centre Setosa cluster centre

Versicolcff

cluster centre

! ++ + •

Figure 2. Principal component plot of the Iris dataset

To demonstrate how the method works, we take two cases, one from setosa and one from virginica:

JC7= (4.4, 2,9, 1.4, 0.2), yi= setosa

80

xi = (72y 3.2, 6,1.8), yi = virginica

We take as target the versicolour iris:

X = (5.5, 2.3, 4,1.3), y=?

Taking p=l and k = 2, the function I(y) is:

'(y^ = md^(y,y,ydAx,xy' /Y,ld^(x,x,y'

= ((0.36)-'dJ y,setosa f + (0.35 T'd^( y,virginica f)/{{0.36 f + (0.35 f')

Using the values for dy in Table 1, we have the following values for I(y);

l(setosa) = 0.1217 I(versicolour) = 0.0768 I((virginica) = 0.1184

Since I(versicolour) is minimum, we take y = versicolour as the estimated value.

This example shows an advantage of the interpolation method in situations, in that it can correctly predict nominal values not represented in the case base itself. We show in Section 7 that this can be an advantage when new solutions need to be added to a case base. In fact, from an extreme case base with just two cases used in this example the method correctly predicts 128 of the 150 irises in the dataset. The DWNN method can only predict the 100 setosa and virginica targets correctly. The full details of this example are given in [12].

6. Test of the Method on a Simulated Case Base We have conjectured that the capability of GSNN for predicting nominal values not in the retrieval set can be advantageous in improving performance. To examine how this applies in practice, we simulated case bases of varying density and structure, and used the method to estimate simulated target sets. As a basis for the simulation, we adapted the smoothly varying function:

F^(x, y) = sin 2;ry * sin TTX (6.1)

used by Ramos and Enright [8] to test out Shepard's method for interpolation over scattered data. We adapted (6.1) by discretising the function to give 21 nominal values, } 7,..., y2i. These are the 21 integral values of the function:

y = Int (10 sin ITTX^ * sin 2;rjC2),

where Int() is the integer function.

Although these values yi,...y2i are in fact numeric, we have treated them as nominal throughout this experiment, and inherited a distance metric from the numeric values:

dy{yi,yj)= \yi-yj I

81

In this way, we have treated the 21 values yi,..., yii as symbols, with no intrinsic order but with an externally imposed metric d (y,^)- Ramos and Enright tested

Shepard's method on (6.1) with both regularly spaced node sets and randomly spaced, and we have followed this example in two tests. Test 6.1 uses regularly spaced cases at various case densities. This might represent a well organised case base, where these cases had been selected from a large available pool. Test 6.2 uses randomly selected cases, and is intended to represent a disorganised case base. Cases (jc;, JC2, y) are constructed as:

(x^.Xj^.y- /nrf lOsin 27D:i*sin 27cx:2 U ^< Xy^,x^<\

Altogether there are 400 separate regions in the (xy, xi) plane, each mapping to a nominal value. In Test 6.1, cases were constructed over a regular square lattice, with 7^10^20^25^30^ points. These lattice points are not associated with the 400 regions. Fig. 3 shows the result of Test 6.1.

Figure 3. Correct predictions in estimating a test set of 1000 targets, for regular case bases

These results confirm that GSNN can out-perform both A:-NN and DWNN for case bases with regular structure. In fact, the optimum value of A: is 3 for this test for all three methods. We see that in fact GSNN reached a performance of 80 % correct predictions for a case base of only 400 cases, as against 50% and 30 % for DWNN and fc-NN for the same case base.

In Test 6.2, we took case bases of the same size as those in Test 6.1, but this time generated x;, xi randomly within the unit square. Fig. 4 shows the results of this test.

82

Figure 4. Correct predictions in estimating a test set of 1000 targets, for random case bases

The results show that more errors are recorded for random case bases than for regular case bases of equivalent size, whatever the value of k, although once again, k=3 is optimal. However, the improvement is not so marked as in the regular case base test of 6.1. In Section 4 above it was noted that the interpolation method does not work well for extrapolation. In Test 6.1, because of the regularity of the case base, all estimates are in fact interpolations, whereas in Test 6.2, we expect some extrapolations as well.

7. Test of the Method on the Travel Case Base In this section, the interpolative method is tested on a benchmark case base from the travel domain [9]. The problem investigated here is that of predicting a hotel for a given package holiday. We divide the domain attributes into the problem domain: X = {holiday type, destination region, duration, accommodation type, . . . } , and the solution domain Y= {Hotel}. The case base consists of 1024 package holidays. For the problem space we define distance according to a weighted sum of attributes with equal weight. For the Y space, we derive a metric on Y defined by its region and class of accommodation.

From the original case base we have chosen 1000 cases for experiments. 300 cases are chosen randomly as target problems. These cases are unseen target problems, not in the case base. The remaining 700 cases are used to form experimental case bases. We divide 700 cases into 7 independent case bases ranging from 100, 200, 300 to 700. This enables us to examine the predictive power of each retrieval method using various case base sizes, on the 300 unseen target problems. Following Smyth & McKenna [13], we use a similarity threshold as criterion for correct prediction. If the predicted value is within the similarity threshold, that counts as correct prediction. In the experiments below, we take the threshold as 100%.

83

Figure 5. Comparing the correct prediction accuracy (%) of retrieval methods on 300 unseen target problems

Fig. 5 shows the comparison of correct prediction accuracy (%) of GSNN, DWNN and A;-NN on the 300 unseen target problems. Details of these test results are available in [14]. The experiment shows clearly that GSNN out-performs both DWNN and A:-NN, due chiefly to its ability to correctly predict nominal values not in the retrieval set, or even in the case base. We shall defer a detailed analysis of the results until the next section, in which we see that the efficiency can be further improved by using a different retrieval algorithm.

8. Interpolation Using GSNN with a Diverse Retrieval Set In Section 6, we have shovm that GSNN works better in a regular case base than in a random case base, and that this is connected with the general property that the algorithm performs better at interpolation than extrapolation. Analysis of false predictions from GSNN shows in fact that it does not predict well when the members of the retrieved set are close together. In this case we can fall into the extrapolation trap, the target being outside the interpolation points. This points to the importance of the retrieval set for interpolation. The experiment described in Section 7 simply used the k nearest neighbours for interpolation. However, Smyth & McGinty [15] and Smyth & McClave [16] have shown that selecting a more diverse set can improve the efficiency of recommendation systems. In this section, we show that use of a diversity algorithm can also improve the efficiency of interpolation systems.

84

In this experiment, we use bounded-greedy diversity technique proposed by Smyth and McGinty [15] to generate a diverse set of candidate cases. We then used the diverse set for interpolation, using DWNN and GSNN. The results for various case base sizes are given in Fig. 6, which contrasts the two methods using both diverse retrieval sets and nearest neighbour retrieval sets.

60 1

50-

£ 40.

c

i g 30 Q.

O O 20 O

10-

0-

(

^ . ^

^ ^

/ ^ — ^ . . - ^ ^ ' " ^ ^ ^ • •-^'^^ ^^"^^^^^^^^^^^^^^^^

• " .^^^^ ft-'"*''"'^^^^ •^ ^^A—==rf::^:::^^^^^*'

r -

3 100 200 300 400 500 600 700 800

No of Cases

-•-GSNN& Diversity

-»-GSNN

- • - DWNN & Diversity

-A-DWNN

Figure 6. Comparing the correct prediction accuracy (%) of retrieval methods using both diverse retrieval sets and nearest neighbour retrieval sets, on 300 unseen target problems

Here we see that GSNN benefits considerably from the use of a diverse retrieval algorithm. DWNN however improves only marginally. This is because DWNN operates using distance weighted voting that is usually dominated by the nearer neighbours. Diverse sets will naturally include more distant neighbours, which will play little part in the voting. GSNN on the other hand does not suffer from this disadvantage.

Analysis of the results shows that GSNN has two advantages. The first is that it can predict both solutions that are not in the case base. In this example, GSNN managed to predict correctly hotels for which there were no holiday packages in the case base. In column 3 (i.e., A vv Hotels), Table 1, we see the statistics on new hotel predictions. DWNN cannot predict any of these correctly. Notice that these decrease as the case base size increases. This is because the number of new hotels in the test set decreases as the case base size increases.

The second advantage is that GSNN can correctly predict "old" hotels which are not in the retrieval set. Once again DWNN cannot predict these. Table 1, Column 2 (i.e.. Old Hotels) shows the number of "old" hotels correctly predicted by GSNN which were incorrectly predicted by DWNN. Contrast this against column 4 shows the number of "old" hotels that are correctly predicted by DWNN but incorrectly

85

by GSNN. Column 5 shows the number of "old" hotels that are correctly predicted by both GSNN and DWNN.

Table 1. Comparing the number of correct predictions of old hotels and new hotels for GSNN and DWNN using diverse retrieval sets, on 300 unseen target problems

No of Cases

100 200 300 400 500 600

1 700

No of Correct Predictions by GSNN,

not DWNN Old

Hotels 5 21 28 44 60 61 63

New Hotels

35 28 22 24 27 22 16

No of Correct Predictions by

DWNN, not GSNN Old Hotels

16 15 9 12 12 12 14

No of Correct Predictions by both GSNN and DWNN

Old Hotels

38 50 54 59 60 71 76

9. Conclusion In this paper, we have proposed a method for interpolation over nominal values. The method assumes only that a distance or dissimilarity metric is defined over the problem domain (the independent variable) and over the solution domain (the dependent variable). The method has an advantage for CBR in that it is applicable to case bases with nominal values in the problem and solution domain where no natural ordering exists. Previous attempts at interpolation have assumed such an ordering. The method generalises Shepard's interpolation method by expressing it in terms of the minimization of a function I(y). This function relies only on distance metrics defined over problem and solution spaces.

We have tested the method on three test problems: the well studied Irises problem; a benchmark simulated nominal case base; and a benchmark CBR case base from the travel domain. The examples studied indicate that GSNN can be more efficient as an adaptation engine than other nearest neighbour methods. We have also shown that the method can work well in collaboration with a diverse retrieval algorithm. Once again experiments show that it can out-perform DWNN interpolation on a benchmark CBR case base.

References 1.

2.

Knight B. and Woon F. L. Case Base Adaptation Using Solution-Space Metrics. Proceedings of 18* Intemational Joint Conference on Artificial Intelligence IJCAI-03, Acapulco, Mexico, 9-15 August. Morgan Kauftnann, San Francisco, CA., 2003, pp 1347-1348 Cover, T. and Hart, P. Nearest Neighbour Pattem Classification. IEEE Transactions on Information Theory 1967; 13:21-27

86

3. Mitchell T. Machine Learning. McGraw-Hill Series in Computer Science, WCB/McGraw-ffill, USA, 1997

4. Shepard D. A Two-dimensional Interpolation Function for Irregularly Spaced Data. Proceeding of the 23" National Conference, ACM, 1968, pp 517-524

5. Wilson D. R. and Martinez T. R. Improved Heterogeneous Distance Functions. Journal of Artificial Intelligence Research 1997; 6:1-34

6. Stanfill C. and Waltz D. Toward Memory-Based Reasoning. Communications of the ACM 1986; 29:1213-1228

7. Fisher R. A. The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics 1936; 7, Part II, 179-188.

8. Ramos G. A. and Enright W. Interpolation of Surfaces over Scattered Data. Visualization, Imaging and Image Processing VIIP2001, Proceedings of lASTED, Marbella, Spain, 3-5 September, ACTA Press, 2001, pp 219-224

9. Lenz M., Burkhard H-D. and Bruckner S. Applying Case Retrieval Nets to Diagnostic Tasks in Technical Domains. Proceedings of the 3"* European workshop on Case-Based Reasoning EWCBR-96, Lausanne, Switzerland, November. Springer-Verlag, Berlin, 1996, pp 219-233 (Lecture Notes in Artificial IntelUgence no. 1168)

10. Franke R. and Nielson G. Smooth Interpolation of Large Sets of Scattered Data. International Journal for Numerical Methods in Engineering 1980; 15:1691-1704

11. Lazzaro D. and Montefusco L. B. Radial Basis Functions for the Multivariate Interpolation of Large Scattered Data Sets. Journal of Computational and Applied Mathematics 2002; 140:521-536

12. wvvw.gre.ac.uk/~wfD5/iris/iristestresults/ 13. Smyth B. and McKenna E. Modelling the Competence of Case-Bases. Proc of 4*

European Workshop on Case-Based Reasoning, EWCBR-98, Dublin, Ireland, September. Springer-Verlag, Berlin, 1998, pp 208-220 (Lecture Notes in Artificial IntelUgence no. 1488)

14. www.gre.ac.uk/-wf05/travel/traveltestresults/ 15. Smyth B. and McGinty L. The Power of Suggestion. Proceedings of 18* International

Joint Conference on Artificial Intelligence UCAI-03, Acapulco, Mexico, 9-15 August. Morgan Kauftnann, San Francisco, CA, 2(X)3, pp 127-132

16. Smyth B. and McClave P. Similarity vs. Diversity. Proc of 4* International Conference on Case-Based Reasoning, ICCBR-01, Vancouver, BC, Canada, July/August. Springer-Verlag, Berlin, 2001, pp 347-361 (Lecture Notes in Artificial IntelUgence no. 2080)

Automating the Discovery of Recommendation Rules

David McSherry

School of Computing and Information Engineering, University of Ulster, Coleraine BT52 ISA, Northern Ireland

Abstract

We present techniques for the discovery of recommendation rules that describe the behaviour of a recommender system in localised areas of the product space. Potential uses of the discovered rules include assessing the performance of the system in terms of recommendation efficiency and solution quality. For example, the discovered rules may reveal potential efficiency gains that might be achieved with an altemative recommendation strategy. We also present an efficient algorithm for automating the discovery of recommendation rules in nearest-neighbour (NN) retrieval, the standard approach to product recommendation in case-based reasoning (CBR).

1 Introduction

Balancing the trade-off between recommendation efficiency and solution quality is an important challenge in CBR approaches to product recommendation [1-3]. Recently there has been considerable research interest in increasing the efficiency of recommendation dialogues relative to the standard NN approach in which the recommended case is the one that is most similar to a query representing the preferences of the user [4-8]. However, the potential benefits must be balanced against the possible effects on solution quality [5, 7-8].

In this paper, we present techniques for the discovery of recommendation rules that describe the behaviour of a recommender system in localised areas of the product space. Potential uses of the discovered rules include assessing the system's performance in terms of efficiency and solution quality. For example, the discovered rules may reveal potential efficiency gains that might be achieved with an altemative recommendation strategy. They may also be useful for identifying regions of the product space in which the case library is lacking in coverage. We also show how recommendation rules can be used to assess the performance of a system that bases its recommendations on a subset of the user's preferences relative to one that takes account of all the user's preferences.

87

88

The following is a valid recommendation rule for a restaurant recommender system that always recommends "Pizza Italiano" when the user is looking for an inexpensive Italian restaurant:

if cuisine = Italian and price range = economy then Pizza Italiano

However, it is not a valid recommendation rule if it is possible that Pizza Italiano will not be recommended when the system is informed of additional preferences of the user such as the preferred location of the restaurant.

We will refer to the number of conditions in a recommendation rule as the length of the rule. The discovery of a recommendation rule with a small number of conditions may indicate that the recommended case is located in a region of the product space that is sparsely represented by the available cases. It also means that some of the system's recommendations are being made on the basis of a subset of the user's preferences. But it does not necessarily follow that the potential efficiency gains suggested by the rule are actually being achieved by the system. In a form-based interface, for example, the user's preferences with respect to all the case attributes may be elicited before the retrieval of recommended cases.

On the other hand, the discovery of recommendation rules showing that important attributes are often ignored by the system may give rise to concerns about the quality of its recommendations. In the following section, we present an objective approach to assessing the quality of the recommendations "covered" by a recommendation rule.

An important point to note is that a recommendation rule is a description, rather than a prediction, of a system's behaviour given the cases available in the case library. In CBR approaches to product recommendation, the system's recommendations may depend on importance weights assigned to the case attributes, which can often be adjusted to reflect the personal priorities of the user. However, we assume in this paper that the recommendation rules used to describe a system's behaviour are based on importance weights, if any, that are the same for all users, or chosen by the investigator to represent the priorities of a "typical" user.

In Sections 2 and 3, we examine recommendation rules in more detail and consider possible techniques for their discovery in different recommendation strategies. In Section 4, we present an algorithm for automating the discovery of recommendation rules in NN retrieval, the standard approach to product recommendation in CBR. In Section 5, we present the resuhs of applying our discovery algorithm to a variety of case libraries, including a standard benchmark in the travel domain containing more than 1,000 cases. Our conclusions are presented in Section 6.

2 Recommendation Rules

Table 1 shows an example case library in the property domain that we use to illustrate our discussion of recommendation rules. If we assume that all possible values of the case attributes are represented in the case library, then there are 239 possible queries in the example product space apart from the empty query. We will refer to a query in which preferred values are specified for all the case attributes as

89

a full-length query. In the example product space, the number of possible full-length queries is 3 x 3 x 4 x 2 = 72.

In general, the cases recommended in response to a given query will depend on the available cases and the system's recommendation strategy. We will denote by rCases{Q, S) the cases recommended for a given query Q when a recommendation strategy S is applied to a given case library. In NN retrieval, it is not unusual for more than one case to be maximally similar to the user's query, in which case we assume that all such cases are equally recommended. That is,

rCasesiQ, NN) = {C: Sim{Q Q) > Sim{C'', Q) for all cases C°}

We will refer to a rule R of the form:

if a\ = V] and ai = V2... and a^ = Vk then C

as a recommendation rule for a given recommender system if C is one of the cases recommended by the system for every query that includes the preferences on the left-hand side (LHS) of the rule. More formally, we say that /? is a recommendation rule for the system if:

C G rCases(Q, S) for every query Q such that conditions(R) c Q

where S is the system's recommendation strategy. We say that a given query Q is covered by R if conditions(R) c Q.

A possible recommendation rule for a recommender system based on the example case library in Table 1 might be:

Rule 1. if beds = 2 and style = detached then Case 7

It can be seen to cover 12 of the 239 possible queries in the product space, including 6 full-length queries. It is interesting to note that preferred values are specified for only two of the four case attributes in Rule 1, which means that Case 7 is always recommended whenever the user expresses a preference for a 2-bedroom detached property, regardless of her preferences with respect to location and reception rooms. In view of the importance often associated with location in the property domain, the discovery of such a recommendation rule may give rise to concerns about the quality of the system's recommendations in the area of the product space covered by the rule.

As the standard approach to product recommendation in CBR, NN retrieval based on a full-length query representing the user's preferences with respect to all the case attributes provides an appropriate baseline for assessing the quality of a recommendation based on a subset of the user's preferences. The same principle can be used to assess the quality of a given recommendation rule. More specifically, we measure the accuracy of the rule relative to the results produced by NN retrieval when applied to all the full-length queries that are covered by the rule.

90

Table 1. An example case library in the property domain

Case No.

1 2 3 4 5 6 7 8 9

Location

A A B C B C A A B

Style

detached terraced detached semi terraced semi detached detached semi

Bedrooms

5 4 4 4 2 3 2 3 3

Reception Rooms

3 2 2 3 2 3 2 3 2

Definition 1. The accuracy of a recommendation rule R is the percentage of full-length queries Q that it covers for which conclusion{R) e rCases(Q, NN).

By definition, the accuracy of a recommendation rule for a system based on NN retrieval, whether or not the system's recommendations are based on full-length queries, is 100%. For a system that uses a recommendation strategy other than NN retrieval, the accuracy of a given recommendation rule R can be determined by submitting all full-length queries covered by 7? to a recommender system based on NN retrieval and counting the number of occasions in which conclusion(R) is one of the recommended cases. Of course this exhaustive approach is possible only in a finite product space.

As we shall see in Section 3, Case 7 is one of the cases recommended in NN retrieval for only one of the six full-length queries covered by Rule 1. The accuracy of Rule 1 is therefore only 17%, which is perhaps unsurprising given that it ignores the user's preference with respect to location.

3 Recommendation Strategies

We now examine some of the recommendation strategies used in CBR recommender systems, and discuss possible techniques for the discovery of recommendation rules in the two most common strategies.

3.1 Nearest Neighbour Retrieval

We assume that the similarity of any case C to a given query Q over a subset Agof the case attributes A is defined to be:

Sim(C.Q)= Y.WaSima(C,Q) aeAg

91

where for each a e A, Wa is the importance weight associated with a and sirriaiC, Q) is a local measure of the similarity of TTaiQ, the value of a in C, to â{Q)^ the preferred value of a. We also assume that for each ae A, 0 < simJiC, Q) < 1 and siniaiC, 0 = 1 if and only if TTaiQ = âiQ)- Often in practice, the overall similarity score is divided by the sum of the importance weights to give a normalised similarity score NSim{C, Q) in the range from 0 to 1. When discussing actual similarity scores, we will show only the normalised scores.

To apply NN retrieval to our example case library in Table 1, we must first assign importance weights to the attributes and define measures for assessing the similarity of a given case to a target query. We will assume that the importance weights assigned to location, style, bedrooms and reception rooms are 4, 3, 2, and 1 respectively.

As often in practice, we define the similarity of two values x and > of a numeric attribute such as bedrooms or reception rooms to be:

sima(x,y)= 1-max(a) - min(a)

where min(a) and max (a) are the maximum and minimum values of the attribute in the case library. Since the number of bedrooms ranges from 2 to 5 in the example case library, the similarity of two values that differ by one is 0.67.

Our similarity measure for style (det, sem, ter) is equivalent to applying our similarity measure for numeric attributes to the number of adjoining buildings (0, 1, 2). For example:

sinisîeidQiy det) = 1 5/Wj /g(det, sem) = 0.5 sinistyiJidGU ter) = 0

Finally, our similarity measure for location assigns a similarity score of 1 if the two locations are the same and 0 if they are not the same.

One recommendation rule for NN retrieval in the example case library is:

Rule 1. if loc = A and style = terraced and beds = 4 and recs = 2 then Case 2(1)

As the conditions of Rule 1 exactly match the description of Case 2, it is clear that no other case can be more similar to the only possible query that includes those conditions. The figure in brackets after the rule is the number of full-length queries that it covers. Like any recommendation rule in which preferred vales are specified for all the case attributes. Rule 1 covers only one ftill-length query.

Less obviously, the following is also a valid recommendation rule for NN retrieval in the example case library:

Rule 2. if loc = A and style = terraced then Case 2 (8)

This is a more interesting rule as it means that if the user is looking for a terraced property in location A, then Case 2 will always be recommended whatever the user's preferences with respect to bedrooms and reception rooms.

To confirm that Rule 2 is a NN recommendation rule, we need to verify that Case 2 will be one of the cases recommended for any query that includes the preferences on the LHS of the rule. For example, one of the queries covered by Rule 2 is:

Q: loc = A, style = terraced, beds = 5

92

As it differs from Q only in a single bedroom, Case 2 has a high similarity score:

4x1 + 3x1 + 2x0.67 yV5/m(Case2,0= = 0.83

The next most similar case is Case 1. Though differing from Q only in style, it is unable to compete with Case 2:

4 x 1 + 3 x 0 + 2 x 1 iV5/m(Casel,0= = 0.60

So Case 2 is in fact the only case recommended for the example query in NN retrieval. That Case 2 is one of the cases recommended in NN retrieval for any of the other 14 queries covered by Rule 2 can be verified in a similar way. Of course, this is a computationally intensive process and possible only in a finite product space. In Section 4, we present an approach to automating the discovery of NN recommendation rules that does not rely on exhaustive testing of covered queries and in which there is no assumption of a finite product space.

3.2 Conversational CBR

In conversational CBR (CCBR), a query is incrementally elicited in an interactive dialogue with the user, often with the aim of minimising the number of questions the user is asked before a conclusion is reached [9]. In product recommendation, the elicited query represents the preferences of the user with respect to one or more of the case attributes [7,10-11]. On each cycle of the recommendation process, the user is asked to specify a preferred value for the attribute considered most useful by the system. The query elicitation process continues until a predefined termination condition is satisfied.

In some CCBR approaches, a decision tree induced from descriptions of the available products is used to guide the retrieval of products that meet the requirements of the user. The product represented by each case is treated as a unique outcome class in the induction process [6], often with attribute selection based on information gain [12] as in the standard CBR approach to inductive retrieval [13].

Figure 1 shows a decision tree induced from the example case library in Table 1 with attribute selection based on information gain. Compared to approaches that require the user to specify preferred values for all the case attributes, the decision tree offers a clear advantage in terms of recommendation efficiency. In a recommender system based on the decision tree, users will never be asked more than two questions. As we shall see, however, the trade-offs in terms of solution quality and coverage are unlikely to be acceptable.

The decision tree can be regarded as a collection of recommendation rules:

Rule 1. if beds = 2 and style = terraced then Case 5 (6, 33) Rule 2. if beds = 2 and style = detached then Case 7 (6, 17) Rule 3. if beds = 3 and loo = A then Case 8 (6, 33) Rule 4. if beds = 3 and loc = B then Case 9 (6, 33) Rule 5. if beds = 3 and loc = C then Case 6 (6, 100)

93

D^^^O

2

J

A

5

^

p

oiyicr terraced

detached

Location? A

R

r

Style?

cf»mi

i\(*tarh(*A

5

7

8

9

6

2

4

3

1

Figure 1. Decision tree induced from the example case library with information gain as the attribute-selection criterion

Rule 6. if beds = 4 and style = terraced then Case 2 (6, 33) Rule 7. if beds = 4 and style = semi then Case 4 (6, 33) Rule 8. if beds = 4 and style = detached then Case 3 (6, 33) Rule 9. if beds = 5 then Case 1 (18, 17)

The first figure in brackets after each recommendation rule is the number of ftill-length queries that it covers and the second is its accuracy relative to NN retrieval. For example, Rule 9 covers 18 full-length queries and its accuracy is 17%. This means that Case 1 is one of the cases recommended in NN retrieval for only three of the 18 full-length queries covered by Rule 9. The other eight rules also have low accuracy relative to NN except for Rule 5, which agrees with NN on all six of the full-length queries that it covers. Overall accuracy of the decision tree relative to NN is only 33%, which does not seem an acceptable trade-off for the benefits it offers in terms of recommendation efficiency.

Another important limitation of the example decision tree is related to the fact that no two rules generated from a decision tree can cover the same full-length query. The total number of full-length queries covered by the recommendation rules, and hence by the decision tree, is 66. As there are 72 ftill-length queries in the product space, this means that there are six full-length queries that the decision tree fails to cover. Its failure to provide full coverage of the product space can also be seen from Figure 1. If the user is looking for a 2-bedroom semi-detached property, then the decision tree is unable to offer a recommendation. The problem,

94

of course, is that decision trees insist on exact matching and there is no such case in the case Hbrary.

The length of the recommendation rules may also be of interest from a maintenance perspective. A very short rule like Rule 9 indicates a region of the product space that is sparsely represented by the available cases. In fact, it can be seen from Table 1 that only one 5-bedroom property is available in the case library. For a larger decision tree, summary statistics such as maximum and average rule length may also be usefiil as a means of assessing recommendation efficiency.

3.3 Incremental Nearest Neighbour

Several authors have questioned the use of information gain as an attribute-selection criterion in product recommendation [5,7-8]. No account is taken of the relative importance of the case attributes, and the user's preferences with respect to attributes not mentioned in the recommendation dialogue are ignored. The result is that a product may be recommended simply because it is the only one that matches a requirement that the user considers to be of little importance.

Alternatives to information gain include the simVar measure proposed by Kohlmaier et al. [5]. However, to effectively address the trade-off between recommendation efficiency and solution quality, a recommender system must also be capable of recognising when the dialogue can safely be terminated without compromising solution quality. Naive approaches such as terminating the dialogue when a similarity threshold is reached cannot guarantee that a better solution will not be found if the dialogue is allowed to continue. In recent work t7,14] we presented a CCBR approach to product recommendation that uniquely combines an effective approach to reducing the length of recommendation dialogues with a mechanism for ensuring that the dialogue is terminated only when it is certain that the recommendation will be the same no matter how the user chooses to extend her query.

A key role in the approach, which we will refer to here as incremental nearest neighbour (iNN), is played by the concept of case dominance that we now define.

Definition 2. A given case Ci is dominated by another case C\ with respect to a query Q ifSim{C2, Q) < Sim{Cu Q) and Sim(C2, Q*) < Sim{Cu Q*)for all possible extensions Q* ofQ.

One reason for the importance of case dominance in product recommendation is that if a given case C2 is dominated by another case C\ then the product represented by C2 can be eliminated. It can also be seen that if the case that is most similar to the user's current query dominates all other cases, then there is no need for the query to be fiirther extended as the user's preferences with respect to any remaining attributes cannot affect the recommendation. The criterion used to identify dominated cases in iNN is stated in the following theorem, which assumes that for each a e A, the distance measure da^" \ - sima satisfies the triangle inequality [7].

95

Theorem 1. A given case Ci is dominated by another case Q with respect to a query Q if and only if.

Sim{C2,Q)+ Y.w„(\-sim„{Ci,C2)) < Sim{C„Q) aSA-AQ

An initial query entered by the user is incrementally extended in iNN by asking the user for the preferred values of attributes not mentioned in her initial query. Attribute selection is goal driven in that the attribute selected at any stage is the one with the potential to maximise the number of cases dominated by a case selected by the system as the target case.

In the following section we describe how the concept of case dominance on which iNN is based can also be used to guide the discovery of recommendation rules in NN retrieval.

4 Recommendatioii Rule Discovery

We now present an efficient algorithm for the discovery of recommendation rules in NN retrieval. Given a target case, our algorithm aims to construct a NN recommendation rule of the shortest possible length for the target case. As we shall see, an important role in the discovery process is played by the concept of case dominance [7]. First we note that the conditions in a recommendation rule can be regarded as a query. It can also be seen that for any case C and query g , if Q then C is a NN recommendation rule for C if and only if C is one of the cases recommended in NN retrieval for all possible extensions Q* of Q. Note that here we consider the possible extensions of a given query to include the query itself

The importance of case dominance in the discovery of NN recommendation rules can be seen from the following theorem.

Theorem 2. For any case C and query Q, if Q then C is a NN recommendation rule for C if the following conditions hold:

(a) C is an exact match for Q (b) For any case C° that is not dominated by C with respect to Q, C° and C have

the same values for all a e A - AQ

Proof. If conditions (a) and (b) are true, then as C is an exact match for Q, it is clear that C e rCases(Q, NN). For any other extension Q* ofQ, we can write Q* = Q u Q\ where Q' is the query consisting of those preferences in Q* that are not included in Q. For any case C that is dominated by C with respect to Q, we know that Sim(C, Q*) > Sim{C\ Q*). For any case C° that is not dominated by C with respect to Q, C" and C have the same values for ^[\ a e A - AQ and so Sim{C, Q*) = Sim{Q Q) + Sim(Q Q") > Sim{C\ Q) + SimiC", Q") = Sim{C\ Q*). It follows that C e rCases(Q*, NN) for any extension Q* of Q, and so if Q then C is a NN recommendation rule for C as required.

96

It follows from Theorem 2 that a NN recommendation rule for a target case C can be constructed simply by adding attribute-value pairs from the description of C to an initially empty query until any case that is not dominated by C has the same value for all the remaining attributes. However, the role of case dominance in our approach is not limited to providing a stopping criterion for the discovery process. It also provides the basis of an attribute selection criterion that aims to minimise the length of the discovered rule. At each stage of the discovery process, the attribute selected for addition to the query on the LHS of the rule (with its value in the target case) is the one that maximises the number of cases dominated by the target case with respect to the extended query.

algorithm RuleSeeker(C, Cases, Atts) begin

Q ^ ( t ) while not all_same(Cases, Atts) do begin

a <- most_useful{Q, C, Cases, Atts) Q ^ Q u{a = TTaiQ} Atts <r- Atts - {a} Cases <- Cases - {C: C is dominated by C with respect to Q}

end return Q

end

Figure 2. Algorithm for the discovery of recommendation rules in NN retrieval

Our algorithm for the discovery of recommendation rules in NN retrieval, called RuleSeeker, is outlined in Figure 2. Cis the target case for which a recommendation rule if Q then C is to be discovered. Initially containing all cases in the case library, Cases is the set of cases that are not currently dominated by the target case. The predicate all_same(Cases, Atts) succeeds if Atts (the set of remaining attributes) is empty, or Cases contains only the target case, or all cases in Cases have the same values for all attributes in Atts. The attribute a returned by mostjuseful is the one that maximises the number of cases in the current subset dominated by C with respect to g u {a = ;r«(C)}. If two or more attributes are equally promising according to this criterion, RuleSeeker uses the importance weights associated with the case attributes as a secondary selection criterion. That is, it selects the most important of the equally promising attributes.

It can be seen from Theorem 2 that the rule discovered by RuleSeeker is in fact a NN recommendation rule for the target case. At each stage of the discovery process, RuleSeeker uses the dominance criterion from Theorem 1 to identify the cases that will be dominated by the target case if a given attribute (and its value in the target case) is added to the query on the LHS of the discovered rule. The

97

complexity of the discovery process for a single rule is linear in the size of the case library.

When applied to the example case library in Table 1 with each case in turn as the target case, RuleSeeker discovers the following recommendation rules:

Rule 1. if loo = A and beds = 5 and style = detached then Case 1 (2) Rule 2. if loc = A and style = terraced then Case 2 (8) Rule 3. if loc = B and style = detached then Case 3 (8) Rule 4. if loc = C and beds = 4 then Case 4 (6) Rule 5. if loc = B and style = terraced then Case 5 (8) Rule 6. if loc = C and beds = 3 then Case 6 (6) Rule 7. if loc = A and style = detached and beds = 2 and recs = 2 then Case 7 (1) Rule 8. if loc = A and style = detached and beds = 3 and recs = 3 then Case 8(1) Rule 9. if loc = B and style = semi then Case 9 (8)

For example, Rule 3 says that if the user is looking for a detached property in location B, then she can do no better than Case 3 regardless of her preferences with respect to bedrooms and reception rooms.

5 Experimental Results

We now present the resuHs of applying our algorithm for the discovery of NN recommendation rules to three selected case libraries. Our investigation focuses on the length of the rules discovered by RuleSeeker in case libraries of realistic size. Below we describe the characteristics of the selected case libraries and reasons for including them in our experiments. Ranging in size from 120 to over 1,000 cases, they all have eight attributes and include both nominal and continuous attributes.

PC. The well-known PC case library [2] contains the descriptions of 120 personal computers. The attributes in this relatively small case library and weights assigned to them in our experiments are type (8), price (7), manufacturer (6), processor (5), speed (4), monitor size (3), memory (2), and hard disk capacity (1).

AutoMPG. Our second case library is based on the AutoMPG dataset from the UCI repository of machine learning databases [15]. We include this dataset because some of the attributes (e.g. year, origin, mpg, acceleration) are among those one might expect to see in a recommender system for previously-owned cars. Our case library of 392 cases includes all examples in the dataset apart from six examples that have missing values for one of the attributes. Attributes in the case library and importance weights assigned to them in our experiments are: year (8), origin (7), mpg (6), displacement (5), acceleration (4), horsepower (3), cylinders (2) and weight (1). The following is one of the NN recommendation rules discovered when RuleSeeker was applied to the AutoMPG case library:

if origin = Europe and year = 80 and acceleration = 21.8 and mpg = 30 then Mercedes-Benz 240d

98

Travel. This case library (www.ai-cbr.org) is a standard benchmark containing the descriptions of 1,024 holidays. Attributes in the case library and importance weights assigned to them in our experiments are price (8), month (7), region (6), persons (5), duration (4), type (3), accommodation (2), and transport (1).

Our first experiment compares the characteristics of the discovered recommendation rules in the selected case libraries. In each case library, we present each case to RuleSeeker as a target case and record the length of the discovered recommendation rule. Figure 3 shows the maximum, average and minimum length of the discovered rules for PC, AutoMPG and Travel.

I Max BAvg DMin

AutoMPG (N = 392)

Travel (N = 1024)

Figure 3. Lengths of recommendation rules discovered in three case libraries

60

50 4

40-1

30

20

10

0

pliiiil

3 4 5 6 7

Rule Length

Figure 4. Lengths of recommendation rules discovered in the Travel case library

Average rule length ranges from 4.7 to 5.1 and is smallest for Travel, showing that average rule length does not necessarily increase as the size of the case library increases. On the other hand. Travel can be seen to have yielded recommendation

99

rules ranging in length from 3 to 7 whereas the maximum length of the rules discovered in PC and AutoMPG is 6.

Our second experiment focuses on the distribution of rule length among the discovered recommendation rules in the Travel case library. Figure 4 shows the relative frequency with which each rule length occurs over all 1,024 discovered rules. By far the most common lengths are 4 and 5, together accounting for 90% of the discovered rules. Though rule length reaches a maximum length of 7 in this case library, this was the case for only two of the 1,024 discovered rules.

6 Conclusions

Recommendation rule discovery is a useful and novel approach to describing the behaviour of a recommender system in localised areas of the product space. For example, recommendation rules generated from a decision tree provide a complete description of the system's behaviour that can be used to assess its performance in terms of efficiency, coverage and solution quality. The inferior accuracy and coverage characteristics of the recommendation rules generated from an example decision tree highlights the potential limitations of decision tree approaches to product recommendation in which no account is taken of the relative importance of the case attributes.

We have also presented an efficient algorithm called RuleSeeker for automating the discovery of NN recommendation rules. Our algorithm's goal-driven approach to rule discovery enables the investigator to focus on a particular case (or product) and quickly obtain an answer to the important question:

When will this case be recommended?

The response provided by RuleSeeker is a set of preferences such that the selected case will always be recommended for any query that includes those preferences. Alternatively, RuleSeeker can be used to discover a NN recommendation rule for every case in the case library. In a system that makes no attempt to minimise the length of recommendation dialogues, the discovered rules may reveal potential gains in recommendation efficiency that could be achieved with a CCBR approach such as iNN [7].

A current limitation of RuleSeeker that we plan to address in future research is that only a single rule is discovered for a given target case. As there may be many recommendation rules for a target case, an important issue is to how to identify those that are likely to be of most interest. A related issue concerns the generality of the discovered rules. The value of each attribute in the conditions of a discovered rule is its value in the target case. While this makes good sense for nominal and discrete attributes, conditions expressed in terms of exact values of continuous attributes like acceleration = 21.8 make the discovered rules seem unnatural and unnecessarily specific. In future research we plan to investigate techniques for the discovery of more general recommendation rules, for example with conditions expressed in terms of a preferred maximum or minimum value of a continuous attribute, or a range of preferred values.

100

References

1. Branting, L.K.: Acquiring Customer Preferences from Return-Set Selections. In: Aha,

D.W., Watson, I. (eds.) Case-Based Reasoning Research and Development. LNAI,

Vol. 2080. Springer-Verlag, Beriin Heidelberg (2001) 59-73

2. McGinty, L., Smyth, B.: Comparison-Based Recommendation. In: Craw, S., Preece,

A. (eds.) Advances in Case-Based Reasoning. LNAI, Vol. 2416. Springer-Verlag,

Beriin Heidelberg New York (2002) 575-58

3. McSherry, D.: Balancing User Satisfaction and Cognitive Load in Coverage-

Optimised Retrieval. In: Coenen, F., Preece, A., Macintosh, A. (eds.) Research and

Development in Intelligent Systems XX. Springer-Verlag, London (2003) 381-394

4. Doyle, M., Cunningham, P.: A Dynamic Approach to Reducing Dialog in On-Line

Decision Guides. In: Blanzieri, E., Portinale, L. (eds.): Advances in Case-Based

Reasoning. LNAI, Vol. 1898. Springer-Veriag, Beriin Heidelberg (2000) 49-60

5. Kohlmaier, A., Schmitt, S., Bergmann, R.: A Similarity-Based Approach to Attribute

Selection in User-Adaptive Sales Dialogues. In: Aha, D.W., Watson, I. (eds.) Case-

Based Reasoning Research and Development. LNAI, Vol. 2080. Springer-Verlag,

Beriin Heidelberg (2001) 306-320

6. McSherry, D.: Minimizing Dialog Length in Interactive Case-Based Reasoning.

Proceedings of the Seventeenth International Joint Conference on Artificial

Intelligence (2001) 993-998

7. McSherry, D.: Increasing Dialogue Efficiency in Case-Based Reasoning Without Loss

of Solution Quality. Proceedings of the Eighteenth International Joint Conference on

Artificial IntelHgence (2003) 121-126

8. Schmitt, S.: simVar: a Similarity-Influenced Question Selection Criterion for e-Sales

Dialogs. Artificial Intelligence Review, 18 (2002) 195-221

9. Aha, D.W., Breslow, L.A., Mufloz-Avila, H.: Conversational Case-Based Reasoning.

Applied Intelligence 14 (2001) 9-32

10. G6ker, M.H., Thompson, C.A.: Personalized Conversational Case-Based Recommendation. In: Blanzieri, E., Portinale, L. (eds.) Advances in Case-Based Reasoning. LNAI, Vol. 1898. Springer-Veriag, Beriin Heidelberg (2000) 99-111

11. Shimazu, H., Shibata, A., Nihei, K.: ExpertGuide: A Conversational Case-Based Reasoning Tool for Developing Mentors in Knowledge Spaces. Applied Intelligence 14(2001)33-48

12. Quinlan, J.R.: Induction of Decision Trees. Machine Learning 1 (1986) 81-106

13. Watson, I.: Applying Case-Based Reasoning: Techniques for Enterprise Systems.

Morgan Kaufinann, San Francisco (1997)

14. McSherry, D.: Explanation in Recommender Systems. In: Cunningham, P., McSherry,

D. (eds.) Proceedings of the ECCBR-04 Workshop on Explanation in CBR (2004)

15. Blake, C , Merz, C: UCI Repository of Machine Learning Databases. Department of

Information and Computer Science, University of California, Irvine, California (1998)

Incremental Critiquing

James Reilly, Kevin McCarthy, Lorraine McGinty, Barry Smyth Adaptive Information Cluster* Department of Computer Science,

University College Dublin (UCD), Ireland.

Abstract

Conversational recommender systems guide users through a product space, alternatively making concrete product suggestions and eliciting the user's feedback. Critiquing is a common form of user feedback, where users provide limited feedback at the feature-level by constraining a feature's value-space. For example, a user may request a cheaper product, thus critiquing the price feature. Usually, when critiquing is used in conversational recommender systems, there is Uttle or no attempt to monitor successive critiques within a given recommendation session. In our experience this can lead to inefficiencies, on the part of the recommender system, and confusion on the part of the user. In this paper we describe an approach to critiquing that attempts to consider a user's critiquing history, as well as their current critique, when making new recommendations. We provide experimental evidence to show that this has the potential to significantly improve recommendation efficiency.

1 Introduction

Recommender systems are designed to alleviate the information overload problem [13] by assisting users to make choices that guide and inform the decision making process. They combine ideas from information retrieval, machine learning and user profiling, among others, to offer the user a more efficient search environment, one that is better suited to their needs and preferences, and one that helps them to locate what they are looking for more quickly and easily. Conversational case-based recommender systems guide users through a recommendation session by adopting a recommend-review-revise strategy [1, 6]. In short, during each cycle of a recommendation session a user is presented with a new recommendation and is offered an opportunity to provide feedback on this suggestion. On the basis of this feedback, the recommender system will revise its evolving model of the user's current needs in order to make further recommendations.

Considerable research effort has been invested in developing and evaluating different forms of feedback for conversational recommender systems and a variety of feedback alternatives have become commonplace. For example, value elicitation approaches ask the user specific questions about specific features (e.g.

*This material is based on works supported by Science Foundation Ireland under Grant No. 03/IN.3/I361

101

102

"what is your target price?") while preference-based feedback and ratings-based methods simply ask the user to indicate which product they prefer when presented with a small set of alternatives [6, 7, 15], or to provide ratings for these alternatives [13]. It is well known that different forms of feedback introduce different types of trade-offs when it comes to recommendation efficiency and user-effort. For instance, value elicit at ion is a very informative form of feedback, but it requires the user to provide detailed feature-level information. In contrast, preference-based feedback is a far more ambiguous form of feedback but it requires only minimal user effort. One form of feedback that strikes a useful balance, in this regard, is critiquing [2, 3]. The user expresses a directional preference over the value-space for a particular product feature. For example in a PC recommender, a user might indicate that they are looking for a PC that has a Jaster processor" than the current recommendation, "faster processor" being a critique over the processor speed feature of the PC case.

In our recent work we have explored a variety of ways of improving the efficiency of conversational recommenders that employ different forms of feedback, with our main focus on preference-based feedback and critiquing [6, 8, 14, 15]. In this paper we continue in this vein by proposing a new approach to critiquing called incremental critiquing. The motivation for this new approach stems from our observation, that when critiquing is used in conversational recommender systems, there is usually little or no attempt to monitor successive critiques within a given recommendation session. In our experience this can lead to a number of problems. For instance, users may, during the course of a recommendation session, change their mind about certain feature preferences (maybe temporarily), and so contradict earlier critiques that they may have applied. Our incremental critiquing strategy attempts to remedy this by maintaining a history of successive critiques, during a recommendation session, and by allowing these past critiques to influence future recommendation cycles. In this way new recommendations are chosen based not only on their ability to satisfy the current critique, but also on their compatibility with previous critiques. We describe a flexible approach for incorporating past critiques in a way that facilitates the type of feedback inconsistencies that tend to occur during real recommendation sessions. We further show that this incremental form of critiquing has the potential to deliver significant improvements in recommendation efficiency as well as providing users with more intuitive recommendations.

2 A Review of Critiquing

Critiquing, as a form of feedback, is perhaps best known by association with the FindMe recommender systems [2], and specifically the Entree restaurant recommender. The original motivation for critiquing included the need for a type of feedback that was simple for users to understand and apply, and yet informative enough to focus the recommender system. For instance, the Entree system presents users with a fixed set of directional critiques in each recommendation cycle. In this way users can easily request to see further suggestions that are

103

jmammiM^

H^^SS^^&SiS^^ «.3 Itagapiical CMOS swisor

Hi9hrfMwfionn«noa OIOIC pro€Mf or 100-i«0e ISO spMd raigc Compiwli wKh • i Ctnon C* lancM and CC Spsadttn Piot ridgc, C«non Dirvct Print and BuoMc Jvt DinMt compMibla ** no

Adjust your prafsnmcos in real timo and lot us find tha right product for you I

Manufteturtr J^ ^Ctnaoo

ji\ eoswo Pixel

Ntmory Siia(ns)

MMAorytyp*

6«ttwTTyp«

strap

Cabia

iJ jb^ j j a .^-li' Co«iya«tfla«t»<iard

J j b i . i l l W»-5U

H ) CD-'iam f»atujrihi'ikii^

Wa hava mar* matching praduets with dia fallawing..

I • Mon MaMnr|E a Maca M M I M M i t A. OIR(MVMI'JH40 I.7«/

s« Mam PiMMMry ai AlxRafafKTy|pw^MM^H^atliQf#

9* Mawi NiMMiry a Hcwa sattanM a A CNMraM C M M v^r

'r J

. i ^

Unit CridqiiM

Compoand 'Critlqv

Figure 1: Unit critiques and compound critiques in a sample recommender.

diflPerent in terms of some specific feature. For example, the user may request another restaurant that is cheaper or more formal^ for instance, by critiquing its price and style features.

Recently researchers have begun to look more closely at critiquing and have suggested a number of ways that this form of feedback can be improved. For example, [7, 8, 10] show how the efficiency of critiquing can be improved by incorporating diversity into the recommendation process and in Section 2.1 we will review further recent work. However, a number of challenges remain when it comes to the practicalities of critiquing. Certainly, the lack of any continuity-analysis of the critiques provided in successive cycles means that recommender systems may be misled by the inconsistent feedback provided by an uncertain user, especially during early recommendation cycles.

2.1 Unit vs Compound Critiques

The critiques described so far are all examples of what we term unit critiques. That is, they express preferences over a single feature; cheaper critiques a price feature, more formal critiques a style feature, for example. This ultimately limits the ability of the recommender to narrow its focus, because it is guided

104

by only single-feature preferences from cycle to cycle. An alternative strategy is to consider the use of what we call compound

critiques [12]. These are critiques that operate over multiple features. This idea of compound critiques is not novel. In fact the seminal work of Burke et al. [2] refers to critiques for manipulating multiple features. They give the example of the sportier critique, in a car recommender, which operates over a number of different car features; engine size, acceleration and price are all increased. Similarly we might use a high performance compound critique in a PC recommender to simultaneously increase processor speed, RAM, hard-disk capacity and price features. Obviously compound critiques have the potential to improve recommendation efficiency because they allow the recommender system to focus on multiple feature constraints within a single cycle. In addition, it has also been argued that they carry considerable explanatory power because they help the user to understand common feature interactions [5, 12]; in the PC example above the user can easily understand that improved CPU and memory comes at a price.

In the past when compound critiques have been used they have been hard-coded by the system designer so that the user is presented with a fixed set of compound critiques in each recommendation cycle. These compound critiques may, or may not, be relevant depending on the cases that remain at a given point in time. For instance, in the example above the sportier critique would continue to be presented as an option to the user despite the fact that the user may have already seen and declined all the relevant car options. Recently, we have argued the need for a more dynamic approach to critiquing in which compound critiques are generated on-1he-fly, during each recommendation cycle, by mining commonly occurring patterns of feature differences that exist in the remaining cases [12]. Figure 1 shows a Digital Camera recommender system that we have developed to explore usability and efficiency issues in a real-world setting. The screenshot shows two types of critiques. Standard unit critiques are presented alongside each individual feature, while k (where k=3) dynamically generated compound critiques appear below the current case description.

2.2 Consistency & Continuity

Regardless of the type of critiquing used (unit versus compound, or a mixture of both), or the manner in which the critiques have been generated (fixed versus dynamic), there are a number of important issues that need to be kept in mind from an application deployment perspective. This is especially important when it comes to anticipating how users are likely to interact with the recommender. In particular, users cannot be relied upon to provide consistent feedback over the course of a recommendation session. For example, many users are unlikely to have a clear understanding of their requirements at the beginning of a recommendation session. In our experience many users rely on the recommender as a means to educate themselves about the features of a product-space. As a result users may select apparently conflicting critiques during a session as they explore different areas of the product space in order to build up a clearer picture

105

of what is available. For example, in a given cycle we may find a prospective digital camera owner looking for a camera that is cheaper than the current 500 euro recommendation, but later on may ask for a camera that is more expensive than another 500 euro recommendation. There are a number of reasons for this: perhaps the user has made a mistake; perhaps they are just interested in seeing what is available at the higher price; or perhaps their preferences have changed from the start of the session as they recognise the serious compromises that are associated with lower priced cameras [11].

Current recommender systems that employ critiquing tend to focus on the current critique and the current case, without considering the critiques that have been applied in the past. This, we argue, can lead to serious problems. For instance, if the recommender uses each critique to permanently filter-out incompatible product-cases, then a user may find that there are no remaining cases when they come to change their mind; having indicated a preference for sub-500 euro cameras early-on, the user will find the recommender unable to make recommendations for more expensive cameras in future recommendations.

As a result of problems like this, such a strict filtering policy is usually not employed by conversational recommender systems in practice. Instead of permanently filtering-out incompatible cases, irrelevant cases for a particular cycle tend to be temporarily removed from consideration, but may come to be reconsidered during future cycles as appropriate. Of course this strategy introduces the related problem of how past critiques should influence future recommendations, especially if they conflict or strengthen the current critique.

Current implementations of critiquing tend to ignore these issues in the Wind hope that users will either behave themselves — that they will responsibly select a sequence of compatible critiques in pursuit of their target product — or that they will have the patience to backtrack over their past critiques in order to try alternatives. In our experience this approach is unlikely to prove successful. In real user trials common complaints have included the lack of consistency between successive recommendation cycles that arise because of these issues.

3 An Incremental Critiquing Strategy

Our response to the issues outlined above is to give due consideration to past critiques during future recommendation cycles using a variation of comparison-based recommendation [6] that we call incremental critiquing (See Figure 2). We do this by maintaining a critique-based user model which is made up of those critiques that have been chosen by the user so far. This model is used during recommendation to influence the choice of a new product case, along with the current critique. Our basic intuition is that the critiques that a user has applied so far provide a representation of their evolving requirements. Thus the set of critiques that the user has applied constitutes a type of user model {U = {Ui,...,Un}, where Ui is a single unit critique) that reflects their current preferences. At the end of each cycle, after the user has selected a new critique, we add this critique to the user model.

106

1. 2.

3.

' 4. 5. 6.

7.

8-

9.

10. 11. 12. 13.

14*.

15. 16.

CB: caaebaae, U: user-model, q: query, t: chosen

define laere—nf l-Critiquingigr CB)

U - {}

t - {}

repeat r 4- It—Bepo—nd (q, CB, t, U)

t 4- J3mmtBmrLmt{t, CB) q 4- Qu«xyll«vi«« (q, r)

U •- QlpdataModttl (U, t, r)

until U««xlkoo«pts(r)

define Us«xllanri«w(r, CB) t 4- user critique for some feature € r CB 4- CB - (r)

return c

define Qu«zyR«vls«(q, r)

q <- r return q

17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29.

30. 31. 32. 33. 34.

critique, r: recoaanended item

define DpdateNodel (U, t, r) If ZsCoeiponiid(t) then t-set «-DnitCxltlqu«a(t)

else t-set4-{t)

Endlf For each t € t-set do 0 4-U - eontr«diot(U, t, r) U4-U - reflM(U, t, r) U4-U + {<t, r>)

EndFor return U

define It—Beoc—nd(q, CB, t, U) CB' 4- {1 € CB 1 Satisfies(l,t)} CB'' 4- sort CB' by decreasing Quality r 4- top item in CB''

return r

Figure 2: The Incremental Critiquing algorithm

3.1 Maintaining the User Model

Maintaining an accurate user model, however, is not quite as simple as storing a list of previously selected critiques. As we have mentioned above, some critiques may be inconsistent with earlier critiques. For example, in the case of a PC recommender, a user selecting a critique for more memory, beyond the 512MB of the recommended case, during one cycle, may later contradict themselves by indicating a preference for less memory than the 256 MB offered during a subsequent cycle. In addition, a user may refine their requirements over time. They might start, for example, by indicating a preference for more than 128 MB of RAM (with a more memory critique on a current case that offers 128 MB). Later they might indicate a preference for more than 256 MB of RAM with a more memory critique on a case that offers 256 MB. In consideration of the above our incremental critiquing strategy updates the user model by adding the latest critique only after pruning previous critiques so as to eliminate these sorts of inconsistencies (see lines 23-26 in Figure 2). Specifically, prior to adding a new critique all existing critiques that are inconsistent with it are removed from the user model. Also, all existing critiques, for which the new critique is a refinement, are removed from the model. Finally, it is worth mentioning how the user model is updated with compound critiques; so far our examples have assumed unit critiques. To keep things simple, we deal with compound critiques by splitting them up into their constituent unit critiques so that the update procedure then involves making a set of unit updates.

3.2 Influencing Recommendation

The basic idea behind the user model is that it should be used to influence the recommendation process, prioritising those product cases that are compatible

107

with its critiques. The standard approach to recommendation, when using critiquing, is a two step one. First, the remaining cases are filtered by eUminating all of those that fail to satisfy the current critique. Next, these filtered cases are rank ordered according to their similarity to the current recommendation.

We make one important modification to this procedure. Instead of ordering the filtered cases on the basis of their similarity to the recommended case, we also compute a compatibility score for each candidate case, which is essentially the percentage of critiques in the user model that this case satisfies (see Equation 1 and note that satisfies{Ui,c') returns a score of 1 when the critique, Ui, satisfies the filtered case, c, and returns 0 otherwise). Thus a case that satisfies 3 out of the 5 critiques in a user model obtains a compatibility score of 0.6.

Quality{c', c, U) = Compatibility{c\ U) * Similarity{c', c) (2)

This compatibility score is then combined with the candidate's (c') similarity to the recommended case, c, in order to obtain an overall quality score; see Equation 2. This quality score is used to rank-order the filtered cases prior to the next recommendation cycle and the case with the highest quality is then chosen as the new recommendation (see lines 30-34 in Figure 2).

Of course there are many ways that we could have combined compatibility and similarity. In this work we give equal weight to both compatibility and similarity. Other weight settings may be considered; for example we might increase the weight given to past critiques as the user model grows. The essential point is, however, that the above formulation allows us to prioritise those candidate cases that: (1) satisfy the current critique; (2) are similar to the previous recommended case; and (3) satisfy many previous critiques. In so doing we are implicitly treating the past critiques in the user model as soft constraints for future recommendation cycles; it is not essential for future recommendations to satisfy all of the previous critiques, but the more they satisfy, the better they are regarded as recommendation candidates. Moreover, given two candidates that are equally similar to the previously recommended case, our algorithm will prefer the one that satisfies the greater number of recently applied critiques.

4 Evaluation

One of the key measures of success of any conversational recommender is efficiency in terms of session length. Recommendation sessions that present users with increasingly good recommendations are likely to have less cycles and therefore likely to lead to greater success than long ones [4, 9]. Because incremental critiquing has a much more informed user model than standard critiquing we expect that it should lead to shorter sessions. In this section we describe the

108 results from an efficiency evaluation of incremental critiquing by comparing its performance to standard forms of critiquing (both unit and compound) on two different datasets. The results are very positive, showing significant reductions in session-length.

4.1 Setup

Our main objective is to compare our incremental critiquing (IC) approach to the standard version of critiquing (STD). In addition we also take the opportunity to explore a minor variation to incremental critiquing, which we refer to as IC-ENABLE. This approach follows the incremental critiquing method but after each cycle it makes a second attempt to prune the user model by eliminating those critiques that are not compatible with the current recommended case; this involves minor revisions to the UpdateModel method of the algorithm presented in Figure 2. The hope is that this second pruning phase with help to maintain an improved user profile that remains better focused on the region of the product-space in the vicinity of the current recommendation.

We compare our three different recommendation algorithms (STD, IC and IC-ENABLE) by examining recommendation performance on both unit and compound critiques. To be clear, the standard approach to unit critiquing just involves the user being presented with, and selecting, unit critiques. The standard approach to compound critiquing involves the user being presented with both unit and compound critiques. The compound critiques are generated according to the dynamic critiquing strategy described in [12] and discussed briefly in Section 2.1; 5 compound critiques are presented during each cycle along with the standard unit critiques.

The evaluation was performed using two standard datasets. The PC dataset [8] consists of 120 PC cases, each described in terms of 8 features including manufacturer, processor, memory etc. The Travel dataset (available from http://www.ai-cbr.org) consists of 1024 vacation cases. Each case is described in terms of 9 features including price, duration, region etc.

4.2 Methodology

Ideally we would like to have carried out an online evaluation with live-users, but unfortunately this was not possible. As an alternative we chose to opt for an offline evaluation described by [8, 12, 15]. Accordingly, each case {base) in the case-base is temporarily removed and used in two ways. First, it serves as a basis for a set of queries by taking random subsets of its features. We focus on subsets of 1, 3 and 5 features to allow us to distinguish between hard, moderate and easy queries respectively. Second, we select the case that is most similar to the original base. These cases are the recommendation targets for the experiments. Thus, the base represents the ideal query for a user, the generated query is the initial query provided by the 'user', and the target is the best available case for the 'user. Each generated query is a test problem for the recommender, and in each recommendation cycle the 'user picks a critique that is compatible with the

109

Unit Compound

IC-ENABLE IC-ENABLE

Figure 3: Efficiency results: Incremental critiquing vs standard critiquing.

known target case; that is, a critique that when applied to the remaining cases, results in the target case being left in the filtered set of cases. Each leave-one-out pass through the case-base is repeated 10 times and the recommendation sessions terminate when the target case is returned.

4.3 Recommendation Efficiency

Ultimately, we are interested in how incremental critiquing affects recommendation efficiency; that is, the number of cycles before the target case is returned to the user. In the following sets of experiments we are looking for efficiency improvements over standard approaches to unit and compound critiquing. To evaluate this, we run the leave-one-out test for each dataset (PC and Travel), on both unit and compound critiquing systems, and compare the performance of the two forms of incremental critiquing against the standard approach to critiquing. We measure the average length of the recommendation sessions for each of the different recommendation strategies, with special attention paid to recommendation sessions for different standards of query difficulty.

Figure 3 presents summary results for the two data sets on unit and compound critiquing. Each graph represents the average session length (in terms of the number of cycles) for each of the three recommendation strategies. The results show a clear and compelling benefit for the incremental critiquing variations (IC and IC-ENABLE). For example, in the PC domain, the average length of an IC session using unit critiques is 5.4 cycles, compared to 10 cycles for STD (see Figure 3(a); a 45% reduction in cycle length for IC). The results are even more striking for the larger Travel dataset. STD delivers session lengths of 65.7 cycles, on average for unit critiquing, compared to only 18.8 cycles for IC (see

110

Unit

(c)

Query Size

80 T

60

40

20

-STD- -IC-ENABLE- -IC

1 3

Query Size

(b)

8

K 0

(d)

25 I 20 ^ 16 I 10 ^ 5

Compound

1 - 0 - STD - e - IC-ENABLE-A- IC |

|:::^^====?i=^:::

1 3 5

Query Size

1 - ^ STD - S - IC-ENABLE-A- IC |

1 0^^

1 " ^^ -o \ n ^

1 3 5

Query Size

Figure 4: Efficiency results: Incremental critiquing for different query lengths.

Figure 3(c) — a 71% session length reduction). Similar benefits are observed when we look at session lengths in compound critiquing scenarios. For instance, in the PC dataset IC delivers a 52% reduction in average session length compared to STD (see Figure 3(b)) and a 48% improvement is noted in the Travel domain; see Figure 3(d). It is interesting to note that there is little difference between the performance of IC and IC-ENABLE. If anything, IC-ENABLE appears to be performing marginally worse that IC across the board, suggesting that its second pass at pruning is not contributing in a positive way to recommendation quality.

It is also worth investigating how the recommendation efficiency is influenced by the initial query length, as this is a reasonable measure of query difficulty; short queries are more ambiguous as recommendation starting points and therefore represent more of a challenge than longer, more complete queries. To do this we recomputed the session length averages above by separating out the queries of various sizes. As expected, session length decreases as the query size increases; see Figure 4. Once again, the results point to a significant advantage due to incremental critiquing but it is interesting to note that the benefit decreases as query size increases; that is, incremental critiquing has a greater advantage when the initial query is more difficult. For example in the PC domain when using unit critiques (Figure 4(a)), IC reduces session length by 46% for the most difficult queries (where only 1 feature is specified initially) but this then falls to 38% for the easiest queries (where 5 features are specified initially). For compound critiques (Figure 4(b)) this benefit drops from 37% to 15%.

This reduction in benefit is likely due to the fact that shorter sessions to begin with mean that there is less opportunity for incremental critiquing to make a

111

(a) Unit

-STD-30

20

U 10 j

-IC-ENABLE-

40 30 20 10

% Noise Level

(C)

400

t 300

30 20 10

% Noise Level

Compound

40 30 20 10 0

% Noise Level

40 30 20 10

% Noise Level

Figure 5: Critique selection noise results.

difference. Thus, as query difficulty is reduced so is the IC benefit. As we move from unit critiques to compound critiques we find a reduction in average session length, and a similar reduction in IC benefit. The IC benefit found in the Travel domain (Figures 4(c&;d)) is less sensitive to query difficulty, or the change from unit to compound critiques, mainly because the Travel sessions length remain fairly long so there is plenty of opportunity for IC to exert its influence.

4.4 Critique Selection Noise

In the experiments so far we have assumed that the our artificial user will always pick a critique that is compatible with the target case. Of course this assumption is unlikely to hold up in reality. Often users will make mistakes or choose sub-optimal critique selections. The question is: how do our incremental critiquing approaches perform under such conditions. To test this, we introduce different levels of noise into the critique-selection process. For example a critique selection noise level of 10% means that at every cycle there is a 10% chance that the 'user' will choose an incorrect critique, one that is incompatible with their target. We examine noise levels of 10%—40% following the standard methodology above to produce an average session length per noise level.

The results are presented in Figure 5(a-d) as graphs of average session length (numbers of cycles) versus different noise levels. Each graph charts the performance of the three algorithmic variations—STD, IC and IC-ENABLE—and, as before, we look at both datasets in terms of unit and compound critiques. As expected, the results show general increases in session length as the noise level increases; unreliable critique-selections inevitably lead to less efficient recom-

112

mendation sessions as it is harder for the recommender system to focus in on the right area of the product-space. However, we do see that IC maintains a significant advantage over the standard approach in both domains for unit and compound critiques.

In the PC dataset using unit critiques only (Figure 5(a)), IC produces session lengths that vary from 23 (at the 40% noise-level) to about 6 cycles (at the 0% noise-level). The STD algorithm produces session lengths that vary from 26 cycles to just over 10 cycles across the same noise-levels. In other words, incremental critiquing reduces session lengths by between 45% (at the 0% noise-level) and 10% (at the 40% noise-level) when compared to STD. Once again the IC-ENABLE variant is seen to perform slightly worse (by about 7%) than IC in the PC dataset. On the much larger Travel dataset we notice some differences in results when compared to PC. For example, at the 40% noise-level IC actually performs marginally worse than STD for unit critiques; see Figure 5(c). However, as the noise-level drops to more reasonable levels we find that IC rapidly and significantly outperforms STD. It appears that at the highest noise level the IC user model has become swamped with many inconsistent critiques and so fails to perform as well as STD, which is not influenced by any user model. The fact that IC-ENABLE actually performs better than IC at this 40% noise-level supports this hypothesis; IC-ENABLE benefits from a user model whose inconsistent critiques have been eliminated by a second pruning stage.

This study of noise-sensitivity shows that the incremental critiquing approach still offers significant reductions in session-length even if users make many unreliable critique selections. However it is also clear that the benefits of incremental critiquing decrease as the noise level increases. In reality, it seems unlikely that users would behave so erratically as to select spurious critiques 40% of the time and so we are confident that the more significant benefits available to IC at more reasonable (10%-30%) noise-levels are likely to be realised in practice. There are of course other ways that we may be able to cope with noise too. For example, weighting functions could be used to limit the influence of less reliable critiques in the user model or to Umit the influence of older critiques on the grounds that they are more likely to be unreliable.

5 Conclusions

Critiquing is an important form of user feedback that is ideally suited to many recommendation scenarios. It is straightforward to implement, easy for users to understand and use, and it has been shown to be effective at guiding conversational recommender systems. Recently we have been interested in ways of improving the efficiency of critiquing in conversational recommender systems. We have pioneered the use of compound critiques that are dynamically generated during each recommendation cycle, and we were able to show that these compound critiques have the ability to reduce session length significantly.

The motivation for the present work came from the observation that critiquing-

113

based recommender systems, whether based on unit or compound critiques, never appeared to expUcitly consider the sequence of critiques chosen by a user during recommendation. Our studies suggest that by maintaining a record of a user's critiques we can recognise 'changes of mind', inconsistencies and refinements as the user's requirements evolve during a session, and that by doing so can better influence future recommendations within a session. The incremental critiquing strategy described in this paper attempts to capture this idea. It maintains a model of the user during a recommendation session by storing a history of their selected critiques. We have described how this type of user model can be used to provide further guidance for a conversational recommender system by considering the degree of compatibility between a recommendation candidate and the user's past critiques as a factor during recommendation.

Our evaluation studies have shown that this new incremental critiquing approach can deliver significant performance benefits, by reducing session lengths by up to 70%, under a variety of experimental conditions. These benefits have been observed in multiple domains and regardless of whether unit or compound critiquing is used. By combining our new incremental critiquing approach with our dynamic critiquing strategy [12], we have achieved overall session-length reductions (over standard unit critiquing) of about 60% in the PC domain (with session lengths falling from over 10 to 4 cycles on average) and over 80% in the Travel domain (with session lengths falling from about 65 to 12 cycles).

We believe that these session-length reductions will dramatically improve the appeal of critiquing as a standard form of feedback in a wide variety of recommendation scenarios. The improvements mean that the efficiency of critiquing is brought more into line with the efficiency of value elicitation forms of feedback, but with none of the disadvantages of value elicitation when it comes to increased user effort and additional domain expertise requirements.

References

[1] D.W. Aha, L.A. Breslow, and H. Muoz-Avila. Conversational case-based reasoning. Applied Intelligence, 14:9-32, 2000.

[2] R. Burke, K. Hammond, and B. Young. Knowledge-based navigation of complex information spaces. In Proceedings of the 13th National Conference on Artificial Intelligence, pages 462-468. AAAI Press/MIT Press, 1996. Portland, OR.

[3] R. Burke, K. Hammond, and B.C. Young. The FindMe Approach to Assisted Browsing. Journal of IEEE Expert, 12(4):32-40, 1997.

[4] M. Doyle and P. Cunningham. A Dynamic Approach to Reducing Dialog in On-Line Decision Guides. In E. Blanzieri and L. Portinale, editors. Proceedings of the 5th European Workshop on Case-Based Reasoning, (EWCBR-00), pages 49-60. Springer, 2000. Trento, Italy.

114

[5] K. McCarthy, J. Reilly, L. McGinty, and B. Smyth. Thinking Positively -Explanatory Feedback for Conversational Recommender Systems. In Proceedings of the European Conference on Case-Based Reasoning (ECCBR-04) Explanation Workshop, 2004. Madrid, Spain.

[6] L. McGinty and B. Smyth. Comparison-Based Recommendation. In Susan Craw, editor, Proceedings of the 6th European Conference on Case-Based Reasoning (ECCBR-02), pages 575-589. Springer, 2002. Aberdeen, Scotland.

[7] L. McGinty and B. Smyth. The Role of Diversity in Conversational Recommender Systems. In D. Bridge and K. Ashley, editors. Proceedings of the 5th International Conference on Case-Based Reasoning (ICCBR-03), pages 276-290. Springer, 2003. Troindheim, Norway.

[8] L. McGinty and B. Smyth. Tweaking Critiquing. In Proceedings of the Workshop on Personalization and Web Techniques at the International Joint Conference on Artificial Intelligence (IJCAI-03), pages 20-27. Morgan-Kaufmann, 2003. Acapulco, Mexico.

[9] D. McSherry. Minimizing dialog length in interactive case-based reasoning. In Bernhard Nebel, editor. Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI-01), pages 993-998. Morgan Kaufmann, 2001. Seattle, Washington.

[10] D. McSherry. Diversity-Conscious Retrieval. In Susan Craw, editor. Proceedings of the 6th European Conference on Case-Based Reasoning (ECCBR-02), pages 219-233. Springer, 2002. Aberdeen, Scotland.

[11] D. McSherry. Similarity and Compromise. In D. Bridge and K. Ashley, editors. Proceedings of the 5th International Conference on Case-Based Reasoning (ICCBR-03), pages 291-305. Springer-Verlag, 2003. Trondheim, Norway.

[12] J. Reilly, K. McCarthy, L. McGinty, and B. Smyth. Dynamic Critiquing. In P.A. Gonzalez Calero and P. Funk, editors. Proceedings of the European Conference on Case-Based Reasoning (ECCBR-O^)- Springer, 2004. Madrid, Spain.

[13] B. Smyth and P. Cotter. A Personalized TV Listings Service for the Digital TV Age. Journal of Knowledge-Based Systems, 13(2-3):53-59, 2000.

[14] B. Smyth and L. McGinty. An Analysis of Feedback Strategies in Conversational Recommender Systems. In P. Cunningham, editor. Proceedings of the 14th National Conference on Artificial Intelligence and Cognitive Science (AICS-2003), pages 211-216, 2003. DubUn, Ireland.

[15] B. Smyth and L. McGinty. The Power of Suggestion. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI-03), pages 127-138. Morgan-Kaufmann, 2003. Acapulco, Mexico.

SESSION 2:

Al TECHNIQUES II

A Treebank-Based Case Role Annotation Using an Attributed String Matching

Samuel W. K. Chan Dept. of Decision Sciences

The Chinese University of Hong Kong Hong Kong SAR

[email protected]

Abstract

A novel approach of identifying case role is proposed. The approach makes use of an attributed string matching technique which takes full advantage of a huge number of sentence patterns in a Treebank. Based on the syntactic and semantic tags encoded in the Treebank, the approach goes beyond shallow parsing to a deeper level of language understanding, while preserving robustness, without being bogged down into a complete linguistic analysis. An evaluation of 5,000 Chinese sentences is examined in order to justify its statistical significances.

1. Introduction Automatic information extraction is an area that has received a great deal of attention in recent development of computational linguistics. While a profusion of issues relating to questions of efficiency, flexibility, and portability, amongst others, have been thoroughly discussed, the problem of extracting meaning from natural texts has scarcely been addressed. When the size and quantity of documents available on the Internet are considered, the demand for a highly efficient system that identifies the semantic meaning is clear.

Case frame^ as proposed by most linguists, is one of the most important structures that can be used to represent the meaning of sentences [1]. One could consider a case frame to be a special, or distinguishing, form of knowledge structure about sentences. Although several criteria for recognizing case frames in sentences have been considered in the past, none of the criteria serves as a completely adequate decision procedure. Most of the studies in computational linguistics do not provide any hints on how to map input sentences into case frames automatically, particularly in Chinese [2]. As a result, both the efficient and robustness of the techniques used in information extraction is highly in doubt when they are applied to real world applications. In this research, first, a shallow but effective sentence chunk-

1 Due to the lack of conciseness or conformity that authors have shown in using this and other terms, in this paper, a case frame is to be understood as an array of slots, each of which is labelled with a case name, and eventually possibly filled with a case filler, the whole system representing the underlying structure of an input sentence.

117

118

ing process is developed. This sentence chunking process is to extract all the phrases from the input sentences, without being bogged down into deep semantic parsing and understanding. Second, based on the syntactic and semantic tags of the latest Chinese Sinica Treebank, the phrases extracted are then annotated with the possible case roles [3]. Our approach integrates both the chunking and case role annotation into a single step. One of our primary goals in this research is to design a shallow but robust mechanism which can analyze sentences in Chinese [4, 5, 6, 7]. Even though the classical syntactic and semantic analysis in Chinese is extremely difficuh, if not impossible, to systematize in the current computational linguistics research, this Treebank-based annotator does not require any deep linguistic analysis to be formalized. Consequently, the annotated sentences will give piecemeal the underlying semantic representation, without being mired into the formalism.

The organization of the paper is as follows. The related work in case role analysis is first described in Section 2. The characteristics of the Treebank which supports our approach described in this paper will also be explained. In this research, each Chinese token will have two attributes, i.e., Part-of-Speech (POS) and Semantic Classes (SC). Any input sentence can be viewed as an attributed string. The detailed discussion on how an attributed string matching algorithm can be used in the case role annotation is shown in Section 3. The system has already been implemented using Java language. In order to demonstrate the capability of our system, experiment with 5,000 sentences is conducted. It is explained in Section 4 followed by a conclusion.

2. Related Work Following the framework of case grammar which was originally proposed by Fillmore in 1968, many researchers in linguistics and philosophy have accepted that every nominal constituent in every language bears a single syntactic-semantic case relation [8, 9]. Computational techniques can be found in many earlier systems [10, 11]. Nagao et al. [12] developed a powerful parser for Japanese sentences based on the case frames encoded in a verb dictionary. Somers [13] described a prototype computer program which attempts to map surface strings of English onto a formalism representing one level of a deep structure. It was suggested that semantic features inherent in the main verb of a sentence can be used to infer a potential case frame for that sentence. Weischedel et al. [14] predicted the intended interpretation of an utterance when more than one mterpretation satisfies all known syntactic and semantic constraints, and ascertained its case frames. Utsuro, Matsumoto and Nagao [15] described a method for acquiring surface case frames of Japanese verbs from bilingual corpora. They made use of translation examples in two distinct languages that have quite different syntactic structures and word meanings. Kuro-hashi and Nagao [16] used the case frame dictionary, which has some typical example sentences for each case frame, to select a proper case frame for an input sentence. They have devised a matching score from which the case frame with the best score is considered as the best case structure of the sentence. Their approach highly relies on the accuracy of a thesaurus which may continuously subject to a restless modification, however. More recently, Cook [17] developed a matrix

119

model and applied it to an in-depth analysis of 5,000 English clauses. Chan & Franklin [18] also suggest a connectionist model in resolving the case role ambiguity.

g « i^ii ^m n X - m ba4 ma1 tao3 Iun4 siao3 da3 ren2 yi1 shi4 parents discuss ming2 hit a person one issue

SiuMing Figure 1 Example of a tree of Information-based Case Grammar in Sinica Treebank

The state of the art in computational linguistics is to make use of the knowledge encoded in Treebank to analyze sentence structures. In contrast to the English and Chinese Penn Treebank which took a straightforward syntactic approach [19], the Information-based Case Grammar (ICG) in Sinica Chinese Treebank stipulates that each lexical entry contains both semantic and syntactic features [20]. As shown in Figure 1, partial semantic information is annotated in the Chinese structural trees. That is, grammatical constraints are expressed in terms of linear order of thematic roles and their syntactic and semantic restrictions. This tree structure has the advantage of maintaining phrase structure rules as well as the syntactic and semantic dependency relations. The latest version of Sinica Treebank (v.2.1), released in early 2004, contains about 55,000 trees with 300,000 words. The Treebank contains a compact bundle of syntactic and semantic information, with more than 150 different types of POS and 50 semantic roles. Table 1 shows the frequency of some of the semantic roles in the Treebank.

Table 1 Frequency of some case roles in the Sinica Treebank

Case Role Property

Goal Time Range

Frequency 41572 22942 13570 11308

Case Role Agent

Experiencer Manner Location

Frequency 10707 1801 5469 4286

The ICG shown in the Treebank indicates the way that lexical tokens in the sentences are related to each other. Unfortunately, a full syntactic and semantic analysis of every sentence in every text is too computationally demanding. In this research, a shallow case role tagger is designed by matching any input Chinese sen-

120 tence with the trees in the Treebank using an approximate pattern matching technique. The algorithm, characterized by an optimization technique, looks for a transformation with a minimum cost, or called edit distance. While the concept of edit distance is commonly found in the conventional pattern matching techniques [21, 22], we take a step further in applying the technique in the area of natural language processing. The technique is essentially accomplished by applying a series of edit operations to an input sentence to change it to every tree in the Tree-bank. Every edit operation has been associated with a cost and the total cost of the transformation can be calculated by summing up the costs of all the operations. This edit distance reflects the dissimilarity between the input sentence and the trees. Instead of analyzing the exact Chinese tokens appearing in the sentence, extended attributes of each token in both input sentence and the trees, with their POS and semantic classes, are used. The closely matched tree, i.e., the one with minimum cost or edit distance, is selected and the corresponding phrase structures and semantic role tags delineated in the tree are unified with the input sentence. The detailed discussion of the algorithm is shown in the following section.

3. Shallow Case Role Annotation Using Attributed String Matching Algorithm

Let two given attributed strings A and B denoted as ^ = ^102^3... am and B = bibibj... b„, where are aj, bj the /th andyth attributed symbols of A and B respectively. Each attributed symbol represents a primitive of or B. Generally speaking, to match an attributed string A with another B means to transform or edit the symbols in A into those in B with a minimum-cost sequence of allowable edit operations. In general, the following three types of edit operations are available for attributed symbol transformation.

(a) Change: to replace an attributed symbol or/ with another bp denoted as a, -* bj. (b) Insert: to insert an attributed symbol bj into an attributed string, denoted as X -^

bj where X denotes a null string. (c) Delete: to delete an attributed symbol a^ from an attributed string, denoted as at

[Definition 1] An edit sequence is a sequence of ordered edit operations, s\, 52»— Sp where Sj is any of the following three types of edit operations. Change, Insert, Delete. [Definition 2] Let /? be an arbitrary nonnegative real cost function which defines a cost R{ar^bj) for each edit operation ar'-bj. The cost of an edit sequence S = ^1, sj,... Sp to be

R{S) = tR{s,) (1)

121

[Definition 3]

For two strings AadB with length m and n respectively, D(iJ) denotes the edit distance, which is the minimum number of edit operations, needed to .^. transform the first / characters of A into first 7 characters of 5, where 1 < / < m and \<j<n

In other words, if .4 has m letters and 5 has « letters, then the edit distance of .4 and B is precisely the value D(/w, n). The following algorithm has been proposed for computing every edit distances D{Uj) [23].

[Algorithm A] D(0,0):=0; for / := 1 to len{A) do D{U 0):=/)(/-l, 0)+/?(a/ ->'k)\ for J := 1 to len{B) do D(Ojy=D(QJA)+RiX -^bj); for / := 1 to len(A) do

for J := 1 to len{B) do begin

ml:=DiiJ-l) + R(X^bj); m2'=D(i-lJ) + R{ai^X);

D{iJ) := min (wl, w2, /w3); end

Our attributed string matching in case role annotation is to make use of the algorithm above and modify the cost function R{) for various edit operations. In our approach, each Chinese token has two attributes, i.e., Part-Of-Speech {POS) and Semantic Class {SQ. Let S be an input sentence and the T be a tree in the Sinica Treebank, s, and tj be two tokens in S and T with attribute {POSf, SC,) and {POSp SCj) respectively. We define the cost function for a change operation st -^ tj to be

R[S, ^tj)=u(POS,,POSj)^v(SC,,SCj) (3)

where u{POSi, POSJ) defines the partial cost due to the difference between the POS of the tokens. The POS tags from the Chinese Knowledge Information Processing Group (CKIP) of Academia Sinica are employed [24]. The tags involve 46 different types of POS which can further refine into more than 150 subtypes. In order to figure out the cost function i/(-,), in our system, all the POS tags are organized into a tree structure using XML with an associated hard-coded cost function. Table 2 shows a fragment of XML of the nouns (Na) which is divided into in-collective (Nae) and collective (Nal) nouns which are then divided ultimately into in-collective concrete uncountable nouns (Naa), in-collective concrete countable nouns (Nab), in-collective abstract countable nouns (Nac), in-collective abstract uncountable nouns (Nad).

122

Table 2 Cost function and the tree structure of Nouns (Na) based on the CKIP Academia Sinica.

<Na Cost="4" Level="2"> <Nal Cost="2" Level="3">

<Nall Cost="l" Level="4"> <Naa Cost="l" Level="5" /> <Nab Cost="l" Level="5" />

</Nall> <Nal2 Cost="l" Level="4"> <Nac Cost="l" Level="5" /> <Nad Cost="l" Level="5" />

</Nal2> </Nal> <Nae Cost="2" Level="3">

<Naea Cost="l" Level="4" /> <Naeb Cost="l" Level="4" />

</Nae> </Na>

The cost function u{,) will reflect the difference based on the cost encoded in the XML as shown in Table 2. For example, the cost for changing the word ^H (responsibility) to # ^ (meeting) is relatively low since they are all classified under Nal2 even though their meaning are completely different. The function u{,') partially indicates the alignment of the syntactic structure of the input sentence and the sentence appeared in the Treebank. The second term in equation (3) defines another partial cost due to the semantic differences. In our approach, the lexical tokens in the both sentences are identified using a lexical source similar to the Roget's Thesaurus. The lexical source is a bilingual thesaurus with an is-a hierarchy. An is-a hierarchy can be viewed as a directed acyclic graph with a single root. Based on the is-a hierarchy in the thesaurus, we define conceptual distance d between two notional words by their shortest path lengths [25]. Figure 2 shows one of our is-a hierarchies in our bilingual thesaurus using our Tree Editor. While the upward links correspond to generalization, the specialization is represented in the downward links. For example, the upward link from ^ f S H (preschool) to ^ ^ (school) indicates that ^^ (school) is more general than ^ f i S (preschool) and the lexical tokens in the same terminal subclass, such as f i S (pre-school), :^5J H (nursery school), or : 5&4 'L^ (day-care centre), are of the same meaning. The hierarchies demonstrated in the thesaurus are based on the idea that linguists classify lexical items in terms of similarities and differences. They are used to structure or rank lexical items from more general to the more special.

123

HUMAN SOCIETY AND INSTTTUTIONS {RO09] c7179|

a9.1[RO0901]<534>'

C99.2lRO0902]<688>

f m School [RO0902fr] <79>

•• m 9.2.567.1(NOUNS) [RO0902fr01] <9>

t ^9.2.567.2(NOUNS)[RO0902fr02]<11>

t E99.2.567.2.1 [RO0902fr0201]<5>

1 1 ^ 9 2.567 2.1 1 [RO0902fr020101] <3>

f I® 9.2.567.2.1.4 [RO0902fr0201M]<2>

D nursery school [RO0902fr02010401]

Figure 2 An is-a hierarchy in the bilingual thesaurus

Given two tokens t\ and /2 in an is-a hierarchy of the thesaurus, the distance d between the items is defined as follows: ^ih, h) = minimal number of is-a relationships in the shortest path between

/i and /2 (4)

The shortest path lengths in is-a hierarchies are calculated. Initially, a search fans out through the is-a relationships from the original two nodes to all nodes pointed to by the originals, until a point of intersection is found. The paths from the original two nodes are concatenated to form a continuous path, which must be a shortest path between the originals. The number of links in the shortest path is counted. Since d(tu ti) is positive and symmetric, d{tu ii) is a metric which means (i) d{tu h) = 0; (ii) d(tu 2) = d {ti, /i); (iii) d{tu ti) + 4/2, 3) > d{h, ty). At the same time, the semantic similarity measure between the items is defined by:

if d(t,,tj) < ^ _ otherwise ^' '' I Maxint

(5)

where d^ax is proportional to the number of lexical items in the system and Maxint is a maximum integer of the system. This semantic similarity measure defines the degree of relatedness between tokens. Obviously, strong degree of relatedness exists between the lexical tokens under the same nodes. For the cost of the insert and delete operations, we make use the concept of collocation which measures how likely two tokens are to co-occur in a window of text. To better distinguish statistics based ratios, work in this area is often presented in terms of the mutual information, which is defined as

MIitj_,,tj) = \og,- (6) P(tj.,)xP(tj)

where tj.\ and tj are the two adjacent tokens. While P(x, y) is the probability of observing x and y together, P{x) and P(y) are the probabilities of observing x and y anywhere in the text, whether individually or in conjunction. Note that tokens that have no association with each other and co-occur together according to chance will

124

have a mutual information number close to zero. This leads to the cost function for insertion and deletion shown in equation (7) and (8) respectively.

A:X|Z| i f O > 2 > f

[Maxint otherwise where z =min {MI{tj.i, (,), MI(tj, tj+i)}

(7)

)>€ (8) 4 - /l)=| ''H' -"' ' l ifO^^ (V.'V.): ^ [ MaxInt otherwise

where K, L, e are three constants relied on the size of the corpus.

Obviously, the insertion operation will be penalized if the co-occurrence between the newly inserted token and its neighbors is low. Similarly, the deletion operation is most likely to happen if there is a high co-occurrence between the adjacent pairs after the deletion. Using the above cost fiinctions for the three types of edit operations, the tree in the Treebank with minimum cost is identified to best approximation of the input sentence S and its relevant case roles tags will be adopted.

4. Experimental Results We have implemented the system using Java JDK 1.4.2 under Sun Microsystems. The whole system development is designed under Unified Modeling Language (UML) using Rational Rose. To show the efficiency of the proposed algorithm, a series of experiments are performed. In our system, for every input sentence, the best matching tree with minimum edit distance in the Treebank is calculated as shown in Algorithm A. The Information Case Grammar (ICG) of the best matching tree in the Treebank will be adopted. Figure 3 shows the graphical user interface which includes the cost matrix generated and the corresponding ICG structure of the input sentence.

J^SM. ^^m^M p.mm>yjm!m^m

Tree PP(He8d P80 mTfOUMMT GP(OUMMY VPfHead VC31 »mth«me:M»(Head.Naeb £i«))|Head Nd

'^x!m^mEm^:jz

Figure 3 Graphical User Interface (GUI) in the shallow case role annotation

125

We have tested our shallow case role annotation with 5,000 input sentences. The detailed results are shown in Table 3. The average sentence length is around 10.5 char/sentence.

Table 3 Analysis of 5,000 sentences in the experiment. Edit distance is defined as a minimum cost in transforming the input sentence with the closest sentence pattern in the Tree-bank. The smaller the distance, the higher similarity they have.

Edit distance

# of sentences Average # of tokens

Average edit distance

% of sentences having incomplete

semantic classes 0-25 0-50 0-75 0-100

336 1556 2841 5000

5.24 6J5 6.67 6.62

21.06 34.42 46.94 65.94

2.32 9.01 11.08 11.93

In order to let the readers to visualize the relevance of the edit distance with the underlying tree structure, Figures 4 and 5 show two sentences clearly with edit distance equal to 12 and 64 respectively. In each figure, the upper sentence is coming from the Treebank T while the lower one represents the input sentences S. Due to the similarity, in terms of the edit distance, of the matched pair, the syntactic structure of the sentence from the Treebank is transplanted to the input sentence.

CKIPTree PP

DUMMY GP

DUMMY NP

Head P21

P

property VH11

VH

Head Nac

m Na

Head Ng

Ng

^ ^m mm ^ Figure 4 Sentence with edit distance equal to 12

126

CKIPTree VP

goal S

theme NP

Head VK1

property Head Naeb Naeb

degree Dfa

Head VH13

»4 VH mm

Na Na Dfa VH

writ m^ H-ft » » mm Figure 5 Sentence with edit distance equal to 64

(SI)

As a result, each token in the input sentence will inherit the associated roles from the target sentence. For example, the Chinese sentence shown in Figure 5.

(in English: Unfortunately, the national budget is so tight)

The sentence is chunked into (^^m) (unfortunately), and (MWiM^M^MM) (the national budget is so tight) which is further chunked into (WiW-M^) (national budget), <| |5» (so), (EH) (tight). At the same time, (S^Wlgcll^ffl l i ) (national budget is so tight), O^WiJc) (national budget), and (M6^) (so) are annotated as the goal, main theme and degree of Sentence SI respectively. This shallow case role annotation technique does not require a fully and exact tagged input sentence.

Similarly, the sentence m^nMmm&^-m (S2)

(in English, The senators discuss the issue on sending troops initiated by the president)

has a small edit distance, equal to 15, with the tree shown in Figure 1, the sentence is then chunked into phrases, (^M) (The senators), (MM) (discuss), and (lilJttfcJ B.—^) (the issue on sending troops initiated by the president), which are further tagged with agent, act, and goal respectively by taking the advantage of annotation in the Treebank. Certainly, the phrase (M^tH^-^V-) (the issue on sending troops initiated by the president) can be further chunked into more details (ISi^ttl ^) (sending troops initiated by the president), {—^) (the issue) as shown in Figure 1. This chunking not only provides the basic semantic tag for each constituent, it also reflects the deep structure of the input sentence. For example, the deep structure shown in Figure 1 is then projected exactly onto Sentence S\, i.e., p[|^

127

[^M, WMti4^~'^] (in English: discuss[senators, sending troops initiated by the president]).

As with other text analysis, the effectiveness of the system appears to be dictated by recall and precision parameters where recall (R) is a percentage of how many correct case roles can be identified while precision (P) is the percentage of case roles, tackled by the system, which are actually correct. In addition, a common parameter F is used as a single-figure measure of performance as in follows,

{/3^ + l)xPxR (9) fi^xP+R

We set yff =1 to give no special preference to either recall or precision. The recall, precision and F value are 0.84, 0.92 and 0.878 respectively. It is worthwhile to mention that, as shown in Table 3, more than 500 sentences have incomplete semantic classes which mainly come from proper nouns, unknown words, proverbs or even short phrases. While the boundaries between words and phrases in Chinese are not easy to differentiate, the performance, due to the coverage of semantic classes in our thesaurus, does not deteriorate much in our system. This tolerance ability provides the graceful degradation in our case role annotation. While other systems are brittle and working only in all-or-none basis, the robustness of our system is guaranteed even though more tfian 10% of tokens having their SC tags missing.

5. Conclusion A recent trend in information retrieval and question answering systems has tried to circumvent many syntactic and semantic complexities by extracting only certain predetermined patterns of information from narrowly focused classes of text. Although such an approach can achieve high quantitative returns, it necessarily compromises the quality of understanding. However, from a very superficial observation of the human language understanding process, it is clear that texts which are poorly formed grammatically can be understood, apparently by much the same processes as are used for well-formed texts. This kind of fault-tolerant behavior in language understanding without relying on any deep syntax analysis is imperative in real world applications.

In this paper, we have demonstrated a shallow parsing, based on the annotation of a Treebank, which does not require a full syntactic parse to pursue semantic analysis. Sentence interpretation is represented by chunks of phrases or words which are tagged with case roles. Our system does not claim to deal with all aspects of language, but its limitations are not relevant to our main focus: Our system can understand grammatical and even ungrammatical sentences, providing that they have been analysed in the Treebank. Our approach uses much simpler and more local types of linguistic primitives without demanding a complete syntactic analysis. Our linguistic sequence analysis is inspired by the research into bio-molecular sequences, such as DNA and RNA in protein. Bio-molecular scientists believe that high sequence similarity usually implies significant function or structural similarity. It is characteristic of biological systems that objects have a certain form that

128

has arisen by evolution from related objects of similar but not identical form. This sequence-tostructure mapping is a tractable, though partly heuristic, way to search for functional or structural universality in biological systems. With the support from the results as shown in this paper, we conjecture this sequence-to-structure phenomenon appears in our sentences. The sentence sequence encodes and reflects the more complex linguistic structures and mechanisms described by linguists. With the development of rapid methods for sequence comparison, both with heuristic algorithms and powerfiil parallel computers, our sentence chunking and case role annotation based on sequence homology suggest a more plausible way to handle the real corpus.

Acknowledgement

The work described in this paper was fiiUy supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CUHK4438/04H) and the APIB research grant of the Chinese University of Hong Kong.

References [I] Fillmore, C.J. (1968). The case for case. In E. Bach & R.T. Harms (Eds.), Universals

in Linguistic Theory, 1-90. Holt, Rinehart & Winston. [2] Chao, Y.-R. (1968). A Grammar of Spoken Chinese. University of California Press. [3] CKIP (2004). Sinica Chinese Treebank: An Introduction of Design Methodology.

Academic Sinica. [4] Chan, S.W.K. (2001). Integrating linguistic primitives in learning context-dependent

representation. IEEE Transactions on Knowledge and Data Engineering: Special Issue on Connectionist Models for Learning in Structured Domains, 13,2, 157-175.

[5] Chan, S.W.K. (2004). Extraction of Textual Salient Patterns: Synergy between Lexical Cohesion and Contextual Coherence. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 34, 2,205-218.

[6] Her, 0. S. (1990). Grammatical functions and verb subcategorization in Mandarin Chinese. The Crane publishing Co.

[7] Li, Y. C. (1971). An investigation of Case in Chinese Grammar. Set On Hall University Press.

[8] Jackendoff, R. (1983). Semantics and Cognition. MIT Press. [9] Dowty, D. (1991). Thematic proto-roles and argument selection. Language, 67, 547-

619. [10] Simmons, R.F. (1970). Natural language question answering systems. Communica

tions of ACM, U, 15-30. [II] Wilks, Y.A. (1972). Grammar, Meaning and the Machine Analysis of Language.

Routledge. [12] Nagao, M., Tsujii, J., and Tanaka, K. (1976). Analysis of Japanese sentences by using

semantic and contextual information-semantic analysis. Information Processing Society of Japan, \1,1, \0-\S.

[13] Somers, H.L. (1982). The use of verb features in arriving at a 'meaning representation'. Linguistics, 20,237-265.

[14] Weischedel, R, Meteer, M., Schwartz, R, Ramshaw, L., Palmucci, J. (1993). Coping with ambiguity and unknown words through probabilistic models. Computational Linguistics, 19,2,359-382.

129 [15] Utsuro, T., Matsumoto, Y., and Nagao, M. (1993). Verbal case frame acquisition from

bilingual corpora. Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, vol. 2,1150-1156.

[16] Kurohashi, S., and Nagao, M. (1994). A method of case structure analysis for Japanese sentences based on examples in case frame dictionary. lEICE Transactions on Information and Systems, vol. E77-D, no. 2, 227-239.

[17] Cook, W.A. (1998). Case Grammar Applied. Summer Institute Linguistics. [18] Chan, S.W.K., & Franklin, J. (2003). Dynamic context generation for natural language

understanding: A multifaceted knowledge approach. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 33, 1, 23-41.

[19] Xia, F., Palmer, M., Xue, N., Okurowski, M.E., Kovarik, J., Chiou, F.-D., Huang, S., Kroch, T., & Marcus, M. (2000). Developing Guidelines and Ensuring Consistency for Chinese Text Annotation. Proceedings of the second International Conference on Language Resources and Evaluation (LREC-2000), Athens, Greece.

[20] Chen, F.-Y., Tsai, P.-F., Chen, K.-J., & Huang, C.-R. (2000). Sinica Treebank. [in Chinese] Computational Linguistics and Chinese Language Processing, 4, 2, 87-103.

[21] Gusfield, D. (1997). Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press.

[22] Tsay, Y.T., & Tsai, W.H. (1989). Model-guided attributed string matching by split-and-merge for shape recognition. International Journal of Pattern Recognition and Artificial Intelligence, 3, 2, 159-179.

[23] Wagner, R.A., & Fischer, M.J. (1974). The string-to-string correction problem. Journal of the Association for Computing Machinery, 21, 1, 168-173.

[24] Chen, K.-J., Huang, C.-R., Chang, L.-P., & Hsu. H.-L. (1996). Sinica Corpus: Design Methodology for Balanced Corpora. Proceedings of the 11th Pacific Asia Conference on Language, Information, and Computation (PACLICII), Seoul Korea, 167-176.

[25] Rada, R., Mili, H., Bicknell, E., and Blettner, M. (1989) Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics, 19, 1, 17-30.

A combinatorial approach to conceptual graph projection checking

Madalina Croitoru Ernesto Compatangelo

Department of Computing Science, University of Aberdeen

Abstract

We exploit the combinatorial structure of conceptual graphs in order to obtain better execution times when computing projection, which is a core generalisation-specialisation relation over conceptual graphs. We show how the problem of finding this relation can be translated into the Maximum Clique problem. Consequently, approximation techniques developed for the Maximum Clique problem can be used to compute projection in conceptual graphs. We show that there are "simple queries" which can be answered quickly, thus providing efficient reasoning support in a knowledge management environment based on conceptual graphs.

1 Introduction

Conceptual graphs (CGs for short) are a widely-used logical approach to knowledge representation based on semantic networks. They introduce a clear distinction between ontological and asserted knowledge, specifying it in a way that is both readable by humans and computationally tractable by machines [16].

Graph-based formalisms (e.g. CGs) can be successfully applied to solve relevant knowledge management problems such as modelling the articulation of ontologies and representing links between different ontology versions [11]. As deriving relationships (e.g. subsumption) between linked structures in articulated or versioned ontologies greatly improves multi-ontology management [14], devising a set of inferences that operate on CGs (thereby reasoning with these structures) is an important area of research. Reasoning with and about conceptual graphs, which is logically sound and complete [12], is based on a conceptual graph operation called projection. This is a labelled graph homomorphism which defines a generalisation-specialisation relation over conceptual graphs. A structure G is more general than a structure H (denoted as G > H) if there is a projection from G to H. The ">" symbol is interpreted as "greater than" in the sense that a "human" is more generic (i.e. broader) than a "student".

Algorithms devised for computing projection in CGs are either based on logic or on graph theory. The former combine First-Order Logic and Prolog mechanisms (e.g. resolution) in a reasoning tool [7]. Conversely, the latter translate the reasoning problem into a graph-homomorphism problem [5, 10].

130

131 Unfortunately, deciding whether G > H given two conceptual graphs G

and H is an NP-complete problem [5, 2]. However, it has been shown that this is polynomially equivalent to other problems, such as (i) conjunctive query containment [4, 15] and query output [8] in databases, (ii) constraint satisfaction [15] in combinatorial optimisation, and clause subsumption [8] in knowledge representation and reasoning. Consequently, algorithms of exponential complexity with fast execution time have been used in practical applications when the size of the graphs involved is not too large [4].

Extensive research has been done to improve the computational behaviour of algorithms for testing projection in CGs [12, 9, 1]. However, most research results have been influenced by the algorithms used to solve the above mentioned equivalent problems. In this paper, we show how the rich combinatorial structure of CGs can be used to obtain better execution times for the resulting projection algorithm. Section 2 reviews basic notions and results about (simple) conceptual graphs. Section 3 presents a new projection algorithm based on the idea of firstly projecting relation nodes, which has the advantage of implicitly forcing the projection of concept node neighbours. Section 4 introduces the matching graph MQ-^H oi a pair (G, H) of conceptual graphs, which is the main contribution of this paper. Such a graph is based on the efficient translation of the problem about deciding whether G > H into the well-known Maximum Clique [3] problem with a matching graph MG-^H as input. This translation allows the approximation techniques developed for the Maximum Clique problem to be used in the projection of conceptual graphs. Moreover, the clique number of the matching graph can be used as a non-trivial measure in order to perform comparisons between conceptual graphs.

2 Simple Conceptual Graphs

Simple conceptual graphs are bipartite node-edge diagrams in which square nodes, representing term occurrences, alternate with rounded nodes, representing predicate occurrences. Labelled edges linking round nodes {relation nodes) to a set of square nodes {concept nodes) symbolise the ordered relationship between a predicate occurrence and its arguments. Concept nodes are labelled with a concept type and either a constant or a star (unnamed existentially quantified variables). Examples of CGs are shown in Figures 3 and 4.

2.1 Bipartite Graphs

A bipartite graph is a pair G = {VG^EG) with the nodes set VG = Vc UVR,

where Vc and VR are finite disjoint nonempty sets, and each edge e G EG is a two element set e = {vcyVR}, where vc € Vc and VR eVR.

A bipartite graph G is denoted as G = {Vc, VR] EG)- The number of edges incident with a node i; G VG is the degree, dG{v), of the node v. If, for each '^R VR, there is a linear order e\ — {VR,V\\, • • •, e^ = {VR^V^] on the set of edges incident to VR (where k = dG{vR)), then G is an ordered bipartite graph.

132 A simple way to express that G is an ordered graph is to provide a labeUing

/ : EG —> { 1 , . . . , |Vc|} where /({v^, w}) is the index of the edge {VR, W) in the above ordering of the edges incident in GXOVR. We denote an ordered bipartite graph by G = (Vc, VR\ EG, 0) where I is an order labelling of the edges of G.

For each node v G Vc U VR, the set of neighbouring nodes of v is denoted as NG{V) = {W eVcU VR\{V,W} G EG}- Similarly, if ACVRU VC, the set of neighbouring nodes of A is NG{A) = UVGA^G(^) — ^•

If G is an ordered bipartite graph, then for each r G VR, ^ci''^) denotes the 7-th neighbour of r, i.e. v = NQ{r) iff {r,v} G EG and l{{r,v}) = i.

2.2 Support

Background knowledge, i.e. basic ontological knowledge, is encoded in a structure called support, which is implicitly used in the representation of factual knowledge as labelled graphs. A support is a 4-tuple S = {TC^TR,!, *) where:

• Tc is a finite, partially ordered set (poset) of concept types {Tc^ <) that defines a type hierarchy where Va;,2/ G Tc, x < y means that x is a subtype of y. The top element of this hierarchy is the universal type TQ-

• TR is a finite set of relation types partitioned into k posets (T]j, <)i=i,fc of relation types of arity i {I < i < k), where k is the maximum arity of a relation type in TR. Each relation type of arity z, namely r e T}^, has an associated signature a{r) G Tc x • • xTc, which specifies the

^ - • ' i times

maximum concept type of each of its arguments. This means that if we use r(rri , . . . ,Xi), then Xj is a concept of type{xj) < cr{r)j (1 < j < i). The partial orders on relation types of the same arity must be signature-compatible, i.e. Vri,r2 G Tjj ri < r2 =» ^(^i) ^ ^('"2)-

• J is a set of countable individual markers that refer to specific concepts.

• * is the generic marker that refers to an unspecified concept. However, this concept has a specified type.

• The sets Tc, TR, I and {*} are mutually disjoint and J U {*} is partially ordered by x < y if and only if x = y or i/ = *.

Figures 1 and 2 introduce a small-scale example of support providing information about a holiday booking ontology "BukARest". Notably, while Figure 1 shows the concept hierarchy. Figure 2 shows the relationship taxonomy.

2.3 Simple conceptual graphs

A simple conceptual graph (SCG) is a 3-tuple SG = [5,G, A], where:

• S = {TC,TR,I, *) is a support;

• G = (Vc, VR; EG, I) is an ordered bipartite graph;

133

• A is a labelling of the nodes of G with elements from the support S: Vr G VR, A(r) E T^''^''^ Vc E Vfc, A(c) € Tc x (XU {*}) such that if c = Ar^(r'), A(r) = r and A(c) = (tc.refc) then tc < (Ji{r).

Entity

Person

Student Retired Sea-side Mountains

Transport

LandTransport 1 AirTransport

Plane

City Balnear / \ Attnbute

Resort Camping

PersonAttribute \ TransportAttribute ^ I \ LocationAttribute / \

Married Single Rich / | \ fast Slow

Cheap Expensive En-vogue

Figure 1: Concept hierarchy for the holiday booking ontology

spendHoliday (Person, Location)

T (T,T,T)

bookAccomodation (Person, Location)

\ bookFuUBoard

bookSelfCatering (Person, Balnear) (Person,Camping) characteristic

T,Attribute

personCharacteristic (Person, PersonAttribute)

travel (Person, Transport)

fly drive (Person, Plane) (Person, Car)

bookHolidayPack (Person,Location, Plane)

locationCharacteristic transportCharacteristic (Location, LocationAttribute)

(Transport, TransportAttribute)

Figure 2: Relationship taxonomy for the holiday booking ontology

134

Each relation node r is labelled by a relation type denoted as type{c), with an associated signature and an arity given by the degree of the node r in G. Each concept node c is labelled by a pair (^ype(c),re/(c)), where type{c) is the type of the node c and ref{c) the referent of c. The referent of c either belongs to J (when c is said to be an individual concept node)^ or is the generic marker * (when c is said to be a generic concept node). When the support is implicit, we shall denote simple conceptual graph by the pair SG = (G, A).

An example of simple conceptual graph based on the support described in the previous section is shown in Figure 3. This graph states that John, a retired rich person, bought a holiday pack to fly to the Black Sea Coast, an expensive balneary resort, while Kate, a single student, also booked full board at the Black Sea Coast, driving there in her slow VWGolf4 car.

[Rkh]

CjcrsonCharacteristic^

Expensive 17

Retired: John

Figure 3: A first example of simple conceptual graph

2.4 Projection

Projection [16] is the fundamental operation on simple conceptual graphs; it can be used to define a pre-order on the set of SCGs based on the same support.

If SG = (G, AG) and SH = {H,XH) are two simple conceptual graphs defined on the same support 5, then a projection from SG to SH is a mapping n : Vc{G) U VR{G) -^ Vc{H) U VR{H) where:

• n(Vfc(G)) C Vc{H) and U{VR{G)) C VniH);

• Vc 6 Vb(G), Vr e VR{G), if c = iV^(r) then U{c) = iV|,(n(r));

• Wv G Vc{G) U VR(G), XG{V) > Ai/(n(i;)).

135

If there is a projection from SG to SH, then SG subsumes SH (denoted as SG > SH). Such subsumption relation is a pre-order on the set of all simple conceptual graphs defined on the same support. For example, the conceptual graph shown in Figure 4 represents the fact that John, who is a rich retired person, spends the holiday in the same expensive balneary resort (the Black Sea Cost) as Kate, who is a single student. Testing whether there is a projection relationship between this graph and the one shown in Figure 3 means "querying" whether John and Kate spend their holiday in the same place.

Student:Kate

ersonCharacteristic

Expensive JL

Single

Figure 4: A second example of simple conceptual graph

3 A new projection checking algorithm

Subsumption checking is an NP-complete problem that has been investigated using graph homomorphism techniques [5, 2]. Unfortunately, the algorithms that have been devised so far are not exploiting the underlying conceptual graph properties [6]. In fact, results achieved so far do not intuitively provide the user with an understanding of how some of the derived polynomial reductions are connected with particular graph structures. Due to the size of the graphs used in practice, a backtracking algorithm can be derived with good computation results. We are currently implementing this algorithm; however, a shortage of available test data on CGs makes it harder for us to perform comparative evaluations. In the next section, we will show how our algorithm can be used to describe new tractable instances of the general problem.

Let us suppose that we want to test if SG > SH for two SCGs SG = (G, XG) and SH = {H,XH) defined on the same support S. We can assume that G is connected, since it is easy to see that SG > SH if and only if there is a projection from (the SG associated to) each connected component of G to SH. In particular, we have no isolated concept vertices in G.

The algorithm constructs in an incremental way an induced subgraph of G which has a projection to H. The algorithm stops either when this subgraph is G itself or when no projection from G to if is found.

136

The novel approach proposed in this paper is that of projecting relation vertices of G to "compatible" relation vertices of H^ which implicitly forces projections of their (concept) neighbours. More specifically, if n(r) = s for some vertex r E VR{G), then (by the definition of a projection), n(iV^(r)) = Nfj{s). This will reduce the search space in a number of applications where the support is not "theoretically" constructed (as it is in the proofs of NP-completeness of the corresponding decision problem [5, 2]). Comparisons are discussed in [6].

A projection can be viewed as a mapping 11 : VR{G) —> VR{H) where: (i) Vr G VR{G), Xcir) > XH{U{r)) and Vz E {1 , . . . ,dG(r)}, AG(^^(r)) > XH{Nij{s)) and (ii) Vr , / G VR{G), if iV^(r) = N}.{r'), then N}j{U{r)) = Njj{Il{r')). Since concept nodes in a conceptual graph are not isolated, then we will obtain a projection from SG to SH by extending II as II : VC{G)L}VR{G) -^ Vc{H) U VR{H) by putting, for each v e Vb(G), U{v) = Ar|^(n(r)), where r e VR{G) and v = Ar^(r).

During the execution of the algorithm a set P C VR{G) is maintained, where a projection II has been already constructed for the subgraph of G with vertex set P U NG{P)- Moreover, for each relation vertex r e VR{G) \ P , a set Candidates{r) of possible candidates from VR{H) is available. A relation vertex s belongs to the set Candidates{r) if and only if XH{S) < Xcir) and for each z G { 1 , . . . , dcir)}, iiv = Ar^(r) G NG{P), then U{v) = N}j{s). This means that a relation vertex from Candidates{r) can be used to extend the current projection of the subgraph of G with vertex set P U NG{P) to a "compatible" projection of the subgraph of G with vertex set P U NG{P U {r}). If, for some f ^ yR{G) \ P, Candidates{r) = 0, then there is no extension of the current projection II to a projection of the entire graph G to H.

Let us consider Figure 5, where relation r^ and its neighbour concepts are shown in graph G, while relation rn and its neighbours are shown in graph H. If we imagine the "claw" represented by VG and its neighbours as a rigid structure, then this has to match the corresponding structure in H.

Figure 5: An example of projection constraints

A problem arises if we consider another node SG '• NG{rG) H NG{SG) ¥" 0-In this case, the neighbours must fulfill the same properties as in graph H.

Initially, set P is empty and the sets Candidates{r) (where r G VR{G))

provide the possible "free" projections of the relation vertices belonging to

137

(a) Initial situation

(b) Checking whether the space fits

(c) Checking whether the contour fits

(d) Good matching

Figure 6: Jigsaw example to show projection algorithm behaviour

graph G. Such projections can be determined by scanning the relation vertices s oi H and by testing if XH{S) < Xcir). This can be regarded as a "preprocessing" step in our algorithm, since it includes all the tests involving the partial orders represented by the support. The time computational complexity of an efficient implementation of this step can be very important in certain applications and depends on the adopted data structures [7, 13].

Our algorithm is similar to the one used in building a jigsaw image by putting the jigsaw pieces together. Our algorithm takes a piece of the jigsaw and checks if it fits in the place we want to put it in (the third condition in the projection definition). If it does, then we check if the contour also fits (the matching neighbours condition), as shown in Figure 6.

The current set P is represented as a sequence P[l], P[2] , . . . , P[A;] where k = \P\. For each Z E { 1 , . . . , / ; : } , a stack Jlf is associated to P. The top of the stack Ui is the current projection of the relation vertex P[z], i.e. P[i] is projected to n(P[2]). The above sets Candidates{r) are computed for this current projection and therefore are indexed by k: Candidates k{r) (r G VR{G)\P). In our jigsaw example, the dark area depicted in Figure 6(a) represents our current set P.

U k < VR{G), the vertex s G VR{G) \ P, used to extend the current set P, is selected by a greedy type heuristic in order to reduce the search space by identifying in advance the dead ends. More precisely, the vertex s is selected from VR{G)\P by the rule \NG{S)\NG{P)\ = max,eVH(G)\p \NG{r)\NG{P)l In the corresponding forward step, P[k + 1] 4— s (i.e. P ^ P U {s}) and the stack Iljt+i is created using the set Candidatesk{s).

In the proposed jigsaw example, we use the greedy heuristic by finding a piece that will fit in next to most existing pieces. We also make sure that the selected piece properly fits in the free jigsaw space (see Figure 6(b)). Once the

138

piece properly fits in the free space, we need to make sure that it matches with its neighbours (see Figure 6(c)). In Figure 6(d) there is a potential match for the current jigsaw position. We continue to fit jigsaw pieces until we either complete the jigsaw (success), or we reach a dead end. If the backtracking algorithm allows a different choice of jigsaw pieces, we continue, otherwise the algorithm returns that no projections exist between the two graphs.

The new candidate sets Candidatesk^\{r) (r € VR[G) \ P) are also computed by checking whether the relation vertices from Candidatesk{r) remain compatible with the projection of s fixed by the top of the stack Ilfc+i. This means that if v G NG{S) \ NG{P \ {s}) and v = Ar^(r), then U{v) (which is settled by the projection of s) must be such that Il{v) = iV)^(r'), for each r' € Candidatesk{r) that is transferred in Candidatesk-^i{r). Each element of NG{S) \ NG{P \ {s}) thus generates a new constraint for relation vertices from the sets Candidatesk{r)^ which explains our rule for selecting the vertex 5. The resulting backtracking algorithm is as follows:

1. Preprocessing step:

for each r 6 VR{G) do Candidatesolr) <- {r' e VR{H)\XH{r') < Xcir)}]

found <— false] 2. FindJ^rojection{0)]

if not found t hen "G ^ H" else «G > H", U.

FindJProjection(k) if not found then

if fc = |VR(<^)| then found ^ true else { v^{p[i],...,p[k]y,

t *— imnreVj^{G)\v \Candidatesk{r)\] S^{re VR{G) \ V\ \Candidatesk{r)\ = t}; if t > 0 then { find s € 5 s. t. \NG{S) \ NG{V)\ = maxres iNcir) \ Ncir)]',

P[fc + 1] ^ s; construct the stack Ilfc-fi from Candidatesk{s),, while Ilfc-i-i ^ 0 do { r ' 4- pop (Hfe+i);

U{P[k + 1]) ^ r'; for each v e NG{P[k + 1]) \ NG{V) do

if V = Nh{P[k + 1]) then U{v) ^ Nh{r')] for each r e VR{G) \{VU {P[k + 1]}) do

construct Candidatesk-\-i{r) from Candidatesk{r)i by selecting only the elements r' € Candidatesk{r) s. t. for each i such that Nh{r) e NG{P[k + 1]) \ NG{V), we have U{Nh{r)) = Nh{r');

FindJ^rojection{k-\-1) }}}

139

4 The matching graph

The correctness of the algorithm presented in previous section can be derived by examining its behaviour on a virtual graph which can be obtained from the two simple conceptual graphs G and H. Furthermore, this graph can be used to improve the time complexity of the projection test.

Let SG = (G, XG) and SH = {H, A//) be two SCGs with no isolated concept vertices defined on the same support S. The matching graph of SG and SH is defined as the graph MG-^H = {V^ E) where:

• V" C ^^(G) X VR{H) is the set of all pairs (r, s) such that r eVR{G),se VR{H), Ac(r) > XH{S), and Vz G { 1 , . . . , dcW}, XciNi^ir)) > XH{Nh{s)).

• E is the set of all 2-sets {(r, 5), (r',5')}, where r / r', {r,s), {r\s') G V and N}f{s) = iV^(s')Vi G { 1 , . . . ,dG(r)},Vj G {I,... ,dG{r')} such that

The first condition in the above definition states that the nodes of the matching graph are represented by the pairs of nodes that match (puzzle pieces that match their potential space). The second condition makes sure that if the contour of the pieces matches, then there is an edge between the two pairs. To obtain the matching graph, we horizontally enumerate the relation nodes r to be projected, and vertically depict their corresponding Candidateso{r). The edges are drawn iff the potential jigsaw arrangement of the two pieces does not clash. There is no edge between nodes on the same vertical line.

The vertices of the graph MG^H shown in Figure 7 are represented by black circles labelled using the pair obtained by taking the corresponding column and row relation vertices. For example, vertices {xi.yi) and (0:1,2/4) mean that the

"1 L J — ^

"2|_LiljX

^ 3 L L 1 L A

" ' " - ^ ^sLnir

1

/ V D '2

" i L ^ "2I T:* ^

"3I T:* 1

" 4 | T : * f -

" s L l ^ "el T:* V

1

V 2 ^

3 1

1 ^1

0 ^ 2

ya

^4 •

'^2

•

^ 3

•

\ ""s

•

•

G^^H

Figure 7: An example of a matching graph

relation vertex xi of G can be projected to relation vertex yi or 2/4 of i7. The only drawn edges of MG^H are those of the complete subgraph induced by

140

{(a;i,?/i),(x2,y2),(^3,2/2),(a:4,2/i),(a;5,2/2)}. The edge {{xi,yi){x4,yi)} means that the projection of xi and X4 to 2/1 preserves the (ordered) adjacency in G, since both V4 = NQ{XI) = NQ^X^) and -us — NQ{XI) = NQ{X4) are satisfied by

their common projection yi in H. It follows that by projecting xi to 2/1, ^2 to 2/2, X3 to y2, X4 to 2/1, X5 to 2/2? as well as VI = 7V^(a:i) = iV^(x2) = iViCxa) = N^{xs) to 7x1 = Ar^(2/i) = iV^(2/2), ^2 = ^^0(^2) to U2 = Njj{y2), V3 = N^{x4) to tzi = N}j{yi), V4 = N^ixi) = NQ{X3) = NQ{X4) = NQ{XS) to ii2 = Nfiiyi) = ^^(2/2), we obtain a projection

n from 5 G to 5 i7 . We now introduce the notion of clique in a graph F as a set of mutually

adjacent vertices whose maximum cardinality is denoted as uj{F).

Theorem 1 Let SG = (G, XG) Sind SH = (i7, XH) he two simple conceptual graphs without isolated concept vertices deGned on the same support S and let MG-*H = {y^ E) be their matching graph. There is a projection from SG to SH if and only if ( ^ ( M G ^ H ) = | V H ( G ) | .

Proof. For each r e VR{G), let Vr = {(r, s) G V{MG-.H)}' The sets Vr are disjoint,

their union is V{MG^H) and no two vertices of the same Vr are adjacent in MG-^H' It follows that any clique Q of M G - > / / , satisfies |Qnyr | ^ 1- Therefore, \Q\ = IQ n V{MG^H)\ = llreVniG) \Q n Vr\ < E . ev . (G) 1 = \yRiG)l

If (JJ{MG-^H) = \VR{G)\, then there is a clique Q in MG-.H with |Q| = | V R ( G ) | . Hence, according to the above remark, |Q fl K-l = IVr € VR{G). Therefore, we can consider the map 11 : VR{G) -^ Vji{H) by taking 11 (r) = s, where Q D Vr = {(r, s)} . By definition of the matching graph M G - ^ H , we can extend U to Vc{G) U VR{G) by taking for each c e Vfc(G) U{c) = N}j{U{r)), where r G VR{G) and NQ{r) = c (such a relation vertex exists since G has no isolated concept vertices). This extension is well defined, since by definition of the matching graph, if c = N}.{r') with / ^ r, then N}j{U{r)) = iV^(n(r ' ) ) . Therefore 11 is a projection from SG to SH.

Conversely, if a projection 11 : SG -^ SH exists, then the set {(r, n ( r ) ) | r G yR{G)} is a clique of cardinality | V R ( G ) | in MG-^H and, using the remark at the beginning of the proof, UJ{MG^H) — | V H ( G ) | holds.

R e m a r k s

1. Using a backtracking scheme, the algorithm described in Section 3 tries to construct a maximum clique in the (implicit) graph MG-^H whose sets Vr are called Candidateso{r). The explicit algorithm based on the above theorem can benefit from the necessary condition that the clique number equals | V R ( G ) | . For example, if (r, 5) G V{MG-^H) is a vertex such that there is an r ' G V^/?(G), r' ^ r where NMG-.HH''^^^)) H K = 0, then (r,5) belongs to no |VR(G)|-chque and can be thus deleted from MG-^H- If we call this new graph the reduced matching graph of the pair {G,H)^ then the algorithm in Section 3 can be improved in a way that corresponds to forward checking rules in constraint satisfaction. The new projection checking algorithm can be now described as follows:

141

1. Construct the reduced matching graph MG^H] 2. Find the clique number U}{MG^H); 3. If UJ{MG-.H) < \VR{G)\

then return "G ^ H'' else return "G > H'', 11 (obtained as in the proof of the theorem)

2. The above theorem reduces the checking of a projection of SG to SH to the ATP-hard Max Clique problem on the matching graph. However the combinatorial structure of the matching graph (where the sets {Vr)reVR{G) are the colour classes of a |VR(G)|-colouring of MQ-^H ) shows that the abstract decision problem to be solved is in fact the following:

Maximum Clique in a coloured graph INSTANCE: A graph G and a /c-colouring of G (A; G Z+). QUESTION: Has G a A;-clique ?

Corollary 4.1 Maximum Clique in a Coloured Graph is NP-complete.

3. In practical approaches to projection checking of conceptual graphs, the reduction proved in the theorem can be exploited in two ways.

Firstly, if the graph MG->H (in its reduced form) belongs to a class of graphs on which finding the maximum clique can be solved in polynomial time, then our projection testing can be done in polynomial time.

Secondly, any approximate method (e.g. genetic algorithms or semi-definite approximations) for the determination of the clique number of the matching graph [3] can be used to obtain practical approaches in the case of large conceptual graphs. Moreover, the clique number of the matching graph can be considered as a non-trivial measure for comparing conceptual graphs. More precisely, the difference between |VH(G)| and U{MG-^H) can be considered as the "distance" between SG and SH.

While preliminary results for the first approach are available, the second approach still needs further experimental investigation.

Theorem 2 / / the query conceptual graph G in the projection checking problem G > HI is such that its relation vertices set VR{G) can he ordered as ri < r2 < ... <rm so that, for each i < j < k, either NG{ri)r\NG{rj) C NG{rj)nNG{rk) or NG{ri) n NG^rk) C A^G(^J) H NG{rk), then the answer is yes iffMG^H, namely the reduced matching graph of G and H, has m non-empty sets Vr.

This theorem shows that there are "simple queries" which can be answered quickly. This is of practical importance from a computational viewpoint in a knowledge environment where reasoning is based on conceptual graphs. A proof of the theorem is detailed in [6].

142

5 Discussion

We showed how the rich combinatorial structure of conceptual graphs can be used to obtain faster execution time for the resulting projection algorithms. Our main contribution is the efficient translation of the problem of deciding if G > if into the well-known Maximum Clique problem where the input is a structure called the matching graph MG->H- This result paves the way to the usage of those approximation techniques already developed for the Maximum Clique problem [3] in computing the projection of conceptual graphs.

The matching graph MG-*H of a pair {G^H) of conceptual graphs is a coloured graph introduced as a way of checking the correctness of our novel projection algorithm. The novel idea of the proposed algorithm is to project relation vertices of G to "compatible" relation vertices of H, which implicitly forces projections of their (concept) neighbours. It is expected that this will reduce the search time for conceptual graphs in those applications where the support is not as "artificially" constructed as in theoretical. In fact, we are currently investigating the performance of our algorithm vis-a-vis those matching algorithms currently implemented in CG editing systems.

The search time can be further reduced by exploiting the structure of the matching graph. Note that due to the structure of the matching graph, we also highlighted a new NP-complete problem, i.e. Max-Clique in coloured graphs. If the graph MQ-^H (i-e. its reduced form) belongs to a class of graphs for which finding the maximum clique can be solved in polynomial time, then our projection testing can be done in polynomial time. Since the recognition of such classes can be done in polynomial time, we can improve the practical time complexity in the case of a positive answer. In this case it would be worth finding which combinatorial properties of the involved conceptual graphs give rise to such matching graphs. Moreover, the clique number of the matching graph can be considered a non-trivial measure to compare conceptual graphs. More precisely, the difference between |VR(G) | and IJJ{MG-*H) can be considered as "distance" between SG and SH.

Finally, we enunciated a condition on the structure of the conceptual graph G such that the matching graph's MG-+H structure guarantees a successful projection. This means that there are "simple queries" which can be answered quickly. This is of practical importance when reasoning is based on conceptual graphs. Such can be the case where the articulation of diverse ontologies or the versioning of related ontologies is represented using CGs.

References [1] F. Baader, R. Molitor, and S. Tobies. Tractable and Decidable Fragments of

Conceptual Graphs. In Proc. of the 7th InVl Conf. on Conceptual Structures, pages 480-493. Springer-Verlag, 1999.

[2] J.-F. Baget and M.-L. Mugnier. Extensions of Simple Conceptual Graphs: the Complexity of Rules and Constraints. Jour, of Artificial Intelligence Research, 16:425-465, 2002.

143

[3] M. Bomze et al. The maximum clique problem. In Handbook of Combinatorial Optimization, volume Suppl. A, pages 1-74. Kluwer Academic Publishers, 1999.

[4] M. Chein, M.-L. Mugnier, and G. Simonet. Nested graphs: A graph-based knowledge representation model with FOL semantics. In Proc. of the 6th InVl Conf. on the Principles of Knowledge Representation and Reasoning (KR '98), pages 524-535. Morgan Kaufmann, 1998.

[5] P. Creasy and G. Ellis. A conceptual graph approach to conceptual schema integration. In Proc of the 1st Int'l Conf on Conceptual Structures, (ICCS'93), 1993.

[6] M. Croitoru and E. Compatangelo. On Conceptual Graph Projection. Technical Report AUCS/TR0403, Dept. of Computing Science, University of Aberdeen, UK, 2004. URL http://www.csd.abdn.ac.uk/research/receiitpublications. php.

[7] G. Ellis. Efficient retrieval from hierarchies of objects using lattice operations. In Proc. of the 1st Int'l Conf on Conceptual Structures, Led. Notes in Artif. IntelL, pages 274-293. Springer-Verlag, 1993.

[8] G. Gottlob, N. Leone, and F. Scarcello. Hypertree Decompositions: A Survey. In Proc. of the 26th Int'l Symp. on Mathematical Foundations of Computer Science, pages 37-57. Springer-Verlag, 2001.

[9] G. Kerdiles and E. Salvat. A Sound and Complete CG Proof Procedure Combining Projections with Analytic Tableaux. In Proc. of the 5th InVl Conf. on Conceptual Structures, pages 371-385. Springer-Verlag, 1997.

[10] R. Levinson and G. Ellis. Multi-Level Hierarchical Retrieval. In Proc. of the 6th Annual W'shop on Conceptual Graphs, pages 67-81, 1991.

[11] P. Mitra, G. Wiederhold, and M. L. Kersten. A Graph-Oriented Model for Articulation of Ontology Interdependencies. In Proc. of the VII Conf. on Extending Database Technology (EDBT'2000), Led. Notes in Comp. Sci., pages 86-100. Springer-Verlag, 2000.

[12] M.-L. Mugnier. On generalization /specialization for conceptual graphs. Jour, of Experimental and Theoretical Computer Science, 7:325-344, 1995.

[13] S. H. Myaeng and A. Lopez-Lopez. Conceptual graph matching: A flexible algorithm and experiments. Jour, of Experimental and Theoretical Computer Science, 4:107-126, 1992.

[14] N. F. Noy and M. A. Musen. Ontology Versioning as an Element of an Ontology-Management Framework. Technical Report SMI-2003-0961, School of Medical Informatics, Stanford University, USA, 2003. To appear in IEEE Intelligent Systems.

[15] G. Simonet, M. Chein, and M.-L. Mugnier. Projection in conceptual graphs and query containment in nr-Datalog. Technical Report 98025, Laboratoire d'Informatique, Robotique et Microlectronique, Montpellier, France, 1998.

[16] J. Sowa. Knowledge Representation: Logical, Philosophical, and Computational Foundations. Brooks Cole Publishing Co., 2000.

Implementing Policy Management through BDI

Simon Miles and Juri Papay and Michael Luck and Luc Moreau University of Southampton, Southampton, UK

email: [email protected]

Abstract The requirement for Grid middleware to be largely transparent to individual users and at the same time act in accordance with their personal needs is a difficult challenge. In e-science scenarios, users cannot be repeatedly interrogated for each operational decision made when enacting experiments on the Grid. It is thus important to specify and enforce policies that enable the environment to be configured to take user preferences into account automatically. In particular, we need to consider the context in which these policies are applied, because decisions are based not only on the rules of the policy but also on the current state of the system. Consideration of context is explicitly addressed, in the agent perspective, when deciding how to balance the achievement of goals and reaction to the environment. One commonly-applied abstraction that balances reaction to multiple events with context-based reasoning in the way suggested by our requirements is the belief-desire-intention (BDI) architecture, which has proven successful in many applications. In this paper, we argue that BDI is an appropriate model for policy enforcement, and describe the application of BDI to policy enforcement in personalising Grid service discovery. We show how this has been implemented in the myGrid registry to provide bioinformaticians with control over the services returned to them by the service discovery process.

1 Introduction

The Grid is a truly heterogeneous, large-scale computing environment in which resources are geographically distributed and managed by a multitude of institutions and organisations [16]. As part of the endeavour to define the Grid, a service-oriented approach has been adopted by which computational resources, storage resources, networks, programs and databases are all represented by services [10]. Discovering services, workflows and data in this fluid and ever-changing environment is a real challenge that highlights the need for registries with reliable information content.

myGrid (www . m y g r i d . o r g . uk), a pilot project funded by the UK e-Science progranmie, aims to develop a middleware infrastructure that provides support for bioinformaticians in the design and execution of workflow-based in silico experiments [12] utilising the resources of the Grid [15]. For less skilled users, myGrid should help in finding appropriate resources, offering alternatives to busy resources and guiding them through the composition of resources into complex workflows. In this context, the Grid becomes egocentrically based around the Scientist: myGrid.

Middleware is generally regarded as successful if its behaviour and management remain largely invisible to the user, while still remaining efficient. However, such a requirement is in conflict with myGrid's philosophy according to which middleware

144

145

should act in accordance with personal needs. Crucially, myGrid user requirements [14] have identified that final service selection ultimately rests with the scientist, who will select those to be included according to the goal of the experiment they are designing. Therefore, a service registry, the specific Grid service we consider in this paper, should be designed to provide a list of services adapted to the user's needs. Yet, realistically, users cannot be interrogated repeatedly for each operational decision pertaining to service discovery made when enacting experiments on the Grid. It is important that the environment in which the scientist works is configured prior to use so that preferences can be taken into account automatically.

The term *user' should be understood here in the broadest sense. Indeed, the configuration may not necessarily be customised from the end-user's personal preference only, but also from their collaborators or institutions preference or reconmiendations. For instance, guidelines may be issued by a lab director about which services should or should not be used. System managers may also dictate service constraints, e.g., so that machines hosting them remain at an acceptable load level. All these generally complex preference requirements specify not only the functional behaviour of services, but also non-functional aspects such as security or quality of service. They cannot be progranmied generically into the system, since they can vary dramatically from one deployment to the other and they potentially can change over time. Our behef is that definition of preferences with regard to a particular aspect of behaviour, in our case service discovery, should be achieved through a policy. Such a policy defines what is allowed, what is not allowed, and what to do (or not) in reaction to events, for all the stakeholders of the service.

While various policy definition technologies have been developed, largely focusing on the declaration of either what is or is not permitted, or what rule to follow in reaction to a given event, little consideration has been given to the context in which rules are applied. Conversely, work in artificial intelligence, planning and agent-based systems has developed approaches in which context is explicitly considered when deciding how to achieve a goal or react to the environment. One conmionly-applied abstraction that balances reaction to multiple events with context-based reasoning in the way suggested by our requirements is the belief-desire-intention (BDI) architecture.

In this paper, therefore, we propose the application of agent-oriented engineering and the BDI architecture to policy enforcement in personalising Grid service discovery. Specifically, having adopted a message-passing architecture to implement the registry, we see such messages as events that can trigger plans, whose execution is decided based on the current state of the system and enact further goals by generating more messages. Primitive goals are executed by passing messages to the appropriate manually programmed message handlers. This provides us with a flexible architecture that is capable of routing and handling messages according to all requirements specified in the form of policies at deployment time. We show how this has been implemented in the myGrid registry to provide bioinformaticians with control over the services returned to them by the service discovery process.

In Section 2, we present the motivation for policy-controlled behaviour in Grid service discovery, give an example scenario in which policy enforcement is required and show how an intentional, agent-based perspective provides a useful way of mod-

146

elling the problem. We develop the idea in Section 3 by showing how the Procedural Reasoning System (PRS) [17] is applied to the problem of policy enforcement in our design, and show the way in which this is implemented as part of the myGrid architecture in Section 4. We discuss the evaluation of our work in Section 5, related work on policies and BDI agents is discussed in Section 6 and conclusions drawn in Section 7.

2 Service Discovery Policy in myGrid

For a bioinformatician to discover tools and data reposistories available on the Grid, there must be a known registry in which they are advertised. However, as part of the user requirements, individual scientists have preferences about which services to use: they prefer some services over others, and sometimes want to take into account the opinions of trusted experts as to which services have performed well for them previously. In many labs, all members will use the same services and parameters to those services. This subjective information is not available in public registries and cannot easily be added (it cannot be added at all by third parties).

In myGrid, therefore, we have developed the idea of a personalised registry. In brief, an instance of a registry copies a filtered selection of entries from other registries and allows metadata to be attached to the service descriptions. Metadata is encoded information regarding, and explicidy associated with, existing data. This metadata can include personal opinions and can subsequently be used to ensure that the service discovery returns only the desired services.

To enable this personalised service discovery, a wide range of behaviours must be possible to specify, including the following:

1. Keep registry contents up-to-date with regard to another registry.

2. Add metadata to entries in the registry if allowed.

3. Perform discovery and return results to a client if allowed.

4. Keep subscribers notified of new services registered in the registry.

5. Favour some sources of data and metadata over others when multiple sources are used, retaining only the most favoured.

6. Resolve conflicts between information from different sources.

All of the above configuration options are personal to the owner of the personalised registry, and can include significant amounts of detail. In fact, all these requirements lead to a curated registry, which identifies multiple roles: owner, curator, user, with policies specifying the behaviour of the system, but also the actions permitted to each role (in role-based access control style). Other management issues to be enforced include the prioritisation of actions, which affects the responsiveness and latency of the system and an avoidance of cyclic subscriptions. To deploy a personalised registry, therefore, we need a policy must to define its behaviour, and a suitable mechanism to

147

enforce policy during the operation of the personalised registry. We will argue that this is best done using BDI.

Example Scenario To better explain our approach to solving the problem of policy enforcement in this domain, in this section we present an example scenario and associated policy. Figure 1 illustrates the scenario in which the expert scientist in an organisation has a personalised registry (Registry 1) that copies the service adverts published in one or more public registries. The expert then adds a trust value as metadata to each service advert, indicating how reliable they have found that service. A novice in the same organisation also has a personalised registry (Registry 2) that copies the content of the expert's registry, but only where the trust value of a service is higher than a particular defined constant. The novice is the only user allowed to edit the metadata in Registry 2. This means that when the novice discovers services, they are only provided with services that the expert has judged to be regarded as trustable.

The information in Personalised Registry 1 is

curated: a curator is adding a trust values to advertised

services

Notification message

Selecting services with a trust value > X

PersoneUised Registrya

Personalised Registryl pulls information from

Query for details Figure"i'!!PA' ep!t)yment scenarig ^

We now consider in detail the policy specified in order to define the desired behaviour of these personalised registries. First, on metadata being added or updated in the expert's registry, the novice's registry should query the expert's registry for the details of the service advert, and its metadata, and then save them if the metadata includes a trust value that exceeding the defined constant. Second, a service advert and associated metadata should only be kept in the novice's registry if it has been copied from the expert's registry, or if the novice has personally published it so that no other party is authorised to change the contents of the novice's registry.

In our system, the policy is specified in one or more policy documents following a structured XML format. Thus, the policy document for encoding the behaviour of the novice's registry is given in Figure 2, which is separated into three parts.

In the first part of Figure 2, we show the part of the policy document relating to authorisation. We group all users into a single "User" role and all trusted registries into a "TrustedRegistry" role. There is only one client in each specified role in the policy as it stands, but this could easily be extended later if the novice allows colleagues to use her personalised registry. The operations related to changing metadata, such as trust values, are grouped into the "Edit Metadata" operation type. We then specify that members of the "User" role are permitted to perform operations of this operation type (members of the "TrustedRegistry" role are also permitted to edit metadata).

148

<Role value = "User"> <Client value = "Novice"/> </Role> <Role value = "TrustedRegistry"> <Client value = "Expert"/> </Role> <OperationType value = "Edit Metadata"> <Operation value = "AddMetadataToBusinessService"/> <Operation value = "DeleteMetadata"/> </OperationType> <Pennission> <OperationType reference = "Edit Metadata"/> <Role reference = "User"/> </Pennission>

<Subscription> <Registry> <RegistryLocation value =

"http://www.ecs.ac.uk/expert:8080/view"/> <RegistryType> <RegistryCallHandler value = "uk.ac.soton.ecs .views.server, iiiipl.notifications

.handlers.RegistryEventlncomingHcuidler"/> </RegistryType> <Client reference = "Expert" /> </Registry> <Argument parameterName = " topic "> <String value = "MetadataChanged"/> </Arguinent>

</Subscription>

<EntityFilter> <CoinparisonHandler value = "GreaterOrEqualThanCoitparison" /> <MetadataValueExtractor> <Operation value = "GetBusinessServiceMetadata" /> <Arguinent parameterName = "type"> <URI value = "http://www.ecs.ac.uk/expert/types/TrustRating"/> </Argument> </MetadataValueExtractor> <MetadataValueExtractor> <Operation value = "ReturnConstant" /> <Argument> <String value = "0.7"/> </Argument> </MetadataValueExtractor> </EntityFilter>

Figure 2: Policy document fragment with three parts

149

The second part of Figure 2 specifies that the expert's registry should notify the novice's registry whenever a piece of metadata is added or changed. This is because the metadata may include a trust value provided by the expert and we may wish to copy the service advertisement to which it is attached into the novice's registry. Such a request for notification is called a subscription. Finally, the third part of Figure 2 specifies that the novice's registry filter ensures only services with a trust value exceeding 0.7 should be included.

Interpretation in Intentional Terms Registries satifying the requirements above are crucial for several reasons. First, they are federated, with annotations made in one registry being communicated to another without guarantee of their inclusion in the latter: this is flexible, social behaviour. Second, the registries are autonomous in that they poll other registries according to some query, and reactive in that they may incorporate the results into their own repository, both within the current environmental context provided by the policy and through communication from other registries. Such flexible, autonomous and reactive behaviour, which shares many characteristics with the notions underlying agent systems, suggests that an agent approach may offer a useful framework.

In this way, the services other than the registry can also be viewed as agents, which may be represented in the system as automatic publishers (or re-publishers) of services into multiple registries, as automated discoverers of services to be included in workflows, as personal agents adjusting service discovery to a user's preferences and as automated executors that handle the invocation, composition and failure of services. This usefully identifies the application as a whole as a multi-agent system.

To allow this to be applied in the design of an individual agent, we can view the registry as an entity with multiple intentions it fulfills in line with environmental conditions (the context as defined earlier). Since registries themselves are tradionally viewed as passive repositories of information, it is perhaps clearer to view the agent in this system as an entity managing the content of the registry rather than the registry itself, so that the agent enforces the management/7o//c>' of the registry. (Note, however, that there is also much work on agent brokers and mediators addressed in a different context [19, 22]. Although these address a different problem of matching, they might use policy-based registries to achieve such increased functionality.)

This policy enforcement agent can be seen as instantiating as intentions the goals expressed above. More specifically, the agent must take into account context in fulfilling its intentions. For example, in keeping the most recent version of a service advertisement in the registry, the currently stored version must be taken into account. Similarly, in copying service advertisements from the most trusted other registries, the sources of an advertisement and trust in those sources must be compared.

3 Using BDI to Enforce Policy

In order to enforce management policy in a registry, we create an agent that processes the goals of the policy and the operations performed on the registry using the belief-desire-intention model and its specific instantiations as the procedural reasoning system (PRS) and the distributed multi-agent reasoning system (dMARS) [6]. The

150

choice of this pohcy enforcement mechanism was due to the particular requirements of registry management. In particular, decisions on policy are triggered by external events and different decisions are made depending on both the policy and the contents of the registry. Other policy enforcement technologies are discussed in section 6; in this section, we show how this BDI model fits well with our requirements.

The benefits of modelling registry policy enforcement in terms of agents are twofold. First, at the level of the individual agent, we can map our requirements to proven agent technologies and re-use them for fulfilling intentions in a flexible, autonomous, context-sensitive way. In this way, we adopt the BDI model, which can manage multiple intentions achieved in different ways depending on context. We discuss how the BDI model has been applied to registry policy enforcement below. The second benefit is in easing the complexity of understanding the behaviour of multiple, federated registries. As the behaviour of a registry is controlled by a flexible policy enforcement agent, we can assume that each will robustly interpret any data sent to it by other registries, without consideration of how it is processed or how it may conflict with other operations the agent is performing.

The Procedural Reasoning System PRS is an agent architecture that balances reaction to the environment (i.e. doing something appropriate) with structured activity based on knowledge (i.e. planning in advance for complex actions). An agent's beliefs correspond to information the agent has about the world, desires (or goals, in the system) intuitively correspond to the tasks allocated to it. The intuition is that an agent is unable to achieve all its desires, even if these desires are consistent. Agents must therefore fix upon some subset of available desires (or intentions) and commit resources to achieving them until either they are believed satisfied or no longer achievable [5].

The model is operationalised by pre-defined plans that specify how to achieve goals or perform activities. Each agent has a plan library representing its procedural knowledge of how to bring about states of affairs. Plans include several components: a trigger, which indicates the circumstances in which plan should be considered for use; a context, which indicates when the plan is valid for use (and is a formal representation of the context of the registry as discussed in the introduction), and a body comprising a set of actions and sub goals that specify how to achieve the plan's goal or activity.

In the PRS view, triggering events, e.g. user conmiands, appear at unpredictable times in the external environment and within the agent. On perceiving an event, an agent compares it against the plans in its plan library and, for each plan with a matching trigger, the context is compared to the agent's current beliefs. When the trigger and context of plans match the current event and beliefs, plans are conmiitted to being achieved, and they become intentions, which are active threads of control that enact plans. As an agent runs, it progressively enacts its intentions by performing the actions specified in the intended plan one at a time one. When all actions have completed, the intention is achieved. Subgoals in a plan allow parts of the plan to be tailored to the current circumstances that may have changed since the plan started (because the newly triggered subplans are chosen based on their contexts). The activity of triggering new plans and transforming them into intentions combines with enactment of intentions to allow such agents to balance reactivity with advance planning.

Broadly, we can see how each feature is useful in meeting our requirements. An

151

operation call by a user to the registry (e.g. publishing or discovering a service, or information being received from another registry) is represented by a message, the receiving of which acts as a triggering event for PRS plans. The context of a plan can represent decisions on whether to allow an action, such as publishing a service, based on the authorisation of the caller, the content of the advertisement or the current content of the registry. The actions of a plan are performed by sending messages (either the ones received or new ones created according to the plan specification) to the appropriate handler, which implements the required business logic. Finally, subgoals can be used to both separate concerns and ensure the most recent context is taken into account on each decision. For example, copying information from another registry may involve checking both the information content and the authorisation applied to the registry, and these steps can be separated by the use of subgoals.

Example PRS Plan In our system, a set of policy documents, written in XML, are parsed on creating a registry and used to construct a running PRS agent with a set of PRS plans. The policy can then be altered at run-time to, for example, permit new users to perform service discovery.

In the example scenario given in Section 2, services are copied from the expert's registry into the novice's registry only if the trust value assigned by the expert is greater than 0.7. The policy requires several plans to be defined to ensure appropriate behaviour over time, including those encoding the following behaviours. On startup, the novice's policy agent should ask the expert's policy agent to inform it when trust values are added to or changed for the services in the expert's registry. Then, every time a new service advert is published, either by a user or copied from another registry, the agent should ensure that the user/registry is authorised to provide such information before it can be saved. When the agent is informed that a trust value has been added to a service in another registry, it should decide whether to include the copy of the service.

The TrustValueAdded plan for this final behaviour is shown in Figure 3, which expresses the plan structure clearly, based loosely on the AgentSpeak(L) language [20]. At the top of the plan is the trigger for considering when to form an intention to enact the plan. This trigger is a piece of metadata representing a trust value being added to a service in a remote registry, with the service being identified by a unique service key as is the convention in UDDI. In the second section, the context determines whether the plan is applicable whenever the trigger occurs. The context compares the trust value to the constant given in the policy (0.7), and if the trust value is high enough, the actions are performed. These are the last four lines of the plan, performing the following: retrieve the details of the service advert from the remote registry, save those details, retrieve the metadata attached to the service advert and save that metadata. The retrieval operations are primitive actions performed on the remote registry, while the save operations are subgoals to trigger the processing of new intentions. The value in using subgoals in this example is that the authorisation to copy the contents of a remote registry can be checked at the time of the save operation.

For each operation, the authorisation is checked before it can be performed. In the SaveService plan of Figure 3, we show the SaveService message being triggered, due to a subgoal from the plan in Figure 3 or because a user has called the saveService

152

TrustValueAdded(remoteRegistry,serviceKey,trust)

? trust >= 0.7

details=getServiceDetails (remoteRegistry, serviceKey) ! SaveService (details) metadata=getServiceMetadata(remoteRegistry, serviceKey) ! SaveMetadata (serviceKey, metadata)

SaveService (serviceDetails)

? inRole (role, caller) ? isPermitted (SaveService, role)

UDDIPublishHandler.SaveService (serviceDetails)

Figure 3: Plans for copying services and saving service adverts

operation to publish a service. In the context the plan first finds in the agent's belief store the roles in which the client sending the message is active, and then checks that at least one role is permitted to perform SaveService. If so, then a message is sent to the UDDI publish handler to actually save the service.

4 Implementing Policy Enforcement

The above sections justify the use of PRS and describe how policy documents can be mapped to PRS plans, but we still need to address how the policy is practically enforced within the registry. In order to do this, we need to introduce the architecture of the myGrid registry.

The myGrid registry is modular and extensible, allowing deployers to choose which protocols will be supported. At the moment the registry can support several protocols including UDDI version 2, a protocol for parsing and annotating WSDL documents and a generic protocol for attaching metadata to service descriptions. Each operation call is processed by a chain of handlers performing the business logic of the operation. The PRS agents are inserted into the handler chains to ensure that the policy is enforced at all times.

As shown in Figure 4, a registry is created by providing a set of policy documents to a registry factory. The policy describes a set of policy objects controlling how the registry behaves. These policy objects are explicitly instantiated when the documents are parsed. They are then translated into Procedural Reasoning System plans to be interpreted by a set of PRS Enforcer agents, one for each protocol that the registry is configured to process. Whenever a user of the registry makes an operation call, the call is transformed into a message, that triggers a PRS Enforcer before the message (or a newly generate one) can be processed by the registry's handlers. The policy can be modified during the registry's lifetime by the registry administrator via calls to a generic policy object manipulation API.

The transformation of policy documents into PRS agents can best be illustrated by referring back to the policy fragments in Figure 2. The first fragment translates directly to terms in the context of plans to save metadata, denying the action if the permission

153

Pe^

^-^ ! p

UDDI Handter

Metadata HancBer

WSDL H a r u ^

•

S * « S « n « «

MCMWKMa

AiWUki«U3

'XoOpvmsan

SawSwitc*

AddMaexMt

AddMM«liM4

ToOpwailon

Policy Objects

toPH6 1

i

Policy Enforcer

Policy Enforcer

Policy Enforcer

t , ,

SavvSiHvira

ACM Mmtidsui

AddMwwbu.

'ToOowstion

Generic Policy API

UODI API

Metadata API

WSDL API

UpaManolcy

C»»Cr»ni<ati

Figure 4: The architecture of a myGrid registry

does not exist, the second maps to the triggering event of the Trus t Value Added plan, and the third fragment is used as the context of that plan .

5 Evaluation

A small but increasing number of bioinformatics services are being made available as Web Services by public and commercial organisations. These include XML Central of DDBJ [4], who have exposed many popular bioinformatics tools, such as Blast, the gene sequence similarity search tool, and Protein Data Bank Japan [3]. In addition, the European Bioinformatics Institute [1], as part of the myGrid project, has exposed many of its tools as Web Services to support the day-to-day research by biologists at the University of Manchester and the University of Newcastle, and these are gradually being moved into the public domain.

We have successfully tested our system using the scenario described earlier, as well as more simple cases, using the interfaces and descriptions of these existing bioinformatics services (these are available from the websites of the organisations mentioned above). While this does demonstrate that our approach does as it states it should, the evaluation of personalisation is made difficult by the subjective and qualititative nature of judging it to be a success. The requirements for personalisation come directly from biologists [14] and, at the moment, we have anecdotal evidence from bioinformati-cians that it would be a useful tool on transferring to a new organisation, for example. However, the true value cannot be interpreted until myGrid becomes more widely adopted within organisations. In addition, the personalised registry will continue to be developed and tested as part of the Open Middleware Infrastructure Institute programme [2], which has a wide user base including engineers, physicists, chemists and biologists.

154

6 Related Work

Policy languages have been the focus of much attention lately in the Grid and Web Services community. Bradshaw etal, [11] discuss the notion of policy in the context of autonomous entities that cannot always be trusted to regulate their own behaviour appropriately, because poorly designed, buggy or malicious. In this context, they introduce KAoS, a policy language that allows them to externally adjust the bound of autonomous behaviour in order to ensure safety and effectiveness of the system. Specifically, their policy language and associated enforcers allow them to dynamically regulate the behaviour of system components without changing their code, not requiring the cooperation of the components being governed. The KAoS policy language is based on the OWL ontology language, and distinguishes permitted actions from obligated actions [21]. KAoS can therefore be seen as a specification language, identifying some of the properties that are expected from a system. Interestingly, obligations and permissions policies may not necessarily be enforceable, and this remains an open problem at the moment.

The recent proliferation of policy-based specifications in the Web Services community is also relevant to this line of work. WS-Policy is a framework for indicating a service's requirements and policies [7]. WS-SecurityPolicy defines extensions for security properties for Web services [8]. Other proposals for security policy exist in the context of Web Services: XACML (extensible Access Control Markup Language) is an OASIS specification that defines an XML schema for an extensible access-control policy language [23]; Denker et al. [9] define an OWL ontology of security mechanisms, which they use to specify which security mechanisms are supported by Semantic Web Services. Such security requirements can be regarded as an obligation policy that client must satisfy in order to interact with a service. Similarly Kagal et al. [18] also define a policy language with constructs for obligations and authorisations. Their policy statements are also used during matchmaking. Ponder [13] is a declarative, object-oriented language for specifying security policies with role-based access control and general-purpose management policies for specifying what actions are carried out when specific events occur within a system or what resources to allocate under specific conditions. Enforcement is typically delegated to enforcement agents that intercept actions and check whether the access is permitted.

7 Conclusions and Further Work

The demand for personalisation of the Grid in a wide variety of applications has driven the need for policies which configure the resources available to the needs of individual users. However, the dynamic distributed environment requires policy enforcement mechanisms that act in a timely and context-dependent way. These requirements are exactly those addressed by agent-based systems and solved by existing agent technologies such as BDI. In this paper, we have presented our approach to policy enforcement in service discovery, as applied to the bioinformatics application domain. In our approach, policy documents are written by a user to define the behaviour of a personalised service registry. These are then translated into a set of PRS plans, enacted

155

by a policy enforcement agent whenever triggered. Acting as part of the handling architecture of a registry, the agent can ensure that authorisation rights are observed and that content is shared between registries by continued communication. We will be testing our policy approach further within the context of the myGrid architecture in supporting bioinformaticians. We also intend to look at the verification of policies, by reasoning over their data representation. Additionally, we intend to look at applying our policy enforcement to configuring the agent matchmaking process, which can be seen as an extension of service discovery.

8 Acknowledgements

This research is funded in part by the myGRID e-Science pilot project grant (EPSRC GR/R677643), the Overseas Travel Scheme of the RoyalAcademy of Engineering, EPSRC project (reference GR/S75758/01) and the PASOA project (EPSRC GR/S67623/01).

References

[1] European bioinformatics institute, http://www.ebi.ac.uk, 2004.

[2] Open middleware infrastructure institute, http://www.omii.ac.uk/, 2004.

[3] Protein data bank japan, http://pdbj.protein.osaka-u.ac.jp/, 2004.

[4] Xml central of ddbj. http://xml.ddbj.nig.ac.jp, 2004.

[5] P. R. Cohen and H. J. Levesque. Intention is choice with commitment. Artificial Intelligence, 42:213-261,1990.

[6] M. d'Invemo et al.' The dmars architechure: A specification of the distributed multi-agent reasoning system. Autonomous Agents and Multi-Agent Systems, to appear, 2004.

[7] D. Box et al.' Web services policy framework (ws-policy). http://msdn.microsoft.com/webservices/default.aspx?pull=/library/en-us/dnglobspec/html/ws-pohcy. asp, 2003.

[8] G. Della-Libera et al.* Web services security policy (ws-securitypolicy). www.ibm.com/developerworks/library/ws-secpol/index.html, 2002.

[9] G. Denker et al." Security for daml web services: Annotation and matchmaking. In Second International Semantic Web Conference, Sanibel Island PL, 2003.

[10] I. Foster et al.* The Physiology of the Grid — An Open Grid Services Architecture for Distributed Systems Integration. Technical report, Argonne National Laboratory, 2002.

[11] J.M. Bradshaw et al.* Computational Autonomy, chapter Dimension of Adjustable Autonomy and Mixed-Initiative Interaction. Springer, 2004.

156

[12] L. Moreau et al." On the Use of Agents in a Biolnformatics Grid. In S. Lee, S. Sekguchi, S. Matsuoka, and M. Sato, editors. Proceedings of the Third IEEE/ACM CCGRID'2003 Workshop on Agent Based Cluster and Grid Com-puting, pages 653-661, Tokyo, Japan, 2003.

[13] N. Damianou et al.* Ponder: a language for specifying security and management policies for distributed systems. Imperial College Research Report DoC 2000/1, 2000.

[14] R. Stevens et al. * Performing in silico Experiments on the Grid: A Users Perspective. In S. Cox, editor. Proceedings of the UK OST e-Science Second All Hands Meeting 2003, pages 43-50, Nottingham, UK, 2003.

[15] I. Foster, C. Kesselman, and S. Tuecke. The Anatomy of the Grid. Enabling Scalable Virtual Organizations. InternationalJoumal of Supercomputer Applications, 2001.

[16] Ian Foster. What is the grid? a three point checklist. http://www-fp.mcs.anl.gov/ foster/, 2002.

[17] M. Georgeff and A. Lansky. Procedural Knowledge. Proceedings of the IEEE, 74(10):653-661,1986.

[18] L. Kagal, T. Finin, and A. Joshi. A poHcy based approach to security for the semantic web. In Second International Semantic Web Conference (ISWC2003), Sanibel Island FL, 2003.

[19] M. Klusch and K. Sycara. Brokering and Matchmaking for Coordination of Agent Societies: A Survey. In A. Omicini et al., editor. Proceedings of Coordination of Internet Agents 2001. Springer, 2001.

[20] A. Rao. AgentSpeak(L): BDI agents speak out in a logical computable language. In Seventh European Workshop on Modelling Autonomous Agents in a M ulti-Agent World, 1996.

[21] A. Uszok, J. Bradshaw, and R. Jeffers. Kaos: A policy and domain services framework for grid computing and semantic web services. In Proceedings of the iTrust'04 conference, 2004.

[22] C. Wong and K. Sycara. A Taxonomy of Middle Agents for the Internet. In A. Omicini et al., editor, Proceedings of the Fourth International Conference on Multiagent Systems, pages 465-466,2000.

[23] Oasis extensible access control markup language. http://www.oasis-open.org/committees/tc Jiome.php?wg-abbrev=xacml.

Exploiting Causal Independence in Large Bayesian Networks

Rasa Jurgelenaite and Peter Lucas

Radboud University Nijmegen, Nijmegen, The Netherlands

Email: {rasa, peterl}@cs.ru.nl

Abstract

The assessment of a probability distribution associated with a Bayesian network is a challenging task, even if its topology is sparse. Special probability distributions based on the notion of causal independence have therefore been proposed, as these allow defining a probability distribution in terms of Boolean combinations of local distributions. However, for very large networks even this approach becomes infeasible: in Bayesian networks which need to model a large number of interactions among causal mechanisms, such as in fields like genetics or immunology, it is necessary to further reduce the number of parameters that need to be assessed. In this paper, we propose using equivalence classes of binomial distributions as a means to define very large Bayesian networks. We analyse the behaviours obtained by using different symmetric Boolean functions with these probability distributions as a means to model joint interactions. Some surprisingly complicated behaviours are obtained in this fashion, and their intuitive basis is examined.

1 Introduction

Bayesian networks offer an appealing language with associated set of tools for building models of domains with inherent uncertainty. However, a significant bottleneck in constructing Bayesian networks, whether done manually or by learning from data, is the size of their underlying probability tables. Even though adopting a sound design methodology may render the resulting graph representation of the Bayesian network relatively sparse, typically, real-world Bayesian networks include some probability tables which are really large. There are several proposals in the literature which may help reducing the size of the tables. One of the more systematic ways to cope with large probability tables is offered by the theory of causal independence; it allows decomposing a probability distribution in terms of Boolean interactions among local parameters.

As a consequence of the success of Bayesian networks in solving realistic problems, increasingly complicated situations are being tackled. We are in particular interested in the modelling of biomedical knowledge, for example in fields such as genetics and immunology; in these fields hundreds to thousands of interactions between variables may need to be captured in a probabilistic model. Clearly, such models cannot be handled without exploiting (potentially

157

158

hypothetical) knowledge about underlying causal mechanisms and associated simplifying assumptions.

The aim of the present work was to develop a theory on top of the theory of causal independence which allows defining interactions between a huge number of causal factors. This is done by assuming that the parameters in terms of which the probabiUty distribution is defined are members of an equivalence class. We apply symmetric Boolean functions to combine first the causal factors inside an equivalence class and subsequently the effects of the equivalence classes. The probabilistic behaviour obtained in this fashion is analysed in detail.

The practical significance of such an analysis becomes clear if one reahses that many practical Bayesian network models use causal independence assumptions. A well-known example is the probabiUstic reformulation of the Quick Medical Reference (QMR), called QMR-DT, a very large diagnostic knowledge-based system in the area of internal medicine, in which causal independence was used to manage the complexity of the underlying Bayesian network model [8].

The remainder of this paper is organised as follows. In the following section, the basic properties of Bayesian networks. Boolean functions, and the notion of causal independence are introduced. A mathematical analysis of the behaviour of various models is given in Sections 3 and 4. The paper is rounded off by a summary of what has been achieved and by plans for future research.

2 Preliminaries

2.1 Bayesian Networks and Causal Modelling

A Bayesian network B = (G, Pr) represents a factorised joint probabiUty distribution on a set of variables V. It consists of two parts: (1) a qualitative part, represented as an acyclic directed graph (ADG) G = (V(G), A(G)), whose vertices V(G) correspond to the random variables in V, and arcs A(G) represent the conditional (in)dependencies between the variables; (2) a quantitative part Pr consisting of local probability distributions Pr(V | 7r(F)), for each vertex V e V(G) given its parents IT{V), The joint probability distribution Pr is factorised according to the structure of the graph, as follows:

Pr(V(G))= n P^(^K(^))-VGV(G)

Each variable V G V has a finite set of mutually exclusive states. In this paper, we assume all variables to be binary; as an abbreviation, we will often use V to denote V = T (true) and v to denote V = ± (false). Variables V can either act as free variables, in which case their binding is arbitrary, or they can act as bound variables, where bindings are established by associated operators. Furthermore, an expression such as

159

Figure 1: Example Bayesian network, modelling the interaction between the antimicrobial agents penicillin and chlortetracyclin on infection.

stands for summing over all possible values of p ( / i , . . . , /n) for all possible values of the variables Ik for which the constraint ip{Ii,..., In) = e holds.

Even though it is acknowledged by researchers that Bayesian networks are excellent tools for the modeUing of uncertain causal mechanisms, the question remains in what way different causal mechanisms can be best modelled. Let us look at two real-world examples, which provide motivation for the approach developed in this paper.

Consider the interaction between bactericidal antimicrobial agents, i.e. drugs that kill bacteria by interference with their metabolism, resulting, for example, in fragile cell walls, and bacteriostatic antimicrobial agents, i.e. drugs that inhibit the multiplication of bacteria, for example by suppressing the production of necessary proteins. Penicillin is an example of a bactericidal drug, whereas chlortetracyclin is an example of a bacteriostatic drug. It is well known among medical doctors that the interaction between bactericidal and bacteriostatic drugs can have antagonistic effects; e.g. the drug combination penicillin and chlortetracyclin may have as little effect against an infection as prescribing no antimicrobial agent at all, even if the bacteria are susceptible to each of these drugs. The depiction of the causal interaction of the relevant variables is shown in Figure 1.

As a second example, consider the administration of chemotherapy to patients. If a patient has cancer, chemotherapy increases the chances of survival; however, if the patient does not have cancer, chemotherapy reduces the chances of survival. Clearly, the causal interaction between chemotherapy, cancer and survival has some underlying logic. This is shown schematically in Figure 2.

Although the Bayesian networks shown in figures 1 and 2 have a very similar structure, their underlying interaction semantics is very different as we will see.

160

Figure 2: Example Bayesian network, modelling the interaction among cancer, chemotherapy and survival.

2.2 Probabilistic Representation

Causal independence [10], also called noisy functional dependence [5], is a popular way to specify interactions among cause variables. The global structure of a causal independence model is shown in Figure 3; it expresses, the idea that causes C i , . . . , Cn influence a given common effect E through intermediate variables / i , . . . , /„ and a deterministic function / , called the interaction function. The impact of each cause Ck on the common effect E is independent of each other cause CjJ^k. The function / represents in which way the intermediate effects /fc, and indirectly also the causes Cfc, interact to yield the final effect E. Hence, the function / is defined in such a way that when a relationship, as modelled by the function / , between Ik,k = 1 , . . . , n, and E = T is satisfied, then it holds that e = / ( / i , . . . , In)- It is assumed that Pr(e | / i , . . . , 7^) = 1 if / ( / i , . . . , /n) = e, and Pr(e | 7i, . . . , / „ ) = 0 if / ( / i , . . . , /n) = e.

The conditional probability of the occurrence of the effect E given the causes C*!,..., Cn, i.e. Pr(e | C i , . . . , Cn), can be obtained from the conditional probabilities Pr(/fc I Ck) as follows [7, 10]:

n

Pr(e|Ci,...,Cn)= Yl U^^ihlCk). (1) /(/i, . . . , /n)=eA:=l

It is assumed that absent causes do not contribute to the effect, i.e. Pr(zit I Ck) = 0.

An important subclass of causal independence models is formed by models in which the deterministic function / can be defined in terms of separate binary functions gk, also denoted by 9k{h,h-\-i)- Such causal independence models have been called decomposable causal independence models [4]; these models are of significant practical importance. Usually, all functions gk{Ik,Ik-\-i) are identical for each fc; a function Qkihih+i) may therefore be simply denoted hygiU').

161

Figure 3: Causal independence model.

Well-known examples of causal independence models are the noisy-OR and noisy-AND models, where the function / represents a logical OR and a logical AND function, respectively.

2.3 Symmetric Boolean Functions

The function / in equation (1) is actually a Boolean function. However, there are 22 different n-ary Boolean functions [2, 9], Consequently, the potential number of causal interaction models is huge. However, in the case of causal independence it is usually assumed that the function / is decomposable to identical, binary functions. In addition, it is attractive to assume that the order of the cause variables does not matter; thus, it makes sense to restrict causal independence models to symmetric Boolean functions, where the order of arguments is irrelevant [9].

There are 8 symmetric binary Boolean functions, of which 6 commutative and associative, which we will take as a basis for defining Boolean functions of n arguments [7]. The advantage of this choice is that the order of arguments is irrelevant, as for any symmetric Boolean function, and that the resulting functions are also decomposable. Logical truth and falsity are constants, and act as the global extremes in a partial order among Boolean functions. As such they give rise to trivial causal independence models. The remaining four causal independence models are defined in terms of the logical OR, AND, XOR and bi-impUcation.

We use * to denote a commutative, associative binary operator. A Boolean function can then also be expressed as an expression: / * ( / i , . . . , /n) = A * "' * In- Table 1 gives the truth tables for the n-nary Boolean functions of interest. From now one, the following notation is adopted: V (OR), A (AND), (g) (exclusive OR), ^ (bi-implication).

Table 1: The truth tables for the n-ary symmetric Boolean functions; here we have that k = Yl^=i ^ih)^ with i^(/j) equal to 1 if Ij is equal to true and 0 otherwise.

/ l V • • . V /n

fc>l h^'"^In

k = n / i (8) • • • (8) /n

odd{k) h^ '" ^ In

even{n — k)

162

We return to our example Bayesian-network models shown in figures 1 and 2. The interaction between penicillin and chlortetracycUn as depicted in Figure 1 can be described by means of an exclusive OR, 0 , as presence of either of these in the patient's body tissues leads to a decrease in bacterial growth, whereas if both are present or absent, there will be little or no effect on bacterial growth. The interaction between cancer and chemotherapy as shown in Figure 2 can be described by means of a bi-implication, ^ , as chances of survival are large in the case of cancer if it is being treated by chemotherapeutics, and also in the absence of cancer without treatment.

2.4 Symmetric Causal Independence Models

Recall that the function / v ( / i , . . . , /n) yields the value true if there is at least one variable Ij with the value true. Therefore, the probabiUty distribution for the OR causal independence model is defined as follows:

P r v ( e | C i , . . . , C n ) = l - ( l - Y^ f[Pi{Ik\Ck)] \ 7iV..V/nfc=l / n

= 1 - U Pr(z, I Ck). (2)

The probabiUty distribution for the AND causal independence model is defined similarly:

n

P r A ( e | C i , . . . , C n ) = llPr{ik\Ck). (3)

The function / ^ ( / i , . . . , /n) yields the value true if there are an odd number of variables Ij with the value true. Therefore, in order to determine the probability of the effect variable -E, Pr(e | C i , . . . , Cn), the probabilities for all cause variable combinations with an odd number of present causes have to be added. We have:

n

PT^{e\Ci,..., C7„) - Yl n Pr(4|Cfc) /l(g)---(g)/n fc=l

= Pr(n|C70--Pr(l„|C„) ^ l^MiM k

l<k<n odd(k)

Pr(b|C,)

= Pr(zi|Ci)...Pr(in|Cn)-

^ ^ / " ^ *i'.* ^ ^ i P r ( 2 , i | C , J * " P r ( z , J C , J ' ^^ l < A : < n J i= l Jt=Jt-i+ljfc=Jfc-i+l ^^^' ^^^ ^•'*' ^^^

odd(k)

The function / ^ ( / i , . . . , 7^) gives the value true if there are an even number of variables Ij with the value false. Thus, to determine Pr(e | C i , . . . , Cn) the

163

probabilities for all cause variable combinations with an even number of absent causes have to be added:

n

Pr„(e|Ci,...,C„)= Y. X{^<h\Ck)

= P r ( i i |C i ) - - -P r ( i „ |C„ ) -

\ even(fc) /

= P r ( i i | C i ) - - - P r ( i „ | C n ) -

/ \ n—fc+1 n—k-\-t

\ even(fc) /

1+ E E - E - E .(5)

The following proposition establishes the relationship between the probability distribution obtained when taking the XOR and the bi-implication, respectively, as a basis for a causal interaction model:

Proposition 1 Pr0(e | C i , . . . ,Cn) = Pr^{e \ C i , . . . , C n ) , for odd{n), and Pi<S){e I C i , . . . ,Cn ) = P r ^ ( e | C i , . . . , C n ) , for even{n).

The XOR and bi-implication causal interaction models are very sensitive to changes in the probabilities of the cause variables. If at least one cause variable is equally likely to be absent or present, the probabiUty of the effect variable E is also equally likely to be absent or present, as is shown in the following proposition:

Proposition 2 Let XOR and bi-implication be the Boolean functions of two causal independence models. If at least one cause variable is equally likely to be present or absent^ i.e. Pr(zfc | Ck) = Pr(2jk | Ck)y the probabilities of the effect E to be present or absent are also equal:

Pr,(e I C i , . . . , Cn) = Pr*(e | C i , . . . , Cn) = ^

where * G {®,^].

The proposition indicates that the probability for one cause variable can completely dominate the probabiUty of the effect variable E. However, the situation changes if this particular cause variable is instantiated. This property is invalid for OR and AND causal interaction models: in these models one cause variable cannot completely dominate the probability distribution for the effect variable E.

164

3 Grouping Probabilistic Information

Even if we use the theory of causal independence as a tool to simplify estimating a conditional probability distributions Pr(J5 | Ci , . . . , Cn) if n is very large, the entire process becomes rapidly infeasible. However, the larger n becomes, the more Ukely it becomes that parameters Pr(/jfe | C^) of a causal independence model become arbitrary close to each other. Hence, one way to simplify the estimation of the probability distribution is to group parameters in particular equivalence classes, and to assume that the class representative Pr(/fc | Ck) follows a particular statistical law. In the remainder of the paper, we study the various probability distributions that are obtained in this fashion. In the case of a Bayesian network with discrete variables, taking the binomial distribution as a basis for estimation purposes seems to offer a good starting point.

3.1 The Binomial Distribution

The binomial distribution is one of the most commonly used discrete probability distributions. In an experiment which follows a binomial distribution, trials are independent and identical, with possible outcomes 'success' and 'failure', and with a probability of success that is constant.

The probability distributions of a causal independence model can be interpreted as representing a sequence of results of an experiment of n identical trials, where n is equal to the number of cause variables. Prom the definition above we can see that cause variables can be treated as trials of an experiment satisfying the requirements of a binomial distribution, as the number of cause variables n is known in advance, all cause variables have two states, are independent, and the probabiUty of occurrence of each cause is the same.

3.2 Equivalence Classes of Binomial Distributions

We organise the intermediate variables / i , . . . , /„ and their associated variables Ci , . . . ,Cn by their influence on the common eff ect E, in accordance to the increasing order of the associated probabilistic parameters Pr(/jfc | Ck)- Next, we choose a small positive number e G R"*", which determines how much the probabilities may vary inside an equivalence class. An intermediate variable Ik belongs to the t-th equivalence class if its probability of success Pr(ijfc | Ck) falls into the interval [2{t — l)£,2ts). The number of equivalence classes is equal to r = ^ . Further, we assume that all intermediate variables from the same equivalence class have the same probability of success Pr(it | Ct) and apply the concepts of the binomial distribution to estimate the probability distribution of the t-th equivalence class E/t,*...*/,^ Ut^ti^^i^k I Ck), where Ct^,...,Ct^^ are the cause variables that belong to the ^-th equivalence class, rik is the number of variables in the equivalence class, and Yj\=i '^k = ^- In this paper we assume the class representative to be Pr(it | Ct) — (2t — \)e\ however, there are other possible ways to define the probability of success inside an equivalence class, e.g. Pr(it \C^) = ^^ El"it, ^<ik \ Ck).

165

To determine the probability distribution of the effect variable E based on the probability distributions of contributing equivalence classes, exactly the same combining functions are employed as when combining single probability distributions Piih \ Ck) associated with cause variables C^.

Dependent on the Boolean function employed, the probability distribution inside an equivalence class is then determined by one of the following equations:

P r v ( e | C i , . . . , C „ ) = l - P r ( v | C e ) " (6)

P r A ( e | C 7 i , . . . , C „ ) = P r ( i t | C t r (7) P r ® ( e | C i , . . . , C „ ) =

Y, (^)Pr(it|Ct)'=Pr(rt|Ct)"-'= (8)

odd(k)

P r „ ( e | C i , . . . , C „ ) =

E 0 < fc < n

even{k)

n\ ^ ,_ , _ ^fcr> /« 1/^ \n-k ^^Pv{H\Ct)'Fv{it\Ctr-' (9)

4 Analysis of Probabilistic Behaviour

In this section, we study the properties of the causal independence models introduced above, and in particular we examine patterns in the resulting probability distribution as a function of the number of contributing causes. This will give us insight into the global probabilistic characteristics of large causal independence models.

Section 3 mentioned a scheme to combine the effects of the individual equivalence classes. Here it is therefore permitted to restrict the mathematical analysis to one equivalence class of binomial distributions only as the analysis for the other equivalence classes is identical. The basis of the analysis is provided by the mathematical theory of sequences and series.

Let 5 i , 5 2 , . . . be a sequence, abbreviated to (5*); throughout this section, a member 5* of this sequence represents a sum of products of probability distribution in an equivalence class of binomial distributions, i.e.:

We assume the probability Pr(zt | Ct) to be constant, i.e. p = Pv{it \ Ct). In our treatment we combine various causal independence models based on

similarity in behaviour. For example, the OR and AND causal independence models possess similar behaviours, which in most cases appear to be each other opposites. Analogous remarks can be made for the two other types of causal independence models. The following propositions show that OR and AND causal independence models yield first-order behaviour, which is monotonic for

166

1

0.8

^ 0.6

•8 & 0.4

0 *

J f- • ^ P T D=0 - ••-p=0 --•

P=:1/10 —X-p=1/2 —«— J

p=9/10 •••a--p_1 -.-»._.

0.2 Y ;>r '

r 1 2 3 4 5 6 7 8 9 10

number of causes

Figure 4: The patterns of the OR causal independence model.

any probability p with the exception of the bounds p € {0,1}. The proofs are omitted because of lack of space (cf. [6]).

Proposition 3 Let (5*) be a sequence as defined above. For each member 5* of the sequence it holds that:

• ifpG (0,1) then

- 5;; € [p, 1) for * = V, and

- 5 ; € ( 0 , p ] / o r * = A.

• otherwise, ifp G {0,1} then Sn=p for both * = V and * = A.

Proposition 4Ifp£ (0,1) then a sequence (5*) is

• strictly monotonically increasing for * = V, and

• strictly monotonically decreasing for * = A.

It appears that the sequences converge to one of their bounds. As we try to understand the behaviour of large causal independence models, the rate of convergence is clearly also relevant. The first derivative of the function F , used to generate the sequence 5*^i = F(5*), can serve as a basis for this. If * = V then -F(5*) = 1 — (1 - pjS'*; thus the larger the value of p, the faster the sequence converges to 1. If * = A then F(5*) = p5*; thus the smaller the value of p, the faster the sequence converges to 0. Figures 4 and 5 illustrate the results above by means of plots.

So far the study of the OR and AND causal models; the nature of the monotonic behaviour revealed by the propositions above and the associated plots are presumably consistent with the expectations of the reader. However, the study of the properties of the causal independence models with XOR and bi-implication interactions revealed surprisingly complicated behaviours. In addition to the expected bounds of 0 and 1, the sequences have an additional bound at ^.

p=0 "•••-p=1/10 —X-p=1/2 —«— J

p=9/10 - ' ' p=1 . . - .

2 3 4 5 6 7 number of causes

9 10

167

Figure 5: The patterns of the AND causal independence model.

Proposition 5 Let (5*) be a sequence as defined above. For each member 5* of the sequence it holds that:

• ifp e [0, ) then

- 5* G [p, ) for * = ®, and

• otherwise, if p £ (5,1] then

- S*£ [2p(l - p), I) U (i,p] for * = ®, and

-S'„G{lp]for* = ^.

Proof: (Sketch) As the proof is by induction, we express S^^i in terms of 5®. Using the theory of binomial coefficients it follows that

Nn+l-fc

l<k<n-\-l,odd{k) ^ ^

= ( l - p ) 5 ® + p ( l - 5 ® ) = 5 ® ( l - 2 p ) + p (10)

We have used the fact that

1 - S ? = 0<A:<n,even(fc) ^ ^

n-k

In a similar way, we obtain the results for the bi-implication:

5r+i = 5 r ( 2 p ~ i ) + i - p

D

Proposition 6 A sequence (5*) is

• strictly monotonically increasing ifpE (0, ) and * = 0,

(11)

168

• strictly monotonically decreasing ifp£ (^,1) cind * = -<-•,

• constant S!l^^p if

- p € {0, | } and * = (g),

- p G {^,1} and * = *->,

• non-monotonic if

- pe ( | ,1] and * = 0 , - p € [0, | ) and * = -<->.

The propositions above yield insight into the behaviour of the sequences but raise questions about the behaviour non-monotonic sequences will show, i.e. when p € ( | , 1], * = (8>, and p G [0, ^) , * = <-•. Let the sequence (5*) be divided into two sequences: 5 i , 5 3 , . . . , denoted by (S'*^^/^)), and S J , ^ ! , . . . , denoted by (S'*ên(n))* ^"^ ^^^^ ^^^ following proposition:

Proposition 7 Let (S*^^/^)) fl^d (' even(n)) ' ^ sequences as defined above. For each member of the sequences it holds that:

• z/ * = 0 and p e ( | , 1] then

" ^ldd{n) ^ (2'^]^

- ^ : v e n ( n ) ^ [ 2 p ( l - p ) , ^ ) ,

• ifrj^^'^ and p G [0, ) then

Proposition 8 I^^t (SlddCn)) ^^^(*êven(n)) be scqucnccs as defined abovc. Then it holds that:

• ^/P ^ ( > 1] ^^^ * = 0

- (5'odd(n)) ^ strictly monotonically decreasing,

- (S'g^g / s) 15 strictly monotonically increasing

• z/p G [0, ) and * = -f

- (5'*^ / x) zs strictly monotonically increasing,

- (S'gênrn)) ^ strictly monotonically decreasing

Prom the propositions above we conclude that despite their complicated behaviours, the sequences converge to \, Once again we will employ the first derivative of the function F , with 5*^^ = F{S'!^)^ to determine the convergence rate. Prom the previous results (10) and (11) we know that F{S'!^) = (1 - 2p)5* + p if * = 0 and F{S*) = 5*(2p - 1) + 1 - p if * = <H-.. AS ^\'^n) = |1 — 2p| for * G {0, ^ } the rate of convergence depends on the value of p; the closer the value of p is to ^, the faster the sequence converges to | . Pigures 6 and 7 illustrate this behaviour.

169

a

number of causes

Figure 6: The patterns of the XOR causal independence model.

3 4 5 6 7 8 9 10

number of causes

Figure 7: The patterns of the bi-implication causal independence model.

5 Discussion

In this paper, we addressed the problem of probability distribution estimation in very large Bayesian networks. Quite naturally, the theory of causal independence served as a starting point for such networks. As was argued, even if resorting to this theory, it quickly becomes infeasible to assess probability distributions for such networks. Our solution was to group local probabiUty distributions into equivalence classes using probability intervals, and to use a suitably defined probability distribution as a basis for assessment.

The basic tools used for probability estimation were symmetric Boolean functions, which appeared to offer a natural choice as they provide a logical description of interactions between cause variables where the order between variables does not matter, and the binomial distribution, which is a standard choice in discrete probability distributions. As far as we know, this is the first paper offering a systematic analysis of the global probabilistic patterns that occur in large Bayesian networks based on the theory of causal independence.

170

As was shown, these types of Bayesian networks reveal surprisingly rich probabilistic patterns.

Even though the results achieved in this paper are theoretical, it should be stressed that the theory of causal independence is being used in practice in building Bayesian networks. The theory developed in this paper can be used as a basis for the construction of very large Bayesian networks, for example, in fields such as medicine, in particular internal medicine, and genetics. Although Bayesian networks have been explored in the early 1990s in such fields as part of research projects, it is only now that Bayesian networks are being adopted as tools for solving biomedical problems (cf. [3]). The theory developed in this paper could enhance the practical usefulness of the formalism.

References

[1] F.J. Diez, Parameter adjustment in Bayes networks. The generalized noisy or-gate, Proc, UAI-93, pp. 99-105, 1993.

[2] H.B. Enderton, A Mathematical Introduction to Logic, Academic Press, San Diego, 1972.

[3] N. Friedman, Inferring cellular networks using probabilistic graphical models. Science, 3003, pp. 799-805, 2004.

[4] D. Heckerman and J.S. Breese, A new look at causal independence, Proc. UAI'94, pp. 286-292, 1994.

[5] F.V. Jensen, Bayesian Networks and Decision Graphs, Springer-Verlag, BerUn, 2001.

[6] R. Jurgelenaite and P.J.F. Lucas, Parameter Estimation in Large Causal Independence Models, Technical Report, NIII, Radboud University Nijmegen, NIII-R0414, 2004.

[7] P.J.F. Lucas, Bayesian network modelling by qualitative patterns, Proc. ECAI-2002, pp. 690-694, 2002.

[8] M.A. Shwe, B. Middleton, D.E. Heckerman, M. Henrion, E.J. Horvitz, H.P. Lehmann and G.F. Cooper. Probabilistic diagnosis using a reformulation of the INTERNIST-1/QMR knowledge base, I - The probabilistic model and inference algorithms. Methods Inf Med, 30, pp. 241-255, 1991.

[9] L Wegener, The Complexity of Boolean Functions, John Wiley, New York, 1987.

[10] N.L. Zhang and D. Poole, Exploiting causal independence in Bayesian networks inference, J AIR, 5, pp. 301-328, 1996.

SESSION 3:

INTELLIGENT AGENTS AND SCHEDULING SYSTEMS

A Bargaining Agent aims to Tlay Fair'

John Debenham

Faculty of Information Technology, University of Technology, Sydney, NSW, Australia, debenham§ i t . u t s . e d u . a u ,

WWW home page: h t t p : / /www-staf f . i t . u t s . edu. au / "debenham/

Abstract. A bargaining agent exchanges proposals, supported by claims, with an opponent. Each proposal and claim exchanged reveals valuable information about the sender's position. A negotiation may break down if an agent believes that its opponent is not playing fairly. The agent aims to give the impression of fair play by responding with comparable information revelation whilst playing strategically to influence its opponent's preferences with claims. The agent makes no assumptions about the internals of its opponent, including her motivations, logic, and whether she is conscious of a utility function. It focusses only on the information in the signals that it receives. It uses maximum entropy probabilistic reasoning to estimate unknown values in probability distributions including the probability that its opponent will accept any deal.

1 Introduction

A bargaining agent extends the simple, offer-exchange, bargaining agent described in [1] that evaluates and generates proposals based on information extracted from the marketplace, the World Wide Web and by observing the behavior of its opponent. The agent described here supports its position with claims. Each proposal and claim exchanged reveals valuable information about the sender's position. The agent responds with proposals and claims that have comparable information revelation. In this way it aims to gain the trust of its opponent. The agent does not necessarily strive to optimize its utility and aims to make informed decisions in an information-rich but uncertain environment.

In addition to exchanging proposals, an argumentation agent exchanges arguments, and so it requires mechanisms for evaluating arguments, and for generating arguments [2]. An argument is information that either justifies the agent's negotiation stance or attempts to influence its opponent's stance [2]. Argumentation here is approached in the rhetorical sense in which arguments are intended to beneficially influence the opponent's evaluation of the issues [3]. This is in contrast to defeasible argumentation in which an agent attempts to find the truth, possibly by finding fault with its opponent's logic.

The negotiation agent, iT, attempts to fuse the negotiation with the information that is generated both by and because of it. To achieve this, it draws on ideas from information theory rather than game theory. U decides what to do — such as what argument to generate — on the basis of its information that may be qualified by expressions of degrees of belief. iT uses this information to calculate, and continually re-calculate, probability distributions for that which it does not know. One such distribution, over the

173

174

set of all possible deals, expresses 77's belief in the acceptability of a deal to herself. Other distributions attempt to predict the behavior of its opponent, i?, — such as what proposals she might accept. 11 makes no assumptions about the internals of its opponent, including whether she has, or is even aware of the concept of, utility functions. n is purely concerned with its opponent's behavior — what she does — and not with assumptions about her motivations.

As with the agent described in [1], the negotiation agent described here does not assume that it has a von Neumann-Morgerstem utility function. The agent makes assumptions about: the way in which the integrity of information will decay, and some of the preferences that its opponent may have for some deals over others. It also assumes that unknown probabilities can be inferred using maximum entropy inference [4], ME, which is based on random worlds [5]. The maximum entropy probability distribution is "the least biased estimate possible on the given information; i.e. it is maximally non-conmiittal with regard to missing information" [6]. In the absence of knowledge about the opponent's decision-making apparatus the negotiating agent assumes that the "maximally noncommittal" model is the correct model on which to base its reasoning.

2 The Negotiation Agent: 17

n operates in an information-rich environment that includes the Internet. The integrity of iT's information, including information extracted from the Internet, will decay in time. The way in which this decay occurs will depend on the type of information, and on the source from which it is drawn. Little appears to be known about how the integrity of real information, such as news-feeds, decays, although the effect of declining integrity has been analyzed. For example, [7] considers how delays in the acquisition of trading data effect trading outcomes.

One source of iT's information is the signals received from i?. These include offers from i? to 71, the acceptance or rejection by i? of i l ' s offers, and claims that i? sends to 77. This information is augmented with sentence probabilities that represent the strength of 77's belief in its truth. If f2 rejected 77's offer of $8 two days ago then what is 77's belief now in the proposition that fi will accept another offer of $8 now? Perhaps it is around 0.1. A linear model is used to model the integrity decay of these beliefs, and when the probability of a decaying belief approaches 0.5* the belief is discarded. The model of decay could be exponential, quadratic or what ever.

2.1 Interaction Protocol

A deal is a pair of commitments Sn.ni'^j ^) between an agent 77 and an opponent agent i?, where TT is 77's conunitment and u is i?'s commitment. V = {<5i}£i is the deal set — ie: the set of all possible deals. If the discussion is from 77's point of view then the

* A sentence probability of 0.5 represents null information, ie: "maybe, maybe not", for the truth of a single statement. In general, prior to 77 discovering new information that i? rejected $8, 77 would have had a prior probability, p, for i? accepting $8 now. The integrity of this new information should decay so that the probability that i? will accept $8 tends to p at which point the information may be discarded.

175

subscript "77 : i?" may be omitted. These commitments may involve multiple issues and not simply a single issue such as trading price. The set of terms, T, is the set of all possible conmiitments that could occur in deals in the deal set. An agent may have a real-valued utility function: U : T -^ 3?, that induces an ordering on T. For such an agent, for any deal S = (TT, U) the expression U(a;) - U(7r) is called the surplus of S, and is denoted by L((5) where L : T x T —• 9fi. For example, the values of the function U may expressed in units of money. It may not be possible to specify the utility function either precisely or with certainty.^ This is addressed in Sec. 3 where a predicate f2Acc(.) represents the acceptability of a deal to i7.

The agents communicate using sentences in a first-order language C. This includes the exchange, acceptance and rejection of offers. C contains the following predicates: Offer{S),Accept{S), Reject{S) and Quit(.), where Offer{S) means "the sender is offering you a deal 6'\ Accept{6) means "the sender accepts your deal S'\ Reject{S) means "the sender rejects your deal (J" and Quit(.) means "the sender quits — the negotiation ends". C also contains predicates to support argumentation — these are described in Sec. 5.

2.2 Agent Architecture

n uses the language C for external communication, and the language C for internal representation. Two predicates in C are: nAcc{.) and f2Acc{.), The proposition {nAcc{S) I It) means: "77 will be comfortable accepting the deal 6 given that 77 knows information It at time f\ The idea is that 77 will accept deal S if P{nAcc{6) | X/) > a for some threshold constant a. The precise meaning that 77 gives to 77Acc(.) is described in Sec. 3. The proposition nAcc{6) means "i? is prepared to accept deal S'\ The probability distribution P(i7Acc(.)) is estimated in Sec. 4.1.

Each incoming message M from source S received at time t is time-stamped and source-stamped, M[s,t]* and placed in an in box, X, as it arrives. 77 has an information repository I, a knowledge base K and a belief set B. Each of these three sets contains statements in a first-order language C. I contains statements in C together with sentence probability functions of time. It is state of J at time t and may be inconsistent. At some particular time t, ICt contains statements that 77 believes are true at time t, such as Wx{Accept{x) -> -yReject{x)). The belief set Bt = {ft} contains statements that are each qualified with ag/v^n sentence probability, B(/?/), that represents 77's belief in the truth of the statement at time t. K and B play different roles in the method described in Sec. 2.3; /Ct U Bt is required by that method to be consistent.

77's actions are determined by its "strategy". A strategy is a function S : /C x B — A where A is the set of actions. At certain distinct times the function S is applied to K and B and the agent does something. The set of actions, A, includes sending Offer{.), Accept{.), Reject^.), Quit{.) messages and claims to i?. The way in which S works is described in Sees. 4.2 and 5.1. Two "instants of time" before the S function is activated, an "import function" and a "revision function" are activated. The import function I : (A' x It-) —• It clears the in-box, using its "import rules". An import

^ The often-quoted oxymoron "I paid too much for it, but its worth it." attributed to Samuel Goldwyn, movie producer, illustrates that intelligent agents may negotiate with uncertain utility.

176

rule takes a message M, written in language C, and from it derives sentences written in language £ to which it attaches decay functions, and adds these sentences together with their decay functions to It- to form 2f These decay functions are functions of the message type, the time the message arrived and the source from which it came — an illustration is given below. An import rule has the form: P ( 5 | M^Q^f^) = / ( M , i?, t) e [0,1], where 5 is a statement, M is a message and / is the decay function. Then the belief revision function R : I^- —* (2t x /Ct x B ) deletes any statements in It-whose sentence probability functions have a value that is « prior value at time t. From the remaining statements R selects a consistent set of statements and instantiates their sentence probability functions to time t, and places the unqualified statements from that set in ICt and the qualified statements, together with their sentence probabilities, in Bf

An example now illustrates the ideas in the previous paragraph. Suppose that the predicate f2Acc{S) means that "deal 6 is acceptable to i?". Suppose that 77 is attempting to trade a good "g" for cash. Then a deal (J(7r, u) will be S{g^ x) where x is an amount of money. If 11 assumes that i? would prefer to pay less than more then It will contain: 0 : {ygxy){{x >y)-^ {QAcc{g^ x)) -^ QAcc{g^ y)). Suppose 77 uses a simple linear

decay for its import rules: / ( M , i?, U) = trust{Q) -h (0.5 - trust{f2)) x ^J~/}^), where trust{fi) is a value in [0.5,1] and decay{Q) > 0? trust{Q) is the probability attached to S at time t = U, and decay{Q) is the time period taken for P(5) to reach 0.5 when S is discarded. Suppose at time t = 7,n receives the message: Offer{g, $20) [^jj, and has the import rule: P{QAcc{g, x) \ Offer{g^ x)[Q,ti]) = 0-8 - 0.025 x{t — U), ie: trust is 0.8 and decay is 12. Then, in the absence of any other information, at time ^ = 11, ICt^^ contains o and Btn contains f2Acc{g, $20) with a sentence probability of 0.7.

77 uses three things to make offers: an estimate of the likelihood that i? will accept any offer [Sec. 4.1], an estimate of the likelihood that 77 will, in hindsight, feel comfortable accepting any particular offer [Sec. 3], and an estimate of when i? may quit and leave the negotiation — see [1]. 77 supports its negotiation with claims with the aim of either improving the outcome — reaching a more beneficial deal — or improving the process — reaching a deal in a more satisfactory way.

2.3 Random worlds

Let Q be the set of all positive ground literals that can be constructed using the predicate, function and constant symbols in C. A possible world is a valuation function V : Q -^ {T, ± } . V denotes the set of all possible worlds, and Vjc denotes the set of possible worlds that are consistent with a knowledge base /C [5].

A random world for /C is a probability distribution Wjc = {pi} over Vjc = {Vi}, where Wjc expresses an agent's degree of belief that each of the possible worlds is the actual world. The derived sentence probability of any a e C, with respect to a random world WK: is:

(VaG£)Pw;c(^) = E { ^ - : a z s T m V n } (D

^ In this example, the value for the probability is given by a linear decay function that is independent of the message type, and trust and decay are functions of i? only. There is scope for using learning techniques to refine the trust and decay functions in the light of experience. As discussed in footnote 1, the value "0.5" is used here as a default prior sentence probability.

177

A random world Wx: is consistent with the agent's beliefs B if: (V/? € B)(B{/3) = Py^^{P)). That is, for each belief its derived sentence probability as calculated using Eqn. 1 is equal to its given sentence probability.

The entropy of a discrete random variable X with probability mass function {pi} is [4]: H(X) = -T,riPn logPn where: pn > 0 and J^^Pn = 1. Let W{K:,5} be the "maximum entropy probability distribution over V)c that is consistent with B". Given an agent with /C and B, its derived sentence probability for any sentence, a e C,is:

(V<7e£)P(a)4p,V{;c.8}W (2)

Using Eqn. 2, the derived sentence probability for any belief, /Jf, is equal to its given sentence probability. So the term sentence probability is used without ambiguity.

If X is a discrete random variable taking a finite number of possible values {xi} with probabilities {pi} then the entropy is the average uncertainty removed by discovering the true value of X, and is given by H(X) = —J^nPn^^SPn- The direct optimization of H(X) subject to a number, 6, of linear constraints of the form ^nPnQki^n) = 9k ^ r givcu coustauts ^^, where k = 1,. . . , 0, is a difficuh problem. Fortunately this problem has the same unique solution as the maximum likelihood problem for the Gibbs distribution [8]. The solution to both problems is given by:

e^p(-Yli=i>^k9k{xn)) Pn = ^7 ^ - , n = l ,2,- . . (3)

Em^'^Py-Ek^l^kgkiXm))

where the constants {AJ may be calculated using Eqn. 3 together with the three sets of constraints: pn > 0, YlnPn — 1 ^^^ ^nPndki^n) = 9k' The distribution in Eqn. 3 is known as Gibbs distribution.

2.4 An Exemplar Application

An exemplar application is used following. 77 is attempting to purchase of a particular second-hand motor vehicle, with some period of warranty, for cash. So the two issues in this negotiation are: the period of the warranty, and the cash consideration. A deal 5 consists of this pair of issues, and the deal set has no natural ordering. Suppose that 77 wishes to apply ME to estimate values for: P(i?Acc((5)) for various S. Suppose that the warranty period is simply 0, • • • ,4 years, and that the cash amount for this car will certainly be at least $5,000 with no warranty, and is unlikely to be more than $7,000 with four year's warranty. In what follows all price units are in thousands of dollars. Suppose then that the deal set in this application consists of 55 individual deals in the form of pairs of warranty periods and price intervals: { (it;, [5.0, 5.2)), (w, [5.2, 5.4)), (w, [5.4, 5.6)), (w, [5.6, 5.8), {w, [5.8, 6.0)), (w, [6.0, 6.2)), (w, [6.2, 6.4)), (w, [6.4, 6.6)), (w, [6.6, 6.8)), (w, [6.8, 7.0)), (w, [7.0, oo)) }, where i/; = 0, • • • , 4. Suppose that 77 has previously received two offers from i7. The first is to offer 6.0 with no warranty, and the second to offer 6.9 with one year's warranty. Suppose 77 believes that i? still stands by these two offers with probability 0.8. Then this leads to two beliefs: /3i : i7Acc(0,[6.0,6.2)); B(/3i) = 0.8,/32 : r2Acc(l, [6.8,7.0)); B(/32) = 0.8. Following

178

the discussion above, before "switching on" ME, 11 should consider whether it beUeves that P(i7Acc(5)) is uniform over 5. If it does then it includes both (3i and /?2 in B, and calculates W{jc,B} that yields estimates for P(i7Acc(5)) for all 5. If it does not then it should include further knowledge in /C and B. For example, 77 may believe that Q is more likely to bid for a greater warranty period the higher her bid price. If so, then this is a multi-issue constraint, that is represented in S, and is qualified with a sentence probability.

3 Accepting a Proposed Deal

The proposition (/TAcc((5) 11,) was introduced in Sec. 2.2. Here, agent, 77, is attempting to buy a second-hand motor vehicle with a specific period of warranty as described in Sec. 2.4. This section describes how U estimates: 'P[nAcc{5) \ It). This involves the introduction of four predicates into the language £: Me{.), Suited{.), Good{.) and FairC).

General information is extracted from the World Wide Web using special purpose bots that import and continually confirm information. These bots conmiunicate with 11 by delivering messages to 77's in-box X using predicates in the communication language C in addition to those described in Sec. 2.1. These predicates include IsGood{r, i?, r), and IsFair{r, 5,5) meaning respectively that "according to agent T, agent Q is a good person to deal with certainty r", and "according to agent T, 5 is a fair market deal with certainty s". The continual in-flow of information is managed as described in [9]. As described in Sec. 2.2, import functions are applied to convert these messages into beliefs. For example: P(Good(i?) | /5G(9od(r, i?,r)[e,ti]) = f{IsGood^r^r^t), where Good{0) is a predicate in the agents internal language C meaning "i? will be a good agent to do business with". Likewise, IsFair(,) messages in C are imported to I as Fair(.) statements in £, where Fair{S) means "J is generally considered to be a fair deal at least".

With the motor vehicle application in mind, P{nAcc{S) \ It) is derived from conditional probabilities attached to four other propositions: Suited(uj), Good{Q), Fair(d), and Me(S)y where Suited(uj) means "terms to are perfectly suited to TT's needs", and Me(S) means "on strictly subjective grounds, the deal S is acceptable to TT". These four probabilities are: P{Suited{u) \ I,), P{Good{Q) \ I,), P{Fair{5) \ I, U {Suited{uj), Good{Q)}) and P{Me{S) \ It U {Suited{uj), Good{Q)}). The last two of these four probabilities factor out both the suitability of u and the appropriateness of the opponent i7. The third captures the concept of "a fair market deal" and the fourth a strictly subjective "what UJ is worth to 77". The "Me(.)" proposition is closely related to the concept of a private valuation in game theory. This derivation of P(77Acc((5) | It) from the four other probabilities may not be suitable for assessing other types of deal. For example, in eProcurement some assessment of the value of an on-going relationship with an opponent may be a significant issue. Also, for some low-value trades, the inclusion of Good{.) may not be required.

To determine T?{Suited{uji) \ It), if there are sufficiently strong preference relations to establish extrema for this distribution then they may be assigned extreme values « 0.0 or 1.0. 77 is Chen repeatedly asked to provide probability estimates for the offer

179

Fig. 1. Acceptability of a deal

Internet Market data Agent Q

ViMe(d)) ?(Suitedia))) ?(Good(Q)) V(Fairid))

P(/I4cc(6) I d^) Agent nj

u that yields the greatest reduction in entropy for the resulting distribution [4]. This continues until 77 considers the distribution to be "satisfactory". This is tedious but the "preference acquisition bottleneck" appears to be an inherently costly business [10].

To determine P(Good(i?) | It) involves an assessment of the reliability of the opponent i?. For some retailers (sellers), information ~ of varying reliability — may be extracted from sites that rate them. For individuals, this may be done either through assessing their reputation established during prior trades [11], or through the use of some intermediate escrow service that is rated for "reliability" instead.

P{Fair{5) \ I , U {Suited{u;), Good{f2)}) is determined by reference to market data. Suppose that recently a similar vehicle sold with three year's warranty for $6,500, and another less similar was sold for $5,500 with one year's warranty. These are fed into It and are represented as two beliefs in B: 0s • Fair{3, [6.4,6.6)); B{/3s) = 0.9, /34 : Fair{3, [5.4,5.6)); B(/?4) = 0.8. The sentence probabilities that are attached to this data may be derived from knowing the identity, and so too the reputation, of the bidding agent. In this way the acceptability value is continually adjusted as information becomes available. In addition to /Ja and /34, there are three chunks of knowledge in /C. First, K2 • Fair{4:^ 4999) that determines a base value for which P{Fair) = 1, and two other chunks that represent i l ' s preferences concerning price and warranty: «3 : Vx,2/,z((x >y) -^ {Fair{z,x) -^ Fair{z,y))) K4 : Vx, y, z{{x >y) -^ {Fair{y, z) -^ Fair{x, z))) The deal set is a 5 x 11 matrix with highest interval [7.0, oo). The three statements in /C mean that there are 56 possible worlds. The two beliefs are consistent with each other and with /C. A complete matrix for P{Fair{5) \ It) is derived by solving two simultaneous equations of degree two using Eqn. 3. As new evidence becomes available

180

it is represented in B, and the inference process is re-activated. If new evidence renders B inconsistent then this inconsistency will be detected by the failure of the process to yield values for the probabilities in [0,1]. If S becomes inconsistent then the revision function R identifies and removes inconsistencies from B prior to re-calculating the probability distribution. The values were calculated using a program written by Paul Bogg, a PhD student in the Faculty of IT at UTS, [12]:

w = 0 w = l w = 2 w = 3 w = A p = [7.0,oo) 0.0924 0.1849 0.2049 0.2250 0.2263 p = [6.8,7.0) 0.1849 0.3697 0.4099 0.4500 0.4526 p = [6.6,6.8) 0.2773 0.5546 0.6148 0.6750 0.6789 p = [6.4,6.6) 0.3697 0.7394 0.8197 0.9000 0.9053 p = [6.2,6.4) 0.3758 0.7516 0.8331 0.9147 0.9213 p = [6.0,6.2) 0.3818 0.7637 0.8466 0.9295 0.9374 p = [5.8,6.0) 0.3879 0.7758 0.8600 0.9442 0.9534 p = [5.6,5.8) 0.3939 0.7879 0.8734 0.9590 0.9695 p = [5.4,5.6) 0.4000 0.8000 0.8869 0.9737 0.9855 p = [5.2,5.4) 0.4013 0.8026 0.8908 0.9790 0.9921 p = [5.0,5.2) 0.4026 0.8053 0.8947 0.9842 0.9987

The two evidence values are shown above in bold face. Determining P{Me{5) \ It U {Suitediu)^ Good{Qi)]) is a subjective matter. It is

specified using the same device as used for Fair except that the data is fed in by hand "until the distribution appears satisfactory". To start this process, first identify those 5 that ""U would be never accept" — they are given a probability of « 0.0, and second those 5 that "77 would be delighted to accept" — they are given a probability of « 1.0. The Me proposition links the ME approach with "private valuations" in GT.

The whole "accept an offer" apparatus is illustrated in Fig. 1. The in-flow of information from the Internet, the market and from the opponent agents is represented as It and is stored in the knowledge base /C and belief set B. In that Figure the D symbols denote probability distributions as described above, and the o symbol denotes a single value. The probability distributions for Me{5), Suited{u)) and Fair{S) are derived as described above. ME inference is then used to derive the sentence probability of the P{nAcc{S) I It) predicate from the sentence probabilities attached to the Me, Suited, Good and Fair predicates. This derivation is achieved by two chunks of knowledge and two beliefs. Suppose that iT's "principles of acceptability" require that: / 5 : {Me A Suited A Good A Fair) -^ UAcc ^6 * {-^Me V -^Suited) -> -^11 Ace these two statements are represented in /C, and there are 19 possible worlds. Suppose that n believes that: /?5 : {HAcc I Me A Suited A -^Good A Fair); 6(^5) = 0.1 f3e : {UAcc \ Me A Suited A Good A -.Fa/r); B(/36) = 0.4 these two beliefs are represented in B. The ME inference process is rather opaque — it is difficult to look at Eqn. 3 and guess the answer. In an attempt to render the inference digestible [1] uses a Bayesian net to derive P(i7Acc). In contrast, the derivation is achieved here using ME.

181

The UAcc predicate generalizes the notion of utility. If It contains {Me - 11 Ace)

then P(i7Acc) = P{Me). Then define P(M^(7r, u)) to be: 0.5 x ( g g j ^ ^ -f 1) for

U((x;) > U(7r) and zero otherwise, where: aJ = argmaXcj{U(a;) | {n^u) 6 V}.^ An acceptability threshold a of 0.5 will then accept deals for which the surplus is non-negative. In this way UAcc represents utility-based negotiation with a private valuation.

4 Negotiation

n engages in bilateral bargaining with its opponent i?. 77 and i? each exchange offers alternately at successive discrete times [13]. They enter into a commitment if one of them accepts a standing offer. The protocol has three stages:

1. Simultaneous, initial, binding offers from both agents; 2. A sequence of alternating offers, and 3. An agent quits and walks away from the negotiation.

In the first stage, the agents simultaneously send Ojfer{.) messages to each other that stand for the entire negotiation. These initial offers are taken as limits on the range of values that are considered possible. This is crucial to the method described in Sec. 2.3 where there are domains that would otherwise be unbounded. The exchange of initial offers "stakes out the turf' on which the subsequent negotiation will take place. In the second stage, an Offer{.) message is interpreted as an implicit rejection, Reject(.), of the opponent's offer on the table. Second stage offers stand only if accepted by return — 77 interprets these offers as indications of i?'s willingness to accept — they are represented as beliefs with sentence probabilities that decay in time. The negotiation ceases either in the second round if one of the agents accepts a standing offer or in the final round if one agent quits and the negotiation breaks down. To prevent an "information flood", the agents are only permitted to exchange claims either with proposals or in response to requests described in Sec. 5.

4.1 Estimating the Opponent's Response to a Proposal

To support the offer-exchange process, 77 has do two different things. First, it must respond to offers received from i? — that is described in Sec. 3. Second, it must send offers, and possibly information, to i?. This section describes machinery for estimating the probabilities P{f2Acc{S)) where the predicate f2Acc{S) means "i? will accept 77's offer 5". In the following, 77 is attempting to purchase of a particular second-hand motor vehicle, with some period of warranty, for cash from i? as described in Sec. 2.4. So a deal 5 will be represented by the pair {w^ p) where w is the period of warranty in years and $p is the price.

77 assumes the following two preference relations for i7, and /C contains: Ku : yx,y,z{{x < y) -^ {f2Acc{y,z) -^ f2Acc{x,z)))

" The annoying introduction of u may be avoided completely by defining P(Me(7r, u)) = i- exp(-/9x(U(a;)-u(7r)) ^ ^ ^^^ constant /3. This is the well-known sigmoid transfer function used in many neural networks. This function is near-linear for U(u;) w U(7r), and is concave, or "risk averse", outside that region. The transition between these two behaviors is determined by the choice of p.

182

«;i2 : Vx,2/, z((x <y) -^ {nAcc{z,x) -^ nAcc{z,y))) As in Sec. 3, these sentences conveniently reduce the number of possible worlds. The two preference relations KU and /ci2 induce a partial ordering on the sentence probabilities in the P{{2Acc{w,p)) array from the top-left where the probabilities are « 1, to the bottom-right where the probabilities are « 0. There are fifty-one possible worlds that are consistent with /C.

Suppose that the offer exchange has proceeded as follows: i? asked for $6,900 with one year warranty and 77 refused, then 77 offered $5,000 with two years warranty and i? refused, and then i? asked for $6,500 with three years warranty and 77 refused. Then at the next time step B contains: /?ii : i7Acc(3, [6.8,7.0)), /?i2 : QAcc{2, [5.0,5.2)) and /?i3 : J?Acc(l, [6.4,6.6)), and with a 10% decay in integrity for each time step: P ( A i ) = 0.7, PifSu) = 0.2 and P{(3is) = 0.9

Eqn. 3 is used to calculate the distribution W{K,B} which shows that there are just five different probabilities in it. The probability matrix for the proposition QAcc{w^p) is:

It; = 0 w=^\ w = 2 w = 3 w = A p=[7.0,oo) 0.9967 0.9607 0.8428 0.7066 0.3533 p = [6.8,7.0) 0.9803 0.9476 0.8330 0.7000 0.3500 p = [6.6,6.8) 0.9533 0.9238 0.8125 0.6828 0.3414 p = [6.4,6.6) 0.9262 0.9000 0.7920 0.6655 0.3328 p = [6.2,6.4) 0.8249 0.8019 0.7074 0.5945 0.2972 p = [6.0,6.2) 0.7235 0.7039 0.6228 0.5234 0.2617 p = [5.8,6.0) 0.6222 0.6058 0.5383 0.4523 0.2262 p = [5.6,5.8) 0.5208 0.5077 0.4537 0.3813 0.1906 p = [5.4,5.6) 0.4195 0.4096 0.3691 0.3102 0.1551 p = [5.2,5.4) 0.3181 0.3116 0.2846 0.2391 0.1196 p = [5.0,5.2) 0.2168 0.2135 0.2000 0.1681 0.0840

In this array, the derived sentence probabilities for the three sentences in B are shown in bold type; they are exactly their given values.

4.2 Negotiating with Equitable Information Revelation

77's negotiation strategy is a function S : tC x B ^ A where A is the set of actions that send Offer{.), Accept{.), Reject{.) and Quit{.) messages to i?. 77's argumentation strategy includes sending other messages described in Sec. 5. If 77 sends Offer{.), Accept{.) or Reject{.) messages to i? then she is giving i? information about herself. In an infinite-horizon bargaining game where there is no incentive to trade now rather than later, a self-interested agent will "sit and wait", and do nothing except, perhaps, to ask for information. The well known bargaining response to an approach by an interested party "Well make me an offer" illustrates how a shrewd bargainer may behave in this situation.

An agent may be motivated to act for various reasons — three are mentioned. First, if there are costs involved in the bargaining process due either to changes in the value of the negotiation object with time or to the intrinsic cost of conducting the negotiation itself. Second, if there is a risk of breakdown caused by the opponent walking away from

183

the bargaining table. Third, if the agent is concerned with establishing a sense of trust [11] with the opponent —this could be the case in the establishment of a business relationship. Of these three reasons the last two are addressed here. The risk of breakdown may be reduced, and a sense of trust may be established, if the agent appears to its opponent to be "approaching the negotiation in an even-handed manner". One dimension of "appearing to be even-handed" is to be equitable with the value of information given to the opponent. Various bargaining strategies, both with and without breakdown, are described in [1], but they do not address this issue. A bargaining strategy is described here that is founded on a principle of "equitable information gain". That is, 11 attempts to respond to i?'s messages so that i?'s expected information gain similar to that which n has received.

n models i? by observing her actions, and by representing beliefs about her future actions in the probability distribution P(i7Acc). 11 measures the value of information that it receives from i? by the change in the entropy of this distribution as a result of representing that information in P(i7Acc). More generally, 77 measures the value of information received in a message, //, by the change in the entropy in its entire representation, Jt = ICtUBt, as a result of the receipt of that message; this is denoted by: A^\J/^\, where \Jt^\ denotes the value (as negative entropy) of 77's information in J at time t. Although both 77 and i? will build their models of each other using the same data — the messages exchanged — the observed information gain will depend on the way in which each agent has represented this information as discussed in [14]. It is "not unreasonable to suggest" that these two representations should be similar. To support its attempts to achieve "equitable information gain" 77 assumes that i7's reasoning apparatus mirrors its own, and so is able to estimate the change in i7's entropy as a result of sending a message /x to i?: A^\j/^\. Suppose that 77 receives a message /x = Offer{,) from fi and observes an information gain of A^ \ J^ \. Suppose that 77 wishes to reject this offer by sending a counter-offer, Offer{S), that will give i? expected "equitable information gain". 6 = {avgmaxs P{nAcc{S) \It)>a\ {Aoffer(6)\Jt^\ « ^M"\)}' That is 77 chooses the most acceptable deal to herself that gives her opponent expected "equitable information gain" provided that there is such a deal. If there is not then 77 chooses the best available compromise (5 = {aigimxs{Aoffer{6)\Jt^\) I P{nAcc{S) \ It) > a} provided there is such a deal — this strategy is rather generous, it rates information gain ahead of personal acceptability. If there is not then 77 does nothing.

The "equitable information gain" strategy generalizes the simple-minded alternating offers strategy. Suppose that 77 is trying to buy something from i? with bilateral bargaining in which all offers and responses stand — ie: there is no decay of offer integrity. Suppose that 77 has offered $1 and i? has refused, and i? has asked $10 and 77 has refused. If amounts are limited to whole dollars only then the deal set P = {1, • • , 10}. 77 models i? with the distribution P(i7Acc(.)), and knows that P(^A<:c(l)) = 0 and P (i?Acc( 10)) = 1. The remaining eight values in this distribution are provided by Eqn. 3, and the entropy of the resulting distribution is 2.2020. To apply the "equitable information gain" strategy 77 assumes that i?'s decision-making machinery mirrors its own. In which case i? is assumed to have constructed a mirror-image distribution to model 77 that will have the same entropy. At this stage, time ^ = 0, calibrate the amount of information held by each agent at zero — ie: \Jo^\ = \Jo^\ = 0. Now

184

if, at time t = l,n asks 77 for $9 then i? gives information to 77 and \J{^\ = 0.2548. If 77 rejects this offer then she gives information to i? and \jf\ = 0.2548. Suppose that 77 wishes to counter with an "equitable information gain" offer. If, at time t = 2, 77 offers i? $2 then \J^\ = 0.2548 + 0.2559. Alternatively, if H offers i? $3 then \J2^\ = 0.2548 + 0.5136. And so $2 is a near "equitable information gain" response by 77 at time t = 2.

5 Claims

77 is a bargaining agent that argues by exchanging proposals justified with claims. Sec. 4 describes 77's offer-exchange machinery. 77's claim-exchange machinery is described here. Claims are expressed in terms of the conmiunication predicates: IsGood{r, i?, r) and IsFair{r, S^s)^, as described in Sec. 3, and a third predicate ILike(.). An agent may have a preference for some deals over others. If i? believes that it is in its interests to communicate some of its preferences to 77 then it does so using the ILike{.) predicate in the communication language C. Messages expressed in terms of these three predicates are imported using import rules. For example, the preference relation KU in Sec. 4.1 could have been imported from a message: Vx, t/, z{{x <y) -^ {ILike{y^ z) -^ ILike{x^ z)))[f2^ti] in X using a suitable import rule.

The negotiation protocol is described in Sec. 4, in it agents are permitted to request information from one another. The communication language C contains three such predicates. ?IsFair{S) and ?IsGood(r) are invitations for the opponent to submit claims in terms of IsGood{,), and IsFair{.). Those two predicates are also used by 77 to request information from its data and text mining bots as described in Sec. 3. The third predicate ?ULike(.) is an invitation for Q to submit claims expressed in terms oflLikei.).

77 must be able to evaluate claims. 77's offer-exchange machinery is based on probability distributions derived by applying maximum entropy inference to Jt which contains knowledge and beliefs. This architecture is designed to facilitate the import of information, and so extends easily to evaluating claims received from i?. All that is required is suitable import rules to perform this task.

5.1 Generating Claims

Here 77's objective in using argumentation is to increase the expected value of the negotiation in some sense whilst exhibiting fair play. Argumentation may lead to a more valuable outcome either by discovering information about i7's preferences — particularly, fi's willingness to trade — or by convincing i? to modify her preferences so as to increase her willingness to accept deals from a section of the deal set D that is more profitable to 77 than is currently "available". The communication language C contains the four bargaining predicates described in Sec. 2.1, and the six argumentation predicates described above. The negotiation protocol described in Sec. 4 restricts agents to exchanging claims either in response to invitations or in support of a proposal.

^ This follows a simple version of Toumlin's model: claim (6 is a fair market deal), data (agent r states that a prior deal 6 was struck), warrant (the deal 6 reported by 7 could have been availed of by Q and therefore is fair), and backing (F reports honestly).

185

n uses the predicates IsGood{.) and IsFair{.) to support its proposals. It uses its own acceptance machinery — see Sec. 3 — "in reverse" to identify suitable evidence to transmit in these predicates. This is not described further here. 77 uses the predicate ILike(.) to convey its preferences. The use of this predicate to attempt equitable information revelation is described below. To convey preference information the ILike(.) predicate will only be partially instantiated — otherwise it reduces to an Offer{.).

Conveying preference information is a "two-edged sword". If an agent states "I want a blue one" then on the one hand this is an invitation to her opponent to accelerate the search by only considering such deals, but on the other hand is an invitation to her opponent to exploit the possibility that she may be prepared to pay more for a "blue one". Transmitting a preference statement statement such as this should cause a substantial change in entropy /L/jkg(biue) | J i | . So an agent needs to be motivated to do so — equitable information revelation provides such a motivation. The general principle of equitable information gain is to respond to an incoming message /iin with a response /Xout such that: A^^^^ \J^ \ ^ A^.^ IJt^l The question now is how to select such a /Xout-

An approach to issue-tradeoffs is described in [15]. The bargaining strategy described there attempts to make an acceptable offer by "walking round" the iso-curve of i l ' s previous offer (that has, say, an acceptability of an > <^) towards i7's subsequent counter offer. In terms of the machinery described here, an analogue is to use the strategy S ': argmax^l P{f2Acc{S)) \ P(i7Acc(5) | It) > an } for a = an. This is reasonable for an agent that is attempting to be accommodating without compromising its own interests. Presumably such an agent will have a policy for reducing the value an if her deals fail to be accepted. The complexity of the strategy in [15] is linear with the number of issues. The strategy described here does not have that property, but it benefits from using P(i7Acc(.)) that contains foot prints of the prior offer sequence — see Sec. 4.1 — in that distribution more recent offers have stronger weights.

In a sense, the preference exchange strategy mimics S'' by "walking round" the iso-curve of preferences defined, after receipt of a message fim, by A^^^^ \ J^ \ ^ A^.^ \ J^ \. Note that \J^^\ represents the expected effect at time t of all previous messages that 77 has transmitted to f2. An ILike(X) message will be associated with a set of deals — as A is partially instantiated only. For example, "I like 4 year warranties". Given a message /iin there will be many such A with the property: Aiuke(^x) \Jt^I ^ /xm l^t^I- Of ^ ^ of these A choose that for which the mean of P(77Acc((5)) over all S associated with A is greatest. This may not be easy to calculate, but is a "fair play" response.

6 Conclusions

The establishment of a sense of trust [11] contributes to the establishment of business relationships and to preventing breakdown in one-off negotiation. One way to foster trust is to give the impression of fair play. Exchanging and reacting to proposals and in a negotiation gives the opponent valuable information. Exchanging preference information in argumentation has the potential to accelerate the negotiation but does so at the cost of also giving away information. One sense of "fair play" is to be equitable in this information exchange. The agent described exchanges offers and claims whilst attempting to achieve equitable information revelation.

186

The agent architecture is based on a first-order logic representation, and so is independent of the number of negotiation issues, although only two-issue bargaining is illustrated here. The implementation incorporates a modified version of tuProlog that handles the Horn clause logic including the belief revision and the identification of those random worlds that are consistent with K. Existing text and data mining bots have been used to feed information into 77.

77 has five ways of leading a negotiation towards a positive outcome. First, by making more attractive offers to i?. Second, by reducing its threshold a. Third, by acquiring information to hopefully increase the acceptability of offers received. Fourth, by encouraging Q to submit offers that are more attractive to 77. Fifth, by encouraging Q to accept 77's offers. The first two are described in Sec. 4.2. The third is not described here — it is not difficult to see how the acceptance machinery can be "driven backwards" to achieve this. The last two of these have been described in Sec. 5.1.

Much has not been described here including: the data and text mining software, the proactive acquisition of information, and the way in which the incoming information is structured to enable its orderly acquisition [9].

References

1. Debenham, J.: Bargaining with information. In: Proceedings Third International Conference on Autonomous Agents and Multi Agent Systems AAMAS-2004. (2004)

2. Rahwan, I., Ramchum, S., Jennings, N., McBumey, R, Parsons, S., Sonenberg, E.: Argumentation-based negotiation. Knowledge Engineering Review (2004)

3. Ramchum, S., Jennings, N., Sierra, C: Persuasive negotiation for autonomous agents: A rhetorical approach. In: Proc. UCAI Workshop on Computational Models of Natural Argument. (2003) 9-17

4. MacKay, D.: Information Theory, Inference and Leaming Algorithms. Cambridge University Press (2003)

5. Halpem,J.: Reasoning about Uncertainty MIT Press (2003) 6. Jaynes, E.: Information theory and statistical mechanics: Part i. Physical Review 106 (1957)

620-630 7. Bernhardt, D., Miao, J.: Informed trading when information becomes stale. The Joumal of

Finance LIX (2004) 8. Pietra, S.D., Pietra, V.D., Lafferty, J.: Inducing features of random fields. IEEE Transactions

on Pattem Analysis and Machine Intelligence 19 (1997) 380-393 9. Debenham, J.: An enegotiation framework. In: Applications and Innovations in Intelligent

Systems XI, Springer Veriag (2003) 79-92 10. Castro-Schez, J., Jennings, N., Luo, X., Shadbolt, N.: Acquiring domain knowledge for

negotiating agents: a case study. Int. J. of Human-Computer Studies (to appear) (2004) 11. Ramchum, S., Jennings, N., Sierra, C, Godo, L.: A computational tmst model for multi-

agent interactions based on confidence and reputation. In: Proceedings 5th Int. Workshop on Deception, Fraud and Tmst in Agent Societies. (2003)

12. Bogg, P.: (http://www-staff.it.uts.edu.au/plbogg/negotiation/demos/maxent/) 13. Kraus, S.: Strategic Negotiation in Multiagent Environments. MIT Press (2001) 14. Debenham, J.: Auctions and bidding with information. In Faratin, P., Rodriguez-Aguilar, J.,

eds.: Proceedings Agent-Mediated Electronic Commerce VI: AMEC. (2004) 15. Faratin, P., Sierra, C, Jennings, N.: Using similarity criteria to make issue trade-offs in

automated negotiation. Joimial of Artificial Intelligence 142 (2003) 205-237

Resource Allocation in Communication Networks Using Market-Based Agents

Nadim Haque, Nicholas R. Jennings, Luc Moreau School of Electronics and Computer Science, University of Southampton

Southampton, UK.

{N.A.HAQUE,N.R.JENNINGS,L.MOREAU}«ecs.soton.ac.uk

Abstract

This work describes a system that allocates end-to-end bandwidth, in a switched meshed communications network. The solution makes use of market-based software agents that comp>ete in a number of decentralised marketplaces to buy and sell bandwidth resources. Agents perform a distributed depth first search with decentralised markets in order to allocate routes for calls. The approach relies on a resource reservation and commit mechanism in the network. Initial results show that under a light network load, the system sets up a high percentage of calls which is comparable to the optimum value and that, under all network loads, it performs significantly better than a random strategy.

1 Introduction

The work presented in this paper describes the methodology, implementation and evaluation of a multi-agent system that allocates end-to-end (source-to-destination) bandwidth in a communications network to set up calls. In particular, we consider meshed networks where nodes communicate with their immediate neighbours using radio [1]. In such networks, nodes operate on batteries and solar power and are therefore designed to consume as little power as possible, where they are connected to fixed handsets via base stations. These networks are used mainly in developing third world countries where equipment is scarce and cost is at a minimum. They are equally appUcable in areas where the network infirastructure is not fixed, for example, soldiers in a desert who need to communicate their geographical positions to one another. Such low power consumption and low-cost solutions imply that such a network has Um-ited bandwidth. This has two impUcations: (i) the number of messages sent between nodes must be restricted and (ii) the size of each message sent should be kept to a minimum.

Therefore, it can be seen that resource allocation is a central problem in eflfectively managing such networks. Specifically, this covers the process by which network elements try to meet the competing demands that appUcations have for network resources — primarily Unk bandwidth and buffer space in routers or switches [2]. This is a challenging problem since resources become scarce when there is a high demand for them. Thus, practical methods must be found for allocating the scarce resources that satisfy users adequately.

187

188

Against this background, the solution we have developed can be viewed as a computational economy where software agents compete in a m€u:ketplace to buy and sell bandwidth on a switched network. Here, buyer agents represent callers that aim to make calls in the network and seller agents represent the owners of the resources who wish to profit from leasing their bandwidth. However, a key requirement is that the bandwidth resources should not all be sold from the same central location in the network because a centralised market server would give a central point of failure. Therefore, we use decentraUsed market servers from where resources are bought and sold. This means that if a failure was to occur on a server node, then resources should still be available from other market servers. Also, since bandwidth from neighbouring nodes is required to form a continuous end-to-end path in the network, there is a requirement for a protocol that can allocate interrelated resources simultaneously. This ensures that either no resources or a complete set of resources are bought.

We decided to base our solution on agents for a number of reasons. First, their autonomous behaviour allows them to carry out their tasks in the decentraUsed control regime of distributed marketplaces. Second, the reactive nature of agents is needed to respond to requests quickly so that calls within the network can be made with minimum delay. Third, agents have the ability to flexibly interact which is important in our system because the agents need to bid against a variety of diflFerent opponents in an environment where the available resources vary dynamically. A market-based approach was chosen for the following reasons. First, markets are effective mechanisms for allocating scarce resources in a decentraUsed fashion [3]. Second, they achieve this based on the exchange of small amounts of information such as prices. Finally, they provide a natural way of viewing the resource allocation problem because they ensure the individual who values the resources the most wiU obtain them.

To meet our requirements, the system we have developed extends the state of the Bit in the following ways. It develops a novel distributed market mechanism scheme in which the aUocations made consist of sets of interrelated resources, bundles, which are sold in multiple markets. The marketplace protocol incorporates a reservation and commitment mechanism that provides a guarantee that resources will not be bought unnecessarily.

The remainder of this paper is structured as follows: section 2 describes the design of the system and the components that it comprises. A methodology outUning the evaluation of the system and experimental results are presented in section 3. Section 4 describes the related work. FinaUy, a conclusion of the work is discussed in section 5 along with the envisaged future work.

2 System Design

This section describes the design of the system. SpecificaUy, section 2.1 outUnes the basic components, section 2.2 describes the network used and how it is modeUed, section 2.3 details the constituent agents and section 2.4 then outUnes the process of how resources are acquired.

189

AGENTS X MARKET SERVER

BUYER AGENT f I AUCTIONEER SELLER AGENTS bid AGENT asks ^ Q o

^ O O ^ O " ^

CALLEE

Figure 1: An overview of the system architecture. Black nodes in regions represent market servers and grey nodes represent allocated resources for a particular call from the caller to the callee.

2.1 System Architecture

The system consists of three types of agents: seller, buyer and auctioneer agents (see figure 1). Seller agents are responsible for selUng the bandwidth capacity resources and buyer agents are responsible for buying these resources. The auctioneer agents accept asks from seller agents and bids from buyer agents and conduct auctions so that resources can be allocated using a market-based protocol (a description of which is given in subsection 2.3.1). As can be seen, the overall network is divided into a number of regions (section 2.2 describes what regions are and explains why each one has its own market server). Callers are not regarded as agents within the system but are used to initiate calls via the use of handsets. When a call request takes place, the destination location to where the caller wishes to make the call is passed to the buyer agent on the local node. This agent then starts the process of setting up the call. For each call attempt, a buyer agent in each required region tries to reserve a resource bundle (i.e. set of interrelated resources in a single region) from its local market server. Buyer agents work together to collectively make a complete source-to-destination path across the regions using the bundles i.e. the path is put together in a distributed way. If some resource bundles cannot be obtained for a call, then a backtracking mechanism is used which allows alternative allocations to be made if currently reserved resource bundles cannot lead to the final destination. An example of backtracking is outUned in section 2.4.

190

2.2 Network Structure and Modelling

As outlined in section 1, it is desirable for resources to be bought and sold in the network from various points and not from a central location. With this in mind, the structure of the network requires consideration. In particular, there are a number of ways in which a market could have been distributed. The two approaches that were considered were to have: (i) resource information repUcated across several market servers, where each can sell all of the resources in the entire network, or (ii) to partition the complete resource information such that the market servers sell resources that are not for sale on any other market (i.e. to introduce local network regions that are distinct and where only resources within those regions are sold). We regard a network region as a group of nodes that are situated geographically close together where each region is created in advance of any resources being bought or sold. Nodes on the edge of regions can communicate with other edge nodes in neighbouring regions.

We chose the partitioned approach for a number of reasons. Firstly, if resource information was repUcated then the recipient market server would need to contact all other markets to make sure that the same resources are not being sold elsewhere, for each bid submitted. This could soon flood the network with messages. This situation is avoided with regions since each market only sells the resources within that region and only the required markets are contacted. Also, the partitioned approach allows the expansion of the network where adding extra regions and markets can take place without significantly affecting any parts of the existing network.

To model the network, each node has a fixed total bfiuidwidth capacity that is spUt logically into severed equal parts, where these are the resources. This means that these parts of bandwidth can be used in relaying several calls at the same time through the nodes. Each node has a fixed number of handsets attached from where calls originate. A handset that is currently in use is assumed to be engaged and, thus, cannot be used for any other calls at the same time. Also, currently, control messages are assumed to be routed by a separate communication layer, where there is infinite capacity, for which we do not set an upper bound on the bandwidth or number of messages. We aim to relax this assumption as part of our future work (as described in section 5).

2.3 The Agents

2,3.1 The Auctioneer Agent

Auctioneer agents conduct auctions using a combinatorial reverse auction protocol [4] to allocate goods (units of node bandwidth) to buyers. With this particular protocol, the auctioneer agents try to allocate a combination of goods (i.e. a source-to-destination path) that consist of the cheapest possible bundles. There is one auctioneer agent present in each region in the network, each on their respective market server nodes. Market servers are placed manually within a central location in their regions where there is a high connectivity of neighbouring nodes - this is so that they can receive more messages per unit

191

time than if the connectivity is less. Over a period of time, auctioneer agents execute a mnner determination protocol that determines which resources are allocated to which parties, every time they have a bid to process.

In more detail, for each bid submitted by a buyer, the set of winning sellers must be found. For each buyer, the auctioneer has a set of resources that it tries to acquire, M = {1, 2, ..., m}, as specified by the buyer in its bid. Buyers only ever bid for single units of goods for their bundles, since one unit of node bandwidth is assumed to be sufficient capacity for handling a call. They specify for which nodes these single resource units are required: U = {wi, U2, ..., u^} where, in this case, Ui = I. Sellers only ever sell one type of resource each i.e. the bandwidth of a single node k (where A; is a different and unique single node for each seller). They each submit an ask individually where the market eventu€dly receives the set of asks fi:om all sellers: A = {Ai, i42, ..., An}. Each ask is a tuple Aj = { \j^ Pj ) where A > 0 is the number of resource units of node A: offered by the ask from the jth seller and Pj is the ask price per unit. The winner determination algorithm then attempts to allocate resources by minimising the amount spent [4]:

™ii E i = i Pj^j «•*• E i = i ^^j ^ t» ^ = 1» 2, ..., m Xj e {0,1}

A bid from a buyer agent contains several bundles from which only one is required (see subsection 2.3.3). The winner determination protocol operates by exhaustively incrementing through these, finding the bundles which are available as a complete set. Prom these bundles, the cheapest one is allocated to the buyer agent (i.e. this algorithm is executed for each bundle in a buyer agent bid until the one with the minimum cost is found). Assuming that a buyer agent's bid is successful, resources are sold at the seller agent's asking price.

2,3,2 The Seller Agents

There are several seller agents per region, one owning each node. The implication of each seller agent owning a node is that they can attempt to compete against each other by pricing their respective resources competitively. All seller agents are physically deployed on their local market server nodes and we assume that they all use the same simple linear pricing strategy for the moment. A seller agent begins with y number of resource units initially priced at one price unit each. For each unit sold, the price increases by one price unit (i.e. when there is only one resource unit left, it should cost y price units). Conversely, for each unit reclaimed by a seller agent, the price reduces by one price unit.

The initial low price of one price unit is chosen so that sellers can sell resources more easily to begin with. As demand for resources increases, the price per unit increases so that buyer agents have to bid more for resources. Given this, seller agents can maximise their utilities by making as much profit as possible. They also reduce the price of resources by one price unit when they have reclaimed the resource so that they can lure more buyers to purchase resources from them (i.e. seller agents remain competitive against each other).

192

2.3,3 The Buyer Agents

Buyer agents purchase node capacity resources from seller agents within the system and are funded by callers so that resource bundles can be bought. The bundles estabUsh a complete path from the caller^s source location to the destination location that the caller wishes to contact. These resources allow calls to be made across the network. There is one buyer agent per node. They are put on individual nodes so that they can await call requests, from callers, from any point in the network. The number of buyer agents required in setting up a call is the same as the number of regions in which resources are required for a given call (i.e. different buyer agents purchase resources in their own respective regions in order to make a complete path across several regions, for multi-region calls). For a single-region call, only a single buyer within that region is required to set up that call. If the call request involves several regions, then other buyer agents are contacted to purchase resources in their regions. The process of reserving resources across several regions is described in detail in section 2.4.

The current buyer agent bidding strategy is simple and assumes that buyers have knowledge of the price of all resources within their own regions.^ However, it must be noted that buyer agents do not know the current availabiUty of resources, as this would be unrealistic. We assume that all buyer agents use the same purchasing strategy. Thus, when a buyer agent receives a request for purchasing node bandwidth, it then formulates its bid. In doing so, we assume that buyers have knowledge of how all of the regions are connected together in the network as well as in which regions all nodes sore situated.^ Therefore, once the buyer knows the final destination to where it purchases the resources, it finds the cheapest set of routes that lead from its current node to a destination node within its own region. These are then sent as a bid to the buyer's local market. If the final destination node is within the same region, then that node is the destination node. If, however, the final destination is in another region, then the buyer finds a set of routes that lead to a node within its current region that is connected to a node in a neighbouring region that leads to the region where the final destination node is. Since the buyer agents have knowledge of resource prices, they select a set of bundles that minimise the cost of their desired routes.

A buyer agent would Uke to obtain only one bundle from the set that it submits to its local market. Therefore, we make the assumption that buyer agents are only allowed to submit up to a certain number of bundles for each bid. The value chosen here is five because we wanted to allow some choice and flexibility in the bundle that a buyer could be allocated and yet not choose a number that is so high that the market algorithm has to do significant amounts of unnecessary processing.^ Finally, if the buyer agent is successful in reserving resources, it is informed by the local market.

^More advanced buyer strategies will be investigated as part of the future work. ^ Buyer agents do not know the entire topology of how all nodes in all regions are connected

together. ^A future investigation will be to look into exactly how much processing is done when the

number of bundles submitted is altered.

193

2.4 Acquiring Resources Across Regions

In a multi-region call, when a buyer agent has successfully reserved a bundle of resources, the market server in that region is responsible for contacting a buyer agent that is on the edge of the next region. The node on which this second buyer agent resides must be in reach of the last node in the bundle of resources that have been reserved in the previous region such that when the call eventually takes place, there should be a continuous path from the source node to the destination node. To this end, the reservation procedure is described next, followed by the backtracking mechanism that releases resources that are no longer required and attempts to reserve alternative bundles for a given call, when a complete path cannot be made.

2.4-1 Resource Reservation and Commitment

Figure 2 shows the actual network topology used in our experiments (see section 3). Therefore, we now use it to demonstrate how buyer agents attempt to reserve resources. The market servers in regions 0, 1, 2, 3 and 4 axe assumed to be resident on nodes 3, 16, 26, 38 and 46, respectively. For this example, we assume that the source of the call is from node 0 and the destination is on node 49. When a call request arrives in region 0 on node 0, the buyer agent on that node, say 6i, sends a bid to its local market. Here, we assume that 6i has successfully reserved the path containing resources 0-1-4-7. The market server on node 3 then contacts a buyer agent in region 1 so that it can purchase the next set of resources. It makes a random decision of selecting a buyer agent on either node 8 or node 11 since these are directly in reach of node 7 in region 0, where node 7 is the last resource in the reserved bundle. In this example we assume that node 8 is chosen on which the buyer agent, 62? resides. Therefore, 62 is given the responsibiUty of bidding for a set of resources in region 1. This process continues until the final destination is reached. Hence, there is an element of cooperation between buyers in diflFerent regions when paths are being reserved. Buyer agents reserve resources only from local markets because the complete network is split up into regions. Local markets only sell resources in the local region in which they are operating.

Once the final destination has been reached, the market server in the last region (region 4) sends a commit message to the buyer agent within its own region. This buyer agent then contacts the market server in the previous region (region 1) which, in turn, informs its buyer agent, 62, about the complete path being reserved. Payment of resources takes place during this commit phase. Eventually, the originating buyer agent, 61, receives the commit message and the call can be placed. Once the call has completed, a message is sent from 61 in region 0 to its locaJ market that resources need to be released. After this has been done, this message is then propagated across all used markets in the direction of the final region so that resources can be released. The markets can then resell the resources to the buyers that place bids for them in the future.

194

Region 4

Figure 2: A 50 node network topology that has been partitioned into 5 distinct regions. The grey nodes show where the hand-picked market servers reside.

2.4-2 The Backtracking Mechanism

As part of our solution, the system uses a backtracking mechanism that allows alternative allocations to be made if currently reserved resource bundles cannot lead to the final destination. Thus, if a buyer agent in an intermediate region fails in reserving a bundle of resources, then it resubmits another bid to its local market which contains bundles that lead to another destination node within its own region. This process continues until either a bundle has been reserved or there are none left. In the latter case, the market in the previous region is informed and the previous buyer agent releases its currently reserved resource bundle and bids for another set of resources that lead to a diflFerent region.

Using figure 2 as an example, if 62 on node 8 fails in being allocated a resource bundle firom node 8 to node 21, then it can submit a second bid for a route that leads from node 8 to node 22. If this also fails, then 62 would know that all routes that lead directly to region 4 have been exhausted. Therefore, it could try for a bundle that leads to region 3 (i.e. node 8 to node 19). If 62 is successful in receiving such a bundle (e.g. 8-11-15-19), then the buyer agent on node 33 in region 3 can continue in setting up this call by bidding for a bundle of resources that lead from its region to region 4. In short, the agents in the system perform a distributed depth first search of the resource bundles when bids are made (a complete description of the system algorithm is given in [5]).

3 Experimental Evaluation This section describes the experimental work that was carried out in evaluating the system. Section 3.1 describes the methodology and experimental parame-

195

ters and the results are outlined in sections 3.2 and 3.3.

3.1 Experimental Methodology and Settings

In order to evaluate our system, it was benchmarked against two other controls. These consist of the global optimum values, as well as a random strategy that is used for allocating resources. For both controls, as well as our algorithm, we assume that one hop in the network takes one simulation time step. When a source-destination pair has been selected for a call attempt in our simulation, then the same pair is used for the optimum and random strategies.

The global optimum strategy works in an entirely impractical way that gives it a number of significant advantages over our system. The optimum strategy assumes that it has global knowledge of all of the resources available at any moment in time. In more detail, at the time of a call originating, a complete global search is done to see if a path exists that leads from the source node to the destination node. If one is found, then this is deemed to be a successful allocation attempt. This test is performed on each time step during the set up period when a call attempt is made, until a solution is found. Whilst one hop in the network is assumed to take one time step, we assume that the global optimum strategy provides an instantaneous allocation, when measuring the call success rate (see section 3.2 for details concerning this experiment). If no source-to-destination path is found before a call has been set up in our system, then it is considered to have failed in the optimum strategy. With the random strategy, a randomly chosen neighbouring node is selected and a check is done to see if there is sufficient call capacity for it to accept a call. If so, it is made into the current node. If not, then the previous node must select another neighbouring node. The search process continues until either the final destination node has been reached or until there are no more neighbouring nodes to contact. If the final destination is found, then the random strategy is considered to have succeeded in its allocation attempt. To avoid cyclic routes and reserving multiple units of bandwidth on the same node, the nodes are not allowed to contact neighbours where resources have already been reserved.

The experimental settings we used in this evaluation were obtained from a domain expert. Specifically, each experiment was run for a total of 100,000 time steps. The simulation was probed after every 1,000 time steps. The duration of a call was set to 500 time steps. We assume that each node has 2 handsets attached to it. Also, each node has a total of 10 units of node bandwidth capacity available. This means that a node can handle up to 10 simultaneous calls at any one time. Calls were made to originate after every 25 time steps. The cost of calls were set at 35 price units per region. For each experiment, the call origination probability (traffic load in the network) was increased. Also, the number of simulation runs for each experiment was sufficient for the results to be statistically significant at the 95% confidence level. The network topology on which our system operates was shown in figure 2. This was chosen because it demonstrates a topology which has a central region (region 1) through which many calls would require resources in multiple regions.

196

To evaluate our system, we wish to measure the average call success rate (section 3.2). This provides us with an insight into a fundamental measure of the percentage of successfiil calls that can be placed given different traffic loads in the network. We also look at the average time required for a call to be set up (section 3.3). For all experiments, graphs are plotted each of which show the standard deviation by using error bars.

3.2 Average Call Success Rate

The purpose of this experiment is to investigate the number of calls that could successfully be set up, on average, when varying the call origination probability. The hypothesis for this experiment was that if the call origination probability is increased, the call success rate would decrease, assuming that all other variables remain constant. As can be seen from figure 3(a), the call success rate does indeed decrease, but it does so at a steady rate. The reason for this is that as the call origination probability is increased, the bandwidth capacity in the nodes is used more (or occupied for longer periods of time) and therefore, bandwidth is more sc£u:ce. This is proved by figure 3(b), which shows that as the load in the network is increased, the usage of nodes is greater. Figure 3(a) also shows that our algorithm performs considerably better than the random strategy. In particular, the average call success rate does not increase with the random strategy when the load is increased because nodes are not allowed to communicate with neighbouring nodes that have already been contacted for a given call. When there are no more neighbouring nodes left, the calls are dropped. This dictates the overall poor performance of the random strategy, regardless of the load in the network.

In more detail, the results in figure 3(a) show that when the call origination probability was set at only 0.01 (1% load), our system successfully allocated 84% of the calls, where the global optimum was only marginally higher at 92%. This shows that the sjrstem performs comparatively well at a light load. We would expect the global optimum strategy to perform comparatively better than our algorithm because of the many advantages it is given in terms of information and processing capability (as was detailed in section 3.1). As traffic load increases, the difference in average call success rate between the optimum strategy and the system algorithm becomes larger. The reason for this is that increasing the traffic load induces more contention for resources, which has a larger effect on the €dgorithm than on the optimum strategy. This can be explained by the fact that our system attempts an exhaustive search across the network for resource bundles. Doing so means that a certain percentage of resource bundles are being reserved and are unused for periods of time and this prevents some other calls from being set up. In the case of the optimum strategy, allowing allocations instantaneously means that resources are never occupied unnecessarily for any amount of time, even when load is increased in the network. In order to try to get our system to perform as close as possible to the optimum solution, we aim to Umit the amount of backtracking in the system and to make the buyer agents bid more inteUigently. Consequently,

197

future experiments will be conducted in order to see how well the allocations are being utilised.

0.9

H 0 . 6

I 0.5 go.4

< 0 . I

0 ,

\- " ' ' -

"^" "-^^

Algarithm | OptuBum r Random [

0.05 0.1 0.15 0.2 0.25 Call OriginaticMi Probability

0.3 0.05 0.1 0.15 0.2 0.25 Call Origination Probability

(a) Average call success rate against call orig- (b) Average node occupancy against call origination probability. ination probability (for Algorithm).

Figure 3: Graph plots for the experiment described in section 3.2.

30

g 2 5

^ 2 0

S l 5

10 h I .

Algorithm 1 Optimum ]\ Random T

i..........J

XJif^^

::-'-::::--i--.-..,_

Single Reg ian 11 - - - O o U U e R e g i o a \

T r i p t e R ^ k m Quadruple R e g k n U

^•'••••^•••••••-••••••••I-------J

0 0.2 0.4 0.6 0.8 1 Call Origination Probability

(a) Average call set up time against call origination probability.

0.2 0.4 0.6 0.8 Call Origination Probability

(b) Percentage of successful calls made across region(s) agsdnst call origination probability (for Algorithm).

Figure 4: Graph plots for the experiment described in section 3.3.

3.3 Average Call Set Up Time

The purpose of this experiment is to investigate how long it takes, on average, for a call to be set up when varying the call origination probability. The hypothesis was that if the call origination probability is increased, then the average time taken for call set up will be longer, assuming that all other variables remain constant. Figure 4(a) shows that as the call origination probability was increased, the average call set up time actually decreased. Specifically, figure 4(a) shows that by using the system algorithm, calls took a longer time to be set up than with using the optimum strategy. The reason for this is that by using the algorithm, call set up time takes longer because a few messages are

198

required between market servers and buyer agents, within and across regions. The optimum strategy does not require such messages. Using the random strategy, the average call set up time is marginally above 0 time steps. This result gives a false impression of this strategy performing well. The result can be explained by the fact that very few calls are successfully set up with the random strategy (as indicated by figure 3(a)) and that these are all short distance calls of only a few hops in length.

For our system algorithm, our intuition for calls taking shorter time when increasing load was that more shorter distance calls were being set up than longer distance calls. Figure 4(b) shows how the percentage of successful calls that were made across one or more regions was changing as load was increased. This showed that the percentage of single region calls increases when call origination probability is increased, double region calls stay approximately the same and that triple and quadruple region calls decrease. We also intuitively know that single region calls, on average, would take a shorter time to set up than double region calls, which in turn take less time than triple region calls, and so on. This indicates that increasing load means that the average number of regions used for a successful call decreases, which explains why the average call set up time also decreases with load.

4 Related Work There are several market-based architectures that have been proposed for allocating resources in a distributed environment. Gibney and Jennings [6] describe a system in which agents compete for network resources in distributed markets so that cfidls can be routed in a telecommunications network. The system used a double auction protocol [7] with sealed bids. Results showed that as more resources were being used, the price of resources margin£dly increased such that eventually the buyers bought alternatives paths. This provided good utilisation of the network and also balanced the load in the network. However, a drawback was that if some resources on a path were already bought and the next desired resource could not be obtained, then the resources already bought could become redundant and a certain amount of money would be spent unnecessarily. In contrast, our reserve/commit mechanism ensures that this situation is avoided by releasing unused resources immediately and allowing payment to occur only after all necessary resources have been successfully reserved.

The Global Electronic Market System (GEM) [8] is a framework for de-centraUsed markets across the Internet. GEM has a single market which is distributed on which goods are sold. The general idea in GEM is that agents initially trade in local markets and when required, inter-market communication takes place between other markets. The GEM system is diflferent from traditional independent local markets because the markets are replicated and the order for goods is distributed across these markets. Multiple markets are used in GEM to increase the probabiUty of finding a match for a resource. If a

^The resources that are allocated in GEM are not necessarily network resources.

199

market is heavily loaded, then it is possible that another market can be used for obtaining resources. Looking at GEM provided an insight into one method of how servers in a market-based resource allocation system could be distributed. However, the approach taken by GEM of replicating the resource information is not suitable for our system because it induces more messages in the network than our partitioned approach (as was detailed in section 2.2).

MIDAS [9] is an auction-based mechanism that allocates Unk bandwidth in a network for making paths. Simultaneous multi-unit Dutch auctions were used as the protocol for allocating the resources. However, this auction protocol would be inadequate with respect to our requirements since it is not capable of allocating several interrelated goods at the same time. Finally, Ezhilchelvan and Morgan [10] have looked at how an auction system can be distributed across several servers in a network of servers. However, this approach assumes that communication takes place using a high-bandwidth network which is an assumption that cannot be made within our work.

5 Conclusions and Future Work

In this paper, a system was described that allocates end-to-end bandwidth to set up calls in a network using market-based agents. The S3 tem used a combinatorial reverse auction where bundles of interrelated resources were allocated and novel reserve and commit mechanisms were developed to cope with the partitioned nature of the distributed marketplace. Empirical evaluation showed that our system successfully set up considerably more calls than that achieved by the random strategy given all traffic loads. It also set up a comparable number of ccdls when put side by side with the optimum strategy given a light network load. Results also showed that the average time tsiken for a call to be set up is longer when the load in the network is at its lightest. This was explained by the fact that the percentage of longer distance calls decreased as load was increased and vice versa and that, intuitively, we know that shorter distance calls take less time to set up.

Whilst the optimum used to benchmark against our system was unrealistic, there are a number of ways in which our system can be improved. Firstly, we aim to develop agent strategies that are more realistic with respect to the current assumptions made. SpecificeJly, buyer agents will need to make realistic estimates on the price of resources without knowing the actual prices a priori. In order to achieve this, various techniques such as learning and heuristic methods will be investigated for allowing buyer agents to calculate resource prices. Secondly, we aim to account for a finite number of control messages within our simulation. Currently, our system assumes that there is an infinite amount of bandwidth available for control messages. Finally, in order for our system to perform as close as possible to the optimum, we plan on Umiting the amount of backtracking that will take place. It is envisaged that in doing so, as well as allowing buyers to bid more inteUigently in the first instance, would mean that less resources would be reserved unnecessarily. Therefore, more resources will

200

be available for other calls. This should also reduce the average set up time for calls too, which is desirable.

6 Acknowledgements

The research in this paper is part of the EPSRC funded Mohican Project (Reference no: GR/R32697/01). We would also like to acknowledge the contribution of Steve Braithwaite who provided us with domain expertise.

References

[1] p. Nicopolitidis, M. S. Obaidat, G. I. Papadimitriou and A. S. Pomportsis, Wireless Networks, John Wiley & Sons Ltd, Chichester, England, 2003.

[2] L. L. Peterson and B. S. Davie, Computer Networks: A Systems Approach, Morgan Kauftnann Publishers Inc, San Francisco, California, 2000.

[3] S. H. Clearwater, Market-Based Control: A Paradigm For Distributed Resource Allocation, World Scientific Publishing Co. Pte. Ltd, Covent Garden, London, 1996.

[4] T. Sandholm, S. Suri, A. Gilpin and D. Levine, Winner determination in combinatorial auction generalizations. In AGENTS-2001 Workshop on Agent-Based Approaches to B2B, Montreal, Canada, 2001.

[5] N. Haque, Resource Allocation in Communication Networks Using Market-Based Agents, Technical Report, School of Electronics and Computer Science, University of Southampton, Southampton, UK, May 2004.

[6] M. A. Gibney and N. R. Jennings, Dynamic Resource Allocation by Market-Based Routing in Telecommunications Networks, Springer-Verlag: Heidelberg, Germany, 1998, volume 1437, pages 102-117.

[7] P. Wurman, W. Walsh and M. Wellman, Flexible double auctions for electronic commerce: Theory and implementation. Decision Support Systems, 1998, volume 24, pages 17-27.

[8] B. Rachlevsky-Reich, I. Ben-Shaul, N. Tung Chan, A. W. Lo and Tomaso Poggio, GEM: A Global Electronic Market System, Information Systems, 1999, volume 24, number 6, pages 495-518.

[9] C. Courcoubetis, M. Dramitinos and G. D. Stamoulis, An Auction Mechanism for Bandwidth Allocation Over Paths: New Results, M3I Modelling Workshop, London, UK, June 2001.

[10] P. D. Ezhilchelvan and G. Morgan, A Dependable Distributed Auction System: Architecture and an Implementation Framework, International Symposium on Autonomous Decentralized Systems, 2001, pages 3-10.

Are Ordinal Representations Effective?

Andrew Tuson Department of Computing, City University, London,

Northampton Square, London EClV OHB, UK

email: [email protected]

Abstract Permutation optimisation problems are of interest to the local search

community, who have long been interested in effective representations of such problems. This paper examines the effectiveness of one such 'general-purpose' approach, the ordinal encoding. Using forma analysis to structure the discussion, it shall be argued that the ordinal approach, by abstracting away problem structure, can perform poorly even in cases where the structures they manipulate map relatively well onto the problem domain. The discussion will be evaluated by an empirical study of the fiowshop sequencing problem.

1 Introduction

This paper explicitly examines representational issues pertaining to optimisation over a space of permutations [6]; these include assignment, scheduling and routing problems. The Ordinal [2] encoding is one such approach in the evolutionary algorithms literature, devised to overcome a perceived shortcoming of encoding solutions directly as permutations. Since standard crossover operators in evolutionary algorithms do not produce legal permutations and the ordinal encoding does not suffer from this problem, it can be considered in a sense general purpose. However work has not been performed on how this encoding works in terms of the problem structure it may exploit.

Representation is an important aspect of local search and evolutionary algorithm design; e.g. [14] argues that the design of such optimisers can be thought of as analogous in many respects to that of a knowledge based system. In this context, forma analysis [9] and its extensions due to [12, 14] can also be used to design operators for evolutionary algorithms in a rigorous and principled manner.

Forma analysis is a generalisation of the notion of schema [3] that provides a framework for analysing evolutionary algorithm representations and operators based on the use of equivalence relations that represent features/building blocks thought to correlate well with solution quality.

This paper will provide a brief overview of forma analysis and show how it can structure the analysis of the ordinal encoding and the derivation of alternative problem specific operators. This analysis will highlight whether and how the ordinal encoding exploits problem structure and highlights a number of potential shortcomings due to this encoding abstracting away problem structure.

201

202

An empirical case study of the flowshop sequencing problem, selected as an example of a problem which has structure closest to what the ordinal encoding may exploit, will be used to support the analysis in this paper.

2 Forma Analysis

Forma analysis [8, 9] describes each solution in the search space, 5, by a set of relevant features (eg. edges if we are considering the travelling salesman problem) that are thought to relate strongly to solution quality. Forma analysis then formalises each feature as an equivalence relation xp which has a set of equivalence classes (formae) denoted by € H^ (e.g. the presence/absence of a given edge). Formae can also be used to denote the subset of the solutions in the search space that match a given equivalence relation/class combination, ie.:

a^J) = {xeS\i^{x) = j} For each equivalence relation ip we can now define a function/predicate

ip{x,y) that returns the equivalence class of the equivalence relation ^ for the solution X E S:

ip{x,y) =tp{x) = ip{y)

Given the above we can fully specify any solution in the search space in terms of the vector of basis equivalence relations which are the members of the minimum set, *, of equivalence relations required to uniquely describe any solution in the search space 5. A basis set must satisfy the criteria of coverage, where ip refers to a basis equivalence relation:

^xeS,yyeS\ {x}, 3i)e^ : -^tp{x,y) Finally, it may be the case that there are certain constraints on what equiv

alence classes can be used for a given equivalence relation with respect to the equivalence classes adopted for other equivalence relations — permutations are an example as will be shown later — in which case the search space S will use a subset of the above set of combinations.

2.1 Solution Encoding From the above, any particular solution, x e S, can be described in terms of a representation function, p. This is defined in terms of a set of partial functions for each of the equivalence relations in $:

p^ :S ->E^ where p^{x) = [x]^

where [x]^ is the equivalence class of x under 'tp. The representation function for the string as a whole, p^ : 5 -^ S^, is thus given by the combination of the partial representation functions, p^., for all V'i € *:

203

P * W = {PrPi{x),p,l;^{x), . . .,p,p^{x)) = ([a:]v;x, [x]v,2, . . . , Wv;n)

The work in [9] then notes that the above formaUsm suggests an encoding for the encoding space £ as an image (direct encoding) of the induced equivalence classes in p^(x). Though forma analysis suggests an encoding, other encodings can be used so long as the operators for that encoding are functionally equivalent in the sense that the same formae (features) are manipulated in the same way as forma analysis proscribes. Therefore from now on, instantiations of operators will be described in terms of the usual encoding used for that approach.

2.2 Derivation of Operators

Forma analysis can from the above produce general specifications for the distance metric and mutation operator. The distance metric, d(a:,y,*), can now be defined as the number of equivalence relations in $ that are not equivalent for two solutions x^y ^ S that are of interest:

d(a:,2/,*)= 5 ] l-i^{x,y)

The neighbourhood of a solution (and therefore the neighbourhood operator) is specified in terms of a generalised A;-change operator [14], which defines the set of solutions that differ by up to k features with respect to the basis set of equivalence relations. Formally for a unary (mutation) operation N : S X KN ~> S", the set of solutions in the neighbourhood, Nk-changeix, * ) , of X G 5 is given by:

Nk-changeix^^^k) = {v E S \ d{x,y,^) < k}

Forma analysis also defines representation independent recombination (ie. crossover) operators. This arises from a consideration of the desirable properties of recombination operators, and their formaUsation in [9]. An example of such an operator is Random Transmitting Recombination (RTR); a formal specification of RTR, in terms of equivalence relations, is given by saying that for all basis equivalence relations, the equivalence class for the child solution, ip{z), must be present in either or both of the parent solutions. A formal specification of RTR, in terms of equivalence relations, is given by (note that 0 denotes 'exclusive-or'):

RTR{x,y,^) ^{zeS\y^pe^: ^p{x) = jp{z) 0 xp{y) = ^{z)}

or, more concisely, in terms of the basis formae:

RTR{x, y, *) = {z G 5 I V € H^ : x, G ^ 0 2/, ; G 0

where the actual child solution, z is again chosen out of the set above at uniform random.

204

RTR is selected as the recombination operator in this study as it is the least disruptive of the operators defined in [9], also for the purposes of this study consistency is all that is needed.

2.3 A New Concept: Linkage Specialisation

Many problems exhibit some form of positional linkage, that is features that are 'close by' in the solution interact more strongly as regards to their contribution to solution quality. As a result of this interaction it is reasonable to suppose that if multiple strongly interacting features are to be changed then they should be changed at the same time as a unit.

An extension of forma analysis in [14] formalises the above notion as a linkage specialisation of the neighbourhood operator; this defines constraints upon the operator specification so to reduce the neighbourhood to include only those moves that are considered to preserve the linkage between features.

This restriction can be achieved for mutation by creating a variant of the generalised fc-change operator that only allows changes to features that are positionally adjacent. This can also be transferred to recombination operators. For example, the generalised N-point crossover (GNX) operator template due to [10] can be used as a variant of RTR, in conjunction with a basis set, to derive a standard N-point crossover operator.

3 Permutation-Based Operators

Mattfield et al [6] has proposed a taxonomy of sequencing recombination operators based upon the building blocks (formae) that they manipulate and the types of sequencing problem each is thought suited to.

• Position is the absolute position of an item (eg. item 5 is at position 4); this was proposed to be suitable for assignment-type problems.

• Precedence is whether one task is performed before another (eg. item 5 appears before item 4) and was thought useful for 'scheduling' problems.

• Edge is whether two items are next to each other (eg. item 5 is next to item 4); which is thought to be suitable for routing problems.

The central question is to what extent problem specific structure is exploited by the ordinal approach? It will be clear later on that precedences are the structures that this approach processes most of all, albeit imperfectly. Thus a forma analysis for precedence features will act as a reference point against which to contrast and compare the ordinal approach assuming throughout permutations of n elements in the set AT = {1 , . . . , n}.

205

3.1 Formalising Precedence

Let ^prec be the set of basis precedence equivalence relations for a permutation of N elements, ^prec = {i^prec(ij) \hj ^ N Ai^ j} — in all cases i and j refer to the elements in the permutation. Therefore the equivalence classes are simply a true/false answer to whether the element i proceeds the element j in the permutation (ie. Vi, j(i # j) : E^^^^^..^ = {^o^^^,Ci^^}).

Constraints to ensure a valid permutation need to be provided. The first of these ensures that if item i is before j then the reverse relationship cannot possibly be true (in fact we can discard around half the basis functions due to symmetry):

V i , i eN {i^ j) : Ipprec(iJ) ^ -^i^precUA)

In addition, a constraint needs to be added that a valid permutation exists if and only if the relationship between the precedences is consistent (in that the transitivity condition is preserved):

Vi, j , keN {i:/:j ^k): {lpprec{i,j) A 'lpprec(j,k) =^ '^preciUk))

A distance metric can be specified as the number of diflFering (non-redundant) precedence relations between two solutions.

3.2 Precedence Mutation

The 2-change operator is minimal for precedences, which corresponds to the swap-adjacent operator which exchanges 2 elements of the permutation that occupy adjacent positions in the solution; the fc-change operator can be viewed as a sequence of swap adjacent moves.

A more commonly used operator, however, is the permutation-shift operator (Figure 1) which selects and removes an element from the permutation and re-inserts it elsewhere in the sequence.

Shift 1 31615 4[2J7 8 ""^ - 13 5 4 2 6 7 8

Figure 1: The Permutation-Shift Operator

This operator can in fact be viewed as a linkage specialisation of the fc-change operator, in a similar fashion to before, where the k precedences modified are those between the element removed and the elements between it and the second selected element it is inserted before/after (both elements are boxed in Figure 1). This assumes that precedences between positionally nearby elements in the sequence interact more strongly with respect to fitness.

206

Permutation shift is an operator that appears to be 'natural' (in that it would be an operation that a human would be likely to use) for a number of scheduling problems in that it visually captures the concept of a *block of jobs' in the schedule.

3.3 Precedence Crossover

An operator has been devised that strictly transmits precedences: Precedence Preservative Crossover (PPX) [6]; or equivalently Precedence RTR (and G2X for the two-point variant which implements the linkage specialisation used in the shift operator).

5

62 2 3

6 7 8

4:4781 Crossover 3 6 2 5 4 7 8 1

Figure 2: The 2-point PPX/Precedence G2X Crossover Operator

Figure 2 above illustrates the working of the two-point variant of this operator. Two positions are selected for crossover that axe used for both parents. It is also assumed in this example that the process also starts on the uppermost, ^current' solution in Figure 2. Now working from the left hand side of the current parent solution, elements are placed from the current parent solution into the child (building up the solution from left to right) and simultaneously removed from each of the two parent solutions. When a crossover position is reached, the current parent solution is changed. This process continues until the child solution is constructed. For ease of interpretation, the elements in the parent solutions are numbered in the order that they are taken to construct the child solution.

4 The Ordinal Encoding The ordinal encoding was originally due to [2]. A string of N variables, numbered i from left to right, with values in the range 1 to iV - i - 1 is used to encode the permutations in the form of a 'pick-list' (see Figure 3). The string is then decoded by proceeding from the start of the string and taking (and removing) the j'th element from the (ordered) permutation {1,2,..., N}, where the value of j is given by the value of the string at that point — the process is then repeated until a permutation is produced. This is illustrated by Algorithm 1 and Figure 3 below.

To formalise this encoding, let $ord be the set of basis ordinal equivalence relations for a permutation of N elements, where i refers to one of the n positions in the encoding, ie. ^ord = {V ord(i) | « = 1,... ,n}. The equivalence

207

Algorithm 1 TRANSFORMING AN ORDINAL ENCODING TO A PERMUTATION

1: Let O be a list containing the ordinal representation of the solution; 2: Let P = 0;

{Where P is the permutation to be constructed} 3: Let E = {l,,,.,N}]

{Where E is the numerically ordered set of elements in the permutation} 4: repeat 5: j = I T E M ( E , F I R S T ( O ) ) ;

{Take the n-th item from E determined by 0 } 6: P = APPEND(P,j); 7: E = REMOVE(E,e); 8: O = R E M 0 V E ( 0 , F I R S T ( O ) ) ;

9: until E = 0; 10: Return the permutation P thus produced;

classes are the set H^ ^ . = {^i,... ,^n-i+i} which correspond to each of the j-th remaining items in the permutation.

The distance metric in this case corresponds simply to the Hamming distance between the solutions. Also, the minimal (smallest change) mutation in this encoding involves taking one of the equivalence relations and changing its equivalence class as shown by Figure 3 below.

1243 2 1 1 1 ^°" > 1 2 4 1 2 1 1 1

f

1 3 6 5 4 2 7 8 1 3 6 2 5 4 7 8 Figure 3: The Ordinal Neighbourhood Operator

For this study a standard n-ary uniform crossover operator was adopted (i.e. Ordinal RTR).

4.1 Relating the Ordinal and Shift Neighbourhoods

The observant reader will note that the above example corresponds to a shift move where element 2 is removed and inserted before element 5. In fact, all of the minimal mutations have this effect. The two neighbourhoods are not of the same size. The number of solutions in this encoding is N{N - l ) /2 which is half the size of the shift neighbourhood — so where does the other half of the shift neighbourhood lie? Consider Figure 4 which shows that the other half of the shift neighbourhood is in fact quite distant in the ordinal space — the

208

distance increasing as the number of precedences changed by the shift operator increases.

1 2 3

13452678 [12221111]

l[il3 4 5 67 8 / ^ _ _ 13 456278 [1 1 1 1 1 1 1 1] \ [12222 111]

13456728 [122222 11]

Figure 4: The Missing Half of the Shift Neighbourhood

Therefore, not only does the ordinal representation not correspond to any direct feature of the permutation given in [6]; it is poorly correlated to even the most similar of those permutation features, precedence, described earlier.

4.2 How Does Forma Analysis Help?

Forma analysis plays two key contributions to this study. First of all it ensures that operators are used that actually do process the features that they are claimed to process.

Second, the above discussion relating the two neighbourhoods can be placed in forma processing terms, which is more meaningful when discussing recombination operators (as it is effectively a generalised version of traditional schema processing arguments). In the case above a highly transmitting ordinal operator also implicitly processes (ie. transmit) precedence formae to some extent — this idea of implicit forma processing was first introduced by [4].

In fact a more extensive comparison of sequencing operators conducted in [14] shows that relative operator performance can often be accounted for in terms of the degree of implicit forma processing that they perform.

5 Empirical Study

The analysis above indicates that the ordinal approach most directly (though imperfectly) manipulates precedence relationships in permutations, of those

209

given in [6]. Therefore a scheduling-type sequencing problem will be used to evaluate the predictions made above, based on the arguments given by [6], to give the best chance for the ordinal approach to shine. Furthermore, to see whether EA dynamics play a role, results for a basic stochastic hillclimber will be presented and contrasted with those for an EA.

However, from the arguments above it would be expected that precedence based operators would outperform the ordinal approach as the ordinal approach only partially manipulates these features.

5,1 Experimental Approach

The flowshop sequencing, or nlmjPlCmax', problem [5] involves finding a sequence of jobs for the flowshop (a straight line of machines to process), so as to minimise the makespan — the time taken for the last of the jobs to be completed.

This task is known to be NP-hard [1] (the number of possible sequences is n!) and can be formalised as follows: n jobs have to be processed (in the same order) on m machines; the aim is to find a job permutation {Ji, J2,..., Jn} so as to minimise Cmax- This is defined as follows: given processing times p(i, j ) for job i on machine j and the job permutation above, we can find the completion times by the following equations:

C{Ji,l) = p{Ji,l)

C{Ji, 1) = C{Ji-i, 1) + p{Ji,I) fori = 2,...,n

C( J i , j ) = C{Ji,j - 1) + p ( J i , j ) far j = 2,...,m

C{Ji,j) = max{C{Ji-uJ),C{Ji,j - l)}+piJi,j)

for i = 2,..., n; j = 2,..., m

C{Jn,m)

Standard benchmarks due to Taillard [13] exist for this problem, and were used in this study. Experiments were performed for the first of the TaUiard test set's instances of the following problems: 20x5, 20x10, 20x20, 50x5, 50x10, 50x20, 100x5, 100x10, 100x20, 200x10, and 200x20; where the notation used is of the form 'number of jobs' x 'number of machines'.

The measure of performance used was the makespan obtained after a set number, N, of evaluations. The value of AT used in these experiments was solely dependent upon n (the number of jobs) and was set to 5000, 6750, 8000, and 9250 evaluations respectively. The value of AT at n = 20 was set on the basis of formative experiments and then scaled up by a roughly ln{n) relationship for larger instances, justified from empirical results from [7].

A sample of fifty runs was taken in each case, and where performance differences are reported as being significant, this refers to the results of a number of statistical hypothesis (two tailed Student t-) tests which can be found in [14].

210

5.2 Algorithm Configurations

A Davis-style, GENITOR [15] steady-state (with kill-worst replacement), EA with an unstructured population model was implemented (for full details see [14]). This is chosen as being generally applicable and robust, based on results for sequencing problems in the EA literature such as [11].

Experiments were also performed to examine the performance of stochastic hillclimbing; the implementation used is described fully in [14].

5.3 Experimental Results

The results obtained are summarised in Tables 1 and 2. For all of the (mean) results presented here, the standard deviation is given in parentheses and P-values are quoted in braces for comparisons with the precedence-based approach, low values supporting the alternative hypothesis (difference in performance).

Problem 20x5 20x10 20x20 50x5 50x10 50x20 100x5 100x10 100x20 200x10 200x20

Shift Ordinal 1292.44 (7.77) 1605.88 (11.00) 2337.42 (17.18) 2734.46 (6.76) 3121.10 (24.73) 4021.62 (22.52) 5506.56 (14.14) 5884.80 (30.06) 6581.78 (38.13) 11020.20 (31.61) 11797.80 (50.06)

1307.16 (17.55) {> 0.00} 1696.16 (43.21) { » 0.00} 2399.50 (31.67) { » 0.00} 2748.00 (14.90) { » 0.00} 3281.32 (39.09) { » 0.00} 4253.40 (58.17) { » 0.00} 5523.80 (23.14) { » 0.00} 6140.66 (53.84) { » 0.00} 6981.76 (75.53) { » 0.00} 11257.36 (79.35) { » 0.00} 12468.80 (84.39) {> 0.00}

Table 1: Summary of Experimental Results (Hillclimbing)

Comparing representations for stochastic hillcUmbing indicates that the shift-neighbourhood gave the highest quality solutions in the time available - the ordinal neighbourhood being significantly outperformed over all instances at an extremely high level of confidence.

The variability of results within each set of EA runs is much higher than for the hillclimber making it harder to show statistical significance. However, examination of the results obtained for the ordinal representation showed that this representation was still a poor choice.

The use of crossover operators that are 'precedence aware' gave the best performance; it would appear that the performance of the ordinal approach may be due to its ability to process precedences in an implicit manner. The performance crossover operators based on the precedence and ordinal neighbourhoods (formae) were in line with the relative performance indicated by the hillclimbing experiments - why begs the questions of whether hillclimbing experiments could help design EA crossover operators.

211

Problem ^0^^5 20x10 20x20 50x5 50x10 50x20 100x5 100x10 100x20 200x10 200x20

P P X and Shift 1261.36 (91.18) 1577.88 (77.45) 2267.94 (70.30) 2804.44 (129.12) 3155.68 (100.90) 4020.76 (92.10) 5390.86 (190.86) 5843.34 (159.87) 6824.06 (97.10) 11067.86 (207.81) 12263.32 (111.37)

Ordinal 1280.32 (85.55) {0.31} 1623.80 (85.56) {0.01} 2333.44 (76.90) {0.00} 2834.34 (121.50) {0.25} 3244.28 (100.13) { » 0.00} 4136.80 (88.88) { » 0.00} 5431.52 (156.64) {0.22} 5937.54 (136.00) { » 0.00} 7043.92 (97.06) { » 0.00} 11264.38 (179.96) { » 0.00} 12607.42 (133.75) { » 0.00}

Table 2: Summary of Experimental Results (Evolutionary Algorithm)

No consistent trends were found with regards to relative performance of the two approaches and the number of jobs and machines. However, a more extensive comparison can be found in [14] if a closer examination of such relationships is desired.

6 Conclusions

This paper has examined the Ordinal approach to representing sequencing problems. In this context, forma analysis assisted by placing the ordinal approach in the context of other work relating permutation features to sequencing problem domains, and in ensuring that the operators used actually manipulated

From this, it was argued that the ordinal approach, in abstracting away relevant problem structure, could have a detrimental effect on optimiser performance.

The results from a study of the flowshop sequencing problem supported this view; appropriate operators selected using forma analysis to directly exploit the problem features thought relevant to flowshop sequencing clearly outperformed the ordinal approach in the way predicted by the work by Mattfield et al [6].

So in summary, though the intent of [2] to produce a general approach to sequencing problems is laudable, it is clear that it does so by unnecessarily reducing optimiser performance; especially when work such as Mattfield et ofs provides straightforward guidance in relating sequencing problem type to which features operators should explicitly manipulate and forma analysis guides the construction of such operators.

7 Acknowledgments

I would like to express my gratitude to the Engineering and Physical Sciences Research Council (EPSRC) for their support via a research studentship

212

(95306458) of the work that provided a basis for this paper.

References

[1] Michael R. Garey and David S. Johnson. Computers and Intractability: a Guide to the Theory of NP-Completeness. Freeman, 1979.

[2] J. J. Grefenstette, R. Gopal, B. Rosmaita, and D. Van Gucht. Genetic Algorithm for the TSP. In J. J. Grefenstette, editor. Proceedings of the International Conference on Genetic Algorithms and their Applications, pages 160-168. San Mateo: Morgan Kaufmann, 1985.

[3] John H. Holland. Adaptation in Natural and Artificial Systems. Ann Arbor: The University of Michigan Press, 1975.

[4] M. Jelasity and J. Dombi. Implicit Formae in Genetic Algorithms. In H.-M. Voigt et al, editor. Parallel Problem-solving from Nature - PPSN IV, LNCS, pages 154-163. Springer-Verlag, 1996.

[5] A. H. G. Rinnooy Kan. Machine Sequencing Problems: Classification, Complexity and Computations. Martinus Nijhoff, The Hague, 1976.

[6] D. Mattfeld, C. Bierwirth, and H. Kopfer. On Permutation Representations for Scheduhng Problems. In H.-M. Voigt et al, editor. Parallel Problem-solving from Nature - PPSN IV, LNCS, pages 310-318. Springer-Verlag, 1996.

[7] I. H. Osman and C. N. Potts. Simulated anneaUng for permutation flow-shop scheduling. OMEGA, 17:551-557, 1989.

[8] N. J. Radcliffe. Equivalence Class Analysis of Genetic Algorithms. Complex Systems, 5(2):183-205, 1991.

[9] N.J Radcliffe. The Algebra of Genetic Algorithms. Annals of Maths and Artificial Intelligence, 10:339-384, 1994.

[10] N.J RadcUffe and P. D. Surry. Formal memetic algorithms. In T. C. Fogarty, editor. Proceedings of the AISB workshop on Evolutionary Computation. Springer-Verlag, 1994.

[11] C. R. Reeves. A genetic algorithm for flowshop sequencing. Computers & Ops. Res., 22:5-13, 1995.

[12] P. D. Surry. A Prescriptive Formalism for Constructing Domain-Specific Evolutionary Algorithms. PhD thesis. University of Edinburgh, UK, 1998.

[13] E. Taillard. Benchmarks for basic scheduling problems. European Journal of Operations Research, 64:278-285, 1993.

213

[14] A. L. Tuson. No Optimisation Without Representation: A Knowledge-Based Systems View of Evolutionary/Neighbourhood Search Optimisation. PhD thesis, University of Edinburgh, 2000.

[15] D. Whitley. The GENITOR Algorithm and Selective Pressure. In Stephanie Forrest, editor, Proceedings of the Third International Conference on Genetic Algorithms, pages 116-121. San Mateo: Morgan Kauf-mann, 1989.

A Framework for Hybrid Planning

Max Garagnani Department of Computing, The Open University

Milton Keynes - UK

Abstract

Sentential and analogical representations constitute two complementary formalisms for describing problems and domains. Experimental evidence indicates that different domain types can have their most efficient encoding in different representations. While real-world problems typically involve a combination of different types of domains, all modem planning domain description languages are purely sentential. This paper proposes a framework for planning with hybrid models, in which sentential and analogical descriptions can be integrated and used interchangeably, thereby allowing a more efficient description of realistically complex planning problems.

1 Introduction

Research in knowledge representation and reasoning [1, 2] indicates that many problems are easier to solve if described using analogical [3] (a.k.a. diagrammatic, homomorphic [4]) representations. Recent experimental evidence in planning [5] demonstrates that Move problems, involving the movement and manipulation of entities subject to a set of constraints, are solved significantly faster (up to two orders of magnitude) if recast in purely analogical terms. Real domains, however, are typically the result of a "mixture" of diverse types of components: representing all of these aspects in an efficient manner using a purely analogical (or a purely sentential) language is often impossible.

This work concerns the development of hybrid (or heterogeneous [4]) planning representations, able to merge sentential and analogical models into a single formalism that combines the strengths and overcomes the weaknesses of both paradigms. In this paper, we (i) review the setGraph model described in [6] (to the best of our knowledge, the only existing proposal of analogical planning representation) and extend it into a more expressive representation; (ii) briefly describe the sentential model chosen, based on the planning domain description language PDDL2.1 [7] and expressively equivalent to the analogical model; {Hi) describe a simple model of hybrid planning which allows the two above representations to be integrated; and (iv) present a general theory that guarantees the soundness of the approach. In particular, the Soundness Theorem presented extends to analogical and hybrid representations the theory of sound action description given in [8], originally limited to sentential models. The final section discusses related work, limitations and future directions.

214

215

2 The Analogical Model: setGraphs

This section reviews the setGraph model proposed in [6] and extends it so as to allow (1) numeric values (hence, attributes with infinite domains), and (2) actions involving non-conservative changes (addition and removal of elements to and from a state) and numeric updates.

A setGraph is essentially a directed graph in which the vertices are sets of symbols. For example, Figure l.(a) shows a setGraph description of a Blocks World (BW) state with three blocks and a table, represented by symbols A, B, C, Table. The vertices of the graph are depicted as ovals, labelled V i , . . . ,Vio. In this example, the edges of the graph (arcs) represent 'on' relations between spatial locations: if a vertex containing symbol x is linked to a vertex containing symbol t/, then On(a:,t/) holds in the current state.

(a) (b)

Figure 1: A SetGraph encoding of the Blocks World domain: (a) state representation; (b) Move{x,y,z) operator (x € {A,B,C}; y,z € {A,B,C,Table}).

The symbols of a setGraph can be moved from one vertex (set) to any other through the application of diagrammatic operators^ which specify the set of legal transformations of a state (setGraph). Figure 1.(6) depicts the Move operator for the BW domain. The operator preconditions P describe a specific arrangement of s)anbols in a part (sub-graph) of the current state; the eflFects E describe the arrangement of these symbols in the same sub-graph after the apphcation of the operator. Intuitively, an operator P =^ E is applicable in a state s iff each of the graphs contained in P can "overlap" with (be mapped to) a sub-graph of s having the same "structure", so that each variable corresponds to a distinct symbol, each vertex to a vertex, each edge to an edge, and: (1) if a variable is contained in a set (vertex), the corresponding symbol is contained in the corresponding set; (2) if an edge links two vertices, the corresponding edge links the corresponding sets; and (3) if a set is empty, the corresponding image is empty. When all of these conditions hold, we will say that the precondition setGraphs are satisfied in s. Notice that the variables can be of specific types, subsets of the universe of symbols. For example, variable x of the Move{x, y, z)

216

operator has type Block={A, B, C}, while y^z e Object={A, B, C, Table}. Thus, this operator encodes the movement of a block x from its current location to a new one, originally empty, situated "on top" of a set containing another block (or the table) y. Notice that the operator is applicable only if block x has an empty set on top of it (i.e., if x is clear).

The application of an operator to a state 5 causes the corresponding symbols in s to be re-arranged according to the situation described in the eflFects E. For example, the Move{x,y, z) operator can be applied to the state of Figure l.(a) in several ways. One possible binding is x/C, y/Table, z/A; the apphcation of Mot;e(C,Table,A) would unstack block C from A and put it on the table (i.e., in set V9 of Figure 1).

This simple representation is made more expressive by allowing edges to connect any two elements of the graph (e.g., a symbol and a vertex, or two symbols), and symbols to consist of strings of digits. Such an extended representation is presented more formally in the sequel.

2.1 Formalising Set Graphs

The formal definition of setGraph is based on the concept of multiset [9]. A multiset is a set-like object in which the order of the elements is unimportant, but their multiplicity is significant. For example, C={1,1,0,0,0} denotes a multiset of integers containing two occurrences of 1 and three occurrences of 0. Since the order is unimportant, C is equivalent to {1,0,1,0,0}, but not to {1,1,0,1,0}. The empty multiset is denoted as { }. We will say that x is contained in C and write "x G C" to indicate that element x appears (occurs) at least once in multiset C.

A setGraph is a multiset of nodeSets, which are defined as follows:

Definition 1 (nodeSet) Let W he a set of symbols (language). A nodeSet is either:

• a symbol w ^Wj or

• a finite multiset of nodeSets.

In short, nodeSets are data structures consisting of multi-nested sets of symbols (strings) with multiply occurring elements and no limit on the level of nesting. NodeSets that are symbols of W are called nodes. All other nodeSets are called places. Thus, a place can contain both nodes and places. Places can be labelled (with possibly identical labels). Given a nodeSet iV, p{N) is defined as the multiset of all the nodeSets occurring in N (including N itself). For example, consider a language W={A}. Let Ni be the nodeSet {A, {A}, {{A}}}. Then, p(iVi) = M u {A, A, A, {A}, {{A}}, {A} }.

In order to represent numbers, the language W is allowed to contain also numeric symbols. A numeric node is a string of form n.m, or n (possibly preceded by -f or —), where n,m are sequences of digits and the first (last) digit of n (m) is not 0. The set of strings of all real numbers (of form ^''n.vn!'') together with "±" (representing undefined values) will be called 3fij..

217

Definition 2 (setgraph) A setGraph is a pair {N^E), where N is a nodeSet and E = {J^i, . . . ,Ek} is a finite set of binary relations on p{N).

If E contains only one relation £", we shall simply write (AT, E') instead of {N,{E'}). For example, let iVi be the nodeSet iVi={A,{B},{{C}}}. The pair {Ni.Ei), with £ i={(C, B), ({B}, {{C}}), ({A,{B},{{C}}}, A)}, is a setGraph. The instances of the binary relation Ei , pairs of elements of p{Ni), are the edges of the setGraph. Notice that if all places of Ni were assigned distinct labels, for example, iVi=Po{A, Pi{B}, P2{P3{C}}} (where the syntax name{x,y,„.^z} denotes a place with label name containing nodeSets x, y,..., 2), then Ei could be specified more simply as Ei={(C, B), (Pi, P2), (Po, A)}.

Example 1 - Consider the Briefcase domain, consisting of two connected locations (home and office), one briefcase, and three objects (called A, C and D). The sentential state {(at Brief Home) (a t A Home) (a t C Home) ( in D Brief)} can be encoded as a setGraph a = (Pi,i?i), where:

Pi = { Home{A,C,Brief{D}}, Off{ }, 1 } Ri= { (Home,Off), (Off,Home), (Brief ,1) }

Two (unlabelled) edges in P i are used to represent the bidirectional connection between the two locations. The setGraph also contains an edge associating place Brief to a numeric node. This node is used to keep track of the total number of objects currently inside the briefcase.

Home PLACE NODE

Mobile {Portable}

I Location Brief ^ \

Home Off

(«) (b)

Figure 2: Analogical encoding of Briefcase domain: (a) graphical depiction of setGraph a= (P i ,P i ) (see text); (6) nodeSets type hierarchy.

Figure 2.(a) shows a graphical representation of setGraph a. All and only the nodeSets that are contained in a place are depicted within the perimeter of the corresponding oval. NodeSets (and edges) of a setGraph are grouped in different types (or sorts), specified using a hierarchy (see Figure 2.(6)). By default, NODE and PLACE are the two only subtypes of the root-type NODESET

(not shown in the figure). Every type t represents the finite set of instances (leaves) of the subtree that has t as root (hence, NODESET = PLACE U NODE).

Different types may have different properties. The properties of a type are inherited by all of its subtypes and instances. In this example, the PLACE

D C

A

218

hierarchy restricts any instance of a Mobile place to contain only Portable nodes (by default, a place may contain any instance of NODESET).

If a setGraph G has an associated type hierarchy, G is said to be typed. In a typed setGraph, the language W is assumed to contain all the instances of NODE U 3?_L. In what follows, all setGraphs will be assumed to be typed, unless otherwise specified. A parameterised setGraph is a setGraph in which at least one of the place labels or nodes has been replaced with one of its super-types (or, equivalently, with a variable ranging on one of its super-types). We refer to the label, symbol or variable name associated to an element x of a setGraph (nodeSet or edge) as to that element's identifier, returned by the function id{x). A typed setGraph containing only instances of the NODESET hierarchy (i.e., no types or variables) is said to be ground (e.g., see Figure 2.(a)).

2.2 Representing Action

In addition to the description of the initial world state (encoded as a ground setGraph), a planner must also be provided with a specification of how states are changed by actions. As usual, the domain-specific legal transformations of a state are defined through a set of parameterised action schemata (operators). An operator P =^ E consists of preconditions P, specifying the situation required to hold in the state before the action is executed, and eflFects E, describing the situation of the state after. For example. Figure 3.(a) depicts a graphical representation of the Move operator for the Briefcase domain, which transfers a mobile object x (and all of its contents) from one location to another. The elements to the left of the arrow represent the preconditions; those to the right, the effects.

JCcMobile > ,2€ Location

z. . ^ z

>^€Portable Z€Mobile

ia) X € Ri_. Jc < 3 (increase x \)

ib)

Figure 3: Briefcase operators: (a) Move and (6) Put-in.

While the model described in [6] was limited to actions consisting only of nodeSet movement, the addition and removal of an element and the update of the value of a numeric node will also be allowed here. The movement (or removal) of elements in a setGraph is based on the following general rules: (1) if a node is (re)moved, all edges linked to it move (are removed) with it; (2) if a place is (re)moved, all the elements contained in it and all edges linked to it move (are removed) with it. Given the current value x of a numeric node and

219

a numeric value v G 5Rj_, the possible update operations are: (a) assignment (x' := v); (6) increase {x' := x 4- v); (c) decrease {x' := x — v); (d) scale-up {x' := X • v)^ and (e) scale-down {x' := x/v). Finally, any element not moved, removed or updated is left unaltered (i.e., we assume default persistence).

In a setGraph operator P =^ E^ preconditions P and effects E are composed of two separate parts, analogical and numerical. The analogical components consist of ordered list of typed setGraphs. The numerical part of the preconditions consists of a set of comparisons ( < , > , < , > , =,7^) between pairs of numerical expressions, while the numerical effects consist of a set of update operations of the kind (a)-(e) listed above. Numeric expressions are built from the values of numeric nodes using arithmetic operators. For example, Figure 3.(6) represents graphically the Put-in operator, which moves an object y inside a mobile z, subject to them being at the same location and to the mobile containing at most two objects. The analogical precondition and effect lists of this operator consist of only one typed setGraph. The numerical parts constrain and update, respectively, the value x of the node associated to the mobile z.

The next definition specifies the conditions for a parameterised setGraph T to "match" (or be satisfied in) a ground setGraph G. Intuitively, T is satisfied in G iff there exists a substitution of all the parameters of T with appropriate instances such that T can be made "coincide" with G (or with part of it).

Definition 3 (Satisfaction) Given a parameterised setGraph T={N, E) and a ground setGraph G, T is satisfied in G iff there exists a substitution 0 of all variables and types of T with corresponding instances, and a 1-1 function a : T ^^ G mapping elements of T to elements of G, such that, if Te is the ground setGraph T after substitution 6, the following conditions hold true:

• for each nodeSet (or edge) x e TQ, id{x) = id{a{x))

• for all pairs (x,2/) G p{N)xp{N), ifxey then a{x) € (T{y)

• for all edges e = {x,y) € E, (j{e) = {a{x),(T{y))

The first condition requires that each element of T is mapped to an element having identical identifier. The second condition requires that any relation of containment between nodeSets of T is refiected by containment between the corresponding images in G. The last condition requires that if two nodeSets are linked by an edge, the corresponding images are linked by the image of the edge in G. For example, given the type hierarchy of Figure 2.(6), it is easy to see that the preconditions of the two operators of Figure 3 are both satisfied in the ground setGraph of Figure 2.(a).

The semantics of action is specified by providing an algorithmic definition of the following: (a) a method to check whether an operator O is applicable in a given state 5; (^) a method for calculating the new state resulting from the application of an operator O in a state s. These definitions are given below:

(a) an operator P => J5 is applicable in a state (ground setGraph) 5 iff{l) all the parameterised setGraphs of P are satisfied in s (using a binding a and

220

a common substitution fl), and (2) if every occurrence of each numeric variable x in the numeric part of P is replaced with the value of cr(x), all the numeric comparisons in P are true;

(/?) if operator O is applicable in state s, the result of applying O is the new setGraph obtained from 5 by (1) carrying out - on the corresponding elements of s identified through binding cr - the changes required to transform each of the setGraphs in the preonditions P into the (respective) setGraph in the effects JS, and (2) for each update operation of E, updating the corresponding numeric node with the result of the operation.

3 The Sentential Model

The sentential planning representation adopted is a simplified version of PDDL-2.1 [7] equivalent to extending STRIPS to numbers and functor symbols.

As in PDDL2.1 [7], the sentential world state is composed here of two separate parts, a logical (STRIPS-like) state and a numeric state. While the logical state L consists of a set (conjunction) of ground atomic formulae (the truth of an atom p depending on whether p e L), the numeric state consists of a finite vector R of values in 3fx= 3? U {±} (where 3? is the set of real numbers). Each element of R contains the current value of one of the primitive numeric expressions (PNEs) of the problem (values associated with tuples of objects by functor symbols - see [7] for more details).

A sentential operator specifies a transformation of a state-pair s = {L, R) into a new state-pair 5' = {L\ R'). We consider operators P =^ E in which the preconditions P contain just a set (conjunction) of literals (possibly negative atoms) and a set of comparisons between pairs of numeric expressions (containing PNEs and numbers), while the eflFects E consist of a list of literals and a set of numeric update operations (analogous to those allowed in setGraph operators). This does not cause any loss of generality, as any PDDL2.1 "level 2" (i.e., non-durative actions) operator can be compiled into an equivalent set of ground operators of the above form [7]. In view of this, we refer to the sentential formalism described above as to PDDL2.1-Zev2. The complete semantics for these operators is described in [7]. An example of sentential operator is given in the next section.

4 The Hybrid Representation

The hybrid model puts together the analogical and sentential models described above. In the hybrid representation, the world state is composed of two distinct parts: an analogical state and a sentential state. The two components are treated as two independent sub-states; each hybrid operator will consist of two distinct parts, each describing the transformation of the respective sub-state.

221

For example, consider a modified Briefcase domain, in which a bucket B containing green paint is used to carry around the objects A,C,D. Any object dropped in the bucket becomes green. The analogical part of the Drop-in operator would be identical to the Put-in action depicted in Figure 3.(6). The sentential part could consist of the following preconditions P and effects E:

P = { (colour y w) } E = { (colour y Green), -i(colour y w)}

where y 6 Portable and w e Colours. As described in the previous section, the sentential part of the operator may also contain numerical elements. For example, given the 1-placed functor symbol "Total_obj" returning the number of items currently contained by a mobile object, precondition P could require (< (Total-obj B) 3), and effect £ would contain (increase (Total_obj B) 1). Notice that the function (or PNE) (Total.obj Brief) is realised in the analogical representation of Figure 2. (a) by Unking Brief to a numeric node.

5 Soundness of Hybrid Planning Models

The simple juxtaposition of sentential and analogical representations does not guarantee that the resulting model is sound with respect to the real domain represented. In this section, we describe a unifying framework that leads to the specification of the conditions for sound hybrid representations. These conditions extend those identified by Lifschitz in [8], stilj at the basis of current sentential planning languages [7]. It should be pointed out that the contents of this section are purely theoretical constructs which, pragmatically speaking, are not necessary for the actual realization of a hybrid planning system.

Following [8], the world is taken to be, at any instant of time, in a certain state s, one of a set S of possible ones. A domain consists of a finite set / of entities and finite sets of relations among (and properties of) entities. In order to describe a domain, we adopt a formal language £ = {P, F, C), where P, F, C are finite sets of relation, function and constant symbols, respectively. Each relation and function symbol of P and F can be either numeric or logical. The wff^s of C (logical and numeric atoms) are built as follows:

• for any p e P, p(ci, ...,Cn) is a logical atom iffci e C (with i e {1, ...,n})

• for any q e P, q{ti, ...,tn) is a numeric atom iffti e PNE U NE

• /(ci , . . . , Cm) is a primitive numeric expression (PNE) ifffeF and Ci e C

• a real number is a numeric expression (NE); for any h £ F, /i(ti, ...,tm) i s a N E iffti e PNE U NE

where NE and PNE indicate, respectively, the sets of all numeric and primitive numeric expressions, obtained as specified above.

An interpretation function g maps each constant symbol c G C to a distinct entity i = g{c) e / , each m-placed logical function symbol / 6 P to a function

222

g{f) : /"* -^ SR, and each n-placed logical relation symbol p e P to a. relation 9{p) Q I^' Each m-placed numeric function symbol h € F is mapped to a (fixed) function g{h) : 3?^ -^ 3?, and each n-placed numeric relation symbol q e P to a. (fixed) relation on real numbers g{q) C 3?".

We define g{t) = t for alH G SR. Let / € F and U e CU PNE U NE; if i = /(*i»••• J tm), then g{t) is the value (in the current state 5) of g{f) calculated in g{ti),.. ,,g{tm) (written /(ti,...,tm)\a)' In what follows, we assume that, for a given language £, a fixed interpretation g is adopted.

Definition 4 (Atom-satisfaction) Given a language C = ({P,F,C}) and a state s e S, an atom p(ti,... ,tn) ^ C is satisfied in s iS g{p){g{ti),... ,p(*n)) is true in 5.

Consider an abstract data structure V (such as a tree, a list, an array, etc.) and a universe U of elements (e.g., integers, characters, booleans,...). Let Vu he a. select set of instances of V built using elements in U (e.g., trees of booleans, of lists of integers, etc.).

Definition 5 (Model) Given a language C = (P, F, C) and a set Vu of data structure instances with elements €U, a model is a pair M=(d, e), where d e Vu and e is a 1-1 total function t: C -^U,

A model is essentially a data structure containing elements taken from a set W. The function t maps the relevant objects (symbols) of the domain to the corresponding elements of the universe that represent them (which may or may not appear in the model). The use of an unspecified data structure V allows this definition to be used for both sentential and analogical (setGraph) models.

Definition 6 (Domain representation structure) A domain representation structure (DRS) for a language C (with interpretation g) is a triple (2^, *, $), where Vu is a set of instances of a data structure V with elements in U, and each il)i £ ^ ((l>j e ^) is an algorithm associated to the n{m)-placed relation (function) symbol i e P (j e F), such that tpi,<t>j always terminate, and:

• for each logical symbolp E P^ ippiVu xU^ -^ {0,1}

• for each logical symbol q e F^ (l>q'Vu y^ W"* —• K±

• for each numeric relation (function) r e P (he F), i/jr calculates g{r) C 5ft" and ^H calculates g{h) : 5ft -> 3?

Basically, a DRS consists of a data structure and a set of procedures for checking it. Each procedure takes as input a model (a data structure instance) and a set of object symbols, and (always) returns a value. For example, given n objects c i , . . . Cn, in order to establish whether p(ci, ...Cn) holds in the current model M, it is sufficient to run the procedure V'p on M, using symbols e(ci),... e{cn) e U (representing objects ci , . . . Cn in M) as input.

Example 2 - Consider the Briefcase domain of Example 1. The briefcase.

223

the two locations (home and office) and the three portable objects are the entities of interest. The property "^o be inside^'* is the relation of interest, and the number of objects inside the briefcase is the only relevant numeric property. Let the language £ i contain the following symbols, having their standard interpretation: a 2-placed logical relation /n, a 1-placed logical function TotaLobj, and a 2-placed numeric relation '< ' . The constant symbols are Ci={A,C,D,Brief,Home,Off}. Let us define, for this domain and language, an analogical domain representation structure DRSo-

The data structure Va adopted is the setGraph (an example of state was given in Figure 2.(a)). The universe U consists of set Ci. Procedure tpin{d, x, y) takes as input a ground setGraph d sgid two labels x,y e U^ and returns 1 if the setGraph of Figure 4.(a) (having parameters x^y substituted with the corresponding input) is satisfied in d, 0 otherwise. (Notice that il;in{d,x,y) will need to check whether input label x is an instance of NODE or PLACE, and represent x as a place only in the latter case - this is indicated in the Figure using a dashed oval instead of a normal one). Procedure (j>TotaUohj{d^^) takes as input a ground setGraph d and a string x G Mobile, and, if there exists (J such that the setGraph of Figure 4.(6) is satisfied in d with mapping a, it returns the value of CT{W)^ ± otherwise. Procedure V^<(x, y) works as expected.

x€ PLACE U NODE J'€ PLACE

(a)

xcMobile

\ we/?, w (b)

Figure 4: Briefcase domain: parameterised setGraphs encoding procedures ipin{x,y) and <l>Totai.obj{x) ((a) and (6), respectively).

Definition 7 (Model-representation) Given a language C, a DRS TZ = {Vu, * , $) for C and a model M=(d, e) with d € 2 ^ , Mrepresents a state s e S (written M =n s) iff, for every logical atom p{ti,.,.tn) and PNE f{ti,.,.tm) of C, both of the follovnng hold:

• il)p{d,t{ti),...,t{tn)) = l iffp{ti,''',tn) is satisfied in s

• (l)f{d,e{ti),...,e{tm)) = f{ti,...,tm)\s if f{ti,.„,tm) is defined in s, 1 otherwise

Definition 8 (Planning Domain) A planning domain is a pair {S^ A), where S is the set of possible world states, and A, the set of actions, is a finite set of total functions a: S -^ S.

Given a domain (5, A) (with language C and DRS Tl) and a set E of models in 72., E represents S iffeoch. model M G E represents exactly one state 5 ^ 5 , and Vs G 5, there exists one and only one model M € E such that M =7^ s.

224

Definition 9 (Sound action representation) Given a domain (5, A) (with language C and DRS TZ) and a set E of models in TZ representing S, a function X : Ti ^^ T, is a sound representation of a £ A iflF, for each model M G E and state s e S such that M =n s, \{M) ^n ^(s):

Given a domain D=(5, A), a pair R=(E, A) is a sound representation of D iff S is a set of models representing 5, and A = {Ai,...,Afc} is a set of sound representations of the actions {ai,..., ak}=A.

Theorem 1 (Soundness) Let i2=(S, A) be a sound representation of a domain D={S,A). Let X = (Ai,...,An) be a sequence (plan) of sound action representations, and a = (ai,...,an} be the corresponding sequence of actions. If Mo G E represents SQ £ 5, and the application of X to MQ produces M^ = X{MQ), then Mn represents a{sQ).

Theorem 1 and the definition of sound action representation extend to hybrid representations the Soundness Theorem and Definition A given in [8]. (The proof, by induction, follows directly from Definition 9, and is analogous to the proof given in [8], except that the concept of satisfaction is replaced here with that of representation, applicable to both the sentential and analogical case).

Finally, the following theorem demonstrates that setGraphs and PDDL2.1-lev2 have equivalent expressive power:

Theorem 2 (Equivalence) Any setGraph encoding of a planning domain can be transformed into an equivalent sentential (PDDL2.1-Zev2J description, and vice versa.

Proof - The proof of the first part is straightforward and not reported here for reasons of space. Consider the second part of the theorem. We first show how to transform every sentential state-pair {logical, numeric) into a corresponding setGraph, and then show how any sentential operator can be encoded as an equivalent setGraph operator in this representation.

Every state s = (L, R) consists of a finite set L of ground atoms p{xi,..., Xn) and a finite vector R of numeric values t/j, each one representing the value in s of the j - t h primitive numeric expression / ( x i , . . . , Xm) (where Xi e C, and C is the set of constant symbols representing the entities of the domain). Let G be a set-Graph containing the following: (1) three places, labelled Pred, Obj and Punct; (2) a node "c" in Obj for each symbol ceC; (3) a node "p" in Pred and a set of labelled edges {ei(p, x i ) , . . . , en(p, Xn)} for each atom p(xi , . . . ,Xn) of L; and (4) a node "/" for each functor symbol / and a set of nodes { x i , . . . , x^ , t/j} in Funct linked by a set of edges {(/ , xi) , (xi, X2), . . . , (xm, %)} for each value yj of R. Then, the truth of p ( x i , . . . , Xn) can be determined by checking if setGraph {{Pred{p,xi,..., Xn}}, {ei(p, x i ) , . . . , en(p, Xn)}} is satisfied in G. Moreover, the value of the j - th PNE is identified by the value to which the numeric variable w 63?J.has to be bound for the parameterised setGraph ({/ , x i , . . . , Xn, t^}, {( / ,xi ) , (xi ,X2), . . .,{xm,w)}) to be satisfied in G.

225

Given the above encoding, every setGraph operator can be transformed into an equivalent sentential operator as follows: each addition (removal) of an atom p{xi,..., a:n) to (from) the logical state L corresponds to the addition (removal) of the corresponding node "p" and associated edges to (from) place Pred, Similarly, each update of a PNE / ( x i , . . . , Xm) in R is encoded through the update of the numeric node w at the end of the chain (/, xi), (xi, X2), . . . , (x^, ^ )-Q.E.D.

Notice that the expressive equivalence of two formalisms does not imply equivalent efficiency: two formalisms that are equally powerful still produce diflFerent descriptions of the same problem, possibly requiring a different number of solution steps, even when solved using the same algorithm. Indeed, this was the set up for the experiments reported in [5]: in each of five diflFerent domains, and for each test problem, the same problem was solved much faster (between two and one hundred and sixty times) if recast in purely analogical terms.

6 Related Work and Discussison

Perhaps the main contribution of this paper is the theoretical framework (Definitions 5-10 and Theorem 1) for planning with analogical, sentential and hybrid representations. Notice that the framework described is not specific to the analogical or sentential models that have been considered here: in addition to being able to support and integrate setGraphs and PDDL2.1-/et;2 representations, the theory provides a basis for the integration of any sentential and analogical representations that fit its premises.

A further contribution consists of an extended, powerful analogical planning representation (Definitions 1-3) based on setGraphs [6], expressively equivalent to the sentential model adopted (Theorem 2). It should be noticed that although the setGraph formalism can ultimately encode any PDDL2.1 "level-2" planning domain description, it can do so only if conditional effects, quantification and disjunctive preconditions are previously compiled away [7]. In this sense, the expressiveness of the setGraph formalism is still limited, although adding this kind of "syntactic sugar" to the model should not be too diflficult.

The possibiUty of separating the analogical part of a domain description from the sentential one suggests that hybrid representations may be effective in the automatic extraction of heuristics [10]. In particular, a useful heuristic for a planning problem could be obtained by simply ignoring the sentential (or analogical) parts of the hybrid description, and solving the "relaxed" problem thus obtained. The learning of domain-specific control knowledge also appears to be facilitated by the adoption of analogical and hybrid descriptions. To see this, observe that the ability to learn from the solution of diflFerent problems in the same domain depends heavily on the capacity to recognise common "patterns" in diflFerent solutions. Identifying such patterns in sequences of propositional states is generally much harder then in analogical representations.

SetGraphs borrow ideas from various related works in the area of analogi-

226

cal representations [2, 1], set and graph theory [9] and semantic networks (in particular, conceptual graphs [11]). However, the use of diagrammatic representations in planning appears to have no precedents. Myers and Konolige [3] describe a hybrid framework for problem solving that allowed a sentential system to carry out deductive reasoning with and about diagrams. Their system allows the addition (and extraction) of information to and from diagram models, but did not permit existing analogical information to be "retracted" from the models. This possibility is crucial for enabhng non-mbnotonic changes of a diagram, typically associated with the execution of an action. Similar considerations apply to other works on heterogeneous representations (e.g., [4]). The work of Glasgow and Malton [12] on purely analogical spatial reasoning is closely related to many of the ideas developed here; the present work generalises to hybrid models their approach, and extends it with a representation for describing and reasoning about the effects of an action. The work of Long & Fox [13] on generic types and automatic problem decomposition nicely dovetails with the present approach. In particular, hybrid descriptions can be used to encode different generic types using different representations. For example, while the movement aspects of a domain can be conveniently encoded using setGraphs, stationary changes (e.g., a change in colour or size) may be better described using a sentential formalism. Once decomposed, these sub-problems can be solved independently, using more efficient, special-purpose algorithms.

A specific syntax for setGraph-based planning languages has not been discussed here. The BNF specification of a language for purely analogical planning was proposed in [5], though it was restricted to the use of two-dimensional arrays of characters. While the extended setGraph formalism requires a more complex definition, the simplicity of its primitive components - sets and graphs - should make a syntax specification relatively straightforward.

An important aspect of action modeUing that has not been dealt with here concerns the specification of concurrent (analogical) actions. A possible approach to identifying non-interference conditions for the concurrent execution of setGraph operators may be to require that the elements of the setGraph acted upon by the operators be disjoint. However, the introduction of time and durative actions, and of other features such as conditional and continuous effects, makes this a rather complex issue, requiring further investigation.

Acknowledgements

This work was supported by the UK EPSRC (ref. GR/R53432/01). Special thanks go to Michael Jackson, Neil Smith, David Smith, Jorg Hoffmann, Bern-hard Nebel, Maria Fox and Derek Long for their useful comments and feedback.

References

[1] Glasgow, J., Narayanan, N., Chandrasekaran, B., eds.: Diagrammatic Reasoning. AAAI Press/The MIT Press, Cambridge, MA (1995)

227

[2] Kulpa, Z.: Diagrammatic representation and reasoning. Machine GRAPHICS & VISION 3 (1994) 77-103

[3] Myers, K., Konolige, K.: Reasoning with analogical representations. In Nebel, B., Rich, C , Swartout, W., eds.: Principles of Knowledge Representation and Reasoning: Proceedings of the Third International Conference (KR92), Morgan Kau&nann Publishers Inc., San Mateo, CA (1992) 189-200

[4] Barwise, J., Etchemendy, J.: Heterogeneous logic. In: [1]. (1995) 21J.-234

[5] Garagnani, M., Ding, Y.: Model-based planning for object-rearrangement problems. In: Proceedings of 13th International Conference on Automated Planning and Scheduling (ICAPS'03) - Workshop on PDDL, Trento, Italy (2003) 49-58

[6] Garagnani, M.: Model-Based Planning in Physical domains using Set-Graphs. In Bramer, M., Preece, A., Coenen, F., eds.: Research and Development in Intelligent Systems XX (Proc. of AI-2003), Springer-Verlag (2003) 295-308

[7] Fox, M., Long, D.: PDDL2.1: An extension to PDDL for expressing temporal planning domains. Journal of Artificial Intelligence Research 20 (2003) 61-124

[8] Lifschitz, V.: On the semantics of STRIPS. In Georgeff, M., Lansky, eds.: Proceedigns of 1986 Workshop: Reasoning about Actions and Plans. (1986)

[9] Skiena, S.: Implementing Discrete Mathematics: Combinatorics and Graph Theory with Mathematica. Addison-Wesley, Reading, MA (1990)

[10] Hoffmann, J., Nebel, B.: The FF planning system: Fast plan generation through heuristic search. Journal of Artificial Intelligence Research 14 (2001) 253-302

[11] Sowa, J.: Conceptual Structures: Information Processing in Mind and Machine. Addison-Wesley, Reading, MA (1984)

[12] Glasgow, J., Malton, A.: A semantics for model-based spatial reasoning. Technical Report 94-360, Department of Computing and Information Science, Queen's University, Kingston, Ontario (1994)

[13] Long, D., Fox, M.: Automatic synthesis and use of generic types in planning. In Chien, S., Kambhampati, S., Knoblock, C , eds.: Proceedings of the 5th International Conference on AI Planning and ScheduUng (AIPS'OO), Breckenridge, CO, AAAI Press (2000) 196-205

SESSION 4:

KNOWLEDGE DISCOVERY IN DATA

Towards Symbolic Data Mining in Numerical Time Series

Agustin Santamaria^ Africa Lopez-Illescas^, Aurora Perez-Perez^ Juan P. Cara9a-Valente^

^ Technical University of Madrid, Campus de Montegancedo 28660, Madrid, Spain

^ High Council for Sports, C/ Obispo Trejo s/n, Madrid, Spain

Abstract

The analysis of time series databases is very important in areas like medicine, engineering or finance. Most of the approaches that address this problem are based on numerical algorithms calculating distances, clusters, index trees, etc. We have developed a numerical pattern discovery algorithm to find similar patterns to characterize time series, with good results in the isokinetics domain.

However, it is sometimes necessary to conduct a domain-dependent analysis, searching for symbolic rather than numerical characteristics of the time series. For this purpose, we have designed a symbol extraction method that translates a numerical sequence into a symbolic one with a semantic value in a particular domain. This method provides a semi-analyzed symbolic series, in which the main characteristics of the numerical series have been discovered. So, this symbolic series help users to efficiently analyze time series similarly to how an expert in the domain would do.

1. Introduction There are many databases that store temporal information as sequences of data in time, also called temporal sequences. They can be found in different domains like the stock market, business, medicine, etc. An important domain for the application of data mining (DM) in the medical field is physiotherapy and, more specifically, muscle function assessment based on isokinetics data. This data is retrieved by an isokinetics machine (Fig. 1), on which patients perform strength exercises at constant speed. We decided to focus on knee exercises (extensions and flexions) since most of the data and knowledge gathered by the sport physicians is related to this joint. The machine records the strength exerted. The data takes the form of a strength curve with additional information on the angle of the knee (Fig. lb). The

231

232

top half of the curve represents extensions (knee angle from 90° to 0°) and the bottom half, flexions (knee angle from 0° to 90°) (Fig. lb).

This work is part of the 14 Project (Intelligent Interpretation of Isokinetics Information) [1], which provides sport physicians with a set of tools to analyze patient strength data output by an isokinetics machine. The isokinetics data is processed by the 14 system that cleans and pre-processes the data and provides a set of DM tools to analyze isokinetics exercises in order to discover new and useful information for monitoring injuries, detecting potential injuries early, discovering fraudulent sickness leaves, etc.

The 14 system does not incorporate any expert knowledge (apart from thresholds). However, after observing the expert at work, we found that he applies his knowledge and expertise to focus on certain sections of the series and ignore some others. The process followed by the expert suggested that the analysis tools needed to include domain-dependent information to assure that the system would follow a similar process. In this paper we report a work based on a technique that transforms the numerical time series into a symbolic one.

This paper focuses on the discovery of similar patterns in time series. We briefly describe the numerical algorithm used to discover similar -rather than identical-patterns and the difficulties in its application in the real world. Then, we describe another approach to the problem, based on extracting semantic information from the numerical series. The paper is arranged as follows: section 2 describes the process of data preparation; section 3 describes the method for searching similar patterns in numerical time series; section 4 introduces the importance of domain-dependent analysis and symbolic time series; section 5 introduces the Symbols Extraction Method (SEM); and Section 6 shows the results of SEM. The paper ends on some conclusions and mentions fiiture lines of research.

(b)

Figure 1 Isokinetics machine (a) and collected data (b).

2. Data Preparation A good preparation of the initial data is crucial for achieving usefiil results in any DM or discovery task. But no standard universally valid procedure can be

233

designed for this stage, so solutions vary substantially from one problem to another.

Data preparation in 14 is as follows. The available isokinetics tests data sets have been used to assess the physical capacity and injuries of top athletes since the early 90s. An extensive collection of tests has been gathered since then, albeit immethodically. Hence, we had a set of heterodox, unclassified data files in different formats, which were, partly, incomplete. On the positive side, the quality of the data was unquestionable, as the protocols had been respected in the huge majority of cases, the isokinetics system used was of proven quality and the operating personnel had been properly trained.

iRQsaMota 1 Fernando Mamede

Carlos Lopes Sex: Max. 123 127 131 135

Male Age: 34 Peak: 235.... 543 001 675 003 703 005 755 007

Carlos Lopes Sex: Male Age: 34 Max. Peak: 235....

Decoding Cleaning

Smoothed Inertia peaks graph....•-

Figure 2 Data pre-processing tasks.

A series of tasks, summarized in Fig. 2, had to be carried out before the available data set could be used. The first one involved decoding, as the input came from a commercial application (the isokinetics system) that has its own internal data format. Then, the curves had to be evaluated to identify any that were invalid and to remove any irregularities entered by mechanical factors of the isokinetics system. Two data cleaning tasks were performed using expert knowledge:

• Removal of incorrect tests. The goal of this task is to determine that the isokinetic test protocol has been correctly applied. All the exercises defined in the protocol must have been completed successfully in the correct order.

234

Additionally, the strength values must demonstrate that patients exerted themselves during the exercises and, therefore, tired to some degree.

• Elimination of incorrect extensions and flexions. Even if the isokinetic protocol has been correctly implemented, some of the extensions and/or flexions within an exercise may be of no use, owing mainly to lack of concentration by the patient during the exercise. 14 detects extensions and flexions that are invalid because the patient employed much less effort in them than in others.

Having validated all the exercises as a whole and each exercise individually, they have to be filtered to remove noise introduced by the machine itself. Again expert knowledge had to be used to automatically identify and eliminate flexion peaks, that is, maximum peaks produced by machine inertia.

This process results in a database in which tests are homogeneous, consistent and noise free.

3. Numerical Algorithm for discovering similar patterns As far as isokinetics exercises are concerned, the presence of patterns in the time series could be representative of some kind of injury, and the correct identification of the deviation could be an aid for detecting the injury in time. The process of developing a DM method for identifying patterns that potentially characterize some sort of injury was divided into two phases:

• Develop an algorithm that detects similar patterns in exercises.

• Use this algorithm to detect patterns that appear in exercises done by patients with injuries and do not appear in exercises completed by healthy individuals.

The algorithm developed is capable of detecting similar sequential patterns in a set of time series. It reuses some state-of-the-art ideas [2, 3, 4], like the R-tree for indexing patterns, the A priori property to prune the search tree, etc. Owing to the special features of our problem, however, major changes had to be made to state-of-the-art algorithms in order to consider pattern similarity using the Euclidean distance, as the above-mentioned papers either search for identical patterns in the series or consider only patterns of a given length. In identical pattern-searching algorithms, each pattern matches a branch of the tree. In the similar pattern-searching algorithm in question, however, a pattern can match several branches. For example, if the patterns (12, 14, 16, 18) and (12, 14, 16, 19) are considered similar (depending on the algorithm parameters), this must be taken into account when calculating the frequency of the two patterns. A pattern search tree was built to speed up the pattern-searching algorithm. In algorithms that search for identical patterns [4], it suffices to store the pattern and a counter of appearances in each node. As there are similar patterns in our case, the list of series in which the pattern appears (SA) and the list of series in which a similar pattern appears (SSA) has to be stored (Fig. 3). This is because pattern similarity is not transitive (i.e., we can have pi similar to p2, p2 similar to p3 and pi not similar to p3).

235

12, 14, 16, 18 Si

SA: series in which

c c ^ 1 , ^ • '2

^ SSA: series in whic similar pattern app

Pattern •

Figure 3 Format of the similar pattern search tree.

The problem defined in phase 1) of the method of injury identification is set out as follows:

• Given: - A collection S of time series composed of sequences of values (usually real

numbers or integers), of variable length, where the length of the longest is max-length.

- The value (supplied by the user) of minimum confidence min-conf (number of series in which pattern appears divided by the total number of series).

- The maximum distance between patterns to be considered similar (max-dist). • Find:

- All patterns of length 0 < i < max-length, that is, identical or similar sequences that appear in S with a confidence greater or equal to min-conf.

First, patterns that appear in the time series are built in the same manner, as an identical pattern-searching algorithm would do. However, it is not enough just to store the number of times the pattern appears in the series to calculate its confidence. It is important to find out in which series the pattern appears in order to be able to analyze its similarity to other patterns. Then the algorithm has to run through the patterns to take into account the appearances attributed to similar patterns. For each pattern p, all the patterns of the same length in any series that are at a lesser distance than threshold max-dist from pattern p are considered similar patterns.

Special care has to be taken in the pruning phase not to prune patterns, which, although not frequent, play a role in making another pattern frequent. If this sort of patterns were pruned, the algorithm would not be complete, that is, would not find all the possible patterns. Only patterns that are infi-equent and whose minimum distance from the other patterns is further than the required distance {max-dist) will be pruned. Having completed the tree-pruning phase, the next level of the tree is generated using the longest patterns. A full description of this algorithm is in [5].

Although this method demonstrated to be useful for discovering representative patterns for different groups of individuals and pathologies, some difficulties have been found in its application. On one hand, the own nature of data produces a certain amount of false positives and false negatives that only the expert could detect. On the other hand, the pattern representation is not very illustrative for the expert, who demands a more intuitive representation related to his own way of thinking and operating. Hence symbolic series were researching as new alternative

236

ways in order to be closer of the expert conceptual mechanisms. The next sections describe the symbolic knowledge extraction method from numerical series.

4. Semantic Extraction In this paper we report a method that transforms the numerical time series into a symbolic time series, allowing physicians to use a sequence whose main features have been identified, making their routine work much easier and more efficient. The purpose is to help users to efficiently analyze the time series similarly to how an expert in a particular domain would do.

So far, there has been a lot of research into temporal sequences introducing concepts like distance, needed to establish whether two series are similar, transformations, designed to convert one time series into another to ease analysis, and patterns, that is, independently meaningful sections of a time series that explain its behaviour or characterise a time series. Many papers have been published analysing which are the best techniques for defining distance and what transformations have to be used to find patterns and, therefore, to forecast future events. Some of these papers are related to pattern discovery in areas like signal processing [6], genetic algorithms [7], and speech recognition [8].

The main exception to this numerical analysis is the work by Agrawal, Faloutsos and Swami [9], which presents a shape definition language, called SDL, for retrieving objects based on shapes contained in the histories associated with these objects. This language is domain independent, which is one of the main differences from the work that we present in this paper.

An important point is that time series should, in most cases, be analyzed by an expert in the series domain. The expert will have the expertise to give an explanation, the result of which will be an interpretation of the different features of the time series. When analysing a sequence, most experts instinctively split the temporal sequence into parts that are clearly significant to their analysis and, maybe, ignore parts of the sequence that provides no information. So, the expert identifies a set of concepts based on the features present in each part of the time series that are relevant for explaining the behaviour of the time series. This process can be later used to assemble these concepts in a set called pattern, which characterises a given situation.

After observing isokinetics domain experts at work, we found that they focus on sections like "ascent, curvature, peaks..." These are the sections that contain the concepts that must be extracted from the data. Therefore, we developed SEM (Symbol Extraction Method), whose goal is to translate the numerical time series into a symbolic series incorporating expert knowledge. From the sequence analysis point of view, the translation of the time series, composed of numerical values, to a sequence made up of symbols, brings the sequence one step closer to the expert, making it more akin to his expertise in the field. Our aim is to look for a method that transforms the time series into a symbolic sequence, more familiar to the expert.

237

5. Symbols Extraction Method We will first describe the Isokinetics Symbols Alphabet (ISA) that consists of the symbols used to build the symbolic sequences. Then, we will describe the method used for symbol extraction.

5.1 Isokinetics Symbols Alphabet (ISA)

It does not make much sense to start to study data depicted in the time, if there is no knowledge of the domain that is to be analyzed. Some interviews with the isokinetics expert, who is specialised in the analysis of isokinetics temporal sequences of some joints like the knee or the shoulder, had to be planned to elicit expert knowledge as the research advanced.

Figure 4 Format of the similar pattern search tree.

After the first few interviews, the expert stated that there were two visually distinguishable regions in every exercise: knee extension and flexion. Both had a similar morphology (the shape shown in Fig. 4), which allow us to identify the following symbols:

- Ascent. The part of the sequence where the patient gradually augments the strength applied.

- Descent. The part of the sequence where the patient gradually decreases the strength applied.

- Peak. A prominent part in any part of the sequence. - Trough. A depression in any part of the sequence. - Curvature. The upper section of a region. - Transition. The changeover from extension to flexion (or vice versa).

After identifying the symbols used by the expert, it was necessary to associate types to these symbols, which have to be taken into account when translating a temporal numerical sequence to a symbolic one. The types were elicited directly from the expert as he analyzed a set of supplied sequences that constituted a significant sample of the whole database. For peaks and troughs the expert distinguished between big and small. He considered ascents and descents severe or gentle. Also, he said that the curvature is usually irregular (an exercise done improperly), flat or sharp. As the expert separated an extension fi-om a flexion, we have to store this information together with the symbols and their types. Therefore, each symbol has to be labelled with its type and also with the keyword "Ext" or

238

"Flex". Thus, we know the source of each symbol at any time. The set of symbols, types and zones form an alphabet called Isokinetics Symbols Alphabet (ISA) (see Table 1).

Table 1 Isokinetics Symbols Alphabet

Zone Symbol Types

FLEX

Ascent Descent Trough Peak

Curvature

Severe Severe

Big Big

Sharp Flat

Gentle Gentle Small Small Irregular

Transition

5.2 Symbols Extraction Method (SEM)

The previous section defined the ISA, which will be used to get symbolic sequences from numerical temporal sequences. The Symbols Extraction Method (SEM), whose architecture is shown in Fig. 5, was designed to make this transformation. SEM is divided into two parts. The first one is domain independent (DIM) and, therefore, can be applied and reused for any domain. The second part is domain dependent (DDM) and is the part that really contains the expert knowledge about the symbols needed to analyze a particular sequence.

Refei Modi

•ence _^ 1 " *

Oulput J Oomiifi* \ n IficltptiidiKit 1

Domain Ind«p«nt«nt

^ ' 1 Quipyt

1 N«ttirti Fm«fin9

Pomain DapanoMil

Typ«s of Symbols

Symbol sequence

Figure 5 SEM architecture

The 14 application contains a database of isokinetics exercises done by all sorts of patients. A particular exercise, done at a speed of 60 radians per seconds is used as input for the SEM (Fig. 5). The DIM is made up of a submodule that outputs a set of domain independent features. Basically, these features are peaks and troughs, which, after some domain-dependent filtering, will be matched to symbols. Both, the features output by the DIM (or simple features) and the domain dependent data will be used as input for the DDM, which is divided into two submodules. The first one extracts the ISA symbols and the second one characterises each symbol with type and zone. The DDM output will be the symbolic sequence.

239

5.2.1 Domain-Independent Module (DIM)

The goal of this module is to scan the time series and extract simple features from it. Particularly, these features will be peaks and troughs, which can be found in any sequence irrespective of the domain. Thus, a trough is a point where a sequence recording downward values starts to record upward values and a peak is the point where the sequence recording upward values starts to record downward values.

This module sequentially scans the whole sequence, retrieving peaks and troughs. Apart from storing the point at which a peak or a trough has been detected, it also stores data related to this detected feature. Irrespective of whether the detected feature is a peak or trough, the following data are stored (see Fig. 6):

- Point. Highest (lowest) strength value of the peak (trough). - Slope. In the case of a trough, the slope is calculated from the point of the

previous peak to the point of that trough. In the case of a peak, the slope is calculated from the point of the previous trough to the point of that peak.

- Initial, Final. Value of the initial and final points of the peak or trough. - Duration. Value that contains the difference between the initial and final points. - Amplitude. Value that measures the height of the peak or the depth of the

trough.

Initial) trough

trough

Figure 6 Data associated with a simple feature

This module has been also tested with sequences coming from other domains like stock market, electrocardiograms, providing outcomes that show its applicability.

5.2.2 Domain Dependent Module (DDM)

The goal of DDM is to get a set of domain-dependent symbols. Its inputs are the original temporal sequence that is made up of real numbers that correspond to strength exerted by the knee, and the output of the DIM. The DDM consists of three submodules. The first one outputs the set of symbols, the second one filters the set of symbols and the third one characterises each symbol.

240

a) Outputting domain-dependent symbols

Our aim is to output the symbols shown in section 5.1. At first glance, it would appear that all the peaks/troughs supplied by the DIM would result in a peak/trough output by the DDM. However, this is not possible, because, if we did it like that, all peaks and troughs, no matter how insignificant they were, would be taken as symbols. The expert only analyzes some peaks or troughs, disregarding irrelevant ones. Therefore, the peaks/troughs supplied by the DIM need to be filtered by means of a condition (i.e. amplitude/relation > threshold) that assesses whether a peak, or a trough, can be considered as a symbol. The values of the thresholds were determined through an iterative procedure.

The ascent and descent symbols were determined similarly to the peaks and troughs extraction. To avoid confusions between ascents/descents and peaks/troughs, ascents or descents must fulfil Expr. 1.

({gradient >= slope _ threshold )and {duration >= dur _ threshold ))

or

{{gradient >= slope _threshold )and{amplitude >= ampl _threshold ))

Expression 1 Expression for determining peak/trough

Regarding curvatures, the objective was to locate the section of the region where a curvature could be found, irrespective of whether the region was an extension or a flexion. It was estimated that the curvature accounted for around 20% of the upper section of each region.

The transition symbol indicates the changeover from extension to flexion and vice versa.

b) Filtering

The set of symbols outputted by the previous submodule would be put through a filtering stage (see Fig. 5), which, apart from other filtering processes, checks that no repeated symbols appear.

c) Types of symbols

The goal of this submodule is to label each symbol with a type (see Table 1). This will provide more precise information about the original temporal sequence. Remember also that the expert instinctively uses a symbol typology based on his expertise. This classification is done by means of a set of thresholds that define the type symbol for each case.

6. Results A graphical interface has been designed to easily work with the SEM. First we will describe this user interface and then we will present a series of examples that

241

illustrate how the SEM translates into symbols and how the original time sequence can be reconstructed from them.

An exercise is selected as input to the SEM. The original temporal sequence of the exercise is displayed at the top of the interface (see Fig. 7). The central part displays the translation of the temporal sequence into symbols, illustrating all the SEM stages. The first stage outputs the domain-independent features, as is shown on the left under the head "FEATURES". This list contains all the information related to each peak/trough and is formatted as follows:

<feature>.Slope:<slope_value> Ini:<initial_value> Fin:<final_value> Ampl:<amplitude_value> Dur:<duration_value>

<value_of_the_point >

The next stage of the method is to output the domain dependent symbols, which are shown (see Fig. 7) under the head "DOMAIN-DEPENDENT SYMBOLS". The threshold parameters that are used to output these symbols are listed under "FILTERING PARAMETERS".

Figure 7 Symbolic representation interface.

The result of the last stage of the SEM is set out at the right side of the interface, under the head "DOMAIN-DEPENDENT TYPED SYMBOLS", and is the type characterisation of each symbol. The threshold parameters used are shown as "TYPOLOGY PARAMETERS".

The curve reconstructed from the symbols is shown at the bottom (Fig. 7).

As a preliminary validation step we have presented a set of 30 cases to the expert. Each case was composed of an exercise and its symbolic transformation. The expert analysed each transformation and gave its opinion regarding both the

242

transformation as a whole and each particular semantic section. As a consequence of this process, some of the symbolic transformations were tuned and many of the parameter and threshold values were adjusted. Currently, a more exhaustive validation process is being carried out, based on the daily work of the expert.

As stated by the expert, SEM is an important aid for physicians in writing reports, examining the evolution of an athlete's joint, diagnosing injuries or controlling the treatment of a medical diagnosis.

7. Conclusions and future lines In this paper it has been introduced a method (SEM) that transforms a numerical sequence into a symbolic sequence with semantic content in a specific domain. This work has been included into the 14 project, which provides a set of tools to analyze isokinetics strength curves exerted by sport people or other patients.

This automated procedure extracts the same set of symbols from a time series that the expert would have inferred naturally. The process followed by the expert suggested that the analysis tools needed to include domain-dependent information to assure that the system would follow a similar process. And this information, due to the circumstances of this domain, must be expert knowledge.

It should be noted that the extraction of such symbols for subsequent temporal sequence analysis is an important part of the expert's job of making reports on patient strength based on such concepts/symbols. The transformation process is of great utility for isokinetics domain experts, once it avoids a task that requires a lot of calculations, but it is still more useful for the non-specialist medical user because it provides knowledge that they would hardly extract from the numerical sequence.

Although a deeper evaluation is in process, we have made a field evaluation, introducing a set of cases to the isokinetics expert, where each case is composed by the sequence of original data, the symbolic sequence and the rebuilt sequence (to provide the expert with a graphical view of the transformation). The comments of the expert suggested that this process would be of great utility for his work, making easier to elaborate diagnosis reports. In addition, the detailed examination of each specific case showed that in the rebuilt curves are included, in most of the cases, the essential characteristics that will allow a precise diagnosis of the patient.

This research has opened up a new path for automating other processes like pattern discovery or sequence comparison symbolically (rather than numerically). The next step in our work is to define a method for discovering similar patterns and for comparing temporal symbolic series which will easily allow us the execution of the rest of 14 tools (comparison of two patients, comparison of a patient and a model that represents a group, clustering of a set of patients related to their muscular strength, identification of patterns expected to represent some kind of problem, and soon.

243

References 1. Alonso, F., Lopez-Chavarrias, I., Cara9a-Valente, JP., & Montes, C. Knowledge

Discovery in Time Series Using Expert Knowledge. In K. J. Cios, Medical Data Mining and Knowledge Discovery. Heidelberg: Physica-Verlag 2001.

2. Rakesh Agrawal, Christos Faloutsos, and Artun N. Swami. Efficient Similarity Search In Sequence Databases. In D. Lomet, editor, Proceedings of the 4th International Conference of Foundations of Data Organization and Algorithms (FODO), pages 69-84, Chicago, Illinois, 1993. Springer Verlag.

3. Faloutsos C, Ranganathan M, Manolopoulos Y (1994b) Fast subsequence matching in time series databases. In Proceedings of SIGMOD'94, Minneapolis, MN, pp 419-429.

4. Han J, Dong G, Yin Y (1998) Efficient mining of partial periodic patterns in time series database. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA, pp 214-218.

5. F. Alonso, JP. Cara^a-Valente, L. Martinez, C. Montes Title: Discovering Similar Patterns for Characterising Time Series in a Medical Domain. Knowledge and Information Systems: An International Journal Volume: 5, n° 2 pp: 183 - 200. 2003

6. H. V. Poor. An Introduction to Signal Detection and Estimation. SpringerVerlag, 1988.

7. Goldberg, D. E., 1989, Genetic Algorithms in Search, Optimization & Machine Learning. Addison-Wesley.

8. W. A. Ainsworth. Speech Recognition by Machine. London: Peter Peregrinus Ltd, 1988.

9. R. Agrawal, G. Psaila, E. L. Wimmers, and M. Zai't. Querying Shapes of Histories. Proceedings of the 21st VLDB Conference Zurich, Switzerland, pp. 502 - 514, 1995.

Support Vector Machines of Interval-based Features for Time Series

Classification *

Juan Jose Rodriguez Lenguajes y Sistemas Informaticos

Universidad de Burgos, Spain

Carlos J. Alonso Grupo de Sistemas Inteligentes, Departamento de Informatica

Departamento de Informatica

Abstract

In previous works, a time series classification system has been presented. It is based on boosting very simple classifiers, formed only by one literal. The used literals are based on temporal intervals.

The obtained classifiers were simply a linear combination of literals, so it is natural to expect some improvements in the results if those literals were combined in more complex ways. In this work we explore the possibility of using the literals selected by the boosting algorithm as new features, and then using a SVM with these metafeatures. The experimental results show the validity of the proposed method.

1 Introduction

Normally, boosting [26] is used with well-known base classifiers, such as decision trees or neural networks. Hence, its main contribution is the capacity of improving the accuracy results of those methods. Nevertheless, boosting also allows to develop domain-specific learning methods in a rather simple way. It is only necessary to develop a modest (weak) base classifier, adequate for that domain. Then, using boosting, it is possible to obtain a strong learner for that domain. For time series classification, adequate base classifiers for boosting are interval-based literals [23].

The domain we consider is time series. Our weak classifiers, interval-based literals, consider what happens in a given interval, e.g. what is the average value. These classifiers are very simple, but are designed for this domain in particular. Using boosting with these classifiers it is possible to obtain good results [23].

The obtained classifiers with boosting are linear combinations of the base classifiers. Then, it is natural to question if it could be possible to obtain better results using more complex combinations of the base classifiers. Although

*This work has been supported by the Spanish MCyT project DPI2001-01809

244

245

boosting is not as resistant to overfitting as once was considered [20], it is still very useful the capacity for obtaining a combination of weak base classifier that are able to classify reasonably well. Hence, a natural idea is to obtain the base classifiers using boosting, but combine them using a most robust method. We explore in this paper the use of Support Vector Machines, SVM [6], for the combination of the obtained base classifiers. On the other hand, in [22] we considered the use of decision trees with features selected by boosting, with the objective of obtaining more comprehensible classifiers.

The rest of the paper is organized as follows. The classification method is described in Sect. 2. Section 3 presents experimental results. Some related works are mentioned in Sect. 4. Finally, we give some concluding remarks in Sect. 5.

2 The Classification Method

2.1 Interval-based Predicates

For comparison purposes we first introduce point-based predicates. They use only one point of the series:

• point<( Example, Variable, Point, Threshold ). It is true if, for the Example, the value of the Variable at the Point is less or equal than Threshold.

The Variable is included in the predicate because multivariate time series are also considered. This notion of variable is the one used in time series, not the one used in machine learning. The combination of a Variable (e.g., x) and a Point (e.g., 7) would be considered in machine learning as a feature (e.g., xj).

Interval-based predicates consider what happens in a given interval. The ones used in the present paper are:

• average<( Example, Variable, Begin, End, Threshold ). It is true if, for the Example, the average value of the Variable in the interval given by Begin and End is less or equal than Threshold.

• deviation<( Example, Variable, Begin, End, Threshold ). It is true if, for the Example, the deviation of the values of the Variable in the interval given by Begin and End is less or equal than Threshold.

In [23], more interval-based predicates are described (e.g., always, sometime), and it is explained how to select them in an efficient way.

Variable length series. There are several methods, more or less complicated, that allow to normalize the length of a set of series. These methods, which preprocess the data set, can be adequate for some domains. Nevertheless, the use of these methods is not a general solution. For instance, the series of two classes could have the same shape, but different lengths. Then, it is

246

Literal Class 1 Class 2 Class 3 deviation<( E, x, 63, 126, 1.813266 ) deviation<( E, x, 48, 111, 1.889214 ) average<( E, x, 38, 101, 0.770881 ) not average<( E, x, 30, 33, 3.349725 ) average<( E, x, 55, 58, 4.280890 ) deviation<( E, x, 3, 34, 1.653625 ) not average<( E, x, 49, 52, 5.508630 ) deviation<( E, x, 25, 56, 1.107998 ) not average<( E, x, 44, 47, 3.720549 ) not deviation<( E, x, 26, 33, 3.355703 )

-0.399084 -0.232020 0.096480 0.746371

-0.906306 -0.334412 0.743638 0.634717 0.484320

-0.149486

-0.421169 -0.153754 0.072766

-1.641331 0.352215 2.162192 0.517645 0.540074

-0.932335 0.090624

1.037954 0.606268 0.715047 0.664797 0.471577

-0.114016 -0.195655 0.000214

-0.155742 0.657721

Table 1: Example of a classifier obtained with boosting, for a 3 class problem. For each literal, a weight is associated to every class.

important that the learning method can deal with variable length series. Of course, it can still be used with preprocessed data sets of uniform length.

If the series are of different lengths, there will be literals with intervals that are outside, partially or totally, for some of the series. For this cases, the result of the evaluation of the Uteral will not be true nor false, but an abstention.

2.2 Boosting Interval-based Literals

Clearly, in order to obtain accurate classifiers it is necessary to combine several literals. The approach used in this work is combining them by boosting.

At present, an active research topic is the use of ensembles of classifiers. They are obtained by generating and combining base classifiers, constructed using other machine learning methods. The target of these ensembles is to increase the accuracy with respect to the base classifiers.

One of the most popular methods for creating ensembles is boosting [26], a family of methods, of which ADABOOST is the most prominent member. They work by assigning a weight to each example. Initially, all the examples have the same weight. In each iteration a base (also named weak) classifier is constructed, according to the distribution of weights. Afterward, the weight of each example is readjusted, based on the correctness of the class assigned to the example by the base classifier. The final result is obtained by weighted votes of the base classifiers.

A D A B O O S T is only for binary problems, but there are several methods for extending ADABOOST to the multiclass case. The one used in this work is ADABOOST.MH [27].

Table 1 shows an example classifier. It was obtained from a data set with three classes. This classifier is composed by 10 base classifiers. The first column shows the literal. For each class in the data set there is another column, with the weight associated to the literal for that class.

In order to classify a new example, a weight is calculated for each class, and then the example is assigned to the class with a greatest weight. Initially,

247

the weight of each class is 0. For each base classifier, the literal is evaluated. If it is true, for each class its weight is added with the weight of the class for the literal. If the literal is false, then the weight of the class for the literal is subtracted from the weight of the class.

For the basic boosting method the base classifiers return +1 or —1. Nevertheless, there are variants that use confidence-rated predictions [27]. In this case, the base learner returns a real value. The sign indicates the classification and the absolute value of the confidence in this prediction. There are three possible results for the evaluation of an interval based-literal: false, true, or abstention. They are assigned, respectively, the numeric values —1, 1 and 0.

2.3 SVM of Interval-based Features

Once that an ensemble of literals has been obtained, it is possible to use it as a classifier. Nevertheless, we are considering another alternative. Each Uteral can be considered as a new (meta)feature. Then, any other classification method can be used with these new features.

Although it would be possible to use the literals directly as boolean features, the ones considered here compare the value of a function over an interval (e.g., the average) with a threshold. If the classification method that is going to be used is able of deahng with numeric features, it is sensible to use the values of the functions instead of the values of the literals.

Boosting combines the base classifiers using a weighted voting. Our hypothesis was that perhaps the base classifiers could be combined in a better way. In these paper we test this hypothesis using Support Vector Machines.

3 Experimental Validation

3.1 Data Sets

Table 2 summarizes the characteristics of the data sets. Five are synthetic, six are real.

The CBF data set is an artificial problem, introduced in [25]. The learning task is to distinguish between three classes: cylinder, bell or funnel. Figure 1 shows some examples of this data set. The CBF tmnslated data set is a modification of the previous one introduced in [9]. It emphasizes the shifting of patterns.

In the Control Charts data set there are six different classes of control charts, synthetically generated by the process in [1]. Figure 2 shows two examples of each class. The used data was obtained fi:om the UCI KDD Archive [10].

The Two Patterns data set was introduced in [9]. Each class is characterized by the presence of two patterns in a definite order. Figure 3 shows examples of this data set.

The data set Trace is introduced in [24]. It is proposed as a benchmark for classification systems of temporal patterns in the process industry. This data set was generated artificially. There are four variables, and each variable has

248

CBF CBF-tr Control Charts Two Patterns Trace Gun Pendigits Vowels Japanese Vowels Auslan/N Auslan/F

3 3 6 4

16 2

10 11 9

95 95

128 128 60

128 [268... 394]

150 8

10 [7... 29]

[17... 149] [45... 136]

4 if

1 1 1 1 4 2 2 1

12 8

22

798 5000 600

5000 1600 200

10992 990 640

1900 2565

.4°

lO-CV 1000 / 4000

10-CV 1000 / 4000 800 / 800

10-CV 7494 / 3498 528 / 462 270 / 370

5-CV 5-CV

Table 2: Characteristics of the data sets.

Cylinder Bell Funnel

Figure 1: Examples of the CBF data set. Two examples of the same class are shown in each graph.

Normal Decreasing Increasing

Cyclic Downward Upward

Figure 2: Examples of the Control data set. Two examples of the same class are shown in each graph.

249

down-down

u m

up-down down-

U I I

up up-up

m

Figure 3: Examples of the Two Patterns data set.

Variable 1

£r^ 11

Variable 2 Variable 3 Variable 4

«*fj

y [/"" lf>^»4 V«*<l»f-'»x^lt*'

Figure 4: Trace data set. Each example is composed by 4 variables, and each variable has two possible behaviors. In the graphs, two examples of each behavior are shown.

two behaviors, as shown in figure 4. The combination of the behaviors of the variables produces 16 different classes. 1600 examples were generated, 100 of each class. Half of the examples are for training and the other half for testing.

The data set Gun was introduced in [19]. It comes from the video surveillance domain. There are two classes "gun-draw" and "point". In the first case the actor takes a replicate gun and in the second one he points with their index fingers. The data is obtained from tracking the centroid of the right hand in the X-axis. Figure 5 shows two examples of each class.

The data set Pendigits is introduced in [2]. It was obtained by collecting

Gun-Draw Gun-Draw Point Point

Figure 5: Examples of the Gun data set.

250

250 samples from 44 writers, using a tablet with a cordless stylus. The training data are the digits from 30 writers and the digits from the other 14 writers are the testing data.

The data set Vowels is introduced in [7]. It is a problem of speaker independent recognition of the eleven steady state vowels of British Enghsh.

The data set Japanese Vowels is introduced in [15]. It is a speaker recognition problem. Nine male speakers uttered two Japanese vowels / ae / successively. For each utterance, it was applied 12-degree hnear prediction analysis to it to obtain a discrete-time series with 12 LPC cepstrum coefficients. This means that one utterance by a speaker forms a time series whose length is in the range 7-29 and each point of a time series is of 12 features (12 coefficients).

Auslan is the AustraUan sign language, the language of the Austrahan deaf community. Instances of the signs were collected using an instrumented glove [12]. There are two versions of this data set, obtained with different equipments. In the first one the manufacturer is Nintendo, and in the second one it is Flock. According to [12], in terms of the quality of the data, the Flock system was far superior to the Nintendo system.

3.2 Results

For those data sets with an specified partition, the experiments were repeated five times, because the learning method is not deterministic, some decisions are made randomly. For the rest of data sets, 10 fold cross vaUdation was used. The exception are the Auslan data sets because for them it is the norm to use 5 fold cross validation [12]. Boosting was used to select 100 features.

The series are of variable length, and if we want to use SVM (with conventional kernels) with these data sets it is necessary to express them using a fixed number of attributes. The used solution is to have as many attributes as the longest series (multiplied by the number of variables). The attributes corresponding to points after the end of a series are given the value "missing". This is done for implementation convenience, because the used SVM tool allows to give the value missing for an attribute. It substitutes missing attributes with the average value of the attributes. The same approach is used for interval predicates, if the interval is after the end of a series, the value of the feature will be "missing".

For linear kernels, the used implementation of SVM was the one available in WEKA [29]. It is based on the sequential minimal optimization (SMO) algorithm [18, 13]. The parameters were not adjusted in any way, the default values were used. The used features were normalized. For multiclass problems, the used tool builds a classifier for each pair of classes.

For gaussian kernels, LIBSVM was used [11, 5], because it includes a tool for selecting the parameters of this kind of kernel. It considers different values for two parameters (7 and C) and uses 5-fold cross validation for evaluating each pair of values.

The results are shown in Table 3. It shows the results obtained with boosting and with SVM, using point- and interval-based features and linear and gaussian

251

Data set""] lOBF CBF-tr Control Two Pat. Trace Gun Pendigits Vowels J. Vowels Auslan/N Auslan/F \

1 Boosting

1 points 3.51

22.54 4.00

46.09 73.92 2.00

10.30 56.10 6.32

40.05 9.49

intervals]

TlS^ 6.64 0.83

20.59 10.90 0.50 6.39

47.10 4.86

34.11 10.17

SVM (linear) 1 original

488 31.31

1.00 17.90 64.88

6.50 5.06

53.03 3.24

42.34 7.27

points 5.37

30.39 1.17

21.76 66.95 11.00 5.06

53.38 2.81

34.11 3.35

intervals

r25l 6.39

0.17 9.34 0.15 5.50 2.40

40.17 1.51

22.84 1.42

1 SVM (gaussian) [original

LOO 4.25 0.50 8.42

63.37 2.50 1.86

38.10 2.86

43.53 7.46

points

"Too 4.72 1.67

13.07 32.40

2.50 1.77

37.45 6.43

39.53 3.28

intervals 0.75 0.86 0.17 2.56 0.12 3.00

1.59 39.18 1.41

21.58 1.281

Table 3: Experimental results. Error rates (as percentages). The best result for each data set is marked in boldface.

kernels. It also shows the results for SVM with the original data.

For nine of the eleven data sets the best result is obtained when using interval features and gaussian kernels. The exceptions are Gun and Vowels. For the first one the best result is obtained with boosting and interval features. For this data set SVM has not any advantage for using "one vs one", because there are only two classes. For Vowels^ the best result is for point features and gaussian kernels. In this data set the series are very short, only 10 points.

The use of boosting with the point-based feature is in fact a kind of feature selection method, the features selected are attributes that were already present in the data set. The reduction of the number of features is rather important, e.g., for the Auslan-F data set the number of attributes of the original data is 2992, while the selected features are only 100.

An important question when using gaussian kernels is the number of Support Vectors, because it it necessary to store them with the classifier. For the linear kernel they are not necessary because the obtained classifiers are Unear combinations of the inputs. Table 4 shows the average number for the different data sets. The size of the support vectors is not the same, when using the original data set it is its dimensionality, when using selected features is the number of selected features (100 or less). For some of them {Two Patterns, Trace, Auslan) nearly all the training examples are considered as support vectors. If the size of the classifier is important, it could be more adequate to use Unear kernels.

SVM with Unear kernels and interval features has better results than boosting with interval features for nine of the eleven data sets. In some cases the differences are important (e.g., the 5 last data sets in the table). On the other hand, SVM with linear kernels and interval features has better results than SVM with linear kernels on the original data sets for the eleven data sets.

252

CBF CBF-tr Control Two Patterns Trace Gun Pendigits Vowels Jap. Vowels Auslan/F Auslan/N

original 326.0 732.0 351.4 924.0 800.0 116.6

1135.0 522.0 234.0

2043.2 1520.0

points 287.1 724.6 336.6 906.0 800.0 109.1

1656.0 522.0 226.2

2008.8 1519.8

intervals 191.7 543.0 222.7 799.6 776.2 105.7 932.6 516.0 235.6

2019.6 1519.4

Table 4: Average number of Support Vectors.

3.2.1 Previous results.

The results that we know for these data sets from other authors are:

• CBF\ 0.0% [12] using nearest neighbor and using Hidden Markov Models. Their data set is not the same that ours, although they were generated using the same method.

• CBF-tr: 3.22% [9] using decision trees of extracted patterns. Our error with gaussian SVM and interval features is 0.86%.

•

•

•

Control: 1.50% [9] with boosting of decision trees. Our error with SVM of interval features is 0.17%.

Two Patterns: 4.95% [9] using decision trees of extracted patterns. Our error with gaussian SVM and interval features is 2.56%.

Trace: 1.4% [24], using recurrent neural networks and wavelets, but 4.5% of the examples are not assigned to any class. We obtain 0.15% with linear SVM and interval features.

Gun: 0.5%[19] using a variant of dynamic time warping. We obtain the same result using boosting with interval features.

Pendigits: 2.2% [2] using 3-nearest neighbors. 2.97% [3] using multilayer perceptrons with the series and an image of 8 x 8 pixels as input. Using SVM with gaussian kernels our results are smaller than 2%.

Vowels: 44% [21] using the nearest neighbour classifier, the results using different neural networks are not better. Some authors use this data set with cross validation, but in that case the problem is a lot easier. In the specified partition the speakers for the training and test data sets are different.

253

Data set Trace

\j. Vowels Auslan/N Auslan/F

Boosting 1 points

69.50 4.92

33.89 10.24

intervals 13.72 6.00

31.74 10.81

SVM (linear) 1 original

44.00 2.97

32.79 7.67

points 59.93 2.65

30.53 3.14

intervals] 0.10 2.00 21.47

1.40

SVM (gauss) original

44.75 2.16

31.37 7.56

points 56.63

2.49 27.42 3.05

intervals 0.13 2.32

20.05 1.35

Table 5: Experimental results for the extended data sets. Error rates (as percentages).

• Japanese Vowels: 5.9% [15] using their method and 3.8% using a 5-state continuous Hidden Markov Model. Our results using SVM with the original data set or with interval features are better.

Auslan/N: 28.80% [12], for ffidden Markov Models, linear SVM and interval features is 22.84%.

Our result with

• Auslan/F: slight smaller than 2% [12]. This result was obtained by voting several classifiers (9 for the best result). Each one of these classifiers was obtained using feature extraction, and then boosting decision trees of that features. Our results for linear SVM and interval features are close to 1.5%.

3.2.2 Normalized data sets.

It could be argued that the used method for dealing with variable length series, to consider that the values after the end of the series have the value "missing", is not adequate for using SVM with the original data. Hence, we consider another alternative. For each data set, the series are extended to the length of the longest series in the data set. To extend a series, some of the values are duplicated. For instance, if we want to extend a series of length 100 to a length of 150, for each pair of values one would be duplicated. In this way all the values present in the original series are in the extended series and the last one has not values that are not present in the original one.

The results for the extended data sets are shown in table 5. The best results for each data set are again obtained using interval features with SVM.

4 Related V^ork

There are some works that also propose to select features with boosting and using these features with other classifiers. For instance, [16] considers a problem of facial expression classification and [14] a speech recognition problem.

There are several proposals for selecting features from time series. For instance, [12] proposes to extract features from each series. Then these features are grouped, and each group is considered as an attribute. [17] approximates each series with several functions, each function for an interval of the series. The

254

parameters of the functions and the positions of the intervals are considered as attributes. [8] proposes to use genetic programming for feature selection. The selected features are trees. The nodes in the trees are signal processing and statistical methods. The selected features are used with SVM.

With respect to SVM, one of the most related works is [4]. They use dynamic time warping (DTW) as a kernel. They state that although DTW it is not a metric, because triangular inequality does not hold, and that this kernel is not positive definite, they obtain good recognition rates. On the other hand, the "Dynamic Time-Alignment Kernel" [28] is a kernel based on DTW.

5 Conclusions and Further Work

A novel approach has been presented for multivariate time series classification. It is based on the application of SVM to obtained features. The obtaining of these features could be considered as a preprocessing step. Nevertheless, these features are obtained using another classification method, boosting, using as base classifiers algorithms that select the best feature. Although it is possible to use directly the boosted classifiers, the experimental results demonstrate that the accuracy can be improved substantially if SVM is applied to the obtained features. The experimental validation shows that this method is a strong alternative to state ot the art time series classification methods.-

One sensible question is why boosting is used. It would be easier to use SVM with the defined meta features. This approach is not considered because there are a lot of possible metafeatures. If the length of the series is n, the number of possible intervals is 0{n'^). With the exception of small data sets, it is not an option to multiply the size of the data set. Hence, it is necessary some process that selects interesting metafeatures. The process considered here is boosting.

Among the objectives and conclusions of this work is not any statement about any kind of superiority of SVM against boosting. First, the complexity of the classifiers obtained with boosting and SVM are not comparable. Boosting and linear SVM produce linear combinations, but in the case of boosting there is a linear combination of features per class, but in SVM there is a linear combination per each pair of classes. This is because the method used for dealing with multiclass problems in the SVM case was "one vs one".

Probably, the results obtained with boosting could be improved. Using only 100, rather simple, base classifiers for a 95 class problem is not comparable with the settings commonly reported for boosting, where it is not strange to use ensembles of 100 decision trees. The fact is that obtaining and using the base classifiers has a cost, and improving the accuracy adding more classifiers to the ensemble is not always an option. The "one vs one" method could be used also for boosting, but in this case an ensemble of Uterals would be obtained for each pair of classes.

On the other hand, it would be possible to apply the same scheme (that is, obtaining metafeatures with boosting, using other classification method for

255

combining them) with any other classification method instead of SVM. In particular, it would be possible to use again boosting with, for instance, decision trees of interval literals. In this way we believe it could be possible to obtain similar results to the ones obtained with SVM.

A c k n o w l e d g e m e n t s . To the maintainers of the UCI KDD Repository [10]. To the donors of the different data sets [1, 2, 7, 9, 12, 15, 19, 24]. To the developers of the WEKA Ubrary [29] and of LIBSVM [11, 5].

References

[1] Robert J. Alcock and Yannis Manolopoulos. Time-series similarity queries employing a feature-based approach. In Proceedings of the 7*^ Hellenic Conference on Informatics, loannina, Greece, 1999.

[2] Fevzi Alimoglu. Combining multiple classifiers for pen-based handwritten digit recognition. Master's thesis, Institute of Graduate Studies in Science and Engineering, Bogazici University, 1996. http: //www. cmpe. boun. edu. tr/''alimoglu/ alimoglu.ps.gz.

[3] Fevzi Alimoglu and Ethem Alpaydin. Combining multiple representations for pen-based handwritten digit recognition. ELEKTRIK: Turkish Journal of Electrical Engineering and Computer Sciences, 9(1):1-12, 2001.

[4] Glaus Bahlmann, Bernard Haasdonk, and Hans Burkhardt. On-line handwriting recognition with support vector machines: A kernel approach. In 8th Int. Workshop on Frontiers in Handwriting Recognition (IWFHR), pages 49-54, 2002.

[5] Chih-Chung Chan and Chih-Jen Lin. Training nu-support vector classifiers: Theory and algorithms. Neural Computation, 13(9):2119-2147, 2001. http: //www. c s i e . ntu. edu. tw/'^c j lin/paper s/newsvm. ps. gz.

[6] Nello Cristianini and John Shawe-Taylor. An Introductio to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, 2000.

[7] David H. Deterding. Speaker Normalisation for Automatic Speech Recognition. PhD thesis, Department of Engineering, University of Cambridge, 1989.

[8] Damian Eads, Daniel Hill, Sean Davis, Simon Perkins, Junshui Ma, Redi Porter, and James Theiler. Genetic algorithms and support vector machines for time series classification. In 5th Conference on the Applications and Science of Neural Networks, Fuzzy Systems, and Evolutionary Computation. Symposium on Optical Science and Technology of the 2002 SPIE Annual Meeting, 2002. http://www. OS. r i t . edu/''dre9227/papers/eadsSPIE4787. pdf.

[9] Pierre Geurts. Contributions to decision tree induction: bias/variance tradeoff and time series classification. PhD thesis. Department of Electrical Engineering and Computer Science, University of Lige, Belgium, 2002.

[10] S. Hettich and S. D. Bay The UCI KDD archive [http:/ /kdd.ics .uci .edu], 1999. Irvine, CA: University of California, Department of Information and Computer Science.

256

[11] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A practical guide to support vector classification. http://www.csie.ntu.edu.tw/''cjl in/papers/ guide/guide.pdf.

[12] Mohammed Waleed Kadous. Temporal Classification: Extending the Classification Paradigm to Multivariate Time Series. PhD thesis. The University of New South Wales, School of Computer Science and Engineering, 2002.

[13] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, and K.R.K. Murthy. Improvements to platt's smo algorithm for svm classifier design. Technical Report CD-99-14, Control Division, Dept. of Mechanical and Production Engineering, National University of Singapore, 1999.

[14] Aldebaro Klautau. Mining speech: automatic selection of heterogeneous features using boosting. In ICASSP 2003, 2003. http://www.laps.ufpa.br/aldebaro/ papers/klautau-icasspOS.zip.

[15] Mineichi Kudo, Jun Toyama, and Masaru Shimbo. Multidimensional curve classification using passing-through regions. Pattern Recognition Letters, 20(11-13): 1103-1111, 1999.

[16] Owen Littlewort, Marian S. Bartlett, Ian Fasel, Joel Chenu, Takayuki Kanda, Hiroshi Ishiguro, and Javier R. Movellan. Towards social robots: Automatic evaluation of human-robot interaction by facial expression classification. In NIPS 2003 Conference Proceedings, Advances in Neural Information Processing Systems 16, 2003.

[17] Robert T. Olszewski. Generalized Feature Extraction for Structural Pattern Recognition in Time-Series Data. PhD thesis, Computer Science Department, Carnegie Mellon University, 2001. h t tp : / / r epor t s -a rch ive .adm.cs . cmu.edu / anon/2001/abstracts/Ol-108.html.

[18] J. Piatt. Fast training of support vector machines using sequential minimal optimization. In B. Schlkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT Press, 1998.

[19] Chotirat Ann Ratanamahatana and Eamonn Keogh. Making time-series classification more accurate using learned constraints. In SI AM International Conference on Data Mining (SDM'04), 2004.

[20] Gunnar Ratsch, Takeshi Onoda, and Klaus R. Miiller. Regularizing adaboost, 1999. ht tp: / / ida.f i rs t .gmd.de/ ' raetsch/ps/RaeOnoMueSSd.pdf .

[21] Anthony J. Robinson. Dynamic Error Propagation Networks. PhD thesis, Cambridge University Engineering Department, 1989.

[22] Juan J. Rodriguez and Carlos J. Alonso. Interval and dynamic time warping-based decision trees. In 19th Annual ACM Symposium on Applied Computing, Special Track on Data Mining, 2004.

[23] Juan J. Rodriguez, Carlos J. Alonso, and Henrik Bostrom. Boosting interval based literals. Intelligent Data Analysis, 5(3):245-262, 2001.

[24] Davide Roverso. Multivariate temporal classification by windowed wavelet decomposition and recurrent neural networks. In 3^ ANS International Topical Meeting on Nuclear Plant Instrumentation, Control and Human-Machine Interface, 2000.

[25] Naoki Saito. Local Feature Extraction and Its Applications Using a Library of Bases. PhD thesis, Department of Mathematics, Yale University, 1994.

257

[26] Robert E. Schapire. A brief introduction to boosting. In 16*^ International Joint Conference on Artificial Intelligence (IJCAI-99), pages 1401-1406. Morgan Kaufonann, 1999.

[27] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. In 11^^ Annual Conference on Computational Learning Theory (COLT 1998), pages 80-91. ACM, 1998.

[28] Hiroshi Shimodaira, Ken ichi Noma, Mitsuru Nakai, and Shigeki Sagayama. Dynamic time-alignment kernel in support vector machine. In NIPS 2001, 2001.

[29] Ian H. Witten and Elbe Prank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999.

Neighbourhood Exploitation in Hypertext Categorization

Houda Benbrahim and Max Bramer Department of Computer Science and Software Engineering

Portsmouth University {houda.benbrahim, max.bramer}@port.ac.uk

Abstract

The exponential growth of the web has led to the necessity to put some order to its content. The automatic classification of web documents into predefined classes, that is hypertext categorization, came to elevate humans from that task. The extra information available in a hypertext document poses new challenges for automatic categorization. HTML tags and linked neighbourhood all provide rich information for hypertext categorization that is not available in traditional text classification. This paper looks at (i) which extra information hidden in HTML tags and linked neighbourhood pages to take into consideration to unprove the classification task, and (ii) how to deal with the high level of noise in linked pages. A hypertext dataset and four well-known learning algorithms (Naive Bayes, K-Nearest Neighbour, Support Vector Machine and C4.5) were used to exploit the enriched text representation. The results showed that the clever use of the information in linked neighbourhood and HTML tags improved the accuracy of the classification algorithms.

1. Introduction It has been estimated that the World Wide Web comprises more than 3 billion pages and is growing at a rate of 1.5 million pages a day [1]. In front of such a huge volume of documents, search engines become limited: too much information to look at and too much information retrieved. The organization of web documents into categories will reduce the search space of search engines, and improve their retrieval performance. Moreover, a recent study [2] showed that users prefer to navigate through directories of pre-classified content, and that providing a categorised view of retrieved documents enables them to fmd more relevant information in a shorter time. The common use of the manually constructed category hierarchies for navigation support in Yahoo [3] and other major web portals has also demonstrated the potential value of automating the process of hypertext categorization.

Text categorization is relatively a mature area where many algorithms have been developed and experiments conducted. Classification accuracy reached 87% [4]

258

259

for some algorithms applied to known text categorization corpora (Reuters, 20_newspaper...) where the vocabulary is coherent and the authorship is high. Those same classical classifiers perform badly [4] on samples from Yahoo! pages. This is due to the extreme diversity of web pages (such as homepages, articles...) and authorship, and to the little consistency in vocabulary.

Automated hypertext categorization poses new research challenges because of the extra information in a hypertext document. Hyperlinks, HTML tags, metadata and linked neighbourhood all provide rich information for classifying hypertext that is not available in traditional text categorization. Researchers have only recently begun to explore the issues of exploiting rich hypertext information for automated categorization.

There is a growing volume of research in the area of learning over web text documents. Since most of the documents considered are in HTML format, researchers have taken advantage of the structure of those pages in the learning process. The systems generated differ in performance because of the quantity and nature of the additional information considered.

Benbrahim and Bramer [5] used BankSearch dataset to study the impact of the use of metadata (page keywords and description), page title and link anchors in a web page on classification. They concluded that the use of basic text content enhanced with weighted extra information (metadata + title + link anchors) improves the performance of three different classifiers.

Oh et al. [6] reported some observations on a collection of online Korean encyclopaedia articles. They used system-predicted categories of the linked neighbours of a test document to reinforce the classification decision on that document and they obtained a 13% improvement over the baseline performance when using local text alone.

Joachims et al. [7] also reported a study using the WebKB university corpus, focusing on Support Vector Machines with different kernel functions. Using one kernel to represent one document based on its local words, and another kernel to represent hyperlinks, they give evidence that combining the two kernels leads to better performance in two out of three classification problems.

Yang, Slattery and Ghani [8] have defmed five hypertext regularities which may hold m a particular application domain, and whose presence may significantly influence the optimal design of a classifier. The experiments were carried out on 3 datasets and 3 learning algorithms. The results showed that the naive use of the linked pages can be more harmful than helpful when the neighbourhood is noisy, and that the use of metadata when available unproves the classification accuracy.

This paper deals with web document categorization. Two issues will be considered in depth: (i) the choice of representation for documents and the extra mformation hidden in HTML pages and its neighbourhood that should be taken into consideration to improve the classification task, and (ii) how to filter out the noisy neighbourhood. Finally, data collected from the web will be used to evaluate the performance of different classification methods with different choices of text representation.

260

Document representation is described in Section 2. Some classification algorithms used for hypertext are reviewed in Section 3. Section 4 presents experiments and results, comparing different classification algorithms with different webpage representation techniques.

2. Text Representation In order to apply machine-learning methods to document categorization, consideration first needs to be given to a representation for HTML pages. An indexing procedure that maps a text into a compact representation is applied to the dataset. The most frequently used method is a bag-of-words representation where all words from the set of documents under consideration are taken and no ordering of words or any structure of text is used. The words are selected to support classification under each category in turn, i.e. only those words that appear in documents in the specified category are used (the local dictionary approach). This means that the set of documents has a different feature representation (set of features) for each category. This approach for building the dictionary has been reported to lead to better performance [9] and [10]. This leads to an attribute-value representation. Each distinct word corresponds to a feature, with the number of times the word occurs in the document as its value.

Based on our preliminary work [5], metadata (page keywords and description), page title and link anchors in a web page along with basic page content improved the accuracy of the classification task. This extra information is included in the data dictionary. Then, words in page's neighbours (documents pointing to, and pointed by the target page) are included in the text representation. The blind use of the content in links may harm the classification task [11], this is due to the fact that many pages point to pages from different subjects, e.g., web pages of extremely diverse topics link to Yahoo! or BBC news web pages. To filter out the noisy link information, just the most similar adjacent pages to the target page are kept. The similarity of pages is determined by the cosine measure.

With the bag-of-words approach for text representation, it is possible to have tens of thousands of different words occurring in a fairly small set of documents. Using all these words is time consuming and represents a serious obstacle for a learning algorithm. Moreover many of them are not really important for the learning task and their usage can degrade the system's performance. Many approaches exist to reduce the feature space dimension, the most common ones are: (i) the use of a stop list containing common English words, (ii) or the use of stemming, that is keeping the morphological root of words.

3. Classification Algorithms

3.1 Naive Bayes (NB) Naive Bayes (NB) is a widely used model in machine learning and text classification. The basic idea is to use the joint probabilities of words and categories in the traming set of documents to estunate the probabilities of categories for an unseen document. The term 'naive' refers to the assumption that

261

the conditional probability of a word is independent of the conditional probabilities of other words in the same category.

A document is modelled as a set of words from the same vocabulary, V. For each class, Cy, and word, w e V, the probabilities, P(Cj) and P(wic\Cj) are estimated from the training data. Then the posterior probability of each class given a document, D, is computed using Bayes' rule:

P(C ) 11

where ai is the i*" word in the document, and \D\ is the length of the document in words. Since for any given document, the prior probability P(D) is a constant, this factor can be ignored if all that is desired is ranking rather than a probability estimate. A ranking is produced by sorting documents by their odds ratios, P(C]\D) / P(Co\D), where Cy represents the positive class and Co represents the negative class. An example is classified as positive if the odds are greater than 1, and negative otherwise.

3.2 K-Nearest Neighbour (KNN) K-Nearest Neighbour (KNN) is a well-known statistical approach in pattern recognition. KNN assumes that similar documents are likely to have the same class label. Given a test document, the method finds the K nearest neighbours among the training documents, and uses the categories of the K neighbours to weight the category candidates. The similarity score of each neighbour document to the test document is used as the weight of the categories of the neighbour document. If several of the K nearest neighbours share a category, then the per-neighbour weights of that category are added together, and the resulting weighted sum is used as the likelihood score of that category with respect to the test document. By sorting the scores of candidate categories, a ranked list is obtained for the test document. By thresholding on these scores, binary category assignments are obtained. The decision rule in KNN can be written as:

y(x, Cj)= YJ ^^^^^^ î ^y^î' ^7) " ^ dêKNN

where y(d^,Cj)e {0,1} is the classification for document d^ with respect to

category cj (y = 1 for Yes, and y = 0 for No); sim{x, d^ ) is the sunilarity between

the test document X and the training document d^ ; and bj is the category specific

threshold for the binary decisions.

262

3.3 C4.5 (Decision Tree Classification) C4.5 is a decision tree classifier developed by Quinlan. The training algorithm builds a decision tree by recursively splitting the data set using a test of maximum gain ratio. The tree is then pruned based on an estimate of error on unseen cases. During classification, a test vector starts at the root node of the tree, testing the attribute specified by this node, then movmg down the tree branch corresponding to the value of the attribute in the given example. This process is then repeated for the subtree rooted at the new node until a leaf is encountered, at which time the pattern is asserted to belong to the class named by that leaf

3.4 Support Vector Machine (SVM) Support vector machines are based on the Structural Risk Minimization principle[12] from computational leammg theory. The idea is to find a hypothesis h for which we can guarantee the lowest true error. The true error of h is the probability that h will make an error on unseen and randomly selected test example. An upper bound can be used to connect the true error of a hypothesis h with the error of h on the training set and the complexity of H (measured by VC-Dimension), the hypothesis space containing A [12]. Support vector machines find the hypothesis A, which minimizes this bound on the true error by effectively and efficiently controlling the VC-Dimension of H. A remarkable property of SVMs is that their ability to learn can be independent of the dimensionality of the feature space. This property makes them suitable for text categorization where the input space' dimension is very high.

4. Experiments

4.1 Dataset To test the proposed algorithms for hypertext classification, datasets were needed that reflected the properties of real world hypertext classification tasks.

The major practical problem in using web document datasets is that most of the URLs become unavailable. The well-known dataset WebKB project at CMU [13] is outdated since most of its web pages are no longer available.

The BankSearch [14] dataset used for the experiments comprises a set of HTML web documents. The Open Dbectory Project and Yahoo! categories were used to provide web pages that have abeady been categorized by people. The considered dataset consists of 11,000 pages. The web pages were distributed over 11 different categories under 4 distmct themes. The dataset consists of some sets of categories that are quite distinct from each other, as well as other categories that are quite similar to each other. Table 1 gives a summary of the dataset.

263

Dataset ID

A

B

C

D

E

F

G

H

I

J

K

Dataset Category

Commercial Banks

Building Societies

Insurance Agencies

Java

C/C++

Visual Basic

Astronomy

Biology

Soccer

Motor Sport

Sport

Associated Theme

Banking and Finance

Banking and Finance

Banking and Finance

Programming Languages



Science

Science

Sport

Sport

Sport

Table L Dataset Summary

4.2 Performance measures

The evaluation of the different classifiers is measured using four different measures: recall (R), precision (P), accuracy (Ace), and Fl measure [15]. These can all be defined using the 'confusion matrix' shown as Table II.

Correct Class is Cu Correct Class is Cj^

Assigned class is C^

Assigned class is C^

b

d

Table IL Confusion Matrix

R = a (a-^-c)

if(a^c) > 0 otherwise R = 1

P = a (a + b)

if(a-^b) > 0 otherwise P = 1

Acc = wheren = a + b + c-^ d> 0 n

264

2PR 2a r 1 = = if(ci'^c) > 0 otherwise undefined

Recall (R) is the percentage of the documents for a given category that are classified correctly. Precision (P) is the percentage of the predicted documents for a given category that are classified correctly. Accuracy (Ace) is defined as the ratio of correct classification into a category Q.

Neither recall nor precision makes sense in isolation from the other. In fact, a trivial algorithm that assigns class Q to all documents will have a perfect recall (100%), but an unacceptably low precision. Conversely, if a system decides not to assign any document to Q it will have a perfect precision but a low recall. The Fl measure has been introduced to balance recall and precision by giving them equal weights.

Classifying a document involves determining whether or not it should be classified in any or potentially all of the available categories. Since the four measures are defined with respect to a given category only, the results of all the binary classification tasks (one per category) need to be averaged to give a single performance figure for a multiple class problem.

In this paper, the 'micro-averaging' method will be used to estimate the four measures for the whole category set. Micro-averaging reflects the per-document performance of a system. It is obtained by globally summing over all individual decisions and uses the global contingency table.

4.3 Design of experiments

The classification algorithms NB, KNN, SVM and C4.5 were applied to the BankSearch dataset to address the different binary classification problems. The dataset was randomly split into 70% training and 30% testing.

Two local dictionaries were then built for each category and for each text representation after stop word removal (using a stop list of 512 words provided by David Lewis [16]), with the option of stemming turned either on or off. Documents were represented by a VSM where the weights were the term frequencies in documents.

Two series of experiments were conducted. The documents are represented by (i) the basic content of HTML documents or (ii) a combination of basic html content, metadata, title and link anchors of the target page, along with weighted content of the similar in-coming and out-going linked pages.

The local dictionaries and document's VSM for the second option of text representation were constructed as follows: For each target page in the dataset, the set of neighbour' pages was determined. The content of all the in-coming and outgoing links along with the target page were used to build the dictionaries. Then, the similarity of each target page with its neighbours is calculated to filter out the noisy links. The term weights of the target pages are adjusted so that the target page is influenced by its similar neighbours.

265

4.4 Results and interpretation As a first note, and within the processing step, it has been noticed that the pages considered in this specific dataset have in average 16.4 out-going Imks with this number varying from a maximum of 189 and minimum of 1. This number also varies depending on the category considered. Concerning the in-coming links, the average number of pages was 7, with a maximum of 456 and minimum 0. Many target pages in the dataset were not pointed by any document in the web. An interesting remark was drawn while determining the similar pages to a given target page; the average number of similar pages (including both in-coming and out-going pages) was 5, with a minimum of 0 and max 36 (those numbers vary depending on the considered category). As a result, a large number of linked pages was thrown away. This explains clearly the fact that linked neighbourhood is noisy. This filtering step was helpful in this regard.

The different algorithms result in different performance depending on the features used to represent the documents.

The set of experiments evaluates SVM, C4.5, NB and KNN, for texts represented using either (i) the basic content enhanced by the meta data, title and link anchors with stemming option turned on or off, or (ii) a combination of basic content, metadata, title and link anchors of the target page with those of its similar neighbours where extra weight was assigned to common words between the target page and its neighbours, this is done with the stemming option turned on or off.

96

95

94

I 93 § 92

91

90

89

NB

KNN

C4.5

SVM

Base BaseStm Neigh

Text Representation

NeighStm

Figurel: N.B, KNN, C4.5 and SVM accuracy for different choices of text representation.

Figure 1 (Figure 2) reports the performance accuracy (Fl measure) on the test set of SVM, C4.5, NB and KNN for the different text representation options.

266

Figures 1 and 2 show that the use of stemming improves the performance of all the classifiers for all the options of text representation. Stemming is helpful since it decreases the size of the feature space, moreover, it combines the weights of the different words that share the same stem.

These figures also show that SVM outperforms all the other classifiers. This is not surprising since it has been reported [12] that it works well in high dimensional feature space, this also explains its slight increase in performance when stemming was used. C4.5 also outperforms NB and KNN for the different text representations. The features selected by C4.5 to build the tree were meaningful in terms of class description.

0.95

0.9

0.85

0.8

0.75

0.7

0.65

0.6

NB KNN C4.5 SVM

Base BaseStm Neigh

Text Representation

NeighStm

Figure2: N.B, KNN, C4.5 and SVM Fl measure for different choices of text representation.

Including the extra information m the filtered neighbourhood to the basic pages has improved the performance of the different classifiers. Note that if the threshold used to decide about the similarity of two pages is set too high, in this case, no similar documents would be found, and the performance of the classifiers, with the linked neighbourhood mformation taken into consideration for text representation, will be as good as that of the classifiers with the basic text content as text representation. The slight increase in each classifier's performance when the linked information is used means that all the noisy links that may harm the classification were filtered out. The arbitrary threshold used in those experiments to decide about the similarity of two pages was set to 0.8. This threshold may seem too high since it may decline even useful links, but at the same time, somewhat secure since there is a low chance to include noisy Imks.

267

5. Conclusions and future work In summary, a number of experiments were conducted to evaluate the performance of some well-known learning algorithms on hypertext data. Different text representations have been used and evaluated. It can be concluded that the careful use of the extra information available in the linked neighbourhood increases the performance of the classifiers. The improvement was smaller than expected since the filtering was too high, and useful links might have been filtered out.

The careful use of the extra information in the linked neighbourhood of HTML pages improved the performance of the different classifiers. In future work, this extra information will be extended by less severely selecting the useful neighbour links. The class of the linked neighbourhood instead of its similarity to the target page may be used to filter out the noisy links. Experiments with different datasets should also be conducted before final conclusions are drawn.

References 1. K.Bharat and A. Broader. A technique for measuring the relative size and

overlap of public web search engines. In Proc. Of the 7th World Wide Web Conference (WWW7), 1998.

2. H. Chen and S. Dumais. Bringing order to the web: automatically categorizing search results. In proceedings of CHI-00, ACM International Conference on Human Factors in Computing Systems, p 145-152, Den Haag, NL, 2000. ACM Press, New York, US.

3. http://www.yahoo.com

4. S. Chakrabarti, B .Dom, R. Agrawal and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In VLDB, Athens, Greece, Aug. 1997.

5. H. Benbrahim and M. Bramer. Impact on performance of hypertext classification by selective rich html capture. IFIP World Computer Congress, Toulouse, France, Aug 2004 (to appear).

6. H. Oh, S. Myaeng, and M. Lee. A practical hypertext categorization method using links and incrementally available class information. In Proceedings of the Twenty Third ACM SIGIR Conference, Athens, Greece, July 2000.

7. T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hypertext categorization. In International Conference on Machine Learning (ICML'Ol), San Francisco, CA, 2001, Morgan Kaufmann.

8. Y. Yang, S. Slattery and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems (Special Issue on Automatic Text Categorization) 18 (2-3) 2002, pp. 219-241.

9. C. Apte, F. Damereau, and S. Weiss. Automated learning of decision rules for text categorization. ACM trans.Information Systems, Vol.12, No.3, July 1994, pp. 233-251.

268

10. A.Bensaid and N. Tazi. Text categorization with semi-supervised agglomerative hierarchical clustering. International Journal of Intelligent Systems, 1999.

11. S.Chakrabati, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. Proceedings ACM SIGMOD International Conference on Management of Data, pages 307-318, Seattle, Washington, June 1998. ACM Press.

12. T.Joachims. Text categorization with Support Vector Machines: Learning with many relevant features. Proceedings ofECML-98, 10* European Conference on Machine Learning, pages 137-142.

13. http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/

14. http://www.pedal.rdg.ac.uk/banksearchdataset/

15. Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1999.

16. D. Lewis. Feature selection and feature extraction for text categorization. Proceedings of Speech and Natural Language Workshop, 1992, pp. 212-217.

Using Background Knowledge to Construct Bayesian Classifiers for

Data-Poor Domains

Marcel van Gerven & Peter Lucas Institute for Computing and Information Sciences

Radboud University, Toernooiveld 1 6525 ED Nijmegen, The Netherlands

E-mail: {marcelge,peterl}@cs.ru.nl

Abstract

The development of Bayesian classifiers is frequently accomplished by means of algorithms which are highly data-driven. Often, however, sufficient data are not available, which may be compensated for by eliciting background knowledge from experts. This paper explores the trade-offis between modelling using background knowledge from domain experts and machine learning using a small clinical dataset in the context of Bayesian classifiers. We utilized background knowledge to improve Bayesian classifier performance, both in terms of classification accuracy and in terms of modelling the structure of the underlying joint probability distribution. Relative differences between models of differing structural complexity, which were learnt using varying amounts of background knowledge, are explored. It is shown that the use of partial background knowledge may significantly improve the quality of the resulting classifiers.

1 Introduction

Again and again, Bayesian classifiers have proved to be a robust machine learning technique in the presence of suflScient amounts of data [3, 9, 7]. The heavy reliance of their construction algorithms on available data is, however, not always justified, as there are many domains in which this availability is limited. For instance, in the medical domain, more than 90% of medical disorders have a sporadic occurrence and, therefore, even clinical research datasets may only include data of a hundred to a few hundred patients. Clearly, in such cases there is a role for human domain knowledge to compensate for the limited availability of data, which then may act as background knowledge to a learning algorithm.

Even if the exploitation of background knowledge seems diflScult to avoid in such data-poor domains, there is a question as to the form of this background knowledge. In the context of Bayesian classifiers, where the aim is to learn a probability distribution that is then used for classification purposes, representing background knowledge as a Bayesian network seems to have at least the appeal that it can easily be transferred to a Bayesian classifier. We call

269

270

Bayesian networks that offer a task-neutral representation of statistical relations in a domain declarative Bayesian networks. Often, declarative Bayesian networks can be given a causal interpretation.

The construction of declarative Bayesian networks is a difficult undertaking; experts have to state perfectly all the dependencies, independencies and conditional probabiUty distributions associated with a given domain. Since this is a very time-consuming task and an instantiation of the infamous knowledge acquisition bottleneck, we will investigate how background knowledge of different degrees of completeness influences the quality of the resulting classifiers built from this knowledge. We will refer to this form of incomplete and fragmentary knowledge as partial background knowledge.

We will use so-called forest-augmented naive classifiers in order to assess the performance of Bayesian classifiers of different degrees of structural complexity. Both the naive and the tree-augmented naive classifier are limiting cases of this type of Bayesian network [10, 7]. Since Bayesian classifiers ultimately represent a joint probabiUty distribution, we are not only interested in classifier performance, but also in the quality of the learnt probabiUty distributions.

The aim of this article is to gain insight into the quality of Bayesian classifiers when learnt firom either (partial) background knowledge or data using a clinically realistic model and accompanying patient database. Note that this is fairly uncommon, since most machine learning research is either based on the availability of large amounts of data or on a declarative model firom which the data is generated. These models and data are often explicitly designated for benchmarking purposes, but it is not known and even doubted whether they properly represent the real-world situation [7]. Therefore, we have chosen to use both a model and a dataset taken directly firom clinical practice. The declarative model serves as the background knowledge we have at our disposal and we will show how its exploitation may assist in the construction of Bayesian classifiers. We investigate whether the use of partial background knowledge is a feasible strategy in case of limited availability of data.

2 Forest-augmented naive classifiers

2.1 Definition and construction

A Bayesian network B (also called belief network) is defined as a pair B = {G,P), where G is a directed, acyclic graph G = (V(G),A(G)), with a set of vertices V{G) = {Xi, . . . ,Xn}, representing a set of stochastic variables, and a set of arcs A{G) C V{G) x V'(G), representing conditional and unconditional stochastic independences among the variables, modelled by the absence of arcs among vertices. Let noiXi) denote the conjunction of variables corresponding to the parents of Xi in G. On the variables in V(G) is defined a joint probability distribution P(Xi,. . . ,Xn), for which, as a consequence of the local Markov property, the following decomposition holds: P{Xi,...,X„) = nr=iP{Xi InaiXi)).

271

Figure 1: Forest-augmented naive (FAN) classifier. Note that both the naive classifier and the tree-augmented naive classifier are limiting cases of the forest-augmented naive classifier.

In order to systematically assess the performance of Bayesian classifiers with structures of varying complexity we utilize the forest-augmented naive classifier, or FAN classifier for short (Fig. 1). A FAN classifier is an extension of the naive classifier, where the topology of the resulting graph over the evidence variables € = {El,..., En} is restricted to a forest of trees [7]. For each evidence variable Ei there is at most one incoming arc allowed firom S \ {Ei} and exactly one incoming arc from the class variable C.

The algorithm to construct FAN classifiers used in this paper is based on a modification of the algorithm to construct tree-augmented naive (TAN) classifiers by Friedman et al. [3] as described in Ref. [7], where the class-conditional mutual information (CMI)

Ei,Ej,C fr(E„£,ic,= i ; / (^ . .^ .c ) '»8p( f^5 |^

is used to select succeeding arcs between evidence variables. In our research, the joint probabiUty distributions of the classifiers were

learnt either firom data using Bayesian updating with uniform Dirichlet priors or estimated firom a declarative Bayesian network. We refer to classifiers of the first kind as data-driven classifiers (denoted by Fd) and to classifiers of the second kind as model-driven classifiers (denoted by F^). We use F}^ to refer to a type k FAN classifier containing n arcs of the sort {Ei,Ej) with i ^ j . Note that FJ^ is equivalent to a naive classifier when n = 0 and equivalent to a TAN classifier when n is equal to |£| — 1, forming a spanning tree over the evidence variables.

2.2 Estimating classifiers from background knowledge

The new approach studied in this article is to learn a Bayesian classifier's joint probability distribution not only firom data, but alternatively to estimate it firom a declarative Bayesian network. Declarative Bayesian networks may be viewed as the best approximation to the underlying probability distribution of the domain given the knowledge we have at our disposal. Learning FAN classifiers directly firom a declarative model is accomplished as follows.

If we have a joint probabiUty distribution P{X, S, C) with X = {Xi,..., Xn}y evidence variables £ = {Ei,...,Em} and class-variable C, underlying the

272

Figure 2: Declarative Bayesian network, used in computing the joint probability distributions for a three-vertex network, where P{Ei,Ej,C) = P{Ei \ Ej,C)P{Ej I C)P{C) and P{Ei,Ej | C) = P{Ei \ Ej,C)P{Ej \ C).

declarative Bayesian network B = (G,P), then the following decomposition is associated with the Bayesian network:

m n

P{M,S,C) = P{C 17rG{C))]lP{Ei I TToiEi)) [JP{Xj \ noiXj)), i= l j = l

The joint probability distribution underlying the FAN classifier B' = {G',P') with y(G') = V{G) is defined as P'{S,C), The probability distribution P is used as a basis for the estimation of P' , as follows:

P\Ei I p{Ei),C) = Yl ^(^^'^ I ' ( )' ) yea(Xue\{Ei}Up{Ei))

where cr(V) denotes the set of configurations of the variables in V and

(1)

piE, '•' = (

{Ej} i{7rG'{Ei) = {Ej,C} 0 otherwise.

The construction of FAN classifiers from the declarative model and the FAN construction algorithm amounts to estimating three-vertex networks of the form depicted in Fig. 2 using equation (1).

Since FAN classifiers may incorporate just a proper subset of the vertices in the declarative model, we are allowed to remove vertices which do not take part in the computation of the (conditional) probabilities P(C), P{Ej \ C) and P{Ei I Ej^C). Equation 1 does not take these irrelevant vertices explicitly into account, but standard techniques from the context of Bayesian inference exist to prune a declarative model prior to computing relevant probabilities [6].

2,3 Classifier evaluation The performance of FAN classifiers may be determined by computing zero-one loss, where the value c* of the class variable C with largest probability is taken:

c* = argmaXcP(C = c\S).

Declarative model Partial model

273

FAN model

reduction construction

Figure 3: A declarative model is reduced to a partial model. Subsequently, FAN models are constructed from the partial model.

A disadvantage of this straightforward method of comparing the quality of the classifiers is that the actual posterior probabilities are ignored. A more precise indication of the behaviour of Bayesian classifiers is obtained with the logarithmic scoring rule [2]. Let JD be a dataset, \D\ = p, p > 0, With each prediction generated by a Bayesian model for case rk € D, with actual class value Ck, we associated a score:

Sk = -\ogP{ck\S),

which can be interpreted formally as the entropy and has the informal meaning of a penalty. When the probability P{ck \ £) = 1, then Sk = 0 (actually observing Ck generates no information); otherwise, Sk > 0. The total score for dataset D is now defined as the average of the individual scores S = ^ 2]^=i Sk-

The logarithmic scoring rule is a rule which measures differences in probabilities for a class Ck given evidence £. A global measure of the difference between two probability distributions P and Q is the relative entropy (or KuUback-Leibler divergence [5]):

D(P,Q) = ^ P { X ) l o g PJX)

Qixy

We have used the percentage of correctly classified cases computed using zero-one loss as our measure of classification accuracy, the logarithmic score to gain insight into the quality of the assigned probabiUties for unseen cases and relative entropy as a means to gain insight into the quality of the joint probability distribution when comparing the declarative model with the other models.

2.4 Partial background knowledge

Declarative Bayesian networks are particularly useful to represent the background knowledge we have about a domain, but often this knowledge is incomplete. We define partial background knowledge as any form of knowledge which is incomplete relative to the total amount of background knowledge available.

274

More formally, let B = {G,P) be a declarative model with joint probability distribution P(Xi, . . . jXn), representing full knowledge of a domain. Let B' = (G', P') with V{G') = V{G) be a Bayesian network with P'(Xi,. . . ,Xn). S' is said to represent partial background knowledge if 0 < D(P,P') < e for small e > 0, where e is the least upper-bound of D(P,P') for an uninformed prior P ' (note that D(P,P') > 0 in general).

In this article we have focused on the incomplete specification of dependencies as our operationalisation of partial background knowledge, such that for a partial model B', A{G') C A{G). The probabiUty distribution P is used as a basis for the estimation of P', as follows:

P'{Xi\nG'{Xi))= Yl P{Xi\7rG'{Xi),^)P{^\7:G'{Xi)). (2) l€(T(nG(Xi)\nQi (Xi))

Figure 3 shows how a partial model is estimated from a declarative model using equation (2) and employed to estimate the probabilities for a FAN classifier. Varying the amount of background knowledge we have at our disposal enables us to investigate the relative merits of knowledge of different degrees of completeness. The upper bound of completeness is formed by the knowledge represented in the declarative Bayesian network.

3 Non-Hodgkin lymphoma model and data

CT4RT-SCHQ)UI,E

OBCnM.-HEM,TK8TATU8

Figure 4: Declarative Bayesian network as designed with the help of expert clinical oncologists.

275

OENERAL-HEALTH-STATUS QEN6BAUH6ALTH-STATUS

Figure 5: Differing resulting structures for data-driven FAN classifiers (left) and model-driven FAN classifiers (right) for the class-variable 5-YEAR-RESULT.

In this research, we used a Bayesian network incorporating most factors relevant for the management of the uncommon disease gastric non-Hodgkin lymphoma (NHL for short), referred to as the declarative model, which is shown in Fig. 4. It is fully based on expert knowledge and has been developed in collaboration with clinical experts from the Netherlands Cancer Institute (NKI) [8]. The model has been shown to contain a significant amount of high quaUty knowledge [1]. Furthermore, we are in possession of a database containing 137 patients which have been diagnosed with gastric NHL.

We excluded post-treatment variables and have built FAN classifiers as depicted in Fig. 5, where the structure and underlying probabiUty distributions are either learnt from the available patient data or estimated directly from the (partial) declarative model using equation (1).

Classifiers were evaluated by computing classification accuracy and logarithmic score for 137 patient cases for the class-variable 5-YEAR-RESULT. This variable represents whether a patient has died from NHL (DEATH) or lives (ALIVE) five years after therapy. For the classifiers learnt from patient data leave-one-out cross-validation was carried out such that test cases where excluded during estimation of the joint probability distribution of the resulting classifiers. Probabihty distributions of the classifiers were compared with that of the declarative model by means of relative entropy.

Both the declarative model and the patient database are used as a gold standard, even though no such standard exists in practice. The declarative model is regarded as the gold standard when used as the reference model in computing relative entropies and the patient database is regarded as the gold standard when used as a test set during leave-one-out crossvalidation. As such, both the declarative model and the patient database reflect our best guess with respect to the underlying joint probability distribution of the domain.

276

%

^0 ^1 p2 p3 pi p5 p6 p7

Model

u.o

0.74

0.68

0.62

0.56

1 1 1

-

-

_••"• '

1 1 1

1 1 . !•• 1

-

-

-

1 1 1

^0 pi p2 p3 p4 pb pe p7

Model

Figure 6: Classification accuracy (left) and logarithmic score (right) for Bayesian classifiers with a varying number of arcs learnt from either patient data (dotted line) or the declarative model (solid line). Classification accuracy and logarithmic score for the declarative model are shown for reference (straight fine).

4 Results

4.1 Data-driven versus model-driven classification

The results for both classification accuracy and logarithmic score (Fig. 6) show that performance was consistently better for the model-driven classifiers than for the data-driven classifiers. Construction of a classifier from a database with a limited number of cases obviously leads to a performance degradation and the use of background knowledge considerably enhances classifier quality. Fig. 6 also shows that model-driven FAN classifiers attained better performance than the declarative model, which is task-neutral and not optimised for classification.

Performance differences between model-driven and data-driven classifiers can only arise from qualitative differences in terms of network structure or quantitative differences in terms of estimated conditional probabilities. We proceed by showing how such differences may arise.

4.1.1 Qualitative Differences

When structures are compared, it is found that entirely different dependencies were added due to large differences in CMI when computed either from patient data or background knowledge. The strongest dependency computed from patient data is the dependency between CT&RT-SCHEDULE (chemotherapy and radiotherapy schedule) and CLINICAL-STAGE having a CMI of 0.212. An indirect dependency with a CMI of 0.0112 indeed exists between these variables, since the two post-treatment variables EARLY-RESULT and 5-YEAR-RESULT are mutual descendants (Fig. 4). Because post-treatment information is unknown

277

Table 1: Relative entropies for model-driven and data-driven FAN classifiers.

TD "pi "p^ "p^ "pi "pE "pB "pi

Model-driven 0.52 0.27 0.22 0.18 015 O l i 013 0 1 3 " Data-driven 6.56 6.58 8.40 9.24 11.55 11.56 12.36 13.77

at the time of therapy administration, clinicians tend to base therapy selection directly on the clinical stage of the tumour. This is an example of a discrepancy between expert opinion and clinical practice, which must be taken into account when validating a model based on patient data. In Ref. [8] more such discrepancies are identified, which are due to evolution in treatment policy or the use of indirect and inaccurate measures of a variable which is identified to be clinically relevant.

Next to the occurence of such discrepancies, which can only be identified by having sufficient knowledge about the domain, the construction of an accurate classifier based on a small database is impaired in principle. The conjecture that suboptimal dependencies were added is supported by the increasing relative entropy between the declarative model and data-driven classifiers with increasing structural complexity (Table 1). It is unlikely that the naive classifier is simply the best representation of the dependencies within the model since relative entropy was shown to decrease for model-driven classifiers of increasing structural complexity. Data-driven models add a different set of dependencies, which may be due to incorrect estimation of conditional probabiUties during the computation of conditional mutual information. In Ref. [11] we refer to added dependencies which are based on insufficient information as spurious dependencies and present a solution based on non-uniform Dirichlet priors to prevent their occurence.

4-1'2 Quantitative Differences

With regard to the naive data-driven classifier, we observed a higher logarithmic score than that of the naive model-driven classifier. Since the structures are equivalent, this must be caused by an incorrect estimation of the conditional probabilities. This is also evident from the discrepancies between the prior probabilities for classifiers built either from data or from background knowledge, as depicted in Fig. 5.

As more arcs are added, the incorrect estimation of conditional probabiUties is amplified. The addition of a parent with n states multiplies the number of possible parent configurations of a vertex by n. For instance, a large increase in logarithmic score going from model F j to F j was observed. In this case, a dependency between GHS (general health status) and AGE was added. There is however no patient data available on the age distribution when GHS takes on the value POOR, such that a uniform Dirichlet prior will be assumed, which is inconsistent with the knowledge contained in the declarative model (Fig. 7).

278

P 0.2

Figure 7: The probability distribution P(AGE | GHS=POOR, 5-YEAR-RESULT=DEATH) is estimated as a uniform distribution since there is no data present for this configuration and the Dirichlet prior is uniform. Note that an estimate chosen as the marginal distribution P(AGE | 5-YEAR-RESULT=DEATH) computed from patient data (dotted line) comes closer to the distribution computed from the declarative model (solid line).

Note that a decrease in classification performance was also observed for model-driven classifiers, in which case amplification of incorrect estimation cannot be caused by a finite sample size because conditional probabilities can be reliably estimated from the declarative model. It can however be caused by an incorrect estimation of conditional probabilities by the expert physician; it is to be expected that the accurate estimation of conditional probabilities tends to become more difiicult when the size of the conditioning set grows. However, an estimate can in principle be made for any conditional probability, where the estimate might be the marginal distribution as shown in Fig. 7. Any use of such marginals when probabilities are computed firom data, must be implemented explicitly.

4-1.3 Performance Testing with Probabilistic Logic Sampling

In order to test whether a naive classifier always performs best for this domain, we have generated a random sample of 10 x 137 cases from the declarative model by means oi probabilistic logic sampling [4]. When validating the model based on this sample we found that logarithmic score decreased firom 0.545 for the naive model to 0.523 for the TAN model. Thus, TAN models are in principle able to perform better than a naive model, but for this domain, improvement is only marginal. The reason for this marginal improvement is explained as follows.

279

When comparing the CMI between variables computed from either background knowledge or patient data, we have found that there is only one dependency between GHS (general health status) and AGE showing a high CMI of 0.173 when computed from background knowledge, whereas there are many such combinations when computed from patient data. Let PQ and Pi denote the probability distributions for model F^ and model F^ encoding this dependency. The differences in logarithmic score for these models are then specified by Po (5-YEAR-RESULT | AGE, GHS) and Pi (5-YEAR-RESULT | AGE, GHS) which

can be computed from

P(AGE I GHS,5-YEAR-RESULT) P(GHS I 5-YEAR-RESULT)P(5-YEAR-RESULT) P ( A G E I GHS) P ( G H S ) '

where the last component is constant for both F^ and P^ and the first component reduces to PO(AGE | 5-YEAR-RESULT)/PO(AGE) for model i ^ . When we compare D(PI(AGE | GHS,5-YEAR-RESULT), PO(AGE | GHS, 5-YEAR-RESULT)) and D(Pi(5-YEAR-RESULT I AGE, GHS),Po(5-YEAR-RESULT | AGE, GHS)) we find relative entropies of respectively 2.00 and 0.135.

Let ct Qg Q gi denote the value of the class variable 5-YEAR-RESULT for evidence {AGE,GHS}, classified using Pi(5-YEAR-RESULT | AGE,GHS). The difference between the logarithmic score 5{AGE,GHS} of Fm ^^^ ^m for evidence {AGE, GHS} can be written as

l(C|AGEGHs; I AGE,GHS) * AVWAGE,GHSV

I AGE, GHS) log ^ V . ^ . . _ , , (3) ^OiC{AGE,GHS} I ^^^)

and the relative entropy between models P^ and P^ can be written as

E „ , ^ , , Pi (AGE I GHS, 5-YEAR-RESULT) Pi (AGE, GHS, 5-YEAR-RESULT) log 7-7—^—r-r r-^.

^ ^ ^ Po (AGE 5-YEAR-RESULT) AGE,GHS,5-YEAR-RESULT uv | /

(4) There is little impact on the logarithmic score (equation 3) since this is

dependent on the factors P(5-YEAR-RESULT | AGE, GHS), which show only little relative entropy between models P^ and P^. Impact on relative entropy between models P^ and P^ is high (equation 4), since this is dependent on the factors P(AGE | GHS, 5-YEAR-RESULT).

4.2 Classification using partial models

Although the benefit of using background knowledge has been demonstrated in previous sections, it will not usually be the case that full knowledge of the domain is available. Instead, one expects the expert to deliver partial knowledge about the structure and underlying probabilities of the domain. In this section we investigate how partial specifications influence the quality of Bayesian classifiers. To this end, we created partial models retaining 0, 5, 10, 15, 20, 25 and all 32 arcs of the original declarative model. In total 77

280

%

75

70 <

bb

60

55

50

I

-

,

0 0.5

0.72 1

0.68 h

0.64 \- + i S L A ^

0.6-pV^ 0.56 K ^

1 1 1 1 ^ ^ 4 n rn

1 1.5 2 2.5 3 D(P,P')

u.o^ 1

0 0.5

"T 1 1 1 !-]

1 1 1 1 1 1

1 1.5 2 2.5 3 D(P,P')

Figure 8: Regression results on classification accuracy and logarithmic score for the naive classifier F ^ (o, thin Une) and TAN classifier F^ (+, thick Une) for partial models containing varying amounts of partial background knowledge as measured by the relative entropy between the declarative model B = (G,P) and partial models B' = {G'^P').

different partial models were generated and the relative entropies between the declarative and partial models were computed. Prom these models we have generated model-driven FAN classifiers F ^ and F^. Linear regressions on classification accuracy and logarithmic score are shown in Fig. 8.

The outliers at the bottom right and top right of the figure were identified to be partial models where the class-variable 5-YEAR-RESULT is a disconnected vertex and were not included in the regression. Such a model encodes just the class-variable's prior probabilities and can be regarded as the model with baseline performance. Superimposed -h and o symbols represent models whose relevant dependencies can be fully represented within the conditional probability tables of the naive classifier.

It is hard to discern a pattern in the left part of Fig. 8 and little value can be assigned to the regression results. On average, the naive classifier does show better classification accuracy than the TAN model with a best performance of 73.72% for a model containing ten arcs with a relative entropy of 1.75. The large variance in classification accuracy for partial models with equal relative entropies confirms previous results reported in Ref. [7] where it was indicated that the relationship between the quality of a probability distribution, as measured here more precisely by means of relative entropy, and classification performance is not straightforward.

In the right part of Fig. 8 one can observe, on average, an increase in logarithmic score with increasing relative entropy, which is more pronounced for the naive classifier. This corroborates the thesis that more complete background knowledge has in general a positive effect on classification performance.

On average, partial models containing 10 arcs attain performances similar

281

to that of the model which was learnt from data, which demonstrates that the use of partial background knowledge is indeed a feasible alternative to the use of data for the construction of Bayesian classifiers.

Note that the set of partial models we have used may not be a representative sample, as there are more ways to define partial knowledge. For instance, in our definition, irrelevant vertices are not taken into account when constructing partial models. Hence, arcs may be removed which do not influence the quality of the background knowledge represented in the model with respect to the classification task. On the other hand, naively removing arcs from the declarative model may disconnect the class-variable from the rest of the model, reducing model quality severely. In practice, one expects a domain expert to provide a partial model which expresses knowledge relevant to the classification task.

5 Conclusion

Many real-world problems are characterised by the absence of sufficient statistical data about the domain. Most algorithms for constructing Bayesian classifiers are highly data-driven and therefore incapable of producing acceptable results in such data-poor domains. In this article we have formaUsed the notion of partial background knowledge and introduced the concept of a partial model. We presented a method for constructing model-driven classifiers from partial background knowledge and showed that they outperform data-driven classifiers for data-poor domains.

The main goal of this article was to gain insight into the quaUty of Bayesian classification when building a real-world classifier for a data-poor domain. Our use of both a model and a dataset taken directly from clinical practice enabled us to show that:

1. Performance differences between model-driven and data-driven classifiers may arise from discrepancies between expert opinion and clinical practice.

2. The performance of both data- and model-driven classifiers decreases when the structural complexity of the classifiers increases.

3. Even though the introduction of dependencies may have a significant impact on relative entropy, the effect on logarithmic score can be negligable.

For model-driven classifiers, performance decrease is thought to arise mainly from judgment error in estimating conditional probabilities. For data-driven classifiers, performance decrease is thought to be due to the small size of the database, leading to the introduction of spurious dependencies and the amplification of incorrect estimation of conditional probabilities. Further research has shown that the use of non-uniform Dirichlet priors is capable of preventing the introduction of spurious dependencies in a principled manner [11].

We have demonstrated that for a real-world problem, background knowledge offers a significant contribution to improving the quality of learnt classifiers and even becomes invaluable since data is often noisy, incomplete and hard to

282

obtain. Note that our operationalization of partial background knowledge is only one of the many forms of background knowledge one may wish to include.

In a real-world setting, a proper mix should be determined in terms of the use of various kinds of background knowledge on one hand and learning based on data on the other hand. The development of techniques for using background knowledge in order to improve the quality of Bayesian networks is the focus of our future research.

References

[1] C. Bielza, J. A. Fernandez del Pozo, and P. J. F. Lucas. Finding and explaining optimal treatments. In AIME 2003, pages 299-303, 2003.

[2] R. G. Cowell, A. P Dawid, and D. Spiegelhalter. Sequential model criticism in probabilistic expert systems. PAMI, 15(3):209-219, 1993.

[3] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classifiers. Machine Learning, 29:131-163, 1997.

[4] M. Henrion. Propagation of uncertainty by probabilistic logic sampling in Bayes' networks. In Proceedings of Uncertainty in Artificial Intelligence, volume 2, pages 149-163, 1988.

[5] S. Kullback and S. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22:29-86, 1951.

[6] S. L. Lauritzen, A. P. Dawid, B. N. Larsen, and H. G. Leimer. Independence properties of directed Markov fields. Networks, 20:491-506, 1990.

[7] P. J. F. Lucas. Restricted Bayesian network structure learning. In J.A. Gamez, S. Moral, and A. Salmeron, editors. Advances in Bayesian Networks, Studies in Fuzziness and Soft Computing, volume 146, pages 217-232. Springer-Verlag, BerUn, 2004.

[8] P. J. F. Lucas, H. Boot, and B. G. Taal. Computer-based decision support in the management of primary gastric non-Hodgkin lymphoma. Methods of Information in Medicine, 37:206-219, 1998.

[9] M. Pazzani. Searching for dependencies in Bayesian classifiers. In Learning from data: Artificial intelligence and statistics V, pages 239-248. New York, NY: Springer-Verlag, 1996.

[10] J. P. Sacha, L. Goodenday, and K. J. Cios. Bayesian learning for cardiac SPECT image interpretation. Artificial Intelligence in Medicine, 26:109-143, 2002.

[11] M. A. J. van Gerven and P. J. F. Lucas. Employing maximum mutual information for Bayesian classification. Technical report NIII-R0433, Rad-boud University Nijmegen, 2004.

SESSION 5:

SPATIAL REASONING, IMAGE RECOGNITION AND HYPERCUBES

Interactive Selection of Visual Features through Reinforcement Learning

Sebastien Jodogne* Montefiore Institute (B28), University of Liege

B-4000 Liege, Belgium

S.JodogneQULg.ac.be

Justus H. Plater Montefiore Institute (B28), University of Liege

B-4000 Liege, Belgium

Justus.PiaterOULg.ac.be

Abstract We introduce a new class of Reinforcement Learning algorithms designed to operate in perceptual spaces containing images. They work by classifying the percepts using a computer vision algorithm specialized in image recognition, hence reducing the visual percepts to a symbolic class. This approach has the advantage of overcoming to some extent the curse of dimensionality by focusing the attention of the agent on distinctive and robust visual features.

The visual classes are learned automatically in a process that only relies on the reinforcement earned by the agent during its interaction with the environment. In this sense, the visual classes are learned interactively in a task-driven fashion, without an external supervisor. We also show how our algorithms can be extended to perceptual spaces, large or even continuous, upon which it is possible to define features.

1 Introduction

Reinforcement Learning (RL) is a general framework for modeling the behavior of an agent that learns how to perform its task through its interactions with the environment [2, 7, 22]. The agent is never told what action it should take; rather, when it does a good or a bad action, it only receives a reward or a punishment, the reinforcement. Schematically, RL lies between supervised learning (where an external teacher gives the correct action to the agent) and unsupervised learning (in which no clue about the goodness of the action is given). RL has had spectacular applications, e.g. turning a computer into an excellent backgammon player [23], or making a quadruped robot learn walking progressively without any human intervention [6].

In RL, the agent operates by repeating the following sequence of operations: at time t, (i) it senses its inputs in order to determine the current state St of the

* Research Fellow of the Belgian National Fund for Scientific Research (FNRS).

285

286

environment, (ii) it selects an action at, (iii) it applies this action, which results in sensing a new state St+i while perceiving a numerical reinforcement Vt^i e R, and (iv) it possibly updates its control law using this new experiment. Initially, since the agent knows nothing about what it should do, it acts randomly. After some trial-and-error interactions, the agent begins to learn its task and performs better and better. Two major challenges in RL are the exploration-versus-exploitation dilemma (should the agent exploit its history or try new actions?) and the delayed-reward problem (the pertinence of an action can appear a long time after the interaction, for example in the game of chess).

In this article, we consider the applicability of RL when the agent is faced with visual inputs. As an example, consider the task of grasping objects. It has been shown that infants learn to pre-shape their hands using their vision before they reach the object to grasp [10]. Once the contact is made, haptic feedback is used to locally optimize the grasp. For this grasping procedure to succeed, infants have to learn to distinguish between objects that require different hand shapes. Thus, infants learn to recognize objects following the needs of the grasping task. More generally, evidence shows that visual learning is task-driven [20]. Our long-term goal is to create an artificial system that would acquire object recognition skills using only its interactions with the environment [15]. RL is one plausible framework to model such a system.

Unfortunately, RL algorithms are subject to the curse of dimensionality, i.e., they are very sensitive to the number of states and actions. Now, the size of perceptual domains containing visual inputs are exponential in function of the size of the images. On the other hand, since the number of interactions an agent has at its disposal to learn its task is necessarily finite, generalization abilities are necessary to face continuous input and/or output spaces: similar perceptions are indeed expected to require similar actions. But, a robotic hand learning to grasp objects has a continuous action space.

In order to deal with these two issues, some authors have recently tried to take advantage of supervised learning techniques in the context of RL [4, 14]. Their main argument is that supervised learning comprises a large number of powerful techniques tackling high-dimensional problems with excellent generalization performances. Sketchily, these approaches reduce the RL problem to a sequence of supervised regression problems, each approximating the value of taking, in each state, any possible sequence of actions of a fixed length (the further along in the regression sequence, the greater the considered length). Using the terminology of Ernst et al. [4], we will refer to such techniques as Fitted Q Iteration, It seems thus promising to use Fitted Q Iteration in RL problems involving camera sensors by applying the regression algorithms directly to the values of the raw pixels. To the best of our knowledge, no material has been published on this topic yet.

Nevertheless, if Fitted Q Iteration is used on visual perceptual spaces, the embedded supervised learning algorithm will necessarily have to distinguish between visual inputs. In this sense, the learning algorithm will have to solve simultaneously a computer vision problem (image classification) and a RL prob-

287

lem (construction of an optimal control law). Now, it is widely admitted that vision problems are difficult to solve, and a large number of non-trivial, powerful techniques devoted to the visual recognition of objects have been developed during the last decades.

Our basic idea is therefore to facilitate the RL process by taking advantage of specialized image classification algorithms, while letting the supervised learning algorithm focus on the control law computation. Obviously, we expect that replacing the visual input by a symbolic input (i.e., the class number corresponding to the image) will drastically reduce the size of the perceptual space, and will break to some extent the curse of dimensionality.

It is clear that the idea of using vision algorithms to make RL easier is not limited to Fitted Q Iteration: actually, any RL algorithm could benefit from visual recognition. Therefore, our technique should remain general enough not to rely on a particular RL algorithm.

2 Reinforcement Learning

2.1 Markov Decision Processes

Reinforcement Learning problems are most often defined in the Markov Decision Processes (MDP) framework. This basically amounts to saying that, after doing some action at in some state st of the environment, the next state s^+i does not depend on the entire history of the system, but only on St and at. This also implies that the environment obeys a discrete-time dynamics.

According to the conventions of KaelbUng et al. [7], a MDP is a tuple (5, A, r, T), where 5 is the finite set of possible states in the environment; A is the finite set of possible actions; r : 5 x A f— R is the reinforcement function giving for each state-action pair the immediate reinforcement for doing this action in this state; and T : 5 x A x 5 i - > [ 0 , l ] i s the transition function giving the probability of reaching one state after doing some action in some state. Formally:

T(s, a, s') = P {st+i = s' \st= s, at=a} .

2.2 Optimal Policies and Q-functions

A stationary Markovian control policy (for shortness, a policy) is a probabilistic mapping from the states to the actions. A policy governs the behavior of the agent by specifying what action it should take in each state. RL is concerned with the construction of an optimal policy, in a sense that remains to be defined.

The goal of the agent is not to maximize its immediate reinforcements (the sequence of r^), but its rewards over time. This leads to the definition of the discounted return. Given an infinite sequence of interactions, the discounted return at time t is defined by:

oo

Rt = Y,l'rt+i+u (1) i=0

288

where 7 e [0,1] is the discount factor that gives the current value of the future reinforcements^ This means that a reward perceived k units of time later is only worth 7^ of its current value.

Let us call the Q function of a policy TT, the function giving for each state 5 6 5 and each action a e A^ the expected discounted return obtained by starting from the state 5, taking the action a, and thereafter following the policy TT: Q^{s,a) = E r {Rt \ St = s, at = a}, where E r denotes the expected value given that the agent follows the policy TT. Dynamic Programming theory [1] shows that all the optimal policies for a given MDP share the same Q function, denoted Q*, that always exists and that satisfies the so-called Bellman's optimality equation:

Q*{s,a) =:r(5,a) + 7 V T(s,a,s ')maxQ*(5',a'), (2) s'es

for all 5 G 5 and a e A. When the Q* function is known, for example by solving the non-linear system of Equations (2), an optimal deterministic policy TT* is easily derived by letting 7r*{s) = argmax^^^ Q*{s,a) for each s e S.

2.3 Overview of RL Algorithms

RL algorithms can be roughly divided in two categories: incremental and batch. In incremental RL, the agent starts with an initial policy, which is continuously updated after each interaction with the environment until convergence to an optimal policy. The popular Q-learning algorithm [24] belongs to this category, as well as Sarsa [22].

On the contrary, in batch RL, the learning process is split in two parts: (i) collection of a database of interactions, and (ii) computation of an optimal policy. The database simply contains the tuples {st,at,rt-^i,St+i) encountered during the interactions, which summarize the entire history of the system (indeed, the time information t does not matter because of the Markovian nature of the environment). Value Iteration and Policy Iteration [1] are batch RL algorithms, as well as Fitted Q Iteration (cf. Introduction). Batch RL is an interesting method when the cost of the experiments is expensive, which is the case in many robotic applications, for example grasping. It is indeed sufficient to collect once and for all a representative set of interactions.

2.4 Perceptual Aliasing

So far, we have implicitly supposed that the agent is able to distinguish between the states of the environment using only its sensors. If this is the case, the perceptual space is said fully observable^ and the right decisions can always be made on the basis of the percepts. If it is not the case (i.e., if the perceptual space is only partially observable), the agent cannot distinguish between any pair of states and thus will possibly not be able to take systematically the

^In practice, 7 is often supposed to be less than 1 to ensure the convergence of the sum.

289

right decision. This phenomenon is known as the perceptual aliasing (or hidden state) problem, and is closely related to ours, as it will soon become clear.

Two solutions to this general problem have been proposed in the literature: either the agent identifies and then avoids states where perceptual aliasing occurs [25], or it tries to build a short-term memory that will allow it to remove the ambiguities on its percepts [3, 9].

In this paper, we will only consider fully observable perceptual spaces. However, until the agent has learned the visual classes required to complete its task, visual percepts needing different reactions may be mapped to the same class, thus introducing perceptual aliasing. Nevertheless, the previous approaches are irrelevant in our context, since these ambiguities can be removed by further refining the image classifier. Actually, previous techniques tackle a lack of information inherent to the used sensors, whereas our goal is to handle a surplus of information related to the high redundancy of visual representations.

3 Image Classification using Visual Features

Besides RL, image classification is the other tool required by our algorithms. The goal of image classification is to map an image to a class of objects. Re

cent successes in visual object recognition are due to the use of local-appearance approaches [8, 15, 18]. Such approaches first locate highly informative patterns in the image and in a picture of the object to be recognized, using interest point detectors [19], then match these interest points using a local description of their neighborhood, called a visual feature"^ [11]. If there are enough matches, the image is taken as belonging to the object class. As visual features are vectors of real numbers, there exists an unbounded number of features.

Local-appearance methods can deal with partial occlusions and are very flexible, since they do not need a 3D model of the objects that is frequently hard to obtain, especially for non-rigid objects. Furthermore, they take advantage of more and more powerful interesting point detectors and local descriptors.

4 Reinforcement Learning of Visual Classes

4.1 Description of our Learning System

As discussed in the Introduction, we propose to introduce an image classifier before the RL algorithm itself. The resulting architecture will be referred to as Reinforcement Learning of Visual Classes, and is schematically depicted at the right of Figure 1. This two-level hierarchy can be thought of as a way to raise the abstraction level on which the RL algorithm is applied: the classifier translates a low-level information (the raw values of the pixels) into a high-level information (an image class) that will itself feed the RL algorithm.

The key idea in RL of Visual Classes is to focus the attention of the agent on a small number of very distinctive visual features that allow the agent to reason

•^The terminology "visual feature" is used here as a synonym for "local descriptor".

290 percepts reinforcements percepts reinforcements

^ ^ ^ ^ ^ ^.^ IT CReinforcement Learning " ^ C Image Classifier ^ •^Reinforcement Learning^

. "-"^ ^•*-~. - 4 - ^ _ classes to refine ^ •——

Figure 1: Comparison of information flows between "classical" Reinforcement Learning (left) and Reinforcement Learning of Visual Classes (right).

upon visual classes rather than raw pixels, and that enhance its generalization capabilities [15]. Initially, the system knows only about one class, so that all the percepts are mapped to this class. Of course, this introduces a kind of perceptual aliasing, though the perceptual space is fully observable. The challenging problem is therefore to refine a visual class dynamically when the agent identifies inconsistencies in the earned discounted returns when faced with that class. For instance, if the same action leads sometimes to a reward, and other times to a punishment, there is strong evidence that the agent is "missing something" in the percepts corresponding to the class. This explains the presence of a right-to-left arrow in Figure 1: the RL algorithm has to inform the image classifier when the learning of a new visual class is required.

Since there is no external supervisor telling the agent when a refinement is needed, our algorithm can only rely on statistical analysis involving the reinforcements earned by the agent. The agent will consequently learn visual classes only through interactions, which is the central property of this system. Intuitively speaking, the role of the agent is to identify functionally-distinguishable percepts: it should distinguish between percepts that involve different discounted returns when it chooses the same reactions.

In the sequel, we will discuss the two major elements that are required in order to turn this learning structure into a working algorithm, namely: (i) a robust criterion able to decide when the classification is not fine enough, that will be called the aliasing criterion, and (ii) an image classifier able to refine a class on request by learning a new distinctive visual feature. To conclude this general description, note that the visual features should be powerful enough to distinguish any functionally-distinguishable percept. We will suppose that this weak requirement is met in the rest of the paper.

4.2 Detailed Description

4-2.1 Core of the Algorithm

The previous section has introduced the paradigm of Reinforcement Learning of Visual Classes. We are now ready to give an in-depth view of our algorithm, which operates in batch mode, since it relies on a statistical analysis of the

291

discounted return observed during the interactions. Here is its core:

1. Begin with step count fc := 0 and a percept classifier Ck that maps all the percepts to a single class, i.e., such that Ck{s) = 1 for all the percepts s;

2. Collect a database of interactions (5t,at,rt+i,St+i,et), where St are the raw percepts^ furnished by the sensors, and et is a Boolean tag indicating whether the action at has been chosen by randomization or by deterministic exploitation of the knowledge of the agent from the previous steps;

3. After N interactions have been collected:

(a) Use the aliasing criterion to decide if a class needs to be refined,

(b) While there exist aliased classes, refine the classifier Ck by learning new distinctive visual features, which leads to a new classifier C^- i,

(c) Let k := k-\-l. If fc is below some threshold value, go to Step 2.

4. Use a RL algorithm to control the system through the last classifier Ck-

The way the interactions are acquired at the second step is unimportant. For example, a simple e-greedy policy can be used in order to collect the database. Nevertheless, the database has to satisfy the following requirement, the reason of which will be explained in the next section: at a given step /c, whenever the agent chooses to deterministically exploit its knowledge at some time f, it should do so for at least the next k interactions, i.e., up to time t-\- k.

4.2.2 Aliasing Criterion

The only information available to the aliasing criterion is given by the reinforcement values present in the database of interactions. Since the database is necessarily finite, our criterion can only rely on an approximation of the discounted returns (see Equation (1)) observed in the database over some finite time horizon. This leads to the following definition:

Definition 1 The truncated discounted return at some time t for a time hori-zon H is defined as R^ = J2iz=o^^'^t+i+iy where rt+^+i are the reinforcements present in the database. It is left undefined ift is greater than N — H.

Let us now suppose that the environment is deterministic, i.e., the transition function T of the underlying MDP is deterministic. In this context, it is clear that executing the same sequence of actions starting from a given state will always lead to the same truncated discounted return. Therefore, two states can be distinguished using only the reinforcements if there exists some sequence of actions such that executing this sequence in those two states leads to different truncated discounted returns. Of course, we cannot try every possible sequence of actions, so we restrict ourselves to the sequences present in the database.

^Here, st denotes at the same time a percept and a state. This syntax is justified since the perceptual space is fully observable: there is a mapping from the percepts to the states.

292

Now, there could be random variations in the truncated discounted returns just because of the non-deterministic nature of the exploration policy. Such variations should obviously not be taken into account in the aliasing criterion. This explains the requirement on the database of interactions introduced at the end of Section 4.2.1: by considering only the sequences of actions starting in states marked as obtained from deterministic exploitation of the system history, which can be determined by testing the flag e , we ensure the uniqueness of the considered sequences of actions.

At some step k of our algorithm, this uniqueness is only ensured for sequences of actions of length less than k. The aliasing criterion is thus based upon an incremental construction: at the step k of our algorithm, we only try to distinguish states that are aliased by considering sequences of actions of length k present in the database, i.e., that are fc-aliased. More formally:

Definition 2 Two states St and Sf belonging to the same visual class (i.e., such that Ck{st) = Ck{st')) and encountered respectively at times t and t', are k-aliased if they have both been tagged as obtained from deterministic exploitation, andifR^^R^,.

Of course, the more interactions are collected, the more fine-grained distinctions between states can be discovered. Note that the number of iterations of our algorithm corresponds to the maximum time horizon to consider.

4-2.3 Class Refinement

The class refining operation has to discover a new visual feature that best explains the variation in the truncated discounted returns for some visual class at some time horizon k. This is a classification problem, for which we propose a variation of the standard splitting rule used when building decision trees [12].

Firstly, we sort the observed truncated discounted returns obtained starting from the considered class. Each cutpoint in the obtained sequence induces a binary partition of the visual percepts mapped to this class: the percepts such that the corresponding truncated returns are above the cutpoint, and the others. Then, for each possible cutpoint, we extract the visual feature that maximizes some information-theoretic score for this partition into two buckets of visual percepts. This is done by iterating over all the visual features present around the interest points in the considered percepts, that are in finite number, and evaluating the split induced by each one of those features. We finally keep only the visual feature that has the maximal score among all the extracted visual features.

4-2.4 Non-deterministic Environments

We have supposed since Section 4.2.2 that the environment behaves determin-istically. Of course, this might not be the case. So, a hypothesis test using the x^-statistic is applied after each class refining attempt in order to decide if the selected visual feature induces a genuine split that is significantly difterent from a random split. This approach is inspired from decision tree pruning [17].

293

4.3 Using Decision Trees as Classifiers

The concrete classifier used in our implementation has not been discussed yet. In this work, we have been working with binary decision trees: the visual classes correspond to the leaves of the tree, and the internal nodes are labeled by the visual feature, the presence of which is to be tested in that node. The classification of a percept consists in starting from the root node, then progressing in the tree structure according to the presence or the absence of each visual feature found during the descent, until reaching a leaf.

To refine a visual class using a visual feature, it is sufficient to replace the leaf corresponding to this class by an internal node testing the presence or the absence of this feature, and leading to two new leaves.

5 Reinforcement Learning of Classes

The approach we have just presented is actually not limited to visual inputs. It could indeed be useful in any perceptual domain (possibly continuous) that supports classification as a way to reduce its size. In this context, the "visual features" would become "features", i.e., properties that can be displayed or not by the raw percepts. For example, a feature could be the value of a bit in the case of percepts containing binary numbers. Our technique could also be applied for agents having noisy sensors: the use of distinctive features would allow the agents to get rid of noise by examining only pertinent and robust parts of their percepts.

All the previous algorithms can readily be adapted to perceptual spaces upon which the following three elements can be defined:

Features: A feature is any property a raw percept can exhibit or not. There can possibly be an infinite number of features.

Feature Detector: It is. a function that tells whether or not a given raw percept exhibits a given feature.

Refining Oracle: It is an oracle that, given two sets of raw percepts, returns the most informative feature explaining this partition into two subsets. It is introduced as an oracle since it is allowed to use some context-dependent information to direct the search of the best feature: the oracle is not obliged to exhaustively consider every feature, which makes a particular sense when there is an infinite number of features.

We will call such a generalization Reinforcement Learning of Classes,

6 Experiments

We have investigated the behavior of Reinforcement Learning of (Visual) Classes in the context of a simple navigation problem, namely escaping from a discrete 2D maze constituted of empty cells and walls. The goal of the agent

294

0

X position (O-iO)

4

irrcicvani (random) iiiTonnaiion

15

y position (0-7)

Figure 2: On the left, Sutton's Gridworld [21]. Filled squares are walls, and the exit is indicated by an asterisk. On the right, a diagram describing the percepts of the agent, that are binary numbers of 18 bits.

is to reach as fast as possible the exit of the maze. In each cell, the agent has four possible actions: go up, right, down, or left. When a move would take the agent into a wall, the location is not changed. When a move takes it into the exit, the agent is randomly teleported elsewhere in the maze. The agent earns a reward of 100 whenever the exit is reached, and a penalty of - 1 for any other move. Note that the agent is faced with the delayed-reward problem.

This task is directly inspired by Sutton's so-called "Gridworld" [21], with the major exception that our agent does not have a direct access to its (x, y) position in the maze. Rather, the position is implicitly encoded in the percepts: in a first experiment, the percepts will be binary numbers that contain the binary values of x and y; in a second experiment, a different object will be buried in each cell under a transparent glass, and the sensors of the agent will return a picture of the object underneath.

6.1 The "Binary" Gridworld

In this first experiment, we have used the original Gridworld topology, which is depicted at the left of Figure 2. The sensors of the agent return a binary number, the structure of which is shown on the right of the same figure. In this experiment, features are defined as the bits of the binary numbers, so RL of Classes has been applied. Here, the feature detector tests if a given bit is set or not, and the refining oracle seeks the most informative bit explaining the partition into two subsets of binary numbers.

To achieve its task, the agent has to focus its attention on the bits encoding X and y, since the other bits are random, and thus irrelevant to its task. We have noticed that this is indeed the case: the built classifier only uses the bits 0, 1, 2, 3, 15, 16 and 17. The obtained classification is shown in Figure 3, as well as the optimal policy it involves. It can easily be seen that the built policy is optimal. After k has reached the value 15, which roughly corresponds to the diameter of the maze, no further split was produced. Note however that this value can vary depending on the database of interactions collected.

It is important to notice that the classification rule is obtained without pre-treatment, nor human intervention. The agent is initially not aware of

295

Figure 3: On the left, the classification using bits obtained at the end of our algorithm. On the right, the policy built using the last classifier.

which bits are important. Moreover, the interest of using features is clear in this application: a direct tabular representation of the Q function would have 2^^ X 4 cells (one for each possible pair of a binary number and an action).

6.2 The "Tiled'' Gridworld The goal of this second experiment is to illustrate RL of Visual Classes on the toy example depicted in Figure 4. The navigation rules are identical to the Binary Gridworld, but there are fewer cells in order to better interpret the results. The percepts are color images of objects taken from the COIL-100 database [13]. Each cell is identified by a different object. The used visual features are color differential invariants detected at Harris color points of interest [5].

Figures 4 and 5 depict the obtained results. The algorithm succeeds at distinguishing between visual inputs requiring different reactions. Once k has reached the value 3, no further refinement has taken place. It is interesting to notice that the hamburger and the wooden toy (class 4), as well as the duck and the boat (class 5), have not been distinguished. This is a desirable property since these states require the same action, i.e. to go right.

7 Conclusions

We have introduced algorithms that succeed in learning distinctive features in an interactive context, using only a reinforcement feedback. Our approach is quite general, and is applicable to many perceptual domains (large, continuous and/or noisy). In particular, these algorithms can be applied in RL problems with image inputs. The only restrictions on the perceptual space are that it must be fully observable, and that it must be possible to define features in it. To achieve this goal, techniques similar to those used in supervised decision tree construction are exploited (evaluation of splits using information-theoretic measures, and reduction of overfitting through hypothesis testing). The architecture of our system is independent of the underlying RL algorithm.

296

Figure 4: On the left, the Tiled Gridworld. The objects under each cell are marked with their interest points circled. The exit is labeled by an asterisk. On the right, the learned classification.

Figure 5: The resulting decision tree. The visual features tested in each internal node are circled.

297

Its two-level hierarchy with top-down and bottom-up information flows enables to raise the abstraction level upon with the embedded RL algorithm is applied.

Our work can be seen as a generalization of the visual feature learning system that has been applied by Plater to grasp objects [15]. Indeed, this system can only be applied in interactive tasks with no delayed reward (i.e., where 7 = 0) and with binary reinforcements (i.e., only two reinforcements are possible: either good or bad action). Moreover, our work is to be distinguished from the tree-based discretization technique of Pyeatt and Howe [16], since the latter is specific to Q-learning, and since its discretization of the perceptual space relies on the perceptual values rather than on higher-level features.

Future research should try to adapt RL of (Visual) Classes to problems with continuous perceptual and/or action spaces, for example grasping. On the other hand, techniques to remove learned features that are subsequently proved to be useless could be developed. To evaluate the performance of different classifiers (e.g., naive Bayes), and to use more powerful visual features (e.g., affine-invariant features, or features taking semi-local constraints into account), are other interesting research topics.

References

[1] R. Bellman. Dynamic Programming. Princeton University Press, 1957.

[2] D.P. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996.

[3] L. Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In National Conference on Artificial Intelligence, pages 183-188, 1992.

[4] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning, 2004. Submitted for publication.

[5] V. Gouet and N. Boujemaa. Object-based queries using color points of interest. In IEEE Workshop on Content-Based Access of Image and Video Libraries, pages 30-36, Kauai, Hawaii, USA, 2001.

[6] M. Huber and R. Grupen. A control structure for learning locomotion gaits. In 7th Int. Symposium on Robotics and Applications, Anchorage, AK, May 1998. TSI Press.

[7] L.P. Kaelbling, M.L. Littman, and A. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237-285, 1996.

[8] T.K. Leung, M.C. Burl, and P. Perona. Finding faces in cluttered scenes using random labeled graph matching. In Proc. of the Fifth International Conference on Computer Vision, page 637. IEEE Computer Society, 1995.

[9] R.A. McCallum. Reinforcement learning with selective perception and Hidden State. PhD thesis. University of Rochest or, Rochestor, New York, 1996.

298

[10] M. McCarty, R. Clifton, D. Ashmead, P. Lee, and N. Goubet. How infants use vision for grasping objects. Child Development^ 72:973-987, 2001.

[11] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition^ volume 2, pages 257-263, Madison, Wisconsin, June 2003.

[12] T.M. Mitchell. Machine Learning. McGraw Hill, 1997.

[13] S.A. Nene, S.K. Nayar, and H. Murase. Columbia object image library (COIL-100). Technical Report CUCS-006-96, Columbia University, New York, NY, February 1996.

[14] D. Ormoneit and S. Sen. Kernel-based reinforcement learning. Machine learning, 49(2-3):161-178, 2002.

[15] J.H. Piater. Visual Feature Learning, PhD thesis. Computer Science Department, University of Massachusetts, Amherst, MA, February 2001.

[16] L.D. Pyeatt and A.E. Howe. Decision tree function approximation in reinforcement learning. In Pivc, of the Third International Symposium on Adaptive Systems, pages 70-77, Havana, Cuba, March 2001.

[17] J.R. Quinlan. The effect of noise on concept learning. In Machine Learning: An Artificial Intelligence Approach: Volume II, pages 149-166. Kaufmann, Los Altos, CA, 1986.

[18] C. Schmid and R. Mohr. Local greyvalue invariants for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(5):530-535, 1997.

[19] C. Schmid, R. Mohr, and C. Bauckhage. Evaluation of interest point detectors. International Journal of Computer Vision, 37(2):151-172, 2000.

[20] P.G. Schyns and L. Rodet. Categorization creates functional features. Journal of Experimental Psychology: Learning, Memory and Cognition, 23(3):681-696, 1997.

[21] R.S. Sutton. Integrated architectures for learning, planning and reacting based on approximating dynamic programming. In Proc. of 7th Int. Conference on Machine Learning, pages 216-224, San Mateo, CA, 1990.

[22] R.S. Sutton and A.G. Barto. Reinforcement Learning, an Introduction. MIT Press, 1998.

[23] G. Tesauro. Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3):58-68, March 1995.

[24] C. Watkins and P. Dayan. Q-learning. Machine learning, 8:279-292,1992.

[25] S.D. Whitehead and D.H. Ballard. Learning to perceive and act by trial and error. Machine Learning, 7:45-83, 1991.

Imprecise Qualitative Spatial Reasoning

Baher A. El-Geresy School of Computing, University of Glamorgan

Treforest, Wales, UK

Alia I. Abdelmoty School of Computer Science

Cardiff University,

Cardiff, Wales, UK

Abstract This paper addresses the issue of qualitative reasoning in imprecise spatial domains. In particular, the uncertainty in the nature of the spatial relationships between objects in space is represented by a set of possibilities. The approach to spatial reasoning proposed here is carried out in three steps. First, a transformation is carried out on the disjunctive set of possible relationships to derive their corresponding set of spatial constraints. Reasoning formulae are developed to propagate the set of identified constraints and finally a transformation is carried out on the resulting constraints to map them back to the domain of spatial relations to identify the result of the spatial composition. Two general equations form the basis for the propagation of the spatial constraints. A major advantage of this method is that reasoning with incomplete knowledge can be done by direct application of the reasoning formulae on the spatial objects considered, and thus eliminates the need for utilising the inordinate number of composition tables which must be built for specific object types and topology. The method is applied on spatial objects of arbitrary complexity and in a finite definite number of steps controlled by the complexity needed in the representation of objects and the granularity of the spatial relations required.

1 Introduction Large spatial databases such as geographic databases are characterised by the need to store and manipulate substantial numbers of spatial objects and to provide effective and efficient means of retrieval and analysis. For example, a typical geographic database may contain hundreds of thousands of objects represented by polygons which are themselves represented by hundreds of points. Expensive computational geometry techniques as well as spatial data structures and indexing algorithms are normally employed in such databases. Spatial data in many applications of this sort are often imprecise or incomplete [9, 12]. This may be due in part to the inaccuracy of the measuring devices, or simply to the non-availability of the information. For example, a topographic data set may contain detailed representation of boundaries of large cities, but only representative centre points of smaller towns and villages. Hence, exact spatial relationships which these smaller objects may be involved in can't be precisely defined. It would therefore be useful, in these circumstances, for such systems to be able to encode this uncertainty in the representation of spatial objects and spatial relationships. More importantly, it would also be useful to reflect this imprecision in the manipulation and analysis of the data sets.

299

300

Qualitative Spatial Representation and Reasoning (QSRR) is an active field of AI research where formalisms for encoding and manipulating qualitative spatial knowledge are studied[3, 2]. The goal is for such techniques to complement and enhance the traditional methods, especially when precise information are neither available nor needed. A typical problem for qualitative reasoning techniques is the automatic composition of spatial relationships. For example, the derivation of the fact that a region x is not connected to another region z, using the knowledge that region x is inside a region y and y is either not connected to z or is touching z externally. Most approaches to QSRR are concerned with finding means of automating the composition of spatial relationships and ultimately the derivation of composition tables for different types of objects and relationships [6, 5, 4]. The problem is considered to be a major challenge to automatic theorem provers [1, 11].

This problem is complicated further when imprecise or incomplete knowledge is used as input to the composition process. In this case, knowledge is usually represented by a disjunctive set of relations, e.g. region x overlaps with or is inside region y. Every relation in the disjunctive set of relations need to be processed separately and the resulting sets of composed relations are intersected, or summed, to derive the final spatial composition result.

For example, if the relation between objects x and y is disjoint V touch V overlap and y is inside z, then the following four steps are needed to derive the composition between x and z.

1. disjoint{x, y) o inside{y, z) —> {disjoint V touch V overlap V inside)

2. touch{x, y) o inside{y, z) —> {meet V overlap V inside]

3. overlap{x^ y) o inside{y^ z) —> {overlap V inside}

4. The result is the sum of results of the above three steps, i.e. {disjoint V meet V overlap V inside}

In general, if the number of relations in the disjunctive sets for x^y and y, z are n and m respectively, then the number of spatial compositions is n x m and the total number of steps required to derive the result is {n x m) -\-1. The above method is dependent on the availability of composition tables between the objects involved and for the types of relations considered. Hence, composition tables must be either pre-computed or generated on the fly. What was also propagated in the above example is knowledge of possibilities about the spatial scene and not knowledge of facts. Few works have approached the problem of reasoning with imprecise spatial knowledge. In [8], Freksa used a semi-interval based temporal reasoning to deal with incomplete or imprecise knowledge. His method was based on capturing the relations between the starts and ends of intervals to represent a set of disjunctive relations. Eleven coarse relations were introduced and a diagrammatic representation of the disjunction was used. Hernandez [10] used a similar approach to define coarse relations between convex regions in the spatial domain. Both works are limited by the diagrammatic representations used and the fact that their methods are applicable only if the relations involved were conceptual neighbours [7]. Also, the method rely on looking the relations up in pre-computed composition tables as no general reasoning mechanism was proposed.

301

xO yO

<ixZ)^aD touch(x,y)

(T(x,y)) oyerlap(x,y)

(0(x,y))

XQ

Xi

X2

yo 1 1 1

2/1 1 0 0

y2 1 0 1

XQ

Xi

X2

2/0 1 1 1

2/1 1 1 1

2/2 1 1 1

(a) (b)

Figure 1: Different qualitative spatial relationships distinguished by identifying the appropriate intersection of components of the objects and the space, and their corresponding intersection matrices respectively.

In this paper, a new approach is proposed for reasoning with imprecise spatial knowledge. The approach is significant as it is carried out in only two steps, irrespective of the number of relations in the disjunctive sets noted above. Also, the method does not rely on the pre-computation of composition tables and hence may be applied to objects of arbitrary complexity.

The approach utilises the representation method proposed in [6] and inherits its generality in dealing with topological relations between objects of arbitrary complexity. The paper is organised as follows. In section 2, the approach used for the representation of topological relations is presented. The reasoning approach is then described in section 3. Several examples are given in section 4 to demonstrate the generality of the approach. In section 5, a discussion is given on how the method may be applied to representation and reasoning in the temporal domain. Conclusions and an outlook on future are given in section 6.

2 The Underlying Representation of Spatial Relations

Objects of interest and their embedding space are divided into components according to a required resolution. The method of representation of spatial relations has been proposed in earlier works [6] and is briefly reviewed here. Topological relations are represented through the intersection of objects components. The distinction of topological relations is dependent on the strategy used in the decomposition of the objects and their related spaces.

The complete set of spatial relationships are represented by combinatorial intersection of the components of one space with those of the other space.

If i?(x, y) is a relation of interest between object x and object y, and X and Y are the spaces associated with the objects respectively such that m is the number of components in X and / is the number of components in F, then a spatial relation R{x^ y) can be represented by one state of the following

302

equation:

R{x,y) = XnY

= u-0^ ^t=i

Xo Xi

X2

yo 2/1 2/2

= (ici n 2/1, • • •, x i n 2/m, a:2 n 2/1, • • •, Xn n Vm) The intersection Xi 0 yj can be an empty or a non-empty intersection. The above set of intersections shall be represented by an intersection matrix, as follows.

R{x,y) =

For example, spatial relationships and their corresponding intersection matrices are shown in figure 1. The component X2 has an empty intersection with 2/1 in 1(a) and a non-empty intersection in 1(b).

2.1 Mapping Component Intersections into Relations

The intersection matrix is in fact a set of intersection constraints whose values identifies specific spatial relationships. Figure 2 represents the mapping between intersection constraints and set of spatial relations in the case of two convex areal objects. Table entries represent the resulting set of possible relations if the result of the intersection of the corresponding components is non-empty (or (1)). For example, if XQ H 2/2 = 1 is the only intersection known then the relationship between objects x and y is Dy TV OW I W IB, and so on. If the result of the intersection is the empty set (or (0)), then the possible set of relations will be the complement of the sets shown. If XQ D 2/2 = 0, then the corresponding table entry will he EV CV CB. The possible relations between objects x and y can therefore be derived from the combination (set intersection) of all the table entries. An example is given in figure 3 where in 3(a) the intersection matrix for objects x and y is shown (with unknown value for X2 n 2/2)- In 3(b) the mapping of the intersections into possible relations is given using the table (and its complement) in figure 2(b). The spatial relations between objects x and y is then derived from the set intersection of all table entries to be /(x, y) V IB{x, y) shown in 3(c).

3 The Approach

The imprecise knowledge in the form of a disjunctive set of spatial relations is to be represented as a set of intersection constraints. For example, the set of relations (T, O, E, CB, IB) shown in figure 4(a) can be expressed precisely by one constraint only, namely, X2 fl 2/2 = 1, as shown in 4(b).

The process of spatial reasoning can be defined as the process of propagating the intersection constraints of two spatial relations (for example, Ri {A, B) and

303

0 0 C O ( D « @ @ 0 0 D{x,y) T(x,y) 0(x.y) E(x,y) C(x,y) CB(x,y) IB(x,y) I(x,y)

(a)

XQ

Xi

X2

yo

ALL

{D\/TyO VC V CB)

{DyTyOMCy CB)

2/1

(JD V r V 0 V/V/J5)

[OyiMiB VCWCBW E)

{lylBv 0)

2/2

{DWTVO WIVIB)

{OWCWCB)

{Tyov IBMCBW E)

(b)

Figure 2: (a) Set of possible relations between two convex areal objects, (b) Mapping the non-empty intersection of object components into a disjunctive set of possible relations.

304

XQ

Xi

X2

yo 1 0 0

2/1 1 1 1

2/2 1 0

OVl (a)

I(x,y) V IB(x,y)

(c)

XQ

j Xi

X2

2/0

ALL

[lylBME)

{IWIBW E)

2/1

{DyTwO WIWIB)

{OVIWIB vCwCBy E)

{IVIBW 0)

y2

{DyTyo wiyiB)

{lylBvO VT V E)

{TyOVlBVCBVE) OR (D V / V C)

(b)

Figure 3: (a) An example intersection matrix with unknown value for ^2 H 2/2-(b) Mapping intersections of individual components to possible relations, (c) Possible relations between objects as the common subset in all the table entries.

£| ^ • f ) O

(a)

XQ

Xl

X2

2/0 ?

?

?

2/1 ?

?

?

2/2 ? ? 1

IB CB

(b)

Figure 4: (a) Set of spatial possible relations between two spatial objects, (b) Their representation by one constraint X2 H 2/2 = 1-

305

R2{B,C)), to derive a new set of intersections between objects. The derived constraints can then be mapped to a specific spatial relation (i.e. the relation RsiAC)).

Let X = U ^ i Xi and Z = U^=i ^k represent the spaces X and Z associated with objects x and z respectively, m and n are the total number of components in those spaces. If yj C Y and Y is the embedding space for the common object in the composition relation and since X = Y = Z, it follows that {X D yj) A {yj C Z).

In general the intersection of two components can take one of three values: 0 or 1 or ? where 0 indicates an empty intersection, 1 a non-empty intersection and ? indicates either a 0 or a 1. (unknown value). Hence, if P = {0,1,?} then yj C X -^ Mxi G X{yj fl Xj = Pq) where Pq G P. Similarly, yj C Z —* Vzk e Z{yj f^zk = Pq)

The reasoning process can be carried in two steps, namely:

1. Multiplication Operation: on the intersection relations between every component from the intermediate space and every component of the other two spaces.

2. Addition Operation: on the results of multiphcation for all the components of the intermediate space.

The multiplication operation can be expressed as follows:

{Xf^Z)y. = \\Jxi^yj\ • \[jzkf^yj\ (1)

where X fl Z)y. are the propagated intersection relations between components of the spaces X and Y based on their intersection with the component yj of space Y. The addition operation can be expressed as follows:

(xnz) = i:\^,{xr\z\. (2) where / is the total number of components of space Y.

Substituting 1 in 2 we get the general reasoning formula below. General Reasoning Formula

( m n \

{[jxi^^yj) ^ {\jzk^yj)\ (3) i = l fc=l /

Note that there is no restriction on the application of formula 3 to a single or a set of components of the intermediate space, hence a general form for formula 1 can be stated as follows.

{xr^Z)y, = \[}xiny'\ * lijzkny'] (4)

where y' CY, for example y' = 2/i U 2/2.-The two formulae 3 and 4, constitute the general spatial reasoning method

for incomplete or uncertain knowledge. Formula 3 need to be applied in all

306

•

nri ?

1 ?+ i+j

Lo_ fo" 0 0 0

[q_

7

0 ?

7

7

7

1 0 7

1 7

1

7+

0 7

7

7

7

1+ 0 7

1 ?a ?a

Table 1: Multiplication Table for incomplete knowledge.

±j "nri

7

1 ?a

1 0 FT"

7

1 1 • a

7

7

7

1 ?a

1 h 1 ?/3 ? 1

?aA?/3

Table 2: Addition Table for incomplete knowledge.

cases, while 4 is only needed to be applied whenever a constraint exist of the following form:

XiHyj = ? A XiH 2/j+i = ? A Xi fl {yj U Vj+i) = 1 (5)

i.e. the intersection of Xi with both yj and y^+i cannot be empty. Then using y' = {yj U yj+i) will give XiHy' = 1.

To distinguish the constraint in 5 from a non-related constraint of the form: XiHyj = ? A XiCi yj^i = ? A Xj fl (2/j U yj+i) =? , a label will be used and equation 5 can be rewritten as: Xi D yj = ?o A Xi D y^+i = ?a The added letter a indicates that the two constraints are related.

A constraint of the type 5 can be used as either an input to the reaisoning task or an output of it. If the number of components of space x that has nonempty intersections with yj is > 1, then this case is distinguished with a lable 1" instead of 1. Similarly, if the number of components of space x that has non-defined intersections (?) with yj is > 1 then this case is distinguished with a ?"*• instead of ?.

Accordingly the multiplication and addition tables of our method are as shown in figure 1 and 2.

Note that in the addition table, we used ?a and ?/? since we add results for different components of space F , i.e. ?«+?/? =?aA?/3.

Equations 3 and 4 and the multiplication and addition tables represent the general space algebra for reasoning with incomplete or uncertain knowledge. The algebra makes no restriction on the complexity of the objects used or the completeness or uncertainty of the knowledge of the topological relations involved.

307

X2

1 0 ?

Xi

1 1 1

Xo 1 0 0

yo 2/1

2/2

^0

1 1 1

Z\

1 0 0

^2

1 0 ?

(a)

let d (b)

Figure 5: (a) Example reasoning problem with incomplete knowledge, (b) Resulting possible set of relations between x and z.

4 Examples of Spatial Reasoning with Incomplete Knowledge

4.1 Example 1

Consider the reasoning problem where the relations between simple convex areal objects x, y and z are: C(x, y) or CB{x, y) between x and y and D{y, z) or T{y, z) between y and z, as shown in figure 5(a). This indefiniteness is reflected in the intersection matrices in the figure where X2 H 2/2 =? and 2/2 H 22 =?•

It is required to derive the possible relationships between objects x and z. Applying the general topological reasoning equation 3 on yo, Vi and 2/2 using

the multiplication table, we get the following,

• 2/0 intersections: {n" = n and m" = m - second general constraint)

(xnz) yo {xo,xi,X2}n {20,2^1, 2} =?

• 2/1 intersections: xi fl ZQ = 1 and (X - X2) fl (Z - 2:0) = (j).

• 2/2 intersections: (n'' > 0 A n' > 1 A m'' > 0 A m' > 1, i.e. a constraint of the type described in section ??(??) above.

Xi n Zo =?a Xi n Z2 =?a ZQ n Xi =?6 ZO n X2 =?6 Xo n {ZQ VJ Zi\J Z2) =^

308

zi n (xo u xi u X2) = 0

Using the addition table we get the following:

xoHzo Xo PlZi

XQr\z2

Xi flZo Xi f lZi

xi nz2 X2 rizo X2 flZi

X2 C\Z2

yo ?

?

?

?

?

?

?

?

?

2/1

0 0 0 1 0 0 0 0 0

y2

0 0 0

?aA?5 0 ?a ?6 0 ?

s?=12/i ?

?

?

1 ?

?a ?6 ?

?

Also from the first general constraint XQH ZQ = 1. Compiling the above intersection we get the following resulting intersection matrix,

XQI

Xi

_X2j

[zo^

n~ 1 ?

Zl

?

?

?

^2

?

?

?

Mapping the above matrix into spatial relations, as shown earlier in table 2(b), gives the disjunctive set: R{X,Z) = DyT\/OyC\/CBas shown in figure 5(b).

4.2 Example 2

The resulting relationships in the above example gave the intersection of X2 as X2 n {^0,^1,^2} =?, i.e. no constraints are propagated for the component X2. If a new fact is added such that X2 fl {^0, 2} = 1, i.e. X2 fl ZQ =?a and X2 n Z2 =?a and in this case the relationships between objects x and z are to be further composed with a relationship C{z,q) between objects z and q as shown in figure 6.

Applying the general topological reasoning equation 3 on ZQ? zi and Z2 and using the multiplication table, we get the following,

• ZQ intersections:

{xnQ),,= -^ (xongo = i)A(xingo = i) A (x2n^o=?) A {gi,g'2}n{xo,Xi,X2} = (/)

• Zl intersections:

{90,9i, 92} n {xo, xi, X2} =?

309

O'»'Gf "' Qf "'\. D(x,z) T(x,z) 0(x,z) C(x,z) CB(x,z)

0 ©G

(b)

Figure 6: (a) Further composition of the spatial relations between x and z from the previous example with the relation C{z^q) to get the possible relations between x and q in (b).

• Z2 intersections:

( 0 n {xo,xi,X2} =?) A {^1,92} n {xo,a;i,X2} = 0

From the previous example we have that X2 H ZQ =?a and X2 fl 2:2 =?a. Applying equation 4 for X2 only, we get the following,

• X2 intersections:

X2 n (2:0 u 2:2) = 1 A (2:0 u ^i) n g'o = 1 -^ 2:2 n q'o = 1

Compiling the above intersection we get the following resulting intersection matrix.

XQ

Xi

X2

Qo

n~ 1 ?

qi ?

?

?

Q2

?

?

?

Mapping the above matrix into spatial relations, as shown earlier in table 2(b), gives the disjunctive set: R{X,Q) = D\/TyO\/CvCBas shown in figure 6(b).

4.3 Example 3: Composition with indefinite and related constraints

Consider the relations between objects x and y and z as shown in figure 7 {R{x,y) = T(x,2/) V C{x,y) V CB{x,y) and R{y,z) = IB{y,z)). Their representative intersection matrices are as follows:

310

£«® Q (} T(x,y) CT CB

(a)

IB(y,z)

R(x,z) (b)

Figure 7: (a) Composition with indefinite and related constraints, (b) The composition result.

X2

1 0

i ?a

Xi

1 ?

? • a

XQ

1 ?

?

2/0

yi y2

ZQ

1 0 0

^1

1 1 1

Z2

1 0 1

Applying the general formula 3 on yo, 2/i and 2/2 and using the multiplication table, we get the following,

• 2/0 intersections:

{XQ,XI,X2} r\ {ZQ, zi, Z2} = 1

• yi intersections:

x\ n zi =? A xo n zi =? A X2 n 2:1 =?

• 2/2 intersections:

xi n z\ =?o A xi n 22 =?o a;2 n Zi =?a A X2 n 22 =?a

xo 0^1=? A xo n Z2 =?

Also from the first general constraint XQ fl zo = 1. Compiling the above intersection we get the following resulting intersection matrix.

2 0 Xi

X2

ZQ

rv ?

?

Zl

?

?a ?a

'22 ?

?a

?a

311

I.e. {xi,X2} n { 1,2:2} = 1. Mapping the constraint back will result in the exclusion of the disjoint relation only, and hence, R{X, Z) =T\/ OW EW IW IB yew CB.

5 Conclusions

A general approach to spatial reasoning over imprecise topological relations is proposed. The approach is applicable to objects with random complexity. The method builds on and genarlises previous work in [6] where spatial relations are represented by the intersection of object and space components. Spatial reasoning is carried out in three steps. First a transformation is used to map the imprecise input relations into a specific set of known constraints. Spatial reasoning is carried out on the constraints to derive a resulting set of constraints and finally the resulting constraints are mapped back into a set of possible relations between the objects considered. The method eliminates the need for the development and utilisation of composition tables in the spatial reasoning process. It has also been briefly shown how to adapt the method for representation and reasoning in the temporal domain. The homogeneous treatment of space a time is a subject of much research and shall be investigated further in future works.

References

[1] B. Bennett, A. Isli, and A. Cohn. When does a composition table provide a complete and tractable proof procedure for a relational constraint language, 1997.

[2] S. Chen. Advances in Spatial Reasoning. Ablex, 1990.

[3] A G Cohn and S M Hazarika. Qualitative spatial representation and reasoning: An overview. Fundamenta Informaticae, 46(1-2): 1-29, 2001.

[4] A.G. Cohn and A.C. Varzi. Modes of Connection: A Taxonomy of Qualitative Topological Relations. In Proceeding of the Intemation Conference on Spatial Information Theory COSIT99, volume LNCS 1329, pages 299-314. Springer Verlag, 1999.

[5] M.J. Egenhofer and J.R. Herring. A Mathematical Framework for the Definition of Topological Relationships. In Proceedings of the 4th international Symposium on Spatial Data Handling^ volume 2, pages 803-13, 1990.

[6] B.A. El-Geresy and A.I. Abdelmoty. SPARQS: Automatic Rasoning in Qualitative Space. In Proc. of AI'2003, the Twenty-third SGAI Int. Conf on Innovative Techniques and Applications of Artificial Intelligence, pages 243-254. Springer, 2003.

[7] C. Freksa. Conceptual Neighborhood and its Role in Temporal and Spatial Reasoning. In Decision Support Systems and Qualitative Reasoning, pages 181-187, 1991.

312

[8] C. Freksa. Temporal Reasoning based on Semi-Intervals. Artificial Intelligence, 54:199-227, 1992.

[9] M.F. Goodchild and S. Gopal, editors. Accuracy of Spatial Databases. Taylor & Francis, London, 1989.

[10] D. Hernandez. Qualitative Representation of Spatial Knowledge, volume 804. Springer Verlag, 1994.

[11] D. Randell and M. Wikowski. Building Large Composition Tables via Axiomatic Theories. In Principles of Knowledge Representation and Reasoning: Proceedings of the Eighth International Conference (KR-2002), pages 26-35. AAAI Press, 2002.

[12] M.F. Worboys and E. Clementini. Integration of imperfect spatial information. Journal of Visual Languages and Computing, 12:61-80, 2001.

Reasoning with Geometric Information in Digital Space

Passent El-Kafrawy Robert McCartney

Department of Computer Science and Engineering

Storrs, CT

Abstract

Concurrency requires consistency and correctness. Isothetic rectangles can be used as a geometrical technique to verify a safe and deadlock free schedule for concurrent nodes. However, the known algorithms for concurrency using isothetic rectangles require the prior knowledge of the system behavior. We provide a new mechanism to use isothetic rectangles without this limitation. The discrete nature of isothetic rectangles provides an opportunity for inter-diagrammatic reasoning. Inter-Diagrammatic Reasoning (IDR) can be easily computed on a parallel machine, and has a complexity of 0(n) for most of the iso-rectangles problems where the best known algorithm was 0(n log n) in Euclidean geometry. This new framework will also allow dynamic mode of operation in calculating the closure of a set of iso-rectangles; rather than restricting the solution to static systems where all required resources must be reserved in advance.

1 Introduction

Deadlock is a major problem in computer science. In this paper, we discuss how to address this geometrically, by representing the resource elements of processes as isothetic rectangles in a diagram. We consider the special case of discrete time steps, which allows us to map these diagrams into digital space. We then apply the techniques of inter-diagrammatic reasoning to solve the deadlock problem, resulting in a dynamic deadlock avoidance technique.

In the 1980's, researchers worked hard on defining concurrency control mechanisms, one of which is using geometrical representation techniques. Dijkstra was the first researcher to explain semaphores using progress graphs [3]. Although he didn't use them as a technique to find a safe solution, other researchers used progress graphs as a geometrical representation technique to prove that a schedule was safe and deadlock free. The properties of transaction systems with lock operations were related to the geometry of isothetic rectangles in [12].

The eflScient graphical solutions for deadlock require that the set of rectangles be known in advance. The prior knowledge of all requirements is needed to construct the graphical image of the system from which a solution can be calculated. This static mode of operation is a limitation, especially in most of

313

314

the real concurrent applications. In dynamic operation, as the set of rectangles change, a new solution has to be calculated automatically. This requires a computational framework that works on the rectangle level, in contrast to the Euclidean solution where a processing step works parallel to the axis. There is no known graphical solution that operates dynamically till now; there are non-graphical approaches for deadlock prevention [11, 5] but not avoidance.

In dealing with deadlock, one has to sacrifice convenience or correctness. Some systems ignore the problem, assuming that it happens rarely. But when it happens, restoring the system is a big burden with great loss. Another solution is detection and resolution: the system tries to detect when it happens and then recover from it. If deadlock can be detected, recovery is difficult and some times impossible. The other technique is to avoid deadlock, but is there an algorithm that allocates resources safely? The algorithms known are restricted by the fact that all resources must be requested in advance; no dynamic algorithm exists. The last technique known is to prevent deadlock from occurring, but to find appropriate conditions for prevention is usually unpractical in most systems[10].

The collection of rectangles in the plane is an abstract representation of several real problems. The research in that area works on orthogonal directions - parallel to the axes - called isothetic rectangles according to [8] (other researchers call it iso-oriented, orthogonal, linear and aligned), iso-rectangles for short. The collection of isothetic rectangles is characterized by the unique property that the plane can be subdivided into subintervals, in each direction of the coordinates, where each axis here represents a concurrent transaction. Each rectangle represents two intervals each on each axis, the two intervals are the mutual exclusion period for requesting a single resource by the two transactions.

The goal of this research is to avoid deadlock dynamically - in situations where the resources are not all known in advance. We restrict time to be discrete integer values, which means that the diagrams are in discrete space, where we can use inter-diagrammatic reasoning (IDR) as a reasoning tool. IDR requires an underlying discrete tessellation, isothetic rectangles characterizes such discrete grid. IDR operates over the rectilinear grid, rectangle by rectangle. As a rectangle can represent a requested resource in a concurrent system, requests are handled dynamically to prevent deadlock. At the same time this geometrical reasoning technique allows visual monitoring of the system. The state of the concurrent system can be verified at any given time.

Inter-diagrammatic reasoning (IDR) is a computational framework for digital geometry [1]. IDR can be used to represent digital geometry in two dimensions, as well as providing a concise language for specifying algorithms. In previous work [7], we showed how IDR provides an algorithmic and computational framework for computing in planar digital space. We examined the characteristics of digital geometry and digital pictures, and applied inter-diagrammatic reasoning to represent data and computations relating to planar digital pictures.

We will define some basic notations about the geometry of isothetic rectangles and IDR as a general computational framework. A detailed description of how concurrency can be geometrically represented is given. The reasoning mechanism for a concurrent system of two nodes is first defined and then we

315

generalized the technique to d-transaction system, and finally, we concluded the proposed technique.

2 Isothetic Rectangles

An application involving intervals or subdivision of the domain space is a candidate to be represented computationally using isothetic rectangles. Two well-studied problems represented and solved using isothetic rectangles are concurrency control [2, 6] and VLSI design [4]. Isothetic rectangles can also represent other problems like scheduling, performance and QoS, which are problems in database management, operating systems and networking.

By definition an isothetic rectangle is a rectangle with sides parallel to the axes. A set of these can be represented graphically on a plane. The mapping of the rectangles on the plane imposes a subdivision on the axes. In other words, the rectangles sides map over the coordinates of the cartesian graph, forming a grid that has discrete intervals. In our application, we restrict this grid to fall on particular intervals, so the underlying space can be described by predefined rectangular pixels.

Pl

;:c . . .Bpj

L

p2'

p2

Figure 1: Some isothetic rectangles in a 2D space

Different questions arise with the applicability of isothetic rectangles to different problems (e.g. VLSI design requires the area and perimeter of intersection and/or union, concurrency control requires the closure of the union, etc.). As this paper is more directed to reasoning about concurrent nodes, we will not explain all of these problems. We will concentrate on the closure of isothetic rectangles as it is the key to find an appropriate solution to concurrency.

Consider two points pl = (xl, yl) and p2 = (x2, y2) in the plane. These are called incomparable if x l < x2 and t/1 > y2, this can be defined mathematically as ((xl - x2){yl — t/2) < 0). Then the SW-closure of the incomparable points pl and p2 is all points enclosed by the rectangle of sides xl , x2, yl and y2 with the SW-corner point (xl,y2). Similarly, the NE-closure is all points enclosed by the rectangle of sides xl , x2, yl and y2 with the NE-corner point (x2,yl), the shaded regions in figure 1.

316

Definition Let U^ be the union of a set of iso-rectangles, then U^ is SW-dosed if for every two incomparable points pi and p2 in C/g?, the SW-closure of pi and p2 is in f/g?, and C/SR is NE-closed if for every two incomparable points pi and p2 in U^t, the NE-closure of pi and p2 is in C/g . If C/g is NE-closed and SW-closed then C/g? is NESW-closed [9].

A region S is the X-closure of a region R, denoted S=X(R), if S is the smallest X-closed region containing R, where X in {NE, SW, or NESW}. NESW-closure is abbreviated as the closure. The closure of a region R is the (well-defined) smallest closed region containing R. This definition holds if R is connected. If not the closure consists of the union of the closures of each connected subset of regions.

3 Inter-diagrammatic Reasoning

The fundamental concept of inter-diagrammatic reasoning is that reasoning is done about information in diagrams. The solution is taken firom inferences over a set of diagrams and is represented over another diagram (or a set of diagrams). The fundamental computational step is the combination of two diagrams.

Any computational step in IDR depends on the underlying uniform tessellation of the 2-dimensional space. The computation step is done on two diagrams of same tessellation by evaluating an operator over each pair of corresponding tessera. Each pair of tessera can be computed independently of the others constructing the resultant diagram. These computations are a good candidate for parallel computing.

A diagram in IDR is a bounded planar region with a discrete set of tessera, for discrete problems with isothetic rectangles the obvious tessellation is the rectangular grid. Each tessera has a corresponding color, if all tessera are colored W H I T E then this diagram is called null. In the context of isothetic rectangles, a binary diagram could be sufficient, where each rectangle is filled with BLACK and the background is W H I T E . But more information can be inferred from a diagram if it allows different gray levels (or colors).

A rectangle color here will not be black rather it will have a certain gray level, when rectangles are added the greater the number of rectangles overlap the darker the color will be. Due to that the intersecting areas will be noticed easily. The colored diagrams (gray levels or colors) ease the evaluation of the level of intersection between the rectangles. The implementation of colored diagrams is the same as binary ones in IDR, only the color variable will be set differently.

As the concurrent system changes over time, iso-rectangles should be added or deleted accordingly. Adding a rectangle means ORing its constructed diagram with the solution diagram. Deleting one means peeUng off a rectangle without reevaluating the whole solution. Removing a rectangle that intersects another one can cut pieces from the underneath one. But peeling off levels of gray will give us the tool to remove a rectangle without losing any information. Statistics can be given about the concurrent system from the solution diagram too, as the

317

degree of intersection represents the number of transactions requiring a single resource (entity). This is directly calculated from the gray level (color value).

To apply any IDR operator/function on these diagrams, the resultant diagram is computed from combining the color of two corresponding tessera in the input diagrams. Consider Dl and D2 as two input diagrams then the set of operators given are:

or (union) , denoted as Dl V D2, is the maximum between each corresponding tessera values.

and (intersection) denoted as D l A D2, is the minimum between each corresponding tessera values.

overlay (add) denoted as Dl -I-D2, is the sum between each corresponding tessera values.

peel (difference) denoted as Dl — D2, is the difference between each corresponding tessera values.

not (complement) denoted as -iDl, is the difference between BLACK and each tessera value in the diagram.

In addition there are other mapping functions that work on a set of diagrams:

accumulate denoted as a(D, {D},o), applies the binary operation o to the initial diagram D, and the sequence of diagrams {D}.

map ^(p, {D}), applies the function g to each diagram in the sequence {D}.

filter (j){g^ {D}), filters a sequence of diagrams by applying g to the sequence {D} and removing each diagram for which g returns false.

null is a boolean operator r]{D), returns true if the diagram D is all WHITE.

max returns the maximum color in D as an integer value.

min returns the minimum color in D as an integer value.

lambda Xv.b, is used for functional abstraction.

These operators are used in [7] to deal with geometrical properties of planar diagrams, with detailed explanation of the functions.

318

4 Isothetic Rectangles and Concurrency

A concurrent system is composed of a set of concurrent nodes or transactions. A node can be a transaction in a database system, a process or task in operating systems, or any entity that requires some resources in a multiprocessing system. Each node requests a set of resources as it is running in a certain order, in other words, a set of actions in sequence. The node can be subdivided into time intervals of equal space. Each interval represents an action; a unit time of accessing a certain resource.

Each node is represented as an axis on a cartesian graph and the actions are the coordinate values of that axis. Inconsistency happens when more than two nodes request the same resource at the same time. Mutual exclusion is the main mechanism to guarantee consistency, also known as semaphores. The resource is locked by the node that accesses it and is released or unlocked after the resource is no longer required. If another node requests the same resource, it waits until the resource is released. The period where two nodes request and release a resource can be visually represented by an isothetic rectangle, see figure 2.

2 A

Ux

Uy

Lx

Ly

• -»- • •* - *"- - * r—* f - - f "t—i--i--f-

-4--4-

V ^ d

- I •( i

Lx Ly

•vli

1....4....}....j)|....|.„.4_..4....4_..4.„.4.^

i*^

.•f-4. - i I- 4 1--I f-j i f I

Ux Uy

Ti

Figure 2: Geometrical representation of 2-transaction

In figure 2, two nodes Ti and T2 request and release two resources x and y (denoted as Lx for lock x and Ux for unlock x). The time elapsed to access resource x by Ti and T2 composes an isothetic rectangle in the 2D plane (Ti, T2); similarly for any other resource used by both nodes. Inconsistency happens when an action is performed within the time covered by this rectangle. Thus, the action is scheduled outside that period of time. These rectangles are called forbidden regions.

The curve S in figure 2 represents the sequence of actions scheduled between Ti and T2. A horizontal line in the curve starting from coordinate i and ending at j means that the actions i to j of the node represented by the x-axis are to be

319

executed. Similarly, a vertical line in the curve starting from coordinate i' and ending at j ' , means that the actions i' to j ' of the node at the y-axis are to be executed. A safe schedule is one that doesn't intersect any forbidden region.

m Fieure 3: The set of vertical and horizontal half olanes

L9^ LH Hii' Hi bottom rectangle

bottom right South shadow

Figure 4: The creation of a rectangle and its shadow from the half plane diagrams

To map iso-rectangles in digital space, the basic tessellation of the images is a rectangular grid. In order to place iso-rectangles over the grid when required, a set of IDR operators are defined to create an image for a rectangle with given coordinate values. The construction step is done from a set of pre-stored diagrams, HPm, where m is the maximum coordinate value. For each increment on the axes a half plane diagram is defined in the horizontal and vertical directions. The set of half planes is an ordered pair HPm = {H, V). The half plane diagrams are stored in order according to their axis positions, as in figure 3.

To construct a rectangle's diagram, the half plane diagrams of the corresponding four coordinates are used and the overlapped area defines the rectangle, see figure 4. Lets consider a half plane diagram as H(x) for horizontal half planes, and V(y) for vertical half plane diagrams. Thus, a rectangle is defined as

R, = -//(i2i,„„„^) A -^ViRi,^,,) A V{Ri^,^J A H{Ri,J

For each rectangle four shadow diagrams are also constructed. These shadow diagrams are required in calculating the closure of the union of all rectangles. To construct the shadow, only three half plane diagrams are overlapped, for example

320

5 Safe and Deadlock Free Reasoning

Safety is achieved by scheduling actions outside of the forbidden areas. However, the use of semaphores (lock/unlock operations in mutual exclusion) can deadlock some nodes, which are cycUc wait relationships between transactions. Thus, besides safety, freedom from deadlock is another important factor.

Deadlock graphically means that a curve on the (Ti, T2) relationship graph reached a point where it can not proceed forward any more. The area marked D in figure 2 is a deadlock area. If a curve entered that region then it cannot move forward or the mutual exclusion condition will be violated. From the geometry of isothetic rectangles, region D is the SW-closure of the union of the two connected rectangles in the diagram. The SW-closure is thus defined and added to the forbidden region. Any schedule will then proceed avoiding deadlock. [12] calculated the NE-closure as well as the SW-closure for a safe solution, similarly, we will add the NE-closure to the forbidden region, the area denoted as [/ in figure 2 but we may relax that later.

The set of transaction pair relationship diagrams, T, represent the current state of the system. Dynamically, when a new request arrives from one of the nodes, say Ti requests resource f. The request is added to the axis of Ti. If Tj has f on its axis, then the points of the two matching Lf and Uf operations will construct a new forbidden region that is added to {Ti^Tj) diagram. Then any action for (Ti.Tj) can be scheduled outside the new forbidden region.

Lets denote the forbidden region by 5i and let r be a concurrent system, and Ti ^ Tj two transactions of r . Let ^ij be the forbidden region of {Ti, Tj) and let S be a schedule not intersecting Jff j. The goal is to find the connected closure of all rectangles in r . The SW-closure and NE-closure are computed and added to ^ij, where any schedule outside that area is safe. A schedule S corresponds to an increasing curve from O to F that avoids all such iso-rectangles. The two serial histories are the curves OTiF and OTjF. The schedule S provides maximum concurrency, if diagonally increasing towards F.

5.1 Reasoning in 2-transaction System

A system with only two concurrent transactions is represented in a single isothetic diagram where the actions of Ti are placed on the x-axis and of T2 are placed on the y-axis. Then any request of the same resource between Ti and T2 will produce an iso-rectangle with coordinates Tuxi Twx^ T2LX^ and T2UXJ let us call it Ri. If Ri overlaps another iso-rectangle (region) say R2 then the closure need to be added if i?i and R2 have incomparable points. If the rectangles are connected and have no incomparable points then they are merged into one diagram. The SW-closure is calculated from the union of: 1) the intersection of south shadow diagram of RI and east shadow diagram of R2, 2) intersection of east shadow of RI and south shadow of R2, 3) RI, and 4) R2, see figure 5. The NE-closure can be calculated similarly from the east and north shadow diagrams of RI and R2.

321

R2 r (Rl North ^ R2East)v(RlEast^ R^North)

(Rl South ^ R2west)v(RlwestA Rawest)

Rsw V Rl V R2 V RNE = closure (Rl , R2)

Figure 5: Calculate the SW-closure of Rl and R2, SW(R1,R2).

SW{Rl, R2) = {Rlsouth A R2west) V {Rlwest A R2south) V i?l V i?2

After calculating the closure, the union of each pair of shadow diagrams in each direction is taken. Then, the new calculated region is subtracted from each shadow, to ensure that the shadow does not overlap the region. As we explained the methodology of calculating the union and closure from two rectangles and their shadows, we will explain the algorithm for calculating the closure for n-rectangles dynamically over the (Ti.Tj) diagram.

Figure 6: T l request resource f

For each two transactions diagram, (Ti.Tj), whenever both transactions require the same entity an iso-rectangle is created in a diagram and its four shadow diagrams as stated in section 4. The rectangle is stored in the set R. The union of the rectangles with their closure will be accumulated in a diagram, called (Ti.Tj). As the closure is calculated for a connected set of iso-rectangles, each connected set is stored in a separate diagram in the set C. Each diagram in C has a different color, which is used as an index for C, this allows for direct

322

lookups given the color of the region. The same color is given to the same corresponding regions in (Tj, Tj). The algorithm will loop over R until all rectangles are added. Usually, R has one diagram when a request is issued, and required by another node, but if more than one required resource is requested then each one is represented in a separate diagram (if there is a mutual exclusion period between the two transactions) and added to R. The algorithm works the same for one request or more. Each time a new request is received a single iteration of this algorithm is executed. The algorithm works as follows:

1. Take a rectangle from R, say Ri.

2. Evaluate I = Ri A {Tu Tj).

3. If rj{I) then Ri does not overlap {Ti^Tj); which means that Ri is a new disconnected region. Ri will be colored with a new color and added to the set C, also, Ri is overlaid on (Ti.Tj), (Ti.Tj) V Ri.

4. If Ri intersects {Ti.Tj) then the closure need to be calculated

a. Get the region that intersects Ri by color = max{I)

b. Get from C the diagram indexed by color, Ccoior

C. Subtract {Ti,Tj) - Ccoior

d. Calculate the closure{Ri, Ccoior) as in figure 5 and color it with color

e. Replace Ri with closure{Ri^ Ccoior)

5. Repeat step(2) until Ri intersection with (Ti^Tj) is empty. The algorithm continues as before until R is empty.

An example is given in figure 7

5.2 Reasoning for d-transaction System

Geometrically, each node is an axis on a d-dimensional coordinate system, with the actions being the coordinate values on the axes and d corresponds to the number of transactions. In this proposed technique each pair of transactions corresponds to a plane with a grid imposed by their actions. The time interval elapsed between Lx/Ux operations of a certain entity x, between these transactions, produces an iso-rectangle with coordinate imposed from these actions. The relation between each pair of transactions is represented in a single diagram. A point (a,b) on the grid, represents the state in which the first a actions of Ti and the first b actions of Tj have been executed. The whole system is defined in d{d — l ) /2 diagrams.

In real time concurrency, when a transaction Ti requires an action, /, then this action needs to be scheduled safely. That also may require rescheduling some or all existing requests that have not been executed yet. This can be done as follows: first of all, this action is added to the axes of the (d-1) diagrams of Ti

323

Figure 7: Example of how the closure is calculated by introducing a new rectangle

(in relation to all other transactions) with its expected finish time if possible. If another transaction Tj requests / then an iso-rectangle is created in the {Ti, Tj) diagram. As the rectangle is added over the diagram, the closure is calculated using the algorithm in section 5.1. The set of these diagrams define d schedules for Ti,

To be able to define the safe periods of time where that action can be scheduled (or a total schedule for Ti), all diagrams of Ti are accumulated using the "-h" operator. Finally, a schedule that doesn't intersect any forbidden area in the accumulated diagram is safe and deadlock free. In figure 8 the soUd curve represents the executed actions and the doted one is a safe schedule for Ti (where the horizontal part is actions of Ti and the vertical segment is actions firom T2,T3, or T4). It is clear that the concurrent system can be controlled from this diagram that represents the system as it is running while planning.

Besides inferring a schedule for a transaction Ti, when accumulating with the "-h" operator, other information can be deduced easily. Before calculating A , all the {Ti,Tj) diagrams are colored with a single minimum gradient, x, by the following IDR function /jL{Dx,^j=i..N{Ti,Tj), A). A rectangle in {Ti,Tj) means that two transactions require a single resource, r; and the same rectangle in the accumulated diagram, Ai, represents the number of transactions requiring r at that time. This is represented diagrammatically firom the color value of that rectangle, dividing the accumulated color value by x returns the number of transactions that requested r.

324

D T3.

ID

{——*• • • • • • • *

T4i

a D

LaLb IT) U l c l a Ld Le £ / 1 / Ld l e T l La^b " Lc ^ c La Ld Lc ^f t / / I d UeTl U L b LI) LcLc La Ld Le i / Vf Ld Ue Tl

Hl3

LaLb IT) LcLc La Ld Le £ / Vf Ld Le Tl

Figure 8: A 4-transaction system, the first 3 diagrams represent the relationship between T l and all others when T l requests resource f, the bottom diagram is the proposed schedule for T l .

5.3 Algorithm Complexity

Let us first review the complexity of each IDR operator and function as defined in [7]. The complexity of obtaining the union, intersection, overlaying, and peeling is constant on a parallel machine. Assuming that there is a processor assigned to each pixel. The IDR functions (accumulate, map and filter) depend on the function to be applied; however, the null, max and min functions take 0(log p) where p is the number of pixels in the diagram. Therefore, to get the union of N iso-rectangles, with each rectangle in a separate diagram, is of 0(N). The space complexity is constant as these N-diagrams are accumulated incrementally in a single diagram. Similarly, most of the isothetic rectangle problems are solved in linear time on the given parallel machine.

The closure of two isothetic regions is computed from the union of the overlapping of the 4 shadow diagrams each two at a time. For each pair of iso-rectangles five operations of constant time are performed to calculate the closure. This is the time taken for a single closure calculation, however, the total time of the concurrency algorithm depends on the number of disconnected regions in the pair relationship diagram. The closure is computed for only the connected set of rectangles on (Ti^ Tj) that intersect iZ, so if C has k connected components then the algorithm will perform the closure at most k times for any added rectangle; however, as each merge reduces the number of regions by 1, the total number of pair wise closure computations cannot be greater than N-1. In each step it takes constant time to calculate the closure and log p to get max color or check for null intersection, so the total complexity is 0(N log p) time

325

and 0(k) space, where k is the maximum number of disconnected regions in the whole system, k < N.

In concurrency control, in a multi-transaction system, the number of diagrams that are defined in the system is quadratic in the number of transactions. A diagram is needed to represent the relationship between each pair of transactions. When a request is scheduled for T , at most (d-1) diagrams will be considered; the diagrams that represent the relationship of Ti with all other transactions. Thus the complexity of scheduUng an action requested by a transaction, Ti^ at any given time depends on the complexity of calculating the closure in each pair-relation diagram and the accumulation of Ti diagrams. Since the total cost of calculating the closure in each diagram is 0(k log p), then the complexity to define the safe and deadlock free periods in d-transaction system is 0(k log p + d). Actually, k is much smaller than N, if in the worst case at a given time k=N and the new request intersects these N components, then k = l before the next closure calculation.

In summary, most IDR operations for isothetic rectangles take 0(n) time and 0(p) space. However, the size of the diagram, p, is fixed and can be considered as constant space. IDR solution to isothetic-rectangles is more efficient than any other solution given before for CC (or VLSI design) and is not difficult to implement.

6 Conclusion

We proposed a solution for representing and solving the concurrency control problem diagrammatically from the set of isothetic rectangles. Isothetic rectangles are handled in a digital framework. Using IDR the solution is represented and calculated on a pixel by pixel basis. In comparison to the continuous space solution where line sweep technique were used, this algorithm takes 0(n log p) time.

The new proposed solution to isothetic rectangles allows the system to work in a dynamic mode of operation. We do not calculate the solution once after having all requirements, rectangles, but the solution is calculated on a one by one basis. This allows for a better implementation to most of the current database management systems. Also the current representation provides a graphical view of the concurrent system for a period of time, depending on the size of the axis. Prom this graphical representation different information can be extracted for firee, the resources that are accessed by a certain transaction and the peak and off periods for each transaction. How many transactions request a single resource at a given time, and other information that is implicit in the accumulated diagrams.

Although isothetic rectangles have well known appUcations since the 1960's, more problems can be developed within the new computational model. The ap-pHcability of isothetic rectangles in dynamic and distributed systems should be investigated in more details, including the relationship between different transactions in a multi-transaction system to handle concurrency. Especially, the distinction between read/write pairs (RR, RW, WW) and how the transactions

326

should be checked for concurrency under these distinctions; what other information can be inferred from the diagrams. Reliability and scalability also need to be investigated. In terms of IDR more efficient data structures can be developed that allow for faster handling of the underlying operators.

References

[1] M. Anderson and R. McCartney. Diagram processing: Computing with diagrams. Artificial Intelligence, 145(l-2):181-226, 2003.

[2] S. Carson and J. P. Reynolds. The geometry of semaphore programs. ACM transactions on Programming Languages and Systems, 9(l):25-53, January 1987.

[3] W. Dijkstra. Co-operating sequential processes. In F. Genuys, editor, Programming Languages, pages 43-110. Academic Press, 1968.

[4] M. Kankanhalli and W. R. Franklin. Area and perimeter computation of the union of a set of iso-rectangles in parallal. J. of Parallel and Distributed Computing, 27:107-117, 1995.

[5] K. Lam, C. Pang, and S. Son. Resolving executing-committing conflicts in distributed real-time database systems. In COMPUTER JOURNAL, volume 42, pages 674-692, 1999.

[6] W. Lipski and C. H. Papadimitriou. A fast algorithm for testing for safety and detecting deadlocks in locked transaction. Journal of Algorithms, 2{3):211-226, September 1981.

[7] R. McCartney and P. El-Kafrawy. Inter-diagrammatic reasoning and digital geometry. In Proceedings of Third International Conference, Diagrams 2001 pages 199-215, Cambridge, UK, 2004. LNAI vol. 2980.

[8] F. Preparata and M. Shamos. Computational Geometry: An Introduction. Springer-Verlag, New York, 1985.

[9] E. Soisalon-Soininen and D. Wood. An optimal algorithm for testing for safety and detecting deadlocks in locked transaction systems. ACM, 2(3):108-116, 1982.

[10] A. Tanenbaum. Modem Operating System. Printice Hall, 2 edition, 2001.

[11] O. Ulusoy. Performance issues in processing active real-time transactions. In LECTURE NOTES IN COMPUTER SCIENCE, volume 1553, pages 98-118, 1998.

[12] M. Yannakakis, C. H. Papadimitriou, and H. T. Kung. Locking policies: Safety and freedom from deadlock. IEEE Symposium on Foundations of Computer Science, pages 286-297, October 1979.

On Disjunctive Representations of Distributions and Randomization

T. K. Satish Kumar Gates 250, Knowledge Systems Laboratory

Stanford University, U.S.A.

[email protected]

Abstract

We study the usefulness of representing a given joint distribution as a positive linear combination of disjunctions of hypercubes, and generalize the associated results and techniques to Bayesian networks (BNs). The fundamental idea is to pre-compile a given distribution into this form, and employ a host of randomization techniques at runtime to answer various kinds of queries efficiently. Generalizing to BNs, we show that these techniques can be effectively combined with the dynamic programming-based ideas of message-passing and clique-trees to exploit both the topology (conditional independence relationships between the variables) and the numerical structure (structure of the conditional probability tables) of a given BN in efficiently answering queries at runtime.

1 Introduction

We present a novel method for representing and reasoning with joint distributions, and generalize this method to probabilistic models like BNs. The fundamental idea is to represent a joint distribution as a positive linear combination of disjunctions of hypercubes, and employ a host of randomization techniques at runtime to answer various kinds of queries efficiently (in time that is only polynomial in the size of this representation). We argue that because such a representation is much more compact (often exponentially so) than various other schemes, the computational complexity of a multitude of fairly important AI problems like Bayesian inference and MAP (maximum a posteriori) hypothesis selection can be made much less than the traditional complexities attached with them. In particular, we will show how we can pre-compile a given BN into a series of disjunctions of hypercubes, and exploit both its topology (conditional independence structure between the variables) and its numerical structure (the structure of its conditional probability tables (CPTs)) for answering queries efficiently at runtime. Two surprising results that follow from our approach are: (1) Bayesian inference which has traditionally been characterized as being exponential in the tree-width of the variable-interaction graph (moralized graph), can be made exponential only in a small factor r that is much less than the tree-width, and (2) the problem of MAP hypothesis selection which has traditionally been characterized as being exponential in the constrained tree-width of the same graph (the constrained tree-width is much greater than

327

328

d j l

0.06

0.04

0.07

0.05

d j 2

0.09

0.07

0.07

0.09

d j 3

0.05

0.05

0.09

0.06

d j 4

0.05

0.03

0.07

0.06

0.04

+ 0.03 0.02

0.04 ([O < - Xi < - 3][0 < - Xj <« 2] V [2 < - Xi < - 4][1 <« Xj <> 4]) + 0.03 ([O < - XI < - 3][1 < - Xj < - 4] V [2 < - XI < - 4l[0 < - Xj < - 2])

+ 0.02 ([O < - Xi < - 1J[0 < - Xj < - 4] V [O < - Xi < - 41(2 < - Xj < - 3] V [3 < - Xi < - 4][0 < - Xj < - 4])

Figure 1: Shows the hypercube-based representation of a joint distribution over 2 random variables Xi and Xj (both having domains of size 4). The size of this representation is only 7 (as opposed to the 16 entries required in the tabular representation). The regions are indicated by shaded areas.

Hypercube M ^mm

1 ft W

^

z. X^ I

\=^ ^ /

yZ

Hypercube 2

Hypercube 1

P(#R)

Figure 2: The left side of the figure shows a simple example to illustrate why insisting on disjointness of the hypercubes can lead to larger representations. The right side of the figure illustrates the idea of importance sampling in the FPRAS for estimating the volume of a disjunction of hypercubes.

the tree-width, and even worse, depends on the query itself), can also be made exponential only in the same small factor r (which is independent of the query).

2 Hypercube-Based Representations of Joints

Consider a single joint distribution (say, on some discrete random variables X = {Xi, ^ 2 • • • ^N}) given explicitly in a tabular form, and consider answering the two kinds of queries (for some subsets of variables y and Z): (1) computing P{y = y/Z = z) {inference queries), and (2) computing aLTgmaXyP{y = y/Z = z) {MAP queries). In the worst case, both these kinds of queries require us to consider all the entries in the table, and are therefore exponential in the number of variables (because the size of the table is exponential in the number of variables). The fundamental idea in this paper is to be able to do much better by doing a fair amount of work before any query is presented to us (hence significantly reducing the amount of work we need to do at runtime).

In particular, we first pre-compile a given joint distribution into a positive linear combination of regions as shown in Figure 1. Each region is a disjunction of hypercubes, and a hypercube, in turn, is a conjunction of upper and

329

lower bounding planes along each dimension. Further, each entry in the joint is equal to the sum of the weights attached to all those regions that subsume the space corresponding to that entry. Such a disjunctive representation of the joint distribution is much more compact (see definition below) than the tabular representation. Finally, at runtime (when a query is presented to us), the idea is to answer the presented queries in time that is only polynomial in the size of this hypercube-based representation. Definition 1: The size of a hypercube-based representation of a joint distribution is equal to the number of hyper cubes in it.

Although the size of the hypercube-based representation in the worst case could still be exponential in the number of variables, it is always guaranteed to be better than the tabular representation.^ More importantly, the hypercube-based representation is a framework for exploiting the numerical structure of a joint distribution (which is different from just exploiting the independence relationships between the random variables). Note that we do not insist on the hypercubes being disjoint. Figure 2 (left side) shows an example where the disjunctive representation of a region defined by two intersecting hypercubes is (by definition) only of size 2, but insisting on disjointness requires at least 5 hypercubes to describe it. In general, insisting on disjointness can lead to a hypercube-based representation that is exponential in the number of dimensions (variables), while the disjunctive representation can still be only polynomial.

While the hypercube-based decompositions of joints can be made irrespective of whether we are dealing with continuous or discrete distributions, we will work with regions as if they were continuous, and assume that in the latter case, the probability mass at a point is distributed uniformly in the "cell" corresponding to that entry (all "cells" are conceptually of the same dimensions). When required, we will recover and revert this transformation. The issue about how we can represent a joint optimally (most compactly) using a positive linear combination of disjunctions of hypercubes, and the combinatorial problem of finding the optimal domain orderings, are alluded to later in the paper—but it is important to note that these problems prevail only in the offline phase (before any query is presented to us). Our real concern now is how we can answer queries efficiently (at runtime) using the hypercube-based representations.

We will use the following notation. We will assume that the domains of all the variables are ordered in some way. A hypercube Hi can then be represented as a conjunction of upper and lower bounding hyperplanes along each dimension. That is, Hi = [Lf < r{Xi) < U^'] A [L^' < r{X2) < U^']... [L^' < r{XN) < U^'], Here, [if' < r{Xj) < Uf'] indicates that the rank of the domain values allowed for variable Xj should be within LJ * and f/ * ? A region is then a disjunction of hypercubes represented dJ&Ri^ H\\JHi ... H\^ , and a distribution P is a positive linear combination of regions represented as Yl,i=i '^i^i {wi > 0). A query Q specifies a range of values for some of the variables (im-

^We can always decompose a joint distribution into a positive linear combination of hypercubes, each corresponding to a single entry in the distribution.

^For continuous distributions, this can indicate the actual range of values taken by Xj.

330

plicitly marginalizing over the other variables) and can therefore be represented as Q = [L? < r{Xi) < U?] A [L^ < riX^) < U^] ..•[L%< r(Xiv) < U^]. Further, we will represent the part of a hypercube Hi within the bounding planes specified by a query Q using HiAQ, and the part of a region Ri within these planes by Ri A Q. We will use #i?i to denote the volume of the region Ri {area or length respectively when there are only 1 or 2 variables). L e m m a 1: For any region R = HiV H2... HM, R AQ = {HI AQ) W {H2 A Q)... {HM A Q ) , and the size of the hypercube-based representation of it A Q can only be less than that of R. Proof: By simple distributive law, RAQ = {HIAQ)W{H2AQ)... {HM A Q). Now consider any Hj AQ. Since Hj is of the form [L^^ < r{Xi) < Ui^] A [ i f ' < r{X2) < U2']...[L%' < r{XN) < U^'], we have that Hj A Q is = [max(Lf^L?) < r{Xi) < min(C/f^l7p)] A [max(Lf^L^) < r{X2) < mm{U2', U^)]... [max(L^^, L%) < r{XN) < mm{U^', f/^)], which is also a hypercube. Therefore, every hypercube in the disjunction yields a hypercube when A-ed with Q (except in the case when the result is the null space, and we would remove it from the disjunction). This proves that the size of the hypercube-based representation oiRAQ can only be less than that of R.

2.1 Answering Inference Queries

We claim that inference queries can be reformulated as volume estimation problems when a joint distribution V is represented implicitly as a positive linear combination of disjunctions of hyper cubes. L e m m a 2: Let P be a probability distribution over the variables Xi , X2 . . . XN represented implicitly as YlJî WiRi (with Ri = H{\/ H^.., H]^.). Then, for

any query Q, V{Q) = EJ=I î#{Ri A Q). Proof: We first consider the case when Q is a specific assignment for all the variables. Geometrically, this represents a point, and its probability (as specified by the joint distribution V) has been written out as a positive linear combination of weights attached to regions that subsume it. By definition therefore, V{Q) = Yîî'Wih{Ri^Q). Here, h{Ri^Q) indicates whether the point Q lies within the region Ri. Now, let Q be a general query. We have that ^{Q) — YliQ' '^{Q') where Q' is a complete assignment to all the variables (i.e. a single point) and is a consistent extension to Q (i.e. Q' is contained in Q), This means that V{Q) = ^ Q , ]Ci=i 'îh{Ri^ Q'). Interchanging the summations, we have that P{Q) = Z t , ZQ' Wih{Ri,Q'). Now, ZQ' h{Ri,Q') = #{Ri A Q),

hence giving us that V{Q) = YlJ=i Wi#{Ri A Q). The above Lemma establishes our ability to efliciently answer inference

queries if we can efliciently estimate the volume of a disjunction of hypercubes.

2.1.1 Estimating the Volume of a Disjunction of Hypercubes:

One way to estimate the volume of the disjunction of a set of potentially intersecting hypercubes (i.e. of the region R = H1VH2 . . . HM) is to use the principle

331

ALGORITHM: SAMPLE-COUNTER proportional to #Hi. INPUT: Hypercubes Hi,H2.- HM in (3) Choose a point p uniformly at iV-dimensional space. random from H as follows: OUTPUT: A counter {H,p) sampled (a) For j = 1 to iV: uniformly at random {H is a hypercube (A) Choose s in [0,1] uniformly at and p is a point in the AT-dimensional random. space). (B) Set the Xf^ coordinate of p to

(1) For each hypercube Hf. sLf -h (1 - s)U^. (a) Let #Hi = flf^iC^f " ^ f )• W RETURN: (/f,p).

(2) Choose H^^Hi with probabiUty END ALGORITHM

Figure 3: Shows the algorithm for uniform sampUng from the space of counters associated with a disjunction of hypercubes.

ALGORITHM: ESTIMATE-VOLUME {HUH2...HM)-

INPUT: Hypercubes H\^H2 . • HM in (b) If p does not lie in any of iV-dimensional space. Hi,H2 ... Hk-i-OUTPUT: The volume of their disjunc- (A) Set countBottomMost = tion Hi V if 2 • • • HM' countBottomMost +1.

(1) countBottomMost = 0. (3) Let / = countBottomMost/A/*. (2) For i = 1 to AT (= 4Mloge(2/(5)6-2): (4) RETURN: / (# i f i -f . . . #HM)-

(a) {Hk^p) = SAMPLE-COUNTER END ALGORITHM

Figure 4: Shows the algorithm for estimating the volume of a disjunction of intersecting hypercubes. e and 1~6 are respectively the relative approximation and confidence factors in the FPRAS.

of inclusion and exclusion. Here, we add up the volumes of all the hypercubes independently, and subtract from them all the volumes of the pair-wise intersections, and so forth. The obvious problem with this approach however, is that we have to consider an exponential number of terms and the computation is therefore infeasible. A second approach is to perform sampling. The naive way of doing this is to choose points uniformly at random (from the entire space) one at a time (treating them as samples), count the fraction of A/" samples that lie in any of the hypercubes, and scale appropriately. However, this method is not very useful because the Estimator Theorem for uniform sampling relates the number of samples Af with the relative approximation factor e and the

confidence factor 1-6 through the equation J\f > ^^y^ • Here p is the actual fraction of the volume occupied by the hypercubes, and if it happens to be exponentially low, we need an exponentially large number of samples.

To get around this problem, we leverage the extra structure present in the hypercube-based decomposition of a given region (see Figures 2 (right side), 3 and 4). A point lying within any Hi also lies within the region R. Moreover, a point within Hi = [L^' < r{Xi) < U^'] A [L^' < r{X2) < U^']... [L^' < r{XN) < U^'] can be sampled uniformly at random by independently choosing a value for each Xj between Lf' and UJ^* uniformly at random. Also, the

332

volume of Hi is given by (C/f * - i f *) x {U2' -L"') ---{U"'-L^'). Imagine a series of columns that can hold objects called counters (see right side of Figure 2). Suppose that there is a column associated with each possible point in the iV-dimensional space and suppose for every if , we put a counter'm. the column of every point that lies within it. Suppose we order the hypercubes in some way and throw the counters corresponding to each Hi into the columns in that order. It is easy to see that # i ? is proportional to the number of points whose columns have at least one counter in them. This in turn is equal to the number of "bottom-most" counters. Since we know that the total number of counters is proportional to #Hi -\- # i /2 • • • #HM, estimating the fraction of bottommost counters leads us to estimating #( i f i V if2 • • • HM) as required. We have to ensure two things: (1) we can sample uniformly at random among all the counters, and (2) estimating the fraction of bottom-most counters does not require an exponential number of samples. We can take care of (1) by picking a hypercube Hi with probability proportional to #i?i , and then sampling a point in it uniformly at random (see Figure 3). We can take care of (2) by noticing that since there are no more than M counters in any column, the actual fraction of bottom-most counters is bounded below by 1/M. Note that a sampled counter can be checked to see if it is a bottom-most counter in polynomial time by verifying that none of the hypercubes occurring before the chosen hypercube contain the point corresponding to this counter. By the Estimator Theorem, therefore, the number of samples required to get an (e, 5) approximation suffices to be 4Mlogg(2/J)6~^. This yields an FPRAS (fully polynomial-time randomized approximation scheme) for estimating the volume of a disjunction of hypercubes, the running time of which is polynomial in M, N, loge(l/(5) and 1/e.

2.2 Answering MAP Queries

MAP queries are typically much more complex compared to inference queries, and in one sense, answering them requires us to evaluate the probability of every possible combination of values to the MAP variables, and report the one with the maximum probability.^ One simple way to answer MAP queries is to compute the probability of every possible combination of values to the MAP variables using the volume estimation scheme, and choose the best such combination. This procedure is no more exponential in the number of variables, but is still exponential in the number of MAP variables. Our first attempt to circumvent this is to choose only a few (constant or polynomial) number of randomly selected combinations of domain values to the MAP variables, evaluate their probabilities, and return the best among them. The obvious problem with this approach is that there are an exponential number of possible combinations, and the probability that we hit upon H* (the true MAP hypothesis) is exponentially low. We will now show how randomization can help us again to get around this problem.

^Note that for continuous distributions, MAP queries involving domain intervals are more natural than specific values to variables.

333

ALGORITHM: SAMPLE-POINT (a) (if,p) = SAMPLE-COUNTER INPUT: Hypercubes HI,H2...HM in (i/i,if2 . . ^ M ) . iV-dimensional space. (b) Let k be the number of Hi OUTPUT: A point in if i V if2 . . . i M (1 < i < M) containing p. sampled uniformly at random. (c) Set Pass = True with probl 1/k.

(1) Set Pass = False. (3) RETURN: p. (2) While (Pass == False) END ALGORITHM

Figure 5: Shows the algorithm for sampling a point uniformly at random from the space of a disjunction of hypercubes.

2.2.1 Uniform Sampling in a Disjunction of Hypercubes:

Given a region i? = ifi V if2 • • • HM^ Figure 3 shows the procedure for uniform sampling from the space of counters (as defined in the previous subsection), and Figure 5 shows the procedure for using this towards sampling uniformly at random from the set of all points in R. Lemma 3: Figure 3 samples a counter uniformly at random. Proof: Continuing the discussion in the previous subsection, the probability that Hi is chosen in step 2 is {#Hil ^ #Hi), and the probability that a particular counter associated with ifi is chosen in step 3 is ( l / # i f i ) . The probability of a particular counter being chosen is therefore = ( # i f i / ^ # i f i ) ( l / # i f i ) = (V Zl i^Hi) (which is the same for all counters). Lemma 4: Upon termination, Figure 5 samples a point in R uniformly at random. Proof: We prove this Lemma by induction on the number of iterations. Consider any point p in it, and let kp be the number of hypercubes that it is in. In step 2(a), the probability that we choose a counter in its column is {kp/ ^ H^Hi). In step 2(c), the probability that p is passed as the chosen sample is (1/fcp), hence making the probability of choosing any point p in i t (as the required sample) equal to (1 / J ] #Hi) (which is the same for all points in i?). A point that is not in R will never be chosen because it does not induce any counters. Lemma 5: After L iterations. Figure 5 passes a sample with probability > l _ e - V M

Proof: The probability that no sample is passed in the first iteration is 1 - 1/fci (for some 1 < fci < M) which is < 1 - 1/M. The probability that no sample is passed after L iterations is therefore < (1 — 1/M)^ which is < e~^/^ , hence establishing the truth of the Lemma. Lemma 6: The running time complexity of Figure 5 is 0{MN + L{M -\- N)). Proof: The complexity of the steps in Figure 3 that are independent of the iterations in Figure 5 and can therefore be done just once, is 0{MN). The complexity of the remaining steps is 0 ( M -f AT), and since the number of iterations of Figure 5 is L, the total complexity is 0{MN -f L{M + N)),

It is worth noting that Figure 5 is a Las Vegas algorithm, and if we assume that e"^^^ (the probability of the algorithm not terminating) is small enough, we can set L = lOOM so that the running time of Figure 5 is 0{M{M 4- AT)).

334

ALGORITHM: SAMPLE-DISTR according to the distribution V. INPUT: A positive linear combination (1) For each Rt: of regions ^^^^ WiRi that impHcitly rep- (a) Compute ^Ri = resents a joint distribution V over the ESTIMATE-VOLUME (Ri). variables Xi, "2 • •. XN- (2) Choose Ri with probability Wi#Ri. OUTPUT: A sample (complete assign- (3) RETURN: SAMPLE-POINT (Ri). ment to all the variables), a, drawn END ALGORITHM

Figure 6: Shows the algorithm for sampUng a complete assignment to all the variables according to a distribution V represented using a positive Unear combination of disjunctions of hypercubes.

2.2.2 Sampling from a Joint Distribution:

Figure 6 illustrates an algorithm for sampling from a joint V (over the variables Xi ,X2 .. -XN) when it is represented implicitly using a positive linear combination of disjunctions of hypercubes. Lemma 7: Figure 6 is polynomial in YA=I î^ V^» loge(l/^) ^^^ '^• Proof: This follows directly from the complexity of the volume estimation procedure in step 1(a) and the complexity of the sampling procedure in step 3. Lemma 8: Figure 6 chooses a complete assignment a with probability V{a). Proof: Let (l>i{a) indicate whether the complete assignment a lies in the region Ri. The probability that a is chosen is equal to Y!tiî{H'RiWi){(j)i{a)/i^Ri) = Iî=i Wi(t)i{a). From Lemma 2, the term Yli=i 'î<t>i{'^) is equal to V(a). Lemma 9: To sample a partial assignment h for a subset of the variables V = {Xii, Xi^... Xi^ } C {Xi, X2... XN} according to the marginal distribution Vv, it suffices to sample a complete assignment a according to V^ and take its projection on the set V. Proof: From the description of the sampling procedure, the probability of choosing 6 is E(complete assignment a) T^{o){a is consistent with h). This summation is equivalent to marginalization, and is equal to V{h) as required. Lemma 10: To sample a complete assignment a to all of the variables X = {Xi, ^ 2 . . . XN} according to the distribution V^ it suffices to sample a partial assignment h to the variables V = {Xi^ ,Xi^... Xi^} according to the marginal distribution Vv, and subsequently sample an assignment to the variables X\V according to the conditional distribution V(x\v)/{v=b)' Proof: The probability of choosing a complete assignment a (according to the above procedure) is Ylb^(^)(^cî^/^)i^ ^^d b are consistent with a)). The inner summation is equivalent to marginalization, and is equal to V{a/b). The whole term then becomes equal to ^bV{b)V{a/b) = V{a), as required.

2.2.3 An Atlantic-City Algorithm for MAP:

From the foregoing Lemmas, we can design a randomized algorithm for sampling from the marginal distribution over the MAP variables by first sampling a complete assignment as shown in Figure 6, and then taking the projection of this

ALGORITHM: MAP-HYPOTHESIS INPUT: A distribution V over the N variables Xi, X2 .. . XN represented implicitly as a positive linear combination of disjunctions of hypercubes; a subset of variables y C. X. OUTPUT: A MAP hypothesis if* over the variables y.

(1) current Best Value = 0. (2) For 2 = 1 to L:

(a) a = SAMPLE-DISTR {V).

335

(b) Take the projection of a onto variables y (denoted ay). (c) For each Y ^y-.

(A) Modify ay to be the appropriate interval (domain value) for continuous (discrete) distributions.

(d) If P{ay) > currentBestValue: (A) currentBest Value = P{ay). (B) currentBestAssgn = ay.

(3) RETURN: current Best Assgn. END ALGORITHM

Figure 7: Shows the algorithm for MAP hypothesis selection. The key idea is to exploit the fact that we can efficiently sample a hypothesis according to its true distribution. Note that step 2(d) is carried out using the volume estimation procedure.

sample over the MAP variables. The complexity of this procedure is dominated by the former step, and is only polynomial in the size of the hypercube-based representation of the joint. In every sample drawn, therefore, the probability that we hit upon H* (the true MAP hypothesis) is P{H*)—its actual probability. This addresses the problem with our first attempt where H* could only be sampled with an exponentially low probability. More formally, we can design a randomized algorithm for MAP hypothesis selection as shown in Figure 7. Lemma 11 : Figure 7 returns the MAP hypothesis over the variables 3 with probability > 1 - e'^^^^^l Proof: From Lemma 9, we know that step 2(c) samples an assignment H for the variables 3 with probability P{H). The probability that we do not hit if* in any iteration is < 1 - P{H*). After L iterations, the probability that we do not hit if* is < (1 - P{H*))^ which is < e~^^^^*\ hence proving the Lemma.

In the above algorithm, if we assume that e~^^^ is small enough, we can set L to 100/P(if *)—making the running time only polynomial in the size of the hypercube-based representation of the joint, and 1/P{H*), A trivial upper bound on this is polynomial in N and 1/e, for an absolute approximation factor of €—hence, leading us to a randomized approximation scheme for MAP."*

2.3 Conditioning

Often, queries requiring us to reason about conditional distributions (based on some observations) are presented. We will now show that the foregoing results carry over directly to conditional distributions. Lemma 12: Given a joint V = YlJ^i ^ t - ^ over variables X = {Xi, X2 . . . XN}, the size of the hypercube-based representation of V conditioned on the observations Z = z {Z C X) remains unchanged (or can only decrease).

In most practical cases, P(H*) is fairly high for H* to be even considered as a hypothesis (or a diagnosis). If we make the additional assumption that P{H*) > a (for some application-specific known constant a), the convergence is much faster, and L (the number of iterations) needs to be only a constant—viz. 100/a.

336

Proof: By Bayes rule, V{XIZ = z) = V{X, Z = z)/V{Z = z). By Lemma 2, V{X^Z = z) is given by ^^= i Wi{Ri A Z = z). Note that A-ing Z = z with Ri does not increase the number of hypercubes in them because Z = z (hke Q in Lemma 1) constitutes a single hypercube. Further, the factor V{Z = z) is a single number dividing Wi (for all 1 < i < T), and in turn can be computed efficiently using Lemma 2 and the volume estimation procedure.

3 Hypercube-Based Representations of BNs

A cluster-tree over a BN G (over the variables X = {Xi, X2 . . . XN}) is a tree each of whose nodes is associated with a cluster (a subset of X). Each edge is annotated with a subset of the nodes called the separator. We say that a cluster-tree T over G satisfies the family values property if, for every family (a node and its parents), there exists some cluster C in T that subsumes the variables in it. We say that T satisfies the running intersection property if, whenever there is a variable Xi such that Xi is in C and Xi is in C", then Xi is also in every cluster in the path in T between C and C". A cluster-tree that satisfies both the family values and the running intersection properties is called a clique-tree, and its nodes are referred to as cliques (over the subset of variables they contain).

Clique-trees constitute a dynamic programming perspective on exploiting the independence relationships in a BN for answering queries (see [3] for details). Specifically, the complexities of answering inference and MAP queries are related to the size of the largest clique constructed while running the variable-elimination algorithm using a chosen ordering. The best ordering yields the tree-width and is used to answer inference queries, while the best ordering with the additional constraint that all the non-MAP variables have to be eliminated before any MAP variable, yields the constrained tree-width and is used to answer MAP queries. The constrained tree-width could be much larger than the tree-width, and even worse, depends on the MAP variables in the query itself In this section, we will show how we can improve upon these complexities using the ideas presented in the previous sections and the standard ideas of message-passing and clique-trees.

3.1 Hypercube-Based Message Passing and Sampling

The fundamental idea in hypercube-based message-passing (see Figure 9) for clique-tree calibration is to still make use of clique-trees and message-passing, but the data structures maintained inside each clique are made compact and eSicient. Figure 8 compares the data structures maintained in each clique by the traditional versus the hypercube-based methods. Traditional approaches represent the potential of a cUque by maintaining a table that is exponential in the size of the cUque (and perhaps analytically representing the dependency on any continuous variables). In hypercube-based approaches, we maintain a table that is exponential only in the communication size of that clique. The

Domain(A) = {1, 2) Domain(C) = {1, 2, 3} Domain(B) = [0,10] Domain(D) = {1, 2)

CD A X 1 ,1 2 ,1 3 , 1 1,2 2 ,2 3 , 2

11 f(B)

(a) Explicit Storage of Potential

337 A D Bounded H3rpercube R e p m Weight

1 1 1 2 2,1 2,2

HCR / \ El

HCR A E2

0.071

0.082

HCR = Hypercube-Based Representation of the Potential over A, B, C and D .

El = [0 <= A <= 1] / \ [0 <« B <= 10) A [0 <= C <= 3] A [1 <= D <= 2]

E2 = [1<= A <= 2] A [0 <= B <= 10] A [0 <* C <= 3] A [1 <= D <= 2]

(b) Hypercube—Based Storage of Potential

Figure 8: Compares the data structures maintained in each chque by the traditional versus the hypercube-based approaches. In the former case, we need a table of size 8 (assuming that each entry specifies the distribution over the continuous variable B analytically). In the latter case, we need only 4 entries when we have a bounded hypercube-based decomposition of the clique's potential.

communication of a clique C is the set of variables that C shares with any of its neighbors (denoted by C©), and the communication size is the cardinality of this set. Each entry in the table consists of two parts: (1) a weight field that incorporates any information that the clique may receive from its neighbors, and (2) a HCR field that essentially A-s the hypercube-based representation of the initial potential {IIO{CQ)HCR in Figure 9) of that clique (say bounded by size P) with the particular assignment for the communicating variables corresponding to that entry (the result also has a hypercube-based representation with size bounded by P).

Suppose that all the initial potentials have hypercube-based decompositions bounded by size P. Then, we immediately have an algorithm that extends the message-passing algorithm, and whose message-passing phase has a running time that is polynomial in P and exponential, not in the size of the largest clique, but in the size of the largest communication. The truth of this claim follows from the following observation. We know that summing out variables U C V in the joint over V in order to obtain a marginal joint over V\U, is equivalent to obtaining a probability for all the possible assignments to variables in V\U, and this (by Lemma 2) is equivalent to estimating the volume of the hypercube-based representation of the joint by considering different values for V\U one at a time.

Figure 9 shows the working of message-passing algorithms that employ the hypercube-based techniques for sending messages. An incoming message is incorporated by multiplying it into the currently maintained potential over the weights. An outgoing message is obtained by marginalization over the appropriate missing variables, where marginalization takes into account both the weights and the volumes of the disjunctions of hypercubes. Note that an incoming message can always be appropriately incorporated, and an outgoing message can always be correctly computed because the variables over which they are defined form a subset of the variables over which the potential (table)

338

ALGORITHM: HYPERCUBE-CALIBR INPUT: A clique-tree T for G. RESULT: Calibration of T.

(1) For each clique C\ (a) Set HINITIC) = 1.

(2) For each family F: (a) Choose clique C s.t. FCC. (b) UINIT{C) = UiNiHC) X CPT{F).

(3) For each clique C: (a) UO{CQ)HCR = DECOMP[n/jv/T(C)]. (b) Set no(C0)vr = 1.

(4) Pick any clique Cr as the root. (5) polist = list of chques in post order. (6) Let rpolist be the reverse of polist. (7) While polist is not empty:

(a) Let C = next item on polist. (b) Let Ci, C2 .. . Cfc be C's children.

(c) Let C+ be C s parent. (d) Set U{CQ)W = UO{CQ)W

(e )Le ty = CnC-h. (f) Set /ic-*c-i- = X:c^^^n(C0)vv X #U{CQ)HCR.

(8) While rpolist is not empty: (a) Let C = next item on rpolist. (b) Let Ci,C2...Cfc beC's children. (c) For 2 = 1 to A::

(A) Let y = C n C i . (B) Set ic-^d = Ece\Y^(^Q)^ X #n(C0)HCH. (C) Set U{Ci) = U{Ci) X

END ALGORITHM

Figure 9: Shows the working of the hypercube-based message-passing algorithm for chque-tree calibration.

ALGORITHM: CLQTR-SAMPLING INPUT: A calibrated chque-tree T with root Cr. OUTPUT: A sample a drawn according to the distribution represented by T.

(1) Let prlist be the cliques arranged in prefix order with Cr as the root. (2) While prlist is not empty:

(a) Let C be the next item on prlist. (b) Let Ci, C2 . . . Cfc be C s children. (c) Choose an assignment a{C) to all the variables in C according to n(C). (d) For z = 1 to k:

{i)U(Ci) = U(Ci)/a(CnCi). (3) RETURN: a.

END ALGORITHM

Figure 10: Shows the algorithm for sampHng from a joint represented by a clique-tree. Step 2 implements the iterative conditioning and dynamic programming paradigm, exploiting the properties of the cUque-tree and the truth of Lemmas 9 and 10.

is maintained. Figure 10 shows the procedure for sampling from the joint distribution represented by the clique-tree in time exponential only in the size of the largest communication. Here, step 2(c) is implemented using the procedure SAMPLEl-DISTR (Figure 6), and step 2(d) is implemented using simple A-ing. The correctness of the procedure follows from the truth of Lemmas 9 and 10. Note that while the original message-passing algorithm incorporated evidence Z = z by marginalizing only over the entries that were consistent with Z = z (see [3]), hypercube-based algorithms incorporate evidence by A-ing each term in the hypercube-based representation of the joint distribution with Z — z (such an A-ing results in a representation that is still bounded by size P) . Also note that since the algorithm in Figure 7 involves drawing only a polynomial number of samples (see Lemma 11), the complexity of answering MAP queries is also exponential only in the size of the largest communication. Further,

339

ALGORITHM: HYPCUBE-CLQTR I N P U T : A BN G. O U T P U T : A clique tree J s.t.

(A) IIINIT{C) (for all cliques C) has a hypercube-based representation with size bounded by P. (B) size of the largest communication is minimized. (1) J = M-Ordering-Clique-Tree (G).

(2) While (True): (a) If there exists {Ci,Cj) s.t. (A) UiNiriCi) X UINIT{CJ) has a hypercube-based representation with size bounded by P ( B ) | ( a , C , ) 0 | < | ( C i ) o U ( C , ) o | (i) Merge (Ci.Cj) in J .

(b) Else Break. E N D ALGORITHM

Figure 11: An offline algorithm that tries to minimize the exponential factor (size of maximum communication) for hypercube-based message-passing and sampUng algorithms. In step 1, although the tree-width itself is hard to find, m-ordering (see [6]) is used to get good approximations.

this factor is independent of the MAP variables—hence enabling us to perform pre-computation on a given BN even in the context of answering MAP queries.

3.2 Offline Computation

Although the message-passing phase of the hypercube-based algorithm is exponential only in the size of the largest communication, the initial phase is still exponential in the size of the largest clique (routine *DECOMP' in Figure 9). The point to note, however, is that this is a one-time process that is independent of any evidence or query. Once we compile a BN into a series of disjunctions of hypercubes for the initial potentials of all the cliques, we are in a position to perform all previous tasks efficiently—and it is this message-passing and sampling phase that is important for us at runtime, and which turns out to be exponential only in the size of the largest communication.

Because of the above arguments, our goal is now to produce clique-trees that minimize the size of the largest communication, keeping the hypercube-based decompositions of the initial potentials of all the cliques bounded by size P. Figure 11 shows an algorithm that runs offline, and tries to minimize the size of the largest communication while retaining the bounded hypercube-based representation property. The idea is to make use of the clique-tree generated using m-ordering (see [6]) over the moralized graph of a BN as a starting point, but merge cliques until we reach a point where merging them any further does not yield a benefit.^ It is easy to note that such a merging process maintains the family values and running intersection properties required for clique-trees, and in most practical domains, the size of the largest communication comes out to be significantly less than the tree-width of the moralized graph (as approximated by m-ordering)—noting that we are always assured of doing better than this factor because we use it as a starting point in Figure 11 to improve upon.

^Even the offline decomposition of a large potential can be done simply by multiplying the hypercube-based decompositions of the individual potentials it incorporates, and using subsequent algebraic simplification.

340

4 Related Work and Conclusions

We presented a novel method for representing joint distributions as positive linear combinations of disjunctions of hypercubes, and provided randomized algorithms for efficiently reasoning with them. We generalized this to BNs (exploiting both the independence structure between the variables, and the structure of the CPTs) to show important (and surprising) implications on the complexity of answering inference and MAP queries. The work presented in this paper is a generalization of our earlier work (presented in [5]) to cases when random variables are not necessarily Boolean. [5] presents a method for pre-compiling a given BN (over Boolean random variables) into a series of SAT instances in DNF (disjunctive normal form), and subsequently employing randomization for answering queries that can be reformulated as DNF counting and/or DNF sampling tasks. [2] shows a slightly different method for representing BNs, and [1] employs counting schemes for SAT instances in DNNF (decomposable negation normal form). The DNNF representation, however, is significantly weaker than the DNF representation in the sense that the latter can be compact when the former is exponentially larger. Our future work is along two lines. Firstly, we are looking into various combinatorial arguments for finding (near)-optimal hypercube-based decompositions of a given joint, including whether we can perturb it for better results, what tradeoffs need to be made, etc. Secondly, we are extending the theory developed in this paper to deal with influence diagrams, POMDPs, and various other tasks in probabilistic reasoning. We are also applying these techniques to various real-life scenarios that require fast probabilistic inference, diagnosis, and/or decision making.

References

[1] Darwiche A. (2001). On the Tractable Counting of Theory Models and its Applications to Belief Revision and Truth Maintenance. Journal of Applied Non-Classical Logics. 2001.

[2] Darwiche A. (2002). A Logical Approach to Factoring Belief Networks. Proceedings of KR '2002.

[3] Jensen F. V. and Jensen F. (1994). Optimal Junction Trees. Tenth Annual Conference on Uncertainty in Artificial Intelligence (UAriOO^)-

[4] Karp R., Luby M. and Madras N. (1989). Monte-Carlo Approximation Algorithms for Enumeration Problems. Journal of Algorithms. 1989.

[5] Kumar T. K. S. (2002). SAT-Based Algorithms for Bayesian Network Inference. Proceedings of the 22nd SGAI International Conference on Knowledge-Based Systems and Applied Artificial Intelligence (ES'2002).

[6] Tarjan R. E. and Yannakakis M. (1984). Simple Linear Time Algorithms to Test Chordality of Graphs, Test Acyclicity of Hypergraphs, and Selectively Reduce Acyclic Hypergraphs. SI AM Journal of Computing. 1984-

AUTHOR INDEX

Abdelmoty, A 299 Allen, TJ 3 Alonso, CJ 244 Benbrahim, H 258 Bosse,T 19 Bramer, M 258 Cara9a-Valente, J 231 Chan,SWK 117 Compatangelo, E 44,130 Croitoru, M 130 Cunningham, P 33 Debenham, J 173 El-Geresy, B 299 El-Kafrawy,P 313 Garagnani, M 214 Haque,N 187 Hopgood, AA 3 Jennings, NR 187 Jodogne, S 285 Jonke, CM 19 Jurgelenaite, R 157 Knight, B 73 Kumar, TKS 327 L6pez-Illescas, A 231 Loughrey, J 33 Lucas, P 157,269

Luck, M 144 McCarthy, K 101 McCartney, R 313 McGinty, L 101 McQueen,T 3 McSherry,D 87 Miles, S 144 Moreau,L 144,187 Papay, J 144 Perez-Perez, A 231 Piater, JH 285 Rahman, TA 73 Reilly,J 101 Rodriquez, JJ 244 Santamaria, A 231 Scharlau, B 44 Schut, MC 19 Sleeman, D 58 Smyth, J 101 Tepper, JA 3 Treur,J 19 Tuson, A 201 Van Gerven, M 285 Vasconcelos, W 44,58 Woon,FL 73 Zhang, Y 58

341