data mining, rough sets and granular computing

538
Data Mining, Rough Sets and Granular Computing

Upload: others

Post on 11-Sep-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining, Rough Sets and Granular Computing

Data Mining, Rough Sets and Granular Computing

Page 2: Data Mining, Rough Sets and Granular Computing

Studies in Fuzziness and Soft Computing

Editor-in-chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw, Poland E-mail: [email protected] http://www.springer.de/cgi-binlsearch_book.pl ?series = 2941

Further volumes of this series can be found at our homepage.

Vol. 74. H.-N. Teodorescu, L. C. Jain and A. Kandel (Eds.) Hardware Implementation of Intelligent Systems, 2001 ISBN 3-7908-1399-0

Vol. 75. V. Loia and S. Sessa (Eds.) Soft Computing Agents, 2001 ISBN 3-7908-1404-0

Vol. 76. D. Ruan, J. Kacprzyk and M. Fedrizzi (Eds.) Soft Computing for Risk Evaluation and Management, 2001 ISBN 3-7908-1406-7

Vol. 77. W Liu Propositional. Probabilistic and Evidential Reasoning, 200] ISBN 3-7908-1414-8

Vol. 78. U. Seiffert and L. C. Jain (Eds.) Self-Organdng Neural Networks, 2002 ISBN 3-7908-1417-2

Vol. 79. A. Osyczka Evolutionary Algorithms for Single and Multicriteria Design Optimization, 2002 ISBN 3-7908-1418-0

Vol. 80. P. Wong, F. Anrinzadeh and M. Nikravesh (Eds.) Soft Computing for Reservoir Characterization and Modeling, 2002 ISBN 3-7908-1421-0

Vol. 81. V. Dimitrov and V. Korotkich (Eds.) Fuz-..-y Logic, 2002 ISBN 3-7908-1425-3

Vol. 82. Ch. Carlsson and R. Fuller Fuz,..'T)' Reasoning in Decision Making and Optimization, 2002 ISBN 3-7908-1428-8

Vol. 83. S. Barro and R. Marin (Eds.) Fuz-..,y Logic in Medicine, 2002 ISBN 3-7908-1429-6

Vol. 84. L. C. Jain and J. Kacprzyk (Eds.) New Learning Paradigms in Soft Computing, 2002 ISBN 3-7908-1436-9

Vol. 85. D. Rutkowska Neuro-Fuz;:;,' Architectures and Hybrid Learning, 2002 ISBN 3-7908-1438-5

Vol. 86. M. B. Gorzalczany Computational Intelligence Systems and Applications, 2002 ISBN 3-7908-1439-3

Vol. 87. C. Bertoluzza, M.A. Gil and D.A. Ralescu (Eds.) Statistical Modeling, Analysis and Management of Fuzzy Data, 2002 ISBN 3-7908-1440-7

Vol. 88. R. P. Srivastava and T.J. Mock (Eds.) Belief Functions in Business Decisions, 2002 ISBN 3-7908-1451-2

Vol. 89. B. Bouchon-Meunier, J. Gutierrez-Rlos. L. Magdalena and R. R. Yager (Eds.) Technologies for Constructing Intelligent Systems 1, 2002 ISBN 3-7908-1454-7

Vol. 90. B. Bouchon-Meunier, J. Gutierrez-Rios, L. Magdalena and R.R. Yager (Eds.) Technologies for Constructing Intelligent Svstems 2, 2002 ISBN 3-7908-1455-5

Vol. 91. 1.1. Buckley, E. EsIami and T. Feuring Fuzzy Mathematics in Economics and Engineering, 2002 ISBN 3-7908-1456-3

Vol. 92. P. P. Angelov Evolving Rule-Based Models, 2002 ISBN 3-7908-1457-1

Vol. 93. v.v. Cross and T. A. Sudkamp Similllrity and Compatibility in Fuzzy Set Theury, 2002 ISBN 3-7908-1458-X

Vol. 94. M. MacCrimmon and P. Tillers (Eds.) The Dynamics of Judicial Proof, 2002 ISBN 3-7908-1459-8

Page 3: Data Mining, Rough Sets and Granular Computing

Tsau Young Lin Yiyu Y. Yao . Lotfi A. Zadeh Editors

Data Mining, Rough Sets and Granular Computing

With 104 Figures and 56 Tables

Springer-Verlag Berlin Heidelberg GmbH

Page 4: Data Mining, Rough Sets and Granular Computing

Professor Tsau Young Lin San lose State University The Metropolitan University of Silicon Valley Department of Mathematics and Computer Science One Washington Square San lose, CA 95192-0103 USA [email protected]

Professor Yiyu Y. Yao University of Regina Department of Computer Science Regina, Saskatchewan, S4S OA2 Canada [email protected]

Professor Lotfi A. Zadeh University of California Berkeley Initiative in Soft Computing (BISC) Computer Science Division and Electronics Research Laboratory Department of Electrical and Electronics Engineering and Computer Science Berkeley, CA 94720-1776 USA [email protected]

ISSN 1434-9922 ISBN 978-3-7908-2508-4 ISBN 978-3-7908-1791-1 (eBook) DOI 10.1007/978-3-7908-1791-1

Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Data mining, rough sets, and granular computing: with 56 tables / Tsau Young Lin .. ed. - Heidelberg; New York: Physica-Verl., 2002

(Studies in fuzziness and soft computing; VoI. 95)

This work is subject to copyrigbt. AII rights are reserved, whether the whole or part of the material is concemed, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way. and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and pennission for use must always be obtained from Physica· Verlag. Viola­tions are liable for prosecution under the German Copyright Law.

<E Springer-Verlag Berlin Heidelberg 2002 Originally published by Physica-Verlag Heidelberg in 2002 Softcover reprint ofthe hardcover l st edition 2002

The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

Hardcover Design: Erich Kirchner, Heidelberg

Page 5: Data Mining, Rough Sets and Granular Computing

Preface

During the past few years, data mining has grown rapidly in visibility and importance within information processing and decision analysis. This is par­ticularly true in the realm of e-commerce, where data mining is moving from a "nice-to-have" to a "must-have" status.

In a different though related context, a new computing methodology called granular computing is emerging as a powerful tool for the conception, analysis and design of information/intelligent systems. In essence, data mining deals with summarization of information which is resident in large data sets, while granular computing plays a key role in the summarization process by draw­ing together points (objects) which are related through similarity, proximity or functionality. In this perspective, granular computing has a position of centrality in data mining.

Another methodology which has high relevance to data mining and plays a central role in this volume is that of rough set theory. Basically, rough set theory may be viewed as a branch of granular computing. However, its applications to data mining have predated that of granular computing.

This volume is the result of a two-year project aimed at coalescing the concepts and techniques of granular computing on one side, and rough set theory on another. It consists of a collection of up-to-date and authoritative expositions of the basic theories underlying data mining, granular computing and rough set theory, and stresses their wide-ranging applications. A principal aim of our work is to stimulate an exploration of ways in which progress in data mining can be enhanced through integration with granular computing and rough set theory.

T.Y. Lin, Y.Y. Yao, L.A. Zadeh

Page 6: Data Mining, Rough Sets and Granular Computing

Contents

Preface v T.Y. Lin, Y.Y. Yao and L.A. Zadeh

PART 1: GRANULAR COMPUTING - A NEW PARADIGM

Some Reflections on Information Granulation and its Centrality in Granular Computing, Computing with Words, the Computational Theory of Perceptions and Precisiated Natural Language 3

L.A. Zadeh

PART 2: GRANULAR COMPUTING IN DATA MINING

Data Mining Using Granular Computing: Fast Algorithms for Finding Association Rules 23

T.Y. Lin and E. Louie

Knowledge Discovery with Words Using Cartesian Granule Features: An Analysis for Classification Problems 46

J.G. Shanahan

Validation of Concept Representation with Rule Induction and Linguistic Variables 91

S. Tsumoto

Granular Computing Using Information Tables 102 Y.Y. Yao and N. Zhong

A Query-Driven Interesting Rule Discovery Using Association and Spanning Operations 125

J.P. Yoon and L. Kerschberg

Page 7: Data Mining, Rough Sets and Granular Computing

VIII

PART 3: DATA MINING

An Interactive Visualization System for Mining Association Rules

J. Han, N. Cercone and X. Hu

Algorithms for Mining System Audit Data W. Lee, S.J. Stolfo and K.W. Mok

Scoring and Ranking the Data Using Association Rules B. Liu, Y. Ma and C.K. Wong

Finding Unexpected Patterns in Data B. Padmanabhan and A. Tuzhilin

Discovery of Approximate Knowledge in Medical Databases Based on Rough Set Model

S. Tsumoto

PART 4: GRANULAR COMPUTING

145

166

190

216

232

Observability and the Case of Probability 249 C. Alsina, J. Jacas and E. Trillas

Granulation and Granularity via Conceptual Structures: A Perspective Prom the Point of View of Fuzzy Concept Lattices 265

R. Belohlavek

Granular Computing with Closeness and Negligibility Relations 290 D. Dubois, A. Hadj-Ali and H. Prade

Application of Granularity Computing to Confirm Compliance with Non-Proliferation Treaty 308

A. Fattah, V. Pouchkarev, A.Belenki, A.Ryjov and L.A. Zadeh

Page 8: Data Mining, Rough Sets and Granular Computing

IX

Basic Issues of Computing with Granular Probabilities 339 G.J. Klir

Multi-dimensional Aggregation of Fuzzy Numbers Through the Extension Principle 350

G. Mayor, A.R. de Soto, J. Suiier and E. Trillas

On Optimal Fuzzy Information Granulation 364 A. Ryjov

Ordinal Decision Making with a Notion of Acceptable: Denoted Ordinal Scales 398

R.R. Yager

A Framework for Building Intelligent Information-Processing Systems Based on Granular Factor Space 414

F. Yu and C. Huang

PART 5: ROUGH SETS AND GRANULAR COMPUTING

GRS: A Generalized Rough Sets Model 447 X. Hu, N. Cercone, J. Han and W. Ziarko

Structure of Upper and Lower Approximation Spaces of Infinite Sets 461 D.S. Malik and J.N. Mordeson

Indexed Rough Approximations, A Polymodal System, and Generalized Possibility Measures 474

S. Miyamoto

Granularity, Multi-valued Logic, Bayes' Theorem and Rough Sets 487 Z. Pawlak

The Generic Rough Set Inductive Logic Programming (gRS-ILP) Model 499

A. Siromoney and K. Inoue

Possibilistic Data Analysis and Its Similarity to Rough Sets H. Tanaka and P. Guo

518

Page 9: Data Mining, Rough Sets and Granular Computing

Part 1

Granular Computing -A New Paradigm

Page 10: Data Mining, Rough Sets and Granular Computing

Some Reflections on Information Granulation and its Centrality in Granular Computing, Computing with Words, the Computational Theory of Perceptions and Precisiated Natural Language

Lotfi A. Zadeh

Berkeley Initiative in Soft Computing (BISC), Computer Science Division and the Electronics Research Laboratory, Department of EECS, University of California, Berkeley, CA 94720-1776; : [email protected].

The past few years have witnessed what in retrospect may be seen as a turning point in the evolution of fuzzy logic. What I have in mind is the debut of four linked methodologies: granular computing, computing with words, the computational theory of perceptions and precisiated natural language. What follows is a view of the links between the underlying structures of these methodologies - a view which is presented from a personal perspective.

In our quest for machines which are capable of performing complex tasks, we are developing a better understanding of the centrality of information granulation in human cognition, reasoning and decision-making [2]. In many contexts, information granulation is a reflection of the bounded ability of sensory organs and, ultimately, the brain, to resolve detail and store information. In other contexts, granulation is employed to solve a complex problem by partitioning it into simpler subproblems. This is the essence of the strategy of divide and conquer.

In a general setting, a granule is a clump of real or mental objects (points) drawn together by indistinguishability, similarity, proximity or functionality. Modes of information granulation (IG) in which granules are crisp, play important roles in a wide variety of methods, approaches and techniques. Among them are: interval analysis, quantization, chunking, rough set theory, diakoptics, divide and conquer, Dempster-Shafer theory, machine learning from examples, qualitative process theory, decision trees, semantic networks, analog-to-digital conversion, constraint programming, cluster analysis and many others. In this context, particularly worthy of note is Professor Pawlak's theory of rough sets [1].

To Professors Z. Pawlak and J. Kacprzyk Research supported in part by ONR Contract N00014-99-C-0298, NASA Contract NCC2-1006, NASA Grant NAC2-1177, ONR Grant N00014-96-1-0556, ONR Grant FDN0014991035, ARO Grant DAAH 04-961-0341 and the BISC Program of UC Berkeley.

Page 11: Data Mining, Rough Sets and Granular Computing

4

Important though it is, crisp IG has a major blind spot. More specifically, it fails to reflect the fact that in much - perhaps most - of human reasoning and concept formation the granules are fuzzy rather than crisp. For example, the fuzzy granules of a human head are the nose, ears, forehead, hair, cheeks, etc. Each of the fuzzy granules is associated with a set of fuzzy attributes, e.g., in the case of hair, the fuzzy attributes are color, length, texture, etc. In turn, each of the fuzzy attributes is associated with a set of fuzzy values. Specifically, in the case of the fuzzy attribute length (hair), the fuzzy values are long, short, not very long, etc. The fuzziness of granules, their attributes and their values is characteristic of the ways in which human concepts are formed, organized and manipulated.

Fuzzy information granulation (fuzzy IG) underlies the remarkable human capability to perform a wide variety of physical and mental tasks without any measurements and any computations. In performing such tasks, e.g., driving in city traffic, we employ our perceptions of distance, speed, direction, shape, intent, likelihood, truth and other attributes of physical and mental objects. A basic characteristic of perceptions is their intrinsic imprecision. More specifically, perceptions are, for the most part, f-granular in the sense that: (a) the boundaries of perceived classes are fuzzy; and (b) the values of perceived attributes have a granular structure (Fig.l). In large measure, naturallanguages may be viewed as systems for describing - and reasoning with - perceptions.

A2 f-granule

Al

A3

Figure 1. f-granularity. f-granularity is a reflection of the bounded ability of sensory organs and , ultimately, the brain, to resolve detail and store information

A4

Page 12: Data Mining, Rough Sets and Granular Computing

5

It is of historical interest to note that in my 1973 paper, "Outline of a New Approach to the Analysis of Complex Systems and Decision Processes," [5] it was, in an implicit way, the concept of f-granularity that provided a springboard for the basic concepts of (a) linguistic variable; and (b) fuzzy rule set (Fig.2). In

Rule: if X is A then Y is B

Rule set: if X is Ai then Y is Bi , i =1, 2, ... , n.

Generalized rule: if X isr A then Y iss B

Figure 2. Rules and rule sets

retrospect, the introduction of these concepts may be viewed as a pivotal event in the evolution of fuzzy logic. Today, almost all applications of fuzzy logic use these concepts in one form or another.

Existing scientific theories - based as they are on crisp logic and crisp set theory - provide no means for dealing with f-granularity of perceptions. As we move farther into the age of machine intelligence and automation of reasoning, the need for such means will become more apparent. This is one of the main reasons why fuzzy logic will eventually gain acceptance as the logic on which scientific theories should be based.

The concept of f-granularity plays a central role in four fuzzy-logic-based theories: granular computing (GrC) [7],[12]; computing with words (CW) [11], [13]; the computational theory of perceptions (CTP) [13],[14]; and precisiated natural language (PNL). These theories are closely linked but not identical. (The label "granular computing" was suggested by Professor T.Y. Lin.)

To see the theories in a clearer perspective, it is helpful to note that a basic concomitant of scientific progress is generalization - generalization of concepts, methods and theories. Among the principal modes of generalization there are three that are based on fuzzy logic. They are: (a) f-generalization, which involves a progression from crisp sets to fuzzy sets; (b) f.g-generalization, which involves a

Page 13: Data Mining, Rough Sets and Granular Computing

6

progression from fuzzy sets to granulated fuzzy sets (Fig.3); and (c) n[­generalization, which involves a progression from fuzzygranules to propositions drawn from a natural language, with the understanding that such propositions play the ro le of descriptors ofperceptions.

fuzzification: I '~ ~J fuzzification~ r - - - '-::'.:"'::.:::Cl

:: A~.~I~~~ 1~~V':" __ :..;:.;.:::;.,.::..:.

fuzzy granulation:

Figure 3. Generalization: f-generalization; g-generalization (crisp-granulation); f.g-generalization (fuzzy granulation)

Symbolically, if Ţ is a theory, then its f-generalization is denoted as Ţ+; its f.g­generalization is denoted as Ţ++; and its nl-generalization is denoted as ŢI1I.

In this perspective, granular computing may be viewed as f.g-generalization of numerica! computing, and computing with words as nl-generalization of granular computing. As for the computational theory of perceptions, it may be regarded as a branch of computing with words in which the objects of computation are propositions - propositions which play the role of descriptors of perceptions (FigA).

A very basic concept which is shared by granular computing, computing with words and the computational theory of perceptions is that of a generalized constraint [9], [10]. It is this concept that is the principal source of the computational power ofthese theories.

Conventionally, a constraint on a variable, X, is assumed to mean that the values which X is allowed to take are restricted to a specified subset, C, of the range ofX.

Page 14: Data Mining, Rough Sets and Granular Computing

computing with numbers

computing with words

~--+Gll----+~ G)--+~ @---+~ CW

ob I G!C I Ip~LI computing with computing intervals with granules

precisiated natural language

computational theory of perceptions

7

Figure 4. Evolution of computing. From numerical computing to interval analysis; to granular computing; to precisiated natural language; to computing with words; to computational theory of perceptions

A generalized constraint is expressed as X isr R, where X is the constrained variable, R is the constraining relation and r in isr is an indexing variable whose values define the ways in which R constrains X. The principal constraints are: possibilistic (r = blank); veristic (r = v); probabilistic (r = p); usuality (r = u); random set (r = rs); fuzzy graph (r = fg) and Pawlak set (r = ps). Simple examples are:

possibilistic (crisp): a::; X ::; b

possibilistic (fuzzy): X is small

veristic: Ethnicity (Robert) isv 0.5lGerman + 0.5IFrench, meaning that Robert is half German and half French

probabilistic: X isp N(m, a\ meaning that X is a normally distributed random variable with mean m and variance a2 .

usuality: X isu small, meaning that usually (X is small).

random set: X isrs (0.1 \small + 0.6\medium + 0.3\large), meaning that X is a fuzzy-set-valued random variable which takes the values small, medium and large with respective probabilities 0.1, 0.6, and 0.3.

Page 15: Data Mining, Rough Sets and Granular Computing

8

fuzzy graph: f isfg (small x small + medium x large + large x small), meaning that f is a function, Y = f(X) , which is defined by the rule set:

if X is small then Y is small if X is medium then Y is large if X is large then Y is small.

From these constraints, other types of constraints may be constructed by combination, transformation, modification and qualification. Examples:

Conjunction of possibilistic constraints: (X is not very small) and (X is not very large).

Conjunction of probabilistic constraints and possibilistic constraints: (X isp N(m, 0-2 ) ) and (X,Y) is much larger.

Transformation ofpossibilistic constraints: f(X) is R

Modification of possibilistic constraints: X is very R

Truth-qualification: (X isr R) is not very true

Probability-qualification: (X isr R) is very unlikely

Possibility-qualification: (X isr R) is quite possible.

These and other operations on constraints are instances of generalized constraint propagation. Thus, if C[, ... , Cn. l , Cn are generalized constraints, then generalized constraint propagation from C\, ... , Cn_\ to CII is a derivation ofCn from C\, ... , Cn-\ through the use of rules of constraint propagation, which coincide with the rules of inference in fuzzy logic. Schematically, derivation of Cn from C is represented as

One of the principal rules of inference in fuzzy logic is the generalized extension principle. The basic idea of this principle was described in my 1965 paper [3] and more explicitly in my 1975 paper "The Concept of a Linguistic Variable and its Application to Approximation Reasoning" [6] . For possibilistic constraints, the principle may be expressed as

f(X) is A

g(X) is B

where B = g(fl(A» and fl is the inverse off (Fig. 5). In effect, the extension

Page 16: Data Mining, Rough Sets and Granular Computing

u v

f(x) is A

g(x) is g([-l(A))

Il g(:fl(A» (v) = supu (IlA(f(u)))

subject to: v=g(u)

9

A

+---+- g([-l(A))

Figure 5. Generalized extension principle

principle asserts that a possibilistic constraint on a given function of X, f(X), induces a possibilistic constraint on a specified function, g(X), with the

-1

constraining relation given by B = g(f (A». In more concrete terms, if the -1

membership function of A is flA: U ~ V, then the membership function of g(f (A» is the solution ofthe variational problem

subject to

v = g(u).

The concept of a generalized constraint serves to provide a way of precise characterization of a granule. What this means is that a granule may be viewed as the set of points in its universe of discourse which satis:fy a given generalized constraint, X isr R. In particular, in the case of possibilistic constraints, the set in question is the fuzzy set R.

In this perspective, granular computing (GrC) may be viewed as a system of computing in which the objects of computation have a granular structure, with a granule defined by a generalized constraint. By construction, granular computing subsumes interval analysis, quantization and rough set theory. Within granular computing, the principal computational tool is the generalized extension principle.

What is the difference between granular computing and computing with words? As was alluded to already, computing with words may be viewed as nl­generalization of granular computing (Fig. 4). In essence, this mode of

Page 17: Data Mining, Rough Sets and Granular Computing

10

generalization adds to granular computing the capability to compute with words and propositions drawn from a natural language. A simple illustration of this capability is the so-called Robert example. More specifically, assume that my perception is that Robert returns from work at about 6 pm. Based on this item of knowledge, I wish to compute: (a) the probability that Robert is home at 6:30 pm; and (b) the earliest time at which the probability that Robert is home is high. Standard probability theory does not have a capability to provide answers to these questions because it does not have the ability to operate on perception-based information, e.g., "usually Robert returns from work at about 6 pm."

A key assumption which is made in computing with words is that the meaning of a proposition, p, may be expressed as a generalized constraint. Thus,

p ~ X isrR,

where the arrow should be read as "translates into." In general, X, rand Rare implicit in p. What this points to is that, in general, translation of a proposition drawn from a natural language may be viewed as explicitation of X, rand R. In this sense, translation -- in the context of computing with words -- is equivalent to explicitation. (Fig. 6).

• information is conveyed by constraining - in one way or another- the values which a variable can take

• when information is conveyed by propositions in a natural language, a proposition represents an implicit constraint on a variable.

translation p -+ 1---...,..,.....,.........,...-+--..

explicitation X isrR

canonical form

constraining relation

role ofR in relation to X

constrained variable

Figure 6. A proposition may viewed as an implicit constraint on a variable

In computing with words, translation is governed by what is called constraint­centered semantics of natural language (CSNL). Basically, CSNL is an extension of test-score semantics [9], [10] from possibilistic constraints to generalized constraints.

As an illustration, let us consider the Robert example. In this case,

Page 18: Data Mining, Rough Sets and Granular Computing

11

p = usually Robert returns from work at about 6 pm.

The first step involves a calibration of "usually" and "about 6pm" by specifying their membership functions, ~usually and ~6*, where 6* = about 6 pm (Fig. 7).

1

o

• precisiation = calibration + composition

• calibration:usually Robert retrns from work at about 6 pm.

usually Jl

1

1 0 proportion

rabOU16PID

6pm time

Figure 7. Calibration of sually? and? about 6 pm? in the Robert example

The constrained variable, X, is the probability that Robert returns from work at about 6 pm. The constraint on this variable may be expressed as

X is usually,

where "usually" plays the role of a fuzzy probability. Now let T be the time at which Robert returns from work, and let g be the

probability density of T. Then, the probability that Robert returns from work at about 6 pm may be expressed as [4]

where 0 = noon and 12 = midnight. It follows that the initial datum, p, may be interpreted as a possibilistic constraint on g which may be expressed as

12 (fa g(u) 1l6*(U) du ) is usually.

In more explicit form, the degree to which g satisfies this constraint is given by

11

p = usually Robert retums from work at about 6 pm.

The first step involves a calibration of "usually" and "about 6pm" by specifying their membership functions, ~usually and ~6*, where 6* = about 6 pm (Fig. 7).

1

o

• precisiation = calibration + composition

• calibration:usually Robert retms from work at about 6 pm.

usually Jl

1

1 O proportion

rabOU16PID

6pm time

Figure 7. Calibration of sually? and ? about 6 pm? in the Robert example

The constrained variable, X, is the probability that Robert retums from work at about 6 pm. The constraint on this variable may be expressed as

X is usually,

where "usually" plays the role of a fuzzy probability. Now let T be the time at which Robert retums from work, and let g be the

probability density of T. Then, the probability that Robert retums from work at about 6 pm may be expres sed as [4]

where O = noon and 12 = midnight. It follows that the initial datum, p, may be interpreted as a possibilistic constraint on g which may be expressed as

12 (fo g(u) 1l6*(U) du ) is usually.

In more explicit form, the degree to which g satisfies this constraint is given by

Page 19: Data Mining, Rough Sets and Granular Computing

12

d(g) = Ilusually (fol2 g(u) 1l6.(U) du).

Now the probability that Robert is home at 6:30 is a functional of g given by

6:30 P = fo g(u) duo

At this point, we can invoke the generalized extension principle to compute P. Thus, computation ofP reduces to the solution of the variational problem

12 IlP (v) = SUPg (llusuaJly (fo g(u) 1l6.(U) du »

subject to

6:30 V = fo g(u) du,

where IIp is the membership function of P. This variational problem can be reduced to a nonlinear program through discretization of u.

What is the relation between computing with words and the computational theory of perceptions?

The point of departure in the computational theory of perceptions is the assumption that perceptions are described by propositions drawn from a natural language. In this sense, the following propositions are examples of perceptions.

Marie is young.

Robert is very honest.

Most Swedes are tall.

Overeating causes obesity.

It is very unlikely that there will be a significant increase in the price

of oil in the near future.

With this assumption, the computational theory of perceptions may be viewed as computing with words applied to propositions which are descriptions of perceptions (Fig. 4).

An important characteristic of perceptions is their intrinsic imprecision -­imprecision which, as we alluded to earlier, is a reflection of the bounded ability of the human mind to resolve detail and store information. Imprecision of perceptions puts them well beyond the reach of Aristotelian logic and crisp probability theory. In the main, this is the reason why existing scientific theories do not have the ability to operate on perception-based information. The importance of the computational theory of perception derives from the fact that it adds this capability to existing theories. An important example is the perception­based theory of probabilistic reasoning [14]. The Robert example falls within this theory. Another example is the following.

Page 20: Data Mining, Rough Sets and Granular Computing

13

A box contains ba Ils of various sizes. My perception are: (a) there are about 20 balls; (b) most are large; and (c) a few are small. What is the probability that a ball drawn at random is neither larger nor small? Neither this problem nor the Robert example Can be dealt with through the use of standard probability theory.

Computing with words and the computational theory of perceptions suggest an important idea - the idea of what may be called Precisiated Natural Language (PNL). In time, PNL may have a significant impact on the structure and content of scientific theories.

The basic idea which underlies PNL is simple. Naturallanguages are rich but imprecise. What is possible, however, and what is frequently done, involves a construction of what may be called a precisiation language, L; endowing L with precise syntax and semantics; and using L for expressing the meaning of propositions in a natural language, NL. A familiar example is the language of first-order predicate logic, PL, and its variants such as Prolog and other types of meaning representation languages.

The problem with PL as a precisiation language is that it has a very limited expressive power. In particular, it has no capability to deal with f-granularity of perceptions. Thus, the meaning of a proposition as simple as "most Swedes are tall" cannot be represented in PL.

In the case of PNL, the precisiation language, GCL, is based on the concept of a generalized constraint. More specificalIy, the Generalized Constraint Language, GCL, consists of alI generalized constraints which can be formed by combination, transformation, modification and qualification of possibilistic, veristic and

NL: naturallanguage

GCL: generalized constraint language

CSNL: constraint-centered semantics ofnaturallanguage

CSNL serves as a bridge between NL and GCL

Figure 8. Constraint-centered semantics of naturallanguage

Page 21: Data Mining, Rough Sets and Granular Computing

14

probabilistic constraints. By construction, GCL is maximally expressive. A simple example of a constraint in GCL is

(f(X) is R) and (g(X,Y) is S) is unlikely.

A proposition, p, in NL is precisiable if it is translatable into GCL through the use of constraint-centered semantics of natural languages (CSNL) (Fig. 8). Basically, CSNL is an extension of test-score semantics [8], [9], [10] to generalized constraints.

Informally, PNL may be defined as the set of propositions in NL which are precisiable through translation into GCL (Fig. 9). Since GCL is maximally

NL GeL PNL

'--+--PL

translation

natural language precisiation language

PL:first order predicate logic GCL: generalized constraint language generalized constraint: X isr R p: proposition in PNL

p translation

---"""':""~':""""""--•• X isr R precisiation

Figure 9. Relationship between PNL, NL, PL and GCL

expressive, PNL is the largest subset ofNL which admits precisiation. Unlike PL, PNL may be viewed as the language which underlies computing with words and the computational theory of perceptions. An important sUblanguage of PNL is the language of fuzzy if-then rules (Fig. 10).

As a language which provides a means of assigning a precise meaning to a proposition and, more particularly, to a perception, PNL can be employed in a wide variety of ways. Among these, there are two roles that stand out in importance: (a) PNL as a knowledge-description language; and (b). PNL as a concept definition language.

A simple example of use of PNL as a knowledge-description language is the following"

Page 22: Data Mining, Rough Sets and Granular Computing

rule: if X isr R then Y iss S

rule set:

if X Isr RJ then Y ISS Sl

if X Isr R2 then Y iss S2

if X Isr Ru then Y iss Sn

{if X Isr R j then Y iss Sj } i=1,2,? n

example: if X isr A then Y is~

if X isp A then (X, Y) is R usually

15

Figure 10. The language of generalized rules and rule sets is a suhlanguage of PNL

Assume that an item of knowledge in my knowledge-base is the perception:

p = overeating causes obesity.

Translation of pinto GCL may be expressed as

X is most,

where X is the proportion of those who are obese among those who overeat. More specifically, consider a population, A, of n individuals. Assume that the

grades of membership of ab i = 1, ... , n, in the fuzzy set of obese individuals and those who overeat are flobese(aj) and flovereat(aj), respectively. Then the constraint on X may be expressed as

LCount(obese/overeat) is most,

where

LCount(obese/overeat) = (Lj flobese(aj) 1\ flovereat(aj»/(Lj flovereat(3.j»,

where 1\ is min or, more generally, a t-norm. Consequently, the degree, d, to which the proportion in question is constrained by p may be expressed as:

d = flobese «Lj flobes.(aj) 1\ flovereat(aj»/(Lj flovereatCaj»).

Page 23: Data Mining, Rough Sets and Granular Computing

16

In the above analysis, ~(obese/overeat) is assumed to be measurement-based. Alternatively, ~(obese/overeat) could be perception-based, in which case it would be treated as a linguistic variable.

The principal function of PNL is precisiation. As an illustration, consider the proposition (perception)

q = obesity is caused by overeating,

What is the difference, ifany, in the meanings ofp and q?

Let Y be the proportion of those who overeat among those who are obese. Then, using PNL, the meaning of q would be represented as

Y is most

or, more explicitly, as

In this instance, PNL serves to precisiate the difference between the meanings of p andq.

The role of PNL as a concept definition language is likely to grow in visibility and importance as we move farther into the age of machine intelligence and automation of reasoning. Our basic reason why this is likely to happen is that humans can understand imprecisely defined concepts but machines require that meaning be defined precisely.

To be more specific, let X be the concept of, say, a summary, and assume that I am instructing a machine to generate a summary of a given article or a book. To execute my instruction, the machine must be provided with a definition of what is meant by a summary. It is somewhat paradoxical that we have summarization programs which can summarize, albeit in a narrowly prescribed sense, without being able to formulate a general definition of summarization. The same applies to the concepts of causality, randomness and probability. Indeed, it may be argued that these and many other basic concepts cannot be defined within the conceptual framework of classical logic and set theory. It is in this regard that PNL can play an essential role as a definition language.

The role of PNL as a definition language is closely related to the concept of what may be called generalized definability. A basic assumption which underlies this concept is that definability has a hierarchical structure (Fig. 11). Furthermore, it is understood that a defmition must be unambiguous, precise, operational, general, and co-extensive with the concept which it defines.

The hierarchy involves five different types of defmability. The lowest level is that of c-definability, with c standing for crisp. Thus, informally, a concept, X, is c-definable if it is a crisp concept, e.g., a prime number, a linear system or a Gaussian distribution. The domain of X is the space of instances to which X applies.

Page 24: Data Mining, Rough Sets and Granular Computing

u amorphic

PNL f-granular

fuzzy • I CrISp I

• / object(instance) in U

'--- concept

a n object which is NL-definable P

a P

concept which is NL-definable

Statistical independence: f.g-definable; partially c-definable

Figure 11. Definability hierarchy. An element has a dual interpretation: (a) an object (instance); and (b) a concept

17

The next level is that of f-definability, with f standing for fuzzy. Thus, X is a fuzzy concept if its denotation, F, is a fuzzy set in its universe of discourse. A fuzzy concept is associated with a membership function which assigns to each point, u, in the universe of discourse of X, the degree to which u is a member of F. Alternatively, it may be defined algorithmically in terms of other fuzzy concepts. Examples of fuzzy concepts are small number, strong evidence and similarity. It should be noted that many concepts associated with fuzzy sets are crisp concepts. An example is the concept of a convex fuzzy set. Most fuzzy concepts are context­dependent.

The next level is that of f.g-defmability, with g standing for granular, and f.g denoting the conjunction of fuzzy and granular. Informally, in the case of a concept which is f.g-granular, the values of attributes are granulated, with a granule being a clump of values which are drawn together by indistinguishability, similarity, proximity or functionality. As was alluded to already, f.g-granularity reflects the bounded ability of the human mind to resolve detail and store information. An example of an f.g-granular concept which is traditionally defined as a crisp concept, is that of statistical independence. This is a case of misdefinition -- a definition which is applied to instances for which the concept is not defined, e.g., fuzzy events. In particular, a common misdefinition is to treat a concept as if it were c-definable whereas in fact it is not.

The next level is that of PNL-definability. On this level, PNL can be employed to define concepts for which lower-level definitions do not exist, e.g., the concept of a usual -- rather than expected -- value of a random variable. Another important application of PNL relates to re-definition of concepts which

Page 25: Data Mining, Rough Sets and Granular Computing

18

are normally defined as crisp concepts, e.g., optimality, Pareto-optimality, Lyapounov stability and statistical independence. What is not widely recognized is that, in reality, these and many other basic concepts are fuzzy rather than crisp. PNL-based re-defmition of such concepts is needed because crisp definitions may lead to counterintuitive results.

As an illustration, consider Lyapounov's definition of stability. Assume that a ball of diameter D is placed on a bottle whose throat has diameter d (Fig. 12). When diD is about a half or larger, the system is clearly stable, both intuitively and in Lyapounov's sense. However, as diD becomes smaller, the system remains stable in Lyapounov's sense so long as it is not zero. For small values of dID, say about 0.05, the system is clearly unstable. What this contradiction implies is that stability is a matter of degree -- and not a crisp concept as it is usually assumed to be.

disturbance

d

bottle •

Figure 12. Lyapounov definition of stability may lead to counterintuitive conclusions

Another example is the concept of statistical independence. Standard definition of independence does not apply to fuzzy events [4]. As an illustration, consider "smoking" and "heart disease." Are they dependent or independent? Conventional definition of independence does not apply because both "smoking" and "heart disease" are not crisply defined events.

In employing PNL, the first step is granulation of "smoking" and "heart disease." For simplicity, assume that three levels are employed: mild, moderate and severe for "heart disease"; and light, moderate and heavy for "smoking."

The next step is that of constructing a contingency table in which an entry such as ~Count (mild/heavy) represents the relative ~Count of individuals with

Page 26: Data Mining, Rough Sets and Granular Computing

19

mild heart disease among heavy smokers. The granular values of ~Count are assumed to be: low, medium and high.

heart disease levels mild moderate serious smoking levels I

light medium low low moderate high medium low heavy high high medium

It is understood that the entries in this table can be either measurement- or perception-based.

The table serves to provide a basis for assessing the degree to which "heart disease" depends on "smoking." Thus, if the rows are identical, then "heart disease" is independent of "smoking." If this is not the case, then a PNL-based similarity function may be employed to serve as a linguistic measure of the degree of dependence. Evidently, the degree of dependence would depend on the coarseness of granulation and the calibration of linguistic values of granulated variables. This is an intrinsic characteristic ofPNL-based definitions.

The use of PNL as a defmition language is in its early stages of exploration. The use of PNL for this and other purposes is a new direction. My conviction is that eventually PNL will become a widely used tool in the realms of description of knowledge, definition of concepts and, more generally, automation ofreasoning.

In conclusion, the concept which has a position of centrality in granular computing, computing with words, the computational theory of perceptions and precisiated natural language, is that of fuzzy information granulation. It is this concept that plays a pivotal role in fuzzy logic, but is not a part of any other logical system. Basically, the concept of fuzzy information granulation is simple and natural. Why it had not been introduced much earlier in the evolution of science is a question that historians will have to resolve.

References [1] Z. Pawlak, Rough sets, International Journal of Computer and Information

Science, 11,341-356, 1982. [2] J.G. Shanahan, Soft Computing for Knowledge Discovery: Introducing

Cartesian Granule Features, Kluwer Academic Publishers, Boston, 2000. [3] L.A. Zadeh, Fuzzy sets, Information and Control, 8, 338-353, 1965. [4] L.A. Zadeh, Probability measures of fuzzy events, Journal of Mathematical

Analysis and Applications, 23, 421-427, 1968. [5] L.A. Zadeh, Outline of a new approach to the analysis of complex systems and

decision processes, IEEE Transactions on Systems, Man and Cybernetics, SMC-3,28-44, 1973.

[6] L.A. Zadeh, The concept of a linguistic variable and its application to approximate reasoning, Part I, Information Sciences, 8, 199-249, 1975; Part II, Information Sciences, 8, 301-357, 1975; Part III, Information Sciences, 9, 43-80, 1975.

Page 27: Data Mining, Rough Sets and Granular Computing

20

[7] L.A. Zadeh, Fuzzy sets and information granularity, in: M. Gupta, R. Ragade, and R. Yager (Eds.), Advances in Fuzzy Set Theory and Applications, NOlth­Holland, Amsterdam, 3-18, 1979.

[8] L.A. Zadeh, Test-score semantics for natural languages and meaning representation via PRUF, B. Rieger (Ed.), Empirical Semantics, Brockmeyer, Bochum, 1982.

[9] L.A. Zadeh, Outline of a computational approach to machine and knowledge representation based on the concept of a generalized assignment statement, Proceedings of the International Seminar in Artificial Intelligence and Man­Machine Systems, M. Thoma and A.Wyner (eds), 198-211.

[10] L.A. Zadeh, Test-score semantics as a basis for a computational approach to the representation of meaning, Literary and Linguistic Computing, 1, 24-35, 1986.

[11] L.A. Zadeh, Fuzzy logic = computing with words, IEEE Transactions on Fuzzy Systems, 2, 103-111, 1996.

[12] L.A. Zadeh. Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic, Fuzzy Sets and Systems, 90, 111-127,1997.

[13] L.A. Zadeh, From computing with numbers to computing with words -- from manipulation of measurements to manipulation of perceptions, IEEE Transactions on Circuits and Systems, 45, 105-119, 1999.

[14] L.A. Zadeh, Outline of a computational theory of perceptions based on computing with words, in: N.K. Sinha and M.M. Gupta (Eds.), Soft Computing & Intelligent Systems: Theory and Applications, Academic Press, London, 2-33, 2001.

Lofti A. Zadeh is a Professor in the Graduate School and Director of the Berkeley Initiative in Soft Computing (BISC), Computer Science Division and the Electronics Research Laboratory, Department of EECS, University of California, Berkeley, CA 94720-1776.

Page 28: Data Mining, Rough Sets and Granular Computing

Part 2

Granular Computing in Data Mining

Page 29: Data Mining, Rough Sets and Granular Computing

Data Mining Using Granular Computing: Fast Algorithms for Finding Association Rules

T. Y. Lin1 ,2 and Eric Louie3

1 Department of Mathematics and Computer Science San Jose State University, San Jose, California 95192-0103

2 Berkeley Initiative in Soft Computing, University of California, Berkeley, California 94720

3 IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120 [email protected], [email protected]

Abstract. An attribute value of a relation is a meaningful name(common prop­erty) of a group of entities (elementary granule). A relational model using such elementary granules as its attribute values is called machine oriented relational model. In such a model, data processing, in particular finding association rules is transformed into granular computing. In this paper, algorithms for finding asso­ciation rules by granular computing is presented. Analysis and experiments show that the computation is fast and is a promising approach. Experiments show about 15-20 time faster; theoretical analysis indicates that on the counting the support step, which is the major step, it is at least 32 (wordsize) time faster.

Keywords: association rule, granule, machine-oriented relational model.

1 Introduction

In database theory, a relation, often loosely used, could mean relation vari­ables or relation values([4] pp.123). In this paper, relation will always mean relation values (or relation instances).

A relation is a knowledge representation of a universe of entities, which is assumed to be a Cantor set. Each entity is represented by a tuple of attribute values. Each attribute value is a name of an elementary concept representing certain property of entities. In other words, for a fixed knowledge representa­tion, one can regard each attribute value as a meaningful name (a property) of a subset of entities; such a subset will be referred to as an elementary granule (of the universe). Note that in a fixed column each distinct attribute value represents a distinct and mutually disjoint elementary granule; the col­lection of such elementary granules forms a partition (equivalence relation) of the universe. In this paper, we will use various variants of granules; however, elementary granules will always be reserved for equivalence classes of the par­tition of some column. The relational model (for this fixed universe) using elementary granules as attribute values has been referred to as the machine

Page 30: Data Mining, Rough Sets and Granular Computing

24

oriented relational model [7]. Using such models, data processing is reduced to granular computing: Finding association rules becomes computing of inter­sections of elementary granules; decision rules, inclusions; soft decision rules, partial inclusions; and extensional function dependencies (universal inference rules), refinements of two partitions. In this paper, we will focus on associa­tion rules. Some experimental results are presented. The paper is organized as follows: In the first few sections, we set up the model then, we explain the implementations of the algorithms. Various test runs are presented. We believe we have demonstrated that granular computing is a viable approach to relational data mining.

2 Relations and Information Tables

A relation is a representation of the real world (a classical set of entities) by elementary concepts (attribute values). In other words, it represents an entity by a list (tuple) of attribute values. In such a model, there is no formal component for the entities, the independent variable of the representation. This is often inconvenient. So instead, we will use information tables, in which the entities playa formal role. For formal presentation, see Appendix 1.

SNUM SNAME Status City 81 Smith TWENTY LONDON 8 2 Jones TEN PARIS 83 Blake THIRTY PARIS 84 Clark TWENTY LONDON 8s Adams THIRTY ATHENS

Table 1. The Supplier Table

v --- (SNUM SNAME Status City) ID1 ~ (81 Smith TWENTY LONDON) ID2 ~ (82 Jones TEN PARIS) ID3 ~ (83 Blake TEN PARIS) ID4 ~ (84 Clark TWENTY LONDON) IDs ~ (8s Adams THIRTY ATHENS)

Table 2. Information Table of Suppliers; arrows and parentheses will be suppresed

Page 31: Data Mining, Rough Sets and Granular Computing

25

Here, we will illustrate the notion by examples. Let us consider Table 1 from a popular text [4]. Let the set of entities be denoted by V, and its member by I D i , i = 1,2, .... Let the knowledge representation be,

T : entities ----+ tuples,

Then the relation, Table 1, is the image T(u), and the information table, Table 2, is its graph (u, T( u)). In the subsequent tables, the parentheses and arrows of Table 2 will be suppressed.

3 A Granular Relation Theory - Machine Oriented Modeling

From processing point of view, data mining is a process of machine deriva­tion of interesting properties from the mathematical structure of stored data. What would be the proper and efficient primitives for such machine process­ing? In traditional database theory (human oriented model), attribute val­ues are the primitives. Attribute values are meaningful elementary concepts (properties) to human. However, to machine, they are merely bits and bytes; human's intuition provides no special aids to the machine processing. In fact, attribute values are often cumbersome for machine to process, since they are often semantically interrelated and these interrelationships are mathemati­cally reflected in the global structure of the data. For example, the depen­dency between two attribute values, say "London ----+ TWENTY" (London has potential to have twenty billion business) can be deduced by human based on the context(background knowledge), however for machine, it has to scan through the two columns.

Ideally, all primitives should be independent from each other. Relational theory has tactically assumed the entities form a Cantor set, in other words, there are no interaction among them. So entities are ideal primitives for machine processing. Note that each granule is a subset of entities, so it is an expression of entities. Hence the machine oriented data model, which is a relational model uses granules as attribute values, uses entities as independent variables. This theory is developed in this section.

3.1 Granules and Attribute Values

Let Dom(A) be the domain of a given attribute A. Let Co be a fixed attribute value, which represents a property (an elementary concept). We can regard the attribute A (column) as a projection that maps entities to elementary concepts. Formally, it is the map:(see Appendix 1)

p( -, A) : V ----+ Dom(A); v ----+ c = p(v, A))

In Table 2, the attribute STATUS defines the map St: V ----+ Dom(STATUS)

Page 32: Data Mining, Rough Sets and Granular Computing

26

St: IDl ----t TWENTY St: ID2 ----t TEN St:ID3 ----tTEN St: ID4 ----t TWENTY St: ID5 ----t THIRTY

The inverse image of the values are:

1. {ID1,ID4} = SC1(TWENTY) = {v I v E V,St(v) = TWENTY}; 2. {ID2,ID3 } = SC1(TEN) = {v I v E V,St(v) = TEN}; 3. {ID5} = SC1(TEN) = {v I v E V,St(v) = THIRTY}.

In general, the inverse image of the attribute A of a fixed value, Co, is

A-l(co) = {v I v E V,A(v) = co}

The collection of all distinct inverse images A-l(C), V c E C defines a partition, or equivalently, an equivalence relation. By abuse of notation, we will use EA to denote both. Each A-l(C) is an equivalence class. The notion of such equivalence relations are important, so we collection some observations:

Definitions (3 Propositions

1. An attribute A defines an equivalence relation EA(partition). An equiv­alence class is called an elementary granule. It is a subset of entities, so one can be regarded it as an expression of entities.

2. Assume EA and EB are two partitions (equivalence relations). EA is said to be a refinement of EB iff every elementary granule in EA is contained in an elementary granules in EB. In this case, we may also say that EB is depended on EA; EB is coarse than EA; or EA is finer than EB.

3. Assume EA and EB are two equivalence relations. The intersection EA n EB is another equivalence relation whose granules consists of all possible intersections of EA- and EB-granules (equivalence classes). They will be denoted by EA nEB-granules or (EA, EB)-granules. This results are, in fact, valid for any number of equivalence relations.

4. The collection of all elementary granules constitutes a set, called the quotient set and denoted by VIEA or simply VIA.

5. An elementary granule plays two roles: one as an element of the quotient set V I A, another as a subset of the universe V. We can regard the element as the canonical name or label of the subset. In other words, the quotient set is a set that consists of all canonical names.

6. The attribute, as a projection, can be factored though the quotient set

A: V ----t VIA ----t Dom(A); v ----t [v 1 A ----t c,

where the first map is known as the quotient map; [VlA, as usual, is the EA-equivalence class containing v, regarding as its canonical name; and the second map is the naming map, NAME(-).

Page 33: Data Mining, Rough Sets and Granular Computing

27

7. The naming map gives each (machine-oriented) canonical name a (human­oriented) meaningful name i.e, an attribute value c, in notation, c = NAME([vlA) = A(v).

These notions are illustrated in Table 3, 4, 5, 6.

Supplier list Canonical name Meaningful name of a granule (encoded label) (attribute value)

ID1 SNU M(IOOOO) S1 ID2 SNU M(OIOOO) S2 ID3 SNU M(OOIOO) S3 ID4 SNUM(OOOlO) S4 ID5 SNUM(OOOOI) S5

Table 3. Canonical and Meaningful Granular Representations of SNU M

Supplier list Canonical name Meaningful name of a granule (encoded label) (attribute value)

ID1 SN AM E(IOOOO) Smith ID2 SN AM E(OlOOO) Jones ID3 SN AM E(OOlOO) Blake ID4 SN AM E(OOOIO) Clark ID5 SNAME(OOOOI) Adams

Table 4. Canonical and Meaningful Granular Representations of SN AM E

Supplier list Canonical name Meaningful name of a granule (encoded label) (attribute value)

ID1,ID4 CITY(IOOlO) LONDON ID2 ,ID3 CITY(OllOO) PARIS

ID5 CITY(OOOOI) ATHENS

Table 5. Canonical and Meaningful Granular Representations of CITY

3.2 Granular Representations of Relations - Machine Oriented Models

The observation, in the last section, that each attribute induces an equiv­alence relation leads us to the consideration of using elementary granules

Page 34: Data Mining, Rough Sets and Granular Computing

28

Supplier list Canonical name Meaningful name of a granule (encoded label) (attribute value)

IDl ,ID4 ST ATUS(lOOlO) TWENTY ID2,ID3 ST ATUS(OllOO) TEN

ID5 STATUS(OOOOl) THIRTY

Table 6. Canonical and Meaningful Granular Representations of STATUS

as attribute values, or more precisely, their bit patterns or lists of members; they are expressions of entities. Using Table 3, Table 4,Table 5, and Table 6, we can transform Table 1 into Table 7 and Table 8; they are two granular forms of Table 1. Ordinary subset notations are used in former table and bit representation in latter table. In a bit pattern, a bit is on if and only if the corresponding object is belonging to the subset.

Note that SUM and SNAME Columns have the same structure as the universe. Their elementary granules are all singletons. It is clear that Table 8 can be compacted into Table 9, without loosing any information;we will call it machine oriented model. List representation is similar, we skip the illustration.

v CNAME(SNUM) CNAME(SNAME) CNAME(STATUS) CNAME(CITY) IDl IDl IDl {IDl,ID4} {IDl ,ID4} ID2 ID2 ID2 {ID2} {ID2,ID3} ID3 ID3 ID3 {ID3,ID5} {ID2,ID3} ID4 ID4 ID4 {IDl,ID4} {IDl ,ID4} ID5 ID5 ID5 {ID3,ID5} ID5

Table 7. The List Representation of Supplier Table

v CNAME(SNUM) CNAME(SNAME) CNAME(STATUS) CNAME(CITY) IDl SNUM(lOOOO) SN AM E(lOOOO) STATUS(lOOlO) CITY(lOOlO) ID2 SNUM(OlOOO) SN AM E(OlOOO) ST ATUS(OlOOO) CITY(OllOO) ID3 SNUM(OOlOO) SN AM E(OOlOO) ST ATUS(OOlOl) CITY(OllOO) ID4 SNUM(OOOlO) SN AM E(OOOlO) ST ATUS(lOOlO) CITY(lOOlO) ID5 SNUM(OOOOl) SNAME(OOOOl) ST ATUS(OOlOl) CITY(OOOOl)

Table 8. The Bit Representation of Supplier Table

Page 35: Data Mining, Rough Sets and Granular Computing

29

Canonical names Meaningful names (encoded labels) (attribute values)

ST ATUS(lOOlO) TWENTY ST ATUS(OlOOO) TEN ST ATUS(00101) THIRTY

CITY(lOOlO) LONDON CITY(OllOO) PARIS CITY (00001) ATHENS

Table 9. Granular Relational Model-Machine Oriented Modeling

4 Data Mining by Granular Computing

Semantically, data mining is one form of induction. Induction is generally regarded as the process of inferring general laws or principles from the ob­servations of particular instances. In databases, common "frames" of general laws are association rules, decision rules (inference rules) and extensional function dependencies. Association rules can be regarded as soft decision rules. Extensional function dependencies are universal decision rules. We will give a brief formulation.

Let us recall some notions from [7]. Let A = {AI, A2 , ••• , An} and B = {Bb B2' ... ' Bm} be two sets of attributes of a relational database. c = (al,a2, ... ,an ), d = (bl ,b2, ... ,bm ) be two tuples of attributes values of A and B respectively. Let Gai , Gbj be elementary granules corresponding to ai

and bj,i = 1,2 ... n,j = 1,2 ... m respectively. Let Pc = niGail Qd = njGbj

be the respective intersections. Each attribute value is the name of an elemen­tary granule (an equivalence class). So c is a tuple of names, corresponding to a collection of elementary granules. We will regard c as the name of the intersection of these elementary granules. So we write c = NAME(Pc) and d= NAME(Qd). Let Card(~)be the cardinal number of a set ~. Now, we first de­fine the common frames of general laws using granules and defer their details in subsequent sections:

1. Decision rule [6]: A formula c -+ d is a decision rule, if Pc is included in Qd, Pc ~ Qd.

2. Universal decision rule A formula A ===} B is a universal decision rule (extensional function

dependence), if 'v' c E A :3 dEB such that c -+ d is a decision rule 3. Robust decision rule [13]:

A formula c -+ d is a robust decision rule, if c -+ d is a decision rule and Card(Pc ) ~ threshhold.

4. Soft decision rule (strong rule) [19],[15]:

Page 36: Data Mining, Rough Sets and Granular Computing

30

A formula e -+ d is a soft decision rule (strong rule), if Pc is softly or approximately included in Qd, Pc¢;;.Qd.

5. Association rule [1]: A pair (e, d) is an association rules, if Card(Pc n Qd) 2: threshhold .

4.1 Decision Rules

In a traditional model, we say e -+ d is a decision rule if the attribute value e appears in a tuple, it must imply that d also appears in the same tuple. So to check if the following rule is valid

"If STATUS = TEN, then CITY = Paris."

One need to scan through the two columns of STATUS and CITY in Ta­ble 1 and check if the attribute value "TEN" is consistently associated with "Paris." In machine oriented model, the same fact can be checked by the inclusion of two elementary granules, namely,

"TEN" n "Paris" =ST ATUS(010000) nCITY(01100) =(010000) n(01100) =(01100) =ST ATU S(010000) = "TEN"

Note that the attribute names in the bit patterns are the names of partitions (columns), so they do not participate in the computation.

4.2 Universal Decision Rules

Decision rules are relationships among constants (attribute values). Now, we will examine rules that are valid for the whole universe; they are relationships among columns (attributes).

Extensional and Intensional Function Dependencies In this section we will recall few notions on extensional relational databases [12]. A finite collection of attribute names {A1 ,A2 , ... , An} is called a relation scheme [5]. A relation value (instance), or simply relation, R on the relation scheme E is a finite set of tuples of attributes values (see Appendix 1 for formal theory). A functional dependency occurs when the values of a tuple on the set of attributes uniquely determine the values of another set of attributes. Formally, let X and Y be two subsets of A. Definition A relation R satisfies

1. EFD : X ==> Y is an extensional function dependency if for every X­value there is a uniquely determined Y-values in the relation instance R.

Page 37: Data Mining, Rough Sets and Granular Computing

31

2. IF D : X ---+ Y is an intensional function dependency if IF D is satisfied by all relation instances R under the scheme B.-

One should note that at any given moment, a relation instance may satisfy some family of extensional functional dependencies EF Ds, however, the same family may not be satisfied by other relation instances. The family that is satisfied by all the relation instances is the intensional functional dependency IF D. In this paper, we will be interested only in the extensional functional dependency, since we have a fixed universe. So the notation "X ===> Y" is an EFD.

Universal Decision Rules-Extensional Function Dependencies As pointed out earlier that a attribute A induces an equivalence relation EA ;

see Section 3.1. So a family of attributes corresponds to a set of equivalence relations. We collect some obvious facts:

Propositions

1. A ===> B is a universal decision rule iff it is an E F D . 2. A===> B is a universal decision rule iff EA is a refinement of E B .

In traditional model, we scan through the A and B columns to check if each distinctive attribute value of A is uniquely associated with an attribute values of B. In machine oriented model, we check the refinements of two partitions.

4.3 Soft Rules

Very often patterns can not be as clean as decision rules or extensional func­tional dependencies, yet we still would like to express them as some kind of rules, namely, soft rules. In other words, we say c -+ d is a soft decision rule, if the inclusion is a soft or approximate inclusion, Pc¢;.Qd. Soft inclusion can be interpreted in many ways:

1. the number of "illegal or neglected elements" is small [19]. Formally, c -+ d is a soft decision rule, Card(P \ (P n Q))/Card(P) ::; threshhold. In the language of association rules, a soft rule is a (directed) association rule with high confident level.

2. the inclusion is E-fuzzy [15];see Appendix 3. 3. In [14], a new relaxation method based on the semantics of data is pro­

posed; it neglects only semantically negligible elements. Assume each point in Dam(Q) has a unique (nearest) neighborhood (globally, it defines a binary relation on Dam(Qd); this neighborhood is called elementary (basic) neighborhood [7]. c -+ d is a soft decision rule, if P is included in a neighborhood of Qd,

(P ~ N(Qd)), where N(Qd) is the union of all the unique (nearest) neighborhood of each point in Qd. In some literature, this is called cross levels decision rules.

Page 38: Data Mining, Rough Sets and Granular Computing

32

5 Association Rules via Granular Computing

Let Ai, i = 1,2, ... be the attributes of a given relation R. Each column Ai gives rise to a quotient set Qi = V/A i ; see Section 3.1. Each tuple in R induces a tuple in the Cartesian product of Ql x Q2 x, ... For simplicity and clarity, the term, tuple, will be reserved for the relation R, and the tuple and its sub-tuples in the Cartesian product of quotient sets will be referred to as combinations of elementary granules from Qi, i = 1,2, .... A combination of length q is called q-combination. A granules is said to be large, if it meets s% percentage requirement, that is, Card(Qi)/Card(V) ~ s%, otherwise, it is lean. A q-combination (of elementary granules) is said to be large, if the intersection of the elementary granules in the combination is large; we will refer to them as q-Iarge-combinations. A q-Iarge-combination is an q-association rule, association rule of length q [1], [2).

The Apriori algorithms, generally, proceed as follows:

1. Scan through the database to create a list of I-Iarge-combinations (1-large-itemset)

2. Generate all possible q-combinations of attribute values from (q - 1)­large-combinations ((q - I)-large-itemset).

3. For this fixed q count the occurrences of each q-combination in the rela­tion. In the original algorithm, it loops through each tuple(transaction) in the relation and use a routine, called subset function (hashing), to count each occurrence.

4. Declare those q-combinations association rules, if their counts meet or exceed the s% threshold respectively.

Roughly, the algorithms settle two issues: The first is to generate of q­combinations, which we treat at Section 5.1. The second is to evaluate if the q-combinations are large; this is treated at Section 5.2. We consider both together at Section 5.3.

5.1 Generating Combinations

First, we observe some simple properties of q-combinations.

Proposition 5.1.1

• A q-combination that includes a lean granule is lean . • Apriori condition: All sub (q-I )-combinations of q-combination are large.

To start the induction, we need I-Iarge-combinations (an I-combination is an elementary granule). This is done by scanning through the whole database, at the same time it also creates a list of distinct attribute values and counts each of their respective occurrences. The output of this step consists of 1-large-combinations ( I-association rules). To generate q-combinations, it join

Page 39: Data Mining, Rough Sets and Granular Computing

33

the (q-I)-large-combinations with distinct I-large-combinations. One should note that q denote the number of distinctive attributes (columns), not the granules; Each attribute may consist of several elementary granules.

To accomplish this, traditional approach has two steps:

1. the generation step: it creates mathematical q-combinations by joining each (q - I)-large-combinations with distinct I-large-combinations. (We could use the same method as Section 5 Item 2 of the Apriori algorithm.

2. the pruning step: it uses the Apriori condition of Proposition 5.1.1 to eliminate all q-combination whose (q-I) sub-combinations are not large. This step reduces the number of potential candidates to be verified.

The actual verification of large-ness is the task of next section. Granular computing approach could take the following optimization: It estimates the pruning cost and actual cost of performing AND operations (of (q - 1)­large-combinations and distinct I-large-combinations), and take the minimal route.

5.2 Counting Granules - Association Rules

In this section, we will explain how to identify a q-combination is large. For granular computing, this is a simple bit counting. We will use the logarithmic counting; see Appendix 2.

In this step, the traditional approaches go through each transaction and use subset function to verify, if the given q-combination belongs to that trans­action; see Section 2.1 of [2]. It takes more than one instruction to execute the subset function. While for granular computing, it involves only the AN D­operation (and some very small fair share of overhead in global I/O). It is clear that if the AN Ded bit that represents the particular tuple of the gen­erated q-combination is 1, then the particular q-combination does belong to that tuple. It takes at most q/32 (32=wordsize) instruction(s) to verify this fact; Note that each AND operations verifies 32 tuples. So granular approach is much faster.

5.3 The Granular Algorithm - the Combined Step

There are two strategies in carrying out the computations: one is the full computations of the intersection; the other stop at the s% condition; in this experiments we use the full computation, since the support s% we use is large-see below.

1. For q = 1, the task is to identify which elementary granules are large. We accomplish this while we are converting the database to bit streams; see Section 5.1

2. For q = 2, the algorithm generates

Page 40: Data Mining, Rough Sets and Granular Computing

34

(a) the candidate, 2-combination (L1' L 2 ), by joining mathematically two I-large-combinations, L1 and L 2 •

(b) It AND the candidate and counts the bits of the result, L1 n L 2 ,

where L 1 , L2 are from previous step. We can take two approaches: i. Full computation: We compute the intersection and save the re­

sults. For this approach, we may need memory and storage man­agement, since the intersection may not be kept in main memory.

ii. Partial computation: The computation of the intersection stop at the point s% reaches. No intersections of bit patterns are saved. The disadvantage is we may redo the same computations repeat­edly. But in the case that s is much smaller than the length of bit strings, this is the right approach.

(c) For q ~ 3, the algorithm generates i. the candidate q-combination by joining a (q-I )-large-combination

and distinct I-large-combinations. ii. It verifies if the generated q-combination meets the requirement,

in other words, if all possible (q - I)-sub-combination are large. iii. It AND the candidate and counts the bits of the result, L(q-1) n

L 1, where L1 is I-large-combination. Again, we can take two ap­proaches, full and partial computations.

5.4 Memory Management

In last three subsections, we have explained as if there are infinite main memory. Let us explain "visually" how data is arranged. Each I-large granule stands "up vertically." These vertical granules of one attributes are lined next to each other: The first column consist of bit string of one I-large granule, the second column another granule, and etc. Since all the granules are the same size; they form a large matrix of bits.

Next, we will cut these granules horizontally. The 1st horizontal block consists of the first 4K bits of every granule. The 2nd block consists of the second 4K bits, and etc. The algorithms explained in the last three subsec­tions, apply to each horizontal block, one block at a time; do the first block, then second and etc.

We do assume, the main memory is large enough to hold one horizontal block (and make some computation).

6 Experiment Results

The experimental result are taken from [8]. Four different data are generated to compare the run time on these algorithms to find association rules; this is an improvement of [9], [10] The data varies the number of tuples, the attributes per relation, and the minimal support. Each attribute ranges from 2 to 100 attributes.

Page 41: Data Mining, Rough Sets and Granular Computing

35

To reflect, the I/O parts, we restrict ourselves to 10 mega main memory, so all algorithms need to do substantial l/Os. Each read/write unit is restricted to 4K bytes. The software for Apriori, AporiSubset and AprioriHybrid are our implementations of the algorithms in [1], [2]; we believe have done our best to maximize the power of these algorithms. All relations are artificially generated; see Table 10.

The first data set has a smaller data size than the memory. It characterizes a memory-based operation. The other three data set have a larger data size than memory. The algorithms to compare are the following: granule, apriori, apriori with subset, and apriorihybrid. The algorithm, granule, generates q­combinations by combining (q - I)-large-combinations ((q - I)-association rules) with I-large-combination (I-association rules). All versions of apriori, generates q-candidates by combining two (q-I )-associations rules.

The meaning of each column in the following tables is self-evident, except few:

1. " Granule size" is calculated by the following: granule size = total number of granules per table * number of pages per granule * page size. The number of granules per table is the number of I-candidates. The number of pages per granule is approximately the number of tuples in a relation divided by number of tuples per page. Finally, the page size is 4096. The granule size is not directly related to performance; the I-large granule size (Association rules of length 1) is more relevant to the performance.

2. " Logical read" means the number of reading the granules from the buffer. 3. "Real read" means the number of reading the granules from secondary

storage. 4. "gen time" means the time for generating a granule. 5. "count time" means the time for counting the bit in a granule. 6. "Apriori" means Apriori algorithm without using" Subset Routine" call,

in other words, during the prune time, the apriori condition is checked by searching through the large sub-granule lists.

7. "AprioriSubset" means Apriori algorithm using "Subset Routine" call (see Section 2.1 of [2]).

8. " Candidate bar" means the Ci,i = 1,2, ... in AprioriTid

Data Rows Columns Table Granule Memory Minimal Set Size Size Support 1 100000 16 6.4M 7.7M 10M 2000 2 400000 16 25.6M 36.7M 10M 8000 3 100000 48 19.5M 27.8M 10M 2000 4 400000 48 78.0M 88.0M 10M 8000

Table 10. Data Sets

Page 42: Data Mining, Rough Sets and Granular Computing

36

( C1 7) (C2 53) (C3 2) (C4 15) (C5 28) (C6 10) ( C7 31) ( C8 24) ( C9 51) ( C10 17) (C11 62) ( C12 8) ( C13 23) ( C14 25) ( C15 53) ( C16 62)

Table 11. 16 Pairs of (Column Names, # of Distinct Attribute Values) on Data Set 1

Length # of # of Granule Apriori Apriori Apriori Candidates Rules Subset Hybrid

1 471 291 0.00 0.00 0.00 0.00 gen time 1.84 0.94 0.94 0.94 count time

2 39168 214 0.09 0.34 0.33 0.33 gen time 14.36 770.16 104.91 104.97 count time

3 48 14 0.17 0.14 0.13 0.14 gen time 0.02 0.33 6.49 8.22 count time

4 0 0 0.02 0.00 0.00 0.00 gen time 0.00 0.00 0.00 0.00 count time

Total time 16.50 771.91 112.78 114.59

Table 12. Computing Times for Different Methods on Data Set 1

Reads Granule Apriori Apriori Apriori Subset Hybrid

Real Reads 3048 1563 1563 1563 Logical Reads 312756 3126 3126 3126

Table 13. Number of Reads for Different Methods on Data Set 1

( C1 54) ( C2 37) ( C3 53) ( C4 57) ( C5 34) ( C6 54) ( C7 51) ( C8 56) ( C9 52) ( C10 18) (C11 53) ( C12 16)

( C13 37) ( C14 41) ( C15 41) ( C16 35)

Table 14. 16 Pairs of (Column Names, # of Distinct Attribute Values) on Data Set 2

Page 43: Data Mining, Rough Sets and Granular Computing

37

Length # of # of Granule Apriori Apriori Apriori Candidates Rules Subset Hybrid

1 689 307 0.00 0.00 0.00 0.00 gen time 8.75 4.06 3.84 3.84 count time

2 43556 4 0.13 0.39 0.39 0.41 gen time 51.95 3400.28 369.19 369.20 count time

3 1 1 0.00 0.00 0.00 0.00 gen time 0.00 0.44 24.86 26.83 count time

4 0 0 0.00 0.00 0.02 0.00 gen time 0.00 0.00 0.00 0.00 count time

Total time 60.83 3405.17 398.30 400.28

Table 15. Computing Times for Different Methods on Data Set 2

Reads Granule Apriori Apriori Apriori Subset Hybrid

Real Reads 12963 18750 18750 18750 Logical Reads 1128489 0 0 0

Table 16. Number of Reads for Different Methods on Data Set 2

( Cl 18) ( C2 24) ( C3 18) ( C4 39) ( C5 65) ( C6 47) ( C7 61) ( C8 5) ( C9 59) ( CI0 46) (C11 47) ( C12 27)

( C13 56) ( C14 11) ( C15 50) ( C16 45) ( C17 54) ( C18 59) ( C19 37) ( C20 34) ( C21 11) ( C22 62) ( C23 41) ( C24 11) ( C25 21) ( C26 23) ( C27 9) ( C28 30) ( C29 66) ( C30 52) ( C31 8) ( C32 58) ( C33 35) ( C34 43) ( C35 6) ( C36 42) ( C37 42) ( C38 15) ( C39 59) ( C40 10) ( C41 13) ( C42 24) ( C43 58) ( C44 35) ( C45 51) ( C46 26) ( C47 2) ( C48 40)

Table 17. 48 Pairs of (Column Names, # of Distinct Attribute Values) on Data Set 3

Here are some observations and explanations on the results (mainly on data sets 1 Ind 2).

1. All versions of Apriori use the same routine to find I-large-combinations; therefore, all execution times are the same for all versions of Apriori on finding I-large-combinations.

2. The routine, granule, creates a temporary table before determining 1-large-combinations, so its run time to find I-large-combinations is longer than Apriori.

Page 44: Data Mining, Rough Sets and Granular Computing

38

Length # of # of Granule Apriori Apriori Apriori Candidates Rules Subset Hybrid

1 1695 820 0.00 0.00 0.00 0.00 gen time 9.02 4.67 3.16 3.16 count time

2 27899 785 4.67 17.69 17.63 17.75 gen time 130.91 6452.06 1776.98 1776.00 count time

3 1341 51 2.11 0.42 0.42 0.41 gen time 0.55 13.75 396.06 408.61 count time

4 0 0 0.01 0.00 0.00 0.01 gen time 0.00 0.00 0.00 0.00 count time

Total time 147.27 6488.59 2194.25 2205.94

Table 18. Computing Times for Different Methods on Data Set 3

Reads Granule Apriori Apriori Apriori Subset Hybrid

Real Reads 10153 13582 13582 13582 Logical Reads 2635911 704 704 704

Table 19. Number of Reads for Different Methods on Data Set 3

( C1 11) ( C2 48) ( C3 36) ( C4 36) ( C5 53) ( C6 17) ( C7 2) ( C8 57) ( C9 65) ( C10 46) (C11 13) ( C12 39)

( C13 9) ( C14 58) ( C15 3) ( C16 21) ( C17 32) ( C18 47) ( C19 28) ( C20 59) ( C21 43) ( C22 58) ( C23 60) ( C24 61) ( C25 18) ( C26 54) ( C27 6) ( C28 31) ( C29 62) ( C30 43) ( C31 11) ( C32 11) ( C33 61) ( C34 15) ( C35 11) ( C36 31) ( C37 50) ( C38 37) ( C39 13) ( C40 28) ( C41 41) ( C42 54) ( C43 20) ( C44 56) ( C45 47) ( C46 19) ( C47 15) ( C48 18)

Table 20. 48 Pairs of (Column Names, # of Distinct Attribute Values) on Data Set 4

3. AprioriHybrid uses the routine, AprioriSubset, to determine 2-large-combinations and 3-large-combinations. The run time to find 2 and 3-large-combinations are similar.

4. On data 1 and 2, AprioriSubset wins over Apriori when finding 2-large­combinations because the subset function in AprioriSubset reduces the number of 2-candidates to compare to each tuple in the relation.

5. On finding 3-large-combinations on AprioriSubset, the number of subsets per tuple is many times the number of 3-candidates. Each tuple in data

Page 45: Data Mining, Rough Sets and Granular Computing

39

Length # of # of Granule Apriori Apriori Apriori Candidates Rules Subset Hybrid

1 1654 957 0.00 0.00 0.00 0.00 gen time 60.34 28.44 28.69 36.56 count time

2 46298 1200 9.33 30.03 30.03 29.95 gen time 568.84 35280.64 9699.50 9652.92 count time

3 1807 176 26.22 14.97 14.77 14.75 gen time 10.80 80.63 1587.61 1708.94 count time

4 10 1 0.16 0.05 0.08 0.06 gen time 0.01 29.34 7375.17 5.64 count time

5 0 0 0.00 0.00 0.09 0.00 gen time 0.00 0.00 0.00 0.00 count time

Total time 675.70 35464.09 18735.94 11448.83

Table 21. Computing Times for Different Methods on Data Set 4

Reads Granule Apriori Apriori Apriori Subset Hybrid

Real Reads 35662 76192 76192 57144 Logical Reads 11660581 0 0 0

Table 22. Number of Reads for Different Methods on Data Set 4

1 and 2 has C(16,3) and C( 48,3) subsets respectively. Finding 3-large­combinations with Apriori is better than AprioriSubset.

6. AprioriHybrid needs the 3-candidate bar (Ci in aprioriTid) tables when finding 4-large-combinations. So the 3-candidate bar table is created while finding 3-large-combinations. As a result, the execution on finding 3-large­combinations in AprioriHybrid is longer than AprioriSubset.

7. On data 1, the table and granule sizes are smaller than the memory size. Finding large-combinations for data 1 is memory-based.

8. On data 1, all version of Apriori requires 1563 real reads to do one relation scan. All subsequent relation scans are logical reads.

9. On data 2, all version of Apriori requires 18750 real reads for the three cycles. Each table scan replaces the memory twice since the stored data is 21/2 times larger than the memory size.

10. On data 1, the granule performs 6.8 times faster than AprioriSubset. 11. On data 2, the granule performs 6.5 times faster than AprioriSubset.

Next are observations and explanations on data sets 3 and 4.

Page 46: Data Mining, Rough Sets and Granular Computing

40

12. The run times to generate new q-candidates or q-combinations are typi­cally smaller than the run times to process the q-candidates to the rela­tion. For all version of Apriori, the logical reads in data 3 appear because the stored data is less than 2 times the size of memory. The table size is not large enough to flush the memory on one relation scan.

13. AprioriHybrid on data 4 took 5.64 seconds to find 4-large-combinations. This shows the benefits with the candidate bar table compared to Apriori and AprioriSubset.

14. The granule has a high number oflogical reads because of the overlapped of attribute values among the q-combinations.

15. In data 4, AprioriHybrid has 57144 real reads that is 19048 less than the real reads for Apriori and AprioriSubset. The difference is the number of reads for one relation scan. AprioriHybrid did not perform a relation scan when finding 4-large-combinations. Instead, it used the candidate bar table. The reads on the candidate bar table are not shown here.

16. On data 3, the granule performs 14.9 times faster than AprioriSubset. 17. On data 4, the granule performs 16.4 times faster than AprioriHybrid. 18. The memory size in granule is greater than the page size * I-candidates.

When this is true, the number of real reads is low and the number of logical reads is high. When this is false, some data pages in memory are more likely to be replaced with another data pages. As a result, logical reads are replaced with real reads.

19. The granule's number of reads can be verified with the following:

1. Formula,

real reads + logical reads = E;=l (Ci * i * p) 2. The term, r, is the maximum length of candidates. 3. The term, Ci, is the number of i-candidates. 4. The term, p, is the number of pages per granule.

For example, data 4 has 11696243 total reads, 35662 real reads plus 11660581 logical reads. Each granule requires 13 pages, 400000/ (4096 * 8). So the total number of pages is equal to 13* (1654* 1 +446298*2+ 1807 *3+ 10*4).

The use of granules offers a promising method to find association rules. The AND, SHIFTS, and ADD operations among granules are natural for general machine architecture, thus making the operations fast. The disad­vantage with granules is the costs of creating the temporary table to hold the granules. However, the gain is that the data is converted to a compact form, thus benefiting the whole process.

7 Conclusions

Granular computing approach is faster than Apriori mainly because the "database scan" are replaced by bit operations. Moreover the cost of bit operations is known well ahead, so optimizations are possible.

Page 47: Data Mining, Rough Sets and Granular Computing

41

In this paper, we illustrate the use of granular computing to relational databases. If the databases have additional semantics [18], [17], [3], [20], [21], this approach is even better. The theoretical model has been explained in [11], [7], further experimental result will be reported soon.

References

1. Agrawal, R., T. Imielinski, and A. Swami, "Mining Association Rules Between Sets of Items in Large Databases," in Proceeding of ACM-SIGMOD Interna­tional Conference on Management of Data, pp. 207-216, Washington, DC, June, 1993.

2. Agrawal, R., R. Srikant, "Fast Algorithms for Mining Association Rules," in Proceeding of 20th VLDB Conference San Tiago, Chile, 1994.

3. W. Chu and Q. Chen, "Neighborhood and associative query answering," Jour­nal of Intelligent Information Systems, vol 1, 355-382, 1992.

4. Date, C., An Introduction to Database Systems- Vol I, 7th ed., Addison-Wesley, 2000.

5. Maier,D.: The Theory of Relational Databases. Computer Science Press, 1983 (6th printing 1988).

6. Pawlak, Z.: Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic, Dordrecht (1991)

7. T. Y. Lin, "Data Mining and Machine Oriented Modeling: A Granular Comput­ing Approach," Journal of Applied Intelligence, Kluwer, Vol. 13,No 2, Septem­ber/October,2000, pp.113-124.

8. Eric Louie, " Using Granules to Find Association Rules," Thesis, San Jose State University, Fall, 2000

9. Eric Louie and T.Y. Lin, "A Data Mining Approach using Machine Oriented Modeling: Finding Association Rules using Canonical Names.". In: Proceeding of 14th Annual International Symposium Aerospace/Defense Sensing, Simula­tion, and Controls, SPIE Vol 4057, Orlando, April 24-28, 2000

10. Eric Louie and T.Y. Lin, "Finding Association Rules using Fast Bit Computa­tion: Machine-Oriented Modeling." In: Proceeding of 14th Annual International Symposium Aerospace/Defense Sensing, Simulation, and Controls , SPIE Vol 4057, Orlando, April 24-28, 2000

11. T.Y. Lin, "Granular Computing on Binary Relations I: Data Mining and Neigh­borhood Systems." In: Rough Sets In Knowledge Discovery, A. Skoworn and L. Polkowski (eds), Physica-Verlag, 1998, 107-121

12. T.Y. Lin, "An Overview of Rough Set Theory from the Point of View of Re­lational Databases," Bulletin of International Rough Set Society, Vol I, No1, March, 1997, 30-34.

13. T. Y. Lin, "Rough Set Theory in Very Large Databases," in Proceedings of Symposium on Modeling, Analysis and Simulation, IMACS Multi Conference (Computational Engineering in Systems Applications), Lille, France, July 9-12, Vol. 2 of 2, 936-941, 1996.

14. T. Y. Lin, and Y.Y. Yao " Mining Soft Rules Using Rough Sets and Neighbor­hoods." In: Symposium on Modeling, Analysis and Simulation, IMACS Multi­conference (Computational Engineering in Systems Applications), Lille, France, July 9-12, 1996, Vol. 2 of 2, 1095-1100.

Page 48: Data Mining, Rough Sets and Granular Computing

42

15. T. Y. Lin, "Coping with Imprecision Information-"Fuzzy" Logic," Downsizing Expo, Santa Clara Convention Center, Aug.3-5, 1993

16. T. Y. Lin, "Topological and Fuzzy Rough Sets." In: Decision Support by Ex­perience - Application of the Rough Sets Theory, R. Slowinski (ed.), Kluwer Academic Publishers, 1992, 287-304.

17. T. Y. Lin, " Neighborhood Systems and Approximation in Database and Knowl­edge Base Systems," in Proceedings of the Fourth International Symposium on Methodologies of Intelligent Systems, Poster Session, October 12-15, 1989, 75-86.

18. T. Y. Lin,"Neighborhood Systems and Relational Database," in: Proceedings of 1988 ACM Sixteenth Annual Computer Science Conference, February 23-25, Atlanta, Georgia, 1988, 725

19. W. Ziarko, "Variable Precision Rough Set Model." Journal of Computer and Systems Science Vol 46,No1, February, Academic Press, 1993, pp.38-59.

20. Y.Y. Yao, "Information tables with neighborhood semantics." In: Proceeding of 14th Annual International Symposium Aerospace/Defense Sensing, Simulation, and Controls, SPIE Vol 4057, pp. 108-116, Orlando, April 24-28, 2000

21. Y.Y. Yao and N. Zhong, "Potential applications of granular computing in knowledge discovery and data mining", in Proceedings of World Multiconfer­ence on Systemics, Cybernetics and Informatics, Computer Science and Engi­neering, vol. 5, pp. 573-580, Orlando, July 31 - August 4, 1999.

Page 49: Data Mining, Rough Sets and Granular Computing

43

Appendix 1 - Definitions of Information Tables

An information table is a 4-tuple

(V, A, Dam, p),

where

1. V = {u, v, ... } is a set of entities. 2. A is a set of attributes {AI, A2, .. . An}. 3. dam(Ai) is the set of values of attribute Ai

Dam = dam(AI) x dam(A2) ... x dam(An)

4. p: V x A --+ Dam, called description function, is a map such that

p(u, Ai) is in dam(Ai) for all u in V and Ai in A.

The description function p allow us interpret attribute as a projection

The description function p induces a set of maps

t = p(u, -): A --+ Dam.

Each image forms a tuple:

Note that the tuple t is associated with object u, but not necessarily uniquely. In an information table, two distinct objects could have the same tuple representation that is not permissible in relational databases.

The notion of a relation in relational theory consists of

1. V = {x, y, ... } is an implicit set of entities, which is not appear in the formal model.

2. A is a set of attributes {AI, A2, ... An}. 3. Dam(Ai) is the set of values of attribute Ai.

Dam = dam (AI ) U dam(A2) U ... U dam(An)

4. Implicitly, to each entity u we associate a mapping

tu : A --+ Dam,

where t(a) E dam(Ai) for each Ai E A.

A relation consists of mappings tu : A --+ Dam, Informally, one can view relation as a table consists of rows of elements. Each row represents an entity uniquely.

Page 50: Data Mining, Rough Sets and Granular Computing

44

Appendix 2 - Counting Bits

We will explain the logarithmic approach, counts the bits by dividing the word into blocks. For each word, it does the count, first, by every 2 bits, next by every 4 bits, by every 8 bits, and finally by every 16 bits. Here is the sample (we use pseudo C syntax).

1. 2 bits count: odd bits are added to the preceding even bits: Let B be the bit pattern and Count32 is a variable holding the "new" bit pattern.

Count32 = (B&Ox55555555) + ((B » 1)&Ox55555555);

2. 4 bits count: odd 2-bits are added to even 2-bits (each 2-bit represents the sum of even and odd bits)

Count32 = (Count32&Ox33333333) + ((Count32 » 2)&Ox33333333);

3. 8 bits count: odd 4-bits are added to even 4-bits ( each 4-bit represents the sum of even and odd 2-bits)

Count32 = (Count32&OxOFOFOFOF) + ((Count32 » 4) &OxOFOFOFOF);

4. 16 bits count; odd 8-bits are add to even 8-bits

Count32 = (Count32&OxOOFFOOFF) + ((Count32» 8)&OxOOF FOOF F);

5. 32 bits count; odd 16-bits are add to the even 16-bits

TotalCount = (Count32&OxOOOOFFFF) + (Count32» 16);

The first four statements take 28 instructions, 4 * (3 MOVE + 2 AND + 1 ADD + 1 SHR) , and the last statement takes 7 instructions, (3 MOVE + 1 ADD + 2 ADD, + 1 SHR). The total for this approach is 35 instructions to count one 32-bit word. This can be faster than the usual shift method in the order of magnitude.

The instruction count in computing a q-combination being large is the total count of AND instructions between granules and the instructions of counting the bit in the intersected results. Since computers process one word at a time, the number of instructions needed is approximately the following:

1. Set length = The length of the bit string representing a granule. This is the cardinality of the universe, Card(V).

2. Set w = Number of bits in a word; in this paper, w is 32. 3. Set count = Number of instructions per word =

length/w * ((q -l)AND operations + cost of counting bits.)

Page 51: Data Mining, Rough Sets and Granular Computing

45

4. The total instruction count to determine, in the worst case, whether a q-combination is large is the following: count = Number of instructions per word = length/w * ((q - 1) + 35) instructions

For example, for a combination of length 10 with 220 tuples (a "million" tuples) in the relation, 1,441,792 instructions will be executed.

(220 /32) * ((10 - 1) + 35) = 215 * (44) = 1,441,792 instructions.

Note: We may stop once the count meets s% of the data.

Appendix 3 - Fuzzy Set Theory

In this section, we will recall some results of [15]. Let U be the universe of discourse, a classical set. Let K = (U, F K) be a fuzzy subset, where FK : U ----t [0,1] is a membership function Given a real world fuzzy set, there can be more than one membership functions; they are called admissible functions of the real world fuzzy set.

Let U = {Ul,U2, .•... un }. For a given small number f (called radius of tolerance/error), FX and FY are admissible to the same real world fuzzy set,

1. (from topology/geometry) if FX(u) can "continuously" move to FY(u) within given" f-neighborhoods" [16].

2. (from measure theory) if for given f,

IIFX - FYII/ICard(FW)1 ~ f

where IjFX(u) -FY(u)11 = EU~Ul jFX(u) -FY(u)1 (it is the L1-norm), FW = FX U FY, and Card(FW) is the cardinality of fuzzy set FW. However, this admissibility is not an equivalence relation.

3. FY is "f-fuzzy" included, denoted by FY ~f F X, if F X n FY is another admissible function of the fuzzy set FY; note f-fuzzy inclusion is different from the inclusion of fuzzy sets.

4. If F X and FY are crisp sets, then FY ~f F X, is equivalent to 1 -Card(FX \ (FX n FY))/Card(FX) ~ f.

Page 52: Data Mining, Rough Sets and Granular Computing

Knowledge Discovery with Words Using Cartesian Granule Features:

An Analysis for Classification Problems

James G. Shanahanl

Xerox Research Centre Europe (XRCE) Grenoble Laboratory

6 chemin de Maupteruis, 38240 Meylan, France e-mail: [email protected]

1. Introduction Cartesian granule features were originally introduced to address some of the shortcomings of existing forms of knowledge representation such as decomposition error and transparency, and also to enable the paradigm modelling with words through related learning algorithms. This chapter presents a detailed analysis of the impact of granularity on Cartesian granule features models that are learned from example data in the context of classification problems. This analysis provides insights on how to effectively model problems using Cartesian granule features using various levels of granulation, granule characterizations, granule dimensionalies and granule generation techniques. Other modelling with words approaches such as the data browser [1, 2] and fuzzy probabilistic decision trees [3] are also examined and compared. In addition, this chapter provides a useful platform for understanding many other learning algorithms that mayor may not explicitly manipulate fuzzy events. For example, it is shown how a naive Bayes classifier is equivalent to crisp Cartesian granule feature classifiers under certain conditions.

This chapter is organised as follows: Section 2 describes knowledge representation from a Cartesian granule feature perspective, introducing Cartesian granule features, the rules structures into which they can be incorporated and the approximate reasoning strategies that can be used to perform inference and decision making within this framework. A constructive induction algorithm, G_DACG, is described in Section 3 that facilitates the learning of Cartesian granule feature models from example data. Section 4 provides a detailed analysis of the effect of granularity on Cartesian granule feature modelling in the context of the ellipse classification problem. Models generated using Cartesian granule features of differing levels of granularity, granule characterisations, granule

1 Part of the work reported here was carried out while at the Advance Computing Research Centre, University of Bristol, Bristol, UK. This part of the work was supported by the European Community Marie Curie Fellowship Program and by DERA (UK) under grant 92W69.

Page 53: Data Mining, Rough Sets and Granular Computing

47

dimensionalities and granule generation techniques are examined in this context. A comparison of other modelling with words approaches is presented in Section 5. Finally, Section 6 provides an overall discussion on the approaches discussed and compared in previous sections. Section 7 finishes with some overall conclusions. For completeness, Appendix A provides a brief overview of the bi-directional transformation that exists between fuzzy sets and probability distributions, which plays a key role in learning and reasoning within the Cartesian granule feature framework.

2. Knowledge representation using Cartesian granule feature models This section introduces Cartesian granule features and subsequently shows how they can be incorporated in rule-based models. Different approximate reasoning processes can be used to perform inference and decision-making within this framework. These are presented in the final part of this section.

2.1 Cartesian granule features

Basic definitions and examples of Cartesian granule features and related concepts are presented here.

Definition: Let X = {xj, ... , Xn} be a set of given data. A partition P of X is a family of subsets of X denoted by P = {Aj, ... , Ac}, that satisfy the following properties:

(i) Ai n Aj =0 V i,jE{l, ... ,c},andi:;t:j

(ii)UA = X ;=1

In other words, P provides a minimal, or the most efficient covering of X. A trivial example of a partition of X is provided by {A, -,AI -,A denotes the complement of A} for any subset A of X.

Definition: When each Ai is a fuzzy set, afuzzy partition [6] for X is defined and the following properties, corresponding to (i) and (ii) above, must hold:

V i,jE{l, ... ,c},andVkE{l, ... ,n} c

(ii) L,uAi (xk ) = 1 VkE{l, ... ,n} i=l

This type of partition is sometimes known as afuzzy mutually exclusive partition [1]. Relaxing property (ii) for a fuzzy partition as follows:

O<i,uAi(Xk)~C VkE {I, ... ,n} ;:;;:1

Page 54: Data Mining, Rough Sets and Granular Computing

48

leads to a fuzzy non-mutually exclusive partition.

Definition: A linguistic partition can be defined as a fuzzy partition P={w], ... , wd where the labels of constituent fuzzy sets are words Wi.

Definition: Afeature forms part of a description of an object or entity. A feature can assume values from a universe of discourse n. This universe n can be continuous, discrete, numeric or symbolic in nature. In the literature features are often referred to as attributes or variables. A domain feature f is a feature whose value is directly provided by the problem domain (either via sensors or experts). Derived features are artificial features, formed usually as a function of a subset of the available domain features.

Definition: A granule is a collection of points, which are labelled by a word. This collection of points is drawn together as a result of indistinguishability, similarity, proximity or functionality [7, 8]. A granule can be characterised by a number of means such as a fuzzy set or a probability distribution (point or set based). In the case of this work, a granule describes a subset of a domain feature's universe of discourse.

Definition: Given a set of domain features {/J, ... , fm} defined over the universes {fh Dz, ... , Dm} and corresponding linguistic partitions {Pfl, ... , Pfm}, where each Pfi consists of labelled fuzzy sets as follows: {WiJ, Wi2, ..... , Wic}. A Cartesian

granule universe .QPj1x .... xPpn is a discrete universe defined over i~l Pfi where "x"

denotes the Cartesian product. More concretely, A Cartesian granule universe n xP can be formed by taking the cross product of the words making up

PfiXPf2X... fill

each linguistic partition Pfi as follows:

Here each string concatenation of the individual fuzzy set labels Wi.} (where each) denotes the granularity of partition Pjj ) denotes a Cartesian granule. A Cartesian granule can be intuitively visualised as a clump of elements in this m-dimensional universe. Consider the following example, where a two-dimensional Cartesian granule universe is formed using example domain features of Position and Size. To construct a Cartesian granule universe, the universe of each feature is linguistically partitioned arbitrarily as follows:

Ppositioo = {Left, Middle, Right} and

PSize = {Small, Medium, Large}.

Page 55: Data Mining, Rough Sets and Granular Computing

49

The Cartesian granule universe, QPPosition x PSize, will then consist of the following discrete elements (Cartesian granules):

QPPositionx PSize: {Left.Small, Left.Medium, Left.Large, Middle.Small, Middle.Medium, Middle.Large, Right.Small, Right.Medium, Right.Large}.

This is graphically depicted in Figure 1.

Figure 1: The Cartesian granule universe QPPosition x PSize defined in terms of the Iinguistic partitions of the universes QSize and QPosition'

Definition: A Cartesian granule feature CG hxhx ..... xfm is a feature defined over a

Cartesian granule universe QPtXP2X. .... XPm' where each fi is a domain feature and

each Pi is a linguistic partition of the respective universe .0; for all i E (1, ... , m). Throughout the rest of this chapter, Cartesian granule features will denoted

interchangeably by CG fIXhx ..... xfm and F. A Cartesian granule feature can

intuitively be viewed as a multidimensional linguistic variable. For example, considering the problem features of Position and Size presented above, the Cartesian granule feature CGposition x Size could denote a feature defined over the Cartesian granule uni verse DPPosition x PSize as defined in Figure 1. A Cartesian granule feature is an example of a derivedfeature.

Definition: A Cartesian granule fuzzy set CGFS f 1x f 2X •••• xf .. is a discrete fuzzy

set defined over a Cartesian granule uni verse QPIXP2X .... XP,n where each fi is a

Page 56: Data Mining, Rough Sets and Granular Computing

50

domain feature and each Pi is a linguistic partition of the respective universe Q i for all i E {J, ... , m}. Each Cartesian granule is associated with a membership value, which is calculated by combining the individual granule membership values that domain feature values have in the fuzzy sets that characterise the granules. For example, consider the Cartesian granule WnX ... XWm], where each Wi/ is the word associated with the first fuzzy subset in each linguistic partition Pi. The membership value associated with this Cartesian granule WnX ... XWmI for a data tuple <.xl. ... , xm> is calculated as follows:

where Xi is the feature value associated with the i th feature within the data vector. Here the aggregation operator 1\ can be interpreted as any t-norm such as product or min [9].

Extending the example presented above, consider if the universes OPosition and OSize

are defined as [0, 100J and [0, 100J respectively then possible definitions of the fuzzy sets in partitions OPosition and OSize (in Fril notation [1])2 could be:

Left:[O:I,50:0J Middle:[O:O, 50:1, 100:0J Right:[50:0, 100: 1]

Small:[O:I,50:0J Medium:[O:O, 50:1, 100:0J Large:[50:0, 100: 1].

Linguistic partitions provide a means of giving the data a more anthropomorphic feel, thereby enhancing understandability. In essence, when generating a Cartesian granule fuzzy set corresponding to a data tuple, it is first necessary to juzzify (or reinterpret) the domain feature values. Returning to the example, the attribute values for Position and Size are reinterpreted in terms of the words that partition the respective universes, that is, a linguistic description of the data is generated. Taking a sample data tuple (of the form <Position, Size» <60, BO> (denoted as <x, y> in Figure 1), each data value is individually linguistically summarised in terms of two fuzzy sets {Middle/O.B+ Right/0.2} and {Medium/O.4+ Large/0.6}. Subsequently, taking the Cartesian product of these fuzzy data yields the following fuzzy set in the Cartesian granule universe:

CGFSpositionxsize(60, 80) = {Middle.MediumlO.32+ Middle.Large/0.48 + Right.MediumlO.08 + Right.Large/0.12}.

2 A fuzzy set definition in Fril such as Middle: [0:0, 50: 1, 100:0 J can be rewritten mathematically as follows (denoting the membership value of X in the fuzzy set Middle):

0 if x:<:; 0 x

if 0< x :<:;50

J.1Middle(X) = 50

100-x -- if 50<x < 100 50

0 if x ~100

Page 57: Data Mining, Rough Sets and Granular Computing

51

Here the combination operator A is interpreted as product.

2.2 Cartesian granule feature rules

Even though, it is possible to combine all the domain features of a probIem in to one Cartesian granule feature, previous work has [10, 11] highlighted the need for discovering structural decomposition of input feature spaces into lower order feature spaces, in order to generate Cartesian granuIe feature models that provide good generalisation and knowledge transparency. Each such input feature subspace is described using a single Cartesian granule feature. As a re suit of this decomposition, a means of aggregating the individual Cartesian granule features is needed. Both evidentiallogic rule structures and conjunctive ruIe structures [1, 12] provide a natural mechanisms for representing this type of decomposed approach to systems modelling [11]. These types of models are referred to as an additive models and product models respectively. The presentation in this chapter is limited to classification models. For a complete discussion on how Cartesian granuIe features can be used for prediction, the reader is referred to [9].

More formally stated, an additive model consists of a collection of evidentiallogic rules; one for each class (also referred to a concept) in the problem domain. A simplified evidentiallogic rule structure is depicted in Figure 2. Each rule consists of a head proposition and of a body - a collection of fuzzy propositions. CLASS can be viewed as a fuzzy set consisting of a single crisp value, in the case of classification type problems. The body (conditional part) of each rule consists of a collection of fuzzy propositions, where each proposition is characterised by a single Cartesian granule features Fj, whose value FiCLASS corresponds to a Cartesian granule fuzzy set for the output variable value of CLASS. Each feature Fi is associated with a weight term WiCLASS that reflects the importance of this feature to the class CLASS.

Figure 2 Evidentiallogic rule structure.

On the other hand, a product model will consist of a collection of conjunctive rules; one for each class in the problem domain. A conjunctive ruIe structure is depicted in Figure 3. As is the case in evidentiallogic rules, CLASS can be viewed as a fuzzy set consisting of a single crisp value and the body of each rule consists of a collection of Cartesian granule features Fi, whose values FiCLASS correspond to fuzzy sets defined over the Cartesian granule universes ni, for the output or

51

Here the combination operator A is interpreted as product.

2.2 Cartesian granule feature rules

Even though, it is possible to combine all the domain features of a probIem in to one Cartesian granule feature, previous work has [10, 11] highlighted the need for discovering structural decomposition of input feature spaces into lower order feature spaces, in order to generate Cartesian granuIe feature models that provide good generalisation and knowledge transparency. Each such input feature subspace is described using a single Cartesian granule feature. As a re suit of this decomposition, a means of aggregating the individual Cartesian granule features is needed. Both evidentiallogic rule structures and conjunctive ruIe structures [1, 12] provide a natural mechanisms for representing this type of decomposed approach to systems modelling [11]. These types of models are referred to as an additive models and product models respectively. The presentation in this chapter is limited to classification models. For a complete discussion on how Cartesian granuIe features can be used for prediction, the reader is referred to [9].

More formally stated, an additive model consists of a collection of evidentiallogic rules; one for each class (also referred to a concept) in the problem domain. A simplified evidentiallogic rule structure is depicted in Figure 2. Each rule consists of a head proposition and of a body - a collection of fuzzy propositions. CLASS can be viewed as a fuzzy set consisting of a single crisp value, in the case of classification type problems. The body (conditional part) of each rule consists of a collection of fuzzy propositions, where each proposition is characterised by a single Cartesian granule features Fj, whose value FiCLASS corresponds to a Cartesian granule fuzzy set for the output variable value of CLASS. Each feature Fi is associated with a weight term WiCLASS that reflects the importance of this feature to the class CLASS.

Figure 2 Evidentiallogic rule structure.

On the other hand, a product model will consist of a collection of conjunctive rules; one for each class in the problem domain. A conjunctive ruIe structure is depicted in Figure 3. As is the case in evidentiallogic rules, CLASS can be viewed as a fuzzy set consisting of a single crisp value and the body of each rule consists of a collection of Cartesian granule features Fi, whose values FiCLASS correspond to fuzzy sets defined over the Cartesian granule universes ni, for the output or

Page 58: Data Mining, Rough Sets and Granular Computing

52

dependent variable value CLASS. Unlike evidential logic rules, no weights are associated with the features Fi.

Figure 3: Conjunctive rule structure.

2.3 Approximate reasoning using Cartesian granule feature models

This section describes the approximate reasoning process that is used in order to perform inference and decision-making within the framework of Cartesian granule feature models. In short, for classification problems, the system accepts as input an object or event description in terms of a data vector Data (raw data), performs approximate reasoning given this data, the result of which is a discrete value CLASS, the classification of the input object. The presentation here, for the main, is limited to probabilistic approximate reasoning strategies, though fuzzy logic reasoning strategies could also be used [9].

2.3.1 Inference

Inference in Cartesian granule feature models occurs at two different levels: at the body proposition level; and at the body level. At both leve1s, inference is based upon probabilistic conditionalisation (except in the case of the body level of the evidentiallogic rule).

2.3.1.1 Inference at the body proposition level

As seen previously, rules in Cartesian granule feature mode1s can be decomposed into head and body propositions. When new evidence Data, in terms of an object description, is presented to the system a "match" between the body propositions and the evidence needs to be performed to enable higher-Ievel inference (rule­body level inference), that is, the level of support for a body-Ievel proposition given the new evidence needs to be calculated. To achieve this, the domain data values Data must be reinterpreted using each Cartesian granule feature Fi, This results in a data fuzzy set FSiData being generated using the procedure described in Section 2.1. Subsequently, each fuzzy proposition in the body is conditioned on the reinterpreted evidence, resulting in the calculation of the posterior probability Pr( Fi = FSiCLASS I FSiData), that is, the probability of feature Fi having a fuzzy set value FSiCLASS given that the current value is FSiData• This is achieved using the mass assignment theory conditioning operat ion of semantic unification [1], which allows the conditioning of one fuzzy set given another, through their respective mass assignment representations (see Appendix A for details of this transformation). Unlike classical reasoning, where unification is performed at a

52

dependent variable value CLASS. Unlike evidential logic rules, no weights are associated with the features Fi.

Figure 3: Conjunctive rule structure.

2.3 Approximate reasoning using Cartesian granule feature models

This section describes the approximate reasoning process that is used in order to perform inference and decision-making within the framework of Cartesian granule feature models. In short, for classification problems, the system accepts as input an object or event description in terms of a data vector Data (raw data), performs approximate reasoning given this data, the result of which is a discrete value CLASS, the classification of the input object. The presentation here, for the main, is limited to probabilistic approximate reasoning strategies, though fuzzy logic reasoning strategies could also be used [9].

2.3.1 Inference

Inference in Cartesian granule feature models occurs at two different levels: at the body proposition level; and at the body level. At both leve1s, inference is based upon probabilistic conditionalisation (except in the case of the body level of the evidentiallogic rule).

2.3.1.1 Inference at the body proposition level

As seen previously, rules in Cartesian granule feature mode1s can be decomposed into head and body propositions. When new evidence Data, in terms of an object description, is presented to the system a "match" between the body propositions and the evidence needs to be performed to enable higher-Ievel inference (rule­body level inference), that is, the level of support for a body-Ievel proposition given the new evidence needs to be calculated. To achieve this, the domain data values Data must be reinterpreted using each Cartesian granule feature Fi, This results in a data fuzzy set FSiData being generated using the procedure described in Section 2.1. Subsequently, each fuzzy proposition in the body is conditioned on the reinterpreted evidence, resulting in the calculation of the posterior probability Pr( Fi = FSiCLASS I FSiData), that is, the probability of feature Fi having a fuzzy set value FSiCLASS given that the current value is FSiData• This is achieved using the mass assignment theory conditioning operat ion of semantic unification [1], which allows the conditioning of one fuzzy set given another, through their respective mass assignment representations (see Appendix A for details of this transformation). Unlike classical reasoning, where unification is performed at a

Page 59: Data Mining, Rough Sets and Granular Computing

53

syntactic level (pattern matching), here where vague statements are represented by fuzzy sets, unification is performed at a semantic level, by the numerical manipulation of the corresponding memberships functions via the semantic unification operation. This operation provides a very natural and formal means of measuring the degree of "match" between concepts expressed in terms of fuzzy sets.

Though semantic unification comes in two flavours - interval and point-valued -the work presented in this chapter has been limited to point-valued semantic unification. Point semantic unification can be quite efficiently thought of as corresponding to the expected value of the membership of a fuzzy setf given the least prejudiced distribution (LPD - probability distribution, see Appendix A for more details) of fuzzy set g [13]. This is expressed more succinctly as follows for the discrete case:

n

Pr(f I g) LPj(xiP< LPDg (Xi) i~l

where both the fuzzy sets f and g are defined over the discrete universe Ox = {Xl, X2, , •• , xn} and LPDg denotes the probability distribution corresponding to the fuzzy set g, i.e. the updated probability distribution Pr(Xlg) resulting from knowing that the value of the variable X has a fuzzy set value g.

2.3.1.2 Inference at the rule body level

The previous section has described how to calculate the degree of support for a body level proposition given new evidence using point valued semantic unification. This results in a point probability for each body-level proposition Pr(FSiCLASS I FSiData ). Inference at the rule body level is concerned with calculating a support for the overall collection of body propositions. This calculation will vary depending on the rule structure being utilised. For conjunctive rule structures, the body support BcLAss is calculated by taking the product of the individual point semantic unifications between the fuzzy sets FSiCLASS and the data values FSiData as follows:

m

Body = II Pr(FSiCLASS I FSiData)' i~l

On the other hand, in the case of Evidential Logic rules, the body support Body is calculated as follows:

Body = I,Pr(FS'ClASS I FSiD"",)w, i=l

where Wi is the weight of importance associated with feature i.

53

syntactic level (pattern matching), here where vague statements are represented by fuzzy sets, unification is performed at a semantic level, by the numeric al manipulation of the corresponding memberships functions via the semantic unification operation. This operation provides a very natural and formal means of measuring the degree of "match" between concepts expressed in terms of fuzzy sets.

Though semantic unification comes in two flavours - interval and point-valued -the work presented in this chapter has been limited to point-valued semantic unification. Point semantic unification can be quite efficiently thought of as corresponding to the expected value of the membership of a fuzzy set! given the least prejudiced distribution (LPD - probability distribution, see Appendix A for more details) of fuzzy set g [13]. This is expressed more succinctly as follows for the discrete case:

n

Pr(f I g)~>.lf(Xi:P< LPDg(x) i~l

where both the fuzzy sets! and g are defined over the discrete universe .ax = {Xl, X2, ... , Xn} and LPDg denotes the probability distribution corresponding to the fuzzy set g, i.e. the updated probability distribution Pr(Xlg) resulting from knowing that the value of the variable X has a fuzzy set value g.

2.3.1.2 Inference at the rule body level

The previous section has described how to calculate the degree of support for a body level proposition given new evidence using point valued semantic unification. This results in a point probability for each body-Ievel proposition Pr(FSiCLASS I FSWata ). Inference at the rule body level is concerned with calculating a support for the overall collection of body propositions. This calculation will vary depending on the rule structure being utilised. For conjunctive rule structures, the body support BcLAss is calculated by taking the product of the individual point semantic unifications between the fuzzy sets FSiCLASS and the data values FSWata as follows:

m

Body = TI Pr(FSiClASS I FSiData)' i~l

On the other hand, in the case of Evidential Logic rules, the body support Body is calculated as follows:

Body = fPr(FSiCLASS I FSilJa/a)w; i=l

where Wi is the weight of importance associated with feature i.

Page 60: Data Mining, Rough Sets and Granular Computing

54

As a final step in the inference process. the support for the rule (or head proposition). SCLAss. is simply equated to the value of the support for the body Body.

2.3.2 Decision making

In general. when dealing with systems where the individual universes are granulated by fuzzy sets, multiple fuzzy sets and hence multiple fuzzy rules are called upon to deduce an answer given new evidence i.e. inference is performed in a data driven manner - forward chaining as in fuzzy logic. For any particular test case. inference is performed on each rule separately and then the results of individual rule inference are combined to give a final overall outcome. Basically. a level of support SCLASS is calculated for the head of each class rule (Classification of Object is CLASS) using the inference strategies presented above. In the case of classification problems. the classification of the input data vector (decision making) is determined as the class CLASSmax associated with the hypothesis (head proposition) with the highest support. A modified decision making procedure based upon utility theory could alternatively be used. where the posterior probability SCLASS (hypothesis support) is multiplied by the utility value of the respective hypothesis and then the classification of the input data vector (decision making) is determined as the class CLASSmax associated with the hypothesis that maximises the resulting expected utility [14].

3. Learning Cartesian granule feature models Knowledge representation in terms of additive Cartesian granule feature models was presented in the previous section. Here. an induction algorithm. G_DACG. which facilitates the learning of such models from example data, is presented. The induction of additive Cartesian granule feature models falls into the category of supervised learning algorithms. Within this framework. the task of the learner is to model the dependence of an output variable or feature Y on one or more input domain features (h, ... ,f,,) i.e. approximate a function y = g(j). For classification problems. the target variable Y is discrete. taking values from the finite set {Class], ... , Classc }. Generally when learning such systems. problem domains can be represented as databases of examples organised in a spreadsheet format as presented in Table 1. This table consists of N examples, where each example corresponds to a row t that is made up of both input feature values (Vtl, ... , V",J for corresponding domain features (f], ... , f,,). and an output feature value C, that corresponds to a concept label in the problem domain. In the case of Cartesian granule feature modelling. each feature value Vif can correspond to a numeric value. a symbolic value. or to uncertain or vague information that can be specified in terms of fuzzy subsets or interval values. Background knowledge about the domain. other than examples. can also be accommodated within the Cartesian granule feature framework but is not considered here.

54

As a final step in the inference process. the support for the rule (or head proposition). SCLAss. is simply equated to the value of the support for the body Body.

2.3.2 Decision making

In general. when dealing with systems where the individual universes are granulated by fuzzy sets, multiple fuzzy sets and hence multiple fuzzy rules are called upon to deduce an answer given new evidence Le. inference is performed in a data driven manner - forward chaining as in fuzzy logic. For any particular test case. inference is performed on each rule separately and then the results of individual rule inference are combined to give a final overall outcome. Basically. a level of support SCLASS is calculated for the head of each class rule (Classification of Object is CIASS) using the inference strategies presented above. In the case of classification problems. the classification of the input data vector (decision making) is determined as the class CIASSmax associated with the hypothesis (head proposition) with the highest support. A modified decision making procedure based upon utility theory could altematively be used. where the posterior probability SCLASS (hypothesis support) is multiplied by the utility value of the respective hypothesis and then the classification of the input data vector (decision making) is determined as the class CIASSmax associated with the hypothesis that maximises the resulting expected utility [14].

3. Learning Cartesian granule feature models Knowledge representation in terms of additive Cartesian granule feature models was presented in the previous section. Here. an induction algorithm. G_DACG. which facilitates the learning of such models from example data, is presented. The induction of additive Cartesian granule feature models falls into the category of supervised learning algorithms. Within this framework. the task of the learner is to model the dependence of an output variable or feature Y on one or more input domain features (/J, ... ,f,J i.e. approximate a function y = g(j). For classification problems. the target variable Y is discrete. taking values from the finite set {Class], ... , Classc }' Generally when learning such systems. problem domains can be represented as databases of examples organised in a spreadsheet format as presented in Table 1. This table consists of N examples, where each example corresponds to a row t that is made up of both input feature values (Vtl, ••• , V",J for corresponding domain features (f], ... , f"J. and an output feature value C, that corresponds to a concept labeI in the problem domain. In the case of Cartesian granule feature modelling. each feature value Vif can correspond to a numeric value. a symbolic value. or to uncertain or vague information that can be specified in terms of fuzzy subsets or interval values. Background knowledge about the domain. other than examples. can also be accommodated within the Cartesian granule feature framework but is not considered here.

Page 61: Data Mining, Rough Sets and Granular Computing

55

Table 1: Example database in spreadsheet format.

Example fl ... ff ... fn Class 1 V" ... V lf ... Vln CI ... ... ... . .. ... . .. . .. t VII ... Vlf . .. Vtn CI . ,. ... ... ... ... ... . .. N VNI ... VNf ... VNn CN

The goal of supervised learning is to generate a model from the training examples, in this case a Cartesian granule feature model, that covers (classifies correctly) not only training examples, but also examples that have not been seen during training i.e. that generalises well. Subsequent paragraphs describe the main steps in learning an additive Cartesian granule feature model from example data using the G_DACG constructive induction algorithm (Genetic Discovery of Additive Cartesian Granule feature models).

G_DACG, which is presented in the next section in detail, can be viewed abstractly in terms of the following two steps:

• Language identification (step 2 in G_DACG): This step is concerned with identifying the language that can be used to describe models in an effective, tractable and transparent manner, that is, the identification of a network of low-order semantically related features - Cartesian granule features. The step can also be viewed as feature selection and discovery, i.e., identifies "useful" Cartesian granule features, the language of the model. The parameter identification phase of the induction algorithm (outlined next) is used as an evaluation function for identifying the language of the model. As language identification is done outside the main phase of the induction method but uses the induction method as the evaluation function, the feature selection and discovery component of G_DACG is classified as a wrapper approach [15] .

• Parameter identification (steps 3 to 5 in G_DACG): Having identified the language of the model, parameter identification then estimates the class fuzzy sets and class aggregation rules. Setting up the class aggregation rules is further divided into the tasks of estimating the weights associated with the individual Cartesian granule features.

3.1 G_DACG Algorithm

The G_DACG algorithm consists of the following five steps.

Step 1: Setup datasets. Split the database of examples into a training database Dtrain, a control database DControl and a testing database Diesi•

Step 2: Language identification. Select which Cartesian granule features Fj ,

characterised by constituent domain features it and their corresponding abstractions (i.e. the linguistic partition of each problem feature universe

55

Table 1: Example database in spreadsheet format.

Example fi ... ff ... fn Class 1 V" ... V lf ... Vin CI ... ... ... . .. ... . .. . .. t Vtl ... Vtf . .. Vtn Ct . ,. ... ... ... ... ... . .. N VNI ... VNf ... VNn CN

The goal of supervised leaming is to generate a model from the training examples, in this case a Cartesian granule feature model, that covers (classifies correct1y) not only training examples, but also examples that have not been seen during training i.e. that generaZises well. Subsequent paragraphs describe the main steps in leaming an additive Cartesian granule feature model from example data using the G_DACG constructive induction algorithm (Genetic Discovery of Additive Cartesian Granule feature models).

G_DACG, which is presented in the next section in detail, can be viewed abstract1y in terms of the following two steps:

• Language identification (step 2 in G_DACG): This step is concemed with identifying the language that can be used to describe models in an effective, tractable and transparent manner, that is, the identification of a network of low-order semantically related features - Cartesian granule features. The step can also be viewed as feature selection and discovery, i.e., identifies "usefuZ" Cartesian granule features, the language of the model. The parameter identification phase of the induction algorithm (outlined next) is used as an evaluation function for identifying the language of the model. As language identification is done outside the main phase of the induction method but uses the induction method as the evaluation function, the feature selection and discovery component of G_DACG is classified as a wrapper approach [15] .

• Parameter identification (steps 3 to 5 in G_DACG): Having identified the language of the model, parameter identification then estimates the class fuzzy sets and class aggregation rules. Setting up the class aggregation rules is further divided into the tasks of estimating the weights associated with the individual Cartesian granule features.

3.1 G_DACG Algorithm

The G_DACG algorithm consists of the following five steps.

Step 1: Setup datasets. Split the database of examples into a training database Dtrain, a control database DControl and a testing database Dtest.

Step 2: Language identification. Select which Cartesian granule features Fi, characterised by constituent domain features it and their corresponding abstractions (i.e. the linguistic partition of each problem feature universe

Page 62: Data Mining, Rough Sets and Granular Computing

56

Pfi) should be used in order to model a problem effectively. For the purposes of the experiments presented in this chapter, this step is performed manually, however, previous work has proposed and demonstrated a feature discovery algorithm based on genetic programming [9]. This step outputs a set of Cartesian granule features {F], ... , F;, ... , Fm}. These features are subsequently incorporated into evidential logic or conjunctive rules.

Step 3: Learn the class Cartesian granuleJuzzy sets. This step extracts the fuzzy set values CGFS;c/as., of each class-rule feature. For each class Class in the problem domain {Class1, ... , Classc}, extract a fuzzy set CGFS;Class defined over each Cartesian granule feature universe QFj using the procedure outlined subsequently in Section 3.1.1.

Step 4: Identify rule weights. This step is required if the Cartesian granule features {Fj, ... , Fb ... , Fm} and corresponding fuzzy set values are incorporated into evidential logic rules. It estimates the weights Wi associated with each Cartesian granule feature F; using semantic discrimination analysis as follows:

Cu.ssc

Wi = 1- Max Pr(CGFSik I CGFSij) j=l j"t;k

where Pr(.l.) denotes the semantic unification operation. The feature weights are subsequently normalised for each rule.

Step 5:Generate model. Generate the corresponding Cartesian granule feature model incorporating the learnt fuzzy set values and weights into class rules.

3.1.1 Learning Cartesian granule feature fuzzy sets from data

The following steps outline how to extract a Cartesian granule fuzzy set from example data for a class or concept defined over a feature universe QFj. This procedure corresponds to step 3 in the G_DACG algorithm. In the next section, an illustrative example of this fuzzy set learning algorithm is presented.

A Cartesian granule fuzzy set CGFS;c/ass corresponding to the concept Class defined over feature universe QFj is learned from example data tuples as follows:

Step Fl: Initialise a frequency distribution DISTiciass defined over all the Cartesian granules in the Cartesian granule feature universe QFj, that is, set each Cartesian granule to zero.

Step F2: For each class training tuple 'FtC/ass perform the following (Appendix A presents the membership-to-probability bi-directional transformation that exists between fuzzy set theory and probability theory, which is used extensively below):

• Construct the corresponding Cartesian granule fuzzy set (Le. linguistic description of the data vector) CGFS,c/ass using the procedure outlined in Section 2.1.

56

Pfi) should be used in order to model a problem effectively. For the purposes of the experiments presented in this chapter, this step is performed manually, however, previous work has proposed and demonstrated a feature discovery algorithm based on genetic programming [9]. This step outputs a set of Cartesian granule features {FI, ... , F;, ... , Fm}. These features are subsequently incorporated into evidential logic or conjunctive rules.

Step 3: Learn the class Cartesian granulefuzzy sets. This step extracts the fuzzy set values CGFS;clas., of each class-rule feature. For each class Class in the problem dom ain {Class1, ... , Classc}, extract a fuzzy set CGFS;class defined over each Cartesian granule feature uni verse QFj using the procedure outlined subsequently in Section 3.1.1.

Step 4: Identify rule weights. This step is required if the Cartesian granule features {FI, ... , Fi> ... , Fm} and corresponding fuzzy set values are incorporated into evidential logic rules. It estimates the weights Wi associated with each Cartesian granule feature Fi using semantic discrimination analysis as folIows:

Cu.ssc

w j = 1- Max Pr(CGFSik I CGFSij) j=l }"t;k

where Pr(.l.) denotes the semantic unification operation. The feature weights are subsequentIy normalised for each rule.

Step 5:Generate model. Generate the corresponding Cartesian granule feature model incorporating the leamt fuzzy set values and weights into class rules.

3.1.1 Learning Cartesian granule feature fuzzy sets from data

The following steps outline how to extract a Cartesian granule fuzzy set from example data for a class or concept defined over a feature uni verse QFj. This procedure corresponds to step 3 in the G_DACG algorithm. In the next section, an illustrative example of this fuzzy set leaming algorithm is presented.

A Cartesian granule fuzzy set CGFS;class corresponding to the concept Class defined over feature uni verse QFj is leamed from example data tuples as follows:

Step FI: Initialise a frequency distribution DISTic/ass defined over alI the Cartesian granules in the Cartesian granule feature universe QFj, that is, set each Cartesian granule to zero.

Step F2: For each class training tuple 'rtClass perform the following (Appendix A presents the membership-to-probability bi-directional transformation that exists between fuzzy set theory and probability theory, which is used extensively below):

• Construct the corresponding Cartesian granule fuzzy set (Le. linguistic description of the data vector) CGFS,class using the procedure outlined in Section 2.1.

Page 63: Data Mining, Rough Sets and Granular Computing

57

• Transform the fuzzy set CGFStc/ass into its corresponding least prejudiced distribution LPDtc/as., .

• Update the overall frequency distribution DISTic/ass with the least prejudiced distribution LPDtc/ass.

Step F3: This frequency distribution DISTic/as., corresponds to the least prejudiced distribution LPDiC/ass which can then be transformed into the Cartesian granule fuzzy set CGFSiC/ass (using the bi-directional transformation). In the absence of any other information, a uniform prior distribution over the Cartesian granules is assumed for this transformation.

3.1.2 Cartesian granule fuzzy set induction example

The following example illustrates how to form a one dimensional Cartesian granule fuzzy set, in this case, the fuzzy set corresponding to the concept of car positions in images. Firstly, the uni verse of the Position feature is linguistically partitioned. One possible linguistic partition could be:

PPosition = {Left, Middle, Right}.

This linguistic partition is depicted in Figure 4. The main steps in extracting a Cartesian granule fuzzy set for this example are graphically presented in Figure 5.

Figure 4: Fuzzy partition of universe ilPosilion.

The process begins by taking examples of car positions in images and generating corresponding Cartesian granule fuzzy sets and least prejudiced distributions. The top left table contains examples of car positions, corresponding linguistic descriptions (in this case, the Cartesian granule fuzzy sets are equivalent to the linguistic descriptions due to the one-dimensional nature of the CG feature) and least prejudiced distributions. The top middle graph shows the initial Cartesian granule frequency distribution. The top right graph depicts the Cartesian granule frequency distribution after updating with the LPD corresponding to the value of 40. The right middle graph shows the Cartesian granule frequency distribution after updating with the LPD corresponding to the value of 60. The bottom right

Page 64: Data Mining, Rough Sets and Granular Computing

58

graph displays the Cartesian granule frequency distribution after counting aH the LPDs corresponding to the example car positions. FinaHy, the bottom left graph depicts the corresponding Cartesian granule fuzzy set for car positions in images i.e. a linguistic summary of car positions in images in terms of the words Left, Middle and Right. Here, for presentation purposes, the Cartesian granule feature is one dimensional in nature, however, multidimensional features can be accommodated in a similar fashion, as will be presented later in this chapter.

Figure 5: Induction of the Cartesian granule fuzzy set, {LeftlO.3 + Middle/1 + RightlO.25}, corresponding to car positions in images (as depicted in the lower left graph) from example car positions (top Ieft table).

4. Analysis of Cartesian granule feature models This section details a set of experiments that were performed in order to empirically evaluate the impact of granularity on Cartesian granule feature models in the context of classification. The ellipse problem, described subsequently, was chosen as a representative problem to conduct this experimental analysis.

The section begins by introducing the ellipse problem. It then describes the format

Page 65: Data Mining, Rough Sets and Granular Computing

59

used to present the experiments aud analyses. Subsequent sections provide a detailed analysis of the following for Cartesian granule feature modelling in the context of the ellipse classification problem: level of granulation, grauule characterisation, granule dimensionality and granule generation. This section finishes with a summary and discussion of these results.

4.1 Ellipse classification problem

The ellipse problem is a binary classification problem based upon artificially generated data from the real uni verse R x R. Points satisfying an ellipse inequality are classified as legal, while all other points are classified as illegal. This is graphically depicted in Figure 6 for the ellipse inequality x2 + l ~ 1. The two domain input features, X and Yare defined over the universes nx = [-1.5, 1.5] aud ny = [-1.5, 1.5] respectively. Different training, control (validation) and test datasets, con si sting of 1000, 300 and 1000 data vectors respectively, were generated using a pseudo-random number stream. An equal number of data samples for each class were generated. Each data sample consists of a triple <.X, Y, Class>, where Class adopts the value ilie gal, indicating that the point <X, Y> does not satisfy the ellipse inequality, aud the value legal, otherwise.

Figure 6: An ellipse in Cartesian space. Points in lightly shaded region satisfy the ellipse inequality and thus are classified as legal. Points in darker region are classified as illegal.

4.2 Experimental variables and analysis

The ellipse problem has two domain input features, namely X aud Y, and one output (predicted or dependent) variable. As this problem is sufficiently small, it permits the examination of a significant portion of the possible Cartesian granule feature models. The purpose of these experiments is to investigate the impact of different decis ion variables on the induced Cartesian granule feature model, most of which lie within the feature discovery process of the G_DACG algorithm. Models consisting of Cartesian granule features with various levels of granulation,

Page 66: Data Mining, Rough Sets and Granular Computing

60

granule characterization and granule dimensionality are manually and systematically sampled. Due to resource constraints (time and computing power), the analysis is limited to the Cartesian granule features where the underlying granulation of ali domain feature universes (in the case of multi-dimensional Cartesian granule features) are equivalent; though for the investigation into data­driven approaches to partitioning, this assumption is dropped. The examined model sample space represents only a very small proportion of the infinite abyss of possible models.

In this investigation, the use of both one and two dimensional Cartesian granule features formed over the problem input features X and Y is examined. The granularity of the partitions is varied from coarse (few fuzzy sets) to very fine (many fuzzy sets). The finer the granularity, the better the powers of prediction, although empirical evidence tends to suggest that there is a threshold on the number of fuzzy sets, above which no significant gains are made in terms of model accuracy. This threshold wiIl vary from problem to problem. For the results presented here, granularities in the interval [2, 20] were considered, bearing in mind that if the partitioning is too fine, model generalisation will suffer. This is more succinctly stated in the principle of generalisation [16]: "The more closely we observe and take into account the detail, the less we are able to generalise to similar but different situations ... ". The effect of the following granule characterisations is observed: triangular fuzzy sets; crisp sets; and trapezoidal fuzzy sets with differing degrees of overlap. As presented previously, different rule structures lead to different Cartesian granule feature models. Evidential logic rules lead to additive models and conjunctive rules lead to product models. Both ruIe structures are examined here. Table 2 summarises the decis ion variables and their respective values that are investigated.

Table 2: Decision variables (and possible choices) analysed in the context of Cartesian granule feature model construction for three artificial problems.

4.2.1 An example of ACGF modelling for the ellipse problem

A detailed example of one experiment is presented here outlining how a particular type of Cartesian granule feature model can be used to linguistically represent an ellipse. This example serves as a template for problems tackled by Cartesian granule feature models and their results, as presented in subsequent sections. Each experiment consists of five steps, what are outlined below. As alluded to

Page 67: Data Mining, Rough Sets and Granular Computing

61

previously, the language identification phase of modelling (Cartesian granule feature selection), corresponding to steps (i) and (ii) below, are performed manually. Parameter identification, corresponding to steps (iii) to (v) below, is performed automatically using steps 3 and 4 of the G_DACG algorithm (Section 3.1). For the experiment described subsequently, a Cartesian granule feature model is constructed in terms of one two-dimensional Cartesian granule feature. The following steps overview this experimental process from the perspective of this example:

(i) Select Cartesian granule features: The use of one two-dimensional Cartesian granule feature consisting of the domain input features X and Y is examined.

(ii) Determine the granulation of the domain features in each Cartesian granule feature: In this case, the linguistic partitions of the domain features are characterised by six uniformly placed trapezoidal fuzzy sets that overlap to a degree of 0.5 (50% overlap, i.e., the left or right tails of the trapezoid overlap with 50% of the core of the neighbouring fuzzy sets). The linguistic partitions of the universes of the input variables X and Yare defined in Figure 8 (a corresponding graphic depiction is presented in Figure 7).

(iii) Learn Cartesian granule fuzzy sets: Subsequently, a Cartesian granule fuzzy set is learned for each of the legal and illegal classes. The Cartesian granule fuzzy sets corresponding to the legal and illegal classes, when the domain feature universes were partitioned using the above linguistic partitions, are depicted graphically in Figure 10. In both figures each grid point corresponds to a Cartesian granule and its associated membership value. This isomorphic relationship that exists between the class structures, as represented in Cartesian format (raw attribute values), and the graphic representation of the respective Cartesian granule fuzzy sets adds a somewhat intuitive meaning and interpretation to Cartesian granule fuzzy sets.

(iv) Generate rule set: These Cartesian granule features and learnt class fuzzy sets are then incorporated directly into the body of the respective classification rules. In this case since the model only consists of one feature, the conjunctive rule and the evidential logic rule will have equivalent behaviour. The generated rule set for this problem is presented in Figure 9.

(v) Estimate the accuracy of the generated model: The effectiveness of the learnt model is measured based on the accuracy of that model on the test dataset. In addition to model accuracy, a decision boundary is also graphically presented. In this experiment, the classification accuracy of the induced model is 96.5% and the corresponding decision boundary is depicted in Figure 11, where the shaded region corresponds to the predicted legal class, while the unshaded region corresponds to the predicted illegal class. The true ellipse is superimposed on the predicted results to illustrate the accuracy of the model. The fuzzy sets used to partition the domain variable universes

61

previously, the language identification phase of modelling (Cartesian granule feature selection), corresponding to steps (i) and (ii) below, are performed manually. Parameter identification, corresponding to steps (iii) to (v) below, is performed automatically using steps 3 and 4 of the G_DACG algorithm (Section 3.1). For the experiment described subsequently, a Cartesian granule feature model is constructed in terms of one two-dimensional Cartesian granule feature. The following steps overview this experimental process from the perspective of this example:

(i) Select Cartesian granule features: The use of one two-dimensional Cartesian granule feature consisting of the domain input features X and Y is examined.

(ii) Determine the granulation of the domain features in each Cartesian granule feature: In this case, the linguistic partitions of the domain features are characterised by six uniformly placed trapezoidal fuzzy sets that overlap to a degree of 0.5 (50% overlap, i.e., the left or right tails of the trapezoid overlap with 50% of the core of the neighbouring fuzzy sets). The linguistic partitions of the universes of the input variables X and Y are defined in Figure 8 (a corresponding graphic depiction is presented in Figure 7).

(iii) Learn Cartesian granule fuzzy sets: Subsequently, a Cartesian granule fuzzy set is leamed for each of the legal and illegal classes. The Cartesian granule fuzzy sets corresponding to the legal and illegal classes, when the domain feature universes were partitioned using the above linguistic partitions, are depicted graphically in Figure 10. In both figures each grid point corresponds to a Cartesian granule and its associated membership value. This isomorphic relationship that exists between the class structures, as represented in Cartesian format (raw attribute values), and the graphic representation of the respective Cartesian granule fuzzy sets adds a somewhat intuitive meaning and interpretation to Cartesian granule fuzzy sets.

(iv) Generate rule set: These Cartesian granule features and leamt class fuzzy sets are then incorporated directly into the body of the respective classification rules. In this case since the model only consists of one feature, the conjunctive rule and the evidential logic rule will have equivalent behaviour. The generated rule set for this problem is presented in Figure 9.

(v) Estimate the accuracy of the generated model: The effectiveness of the leamt model is measured based on the accuracy of that model on the test dataset. In addition to model accuracy, a decision boundary is also graphica1ly presented. In this experiment, the classification accuracy of the induced model is 96.5% and the corresponding decision boundary is depicted in Figure 11, where the shaded region corresponds to the predicted legal class, while the unshaded region corresponds to the predicted illegal class. The true ellipse is superimposed on the predicted results to illustrate the accuracy of the model. The fuzzy sets used to partition the domain variable universes

Page 68: Data Mining, Rough Sets and Granular Computing

62

ax and ay are also shown beneath and on the left of the classification area respectively. In Figure 11 an example of a linguistic term, yAround_Negl.25, on a y is denoted by a fuzzy set on the vertical axes. In subsequent results, linguistic terms are omitted from graphs to avoid clutter. In this experiment, the extracted model forms a good approximation of the ellipse as shown in Figure 11, though there are regions where false negatives and false positives occur. In terms of are a, measured in Cartesian space, the extracted model yields an error rate of around 3.5%. In subsequent sections, accuracy of the extracted models is presented only in terms of test datasets as opposed to area percentages as this is more representative (and controllable), especially where classes do not occupy equal size areas (as is the case in the ellipse problem).

"ce'

. ' ... xW<iiIDa~Negl:~ xAround±NegO.25 ,>< .. r ·~,g72 ~r{\.·~/~:.· .... ;\ .. 7~"" ~ . \ .. /0 \:;. .>.1

-1.5

Figure 7: A linguistic partition of the variable universe ax, where the granules are characterised by trapezoidal fuzzy sets with 50% overlap.

And

Px:( (xAr(iund_NegL25 [·ld -0.75:'0)) (xAround ..... Neg0.75 [~}.'~5:0 -):} -0.~;1 -O.'JS,·OJ) (xArolfnd_Neg0.25 [~Q/15:0 ~Q~5:l(ţfl0.25,·OJ)

.' (xArmiJfiL0.25[fO.2$,'fj .O:) 0:5:1 'O.7S:'O]) {xAro1,lnd",'O.75 [().2Ş;'OO:5:Ll :hl:.25;·O:j) (x$'Yaund~}}~5{O,~5!;O 1:XiJ))) ..

ipy: ( (YA'Jfaund...:N~1.25!;t"'1:1 .Qî''1:~:Ol)'s . (yAround::,NţgO.7~ {d.25:0-} :lx().5:}:(if125 :0] r

.' (yArouwLNegO.25;ţ~'O. 7Ş:O;,·0;5:i'JO: 1''O;Z5 :oJ) @Jiroi,l'iJl!~Q.25.,~p.25:0 ~;:;J '0/5:1 0.75:'0)) . :f?~round?4qi7,5·{O.25:'O 1J;):}1.;1;~.t5:~~I). f'jAround:!,'l:25 {'O.75.:0 1:1:]);)f .

". C', c, ,,;.>. ' \'/ . '

Figure 8: Linguistic partitions Px and Py of the domain feature universes ax and ay respectively, where each granule is characterised by trapezoidal fuzzy sets with 50% overlap.

Page 69: Data Mining, Rough Sets and Granular Computing

63

4.3 Uniform Cartesian granule features This section examines the use of Cartesian granule features where the universes of the constituent domain features are partitioned by uniformly distributed fuzzy sets. This is in contrast to the next section where the universes of the constituent domain features are partitioned using data-centric techniques. This section progresses from presenting the results of models consisting of two-dimensional Cartesian granule features to models consisting of one-dimensional features.

Figure 9: A possible rule set for the ellipse problem in terms of two­dimensional Cartesian granule features. See Figure 10 for a close-up version of the fuzzy sets in this model.

4.3.1 Ellipse classification using 2D Cartesian granule features

In the previous section, a prototypical experiment of Cartesian granule feature modelling in the ellipse domain was presented in terms of a two-dimensional feature model. Here, the use of other types of two-dimensional Cartesian granule features is investigated. This investigation begins by exploring the use of uniformly placed, mutually exclusive, triangular fuzzy sets as a means of partitioning the domain feature universes. The granularity of the domain feature universes was varied uniformly across each feature (Le. same number of fuzzy sets in each partition). Levels of granularity ranging from 2 to 20 were investigated and the results achieved using the leamt models on unseen test data are plotted in Figure 12. For convenience, the top right hand corner of Figure 12 (and of subsequent result graphs) is used to denote the type of Cartesian granule feature

Page 70: Data Mining, Rough Sets and Granular Computing

64

model being examined. In the case of Figure 12, the graph presents results for the ellipse problem where the underlying models consist of one two-dimensional Cartesian granule feature. The horizontal axis represents the granularity of the domain input feature universes and is expressed in terms of the number of fuzzy sets used. The vertical axis represents the level of accuracy obtained by the corresponding model. To avoid repetition, it is assumed for the remainder of this chapter, unless otherwise stated, that result graphs of this type follow this presentation format. Figure 13 shows the ellipse decision boundaries that were achieved using models where the granularities of the underlying dom ain features were varied from two to ten. At a granularity level of seven (see Figure 13 (i), the extracted model starts to fit the ellipse but it is not until a granularity leveJ of about nine that a good fit is achieved: with an eITor rate of about 4.8%. As depicted in Figure 12, the model accuracies oscillate (especially in the lower levels) as the granularity of the dom ain features increases. This oscillation is primarily due to the "lucky fit" of the triangular sets, which have broader support for lower leveJs of granularity. This "lucky fit" is more apparent in the case of crisp granules that are presented subsequently.

(a) (b)

Figure 10: Graphic representation of (a) Legal and (b) lIlegal Cartesian granule fuzzy sets where each grid point corresponds to a Cartesian granule and its associated membership.

Next, the use of words that are characterised by trapezoidal fuzzy sets is examined as a means of partitioning the domain feature universes. This type of linguistic partition is not mutually exclusive. Again, the use of one two-dimensional Cartesian granule feature formed over the domain input features X and Y is explored. The trapezoidal fuzzy sets were positioned uniformly over the domain universes, varying the trapezoidal overlap factor from 100% overlap to 0% (0% overlap corresponds to a crisp partition). Figure 14 depicts the results obtained using linguistic partitions generated by trapezoidal fuzzy sets with the following degrees of overlap: 100% overlap (curve labelled T=1.0), 50% overlap (curve labelled T= 0.5) and no overlap (curve labelled crisp i.e. T = 0.0). Again the

Page 71: Data Mining, Rough Sets and Granular Computing

65

granularity of the domain input feature universes was varied from 2 to 20 fuzzy sets.

"' ..

·"·""·-ţ---"------"---t·------------~--------------

- 6

Figure 11: Decision boundary using a two-dimensional Cartesian granule feature model, where the domain feature universes were partitioned using six uniformly placed trapezoidal fuzzy sets with 50% overlap.

Figure 12: Classification results for the ellipse problem using one 2D Cartesian granule feature, where triangular fuzzy sets were used to partition the domain features.

Page 72: Data Mining, Rough Sets and Granular Computing

66

Figure 13: A montage of ellipse decis ion boundaries generated by models eonsisting of one 2D Cartesian granule feature, where various numbers of triangular fuzzy sets were used to partition the domain features: (a) 2 fuzzy sets; (b) 3 fuzzy sets (everything is classified as illegal); (e) 4 fuzzy sets; (d) 5 fuzzy sets; (e) 6 fuzzy sets; (f) 7 fuzzy sets; (g) 8 fuzzy sets; (h) 9 fuzzy sets; (i) 10 fuzzy sets.

In general, the use of fuzzy sets as a means of linguistically quantising the domain feature universes gives better results than obtained using crisp sets. The results shown in Figure 14 empirically support this claim. The decision boundaries of models using crisp Cartesian granule features lie along the boundaries of the linear crisp granules and thus it becomes more difficult to model problems other than those with a stepwise linear decision boundary. Decision tree approaches (such as ID3/C4.5 [4]) yield similar piecewise linear boundaries. This is clearly depicted in Figure 15 where the decision boundaries of various leamt models that use crisp granules are presented. Nevertheless, as the granularity increases, the Cartesian granules will better fit the surface boundary for the problem, thereby reducing the

Page 73: Data Mining, Rough Sets and Granular Computing

67

model errOL But with this increased model accuracy comes a high complexity cost, which may prove intractable in more complex systems, and may lead to over fitting.

Figure 14: Classification results for the ellipse problem using one 2D Cartesian granule feature, wbere tbe domain feature universes bave been partitionedusing trapezoidal fuzzy sets with various degrees of overlap.

i -.... ~·1--··_··-4+···' t -o.,. ~.~ .... ____ ._~ ....... ; .... ,... L ..... !.. .. ., ..... , , .. !.. ... , .. .

(a) (b)

Figure 15: (a) Decision boundary for the ellipse problem using a two­dimensional Cartesian granule feature model, where the dom ain feature universes were partitioned using 3 crisp sets; and (b) with 10 crisp sets.

For completeness, Figure 17 and Figure 18 present the model classification accuracies for other two-dimensional Cartesian granule features with varying domain feature granularities where the granules are characterised by trapezoidal fuzzy sets with different degrees of overlap, ranging from 0% overlap (curve labelled crisp i.e. T = 0.0) to 100% overlap (curve labelled T = 1.0).

Page 74: Data Mining, Rough Sets and Granular Computing

68

Figure 16: Decision boundary for the ellipse problem using a two-dimensional Cartesian granule feature model, where the domain feature universes were partitioned using 7 trapezoidal fuzzy sets with an overlap rate of 60%.

tOO

90

~ 80

~ § 70 » 1\l .~ :.! 60 iil

Figure 17: Classification results for the ellipse problem using two-dimensional Cartesian granule features where the domain feature universes are partitioned with trapezoidal fuzzy sets with various degrees of overlap, ranging from 0% (curve labelled crisp) to 50% (curve labelled T=0.5).

In Figure 19, graphs (a)-(k) illustrate the effect of the trapezoidal overlap rate on the decis ion boundary. These are contrasted with the decis ion boundary generated by a model where the granule characterisation is a triangular fuzzy set as depicted in Figure 19(1). In general, for the ellipse problem, granules characterised by trapezoids with overlapping degrees of between 50% and 70% yield models that fit the ellipse adequately (i.e. error rates in terms of misclassified area of around 3%) with very few words (five words) used in the linguistic partition of the domain

Page 75: Data Mining, Rough Sets and Granular Computing

69

feature universes. Figure 16 depicts a model with accuracy of 98% using seven words that are characterised by trapezoidal fuzzy sets with an overlap degree of 60%. As Figure 16 depicts, the misclassified areas correspond to false positive are as for the ellipse class. This is one of the best results obtained using relatively parsimonious/succinct linguistic partitions (well inside Miller's magic number of 7 ± 2 concepts [17]). Furthermore, when compared with triangular-based partitions, the use of trapezoidal-based partitions tends to yield models which are more parsimonious and which better fit the problem. Figure 20 contrasts the results obtained using models that use trapezoidal-based partitions with overlap rates of 0% (crisp case) and 50%, with models that use triangular-based partitions. This graph clearly shows that Cartesian granule features where the underlying granules are characterised by trapezoidal fuzzy sets outperform their triangular counterparts.

The two-dimensional features presented here represent only a very small proportion of the abyss of possible two-dimensional features. For example, it is possible to use features in which the domain attribute universes could have been partitioned with different types of fuzzy set, different numbers of fuzzy sets, and by using non-uniform partitioning (examined below in Section 4.4).

Figure 18: Classification results using two-dimensional Cartesian granule features where the domain feature universes were partitioned with trapezoidal fuzzy sets ofvarious overlapping degrees (from 50% to 100%).

Page 76: Data Mining, Rough Sets and Granular Computing

70

Figure 19: A montage of decision boundaries for the ellipse problem using an assortment of two-dimensional Cartesian granule feature models, where the domain feature universes were partitioned with a granularity of five as follows: (a) - (k) Trapezoidal fuzzy sets where the degree of overlap varies from O to 100% in steps of 10%; (1) Triangular fuzzy sets.

4.3.2 Ellipse classification using lD Cartesian granule features

The use of various types of one-dimensional Cartesian granule feature is examined subsequently. In this case, each class rule consists of two one-dimensional features that are based upon the X and Y features respectively. Once again, the use of mutually exclusive triangular fuzzy sets that were placed uniformly across the domain feature universes is explored initially. Levels of granularity ranging from 2 to 20 were investigated.

The results obtained using these leamt models to classify unseen test data are graphed Figure 21 (curve labelled Triang). In this case the Cartesian granule features were incorporated into evidential logic rule structures. The weights of

Page 77: Data Mining, Rough Sets and Granular Computing

71

importance associated with each feature were estimated using semantic discrimination analysis. These one-dimensional models yielded on average accuracies of 92%. Figure 22(a) and Figure 23(a) illustrate some of the typical decision boundaries achieved using these types of models. In general, the extracted models find it difficult to capture the curvilinear nature of the ellipse's boundary. In fact, it takes a granularity level of around Il to achieve a respectable boundary (which is stiU somewhat jagged). On the other hand, when the one-dimensional Cartesian granule features (triangular-based) were combined using conjunctive (product) mIe stmctures the model accuracies decreased by a couple of percentage points (see Figure 24 - curve labelled ConTriang). Figure 22(b) presents a typical decis ion boundary for conjunctive models with triangular granule characterizations. In general, using the conjunctive mIe structure, as a means of combining the features, does not produce any false positives for the eUipse class (see Figure 22(b) for example), unlike the evidentiallogic mIe Figure 22(a).

Figure 20: Comparison of classification results using two-dimensional Cartesian granule features where the domain feature universes are partitioned with triangular and trapezoidal fuzzy sets.

The use of two one-dimensional Cartesian granule features where the underlying granules are characterised by trapezoidal fuzzy sets is now examined. The trapezoidal fuzzy sets were distributed uniformly over the domain universes, varying the trapezoidal overlap factor from 100% overlap to 0% (i.e. a crisp partition). A granularity range of [2, 20] was investigated with uniformly positioned trapezoidal fuzzy sets with varying overlap. A subset of the results, restricted to the following types of granules, is presented: trapezoids with the best overlap rate; crisp granules; and trapezoidal granules with 100% overlap. This

Page 78: Data Mining, Rough Sets and Granular Computing

72

should give some indication of the accuracies attainable with different degrees of overlap.

Figure 21: Comparison of classification results for the elIipse problem using two one-dimensional Cartesian granule features where the domain feature universes are partitioned with triangular and trapezoidal fuzzy sets. The Cartesian granule features are combined using the evidentiallogic rule.

Figure 21 presents results where the evidential logic mIe structure was used as a means of combining the supports of the individual trapezoidal-based Cartesian granule features. The c1assification results plotted correspond to models where the underlying granules were characterised by trapezoidal fuzzy sets with an overlap degree of 100% (curve labelled T==1.0), with an overlap degree of 30% (curve labelled T==O.3), and no overlap (curve labelled crisp i.e. T == 0.0). The extracted evidentiallogic mIe models, once again outperform their conjunctive counterparts, yielding results in the low 90s (see Figure 24 for a comparison and the next paragraph for an explanation). This is due primarily to the fact that the Y based Cartesian granule feature is more discriminating than the X based feature; as the ellipse is horizontallyoblong. Consequently, this Cartesian granule feature is given a higher weight (via semantic discrimination analysis) within the evidential reasoning process, resuIting in better model accuracies than the conjunctive mIe structure that treats the features as equally important. Figure 23 gives an indication of the nature of the decision boundary generated by evidential and conjunctive mles structures in this case.

Page 79: Data Mining, Rough Sets and Granular Computing

73

.'.'-;;"--::';--. -:. ;0:::,;-":",-",-••. ~m"' •• -'c"~'~'.-,,-C-----c

(a) (b)

Figure 22: Decision boundaries when (a) the evidentiallogic rule and (b) the conjunctive rule are used as a means of combining one-dimensional Cartesian granule features for the ellipse problem. A granularity level of 11 was used on each domain feature universe. The granules were characterised by triangular fuzzy sets .

. ,.~,~ .• --'.,:----,.~ •.• ;---!-. --;;.L.c-.• -~,--;-' X-T .... ".zoid .. l. T ..... 3. a ..... n"' ..... U ........

(a) (b)

Figure 23: (a) Decision boundary when an evidential logic rule is used as a means of combining one-dimensional Cartesian granule features for the ellipse problem. A granularity level of 10 was used on each dom ain feature universe. The granules were characterised by trapezoidal fuzzy sets with an overlap degree of 30%. (b) Decision boundary when a conjunctive rule is used and the granules were characterised by trapezoidal fuzzy sets with an overlap degree of 20%.

Overall, in the case of one-dimensional Cartesian granule features, regardless of the mie stmcture, the use of granules which are characterised by trapezoidal fuzzy sets outperform their triangular counterparts. This comparison of mie stmcture, along with different granule characterisations is presented in Figure 24, where the labelled curves denote the following: ELTriang corresponds to the use of the evidentiallogic mie with triangular based granules; ELT=O.3 represents the use of the evidential logic mie with trapezoidal based granules with an overlap of 30%; ConTriang represents the use of the conjunctive mie with triangular based granules; ConT=O.2 represents the use of the conjunctive mie with trapezoidal

Page 80: Data Mining, Rough Sets and Granular Computing

74

based granules overlapping to a degree of 20%. On the whole, the use of the evidentiallogic rule, where the granules are trapezoidal with 30% overlap, gives the best results.

4.4 Data centred Cartesian granule features

In the previous sections various types of Cartesian granule features were investigated where the underlying feature partitions were uniform in nature. Here, however, the presentation briefly digresses to investigate the use of partitions that are generated by data-centred approaches, along with investigating the use of granule merging (commonly known as pruning in the literature) in order to enhance the generalisation and transparency of leamt Cartesian granule feature models. This is done in the context of modelling the ellipse problem with two­dimensional Cartesian granule features. Two data-driven approaches to generating partitions of the problem feature universes are investigated: a percentile-based approach; and a clustering approach. Subsequently, pruning is examined, that is, where neighbouring granules are merged.

Figure 24: A comparison of using the evidentiallogic rule vs. the conjunctive rule as a means of combining one-dimensional Cartesian granule features for the ellipse problem, where the domain feature universes are partitioned with triangular and trapezoidal fuzzy sets.

4.4.1 Cartesian granule features using percentile-based partitions

The generation of partitions using data percentiles is investigated. The cluster centres were generated for each domain feature (Le. alI class data was considered together - heterogeneous) using the following steps. The data for each feature was ordered and subsequently distributed such that the dataset was uniformly distributed across the partition sets. The midpoints of each percentile interval are

Page 81: Data Mining, Rough Sets and Granular Computing

75

used to form the midpoint of the fuzzy sets that partition the domain feature universe. Class Cartesian granule fuzzy sets and rules are subsequently leamed in terms of these Cartesian granule features. Figure 25 presents some of the interesting results obtained using two-dimensional Cartesian granule features whose domain feature universes were partitioned as described above. Overall, the heterogeneous percentile approach to partitioning does not perform as well as the uniform approach (see the curve labelled Uniform T=O.5 in Figure 25). However, heterogeneous percentile based partitioning does provide granules characterised by crisp sets with a significant boost in model accuracy. This is mainly due to the focusing of the crisp sets, in the case of the legal class, in areas where data exist.

Figure 25: Ellipse Classification using two-dimensional CG features where the partitions were generated using the one-dimensional percentile approach.

4.4.2 Cartesian granule features using clustering-based partitions The use of clustering techniques is examined as a means of generating partitions in the input feature universes. Any of a number of clustering techniques such as fuzzy C-means (FCM) [18], Kohonen [19], LVQ [18] could be used to cluster the input feature data. Here, the FCM clustering algorithm is used. Clustering is considered at different levels of dimensionality and homogeneity, where dimensionality refers to the number of variables considered for clustering at one time and where homogeneity refers to whether all classes are clustered together Of

whether each class is clustered individually. Homogeneous clustering, while facilitating the extraction of knowledge in terms of constructs which best capture the structure of the underlying training data, may lead to over-fitting. In general,

Page 82: Data Mining, Rough Sets and Granular Computing

76

the number of cluster centres is manually input, but could quite easily be determined automatically (see [20] in the case of FCM). The cluster centres, the output of clustering, are then used to form the centre points of fuzzy sets that partition the respective domain feature universes. The fuzzy sets can subsequently be labelled with words automatically from a predefined dictionary, or can be provided by the user or domain expert. Class Cartesian granule fuzzy sets are generated in terms of these words. The ultimate goal is to extract cluster centres, which partition the individual universes, in such a way that good, parsimonious and intuitive linguistic descriptions of concepts can be extracted from the example data that model the system effectively. In other words, the goal is to extract anthropomorphic knowledge descriptions of the system that are effective in modelling the system.

4.4.2.1 Ellipse classification using one-dimensional clustering based Cartesian granule features This section briefly illustrates the application of "single feature" clustering approaches in generating partitions of the domain feature universes. For single feature clustering, two cases are considered: (1) where cluster centres were generated for each feature independently using the FCM clustering algorithm (Le. heterogeneous clustering); and (2) where cluster centres for each class over each feature universe were generated independently (i.e. homogeneous clustering). Subsequently, linguistic partitions were created using the extracted cluster centres. These partitions were then used in conjunction with two-dimensional Cartesian granule features. Figure 26 presents the results obtained using two-dimensional Cartesian granule features where the domain feature universes were partitioned as described above for a fixed granularity of seven and where the granules were characterised by various types of fuzzy sets (see the X axis in Figure 26). By and large, the "single feature clustering" based models do not perform as well as their uniform counterparts (compare curves labelled 1 DHetroClustering and IDHomogClustering, representing features generated using homogeneous and homogeneous clustering respectively, and Uniform in Figure 26).

4.4.2.2 Ellipse classification using two-dimensional clustering based Cartesian granule features The use of multidimensional clustering in generating triangular-based partitions of the domain feature universes is examined here. The cluster centres for each class were generated independently (homogeneous clustering) using the FCM clustering algorithm. These cluster centres were then used to generate mutually exclusive triangular based partitions of the domain feature universes. Table 3 presents some of the more interesting results obtained using two-dimensional Cartesian granule features where the domain feature universes were partitioned as described above. The performance of the models using multidimensional clustering compares very favourably to models that use uniformly partitioned features (compare columns 3 and 4 in Table 3). Forming Cartesian granule features using multi-dimensional clustering can lead to close-lying cluster centres when the multi-dimensional cluster centres are projected on the individual universes. Consequently, the next

76

the number of c1uster centres is manually input, but could quite easily be determined automatically (see [20] in the case of FCM). The c1uster centres, the output of c1ustering, are then used to form the centre points of fuzzy sets that partition the respective domain feature universes. The fuzzy sets can subsequently be labelled with words automatically from a predefined dictionary, or can be provided by the user or domain expert. Class Cartesian granule fuzzy sets are generated in terms of these words. The ultimate goal is to extract cluster centres, which partition the individual universes, in such a way that good, parsimonious and intuitive linguistic descriptions of concepts can be extracted from the example data that model the system effectively. In other words, the goal is to extract anthropomorphic knowledge descriptions of the system that are effective in modelling the system.

4.4.2.1 Ellipse classification using one-dimensional clustering based Cartesian granule features This section briefly illustrates the application of "single feature" c1ustering approaches in generating partitions of the domain feature universes. For single feature c1ustering, two cases are considered: (1) where c1uster centres were generated for each feature independently using the FCM clustering algorithm (Le. heterogeneous clustering); and (2) where c1uster centres for each c1ass over each feature universe were generated independently (Le. homogeneous clustering). Subsequently, linguistic partitions were created using the extracted cluster centres. These partitions were then used in conjunction with two-dimensional Cartesian granule features. Figure 26 presents the results obtained using two-dimensional Cartesian granule features where the domain feature universes were partitioned as described above for a fixed granularity of seven and where the granules were characterised by various types of fuzzy sets (see the X axis in Figure 26). By and large, the "single feature clustering" based mode1s do not perform as well as their uniform counterparts (compare curves labelled 1 DHetroClustering and IDHomogClustering, representing features generated using homogeneous and homogeneous c1ustering respectively, and Uniform in Figure 26).

4.4.2.2 Ellipse c1assification using two-dimensional clustering based Cartesian granule features The use of multidimensional clustering in generating triangular-based partitions of the domain feature universes is examined here. The c1uster centres for each class were generated independently (homogeneous clustering) using the FCM c1ustering algorithm. These cluster centres were then used to generate mutually exclusive triangular based partitions of the domain feature universes. Table 3 presents some of the more interesting results obtained using two-dimensional Cartesian granule features where the domain feature universes were partitioned as described above. The performance of the models using multidimensional c1ustering compares very favourably to models that use uniformly partitioned features (compare colurnns 3 and 4 in Table 3). Forming Cartesian granule features using multi-dimensional clustering can lead to c1ose-Iying c1uster centres when the multi-dimensional cluster centres are projected on the individual universes. Consequently, the next

Page 83: Data Mining, Rough Sets and Granular Computing

77

vanatlon in the approach is to merge close-lying cluster centres, which is examined in the next section.

Type o(fm;zy setcused

HetroCluste ring

HomogClus tering

Figure 26: Ellipse c1assification using CG Features where the underlying feature partitions are generated using uniform and various c1ustering approaches. The granularity of the feature universe partition was fixed at seven.

4.4.3 Pruning Cartesian granules

A simple heuristic was used to merge neighbouring clusters: cluster centres which were within 2% of the domain range of each other were merged. The merging of neighbouring cluster centres results in the generation of a new cluster centre, which is the midpoint between the merged centres. Table 4 gives the results obtained using the pruned models and also indicates the corresponding reduction in granularity. Each entry in column 4 of this table consists of two numbers representing the granularity of the X uni verse for the legal and illegal classes. The entries in column 5 represent similar information, but in this case for the Y feature. Here cluster elimination/merging can be viewed as a complexity reduction. While achieving a reduction in the granularity of the Cartesian granule universe, this also results in a reduction of the model accuracy. The non-merged approach outperforms the merged approach as indicated by the results in Table 3 and Table 4. However, this reduction in accuracy for more complex system may be tolerable since it may be a way of produc ing a model that is comptractable (computationally tractable) and comprehensible, while also performing satisfactorily in terms of accuracy.

Page 84: Data Mining, Rough Sets and Granular Computing

78

Other forms of pruning are also possible, but are not discussed here, such as logical merging of granules. For example, neighbouring granules (in the projected one-dimensional sense) that exhibit similar membership levels could be merged. Similarly, modified entropy algorithms as used in decision tree pruning [5] could be used here to merge neighbouring granules. Pruning in this way is an example of how to exploit the tolerance for imprecision and uncertainty, while achieving tractability, robustness and low solution cost, one of the guiding principles of soft computing [21].

Table 3: A comparison of classification results using Cartesian granule features CGXXY based upon two-dimensional clustering and uniform partitioning.

Table 4: Classification results using Cartesian granule features CGF XxY based upon two-dimensional clustering (after pruning).

4.5 Discussion and summary

The previous sections have presented the results of a study which evaluated models consisting of Cartesian granule features with different leve1s of granulation, granule characterisation and granule dimensionality. The following are the main findings of these experiments for the ellipse problem:

• Overall, granules characterised by trapezoidal fuzzy sets outperformed other characterisations.

• One-dimensional and two-dimensional Cartesian granule features were investigated with the two-dimensional feature yielding a higher accuracy on unseen test data. This suggests a necessity to model this higher-dimensional association in order to avoid decomposition eITor.

Page 85: Data Mining, Rough Sets and Granular Computing

79

• Generating partitions using data-centred approaches such as clustering and percentile based techniques can lead to simpler models but can reduce model generalization. Overall, the use of uniformly positioned fuzzy sets is computationally more efficient and effective. These uniformly placed fuzzy sets can subsequently be remapped on to a more natural or humanistic vocabulary using dictionaries, disjunctions, conjunctions, linguistic hedges etc. to give the model a more anthropomorphic flavour.

• The investigated models consist solely of either one-dimensional features or of a two-dimensional feature, but when models consisting of mixed­dimensional features are used they lead to better performance and transparency (see next section).

• Pruning, as a means of reducing model complexity, while also enhancing the extracted model's accuracy and generalisation powers, was briefly presented but needs further work to illustrate its practical usefulness in this context.

5. Ellipse problem comparison The previous section has presented the results of experiments where Cartesian granule feature models were used to model the ellipse problem. These models were constructed semi-automatically (that is, where the language was determined manually and the corresponding model parameters were identified automatically). In this section, the results of experiments are examined, where other modelling with words techniques were applied to the ellipse problem: the G_DACG algorithm (fully automated version); the data browser; and the mass assignment tree induction (MATI) algorithm. In addition, neural networks were also compared. These approaches were assessed using the same datasets that were described above.

5.1 A G_DACG run on the ellipse problem In the previous section, the language of Cartesian granule feature models was determined manually, while the parameters of the model were identified automatically resulting in the construction and analysis of both one- and two­dimensional models (but not mixtures). Previous work has, however, automated this task by reformulating Step 2 of the G_DACG algorithm as a search through the space of possible models to discover the model language (Cartesian granule features) [9, 22]. In this work, genetic programming has been used in order to accomplish this search. Running the fully automated G_DACG algorithm on the ellipse problem results in Cartesian granule feature models that not only yield models with high levels of accuracy levels (99% or higher) but also with high levels of transparency. One such result is presented in row 1 of Table 7, where the legal class is summarised using two one-dimensional Cartesian granule features (with a granularity of ten for the X feature, and three for the Y feature) and the illegal class is described using three Cartesian granule features. The underlying granule characterisation for this model was a trapezoidal fuzzy set with 60% overlap.

Page 86: Data Mining, Rough Sets and Granular Computing

80

5.2 Fuzzy data browser The data browser is an induction system that automatically extracts rules and fuzzy sets (one-dimensional) from statistical data [2]. The data browser is another example of the paradigm modelling with words. This will become apparent in this section, which begins by briefly describing the learning process in the data browser. This is followed by a summary of the results obtained when the data browser was applied to the ellipse problem.

5.2.1 Learning in the fuzzy data browser

Consider a classification problem, where the target function, y = gel), models a

dependency between a target feature Yand a set of domain input featuresfI, ... ,fn. The target variable Y is discrete, taking values from the finite set {Class], ... , Classc }. The data browser can induce both additive and product models. A data browser induced classifier accepts as input a tuple of values <V], ... , Vn> and predicts the target value y by performing approximate reasoning as described in Section 2.3.

Learning in the data browser consists of two steps: univariate class conditional fuzzy sets estimation; and rule generation. The data browser estimates univariate class conditional fuzzy sets from a training dataset via corresponding probabilistic class conditionals:

Pr{fjlY) ViE {l, ... ,n}

This is enabled by the membership-to-probability bi-directional transformation presented in Appendix A. Conditional probabilities are constructed on discretised universes, where the underlying partitions can be crisp or fUZZY, i.e. modelling with words. The underlying partition of a universe nfi induces a fuzzy set description of each training example, which is subsequently converted into its corresponding least prejudiced distribution using the membership-to-probability bi-directional transformation. The resulting probabilistic event descriptions are then counted for each fuzzy bin and a frequency distribution is generated for each feature f;. This process of counting fuzzy events is similar for one-dimensional Cartesian granule features. Subsequently, the data browser converts each distribution to a continuous form by linking up the centre points of granules. The resulting probability density is then transformed to a fuzzy set, the class conditional fuzzy set FS,-Classj, which approximates or summarizes the description of class Classj in terms of this feature f;. Smoothing algorithms can also be applied prior to this transformation. For example, piece-wise regression techniques could be used, where neighbouring points in the probability distribution with similar characteristics, such as similar derivatives, could be summarized using a line. More sophisticated smoothing algorithms can also be used.

Page 87: Data Mining, Rough Sets and Granular Computing

81

The second step in the leaming phase for the data browser generates a rule for each target class value Classj of the following format (conjunctive rule):

5.2.2 Results for the fuzzy data browser

Table 5 presents the results achieved when the data browser was used to generate models of different rule types for the ellipse problem: i.e. evidential logic and conjunctive rule structures. In this case, the granularity of the domain feature universes is 15, where the granules were characterised by crisp sets. The experiment, while yielding relatively high accuracy rates on the test dataset, produces an inaccurate decis ion boundary, especially on the boundary areas of high curvature (see Figure 27(a)). This low performance arises from the decomposed nature of the features and also from poor granulation (also known as discretisation) during leaming. In the data browser, when generating fuzzy sets over continuous variables from corresponding data distributions, it is necessary to discretise the domain or assume that the data is distributed according to some distribution (such as a Gaussian distribution). Discretisation is a well-know problem in statistics and machine leaming where slightly different partitions of a domain can lead to significantly different models (distributions, decision trees etc.) [11,23,24].

Table 5: Classification results using the data browser on the ellipse problem where using a granularity level of 15, and crisp granules.

To overcome some of the problems resulting from discretisation, such as discontinuities, various smoothing algorithms can be used [25, 26]. For the models presented in Table 5, the data browser extracted unsmoothed fuzzy sets corresponding to the legal and illegal classes. However, when a smoothing algorithm was applied to the extracted data distributions, a decision boundary, as depicted in Figure 27(b), resulted. Contrary to the thesis that smoothing can both improve generalisation and reduce the model/fuzzy set complexity, in this case, it results in a big drop in performance. This drop in performance is mainly attributed to the underlying abstraction of the domain using crisp buckets (i.e. histogram­based) from which the smoothing algorithm cannot recover.

Using fuzzy granules or buckets during histogramming resulted in a significant improvement in performance, but it is not until the granularity level is increased to 30 that performance levels begin to match those of one-dimensional Cartesian granule feature models (96%).

Page 88: Data Mining, Rough Sets and Granular Computing

82

In the unsmoothed case, a data browser induced fuzzy set is similar to a Cartesian granule fuzzy set (Le. one dimensional features). In both cases a probability distribution is generated on crisp/fuzzy granules from the example data. Subsequently, the data browser converts this distribution to a continuous form by linking up the centre points of granules. The resulting probability density is then transformed to a fuzzy set. In contrast, in the case of Cartesian granule features the granularity of the probability distribution is maintained after the transformation to a fuzzy set (that is, it is not converted to a continuous fuzzy set).

5.3 Mass assignment based decision trees The mass assignment tree induction algorithm (MAT!) [3] induces probabilistic decis ion trees over linguistically partitioned universes. This approach is another example of both modelling with words and computing with words. The extracted decision trees can be directly translated in extended Fril rules [3]. Applying the MAT! algorithm to the ellipse problem yields a classification accuracy of 99% on the unseen test data. In this case, the domain features of the induced model were partitioned uniformly using granules that were characterised by trapezoidal fuzzy sets with an overlap degree of 50%. The granularity of each domain feature uni verse was ten.

(a) (b)

Figure 27: (a) Ellipse decision boundary using data browser generated rules and fuzzy sets with no smoothing; (b) ellipse decision boundary using data browser generated rules and fuzzy sets with smoothing.

5.4 Multi-layer perceptron

Two-Iayered perceptrons of various architectures were applied to the ellipse problem. The neural networks were implemented with the SNNS simulator from [27]. A scaled conjugate gradient (SCG) algorithm [28] was used as a leaming algorithm for these feed forward neural networks. SCG leaming algorithms, due to their second order nature, tend to find better ways to minimum models (local) than first order techniques (such as back propagation), but at a higher computational cost. The simulator leaming grain was set to "pattern". Table 6 presents the resuIts obtained when perceptrons with hidden layers of different sizes were used to

Page 89: Data Mining, Rough Sets and Granular Computing

83

model the ellipse problem. The number of hidden nodes was varied from two to five and this is indicated in the network architecture column in Table 6. For example, the architecture 2-2-2 corresponds to the following feed forward network: the network has two input nodes corresponding to the input features X and Y; it has two hidden nodes; and 2 output nodes, each corresponding to the output c1assification of legal or illegal respectively. The output c1assification of a data vector is determined by taking the c1assification corresponding to the maximum of the output values generated by the data vector. The decision boundaries generated by the neural networks models presented in Table 6 are depicted in a series of graphs; the details of which are given in the column entitled "Decision boundary figures". The neural network performs very well in modelling this problem but it does require at least three hidden nodes in order to yield good c1assification accuracy.

Table 6: Classifieation results using neural networks on ellipse probleme

(a) (b) (e)

Figure 28: Decision boundaries aehieved using different multi-Iayer pereeptrons: (a) pereeptron with a 2-2-2 arehitecture; (b) a pereeptron with a 2-3-2 arehiteeture; and (e) a pereeptron with a 2-4-2 arehiteeture.

6. Discussion Despite the uncomplicated nature of the ellipse c1assification problem, it does serve to illustrate the importance of granularity within the modelling with words paradigm. It demonstrates that fuzzy sets are a more desirable and necessary characterisation of granules than crisp sets for c1assification problems. Firstly, models, which employ fuzzy set characterisation of granules, will in general, require a lower granularity. This lower granularity will tend to lead to better generalisation. Secondly, fuzzy set based models due to the interpolative nature of smooth fuzzy sets give a much more flexible decision boundary/surface (i.e. not

Page 90: Data Mining, Rough Sets and Granular Computing

84

piecewise linear), whereas the use of crisp sets or fairly crisp sets (fuzzy sets with low degrees of overlap) yield decision boundaries which are stepwise in nature. Thirdly, models based upon crisp granules tend to be very sensitive to the location of granule boundaries, sometimes yielding a discontinuous behaviour when the boundaries are changed, whereas the use of fuzzy granules tends to be more robust in this respect. Finally, multi-dimensional granules can effectively model composed feature spaces.

All of the approaches examined here do very well in modelling the ellipse problem. Table 7 presents a summary of some of the best results achieved using these approaches. From a generalisation perspective, the composed approaches, such as the multidimensional Cartesian granule feature models, MATI models, and neural network models, perform better than the approaches that rely on total decomposition, such as the single dimensional Cartesian granule feature approaches and data browser approaches. From a model complexity perspective, the Cartesian granule feature models and the associated reasoning and inference procedure are glass-boxltransparent in nature, and relatively easily interpreted. The data browser and MATI algorithms provide similar transparency of representation and inference. The multi-Iayer perceptron based models, in addition to their high degree of parameterisation, also have the disadvantage that the mapping they approximate is embodied in the weights and biases matrices and thus, the approximation may not be amenable to inspection or analysis except in simple cases.

Table 7: Summary of ellipse classification problem using various learning approaches.

The use of crisp one-dimensional Cartesian granule features incorporated into product rules yields a model that is equivalent to a naive Bayes classifier under certain conditions, even though at the surface level, the models and inference strategies look very different, with Cartesian granule feature models, being represented by fuzzy sets and probabilistic rules, and naive Bayes classifiers being represented by conditional probabilities and class priors. Both models yield the same results when the class priors are uniform, and the distribution of data amongst Cartesian granules is uniform. See [29] for further details of this comparison. A possible new approach to learning is to use a naive Bayesian

Page 91: Data Mining, Rough Sets and Granular Computing

85

approach where the events are no longer precise but fuzzy granular.

From a model representation point of view, learnt probability distributions in terms of Cartesian granule features are equivalent to maximum depth crisp (as generated by ID3 or C4.S) or probabilistic decision fuzzy trees (as generated by MATI), where each leaf node is equivalent to a Cartesian granule. However, once these distributions are transformed into their equivalent fuzzy sets (in the Cartesian granule feature case), the parallel no longer exists. Nevertheless, both approaches tend to yield similar results (for example, see Table 7). The added attraction of the Cartesian granule feature approach is that it tries to decompose the problem in a network of low-order semantically related variables, which are represented by Cartesian granule features that are incorporated into rule-based models, whereas decision trees in general, try to solve the problem with one big decision tree. Moreover, recent work has illustrated that combining multiple decision-trees (in effect Cartesian granule features) can lead to useful results [30].

The results presented in this chapter also support the following argument: approaches that rely on total decomposition, that is, ignore the problem structure (such as one-dimensional Cartesian granule features, the data browser and naIve Bayes) will not, in general, perform as well as approaches that focus on modelling the problem structure (multidimensional Cartesian granule feature models, neural networks and Bayesian networks).

The work presented here has focused on granulation for classification problems, however, the use of fuzzy granules also facilitates the use of this approach for prediction problems and unsupervised learning [9].

7. Conclusions This chapter has concentrated on the analysis of granularity within the modelling with words paradigm. This analysis has demonstrated the important role granule characterisation, the level of granularity, the dimensionality of the granules and the techniques used to generate these granules play in modelling with words in general, and more specifically, for Cartesian granule features.

In addition, this chapter has provided a useful platform for comparing and understanding other well-known learning algorithms that may or may not explicitly manipulate fuzzy events or probabilities. For example, it was shown how a naIve Bayes classifier is equivalent to crisp Cartesian granule feature classifiers under certain conditions. Other parallels were also drawn between learning approaches such as decision trees and the data browser. As a result of this analysis, an extension to the naIve Bayesian approach from crisp events to fuzzy events is proposed.

Overall, Cartesian granule features opens up a new and exciting avenue in probabilistic fuzzy systems modelling which allows not only the ability to compute with words but also to model with words, thereby leading to

Page 92: Data Mining, Rough Sets and Granular Computing

86

anthropomorphic knowledge descriptions that are effective in modelling classification systems.

Appendix A: Mapping membership values to probabilities A brief overview of the bi-directional transformation between fuzzy sets and probability distributions is presented here. This mapping is exploited during learning and reasoning in the context of Cartesian granule feature models.

Mass assignments (probability distributions on power set elements in contrast to probability distributions on singleton elements as in point probability theory) play a fundamental role in this bi-directional mapping between probability distributions (in their singleton or point-value sense) and fuzzy sets [1, 31]. A mass assignment over a finite frame of discernment (universe of discourse) n is a function:

m : pen) ~ [0,1]

that satisfies the following condition:

Lm:(A)=l AeP(X)

where pen) is the power set of n.

Every set A E pen) for which meA) > 0 is called afocal element of m. A mass

assignment can be viewed as a form of knowledge that expresses upper and lower probabilities for the individual elements of the frame of discernment. In other words, a mass assignment can be viewed as a family of probability distributions, all of which satisfy the axioms of probability theory and the upper and lower constraints delimited by the mass assignment. Consequently, although mass assignments can represent probabilities they have the added flexibility of being able to represent uncertain probabilities. For example, consider a class of undergraduate students where students can be classified as first-class honours, second-class honours or as pass. Consider the case where there are 100 students, where it is known that 30 are pass students, 40 are second-class honours or pass and the remainder unknown. This can be more succinctly written in mass assignment format as follows:

MAClass = {pass}: 0.3 {pass, second-class honours}: 0.4 {pass. second-class honours, first-class-honours}: 0.3

This mass assignment corresponds to the following family of probability distributions:

Page 93: Data Mining, Rough Sets and Granular Computing

87

0.3 ~ Pr(pass) ~ 1 o ~ Pr(second-class) ~ 0.7 o ~ Pr(first-class) ~ 0.3

such that

Pr(pass) + Pr(second-class) + Pr(first-class) = 1.0.

A particular type of probability distribution is obtained by distributing the mass associated within the non-singleton focal elements uniformly; this distribution is termed as the least prejudiced distribution (LPD) [31]. In the case of MAclass the corresponding LPD, LPDclass, is given as follows:

Pr(pass) = 0.3 + OA12 + 0.3/3 = 0.6 Pr(second-class) = OA12 + 0.3/3 = 0.3 Pr(first-class) = 0.3/3 = 0.1

The transformation of mass assignment to a least prejudiced distribution is reversible; hence given a least prejudiced distribution it is possible to find a corresponding mass assignment.

Mass assignments are related to fuzzy sets via possibility theory [32]. Consider that a variable V has a fuzzy set value f as follows:

V is f

where f is a fuzzy set defined on the discrete universe X = {xJ, ... , xn} written more succinctly as follows:

n

j= Lx;lxi i=1

This proposition that "V has a fuzzy set value f" induces a possibility distribution over the values of X such that the membership values of Xi are numerically equated with possibility i.e.

Suppose f is a normalised fuzzy set whose elements are ordered according to decreasing membership values as follows:

then

Page 94: Data Mining, Rough Sets and Granular Computing

88

With the assumption that Pr(A):O;f1(A) for any AE P(X), the mass assignment corresponding to the fuzzy setfcan be determined as follows:

MAt = { {Xb ... , x;}: Xi - Xi+l } with Xn+l = 0,

This can be extended to non-normal fuzzy sets so that the mass assignment corresponding to the fuzzy setf looks like the following:

MAt = { {Xb ... , x;}: Xi - Xi+i> {0}: I-x;} with Xn+l = 0,

such that a non-zero mass is assigned to the null set, in this case the mass assignment is said to be incomplete. The extension to fuzzy sets over continuous universes is a little more involved and is achieved by taking alpha cuts of the fuzzy set and proceeding in a similar fashion as described above with continuous integrals.

Since the focal elements in the mass assignment corresponding to a fuzzy set are nested (consonant) there exists a straightforward transformation from fuzzy sets to mass assignments to point probability distributions. This bi-directional transformation plays a vital role in the learning algorithms presented in this chapter, facilitating learning through a counting approach over granules/words. For further details of the relationship between probabilities and membership values the reader is referred to [9, 33-37].

References [I] Baldwin, J. F., Martin, T. P., and Pilsworth, B. W. (1995). FRIL - Fuzzy and

Evidential Reasoning in A.I. Research Studies Press(Wiley Inc.). [2] Baldwin, J. F., and Martin, T. P. (1995). "Fuzzy Modelling in an Intelligent

Data Browser." In the proceedings of FUZZ-IEEE, Yokohama, Japan, 1171-1176.

[3] Baldwin, J. F., Lawry, J., and Martin, T. P. (1997). "Mass assignment fuzzy ID3 with applications." In the proceedings of Fuzzy Logic: Applications and Future Directions Workshop, London, UK, 278-294.

[4] Quinlan, J. R. (1986). "Induction of Decision Trees", Machine Learning, 1(1):86-106.

[5] Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.

[6] Ruspini, E. H. (1969). "A New Approach to Clustering", Inform. Control, 15(1 ):22-32.

[7] Zadeh, L. A. (1994). "Soft Computing and Fuzzy Logic", IEEE Software, 11(6):48-56.

[8] Zadeh, L. A. (1996). "Fuzzy Logic = Computing with Words", IEEE Transactions on Fuzzy Systems, 4(2):103-111.

Page 95: Data Mining, Rough Sets and Granular Computing

89

[9] Shanahan, J. G. (2000). Soft computing for knowledge discovery: Introducing Cartesian granule features. Kluwer Academic Publishers, Boston.

[10] Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1998). "Aggregation in Cartesian granule feature models." In the proceedings of IPMU, Paris, 6.

[11] Shanahan, J. G. (1998). "Cartesian Granule Features: Knowledge Discovery of Additive Models for Classification and Prediction", PhD Thesis, Dept. of Engineering Mathematics, University of Bristol, Bristol, UK.

[12] Baldwin, J. F. (1993). "Evidential Support logic, FRIL and Case Based Reasoning", Int. J. of Intelligent Systems, 8(9):939-96l.

[13] Baldwin, J. F., Lawry, J., and Martin, T. P. (1996). "Efficient Algorithms for Semantic Unification." In the proceedings of IPMU, Granada, Spain, 527-532.

[14] Lindley, D. V. (1985). Making decisions. John Wiley, Chichester. [15] Kohavi, R., and John, G. H. (1997). "Wrappers for feature selection",

Artificial Intelligence, 97:273-324. [16] Baldwin, J. F. (1995). "Machine Intelligence using Fuzzy Computing." In the

proceedings of ACRC Seminar (November), University of Bristol. [17] Miller, G. A. (1956). "The magical number seven, plus or minus two: some

limits on our capacity to process information", Psychological Review, 63:81-97.

[18] Bezdek, J. C. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York.

[19] Kohonen, T. (1984). Self-Organisation and Associative Memory. Springer­Verlag, Berlin.

[20] Sugeno, M., and Yasukawa, T. (1993). "A Fuzzy Logic Based Approach to Qualitative Modelling", IEEE Trans on Fuzzy Systems, 1(1): 7-3l.

[21] Zadeh, L. A. (1994). "Soft computing", LIFE Seminar, LIFE Laboratory, Yokohama, Japan (February, 24), published in SOFT Journal, 6:1-10.

[22] Baldwin, J. F., Martin, T. P., and Shanahan, J. G. (1997). "Structure identification of fuzzy Cartesian granule feature models using genetic programming." In the proceedings of /JCAI Workshop on Fuzzy Logic in Artificial Intelligence, Nagoya, Japan, 1-11.

[23] Silverman, B. W. (1986). Density estimation for statistics and data analysis. Chapman and Hall, New York.

[24] Baldwin, J. F., and Pilsworth, B. W. (1997). "Genetic Programming for Knowledge Extraction of Fuzzy Rules." In the proceedings of Fuzzy Logic: Applications and Future Directions Workshop, London, UK, 238-25l.

[25] Baldwin, J. F., and Martin, T. P. (1999). "Basic concepts of a fuzzy logic data browser with applications", Report No. ITRC 250, Dept. of Engineering Maths, University of Bristol.

[26] Weiss, S. M., and Indurkhya, N. (1998). Predictive data mining: a practical guide. Morgan Kaufmann.

[27] Zell, A., Mamier, G., Vogt, M., and Mache, N. (1995). SNNS (Stuggart Neural Network Simulator) Version 4.1. Institute for Parallel and Distributed High Performance Systems (lVPR), Applied Computer Science, University of Stuggart, Stuggart, Germany.

Page 96: Data Mining, Rough Sets and Granular Computing

90

[28] Moller, M. F. (1993). "A scaled conjugate gradient algorithm for fast supervised learning", Neural Networks, 6:525-533.

[29] Shanahan, J. G. (2000). "A comparison between naive Bayes classifiers and product Cartesian granule feature models", Report No. In preparation, XRCE.

[30] Breiman, L. (1996). "Bagging predictors", Machine Learning, 66:34-53. [31] Baldwin, J. F. (1992). "Fuzzy and Probabilistic Uncertainties", In

Encyclopaedia of AI, 2nd ed., Shapiro, ed., 528-537. [32] Baldwin, J. F. (1991). "Combining evidences for evidential reasoning",

International Journal of Intelligent Systems, 6(6):569-616. [33] Sudkamp, T. (1992). "On probability-possibility transformation", Fuzzy Sets

and Systems, 51:73-81. [34] Zadeh, L. A. (1968). "Probability Measures of Fuzzy Events", Journal of

Mathematical Analysis and Applications, 23 :421-427. [35] Dubois, D., and Prade, H. (1983). "Unfair coins and necessary measures:

towards a possibilistic interpretation of histograms", Fuzzy sets and systems, 10:15-20.

[36] Baldwin, J. F. (1991). "A Theory of Mass Assignments for Artificial Intelligence", In /JCAI '91 Workshops on Fuzzy Logic and Fuzzy Control, Sydney, Australia, Lecture Notes in Artificial Intelligence, A. L. Ralescu, ed., 22-34.

[37] Klir, K. (1990). "A principle of uncertainty and information invariance", International journal of general systems, 17(2, 3):249-275.

Page 97: Data Mining, Rough Sets and Granular Computing

Validation of Concept Representation with Rule Induction and Linguistic Variables

Shusaku Tsumoto

Department of Medical Informatics, Shimane Medical University, School of Medicine, 89-1 Enya-cho, Izumo 693-8501 Japan E-mail: [email protected]

Abstract. This paper shows problems with combination of rule induction and attribute-oriented generalization, where if the given hierarchy includes inconsisten­cies, then application of hierarchical knowledge generates inconsistent rules. Then, we introduce two approaches to solve this problem, one process of which suggests that combination of rule induction and attribute-oriented generalization can be used to validate concept hiearchy. Interestingly, fuzzy linguistic variables play an important role in solving these problems.

1 Introduction

Conventional studies on machine learning[ll], rule discovery[2] and rough set methods[5,13,14] mainly focus on acquisition of rules, the targets of which have mutually exclusive supporting sets. Supporting sets of target concepts form a partition of the universe, and each method search for sets which covers this partition. Especially, Pawlak's rough set theory shows the family of sets can form an approximation of the partition of the universe. These ideas can easily extend into probabilistic contexts, such as shown in Ziarko's variable precision rough set model[19]. However, mutual exclusiveness of the target does not always hold in real-world databases, where conventional probabilistic approaches cannot be applied.

In this paper, first, we show that these phenomena are easily found in data mining contexts: when we apply attribute-oriented generalization to attributes in databases, generalized attributes will have fuzziness for classi­fication, which causes rule induction methods to generate inconsistent rules. Then, we introduce two solutions. The first one is to introduce aggregation operators to recover mathematical consistency. The other one is to intro­duce Zadeh's linguistic variables, which describes one way to represent an interaction between lower-level components in an upper level components and which gives a simple solution to deal with the inconsistencies. Finally, we briefly discuss the mathematical generalization of this solution in which context-free fuzzy sets is a key idea. In this inconsistent problem, we have to take care about the conflicts between each attributes, which can be viewed as a problem with multiple membership functions.

Page 98: Data Mining, Rough Sets and Granular Computing

92

2 Attribute-Oriented Generalization and Fuzziness

In this section, first, a probabilistic rule is defined by using two probabilistic measures. Then, attribute-oriented generalization is introduced as transform­ing rules.

2.1 Probabilistic Rules

Accuracy and Coverage In the subsequent sections, we adopt the following notations, which is introduced in [10].

Let U denote a nonempty, finite set called the universe and A denote a nonempty, finite set of attributes, i.e., a : U --+ Va for a E A, where Va is called the domain of a, respectively. Then, a decision table is defined as an information system, A = (U,Au {d}).

The atomic formulas over B ~ Au {d} and V are expressions of the form [a = v], called descriptors over B, where a E B and v E Va. The set F(B, V) of formulas over B is the least set containing all atomic formulas over Band closed with respect to disjunction, conjunction and negation.

For each I E F(B, V), IA denote the meaning of I in A, i.e., the set of all objects in U with property I, defined inductively as follows.

1. If I is of the form [a = v] then, IA = {s E Ula(s) = v} 2. (f /\g)A = IA ngA; (f V g)A = IA V gA; (""f)A = U - la

By the use of this framework, classification accuracy and coverage, or true positive rate is defined as follows.

Definition 1. Let Rand D denote a formula in F(B, V) and a set of objects which belong to a decision d. Classification accuracy and coverage(true positive rate) for R --+ d is defined as:

G:R(D) = IR~=~I (= P(DIR)), and

r.R(D) = IRI~ DI (= P(RID)),

where IAI denotes the cardinality of a set A, G:R(D) denotes a classification accuracy of R as to classification of D, and r.R(D) denotes a coverage, or a true positive rate of R to D, respectively.

1 Pawlak recently reports a Bayesian relation between accuracy and coverage[8]:

QR(D)P(D) = P(RID)P(D) = P(R,D)

= P(R)P(DIR) = K,R(D)P(R)

This relation also suggests that a priori and a posteriori probabilities should be easily and automatically calculated from database.

Page 99: Data Mining, Rough Sets and Granular Computing

93

Definition of Rules By the use of accuracy and coverage, a probabilistic rule is defined as:

R ~ d s.t. R = Aj Vk raj = Vk], O'.R(D) ~ 80;' K,R(D) ~ 8",.

This rule is a kind of probabilistic proposition with two statistical mea­sures, which is an extension of Ziarko's variable precision model(VPRS) [19].2

It is also notable that both a positive rule and a negative rule are defined as special cases of this rule, as shown in the next subsections.

2.2 Attribute-Oriented Generalization

Rule induction methods regard a database as a decision table[5] and induce rules, which can be viewed as reduced decision tables. However, those rules extracted from tables do not include information about attributes and they are too simple. In practical situation, domain knowledge of attributes is very important to gain the comprehensiblity of induced knowledge, which is one of the reasons why databases are implemented as relational-databases[l]. Thus, reinterpretation of induced rules by using information about attributes is needed to acquire comprehensive rules. For example, terolism, cornea, anti­mongoloid slanting of palpebral fissures, iris defects and long eyelashes are symptoms around eyes. Thus, those symptoms can be gathered into a cate­gory "eye symptoms" when the location of symptoms should be focused on. symptoms should be focused on. The relations among those attributes are hierarchical as shown in Figure 1. This process, grouping of attributes, is called attribute-oriented generalization[l].

Attribute-oriented generalization can be viewed as transformation of vari­ables in the context of rule induction. For example, an attribute "iris defects" should be transformed into an attribute "eye symptoms=yes".It is notable that the transformation of attributes in rules correspond to that of a database because a set of rules is equivalent to a reduced decision table. In this case, the case when eyes are normal is defined as "eye symptoms=no". Thus, the transformation rule for iris defects is defined as:

[iris-defects = yes] -+ [eye-symptoms = yes] (1)

In general, when [Ak = Vi] is a upper-level concept of [ai = Vj], a trans­forming rule is defined as:

and the supporting set of [Ak = Vi] is:

i,j

2 This probabilistic rule is also a kind of Rough Modus Ponens[7].

Page 100: Data Mining, Rough Sets and Granular Computing

94

Location

Head··· Face .. ·

{hyper

terolism normal hypo

{ megalo

cornea large normal

Eye: antimongoloid slanting of palpebral fissures { yes no

iris defects { yes no

eyelashes { long l norma

Noses·· .

Fig. 1. An Example of Attribute Hierarchy

where A and a is a set of attributes for upper-level and lower level concepts, respectively.

2.3 Examples

Let us illustrate how fuzzy contexts is observed when attribute-oriented gen­eralization is applied by using a small table (Table 1). Then, it is easy to see

Table 1. A Small Database on Congenital Disorders

U round telorism cornea slanting iris-defects eyelashes class 1 no normal megalo yes yes long Aarskog 2 yes hyper megalo yes yes long Aarskog 3 yes hypo normal no no normal Down 4 yes hyper normal no no normal Down 5 yes hyper large yes yes long Aarskog 6 no hyper megalo yes no long Cat-cry DEFINITIONS: round: round face, slanting: antimongoloid slanting of palpebral fissures, Aarskog: Aarskog Syndrome, Down: Down Syndrome, Cat-cry: Cat Cry Syndrome.

that a rule of "Aarskog",

[iris-defects = yes]-* Aarskog a = 1.0, r;, = 1.0

Page 101: Data Mining, Rough Sets and Granular Computing

95

is obtained from Table 1. When we apply transforming rules shown in Figure 1 to the dataset of

Table 1, the table is transformed into Table 2. Then, by using transformation

Table 2. A Small Database on Congenital Disorders (Transformed)

U eye eye eye eye eye eye class 1 no no yes yes yes yes Aarskog 2 yes yes yes yes yes yes Aarskog 3 yes no no no no no Down 4 yes yes no no no no Down 5 yes yes yes yes yes yes Aarskog 6 no yes yes yes no yes Cat-cry DEFINITIONS: eye: eye-symptoms

rule 1, the above rule is transformed into:

[eye-symptoms = yes] -+ Aarskog.

It is notable that mutual exclusiveness of attributes has been lost by trans­formation. Since five attributes (telorism, cornea, slanting, iris-defects and eyelashes) are generalized into eye-symptoms, the candidates for accuracy and coverage will be (2/4, 2/3), (2/4,3/3), (3/4,3/3), (3/4,3/3), (3/3,3/3) and (3/4, 3/3), respectively. Then, we have to select which value is suitable for the context of this analysis.

In [12], Tsumoto selected the minimum value in medical context: accuracy is equal to 2/4 and coverage is equal to 2/3.

Thus, the rewritten rule becomes the following probabilistic rule:

[eye-symptoms = yes] -+ Aarskog,

a = 3/4 = 0.75, '" = 2/3 = 0.67.

This examples show that the loss of mutual exclusiveness is directly con­nected to the emergence of fuziness in a dataset. It it notable that the rule used for transformation is a deterministic one. When this kind of transforma­tion is applied, whether applied rule is deterministic or not, fuzziness will be observed. However, no researchers has pointed out this problem with combi­nation of rule induction and transformation.

It is also notable that the conflicts between attributes with respect to ac­curacy and coverage corresponds to the vector representation of membership functions shown in Lin's context-free fuzzy sets[4].

2.4 What is a Problem?

The illustrative example in the last subsection shows that simple combination of rule induction and attribute-oriented generalization easily generates many

Page 102: Data Mining, Rough Sets and Granular Computing

96

inconsistent rules. One of the most important features of this problem is that simple application of transformation violates mathematical conditions.

Attribute-value pairs can be viewed as a mapping in a mathematical con­text, as shown in Section 2. For example, in the case of an attribute "raund", a set of values in "raund", {yes, no} is equivalent to a domain of "round". Then, since the value of raund for the first example in a dataset, denoted by "I" is equal to 1, raund(l) is equal to no. Thus, an attribute is a mapping from examples to values. In a reverse way, a set of examples is related to attribute-value pairs:

raund- 1 (no) = {1,6}.

In the same way, the following relation is obtained:

eyeslashes-1(normal) = {3,4}.

However, simple transformation will violate this condition on mapping be­cause transformation rules will change different attributes into the same name of generalized attributes. For example, if the following two transformation rules are applied:

round -t eye-symptoms,

eyeslashes -t eye-symptoms,

normal -t no, long -t yes,

then the following relations are obtained:

eye-symptoms-1 (no) = {I, 6}, eye-symptoms-1 (no) = {3,4},

which leads to contradiction. Thus, transformed attribute-value pairs are not mapping because of one to many correspondence.

In this way, violation is observed as generation of logically inconsistent rules, which is equivalent to mathematical inconsistencies.

3 Solutions

3.1 Join Operators

In Subsection 2.3, since five attributes (telorism, cornea, slanting, iris-defects and eyelashes) are generalized into eye-symptoms, the candidates for accuracy and coverage will be (2/4, 2/3), (2/4, 3/3), (3/4, 3/3), (3/4,3/3), (3/3,3/3), and (3/4, 3/3), respectively. Then, we show one approach reported in [12]. the minimum value is selected: accuracy is equal to 2/4 and coverage is equal to 2/3. This selection of minimum value is a kind of aggregation, or join

Page 103: Data Mining, Rough Sets and Granular Computing

97

operator. In join operators, conflict values will be integrated into one values, which means that one to many correspondence is again transformed into one to one correspondence, which will recover consistencies.

Another example of join operators is "average". In the above example, the average of accuracy is 0.71, so if the average operator is selected for aggregation, then the accuracy of the rule is equal to 0.71. This solution can be generalized into context-free fuzzy sets introduced by Lin[4), which is shown in Section 4.

3.2 Zadeh's Linguistic Variables

Concept Hierarchy and Information Another solution is to observe this problem from the viewpoint of information. After the application of trans­formation, it is clear that some information is lost. In other words, transfor­mation rules from concept hierarchy are kinds of projection and usually pro­jection loses substantial amounts of information. Intuitively, over-projection is observed as fuzziness.

For example, let me consider the following three transformation rules:

[Round = yes] -+ [Eye-symptoms = yes), [Iris-Defects = yes] -+ [Eye-symptoms = yes), [Telarism = hyper] -+ [Eye-symptoms = yes]

One of the most important questions is whether eyes only contribute to these symptoms.

Thus, one way to solve this problem is to recover information on the hierarchical structure for each symptoms. For example, let us summarize the components of each symptom and corresponding accuracy into Table 3.

Table 3. Components of Symptoms

Symptoms Components Accuracy [Round = yes] : [Eye, Nose, Frontal] Q = 1/2 [Iris - Defects = yes] : [Substructure of Eye] Q = 3/3 [Telorism = hyper] : [Eye, Nose, Frontal] Q = 2/3

It is notable that even if components of symptoms are the same, the values of accuracy are not equal to each other. These phenomena suggest that the degrees of contribution of components are different in those symptoms. In the above examples, the degrees of contribution of Eye in [Round = yes), [Iris - Defects] and [Telarism] are estimated as 1/2 (0.5), 3/3 (1.0) and 2/3(0.67), respectively.

Page 104: Data Mining, Rough Sets and Granular Computing

98

Linguistic Variables and Knowledge Representation Zadeh proposes linguistic variables to approximate human linguistic reasoning[16-18]. One of the main points in his discussion is that when human being reasons hier­archical structure, he/she implicitly estimates the degree of contribution of each components to the subject in an upper level.

In the case of a symptom [Round = yes], this symptom should be de­scribed as the combination of Eye, Nose and Frontal part of face. From the value of accuracy in Aarskog syndromes, since the contribution of Eye in [Round=yes] is equal to 0.5, the linguistic variable of [Round = yes] is rep­resented as:

[Round = yes] = 0.5 * [Eye] + () * [Nose] + (0.5 - ()) * [Frontal],

where 0.5 and () are degrees of contribution of eyes and nose to this symp­tom, respectively. It is interesting to see that the real hierarchical structure is recovered by Zadehfs linguistic variable structure, which also suggests that linguistic variables captures one aspect of human reasoning about hierar­chical structure. Especially, one important issue is that Zadeh's linguistic variables, although partially, represent the degree of interactions between sub-components in the same hierarchical level, which cannot be achieved by simple application of object-oriented approach.

Another important issue is that the degree of contribution, which can be viewed as a subset of a membership function, can be estimated from data. Estimation of membership function is one of the key issues in application of fuzzy reasoning, but it is a very difficult to extract such membership functions from data and usually they are given by domain experts[9].

In summary, these two important issues suggest that a dataset can be used to validate a concept hierarchy. If some inconsistencies are observed af­ter transformation by a given hierarchy, then some information are thought to be lost in the process of transformation.3 From the observation of lost information, we can go further into the next step to construct more consis­tent hierarchy or knowledge representation. Combination of rule induction methods and attribute- oriented generalization may play an important role in validation. Also, it may be possible to measure the quality of concept rep­resentation from data. Although this topic is not discussed in this paper, evaluation of concept representation is very important for us to construct complete and sound concept representation. Construction of terminology and concept representation should be adaptive because our medical knowledge is dynamic and new knowledge is coming everyday.

3 In this approach, we assume that data does not include errors of experts' deci­sions.

Page 105: Data Mining, Rough Sets and Granular Computing

4 Functional Representation of Context-Free Fuzzy Sets

99

Lin has pointed out problems with multiple membership functions and in­troduced relations between context-free fuzzy sets and information tables[4]. The main contribution of context-free fuzzy sets to data mining is that infor­mation tables can be used to represent multiple fuzzy membership functions. Usually when we meet multiple membership functions, we have to resolve the conflicts between functions. Lin discusses that this resolution is bounded by the context: min, maximum and other fuzzy operators can be viewed as a context. The discussion in Section 2 illustrates Lin's assertion. Especially, when we analyze relational-database, transformation will be indispensable to data mining of multi-tables. However, transformation may violate mutual ex­clusiveness of the target information table. Then, multiple fuzzy membership functions will be observed.

Lin's context-free fuzzy sets shows such analyzing procedures as a simple function as shown in Figure 4. The important parts in this algorithm are the way to construct a list of membership functions and the way to determine whether this algorithm outputs a metalist of a list of membership functions or a list of numerical values obtained by application of fuzzy operators to a list of membership functions.

5 Conclusions

This paper shows that combination of attribute-oriented generalization and rule induction methods generate inconsistent rules and proposes one solution to this problem. It is surprising that tranformation of attributes will easily generate this situation in data mining from relation databases: when we ap­ply attribute-oriented generalization to attributes in databases, generalized attributes will have fuzziness for classification. In this case, we have to take care about the conflicts between each attributes, which can be viewed as a problem with linguistic uncertainty or multiple membership functions. Fi­nally, the author pointed out that these contexts should be analyzed by using two kinds of fuzzy techniques: one is introduction of aggregation operators, which can viewed as those on multiple membership functions. The other one is linguistic variables, which captures the degree of contribution.

This paper is still a preliminary research on combination of medical ter­minology and KDD methods. Further work should be done, but this com­bination may be useful to construct and evaluate medical terminology and concept representation. It will be our future work to introduce a measure for evaluation of terminologies and to formalize validation of terminology and concept representation from data.

Page 106: Data Mining, Rough Sets and Granular Computing

100

procedure Resolution of Multiple Memberships; var

i : integer; La, Li : List; A: a list of Attribute-value pairs (multisets:bag); F: a list of fuzzy operatorsj

begin Li :=A; while (A =1= {}) do

begin [ai = vj](k) = first(A)j Append p,([ai = vj](k» to L[a;=vj]

1* L[a;=vj]: a list of membership function for attribute-value pairs * / A := A - [ai = vj](k)j

end. if (F = {}) then

1* Context- Free * / return all of the lists L[a;=vj];

else 1* Resolution with Contexts* / while (F =1= {}) do

begin f = first(F); Apply f to each L[a;=vj]j p,f([ai = Vj]) = f(L[a;=vj]) Output all of the membership functions p,f([ai = Vj]) F:=F-fj

end. end {Resolution of Multiple Memberships}j

Fig. 2. Resolution of Multiple Fuzzy Memberships

References

1. Y.D. Cai, N. Cercone and J. Han, Attribute-oriented induction in relational databases. in: Shapiro, G. P. and Frawley, W. J. (eds), Knowledge Discovery in Databases, AAAI press, Palo Alto, CA, pp.213-228 (1991).

2. Fayyad, U.M., et al.(eds.): Advances in Knowledge Discovery and Data Mining, AAAI Press (1996).

3. Lin, T.Y. Fuzzy Partitions: Rough Set Theory, in Proceedings of IPMU'9S, Paris, pp. 1167-1174, 1998.

4. Lin, T.Y. Context Free Fuzzy Sets and Information Tables Proceedings of EU­FIT'9S, Aachen, pp.76-80, 1998.

5. Pawlak, Z., Rough Sets. Kluwer Academic Publishers, Dordrecht, 1991. 6. Pawlak, Z. Conflict analysis. In: Proceedings of the Fifth European Congress on

Intelligent Techniques and Soft Computing (E UFIT '97), pp.1589-1591, Verlag Mainz, Aachen, 1997.

7. Pawlak, Z. Rough Modus Ponens. Proceedings of IPMU'9S, Paris, 1998.

Page 107: Data Mining, Rough Sets and Granular Computing

101

8. Pawlak, Z. Rough Sets and Decision Analysis, Fifth nASA workshop on Decision Analysis and Support, Laxenburg, 1998.

9. Pedrycz, W. and Gomide, F. An Introduction to Fuzzy Sets - Analysis and Design, MIT Press, MA, 1996.

10. Skowron, A. and Grzymala-Busse, J. From rough set theory to evidence theory. In: Yager, R., Fedrizzi, M. and Kacprzyk, J.(eds.) Advances in the Dempster­Shafer Theory of Evidence, pp.193-236, John Wiley & Sons, New York, 1994.

11. Readings in Machine Learning, (Shavlik, J. W. and Dietterich, T.G., eds.) Mor­gan Kaufmann, Palo Alto, 1990.

12. Tsumoto, S. Knowledge Discovery in Medical Databases based on Rough Sets and Attribute-Oriented Generalization. Proceedings of IEEE-FUZZ99, IEEE Press, Anchorage, 1998.

13. Tsumoto, S. Automated Induction of Medical Expert System Rules from Clin­ical Databases based on Rough Set Theory Information Sciences 112, 67-84, 1998.

14. Tsumoto, S., Automated Discovery of Plausible Rules based on Rough Sets and Rough Inclusion, Proceedings of PAKDD'99, (in press), LNAI, Springer-Verlag.

15. Zadeh, L.A., Toward a theory of fuzzy information granulation and its certainty in human reasoning and fuzzy logic. Fuzzy Sets and Systems 90, 111-127, 1997.

16. Zadeh, L.A., The concept of linguistic variable and its application to approxi­mate reasoning. Information Sciences 8, 199-249 (part I), 1975.

17. Zadeh, L.A., The concept of linguistic variable and its application to approxi­mate reasoning. Information Sciences 8, 301-357 (part II), 1975.

18. Zadeh, L.A., The concept of linguistic variable and its application to approxi­mate reasoning. Information Sciences 9, 43-80 (part 1111), 1976.

19. Ziarko, W., Variable Precision Rough Set Model. Journal of Computer and System Sciences. 46, 39-59, 1993.

Page 108: Data Mining, Rough Sets and Granular Computing

Granular Computing Using Information Tables

Y.Y. Yao1 and Ning Zhong2

1 Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S OA2 E-mail: [email protected]

2 Department of Computer Science and Systems Engineering Faculty of Engineering, Yamaguchi University Tokiwa-Dai, 2557, Ube 755, Japan E-mail: [email protected]

Abstract. A simple and more concrete granular computing model may be devel­oped using the notion of information tables. In this framework, each object in a finite nonempty universe is described by a finite set of attributes. Based on attribute values of objects, one may decompose the universe into parts called granules. Ob­jects in each granule share the same or similar description in terms of their attribute values. Studies along this line have been carried out in the theories of rough sets and databases. Within the proposed model, this paper reviews the pertinent existing results and presents their generalizations and applications.

1 Introduction

The concept of information granulation was first introduced by Zadeh [43] in the context of fuzzy sets in 1979. The basic ideas of crisp information granulation have appeared in related fields, such as interval analysis, quan­tization, rough set theory, Dempster-Shafer theory of belief functions, divide and conquer, cluster analysis, machine learning, databases, and many oth­ers. However, fuzzy information granulation has not received enough atten­tion [47]. In a series of recent papers and invited talks, Zadeh [45-47,49] proposed the development of a theory of fuzzy information granulation. Mo­tivated by the work of Zadeh, there is a fast growing interest in the study of information granulation and computations under the umbrella of Granular Computing (GrC)l. Roughly speaking, "GrC is a superset of the theory of fuzzy information granulation, rough set theory and interval computations, and is a subset of granular mathematics." [48]

There are theoretical and practical reasons for the study of granular com­puting. Many authors argued that information granulation is very essential

1 The term "Granular Computing" was suggested by T.Y. Lin to label studies on information granulation and computations [45]. A Special Interest Group in Gran­ular Computing in Berkeley Initiative in Soft Computing (BISC/SIG GrC) was established in 1997 (URL: http://www.mathcs.sjsu.edu/GrC/GrC.html). The co­ordinators of the group are Tsau Young Lin (leader), Frank Hoffmann, Yiyu Yao, and Ning Zhong.

Page 109: Data Mining, Rough Sets and Granular Computing

103

to human problem solving, and hence has a very significant impact on the design and implementation of intelligent systems. Zadeh [47] identified three basic concepts that underlie human cognition, namely, granulation, organi­zation, and causation. "Granulation involves decomposition of whole into parts, organization involves integration of parts into whole, and causation involves association of causes and effects." Yager and Filev [29] pointed out that "human beings have been developed a granular view of the world" , and " ... objects with which mankind perceives, measures, conceptualizes and rea­sons are granular" . In many practical situations, when a problem involves in­complete, uncertain, or vague information, it may be difficult to differentiate distinct elements and one is forced to consider granules. A typical example is the theory of rough sets [20]. In some situations, although detailed informa­tion may be available, it may be sufficient to use granules in order to have an efficient and practical solution. Very precise solutions may in fact not be re­quired for many practical problems. It may also happen that the acquisition of precise information is too costly, and coarse-grained information reduces cost.

In summary, granular computing is inspired by the ways in which humans granulate information and reason with coarse-grained information. It builds on existing machinery for fuzzy information processing, such as linguistic variables, fuzzy if-then rules and fuzzy graphs, generalized constraints, and computing with words [47]. Granular computing is likely to play an important role in the evolution of fuzzy logic and its applications.

There are at least three fundamental issues in granular computing: gran­ulation of the universe, description of granules, and relationships between granules. Granulation involves decomposition of whole into parts. A gran­ule is "a clump of points (objects) drawn together by indistinguishability, similarity, proximity or functionality" [47]. In order to apply this abstract concept, it is necessary to study criteria for deciding if two elements should be put into the same granule. In other words, one must provide necessary semantics interpretations for notions such as indistinguishability, similarity, and proximity. Two structures can be observed from the granulation of a universe, the structure of each individual granules and structure induced by a family of granules. In general, a larger granule may be further divided into smaller granules, while smaller granules may be combined into a larger granules. In this way, one may obtain stratified granulation structures of a universe [34]. Once constructed, it is necessary to describe, to name and to label granules using certain languages. Each label represents a concept such that an element in the granule is an instance of the named category, as being done in classification [7]. The granulated view summarizes available infor­mation and knowledge about the universe. By considering a class of objects sharing similar properties, instead of individuals, one may be able to estab­lish relationships and connection between granules. In fact, this is one of the main tasks of data mining [42]. It may be argued that the construction,

Page 110: Data Mining, Rough Sets and Granular Computing

104

interpretation, description, and connections of granules are of fundamental importance in the understanding, representation, organization and synthesis of data, information, and knowledge.

A systematic study and a general framework of granular computing were given in a recent paper by Zadeh [47]. Granules are constructed and defined based on the concept of generalized constraints. Examples of constraints are equality, possibilistic, probabilistic, fuzzy, and veristic constraints. Granules are labeled by fuzzy sets or natural language words. Relationships between granules are represented in terms of fuzzy graphs or fuzzy if-then rules. The associated computation method is known as computing with words [44]. On the other hand, many researchers investigated specific and more con­crete models of granular computing. Lin [9] and Yao [33] studied granular computing using neighborhood systems for the interpretation of granules. Pawlak [20], Skowron and Stepaniuk [24], and Polkowski and Skowron [21] examined granular computing in connection with the theory of rough sets. A salient features of these studies is that a particular semantics interpretation of granules is defined, and an algorithm for constructing granules is given.

The main objective of the present study is to develop a simple and more concrete model for non-fuzzy granular computing using information tables. With respect to the proposed model, we review studies on non-fuzzy granular computing and investigate their possible generalizations and applications. In this framework, each object in a finite nonempty universe is described by a finite set of attributes. That is, each object is only perceived, observed, or measured by using a finite number of properties. The universe is decomposed into granules by grouping objects with the same or similar properties. The representation of objects by their attribute values provide the semantics for interpreting the induced granules. For example, a patient may be represented by a set of symptoms. A set of patients may be divided into subgroups such that each subgroup of patients suffer from the same disease characterized by certain symptoms. Several types of relationships between attribute values will be considered. They induce different granulation structures on the universe.

To illustrate the usefulness of the proposed framework, at the end of this paper we also discuss two specific problems of granular computing, namely, approximations induced by granulations and relationships between granules in data analysis and mining.

2 Information Tables and a Decision Logic Language

The notion of information tables has been studied by many authors as a sim­ple knowledge representation method, in which objects are described by their values on a set of attributes [1,12,15,17,19,27,37]. Formally, an information table is a quadruple:

S = (U,At, {Va I a E At}, {fa I a EAt}),

Page 111: Data Mining, Rough Sets and Granular Computing

where

U is a finite nonempty set of objects,

At is a finite nonempty set of attributes,

Va is a nonempty set of values for a EAt,

fa : U ---+ Va is an information function.

105

Each information function fa is a total function that maps an object of U to exactly one value in Va. Similar representation schemes may be found in many fields, such as decision theory, pattern recognition, machine learning, data analysis, data mining, and cluster analysis [19].

An information table can be conveniently presented in a table form. Ta­ble 1, taken from an example in [22], is an example of an information table. The columns are labeled by attributes, the rows are labeled by objects, and each row represents the information about that object. Pawlak [19] referred to an information table as an information system, a knowledge representation system, or an attribute-value system. We prefer to use the name information tables, in order to avoid confusion with the commonly associated meaning of information systems [9].

Object Height Hair Eyes Class 01 short blond blue + 02 short blond brown -03 tall red blue + 04 tall dark blue -05 tall dark blue -06 tall blond blue + 07 tall dark brown -08 short blond brown -

Table 1. An information table

An information table contains all available information about the objects in the universe. Objects are perceived and observed only through their prop­erties. Objects with the same description cannot be distinguished and they are considered to be the same [20]. More generally, objects with similar de­scriptions may also be considered to be approximately the same. This leads to granulations of the universe. Granular computing using information table deals mainly with the decomposition of universe based on objects' descrip­tions. Information table is a more concrete model that provides semantics for the notion of granules. However, it is only one of the possible ways in which granules are formed and interpreted.

Information provided by an information table may also be described in terms of certain logic languages, in order to make inference easily. Pawlak [19]

Page 112: Data Mining, Rough Sets and Granular Computing

106

discussed a decision logic language (DL-Ianguage) with respect to informa­tion tables. It is a language for describing objects or a group of objects of the universe. For example, an object can be represented as a conjunction of attribute-value pairs. A subset of objects can be similarly described. For­mally, an atomic formula in the DL-Ianguage is given by (a, v), where a E At and v E Va. If ¢J and 1jJ are formulas in the DL-Ianguage, then so are -,¢J, ¢J" 1jJ, ¢J V 1jJ, ¢J -t 1jJ, and ¢J == 1jJ. The semantics of the D L-Ianguage is defined in Tarski's style through the notions of a model and satisfiability. The model is an information table S, which provides interpretation for symbols and for­mulas of the DL-Ianguage. The satisfiability of a formula ¢J by an object x, written x 1=8 ¢J or in short x 1= ¢J if S is understood, is interpreted as follows:

(1) x 1= (a,v) iff Ia(x) = v,

(2) x 1= -,¢J iff not x 1= ¢J,

(3) x 1= ¢J" 1jJ iff x 1= ¢J and x 1= 1jJ,

(4) x 1= ¢J V 1jJ iff x 1= ¢J or x 1= 1jJ,

(5) x 1= ¢J -t 1jJ iff x 1= -,¢J V 1jJ,

(6) x 1= ¢J == 1jJ iff x 1= ¢J -t 1jJ and x 1= 1jJ -t ¢J.

The first four formulas are in fact used in the evaluation of satisfiability of queries by objects in database systems. For a formula ¢J, the set of objects satisfying ¢J is given by:

m8(¢J) = {x E U I x 1= ¢J}. (1)

It is called the meaning of the formula ¢J in S. If S is understood, we simply write m(¢J). The meaning of a formula ¢J is the set of all objects having the property expressed by the formula ¢J. Therefore, ¢J may be viewed as a description of the set of objects m(¢J). Two distinct formulas may have the same meaning in an information table. A granule may have different representations. The connections between formulas of the DL-Ianguage and subsets of U are expressed as [19]:

(a) m(a,v) = {x E U I Ia(x) = v},

(b) m(-,¢J) = -m(¢J),

(c) m(¢J" 1jJ) = m(¢J) n m(1jJ), (d) m(¢J V 1jJ) = m(¢J) U m(1jJ),

(e) m(¢J -t 1jJ) = -m(¢J) U m(1jJ),

(f) m(¢J == 1jJ) = (m(¢J) n m(1jJ)) U (-m(¢J) n -m(1jJ)),

where -m(¢J) = U - m(¢J) denotes the set complement of m(¢J). They give a set-theoretic interpretation of logic operations. In particular, logic negation, conjunction, and disjunction are interpreted as set complement, intersection, and union, respectively.

Page 113: Data Mining, Rough Sets and Granular Computing

107

A formula ¢ is said to be true in an information table S, written I=s ¢, if and only if ¢ is satisfied by all objects in the universe, namely, m(¢) = U. It is false if and only if no object satisfies the formula, namely, m(¢) = 0. For two formulas ¢ and 'I/J, ¢ -+ 'I/J is true if and only if every object satisfying ¢ also satisfies 'I/J, namely, m(¢) ~ m('I/J). They are equivalent in S if and only if m(¢) = m('I/J). In summary, we have [19]:

(i) I=s ¢ iff m(¢) = U,

(ii) I=s -,¢ iff m(¢) = 0, (iii) I=s ¢ -+ 'I/J iff m(¢) ~ m('I/J),

(iv) I=s ¢ == 'I/J iff m(¢) = m('I/J).

We can therefore study the relationships between concepts described by for­mulas of the DL-Ianguage based on the relationships between their corre­sponding sets of objects.

For the information table 1, the following expressions are examples of formulas of the DL-Ianguage:

(Height, tall),

(Height, short),

(Hair, dark),

(Height, tall) V (Height, short),

(Height, tall) 1\ (Hair, dark),

(Height, tall) V (Hair, dark),

(Hair, dark) -+ (Height, tall),

(Hair, dark) == (Height, tall).

The meanings of these formulas, i.e., the subsets of objects satisfying the formulas, are given by:

m(Height, tall) = {03, 04, 05, 06, or},

m(Height, short) = {01,02,OS},

m((Height, tall) V (Height, short)) = U,

m(Hair, dark) = {04, 05, or},

m((Height, tall) 1\ (Hair, dark)) = {04, 05, or},

m((Height, tall) V (Hair, dark)) = {03,04,05,06,0r},

m((Hair, dark) -+ (Height, tall)) = U,

m((Hair, dark) == (Height, tall)) = {01,02,04,05,07,OS}.

Among these formulas, two are true in the information table, namely:

I=s (Height, tall) V (Height, short),

I=s (Hair, dark) -+ (Height, tall).

Page 114: Data Mining, Rough Sets and Granular Computing

108

The first formula represents the fact that an object's Height is either tall or short. The second formula represents the fact that if an object's Hair is dark, then its Height is tall. For the subset {04' 05, 07}, it can be described by both formulas (Hair, dark) and (Height, tall) A (Hair, dark). This suggests that 1=8 (Hair, dark) == ((Height, tall) A (Hair, dark)).

The decision logic language, DL-Language, has been studied by many au­thors. Orlowska [16] used a similar logic for studying reasoning with vague concepts. Polkowski and Skowron [21] adopted decision logic for the formula­tion of an adaptive calculus of granules in the context of information tables.

3 Construction and Interpretation of Granules in Information Tables

In the order of generality, this section summarizes the constructions of gran­ules using the equality relation, equivalence relations, and reflexive binary relations on attribute values. The common practice of using the equality re­lation, as being done in the development of rough set theory [18], is based on exact value matching. This in fact does not take into too much considera­tion of semantic relationships between attribute values. By using other types of relations, semantic relationships between attribute values can be easily integrated into information tables.

3.1 Granules Induced by Equality of Attribute Values

With respect to an attribute a E At, two objects 0 and 0' may have the same value, namely, Ia(o) = Ia(o'). In this case, one cannot differentiate 0 from 0'

based solely on their values on attribute a. They may be put into the same granule. For v E Va, one obtains the granule corresponding to the atomic formula (a, v):

Ge(a,v) = {x E U I Ia(x) = v} = m(a, v). (2)

This granule consists of all objects whose value on attribute a is equal to v. Such granules are defined by equality constraints in the sense discussed by Zadeh [47].

The family of granules,

(3)

forms a partition of the universe [19]. The corresponding equivalence relation EQ{a} on U is given by:

(4)

Page 115: Data Mining, Rough Sets and Granular Computing

109

Each equivalence class of the relation EQ{a} is a granule. The equivalence class containing 0 E U, written [ojEQ{a} ' is defined by collecting all objects whose value on attribute a is the same as o's value:

[ojEQ{a} = {x E U I Ia(x) = Ia(o)} = Ge(a, Ia(o)). (5)

The partition 1I"{a} of the universe is referred to as a quotient set of U and is denoted by U / EQ {a}. It offers a granulated view of the universe. The sets in 1I"{a} are called elementary granules, as they are the smallest granules derivable based on values of attribute a. From the elementary granules, large granules may be built by taking a union of a family of elementary granules. One can build a hierarchy of granules. If the empty set 0 is added, one obtains a sub-Boolean algebra of the Boolean algebra 2u formed by the power set of U.

The argument for constructing granules can be easily extended to cases of more than one attribute. For a pair of attributes a, bEAt and two values Va E Va, Vb E Vb, one can obtain the following granule corresponding to the formula (a, va) 1\ (b, Vb):

Ge«a, Va) 1\ (b,Vb)) = {X E U I Ia(x) = Va 1\ Ib(X) = Vb} = m«a, Va) 1\ (b,Vb)) = m(a, Va) n m(b, Vb) = Ge(a, Va) n Ge(b, Vb). (6)

The granule is defined by two equality constraints. The family of granules:

is a partition of the universe. The corresponding equivalence relation is given by EQ{a,b} = EQ{a} n EQ{b}, namely,

OEQ{a,b}o' <==> Ia(o) = Ia(o') 1\ h(o) = h(o'). (8)

Granules in the partition 1I"{a,b} are smaller than granules in partitions 1I"{a}

and 1I"{b}.

For information table 1, with respect to the attribute A = {Hair}, we can partition the universe into equivalence classes:

{01' 02, 06, os}, {03}, {04' 05, od·

They correspond to formulas (Hair, blond), (Hair, red), and (Hair, dark). Similarly, the use of attribute Height produces the partition:

When the pair of attributes A = {Height, Hair} is used, we consider all possible combinations of values of Height and Hair, such as (Height, short) 1\

Page 116: Data Mining, Rough Sets and Granular Computing

110

(Hair, blond), (Height, tall) 1\ (Hair, blond), and so on. They produce the partition of the universe:

They are granules finer than the ones produced by using either Height or Hair.

For a subset of attributes A ~ At, the equivalence relation is given by EQA = naEA EQ{a}, each equivalence class (granule) is defined by the equal­ity constraints AaEA Ia(x) = Va, where Va E Va. The algebra ({EQA}A~At, n) is a lower semi-lattice with the zero element EQAt [15]. For two subsets of attributes A, B ~ At, if EQA c EQB, we say that the partition 1I"A is finer than 11" B, or 11" B is coarser than 11" A. We will also say that 11" A is a specialization, or refinement, of 1I"B, or 1I"B is a generalization, or coarsening, of 1I"A [19]. The order relation of the semi-lattice represents the generalization-specialization relationships between partitions, Le., families of elementary granules. The empty set 0 produces the coarsest equivalence relation, Le., EQ0 = U x U, where x denotes the Cartesian product of sets. The entire set of attributes produces the finest equivalence relation EQAt. In the construction of gran­ules, the addition of an attribute leads to a specialization, and hence smaller elementary granules. Conversely, the deletion of an attribute leads to a gen­eralization, and hence larger elementary granules.

One may study relationships between attributes using partitions induced by individual attributes or subsets of attributes [39]. This can be done in an information-theoretic setting, as discussed by Lee [8] and Malvestuto [14] on the issues of correlation and interdependency among attributes. The notions such as functional, multi-valued, hierarchical and join dependencies are stated in terms of various entropy functions. Additional entropy related measures and their uses in machine learning and data mining can be found in a paper by Yao et al. [39].

3.2 Granules Induced by Equivalence of Attribute Values

Granules constructed using a single attribute may be either too large or too small. The addition of more attributes may resolve the former problem. A solution to the latter problem will be discussed in this section by grouping values in Va. In particular, values in Va are divided into disjoint classes, Le., a partition of Va, and the corresponding equivalence classes are used as new attribute values. Two examples of such approaches are the discretization of real-valued attributes and the use of concept hierarchies [5]. The idea can be formalized by introducing equivalence relations on the set of attribute values [4,28].

Suppose Ea is an equivalence relation on the set of values Va of an at­tribute a E At. It partitions the set Va into a disjoint family of subsets Val Ea called quotient set of Va. Let [V]Ea denote the equivalence class containing v.

Page 117: Data Mining, Rough Sets and Granular Computing

111

For v E Va, we obtain a granule by replacing = with Ea in Equation (2):

GE(a,v) = {x E U 1 Ia(x)Eav} = {x E U 1 Ia(x) E [V]Ea}

= U{m(a,v') 1 v' E Va, v' E [V]Ea}

= U{m(a,v') 1 v' E Va,v'Eav}. (9)

It consists of all objects whose value on attribute a is equivalent to v. The equivalence relation is a generalization of the trivial equality relation =. The granules given by Equation (9) may be interpreted as the granule defined by a generalized equality constraint. Many authors [5,9-11] used the equivalence classes [V]Ea as higher level concepts in a concept hierarchy. Each value in v E Va is replaced by its equivalence class [V]Ea in the original information table to produce a quotient information table. For the quotient information table, the following equality constraint can in fact be used: for [V]Ea E Val Ea,

Ge(a', [V]EJ = {x E U 1 [Ia(x)]Ea = [V]Ea} = m(a', [V]EJ, (10)

where a' is used to explicitly express the fact that in the quotient information table, an attribute takes equivalence classes of Va as its values. For two values vEav', m(a', [V]EJ = m(a', [V']EJ.

The family of granules,

lI{a} = {GE(a,v) i- 01 v EVa}, (11)

form a partition of the universe [19]. The corresponding equivalence relation E{a} on U is given by:

oE{a}o' {::::} Ia(o)Ea1a(0').

The equivalence class containing 0 E U, written [O]E{a}' is:

[O]E{a} = {x E U 1 Ia(x)Ea1a(o)} = GE(a,Ia(o»

= U{m(a,I(o'» 1 0' E U,Ia(0')Ea1a(o)}

= U{[o']EQ{a} IdE U, Ia(0')Ea1a(o)}.

(12)

(13)

It consists of all objects whose value on attribute a is equivalent to that of the object o. Each of such equivalence granules is a union of some smaller granules of the equivalence relation defined by the equality relation. The argument can be easily extended to any subset of attributes.

Equivalence classes in Val Ea can be combined again to form even larger granules. The process can be continued until the right sized granules are

Page 118: Data Mining, Rough Sets and Granular Computing

112

obtained. Alternatively, one may use a sequence of nested equivalence rela­tions on attribute values. This leads to the formation of a concept hierarchy. Each equivalence class of attribute values is viewed as a concept. A finer equivalence relation produces more specific concepts, while a coarser relation produces more general concepts [34J.

3.3 Granules Induced by Similarity of Attribute Values

The use of the trivial equality relation = and equivalence relations on at­tribute values provides a straightforward way for defining granulation struc­tures on the universe. The type of granulation structures is characterized by partitions of the universe. With a fixed information table, from a subset of attributes one can obtain a partition. The converse is not necessarily true. For an arbitrary partition, one may not be able to find a subset of the at­tributes producing the same partition. Furthermore, equality and equivalence represent special cases of similarity. In order to obtain additional granulation structures, one may use other types of similarity relation on the attribute values [38,40J.

Suppose Ra is a binary relation on Va. For v, v' E Va, if vRav' we say that v'is Ra-related to v, v is a predecessor of v', and v'is a successor of v. The binary relation Ra is interpreted as defining the similarity of attribute values. A value v is similar to v' if vRav'. It seems reasonable to assume that Ra is reflexive, i.e., a value is similar to itself. The property of symmetry may not necessarily be required, namely, the similarity may not be symmetric [9,25J. By collecting values similar to v, we can form a granule of Va as follows:

R~(v) = {v' t v' E Va, v' Rav}. (14)

The set R~(v) is called the predecessor neighborhood of v induced by the binary relation [31J. A binary relation and the predecessor neighborhoods uniquely determine each other. By the reflexivity of Ra , the family of granules {R~(v) "I 0 t v E Va} forms a covering of Va, which is not necessarily a partition.

For v E Va, by extending Equation (9), we obtain a granule of U:

Gs(a,v) = {x E U t Ia(x)Rav}

= {x E U t Ia(x) E R~(v)}

= U{m(a, v') t v' E Va,v' E R~(v)}. (15)

It consists of all objects whose value on attribute a is similar to v. The family of granules,

(16)

form a covering of the universe. Each granule is in fact the predecessor neigh­borhood of certain element of universe induced by the binary relation R{a} on U:

(17)

Page 119: Data Mining, Rough Sets and Granular Computing

113

Unlike the cases of equality and equivalence where an element of the universe belongs to exactly one equivalence class, the element may belong to more than one granule. In fact, 0 is a member of each of the following family of granules,

{Gs(a,v) I Ia(o) E R~(v),v EVa}. One the other hand, the granule:

R{a}(o) = {x E U I Ia(x)Rala(o)}

= Gs(a,Ia(o)),

(18)

(19)

consists of those elements whose value on a is similar to o's value. The relation R{a} preserves properties of Ra. For example, if Ra is a reflexive, a symmetric, and a transitive relation, R{a} is a reflexive, a symmetric, and a transitive relation, respectively [40].

For a subset of attributes A ~ At, the similarity constraint is given by /\aEA Ia(x)Rava, where Va E Va. The similarity relation on the universe de­fined by A is given by:

ORAO' {:=:} /\aEA Ia(O)Rala(o') {:=:} /\aEA oR{a}o'. (20)

That is, RA = naEA R{a}. The relation RA only preserves the common prop­erties of relations R{a} 's, a E A. Similarly, an element may belong to more than one granule.

The same process may also used to construct granules using other neigh­borhoods of a similarity relation. A detailed discussion can be found in a paper by Yao [31]. Furthermore, all concepts and observations discussed in the last section, such as generalization, specialization, and semi-lattice struc­ture of granulations, may be examined for the cases of arbitrary similarity relations.

4 Rough Set Approximations

With the granulation of a universe, an arbitrary subset of the universe cannot be represented precisely using granules. One needs to deal with its approxi­mations. This is in fact one of the main issues of the theory of rough sets [19]. Our discussion of this section follows, to a large extent, a paper by Yao [35].

Consider an equivalence relation E ~ U x U on a universe U. The pair apr = (U, E) is called an approximation space. With respect to the parti­tion U / E, an arbitrary set X ~ U may not necessarily be a union of some equivalence classes. One may characterize X by a pair of lower and upper approximations:

apr(X) = U{G I G E U/E,G ~ X},

apr(X) = U{G I G E U/E,GnX:f. 0}. (21)

Page 120: Data Mining, Rough Sets and Granular Computing

114

The lower approximation apr(X) is the union of all the equivalence granules which are subsets of X. The upper approximation apr(X) is the union of all the equivalence granules which have a non-empty intersection with X.

Lower and upper approximations are dual to each other in the sense:

(Ia) apr(X) = -apr ( -X),

(Ib) apr(X) = -apr(-X).

The set X lies within its lower and upper approximations:

(II) apr(X) ~ X ~ apr(X).

Intuitively, lower approximation may be understood as the pessimistic view and the upper approximation the optimistic view in approximating a set by using equivalence granules. One can also verify the following properties:

(IIIa) apr(X n Y) = apr(X) n apr(Y),

(IIIb) apr(X U Y) = apr(X) U apr(Y).

The lower (upper) approximation of the intersection (union) of a finite num­ber of sets can be obtained from their lower (upper) approximations. However, we only have:

(IVa) apr(X U Y) ;2 apr(X) U apr(Y) ,

(IVb) apr(X n Y) ~ apr(X) n apr(Y).

It is impossible to obtain the lower (upper) approximation of the union (in­tersection) of some sets from their lower (upper) approximations. Additional properties of rough set approximations can be found in papers by Pawlak [18], and Yao [32,36].

Equivalence classes of the partition U j E are called the elementary gran­ules. They represent the available information. All knowledge we have about the universe are about these elementary granules, instead of about individual elements. It follows that we also have knowledge about the union of some elementary granules. As a matter of fact, if X is the empty set 0 or the union of one or more elementary granules, then apr(X) = X = apr(X). These sets are called definable, observable, measurable, or composed granules. The set of all definable granules is denoted GK(U), which is a subset of the power set 2u . The set GK(U) is closed under both set intersection and union. It is an u-algebra of subsets of U generated by the family of equivalence classes UjE.

For an element G E GK(U), we have:

apr(G) = G = apr(G). (22)

Page 121: Data Mining, Rough Sets and Granular Computing

115

For any subset X ~ U, we have the equivalent definition of rough set ap­proximations:

apr(X) = U{G I G E GK(U),G ~ X},

apr(X) = n{G I G E GK(U),X ~ G}. (23)

This definition offers another interesting interpretation. The lower approx­imation is the largest definable granule contained in X, where the upper approximation is the smallest definable granule containing X. They there­fore represent the best approximations of X from below and above using definable granules.

From similarity relations on attribute values, one can derive a similarity relation R on the universe U. A covering of the universe can be constructed by using a particular type of neighborhoods of all elements of U. Let U / R denote such a covering. Rough set approximations can be defined by generalizing Equation (21). In particular, an equivalence class is replaced by a granule in U / R. One of such generalizations is [33]:

apr(X) = U{G I G E U/R,G ~ X},

apr(X) = -apr ( -X). (24)

In this definition, we generalize the lower approximation and define the up­per approximation through duality. In general, apr(X) is different from the straightforward generalization U{ GIG E U / R, G n X =I 0}. While the lower approximation is the union of some granules, the upper approximation cannot be expressed in this way [33].

Subsets in the covering U / R are called elementary granules. By definition, if X is a union of some elementary granules in U / R, then we have apr(X) = X. That is, X can be defined by granules in U / R exactly from below. For this reason, the empty set 0 or the union of some elementary granules are referred to as lower definable granules. The set of all lower definable granules GK(U) is the minimum subset of 2u that contains 0 and U/R, and is closed under set union. The complemented system:

GK(U) = {-G I G E GK(U)}, (25)

contains U and is closed under set intersection. In other words, G K (U) is a closure system [31]. For an element G E GK(U), i.e., -G E GK(U), we have:

apr(G) = G,

apr(-G) = -G. (26)

That is, the system GK(U) consists of upper definable subsets of U. In general, G = apr(G) =I apr(G) and apr(G') =I apr(G') = G' for elements

Page 122: Data Mining, Rough Sets and Granular Computing

116

G E GK(U) and G' E GK(U). In terms of lower and upper definable gran­ules, we have another equivalent definition:

apr(X) = U{G I G E GK(U),G ~ X},

apr(X) = n{G' I G' E GK(U),X ~ G'}. (27)

The lower approximation is the largest lower definable granule contained in X, and the upper approximation is the smallest upper definable granules containing X. They are related to the definition for the case of partitions, in which GK(U) and GK(U) become the same. For a covering, the set GK(U)n G K (U) consists of both lower and upper definable granules. Obviously, 0, U E GK(U) n GK(U).

The new approximations satisfy properties (I), (II), and (IV). They do not satisfy property (III). Nevertheless, they satisfy a weaker version:

(Va) apr(X n Y) ~ apr(X) n apr(Y) ,

(Vb) apr(X U Y) 2 apr(X) U apr(Y).

By definition, apr(X n Y) can be written as a union of some elementary granules. Although both apr(X) and apr(Y) can be expressed as unions of elementary granules, apr(X) n apr(Y) cannot be so expressed.

5 Data Analysis and Data Mining

A granule represents a concept such that each element in the granule is an instance of the concept. Under this interpretation, one of the tasks of data analysis, knowledge discovery and data mining may be regarded to as finding connections between concepts represented by their corresponding granules. In the framework of granular computing, the main results from Yao and Zhong [41,42] are reviewed.

Let X ~ U be a subset of the universe representing a certain concept 4Jx, and FG a family of granules whose descriptions are known. We consider the task of finding a description of A in terms of granules in FG. For a granule G E FG with description 4Ja, i.e., m(4Ja) = G, we have either G n X = 0 or G n X =10. For the case G n X = 0, we say that G and X are not positively related. However, we have:

G~-X. (28)

By property (iii), we have:

(29)

Hence, we can establish an if-then type rule:

IF 4Ja THEN not 4J x . (30)

Page 123: Data Mining, Rough Sets and Granular Computing

117

This rule enables us to decide that an instance of G is not an instance of X. It gives the properties that make an element of U not to be an instance of X. For the case G n X f. 0, we consider three special sub-cases:

(a) G ~ X,

(b) G"2 X, (c) G=X.

In decision logic languages, we have:

FS <Po --t <px, FS <Px --t <Po, FS <Po == <Px.

By properties (iii) and (iv), we can form the following set of rules:

IF <Po THEN <P x, OIF <Po THEN <P x, IIF <Po THEN <P x,

(31)

(32)

where OIF stands for "only if" and IIF stands for "if and only if". We express these rules slightly different from the conventional way, in order to see the difference between them. The first rule enables us to decide if an element of the universe is an instance of A. It shows the properties that make an element of U to be an instance of A. The second rule, which is normally expression as:

IF <Px THEN <Po, (33)

tells us the properties that an instance of X must have. The third rule is the combination of the first two rules. It summarizes the properties that instances of X, and only instances of X, must have. The first two rules may be interpreted as one-way implication, and the third rule as two-way implication. In knowledge discovery and data mining, one may be interested in different rules depending on the situation. Typically, the first rule is referred to as a decision rule, while the second rule as a characteristic rule.

The rules obtained for the previous cases are certain rules, which reflect the logical relationships between concepts or granules. In some situations, though a strict logical connection does not exist, there may still exist some connection between two granules. This corresponds to the case where GnX f. o and neither G ~ X nor G "2 X is true. In order to characterize such associations between two concepts <P and 'Ij;, one may generalize logical rules to association rules of the following form:

(34)

Page 124: Data Mining, Rough Sets and Granular Computing

118

where 01, ... , Om denote the degree or strength of relationships [51]. Al­though keywords such as IF and THEN are used, one should not interpret the rules as expressing logical implications. Instead, these keywords are used to simply link concepts together [47]. For clarity, we also simply write ifJ =? 1/1. The values 01, ••• , Om quantifies different types of uncertainty and proper­ties associated with the rule. Examples of quantitative measures include con­fidence, uncertainty, applicability, quality, accuracy, and interestingness of rules. A recent systematic study on uncertain rules was given by Yao and Zhong [41].

Using the cardinalities of sets, we obtain the contingency Table 2, repre­senting the quantitative information about the rule ifJ =? 1/1, where 1·1 denotes the cardinality of a set. The values in the four cells are not independent. They are linked by the constraint a + b + c + d = n. The 2 x 2 contin­gency table has been used by many authors for representing information of rules [2,6,23,26,50].

t/J -,t/J Totals

fjJ Im(fjJ) n m(t/J) I Im(fjJ) n m(-,t/J) I Im(fjJ) I -,fjJ Im(-,fjJ) n m(t/J) I Im(-,fjJ) nm(-,t/J)I Im(-,fjJ) I

Totals Im(t/J)I Im(-,t/J)I lUI

t/J -'t/J Totals fjJ a b a+b

-,fjJ c d c+d

Totals a+c b+d a+b+c+d=n

Table 2. Contingency table for rule fjJ -t t/J

From the contingency table, we can define some basic quantities. The generality of concept ifJ is defined by:

(,/,) = Im(ifJ) I = a + b g'l' lUI n' (35)

which indicates the relative size of the concept ifJ. A concept is more general if it covers more instances of the universe. If g(ifJ) = 0, then (1000)% of objects in U satisfy ifJ. The quantity may be viewed as the probability of a randomly selected element satisfying ifJ. Obviously, we have 0 ~ g(ifJ) ~ 1.

Page 125: Data Mining, Rough Sets and Granular Computing

The absolute support of 'Ij; provided by c/J is the quantity:

as('Ij;Jc/J) = Jm('Ij;) n m(c/J)J Jm(c/J)J

a = a+b·

119

(36)

It may be interpreted as the degree to which c/J implies 'Ij;. If as('Ij;Jc/J) = a, then (100a)% of objects satisfying c/J also satisfy 'Ij;. It is in fact the conditional probability of a randomly selected element satisfying 'Ij; given that the element satisfies c/J. In set-theoretic terms, it is the degree to which m(c/J) is included in m('Ij;). Clearly, as('Ij;Jc/J) = 1, if and only if m(c/J) ~ m('Ij;). The change of support of 'Ij; provided by c/J is defined by:

cS('Ij;Jc/J) = as('Ij;Jc/J) - g('Ij;) an - (a + b)( a + c)

(a + b)n (37)

Unlike the absolute support, the change of support varies from -1 to 1. One may consider g('Ij;) to be the prior probability of'lj; and as('Ij;Jc/J) the posterior probability of 'Ij; after knowing c/J. The difference of posterior and prior probabilities represents the change of our confidence regarding whether c/J actually confirms 'Ij;. For a positive value, one may say that c/J confirms 'Ij;; for a negative value, one may say that c/J does not confirm 'Ij;. The mutual support of 'Ij; and c/J is defined by:

ms(c/J 'Ij;) = Jm(c/J) n m('Ij;)J , Jm(c/J) U m('Ij;)J

a

- a+b+c· (38)

One may interpret the mutual support, 0 ~ ms(c/J, 'Ij;) ~ 1, as a measure of the strength of the two-way association c/J ¢> 'Ij;. It measures the degree to which c/J confirms, and only confirms, 'Ij;.

The degree of independence of c/J and 'Ij; is measured by:

. g(c/JA'Ij;) md(c/J,'Ij;) = g(c/J)g('Ij;)

an = ..,..--..,.....,..----,-

(a + b)(a + c)" (39)

It is the ratio of the joint probability of c/J A 'Ij; and the probability obtained if c/J and 'Ij; are assumed to be independent. One may rewrite the measure of independence as [3]:

. d('" 0") = as('Ij;Jc/J) m '/', '/' g('Ij;). (40)

Page 126: Data Mining, Rough Sets and Granular Computing

120

It shows the degree of the deviation of the probability of'I/J in the subpopula­tion constrained by ¢ from the probability of'I/J in the entire data set [13,30]. With this expression, the relationship to the change of support becomes clear. Instead of using the ratio, the latter is defined by the difference of as('l/JI¢) and g('I/J). When ¢ and 'I/J are probabilistic independent, we have cs('l/JI¢) = 0 and ind(¢,'I/J) = 1. Moreover, cs('l/JI¢) ~ 0 if and only if ind(¢,'I/J) ~ 1, and cs('l/JI¢) ~ 0 if and only if ind(¢,'I/J) ~ 1. This provides further support for the use of cs as a measure of confidence that ¢ confirms 'I/J.

All measures introduced so far have a probabilistic interpretation. They can be roughly divided into three classes: generality (g), one-way association (as and cs), and two-way association (ms and ind). Each type of association measures can be further divided into absolute support and change of support. The measure of absolute one-way support is as, and the measure of absolute two-way support is ms. The measures of change of support are cs for one­way, and ind for two-way. It is interesting to note that all measures of change of support are related to the deviation of joint probability of ¢ 1\ 'I/J from the probability obtained if ¢ and 'I/J are assumed to be independent. In other words, a stronger association is presented if the joint probability is further away from the probability under independence. The association can be either positive or negative.

6 Conclusion

Granular computing may be regarded to as a label of the family of theories, methodologies, and techniques that make use of granules (Le., groups, classes, or clusters of a universe) in the process of problem solving. The construction, representation, and interpretation of granules, as well as the utilization of granules for problem solving, are some of the fundamental issues. In order to understand and investigate these issues, it is necessary to establish a proper framework. By reviewing some existing studies on non-fuzzy granular com­puting, we proposed a model of granular computing based on information tables. Within this model, various methods for the construction, interpreta­tion, and representation of granules were examined. Two specific problems of granular computing were also discussed, as an illustration to show the use­fulness of our model. One may conclude that although the proposed model is simple, it is powerful for the study of fundamental issues in granular com­puting.

In this paper, we only considered non-fuzzy granular computing. It is useful to extend the framework so that fuzzy information may be incorpo­rated. This may be done by using fuzzy relations on attribute values. One may have another generalization by considering incomplete information ta­bles, where an information function maps each object to a set of attribute values instead of a single value [37]. The major part of our discussion was focused on information granulation, with very little emphasis on the actual

Page 127: Data Mining, Rough Sets and Granular Computing

121

computing. Methods for computation based on granulations of universe are clearly needed. The approaches for constructing rough set approximations and finding connections between granules represent two such examples.

References

1. Codd, E.F. A relational model of data for large shared data banks, Communi­cation of ACM, 13, 377-387, 1970.

2. Gaines, B.R. The trade-off between knowledge and data in knowledge acquisi­tion, in: Knowledge Discovery in Databases, Piatetsky-Shapiro, G. and Frawley, W.J. (Eds.), AAAI/MIT Press, 491-505, 1991.

3. Gray, B. and Orlowska, M.E. CCAIIA: clustering categorical attributes into interesting association rules, in: Research and Development in Knowledge Dis­covery and Data Mining, Wu, X., Kotagiri, R,. and Bork, KB. (Eds.), Springer­Verlag, Heidelberg, Germany, 132-143, 1998.

4. Hadjimichael, M. and Wasilewska, A. Rule reduction for knowledge represen­tation system, Bulletin of Polish Academy of Science, Mathematics, 37, 63-69, 1990.

5. Han, J., Cai, Y., and Cercone, N. Data-driven discovery of quantitative rules in data bases, IEEE 'JIransactions on Knowledge and Data Engineering,S, 29-40, 1993.

6. Ho, KM. and Scott, P.D. Zeta: a global method for discretization of continuous variables, Proceedings of KDD-97, 191-194, 1997.

7. Jardine, N. and Sibson, R., 1971, Mathematical Taxonomy, Wiley, New York. 8. Lee, T.T. An information-theoretic analysis ofrelational databases - part I: data

dependencies and information metric, IEEE 'JIransactions on Software Engineer­ing, SE-13, 1049-1061, 1987.

9. Lin, T.Y., 1998, Granular computing on binary relations I: data mining and neighborhood systems, II: rough set representations and belief functions, in: Rough Sets in Knowledge Discovery 1, Polkowski, L. and Skowron, A. (Eds.), Physica-Verlag, Heidelberg, 107-140, 1998.

10. Lin, T.Y. and Hadjimichael, M. Non-classificatory generalization in data min­ing, Proceedings of the Fourth International Workshop on Rough Sets, Fuzzy Sets, and Machine Discovery, 404-411, 1996.

11. Lin, T.Y., Zhong, N., Dong, J.Z., and Ohsuga, S. Frameworks for mining binary relations in data, in: Rough Sets and Current Trends in Computing, Polkowski, L. and Skowron, A. (Eds.), Springer-Verlag, Berlin, 387-393, 1998.

12. Lipski, W. Jr. On databases with incomplete information, Journal of the ACM, 28, 41-70, 1981.

13. Liu, H., Lu, H., and Yao, J. Identifying relevant databases for multidatabase mining, in: Research and Development in Knowledge Discovery and Data Min­ing, Wu, X., Kotagiri, R,. and Bork, KB. (Eds.), Springer-Verlag, Heidelberg, Germany, 211-221, 1998.

14. Malvestuto, F.M. Statistical treatment ofthe information content of a database, Information Systems, 11, 211-223, 1986.

15. Orlowska, E. Semantics of knowledge operators, Bulletin of Polish Academy of Science, Mathematics, 35, 255-263, 1987.

Page 128: Data Mining, Rough Sets and Granular Computing

122

16. Orlowska, E. Reasoning about vague concepts, Bulletin of Polish Academy of Science, Mathematics, 35, 643-652, 1987.

17. Pawlak, Z. Information systems - theoretical foundations, Information Systems, 6, 205-218, 1981.

18. Pawlak, Z. Rough sets, International Journal of Computer and Information Sciences, 11, 341-356, 1982.

19. Pawlak, Z. Rough Sets, Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Dordrecht, 1991.

20. Pawlak, Z., Granularity of knowledge, indiscernibility and rough sets, Proceed­ings of 1998 IEEE International Conference on Fuzzy Systems, 106-110, 1998.

21. Polkowski, L. and Skowron, A. Towards adaptive calculus of granules, Proceed­ings of 1998 IEEE International Conference on Fuzzy Systems, 111-116, 1998.

22. Quinlan, J.R., Learning efficient classification procedures and their application to chess endgames, in: Machine Learnin9: An Artificial Intelligence Approach, Vol. 1, Michalski, J.S., Carbonell, J.G., and Mirchell, T.M. (Eds.), Morgan Kauf­mann, Palo Alto, CA, 463-482, 1983.

23. Silverstein, C., Brin, S., and Motwani, R. Beyond market baskets: generalizing association rules to dependence rules, Data Mining and Knowledge Discovery, 2, 39-68, 1998.

24. Skowron, A. and Stepaniuk, J. Information granules and approximation spaces, Proceedings of Seventh International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, 354-361, 1998.

25. Slowinski, R. and Vanderpooten, D. Similarity relation as a basis for rough ap­proximations, in: Advances in Machine Intelligence & Soft-Computing IV, Wang, P.P. (Ed.), Department of Electrical Engineering, Duke University, Durham, North Carolina, 17-33, 1997.

26. Tsumoto, S. and Tanaka, H. Automated discovery of functional components of proteins from amino-acid sequences based on rough sets and change of represen­tation, Proceedings of KDD-95, 318-324, 1995.

27. Vakarelov, D. A modal logic for similarity relations in Pawlak knowledge rep­resentation systems, Fundamenta Informaticae, XV, 61-79, 1991.

28. Wasilewska, A. Conditional knowledge representation system - model for an implementation, Bulletin of Polish Academy of Science, Mathematics, 37, 63-69, 1990.

29. Yager, R.R. and Filev, D. Operations for granular computing: mixing words with numbers, Proceedings of 1998 IEEE International Conference on Fuzzy Systems, 123-128, 1998.

30. Yao, J. and Liu, H. Searching multiple databases for interesting complexes, in: KDD: Techniques and Applications, Lu, H., Motoda, H., and Liu, H. (Eds.), World Scientific, Singapore, 1997.

31. Yao, Y.Y. Relational interpretations of neighborhood operators and rough set approximation operators, Information Sciences, 111, 239-259, 1998.

32. Yao, Y.Y. Generalized rough set models, in: Rough Sets in Knowledge Discovery 1, Polkowski, L. and Skowron, A. (Eds.), Physica-Verlag, Heidelberg, 286-318, 1998.

33. Yao, Y.Y. Granular computing using neighborhood systems, in: Advances in Soft Computing: Engineering Design and Manufacturing, Roy, R., Furuhashi, T., and Chawdhry, P.K. (Eds), Springer, London, 539-553, 1999.

Page 129: Data Mining, Rough Sets and Granular Computing

123

34. Yao, Y.Y. Stratified rough sets and granular computing, Proceedings of the 18th International Conference of the North American Fuzzy Information Processing Society, 800-804, 1999.

35. Yao, Y.Y. Rough sets, neighborhood systems, and granular computing, Pro­ceedings of the 1999 IEEE Canadian Conference on Electrical and Computer Engineering, 1553-1558, 1999.

36. Yao, Y.Y. and Lin, T.Y. Generalization of rough sets using modal logic, In­telligent Automation and Soft Computing, an International Journal, 2, 103-120, 1996.

37. Yao, Y.Y. and Noroozi, N. A unified framework for set-based computations, Proceedings of the 3rd International Workshop on Rough Sets and Soft Comput­ing, Lin, T.Y. (Ed.), San Jose State University, 236-243, 1994.

38. Yao, Y.Y. and Wong, S.K.M. Generalization of rough sets using relationships between attribute values, Proceedings of the 2nd Annual Joint Conference on Information Sciences, 30-33, 1995.

39. Yao, Y.Y., Wong, S.K.M., and Butz, C.J. On information-theoretic measures of attribute importance, in: Methodologies for Knowledge Discovery and Data Mining, Zhong, N. and Zhou, L. (Eds.), Springer-Verlag, Berlin, 133-137, 1999.

40. Yao, Y.Y., Wong, S.K.M., and Lin, T.Y. A review of rough set models, in: Rough Sets and Data Mining: Analysis for Imprecise Data, Lin, T.Y. and Cercone, N. (Eds.), Academic Publishers, Boston, 47-75, 1997.

41. Yao, Y.Y. and Zhong, N. An analysis of quantitative measures associated with rules, in: Methodologies for Knowledge Discovery and Data Mining, Zhong, N. and Zhou, L. (Eds.), Springer-Verlag, Berlin, 479-488, 1999.

42. Yao, Y. Y. and Zhong, N. Potential applications of granular computing in knowl­edge discovery and data mining, Proceedings of World Multiconference on Sys­temics, Cybernetics and Informatics, Volume 5, 573-580, 1999.

43. Zadeh, L.A. Fuzzy sets and information granularity, in: Advances in Fuzzy Set Theory and Applications, Gupta, N., Ragade, R. and Yager, R. (Eds.), North­Holland, Amsterdam, 3-18, 1979.

44. Zadeh, L.A. Fuzzy logic = computing with words, IEEE Transactions on Fuzzy Systems, 4, 103-111, 1996.

45. Zadeh, L.A. Some reflections on soft computing, granular computing and their roles in the conception, design and utilization of information/intelligent systems, manuscript, Computer Science Division, University of California, Berkeley, USA, 1997.

46. Zadeh, L.A., Toward a restructure of the foundations of fuzzy logic (FL), Ab­stract of BISC Seminar, University of California, Berkeley, USA, 1997.

47. Zadeh, L.A. Towards a theory of fuzzy information granulation and its central­ity in human reasoning and fuzzy logic, Fuzzy Sets and Systems, 19, 111-127, 1997.

48. Zadeh, L.A. Announcement of GrC, 1997. 49. Zadeh, L.A. Information granulation and its centrality in human and machine

intelligence, in: Rough Sets and Current Trends in Computing, Polkowski, L. and Skowron, A. (Eds.), Springer-Verlag, Berlin, 35-36, 1998.

50. Zembowicz, R. and Zytkow, J.M. From contingency tables to various forms of knowledge in database, in: Advances in Knowledge Discovery and Data Mining, Fayyad, U.M, Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R. (Eds.), AAAI Press / MIT Press, California, 39-81, 1996.

Page 130: Data Mining, Rough Sets and Granular Computing

124

51. Zhong, N., Dong, J., Fujitsu, S., and Ohsuga, S. Soft techniques for rule dis­covery in data, Transactions of Information Processing Society of Japan, 39, 2581-2592, 1998.

Page 131: Data Mining, Rough Sets and Granular Computing

A Query-Driven Interesting Rule Discovery U sing Associations and Spanning Operations

Jong P. Yoon1 and Larry Kerschberlf

1 University of Louisiana at Lafayette, The Center for Advanced Computer Studies, Lafayette, LA 710504-4330, USA

2 George Mason University, Department of Information and Software Engineering, Fairfax, VA 22030-4444, USA

Abstract. In practice, users may often want interesting rules that are also related with user goals . This paper describes a technique of mining useful rules both in­teresting and related to user goals. According to the degree of relevancy to a user goal, a database can be divided into the five views: from the view positively related to the user goal to the view unrelated. To each such view, our novel technique of data mining can be applied. The union and join operations in SQL, unlike the tra­ditional approaches which apply association and prunning operations to one view, are applied to one or more of those views. While the pattern association operation joins patterns over the different attributes, the pattern spanning operation unions patterns over the same attributes. The combination of two operations keeps both confidence and supportiveness measures together, and differenciation of query views enables us to produce the desired level of interestingness and relevancy.

1 Introduction

The problem of mining useful association rules from multidatabases has re­ceived considerable research attention and several algorithms for mining inter­esting rules have been developed [pT98,WTL98]. It is well known that most interesting rules describe the independency between antecedent and conse­quent patterns [PT98,ST96]. If those patterns in the rule are indepedent and part of those patterns is given by a user goal (i.e., beliefs or expectations [PT98]), the rule may be more interesting but less related with what the user goal is requested. The interestingness of association rules even along with sup­portiveness and confidence may not be sufficiently enough to produce useful rules. Therefore, yet another measure called relevancy is considered to mine useful data particularly from multidatabases.

The interestingness, relevancy, confidence, and supportiveness of rules dis­covered from the databases are discussed in this paper .

• Interestingness. Interestingness as unexpectedness can be defined in gen­eral [Fag90,ST96]. Association rules are interesting if patterns in the rule are independent .

• Relevancy. Many rules are mined, and it turns out that not all of them are related to a user goal. Notice that there is a tradeoff between inter­estingness and relevancy. Interesting rules on one extreme end may not

Page 132: Data Mining, Rough Sets and Granular Computing

126

be the ones that are related to a user goal, while rules highly related to a user goal on the other extreme end are not surprise to the user. The goal of this paper is to optimize to mining those rules that are both interesting and related to a user goal .

• Supportiveness and Confidence. Association rules can be used for appli­cations only when they exceed a given minimum confidence. The higher the supportiveness of an association rule, the more applicable it may be. The higher the confidence of an association rule is, the more it can be correct in the database.

With these four measures of data mining in mind, we describe correlations among them. The following example illustrates why these four measures need to be considered.

EXAMPLE 1.1: Consider a soil database where a farmer expects (or queries for) information about soil composition ratio amongst calcium, iron, and sodium as they relate to yield. It is likely that the farmer's goal is to determine what soil compositions result in higher yield. Suppose the following rule is mined: soil with sodium of x amount produces higher yield. Since this rule is exactly the same as what the user expected, it may be fine to the query, but may not be interesting in terms of data mining. On the other hand, a rule specifying that urban customers buy bigger crops than rural may be interesting because the pattern of crop consumption occurs independently to the user goal. However, it is less useful to the user goal because it is not relevant. The problem is that increasing the interestingness of a rule may result in decreasing its relevancy, and vice versa. Our goal in this paper is then to formulate rules which are both related to and interesting to the user. The following rule may provide such an example: soil with calcium of y amount "hinders" higher yield.

Traditional data mining approach focuses on increasing both confidence and supportiveness. In Figure 1, the bottom middle area labelled by (1) is the major issue of traditional data mining research [AIS93,STA98]. Recently, researchers focus on characterizing interesting rules with minimum support and confidence as shown in the upper middle area labelled by (2) in the figure [ST96,WTL98]. In this paper, we will discuss how to mine both interesting association rules that are related to user goal, and meet the minimum support and confidence thresholds. The area we deal with in this paper is (3) in the figure. We do not address the issue of how to build a "good" set of goals. Instead, we assume that user goals are expressed in SQL queries.

The remainder of this paper is organized as follows: Section 2 describes the problem statement and definitions. In Section 3, we develop an algorithm for pattern association using the join operation. Section 4 describes an al­gorithm for pattern spanning using the union operation in SQL. Section 5 describes how these two operations are performed interactively. Finally, the contributions of this paper and future work are described in Section 6.

Page 133: Data Mining, Rough Sets and Granular Computing

. /

Interestin "~~:"""""""""""T(;;' i '----------------J--------------k

vance ' .,/

Con ldence & Supportiveness

Fig. 1. Four Measures

2 Problem Statement

127

Let V = {RI' R2 , •.• , Rr} denote a set of database tables. Assume that those table instances are ordered in transaction-time. Each table 14(1 ::; i ::; r) therefore consists of attributes Aii, Ai2 , ... , Aim,. An example of such tables is in Figure 2.

A user goal 9 expressed in SQL is posed to database tables V. Then the an­swer A is returned by the computation of A = g(V), that is, {t[Aij]lg(t[Aij ]), 1 ::; i ::; r,l ::; j ::; ji}' If the same arity (i = r) and the same size (lAI = IVI) be­tween the given database and the answer, our approach considers the entire database as in traditional approaches.

Pattern Model

A pattern is in the form of Aii E [l, u], where where 1 and u are the lower bound and the upper bound of the attribute, respectively [SA96]. If 1 = u, then the pattern is called point pattern, and otherwise called range pattern.

The conjunction of patterns, Aij E [it, UI] and Akl E [h, U2], is again a pattern. This pattern can be treated by the following operations:

• If j 1= l, then those two patterns together can be joined: (Aii E [it, UI], Akl E

[h, U2])' • If i = k, and j = l, then those two patterns together can be unioned:

(Aii E [h,U2]), where it::; UI::; l2::; U2·

The above two operations are respectively called "association operation" and "spanning operation."

For example, the two patterns (Ca E [4.5,4.5]) and (Na E [0.1,0.17]) can bejoined together to be (Ca E [4.5,4.5]I\Na E [0.1,0.17]). However, the two patterns (Ca E [4.5,4.5]) and (Ca E [4.55,4.75]) may be unioned together to be (Ca E [4.5,4.75]). More details about pattern association and spanning operations will be discussed in the later sections.

Page 134: Data Mining, Rough Sets and Granular Computing

128

Relevancy to User Goal

A user goal can be expressed in SELECT-FROM-WHERE SQL and posed to a database. Each instance in a database either satisfies the WH ERE clause or not. A set of the former instances is positively related to the goal, and one of the latter instances is negatively related to the goal. Attribute values in each instance can either be projected by the SELECT clause or not. Those attributes projected are directly related to the goal, and otherwise indirectly related. The tables not specified by FROM clause are unrelated to the goal.

Based on these classification, database tables can be decomposed into five components, each called a view: Positively-related query view, Positively­indirectly-related query view , Negatively-related query view, N egatively-indirectly­related query view, and Unrelated query view. Each such view has a different degree of relevancy to a user goal.

• Positively-Related Query View (PRQV). A view is materialized by exe­cuting the user query against the database. PRQV(g) = g(D) = {t[Aij) Ig(t[Aij)) , 1 S:. i S:. n, n S:. r, 1 S:. j S:. jr}.

• Positively-Indirectly-Related Query View (PIRQV). This view contains the same instances as in PRQV but the different attributes are not pro­jected. PIRQV(g) = {t[Aij)lg(t[Aij)),1 S:. is:. r,jl + 1 S:. j S:. m r }.

• Negatively-Related-Query View (NRQV). This view is the complement of PRQV. The instances in this view should not be in PRQV, but are projected for the same attributes. NRQV(g) = {t[Aij)l...,g(t[Aij)),l S:. is:. n,n S:. r, 1 S:. j S:. ji}.

• Negatively-Indirectly-Related Query View (NIRQV). This view is the complement to PIRQV. NIRQV(g) = {t[Aij) I ...,g(t[Aij)) , 1 S:. i S:. n, n S:. r, ji + 1 S:. j S:. mr }.

• UnRelated-Query View (URQ V). This view consists of the all tables not involved in the query processing. No instances and attributes (except the join attributes) are in any of the above views. This view is most independent to a user goal, and least relevant to it. URQV(g) = {t[Aij) I t[Aij) ED, ...,g(t[Aij)) , n + 1 S:. i S:. r }.

Most previous data mining approaches consider the PRQV only, while our approach can consider all 5 views. Figure 3 describes the SQL-like algorithm to generate those views. Among those 5 views, there are 11 combinations of the views that are meaningful to use. Table 1 lists the types of rules to be extracted from those 11 combinations of the views. PRQV and NRQV can be unioned because they are union-compatible, and from this union view goal­directly-related and interesting rules can be extracted. Similarly, PIRQV and NIRQV can be unioned to generate goal-indirectly-related and interesting rules. PRQV and PIRQV are joined to generate goal-positively-related rules, while PRQV and NIRQV are outer-joined to generate another type of rules as described in the table. Although all 11 combinations are possible to consider, we take only one combination in this paper: PRQV U NRQV. Since NRQV

Page 135: Data Mining, Rough Sets and Granular Computing

Soil Price Consumption Ca Fe Na ... Yield Yield Price City Market Yield ... Quantity

4.02 45.1 0.10 ... 1 1 11 C1 M1 2 ... 4 C2 M2 3 ... 4

4.02 45.1 0.14 ... 1 2 17 C3 M3 1 ... 4 4.35 53.2 0.10 ... 4 3 23 C4 M1 2 ... 4

4.35 68.1 0.10 ... 3 4 23 C5 M2 3 ... 4 C6 M3 4 ... 4

4.35 47.3 0.10 ... 2 5 30 C1 M1 4 ... 7 5.67 53.2 0.17 ... 3 5.67 68.1 0.17 ... 4

(b) C2 M2 5 ... 7 C3 M3 1 '" 7 C4 M1 2 ... 7

5.67 53.2 0.17 ... 3 C5 M2 3 ... 7

5.89 68.1 0.2 ... 5 C6 M1 4 ... 7 C1 M2 5 ... 10

5.89 45.1 0.14 ... 2 C2 M3 3 ... 10 (a) C3 M2 1 ... 10

C4 M1 2 ... 10 C5 M2 4 ... 10 C6 M3 3 ... 10

(c)

Fig. 2. Soil Database Example

consists of one or more tables and each such table has different arity and domain, we modify NRQV in the same schema to PRQV that both PRQV and NRQV are union compatible.

For example, consider the tables as in Figure 2. Suppose that the user goal is express in the following query to list Ca, Fe, Na, Yield, and Price information if the yield is greater than or equal to 4.

SELECT Ca, Fe, Na, Price FROM Soil AS s, Price AS p WHERE s.field...id = p.field...id AND Price >= 20

I View I Rules to be extracted 1 PRQV Goal-positively /directly-related rules only

129

2 NRQV Goal-negatively/directly-related and interesting rules only 3 PRQV U NRQV Goal-directly-related and interesting rules 4 PIRQV Goal-positively /indirectly-related rules 5 PRQV iXI PIRQV Goal-positively-related rules only 6 NIRQV Goal-negatively /indirectly-related and interesting rules 7 NRQV iXI NIRQV Goal-negatively-related and interesting rules 8 PIRQV U NIRQV Goal-indirectly-related and interesting rules 9 PRQV iXlouter NIRQV either Goal-positively / directly-related rules

or Goal-negatively /indirectly-related and interesting rules 10 NRQV iXlouter PIRQV either Goal-negatively / directly-related and interesting rules

Goal-positively /indirectly-related rules 11 URQV Interesting rules only . . Note that iXlouter denotes the outer-Jom operatIOn .

Table 1. Kinds of the Rules to be Extracted from Five Views

Page 136: Data Mining, Rough Sets and Granular Computing

130

Algorithm.: View Generation Input: Query, Output: Positively-/Negatively-Related Query Views and Positively-Indirectly-IN egatively-Indirectly-Related (PIRQV INIRQV) Method:

(PRQV INRQV) Query Views

Assume that in table R;, are there attributes R;,.Ai1 , R;,.Ai2 , ... , R;,.Aim,. Let q be the given query SELECT Rl.All , ••• , Rl.AljllR2.A11, ... , R2.A2h, ... , Rn.Anj" FROM Rl,R2, ... ,Rn WHERE Cj

where I I ji :::; mi I I begin

CREATE VIEW PRQV SELECT '+' AS viewType, Rl.All , ... ,Rl.A1jll R2.All , ... , R2.A2j2 , ... , Rn .Anj"

FROM WHERE

CREATE VIEW PIRQV SELECT

Rl,R2, ... ,Rn C

'-' AS viewType, Rl.Ali1+1, ... ,Rl.Almll R 2.A1i2+1, ... , R2.A2m2,Rn.Anj~+1 ... ,Rn.A ... mn

FROM Rl,R2, ... ,Rn WHERE C

I I Negatively-Related Query View w.r.t. table R;, I I CREATE VIEW NRQVi (SELECT' -, AS viewType, R;,.Ai1 , ... , R;,.Aij;

FROM R;,) MINUS (SELECT' -' AS viewType, Ri.Ail, ... , Rl.Aij, FROM Rl,R2, ... ,Rn WHERE c)

I I Negatively-Indirectly-Related Query View w.r.t. table R;, I I CREATE VIEW NIRQVi (SELECT ,-, AS viewType, R;,.Aij,+l, ... , R;,.A;m;

FROM R;)

end

MINUS (SELECT ,-, AS viewType, R;,.Aii;+l, ... ,R;,.Aim, FROM Rl,R2, ... ,Rn WHERE c)

Fig. 3. View Generation

As this user goal is processed, we obtain the unioned view "unionedSoil" for the user goal above.

CREATE VIEW unionedSoil AS (SELECT Ca, Fe, Na, Price FROM Soil AS s, Price AS p WHERE s.fieldjd = p.fieldjd AND Price >= 20) UNION (SELECT Ca, Fe, Na, Price FROM Soil AS s, Price AS p WHERE s.fieldjd = p.fieldjd AND NOT Price >= 20)

The resulting unionedSoil view can be obtained as shown in Figure 4.

Page 137: Data Mining, Rough Sets and Granular Computing

unionedSoil (PRQV U NRQV) Unrelated Query View

131

viewType Ca Fe Na Price cnt Indirected View Yield City Market Yield ... Quantity

- 4.02 45.1 - 4.02 45.1

+ 4.35 53.2

+ 4.35 68.1 - 4.35 47.3

+ 5.67 53.2

+ 5.67 68.1

+ 5.89 68.1 - 5.89 45.1

(a)

0.10 11 0.14 11 0.10 23 0.10 23 0.10 17 0.17 23 0.17 23 0.2 30 0.14 17

1 1 1 1 1 2 1 1 1

1 2 3 4 5

(b)

Fig. 4. Query Views

Algorithlll: Extraction of Point Patterns Input: views (PRQV IPIRQV INRQV INIRQV)

C1 M1 2 ... 4 C2 M2 3 ... 4 C3 M3 1 ... 4 C4 M1 2 ... 4

C4 M1 2 ... 7 C5 M2 3 ... 7 C6 M1 4 ... 7 C1 M2 5 ... 10 C2 M3 3 ... 10

C6 M3 3 ... 10

(el

Output: Point Patterns (Pk; where k = 1 and i = 0, meaning one single pattern that is not yet spanned.) Method:

begin I I Point patterns extracted from a view w.r.t. attribute Aj of table R; II

INSERT INTO Ho SELECT viewType, GetString(R;.Aj) AS AttributeNamej, R;.A;j AS 1, R;.Aj AS u, COUNT(*) AS cnt

FROM PRQV GROUP BY AttributeNamej ORDER BY 1

I I Each such SQL is applied to PIRQV, NRQV, and NIRQV views (or unioned view as we simplied in this paper) to generate P lo ' I I end Notice that the user-defined function GetString returns the input string without binding values. For example, GetString(Age) returns "Age" instead of, say 27.

Fig. 5. Algorithm for Extracting Point Patterns

From this unioned view, we can obtain point patterns as shown in Fig­ure 5. A set ofthese point patterns is F1 in the Apriori algorithm [AAB+96]. By applying the example above to the algorithm in Figure 5 with the input view unionedSoil, we obtain the point patterns as shown in Figure 6.

Rule Generation

For a given user goal g, a database is decomposed into five views. From each view, a pattern k can be extracted (pattern extraction, association, and further spanning will be discussed in the later section).

Page 138: Data Mining, Rough Sets and Granular Computing

132

Point Patterns viewType Attribute..name I u cnt - Ca 4.02 4.022

+ Ca 4.35 4.352 - Ca 4.35 4.351 ... ... ... ... . .. - Fe 45.1 45.1 2 ... ... ... ... '"

- Na 0.14 0.142

+ Na 0.17 0.173 ... ... ... ... ... + Price 30 30 1

Fig. 6. Point Pattern Extracted

• If k is extracted from PRQV, then the rule will be k :::::::} g. Its con­fidence is count«k /\ g)(D))/count(g(D)), and its support is count«k /\ g)(D)) /count(D). Similarly, if k is extracted from PIRQV, then the rule will be k :::::::} g. The difference from the above is less relevancy to the user goal. Their confidence and support are the same .

• If k is extracted from NRQV, then the rule will be k :::::::} ,g. The dif­ference from the above two rules is negation expressed in a rule. Its confidence count«k /\ ,g)(D))/count(g(D)), and its support count«k /\ ,g) (D)) / count(D). Similarly, if k is extracted from NIRQV, then the rule will be k :::::::} ,g. The difference from the above three rules is a negative rule that is less relevant to the user goal. On the other hands, if k is extracted from URQV, then the rule will be k :::::::} ,g. The difference from the above all rules is most irrelavancy to the user goal.

Interestingness of Rules

The definition of "interestingness" is based on those of Silberschatz [ST96] and Fagin [Fag90]: "rules are interesting if they are not expected." Likewise, J-measure was also developed [SG93]: The higher degree of dissimilarity a rule has, the more usuful it is. Brin et. al. [BMUT97] describes the interest of A and B in a supermarket basket example as P(A,B)/P(A)P(B). Using these two measures, we define rule interestingness for rules mined from one or more views.

Definition 1. (Rule Interestingness). Consider a query condition q. The interest of a rule k :::::::} 9 which is extracted from PRQV (or PIRQV), is obtained by two measures as described above:

I(k,g) = pfHtc1l [BMUT97]; J(g, k) = P(k)[P(glk)log2 P;Z~~) + (1 -

P(glk))log2 p;cW] [SG93].

Page 139: Data Mining, Rough Sets and Granular Computing

133

If a rule k =} -,g is extracted from NRQV (or NIRQV), the interest is

I(k, -,g) = pf1kpl.!g [BMUT97]; J(-,g, k) = P(k)[P(-,glk)log2 P;0~~) + (1 - P(-,glk))log2 P~0~))] [SG93].

Theorem 1. (Interesting Rule Generation). Rules generated with higher support from the NRQV leads to more interesting rules. This is particularly the case when INRQV(g)1 « IPRQV(g)l.

Proof: Suppose that the size of NRQV is smaller than the one of PRQV, and the support of the rule mined from an NRQV is higher than the one from a PRQV. The difference ofthe rule interest between the NRQV and the PRQV, I(k,-,g) - I(k,q) ~ 0 holds: P(k,-,g)/P(k)P(-,g) - P(k,g)/P(k)P(g) = [P(g)P(k,-,g) - (1- P(g)P(k,g)]/P(k)P(g)(l- peg)) because as stated earlier VPUN is obtained only from PRQV and NRQV, P(-,g) = 1 - peg). Then, the result is greater or equal to 0 because of the given conditions: a smaller NRQV than PRQV, that is, P(-,g) « peg), and a higher support, that is, P( -'g, k) > > P(g, k). Hence, under the given conditions, the theorem is proved. The same proof can be made using J-measure as well.

The primary goal of this paper is to generate interesting rules that are related to user goals. We propose two operations, pattern association (in Section 3) and spanning (in Section 4), which are incrementally switched back and forth between themselves (in Section 5).

3 The Join Operation for Pattern Association

The algorithm is expressed as an SQL-like expression over relations using the features supported in SQL-92 (partially with SQL3). The algorithm makes multiple passes over the data. The first pass counts the support of each pat­tern, that is, the number of sequential data that include the pattern.

As a presetting, the size of each view is computed by an SQL to insert into the table called "view..size_table" which is a temporary two-column table. Each pass k of the Apriori algorithm [AAB+96] first generates a candidate pattern Cki from frequent patterns Fk-l i of the previous pass. Notice that in our approach, there can be a pattern that is resulting from the i number of patterns clustering together. (We call this "pattern spanning" and will describe in the later section.)

Given Fk-l i' the set of all frequent (k - I)-patterns, our association op­eration returns a superset of the set of all frequent k-patterns. From Fk - 1"

the candidate patterns Cki are obtained. By examining the existence of all subsets of Cki in Fk - 1i . Then, by counting the candidate patterns to meet the minimal supportiveness requirement.

The pattern association described in this paper is different from the pre­vious approach:

Page 140: Data Mining, Rough Sets and Granular Computing

134

• Patterns are joined together in SQL to make a pattern. Instead of exami­nation for existence of all subsets of Ck; in Cki , we examine the existence of Ck; itself in the view unioned by PRQV and NRQV. While the previ­ous approach [STA98j uses a k-way join which takes O(nk), our approach uses O(m), where nand m denote the size of views, Fk - 1 and unionedV (or P RQV U N RQV), and k denotes the number of attributes. Notice that n > > m, in general. Our approach increases the performance. The experimental results are omitted in this paper due to the space limitation .

• Each such pattern is not just an itemset in the supermarket example, but it is the one specified over instances in a database.

Since all subsets of a frequent pattern are also frequent, we can generate Cki from Fk-l i as shown in Figure 7. In the algorithm, the line 5 sorts by the number cnt counted for the instances. The reason for sorting is to interact with the spanning operation. The spanning operation which will be discussed in Section 4 benefits from sorted data. The lines 6 thru 8 pruns the associated patterns Fk - 1 ; to generate candidates C ki •

Consider the following example: Let F20 be {(Ca E [4.35,4.35j,Fe E [53.2,53.2]), (Ca E [4.35,4.35j,Na E [0.10,0.10]), (Fe E [53.2, 53.2j, Mn E [10.0,10.0]), (N a E [0.1,0.10], Fe E [68.1,68.1]) }. Note that F20 indicates that two point patterns are associated and not yet spaned. Then, F30 will be generated: {(Ca E [4.35,4.35], Fe E [53.2,53.2], Na E [0.10,0.10]), (Ca E

[4.35, 4.35]' Fe E [68.1, 68.1j, Na E [0.10,0.10]) }. From F30 ' candidate pat­terns C3 are determined by examining if all subset of C3 exist in F2o' In this examination, (Fe E [68.1,68.1j,Mn E [10.0,10.0],Na E [0.10,0.10]) cannot be a candidate because the subset (Na E [0.1, 0.17], Mn E [10.0,10.0]) is not in F2o' After this, the counting process is taken. The prunning and count­ing process are not trivial especially when dealing with file server systems. However, our SQL approach increases its performance as shown in the lines 6 thru 8 in the algorithm. For each associated pattern (in the line 4), the prunning process takes the view types (in line 8) into account.

For example, consider patterns from more than one view. Let F21 be {(Ca E [4.35,4.35j,Fe E [53.2, 53.2])PRQV, (Ca E [4.35,4.35], Na E [O.10,0.17])NRQV, (Ca E [4.35, 4.35]' Mn E [10.0, 11.0])NRQV, (CaE [4.35,4.35]'NaE [0.10,0.10])PRQV, (Mn E [10.0, 11.0j, Na E [0.10, O.17])NRQV}.

Note that F21 = ( ... )PRQV indicates that two point patterns extracted from the PRQV view are associated and spaned once. From F2" we can obtain F311 {(Ca E [4.35,4.35],Fe E [53.2, 53.2j,Na E [0.10,0.10])PRQv, (Ca E [4.35, 4.35], Na E [0.10, 0.17], Mn E [10.0, 11.0])NRQV }.

Patterns extracted need to be associated with another pattern that is extracted from the same view. By prunning, only (Ca E [4.35,4.35j,Na E

[0.10, 0.17], Mn E [10.0, 11.0])NRQV can be the candidate.

Page 141: Data Mining, Rough Sets and Granular Computing

Algorithm: Pattern Association Input: Pattern (F(k-1)" where k::::: 1 and i::::: 0), and minimal support Output: Range Patterns (Fk,) Method:

135

Let the schema of F(k-1), be (viewType, AttributeN ame1, h, , U1, , AttributeN ame2, 12"u2i' ... , AttributeNamek_1, 1(k-1)i' U(k-1)i' cnt(k-1)J. Notice if k = 1 and 1k = Uk, then it is called point pattern. It describes over k - 1 attributes that are associated and clustered with i point patterns. Let sup be the given minimal support threshold; begin / / Only 4 tuples are inserted in view..size_table. / / INSERT INTO view..size_table SELECT '+' AS viewType, count(*) AS viewSize

FROM PRQV; INSERT INTO view..size_table SELECT '-' AS viewType, count(*) AS viewSize

FROM NRQV;

INSERT INTO Fk, i SELECT viewType, 11.AttributeName1, 11.h,I1.U1,

11.AttributeName2, 11.12, 12.u2, ... , 11.AttributeNamek_1 AS AttributeNamek_1, 1l.l(k-1), AS 1(k-1)" 11.U(k_1), AS U(k-1)i' 12.AttributeNamek_1 AS AttributeNamek, 12.l(k_1), AS 1ki> 12.u(k_1), AS Uk" 11.cnt(k_1), + 12.cnt(k_1), AS cntki

2 FROM F(k-1), AS 11, F(k-1)i AS 12 3 WHERE 11.ViewType = 12.viewType AND

11.AttributeName1 = 12.AttributeName1 AND 11.l2i = 12.hi AND 1Lu2, = 12.u2i AND 1LAttributeName2 = 12.AttributeName2 AND 1Lh, = 12.11, AND 1Lu1, = 12.u1, AND ... AND Il.l(k_2), = 12.l(k_2), AND 1Lu(k-2), = 12.U(k-2), AND 1LAttributeNamek_1 <> 12.AttributeNamek_1

4 GROUP BY AttributeName1, AttributeName2, ... AttributeNamek_1 5 ORDER BY cnh, 6 HAVING (cntk, / SELECT sum(viewSize) 7 FROM view..size_table AS 13 8 WHERE 13.viewType = 1LviewType

GROUP BY 13.viewType) ::::: sup; end

Fig. 7. Algorithm for Associating Patterns

4 The Union Operation for Pattern Spanning

Before we develop this issue further, let's take an example.

Page 142: Data Mining, Rough Sets and Granular Computing

136

Associated Patterns, F20

viewType Attribute..namelo lto Ulo Attribute..name2o ho - Ca 4.02 4.02 Fe 45.1 - Ca 4.35 4.35 Fe 47.3

+ Ca 4.35 4.35 Fe 53.2 ... ... ... ... ... ... + Na 0.17 0.17 Price 23

+ Na 0.2 0.2 Price 30

Fig. 8. Pattern Association

Associated Patterns, F21

viewType Attribute..namel1 lt1 Ul1 Attribute..name21 121 - Ca 4.02 4.02 Fe 45.1

+ Ca 5.67 5.67 Fe 53.2 ... ... ... ... ... ... - Ca 4.02 4.02 Na 0.14

+ Ca 4.35 4.35 Na 0.10 ... ... ... ... ... ... - Na 0.14 0.14 Price 17

+ Na 0.17 0.2 Price 23

Fig. 9. Pattern Spanning

Associated Patterns, F33

viewType A..name13 lt3 U13 A..name23 123 U23 A..name33

+ .. , + +

Ca 5.67 5.67 Fe 68.1 68.1 Na ... ... ... ... ... ... '"

Fe 53.2 68.1 Na 0.17 0.2 Price Fe 53.2 68.1 Na 0.2 0.2 Price

Fig. 10. Pattern Association

For example, let F21 be {(Ca E [4.5,4.5],Na E [0.1,0.17]), (Ca E [4.5,4.5],Na E [0.1,0.25]), (CaE [4.5,4.7]' K E [1.0, 1. 7]), (Ca E [4.5,4.7],K E [1.7,2.1])}.

U20 cntzo 45.12 47.31 53.21 ... ... 23 3 30 1

'U21 cntz1 45.12 68.13 . .. ... 0.141 0.102 ... . .. 17 1 30 4

133 U33 cnts3

0.17 0.2 2 ... . .. ... 23 23 1 23 30 2

Notice Fki denotes k associated patterns, each of which describes a value range over the i different values of an attribute. If we apply the union oper­ation to the first and second patterns together, and the third and the fourth patterns together in F21 , then F22 is obtained

{(Ca E [4.5,4.5],Na E [0.1,0.25]), (Ca E [4.5, 4.7], K E [1.0,2.1])}.

Previously, data mining techniques were applied to numerical data [SA96] and further clustering techniques were developed to deal with this problem [EKSX96,Fuk90]. Lent et. al. also develop a clustering technique for similar association rules [LSW97]. As compared to the previous approach, our ap-

Page 143: Data Mining, Rough Sets and Granular Computing

Algorithm: Spanning Patterns Input: Pattern (FkO_1)' where k 2': 1 and i 2': O), and minimal confidence Output: Range Patterns (FIc,)

Method:

137

Let the schema of Fk(i_l) be (viewType,AttributeNamel, h(i_l)' ulAttributeName2,12(i_l)' U2(i_l)' ... , AttributeNamek, lkO_ 1)' Uk(i_l)' cntk(,_,), Notice if k ::: 1 and lk ::: Uk, then it is called point pattern. It describes over k attributes that are associated and clustered with i - 1 point patterns. Let d be the maximum gap allowed for between two numeric patternsi Let con! be the given minimal confidence thresholdi begin

INSERT INTO Fk, 1 SELECT viewType,I1.AttributeN amel,

l1.h(i_l) AS 11" l1.u1(i_l) AS Ul" Il.AttributeName2, l1.h(i_l) AS 12" l1.uz(,_,) AS U2" ... , I1.AttributeNamek, 11.1k(i_l) AS Zk" 12.UkU_l) AS Uki'

2 l1.cntk(i_') + 12.cntk(i_') AS cntki 3 FROM Fk(i_l) ASIl, Fk(,_,) AS 12 4 WHERE I1.ViewType::: 12.ViewType

ANO I1.AttributeNamek ::: 12.AttributeNamek ANO (l1.uk(,_,) - 12.lk(,_,» ~ d

5 GROUP BY AttributeNamel, AttributeNamez, ... AttributeNamelc 6 OROER BY cntk, 7 HAVING (cntk.! SELECT count(*) 8 FROM Fk('_l) AS 13 9 WHERE 13.AttributeN amek ::: I1.Attributek

ANO 13.lk(,_,) 2': l1.lk(i_l) ANO 13.Uk(i_l) ~ 12.Uk(,_,» 2': con!i

end

Fig. 11. Algorithm for Spanning Patterns

proach is to express an SQL statement to handle the same problem. We use the union operation as shown in Figure 11. The line 2 unions two tuples together if the gap between two patterns is below the given threshold d as in the line 4. This algorithm also takes the confidence factor into account as shown in the lines 7 thru 9. Notice that the input instances are sorted as in the line 6. Bason upon these sorted instances, we investigate that there may be three possible spanning methods .

• Spanning with an adjacent neighbor. The algorithm shown in Figure 11 spans the given pattern with one pattern at a time. Consider the two patterns {(Al E [h,ul],A2 E [h,u2], ... ,Ak E [lk"Uk'])' and (Al E [h,Ul],A2 E [12,u2], ... ,Ak E [lkll,Ukll ]) }. For a given threshold d for filling the gap between two values, Uk' and lk", if the difference between

Page 144: Data Mining, Rough Sets and Granular Computing

138

those two values is less than d, then those two patterns are spanned to {(AI E [h,Ulj,A2 E [h,U2j, ... ,Ak E [lkl,Ukll ]) }. Although d may be given by users, it can be chosen automatically if the following is satisfied.

MAX(dom(Ak)) - MIN(dom(Ak)) d (d (A)) (d (A)) Idom(Ak)1 < < MAX om k - MIN om k

This spanning works well if two different views are considered together as shown in Table 1. Suppose F22 be

{(Ga E [4.5,4.7j,Na E [O.1,O.17])PRQv, (Ga E [4.5, 4.7], Na E [O.17,O.20])PRQv, (Ga E [4.5,4.7j,Na E [O.19,O.25])NRQv, (Ga E [4.5,4.7j,Na E [O.26,O.27])PRQV, (Ga E [4.5,4.7j,Na E [O.27,O.28])PRQV }.

Then, F23 be obtained by taking one pattern at a time into the union operation.

{(Ga E [4.5,4.7j,Na E [O.1,O.20])PRQV, (Ga E [4.5, 4.7j, Na E [O.19,O.25])NRQV, (Ga E [4.5,4.7j,Na E [O.26,O.28])PRQV } .

• Spanning with n numbers of neighbors. If given patterns are ordered and are not mixed with patterns extracted from different views, a given pattern may be spanned with n numbers of neighbor patterns together. Consider n patterns in an order of attribute Ak, {(AI E [h,UI],A2 E [l2,u2j, ..• ,Ak E [lkl,Ukl]), (AI E [h,Ulj,A2 E [12,u2j, ... ,Ak E [ikll,Uk"])' ... , (AI E [h,ud,A 2 E [l2,U2], ... ,Ak E [lk(n),Uk(n)]) }. Those patterns can be spanned to {(AI E [h,Ulj,A2 E [l2,u2], ... ,Ak E [h/Uk(n)])} ifthe following conditions are satisfied.

1. user-given threshold d of the gap, Ilk" - Uk' I < d, li k(3) - uk"l < d, ... , and IIk(n) - Uk(n-l) I < d

2. user-given threshold 7) of the pattern number, n :S: 7)(:S: Idom(Ak)l)

where the number of patterns to span, T/, is given by a user within the maximum number of the attribute domain. Suppose F22 be

{(Ga E [4.5,4.7],Na E [O.1,O.17])PRQV, (Ga E [4.5,4.7j,Na E [O.17,O.20])PRQV, (Ga E [4.5,4.7j,Na E [O.20,O.25])PRQV, (Ga E [4.5,4.7]'Na E [O.26,O.27])PRQV, (Ga E [4.5,4.7j,Na E [O.27,O.28])PRQV, (Ga E [4.5,4.7j,Na E [O.28,O.29])NRQV }.

Then, F23 be obtained by taking five patterns at a time into the union operation.

{(Ga E [4.5,4.7j,Na E [O.1,O.28])PRQv, (Ga E [4.5, 4.7j, Na E [O.28,O.29])NRQV }.

This is quite possible because all five patterns above are extracted from the same view.

Page 145: Data Mining, Rough Sets and Granular Computing

139

• Parallel spanning. The two methods above applies the union operation with respect to only one attribute at a time. However, the union operation may be applied to more than one attribute. For example, suppose that F22 be

{(Ca E [4.5,4.7],Na E [0.1,0.17])PRQv, (Ca E [4.7,4.9],Na E [0.17,0.25])PRQv, (Ca E [4.5,4.7],Na E [0.17,0.20])PRQv }.

Then, F23 be obtained by applying the union operation to the two at­tributes Ca and Na. {(Ca E [4.5,4.9],Na E [0.1,0.25])PRQV }.

5 Interaction between Association and Spanning Operations

Consider the following question: Is Fij --+ F i+1j --+ Fi+13+1 different from Fij --+ Fi3+1 --+ Fi+13+1? The answer to this question is "no" if parallel span­ning (mentioned in the previous section) is not allowed. In other words, span­ning and association operations are not commutative. For example, consider Flo =

{(Ca E [4.5,4.5]), (Ca E [4.6,4.6]), (N a E [0.1,0.1]), (N a E [0.25,0.25])}.

H we follow the order of an association operation followed by a spanning operation, Flo --+ F20 --+ F 21 , we end up to F21 =

{(Ca E [4.5,4.5],Na E [0.1,0.25]), (Ca E [4.6, 4.6], Na E [0.1,0.25]) }.

On the other hand, if we follow the order of a spanning operation followed by an association operation, that is, Flo --+ Fll --+ F 21 , we end up to F21 =

{(Ca E [4.5,4.6],Na E [0.1,0.25]) }. Depending upon the order of those two operations, we may have different

patterns as outcome. To deal with this difference, we propose the user-interactively incremen­

tal interaction. As shown in Figure 12, interactively users may choose an operation one at a time. In this way, patterns are associated and spanned interactively and incrementally. One advantage is to give any feed back to the user which association and spanning process is progressed.

6 Conclusion

This paper presents a high-level view of new data mining technique for large­scale multidatabases. The framework uses a user query as the goal to drive a data mining session. There are four types of rules: those that are interesting because they may surprise the user, related to the goal, supportive of the

Page 146: Data Mining, Rough Sets and Granular Computing

140

Fig. 12. User-Interactive Incremental Data Mining

user hypothesis, and exhibit high confidence with respect to the user goal. One of the contributions is to find rules that comply alI of these four types.

The data mining framework posits five query views which contribute dif­ferent types of ruIes; the views are the positively-related query view (PRQV), the positively-indirectly-related query view (PIRQV), the negativeIy-related query view (NRQV), the negatively-indirectly-related query view (NIRQV), and the unrelated query view (URQV). the combination (either join, union, outer-join) of these views can be meaningful, and one major insight provided by is that the union of PRQV and NRQV. When mined for patterns results in goal-directly-related and interesting rules.

The contribution of these views, for the purpose of data mining, can be done using standard SQL. Moreover, the Join and Union operations on rela­tional databases can be used to associate patterns and to enhance the range of attribute patterns. The techniques presented here use SQL-like commands, have superior computational complexity, and allow us to discover rules that are interesting, relevant, supportive of a user query, and provide a high degree of confidence.

References

[AAB+96] R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, aud R. Srikaut. The Quest data mining system. In Prac. of Int'l Conj. on

Page 147: Data Mining, Rough Sets and Granular Computing

141

Knowledge Discovery in Databases and Data Mining, Portland, Oregon, 1996.

[AIS93] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules be­tween sets of items in large databases. In Sushil Jajodia, editor, Proc. of the ACM SIGMOD Conf. on Management of Data, pages 207-216, Washington D.C., 1993.

[BMUT97] S. Brin, R Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules. In Peckman, editor, Proc. of the ACM SIGMOD Conf. on Management of Data, pages 255-264, Thcson, Arizona, 1997.

[EKSX96] M. Ester, H-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noises. In Proc. of Int'l Conf. on Knowledge Discovery in Databases and Data Mining, 1996.

[Fag90] R. Fagin. Finite model theory - a personal perspective. In S. Abiteboul and P. Kanellakis, editors, Proc. Int'l Conf on Database Theory, 1990.

[Fuk90] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, San Diego, CA, 1990.

[LSW97] B. Lent, A. Swami, and J. Widom. Clustering association rules. In Inti. Conf. on Data Engineering, 1997.

[PT98] B. Padmanabhan and A. Thzhilin. A belief-driven method for discover­ing unexpected patterns. In Proc. of Int'l Conf. on Knowledge Discovery in Databases and Data Mining, pages 94-100, 1998.

[SA96] R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In Proc. of the ACM SIGMOD Conf. on Management of Data, 1996.

[SG93] P. Smyth and R. M. Goodman. An information theoretic approach to rule induction from databases. IEEE 'lransactions on Knowledge and Data Engineering, 4(4):301-316, 1993.

[ST96] A. Silberschatz and A. Thzhilin. What makes patterns interesting in knowledge discovery systems. IEEE 'lransactions on Knowledge and Data Engineering, 8(6):970-974, 1996.

[STA98] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In Proc. of the ACM SIGMOD Conf. on Management of Data, 1998.

[WTL98] K. Wang, S.H.W. Tay, and B. Liu. Interestingness-based interval merger for numeric association rules. In Data Mining and Knowledge Discovery, pages 121-127, 1998.

Page 148: Data Mining, Rough Sets and Granular Computing

Part 3

Data Mining

Page 149: Data Mining, Rough Sets and Granular Computing

An Interactive Visualization System for Mining Association Rules

Jianchao Hanl, Nick Cerconel , and Xiaohua Hu2

1 Department of Computer Science, University of Waterloo Waterloo, Ontario N2L 3G1, Canada

2 Knowledge Stream Partner, 148 State St., Boston MA 02109, USA

Abstract. We introduce an interactive visualization system, AViz, which discovers 3D numerical association rules from large data sets. The process of discovering association rules is visualized, which consists of six steps: preparing the raw data set, visualizing the original data set, cleaning the data, discretizing numerical attributes, and mining and visualizing the discovered association rules. The architecture of the A Viz system is presented and each step is discussed. To discretize numerical attributes, three approaches, including equal-sized, bin-packing based equal-depth, and interaction-based approaches, are implemented and compared. The algorithm for mining and visualizing numerical association rules is proposed. Our experimental result on a census data set shows that the A Viz system is useful and helpful for discovering and visualizing numerical association rules.

1 Introduction

Knowledge discovery is still largely an evolving field. The purpose of knowl­edge discovery is to search for knowledge represented as relationships and patterns and employ this knowledge to solve problems or interpret phenom­ena. Although knowledge discovery bears with great variety in terminology and methodology, it is commonly understood as a process of the non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data [8]. This process consists of data selection, model selection, data cleaning and preprocessing, data transformation, data mining, pattern inter­pretation and evaluation, etc.

Since the data set on which the knowledge discovery process works is usu­ally huge, it is difficult to grasp the patterns or rules behind the raw data. Recent research on data visualization techniques encompasses the effective portrayal of data with the additional goal of providing insights about the data [11] [15] [16]. Data visualization is a process of transforming the gener­ated abstract data into a meaningful visual form so that users can understand the data more easily. Effective visualization makes a data mining system's nature apparent at a glance.

Discovering association rules from large data sets is actively pursued [1] [2] [3] [4] [5] [6] [9] [10] [12] [13] [17] [18] [22] [24] [25]. An association rule is an im plication of the form A ~ B, where A, Bare subsets of regarded attributes,

Page 150: Data Mining, Rough Sets and Granular Computing

146

and An B = <p. The rule A ::} B holds in database D with confidence c if c% of tuples in D that satisfy A also satisfy B, and support s if s% of tuples in D satisfy A U B.

Piatetsky-Shapiro [20] described a fast algorithm, KID3, for parallel dis­covery of all simple exact rules in data and proposed an approach to discretiz­ing numerical attributes. Agrawal et al. discussed a special case of mining association rules from market sale transactions [1], which is now the well­known itemset association. Discovery of classical itemset association focuses on Boolean items which takes into account values presence or absence in a transaction. Cai, Cercone and Han [7], Han and Fu [12], as well as Srikant and Agrawal [24] considered taxonomy items and presented mining algorithms based on a concept hierarchy. Srikant and Agrawal [23] extended this kind of itemset analysis to mining quantitative association rules where quantita­tive item (attribute) can take numeric or categorical values. To circumvent the problem, many researchers have proposed various algorithms which can be classified into three categories, Boolean association rules, numerical as­sociation rules, and hierarchical association rules. In this paper we restrict ourselves with the discovery of a special form of numerical association rules.

Consider association rules of the form: A ::} B, where A consists of two different numerical attributes and B is a quantitative attribute ( numerical or nominal attribute ). Suppose X and Yare two such different numerical attributes and Z is a quantitative attribute. Our goal is to find an interest region in the X x Y plane for each discretized interval [zt, Z2] of Z, shown as below:

X E [xt, X2], Y E [yt, Y2] :::::} Z E [Zl. Z2]

If Z is a nominal attribute, then each discretized interval corresponds to one of its distinct values.

For example, assume X is age, Y is salary, and Z is position. Further assume that age has domain [0, 100], salary has domain [10k, 1000k], and position takes values of programmer, analyst, project leader, and manager. Then the association rules to be discovered might be

age E [20,25]' salary E [42k, 48k] :::::} position = programmer age E [26,35], salary E [50k, 65k] :::::} position = analyst age E [30,40], salary E [60k, 75k] :::::} position = projecUeader age E [35,50], salary E [65k, 80k] :::::} position = manager

In order to discover and visualize association rules like those shown above, the AViz system emphasizes three major problems: data reduction, discretiza­tion of continuous attributes, and mining algorithm and visualization scheme.

Data reduction

Generally, business data sets contain millions or billions of data items. To discover associations within the huge data set, the original data must be read

Page 151: Data Mining, Rough Sets and Granular Computing

147

from disk or input from other systems and stored in memory. Even if the data set is scanned once from disk, it is time-wise expensive. So we must reduce the passes of disk scans as much as possible. There are two main approaches to reducing the size of data sets, vertically and horizontally. The vertical reduction is used to reduce the number of tuples in the data set. One of such approach is called sampling approach, which has been discussed by many researchers [20] [22] [25]. The horizontal reduction is also called attribute reduction, which selects only the important attributes as the key features in some importance measure such as approach based on rough set [11]. To efficiently store and process the data set, many structures and algorithms are presented, such as effective hash methods [17] [19].

In the AViz system, the full data set is read twice, the first read is to visu­alize the raw data and clean the data, and the second read is to discover and visualize the association rules. The full pass read can overcome the drawback of sampling approach especially when the data set is not distributed evenly. To avoid the memory requirement for storing the full data set, AViz system exploits an aggregate array. Each element of array represents an interval along axis or a square on the X x Y plane and characterizes the subset of the full data set which falls in the corresponding interval or square. Thus the storage depends on the number of the discretized intervals (buckets) of X, Y, and Z, regardless of the size of the full data set.

Discretization of numerical attributes

Another problem is to discretize individual numerical attribute into interest­ing intervals, and the associated numerical attributes into regions or hyper­regions. Each interval is represented as a Boolean attribute. A tuple is said to satisfy with this Boolean attribute if it has the value of the corresponding numerical attribute falling into this interval.

There are many approaches to this problem, such as Piatetsky-Shapiro's KID3 [20], Quinlan's C4.5 [21], and Miller and Yang's approach [18]. The approaches that have been presented could be divided into three categories, equi-sized partitioning, equi-depth partitioning, and distance-based partition­ing.

The equi-sized approach is to simply partition the continuous domain into intervals with equal length. The SONAR system employs this approach [10].

The equi-depth approach is used in the algorithm KID3 [20], which ba­sically partitions the data values into intervals with equal size along the ordering of the data. For a given bucket width W, the sorted values are ini­tially divided into approximately equal buckets, each of which has low and high bounds and a count of values between the bounds. If a bucket has the same last value as the first value of next bucket, then the next first value is reassigned into the previous bucket until different buckets have different values. If a bucket has the first value less than the last value and the count

Page 152: Data Mining, Rough Sets and Granular Computing

148

of the last value is W or more, then the bucket is split in two, one containing all the last value, the other containing all other values.

Another equi-depth approach proposed by Srikant and Agrawal [23] is based on the measure of the partial completeness over itemsets which handles the amount of information lost by partitioning. The intuition behind this measure is that the information lost due to partitioning should be as small as possible. The information lost is measured by the distance between the set of rules obtained from the raw data and the set of rules obtained from the partitioning.

The distance-based approach proposed by Miller and Yang [18] consists of two phases, identifying clusters and combining clusters to form rules. This approach addresses the following problems: the measure of the quality of an interval which reflects the distance between the data points in the interval (intra-distance), and the distance between the data points in the adjacent intervals (inter-distance), the definition of an association rule which ensures that items in the antecedence will be close to satisfying the consequence, and the measure of rule interests which should reflect the distance between the data points.

The AViz system provides three approaches to perform discretization, including equi-sized, bin-packing based equi-depth, and interaction-based. The bin-packing based equi-depth approaches is based on bin-packing and also employs equi-depth on the basis of the packed bin. The equi-sized and bin­packing based equi-depth approaches require the user to specify the number of intervals for both numerical attributes. The interaction-based approach is based on equi-sized or bin-packing based equi-depth approach, in which the raw data is presented to the user in some visual form. After the numerical attributes are discretized and visualized, the user can intuitively adjust the partition by observing the distribution of the data.

Mining and visualizing association rules

The third problem is the visualization of the (raw) data set and the discovered association rules. Fukuda et al proposed the SONAR system which discovers association rules from two dimensional data by visualizing the original data and finding an optimized rectangular or admissible region [9] [10]. Han and Cercone [11] implement the DVIZ system for visualizing various kinds of knowledge. Keirn and Kriegel compares the different techniques for visualizing information [15], and Kennedy et al present a framework for information visualization [16].

The visualization in AViz consists of three aspects: raw data visualization, discretized intervals visualization, and association rules visualization. Since we only consider three numerical attributes, two in antecedence and one in consequence, the data tuple is mapped to the 3D space and further projected as a point to the numerical plane consisting of two antecedent attributes. The points in the numerical plane may have different distribution. The user can

Page 153: Data Mining, Rough Sets and Granular Computing

149

directly observe the distribution and choose the interesting area to clean the data. The data set that has been cleaned can be discretized by using one of three discretization approaches provided in AViz. The discretized numerical attributes are visualized in the scheme extended from SONAR [10] and DVIZ [11]. Based on the discretized intervals of numerical attributes, the algorithm for discovering association rules is executed to find and render the optimized area.

This paper is organized as follows. The architecture of the AViz system is proposed in Section 2. The data preparation and reduction are introduced in Section 3. The three approaches to discretizing numerical attributes pro­vided in AViz are discussed and compared in Section 4. The paradigm and algorithm for discovering and visualizing numerical association rules are de­scribed and analyzed in Section 5. The implementation of the AViz system and experiment on census data are presented in Section 6. Finally, Section 7 contains concluding remarks.

2 The AViz system

AViz is an interactive visualization system of knowledge discovery, which em­ploys visualization techniques to clean and preprocess the data and also in­terpret the patterns discovered, and exploits numerical attributes discretiza­tion approaches and mining algorithm to discover numerical association rules according to the requirement (support threshold and confidence threshold) specified by the user.

The AViz system consists of the following steps to discover numerical association rules, shown in Fig. 1. In this section, we briefly introduce each component. In the following sections, the three main aspects, data reduction, discretization of continuous attributes, and mining and visualizing association rules, will be discussed in detail.

Data preparation

Data preparation is at the heart of the process of knowledge discovery, which consists of choosing the data set, attributes to be visualized and discovered, and the number of discretization intervals for each numerical attribute.

In the A Viz system, data preparation is used to specify the original data file and attributes file with specific format, and the numerical attributes X and Y and the quantitative attribute Z, which consist of the antecedence and the consequence of the association rules to be discovered, respectively. This specification is interactively given by the user and implemented by using file dialog and choice windows. The data set prepared for discovering association rules is a list of tuples consisting of three fields < x, y, z >.

Page 154: Data Mining, Rough Sets and Granular Computing

150

• 1- ... : ... 1 I

Pi:~" if~ I~-Ir .1 ._rJ,~ .-~~

- ~ •. ;;-':-1-,.. .. ..

_1 1 - . : - I •••

".." I ..

Fig. 1.. The AViz System

Raw data visualization

The second step, visualizing the raw data, is to read the data tuples from disk, and transfer each tuple into a point of the drawable window according to the specified data file and attributes. Since we only consider two numerical attributes as the antecedence of the association rules and one quantitative attribute as the consequence, we project points in 3D space X x Y x Z on to the X X Y plane. Thus it is easily observed how the data distributes in the space, and the joint distribution of the antecedence is visualized so that the interesting area is obviously observed. Generally, on the X x Y plane, the denser are the points in a region, the more support is shown for the region. The support of a region is important to discovering association rules because each rule must have support not less than the support threshold.

Data cleaning

Data is not always clean, which means that there may exist uninteresting data items or data derivation. To grasp the nature of the interesting data, the raw data set should be cleaned. In the AViz system, two approaches to cleaning the data are provided. To reduce the data set horizontally, we restrict the system with only three attributes. This has been calculated in the data preparation step manually.

To reduce the data set vertically, by observing the data distribution, the user can interactively specify the data area in which he/she is interested and remove other data that is outside of the interesting area. The interesting

Page 155: Data Mining, Rough Sets and Granular Computing

151

region on the X X Y plane is picked up by using a rubber band. This region usually contains dense points and has high support. The points outside of the region are cleaned. Therefore, the size of the data set used to discover association rules is reduced. This task is accomplished in the third step, data cleaning. The result of the reduction is redisplayed on the screen window so that it can be reduced further. This step can be repeated until the user is satisfied with the final result.

Discretizing numerical attributes

After the raw data is reduced, the specified interesting data area is scaled drawn on the whole window. To discretize the numerical attributes into dis­joint intervals (buckets), the AViz system provides three approaches in its Discretization Approaches model library, shown in Fig. 1, including equi­sized, bin-packing based equi-depth, and interaction-based approaches. User can choose one of three approaches to discretize jointly two numerical at­tributes of the antecedence. The first two approaches require the user to specify the number of the disjoint intervals for each continuous attributes, and the last one is based on the first two approaches and allows the user to interactively adjust the discretization.

The discretization is performed for each of continuous attributes. If the consequence attribute is a nominal attribute, its distinct values are seen as disjoint buckets.

Visualizing discretization of numerical attributes

To visualize the discretization of continuous attributes, three sets of buckets are created, one for each attribute. Thus a collection of squares is obtained, each square consisting of two intervals, one from each antecedent numerical attribute. These collections of buckets and squares are stored in the Dis­cretized Data Set in Fig. 1.

After the attributes are discretized, the original data in the interesting area is read from the disk second time, and the mapped points are redrawn on the screen. While reading the interesting data, count the support and hit for each square with respect to the buckets of Z, which can be used to compute the RGB-color of the square. The support of a square is the number of points which fall in it, and the hit of a square with respect to each bucket of attribute Z is the number of points that fall in this square and have value that falls in this bucket of Z. For each square, the sum of its all hits is equal to its support. The visualization of the discretized attributes is to render all squares for all buckets of Z based on the support and hit. In the graphics, each bucket of the attribute Z corresponds to a plane parallel with X x Y plane.

Assume that attribute X is partitioned into N x buckets, and Y into Ny buckets, then the total number of squares is N x • Ny. Usually, N x and Ny are

Page 156: Data Mining, Rough Sets and Granular Computing

152

between 20 and 300 in practice. Hence the data set is mapped into Na; . Ny squares, no matter how large the data set size is.

Discovering and visualizing the association rules

Finally, the algorithm for discovering the association rules is executed to find the optimal region for each plane in terms of the user-specified support and confidence threshold by moving threshold sliders. Each optimized region is an association rule corresponding to one bucket of the quantitative attribute Z. The connected squares, which form the rules, on the planes with respect to buckets of Z in the 3D space X x Y x Z are focused with the deep color, and the other squares are rendered with light color.

3 Data preparation and reduction

In the AViz system, the data preparation and reduction consist of the first three steps, data preparation, raw data visualization, and data reduction. In this section, we discuss each of them in more detail.

To prepare the original data, three kinds of files are required. The data file provides the AViz system with the raw data set, the attribute file is used to describe the characteristics of each attribute contained in the data file, and the nominal value description file is to map the discrete values of each nominal attribute to integer values. All files are in plain text format.

The attribute file contains the descriptions of all attributes, each attribute being described in one line as the following format:

rank, name, type, start, length, max, min where rank gives the index of an attribute, name the attribute name,

type attribute type which takes value of numeric or nominal, start and length describe the start position and value length of an attribute in the data file, and max and min define the domain of an attribute.

For example, Table 1 shows an attribute file which characterizes attributes Age, Race, Sex, Class-oj-worker, Total-person-income, Total-taxable-income, Weeks-worked-in-year, and Hours-worked-weekly.

The data file contains a list of data tuples, each tuple consisting of a list of fields with fixed length. The format of tuple is as follows:

< Jield! >< Jield2 > ... < Jieldn >

where < Jieldi > is the value of the i-the attribute, and n is the number of all attributes.

The nominal value description file create mappings from discrete values of nominal attributes to integer values, which counts starting from O. Thus only integer values are used in the data file to save storage space. All nominal attributes are included in this file and each attribute is enclosed with starting

Page 157: Data Mining, Rough Sets and Granular Computing

Table 1. Attributes File Format

1, Age, numeric, 6, 2, 99, 0 2, Class-of-worker, nominal, 8, 1, 8, 0 3, Race, nominal, 9, 1, 5, 1 4, Sex, nominal, 10, 1, 2', 1 5, Total-person-income, numeric, 58, 12, 1000000, 0 6, Total-taxable-income, numeric, 70, 12, 1000000, 0 7, Weeks-worked-in-year, numeric, 82, 4, 52, 0 8, Hours-worked-weekly, numeric, 86, 4, 80, 0

153

attribute name and word End. For instance, Table 2 shows a piece of nom­inal value description file which maps attribute Class-oJ-worker to integers o through 8. Thus the integer values corresponding to the nominal values of attribute Class-oJ-worker occupies only one character. As shown in Table 1, their position in the tuples of the data file is 8 with length 1.

Table 2. Nominal Values Mapping to Integer Values

Class-of-worker 0, 1, 2, 3, 4, 5, 6, 7, 8,

End

Not-in-universe Private Federal-government State-government Local-government Self-employed-incorporated Self-employed-not-incorporated Without-pay Never-worked

To start with, the AViz system provides file dialog windows to specify the data file, attribute file and nominal value description file. Since we are restricted to discover 3D association rules, three attributes must be specified, which is accomplished by using attribute setting Choice windows, each at-

Page 158: Data Mining, Rough Sets and Granular Computing

154

tribute window being used to choose an attribute and set the default number of discrete intervals.

To visualize the original data stored in the data file, each tuple is mapped • to a point in the draw window. Our purpose is to find the association rules of the form

X E [Xl, X2], Y E [YI, 112] =} Z E [Zl' Z2],

so we must first find the potential rectangle [Xl, X2], [YI, Y2], which is inter­esting area. Since the association rules have at least support threshold, all rectangles that have support less than the threshold could not be considered, which means that these areas can be removed from our consideration. To this end, the raw data tuples is read from disk, the triple values < x, y, Z > of the chosen attributes as the antecedence and consequence are taken off according to the attributes' position and length in the data file described in the attribute file, and then the triples are projected to a point in the plane X x Y with respect to the domain of X and Y, represented as their maximum and minimum described in the attribute file.

After data tuples are projected to X x Y plane and mapped to the draw window, the data distribution can be directly observed. Since the observed distribution is limited by the draw window size, many distinct points in X x Y plane may be overlapped when mapped to the window, so the drawable roughly reflects the real distribution. To reflect the real distribution, we at­tempt to reduce the domain of attributes and restrict to the interesting area. All points outside of the interesting area are dropped. What is the interesting area depends on the user's observation and feelings.

To reduce the data set vertically by dropping uninteresting points, the user can interactively specify the interesting data area by using a rubber band. This region usually contains dense points and has high support and the points outside of the region are cleaned.

Once the interesting area is picked up, the AViz system reread the data from disk and redraw data tuples in the same way as above. Thus, the user can further choose the data area in which he/she is .interested. These two steps, visualizing raw data and choosing interesting area, can be repeated until the final data subset used to find association rules is satisfiable.

4 Discretizing numerical attributes

The AViz system provides three approaches to discretizing numerical at­tributes, equi-sized, bin-packing based equi-depth, and interaction-based ap­proaches.

The equi-sized approach partitions the continuous domain into intervals with equal length. For example, if the domain of attribute age is [0,99], then it can be divided into small intervals of length 10, thus we have intervals < age, 0, 9 >, < age, 10, 19 >, ... , < age, 90, 99 >. This approach is simple

Page 159: Data Mining, Rough Sets and Granular Computing

155

and easily implemented. The main drawback of this approach is it may miss many useful rules since it doesn't consider the distribution of the data values.

Suppose the domains of numerical attributes X and Yare [Minx, M axx] and [M iny, M axy], respectively. X x Y forms an Euclidean plane. Each tuple t in the data set can be mapped to a point (t[X], try]) in X x Y. Assume X and Yare discretized into N x and Ny buckets, respectively. Then the size of buckets is, in average, (M axx - Minx )/Nx for X, and (Maxy - Miny )/Ny

for Y. For a region P in X x Y, we say a tuple t meets condition (X, Y) E P if t is mapped to a point in region P.

The second discretization approach used in AViz is called bin-packing based equi-depth approach, which is different from the existing approaches. The domain of the numerical attributes may contain an infinite number of points. To deal with this problem, the KID3 employs an adjustable buckets method [20], while the approach proposed in [23] is based on the concept of partial completeness measure. The drawback of these approaches is in time­consuming computation and/or large storage requirements. AViz exploits a simple and direct method, which is described as follows.

Assume the window size used to visualize the data set is M (width or height) in pixels, and each pixel corresponds to a bin. Thus we have M bins, denoted B[i], i = 0, ... , M - 1. Map the raw data tuples to the bins in terms of the mapping function. Suppose B[i] contains T[i] tuples, and further the attribute is to be discretized into N buckets. According to equi­depth approach, each bucket will contain d = 2::,!~lT[i]/N tuples. We first assign B[O], B[l], ... , to the first bucket until it contains equal to or more than d tuples, and then assign the following bins to the second bucket. Repeat this operation until all buckets contain a roughly equal number of tuples. This process is depicted in Fig. 2.

j=O; for (i = 0; i < N; i + +) {

}

Bucket[i] = 0; for(k=j,K<M,k++) { {

}

Bucket[i]+ = T[j + +]; if Bucket[i] 2:: d

break;

Fig. 2 .. Bin-packing based Equi-depth Discretization

The storage requirement in bin-packing based equi-depth approach is O(M + N), depending on the number of buckets and the size of the vi-

Page 160: Data Mining, Rough Sets and Granular Computing

156

sualization window, regardless of the domain of the attributes and the size of the data set. It does not need to sort the data and the execution time is linear in the size of the data set. This method, however, may not produce enough number of buckets, because each bin must be assigned to only one bucket, and cannot be broken up. For instance, if the data concentrates on several bins, then the buckets that contain these bins will contain many more tuples than others. This case could happen especially when the visualization window has a small size.

The third discretization approach that AViz employs is interaction-based. This method consists of two steps. First, the user can specify one of two above approaches to simply discretize the attributes. AViz displays the discretiza­tion result. In the second step, the user can intuitively observe the data distri­bution and the discretization, and then move discretization lines to wherever he/she thinks appropriate by clicking and dragging the mouse. In this inter­action process, the user can actively to decide the discretization of numerical attributes. However, since the visualized data has been preprocessed and mapped onto the screen, the user can only observe the graphics to obtain rough idea about the data distribution. For a small visualization window, distortion inevitably occurs. This may cause discretization errors.

5 Discovering and visualizing association rules

In order to visualize association rules using geometric techniques, we must pursue an interesting projection of association rules to display. Our basic idea is to find a small region, such as a square, on the display for each association rule and use the size, color hue and intensity of each region to represent the corresponding association rules.

AViz is based on the two-dimensional model for visualizing numerical as­sociation rules proposed by Fukuda et al [10]. Suppose the domains of numer­ical attributes X and Yare discretized into N x and Ny buckets respectively. These buckets mayor may not be equi-sized, depending on the discretization approach. The screen axes are partitioned correspondingly. Thus the X x Y plane is divided into Nx ·Ny unit squares. A tuple t in the data set is projected to the unit square containing the point (t[X] , try]).

For each bucket, denoted Zk, of attribute Z, consider the unit square Gij ,

which is composed of the i-th bucket of X and the j-th bucket of Y. Let Uij

denote the number of total tuples and Vkij the number of tuples satisfying Z E Zk, which are projected to region Gij , that is,

Uij = Support((t[X], try]) E Gij ),

Vkij = Support((t[X] , try]) E Gij /\ Z E Zk).

The confidence of a square Gij with respect to the bucket Zk can be easily calculated as

Page 161: Data Mining, Rough Sets and Granular Computing

157

Thus, on the plane Z E Zk, Gij is rendered with color RGB=(Vkij, Uij -Vkij, 0). The red component Vkij represents the support of the rule, while the green component reflects the confidence. So the redder the square, the higher its support , and the brighter the square, the higher its confidence.

The concepts confidence and support for a square can be extended to any form of region on the plane. The support and confidence of a region Pare defined as follows:

Support(P) = EGijEP Uij

Hit(Ph = EGijEP Vkij

Confidence(Ph = s:!r~o~ 'p

A region is said to be ample if its support is greater than or equal to the support threshold. A region is said to be confident if its confidence is greater than or equal to the confidence threshold.

The algorithm for discovering numerical association rules by visualization is discussed in [9] [10]. To start with, we define the concept gain. For the given confidence threshold (), the gain of a square Gij on the plane Z E Zk is defined as

gain( Gij h = Vkij - () X Uij·

Obviously, when the confidence of G ij = (), gain( G ij ) = 0 with respect to Z E Zk. The gain reflects the confidence. When the confidence is greater than the threshold, the gain is positive, while the gain is negative when the confidence is less than the threshold.

The gain of a region P with respect to Z E Zk is defined as

gain(P)k = L gain(Gijk GijEP

It is easy to show that the following problems are all NP-hard [10]:

• find the optimized gain region, the ample and confident region with the maximum gain

• find the optimized support region, the confident region with the maximum support

• find the optimized confident region, the ample region with the maximum confidence

To circumvent the NP-hard problem, we find the optimized region with a specific form. In particular, we consider rectangular regions.

We design a dynamic programming algorithm, depicted in Fig. 3, to find the optimized gain rectangle in O(Nz ·Nx ·Ny ·min{Nx , Ny}) time, where N z is the number of buckets of attribute Z. The basic idea behind this algorithm is, for each bucket Z of Z, to choose randomly a pair of rows, say i-th and j-th rows, i :S j, and consider rectangles G([i, j], m)z on the plane Z E z, which consists of the squares from the i-th row to the j-th row in the m-th column, for m = 1,2, ... , N x . Then compute the gain for each G([i, j], m)z,

Page 162: Data Mining, Rough Sets and Granular Computing

158

X([i, j], m)z = "£ gain(Gkm)z, for k = i, ... ,j. Finally, compute the gain for rectangular region G([i,j], [r, k])z, r :s; k, which consists of rows from i to j, and columns from r to k. The optimized gain rectangle is the rectangle with the highest gain. In Fig. 3, we suppose Ny < N x .

Algorithm: Optimized rectangle Input: Uij, number of tuples mapped to unit square Gij ,

Vzij, number of tuples mapped to unit square Gij with respect to Z E z,

support threshold c5 and confidence threshold () Output: a set of an optimized rectangular region with maximum gain,

one for each z Variables: gain(Gij)z-the gain of square Gij with respect to Z E z

begin

X([i, j], m)z-the gain of rectangle which consists of squares from i-th row to j-th row in column m with respect to Z E z

OptRect = null; for each bucket z of Z

for m = 1, ... , N x

for i = 1, ... , Ny ifuim 2 c5

gain(Gim)z = Vzim - () X Uim; else gain(Gim)z = -00; X([i, i], m)z = gain( Gim)z; for j = i + 1, ... , Ny

ifUjm 2 c5 gain(Gjm)z = Vzjm - () X Ujm;

else gain(Gjm)z = -00; X([i,j], m)z = X([i, j - 1], j)z + gain(Gjm)z;

for i = 1, ... , Ny for j = 1, ... ,Ny

Max(l)z = X([i,j], l)z; MAX(~ l)z = Max(l)z; for m = 1, ... , N x - 1

Max(m + l)z = max{O, Max(m)z} + X([i,j], m + l)z; MAX(~ m + l)z = max{Max(m + l)z,MAX(~ m)z};

OptRect = OptRect U MAX(~ N)z; Return OptRect;

end

Fig. 3. Dynamic Programming Algorithm for Finding Optimized Gain Rectangle

Page 163: Data Mining, Rough Sets and Granular Computing

159

6 AViz implementation and experiment

The AViz system has been implemented in JDK1.2 and Java3D. The data preparation is accomplished by choosing a data file and attributes file, and specifying the attributes to be mined. The data file is formatted in tuples which consists of a series of fixed length fields. The attributes file characterizes each attribute, including attribute name, type, length, the position in the data file, and domain. This is implemented in dialog windows (under the file menu and setting menu). The steps of discovering knowledge is controlled by control menu, which consists of next five steps. Two sliders are used to control the support threshold and confidence threshold. By moving these sliders in the discovering step, the resulting rules (focus area in each planes parallel with X x Y plane) vary quickly.

AViz has been applied to the U.S. census data in 1996 to find the as­sociation rules between attributes. The data set contains about 1.4 mil­lion tuples, each tuple consisting of 5 numerical attributes age, total-person­income, taxable-income-amount, tax-amount, hours-usually-worked-per-week and 3 nominal attributes sex, race, class-oj-work.

In the following we give an example to keep track of the process of asso­ciation rules discovery and visualization.

Step 1: Data preparation Choose two numerical attributes X = Taxable-income-amount and Y = Total­

person-income, and an nominal attribute Z=Race. The domain of X and Y is [0, lOOOK] and [0, 500K], respectively. Z takes the following values: White, Black, Amer-Indian-Aleut-or-Eskimo, Asian-or-Pacific-Islander, and Other. Also, we specify that X and Yare to be discretized into 20 intervals.

Step 2: Raw data visualization Map the raw data into the visualization window, shown in Fig. 4. Step 3: Data cleaning From Fig. 4, it is clearly seen that most data concentrates on a strip which

is interesting to us. The other data can be cleaned. For now, we pick this strip by using rubber band. After cleaning, the remaining data set contains about 1.08 million tuples.

Step 4: Discretizing numerical attributes We choose the second approach of attribute discretization, bin-packing

based equi-depth, and then utilize the interaction-based method by moving discretization lines to adjust the discretization. The result is shown in Fig. 5.

Step 5: Visualizing the discretization Fig. 6 visualizes the discretization of Taxable-income-amount and Total­

person-income for each value of Z = White, Black, Amer-Indian-Aleut-or­Eskimo, Asian-or-Pacific-Islander, and Other. Each Z value corresponds to a plane and the volume consisting of all planes rotates around Y axis so that all planes can be viewed clearly.

Step 6: Discovering and Visualizing association rules

Page 164: Data Mining, Rough Sets and Granular Computing

160

Fig. 4 .. The raw census data aud interesting area

To find the association rules, we move the threshold sliders and specify the support threshold and confidence threshold as 0.2% and 20%, respectively. we obtain five rules, each corresponding to a value of Z, which are described as following and shown in Fig. 7.

X E [16.74K, 17.86K], Y E [42.78K, 43.73K] =? Z = White

X E [13.37 K, 14.49K] , Y E [35.24K, 35.53K] =? Z = Black

X E [9.99K, 11.12K], Y E [32.20K,32.73K] =? Z = Amer-Indian-or­Eskimo

X E [9.99K, 11.12K], Y E [32.20K,32.73K] =? Z = Asian-or-Pacific­Islander

X E [9.99K, 11.12K], Y E [32.20K, 32.73K] =? Z = Other

Page 165: Data Mining, Rough Sets and Granular Computing

161

Fig. 5 .. Discretizing the interesting range of numerical attributes

The result shows Amer-Indian-Aleut-or-Eskimo, Asian-or-Pacific-Islander and other have the same optimized area, but White and Black have the dif­ferent ones.

7 Concluding remarks

AViz is an interactive system for visualizing and discovering numerical associ­ation rules. We briefly introduced its system structure, components, process, discretizat ion approaches and algorithms. AViz consists of six steps to dis­cover association rules, including data preparation, raw data visualization, data cleaning, numerical attributes discretization, discretization visualiza­tion, and association rules discovery and visualization. The basic idea is first to use visualization techniques to limit the domain of data by interacting with

Page 166: Data Mining, Rough Sets and Granular Computing

162

Support Threshold(1 11000) Confidence Threshold(1 1100)

Fig. 6 .. Visualizing the discretization by rotating plane around Yaxis

the user and then to mine the data to discover rules, and finally to visualize the resulting knowledge.

We developed three approaches to discret iz ing continuous attributes, equi­sized, bin-packing based equi-depth, and interaction-based. We also discussed the algorithm of discovering association rules and the scheme for visualizing these rules. In our implementation, we emphasize the interaction between the computer and the human, since we believe that interactive visualization plays a most important role in data mining to guide the process interactively of discovering knowledge. The experiment has also demonstrated that it is useful for users, by visualizing a large data set, to understand the relationships among data and to concentrate on the meaningful data to discover knowledge.

The geometries representing different knowledge will be developed to en­sure the graphics have clear meaning and are more easily understood. The correspondence between the knowledge and these geometries should be direct

Page 167: Data Mining, Rough Sets and Granular Computing

163

SupportThreshold(1/1000) Confidence Threshold(1/1 00)

Fig. 7.. Optimal range representing association rules

and straightforward. There is a contradiction between the straightforward­ness and the expressiveness of the geometries, however, so how to make the trade-off between them is being considered.

We realize that what we have presented is an initial step toward the specificat ion of an integrated data mining and visualization system and we expect significant further developments to exploit these humble beginnings.

The capability of AViz will be expanded to visualize not only the process of discovering association rules but also the dynamic processes of discovering other kinds of knowledge such as classification rules, sequential rules, and decision tree construction. We are also planning to develop more interactive tools for their each component, such as data transformation, data reduction, and so ono Combining the visualization and data mining algorithms will pro­duce a much more efficient method of knowledge discovery.

Page 168: Data Mining, Rough Sets and Granular Computing

164

Acknowledgments: The authors are members of the Institute for Robotics and Intelligent Systems (IRIS) and wish to acknowledge the support of the Networks of Centers of Excellence Program of the Government of Canada, the Natural Sciences and Engineering Research Council, and the participation of PRECARN Associates Inc.

References

1. Agrawal, R., Imielinski, T., Swami, A. (1993) Mining association rules between sets of items in large databases, Proc. of the ACM SIGMOD International Conference on Management of Data, 207-216.

2. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A. I. (1996) Fast Discovery of Association Rules, in Advances in Know/edge Discovery and Data Mining ed. by U. M. Fayyard, G. Piatetsky-Shapiro, P. Smyth, and R. Uthrusamy, 307-328, AAAI Press, Menlo Park, CA.

3. Agrawal, R., Srikant, R. (1994) Fast algorithm for mining association rules in large databases, Proc. of the 20th International Conference on VLDB, 487-499.

4. Agrawal, R., Shafer, J. C. (1996) Parallel mining of association rules, IEEE Transactions on Knowledge and Data Engineering, 8(6),962-969.

5. Brin, S., Motwani, R., Silverstein, C. (1997) Beyond Market Baskets: General­izing Association Rules to Correlations, Proc. of the ACM SIGMOD Interna­tional Conference on Management of Data, 265-276.

6. Chan, K. C. C., Au, W. (1997) Mining Fuzzy Association Rules, Proc. of the 6th Internal. Conf. on Info. and Knowledge Management, CIKM'97, 209-215.

7. Cai, Y., Cercone, N., Han, J. (1991) Attribute Oriented Induction in Relational Databases, in Knowledge Discovery in Databases ed. by Gregory Piatetsky­Shapiro and William J. Frawley, 213-228.

8. Cercone, N., Hamilton, H. (1998) Database Mining, Encyclopedia of Electrical and Electronics Engineering 4, 576-604.

9. Fukuda, T., Morimoto, Y., Morishita, S., Tokuyama, T. (1996) Mining Opti­mized Association Rules for Numeric Attributes, Proc. of the 15th ACM Sym­posium on Principles of Database System, 182-191.

10. Fukuda, T., Morimoto, Y., Morishita, S., Tokuyama, T. (1996) Data Mining Us­ing Two-Dimensional Optimized Association Rules: Scheme, Algorithms, and Visualization, Proc. of the ACM SIGMOD International Conference on Man­agement of Data, 13-24.

11. Han, J., Cercone, N. (1999) DVIZ: A System for Visualizing Data Mining, Lec­ture Notes in Artificial Intelligence 1574, Proc. of the 3rd Pacific-Asia Knowl­edge Discovery in Databases, pp.390-399.

12. Han, J., Fu, Y. (1995) Discovery of multiple-level association rules, Proc. ofthe 21th International Conference on VLDB, 420-431.

13. Houtsma, M., Swami, A. (1995) Set-oriented mining of association rules, Proc. of the International Conference on Data Engineering, 25-34.

14. Hu, X., Cercone, N. (1999) Data Mining via Discretization, Generalization and Rough Set Feature Selection, Knowledge and Information System: An Interna­tional Journal, 1(1), 33-60.

15. Keirn, D. A., Kriegel, H. P. (1996) Visualization Techniques for Mining Large Databases: A Comparison, Transaction on Knowledge and Data Engineering, 8(6), 923-938.

Page 169: Data Mining, Rough Sets and Granular Computing

165

16. Kennedy, J. B., Mitchell, K. J., Barchay, P. J. (1996) A framework for infor­mation visualization, SIGMORD Record, 25(4), 30-34.

17. Mannila, H., Toivonen, H., Verkamo, A.1. (1994) Efficient algorithms forcliscov­ering association rules, Proc. of the AAAI Workshop on Knowledge Discovery in Databases, 144-155.

18. Miller, R. J., Yang, Y. (1997) Association Rules over Interval Data, Proc. of the ACM SIGMOD International Conference on Management of Data, 452-461.

19. Park, J. S., Chen, M. S., Yu, P. S. (1995) An effective hash based algorithm for mining association rules, Proc. of the ACM SIGMOD International Conference on Management of Data, 175-186.

20. Piatetsky-Shapiro, G. (1991) Discovery, Analysis, and Presentation of Strong Rules, in [{ now ledge Discovery in Databases ed. by Gregory Piatetsky-Shapiro and William J. Frawley, 229-260.

21. Quinlan, J. R. (1993) C4.5: Programs for Machine Learning, CA: Morgan Kauf­mann.

22. Savasere, A., Omiecinski, E., Navathe, S. (1995) An efficient algorithm for min­ing association rules in large databases, Proc. of the 21th International Con­ference on VLDB, 432-444.

23. Srikant, R., Agrawal, R. (1996) Mining Quantitative Association Rules in Large Relational Tables, Proc. of the ACM SIGMOD International Conference on Management of Data, 1-12.

24. Srikant, R., Agrawal, R. (1995) Mining generalized association rules, Proc. of the 21st International Conference on VLDB, 407-419.

25. Toivonen, H. (1996) Sampling large databases for finding association rules, Proc. of the 22nd International Conference on VLDB, 134-145.

Page 170: Data Mining, Rough Sets and Granular Computing

Algorithms for Mining System Audit Data

Wenke Lee!, Salvatore J. Stolfo2 , and Kui W. Mok3*

1 Department of Computer Science, North Carolina State University, Raleigh, NC 27695

2 Department of Computer Science, Columbia University, New York, NY 10027 3 Morgan Stanley Dean Witter & Co., 750 7th Avenue, New York, NY 10019

Abstract. We describe our research in applying data mining techniques to con­struct intrusion detection models. The key ideas are to mine system audit data for consistent and useful patterns of program and user behavior, and use the set of relevant system features presented in the patterns to compute classifiers that can recognize anomalies and known intrusions. Our past experiments showed that classification rules can be used to detect intrusions, provided that sufficient audit data is available for training and the right set of system features are selected. We use the association rules and frequent episodes computed from audit data as the basis for guiding the audit data gathering and feature selection processes. In or­der to compute only the relevant patterns, we consider the "order of importance" and "reference" relations among the attributes of data, and modify these two basic algorithms accordingly to use axis attribute{s} and reference attribute{s} as forms of item constraints in the data mining process. We also use an iterative level-wise approximate mining procedure for uncovering the low frequency but important patterns. We report our experiments in using these algorithms on real-world audit data.

1 Introduction

As network-based computer systems play increasingly vital roles in modern society, they have become the target of our enemies and criminals. There­fore, we need to find the best ways possible to protect our systems. Intrusion prevention techniques, such as user authentication, e.g. using passwords or biometrics, are not sufficient because as systems become ever more complex, there are always system design flaws and programming errors that can lead to security holes [4,8]. Intrusion detection is therefore needed as another wall to protect computer systems. There are mainly two types of intrusion detection techniques. Misuse detection, for example STAT [10], uses patterns of well­known attacks or weak spots of the system to match and identify intrusions. Anomaly detection, for example IDES [17], tries to determine whether devia­tion from the established normal usage patterns can be flagged as intrusions.

Currently many intrusion detection systems are constructed by manual and ad hoc means. In misuse detection systems, intrusion patterns, for exam­ple, more than three consecutive failed logins, need to be hand-coded using

* This author's contributions were made while at Columbia University.

Page 171: Data Mining, Rough Sets and Granular Computing

167

specific modeling languages. In anomaly detection systems, the features or measures on audit data, for example, the CPU usage by a program, that con­stitute the profiles are chosen based on the experience of the system builders. As a result, the effectiveness and adaptability of intrusion detection systems in the face of newly invented attack methods may be limited.

Our research aims to develop a systematic framework to semi-automate the process of building intrusion detection systems. A basic premise is that when audit mechanisms are enabled to record system events, distinct evidence of legitimate and intrusive user and program activities will be manifested in the audit data. For example, from network traffic audit data, connection fail­ures are normally infrequent. However, certain types of intrusions will result in a large number of consecutive failures that may be easily detected. We therefore take a data-centric point of view and consider intrusion detection as a data analysis task. Anomaly detection is about establishing the normal usage patterns from the audit data, whereas misuse detection is about encod­ing and matching intrusion patterns using the audit data. We are developing a framework, MADAM ID, for Mining Audit Data for Automated Models for Intrusion Detection. MADAM ID consists of classification and meta­classification [5] programs, association rules [1] and frequent episodes [19] programs, and a feature construction system. The end product are concise and intuitive rules that can detect intrusions, and can be easily inspected and edited by security experts when needed.

The rest of the chapter is organized as follows. We first examine the lessons we learned from our past experiments on building classification models for detecting intrusions, namely we need tools for feature selection and audit data gathering. We then propose a framework and discuss how to incorporate domain knowledge into the association rules and frequent episodes algorithms to discover "useful" patterns from audit data. We report our experiments in using the patterns both as a guideline for gathering sufficient training data, and as the basis for feature selection. Finally we outline open problems and our future research plans.

2 The Challenges

In [13] we described in detail our experiments in building classification mod­els to detect intrusions to sendmail and TCP lIP networks. The results on the sendmail system call data showed that we needed to use as much as 80% of the exhaustively gathered normal data to learn a classifier (RIPPER [6] rules) that can clearly identify normal sendmail executions and intrusions. The results on the tcpdump [11] of network traffic data showed that by the temporal nature of network activities, when added with temporal statistical features, the classification model had a very significant improvement in iden­tifying intrusions. These experiments revealed that we need to solve some very challenging problems for the classification models to be effective.

Page 172: Data Mining, Rough Sets and Granular Computing

168

a ~ n tcpdump packet data

] 0:35:41.5128.59.23.34.30> 113.22.14.65.80 : .512:1024(512) acI.] wln 92]6 10:35:42.2 102.20.57.15.20> 128.59.12.49.3241: • ack 1073 \\1n 16384 10:35:45.6 128.59.25.14.2623 '" 115.35.32.89.21: . ack 2650 ,,1n 16225

Jl connection records , ' y'

time dUl' ne d.l bytes SI'V flag

10:35:39.1 5.2 A B 42 http SF

10:35:40.4 20.5 C il 22 usel' REJ

10:35:41.2 10.2 E F 1036 fip SF

Fig. 1. Processing packet-level network audit data into connection records

Formulating the classification tasks, i.e., determining the class labels and the set of features, from audit data is a very difficult and time-consuming task. Since security is usually an after-thought of computer system design, there is no standard auditing mechanisms and data format specifically for in­trusion analysis purposes. Considerable amount of data pre-processing, which involves dom ain knowledge, is required to extract raw "action" level audit data into higher level "session/event" records with the set of intrinsic sys­tem features. Figure 1 shows an example of audit data preprocessing. Binary tepdump data is first converted into ASCII packet level data, where each line contains the informat ion of one network packet. The data is ordered by the timestamps of the packets. Therefore, packets belonging to different connec­tions may be interleaved. For example, the 3 packets shown in the figure are from different connections. The packet data is then processed into con­nection records with a number of features (i.e. attributes), e.g., time (the starting time of the connection, i.e., the timestamp of its first packet), dur (the durat ion of the connection), sre and dst (source and destination hosts), bytes (number of data bytes from source to destination), srv (the service, i.e., port, in the destination), and flag (how the connection conforms to the network protocols, e.g., SF is normal, RE] is "rejected"), etc. These intrinsic features essentially summarize the packet level information within a connec­tion, There are commonly available programs that can process packet level data into such connection records for network traffic analysis tasks. However,

Page 173: Data Mining, Rough Sets and Granular Computing

169

for intrusion detection, the temporal and statistical characteristics of connec­tions also need to be considered because of the the temporal nature of event sequences in network-based computer systems. For example, a large number of "rejected" connections, i.e., flag = REJ, within a short time frame can be a strong indication of intrusions, because normal connections are rejected rarely.

We therefore need to construct temporal and statistical measures as ad­ditional features into the connection records. Traditional feature selection techniques, as discussed in the machine learning literature, cannot be di­rectly applied here since they don't consider (across record boundary) se­quential correlation of features. In [7] Fawcett and Provost presented some very interesting ideas on automatic selection of features for a cellular phone fraud detector. An important assumption in that work is that there are some general patterns of fraudulent usage for the entire customer population, and individual customers differ in the "threshold" of these patterns. Such assump­tions do not hold here since different intrusions have different targets on the computer system and normally produce different evidence (and in different audit data sources).

A critical requirement for using classification rules as an anomaly detector is that we need to have "sufficient" training data that covers as much varia­tion of the normal behavior as possible, so that the false positive rate is kept low (i.e., we wish to minimize detected "abnormal normal" behavior). It is not always possible to formulate a classification model to learn the anomaly detector with limited ("insufficient") training data, and then incrementally update the classifier using on-line learning algorithms. This is because the limited training data may not have covered all the class labels, and on-line algorithms, for example, ITI [24], can't deal with new data with new (un­seen) class labels. For example in modeling daily network traffic, we use the services, e.g., http, telnet etc., of the connections as the class labels in train­ing models. We may not have connection records of the infrequently used services with, say, only one week's traffic data. A formal audit data gathering process therefore needs to take place first. As we collect audit data, we need an indicator that can tell us whether the new audit data exhibits any "new" normal behavior, so that we can stop the process when there is no more vari­ation. This indicator should be simple to compute and must be incrementally updated.

3 Mining Audit Data

We attempt to develop general rather than intrusion-specific tools in response to the challenges discussed in the previous section. The idea is to first compute the association rules and frequent episodes from audit data, which capture the intra- and inter- audit record patterns. These frequent patterns can be regarded as the statistical summaries of system activities captured in the

Page 174: Data Mining, Rough Sets and Granular Computing

170

audit data, because they measure the correlations among system features and sequential (Le., temporal) co-occurrences of events. Therefore, these patterns can be utilized, with user participation, to guide the audit data gathering and feature selection processes.

In this section we first provide an overview of the basic association rules and frequent episodes algorithms, then describe in detail our extensions that consider the characteristics of audit data.

3.1 The Basic Algorithms

From [I], let A be a set of attributes, and I be a set of values on A, called items. Any subset of I is called an itemset. The number of items in an itemset is called its length. Let V be a database with n attributes (columns). Define support{X) as the percentage of transactions (records) in V that contain itemset X. An association rule is the expression

X -? Y,c,s

X and Y are itemsets, and X n Y = 0. s = support{X U Y) is the support of the rule, and c = S?:;;:;~~r) is the confidence. For example, an association rule from the shell command history file (which is a stream of commands and their arguments) of a user is

trn -? rec.humor, 0.3, 0.1,

which indicates that 30% of the time when the user invokes trn, he or she is reading the news in rec.humor, and reading this newsgroup accounts for 10% of the activities recorded in his or her command history file. We implemented the association rules algorithm following the main ideas of Apriori [2]. Briefly, we call an itemset X a frequent itemset if support{X) :::: minimum..support. Observed that any subset of a frequent itemset must also be a frequent item­set. The algorithm starts with finding the frequent itemsets of length 1, then iteratively computes frequent itemsets of length k + 1 from those of length k. This process terminates when there are no new frequent itemsets generated. It then proceeds to compute rules that satisfy the minimum..confidence requirement. The Apriori algorithm is outlined in Figure 2.

Since we look for correlation among values of different attributes, and the (pre-processed) audit data usually has multiple attributes, each with a large number of possible values, we do not convert the data into a binary database as suggested in [2]. In our implementation we trade memory for speed. The data structure for a frequent itemset has a row (bit) vector that records the transactions in which the itemset is contained. The database is scanned only once to generate the list of frequent itemsets of length 1. When a length k candidate itemset Ck is generated by joining two length k - 1 frequent itemsets lLl and lLl' the row vector of Ck is simply the bitwise

Page 175: Data Mining, Rough Sets and Granular Computing

scan database V to form L1 = {frequent l-itemsets}; k = 2; 1* k is the length of the itemsets * / while Lk-1 =I 0 do begin 1* association generation * /

end

for each pair of ILl, ILl E Lk-1 and ILl =I ILl where their first k - 2 items are the same do begin

construct candidate itemset Ck such that its first k - 2 items are the same as 1l-1, and the last two items are the last item of ILl and the last item of ILl;

if there is a length k - 1 subset Sk-1 C Ck and Sk-1 ffi Lk-1 then remove Ck; 1* the prune step * /

else add Ck to Ck;

end scan V and count the support of each Ck E Ck; Lk = {Cklsupport(Ck) ;::: minimum-Bupport}; k = k + 1;

foraH h, k > 2 do begin 1* rule generation * / foraH subset am C lk do begin

end end

con! = support(lk)/ support(am );

if con!;::: minimum_confidence then begin output rule am ~ (lk - am),

with confidence = con! and support = support(lk); end

Fig. 2. Apriori association rules algorithm

171

AND product of the row vectors of lL1 and lL1. The support of Ck can be calculated easily by counting the Is in its row vector, instead of scanning the database. There is also no need to perform the prune step in the candidate generation function. The row vectors of length k - 1 itemsets are freed up to save memory after they are used to generate the length k itemsets. Since most (pre-processed) audit data files are small enough, this implementation works well in our application domain.

The problem of finding frequent episodes based on minimal occurrences was introduced in [18). Briefly, given an event database V where each trans­action is associated with a timestamp, an interval [t1, t2) is the sequence of transactions that starts from timestamp t1 and ends at t2. The width of the interval is defined as t2 -t1. Given an itemset A in V, an interval is a minimal occurrence of A if it contains A and none of its proper sub-intervals contains A. Define suppart{X) as the the ratio between the number of minimum oc­currences that contain itemset X and the number of records in V. A frequent

Page 176: Data Mining, Rough Sets and Granular Computing

172

episode rule is the expression

x, y -+ Z, c, s, windaw

x, Y and Z are itemsets in V. s = support(X U Y U Z) is the support of the rule, and c = s:~:;:;~~~~f) is the confidence. The width of each of the occurrences must be less than windaw. A serial episode rule has the additional constraint that X, Y and Z must occur in transactions in partial time order, i.e., Z follows Y and Y follows X. The description here differs from [18] in that we don't consider a separate windaw constraint on the LHS (left hand side) of the rule. The frequent episode algorithm finds patterns in a single sequence of event stream data. The problem of finding frequent sequential patterns that appear in many different data-sequences was introduced in [3]. This related algorithm is not used in our study since the frequent network or system activity patterns can only be found in the single audit data stream from the network or the operating system.

Our implementation of the frequent episodes algorithm utilized the data structures and library functions of the association rules algorithm. Any subset of a frequent itemset must also be a frequent itemset since each interval that contains the itemset also contains all of its subsets. We can therefore also start with finding the frequent episodes of length 2, then length 3, etc. Instead of finding correlations across attributes, we are looking for correlations across records. The row vector is now used as the interval vector where each pair of adjacent Is is the pair of boundaries of an interval. A temporal join function that considers minimal and non-overlapping occurrences is used to create the interval vector of a candidate length k itemset from the two interval vectors of two length k - 1 frequent itemsets.

3.2 Extensions

These basic algorithms do not consider any domain knowledge and as a re­sult they can generate many "irrelevant" (i.e., uninteresting) rules. In [12] rule templates specifying the allowable attribute values are used to post-process the discovered rules. In [22] boolean expressions over the attribute values are used as item constraints during rule discovery. In [21], a "belief-driven" frame­work is used to discover the "unexpected" (hence "interesting") patterns. A drawback of all these approaches is that one has to know what rules/patterns are interesting or are already in the "belief system". We cannot assume such strong prior knowledge on all audit data.

Interestingness Measures Based on Attributes We attempt to utilize the schema level information about audit records to direct the pattern mining process. That is, although we cannot know in advance what patterns, which involve actual attribute values, are interesting, we often know what attributes are more important or useful given a data analysis task.

Page 177: Data Mining, Rough Sets and Granular Computing

173

Using the minimum support and confidence values to output only the "statistically significant" patterns, the basic algorithms implicitly measure the interestingness (i.e., relevancy) of patterns by their support and confi­dence values, without regard to any available prior domain knowledge. That is, assume I is the interestingness measure of a pattern p, then

I(p) = f(suppart(p), confidence(p))

where f is some ranking function. We propose to incorporate schema-level information into the interestingness measures. Assume IA is a measure on whether a pattern p contains the specified important (i.e. "interesting") at­tributes, our extended interestingness measure is

Ie(P) = fe(IA(P), f(suppart(p), confidence(p))) = fe (IA (p), I(p))

where fe is a ranking function that first considers the attributes in the pat­tern, then the support and confidence values.

In the following sections, we describe several schema-level characteristics of audit data, in the forms of "what attributes must be considered", that can be used to guide the mining of relevant features. We do not use these IA measures in post-processing to filter out irrelevant rules by rank ordering. Rather, for efficiency, we use them as item constraints, i.e., conditions, during candidate itemset generation.

Using the Axis Attrihute(s) There is a partial "order of importance" among the attributes of an audit record. Some attributes are essential in describing the data, while others only provide auxiliary information. Consider the audit data of network connections shown in Table 1. Each record (row) describes a network connection. The continuous attribute values, except the timestamps, are discretized into proper bins. A network connection can be uniquely identified by

< ti~esta~p,src-host,src_part,dst_host,service >

that is, the combination of its start time, source host, source port, destina­tion host, and service (destination port). These are the essential attributes when describing network data. We argue that the "relevant" association rules should describe patterns related to the essential attributes. Patterns that in­clude only the unessential attributes are normally "irrelevant". For example, the basic association rules algorithm may generate rules such as

src_bytes = 200 --* flag = SF

These rules are not useful and to some degree are misleading. There is no intuition for the association between the number of bytes from the source, src_bytes, and the normal status (flag = SF) of the connection, but rather it may just be a statistical correlation evident from the dataset.

Page 178: Data Mining, Rough Sets and Granular Computing

174

Table 1. Network Connection Records

timestamp service src-Ilost dst-Ilost src_bytes dst_bytes flag 1.1 telnet A B 100 2000 SF 2.0 ftp C B 200 300 SF 2.3 smtp B D 250 300 SF 3.4 telnet A D 200 12100 SF 3.7 smtp B C 200 300 SF 5.2 http D A 200 0 REJ ...

We call the essential attribute(s) axis attribute(s) when they are used as a form of item constraints in the association rules algorithm. During candidate generation, an item set must contain value(s) of the axis attribute(s). We consider the correlations among non-axis attributes as not interesting. In other words,

I ( ) _ { 1 if p contains axis attribute(s) A P - 0 otherwise

In practice, we need not designate all essential attributes as the axis at­tributes. For example, some network analysis tasks require statistics about various network services while others may require the patterns related to the hosts. We can use service as the axis attribute to compute the association rules that describe the patterns related to the services of the connections.

It is even more important to use the axis attribute(s) to constrain the item generation for frequent episodes. The basic algorithm can generate se­rial episode rules that contain only the "unimportant" attribute values. For example

src_bytes = 200, src_bytes = 200 -+ dsLbytes = 300, srcbytes = 200

(We omit the support, confidence, and window from the above rule.) Note that each attribute value, e.g., src_bytes = 200, is from a different connection record. To make matter worse, if the support of an association rule on non­axis attributes, A -+ B, is high then there will be a large number of "useless" serial episode rules of the form (AIB)(, AIB)* -+ (AIB)(, AIB)*, due to the following theorem:

Theorem 01 Let s be the support of the association A -+ B, and let N be the total number of episode rules on (AlB), i.e., rules of the form

(AIB)(, AIB)* -+ (AIB)(, AIB)*

then N is at least an exponential factor of s.

Proof:

Page 179: Data Mining, Rough Sets and Granular Computing

175

According to the frequent episodes algorithm, any subset of a frequent itemset is also frequent. For example, if A, A -+ A, B is a frequent episode, then so is A -+ A, and A -+ B, etc. At each iteration of pattern generation, an itemset is "grown" (i.e., constructed) from its subsets. The number of iterations for growing the frequent itemsets, i.e., the maximum length of an itemset, L, is bounded here by the number of records (instead of the number of attributes as in associ­ation rules), which is usually a large number.

At each iteration, we can count Nk, the number of episode pat­terns on (AlB). k is the length of such itemsets generated in the current iteration. Thus, the total number of episodes on (AlB) is

Assume that there are a total of m records in the database, the time difference between the last and the first record is t seconds. There are s * m records that contain A U B in the same transaction. Then the number of minimal and non-overlapping intervals that have k records with A U B is s~m. Notice that each of these intervals therefore contains 2k length k episodes on (AlB). That is

and thus L

N= L s:m *2k

k=2

Therefore N is at least an exponential factor of s. Next, we show that L monotonically increases with s, and is

bounded by m. Assume that the records of the database are evenly distributed with regard to time, and so do the records with A U B. Then at each iteration, the width of an interval can be as large as w = sk;". Given the width requirement W, w ::; W must hold, we have

Therefore L, the maximum value of k, is

It is easy to see that if the records are not evenly distributed, then w is a factor of sk;". L still monotonically increases with s. I

Page 180: Data Mining, Rough Sets and Granular Computing

176

To avoid having a huge amount of "useless" episode rules, we extended the basic frequent episodes algorithm to compute frequent sequential pat­terns in two phases: first, it finds the frequent associations using the axis attribute(s)j second, it generates the frequent serial patterns from these asso­ciations. That is, for the second phase, the items (from which episode itemsets are constructed) are the associations about the axis attribute(s), and the axis attribute values (i.e., length 1 associations). An example of a rule is

(service = smtp, src_bytes = 200, dst.bytes = 300, flag = SF), (service = telnet, flag = SF) -+ (service = http, src_bytes = 200)

Note that each itemset of the episode rule, e.g.,

(service = smtp, src_bytes = 200, dsLbytes = 300, flag = SF)

is an association. We in effect have combined the associations among at­tributes and the sequential patterns among the records into a single rule. This rule formalism not only eliminates irrelevant patterns, it also provides rich and useful information about the audit data.

Table 2. Web Log Records

timestamp remote host action request 1 his.moc.kw GET /images

1.1 his.moc.kw GET /images 1.3 his.moc.kw GET /shuttle/missions/sts-71 ...

3.1 takal0. taka.is. uec.ac.jp GET /images 3.2 takal0. taka.is. uec.ac.jp GET /images 3.5 takal0. taka.is. uec.ac.jp GET /shuttle/missions/sts-71 ...

8 rjenkin.hip.cam.org GET /images 8.2 rjenkin.hip.cam.org GET /images 9 rjenkin.hip.cam.org GET /shuttle/missions/sts-71 ...

Using the Reference Attribute(s) Another interesting characteristic of system audit data is that some attributes can be the references of other attributes. These reference attributes normally carry information about some "subject" , and other attributes describe the "actions" that refer to the same "subject". Consider the log of visits to a Web site, as shown in Table 2. action and request are the "actions" taken by the "subject", remote host. We see that for a number of remote hosts, each of them makes the same sequence of requests: "/images", "/images" and "/shuttle/missions/sts-71".

Page 181: Data Mining, Rough Sets and Granular Computing

177

It is important to use the "subject" as a reference when finding such frequent sequential "action" patterns because the "actions" from different "subjects" are normally irrelevant. This kind of sequential pattern can be represented as

(subject = X, action = a), (subject = X, action = b) -+ (subject = X, action = c)

Note that within each occurrence of the pattern, the action values refer to the same subject, yet the actual subject value may not be given in the rule since any particular subject value may not be frequent with regard to the entire dataset. In other words, subject is simply a reference (or a variable).

The basic frequent episodes algorithm can be extended to consider refer­ence attribute(s). Briefly, when forming an episode, an additional condition is that, within its minimal occurrences, the records covered by its constituent itemsets have the same value(s) of the reference attribute(s). In other words,

I (P) = { 1 if the i~emsets of p refer to the same reference attribute value A 0 otherwIse

Level-wise Approximate Mining It is often necessary to include the low frequency patterns. In daily network traffic, some services, for example, go­pher, account for very low occurrences. Yet we still need to include their patterns into the network traffic profile (so that we have representative pat­terns for each supported service). If we use a very low support value for the data mining algorithms, we will then get unnecessarily a very large number of patterns related to the high frequency services, for example, smtp.

Input: database V, the terminating minimum support St, the initial minimum support Si, and the axis attribute(s)

Output: frequent episode rules Rules Begin

(1) Rrestricted = 0; (2) scan database V to form L = {frequent 1-itemsets that meet St}; (3) (4) (5)

(6) (7) (8)

end

S = Si;

while (s ~ St) do begin

end

compute episodes from L: each episode must contain at least one axis attribute value that is not in Rrestricted;

append new axis attribute values to Rrestricted; append episode rules to the output rule set Rules; S = ~; 1* use a smaller support value for the next iteration * /

Fig. 3. Level-wise approximate mining of frequent episodes

Page 182: Data Mining, Rough Sets and Granular Computing

178

We use the level-wise approximate mining procedure described in Figure 3 for finding frequent sequential patterns from audit data. The idea is to first find the episodes related to high frequency axis attribute values, for example

(service = smtp, src_bytes = 200), (service = smtp, src_bytes = 200) -+ (service = smtp, dsLbytes = 300)

We then iteratively lower the support threshold to find the episodes related to the low frequency axis values by restricting the participation of the "old" axis values that already have output episodes. More specifically, when an episode is generated, it must contain at least one "new" (low frequency) axis value. For example, in the second iteration, where smtp now is an old axis value, we get an episode rule

(service = smtp, src_bytes = 200), (service = http,src_bytes = 200) -+ (service = smtp, src_bytes = 300)

The procedure terminates when a very low support value is reached. In prac­tice, this can be the lowest frequency of all axis values.

Note that for a high frequency axis value, we in effect omit its very low fre­quency episodes (generated in the runs with low support values) because they are not as interesting (Le., representative). In other words, at each iteration we have

I (P) = { 1 if p contains at least one "new" axis attribute value A 0 otherwise

Hence our procedure is "approximate" mining. We still include all the old (Le., high frequency) axis values to form episodes along with the new axis values because it is important to capture the sequential context of the new axis values. For example, although used infrequently, auth normally co-occurs with other services such as smtp and login. It is therefore imperative to include these high frequency services into the episode rules about auth.

Our approach is different from the algorithms in [9] since we do not have (and can not assume) multiple concept levels, rather, we deal with multiple frequency levels of a single concept, e.g., the network service.

4 U sing the Mined Patterns

In this section we report our experience in mining the audit data and using the discovered patterns both as the indicator for gathering data and as the basis for selecting appropriate temporal statistical features.

4.1 Audit Data Gathering

We posit that the patterns discovered from the audit data on a protected target (e.g., a network, system program, or user, etc.) corresponds to the

Page 183: Data Mining, Rough Sets and Granular Computing

179

target's behavior. When we gather audit data about the target, we compute the patterns from each new audit data set, and merge the new rules into the existing aggregate rule set. The added new rules represent (new) variations of the normal behavior. When the aggregate rule set stabilizes, i.e., no new rules from the new audit data can be added, we can stop the data gathering since the aggregate audit data set has covered sufficient variations of the normal behavior.

Our approach of merging rules is based on the fact that even the same type of behavior will have slight differences across audit data sets. Therefore we should not expect perfect (exact) match of the mined patterns. Instead we need to combine similar patterns into more generalized ones.

We merge two rules, rl and r2, into one rule r if

• their right and left hand sides are exactly the same, or their RHSs can be combined and LHSs can also be combined;

• the support values and the confidence values are close, i.e., within an c (a user-defined threshold).

The concept of combining is similar to clustering in [16] in that we also combine rules that are "similar" syntactically with regard to their attributes, and are "adjacent" in terms of their attribute values. That is, two LHSs (or RHSs) can be combined, if

• they have the same number of itemsets; and • each pair of corresponding itemsets (according to their positions in the

patterns) have the same axis attribute value(s), and the same or adjacent non-axis attribute value(s).

As an example, consider combining the LHSs and assume that the LHS of rl has just one itemset,

(aXI = VXI,al = VI)

aXI is an axis attribute. The LHS of r2 must also have only one itemset,

Further, aXI = aX2, VXI = VX2, and al = a2 must hold. For the LHSs to be combined, VI and V2 must be the same value or adjacent bins of values. The LHS of the merged rule r is

assuming that V2 is the larger value. For example,

(service = smtp, src_bytes = 200)

and (service = smtp, src_bytes = 300)

Page 184: Data Mining, Rough Sets and Granular Computing

180

can be combined into

(service = s~tp,200 ~ src_bytes ~ 300)

To compute the statistically relevant support and confidence values of the merged rule r, we record suppart.1hs and db_size of rl and r2 when mining the rules from the audit data. suppartJhs is the support of a LHS and db..size is the number of records in the audit data.

Experiments We test our hypothesis that the merged rule set can indi­cate whether the audit data has covered sufficient variations of behavior. We obtained one month of TCP lIP network traffic datal (We hereafter refer it as the LBL dataset. There are total about 780,000 connection records). We segmented the data by day. And for data of each day, we again segmented the data into four partitions: morning, afternoon, evening and night. This partitioning scheme allowed us to cross evaluate anomaly detection models of different time segments (that have different traffic patterns). It is often the case that very little (sometimes no) intrusion data is available when building an anomaly detector. A common practice is to use audit data (of legitimate activities) that is known to have different behavior patterns for testing and evaluation.

We describe the experiments and results on building anomaly detection models for the "weekday morning" traffic data on connections originated from LBL to the outside world (there are about 137,000 such connections for the month). We decided to compute the frequent episodes using the network service as the axis attribute. Recall from our earlier discussion that this formalism captures both association and sequential patterns. For the first three weeks, we mined the patterns from the audit data of each weekday morning, and merged them into the aggregate rule set. For each rule we recorded ~erge_count, the number of merges on this rule. Note that if two rules rl and r2 are merged into r, its ~erge_count is the sum from the two rules. ~erge...count indicates how frequent the behavior represented by the merged rule is encountered across a period of time (days). We call the rules with ~erge_count 2:: ~in_frequency the frequent rules.

Figure 4 plots how the rule set changes as we merge patterns from each new audit data set. We see that the total number of rules keeps increasing. We visually inspected the new rules from each new data set. In the first two weeks, the majority are related to "new" network services (that have no prior patterns in the aggregate rule set). And for the last week, the majority are just new rules of the existing services. Figure 4 shows that the rate of change slows down during the last week. Further, when we examine the frequent rules (we used ~in_frequency = 2 to filter out the "one-time" patterns), we can see in the figure that the rule sets (of all services as well as the individual services) grow at a much slower rate and tend to stabilize.

1 From ''http://ita.ee.lbl.gov/html/contrib/LBL-CONN-7.html'' .

Page 185: Data Mining, Rough Sets and Granular Computing

560

504

448

392 aII services -frequent aII services +-

frequen! http e-Ul 336 frequent smtp -b-jl frequen! !elne! "*-" a:: 'l5 280 a; .c E " z 224

168

112

56

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Fig. 4. The number of rules vs. the number of audit data sets

o Weekday Momings I:::l Weekend Mornings • Weekday Nighls

22.00

19.80

CI) 17.60 Oi a:: 15.40 c

~ 13.20

o 11.00 ~ Ul gj 8.80

U 6.80 .!Il

::!: 4.40

2.20

0.00 4

Fig. 5. Misclassification rates of classifier trained on first 8 weekdays

181

We used the set of frequent rules of alI services as the indicator on whether the audit data is sufficient. We tested the quality of this indicator by con­structing four classifiers, using audit data from the first 8, 10, 15, and 17 weekday mornings, respectively, for training. We used the services of the connections as the class labels, and included a number of temporal statisti­cal features (the details of feature selection is discussed in the next session). The classifiers were tested using the audit data (not used in training) from the mornings and nights of the last 5 weekdays of the month, as well as the

Page 186: Data Mining, Rough Sets and Granular Computing

182

D Weekday Mornings IZ] Weekend Mornings • Weekday Nights 22.00

19.80

., 17.60

10 cr: 15.40 <: 0 13.20 ~ " 11.00 ~

'" '" 8.80

'" 1l

'" 6.60 ~

4.40

2.20

0.00

Fig. 6. Misclassification rates of classifier trained on first 10 weekdays

D Weekday Mornings IZ] Weekend Mornings • Weekday Nights 22.00

19.80

., 17.60

10 cr: 15.40 <:

~ 13.20

" 11.00 ~ '" 8.80

'" 1l '" 6.60 ~

4.40

2.20

0.00

Fig. 7. Misclassification rates of classifier trained on first 15 weekdays

D Weekday Mornings IZ] Weekend Mornings • Weekday Nights 22.00

19.60

., 17.60

10 cr: 15.40 <: 0

~ 13.20

.g 11.00

.;;;

'" 8.80 <IS 1l '" ~

6.60

4.40

2.20

0.00

Fig. 8. Misclassification rates of classifier trained on first 17 weekdays

Page 187: Data Mining, Rough Sets and Granular Computing

183

last 5 weekend mornings. Figures 5, 6, 7 and 8 show the performance of these four classifiers in detecting anomalies (different behavior) respectively. In each figure, we show the misclassification rate (percentage of misclassifi­cations) on the test data. Since the classifiers model the weekday morning traffic, we wish to see this rate to be low on the weekday morning test data, but high on the weekend morning data as well as the weekday night data. The figures show that the classifiers with more training (audit) data perform better. Further, the last two classifiers are effective in detecting anomalies, and their performance are very close (see figures 7 and 8). This is not sur­prising at all because from the plots in Figure 4, the set of frequent rules (our indicator on audit data) is growing in weekdays 8 and 10, but stabilizes from day 15 to 17. Thus this indicator on audit data gathering is quite reliable.

4.2 Feature Selection

An important use of the mined patterns is as the basis for feature selection. When the axis attribute is used as the class label attribute, features (the attributes) in the association rules should be included in the classification models. In addition, the time windowing information and the features in the frequent episodes suggest that their statistical measures, e.g., the average, the count, etc., should also be considered as additional features.

Experiments on the LBL Dataset We examined the frequent rules from the audit data to determine what features should be used to generate training data and learn a classifier. When the same value of an attribute is repeated several times in a frequent episode rule, it suggests that we should include a corresponding count feature. For example given

(service = smtp, src_bytes = 200), (service = smtp, src_bytes = 200) --t (service = smtp, src_bytes = 200)[0.81,0.42, 140s]

we add a feature, the count of connections that have the same service and srcbytes as the current connection record in the past 140 seconds. When an attribute (with different values) is repeated several times in the rule, we add a corresponding average feature. For example, given

(service = smtp, duration = 2), (service = teinet, duration = 10) --t (service = http, duration = 1)

we add a feature, the average duration of all connections in the past 140 seconds. The classifiers in the previous section included a number of tempo­ral statistical features of this type: the count of all connections in the past 140 seconds, the count of connections with the same service and the same srcbytes, the average duration, the average dsLbytes, etc. Our experiments

Page 188: Data Mining, Rough Sets and Granular Computing

184

showed that when using none of the temporal statistical features, or using just the count features or average features, the classification performance was much worse.

In [13] we reported that as we mined frequent episodes using different window sizes, the number of serial episodes stabilized after the time win­dow reached 30 seconds. We showed that when using 30 seconds as the time interval to calculate the temporal statistical features, we achieved the best classification performance. We sampled the weekday morning data and dis­covered that the number of episodes stabilized at 140 seconds. Hence, we used it as the window in mining the audit data and as the time interval to calculate statistical features.

4.3 Off-line Analysis

o Weekday Momings l2:l Weekend Momings • Weekday Nights

22.00

19.80

., 17.80 Oi CI: 15.40 t:

~ 13.20

~ 11.00

'" '" 8.80 .J!! o

8.80 .!!! :<

4.40

2.20

0.00 4

Fig. 9. Similarity measures against the merged rule set of weekday mornings

Since the merged rule set was used to (identify and) collect "new" behavior during the audit data gathering process, one naturally asks "Can the final rule set be directly used to detect anomalies?". We used the set of frequent rules to distinguish the traffic data of the last 5 weekday mornings from the last 5 weekend mornings, and the last 5 weekday nights. We use a similarity measure. Assume that the merged rule set has n rules, and the size of the new rule set from a new audit data set is m, the number of matches (Le., the number of rules that can be merged) between the merged rule set and the new rule set is p, then we have

. ·l·t p p szmz an y = - * -n m

; represents the percentage of known behavior (from the merged rule set) exhibited in the new audit data, and !fi represents the proportion of (all) behavior in the new audit data that conforms to the known behavior. Figure

Page 189: Data Mining, Rough Sets and Granular Computing

185

7 shows that the similarity of the weekday mornings are much larger than the weekend mornings and the weekday nights.

In general the mined patterns cannot be used directly to classify the records (Le., they cannot tell which records are anomalous). They are very valuable in off-line analysis. By studying the differences between frequent pattern sets, we can identify the different behavior across audit data sets. For example, by comparing the patterns from normal and intrusion data, we can gain a better understanding of the nature of the intrusions and identify their "signature" patterns.

Misuse Detection: Results on the InfoWorld IWSS16 Dataset We report the results of our recent experiments on a set of network intrusion data from InfoWorld, which contains attacks of the "InfoWorld Security Suite 16" [20] that was used to evaluate several leading commercial intrusion de­tection products. (We hereafter refer this dataset as the IWSS16 dataset.)

The dataset has two traces of tcpdump packet header only data. One con­tains normal network traffic, and the other contains network traffic where 16 different types of attacks were simulated. According to their attack meth­ods2 , several intrusions in IWSS16 would leave distinct evidence in the short sequence of (time ordered) connection records. The others would leave evi­dence only in the data portion of network packets, which was not included in this dataset.

Below we demonstrate how our data mining algorithms can be used to find (test) the intrusion patterns of these attacks. We used a time window of 2 seconds.

• Port Scan: The attacker systematically makes connections to each port (that is, the service) of a target host (the destination host) in order to find out which ports are accessible. In the connection records, there should be a host (or hosts) that receives many connections to its "different" ports in a short period of time. Further, a lot of these connections have the "REJ" flag since many ports are normally unavailable (hence the connections are rejected).

- Data mining strategy: use dsLhost as both the axis attribute and the reference attribute to find the "same destination host" frequent sequential "destination host" patterns;

- Evidence in intrusion data: there are several patterns that suggest the attack, for example,

(dsLhost = hostv , flag = REJ), (dsLhost = hostv , flag = REJ) -+ (dsLhost = hostv , flag = REJ)

2 Scripts and descriptions of many intrusions can be found using the search engine in .. http://www.rootshell.com ...

Page 190: Data Mining, Rough Sets and Granular Computing

186

but no patterns with flag = REJ are found when using the service as the axis attribute (and dsLhost as the reference attribute) since a large number of different services (ports) are attempted in a short period time. As a result, for each service the "same destination host" sequential patterns are not frequent;

- Contrast with normal data: patterns related to flag = REJ indicate that the "same" service is involved .

• Ping Scan: The attacker systematically sends ping (icmp echo) requests to a large number of different hosts to find out which host is available. In the connection records, there should be a host that makes icmp echo connections to many different hosts in a short period of time.

- Data mining strategy: use service as the axis attribute and src..host as the reference attribute to find the "same source host" frequent sequential "service" patterns;

- Evidence in intrusion data: there are several patterns that suggest the attack, for example,

(service = icmp_echo, src..host = hosth), (service = icmp...echo, src_host = hosth) -+ (service = icmp_echo, src_host = hosth)

Note that there is no dst..host in this rule, suggesting that icmp echo is sent to "different" hosts;

- Contrast with normal data: no such patterns. • Syn Flood: The attacker makes a lot of "half-open" connections (by

sending only a "syn request" but not establishing the connection) to a port of a target host in order to fill up the victim's connection-request buffer. As a result, the victim will not be able handle new incoming requests. This is a form of "denial-of-service" attack. In the connection records, there should exist a host where one of its port receives a lot of connections with flag "SO" (only the "syn request" packet is seen) in a short period of time.

- Data mining strategy: use service as the axis attribute and dsLhost as the reference attribute to find the "same destination host" frequent sequential "service" patterns;

- Evidence in intrusion data: there is very strong evidence of the attack, for example,

(service = http,flag = SO), (service = http,flag = SO) -+ (service = http,flag = SO)

Contrast with normal data: no patterns with flag = SO.

We have developed an automatic technique for comparing and identifying "intrusion only" patterns from an aggregate set of normal patterns and a set of patterns from intrusion audit data [15]. That is, the pattern analysis tasks

Page 191: Data Mining, Rough Sets and Granular Computing

187

described above can be automated. In [14) we described an algorithm for constructing temporal and statistical features from the identified intrusion only patterns. We reported that using this feature construction process, the resultant RIPPER classifier had an overall accuracy of 99.1% on the IWSS16 dataset.

5 Conclusion

In this chapter we discussed data mining techniques for building intrusion de­tection models. We demonstrated that association rules and frequent episodes from the audit data can be used to guide audit data gathering and fea­ture selection, the critical steps in building effective classification models. We incorporated domain knowledge into these basic algorithms using the axis attribute(s), reference attribute(s), and a level-wise approximate mining pro­cedure. Our experiments on real-world audit data showed that the algorithms are very effective.

To the best of our knowledge, our research was the first attempt to develop a systematic framework for building intrusion detection models. We plan to refine our approach and further study some fundamental problems. Are classification models best suited for intrusion detection (i.e. what are the better alternatives)?

It is important to include system designers in the knowledge discovery tasks. We are implementing a support environment that graphically presents the mined patterns along with the list of features and the time windowing information to the user, and allows him/her to (iteratively) formulate a clas­sification task, build and test the model using a classification engine such as JAM [23).

6 Acknowledgments

This research is supported in part by grants from DARPA (F30602-96-1-0311) and NSF (IRI-96-32225 and CDA-96-25374).

Our work has benefited from in-depth discussions with Alexander Tuzhilin of New York University, and suggestions from Charles Elkan ofUC San Diego. We would also like to thank Dave Fan and Andreas Prodromidis of Columbia University, and Phil Chan of Florida Institute of Technology for their help and encouragement.

References

1. Agrawal R., Imielinski T., Swami A. (1993) Mining Association Rules Between Sets of Items in Large Databases. In: Proceedings of the ACM SIGMOD Con­ference on Management of Data, 207-216

Page 192: Data Mining, Rough Sets and Granular Computing

188

2. Agrawal R., Srikant R. (1994) Fast Algorithms for mining Association Rules. In: Proceedings of the 20th VLDB Conference

3. Agrawal R., Srikant R. (1995) Mining Sequential Patterns. In: Proceedings of the 11th International Conference on Data Engineering

4. Bellovin S.M. (1989) Security Problems in the TCP lIP Protocol Suite. Com­puter Communication Review, 19(2):32-48

5. Chan P.K, Stolfo S.J. (1993) Toward parallel and distributed learning by meta­learning. In: AAAI Workshop in Knowledge Discovery in Databases, 227-240

6. Cohen W.W. (1995) Fast Effective Rule Induction. In: Machine Learning: the 12th International Conference

7. Fawcett T., Provost F. (1996) Combining Data Mining and Machine Learning for Effective User Profiling. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 8-13

8. Grampp F.T., Morris R.H. (1984) Unix System Security. AT&T Bell Labora­tories Technical Journal, 63(8):1649-1672

9. Han J., Fu Y. (1995) Discovery of Multiple-level Association Rules from Large Databases. In: Proceedings of the 21th VLDB Conference

10. Ilgun K, Kemmerer R.A., Porras P.A. (1995) State Transition Analysis: A Rule-based Intrusion Detection Approach. IEEE Transactions on Software En­gineering, 21(3):181-199

11. Jacobson V., Leres C., McCanne S. (1989) tcpdump. Available via anonymous ftp to ftp.ee.lbl.gov

12. Klemettinen M., Mannila H., Ronkainen P., Toivonen H., Verkamo A.1. (1994) Finding Interesting Rules from Large Sets of Discovered Association Rules. In: Proceedings of the 3rd International Conference on Information and Knowledge Management (CIKM'94), 401-407

13. Lee W., Stolfo S.J. (1998) Data Mining Approaches for Intrusion Detection. In: Proceedings of the 7th USENIX Security Symposium, 79-93

14. Lee W., Stolfo S.J., Mok KW. (1998) Adaptive Intrusion Detection: a Data Mining Approach. Artificial Intelligence Review (to appear)

15. Lee W., Stolfo S.J., Mok KW. (1999) Mining in a Data-flow Environment: Ex­perience in Network Intrusion Detection. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-99), 114-124

16. Lent B., Swami A., Widom J. (1997) Clustering Association Rules. In: Pro­ceedings of the 13th International Conference on Data Engineering

17. Lunt T., Tamaru A., Gilham F., Jagannathan R., Neumann P., Javitz H., Valdes A., Garvey T. (1992) A Real-time Intrusion Detection Expert System (IDES) - Final Technical Report. Technical Report, Computer Science Labo­ratory, SRI International, Menlo Park, California

18. Mannila H., Toivonen H. (1996) Discovering Generalized Episodes Using Min­imal Occurrences. In: Proceedings of the 2nd International Conference on Knowledge Discovery in Databases and Data Mining, 146-151

19. Mannila H., Toivonen H., Verkamo A.1. Discovering Frequent Episodes in Se­quences. In: Proceedings of the 1st International Conference on Knowledge Discovery in Databases and Data Mining

20. McClure S., Scambray J., Broderick J. (1998) Test Center Comparison: Net­work Intrusion-detection Solutions. INFOWORLD, May 4, 1998

Page 193: Data Mining, Rough Sets and Granular Computing

189

21. Padmanabhan B., Tuzhilin A. (1998) A Belief-driven Method for Discovering Unexpected Patterns. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, 94-100

22. Srikant R., VU Q., Agrawal R. (1997) Mining Association Rules with Item Constraints. In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, 67-73

23. Stolfo S.J., Prodromidis A.L., Tselepis S., Lee W., Fan D.W., Chan P.K. (1997) JAM: Java agents for Meta-learning Over Distributed Databases. In: Proceed­ings of the 3rd International Conference on Knowledge Discovery and Data Mining, 74-81

24. Utgoff P.E., Berkman N.C., Clouse J.A. (1997) Decision Tree Induction Based on Efficient Tree Restructuring. Machine Learning, 29:5-44

Page 194: Data Mining, Rough Sets and Granular Computing

Scoring and Ranking the Data Using Association Rules

Bing Liu, Yiming Ma, and Ching Kian Wong

School of Computing, National University of Singapore, 3 Since Drive 2, Singapore 117543. {liub, maym, wongck} @comp.nus.edu.sg

Abstract. In many data mining applications, the objective is to find the likelihood that an object belongs to a particular class. For example. in direct marketing. marketers want to know how likely a potential customer will buy a particular product. In such applications. it is often too difficult to predict who will definitely be buyers and non-buyers because the data used for modeling is often very noisy and has a highly imbalanced class distribution. Traditionally, classification systems are used to solve this problem. Instead of assigning a definite class (e.g .. buyer or non-buyer) to a data case representing a potential customer, a classification system is made to produce a class probability estimate (or a score) for the data case. However, existing classification systems only aim to find a small subset of rules that exist in data to form a classifier. This small subset of rules can only give a partial (or biased) picture of the domain. In this paper, we show that association rule mining provides a more powerful solution to the problem because association rule mining aims to generate all rules in data. It is thus able to give a complete picture of the underlying relationships that exist in the domain. This complete set of rules enables us to assign a more accurate class probability estimate (or likelihood) to each (new) data case. An efficient technique that makes use of the discovered association rules to produce class probability estimates is proposed. We call this technique scoring based on associations (or SBA). Experiment results on both public domain data and our real-life application data show that the technique performs significantly better than the state-of-the-art classification system C4.S.

1. Introduction

Classification is an important data mining task. The data used in a classification task consists of the descriptions of N objects (or data cases) in the form ofrecords. Each object is described by n distinct attributes. These N objects have also been pre-classified into q known classes. The objective of the classification task is to find a set of characteristic descriptions (e.g., classification rules) for the q classes from the training data. This set of descriptions is often called a predictive model or classifier. The model is then used to classify future (or test) data cases into the q classes, i.e., given a new data case, the model will tell which class it belongs to. We call this binary classification because each data case is either classified to belong to a class or not to belong to a class, and each data case is only classified to belong to one class. Many classification systems have been built in the past [e.g, 27,17,20,28]. They are widely used in real-life applications.

However, building a classifier (or predictive model) to accurately predict future cases is not always possible in many applications. The resulting classifier may have very poor predictive accuracy because the (training) data used is noisy (or quite random) and has a highly imbalanced (or skewed) class distribution. To make the

Page 195: Data Mining, Rough Sets and Granular Computing

191

matter worse, the user is often interested only in data cases of a minority class, which is even harder to predict. Many machine learning researchers have studied this problem in recent years [e.g, 6, 9, 14,26]. Let us have an example.

Example 1: In direct marketing applications, marketers want to identify likely buyers of certain products and to promote the products accordingly. Typically, a past promotion database (training data) is used to build a predictive model, which is then used to select new likely buyers from a large marketing database of potential customers (each record or case in the database represents a potential customer). The training data used to build the model is typically very noisy and has an extremely imbalanced class distribution as the response rate to a product promotion (the percentage of people who respond to a promotion and buy the product) is often very low, e.g., 1-2% [13, 16]. Building a classifier to accurately predict buyers is clearly very difficult, if not impossible. If an inaccurate classifier is indeed used, it may only identify a very small percentage of the actual buyers, which is not acceptable for marketing applications. In such applications, it is common that the model is used to score the potential customers in the marketing database and then rank them according to their scores. Scoring means to assign a probability estimate to indicate the likelihood that a potential customer represented by a data record will buy the product. This gives marketers the flexibility to choose a certain percentage of likely buyers for promotion. Binary classification is not suitable for such applications. Assigning a likelihood to each data record is more appropriate.

The above example shows that scoring and ranking the data is very useful for applications where the classes are very hard to predict.

In many other applications, scoring and ranking the data can still be important even when the class distribution in the data is not extremely imbalanced and the predictive accuracy of the classifier built is acceptable. The reason is that we may need more cases of a particular class than that can be predicted by the classifier. In such a situation, we would like to have the extra cases that are most likely to belong to this class.

Example 2: In an education application, we need to admit students to a particular course. We wish to select students who are most likely to do well when they graduate. We can build a classifier (or model) using the past data. Assume the classifier built is quite accurate. We then apply the classifier to the new applicants. However, if we only admit those applicants who are classified as good students, we may not be able to admit enough students. We would then like to take extra applicants who are most likely to do well. In this situation, assigning a probability estimate to each applicant becomes crucial because it allows us to admit as many applicants as we want and to be assured that these applicants are those who are most likely to do well. We are currently involved in two such applications.

Presently, classification systems are commonly used to score the data. Although such systems are not originally designed for the purpose, they can be easily modified to output a confidence factor (or probability estimate) as a score [16]. The

Page 196: Data Mining, Rough Sets and Granular Computing

192

score is then used in ranking the data cases. However, eXistmg classification systems only aim to discover a small subset of the rules that exist in data to form a classifier. There are many more rules in data that are left undiscovered. This small subset of rules can only give a partial (or biased) picture of the domain.

In this paper, we show that association rule mining [2] provides a more powerful solution to the scoring problem because association rule mining discovers all the rules that exist in data and is thus able to give a complete picture of the domain. This complete set of rules enables us to assign a more accurate class probability estimate (or likelihood) to each new (or test) data case. An efficient technique that makes use of association rules to produce probability estimates is proposed. We call it scoring based on associations (SBA). Experiment results on both public domain data and our real-life application data show that the proposed method performs significantly better than the state-of-the-art classification system C4.5 [27].

1.1 Using association rules to score the data

Below, we outline the proposed approach of using association rules to score the data. We first review association rule mining and then give the proposed approach. We also identify the problems faced in using association rule mining for our task. Solutions to these problems are briefly discussed.

Association rules are an important class of regularities that exist in databases. Since it was first introduced in [2], the problem of mining association rules has received a great deal of attention. The classic application is market basket analysis [2]. It analyzes how the items purchased by customers are associated. An example association rule is as follows,

cheese ~ beer [sup = 10%, conf = 80%]

This rule says that 10% of customers buy cheese and beer together, and those who buy cheese also buy beer 80% of time.

The association rule mining model can be stated as follows: Let I = {iJ, i2, ... ,

im} be a set of items. Let D be a set of transactions (the database), where each transaction d (a data case) is a set of items such that d ~ I. An association rule is an implication of the form, X ~ Y, where X c I, Y c I, and X n Y = 0. The rule X ~ Y holds in the transaction set D with confidence c if c% of transactions in D that support X also support Y. The rule has support s in D if s% of the transactions in D contains X u Y.

Given a set of transactions D (the database), the problem of mining association rules is to discover all association rules that have support and confidence greater than or equal to the user-specified minimum support (called minsup) and minimum confidence (called minconj). Minsup controls the number of transactions that a rule must cover. Minconf controls the predictive strength of the rule. The two values are set according to applications.

Page 197: Data Mining, Rough Sets and Granular Computing

193

Proposed approach: The proposed approach consists of two steps:

1. Generate all association rules from the training data. The consequent (or the right-hand side) of each rule should be a class. We call such association rules the class association rules (CAR) [17]. Normal association rules do not have such a restriction. The generated CARs may also be subjected to a pruning operation to remove those non-predictive or insignificant rules.

2. Make use of all the rules to score the future (or test) data. We have designed and implemented a technique that uses the rules to assign a likelihood score to each future (or test) data case.

Problems faced: Using the association rule model for scoring the data faces some interesting problems. Essentially, the classic association rule model is not suitable for our task because of some inherent restrictions of the model and the specific nature of our task.

• Association rules with a fixed target: In our applications, we only need to mine class association rules or CARs, which use the class attribute as the target. In normal association rule mining, there is no fixed target attribute on the right­hand side of the rules. We have adapted the well-known Apriori algorithm [3} for normal association rule mining to mine only CARs.

• Problem with minimum support: Theoretically, both the computational complexity and the number of rules produced grow exponentially for association rule mining. Minsup holds the key to making association rule mining practical. Minsup is used to reduce the search space and to limit the number of rules generated. However, using only a single minsup implicitly makes the assumption that all items in the data are of the same nature and/or have similar frequencies [18] in the database. This is not the case for our task because the class distribution is often extremely imbalanced. Using the existing association rule model can cause the following problems:

1. If the minsup is set too high, we may not find those rules that involve the minority class, which is often the class that the user is interested in.

2. In order to find rules that involve the minority class, we have to set the minsup very low. This may cause combinatorial explosion because the majority class may have too many rules and most of them are overfitted with many conditions and covering very few data cases. These rules have little predictive value. They also cause increased execution time. Hence, we should allow different classes to have different minsup values.

• Problem with minimum confidence: The classic association rule model allows the user to specify only one minconf. This is also inadequate for our task. For example, in a database, it is known that only 5% of the people are buyers and 95% are non-buyers. If we set the minconf at 96%, we may not be able to find any rule of the buyer's class because it is unlikely that the database contains reliable rules of the buyer's class with such a high confidence. If we set a lower confidence, say 50%, we will generate many rules that have the

Page 198: Data Mining, Rough Sets and Granular Computing

194

confidence between 50-95% for the non-buyer's class and such rules are completely meaningless.

In the proposed approach, these problems are solved. We have extended the classic association rule model to allow the user to assign more than one support and more than one confidence value (see also [18]).

After all the rules are generated, we need to find a way to use the rules to score the data. We have designed such a technique, which will be presented in Section 5.

1.2 Our contributions

This paper makes the following contributions:

1. It proposes a technique that uses association rules to score the data. To the best of our knowledge, association rules have not been used for such a purpose. Due to the completeness property of association rule mining, its resulting set of rules allows us to compute a more accurate probability estimate for each unseen (or test) data case.

2. It proposes a method to interpret the evaluation results of different scoring techniques. In [16], lift index (see Section 3) is proposed as a measure to evaluate and to compare different scoring methods. However, no study is made on how to interpret the improvement in lift index of one technique over another. This paper proposes such a scheme (see Section 6.2). This is significant because otherwise we will not be able to tell what it means by an improvement in lift index of one technique over another.

3. The proposed technique does not require the data to be in memory during rule generation because we use the association rule mining technique. Traditional classification systems require the whole data set to fit in memory (there are works that scale up classification systems [20, 11]).

4. It identifies a few problems with association rule mining when applied to real­life applications. The association rule model is extended to handle these problems, e.g., the problems of a single minsup and a single minconf. A more general solution to these problems can be found in [18].

5. From a boarder perspective, we believe that association rules not only provide a description of the data (all the rules), but also can be used for different types of prediction tasks. We aim to design and implement a data mining system with (multiple minimum supports [18]) association rule mining as the core for rule generation. The discovered rules can be used for multiple mining tasks, such as classification [17], text categorization and classification, scoring, etc (Figure 1). The advantage of such a unified approach is that it reduces the learning effort of the user and/or gives him/her great conveniences. Currently, there are too many complicated data mining techniques. For instance, for classification alone there are already many (most of these techniques produce similar results). We believe this hinders the wide spread use of data mining systems.

Page 199: Data Mining, Rough Sets and Granular Computing

Classification or Prediction

Text categorisation

Identifying Interesting

patterns

••

• Figure 1. (Multiple minimum supports) association rule mining

centered data mining

2. Related work

195

To the best of our knowledge, there is no eXlstmg work that makes use of association rules for scoring the data. Current research and applications typically use classification systems for the purpose. For example, [16] uses classification systems in a number of direct marketing applications. It also shows that classification accuracy is not adequate for the type of tasks, and proposes a new measure called lift index (see Section 3.2) to evaluate the scoring results and scoring techniques. Lift index is based on the concept of lift analysis commonly used in marketing research. However, the paper does not give an in-depth analysis of the behavior of lift index. In particular, it does not study how improvement in lift index of one technique over another can be translated into the improvement of scores of data cases. This paper makes such an attempt.

One of the key characteristics of the task that we are interested in is the highly imbalanced class distribution. This problem was recognized and studied by many machine learning researchers in recently years [e.g., 6, 9, 26, 14]. A commonly used technique is to boost up the number of cases (or records) of the minority class by oversampling with replacement [6, 16]. However, their purpose is to improve the predictive accuracy, which is not directly applicable to scoring. In [16], it is shown that imbalanced data is not a problem if the classification system is made to output a confidence factor (or probability estimate) rather than only a definite class. In the evaluation section (Section 6) of this paper, we will also show that boosting up the minority class data does not help the scoring task.

[17] proposes a technique to use association rules for prediction (or classification). The technique first generates association rules and then selects a special subset of the rules to produce a classifier. It is shown that such a classifier can be very accurate. However, the technique is not designed for scoring. When it is applied to scoring, it does not perform as well as the proposed technique, which does not select any rule but uses the whole rule set. Additionally, association rules

Page 200: Data Mining, Rough Sets and Granular Computing

196

generated in this work are slightly different from those in [17] because we allow different classes to have different minimum supports and minimum confidences. The technique in [17] only uses a single minimum support and a single minimum confidence. [22] and [8] also propose some techniques for classification using association rules.

Regarding association rule mining, since it was first introduced in [2], an extensive study has been made in the past by many researchers [e.g., 2, 3, 1,4,5,7, 12, 17, 18,21,25,29, 30]. Many algorithms have been proposed to improve the mining efficiency [3, 5, 21, 30], to mine generalized association rules [12, 29], to mine association rules on-line [7, 1], etc. The model used in these studies, however, has always been the same, i.e., finding all rules that satisfy the user­specified minimum support and minimum confidence constraints. We have discussed the inadequacy of a single minimum support and a single minimum confidence for our applications. [18] presents a general solution to this problem. Mining of class association rules with more than one minsup and minconf in this paper is a special case of the technique given in [18].

3. Problem statement

3.1. The scoring and ranking problem

Given a data set (training data) in the form of a relational table, which consists of N cases (or records) described by n distinct attributes, AttrJ, ... , Attrn• An attribute can be a categorical or a numeric attribute. The N cases have been classified into q known classes, C], ... , Cq• In an application, we are interested in only the data cases of one class, e.g., buyers in direct marketing. We call this class the positive class, and the data cases in the class the positive cases. The positive class is often a minority class. We call the rest of the classes the negative classes.

In this work, we treat all the attributes uniformly. For a categorical attribute, all the possible values are mapped to a set of consecutive positive integers. For a numeric attribute, its value range is discretized into intervals, and the intervals are also mapped to consecutive positive integers. With these mappings, we can treat a data record as a set of (attribute, integer-value) pairs and a class label. We call each (attribute, integer-value) pair an item. Discretization of numeric attributes can be done using existing algorithms in the machine learning literature [e.g., 10].

Let D be the dataset. Let I be the set of all items in D, and Y be the set of class labels. We say that a data record (or case) d ED contains X ~ I, a subset of items, if X ~ d. A class association rule (CAR) is an implication of the form X -4 y, where X ~ I, and y E Y. A rule X -4 Y holds in D with confidence c if c% of the cases in D that contain X are labeled with class y. The rule X -4 Y has support s in D if s% of the cases in D contain X and are labeled with class y.

Our objectives are:

Page 201: Data Mining, Rough Sets and Granular Computing

197

(1) to generate the complete set of CARs for each class y that satisfy the user­specified minimum support (denoted by minsup(y)) and minimum confidence (denoted by minconj{y)) of the class. We will discuss how to assign minimum supports and minimum confidences in Section 4.4.

(2) (optional) to prune those non-predictive rules in rule generation.

(3) to use the discovered rules to score each data case in the test (or future) data set, i.e., to assign a probability estimate to indicate how likely the data case belongs to the positive class. The data cases in the test set are then ranked according the scores.

The proposed algorithm (called SBA) consists of two parts, a rule generator (called SBA-rg), which generates all CARs and is based on the Apriori algorithm for finding normal association rules given in [3], and a scorer (called SBA-sr) for scoring and ranking the data using the discovered CARs (or the remaining CARs after pruning).

3.2. Evaluation criterion: lift index

The scoring problem is related to classification. In classification, predictive accuracy (or misclassification error rate) is commonly used as the measure to evaluate classifiers. However, we have indicated that predictive accuracy is not adequate for our task. The reasons are (see also [16]):

1. Classification accuracy is often very low (especially in predicting positive cases) for data sets that are very noisy and have highly imbalanced class distributions. As discussed in the introduction section, the classification model is not suitable for our task as it does not give users the flexibility to choose sufficient cases that are most likely to be positive. Clearly, classification accuracy cannot be used to evaluate the scoring results.

2. Classification accuracy treats false positive and false negative equally. But for our task, this is not suitable. False negative (e.g., recognize buyers as non­buyers) is highly undesirable. False positive (e.g., recognize non-buyers as buyers), although still undesirable, is less harmful.

[16] proposes to use lift index to evaluate scoring results. Lift index is based on lift analysis in marketing research. Lift analysis works as follows [13]: A predictive model is first built using the training data, which is then applied to the test data to give each test case a score to express the likelihood that the test case belongs to the positive class. The test cases are then ranked according to the scores in the descending order. After that, the ranked list is divided into 10 equal deciles (could be more partitions), with the cases that are most likely to be positive being in decile 1 (the top decile), and the cases that are least likely to be positive in decile 10 (the bottom decile).

Example 3: In a marketing application, we have a test data of 10,000 cases or potential customers (thus each decile has 1,000 cases or 10% of the total size).

Page 202: Data Mining, Rough Sets and Granular Computing

198

Out of these 10,000 cases, there are 500 positive cases (or buyers). After scoring and ranking (assume modeling has been done), the distribution of the positive cases is shown in the lift table below (the second row in Table I).

Table I: An example lift table

Decile: I 2 3 4 5 6 7 8 9 10

10% 10% 10% 10% 10% 10% 10% 10% 10% 10%

210 120 60 40 22 18 12 7 6 5

From Table I we see that more positive cases are gathered in the earlier deciles than later deciles. This represents a lift due to modeling [13]. Without modeling, the positive cases would be randomly distributed in the 10 deciles.

It is, however, not clear how to use lift tables to compare the performances of two models because there are two many numbers to compare with and the numbers also reside in different deciles. [16] proposes a single measure called lift index to evaluate and to compare different models or different lift tables. Essentially, the lift index is a weighted sum of the positive cases in the lift table. Assuming the numbers of positive cases in the 10 deciles are NJ, N2, ••• NIO respectively. The lift index (denoted by Lindex) is computed as follows:

L · d _lxNI+O.9xN2+0.8xN3+ ... +0.lxNJO In ex - 10

LNi i=l

If the distribution of the positive cases is random, the lift index value is 55% (but converges to 50 with more partitions). If all the positive cases are landed in the first decile, which is the ideal situation, the lift index is 100% [16]. Note that if the number of positive cases in the test data is greater than 10% of the total number of test cases, then the best lift index will not reach 100% because some positive cases have to be put in the subsequent decile(s). For the above example, the lift index is (1x21O+0.9xI20+ ... +0.lx5)/500 = 85%.

In this paper, we will use lift index to evaluate our technique and to compare it with alternative approaches. (The lift index computation can be easily extended to handle more or fewer bins than 10.)

4. Generating the complete set of CARs

In this section, we discuss rule generation, i.e., the SBA-rg algorithm with rule pruning. We also give some guidelines on how to assign minimum support and minimum confidence to each class. In the next section, we describe how to use the generated rules for scoring.

Page 203: Data Mining, Rough Sets and Granular Computing

4.1 Basic concepts used in SBA-rg

The key operation of SBA-rg is to find all ruleitems that have support above minsup(y). A ruleitem is ofthe form:

<condset, y>

199

where condset is a set of items, y E Y is a class label. The support count of the condset (called condsupCount) is the number of cases in the database D that contain the condset. The support count of the ruleitem (called rulesupCount) is the number of cases in D that contain the condset and are labeled with class y. Each ruleitem basically represents a rule:

condset~ y,

whose support is (rulesupCount / IDI)* 100%, where IDI is the size of the database, and whose confidence is (rulesupCount / condsupCount)*100%.

Ruleitems that satisfy minsup(y) are called frequent ruleitems, while the rest are called infrequent ruleitems. For example, the following is a ruleitem:

<{(A, 3), (B, 2)}, (class, 1»,

where A and B are attributes. If the support count of the condset {(A, 3), (B, 2)} is 3, the support count of the ruleitem is 2, and the total number of cases in D is 10, then the support of the ruleitem is 20%, and the confidence is 66.7%. If the minimum support of class 1 (or minsup(I» is 10%, then the ruleitem satisfies minsup(l). We say the ruleitem isfrequent.

If the confidence of a ruleitem of class y is greater than minconfiy), we say that the ruleitem is accurate. The set of class association rules (CARs) thus consists of all rule items that are both frequent and accurate.

4.2 The SBA-rg algorithm

SBA-rg is based on the Apriori algorithm given in [3]. It generates all frequent ruleitems by making multiple passes over the data. In the first pass, it counts the support of individual ruleitem and determines whether it is frequent. In each subsequent pass, it starts with the seed set of ruleitems found to be frequent in the previous pass. It uses this seed set to generate new possibly frequent ruleitems, called candidate ruleitems. The actual supports for these candidate ruleitems are calculated during the pass over the data. At the end of the pass, it determines which of the candidate ruleitems are actually frequent. From this set of frequent ruleitems, it produces the rules (CARs).

Let k-ruleitem denote a ruleitem whose condset has k items. Let Fk denote the set of frequent k-ruleitems. Each element of this set (or a frequent k-ruleitem) is of the following form:

«k-condset, condsupCount), (y, rulesupCount».

Page 204: Data Mining, Rough Sets and Granular Computing

200

Let Ck be the set of candidate k-ruleitems. The SBA-rg algorithm is given in Figure 2 (next page).

1 FI = {frequent l-ruleitems}; 2 CARl = {r = <X, y> E Fd r.rulesupCount / X.condsupCount;;::: minconj{y)}; 3 prCARI = pruneRules(CARI );

4 for (k = 2; Fk_1 #: 0; k++) do 5 Ck = candidateGen(Fk_I);

6 for each data case d E D do 7 Cd = ruleSubset( Cb d); 8 for each candidate c E Cd do 9 c.condsupCount++; 10 if d.class = c.class then c.rulesupCount++ 11 end 12 end 13 Fk = {c = <X, y> E Ck I c.rulesupCount;;::: minsup(y)}; 14 CARk = {r = <X, y> E Fkl r.rulesupCount / X.condsupCount ;;:::

minconf{y) } ; 15 prCARk = pruneRules(CARk);

16 end 17 CARs = Uk CARk; 18 prCARs = UkprCARk;

Figure 2: The SBA-rg algorithm

Line 1-3 represents the first pass of the algorithm. It counts the item and class occurrences to determine the frequent l-ruleitems (line 1). From this set of 1-ruleitems, a set of CARs (called CARl) is generated in line 2, which are both frequent and accurate. CARl is subjected to a pruning operation (line 3) (which can be optional). Pruning is also done in each subsequent pass to CARk (line 15). The pruneRules function uses the pessimistic error rate based pruning in C4.5 [27] (see Section 4.3).

For each subsequent pass, say pass k, the algorithm performs 4 major operations. First, the frequent rule items Fk_1 found in the (k-l)th pass are used to generate the candidate ruleitems Ck using the condidateGen function (line 5). It then scans the database and updates various support counts of the candidates in Ck

(line 6-12). After those new frequent ruleitems have been identified to form Fdline 13), the algorithm produces the rules CARk (line 14). Finally, rule pruning is performed (line 15).

The candidateGen function is similar to the Apriori-gen function in algorithm Apriori except that only those (k-l)-ruleitems with the same class are used to generate the candidate k-ruleitems of that class. This is to ensure that different minsups are satisfied. Also, if a (k-l )-ruleitem already has an item from a particular attribute, it will not join with another item from the same attribute to produce a candidate ruleitem.

The ruleSubset function takes a set of candidate ruleitems Ck and a data case d to find all the rule items in Ck whose condsets are supported by d. This and the

Page 205: Data Mining, Rough Sets and Granular Computing

201

operations in line 8-10 are also similar to those in algorithm Apriori. The difference is that we need to increment both the support counts of the condset and the ruleitem whereas in algorithm Apriori only one count is updated. This allows us to compute the confidence of the ruleitem. They are also useful in rule pruning.

The final set of class association rules is in CARs (line 17). Those remaining rules after pruning are inprCARs (line 18).

4.3 Rule pruning

The pruneRules function uses the pessimistic error rate based pruning method in C4.5 [27]. It prunes a rule as follows: If rule r's estimated error rate is higher than the estimated error rate of rule r - (obtained by deleting one condition from the condset of r), then rule r is pruned. Note that when r is a l-condition rule of the form, x ~ y, then r- is, ~ y, which has no condition.

Let N be the number of training cases covered by r, and E be the number of errors of r (a rule covers a data case if the data case satisfies the conditions of the rule). It is very unlikely that r has the error rate as low as EIN when it is used on unseen or test cases. Instead of using E/N as the estimated error rate of r, we estimate the true error rate as the upper limit (denote by UCL(r» of the confidence interval of this error for some specified confidence level CL for the binomial distribution [24, 27]. The default confidence level used in C4.5 is 0.25, which is also used in our system. This pruning can cut down the number of rules generated substantially (see Appendix).

An important point that needs to be noted is that when attempting to prune a rule r, the r- used (for a k-condition rule r, there are k r-'s) may have been pruned previously. Then, the procedure needs to go back to the rule that prunes the r-, and uses that rule to prune r.

The pruning procedure is given in Figure 3 below.

Procedure pruneRules( CARJ 1 for each rE CARk do 2 Compute the estimated error rate UO.Z5(r) of r;

3 4 5 6 7 8 9 10 11 12

/* We use confidence level of 0.25 */ for each r- of r do /* There are k such r -'s */

if r -.prune "* null then /* If r - was pruned previously */ r = r .prune;

Compute the estimated error rate UO.25(r) of r-; if UO.25(r) ~ UO•25(r) then /* ris not significant compared to r- */

r.prune = r-; /* rcan be pruned */ exit-for;

eodif eodfor

eodfor

Figure 3. The pruneRules function

Page 206: Data Mining, Rough Sets and Granular Computing

202

In line 1, it iterates over the rules in CARk• Line 2 computes the estimated error rate of r. Line 3 tries to use each r- to prune r. However, if r- has been pruned previously, which is indicated by r-.prune ::f:. null (line 4), then, the algorithm finds the rule that prunes r- (line 5). Line 6 computes the estimated error rate of r-. If U025(r) ~ UO.25(r), then r is pruned. Line 8 records that r is pruned because of r-.

4.4 Assigning minimum supports and minimum confidences

Finally, we present some guidelines on how to assign minimum support and minimum confidence to each class. We discuss minimum support first. In normal situations, the user only needs to set one total or overall minimum support (called cminsup), which is then distributed to each class automatically according to the class distribution as follows:

minsup (y) = cminsup x fey) IDI

where fly) is the number of y class cases in the training data. IDI is the total number of cases in the training data. The reason for using this formula is to give frequent classes higher minsups and infrequent classes lower minsups. This ensures that we will generate enough rules for infrequent classes and will not produce too many meaningless rules for frequent classes. Of course, the user is also free to assign a minsup to each class individually.

For minimum confidence, again, the user can assign any minimum confidence for each class. In SBA, we use the following formula to automatically assign minimum confidence to each class, which performs very well:

. fey) mmconf(y) =--

IDI

The reason for using this formula is that we should not produce rules of class y whose confidence is less thanf(y)/IDI because such rules make no sense.

5. Scoring and ranking the data

After the rules (CARs or prCARs) are generated (from training data), they are ready to be used to score and to rank the new (or test) data. Since each association rule is attached with a support and a confidence, it is thus easy to design a method to score the data. However, to design the best method is very difficult, if not impossible, because there are an infinite number of possible methods. The challenge is thus to find a heuristic method that both performs well (Le., produces good lift indices) and also runs efficiently.

In this section, we first present a simple method for scoring, and then present the method used in SBA-sr. Although the results produced by the simple technique

Page 207: Data Mining, Rough Sets and Granular Computing

203

are not as good as those produced by the method in SBA-sr, they are still better than those produced by C4.5 on average (see Section 6.1).

All the methods presented below consist of the following two steps:

1. Scoring: It computes a value or score for each data case to indicate how likely it belongs to the positive class.

2. Ranking: It ranks the whole set of data cases using the scores assigned to them. When some data cases have the same score, conflict resolution is performed to decide the ordering among the conflicting cases.

5.1 A simple scoring and ranking method

The purpose for presenting this simple scoring method is two-fold. We wish to show that (1) it is easy to design a technique for scoring using association rules, and (2) it seems that a simple and reasonable technique can outperform the state­of-the-art classification system C4.5 (we will see the evaluation results in Section 6.1). This demonstrates the power of more rules.

Best rule method: We simply choose the rule with the highest confidence, in fact the highest precedence, in the positive class that can cover the data case to score it. The score is the confidence value of the rule. For those rules that can cover the data case but their classes are not the positive class, we can obtain the confidence (and the support) of each rule in the positive class by saving such information in rule generation (if the data set has only two classes, it can be computed).

Given any two rules, the precedence relation of the rules is defined as follows:

Definition: Given two rules, r; and rj with the positive class, r i r- rj (also called r, precedes rj or r; has a higher precedence than r) if

1. the confidence of r; is greater than that of rJ , or

2. their confidences are the same, but the support of r; is greater than that of ,> or

3. both the confidences and supports of r; and rj are the same, but r i is generated earlier than rj ;

This method simply uses the precedence relation to choose the "best" rule (i.e., with the highest precedence) that can cover the data case to score the data case. For those data cases that do not satisfy any rule, we simply assign them the score O.

5.2 The technique used in SBA-sr

The above two simple techniques basically try to make best use of the rules with the positive class to push those likely positive cases to the front deciles. However, in many data sets, we could not find many good rules (high confidence rules) with the positive class. The rules in CARs (or prCARs) tend to focus on the majority (or

Page 208: Data Mining, Rough Sets and Granular Computing

204

negative) classes. Using only the information (confidence and support) on the positive class in the above two methods strongly favour those rules with the positive class. This results in many low confidence rules of the positive class being used frequently to score the data. Those negative class rules are not given due consideration. This may potentially miss the chance to push those highly probable negative cases to the last few deciles. That is, since the negative class rules often have very high confidences, it is better to consider these rules in the process in order to push those data cases that are unlikely to be positive to the bottom deciles.

The technique used in SBA-sr considers the negative class rules as modifiers to the positive class rules. Hence, the negative class rules also actively contribute to the scoring process. Specifically, we want to achieve the following: When the positive class rules that cover the data case are not confident, but the negative class rules are very confident, the data case should be pushed to a lower decile (i.e., given a very low score), and vice versa. Furthermore, we believe that rule support should also playa part in scoring. A rule with a higher support is more trustworthy than a rule with a lower support. The formula for computing the score (S) for each data case d is given below, which implements the above ideas. The value of Sis between 0 and 1 inclusive.

where:

m n

L W ~osilive X con! i + L W ~gaIive X con! 1"silive S = i=1 j=1

m n

L W ~silive + L W ~gaIive i=1 j=1

m is the number of positive class rules that can cover the data case.

n is the number of negative class rules that can cover the data case.

W ~Sjtive is the weight factor for a positive class rule. We define W ~o,jtjve as follows:

W j = con if j X sup j positive

con! i is the original confidence of the positive class rule.

sup i is the original support of the positive class rule.

Wn~galiVe is the weight factor for a negative class rule. We define W ~galiVe as follows:

con! j x sup j

k

con! j is the original confidence of the negative class rule.

sup j is the original support of the negative class rule.

Page 209: Data Mining, Rough Sets and Granular Computing

205

k is a constant to reduce the impact of negative class rules. We have performed many experiments to determine k, when k = 3, the system performs the best with our 20 test data sets (see the test sets in Appendix).

con! ~s;t;ve is the confidence after converting a negative class rule to a positive class rule. For instance, if we have only two classes in our database and the confidence of a negative class rule is 60%, then the confidence after converting it to a positive class rule is 40%.

For conflict resolution, we compute a priority value (P) using the following formula:

i: sup i - t sup j P = ;=1 j=1

m+n

This formula uses the supports of the rules to calculate the priority. Basically, we give those data cases with higher positive supports higher priorities. When a data case does not satisfy any rule (i.e., m = n = 0), we assign S = 0 and P = o.

The combined algorithm for computing both Sand P for each test data case d is given below (Figure 4, each variable in the algorithm is initialized to 0). Our experiments show that Sand P values can effectively separate negative cases away from the top deciles, and hence generate good lift index results. This algorithm can also be computed efficiently (see the evaluation section).

1 2 3 4 5 6 7 8 9 10

11 12 13 14 15 16 17

18

for each rin CARs (or prCARs) do if r covers the data case d then 1* d satisfies the condition of r *1

if ris a positive class rule then Wi.,. = r.cont * r.sup i; pos. Jve i . temp_s = W positive * r.conf'; W i Wi i = + W positive; . temp -p = temp -p + sup t

else 1* r is a negative class rule *1 Wn~gative = r.confj * r.supj; te~p_s . = ( Wn!gative * r. conf ;'Sitive )/3 + temp_s;

Wi = Wi + W ~gative ;

temp-p = temp-p - supj ; endif numRules = numRules + 1

endif endfor S = temp_s 1 (Wi + wj);

1* Note that this algorithm does not consider numRules = 0, */ P = temp -p / numRules; 1* which is easy to add *1

Figure 4. The SBA-sr algorithm

Page 210: Data Mining, Rough Sets and Granular Computing

206

6. Evaluation

We now compare the proposed technique with the state-of-the-art classification system C4.S (release 8, the most recent release). We used 20 data sets in our experiment. Five (S) out of the 20 are our real-life application data sets. The rest (IS) of them are obtained from UCI Machine Learning Repository [23]. We could not make use of many data sets in UCI Machine Learning Repository for our experiments because in these data sets the class distributions are not imbalanced which is the main characteristic of the applications that we are interested in. Thus, we only selected those data sets with imbalanced class distributions. For each data set we choose a class that has a very low frequency as the positive class. The description of these data sets is given in Appendix.

Note that a typically real-life application data set for scoring has only two classes of data [13, 16], a positive class and a negative class. Many data sets from UCI Machine Learning Repository have more than two classes. In our experiments, for each data set we chose a minority class as the positive class and grouped the rest of the classes to form a single negative class.

Following the discussion in Section 4.4 we set the minconf for each class according to the data characteristics, i.e., the class distribution. For minsup, we also use the formula presented in Section 4.4. The user only needs specify the total minsup, i.e., t_minsup. We have performed many experiments by varying the Cminsup value, which will be presented in Section 6.3. It is shown that when cminsup reaches 1-2%, the results are very good.

For association rule mining, the number of association rules generated can grow exponentially. Unlike a transactiona'l database used in traditional association rule mining [2, 3] that does not have many associations, classification data tends to contain a huge number of associations, which can cause combinatorial explosion. Thus, in our experiments, we set a hard limit on the total number of rules that we handle in memory. We will see in Table 4 (in Appendix) that for many data sets mining cannot be completed even with a very large rule limit (80,000). Section 6.4 presents the experiment results using different rule limits to see how they affect the lift index results. Finally, we selected 1 % for cminsup and 80,000 for rule limit as the default setting of SBA as this combination produces good and stable results.

Many data sets that we use contain numeric attributes. We discretize these attributes into intervals using the class attribute in each data set. There are a number of discretization algorithms in machine learning literature for the purpose. We use the entropy-based method given in [10]. The code is taken from MLC++ [15]. Note that for C4.5 no discretization is needed.

The original C4.S does not provide any confidence or support value for each prediction. However, such information is available in the system. We have modified C4.5 slightly so that it outputs these values. The confidence is the score used in ranking (the support is used for conflict resolution). In the experiments, all C4.S parameters had their default values.

Page 211: Data Mining, Rough Sets and Granular Computing

207

6.1. Experiment results

The main experiment results are presented in Table 1. For SBA, we use the default setting (cminsup = 1 % and rule limit of 80,000) in rule generation. Each column of the table is explained below.

Table 2: Experiment results (SBA uses cminsup = 1 %, rule_limit = 80,000)

With pruning Without pruning

1 2 3 4 5 6 7 8 Data sets C4.5 SBA (SBA) Exe. Best SBA (SBA) Exe. Best

tree time (sec) rule conf time (sec.) rule

1 Adult 81.7 82.8 94.52 83.0 84.5 350.81 82.9 2 allbp 95.2 98.0 7.20 98.8 99.6 19.34 98.8 3 anneal_5 100.0 100.0 0.11 100.0 100.0 0.77 100.0 4 anneal_U 100.0 99.3 0.05 96.4 100.0 0.50 96.4 5 auto 93.0 89.0 0.16 88.0 91.0 42.18 87.0 6 breast 87.5 89.7 0.22 89.7 89.9 1.54 89.7 7 german 60.0 75.8 3.51 73.0 76.4 12.25 72.4 8 hepati 70.0 80.0 0.22 78.6 78.6 29.17 78.6 9 hypo 98.4 98.7 6.92 98.2 98.6 20.93 98.0 10 labor 85.0 88.3 0.00 86.7 88.3 0.06 88.3 11 led7_0 88.7 95.3 0.54 95.0 95.6 0.94 94.0 12 led7_7 84.3 94.7 0.28 93.9 95.7 1.38 94.4 13 pima 72.4 77.5 0.05 76.3 77.4 0.06 76.4 14 sick 98.0 96.8 6.20 98.2 91.3 20.49 98.2 15 vehicle 67.3 72.8 0.55 ... ~.:.L. 72.7 9.17 64.8 ....... __ ..................•........••••• .... .............................. ......................... . ................................. 16 insur 95 65.4 69.2 14.39 68.7 67.8 60.69 68.8 17 insur 96 64.5 61.5 16.31 59.6 64.1 66.74 59.6 18 insur 97 61.9 62.1 12.20 56.2 60.8 49.44 56.4 19 edupo 59.4 66.5 1.59 55.6 68.2 4.50 55.6 20 edupoa 56.2 70.5 1.76 67.7 70.0 4.67 65.6

Average 79.4 83.4 8.34 81.4 83.5 34.78 81.3

Column 1: It lists the names of the 20 data sets. The first 15 data sets are the public domain data sets, and the last 5 are our real-life data sets. For the 15 public domain data sets, all the training sets and test sets are obtained from UCI Machine Learning Repository (they are separated from each other in the Repository). For our 5 real-life application data sets, we use data from some years to generate the rules and then use the data from other years for testing. For row 3 and 4, we use the same training data and the same testing data, but with different positive classes. In row 3, the positive class used (a minority class) is "5", and in row 4, the positive class used (another minority class) is "U". The same applied to row 11 and 12. For row 16, 17, and 18, which is for an insurance application, we use the data set (training) collected before 1995 to generate the rules, and test on the data sets from 1995, 1996, and 1997 respectively.

Page 212: Data Mining, Rough Sets and Granular Computing

208

Column 2: It gives C4.5tree's lift index for each data set (unseen test set). In C4.S, there are two programs that can produce classifiers, C4.Stree and C4.5rules. C4.Stree produces a decision tree for classification, while C4.Srules generates rules from the decision tree and then uses the rules for classification (see [27] for details on C4.S's rule generation). We also experimented with C4.Srules. But it does not produce good results because too many test cases fall into the default class, which is used when no rule can classify a test case. The default class always favors the majority class, which is bad for the type of applications that we are interested in (we are interested in the minority class). C4.Srules is also very inefficient. For example, running of C4.Srules using our insurance data could not be completed after two days. C4.5tree is very efficient. Incidentally, we also used classifiers built by the CBA system [17] for scoring (CBA builds classifiers using association rules [17]). It also does not perform well because of the default class problem as in the case of C4.Srules.

Since our SBA system requires numeric attributes to be discretized before rule generation. We also ran C4.5 using the discretized data. C4.5tree produces similar results as without discretization. Using discretized data, the average lift index of C4.Stree is slightly lower, i.e., 78.9% on average.

As mentioned in the related work section, a commonly used technique in machine learning research for handling imbalanced class distributions is to boost up the number of positive cases by oversampling with replacement, while keep the same number of negative cases. In [6], it is suggested that the desired distribution is SO:SO, i.e., SO% positive cases and SO% negative cases. We have experimented with this distribution in rule generation. The result is almost the same. For C4.Stree, the average lift index over the 20 test sets is 79.0%. For SBA, it is 83.2%.

From Column 3 to Column S we show the experiment results of the two scoring methods using class association rules with rule pruning, i.e., all the rules in prCARs (see also Section 4.2).

Column 3: It gives the lift index produced by our system SBA for each data set.

Column 4: It shows the execution time of SBA in scoring, ranking and computing the lift index for each test data set (the rule generation time is not included, which is given in Appendix). It can be seen that SBA is quite efficient. All the experiments with SBA are done on a Pentium II 3S0 PC with 128M memory.

Comparing the results from C4.Stree (Column 2) and SBA (Column 3), we have the following observations:

• In general, SBA performs better than C4.Stree. On average over the 20 data sets, the lift index of SBA is higher than that of C4.Stree by 4%, which is a significant gain. We will explain in Section 6.2 what 4% means in terms of movements of data cases from lower deciles to higher deciles.

• SBA is superior to C4.5tree on IS (in bold) of the 20 tests. In one test (3), the results are the same.

Page 213: Data Mining, Rough Sets and Granular Computing

209

• In 4 tests, SBA makes dramatic gains over C4.5tree, Le., increasing lift index by more than 10%.

In Section 5, we also presented a simple method for scoring, Le., the best rule method. The lift index results of this method are given in Column 5.

Column 5: It gives the lift index of the best rule method on each test set. We can see that on average its lift index is not as good as that of SBA, but is still better than that of C4.5tree.

From Column 6 - 8, we show the results of the scoring methods using class association rules without pruning, Le., all the rules in CARs (see also Section 4.2).

From Column 6, we see that the lift index of SBA using the complete set of rules (without pruning) is almost the same as SBA using the rules after pruning (Column 3). The best rule method also produces very similar results (Column 8) as with pruning (Column 5). The running times, however, are drastically different. The execution time of SBA using the complete set of rules without pruning (Column 7) is much higher than SBA using the set of rules after pruning (Column 4). This is because pruning reduces the number of rules drastically (see Appendix). We prefer rules after pruning (Le., prCARs) because they give the user meaningful and significant regularities that exist in the data. They also do not harm the performance in terms of lift index.

Another important observation that can be made is that the average results in Column 5, and 8 (produced by the simple best rule method) are all better than that of C4.5tree. This demonstrates the power of more rules.

The numbers of rules generated without pruning (CARs) and with pruning (prCARs), and the execution times used in rule generation are given in Appendix.

6.2 Interpreting the improvement in lift index

We have shown that the proposed method SBA can improve the lift index of C4.5 on the 20 data sets by 4% on average. A question that needs to be asked is "What does 4% mean?" This issue is not studied in [16]. Unlike predictive accuracy in classification, which has a clear interpretation, a particular lift index value is more difficult to comprehend. For predictive accuracy, if the accuracy is 90%, it means that there are 10% of test cases that are not classified correctly. If the accuracy of classifier A on the test data is 80% and the accuracy of classifier B on the same test data is 84%, then the improvement of classifier B over classifier A on the test data can be easily interpreted, Le., classifier B can classify correctly 4% more test cases than classifier A. However, for lift index, there is not such a simple mapping. We propose the following scheme to interpret the improvement in lift index.

Since in the type of applications that we are interested in, the user often wants to choose the data cases in the top few deciles (e.g., for product promotion), we propose to use the average percentage of test cases that would be moved from each decile (except the top decile, or decile 1) to the top decile as the indicator to measure the improvement in lift index.

Page 214: Data Mining, Rough Sets and Granular Computing

210

Let N, be the number of positive cases in decile 1, N2 be the number of positive cases in decile 2, and so on. Let v be the average number of positive cases that would be moved from each subsequent decile to the first decile in order to achieve an improved lift index from an old lift index. Let T be the total number of positive cases in the test data (i.e., T = N, + N2 + ... + NIO). We use Lindeximp to denote the improved lift index and Lindexold the old lift index. We then have the following:

lx(N t+9xv)+O.9x(N 2-v)+O.8x(N 3-v)+ ... +O.lx(N to-v) Lindeximp - Lindex old = ---------------------

T

1 x Nt + 0.9 x N 2 + 0.8 x N 3 + ... + 0.1 x N 10

T

9 x v - 0.9 x v - 0.8 x v - ... - 0.1 x v

v =4.5x-

T

T

Here v/T (we denote it as a) is the average percentage of positive cases that would be moved to the top decile from each subsequent decile due to a certain amount of improvement (denoted by (3) in lift index (i.e., (:J = Lindeximp - Lindexold). Then, the average percentage of positive cases that would be moved to the top decile from other deciles due to {:J is 9xa. We then have

{j a=- and 9xa=2x{j

4.5

This says that the average percentage of positive cases in the test data that would be moved to the top decile from other deciles due to {:J improvement in lift index is 2{:J regardless of the values of Lindeximp and Lindexold. Thus, with the improvement of 4% in lift index in our experiments, we would have 8% more positive cases in the first decile on average, which is very significant. We also have:

Lindeximp - Lindexold = 4%

4.5xa=4%

a=0.89%.

The 4% improvement in lift index also means that on average 0.89% of positive cases in the test data will shift from each decile (except the top decile or decile 1) to the top decile. To make all these more concrete and also to link them to an application, we use a direct marketing example to illustrate.

Example 4: Assume our test database has 100,000 potential buyers, and out of these there are 1000 actual buyers. If we send promotion packages to the people

Page 215: Data Mining, Rough Sets and Granular Computing

211

in the first decile, on average we will catch 8% (= 2x4% or 9xO.89%) more buyers (or 80 extra buyers in absolute terms). 8% represents a significant upward movement of positive cases. If we promote to the people in the first two deciles, we will catch 7.12% (8xO.89%) more buyers (or 71.2 extra buyers) because the movement from decile 2 to decile 1 cannot be counted. If we promote to the people in the frrst three deciles, we will catch 6.23% (7xO.89%) more buyers (or 62.3 extra buyers), and so ono Table 2 depicts the complete situation. To a marketer, these movements of buyers from lower deciles to upper deciles are quite significant.

Table 3: Positive data movement table with 4% of lift index improvement

From the above discussion and the example, we can see that the improvement in terms of movements of positive cases is more dramatic than what 4% suggests.

6.3 Effects of Cminsup

To show how total minimum support (or Cminsup) affects the lift index results of SBA, we performed experiments by varying the cminsup value using the default rule limit of 80,000. Figure 5 shows the average lift index over the 20 tests at various total minimum support levels for SBA (using rules after pruning). It also inc1udes the result from C4.5tree.

84

83

82

81

80

79

78

77 0.50% 1%, 2% 5% 6% 8% 10% c4.5

Figure 5. Effects of Cminsup on lift index

From the Figure, we see that the lift indices do not change a great deal as the Cminsup value increases. At 1-2%, the results are the best.

Page 216: Data Mining, Rough Sets and Granular Computing

212

6.4 Effects of rule Urnit

The in-memory rule limit is another parameter that can affect the final lift index results. At Cminsup = 1 %, we experimented SBA with various rule limits, 30,000, 50,000, ... , and 150,000. The results are shown in Figure 6 (using rules after pruning). The figure also includes the result from C4.5tree.

We can see that the lift indices do not vary a great deal as the rule limit increases. We finally choose Cminsup = 1 % and rule limit of 80,000 as the default setting of our system because this combination produces good and stable results.

84

83

82

81

80

79

78

77

Figure 6. Effects of rule limit on lift index

7. Conclusion

This paper proposed a method to use association rules to score the data, which is traditionally done by classification systems. The new technique first generates all class association rules (from the training data) with multiple supports and multiple confidences. It then uses these rules to score the test (or future) data. Experiment results show that on average the proposed technique performs significantly better than the state-of-the-art classification system C4.5. In addition, experiments with a naive scoring method indicate that any reasonable technique using association rules could potentially outperform C4.5. This demonstrates the power of more rules as association rule mining finds all rules in data and thus is able to give a complete picture of the domain. A classification system, on the other hand, only generates a small subset of the rules to form a classifier. This small subset of rules only gives a partial picture of the domain.

By no means, we say that the proposed method is the best method for the task. There can be many other methods, in fact, an infinite number of them. This work only represents the beginning. We believe that there is a great deal of potential for designing even better techniques for scoring and ranking. In our future work, we will explore this further.

Page 217: Data Mining, Rough Sets and Granular Computing

213

Acknowledgement: We would like to thank Yiyuan Xia for modifying C4.5 for scoring purpose. The project is funded by National Science and Technology Board and National University of Singapore under the project: RP3981678.

References

1. Aggarwal, C., and Yu, P. "Online Generation of Association Rules." ICDE-98, pp. 402-411, 1998.

2. Agrawal, R, Imielinski, T., Swami, A. "Mining association rules between sets of items in large databases." SIGMOD-I993, 1993, pp. 207-216.

3. Agrawal, Rand Srikant, R "Fast algorithms for mining association rules." VLDB-94.

4. Bayardo, R, Agrawal, R, and Gunopulos, D. "Constraint-based rule mining in large, dense databases." ICDE-99, 1999.

5. Brin, S. Motwani, RUllman, J. and Tsur, S. "Dynamic Itemset counting and implication rules for market basket data." SIGMOD-97, 1997, pp. 255-264.

6. Chan, P. K., and Stolfo, S. J. "Towards scaleable learning with non-uniform class and cost distributions: a case study in credit card fraud detection", KDD-98,1998.

7. Cheung, D. W., Han, J, V. Ng, and Wong, C.Y. "Maintenance of discovered association rules in large databases: an incremental updating technique." ICDE-96, 1996, pp. 106-114.

8. Dong, G., Zhang, X., Wong, L. and Li, J. ''CAEP: classification by aggregating emerging patterns." DS-99: Second International Conference on Discovery Science, 1999.

9. Fawcett, T., and Provost, F. ''Combining data mining and machine learning for effective user profile." KDD-96.

10. Fayyad, U. M. and Irani, K. B. "Multi-interval discretization of continuous­valued attributes for classification learning." IJCAI-93, 1993, pp. 1022-1027.

11. Gehrke, J., Ganti, V., Ramakrishnan, Rand Loh, W. "BOAT-optimistic decision tree construction." SIGMOD-99.

12. Han, J. and Fu, Y. "Discovery of multiple-level association rules from large databases." VLDB-95, 1995.

13. Hughes, A. M. The complete database marketer: second-generation strategies and techniques for tapping the power of your customer database. Chicago, Ill.: Irwin Professional, 1996.

14. Kubat, M. and Matwin, S. "Addressing the curse of imbalanced training sets." ICML-1997.

15. Kohavi, R, John, G., Long, R, Manley, D., and Pfleger, K. "MLC++: a machine learning library in C++." Tools with artificial intelligence, 1994, pp. 740-743.

16. Ling, C. and Li C. "Data mining for direct marketing: problems and solutions." KDD-98.

17. Liu, B., Hsu, W. and Ma, Y. "Integrating classification and association rule mining." KDD-98, 1999.

Page 218: Data Mining, Rough Sets and Granular Computing

214

18. Liu, B., Hsu, W. and Ma, Y. "Mining association rules with multiple minimum supports." KDD-99, 1999.

19. Liu, B., Hsu, W. and Ma, Y. "Pruning and summarizing the discovered associations." KDD-99, 1999.

20. Mahta, M., Agrawal, R. and Rissanen, J. "SLIQ: A fast scalable classifier for data mining." Proc. of the fifth lnt'l Conference on Extending Database Technology, 1996.

21. Mannila, H., Toivonen, H. and Verkamo, A. I. "Efficient algorithms for discovering association rules." In KDD-94: AAAI workshop on knowledge discovery in databases, 1994.

22. Meretakis, D. and Wuthrich, B. "Extending naive bayes classifiers using long itemsets." KDD-99, 1999.

23. Merz, C. J, and Murphy, P. UCI repository of machine learning databases [http://www.cs.uci.edul-mlearnlMLRepository.html]. 1996.

24. Mills, F. Statistical Methods, Pitman, 1955. 25. Ng. R. T. Lakshmanan, L. Han, J. "Exploratory mining and pruning

optimizations of constrained association rules." SIGMOD-98, 1998. 26. Pazzani, M., Merz, c., Murphy, P., Ali, K., Hume, T., and Brunk, C.

"Reducing misclassification costs." ICML-97, 1997. 27. Quinlan, R. C4.5: program for machine learning. Morgan Kaufmann, 1992. 28. Rastogi, R. and Shim, K. 1998. "PUBLIC: A decision tree classifier that

integrates building and pruning" VLDB-98, 1998. 29. Srikant, R. and Agrawal, R. "Mining generalized association rules." VLDB-

1995,1995. 30. Toivonen, H. "Sampling large databases for association rules." VLDB-96,

1996.

Appendix

Table 4 summarizes the training and testing data sets used in our experiments. Results on rule generation using SBA-rg are also included.

Column 1: It gives the name of each data set.

Column 2: It gives the number of attributes in each data set (training and testing data sets).

Column 3: It gives the characteristics of the training data. The first value gives the number of records or cases in the training data. The second value gives the percentage of positive cases in the training data.

Column 4: Like Column 3, this column gives the two numbers in the test data.

Column 5: It gives the number of rules generated and the execution time of SBA-rg for each training data set. Here, the default setting of SBA is used, i.e., cminsup = 1 %, and rule limit of 80,000. The first value is the total number of rules (CARs) generated without pruning. The second value is the number of rules left after pruning (or prCARs). We see that the number of rules left after pruning is

Page 219: Data Mining, Rough Sets and Granular Computing

215

much smaller. The third value is the execution time (in sec.) for rule generation (including pruning, and with data on disk) for each data set (running on Pentium II 350 PC with 128M memory). We see that the rule generation times are reasonable.

Table 4. Description of the training and testing data sets

1 2 3 4 5

Training data Test data Rule generation (1 %, 80k

Data set No. of No. of % of No. of %of No. of rules Exe. Attrs cases +ve cases cases +ve case w/opm pm time (sec.)

1 Adult 14 32561 24.08% 16281 23.62% 80000 6451 142.98 2 allbp 29 2800 4.43% 972 2.57% 80000 4341 64.82 3 anneal_5 37 598 8.03% 300 6.33% 2740 146 0.66 4 anneal_U 37 598 4.35% 300 4.67% 2294 125 0.49 5 auto 25 136 8.82% 69 14.49% 80000 939 5.99 6 breast 10 466 36.91% 233 29.61% 4992 158 0.77 7 german 20 666 31.38% 334 27.25% 80000 5785 8.68 8 hepati 19 103 24.27% 52 13.46% 80000 972 6.04 9 hypo 25 2108 4.55% 1055 5.21% 80000 2698 46.00 10 labor 16 40 35.00% 17 35.29% 904 68 0.11 11 led7_0 7 200 7.00% 3000 10.37% 321 106 0.06 12 led7_7 7 200 6.00% 3000 9.73% 546 72 0.06 13 pima 8 512 36.13% 256 32.42% 114 32 0.05 14 sick 28 2800 6.11% 972 6.17% 80000 3955 63.00 15 vehicle 18 564 23.23% 282 28.72% 80000 384 18.02 .................................... .................... .................................................. ................................................ ...................................... ............................... 16 insur 95 9 40245 6.50% 39141 6.08% 3097 775 25.48 17 insur 96 9 40245 6.50% 45036 5.85% 3097 775 25.48 18 insur97 9 40245 6.50% 33729 3.82% 3097 775 25.48 19 edupo 48 638 18.97% 146 23.29% 80000 8166 10.82 20 edupoa 49 638 24.61% 146 26.71% 80000 8061 9.56

Average 21 8318 16.17% 7266 15.78% 41060 2239 22.73

Page 220: Data Mining, Rough Sets and Granular Computing

Finding Unexpected Patterns in Data

Balaji Padmanabhan 1 and Alexander Tuzhilin2

iOperations and Information Management Department, The Wharton School, University of Pennsylvania 2 Information Systems Department, Stern School of Business, New York University

Abstract Many pattern discovery methods in the KDD literature have the drawbacks of (1) discovering too many obvious or irrelevant patterns and (2) not using prior knowledge systematically. In this chapter we present an approach that addresses these drawbacks. In particular we present an approach to characterizing the unexpectedness of patterns based on prior background knowledge in the form of beliefs. Based on this characterization of unexpectedness we present an algorithm, ZoomUR, for discovering unexpected patterns in data.

1. Introduction

The field of knowledge discovery in databases has been defined in [FPS96] as the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns from data. However, most of the work in the KDD field focuses on the validity aspect, and the other two aspects, novelty and usefulness, were studied to a lesser degree. This is unfortunate because it has been observed both by researchers [FPM91, KMR+94, BMU+97, ST95, ST96a, LH96, PT97, P99] and practitioners [S97,F97] that many existing tools generate a large number of valid but obvious or irrelevant patterns. To address this issue, some researchers have studied the discovery of novel [ST95, ST96a, LH96, LHC97, PT97, P99] and useful [PSM94, ST95, ST96a, A T97] patterns.

In this paper, we continue the former stream of research and focus on the discovery of unexpected patterns by using domain knowledge, in the form of beliefs, to seed the search for patterns in data that contradict the beliefs.

The idea of using domain knowledge for the purpose of discovering new knowledge is old and can be traced back to early expert systems such as AM [Len83] and DENDRAL [BF78] that use heuristic search and generate-and-test paradigms for the discovery process. In particular, the AM system discovers new mathematical concepts from the initial set of 115 core set-theoretic concepts by repeatedly applying one of the 250 heuristic rules to the set of already discovered mathematical concepts. The newly generated concepts are then tested for their "interestingness" using such concepts as intuition, aesthetics, utility, richness, interestingness, and relevance [Len83]. Similarly, DENDRAL (and subsequently Meta-DENDRAL) helps organic chemists determine the molecular structure of unknown compounds. This is achieved by generating successively larger and larger structures of molecules using a set of heuristic rules and constraints. The initial approaches to discovering new knowledge from existing one using heuristic searches, were extended later in the EURISKO system [LB84] by several researchers working on scientific discovery problems, e.g., [ZZH90, SL90], and

Page 221: Data Mining, Rough Sets and Granular Computing

217

researchers working on scientific discovery problems, e.g., [ZZH90, SL90], and also by Buchanan et al. [LBA98]. All this work is related to our approach because these researchers were interested in discovering a broad range of new knowledge, including unexpected knowledge. However, these approaches deal with unexpectedness only in a very limited way and do not directly and formally explain what "unexpectedness" is and how to discover unexpected patterns in a systematic way.

Another stream of work on incorporating domain knowledge into machine learning methods is described in [M80, MK97, PK92]. This type of work mainly deals with inductive learning biases [M80] that constrain learning methods to choose one set of rules over others. Specifically, [MK97] characterizes them as representation biases and preference criteria. A representation bias constrains the form of a pattern to certain types of expressions. A preference bias is induced by specifying criteria for choosing among possible candidate hypotheses. Using this classification, the concept of unexpectedness that we present can be characterized as a preference criterion that focuses the discovery on a certain type of patterns to be defined in this section. Also, the representation bias in our work restricts the structure of the discovered rules as explained in Section 3.

In summary, the earlier work on using domain knowledge for discovering new knowledge examined only some aspects of unexpectedness and did not focus on this concept per se.

In the KDD community, unexpectedness of a rule relative to a belief system has been considered before in [ST95, ST96a, LH96, LHC97, PT97, Suz97]. The first systematic attempt to define unexpectedness of patterns in the KDD community was reported in [ST95, ST96a] in which "unexpectedness" of a rule is defined relative to a system of user-defined beliefs. A rule is considered to be "interesting" if it affects the degrees of beliefs. This approach is general because it does not impose any assumptions on the structure on beliefs and patterns. Its limitation, however, lies in that it is computationally hard and that the user has to assign degrees to beliefs, which can be a hard problem in some applications. In contrast to this probabilistic approach to unexpectedness, in this paper we present an approach based on logical contradiction.

Liu and Hsu in [LH96] take a different approach and introduce a measure of distance between a rule and a belief based on a syntactic comparison between the rule and belief. In [LH96], a rule and a belief are "different" if either the consequents of the rule and the belief are "similar" but the antecedents are ''far apart" or vice versa, where "similarity" and "difference" are defined syntactically based on the structure of the rules. In addition, [LHC97] proposes a method in which users can specify their beliefs by using "generalized impressions" that are easier for the user to specify than specific beliefs. However the discovery method is based on syntactic comparisons of rules and beliefs. Further, the approach in [LH96] filters interesting rules from a set of rules that need to be generated using some other approach, while our approach presents a belief-driven method to discover only unexpected patterns.

An alternative approach is presented in [Suz97] that discovers "exception rules" in the form of rule-pairs but does not begin with prior background knowledge. However, it has been argued [ST95, ST96a] that unexpectedness is

Page 222: Data Mining, Rough Sets and Granular Computing

218

inherently subjective and that prior beliefs of the user are, therefore, an important component of unexpectedness. Further, unexpectedness as defined in [Suz97] can be restrictive since it does not capture some exceptions that are unexpected in the sense defined below.

The approach presented in this paper differs from that in [Suz97] in the following aspects:

• The approach presented in [Suz97] does not depend on prior beliefs but discovers pairs of rules (that can be considered as beliefs) and their exceptions simultaneously. The approach presented in this paper begins with a system of beliefs.

• The approaches consider different types of unexpectedness. The approach presented in this paper is based on the monotonicity of beliefs, while exceptions in [Suz97] are based on the structure of the rule-pair discovered and additional probabilistic constraints.

• The approach in [Suz97] discovers only certain refinements to rules as exceptions, while the approach presented here discovers all refinements that are unexpected and also unexpected generalizations as well.

In this paper we present a new definition of unexpectedness, in Section 2, in terms of a logical contradiction of a rule and a belief. We then present, in Section 3, an algorithm for discovering unexpected patterns. Experimental results are discussed in Section 4 followed by conclusions in Section 5.

In this paper, we focus only on the discovery of unexpected patterns given an initial set of beliefs. We do not address the issue of how to build a "good" set of beliefs. We assume that it can be generated using methods described in [ST96b, P99], such as elicitation of beliefs from the domain expert, learning them from data, and refinement of existing beliefs using newly discovered patterns. A similar issue of how to specify an initial set of beliefs has also been addressed in [LHC97].

2. Unexpectedness of a Rule

In order to define the concept of unexpectedness, we first present some preliminaries. We consider rules and beliefs of the form X ~ A, where X and A are conjunctions of literals (i.e., either atomic formulas of first-order logic or negations of atomic formulas). We keep this definition general and do not impose restrictions of the structures of atomic formulas that can appear in literals of X and A. We also associate with the rule some measure of its statistical "strength", such as "confidence" and "support" [AMS+95]. We say that a rule holds on a dataset if the "strength" of the rule is greater than a user-defined threshold value.

We also make an assumption of monotonicity of beliefs. In particular, if we have a belief Y~B that we expect to hold on a dataset D, then the belief will also be expected to hold on any "statistically large"! subset of D. If we have a non-

1 In this paper, a user-specified support threshold is used to determine if the subset is large.

Page 223: Data Mining, Rough Sets and Granular Computing

219

monotonic belief (that we expect not to hold for some subset of the data), we incorporate our knowledge of why we do not expect the belief to hold on the subset into the belief, thereby making the belief more specific (as shown in [P99]). We can do this iteratively until we have a set of monotonic beliefs.2 Given these preliminary concepts, we define unexpectedness of a rule.

Definition. The rule A --f B is unexpected with respect to the belief X --f Y on the dataset D if the following conditions hold:

(a) BAND Y 1= FALSE. This condition states that Band Y logically contradict each other.

(b) A AND X holds on a statistically large2 subset of tuples in D. We use the term "intersection of a rule with respect to a belief' to refer to this subset. This intersection defines the subset of tuples in D in which the belief and the rule are both "applicable" in the sense that the antecedents of the belief and the rule are both true on all the tuples in this subset. (c) The rule A, X --f B holds. Since condition (a) constrains Band Y to logically contradict each other, the rule A, X --f Y does not hold.

We believe that this definition captures the spirit of "unexpectedness" for the following reasons: (1) The heads of the rule and the belief are such that they logically contradict each other. Therefore in any tuple where the belief and the rule are both "applicable," if the rule holds on this tuple, the belief cannot hold and vice-versa. (2) Since both a rule and a belief hold statistically, i.tis inappropriate to label a rule "unexpected" if the intersection of the contradicting rule and the belief is very small. Hence we impose the condition that the intersection of the belief and the rule should be statistically large. Within this statistically large intersection, we would expect our belief to hold because of the monotonicity assumption. However if the rule holds in this intersection, the belief cannot hold because the heads of the rule and belief logically contradict each other. Hence the expectation that the belief should hold on this statistically large subset is contradicted. We next present an algorithm, which is an extension of standard association rule generating algorithms [AMS+95] for finding unexpected rules.

3. Discovery of Unexpected Rules

3.1 Association Rule Preliminaries

In this section we provide an overview of assoclatton rules and sketch the algorithms for discovering association rules proposed in [AMS+95]. Let I = {ib i2,

... , im } be a set of discrete attributes (also called "items" [AIS93]). Let an atomic

2 Converting non-monotonic beliefs to monotonic beliefs can be automated by letting the user specify non-monotonic beliefs with exceptions. Then the system automatically converts these to a set of monotonic beliefs.

Page 224: Data Mining, Rough Sets and Granular Computing

220

condition be defined as a proposition of the form "attribute = value", where the attribute can take on a discrete set of mutually exclusive values. An itemset is a conjunction of atomic conditions. Let D = {Th T2, ... , TN} be a relation consisting on N transactions [AMS+95] Th ... ,TN over the relation schema {ih i2, ... , im}. A transaction Ti is said to "contain" an itemset if the itemset holds on Ti.

An association rule is an implication of the form body ~ head where "body" is an itemset and "head" is an itemset that contains only a single atomic condition. The rule holds in D with confidence c if c% of the transactions that contain body also contain head. The rule has support s in D if s% of the transactions in D contain both body and head. The search for association rules is usually constrained to rules that satisfy minimum specified support and confidence requirements. An itemset is said to be large if the percentage of transactions that contain it exceed the minimum specified support level.

Various efficient algorithms for finding all association rules in transactions databases have been proposed in [AMS+95]. These algorithms operate in two phases.

In the first phase, all large itemsets are generated in an incremental manner. The k-th iteration of Apriori [AMS+95] performs the following two tasks:

(1) generates a set, Ck, of "candidate itemsets", whose support needs to be determined; (2) then evaluates the support of each candidate itemset from the dataset D and determines the itemsets in Ck that are large. The set of large itemsets in this iteration is Lk.

[AMS+95] observes that all subsets of a large itemset are large, which is why the process of computing Ck from the set Lk-l can be done efficiently. Candidate itemsets of length k are generated from the set of large itemsets of length (k-l) by imposing the constraint that all subsets of length (k-l) of any candidate itemset must be present in the set of large itemsets of length (k-l).

The second phase of the algorithm generates rules from the set of all large itemsets. For example, let 1/ = {age = high, income = high} and h = {age = high}. From the supports of these two itemsets the confidence, c, of the rule " if (age = high) then (income = high) "can be calculated as c = support({age = high, income = high}) / support( (age = high}). Hence in this phase, given the set of all large itemsets, significant rules involving these itemsets are generated.

3.2 Discovery of Unexpected Rules

In this section we present an algorithm for discovering unexpected rules. We consider only discrete attributes in this paper. For discrete attributes we differentiate between unordered and ordered attributes in the following sense. For ordered attributes we allow only comparison operators in the corresponding condition. For an ordered attribute even the case when attribute = value is equivalently represented by value ~ attribute ~value. When an unordered attribute is part of a condition, we restrict the operator in that condition to be "=".

Page 225: Data Mining, Rough Sets and Granular Computing

221

The rules and beliefs that we consider are of the form body ~ head, where body is a conjunction of atomic conditions of the form attribute = value for unordered attributes or of the form valuel 5 attribute 5 value2 for ordered attributes where value, valuel, value2 belong to the set of distinct values taken by attribute in the dataset D and head is an atomic condition not involving any attribute present in body. This definition extends the structure of association rules [AMS+95] by considering discrete domains and conditions involving comparison operators. We consider these extensions since in many applications rules and beliefs involve these additional operators. We further follow the approach taken in [AMS+95] and discover unexpected rules that satisfy user-specified minimum support and confidence requirements.

3.3 Overview of the Discovery Strategy

Consider a belief X ~ Yand a rule A ~ B, where both X and A are conjunctions of atomic conditions and both Y and B are single atomic conditions. It follows from the definition of unexpectedness in section 2 that if an association rule A -+ B is "unexpected" with respect to the belief X -+ Y, then the following must hold:

(1) BAND Y 1= FALSE. (2) The rule X, A -+ B holds.

Hence, for every unexpected rule of the form A ~ B, it has to be the case that the rule X, A ~ B also holds.

We present the discovery algorithm ZoomUR (''Zoom to Unexpected Rules") that consists of two parts: ZoominUR and ZoomoutUR. Given a belief X ~ Y, algorithm ZoomUR first discovers (in ZoominUR) all rules (satisfying threshold support and confidence requirements) of the form X, A ~ B, such that B contradicts the head of the belief. We then consider (in ZoomoutUR) other more general and potentially unexpected rules of the form X', A ~ B, where X' eX.

The rules that ZoominUR discovers are "refinements" to the beliefs such that the beliefs are contradicted. The rules that ZoomoutUR discovers are not refinements, but more general rules that satisfy the conditions of unexpectedness. For example, if a belief is that "professional~ weekentf' (professionals tend to shop more on weekends than on weekdays), ZoominUR may discover a refinement such as "professional, december ~ weekday" (in December, professionals shop more on weekdays than on weekends). ZoomoutUR may then discover a more general rule "december~weekday", which is totally different from the belief "professional~ weekencl'.

3.4 Algorithm ZoominUR

Algorithm ZoominUR is based on algorithm Apriori's ideas [AMS+95] of generating association rules from itemsets in an incremental manner. We use the

Page 226: Data Mining, Rough Sets and Granular Computing

222

term "itemset" to refer to a conjunction of atomic conditions, each of the form attribute = value for unordered attributes or of the form value1 ::; attribute ::; value2 for ordered attributes where value, value1, value2 belong to the set of distinct values taken by attribute in the dataset D.

We would like to note that the "range" representation for ordered attributes (value1 5 attribute ::;value2) subsumes any condition of the form attribute 5value or of the form attribute;? value since:

• The range representation value::; attribute ::;valuemax equivalently represents any condition of the form attribute;? value where valuemax is the maximum value taken by the attribute in the dataset D.

• The range representation valuemin ::; attribute ::;value equivalently represents any condition of the form attribute::; value where valuemin is the minimum value taken by the attribute in the dataset D.

There are two main extensions to Apriori that we make in ZoominUR: (1) ZoominUR starts with a set of initial beliefs to seed the search for unexpected rules. This is similar in spirit to the work of [SV A97] where itemset constraints are used to focus the search. (2) We incorporate comparisons since in many applications some rules involve these operators. Before presenting ZoominUR, we first explain some preliminaries.

Consider the belief X ~ Y, where X is a conjunction of atomic conditions of the form described above and Y is a single atomic condition. We use the term "CONTR(Y)" to refer to the set of atomic conditions of the form attribute = value or of the form value1 ::; attribute::; value2 . Assume that vI. V2' ... 'Vk are the set of unique values (sorted in ascending order if a is ordered) that the attribute a takes on in D. CONTR(Y) is generated as follows:

(1) If the head of the belief is of the form "value1 ::; attribute::; value2" (attribute is ordered), any condition of the form "value3 ::; attribute ::;value4"E CONTR(Y) if the ranges [value1, value2] and [value3, value4] are not empty and do not overlap.

(2) If the head of the belief is of the form "attribute = val" (attribute is unordered), any condition of the form "attribute = Vp"E CONTR(Y) if vp E

{v 1, v2,··· vk} and vp ,tval;

In the case of ordered attributes the width of any condition of the form value1 ::; attribute::; value2 is defined to be value2 - value1. We take as user inputs the minimum and maximum width for all ordered attributes. This is necessary and useful for the following reason. Assume that age is defined to be an ordered attribute and takes values ranging from 1 to 100 in the dataset. Clearly at the extreme a rule involving a condition of the form 1 ::; age::; 100 is not useful since the condition 1::; age::; 100 will hold for every record in the dataset. Extending this argument, larger ranges of age may hold for most records in the dataset, hence we allow the user to specify the maximum width for age that the user may be

Page 227: Data Mining, Rough Sets and Granular Computing

223

interested in considering. Similarly the user may not be interested in too small a range for ordered attributes and we allow the user to specify a minimum width for the attribute.

Since the rules discovered need to have minimum support, we follow the method of [AMS+95] and generate large itemsets in the first phase of the algorithm. From the supports of these large itemsets we generate unexpected refinements in the second phase of the algorithm Given these preliminaries, we describe the algorithm next.

ZoominUR algorithm is presented in Fig. 3.1. The inputs to ZoominUR are:

1. A set of beliefs, B, 2. The dataset D, and 3. Minimum and maximum width for all ordered attributes 4. Minimum support and confidence values.

For each belief X -7 Y, ZoominUR finds all unexpected rules of the form X, A -7

C, such that C E CONTR(Y) and the rules satisfy minimum support and confidence requirements.

For each belief X -7 Y, ZoominUR first generates incrementally all large itemsets that may potentially generate unexpected rules. Each iteration of ZoominUR generates itemsets in the following manner. In the k-th iteration we generate itemsets of the form {C,X,P} such that C E CONTR(Y). Observe that to determine the confidence of the rule X, P -7 C, the supports of both the itemsets {C,X,P} and {X,P} will have to be determined. Hence in the k-th iteration of generating large itemsets, two sets of candidate itemsets are considered for support determination:

(1) The set Ck of candidate itemsets . Each itemset in Ck (e.g. {C,x,P}) contains (i) a condition that contradicts the head of belief, (i.e. any condition

C E CONTR(Y)), (ii) the body {X} of the belief, and (iii) k other atomic conditions (i.e. P is a conjunction of k atomic conditions).

(2) A set Ck• of additional candidates. Each itemset in Ck, (e.g. {X,P}) is generated from an itemset in Ck by dropping a condition, C, that contradicts the head of the belief.

We explain the steps of ZoominUR in Fig. 3.1 now. The following is a list of notations that are used in describing the algorithm:

• DISC is the set of unordered attributes. • CONT is the set of ordered attributes. • minwidth(a) and maxwidth(a) are the minimum and maximum widths for any

ordered attribute a.

Page 228: Data Mining, Rough Sets and Granular Computing

224

• Attributes(x) is the set of all attributes present in any of the conditions in itemsetx.

• Values(a) is the set of distinct values the attribute a takes in the dataset D.

First, given a belief, B, the set of atomic conditions that contradict the head of the belief, CONTR(head(B», is computed (as described previously). Then, the first candidate itemsets generated in Co (step 2) will each contain the body of the belief and a condition from CONTR(head(B». Hence the cardinality of the set Co is the same as the cardinality of the set CONTR(head(B».

Xnputs: Beliefs Bel_Set, Dataset D, minwidth and maxwidth for all ordered attributes and thresholds support and confidence

min_support and min_conf

OUtputs: For each belief, B, itemsets Items_In_UnexpRuleB

1 forall beliefs B E Bel_Set 2 Co = { {x,body(B)} I x E CONTR(head(B» };

CO' = {{body(B)}}; k=O

3 while (Ck != 0 ) do

4 forall candidates c E CkUCk', compute support(c)

5 Lk = {x Ix E Ck U Ck', support (x) ~ min_support} 6 k++ 7 Ck = generate_new_candidates(Lk_l, B) 8 Ck' = generate_bodies(Ck , B) 9

10 Let X = {x I x E ULi, x :l a, a E CONTR(head(B» }

11 Items_In_UnexpRuleB = 0

12 forall (x E X) { 13 Let a = x n CONTR(head(B» 14 rule_conf = support(x)!support(x-a) 15 if (rule_conf > min_conf) { 16 Items_In_UnexpRuleB = Items_In_UnexpRuleB U {x}

17 18 19 20

Output Rule x - a ~ a

Figure 3.1 Algorithm ZoominUR

To illustrate this, consider an example involving only binary attributes. For the beliefx=O~y=O, the set CONTR({y=O)) consists of a single condition {y=l}. The initial candidate sets, therefore, are Co = ({y=l, x=O)), CO' = ({x=O)).

Page 229: Data Mining, Rough Sets and Granular Computing

225

Steps (3) through (9) in Fig. 3.1 are iterative: Steps (4) and (5) determine the supports in dataset D for all the candidate itemsets currently being considered and selects the large itemsets in this set.

In step (7), function generate_new_candidates(Lk_l, B) generates the set Ck of new candidate itemsets to be considered in the next pass from the previously determined set of large itemsets, Lk-l, with respect to the belief B ("x ~ y") in the following manner:

(1) Initial condition (k=l): In the example (binary attributes) considered above, assume that Lo = {{ x=O, y= I}, {x=O} }, i.e. both initial candidates had adequate support. Further assume that "p" is the only other attribute (also binary) in the domain. The next set of candidates to be considered would be C 1 = { {x=O,y=l,p=O}, {x=O,y=l,p=l} }, and Cl' = { {x=O, p=O}, {x=O, p=I}}.

In general we generate C] from Lo by adding additional conditions of the form attribute = value for unordered attributes or of the form valuel 5> attribute 5> value2 for ordered attributes to each of the itemsets in Lo. More specifically, for a

belief B, the set C] is computed using the following rules. If itemset x E Lo and x contains a condition that contradicts the head of the belief:

1. The itemset x u {{a = val}} E C] if a E DISC (set of unordered attributes), val E Values(a) and a e Attributes(x).

2. The itemset x u {{valuel 5> a 5>value2}} E C] if a e Attributes(head(B», a E CONT (set of ordered attributes), valuel E Values(a), value2 E Values(a), valuel 5> value2, and the resulting width for the attribute a should satisfy minimum and maximum width restrictions for that attribute.

This process is efficient and complete because of the following reasons.

1. The attributes are assumed to have a finite number of unique discrete values in the dataset D. Only conditions involving these discrete values are considered.

2. For unordered attributes no condition involving an attribute already present in the itemset is added. This ensures that itemsets that are guaranteed to have zero support are never considered. For example, this condition ensures that for the belief month=9 ~ sales=low, the itemset {{month = 3}} is not added to the itemset {{sales = high}, {month = 9}}.

3. For ordered attributes, however, it is legal to add the itemset {{3 5> a 5> 6 }} to {{b=I}, {5 5> a 5> 8}} to result in {{b=I}, {5 5> a 5> 6 }} where the initial belief may be 5 5> a 5> 8~ b=O for example. Without loss of generality in this case we represent the new itemset as {{b=I}, {5 5> a 5> 8}, {3 5> a 5> 6 }} rather than as {{b=I}, {5 5> a 5> 6 }}. We use this "long form" notation since (I) we assume that all itemsets in a given iteration have the same cardinality and (2) the body of the belief is explicitly present in each itemset.

Page 230: Data Mining, Rough Sets and Granular Computing

226

(2) Incremental generation of Ck from Lk•J when k > 1: This function is very similar to the apriori-gen function described in [AMS+95]. For example, assume that for a belief, B, "x ~ y", c is a condition that contradicts y and that LJ = { {c, x, p}, {c, x, q}, {x, p}, {x, q} } . Similar to the apriori-gen function, the next set of candidate itemsets that contain x and c is C2 ={ {x, c, p, q} } since this is the only itemset such that all its subsets of one less cardinality that contain both x and c are in LJ'

In general, an itemset X is in Ck if and only if for the belief B, X contains

body(B) and a condition A such that A E CONTR(head(B)) and all subsets of X with one less cardinality, containing A and body(B), are in Lk.J' More specifically, Ck is generated from Lk•J using the following rule:

• If a E CONTR(head(B)), a E{XJ, X2, ... , xp} and {xJ, X2, .. " Xp, v}, {XJ, X2> .. " Xp, W}E Lk-l then {xj, X2,"" xp, v, w} E Ck ifw ~ Attributes({xj, X2,"" xp, v}).

The above rule for generating Ck essentially limits itemsets to a single condition for each "new" attribute not present in the belief B. This however does not eliminate any relevant large itemset from being generated as the following example shows. Consider the case where for the belief x=l ~ y=O. Assume that LJ contains {y=l, x=l, 3 Sa S 6} and {y=l, x=l, 5 Sa S 7}. Combining these itemsets yields the equivalent itemset {y=l, x=l, 5 S a S 6} of the same cardinality as any itemset in LJ and if this itemset is large, it would already be present in LJ•

In step (8), as described previously, we would also need the support of additional candidate itemsets in Ck' to determine the confidence of unexpected rules that will be generated. The function generate_bodies(CJvB) generates Ck' by considering each itemset in Ck and dropping the condition that contradicts the head of the belief and adding the resulting itemset in Ck '.

Once all large itemsets have been generated, steps (10) to (20) of ZoominUR generate unexpected rules of the form x, p~ a, where aE CONTR(head(B)), from the supports of the large itemsets.

3.5 Algorithm ZoomoutUR

ZoomoutUR considers each unexpected rule generated by ZoominUR and tries to determine all the other more general rules that are unexpected.

Given a belief X ~ Y and an unexpected rule X, A ~ B computed by ZoominUR, ZoomoutUR tries to find more general association rules of the form X', A ~ B , where X' c X, and check if they satisfy minimum confidence requirements. Such rules satisfy the following properties. First, they are unexpected since they satisfy all the three conditions of unexpectedness because (1) the head of the rules contradict the head of the belief, (2) the rule is guaranteed to have adequate support and (3) the intersection of the rule and belief yields an unexpected rule discovered by ZoominUR. Second, these rules are more general in the sense that they have at least as much support as the rule X, A ~ B. Third, the

Page 231: Data Mining, Rough Sets and Granular Computing

227

itemsets {X', A} and {X', A, B} are guaranteed to satisfy the minimum support requirement (though we still have to determine their exact support) since the itemsets {X,A} and {X,A,B} are already known to satisfy the minimum support requirement.

Xnputs: Beliefs Bel_Set, Dataset D, min_support, min_conf, For each belief, B, itemsets

Items_In_UnexpRuleB

OUtputs: Unexpected zoomout rules

1 forall beliefs B { 2 new_candidates = 0 3 forall (x E Items_In_UnexpRuleB ) {

4 Let K = {(k,k') Ik c x,k ~ x-body(B) , k'=k-a, a E CONTR(head(B»}

5 new_candidates = new_candidates U K 6 7 find_support (new_candidates) 8 forall (k,k') E new_candidates 9 consider rule: k'~k-k' with confidence

support(k)/support(k') 10 if (confidence> min_conf) Output Rule k' ~ k-k' 11 } 12

Figure 3.2. Algorithm ZoomoutUR

We present the ZoomoutUR algorithm in Fig. 3.2. For each belief B from the algorithm ZoominUR, we have the set of all large itemsets Items_In_UnexpRuleB

(step (15) in Fig. 3.1) that contain both body(B) and some condition a, such that a E CONTR(head(B)). The general idea is to take each such large itemset, I, and find the supports for all the subsets of I obtained by dropping from lone or more attributes that belonging to body( B). Steps 1 through 5 of ZoomoutUR generated a set of ordered pairs (k, k), new3andidates, such that itemsets in this pair are obtained from some itemset in Items_In_UnexpRuleB by dropping one or more conditions for the body of the belief, B. In Step 4 an itemset k contains a condition that contradicts the head of the belief while k' does not contain any such condition (similar to the distinction between itemsets in Ck and Ck , explained in ZoominUR). For all the itemsets in new_candidates, Step 7 computes the support of these itemsets in D. Steps 8 through 11 generates unexpected rules from itemsets in the ordered pairs.

In this section we presented ZoomUR, an algorithm that discovers unexpected patterns in data. The following theorem states the completeness of ZoomUR. The proof is presented in [P99].

Page 232: Data Mining, Rough Sets and Granular Computing

228

Theorem. For any belief A ~ B, ZoomUR discovers all unexpected rules of the form X ~ Y, where X and A are conjunctions of atomic conditions and Yand B are single atomic conditions.

4. Experiments

We tested our method on Web logfile data tracked at a major university site. The data was collected over a period of 8 months from May through December 1997 and consisted of over 280,000 hits. Some of the interesting rules in this application involve comparison operators. For example, temporal patterns holding during certain time intervals need to be expressed with conditions of the form "20 5 week 5 26" (Sep. 10 through Oct. 29 in our example).

We generated 11 beliefs about the access patterns to pages at the site. An example of a belief is: Belief: For all files, for all weeks, the number of hits to a file each week is approximately equal to the file s average weekly hits. Note that this belief involves aggregation of the Web logfile data. To deal with this, we created a user-defined view on the Web logfile and introduced the following attributes: file, week_number, file_access3nt, avg_access3ntJile, stable_week. The file_access_cnt is the number of accesses to file in the week week_number. The avg_access_cntJile is the average weekly access for file in the dataset. The stable_week attribute is 1 if file_access_cnt lies within two standard deviations around avg_access3ntJile and is 2(3) if file_access_cnt is higher (lower) . The above belief can then be expressed as True -+ stable_week=1. Though this belief was true in general (holds with 94% confidence on the view generated), ZoominUR discovered the following unexpected rules:

• For a certain "Call for Papers" file, in the weeks from September 10 through October 29, the weekly access count is much higher than the average. i.e.

file = cfp Jile, week_number ~ 20, week_number 5 26 -+ stable_week=2.

What was interesting about this rule was that it turned out to be a Call-for-papers for the previous year and the editor of the Journal could not understand this unusually high activity! As a consequence, the file was removed from the server.

• For a certain job opening file, the weeks closest to the deadline had unusually high activity. file = jobJile, week_number ~ 25, week_number 5 30 -+ stable_week=2.

This pattern is not only unexpected (relative to our belief) but is also actionable because the administrators can expect a large number of applications and should prepare for this. Also, this pattern can prompt the administrators to examine IP domains that do not appear in the Web log accesses and target them in some manner.

We would like to make the following observations based on our experiments with the Web application. First, as the examples show, we need to incorporate comparison operators since many of the interesting patterns are expressed in these

Page 233: Data Mining, Rough Sets and Granular Computing

229

terms. Second, the raw web access log data has very few fields, such as IP _Address, File_Accessed, and Time_of-Access. Without beliefs it would be extremely difficult to discover relevant patterns from this "raw" data. Beliefs provide valuable domain knowledge that results in the creation of several user­defined views and also drive the discovery process.

In [P99] we present results of applying ZoomUR in a comprehensive case study application using consumer purchase data. In the consumer purchase dataset, ZoomUR generated between 50 and 5000 rules, for varying levels of support, from an initial set of 28 beliefs. In comparison, for even conservative support values, Apriori generated more than 100,000 rules. Further in [P99] we also show that many of the rules generated by ZoomUR are truly interesting, while the top few rules from Apriori, though very high in confidence, seem obvious or irrelevant.

5. Conclusion

In this paper, we presented an algorithm for the discovery of unexpected patterns based on our definition of unexpectedness. This algorithm uses a set of user­defined beliefs to seed the search for the patterns that are unexpected relative to these beliefs. We tested our algorithm on web logfile data and discovered many interesting patterns. These experiments demonstrated two things. First, user­defined beliefs can drastically reduce the number of irrelevant and obvious patterns found during the discovery process and help focus on the discovery of unexpected patterns. Second, user-defined beliefs are crucial for the discovery process in some applications, such as Weblog applications. In these applications, important patterns are often expressed in terms of the user-defined vocabulary [DT93] and beliefs provide the means for identifying this vocabulary and driving the discovery processes.

As explained in the introduction, we do not describe how to generate an initial system of beliefs. To generate such beliefs, we use the methods described in [ST96b]. However there is a whole set of issues dealing with the problems of generating, managing and revising beliefs that go beyond the initial approaches described in [ST96b] and we are currently working on these issues. We are also working on incorporating predicates and aggregations into the beliefs and on using them in the discovery processes.

References

[AIS93] Agrawal, R., Irnielinski, T. and Swami, A., 1993. Mining Association Rules Between Sets of Items in Large Databases. In Proc. of the ACM SIGMOD Conference on Management of Data, pp. 207-216.

[AMS+95] Agrawal, R., Mannila, H., Srikant, R., Toivonen, H. and Verkamo,A.I., 1995. Fast Discovery of Association Rules. In Fayyad, U.M., Piatetsky­Shapiro, G., Smyth, P., and Uthurusamy, R. eds., Advances in Knowledge Discovery and Data Mining. AAAI Press.

Page 234: Data Mining, Rough Sets and Granular Computing

230

[AT97] Adomavicius, G., and Tuzhilin, A, 1997. Discovery of Actionable Patterns in Databases: The Action Hierarchy Approach. In Proc. of the Third Inti. Conference on Knowledge Discovery and Data Mining (KDD 97).

[BF78] Buchanan, B.G. and E.A Feigenbaum. DENDRAL and MET A­DENDRAL: Their Applications Dimensions. Artificial Intelligence, 11:5 - 24, 1978.

[BMU+97] Brin, S., Motwani, R, Ullman, J.D., and Tsur, S., 1997. Dynamic Itemset Counting and Implication Rules for Market Basket Data. Procs. ACM SIGMOD Int. Conf. on Mgmt. of Data, pp.255-264.

[DT93] Dhar, V., and Tuzhilin, A, 1993. Abstract-Driven Pattern Discovery in Databases. IEEE Transactions on Knowledge and Data Engineering, v.5, no.6 December 1993.

[F97] Forbes Magazine, Sep. 8, 1997. Believe in yourself, believe in the merchandise, pp.118-124.

[FPM91] Frawley, W.J., Piatetsky-Shapiro, G. and Matheus, C.J., 1991. Knowledge Discovery in Databases: An Overview. In Piatetsky-Shapiro, G. and Frawley, W.J. eds., Know. Disc. in Databases. AAAI/MIT Press, 1991.

[FPS96] Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., 1996. From Data Mining to Knowledge Discovery: An Overview. In Fayyad, U.M.,Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, Reds., Advances in Knowledge Discovery and Data Mining. AAAIlMIT Press.

[KMR+94] Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H. and Verkamo, AI., 1994. Finding Interesting Rules from Large Sets of Discovered Association Rules. In Proc. of the Third International Conference on Information and Knowledge Management, pp. 401-407.

[LB84] D.B. Lenat and J.S. Brown. Why AM and EURISKO appear to work. Artificial Intelligence, 23(3):269-294. 1984.

[LBA98] Y. Lee, B.G. Buchanan, and J.M. Aronis. Knowledge-based Learning in Exploratory Science: Learning Rules to Predict Rodent Carcinogenicity. Machine Learning, 30:217-240. 1998.

[Len83] D.B. Lenat. AM: Discovery in Mathematics as Heuristic Search. In R Davis and D. Lenat, editors. Knowledge-Based Systems in Artificial Intelligence. McGraw-Hill. 1983.

[LH96] Liu, B. and Hsu, W., 1996. Post-Analysis of Learned Rules. In Proc. of the Thirteenth National Conf. on Artificial Intelligence (AAAI '96), pp. 828-834.

[LHC97] Liu, B., Hsu, W. and Chen, S, 1997. Using General Impressions to Analyze Discovered Classification Rules. In Proc. of the Third IntI. Conf. on Knowledge Discovery and Data Mining (KDD 97), pp. 31-36.

[M80] Mitchell, T. The need for biases in learning generalizations. Technical Report CBM-TR-l17, Dept. of Computer Science, Rutgers University, 1980.

[MK97] Michalski, RS. and Kaufman, K.A Data Mining and Knowledge Discovery: A Review of Issues and a Multistrategy Approach. Technical Report P97-3 MLI 97-2, Machine Learning and Inference Laboratory, George Mason University, 1997.

[P99] Padmanabhan, B, 1999. Discovering Unexpected Patterns in Data Mining Applications. Doctoral dissertation, Department of Information Systems, Stern School of Business, New York University.

Page 235: Data Mining, Rough Sets and Granular Computing

231

[PK92] Pazzani, M. and Kibler, D. ''The Utility of Knowledge in Inductive Learning." Machine Learning, 9(1): 57-94, 1992.

[PSM94] Piatetsky-Shapiro, G. and Matheus, C.J., 1994. The Interestingness of Deviations. In Proc. of AAAI-94 Workshop on Know. Discovery in Databases, pp. 25-36.

[PT97] Padmanabhan, B. and Tuzhilin, A, 1997. On the Discovery of Unexpected Rules in Data Mining Applications. In Procs. of the Workshop on Information Technology and Systems (WITS '97), pp. 81-90.

[S97] Stedman, C., 1997. Data Mining for Fool's Gold. Computerworld, Vol. 31,No. 48, Dec. 1997.

[SL90] Shrager, J. and P. Langley. Computational Models of Scientific Discovery and Theory Formation. San Mateo, CA: Morgan Kaufmann, 1990.

[ST95] Silberschatz, A and Tuzhilin, A, 1995. On Subjective Measures of Interestingness in Knowledge Discovery. In Proc. of the First International Conference on Knowledge Discovery and Data Mining, pp. 275-281.

[ST96a] Silberschatz, A and Tuzhilin, A, 1996. What Makes Patterns Interesting in Knowledge Discovery Systems. IEEE Trans. on Know. and Data Engineering. Spec. Issue on Data Mining, v.5, no.6, pp. 970-974.

[ST96b] Silberschatz, A and Tuzhilin, A, 1996. A Belief-Driven Discovery Framework Based on Data Monitoring and Triggering. Working Paper #IS-96-26, Dept. of Information Systems, Stern School of Business, NYU.

[Suz97] Suzuki, E., 1997. Autonomous Discovery of Reliable Exception Rules. In Proc. of the Third International Conference on Knowledge Discovery and Data Mining, pp. 259-262.

[SV A97] Srikant, R., Vu, Q. and Agrawal, R. Mining Association Rules with Item Constraints. In Proc. of the Third International Conference on Knowledge Discovery and Data Mining (KDD 97), pp. 67-73.

[ZZH90] Zytkow, J., J. Zhu, and A Hussam. Automated Discovery in Chemistry Laboratory. Proceedings of the Eighth National Conference on Artificial Intelligence. pp 889-894, 1990.

Page 236: Data Mining, Rough Sets and Granular Computing

Discovery of Approximate Knowledge in Medical Databases Based on Rough Set Model

Shusaku Tsumoto

Department of Medical Informatics, Shimane Medical University, School of Medicine, 89-1 Enya-cho, Izumo 693-8501 Japan E-mail: [email protected]

Abstract. One of the most important problems on rule induction methods is that extracted rules do not plausibly represent information on experts' decision pro­cesses, which makes rule interpretation by domain experts difficult. In order to solve this problem, the characteristics of medical reasoning is discussed positive and negative rules are introduced which model medical experts' rules. Then, for induction of positive and negative rules, two search algorithms are provided. The proposed rule induction method was evaluated on medical databases, the exper­imental results of which show that induced rules correctly represented experts' knowledge and several interesting patterns were discovered.

1 Introduction

Rule induction methods are classified into two categories, induction of de­terministic rules and probabilistic ones[5,6,8,12}. On one hand, Deterministic rules are described as if-then rules, which can be viewed as propositions. From the set-theoretical point of view, a set of examples supporting the conditional part of a deterministic rule, denoted bye, is a subset of a set whose exam­ples belongs to the consequence part. denoted by D. That is, the relation e S;; D holds. Thus, deterministic rules are supported by positive examples in a dataset. On the other hand, probabilistic rules are if-then rules with probabilistic informaiton[12}. From the set-theoretical point of view, e is not a subset, but closely overlapped with D. That is, the relations enD -:P ¢ and Ie n DI/ICI 2: 8 will hold in this case. l Thus, probabilistic ones are supported by large positive examples and small negative ones. The common feature of both deterministic and probabilistic rules is that they will deduce their consequence positively if an example satisfies their conditional parts. We call the reasoning by these rules positive reasoning.

However, medical experts do not use only positive reasoning but also neg­ative reasoning for selection of candidates, which is represented as if-then rules whose consequences include negative terms. For example, when a pa­tient who complains of headache does not have a throbbing pain, migraine

1 The threshold 8 is the degree of the closeness of overlapping sets, which will be given by domain experts. For more information, please refer to Section 3.

Page 237: Data Mining, Rough Sets and Granular Computing

233

should not be suspected with a high probability. Thus, negative reasoning also plays an important role in cutting the search space of a differential diagno­sis process[12]. Thus, medical reasoning includes both positive and negative reasoning, though conventional rule induction methods do not reflect this aspect. This is one of the reasons why medical experts have difficulties in interpreting induced rules and the interpretation of rules for a discovery pro­cedure does not easily proceed. Therefore, negative rules should be induced from databases in order not only to induce rules reflecting experts' decision processes, but also to induce rules which will be easier for domain experts to interpret, both of which are important to enhance the discovery process done by the corporation of medical experts and computers.

In this paper, first, the characteristics of medical reasoning are focused on and two kinds of rules, positive rules and negative rules are introduced as a model of medical reasoning. Interestingly, from the set-theoretical point of view, sets of examples supporting both rules correspond to the lower and upper approximation in rough sets[6]. On the other hand, from the view­point of propositional logic, both positive and negative rules are defined as classical propositions, or deterministic rules with two probabilistic measures, classification accuracy and coverage. Second, two algorithms for induction of positive and negative rules are introduced, defined as search procedures by using accuracy and coverage as evaluation indices. Finally, the proposed method was evaluated on several medical databases, the experimental results of which show that induced rules correctly represented experts' knowledge and several interesting patterns were discovered.

2 Focusing Mechanism

One of the characteristics in medical reasoning is a focusing mechanism, which is used to select the final diagnosis from many candidates[12]. For example, in differential diagnosis of headache, more than 60 diseases will be checked by present history, physical examinations and laboratory examinations. In diagnostic procedures, a candidate is excluded if a symptom necessary to diagnose is not observed.

This style of reasoning consists of the following two kinds of reasoning processes: exclusive reasoning and inclusive reasoning. 2 The diagnostic pro­cedure will proceed as follows: first, exclusive reasoning excludes a disease from candidates when a patient does not have a symptom which is necessary to diagnose that disease. Secondly, inclusive reasoning suspects a disease in the output of the exclusive process when a patient has symptoms specific to a disease. These two steps are modeled as usage of two kinds of rules, negative rules (or exclusive rules) and positive rules, the former of which corresponds

2 Relations this diagnostic model with another diagnostic model are discussed in [13].

Page 238: Data Mining, Rough Sets and Granular Computing

234

to exclusive reasoning and the latter of which corresponds to inclusive rea­soning. In the next two subsections, these two rules are represented as special kinds of probabilistic rules.

3 Definition of Rules

3.1 Rough Sets

In the following sections, we will use the following notations introduced by Grzymala-Busse and Skowron[10], which are based on rough set theory[6]. These notations are illustrated by a small database shown in Table 1, col­lecting the patients who complained of headache.

Table 1. An Example of Database

No. age location nature prodrome nausea MI class I 50-59 occular persistent no no yes m.c.h. 2 40-49 whole persistent no no yes m.c.h. 3 40-49 lateral throbbing no yes no migra 4 40-49 whole throbbing yes yes no migra 5 40-49 whole radiating no no yes m.c.h. 6 50-59 whole persistent no yes yes psycho

DEFINITIONS. MI: tenderness of MI, m.c.h.: muscle contraction headache, migra: migraine, psycho: psychological pain.

Let U denote a nonempty, finite set called the universe and A denote a nonempty, finite set of attributes, i.e., a : U ~ Va for a E A, where Va is called the domain of a, respectively.Then, a decision table is defined as an informa­tion system, A = (U, Au {d}). For example, Table 1 is an information system with U = {I, 2, 3, 4, 5, 6} and A = {age, location, nature,prodrame, nausea, M1} and d = class. For location E A, Viocation is defined as {occular, lateral, whole}.

The atomic formulae over B ~ Au {d} and V are expressions of the form [a = v], called descriptors over B, where a E B and v E Va. The set F(B, V) of formulas over B is the least set containing all atomic formulas over Band closed with respect to disjunction, conjunction and negation. For example, [location = occular] is a descriptor of B.

For each f E F(B, V), fA denote the meaning of f in A, i.e., the set of all objects in U with property f, defined inductively as follows.

1. If f is ofthe form [a = v] then, fA = {s E Ula(s) = v} 2. (I Ag)A = fA ngA; (Iv 9)A = fA v gA; (-.f)A = U - fa

Page 239: Data Mining, Rough Sets and Granular Computing

235

For example, f = [location = whole] and fA = {2, 4, 5, 6}. As an example of a conjuctive formula, g = [location = whole] A [nausea = no] is a descriptor of U and fA is equal to g'ocation,nausea = {2,5}.

3.2 Classification Accuracy and Coverage

Definition of Accuracy and Coverage By the use of the framework above, classification accuracy and coverage, or true positive rate is defined as follows.

Definition 1. Let Rand D denote a formula in F(B, V) and a set of objects which belong to a decision d. Classification accuracy and coverage(true positive rate) for R -+ d is defined as:

O!R(D) = IR~:IDI (= P(DIR», and

KR(D) = lR.j;1 DI (= P(RID»,

where lSI, O!R(D), KR(D) and peS) denote the cardinality of a set S, a clas­sification accuracy of R as to classification of D and coverage (a true positive rate of R to D), and probability of S, respectively.

Classification accuracy and coverage (true positive rate) are defined as:

(D) = I[X]R n DI d (D) = I[X]R n DI O!R I[x]RI' an KR IDI'

where IAI, O!R(D) and KR(D) denote the cardinality of a set A, a classification accuracy of R as to classification of D and a coverage, or a true positive rate of R to D, respectively. In the above example, when Rand D are set to [nau = 1] and [class = migraine], O!R(D) = 2/3 = 0.67 and KR(D) = 2/2 = 1.0.

It is notable that O!R(D) measures the degree of the sufficiency of a propo­sition, R -+ D, and that KR(D) measures the degree of its necessity. For example, if O!R(D) is equal to 1.0, then R -+ D is true. On the other hand, if KR(D) is equal to 1.0, then D -+ R is true. Thus, if both measures are 1.0, then R ++ D.

MDL principle of Accuracy and Coverage One of the important char­acteristics of the relation between classification accuracy and coverage is a trade-off relation on description length, called MDL principle(Minimum De­scription Length principle)[9] which is easy to be proved from the definitions of these measures.

Let us define the description length of a rule as:

Page 240: Data Mining, Rough Sets and Granular Computing

236

which represents the length of a bit strings to describe all the information about classification of accuracy and coverage. In this definition, the length of coverage corresponds to the cost of "theory" in MDL principle because of the following theorem on coverage.

Proposition 1 (Monotonicity of Coverage). Let Rj denote an attribute-value pair, which is a conjunction of Ri and [ai+! = Vj]. Then,

D

Then, from their definitions, the following relation will hold unless D:R(D) or /'l,R(D) is equal to 1.0: 3

.[ = -log2 D:R(D) -log2 /'l,R(D) p(RnD) p(RnD)

= -log2 P(R) - log2 P(D)

_ -1 (P(R n D)P(R n D) - og2 P(D)P(R)

P(R) ~ -log2 P(D)·

P(R) and P(D) are defined as:

P(R) = I[X]RI lUI

where U denotes the total samples.

and IDI P(D) = TUT'

When we add an attribute-value pair to the conditional part of a rule, the cardinality of [X]R will decrease and equivalently, the value of P(R) will be smaller. Thus, log2 P(R) will approach to -00 as a result.

Thus, if we want to get a rule of high accuracy, the coverage of this rule will be very small, which causes the high cost of the description of rules. On the other hand, if we want to get a rule of high coverage, the accuracy of this rule will be very small, which also causes the high cost of the description of rules.

It also means that a rule of high accuracy should be described with addi­tional information about positive examples which do not support the rule, or

3 Since MDL principle do not consider the concept of coverage, it is difficult to incorporate the meaning of coverage in an explicit way. However, as discussed in the section on negative rules, the situation when the coverage is equal to 1.0 has a special meaning to express the information about negative reasoning. It will be our future work to study the meaning when the coverage is equal to 1.0. in the context of the description length of "theory".

Page 241: Data Mining, Rough Sets and Granular Computing

237

that a rule of high coverage should be described with additional information about negative examples which support the rule.

The main objective of this paper is to point out that we should use neg­ative rules as additional information for positive rules, as shown in the next subsection. 4

3.3 Probabilistic Rules

By the use of accuracy and coverage, a probabilistic rule is defined as:

This rule is a kind of probabilistic proposition with two statistical measures, which is an extension of Ziarko's variable precision model(VPRS) [15].5

It is also notable that both a positive rule and a negative rule are defined as special cases of this rule, as shown in the next subsections.

3.4 Positive Rules

A positive rule is defined as a rule supported by only positive examples, the classification accuracy of which is equal to 1.0. It is notable that the set supporting this rule corresponds to a subset of the lower approximation of a target concept, which is introduced in rough sets[6]. Thus, a positive rule is represented as:

R -+ d s.t.

In the above example, one positive rule of "m.c.h." (muscle contraction headache) is:

[nausea = no] -+ m.c.h. a = 3/3 = 1.0.

This positive rule is often called a deterministic rule. However, in this paper, we use a term, positive (deterministic) rules, because a deterministic rule which is supported only by negative examples, called a negative rule, is introduced as in the next subsection.

3.5 Negative Rules

Before defining a negative rule, let us first introduce an exclusive rule, the contrapositive of a negative rule[12]. An exclusive rule is defined as a rule supported by all the positive examples, the coverage of which is equal to 1.0.6 It is notable that the set supporting a exclusive rule corresponds to

4 Negative rules are not equivalent to information about positive examples which do not support the positive rules, but they include it implicitly.

5 This probabilistic rule is also a kind of Rough Modus Ponens[7]. 6 An exclusive rule represents the necessity condition of a decision.

Page 242: Data Mining, Rough Sets and Granular Computing

238

the upper approximation of a target concept, which is introduced in rough sets[6]. Thus, an exclusive rule is represented as:

R -t d s.t.

In the above example, exclusive rule of "m.c.h." is:

[Ml = yes] V [nau = no] -t m.e.h. '" = 1.0,

From the viewpoint of propositional logic, an exclusive rule should be repre­sented as:

d -t Vj[aj = Vk],

because the condition of an exclusive rule corresponds to the necessity con­dition of conclusion d. Thus, it is easy to see that a negative rule is defined as the contrapositive of an exclusive rule:

which means that if a case does not satisfy any attribute value pairs in the condition of a negative rules, then we can exclude a decision d from candi­dates. For example, the negative rule of m.c.h. is:

...,[Ml = yes] A ...,[nausea = no] -t ...,m.e.h.

In summary, a negative rule is defined as:

where D denotes a set of samples which belong to a class d. Negative rules should be also included in a category of deterministic rules,

since their coverage, a measure of negative concepts is equal to 1.0. It is also notable that the set supporting a negative rule corresponds to a subset of negative region, which is introduced in rough sets[6].

4 Algorithms for Rule Induction

The contrapositive of a negative rule, an exclusive rule is induced as an exclu­sive rule by the modification of the algorithm introduced in PRIMEROSE­REX[12], as shown in Figure 1. This algorithm will work as follows. (l)First, it selects a descriptor [ai = Vj] from the list of attribute-value pairs, denoted by L. (2) Then, it checks whether this descriptor overlaps with a set of posi­tive examples, denoted by D. (3) If so, this descriptor is included into a list of candidates for positive rules and the algorithm checks whether its coverage is equal to 1.0 or not. If the coverage is equal to 1.0, then this descriptor is added to ReT, the formula for the conditional part of the exclusive rule of D. (4) Then, [ai = Vj] is deleted from the list L. This procedure, from

Page 243: Data Mining, Rough Sets and Granular Computing

239

(1) to (4) will continue unless L is empty. (5) Finally, when L is empty, this algorithm generates negative rules by taking the contrapositive of induced exclusive rules.

On the other hand, positive rules are induced as inclusive rules by the algorithm introduced in PRIMEROSE-REX[12], as shown in Figure 2. For induction of positive rules, the threshold of accuracy and coverage is set to 1.0 and 0.0, respectively.

This algoritm works in the following way. (1) First, it substitutes L 1 ,

which denotes a list of formula composed of only one descriptor, with the list LeT generated by the former algorithm shown in Fig. 1. (2) Then, until L1 becomes empty, the following procedures will continue: (a) A formula [ai = Vj] is removed from L 1 • (b) Then, the algorithm checks whether CtR(D) is larger than the threshold or not. (For induction of positive rules, this is equal to checking whether CtR(D) is equal to 1.0 or not.) If so, then this formula is included a list of the conditional part of positive rules. Otherwise, it will be included into M, which is used for making conjunction. (3) When L1 is empty, the next list L2 is generated from the list M.

procedure Exclusive and Negative Rules; var

L : List; 1* A list of elementary attribute-value pairs * / begin

L:=Po; 1* Po: A list of elementary attribute-value pairs given in a database * / while (L =I- {}) do

begin Select one pair [ai = Vi] from L; if ([ai = vilA n D =I- cjJ) then do 1* D: positive examples of a target class d * /

begin Lir := Lir + [ai = Vi]; 1* Candidates for Positive Rules */ if (K[ai=v;l(D) = 1.0) then Rer := Rer 1\ [ai = Vi];

1* Include [ai = Vi] into the formula of Exclusive Rule * / end

L:= L - [ai = Vi]; end

Construct Negative Rules: Take the contrapositive of Rer .

end {Exclusive and Negative Rules};

Fig. 1. Induction of Exclusive and Negative Rules

Page 244: Data Mining, Rough Sets and Granular Computing

240

procedure Positive Rules; var

i : integer; M, Li : List; begin

L1 := Lir; /* Lir: A list of candidates generated by induction of exclusive rules * / i := 1; M := {}; for i := 1 to n do /* n: Total number of attributes given in a database * /

begin while ( Li f= {} ) do

begin Select one pair R = A[ai = Vi] from Li; Li := Li - {R}; if (aR(D) > 8",)

then do Bir := Bir + {R}; /* Include R in a list of the Positive Rules * /

else M := M + {R}; end

Li+1 := (A list of the whole combination of the conjunction formulae in M);

end end {Positive Rules};

Fig. 2. Induction of Positive Rules

5 Experimental Results

For experimental evaluation, a new system, called PRIMEROSE-REX2 (Probabilistic Rule Induction Method for Rules of Expert System ver 2.0), wass developed, where the algorithms discussed in Section 4 were imple­mented. PRIMEROSE-REX2 was applied to the following three medical do­mains: headache(RHINOS domain), whose training samples consist of 52119 samples, 45 classes and 147 attributes, cerebulovasular diseases(CVD), whose training samples consist of 7620 samples, 22 classes and 285 attributes, and meningitis, whose training samples consists of 1211 samples, 4 classes and 41 attributes (Table 2). 7

For evaluation, we use the following two types of experiments. One ex­periment was to evalute the predictive accuracy by using the cross-validation method, which is often used in the machine learning literature[l1]. The other experiment was to evaluate induced rules by medical experts and to check whether these rules led to a new discovery.

7 The subset of the dataset on meningitis is publicly available from the Web site http://www.shimane-med.ac.jp/med..info/tsumoto. Concerning other two data, they have not been available yet because of the contract with hosptials where the author worked as a neurologist.

Page 245: Data Mining, Rough Sets and Granular Computing

Table 2. Databases

Domain Headache CVD Meningitis

Samples Classes Attributes 52119 45 147

7620 22 285 1211 4 41

5.1 Performance of Rules Obtained

241

For comparison of performance, The experiments were performed by the fol­lowing three procedures. First, these samples were randomly splits into new training samples and new test samples. Second, PRIMEROSE-REX2, con­ventional rule induction methods, AQ15[5] and C4.5[8] were applied to the new training samples for rule generation. Third, the induced rules and rules acquired manually by experts were tested by the new test samples. These pro­cedures were repeated for 100 times and average all the classification accuracy over 100 trials. This process is a variant of repeated 2-fold cross-validation, introduced in [12].

Experimental results(performance) are shown in Table 3. The first and second row show the results obtained by using PRIMROSE-REX2: the results in the first row were derived by using both positive and negative rules and those in the second row were derived by only positive rules. The third row shows the results derived from medical experts. For comparison, we compare the classification accuracy of C4.5 and AQ-15, which is shown in the fourth and the fifth row. These results show that the combination of positive and

Table 3. Experimental Results (Accuracy: Averaged)

Method Headache CVD Meningitis PRIMEROSE-REX2 (Positive+Negative) 91.3% 89.3% 92.5% PRIMEROSE-REX2 (Positive) 68.3% 71.3% 74.5% Experts 95.0% 92.9% 93.2% C4.5 85.8% 79.7% 81.4% AQ15 86.2% 78.9% 82.5%

negative rules outperforms positive rules, although it is a litle worse than medical experts' rules.

5.2 What is Discovered?

Positive Rules in Meningitis In the domain of meningitis, the following positive rules, which medical experts do not expect, are obtained.

Page 246: Data Mining, Rough Sets and Granular Computing

242

[WBC < 12000] A [Sex = Female] A [Age < 40] A [CSF_CELL < 1000] -* Virus [Age 2: 40] A [WBC 2: 8000] A [Sex = Male] A [CSF_CELL 2: 1000] -* Bacteria

The former rule means that if WBC(White Blood Cell Count) is less than 12000, the Sex of a patient is FEMALE, the Age is less than 40 and CSF _CELL (Cell count of Cerebulospinal Fluid), then the type of meningitis is Virus. The latter one means that the Age of a patient is less than 40, WBC is larger than 8000, the Sex is Male, and CSF _CELL is larger than 1000, then the type of meningitis is Bacteria.

The most interesting points are that these rules included information about age and sex, which often seems to be unimportant attributes for differ­ential diagnosis of menigitis. The first discovery was that women did not often suffer from bacterial infection, compared with men, since such relationships between sex and meningitis has not been discussed in medical context[I]. By the close examination of the database of meningitis, it was found that most of the above patients suffered from chronic diseases, such as DM, LC, and sinusitis, which are the risk factors of bacterial meningitis. The second discovery was that [age < 40] was also an important factor not to suspect viral meningitis, which also matches the fact that most old people suffer from chronic diseases.

These results were also re-evaluted in medical practice. Recently, the above two rules were checked by additional 21 cases who suffered from menin­gitis (15 cases: viral and 6 cases: bacterial meningitis.) Surprisingly, the above rules misclassfied only three cases (two are viral, and the other is bacterial), that is, the total accuracy was equal to 18/21 = 85.7% and the accuracies for viral and bacterial meningitis were equal to 13/15 = 86.7% and 5/6 = 83.3%. The reasons of misclassification were the following: a case of bacterial infec­tion was a patient who had a severe immunodeficiency, although he is very young. Two cases of viral infection were patients who also suffered from her­pes zoster. It is notable that even those misclassficiation cases could be ex­plained from the viewpoint of the immunodeficiency: that is, it was confirmed that immunodefiency is a key word for menigitis.

The validation of these rules is still ongoing, which will be reported in the near future.

Positive and Negative Rules in CVD Concerning the database on CVD, several interesting rules were derived. The most interesting results were the following positive and negative rules for thalamus hemorrahge:

[Sex = Female] A [Hemiparesis = Left] A [LOC : positive]-* Thalamus .[Risk : Hypertensian] A .[Sensory = no] -* .Thalamus

The former rule means that if the Sex of a patient is female and he/she suffered from the left hemiparesis([Hemiparesis=Leftj) and loss of conscious-

Page 247: Data Mining, Rough Sets and Granular Computing

243

ness([LOC: positive]), then the focus of CVD is Thalamus. The latter rule means that if he/she neither suffers from hypertension ([Risk: Hypertension]) nor suffers from sensory disturbance([Sensory=no]), then the focus of CVD is Thalamus.

Interestingly, LOC(loss of consciousness) under the condition of [Sex = Female] A. [Hemiparesis = Left] was found to be an important factor to diagnose thalamic damage. In this domain, any strong correlations between these attributes and others, like the database of meningitis, have not been found yet. It will be our future work to find what factor is behind these rules.

5.3 Rule Discovery as Knowledge Acquisition

Expert System:RH Another point of discovery of rules is automated knowledge acquisiton from databases. Knowledge acquistion is referred to as a bottleneck problem in development of expert systems[2]' which has not fully been solved and is expected to be solved by induction of rules from databases. However, there are few papers which discusses the evaluation of discovered rules from the viewpoint of knowledge acquisiton[14].

For this purpose, we develop an expert system, called RH(Rule-based system for Headache) by using the acquired knowledge. 8 RH consists of two parts. Firstly, RH requires inputs and applies exclusive and negative rules to select candidates (focusing mechanism). Then, RH requires additional inputs and applies positive rules for differential diagnosis between selected candidates. Finally, RH outputs diagnostic conclusions.

Evaluation of RH RH was evaluated in clinical practice with respect to its classification accuracy by using 930 patients who came to the outpatient clinic after the development of this system. Experimental results about clas­sification accuracy are shown in Table 4. The first and second row show the performance of rules obtained by using PRIMROSE-REX2: the results in the first row are derived by using both positive and negative rules and those in the second row are derived by only positive rules. The third and fourth row show the results derived by using both positive and negative rules and those by positive rules acquired directly from a medical experts. These results show that the combination of positive and negative rules outperforms positive rules and gains almost the same performance as those experts .

6 Discussion

As discussed in Section 4, positive (PR) and negative rules (NR) are:

8 The reason why we select the domain of headache is that we formely developed an expert system RHINOS (Rule-based Headache INformation Organizing System), which makes a differential diagnosis in headache[3,4]. In this system, it takes about six months to acquire knowledge from domain experts.

Page 248: Data Mining, Rough Sets and Granular Computing

244

Table 4. Evaluation of RH (Accuracy: Averaged)

Method Accuracy PRlMEROSE-REX2 (Positive and Negative) 91.4% (851/930) PRIMEROSE-REX (Positive) 78.5% (729/930) RHINOS (Positive and Negative) 93.5% (864/930) RHINOS (Positive) 82.8% (765/930)

PR: Aj[aj = Vk] ~ d s.t QI\;[a;=vk](D) = 1.0 NR: Aj"..,[aj = Vk] ~..,d s.t. V[aj = Vk] l\;[a;=vk](D) = 1.0.

Positive rules are exactly equivalent to a deterministic rules, which are defined in [6]. So, the disjunction of positive rules corresponds to the positive region of a target concept (decision attribute). On the other hand, negative rules correspond to the negative region of a target concept. From this viewpoint, probabilistic rules correspond to the combination of the boundary region and the positive region (mainly the boundary region).

Thus our approach, the combination of positive and negative determin­istic rules captures the target concept as the combination of positive and negative information. Interestingly, our experiment shows that the combi­nation outperforms the usage of only positive rules, which suggests that we need also negative information to achieve higher accuracy. So, although our method is very simple, it captures the important aspect of experts' reasoning and points out that we should examine the role of negative information in experts' decision more closely.

Another aspect of experts' reasoning is fuzzy or probabilistic: in the rough set community, the problems of deterministic rules are pointed by Ziarko[15], who introduces Variable Precision Rough Set Model (VPRS model). VPRS model extends the positive concept with the precision of classification accu­racy: a relation, the classification accuracy of which is larger than a given precision (threshold), will be regarded as positive. Thus, in this model, rules of high accuracy are included in an extended positive region. Analogously, we can also extend the negative concept with the precision of coverage, which will make an extended negative region. The combination of those positive and negative rules will extend the approach introduced in this paper, which is expected to gain the performance or to extract knowledge about experts' decision more correctly. Thus, it will be a future work to check whether the combination of extended positive and negative rules will outperform that of positive and negative deterministic rules.

Another interest is a measure of boundary region: a measure of positive information is accuracy and one of negative information is coverage. Prob­abilistic rules can be measured by the combination of accuracy and cover­age[12,13], but the combination of two measures is difficult to compare each rule: to measure the quality of boundary. It will also be one of the imporant future research directions.

Page 249: Data Mining, Rough Sets and Granular Computing

245

7 Conclusions

In this paper, the characteristics of two measures, classification accuracy and coverage is discussed, which shows that both measures are dual and that accuracy and coverage are measures of both positive and negative rules, respectively. Then, an algorithm for induction of positive and negative rules is introduced. The proposed method was evaluated on medical databases, the experimental results of which show that induced rules correctly represented experts' knowledge and several interesting patterns were discovered.

References

1. Adams RD and Victor M: Principles of Neurology, 5th edition. McGraw-Hill, New York, 1993.

2. Buchnan BG and Shortliffe EH(Eds): Rule-Based Expert Systems. Addison­Wesley, 1984.

3. Matsumura Y, Matsunaga T, Hata Y, Kimura M, Matsumura H: Consultation system for diagnoses of headache and facial pain: RHINOS. Medical Informatics 11: 145-157, 1988.

4. Matsumura Y, Matsunaga T, Maeda Y, Tsumoto S, Matsumura H, Kimura M: Consultation System for Diagnosis of Headache and Facial Pain: "RHINOS". Proceedings of Logic Prgram Conferences, pp.287-298, 1985.

5. Michalski RS, Mozetic I, Hong J, and Lavrac N: The Multi-Purpose Incremental Learning System AQ15 and its Testing Application to Three Medical Domains. Proceedings of the fifth National Conference on Artificial Intelligence, AAAI Press, Palo Alto CA, pp 1041-1045, 1986.

6. Pawlak Z: Rough Sets. Kluwer Academic Publishers, Dordrecht, 1991. 7. Pawlak Z: Rough Modus Ponens. In: Proceedings of International Conference

on Information Processing and Management of Uncertainty in Knowledge-Based Systems 98, Paris, 1998.

8. Quinlan JR: C4.5 - Programs for Machine Learning. Morgan Kaufmann, Palo Alto CA, 1993.

9. Rissanen J: Stochastic Complexity in Statistical Inquiry. World Scientific, Sin­gapore, 1989.

10. Skowron, A. and Grzymala-Busse, J. From rough set theory to evidence theory. In: Yager, R., Fedrizzi, M. and Kacprzyk, J.(eds.) Advances in the Dempster­Shafer Theory of Evidence, pp.193-236, John Wiley & Sons, New York, 1994.

11. Shavlik JW and Dietterich TG(Eds): Readings in Machine Learning. Morgan Kaufmann, Palo Alto CA, 1990.

12. Tsumoto S and Tanaka H: Automated Discovery of Medical Expert System Rules from Clinical Databases based on Rough Sets. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining 96, AAAI Press, Palo Alto CA, pp.63-69, 1996.

13. Tsumoto S: Modelling Medical Diagnostic Rules based on Rough Sets. In: Polkowski L and Skowron A (Eds): Rough Sets and Current Trends in Com­puting, Lecture Note in Artificial Intelligence 1424, 1998.

Page 250: Data Mining, Rough Sets and Granular Computing

246

14. Tsumoto S: Automated Extraction of Medical Expert System Rules from Clin­ical Databases based on Rough Set Theory. Information Sciences, 112, 67-84, 1998.

15. Ziarko W: Variable Precision Rough Set Model. Journal of Computer and Sys­tem Sciences 46:39-59, 1993.

Page 251: Data Mining, Rough Sets and Granular Computing

Part 4

Granular Computing

Page 252: Data Mining, Rough Sets and Granular Computing

Observability and the Case of Probability *

Claudi Alsina1 , Joan Jacas1 , and Enric Trillas2

1 Secci6 de Matematiques i Informatica, Departament d'Estructures a l' Arquitectura, Universitat Politecnica de Catalunya. Diagonal 649, 08028 Barcelona, Spain.

2 Departamento de Inteligencia Artificial, Universidad Politecnica de Madrid. Campus de Montegancedo. 28660 Boadilla del Monte. Madrid. Spain.

Abstract. The modeling of problems in scientific observation has motivated the development of mathematical tools to deal with the several ways of classifying or granulating the universes of discourse: the world as it is perceived. The aim of this paper is to review and clarify various mathematical aspects related to observability problems within a classical boolean structure or a fuzzy context. In doing so, it is shown how ideas arising in Fuzzy Sets theory can become fine tools to handle most of the granulating problems. In the last section we study the observability of the label "probable" viewed as a fuzzy set.

Key words: T-fuzzy equivalence relation, E-observable fuzzy set, approximation.

1 Observable sets in a boolean context

The observations on a certain given set X are always made throughout a natural or "artificial" instrument that has a certain resolution. Therefore, the objects observed are not the elements of X but certain subsets of X whose "distances" between them are greater than or equal to the resolution of the instrument.

In a classical setting, what we have is an equivalence relation E associated to the instrument, defined in X and such that if x, y E X are indistinguishable then (x, y) E E and (x, y) tJ. E otherwise. In this case, the set X is partitioned into a set X / E of equivalence classes. If x denotes the class associated to x E X, x = {yl(x, y) E E}, we have a canonical mapping 7r : X -+ X/E such that 7r(x) = X. Under this formulation, a subset A c X is observable (neatly) with respect to E or E-observable for short, if it is the union of classes of X / E. This definition can be reformulated in different ways as it is shown in the proposition l.

Summing up, when observing a set X by means of an instrument, our ground set is granulated into subsets (the equivalence classes) and only the sets that are union of classes are "neatly" observed. These subsets are the only ones compatible with the instrument's resolution given by the equivalence E.

* Partially supported by CICYT TIC96-1393-C06-06 and PGC PB98-0924

Page 253: Data Mining, Rough Sets and Granular Computing

250

Proposition 1. The following assertions are equivalent:

(a) A is an E-observable set iff A = U x or A = 0 . xEA

(b) x E A {:} (3u E X such that u E A nx). (c) x E A {:} (Vy E X, (x, y) E E =} YEA).

Proof. a=} b If A = U x then, x E A =} x E A n x. Reciprocally, if for any x E X there

xEA exists u E X such u Ex=} (x, u) E E =} x E 11 =} x E A. b=}c For any y E X such that (x, y) E E with x E A, x E Any therefore YEA. Reciprocally, applying (b), for any z EX, z E A n x=} z E A. c=}a If A is not the union of classes, there exists x E A such that for some b E x, b ¢. A. This contradicts the fact that applying (c) (x,b) E E, x E A then bEA. 0

In what follows, any set A will be identified with its characteristic or membership function XA.

Under this assumption, an equivalent formulation of (b) for the character­ization of an observable set A E P(X), where P(X) is identified with {a, I}X, with respect to E can be stated in terms of their characteristic functions XA and XE.

(b') A is an E-observable set of X iff

Min (XA(y),XE(X,y)) :5 XA(X) (1)

for all x, y of X. Since XE(X, x) = 1 for all x E X, (1) is equivalent to

Finally, an equivalent formulation to (c) can be also rewritten as follows:

(c') A is an observable set iff

(2)

for all X,y in X, where -+ is defined by (Xp -+ XQ) = Max(l- XP,XQ). Since XE(X,X) = 1, InfYEx {XE(X,y) -+ Xx(Y))}:5 (1-+ XA(X)) = XA(X)

and (2) is equivalent to

InfYEx {XE(X,y) -+ XA(y)} = XA(X).

Page 254: Data Mining, Rough Sets and Granular Computing

251

What can we do with the remaining subsets of X that are not observable sets? If ObE(X) is the set of observables, where the empty set is included, and B E P(X), we can consider the smallest set B E ObE(X) such that B C B that can be interpreted as the set of elements "possibly observed". Another point of view is to consider the greatest set B E ObE(X) such that B c B, that can be understood as the set of elements necessarily "observed".

Example 1. Let X = {a,b,c,d,e,J}; E = {(a,b), (a, c), (e,f),(a,a), (b,b), (c,c), (d,d), (e,e), (j,j)} then XjE = {{a,b,c}, {e,J}, {d}}. Let B = {a,d} then B = {a, b, c, d} and B = {d}. Observe that if B does not contain any equivalence class then B = 0 and that, since B c B c B, the finer is the partition of X, the "closer" Band B are, giving an improved approximation of B.

Let us observe that from another point of view, we can define a map <l>E : P(X) --+ P(X) given by

<l>E(XB) (x) = SUPYEX {Min(XE(x, y), XB(Y»)}'

This map has the following properties

(a) XB ::; <l>E(XB) (b) <l>E (VjEJ{XBj }) = VjEJ {<I>E(XBj )} (c) <l>E (AjEJ{XBj}) = AjEJ {<I>EABJ (d) <I>'i; = <l>E (e) <l>E(XB) E ObE(X) and <l>E(P(X» = ObE(X) (f) XB E ObE(X) ~ <l>E(XB) = XB·

Further, if we restrict this map to the singletons of X, then for any {x} E P(X)

<l>E (X{x}) = XX'

<I> E restricted to the singletons of X can be interpreted as the canonical map

7r:X--+XjE

that assigns to each element of x E X its equivalence class x. In a similar way, we can define a map 1/JE : P(X) --+ P(X) as

This map assigns to A E P(X) the union of classes contained in A

1/JE(XA)(X) = 1 iff x c A 1/JE(XA)(X) = 0 otherwise

Therefore, if A does not contain any equivalence class, its image is the empty set. 1/JE(XA) is the greatest observable set contained in A.

The mapping 1/JE fulfills the following properties

Page 255: Data Mining, Rough Sets and Granular Computing

252

(a) 'l/JE (XA) ~ XA (b) 'l/JE (l\jEJ{XA;}) = I\jEJ'l/JE(XA;) (c) 'l/JE (VjEJXA;) = VjEJ {'l/JE (XA;)} (d) 'I/J~ = 'l/JE (e) 'l/JE(XA) E ObE(X) and 'l/JE(P(X)) = ObE(X) (f) XA E ObE(X) <=> 'l/JE(XA) = XA·

So, given A E P(X), tPE(XA) and 'l/JE(XA) are, respectively, the sets of elements possibly observable and necessarily observable associated to A under the equivalence E.

Example 2. If we formulate the example 1 using the characteristic functions, taking a fixed order in X then any set can be identified by a vector of O's and 1 's and the equivalence E will be represented by a boolean squared matrix

111000 111000 111000

that is: B = (1,0,0,1,0,0), E = 0 0 0 1 0 0 and then,

000011 000011

B = E EBmin B = (1,1,1,1,0,0) and B = E -*min B = (0,0,0,1,0,0) where ~n stands for the M~Min operation of matrices and -*mi~r the Min­Min operation where Min(x, y) = Min(x, y) if x > y, and Min(x, y) = 1 otherwise.

2 Observable sets in a fuzzy setting

In the fuzzy framework, the concept of equivalence relation is captured by the so called T-indistinguishability operator or T-fuzzy equivalence relation where T represents a continuous t-norm. More precisely,

Definition 1. A fuzzy equivalence relation with respect a continuous t-norm T is a reflexive and symmetric fuzzy relation E : X x X -* [0,1] such that

T (E(x, y), E(y, z» ~ E(x, z) (T-transitive property)

for any x, y, z of X.

Given a fuzzy relation R : X x X -* [0,1], a fuzzy set J1, : x -* [0,1] is a T­logical state for the relational structure (X, R) whenever T(J1,(x) , R(x, y)) ~ J1,(y) for any x, y in X [18]. Then, T-Iogical states are called T-R-observables when R is a T-indistinguishability operator [2].

In order to extend the idea of observable sets, we can use the formulations (b') and (c') of proposition 1, since, as it will be shown later, in a fuzzy context the representation (a) is not valid. Fuzzifying the formulation (b') we have:

Page 256: Data Mining, Rough Sets and Granular Computing

253

Definition 2. Given a set X, and a T-fuzzy equivalence relation E, h E [O,l]X is an E-observable set of type 1 if

SUPYEX {T(E(x,y), h(y))} = h(x)

And if we fuzzify the formulation (c') we have:

Definition 3. Given a set X, and a T-fuzzy equivalence relation E, h E [O,l]X is an E-observable set of type 2 if

InfYEx {E(x,y) -+T h(y)} = h(x)

where -+T is the residuated implication associated to the t-norm T, i.e., u -+T v = Sup {a I T(a,u):5 v}.

From the properties of T and -+T, it can be easily shown that both defi­nitions are equivalent.

Let Ob'k(X) be the set of the T-E-observables of X, for short ObE(X). If we denote by <PE the mapping between [O,l]X and itself defined by

<pE(h)(x) = SUPYEX {T(E(x, y), h(y))} ,

then for any h E [0, l]X, hE ObE(X) iff <pE(h) = h. The map <PE is a fuzzy closure operator and has been extensively studied

[5,7,8]. This map assigns to every fuzzy subset h the set <pE(h) that is the least E-observable that contains h.

Dually, by using definition 3, we can define the mapping 1/JE : [O,l]X -+ [O,l]X in the following way

1/JE(h)(x) = InfYEx {E(x,y) -+T h(y)}

and h E [O,l]X is an E-observable iff 1/JE(h) = h. This mapping is a fuzzy interior operator. Further studies about this

subject can be found in [2] . It is worth noting that 1/JE(h) is the greatest E-observable contained in h.

3 The set of E-observables fuzzy sets

In a classical setting, given an equivalence relation E on a set X, the set of E-observables is not empty, since X is always an E-observable. Let us denote by ObE(X) the set of E-observables. It is clear that the union, intersection and the complement of E-observables are also E-observables. Then, the set ObE(X) is a Boolean Algebra, namely the Boolean Algebra generated by the partition X/E.

That is, ObE(X) contains, if X is finite, the "events" representing all the reasonable questions that can be formulated on a random experiment whose results are exactly the singletons of the power set P(X).

Page 257: Data Mining, Rough Sets and Granular Computing

254

What is the situation of the fuzzy case? In [5] it is proved that given a T-fuzzy equivalence relation E on a set X, the set of the E-observables (ObE(X) ~) is a complete lattice that is a sub-lattice of ([0, I]X,~) that contains all the constant fuzzy sets (/-Lk(X) = k for any x E X, k E [0,1]) and has /-Lo and /-Ll as supremum and infimum. Actually, this is a general property of the set of T-Iogical states as it was proven in [18]. Furthermore,

Proposition 2. [2,6J For any h E ObE(X) and Q E [0, I] we have

(a) To (/-La X h) E ObE(X) (b) (/-La) -+T h E ObE(X) (c) (h -+T /-La) E ObE(X)

In [9] a reciprocal result can be found, namely

Proposition 3. If a family H C [O,I]X satisfies (a), (b), (c) and is a com­plete sublattice of ([0, I]X,~) then there exists a T-fuzzy equivalence relation E such that H = ObE(X).

As it was commented in the introduction, in a classical setting an observ­able set is always the union of equivalence classes. This, is not the situation for fuzzy sets. In this case, the "columns" Ex of E (Ex = E(x, .)) can be naturally interpreted as the fuzzy equivalence classes associated to the fuzzy equivalence E [21,3] and we have the following

Proposition 4. If /-Lx is the fuzzy set defined by /-Lx( x) = 1 and /-Lx (y) = 0 if y =I- x, then ¢E(/-Lx) = Ex.

Proof. ¢E(/-Lx)(U) = SUPYEX {E(u,y),/-Lx(Y))} = E(u,x) = Ex(u) for all x,uEX.

In other words, as it happens in the classical case, the columns (the fuzzy classes) are the images of the singletons of X and they are also termed the "singletons" associated to E [4] and they can be interpreted as the fuzzyfied version of {x} via the fuzzy equivalence E.

Proposition 5. If A is a crisp set, we have

(a) ¢E(XA) = V xEA Ex for all A ~ X (b) If P(X) is the power set of X, ¢E(P(X)) is the semilattice Hx C

ObE(X) whose elements are generated by {EX}xEX via suprema.

Summing up, considering the columns as the fuzzy classes associated to E,

(a) The columns are observables. (b) The images of the crisp sets for ¢E are observable and they are the only

ones that are union of fuzzy classes.

Page 258: Data Mining, Rough Sets and Granular Computing

255

In general, ObE(X) = <PE([O, I]X). Therefore, if h E ObE(X) then there exists a JLI E [O,I]X such that <PE(JLI) = h, and

h(x) = V T(Ey(x),JLI(Y)) yEX

So ObE(X) can be interpreted as the set of fuzzy sets of the quotient set XI E (i.e. ObE(X) = [O,I]X/E and <PE : [0, I]X ~ [O,I]x/E is the canonical map. Note that if E is a crisp equivalence relation then <PEl X is the canonical map7r:X~XIE.

Lets go back to the boolean structure of the set of observables for the crisp case. What is the situation in the fuzzy case?

If h E ObE(X) and N is a strong negation on [0,1] giving the pseudo­complement No JL for any JL E [O,I]X [15] then, in general Noh tI. ObE(X). ObE(X) is a complete lattice but it is neither a Boolean Algebra nor a De Morgan triplet. There are some cases that, by choosing a convenient T-norm and negation, this last structure can be fulfilled. The study of this subject was initiated in [2] and we have the following results.

Given a non strict archimedean t-norm T, we can associate to T a strong negation defined by NT(x) = X ~T a for all x E [0,1]. If t is an additive generator of T i.e.: T(x, y) = t(-l)(t(x) + t(y)), then x ~T y = t(-l)(t(y)­t(x)) and NT(X) = rl(t(O) - t(x)).

Lemma 1. Let T be a continuous t-norm and EaT-fuzzy equivalence re­lation on a set X. If T is a non-strict archimedean t-norm and NT its as­sociated negation or E is a crisp equivalence, then, for any h E ObE(X), NT 0 h E ObE(X)

Proof. Let h E ObE(X) and t a generator of both T and NT

1/JE(h)(x) = h(x) = inf (E(x,y) ~T h(y)) = yEX

= inft(-I)(t(h(y))-t(E(x,y))) yEX

= inf t(-l) (t(O) - (t (E(x, y)) + t(O) - t ((h(y)))) yEX

= J~i t(-l) (t(O) - t (t(-l) [t (E(x, y)) + (t(O) - t((hy)))]))

= inf NT (T (E(x, y),NT(h(y)))) yEX

= NT sup T(E(x,y),NT(h(y))) yEX

= NT (<PE (NT(h))) (x)

for any x E X. Therefore

NT 0 h = <PE 0 NT 0 h =} NT 0 h E ObE(X). D

Page 259: Data Mining, Rough Sets and Granular Computing

256

So we can conclude that the set ObE(X) is closed under the pseudo-comple­mentation NT, for t-norms T of the Lukasiewicz family that can be repre­sented by

T(x, y) = f- 1 (W(f(x),f(y))

where W(x, y) = Max(x+y-1, 0) and f : [0,1] -+ [0,1] is a strictly increasing function with f(O) = 0 and f(l) = 1.

Lemma 2. Let T be a continuous t-norm T, N a strong negation and E a non-crisp T-fuzzy equivalence relation E on a set X. If for any hE ObE(X), Noh E ObE(X) then T is a non-strict archimedean t-norm.

Proof. If T is an strict archimedean t-norm with generator t, let a, b E X be such that E(a, b) ¢ {O, I} and consider the fuzzy subset ha = E(a, .), then ha E ObE(X) and Nha E ObE(X). On the other hand, in [5] it is proved that given h E [O,l]X

Eh(x,y) = Min (h(x) -+T h(y), h(y) -+T h(x)) (3)

is a T-fuzzy equivalence such that for any u,v E X, Eh(u,v) ;::: E(u,v) iff hE ObE(X).

Therefore,

ENhJa,b) = Min (Nha(a) -+T Nha(b),Nha(b) -+T Nha(a)) = = Min (NE(a,a) -+T N(E(a,b), NE(a, b) -+T NE(a,a)) = = Min (0 -+T NE(a,b),NE(a,b) -+T 0) = = NE(a,b) -+T 0 = 0 < E(a,b)

and Nha ¢ ObE(X) wich contradicts our hypothesis. The same inequality can be proved for the t-norm Min and any ordinal

sum, noting that, in this case, x -+T 0 = 0 for all x E [0,1]. 0

So we have the following theorem

Theorem 1. Let T be a continuous t-norm and EaT-fuzzy equivalence relation on a set X. There exists a strong negation N such that for any hE ObE(X), Noh E ObE(X) iff T is a non-strict archimedean t-norm.

Thus, ObE(X) is closed under fuzzy pseudo-complements whenever T is a non-strict Archimedean t-norm.

4 Observability and approximation

Given a fuzzy or a crisp set A, its "observability" depends, of course, of the chosen equivalence relation E. For example, in X N, the set A = {3,6, ... ,3n, ... } c N is not observable if N/E = {{2,4, ... ,2n, ... }, {I, 3, ... , 2n + 1, ... }}, but A is observable if N/ E = {{I, 2, 4, 5,7,10,11, ... },

Page 260: Data Mining, Rough Sets and Granular Computing

257

{3, 6, 9}, {12, 15, ... , 99}, {102, ... , 999}, {1002, ... , 9999}, ... }. In the fuzzy case, the predicate B =Big on X = [0,1] with the usual membership function B(x) = x with respect to E(x, y) = 1 - Ix - YI, is a W-observable. But with respect to the Prod-fuzzy equivalence E1(x, y) = Min {(I + x)/(1 + y), (1 + y)/(1 + x)} on [0,1]' we have E(I, 0.5) = Min(1.33,0.75) = 0.75 > 0.5, B is not E1-observable.

As we have pointed out in the preceding section, if a set A is not neatly observed we can improve its "observation" by refining the equivalence E. From another point of view, perhaps it is possible to find a more adequate E to observe A related with its definition. That is, with the meaning of the concept that conveys the membership function of A. Therefore, we have two different paths to follow. The first one, if a T-fuzzy equivalence relation is already stablished in our set X, taking into account that, in any context where we have a (fuzzy) equivalence relation, the information levels, or the granularity of knowledge in this context, become more degraded as the level of equivalence increases, a natural problem arises: how to fit in a coherent way, both the relation and the new pieces of information expressed by means fuzzy sets.

In presence of a new available piece of fuzzy information, namely a fuzzy set h ~ ObE(X) this problem may be addressed in two different ways:

a) to modify E in order to accomodate the incoming information, or b) to modify h in order to make it compatible with E.

In the first case, to modify E in a suitable manner can be easily achieved by adding h to the set of all observables of E and then generating a new E' accordingly to the Representation Theorem, [20,5]

E'(x,y) = inf {ET(JL(X),JL(y))}, where H' = ObE(X) u {h}. /LEH'

That is, E' = Min {E,Eh} where Eh is generated accordingly to (3). On the other hand, there are mainly two different ways to distort a fuzzy

set h in order to make it compatible with a given fuzzy equivalence relation E. Both ways consist in approximating h by an observable, but they differ depending on whether upper (<pE(h)) or lower (1/JE(h)) approximation of his chosen.

Another path to follow is to investigate if for a given fuzzy set A, there is a "natural" fuzzy equivalence E, making A E-observable. Natural in the sense of being directly linked with the meaning conveyed by the use on X of the name "A" and, at the same time, making A E-observable. That is, to find a fuzzy or crisp partition whose granularity is the most adequate to capture the meaning of "A" .

For example, the meaning of the vague predicate B =Big in X = [0,1] is grasped throughout its use in [0,1], and this use follows at least from the following rules:

Page 261: Data Mining, Rough Sets and Granular Computing

258

1. If x = 0, then "x is not B" 2. If x = 1, then "x is B" 3. If "x is B" and y ? x, then "y is B" 4. If x > ° and "x is B" , then there exists some n E N such that "x - lO-n

is B".

Consequently, if we take B(x) = f(x) where f : [0, I] --t [0, I], is contin­uous and non-decreasing function such that f(O) = 0, f(l) = 1, f gives the "master" expression for any use of Bon [0,1]; the pattern for all of them. To have a fuzzy set we need to select a particular f by using more information about the particular use of "B" on [0,1]. Without any additional information it is customary to suppose the new rule:

5. The degree of the statement "x is B" depends linearly on x,

and then from 1 and 2, it follows f(x) = x. Considering that rules 1 to 4 are up to a degree given by some EJ, build

up using (3), it follows EJCl,y) = fey), or B1(y) = fey); that is B = Bl is the class of EJ fuzzy equivalence of x = 1, a result that seems in agreement if we take x = 1 as the prototype for the meaning of B in [0,1]. But which T should we take?

Since for any t-conorm T*(x, y) = I-T(I-x, l-y), T*(x, y) ? Max(x, y) ;:::: y, it follows from the rule 3 that if "x is B" it is also "T*(x, y) is B", for any yin [0,1]; this does not allow us again to choose any particular T, and the only rule that remains unused to this end is rule 4.

Under rule 5, Etfin(x, X - lO-n) = x - lO-n; Et:zrod(x, x - lO-n) = 1-I/x ·lO-n, and EJr(x,x _lO-n) = l-lO-n and this last degree is the only one that does not depend on x. If this property is convenient, then T = W. The biggest W fuzzy equivalence for which B(x) = x is "observable" is E(x,y) = I-Jx - yJ.

Then, under the hypothesis that Big is used on [0,1] following rules 1 to 5, that the degree up to which rules 1 to 4 are certain, is given by some Eld, and that Rule 4 is given with degree 1 - lO-n, B is E-observable in a way that can be considered as naturally linked with the use, or definition, of "B" on [0,1].

It should be pointed out that E is 1 minus the euclidean distance in [0,11 that, with the accuracy of lO-n, it approximates any point x > ° and those points y that verify x - lO-n :::; y :::; x + lO-n. With the value of n provided by the Rule 4, all these points are Big provided that "x is B" (everything up to the corresponding degree).

This situation is general and obviously, any E : X x X --t [0, I] is a W fuzzy equivalence if and only if 1 - E is a distance [19].

Let's end this section by introducing a description of when it can be said that a fuzzy set J.L E F(X) is naturally observable. A fuzzy set J.L of X labelled P is naturally observable whenever:

Page 262: Data Mining, Rough Sets and Granular Computing

259

1) J.1- E ObE(X) 2) If J.1- is crisp, E directly follows from the definition of P. If J.1- is properly

fuzzy, E directly follows from the rules translating the use of P on X from which J.1- is constructed.

5 Observability and probability

Let if be X = {a, b, c, ... } a Boolean Algebra with maximum 1, minimum 0, complement, union + , intersection ., and partial order a ::::; b given by a . b = a or, equivalently, a + b = b. Let's also consider the vague predicate on X, P = probable, with generic use on X given by the rules:

1) If a = 1, then "a is p" is true 2) If a . b = 0, the degree of "a + b is p" is the sum of the degrees of "a is p"

and "b is p".

Then, the compatibility function J.1-p of p in X can be expressed by means of any function p: X --* [0,1] such that:

1') p(1) = 1 2') If a . b = 0, then p(a + b) = p(a) + p(b),

by taking J.1-p ( a) = p( a) as the "master" expression of J.1-p. As it is well known, the functions p are called probabilities on X, and from

properties 1 and 2 it follow:

3) p(a + a') = p(1) = 1 = p(a) + p(a')j 4) p(O) = p(1') = 1 - p(1) = OJ 5) If a::::; b, then p(a) ::::; p(a) + p(a' . b) = p(a + a' . b) = p(b).

Each particular use of p on X adds more information to the generic rules 1 and 2. A new information can eventually lead to state a concrete probability on X. For example, in modelizing the random experiment of throwing a dice, the Boolean Algebra X is the finite one with the six atoms ai = "point i is obtained" (1 ::::; i ::::; 6) and, consequently, the degree of "a is p" is the sum of the degrees of the statements "ai is p" for those atoms ai such that ai ::::; a. In this case, if J.1-p(ai) = p(ai) E [0,1], functions p are only those that verify

(4)

with p(al) + ... + p(a6) = 1. Then, the new information made by the rule:

3') All statements "ai is p" hold up to the same degree.

Page 263: Data Mining, Rough Sets and Granular Computing

260

fixes the probability (4) as that given by p(ai) = 1/6, the so called uniform probability.

Probabilistic Logics can be understood as those of the generic use of p and, usually they follow two lines for modeling implications. In the first one (Nilsson's Logic), the degree up to which "if "a is p", then "b is p"" is given by 81 (a, b) = p(a -+ b) = p(a' +b), a function that is a W-p-fuzzy conditional relation [18] because W(p(a),p(a -+ b)) = p(a· b) ::; p(b). In the second one (P6lya's Logic), this degree is given by

S (a b) = { 1, if p(a) = 0 2, p(b/a) - p(a· b)/p(a), otherwise,

a function that is a Prod-p-fuzzy conditional relation [18], since

(a) . S (a b) = (0, if p( a) = 0) < (b) p 2, p(a. b), otherwise - p .

Then, in both cases the corresponding probability p is a logical state for the respective fuzzy relational structure (X,8i ); in the first case, p is a W-Iogical state and in the second one is a Prod-Logical state. Nevertheless, neither 8 1 is a W-fuzzy equivalence nor 8 2 is a Prod-fuzzy equivalence; both are reflexive. 8 1 is W-transitive and 82 is not T-transitive for any t-norm T. Consequently, they are not adequate to "observe" the fuzzy set labeled p and they allow us to consider it only as a logical state and not as an observable.

Concerning the W-conditional fuzzy relation 8}, obviously Ep(a, b) =

W(81(a, b), 81 (b,a)) = W(p(a -+ b),p(b -+ a)) = 1- p(a..1b), with a..1b = a' . b + a . b' (the symmetrical difference), is a W-fuzzy equivalence such that W(p(a), Ep(a, b)) ::; W(p(a),p(a -+ b)) ::; p(a). Then, p is an W-Ep­observable.

It should be pointed out that the biggest W -fuzzy equivalence under which p is a W-observable is E:;"(a. b) = Min(p(a) -+w p(b),p(b) -+w p(a)) = l-Ip(a) -p(b)l, and that, of course, it is possible to find Min and Prod-fuzzy equivalences making of p an observable set as, for example, Er-in(a, b) =

Min(p(a) -+Min p(b), p(b) -+Min p(a)) and Efrod(a, b} = Min(p(a) -+Prod p(b),p(b) -+Prod p(a)), respectively. But, these fuzzy equivalences do not depend on either the defining properties of the probability p or on the boolean structure of X; only Ep is linked both with these properties and with the logical character of 8 1 as a W-Conditional fuzzy relation used in Probabilistic Logic.

Furthermore, the function 1 - Ep(a, b) = p(a..1b) is a distance that plays an important role in the definition of the p-mesurability of an "event" b, when p is initially defined only for "elementary events" of the Boolean a-Algebra X as follows: b is p-mesurable if for any E > 0 there exists an elementary event a such that p(a..1b) ::; E. That is, the only p-measurable events are those that

Page 264: Data Mining, Rough Sets and Granular Computing

261

can be "approximated with arbitrary accuracy" by elementary events under the distance given by the probability of the Symmetrical Difference.

For all that, it seems adequate to say that p is naturally observable by means of Ep and respect to the t-norm of Lukasiewicz.

In the same vein of ideas, what can we say in the case of the Prod­conditional fuzzy relation 52? We can begin with the following

Lemma 3. There is no continuous t-norm T' for which the reflexive and symmetrical fuzzy relation

R(a, b) = T(52 (a, b), 52 (b, a))

is T' -transitive, if the Boolean Algebra X is sufficiently large.

Proof. If R is T'-transitive then, whenever pea) . pCb) > 0, pCb) . p(c) > 0 and pea) . p(c) > 0,

T'(T(p(a,b) p(b.a)) T(P(b.C) p(eb)) <T(p(a.e) p(e.a») pea) , pCb) , p(b)' pee) - p(a)' pee) .

Take a . c = 0 and b = 1. It follows T'(p(a),p(e» = 0, and, consequently, it suffices to take a, e such that pea) and pee) approaches to 1 in addition to a· e = 0 and pea) . p(c) ~ 0, to conclude that T' can not be a continous t-norm. 0

Hence, there is not a general path (analogous to that followed with 51 and W) for making p observable by means of 52 and Prod. So we may explore other possibilities.

In 1951 Karl Menger introduced the definition of a Probabilistic Relation [11] in a set X as function E : X x X -+ [0,1] verifying the three properties of a Prod-fuzzy equivalence and interpreting the numbers E(x, y) as a "degree of probabilistic indistinguishability" between x and y. Latter on 1984, Sergei Ovchinikov [12] proved that E is a Probabilistic Relation if and only if there is a family F of positive functions f : X -+ R such that:

. (f(X) f(y») . E(x,y) = InffEFMm f(y)' f(x) ,for all X,y m X,

where the positive values f(x), f E F, can be interpreted as different mea­surements of the elements of X.

Theorem 2. If X is a Boolean Algebra and p a probability on X, P can not be obtained as a Prod-E-observable if E is built up by means of a finite number of positive measurements of the elements of X.

Page 265: Data Mining, Rough Sets and Granular Computing

262

Proof. As the biggest Prod-fuzzy equivalence that makes p-observable is

E~rod(a, b) = Min(p(a) ---+Prod p(b),p(b) ---+Prod p(a)), if p is a Prod-E­observable for any Prod-fuzzy equivalence E on X, from E(a, b) :::; E~rod(a, b) for all a, b in X it follows E(O, 1) :::; E~rod(O, 1) = 0, or E(O, 1) = 0. Then, if F = {h, ... , In} is a finite family of positive measurements giving E, it should be

. (h(O) h(l) In(O) In(l)) O=E(O,l)=Mm h(l)' h(O)'···' In(l) , In(O) ,

that implies the existence of some li(O) = ° or some fJ(l) = 0. This contra­dicts the hypothesis that all the function In E F are positive. 0

Although actually it is not physically realizable, theorem 5.2 does not ex­clude the possibility of obtain p as a Prod-observable through a denumerable family of positive measurements. It is rather peculiar, in the framework of random experiments, the constraint "positive measurements" as, for exam­ple, if I(a) is the frequency on which an event a occurs in a finite series of repetitions of certain experiment, it could happen I(a) = ° or I(a') = 0.

Remark. Taking into account the representation theorem for T-fuzzy equiv­alence relations in the version presented in [5], we have:

(i) Given f-L E [O,l]X and T a continuous t-norm, the fuzzy relation

is a T-fuzzy equivalence relation. (ii) E is a T-fuzzy equivalence relation iff there exists a family of fuzzy sets

f-Li E [0, l]X, i E I such that

{ Ix < Y In our case, we have T(x, y) = x . y and x ---+T y = JL' th . x,o erWlse,

and

{ I, if x = ° for y = 0, X ---+T ° = ° th . So for any f-Li such that f-Li(X) = 0,

,0 erWlse. or f-Li(Y) = 0, Ei-'i (x, y) = 0. Therefore for any finite set of mesurements F = {h, ... , In} with Ii : X ---+ [0,1] such that there exists J; E F with ° = li(O) -# li(l) then the probabilistic relation generated by F, verifies E(O, 1) = 0.

6 Conclusions

In this final section we would like to sum-up, after the technical considerations made above, some of the main ideas arising in our analysis: "observability"

Page 266: Data Mining, Rough Sets and Granular Computing

263

in a classical setting is related to measurement with a given collection of "crisp" but imprecise instruments. Thus observed sets of real objects become a "granulated" family of subsets (granules) which can be mathematically defined in terms on an equivalence relation. These equivalent observed sets become classes which many reflect some precise characteristics as well as the limitations of the performed process to classify the objects. We may recall here the classical description by Karl Menger concerning sensorial experi­ments where the subject's perception can distinguish zones but not points. We have discussed in this classical approach how the equivalence relation that yields observable classes can be combined with the technique of approxima­tion of non observable sets by upper and lower observable subsets by means of a refined partition associated to the equivalence relation. When working with characteristic functions we have shown how the mappings <P E and 'I/J E

describe completely the upper and lower observable subsets mentioned above. These descriptions of the boolean observable sets have motivated in sec­

tion 2 to study the fuzzy case, when after using a fuzzy equivalence relation E one can consider the appropiate E-observable fuzzy sets by taking full advantage of the fuzzified versions of the operators <PE and 'l/JE. Bearing in mind these definitions the structure of the set of E-observable sets has been studied and relations with crisp sets and their images have been stablished in section 3.

Section 4 deals with the problem of observability and how modifications can be introduce when progressive information can change the previous fuzzy classifications. In it, the idea of how to observe a fuzzy set in the way most naturally linked to the meaning it conveys, is introduced.

Finally, section 5 is devoted to the single case of the lingustic label "prob­able" in a Boolean Algebra, namely to the "observability" of a given proba­bility, viewed as a fuzzy set. It is shown that with T = W (or internal observ­ability) a probability is always naturally observable, but that with T =Prod (or external observability) actually the problem remains open.

References

1. Alsina, c., Trillas, E. and Valverde, L. (1983) On some logical connectives for Fuzzy Set Theory, J Math Ann Appl 93: 15-26.

2. Boixader, D., Jacas, J. and Recasens, J. (2000) Fuzzy Equivalence Relations: Advanced Material In: D. Dubois, H. Prade eds. Fundamentals of Fuzzy Sets, Kluwer Ac. Pub. Boston, 261-290.

3. Dubois, D. and Prade, H. (1994) Similarity-Based Approximate Reasoning, In: Computational Intelligence Imitating Life. Zurada, J.M., MarIes, R.J., Robinson, C.J. Eds. IEEE Press, New York, 69-80.

4. H6hle, U. (1990) The Poincare paradox and the cluster problem. In: A. Dress, A. von Haeseler (Eds.), Trees and Hierarchical Structures, Lectures Notes in Biomathematics, 84, 117-124, Springer-Verlag.

5. Jacas, J. (1988) On the generators of T-indistinguishability operators. Sto­chastica 12: 49-63.

Page 267: Data Mining, Rough Sets and Granular Computing

264

6. Jacas, J. (1993) Fuzzy topologies induced by S-metrics. The Journal of Fuzzy Mathematics 1-1:173-191.

7. Jacas, J. and Recasens, J. (1994) Fixed points and generators of fuzzy rela­tions. J Math Anal Appl 186:21-29.

8. Jacas, J. and Recasens, J. (1995) Fuzzy T-transitive relations: eigenvectors and generators. Fuzzy Sets & Systems 72:147-154.

9. Klawonn, F. and Castro, J.L. (1995) Similarity in Fuzzy Reasoning, Mathware & Soft Computing 2:197-228.

10. Kolmogorov, A.N. and Fomin, S.V. (1961) Measure, Lebesgue Integrals and Hilbert Space. Academic Press.

11. Menger, K. (1951) Probabilistic theory ofrelations. Proc Nat Acad Sci USA 37:178-180.

12. Ovchinnikov, S.V. (1984) Representation oftransitive fuzzy relations. Aspects of Vagueness. (Eds. H.J. Skala et altri), 105-118 Reidel Pubs.

13. Trillas, E. (1993) On logic and Fuzzy Logic. Int. Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 1-2:107-137.

14. Trillas, E. (1994) On Membership and Fuzzy Logic States. Proc. 2nd EUFIT 2:796-801.

15. Trillas, E. (1998) On negation functions in fuzzy set theory, In: Advances in Fuzzy Logic, S. Barro et. a1. Eds. Universidade de Santiago de Compostela, 31-43. English version of the Spanish original of 1979.

16. Trillas, E. and Alsina, C. (1978) Introduccion a los Espacios Metricos Gener­alizados. Fund J March Serie Universitaria 49, Madrid. (In Spanish).

17. Trillas, E. and Alsina, C. (1992) Some Remarks on Approximate Entailment. Int Jour Approximate Reasoning 6:525-533.

18. Trillas, E. and Alsina, C. (1993) Logic: Going further from Tarski? Fuzzy Sets and Systems 53:1-13.

19. Trillas, E. and Valverde, L. (1984) An Inquiry into Indistinguishability Oper­ators. Aspects of Vagueness (Eds. H.J. Skala et altri), 231-256, Reidel Pubs.

20. Valverde, L. (1985) On the structure of F-indistinguishability operators. Fuzzy Sets & Systems 17:313-328.

21. Zadeh, L. A. (1971) Similarity relations and fuzzy orderings. Inform Sci 3:177-200.

Page 268: Data Mining, Rough Sets and Granular Computing

Granulation and Granularity via Conceptual Structures: A Perspective From the Point of View of Fuzzy Concept Lattices

Radim Belohlavek

Institute for Research and Applications of Fuzzy Modeling University of Ostrava Brafova 7 CZ-701 03 Ostrava Czech Republic

Abstract. A fundamental way of granulation the outer world performed by hu­mans is that by forming concepts. We present a formal model of concepts and conceptual structures called fuzzy concept lattices which is a natural formaliza­tion of the Port-Royal approach to concepts. Fuzzy concepts, i.e. the induced fuzzy granules, obey a complete hierarchy w.r.t. subconcept-superconcept relation. Atten­tion is paid to similarity relations (similarity of objects, concepts and conceptual structures are distinguished) and to logical precision both of which represent a systematic way to control the granularity and reduce the complexity of the con­ceptual structure. Applications in conceptual data analysis and representation of conceptual knowledge are discussed.

1 Granulation as Formation of Concepts

The question of what are concepts is one of the most peculiar questions of philosophy, psychology, and cognitive sciences. The reason is the cenral role of concepts in human reasoning. Namely, human reasoning is often identified with reasoning with concepts. There is no unique view on concepts because of the several weltanschaungs and several levels on which the problem of concepts may be regarded. Besides the several classical philosophical writings where one can find considerations on the nature of concepts, concepts are discussed in [25,31,32,34] where one can find also further references.

We are interested in the congitive, esp. in the information processing role of concepts. Recently, it has been stressed that one of the fundamental features underlying the human capability to cope effectively with the enor­mous complexity of the perceived information is the granulation of the outer world [48]. From the point of view of soft computing it is therefore an im­perative to develop plausible still computationally tractable models of infor­mation granulation. Clearly, granulation of the outer world may be thought to be tantamount to the formation of concepts. Loosely stated, a concept

Page 269: Data Mining, Rough Sets and Granular Computing

266

(e.g. MAMMAL 1) represents an information granule via aggregating all the objects (i. e. all the particular mammals) which are covered by the concept. A characteristic feature of the collection of the (formed) concepts, i.e. the information granules, is the hierarchical structure. The hierarchy reflects the natural subconcept-superconcept relation-there are concepts (e.g. ANIMAL) which are more general as well as concepts (e.g. DOG) which are more spe­cific than a given concept (e.g. MAMMAL). The hierarchy is crucial for the conceptual classification of the elementary objects. The granulation and the hierarchy belong to the most important features of conceptual structures.

From the formal point of view, the study of concepts has been a part of traditional logic of Port-Royal school [36]. Even the books on logic from the beginning of this century follow the conception of Port-Royal school with its theory of concepts (see e.g. [20]). With the commencement of modern mathematical logic the chapters on concepts almost disappeared from the logic books. A logical theory of concepts (based on TichY's transparent intensional logic [35]) has been proposed in [25,27]. The notion of concept often appears also in artificial intelligence and clustering. However, except for some cases such as e.g. frames, schemas, and scripts [34,43], the proposed models of concepts are too simple to fulfil the above functions of concepts in a satisfactory way. Conceptual graphs developed by Sowa (see e.g. [24,33]) represent a formalism for reasoning with concepts. For the limited scope they will, however, not be discussed.

We are concerned with the Port-Royal approach to concepts. An alge­braic theory of concepts and hierarchical conceptual structures in the sense of Port Royal has been developed by Wille and his collaborators under the name theory of concept lattices or concept data analysis [1,13,15,14,39,41]. This theory is, in fact, a theory of sharp concepts. In [2,3,7,6,5] the theory has been generalized from the point of view of fuzzy approach. The result­ing theory enables us to model hierarchical structures of non-sharp (fuzzy) concepts with applications in conceptual analysis of fuzzy data and represen­tation of conceptual knowledge. In the following we deal with the problems of granulation and granularity of information from the point of view of mod­eling of concepts in the sense of Port-Royal school. Most of the results have been obtained recently in [2,3,7,6,5] where also the proofs which are omitted here can be found. New here is Theorem 10.

2 Preliminaries

Recall that a fuzzy set [45] in a universe set X is any function A : X ---+ L, where L is a suitable set (of truth values). The value A(x) is called the membership degree of x in A and it is interpreted as the truth value of "x is element of A". Similarly, a fuzzy relation between X and Y is any function

1 We adopt the useful convention of denoting concepts which correspond to lin-guistic expressions by these expressions written capitalized (e.g. MAMMAL).

Page 270: Data Mining, Rough Sets and Granular Computing

267

I: X x Y -t L. By { a/x}, where a E L, x E X, it is meant a fuzzy set given by { a/x} (x) = a and { a/x} (x') = 0 for x' E X, x' =1= x. The crucial step is the choice of an appropriate structure on L. From the point of view of fuzzy logic, a natural one is that of a residuated lattice. A residuated lattice [19,21] is an algebra L = (L, A, V, 0, -t, 0, I) where

(i) (L, A, V, 0, I) is a lattice with the least element 0 and the greatest element 1,

(ii) (L, 0, I) is a commutative monoid, i.e. 0 is associative, commutative, and the identity x 0 1 = x holds,

(iii) 0 and -t satisfy the adjointness property, i.e.

x5.y-tz iff x0y5.z

holds for each x, y, z E L (5. denotes the lattice ordering).

Residuated lattices have been introduced by Dilworth and Ward [38]. Later on, they have been studied under several other names, e.g. integral commutative residuated l-monoids [9, pp. 324-325]' residuated abelian semi­groups with a unit [10, pp. 211-214], or commutative complete lattice ordered semigroups with infinity [16,17]. Residuated lattices have been intensively studied from the point of view of fuzzy logic [21]. A semantically complete first-order many-valued logic with semantics defined over complete residu­ated lattices is described in [22]. Several special classes of residuated lattices serve as structures of truth values of logical calculi which are semantically complete [19,28,30,37]. An interesting discussion on the soundness of choos­ing residuated lattice as the structure of truth values from the logical point of view may be found in [17] and also in [19, pp. 25-27] which is in the spirit of[17].

In each residuated lattice it holds that x 5. y implies x 0 z 5. y 0 z (isotonicity), and x 5. y implies z -t x 5. z -t y (isotonicity in the second argument) and x -t z ~ Y -t z (antitonicity in the first argument). The operation 0 is thus a t-norm (see e.g. [19,23]), -t is called residuum. In the following we will deal with complete residuated lattices, i.e. (L, A, V, 0, I) is assumed to be a complete lattice. It can be shown that in each complete residuated lattice the identity

(1)

holds. Conversely, if in a complete lattice with a binary operation 0 which makes L a commutative semigroup with the identity element 1 the identity (1) is satisfied, then there is a uniquely determined binary operation -t on L which satisfies the adjointness property (it holds a -t b = Va®c<b c).

The t-norm 0 and the residuum -t are intended for modeling of the con­junction and implication, respectively. The adjointness property establishes

Page 271: Data Mining, Rough Sets and Granular Computing

268

an important connection between these two operations. The completeness of the residuated lattice is usually required for modeling of the semantics of quantifiers of first-order fuzzy logic (for a different approach using safe mod­els see [19]). Supremum (V) and infimum (".J are intended for modeling of general and existential quantifier, respectively.

The most studied and applied set of truth values is the real interval [0,1]. The most important are the Lukasiewicz, G6del, and product alge­bras (see [19] for their role and the definitions) defined by the following t­norms: a ® b = max (a + b - 1,0) (Lukasiewicz), a ® b = min(a, b) (G6del), and a ® b = a· b (product), with the corresponding residua given by a --+ b = min(1 - a + b,I), a --+ b = 1 if a :::; band = b else, a --+ b = 1 if a :::; band = bla else, respectively. Another important set of truth val­ues is given by (the ordering determines the complete lattice structure) L = {aD, at, ... , an}, ° = aD < ... < an = 1. Two important t-norms are often considered: ak ® a, = amax(k+l-n,O) and the corresponding residuum ak --+ a, = amin(n-k+l,n) (Lukasiewicz) and ak ® at = amax(k,t) and the cor­responding residuum ak --+ at = 1 if k :::; land = at else (G6del, i.e. the restrictions of the G6del t-norm and residuum on [0,1] to {aD, ... , an}). A special case of the latter algebras is the Boolean algebra 2 of classical logic with the support 2 = {O, I}. It may be easily verified that the only t-norm on {O, I} is the classical conjunction operation A, i.e. a A b = 1 iff a = 1 and b = 1, which implies that the only residuum operation is the classical implication operation --+, i.e. a --+ b = ° iff a = 1 and b = 0. Note that each of the preceding residuated lattices is complete.

Fuzzy sets (fuzzy relations) are also called L-sets (L-relations) ifthe struc­ture L is to be emphasized [16,17]. In this perspective, classical sets (rela­tions) are identified with 2-sets (2-relations). An L-set A is called crisp iff A(x) E {O, I} for each possible x (similarly for L-relations). Thus, each 2-set (and L-relation) is crisp. The set of all L-sets in a given universe X will be denoted by LX.

3 Fuzzy concepts and fuzzy concept lattices

3.1 The traditional (Port-Royal) approach to concepts

From the historical point of view our theory of concepts is based on the traditional approach to concepts which have been developed within the Port­Royal school [25,26,36,20,44] of logic. By Port-Royal, a concept is determined by its extent and its intent ("etendue" and "comprehension" in French and "Umfang" and "Inhalt" in German are used for extent and intent). The extent-intent approach is an example of the set conception in theories of concepts [26]. By the extent of a concept it is meant the collection of all

Page 272: Data Mining, Rough Sets and Granular Computing

269

objects covered by the concept2 . For example, the extent of the concept BLACK DOG is the collection of all black dogs. The intent is understood as the collection of all attributes (or properties) 3 covered by the concept. In our example, the intent of the concept BLACK DOG could be e.g. given by the attributes (properties) "to be black", "to have four extremities", "to be a mammal", "to bark" etc. Both extent and intent are therefore understood to be collections. Implicitly hidden in the texts on Port-Royal approach to concepts is the following assumption:

and

extent is the collection of all objects

sharing all the attributes from the intent

intent is the collection of all attributes

shared by all the objects from the extent.

(2)

(3)

Some trivial precepts follow immediately. Consider e.g. two concepts, Cl

(BLACK DOG) and C2 (DOG), such that each element of the extent of Cl

belongs to the extent of C2. In that case each property from the intent of C2

belongs to the intent of Cl. In other words,

the more objects, the less attributes (common to all the objects), (4)

and vice versa,

the more attributes, the less objects (sharing all the attributes). (5)

The above consideration can be taken for the philosophical background of the theory we present in the following. Note that the set conception can hardly be defended seriously [25,26] from the philosophical point of view. The concepts in the Port-Royal conception are to be understood as concepts with conjunctive structure which are "cut" in a given world at a given time.

3.2 Fuzzy concepts

To formalize the Port-Royal approach to concepts we start with the primitive notion of a fuzzy context. It is supposed that we restrict our attention to only a certain collection of objects (which will be then the elements of extents of concepts) and to a certain collection of attributes (the elements of intents) of these objects. Furthermore, it is assumed that there is the relation "to

2 This is called the per exempla conception (it is due to scholasticism and Leibniz). In the second conception (called per notiones), the extent is understood to be the class of its subconcepts (it is due to Aristotle). See [25].

3 Sometimes understood as the superconcepts of the concept.

Page 273: Data Mining, Rough Sets and Granular Computing

270

have a property" between the objects and the attributes. An L-context (or a fuzzy context) is a triple (G, M, I) where 1 is an L-relation between the sets G and M, i.e. 1 : G x M ~ L where L is the support of an complete residuated lattice L4. The set G represents the set of all objects from the universe of discourse, M represents the set of all attributes (properties) from the universe of discourse and 1 represents the relation "to have an attribute (property)". For 9 E G and m E M, the value I(g, m) is therefore the truth value of the fact "the object 9 has the property m". Unless explicitly stated, in what follows we assume that we have fixed some L.

By the Port-Royal definition, a concept consists of its extent and intent. Extents and intents of concepts are in general non-sharp collections (consider empirical concepts like BIG DOG). Therefore, following the fuzzy approach they will be modeled by fuzzy sets. Taking the conditions (2) and (3) into ac­count, crucial roles are played by the mappings t assigning to each collection A (i.e. a fuzzy set A E LG) of objects the collection At (a fuzzy set At ELM) of all attributes common to all the objects of A, and .J. assigning to to each collection B (i.e. a fuzzy set B ELM) of attributes the collection B.J. (a fuzzy set B.J. E LG) of all objects sharing all the attributes of B. Rewriting these conditions in logic formulas and evaluating them in the residuated lattice L we get the following definition of t and .J.:

At(m) = 1\ A(g) ~ I(g,m) for mE M gEG

B.J.(g) = 1\ B(m) ~ I(g, m) for 9 E G. mEM

(6)

(7)

We are going to show that t and .J. satisfy the properties linguistically described by (4) and (5). In fact, they satisfy more. To this end we need the following notion. Note that for L-sets AI, A2 E LX, the subset hood de­gree [16] of Al in A2 is given by Subs(AI ,A2) = I\xEx(AI(X) -* A2(X)).

Definition 1. An L-Galois connection (or fuzzy Galois connection) between the sets G and M is a pair (t,.J.) of mappings t : LG ~ LM,.J. : LM ~ LG, satisfying

Subs (AI, A 2) ~ Subs (At AI)

Subs (BI , B 2) ~ Subs (Bi, Bt)

A ~ (At).J.

B ~ (B.J.)t .

(8)

(9)

(10) (11)

4 "G", "M", and "I" stem from the German words "Gegenstange" (objects), "Merk­male" (attributes or properties), and "Incidenz" (incidence) used by Wille.

Page 274: Data Mining, Rough Sets and Granular Computing

271

Remark 1. Note that for L = 2 we get the well-known notion of Galois connection between power sets. Also, the results are generalizations of the classical results [9,29].

A simple characterization of L-Galois connections is provided by the fol­lowing theorem.

Theorem 1 ([3]). A pair (t, -I.) forms an L-Galois connection between G and M iff

Subs (A, B-1.) = Subs (B, At)

for all A E LG, B E LM.

(12)

The following theorem shows that there is a one-to-one correspondence between L-Galois connections and binary L-relations. As a consequence, in­stead of startning the formalization of the Port-Royal approach to concepts by the primitive notion of an L-context, one could alternatively start by the notion of an L-Galois connection. For the limited scope we omit the proof (it can be found in [3]).

Theorem 2 ([3]). (1) For a binary L-relation I E LGxM define the pair (tl, -1.1) of mappings by (6) and (7). For an L-Galois connection (t, -I.) be­tween G and M denote I(t,~) the binary L-relation defined by I(t,~)(g,m) = {1/g}t(m). Then (tr, -1.1) is an L-Galois connection and it holds

(t,-I.)=(tr(t.~),-I.I(t.~» d I I an = (t 1 ,~I) .

(2) For any L-Galois connection (t, -I.) between G and M and A E LG, BE L M , it holds AHt = At and BH-I. = B-1..

The following definition of formal concept is a direct formalization of the conditions (2) and (3) of the Port-Royal approach.

Definition 2. A (formal) L-concept (or fuzzy concept) in an L-context (G, M, I) is a pair (A, B) E LG X LM satisfying At = B and B-1. = A.

A is called the extent, B is the intent of (A, B). We write only context and concept if L is obvious. As will be apparent, the structure of truth values influences the structure of concepts. It should be therefore selected carefully.

Remark 2. Not that for L = 2 we get the notion of context and formal concept studied by Wille, see [39].

We are going to show that L-concepts of (G, M, I) correspond to the maximal rectangles contained in I. For A E LG , B E L M , denote by AI8>B the L-set in G x M defined by (AI8>B)(g, m) = A(g) 18> B(m). Call a rectangle any pair (A, B) E LG X LM. There is a naturally defined ordering ~ defined on the set of all rectangles by (Al' B 1 ) ~ (A2' B2) iff for all 9 E G, m EMit

Page 275: Data Mining, Rough Sets and Granular Computing

272

holds Al(g) ~ A2(g) and Bl(m) ~ B2(m). A rectangle (A,B) is said to be contained in 1 if AQ9B ~ 1. The following theorem generalizes the observation of the classical case stating that concepts are just maximal rectangles of 1 which are filled with 1 's (if we consider the two-valued relation 1 as a matrix­table of O's and l's).

Theorem 3. Let (G,M,!) be an L-context. For each A E LG, BE LM it holds that (A, B) is an L-concept iff it is a maximal rectangle contained in 1.

Proof. Let (A, B) be an L-concept. If it were not maximal, there would be (A' ,B') E L G X LM such that (A, B) < (A', B'). Hence, there exists an 9 E G such that A(g) < A'(g) or m E M such that B(m) < B'(m). Suppose the former, i.e. A(g) < A'(g) for some 9 E G (the latter may be handled analogously). By assumption, B(m) ~ B'(m) holds for all m E M, therefore A'(g) Q9 B(m) ~ A'(g) Q9 B'(m) ~ 1(g, m), and thus A'(g) ~ B(m) -+ 1(g, m) holds for each m EM. We conclude

A(g) < A'(g) ~ 1\ (B(m) -+ 1(g,m)) = Bt(g), mEM

a contradiction to A = Bt ((A, B) is a concept). Conversely, let (A, B) be a maximal rectangle contained in 1. We have to

show A = Bt and B = At. From A(g) -+ B(m) ~ 1(g,m) it follows A(g) ~ B(m) -+ 1(g, m) for all 9 E G, m E M, i.e. A(g) ~ Bt(g) = A.mEM(B(m) -+ 1(g,m)), thus A ~ Bt. (Bt,B) is contained in 1, since Bt(g) Q9 B(m) ~ 1(g,m) is equivalent to Bt(g) ~ B(m) -+ 1(g,m) which holds evidently. If A -=F Bt, i.e. A(g) ~ Bt(g) for some 9 E G, then (A, B) < (Bt, B), a contradiciton to the maximality of (A, B) among rectangles contained in 1. The condition B = At may be shown analogously.

Denote

B(G,M,I) = {(A,B) E LG x LM I At = B, Bt = A}

the set of all L-concepts in a given L-context (G, M,!). The concepts (A, B) of B (G, M, 1) represent (structured) granules. In general, they are fuzzy gran­ules. If L = 2, we get crisp granules. Note that granulated here are both the set G of (elementary) objects and the set M of (elementary) attributes. The granulation criterion is given by the Port-Royal definition of concept.

In the next section we investigate the hierarchical structure of concepts­granules induced by a context.

3.3 L-concept lattices

A natural relation which makes each set of concepts a hierarchical structure is the subconcept-superconcept relation. Informally, a concept Cl (e.g. DOG)

Page 276: Data Mining, Rough Sets and Granular Computing

273

is a sub concept of the concept C2 (e.g. MAMMAL) if each element of the extent of Ci is an element of the extent of C2 (i.e. each dog is a mammal) or, equivalently, each element of the intent of C2 is an element of the attribute of Ci (Le. each attribute of mammals is also an attribute of dogs). In this case we also say that C2 is a superconcept of Ci. Formalizing the intuitive notions, we introduce the relation ::; on the set 8 (G, M, I) by

Our primary concern is to study the above introduced hierarchy. To this end, de note li (G, M, I) the pair (8 (G, M,I),::;) and call it the L-concept lattice (or fuzzy concept lattice) induced by (G, M, 1). The fundamental properties and characterization of L-concept lattices is given by the following theorem. Note that for a complete lattice V, a subset K ~ V is V -dense (supremally dense) in V (/\ -dense (infimally dense) in V) if for each v E V there is K' ~ K such that v = V K' (v = /\K').

Theorem 4 (Main theorem of L-concept lattices). Let (G, M, 1) be an L-context. (1) li (G, M, 1) is a complete lattice in which infima and suprema can be described as foUows:

1\ (Aj , Bj) = (n Aj, (n Aj)t) = (n Aj, (U Bj)H) , (13) jEJ jEJ jEJ jEJ jEJ

V (Aj,Bj) = (n Bj)4-, n B j ) = (U Aj)H, n Bj). (14) jEJ jEJ jEJ jEJ jEJ

(2) Moreover, a complete lattice V = (V,::;) is isomorphic to li(G,M,I) iff there are mappings "( : G x L -+ V, ţt : M x L -+ V, such that "(G x L) is V -dense in V, ţt(M x L) is /\ -dense in V, and a ® b ::; I(g, m) is equivalent to "(g, a) ::; ţt(m, b) for aU 9 E G, m E M, a, bEL.

Proof. Complete proof can be found in [7]. We give here only a sketch. Part (1) ofthe assertion follows directly from the results on Galois connections be­tween complete lattices [9] by observing that t and 4- form a Galois connection between the complete lattices (LO,~) and (L M , ~).

Part (2): If 8 (G, M, I) and V be isomorphic, the required mappings "( and ţt are given by

"(g,a) = 'P(\ {ajg}H, {ajg}t)),

ţt(m,b) = 'P(\ {bjm}4-, {bjm}H)),

for every 9 E G, m E M, a,b E L, where 'P : 8(G,M,I) -+ V is the isomorphism.

Page 277: Data Mining, Rough Sets and Granular Computing

274

Conversely, if 'Y and f.L with the above properties exist, it can be shown thet the mappings <p : B (G, M, I) ~ V, 'ljJ : V ~ B (G, M,I) given by

<p(A,B) = V'Y(g,A(g)) (15) gEO

for each (A, B) E B (G, M, I), and

'ljJ(x) = (A, B), where A(g) = V a, B(m) = A b (16) 'Y(g,a):'Ox J.L(m,b)~x

for each x E V, and every 9 E G, m E M, are monotonic bijections, i.e. li (G, M,I) and V are isomorphic.

Remark 3. The above theorem states that the set B (G, M, I) obeys a com­plete hierarchy: for each set of concepts of B (G, M, I) there are both the di­rect super concept (the supremum) and the direct subconcept (the infimum). From the epistemological point of view this property seems to be a very natural one.

Remark 4. Theorem 4 generalizes the Main theorem of concept lattices [39] which is a special case for L = 2.

4 Granularity and granulation in conceptual structures

For a fixed structure L of truth values and a fixed context (G, M, I), the concept lattice li (G, M, I) represents the granules (concepts) resulting from the conceptual granulation. In the following we study two further aspects important from the point of view of granularity and granulation: (1) We are concerned with the granularity of L-concepts of the L-concept lattice. We show that a systematic change of L results in a systematic change of granularity of the induced concepts and in a systematic change of the induced concept lattices. (2) The second aspect studied here is the granulation of the conceptual structure (concept lattice) itself. It is shown that it is possible to granulate the concept lattice by factorization by a naturally defined similarity relations.

4.1 Logical precision

From the point of view of granularity of information, an important role is played by the structure L of truth values. Let a fuzzy set A E LX represents a granule of information. Informally, the more elements of L, the finer the granule A. In the following we propose a systematic approach to to handle the granularity of concepts of the concept lattice.

Let L be a structure of truth values. The set L is the set of all possible truth values which we have at disposal for logical modeling of our knowledge.

Page 278: Data Mining, Rough Sets and Granular Computing

275

It could be considered as representing "logic al discernibility". Consider e.g. the two-element Boolean algebra. Then the level of discernibility is low-we can discern only fully true statements from fully false statements. An n­element chain of truth values offers more-we can discern n logical "levels". Very loosely, using more truth values means more logical precision (in the above sense). From the point of view of logical modeling it is natural to have the possibility to change the set of truth values (in order to in crease or de crease the logical discernibility) so that the structural properties of the model remain preserved. Consider two structures LI and L2 of truth values such that there is an onto mapping h : LI ---+ L2, i.e. h(L1 ) = L2. If h preserves the structure of the sets of truth values then the change from L 2

to LI can be considered as an in crease of logical precision and, conversely, change from LI to L 2 can be considered as an decrease of logical precision. The requirement of preserving the structure of truth values may be, from the algebraic point of view, seen as fulfilled if h is a homomorphism [18], i.e. a mapping satisfying the following conditions (note that they are not independent) :

h(a V b) = h(a) V h(b) , h(a A b) = h(a) A h(b) , h(a Q9 b) = h(a) Q9 h(b) ,

h(a ---+ b) = h(a) ---+ h(b) , h(O) = O ,

h(l) = 1 .

In the following we will suppose that all the homomorphisms under consid­eration will be /\-preserving, i.e. for each K ~ LI it holds h(/\kEK k) = /\kEK h(k). Given two structures LI and L2 of truth values and a homomor­pism h : LI ---+ L2, we define for each L1-fuzzy set A in X (A E Lf) the corresponding L2-fuzzy set h(A) E L: by

(h(A)) (x) = h(A(x)) for all x E X . (17)

The following two statements show how the systematic change of the set of truth values (i.e. increase or de crease of logical precision) influences the granularity of the respective concepts and the structure of the respective concept lattices.

Lemma 1 ([5]). Let LI, L2 be two complete residuated lattices and h : LI ---+ L2 be an onto homomorphism. Let (G, M, 1) be an LI -context. Then for CE Lf, D E L11, the following holds: (C, D) E B (G, M, h(I)) iff there are A E Lf, B E L1f such that (A,B) E B(G,M,I), h(A) = C, and h(B) = D.

A lattice homomorphism h : VI ---+ V 2 between two complete lattices VI and V 2 is called complete iffor each K ~ VI it holds h (/\kEK) = /\kEK h(k) and h (V kEK) = V kEK h(k).

Page 279: Data Mining, Rough Sets and Granular Computing

276

Theorem 5 ([5]). Under the conditions of the preceding lemma, there is a complete homomorphism of B (G, M, I) onto B (G, M, h(J)).

Proof. Due to Lemma 1 (and its proof) it suffices to show that the mapping h* : B (G, M,I) --+ B (G, M, h(I)) defined by

h* (A, B) = (h(A), h(B)

preserves arbitrary meets and joins. We proceed only for meets, the case of joins may be proved analogously. Let (Aj,Bj) E B (G, M,I), j E J. We have to prove

(18) jEJ jEJ

By Theorem 4 we have

(19)

We thus conclude

h*(/\ (Aj,Bj») = h*((n Aj,(n Aj)t)) = jEJ jEJ jEJ

= (h(J:1 Aj), h((J:1 Aj)t)) (20)

and

(21)

The concepts in (20) and (21) equal iff their extens equal. Consider thus any g E G. For the ext ens of the concepts in (20) and (21) we have by (17)

jEJ jEJ jEJ

jEJ jEJ

i.e. the concepts in (20) and (21) are the same. We have proved (18). By the above considerations, the proof is finished.

Page 280: Data Mining, Rough Sets and Granular Computing

277

The foregoing theorem is important also from application point of view. Suppose we have a concept with truth values from L 1 . A further analysis on the level of L1 may be (from various reasons, e.g. computational ones) "too precise" , i.e. one does not need such a fine granularity. We can then skip to a level of L2 = h(L1 ) which is appropriate. Due to the theorem, the structure of the concepts changes systematically, i.e. the structure of concepts in L1 is in a systematic way more precise than that one in L 2 .

4.2 Granulation of concept lattice by similarity relations

Similarity of concepts

The similarity phenomenon plays a crucial role in the way humans regard the outer world. Gradual similarity on concepts is one of the fundamental preconditions for a powerful human reasoning and communication. Similarity phenomenon is thus one of the most important ones accompanying conceptual structures.

In fuzzy set theory, similarity phenomenon is approached via the so called similarity relations [46]. By a &!-similarity relation (or fuzzy &!-equivalence relation, L-valued global equality) [23,22] on a universe U it is meant a binary fuzzy relation E satisfying the following properties for all x, y, z E U:

E(x,x) = 1

E(x,y) = E(y,x)

E(x,y) &! E(y,z) :s E(x,z).

(22)

(23)

(24)

Properties (22), (23), and (24) are called reflexivity, symmetricity, and transi­tivity, respectively. The &!-similarity class of x E U is the fuzzy set [X]E E L U given by [X]E(Y) = E(x, y) for each y E U, i.e. it is a collection of elements similar to x. A fuzzy set A E LG is said to be extensional w.r.t. E if for every x, y E U it holds A(x) &! E(x, y) :s A(y), i.e. if with each its element x, A con­tains all the elements similar to x. In this case, E is also said to be compatible with A. Non-extensional fuzzy sets are not compatible with the underlying similarity relation. It is easily seen that in the crisp case, i.e. L = {O, I}, similarity relations are equivalence relations. For the study of similarity phe­nomenon, the crisp case is a degenerate one and non-interesting~two ele­ments x and y may be "fully similar" (E(x, y) = 1) or "fully dissimilar" (E(x, y) = 0).

To be able to model the equivalence of truth values we have at disposal the so called biresiduum (or biimplication) [28,22] operation f-+ defined by

a f-+ b = (a -+ b) 1\ (b -+ a).

The following two lemmas will be useful in the following considerations.

Page 281: Data Mining, Rough Sets and Granular Computing

278

Lelllllla 2 ([6]). Let E be a 0-similarity on U, S = {Ai E L U liE I} be a family of fuzzy sets. (1) E is the largest 0-similarity relation such compatible with all [XJE. (2) The relation Es defined by

Es(x, y) = 1\ (Ai (x) ++ Ai(Y» (25) iEI

is the largest 0 -similarity relation compatible with all Ai E S . Moreover, Ai(X) = 1 implies [xJEs ~ Ai.

Notice that for the crisp case (i.e. L = {a, I}), Es is a crisp equivalence relation-two elements of the universe are equivalent iff there is no set of the family which separates them.

Lelllllla 3. For any universe U, the relation E on LU given for any AI, A2 E L U by

E(AI ,A2) = 1\ (AI (x) ++ A 2(x» xEU

is the largest 0-similarity relation on L U such that AI(X) 0 E(AI ,A2) < A2(X) holds for each x E U, AI,A2 E LU.

Proof. Putting 1= U, X = LU, Ai(X) = x(i) for x E LU, i E U, the assertion is a direct consequence of Lemma 2.

In the following, it will be clear what universe U the relation E concerns. Consider first the relations EExt and E 1nt on B (G, M, 1), call them in­

duced similarity by extents and induced similarity by intents, respectively:

EExt((AI,BI ), (A2,B2) = E(AI ,A2) = 1\ (Adg) ++ A2(g», gEG

Elnt((AI,BI), (A2,B2) = E(BI ,B2) = 1\ (BI(m) ++ B2(m». mEM

Lemma 3 gives immediately the following statement.

Theorelll 6. EExt and Elnt are the largest 0-similarity relations on B (G, M, 1) such that

AI(g) 0 EExt((AI,BI ), (AI,BI ) :S A2(g),

BI (m) 0 Elnt( (AI, B I ), (AI, B I ) :S B2(m)

hold for every g E G, mE M, (AI, B I ), (A2, B 2) E B (G, M,I).

To answer the question of how the relations EExt and Elnt are related, we derive some preliminary results. The next lemma states that the operators t and .j. preserve similarity.

Page 282: Data Mining, Rough Sets and Granular Computing

279

Lemma 4 ([6]). Let (G,M,I) be an L-context, AI,A2 E LG , BI,B2 E L M .

Then it holds E(AI ,A2):::; E(ALA~) and E(BI ,B2):::; E(Bt,Bi).

The following corollary is immediate.

Corollary 1. Under the conditions of Lemma 9, it holds E(AI , A2) :::; E(AI -1., A~ -1.) and E(BI , B2) :::; E(Bt t , Bit).

The following result shows that the similar it ies of concepts by extents and intents are equal.

Theorem 7. For any L-context (G, M, 1) it holds EExt = E1nt.

Proof. Let (Al, B I ), (Al, B I ) E B (G, M, 1), i.e. AI = Bi and Bi = Ai for i = 1,2. By Lemma 4 we get

EExt( (AI, B I ), (A2, B2») = E(AI , A2) :::;

:::; E(ALA~) = E(BI ,B2) = Elnt«AI,BI)' (A2,B2»),

and analogously, E1nt «Al , Bd, (A2, B2») :::; EExt( (Al , BI ), (A2, B2»). To sum up, EExt«AI , B I ),(A2,B2») = Elnt«AI,BI),(A2,B2»)'

We will therefore write E instead of EExt and E1nt and call it the induced similarity on concepts.

Compatible similarities and factorization

One of the most important reasons why similarity relations are important in human reasoning is that they enable us to reduce of the complexity of the outer world at a reasonable price. One no Ion ger considers the partic­ular elements, rather, the collections of similar elements are in concern. In the general system theory, this process is known as the abstraction process by factorization. Instead of the original system one therefore considers the "system modulo similarity" . The loss of precision is the price one pays.

We are interested in the reduction of the concept lattice by factoriza­tion modulo similarity. To get an insight into the possibly intricate con­ceptual structure one has to look for methods for reduction of the struc­ture. In the two-valued (crisp) case, a considerable attention has been paid to this problem [14,39]. In the many-valued (fuzzy) case, one would expect methods for gradual reduction of the complexity. The idea is to factorize the concept lattice by appropriate a-cut a E of the similarity E (note that aE = ((CI,C2) la:::; E(CI,C2)}), controlling thus the complexity by a E L. Clearly, the lower a E L , the coarser the factorization. We have to specify both the elements and the structure of the factor system. Since both of the steps are non-standard in our case we will describe them in a more detail. In general, algebraic systems can be factorized be congruences, i.e. equivalences

Page 283: Data Mining, Rough Sets and Granular Computing

280

compatible with the structure of the system. We deal with conceptual struc­tures which are complete lattices. The a-cut a E is clearly a tolerance relation (i.e. reflexive and symmetric), not transitive in general. Compatible toler­ance relations on algebras have been studied by Chajda [11]. Factorization of algebras by compatible tolerances is not possible in general. Czedli [12] showed a way to factorize lattices by compatible tolerance relations. The construction has been then used for the factorization of crisp concept lat­tices [40]. In the following we describe the construction of the factor lattice of an L-concept lattice by a compatible tolerance relation. Let (G, M, I) be an L-context. A tolerance relation T on 8 (G, M, I) is said to be compatible if it is preserved under arbitrary suprema and infima, i.e. if (Cj, cj) E T, j E J, implies both (VjEJCj,VjEJcj) E T and (/\jEJCj'/\jEJcj) E T for any Cj, cj E 8 (G, M, I), j E J. For a compatible tolerance relation T on 8(G,M,I) denote CT = /\(c,c')ETC' and cT = V(c,C')ETc'. Call [C]T = [CT,(CT)T] = {c' E 8(G,M,I) I CT :S C' :S (CT)T} a block of T and de­note 8 (G, M, I) IT = {[C]T ICE 8 (G, M,In the set of all blocks. Intro­duce a relation :ST on 8 (G, M, I)IT by [C]T :ST [C']T iff /\[C]T :S /\[C']T (iff V[C]T :S V[C']T). The justification of the construction is given by the following statement.

Proposition 1 ([40].). (1) 8 (G, M, I) IT is the set of all maximal tolerance blocks, i.e. 8(G,M,I)IT = {B ~ 8(G,M,I) I (B x B ~ T)&«VB' J B)B' xB' ~ Tn. (2) (8 (G, M, I)IT, :ST) is a complete lattice (factor lattice) where suprema and infima are described by

(26) jEJ jEJ jEJ jEJ

for every Cj E 8 (G, M, I), j E J.

Substituing (13) and (14) into (26) we get a more concrete description of the lattice operations.

In order to be able to use the construction, the question is whether the a­cuts a E of the similarity E on 8 (G, M, I) are compatible. Call a Q:9-similarity relation F on 8 (G, M, I) compatible if a F is a compatible tolerance relation on 8 (G, M, I) for each a E L. For the two-valued (crisp) case the situation is completely uninteresting. The only cases are 0 E = 8 (G, M, I) x 8 (G, M,I) (i.e. the factor lattice is a one-element lattice) and IE = idS(G,M,I) = {(c,C) IcE 8(G,M,In (i.e. the factor lattice is isomorphic to 8 (G,M,I)).

We have of course not to confine ourselves to the induced similarity E. On the other hand, if we consider only Q:9-similarity relations F satisfying A(g) Q:9F«(A,B),(A',B')):S A'(g) (which is quite natural-it reads "object belonging to the extent of some concept belongs also to the extent of any similar concept") for each 9 E G, then, by Theorem 6 tells us that E provides the most extensive reduction: for any other F and each a E L, a E is coarser than aF.

Page 284: Data Mining, Rough Sets and Granular Computing

281

We will make use of the following lemma.

Lemma 5 ([6]). For every (Aj, B j ), (Aj,B;) E B (G, M,I), j E J, it holds

jEJ jEJ jEJ

An interesting corollary of Lemma 5 states that the similarity of any two concepts is less or equal than the similarity of any of them to their direct join or meet.

Corollary 2. Let (AI, B I ), (A2' B 2) E B (G, M, J). The folowing inequalities hold for i = 1,2:

E( (AI, B I ), (A2' B 2)) ~ E( (Ai, B i ), (AI, B I ) 1\ (A2' B 2)),

E( (AI, B I ), (A2' B 2)) ~ E( (Ai, B i ), (AI, B I ) V (A2' B2)).

Proof. Put J = x,y, (Ax,Bx) = (Ay,By) = (AI,BI ), (A~,B~) = (AI,BI ), (A~,B~) = (A2,B2) and apply Lemma 5.

Theorem 8. The induced similarity Eon B (G, M, I) is compatible. If a E L is 0-idempotent (i.e. a 0 a = a) then a E is, morevover, transitive, i.e. a congruence relation on B (G, M, I).

Proof. The first part follows immediately from (27) and (28) by the fact that if a ~ E((Aj,Bj), (Aj,B;)) for j E J then also

a ~ 1\ E((Aj,Bj), (Aj,Bj)). jEJ

The second part follows from the evident fact that if a is 0-idempotent and a ~ b, c then a ~ b 0 c.

Remark 5. The foregoing theorem assures that the concept lattice can be factorized by any of the a-cuts of the induced similarity E on B (G, M, I). The elements of the factor lattice B (G, M, I) r E are the blocks of the tolerance a E, i.e. crisp granules on concepts. Depending on a, the factor lattice goes from a one-element lattice (for a = 0) to an isomophic copy of B (G, M, I) (for a = 1).

By Theorem 8, if L the algebra of intuitionistic logic (Heyting algebra) or the algebra of G6dellogic [19] then each a-cut of E is indeed a congruence relation.

The next theorem shows that the farther are the concepts in the hierarchy, the less similar they are.

Page 285: Data Mining, Rough Sets and Granular Computing

282

Theorem 9. Let for (Ai,Bi) E B(G,M,I), i = 1,2,3, it holds (Al,Bl ) ::; (A2,B2) ::; (A3,B3). Then

E((Al,Bl ), (A3,B3)) ::; E((Al,Bl ), (A2,B2)), E((Al , B l ), (A3, B3)) ::; E((A2,B2)' (A3, B3)).

Proof. By the assumptions, i.e. Al (g) ::; A2 (g) ::; A3 (g) for all 9 E G, and by the antitonicity of -+ in the first argument we have E( (A l , B l ), (A3 , B3 )) = !\gEG(A3(g) -+ Al(g)) ::; !\gEG(A2(g) -+ Adg)) = E((Al,Bl)' (A2,B2)). The second part may be obtained symmetrically.

For the complete homomorphism h* : B (G, M,I) -+ B (G, M, h(I)) de­scribed in Section 4.1, the concept lattice B (G, M, h(I)) is isomorphic to B (G, M, I)/(h. where th. is the (complete) congruence relation on B (G, M,I) induced by h* (i.e. ((Al , B l ), (A2, B2)) E th.) iff h*( (A l , B l )) = h*( (A2, B2)). We have therefore two ways to factorize concept lattices, the factorization by (h. and the factorization by a E described in this section. We are going to present a sufficient condition for these factorizations to be equivalent. We need the following lemma.

Lemma 6. Let L be a residuated lattice and let a E L be an 0-idempotent (i.e. a 0 a = a) atom such that a -+ 0 = 0 and L = {O} U {x I a::; x}. Then the mapping ha : L -+ {O, I} defined by

h (x) = {I for a ::; X a 0 for x < a

is an onto homomorphism of L on 2. If L is complete, then ha is, moreover, !\ -preserving.

Proof. By definition we have ha(O) = 0 and ha(l) = 1. We show ha(x Vy) = ha (x) V ha (y). Distinguish two cases: (1) if x, Y < a then clearly ha (x V y) = ha(O) = 0 = ha(O) V ha(O) = ha(x) V ha(y); (2) otherwise x V Y ~ a, thus ha(x V y) = 1 = ha(x) V ha(y) since either x ~ a (and then ha(x) = 1) or y ~ a (and then ha(y) = 1). The equality ha(x A y) = ha(x) A ha(y) can be proved analogously. For ha(x 0 y) = ha(x) 0 ha(y) distinguish (1) x, y ~ a in which case x 0 y ~ a due to the idempotency of a and clearly ha(x0y) = 1 = ha(x) 0ha(y); (2) otherwise, e.g. x < a, we have ha(X0Y) = 0= 00 ha(y) = ha(x) 0 ha(y). Finally, to show ha(x -+ y) = ha(x) -+ ha(y) we have either (1) y ~ a and thus x -+ y ~ y ~ a, i.e. ha(x -+ y) = 1 = ha(x) -+ 1 = ha(x) -+ ha(y); or (2) y < a then (2a) x ~ a implies x -+ y ::; a -+ 0 = 0, i.e. ha(x -+ y) = 0 = 1 -+ 0 = ha(x) -+ ha(y), and (2b) x < a implies ha(x -+ y) = 1 = 0 -+ 0 = ha(x) -+ ha(y). The second part is easy to see.

Theorem 10. Let L be a complete residuated lattice. If a is an 0-dempotent atom of L such that a -+ 0 = 0 and L = {O} U {x I a::; x} and ha is the ho­momorphism described in Lemma 6 then B (G, M, ha(I)) and B (G, M, I);a E are isomorphic lattices.

Page 286: Data Mining, Rough Sets and Granular Computing

283

Proof. We have to show ((AI,BI ), (A2,B2)) E aE iff ha((AI,BI)) = ha((A2 ,B2 )). This follows from ((AI,BI ), (A2,B2 )) f. aE iff there is g E G such that Al (g) = 0 and A2(g) > 0, or Al (g) > 0 and A2(g) = 0 iff ha((AI,BI)) i- ha((A2,B2))'

Remark 6. Note that the above conditions put on L cover an important class of residuated lattices, namely, the class of finite chains equiped with G6del structure, cf. Section 2.

5 Illustrative examples

Consider the following example. For £ we put L = {O, ~, I} with the structure described in Section 2. The context is given by the Tab. 1. The set G con­tains nine elements (Mercury, ... , Pluto), the set M contains four properties ("size small", ... , "near to sun"). The corresponding fuzzy concept lattice

Table 1. Fuzzy context given by planets and their properties.

size from sun smaUlarge far near (ss) (sl) (df) (dn)

Mercury (Me) 1 0 0 1 Venus (V) 1 0 0 1 Earth (E) 1 0 0 1 Mars (Ma) 1 0 1 1 "2 Jupiter (J) 0 1 1 1

~ Saturn (S) 0 1 1 "2 Uranus (U) 1 1 1 0 "2 "2 Neptune (N) 1 1 1 0 "2 "2 Pluto (P) 1 0 1 0

is depicted in Fig. 1. To get a deeper insight, the elements (i.e. concepts) of the lattice are identified in Tab. 2. Consider the a-cut of the induced simi­larity on B (G, M,I) for a = ~, i.e. ! E. The tolerance blocks (which are, in fact, complete sublattices) are depicted in Fig. 2. Note that each block is a maximal subset of L-concepts which are similar in the degree at least ~. The corresponding factor lattice B (G, M, I) j! E is depicted in Fig. 3.

A few remarks to this example. First, note that the concepts which were found depend on the fuzzy context. The fuzzy context is given by a subjective judgement (e.g. to what degree we consider "Mars is far from sun" to be true). The subjective judgement influences therefore the resulting structure of concepts. Second, there are apparently natural concepts in the concept

Page 287: Data Mining, Rough Sets and Granular Computing

284

Table 2. Fuzzy concepts of the context of Tab. 1.

no. extent intent Me VEMa J SUN P ss sl df dn

1. 0000000001111 2. 0 0 0 ~ 0 0 0 0 0 1 ~ 1 1 3. 0 0 0 0 ~ ~ 0 0 0 ~ 1 1 1 4. 0 0 0 0 0 0 ~ ~ 0 1 1 1 ~ 5. ~ ~ ~ too 0 0 0 1 t ~ 1 6. 0 0 0 I ~ ~ 0 0 0 ~ III 7. 0 0 0 2 0 0 l. l. ~ 1 2 1 t 8. 0 0 0 0 ~ ~ I I 0 ~ 1 1 2 9 111 10000010 1 1 . III111 I 10. I I I I 2 2 ? ? ? t t I ~ 11. 2 2 2 I ?? I I I l. I 2 I 12. 0 0 0 2 2 2 I I Z Z 2 1 I 13. 0 0 0 0 1 1 2 2 0 0 1 1 2 14. 1 1 1 1 0 0 0 0 0 1 0 0 1 15. ~ ~ ~ 1 ~ ~ 0 0 0 ~ 0 ~ 1 16 1 1 1 1 0 0 1 1 1 1 0 1 1

. I III 11III11I I ~~: 6 6 6 t 6 6 I Iii 6 i 6 19. 0 0 0 III II -_~ 0 t 1 ~ 20. 0 0 0 -z 1 1 1 1 Z -Zll -Z 1 0 I I 21. 1 1 1 1 2 2 0 0 0 2 0 0 1 22. 1 1 1 1 0 0 1 1 1 1 0 0 1 23. Ill. 1 III I I 1 0 1 I ¥ II 22I222 ¥ 2 24. - - 2 1 0 0 2 ~ 11 1 0 - 0 25 I III 1 1 1 0 1 I 1 ·III¥1~22¥1¥¥2

26. 2 2 2 ¥ 2 2 1 1 2 I 2 2 0 27. 0 0 0 ¥ ~ ~ 1 1 1 2 0 1 0 28. 0 0 0 2 1 1 1 1 1 0 ~ 1 0 29. 1 1 1 1 ~ ~ ttl ~ 0 0 ~ 30. 1 1 1 1 0 0 2 2 1 1 0 0 0

III Ill. 11 31. 2 2 III 1 2 2 zOO 2 2 32. 1 1 1 ~ ~ 1 1 1 ~ 0 1 0 33. I I I t 1 1 1 1 ~ 0 ~ I 0 34. 0 0 0 2 1 1 1 1 1 0 0 1 0 35. 1 1 1 1 1 1 ~ ~ ~ 0 0 0 ~ 36. 1 1 1 1 ~ ~ 1 1 1 ~ 0 0 0 37. ~ ~ ~ 1 1 1 1 1 1 0 0 ~ 0 38. 1 1 1 1 1 1 1 1 1 0 0 0 0

Page 288: Data Mining, Rough Sets and Granular Computing

285

Fig. 1. Concept lattice of the context in Tab. 1.

lattice (e.g. 14 ("small planet near to sun")), as well as concepts which "were found" (e.g. 26 ("a planet far from sun which is at least partially large")). Concept no. 1 is an example of an empirically empty concept. Concepts which do not contain any element in the degree 1 in their extents (e.g. 1,2,3) could be called partially (empirically) empty. There is nothing wrong about these concepts just as there is nothing wrong about empirically empty concepts.

Acknowledgement

This work has been supported by the grant no. 201/99/P060, and partly by the project VS 96037 of the Ministry of Education of the Czech Republic.

Page 289: Data Mining, Rough Sets and Granular Computing

286

1 Fig. 2. Blocks of the tolerance relation 2" E on the concept lattice of Fig. 1.

Page 290: Data Mining, Rough Sets and Granular Computing

References

120

9

51Xo 1~08 ~oXoXL ~I/

o 1

Fig. 3. Factor lattice of the lattice in Fig. 1 by ! E

287

1. Arbeiten der Forschungsgruppe Begriffsanalyse bis 1 989. Arbeiten der Forschungsgruppe Begriffsanalyse ab 1990.

2. Belohl<ivek R: Lattices generated by binary fuzzy relations. Tatra Mount. Math. Publ. 16(1999), 11-19 (special, fuzzy set theory and applications).

3. Belohhivek R: Fuzzy Galois connections. Math. Logic Quarterly 45,4 (1999), 497-504.

4. Belohlavek R: Fuzzy logical bidirectional associative memory for concepts rep­resentation. Joint Conf. Inf. Sciences'98 Proc., Vol. II, pp. 123-126, Durham, North Carolina, USA, 1998.

5. Belohlavek R: Logical precision in concept lattices (submitted). Extended ver­sion of paper presented in Soft Computing '99 Conference, Genoa, Italy, 1999.

6. BeIohlavek R: Similarity relations in concept lattices. Journal of Logic and Computation (to appear).

7. Belohlavek R: Lattices of fixed points of fuzzy Galois connections. Math. Logic Quarterly 47,1(2001) (to appear).

8. Belohlavek R.: Concept equations (submitted). 9. Birkhoff G.: Lattice Theory, 3-rd edition. AMS Co!. Pub!. 25, Providence, R I.,

1967. 10. Blyth T.S., Janowitz M.F.: Residuation Theory. Pergamon Press, London, 1972. 11. Chajda I.: Algebraic Theory of Tolerance Relations. Palacky University Press,

Olomouc, 1991. 12. Czedli G.: Factor lattices by tolerances. Acta Sci. Math. (Szeged) 44(1982),

35-42. 13. Ganter B.: Lattice theory and formal concept analysis-a subjective introduc­

tion. Preprint MATH-AL-2-1994, Techical University Dresden, Dresden, 1994.

Page 291: Data Mining, Rough Sets and Granular Computing

288

14. Ganter B., Wille R.: Applied lattice theory: formal concept analysis. In: Gratzer G.: General Lattice Theory. To appear.

15. Ganter B., Wille R., Wolff K. E.: Beitriige zur Begriffsanalyse. B. I. Wis-senschaftsverlag, Mannheim, 1987.

16. Goguen J. A.: L-fuzzy sets. Journal of Math. Anal. Appl. 18(1967), 145-174. 17. Goguen J. A.: The logic of inexact concepts. Synthese 19(1968-69), 325-373. 18. Gratzer G.: Universal Algebra. Princeton, Toronto, London, Melbourne, 1968. 19. Hjek P.: Metamathematics of Fuzzy Logic. Kluwer, 1998. 20. Hoiler A.: Grundlehren der Logik ung Psychologie. G. Freytag, Leipzig, 1906. 21. Hohle V.: Commutative, residuated l-monoids. In: Hohle V., Klement E. P.:

Non-Classical Logics and Their Applications To Fuzzy Subsets, Kluwer (Theory and decision library), 1995, pp. 53-106.

22. Hohle V.: On the fundamentals of fuzzy set theory. Journal of Math. Anal. Appl. 201(1996), 786-826.

23. Klir G. J., Yuan B.: Fuzzy Sets and Fuzzy Logic. Theory and Applications. Prentice Hall, Vpper Saddle River, NJ, 1995.

24. Lukose D., Delugach H., Keeler M., Searle L., Sowa J. (Eds.): Conceptual Struc­tures: Fulfilling Peirce's Dream. Proc. of Fifth Int. Conf. on Conceptual Struc­tures, ICCS'97. Springer-Verlag, Berlin/Heidelberg/New York, 1997.

25. Materna P.: Svet pojmu a logika (The world of concepts and logic, in Czech). Filosofia, Prague, 199X.

26. Materna P.: Teorie pojmu: Bolzanovska a mnozinova tradice. (Theory of con­cepts: Bolzanean and set tradition, in Czech) Filozoficky casopis 45(4)(1997), 547-557.

27. Materna P.: Objects and concepts. Acta Philosophica Fennica, to appear. 28. Novak V.: Paradigm, formal properties and limits offuzzy logic. Int. J. General

Systems 24(4)(1996), 377-405. 29. Ore 0.: Galois connexions. Trans. AMS 55(1944), 493-513. 30. Pavelka J.: On fuzzy logic I, II, III. Zeitschrijt fur Math. Logik und Grundl.

Math. 25(1979), 45-52, 119-139, 447-464. 31. Peacocke C.: A Study of Concepts. MIT Press (A Bradford Book), Cambridge

(Massachusetts), London (England), 1992. 32. Rand A.: Introduction to Objectivist Epistemology. Meridian Book, 1990. 33. Sowa J. F.: Conceptual Structures: Information Processing in Mind and Ma­

chine. Addison-Wesley, Reading, MA, 1984. 34. Thagard P.: Mind. Introduction to Cognitive Science. MIT Press, 1996. 35. Tichy P.: The Foundations of Frege's Logic. W. de Gruyter, Berlin, New York,

1988. 36. Tugendhat E., Wolf U.: Logisch-semantische Propiidautik. Philipp Reclam jun.,

Stutgart 1986. 37. Turunen E.: Well-defined fuzzy sentential logic. Math. Logic Quarterly

41(1995), 236-248. 38. Ward M., Dilworth R. P.: Residuated lattices. Trans. AMS 45(1939), 335-354. 39. Wille R.: Restructuring lattice theory: an approach based on hierarchies of

concepts. In: I. Rival (Ed.): Ordered Sets. 445-470, Reidel, Dordrecht-Boston, 1982.

40. Wille R.: Complete tolerance relations of concept lattices. In: Eigenthalet G. et al.: Contributions to General Algebra, vol. 3. Holder-Pichler-Tempsky, Wien, 1985, pp. 397-415.

Page 292: Data Mining, Rough Sets and Granular Computing

289

41. Wille R.: Concept lattices and conceptual knowledge systems. Computers & Mathematics with Applications 23(1992), 493-515.

42. Wille R.: Conceptual graphs and formal concept analysis. In [24], pp. 290-303. 43. Winston P. H.: Artifical Intelligence. 3-rd Ed. Addison Wesley, 1992. 44. Zaba G.: Logika. (Logic, in Czech) Hejda & Tucek, Praha, 1906. 45. Zadeh L.: Fuzzy sets. Information and Control 8(3)(1965), 338-353. 46. Zadeh L. A.: Similarity relations and fuzzy orderings. Information Sciences

3(1971),159-176. 47. Zadeh L.: The concept of a linguistic variableand its application to approxi­

mate reasoning I, II, III. Information Sciences 8(3)(1975), 199-251, 301-357, 9(1975), 43-80.

48. Zadeh L.: Towards the theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets and Systems 90(2)(1997), 111-127.

Page 293: Data Mining, Rough Sets and Granular Computing

Granular Computing with Closeness and Negligibility Relations

Didier Dubois! , ABel Hadj-Ali2, and Henri Prade!

I I.R.I.T., Universite Paul Sabatier 118 route de Narbonne, 31062 Toulouse Cedex 4, France

{dubois,pradc} @irit.fr , Institut d'Informatique, Universite Mouloud Mammeri

15000 Tizi-Ouzou, Algeria

Abstract. One of the simplest examples of information granulation is the usc by humans of approximate equalities when reasoning with orders of magnitude. The paper proposes a symbolic approach for handling orders of magnitude in terms of a closeness relation and an associated negligibility relation. At the semantic level, these relations are represented by means of fuzzy sets and are parametered. A reduced set of rules, where the parameters are formally combined, embodies all the knowledge for rcasoning on the basis of pieces of information in terms of orders of magnitude. These rules describe how closeness and negligibility relations can be composcd and how they behave with respect to addition and product. The problem of handling qualitative probabilities in uncertain reasoning is thcn investigated in that perspecti ve.

1 Introduction

In a wide variety of fields, the information on which human decisions and reasoning are based is both fuzzy and granular as recently emphasized by Zadeh [20][19]. Fuzziness refers to the representation of classes with ill-located boundaries by means of characteristic functions taking values in the interval [0,1]. In a fuzzy class, some elements that are considered as "marginal" or "less acceptable", are given a degree of membership that is intermediary between 0 and 1. Fuzzy logic can offer a natural and rich framework for taking into account the gradual or Oexible nature of specifications, and the representation of incomplete information. It should be noted that the growing success of fuzzy logic is mainly due to the fact that it provides a tool for bridging the gap between the perceived continuity of the world and human discrete cognitive representation. In particular, fuzzy logic helps with interfacing numerical data anc! symbolic labels. The information may be said to he granular in the sense that the data points within a granule have to he dealt with as a whole rather than individually. In this paper, we investigate a particular type of human reasoning where inferences are

Page 294: Data Mining, Rough Sets and Granular Computing

291

performed at the granular level: qualitative reasoning about relative orders of magnitude expressed in terms of closeness and negligibility relations.

The idea of granulation, which plays a key role in human cognition, was first suggested by Zadeh [17] in 1979 and has been again advocated by' this author recently in [19][20]. In a broad sense, granulation refers to partitioning an object into a collection of granules, each granule being a clump of objects (points) drawn together by indistinguishability, similarity, proximity or functionality. For example, the granules of a human body are the head, neck, arms, chest, etc. The granulation process, that underlies the methodology of computing with words recently emphasized by Zadeh [18], generally serves as a way of achieving data compression. As pointed out by Zadeh [19], information granulation may be crisp or fuzzy. But in much, perhaps most, of human reasoning and concept formation, the granules are fuzzy rather than crisp. The fuzziness of granules is a direct consequence of fuzziness of the concepts of indistinguishability, similarity, proximity and functionality. Fuzzy information granulation underlies the remarkable human ability to organize and summarize information in an environment of imprecision, partial knowledge, partial certainty and partial truth. Indeed the machinery of fuzzy information granulation, especially in the form of linguistic variables, fuzzy if-then and fuzzy graphs, has long played a major role in the success of fuzzy logic in dealing with real-world problems.

In the theory of fuzzy information granulation, a granule, is viewed as a clump of points characterized by a generalized constraint (possibilistic constraint, probabilistic constraint, fuzzy graph constraint, etc), which serves as a canonical form for representing the meaning of a proposition expressed in a natural language. Central to the idea of granulation is the notion of (dis)similarity which enables us to cluster elements which are sufficiently close into a set, and to distinguish them from other subsets. Rough set theory, originated by Pawlak [11], offers a valuable framework for handling granulation. Indeed this theory relies on the notion of equivalence relation which is used for partitioning universes in subsets of indistinguishable elements, and thus coarsening the universe of discourse. In this paper, although we do not explicitly refer to rough sets, the notion of (fuzzy) similarity, or at least of fuzzy proximity, plays a key role.

Fuzzy constraints expressing closeness and negligibility relations can be viewed as a form of fuzzy granulation, a granule being a fuzzy set of elements drawn together by closeness (or negligibility). Then, qualitative reasoning on relative orders of magnitude, expressed in terms of closeness and negligibility relations, may be viewed as a simple illustration of the idea of granular computing, which is a methodology in which granules are used as whole for computing and reasoning. This is the topic of the paper.

Relative order of magnitude reasoning is a form of qualitative reasoning based on approximate equality and negligibility relations [10][15], The first attempt to

Page 295: Data Mining, Rough Sets and Granular Computing

292

formalize and automate such reasoning appeared with the formal system FOG, proposed by Raiman l13][14]. FOG is based on three basic relations expressing the relations 'Negligible in comparison with' (Ne), 'Close to' (Cl), and 'has the same sign and order of magnitude as', i.e., 'is comparable to' (Co). FOG includes one axiom and 31 inference rules. This sct of rules was proved to be consistent by giving an interpretation of the three relations in the framework of Non-Standard Analysis. Note that FOG handles relative orders of magnitude through a purely symbolic computation process. Nevertheless, FOG has several limitations which prevent it from being really used in engineering. The two most obvious ones are the impossibility to explicitly use quantitative information when available, and the difficulty to control the inference process in order to obtain valid results in the real world. It has been pointed out in [3][4][7] that the modeling of relations Ne, CI <md Co by means of fuzzy relations constitutes an appropriate framework for solving thcsc problems. This ncw rcpresentation allows to give a standard numerical semantics for the symbolic computation performed by FOG's rules, and to interface symbolic relations with their numerical interpretation in a convenient way. Symbolic inference rules which enable a formal manipulation of relations Cl and Ne, in agreement with thc fuzzy semantics, have been proposed [4][7]. In this papcr, we examine the independence of the already proposed rules, and we establish new inference rules. A more reduced set of non-redundant rules is proposed. Then, we discuss the application of this set of rules to qualitative probabilities quantifying rules with exceptions in plausible reasoning.

The paper is organizcd as follows. Section 2 recalls the representation of relations CI and Ne, bascd on fuzzy relations. In Section 3, wc present a minimal base of inference rules describing the behaviour of relations Cl and Ne. Lastly, an application to plausible reasoning with qualitative probabilities is proposed in Section 4.

2 Fuzzy Set Representation of Relative Orders of Magnitude

Relative orclers of magnitude can be expressed in practice by fuzzy relations whose membership functions arc defined in terms of difference of values or in terms of ratio of values [4][7]. In this paper we use the ratio x/y rather than the difference x-y, for modeling membership functions expressing approximate equality, or negligibility. Indeed the idea of negligibility is more easily captured in terms of ratios. Then approximate equality relation, represented by closeness relation CI, can be captured by the following fuzzy relation between two real numbers x and y:

/-loCx, y) = /-l;vJ(x/y). (1)

where the characteristic function /-lM is the one of a fuzzy number close to I, such that /-lM(l) = I (since x is close to x), /-lM(t) = 0 if t sO (since, if x and yare elose they have the same sign), and such that

Page 296: Data Mining, Rough Sets and Granular Computing

293

(2)

So, each level cut ofM (Le., Ma= {r, llM(r) ~ an is of the form [1-£, 1/(1-£)] with £E [0, 1 [. This latter property ensures the symmetry of CI, Le.,

(3)

In the formal system FOG [13], the relation Ne can be captured in function of CI through the following rule (a, b) E Ne ¢:::> (a + b, b) E Cl. This leads to the following representation:

llNc(x, y) = lloCx + y, y) = llM«x + y)/y) = llM(1 + x/y). (4)

Thus x is negligeable w.r.t. y if and only if the ratio x/y is very small W.r.t. 1; it is why the use of a ratio seems more appropriate than a difference for modeling the relation Ne. In system FOG, Raiman makes use of a third relation of comparability Co which is less restrictive notion than closeness (Le., CI !::: Co). Note that it is desirable to have not only the fuzzy relation inclusion CI !::: Co but also to express that a and b are not longer comparable (in the sense of Co) when a becomes negligible in comparison with b, or b becomes negligible in comparison to a. This leads to the definition [(a, b) E Co ¢:::> -,«a, b) E Ne or (b, a) ENe))]. Which can be represented as follows:

!lco(x, y) = 1 - max (IlNc(x, y), llNe(y, x)) = 1 - max (IlM(1 + x/y), llM(l + y/x)) (5)

= min (l - llM(l + x/y), 1 - llM(l + y/x)).

2.1 Required Properties of the Fuzzy Parameter

In equations (1), (4) and (5), the parameter M is a fuzzy interval which restricts values around 1 such that llM(l) = 1. Moreover M should satisfy the following constraints:

• In order to ensure the symmetry property of relation CI, M is such that llM(t) = llM(1/t), V t. Then, each a-level cut ofM is of the form [1-£,1/(1-£)] with £ E [0, 1[ .

• llNcCx, y) = ° for x = y, since x is not negligible with respect.to itself. This is equivalent to llM(2) = 0, according to (4). Since M is symmetric, llM(1I2) = ° is also true. Thus, the support of M is included in [1/2, 2]. In other terms, a level cut of M is of the form [1-£, 1/(1-£)] with £ E [0, 1/2].

Page 297: Data Mining, Rough Sets and Granular Computing

294

• In order to guarantee the condition CI t;;;; Co, M is chosen such its support M lies

in the interval [1-£, 1/(1-£)] with £ E [0, (3-\1'5)12].

Proof. The condition CI t;;;; Co is equivalent to f.lM(t) s I - max (f.lM(l + t), f.lM(l + lit))

~ max (f.lM(l + t), f.lM(l + l/t)) s 1 - f.lM(t) ~ max (f.lM0J~.!)' f.l1/(M01)(t)) s 1 -J::M(t) ~ Mel t;;;; M and 1/(Mel) t;;;; M,

where the overbar denotes the fuzzy set complementation. Let us use a-cuts, and the fact that F t;;;; G is equivalent to Fa t;;;; G" for all a> O. M" is of the form [1-£, 1/(1-e)], hence for a-cuts the two inclusions write:

[-£, £/(1-£)] ~ ]-oc, 1-£[ u] 1/(1-£), +ocl and J-oc, -11£,] u[(l-£)/£, +oc[ ~ ]-oc, l-e[ u] 1/(1-£), +oc[

Note that -1/£ < 1-£ is always true for £ E rO, 1 [. The two inclusions are true if and only if £/(1-£) S 1-£ and (1-£)/£;:0: 1/(1-£), which are both equivalent to £2-3£+1 ;:0: 0, i.e., which is true for £ E [0, (3-\1'5)12]. •

Finally, a level cut of M is of the form [1-£, 1/(1-£)] with £ E [0, (3-\1'5)12], i.e., the support M lies in the interval [(\1'5-1)/2, (\1'5+1)12].

In the rest of the paper, we only consider inference rules involving the more basic order of magnitude relations CI and Ne.

2.2 Composition of the Basic Relations Nc and CI

From now on, the notation ClEM] (respectively Ne[M]) means that CI (respectively Ne) is defined from the membership function (1) (respectively (4)). We shall continue to use the notations (a, b) E ClEM] and (a, b) E Ne[M] for expressing that the more or less possible pairs of values for a and b are restricted by the fuzzy sets ClEM] and NelM] respectively.

The representation of relations in terms of ratios enables us to reduce the composition of fuzzy relations (as done in approximate reasoning [16]) to a simple computation on the fuzzy numbers which parameterize the relations. Indeed

SUPy min(f.lcIIM](x, y), f.lC1[N](Y' z)) = SUPy min(f.lM(x/y), f.lJY/z)) . .

(6)

- f.l. (x/z) - f.l (x z) - M@N - ("IIM@NI '

where @ denotes the product extended to fuzzy numbers (in the following, we shall omit '@' when writing products). Similarly,

SUPy min(f.lc1[M/x, y), f.lNeINI(y, z)) = SUPy min(f.lM(x/y), f.lN(l + y/z)) (7)

Page 298: Data Mining, Rough Sets and Granular Computing

295

= sUPu,v S.t. u(v - 1) +1 = 1 + x/z min(IlM(u), IlN(v))

= IlM(N8!)EIll(l + xlz) = IlNe[M(N81)EIll/X, z)

where u = x/y, v = 1 + ylz, and where EB and e denote the addition and subtraction extended to fuzzy numbers [2]. Similarly,

SUPy min (IlNe[M/x, y), IlNe[N/Y'Z)) = SUPy min (IlM(l + x / y), IlN(l + y / z)) (8)

= sup n.v min (IlM(u), IlN(v)) (u-l)(v-l)+ 1= I +xlz

= Il(M0l)(N01)EIlI (1 +xlz) = IlNe[(M01)(N01)Eill](X, z)

Where u = 1 + x / y and v = 1 + y / z. Thus, we have established the following semantic entailments:

(a, b) E CI[M] and (b, c) E CI[N] ~ (a, c) ECI[MN] (9)

(a, b) E CI[M] and (b, c) E Ne[N] ~ (a, c) E Ne[M(N81)EB1]. (10)

(a, b) E Ne[M] and (b, c) E Ne[N] ~ (a, c) E Ne[(M81)(N81)EBI]

(11)

N.B. The product MM in fuzzy arithmetic is equal to M2 if and only if M is either positive (i.e., IlM(t) ;;:: 0 ~ t ;;:: 0) or negative (i.e., IlM(t) ;;:: 0 ~ t ::; 0). Here in practice M is positive, but not (M81). This is why in (11) we write (M8l)(M81) instead of (M81)2 .

3 Base of Rules Based on Fuzzy Relations Ne and CI

A set of symbolic inference rules, based on the order of magnitude relations CI and Ne, has been proposed and justified in [4] and [7]. Nevertheless, the rules which have been established in the previous papers are not all independent. For example, it is easy to see that the rules:

(a, b) E Ne[M] ¢:::> (a + b, b) E CI[M] and (a, b) E CI[M] and (b, c) E CI[N] ~ (a, c) E CI[MN] imply the rule

(a, c) E CI[M] and (b, a) E Ne[N] ~ (a + b, c) E CI[MN],

Page 299: Data Mining, Rough Sets and Granular Computing

296

or that (a, b) E Ne[M] <=> (-a, b) E Ne[28M]

and (a, b) E Ne[M] <=> (a + b, b) E CI[M] imply the rule

(a - b, a) E Ne[M] <=> (b, a) E CI[28M].

Furthermore, some rules can be redundant (a rule is said to be redundant with respect to another if they have the same conclusions and if the conditions of the first are included in (or equal to) the ones of the other). A rule is also redundant with respect to another, if under the same conditions, it leads to a conclusion containing (in the sense of fuzzy sets inclusion) the one of the other rule. A minimal set of rules (from the point of view of independence and redundancy) can be obtained by performing the following operations:

i) Deleting all rules which can be obtained by chaining of two or several rules;

ii) Eliminating all rules, which by changing the variables, coincide with others (for example, the rule expressed by: (a, b) E Ne[M] <=> (a + b, b) E

CI[M]), putting a = c - b, yields: (c - b, b) E Ne[M] <=> (c, b) E Cl[M] , which coincides with another rule given in [7].

The application of these two operations leads to the following rules base:

- Remarkable properties of CI:

a::; band c::; d and alb ::; c/d and (a, b) E Cl[M] => (c, d) E CI[M] (12)

(a, b) E CI[M] <=> (b, a) E CI[M] (13)

(a, b) E Cl[M] and (c, d) E Cl[N] => (a·c, b·d) E CI[MN] (14)

- Definition of Ne from Cl:

(a, b) E Ne[M] <=> (a + b, b) E Cl[M] (15)

(a, b) E CI[M] <=> (max(a, b) - min(a, b), min (a, b)) E Ne[M] (16)

(a, b) E Ne[M] <=> (-a, b) E Ne[28M] (17)

- Remarkable properties of Ne:

ifO::;a/b::; 1 and (b, c) E Ne[M] => (a, c) E Ne[M] (18)

Page 300: Data Mining, Rough Sets and Granular Computing

297

if (a, b) E Ne[M] and O~b/c ~ 1 =:} (a, c) E Ne[M] (19)

(a, a + b) E Ne[M] =:} (a, b) E Ne[M2] (20)

(a, b) E Ne[M] and (c, d) E Ne[N] =:} (a·c, b·d) E Ne[(Mel)(Nel)EBl] (21)

- Composition of relations CI and Ne:

(a, b) E Cl[M] and (b, c) E Ne[N] =:} (a, c) E Ne[M(Nel)EB1] (22)

(a, b) E Cl[M] and (c, a) E Ne[N] =:} (c, b) E Ne[M(Nel)EBl] (23)

- Properties of Cl and Ne with respect to the addition:

(a, c) E Cl[M] and (b, c) E Ne[N] =:} (a + b, c) E Cl[MEBNel] (24)

(a + b, c) E Cl[M] and (b, c) E CI[N] =:} (a, c) E Ne[MeNEBl] (25)

(a, c) E Ne[M] and (b, a) E Ne[N] =:} (a + b, c) E Ne[(Mel)NEB1] (26)

- Properties of CI and Ne with respect to the product:

(a, b) E Cl[M] and (c, d) E Ne[N] =:} (a·c, b·d) E Ne[M(Nel)EBl] (27)

Proofs: The proofs of most of these rules are given in the references [4] and [7] and some of them are given in Appendix. Let us now consider the proof of rule (20): We have

flNcIM](X, x + y) = flM(1 + x/ex + y)) = flM(2 - y/(x + y))

= fl20M(Y/(X+Y)), since flf(M)(t) = flM(fi(t))

=: fl20M((X+Y)/Y) = fl20M(1 + x/y) = flNeI20M](X, y).

The approximate equality (=:) is due to the fact that 28M is only approximately symmetric (in the sense of (2)). Indeed the symmetry of M signifies that each of its a­level cut is of the form [1-£, 1/(1-£)] with £ E [0, (3--15)/2], which implies that the conesponding a-level cut of 28M is [(1-2£)/(1-£), 1 +£]; but we only have (1-2£)/(1-£) =: 1/(1 +£) neglecting £2. In order to overcome this problem, we can substitute 28M by a less restrictive symmetrical fuzzy parameter, e.g., M2. Indeed, the conesponding a-level cut ofM2is [(1-£)2, 1/(1-£)2], it is easy to see that the inequalitiy (1-£)2::; (1-2£)/(1-£) ¢:> (1-£ )(1-£)2::; 1-2£ ¢:> £2-3£+1 2': 0 is true for £E [0, (3--15)12]; and 1/(1-£)22': 1+£ is also true since 1I(1-£? 2': 1/(1-£)2': 1+£. Then, [(1-2£)/(1-£), 1+£] ~ [(1-£)2,1/(1_£)2] holds, namely 28M ~ M2. Then

Page 301: Data Mining, Rough Sets and Granular Computing

298

° < IlNclM](X, x + y) = IlM(1 + X/eX + y)) = IlM(2 - y/(x + y)) = 1l28M(Y/(X+Y)) ~ IlM2(y/(x+y)) = IlM2((X+y)/y), since M2 is symmetric

= IlM,(l + x/y) = IlNc[M2](X, y). This means that (a, b) E Ne[M2]. • From this set of rules of inference, other useful rules can be derived. Thus, we can notice that the following rules are the consequences of others:

- If etO,

(a·c, b·c) E Ne[M] <=> (a, b) E Ne[M]

is a direct consequence of (14) and (15); - If c;tO,

(a·c, b·c) E ClEM] <=> (a, b) E ClEM]

(28)

(29)

is a direct consequence of (12) and (14) noticing that c = d can be written (c, d) E

CI[I] where 1l1(x) = ° if x;tl and 111(1) = 1.

- the iteration properties of Cl

(a, b) E ClEM] and (b, c) E CI[N] ::::} (a, c) E Cl[MN] (30)

andNe

(a, b) E Ne[M] and (b, c) E Ne[N] ::::} (a, c) E Ne[(M81)(N81)EB1] (31)

are consequences of (14), namely of

(A, B) E ClEM] and (C, D) E Cl[N] ::::} (AC, B·D) E Cl[MN]

for A = a, B = C = band D = c using (29), and respectively of (21), namely of

(A, B) E Ne[M] and (C, D) E Ne[N] ::::} (AC, B·D) E Ne[(M81)(N81)EBl]

for A = a, B = C = band D = c using (28).

- the simplification rule with respect to the addition

(a + b, c) E ClEM] and (b, a) E Ne[N] ::::} (a, c) E Cl[MN] (32)

is a consequence of (30), (15) and (13).

- the simplification rule with respect to the product

(a·b, c·d) E Cl[M] and (a, c) E CI[N] ::::} (b, d) E Cl[MN] (33)

is a consequence of (14), namely

(A, B) E ClEM] and (C, D) E CI[N] ::::} (AC, B·D) E CI[MN]

Page 302: Data Mining, Rough Sets and Granular Computing

299

with A = a·b, B = c·d, C = lla and D = lIc.

- in the same way the simplification rule with respect to the product

(a·b, c·d) E Ne[M] and (c, a) E Ne[N] => (b, d) E Ne[(M8l)(N8l)EBl] (34)

is a consequence of (21), namely of

(A, B) E Ne[M] and (C, D) E Ne[N] => (AC, B·D) E Ne[(M8l)(N81)EBl]

with A = a·b, B = c·d, C = l/a and D = l/c.

- the simplification rule with respect to the addition

(a + b, c) E CI[M] and (b, c) E Ne[N] => (a, c) E CI[M8NEBI] (35)

is a consequence of (25), namely of

(A + B, C) E CI[M] and (B, C) E CI[N] => (A, C) E Ne[M8NEBl]

with A + B = a + b, C = c, B = b + c, i.e. A = a-c, using (15) twice.

- the simplification rule with respect to the addition

(a + b, c) E Ne[M] and (b, c) E Ne[N] => (a, c) E Ne[M8NEBl] (36)

is a consequence of (25) with A = a, B = b + c and C = c.

- the summation rule of negligible quantities

(a, c) E Ne[M] and (b, c) E Ne[N] => (a + b, c) E Ne[MEBN81] (37)

is a consequence of (24), namely

(A, C) E CI [M] and (B, C) E Ne[N] => (A + B, C) E CI[MEBN8l]

with A = a + c, B = b, and C = c using (15).

- the simplification rules

(a - b, c) E CI[M] and (a, c) E CI[N] => (b, c) E Ne[N8MEBI] (38)

(a - b, c) E Ne[M] and (a, c) E CI[N] => (b, c) E CI[N8MEBl] (39)

can be deduced from (25), namely from

(A + B, C) E CI[M] and (B, C) E CI[N] => (A, C) E Ne[M8NEBl]

with A = b, B = a - b, C = c and for A = b - c, B = a - b + c, C = c respectively.

Page 303: Data Mining, Rough Sets and Granular Computing

300

- the simplification rules

(a·b, c·d) E CI[M] and (a, c) E Ne[N] ~ (d, b) E Ne[M(N81)EBl] (40)

(a·b, c·d) E Ne[M] and (c, a) E CI[N] ~ (b, d) E Ne[(M81)NEBl] (41)

can be deduced from (27), namely from

(A, B) E Cl[M] and (C, D) E Ne[N] ~ (A·C, B·D) E Ne[M(N81)EBI]

for A = c·d, B = a·b, C = a, D = c using (13) and (28), and for A = c, B = a, C = a·b, D = c·d using (13).

Nevertheless, if we want to build an efficient symbolic reasoning system, it may be interesting to explicitly have some of the above simplifying rules even if they are redundant.

Besides, while the precision of rules (21) to (27) is optimal, similar rules can be syntactically derived from other rules, but with less precise results. For example, we can obtain the following weakened variant of rule (22):

(a, b) E CI[M] and (b, c) E Ne[N] ~ (a, c) E Ne[MN].

Indeed from (a, b) E CI[M] we deduce using (13) that (a + c, b + c) E CI[M] for c>O; by (15) we have (b + c, c) E Cl[N] , and then by (30) (a + c, c) E CI[MN], which implies that (a, c) E Ne[MN]. Indeed, we can verify that the positive part of MN (the only pertinent here) contains the positive part ofM(N81)EBl.

In the same way (from (15), (14) and (18)), we can obtain the following weakened variant of rule (21):

(a, b) E Ne[M] and (c, d) E Ne[N] ~ (a·c, b·d) E Ne[MN].

4 Application to Reasoning with Qualitative Probabilities

Adams [1] has proposed a probabilistic inference system based on the following three inference rules:

Triangularity: A ~ B ,A ~ C ~ (AnB) ~ C Bayes rule: A ~ B , (AnB) ~ C ~ A ~ C Disjunction: A ~ C , B ~ C ~ (AuB) ~ C

Page 304: Data Mining, Rough Sets and Granular Computing

301

which are sound and complete when A ~ B is understood as "the conditional probability P(BIA) is infinitely close to 1". These rules have been used by Pearl [12] for building a probabilistic inference-like default logic.

In this section, we study the counterpart of these rules with a fuzzy, non-infinitesimal, semantics (in order to cope with the graduality of the idea of closeness). In this respect A ~ B will be interpreted as "almost all As are Bs"; in other words "PCBIA) is close to 1" in the sense of closeness relation Cl. Note that an interval-based semantics has been already proposed in [5]. "P(BIA) close to 1" can then be written:

(P(BIA), 1) E CI[M].

As P(BIA) = P(AnB)/P(A), we have: (P(BIA), 1) E CI[M] ¢::> (P(AnB), peA)) E CI[M],

using (29).

4.1 Analysis of the Bayes Rule

The rule writes A ~ B , (AnB) ~ C :::::} A ~ C.

It has been showed in [6] that:

P(ClA) ;;::: P(BIA) . P(ClAnB) = P(CnAnB)/P(A)

By hypothesis, A ~ B writes:

(P(AnB), peA)) E CI[M]

and (AnB) ~ C writes

(P(CnAnB), P(AnB)) E Cl[N].

By application of the rule (30) on (44) and (45), we obtain:

(42)

(43)

(44)

(45)

(P(CnAnB), peA)) E Cl[MN] which can be written (P(CnAnB)/P(A), 1) E

CI[MN].

By (43), we can then conclude:

(P(ClA), 1) E Cl[MN],

applying the rule

1 ;;::: y;;::: x > 0 and (x, 1) E CI[M] :::::} (y, 1) E CI[M]

which follows from (12).

(46)

(47)

Page 305: Data Mining, Rough Sets and Granular Computing

302

In terms of intervals [5], if the part less than I of a level cut of M (respectively N) is denoted [1 - E, 1] (respectively [1 -11, 1], (46) expresses that P(ClA) E [(1 - E)( I -11), 1] = [1 - (E + 11 - Ell), 1]. This latter result coincide with the one recently obatained by Gilio [9].

4.2 Analysis of the Triangularity Rule

The rule writes

A ~ B , A ~ C =} (AnB) ~ C.

The hest lower bound of P(ClAnB) knowing P(BIA) and P(ClA), is given by [6]:

P(ClAnB) 2 max [0, 1 - ((1 - P(ClA» / P(BIA»]. (48)

Note that:

I - ((1 - P(ClA» / P(BIA» == (P(AnB) + P(CnA) - peA»~ / P(AnB). (49)

By hypothesis, we have:

(P(AnB), peA»~ E CI[M]. (50)

(P(CnA), peA»~ E CI[N]. (51)

The rule (15) enables us to obtain (P(CnA) - peA), peA»~ E Ne[N]. This relation and (45) imply that (according to rules (13) and (22»:

(P(CnA) - peA), P(AnB» E Ne[M(NG I)EB I] ¢:::> (52)

(P(CnA) - peA) + P(AnB), P(AnB» E Cl[M(NG I)EB 11 ¢:::>

((P(CnA) + P(AnB) - peA»~ / P(AnB), 1) E Cl[M(NGl)EBl]

which implies that

(P(ClAnB) , 1) E Cl[M(NG1)EB1], (53)

applying (47).

In terms of intervals, a level cut of M (respectively N) is of the form [1 - E , 11(1 -E)] (respectively [I -11, 1/(1 -11)]), (53) expresses that:

P(ClAnB) E [1 - E, 1/(1 - E)] . [-11,11/(1 -l1)J + I from which we deduce that

P(ClAnB) E [1 -11/(1 - E), 1]

Page 306: Data Mining, Rough Sets and Granular Computing

303

which corresponds to a result given in [5]. Also, this latter result coincide with the one obtained by Gilio [9].

4.3 Analysis of the Disjnnction Rule

The rule writes A ~ C , B ~ C => (AuB) ~ C.

The best lower bound ofP(ClAuB) in function ofP(ClA) and P(ClB) is given by [8]:

P(ClAuB) ~ P(ClA) . P(CJB) / (P(CJA) + P(CJB) - P(CJA)-P(ClB» (54)

= 1 / (l1P(ClB) + IIP(ClA) - 1)

This bound improves the one used in [6]:

P(ClAuB) ~ max(O, P(ClA) + P(ClB) - 1).

Indeed on [0, 1] we have ab/(a + b - ab) ~ max(O, a + b - 1).

The following relations are supposed to be verified:

(P(ClA), 1) E CI[M]

(P(CJB), 1) E CI[N].

So by (29), we have (11P(ClA), 1) E C1[M], and (l1P(ClB), 1) E CI[N].

By (16) we have (l1P(ClB) - 1,1) E Ne[N];

by (24) we have (11P(ClA) + IIP(ClB) - 1, 1) E Cl[M~Nel];

by (29) we have (lI(I/P(ClA) + IIP(CJB) -1),1) E Cl[M~Nel];

and finally applying (47) and (54), we obtain

(P(ClAuB), 1) E CI[M~Nel].

(55)

(56)

(57)

This bound obtained by the successive application of local rules is not optimal, since cOlTesponding to the bound used in [6] (indeed we have P(ClAuB) E [1 - E - 11, 1]

Page 307: Data Mining, Rough Sets and Granular Computing

304

forP(ClA) E [1 - E, 1] and P(ClB) E [I - 11, I]). An optimal bound could be semantically obtained by a direct computation, similarly to (6) in Section 2. Indeed the application of the eombinationlprojection technique to the disjunction rule leads to (where x = p(cI A), y = p(cI B) and z = p(cI AuB) = l/(l/x+ I/y-l »:

supx.y min(llcl[MI(X, I), IlCl[NI(Y' 1» z=I/(l/x+lIy-l)

supx. v min(IlM(x), IlN(y» z=I/(I/x+lly-l)

= Ilf (z) = Ilcllrl(z, 1).

This means that (P(CIAuB), 1) E Cl[F], with F = 1I(lIMEBlIN8l). Since, M and N are symmetrical fuzzy intervals then 11M = M and I IN = N. This implies that F = l/(MEBN8 I), and finaIIy we obtain:

(P(CI AuB), 1) E CJ[I/(MEBN8J)] (58)

Now, in terms of intervals, if a level cut of M (respectively N) is of the form [1-£ , 1/(1-£)] (respectively [1-11,11(1-11)]), then (58) writes

(P(CI AuB), 1) E [(1-£)(1-11)/(1-£11),11(1-£-11)], from which we deduce that

(P(CIAuB), 1) E [(1-£)(1-11)/(1-£11),11.

Now, since (1-£)(1-11)1(1-£11) = 1 - (£+11-2£11) I (1-£11), then (59) writes (p(clAuB), 1) E [1- (£+11-2£11) I (I-lOll), 11.

This exactly coincides with the optimal result obtained by Oilio in [9].

Thus we have established the three rules:

A~MB, (AnB)~NC~A~MNC

A~MB, A~NC~(AnB)~M(N81)®IC

A~MC, B~NC~(AuB)~M®NelC

(59)

where A~MB reads (P(BIA), 1) E Cl[M]. These rules enable us to give a fuzzy

semantics to plausible reasoning with qualitative probabilities, and to numerically assess the validity of conclusions. Note that these rules express a degradation of the validity of Adams'axioms, since Cl[M] k CI[M2] and CILM] k Cl[M(M81)EB I], as expected.

Page 308: Data Mining, Rough Sets and Granular Computing

305

5 Conclusion

In this paper, we have developed a symbolic computation of relative orders of magnitude, with a fuzzy semantics. The symbolic side of the computation emphasizes the granular nature of the notions of closeness and negligibility. Indeed, sets of close, or negligeable values, (w.r.t. another value) are manipulated as a whole. However, thanks to the fuzzy semantics, the computation can be interfaced with numerical values. The modeling of relative orders of magnitude, using fuzzy relations, constitutes a promising approach for handling some practical problems in qualitative reasoning. A first application of symbolic reasoning based on relations Ne and Cl, has been discussed in [7]. It concerns the resolution, in an approximate way, of sets equations. The second application, presented in this paper, deals with the handling of qualitative probabilities. It provides a basis for expressing the degradation of the exception rate of rules in plausible reasoning.

References

I. Adams E.W.: The Logic of Conditionals. D. Reidel, Dordrecht, (1975) 2. Dubois D., Prade H.: Fuzzy numbers: An overview. In: Analysis of Fuzzy Information,

Vo!. I: Mathematics and Logic (J.e. Bezdek, ed.), CRC Press, Boca Raton, FL., (1987) 3-39

3. Dubois D., Prade H.: Fuzzy arithmetics in qualitative reasoning. In: Modeling and Control of Systems (A. Blaquiere ed.), Lecture Notes in Control and Information Sciences, 121, Springer-Verlag, (1988) 457-467

4. Dubois D., Prade H.: Order of magnitude reasoning with fuzzy relations. Revue d' Intelligence Artificielle (Hermes, Paris) 3(4), (1989) 69-94

5. Dubois D., Prade H.: Semantic considerations on order of magnitude reasoning. Decision Support Systems and Qualitative Reasoning, M.G. Singh and L. Trave-Massuyes (eds.), Elsevier Science Publishers B.V. (North-Holland), IMACS, (1991) 223-228

6. Dubois D., Godo L., Lopez de Mantaras R., Prade H.: Qualitative reasoning with imprecise probabilities. J. of Intelligent Information Systems 2, (1993) 319-363

7. Dubois D., Hadj-ali A., Prade H.: Raisonnement sur des ordres de grandeurs relatifs avec des relations floues - Quelques nouveaux resultats. Actes de la 7 _me Conference sur la Logique Floue et ses Applications (LFA), Cepadues, Toulouse, (1997) 253-260

8. Gilio A.: Algorithms for precise and imprecise conditional probability assessments. In: Mathematical Models for Handling Partial Knowledge in Artificial Intelligence (G. Coletti, D. Dubois, R. Scozzafava, eds.), Plenum Press, (1995) 231-254

9. Gilio A.: Precise propagation of upper and lower bounds in system P. In Proc. 8 th Inter. Workshop on Non-Monotonic Reasoning MNR'2000, Breckenridge, Colorado, USA, April 9-11, 2000.

10. MQ&D Project (coordinator: Dague P.).: Qualitative reasoning: a survey of techniques and applications. AI Com, The European Journal on Artificial Intelligence, Vol. 8, Nos. 3/4, Sept./Dec., (1995) 119-192

II. Pawlak Z.: Rough Sets. Theoretical Aspects of Reasoning About Data. Kluwer Academic Pub!., Dordretcht (1991)

Page 309: Data Mining, Rough Sets and Granular Computing

306

12. Pearl 1.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausihle Inference. Morgan Kaufmann, San Mateo, Ca., (1988)

13. Raiman 0.: Order of magnitude reasoning. Proc. of the National Conference of the American Association for AI (AAA1), Philadelphia, August, (1986) 100-104

14. Raiman 0.: Le raisonnement sur les ordres de grandeur. Revue d' Intelligence Artificielle (Hermes, Paris) 3(4), (1989) 55-67

IS. Trave-Massuyes L., Dague P., Guerrin F.: Le raisonnement qualitatif pour les sciences de I'ingenieur. Hermes, Paris, (1997)

16. Zadeh L.A.: A theory of approximate reasoning. In: Machine Intelligence, Vol. 9 (J.E. Hayes. D. Michie, L.I. Mikulich, eds.), Elsevier, N.Y., (1979) 149-194

17. Zadeh L.A.: Fuzzy sets and information granularity. In: Advances in Fuzzy Set Theory and Applications (M. Gupta, R. Ragade, and R. Yager, cds.), North-Holland, Amsterdam. (1979) 3-18

18. Zadeh L.A.: Fuzzy logic = Computing with words. IEEE Trans. on Fuzzy Systems, 4(2). (1996) 103-1 I I

19. Zadeh L.A.: Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets and Systems. 90, (1997) 111-127

20. Zadeh L.A.: Toward a rcstructuring of the foundations of fuzzy logic. Proc. of the 7th IEEE International Conference on Fuzzy Systems, Anchorage, Alaska, May 4-9, (\998) 1678-1679

Appendix

• Rule (12), we have IlCl[M](X, y) = IlM(X/Y) > 0 and x/y s zit sl. This last relation implies ts 1 hat 0 < IlM(x/y) s IlM(z/t) s Il~il) = I, since M has an increasing memhership function for t. Thus, 11M (zit) > 0 implies that I-lCl[M](Z, t) > O. This means that (c. d) E CIIM].

• Rule (18), it is easy to see here that x and yare of the same sign (since 0 s x/y s I), and IlNc[M](Y' z) = IlM( I + y/z) > O. Now, let us consider the following two cases:

Case 1: Assume that x, y and z are of the same sign, then 0 s xlz s ylz ¢::> 1 s 1 + xlz:::; 1 + y/z ¢::> 0 < IlM(l + ylz) s I-lM(I + x/z) s IlM(l) = 1 (since, the membership function of M is decreasing for t ;::.: I). This implies that IlNCIMI(X, z) = I-lM( I + xlz) > 0 which means that (a, c) E Ne[Ml.

Case 2: the sign of both x and y, and the sign of z are different, then ylz s xlz s 0 ¢::> I + ylz s I + x/z :::; 1 ¢::> 0 < I-lM(I + ylz) s I-lM(I + x/z) s IlM(l) = I (since, the memhership function of M is increasing for t s 1). This implies that IlNcIMI(x, z) =

I-lM(I + xlz) > 0 which means that (a, c) E NeIMJ.

• Rule (19), it is also easy to see here that y and z are of the same sign (since 0 s ylz s 1), and I-lNc[MI(x, y) = I-lM(l + x/y) > O. Let us divide the proof in the following cases:

Case 1: Assume that y, z and x are of the same sign. then 0 S xlz s x/y ¢::> 1 s I + x/z s 1 + x/y ¢::> 0 < IlM(l + x/y) s I-lM(l + xlz) s IlM(l) = 1 (since, the

Page 310: Data Mining, Rough Sets and Granular Computing

307

membership function of M is decreasing for t :2: 1). This implies that IlNc[M[(X, z) = IlM(l + xlz) > 0 which means that (a, c) E Ne[M].

Case 2: the sign of both y and z, and the sign of x are different, then x/y ::; xlz ::; 0 <==> 1 + x/y ::; 1 + xlz::; 1 <==> 0 < IlM(l + x/y) ::; IlM(l + xlz) ::; IlM(1) = 1 (since, the membership function of M is increasing for t ::; 1). This implies that IlNc[M[(X, z) = IlM(l + xlz) > 0 which means that (a, c) E Ne[M].

• Rule (23) supx min(llcl[Mlx, y), IlNcIN](z, x)) = supx min(IlM(x / y), IlN(l + z / x))

= sUP",v min(IlM(u), IlN(V)), letting u = x / y and v = I+z/x = IlM(N01)+I(l + z / y), observing that 1 +z / y = u(v-1)+ 1 = IlNe[M(N01)+1](Z, y),

This means that (c, b) E Ne[M(N81)+1].

• Rule (24) SUpx,y,z min(llcl[Ml(x, z), IlNe[N[(Y' z)) = SUpx,y,z min(IlM(x / z), IlN(l + y / z))

W=X+Y W=X+Y

= sUP",v min(IlM(u), IlN(V)), letting u = x / z and v = I +y / z

w/z = u+v-J

= IlM(lJN01(W / z) = IlCl[M(lJN01](W, z). This means that (a + b, c) E CI[MEBN8l].

• Rule (26) SUpx,y,z min(IlNc[Mlx, z), IlNc[Nly, x)) = sUPX,y,7 min(IlM(l + x / z), IlN(l + y / x))

W=X+Y W=X+Y

sUP",v min(IlM(u), IlN(V)), letting u = 1 +x / z and v = 1 +y / x l+wlz = (u-llv+l

= Il(M01)N(lJl(l + w / z) = IlNc[(M01)N(lJIlW, z). This means that (a + b, c) E Ne[(M81)NEBl].

• Rule (27)

x.z:~';y!;q l min(llcI[M](x, y), IlNc[Nlz, t)) = xz:~';);{';q l min(IlM(x / y), IlN(1 + z / t))

sUP",v min(IlM(u), IlN(V)), letting u = x / y and v = l+z / t

u(v-ll+l=l+p/q

= IlM(N01)+I(l + P / q) = IlNc[M(N01)+1l(P, q), This means that (a.c, b,d) E Ne[M(N8l)EBl].

Page 311: Data Mining, Rough Sets and Granular Computing

Application of Granularity Computing to Confirm Compliance with Non-Proliferation Treaty

A. Fattah1 , V. Pouchkarevl, A.Belenki2 , A.Ryjov2 , and L. A. Zadeh3

1 International Atomic Energy Agency P.O. Box 100, A-1400 Vienna, Austria 2 Department of Mechanics and Mathematics Lomonosov' Moscow State Univer­

sity, 119899 Moscow, Russia 3 University of California Department of EECS Berkeley, USA

Summary. Safeguards are essentially a technical means of verifying the fulfillment of political obligations undertaken by States and given a legal force in international agreements relating to the peaceful uses of nuclear energy. The main political ob­jectives are: to assure the international community that States are complying with their non-proliferation and other peaceful undertakings; and to deter (a) the di­version of safeguarded nuclear materials to the production of nuclear explosives or for military purposes and (b) the misuse of safeguarded facilities with the aim of producing unsafeguarded nuclear material.

This chapter has been prepared based on the results of the project "Develop­ment of an Intelligent System for Monitoring and Evaluation of Peaceful Nuclear Activities (DISNA), Stage 1: Conceptual Model" [10]. The International Atomic Energy Agency, Department of Safeguards, Division of Concepts and Planning, Sec­tion for System Studies, in co-operation with Moscow State University, Department of Mechanics and Mathematics, has initiated this program. The goal of the system, structure and logic of the model, integration of the IAEA safeguards information sources, technical, technological and other factors which are used as evaluation cri­teria, structure of DISNA, mathematical foundations, technology of information processing and evaluation in DISNA is being discussed in this report.

Application of fuzzy logic concept in a real world situation, such as detection of clandestine nuclear programme to enhance safeguards effectiveness would be enormous The combination of

the systems approach using fuzzy logic; - the mandated "transparency and openness" environment ; and - the results of actual on site visits/inspections cued by "fuzzy logic" evaluations

is likely to be quite powerful.

When fully developed, DISNA will provide a mechanism for use in monitoring and evaluation of all persistent information for current as well as future needs of the IAEA such as nuclear free zones, monitoring of treaties, etc.

One of the theoretical base of DISNA is the theory of fuzzy information granu­lation [16, 17]. Taking into account a great importance of this problem for modern international community and very big interest of similar subject areas for appli­cation of granularity computing we decide to publish our result in this book. We hope that our ideas and results described here will be stimulate a new application of granularity computing techniques.

This work has been performed under the auspices of the International Atomic Energy Agency (IAEA), Vienna. It is based on all information and knowledge avail­able to the authors but does not necessarily reflect the policy expressed or Implied by the IAEA or its Member States

Page 312: Data Mining, Rough Sets and Granular Computing

309

1. Introduction

The International Atomic Energy Agency's Statute in Article III.A.5 allows it "to establish and administer safeguards designed to ensure that special fissionable and other materials, services, equipment, facilities and information made available by the Agency, or at its request, or under its supervision or control, are not used in such a way as to further any military purpose; and to apply safeguards, at the request of the parties, to any bilateral or multilateral arrangement, or at the request of a State, to any of that State's activities in the field of atomic energy" [3].

Safeguards are essentially a technical means of verifying the fulfillment of political obligations undertaken by States and given a legal force in interna­tional agreements relating to the peaceful uses of nuclear energy. The main political objectives are: to assure the international community that States are complying with their non-proliferation and other peaceful undertakings; and to deter (a) the diversion of safeguarded nuclear materials to the pro­duction of nuclear explosives or for military purposes and (b) the misuse of safeguarded facilities with the aim of producing un-safeguarded nuclear material.

Today, most of these obligations flow from the Non-proliferation Treaty (NPT) and other safeguards agreements [4,5].

According to the NPT type of agreements, the objective of safeguards is "the timely detection of diversion of significant quantities of nuclear material from peaceful nuclear activities to the manufacture of nuclear weapons or other nuclear explosive devices or for purposes unknown and deterrence of such diversion by the risk of early detection" . In Safeguards Agreements con­cluded under the non-NPT system there is no specific definition of technical objectives, but in practice essentially the same concepts apply.

1.1 Background

For more than two decades the IAEA safeguards verification system has evolved around those nuclear activities and nuclear materials that are de­clared by the States. The changing perceptions of the international commu­nity, based on recent events such as in Iraq and DPRK, have led to the need for strengthening the current safeguards system to enable detection of un­declared nuclear activities, to meet the global proliferation concerns more effectively. Undeclared activities may involve declared or undeclared facilities processing declared or undeclared material. It is generally recognized that evaluation of all information collected from State declarations, verification activities, open or any other source, would be an essential tool for detection of any undeclared nuclear activity in operation or planned. In this context complementary access foreseen in the strengthened safeguards system plays a major role. The Department of Safeguards is in the process of developing

Page 313: Data Mining, Rough Sets and Granular Computing

310

an intelligent decision support system, based on a human-computer inter­action, for the monitoring and evaluation of States' nuclear activities in an objective and balanced manner. The program for the Development of an In­telligent System for Monitoring and Evaluation of Peaceful Nuclear Activities (DISNA) was launched to achieve these objectives.

The goal of the DISNA program is to develop a methodology and software tool for the analysis of information, concerning all nuclear activities of a State, evaluation of its current status and projecting into the future of probable changes or developments. The Division of Concepts and Planning, Section for System Studies, in co-operation with Moscow State University, Department of Mechanics and Mathematics has initiated this program which will be pursued in three phases. This report constitutes the first phase on development of a conceptual model that will be followed by the next two phases, i.e., a demonstration prototype and finally a prototype. It is expected that the prototype would be available by early 1999 for a campaign of 6 - 12 month's runs with actual data so that the department could deliver a network version in due course.

When fully developed, DISNA will provide a mechanism for use in moni­toring and evaluation of all persistent information for current as well as future needs of the IAEA such as nuclear free zones, monitoring of treaties, etc.

1.2 From Classical to Strengthened Safeguards

Safeguards are essentially a technical means of verifying the compliance of political obligations undertaken by States in concluding international agree­ments relating to the peaceful uses of nuclear energy. The technical objective of safeguards is the timely detection of diversion of significant quantities of nuclear material from peaceful nuclear activities to the manufacture of nu­clear weapons or of other nuclear explosive devices or for purposes unknown, and deterrence of such diversion by the risk of early detection. The techni­cal conclusion is a statement in respect of each material balance area of the amount of material unaccounted for over a specific period, giving the limits of accuracy of the amounts stated.

The current safeguards regime is based on a nuclear material accoun­tancy system which requires the co-operation of nuclear facility operators, State authorities and IAEA inspectors. The Agency, through facility design review verification and inspection activities, essentially audits the nuclear material accounts of IAEA Member States. The Agency's findings are based on an evaluation of a State's material accountancy system with the assurance that it conforms to accepted accounting principles and that there has been no falsification by national authorities. The assurances provided by the Agency's safeguards system through the verification of nuclear material flows and in­ventories relate to the correctness of the information provided by Member States and not its completeness. The issue was debated at length about thirty years ago when states were negotiating the Agency's model comprehensive

Page 314: Data Mining, Rough Sets and Granular Computing

311

safeguards agreement (the legal basis for the lAEA's safeguards agreements with Member States).

The delegates eventually agreed that the safeguards system would pertain only to declared nuclear material with the tacit understanding that States' declarations would be full and complete. However, this does not mean the system, as designed, can address all of the scenarios for effective implemen­tation of safeguards. When a State accedes to the NPT, it must provide the IAEA with its initial declaration of all nuclear materials and facilities com­prehensively, as required by the safeguards agreement it has entered into. To establish the completeness of the initial declarations is a complex exercise requiring access to historical operational records and related data. Even a detailed investigation may not produce the desired results when past records are incomplete or other information substantiating the initial declaration can not be provided. This is a particularly difficult problem if a State is suspected of having produced unsafeguarded weapons-usable material at any stage be­fore or after joining the NPT or other multinational arrangement requiring comprehensive safeguards. Moreover, even if the initial report is complete, a State can subsequently build clandestine facilities and covertly produce fissile materials. Under comprehensive safeguards agreements, the Agency's access during routine inspections is limited to specified areas called strategic points in the Agency's material accountancy verification obligations. How­ever, this limited access diminishes the Agency's ability to detect, for ex­ample, an undeclared nuclear material production cycle without involving safeguarded material.

Recent events such as the discovery of a clandestine nuclear weapons programme in Iraq, the continuing difficulty in verifying the initial declaration of the DPRK upon entry into force of their safeguards agreement and the decision of South Africa and a number of newly independent States from the former Soviet Union to give up their nuclear weapons program and join the NPT have all played a role in embarking on an ambitious effort to strengthen the current IAEA safeguards system. The strength of the safeguards system relies upon three interrelated elements:

- the extent to which the IAEA is aware ofthe nature and locations of States' nuclear and nuclear-related activities;

- the extent to which IAEA inspectors have physical access to relevant lo­cations for independent verification of the exclusively peaceful intent of a State's nuclear program;

- the will of the international community to take action against States not complying with their non-proliferation commitments when such instances are brought to the notice of the United Nations Security Council by the IAEA.

IAEA access to the Security Council has been re-affirmed since 1991. The Board of Governors of the IAEA has approved a number of specific measures

Page 315: Data Mining, Rough Sets and Granular Computing

312

to increase access to informat ion and locations. Some of the new measures are being implemented under existing safeguards agreements. Other measures require new legal authority provided for in the Additional Protocol approved by the Board of Governors in May 1997 [5].

1.3 Enhanced Role of Information Evaluation in Strengthened Safeguards System

The conceptual framework for the evaluation of information for material ac­countancy safeguards was well established at the time INFCIRCj153 [4] was negotiated. It evolved from a series of considerations which attempted to find a balance between that which was needed to maintain technical rigor and in­dependence on the one hand and that which was achievable and affordable on the other.

Basic to those considerations was the conclusion that a safeguards system based on any imaginable form of direct verification, whereby the verificat ion authority would maintain parallel and independent records and accounts, was neither achievable nor affordable.

In February 1992, the Board affirmed that the scope of comprehensive safeguards is not limited to nuclear material declared to the Agency by a State, but that it also includes nuclear material, which may not have been declared. The requirement that the safeguards system provide assurance that States' nuclear material declarations are correct and complete is at the cor­nerstone of strengthened safeguards.

There is no visualisable or practicable form of direct verification at af­fordable cost, which can assure that State's nuclear material declarations are complete. However, a State's nuclear programme involves an interrelated set of activities implied by the existence of equipment, infrastructure, tell-tale traces in the environment and a predictable utilisation of nuclear materials. This provides the basis for development of a concept, a kind of an extended audit function involving expanded declarat ion , information evaluation, new technical measures and inspectors' access as integral parts. This concept is intended to ensure indirectly that the State's nuclear material declarations are complete through assurance of absence of activities that could indicate the presence of such material. And, as an audit function, everything that is done by the way of evaluation, verification and the seeking of additional information, is in the context of a declaration. In the context of strengthened safeguards requiring extended safeguards conclusions, the following questions provide a more specific framework for the evaluation of information:

1. Is the present and planned declared nuclear programme internally con­sistent?

2. Are the nuclear activities and types of nuclear material at declared 10-cations consistent with those declared (e.g., through the collection and analysis of environmental samples)?

Page 316: Data Mining, Rough Sets and Granular Computing

313

3. Are overall production, imports and inventories of nuclear materials con­sistent with the utilisation inferred from the declared nuclear programme?

4. Are imports of specified equipment and non-nuclear materials consistent with the declared programme?

5. Is the status of closed-down or decommissioned facilities and selected LOFs in conformity with the State's declaration?

6. Are nuclear fuel cycle related R&D activities generally consistent with declared plans for future development of the declared programme?

All the above questions and answers can be stated in the form of hypothe­ses and tested by adopting an iterative process of information evaluation and inspection follow-up. Some hypotheses can be tested within a formal statis­tical framework and others qualitatively. The basic hypotheses, and thus the safeguards conclusions - implicit in the technical objectives of safeguards - are that there has been no diversion and the State's nuclear material declarations are complete (or, alternatively, there are no undeclared nuclear activities). The conceptual problem in communicating the nature of such safeguards conclusions in the Safeguards Implementation Report is that the occurrence of "no diversion" or "no undeclared nuclear activities" cannot be detected.

It can only be inferred from the absence of any evidence to the contrary. The absence does not prove that there has been no diversion or that there are no undeclared nuclear activities. It only says that

(i) from all information available and evaluated nonehas been observed and

(ii) in the absence of such observation, there is no reason to reject the hypotheses.

Within a formal setting, the level of assurance or confidence related to a safeguards conclusion is termed the power of the test - the probability of rejecting the hypotheses if they are false. In general, the power of the tests of the hypotheses that there has been "no diversion" or that there are "no un­declared nuclear activities" cannot, except for a few instances at the level of individual facilities, be calculated. The power of the tests, and thus the level of confidence in safeguards conclusions, can only be qualitatively assessed through a technical judgement regarding the efficiency of the information gathering and evaluation process. That is, given the audit nature of safe­guards, if evidence contrary to a State's declaration exists, then what are the chances that the evidence will be part of the information collected and that it will be recognised as such through the information evaluation process?

This is the central problem in the technical implementation of safeguards. For material accountancy safeguards and the associated hypothesis that there has been "no diversion", the whole structure evolves from the identification of indicators of diversion (or indicators of circumstances where diversion can­not be excluded) and an information collection and evaluation process that attempts to assure that if such indications exist they will be recognised. The

Page 317: Data Mining, Rough Sets and Granular Computing

314

implementation of strengthened safeguards and the testing of the hypothesis that there are "no undeclared nuclear activities" presents a similar dilemma.

In practice, the testing of the hypothesis that there has been "no diver­sion" is cumulative through a day-to-day (or, timeliness period to timeliness period) gathering of information which is tested against a State's declara­tion. Generally, the information has three dimensions - technical, temporal and spatial. For example, a State's nuclear material declaration specifies the type and quantity of nuclear material (technical) present at a given location (spatial) at a point in time (temporal). The information is organised or con­ditioned by location (e.g., specific facilities) and then evaluated according to technical characteristics distributed in time. A broadening of the definition of a material balance area to include several facilities may change certain verification requirements but it does not change the information evaluation in any substantive way.

The testing of the hypothesis that" there are no undeclared nuclear activ­ities" is again cumulative where information gathered by the Agency is tested against declarations made by the State. As before, the information has the same three dimensions, only now the role of the time and space dimensions are reversed. The information is conditioned in time (or, more correctly, by time intervals) and evaluated according to technical characteristics distributed in space. The principal issue in this case is "what activities exist and where are they located" . Explicit reference to time will most often enter the evaluation through a State's declaration. Time associated with an indication of any ac­tivity detected by the Agency through information gathering and evaluation process will generally be the time the indication will be detected, but not time associated with the existence of the activity.

The evaluation of information relative to the hypothesis that "there are no undeclared activities" requires that all the information be placed in a context or structure that makes it possible for the evaluator to associate indicators with activities (declared) and to recognise existing ambiguities or inconsistencies (possible undeclared activities).

The obvious structure is the State's declaration organised both technically (e.g., the fuel cycle) and spatially (the geographical location of each activity). The evaluator must be able to visualise the entire spectrum of a State's nuclear programme both technically and the way in which the activities are distributed geographically. Once these visualisations or overviews are in place the evaluator can proceed to increasing levels of detail at selected locations, however, the basic paradigm - overview, select, detail- is still to be followed. Thus, if a location is selected, the evaluator should be in a position to retrieve all available information regarding activities at this location, selecting and then (e.g., environmental sample results) proceeding to the details. Prior to further descriptions and clarifications on how the evaluation process could develop, a further delineation of the technical dimension and the association of indicators with activities performed would be required.

Page 318: Data Mining, Rough Sets and Granular Computing

315

2. Prologue

2.1 Major Sources of Information

Currently the IAEA database contains various information provided by the State and through performance of inspections. Additional information is also available from other IAEA sources as well as various open sources.

aJ Information provided by the State. Under the Safeguards agreement concluded with the IAEA, a State is

obliged to establish a system for accounting and control of nuclear mate­rial. One of the important components is to maintain nuclear material ac­countancy records and provide the IAEA with reports for each facility. The records consist of:

- Design Information: as part of a Safeguards agreement, Member States provide the Agency with lists of existing facilities and other locations with pertinent design and functions. This also includes: - material accounting and control procedures, including containment and

surveillance; - operational information required for the evaluation and review of loss

mechanisms, shipper/receiver differences, material unaccounted for (MUF) and measurement uncertainties associated with MUF, as appropriate.

- A ledger which is a record of final entry summarizing all inventory changes to determine the book inventory.

- Inventory change reports containing chronological records of the various types of inventory changes that occur at a facility.

- Supporting documents which contain data for each inventory change recorded at the time of the change and at its source.

- Detailed itemised list of all nuclear material present at the facility identified by material category for each Material Balance Area and Key Measurement Point, etc ..

- The Inventory Change Report sent to the Agency gives details of all re­ceipts and shipments of nuclear material for each nuclear material category. The report is dispatched to the Agency within 30 days after the end of the month in which the change takes place.

- Physical Inventory Taking performed by the facility operator with a de­tailed Physical Inventory List for each nuclear material category of the nu­clear material present in the facility's inventory. The report is dispatched to the Agency within 30 days after an inventory is taken.

- Voluntary Reporting Scheme as established by the IAEA in 1993 as a means of Strengthening Safeguards. It operates on the basis of regular re­ports from the participating country on its nuclear-related imports and exports containing three categories of information, namely, source mate­rial, specified equipment and non-nuclear material. An important feature is the participation of most major nuclear supplier states. Thus, information

Page 319: Data Mining, Rough Sets and Granular Computing

316

can provide substantial transparency in international nuclear trade and can indicate an interest in particular nuclear activities, especially when considered with other information available to the IAEA

b) Information collected during inspections. Inspection data: collected during inspections at facilities as part of the

IAEA independent verification of inventory and inventory changes arising from performance of measurements, application of containment and surveil­lance, examination of records and reports, etc.

Environmental sampling: now being applied as part of Strengthening of Safeguards measures. The capability of this technique lies in its sensitivity to detect and characterize radio-nuclei released into the environment, though in only very small quantities which can be ambiguously correlated with specific nuclear processes. This is emerging as an effective tool for the detection of undeclared activities.

c) Illicit trafficking in nuclear material and radioactive sources. The Agency has established a data base with a primary function to pro­

vide a reliable and accurate source of information in a timely manner on all trafficking incidents.

d) Other IAEA databases. The IAEA maintains non-Safeguards databases covering a wide range

of information types. The examples of the databases and the information provided therein are summarised below. The IAEA maintains the following nuclear activity databases:

- Nuclear fuel cycle and materials containing - generic information on the nuclear fuel cycle material balance, the as-

sessment of amount of the nuclear materials needed and produced during the nuclear fuel cycle;

- data on uranium geology, resources and production; - directory of civilian nuclear fuel cycle facilities; - comprehensive review of significant uranium deposits world-wide.

- Planning and economic studies data related to the energy and economic statistics of countries including projections for electricity, nuclear power generation and energy consumption.

- Nuclear power engineering - contains general and basic design information on power reactor in opera-

tion, under construction, planned or shut down, and operating experience data on nuclear power plants in the world.

- International nuclear information system (INIS) - bibliographic data base compiled from data submitted by 96 Member

States and 17 co-operating international organizations. - Nuclear data

- nuclear reaction data and nuclear physics numerical data files in different formats for various special purposes applications;

Page 320: Data Mining, Rough Sets and Granular Computing

317

- logs of nuclear data requests received from scientists in Member States and logs of materials sent to them;

- address of nuclear scientists, their interest profile and membership in nuclear data committees.

- Technical Co-operation Data base

- description and projects status regarding all development and research activities;

- equipment; - training courses, workshops and seminars; - staff involved in international and national activities.

e) External open sources. A number of open source information have been identified and are con­

tributing to the Agency's knowledge base on nuclear activities of States. The following are some examples of the open source data bases currently available to the Agency:

- Monterey Institute Data base

- data base surveys the nuclear assets in each of the Soviet Successor States;

- compilation of open source reports documenting illicit trafficking and the exodus of specialists from the former Soviet Union;

- contains abstracts which address the ecological problems associated with nuclear power and weapons. Tracks international trade and developments in nuclear technology;

- surveys trade and development involving ballistic missiles, cruise mIS­siles, missile defence systems and components.

- Kurchatov Institute data base

- contains information from magazines, newspapers, political and technical reports, scientific Journals from other Institutes in Russia and Newly Independent States;

- Minatom Information Division; - Moscow Carnegie Centre.

- Foreign Broadcast Information Service

- contains foreign sources of information and is updated daily. The infor-mation is obtained from television reports, radio broadcasting, newspa­per articles, science and technological reports and analysis of political situations of countries in the world.

2.2 Current Safeguards Evaluation Practice

As part of the independent verification, the IAEA has to process and eval­uate large amounts of data which is available from the different sources as described above. The evaluation is divided into two stages. The first stage

Page 321: Data Mining, Rough Sets and Granular Computing

318

covers the evaluation of the results of verification activities (inspection activi­ties) against the Agency's inspection goals (Safeguards Criteria). The second stage is the country evaluation which reflects the evaluated consistency of all the different types of information (including information from inspection activities) .

2.3 Enhanced Evaluation Prospects

aJ The Physical Model. Nuclear material suitable for the manufacture of weapons does not exist in nature. It must be manufactured from source ma­terial through a series of discrete and definable steps (i.e., mining and milling, conversion, enrichment, fuel fabrication, irradiation, reprocessing). Each step can be accomplished through anyone of several processes when the choice of process for a given step depends, to some extent, upon the processes chosen for both the preceding and succeeding steps.

A development, termed the Physical Model, is an attempt to identify, describe and characterise every known process for carrying out each step necessary for the production of weapons-usable material. Thus, any possible route from source material to special fissionable material is describable as some combination of processes identified and characterised in the Physical Model. The model is the combined work of departmental staff and a small group of experts from Member States. It will always be a work-in-progress subject to periodic review and update but, a form of closure was achieved recently with a Consultants' Meeting where each component was subjected to a detailed review by additional experts from ten Member States.

The structure of the Physical Module is a nested arrangement (i.e., overview, select, detail). Each process for carrying out a given step is de­scribed and then characterised in terms of indicators of the existence of that process. The indicators of the existence of a process may be certain equip­ment, nuclear and non-nuclear materials, environmental signatures, require­ments for specific technical skills, etc ..

The objectives of the Physical Model development are three-fold:

1. to provide a general and easily accessible reference for descriptions of nuclear processes and associated indicators (a training course based on the Physical Model has been tested and is now part of the Department's regular training programme);

2. to provide a Physical Model for each State that is a subset of processes and indicators from the overall Physical Model based on the State's dec­laration which provides part of the technical overview;

3. to provide a simple (one-dimensional) mapping function from general compilation of information on State's nuclear activities to the top level of the Physical Model.

It was originally thought that the initial information compiled, apart from that provided by the State (declaration) and that generated through inspec-

Page 322: Data Mining, Rough Sets and Granular Computing

319

tion activities, could be automated by the use of a text processing system. However, it is now clear that more experience is needed. The initial informa­tion which could include indicators beyond those identified in the Physical Model (e.g., administrative actions such as licensing) could involve much ef­fort, however, it will be a one-time job. Information obtained in the future will simply be dealt with as it arrives.

The Physical Model is a useful technical tool which can assist the evalua­tor with the recognition problem. However, it has a fundamental limitation in that it does not provide, in any systematic way, a logical connection among indicators. For example, two dual-use equipment indicators may be taken individually, only mildly suggestive of the existence of some nuclear activ­ity, while taken together they may provide a very strong indication. These connections are an important part of the recognition problem which must be supplied by the evaluator.

Note 2.1. The indicators associated with each process are placed in a quasi­logical structure along the following lines:

- if process A implies and is implied by indicator X, then X is a strong indicator;

- if process A implies indicator Y (and indicator Y may imply process A), then Y is a medium indicator; and

- if process A may imply indicator Z (and indicator Z may imply process A), then Z is a weak indicator.

It is difficult to be consistent with this and, at the same time, not take the step to logically link two or more indicators with the existence of a particular process. However, this categorisation may be helpful.

b) A Country Evaluation. If a country evaluation is taken to be the test­ing of the hypothesis that "there are no undeclared nuclear activities" then it is a detailed technical evaluation of, firstly, the internal consistency of the State's declaration and, secondly, a point-by-point comparison between in­dications of activities from all information available to the Agency and that which the State indicates they are doing, or plans to do. The process of in­formation evaluation and the inspection process are inextricably linked as many of the sub-hypotheses (or questions) regarding the absence of nuclear activities (including facility misuse) are, or only can be, tested through di­rect observation. Some hypotheses to be tested through direct observation are by design, others arise through the need to resolve inconsistencies between information collected by the Agency and a State's declaration. Information is relevant to this technical evaluation only to the extent that it indicates, directly or indirectly, the existence of a nuclear activity or the presence of nuclear material.

A country evaluation is not some kind of general narrative, drawn from open sources or elsewhere, that attempts to describe a State's nuclear pro-

Page 323: Data Mining, Rough Sets and Granular Computing

320

gramme. That is the State's obligation and it must have the legal standing of a declaration. The Agency's job is to effectively audit that declaration.

2.4 Problems to be Addressed

No universal or perfect methods and algorithms of information processing and decision-making exist. This is because of the fact that the information on a State's nuclear activities has a number of characteristics which make it difficult, and sometimes impossible, to use standard mathematical methods and/or technologies. These characteristics include:

- fragmentation: a piece of information usually relates to a fragment of the problem. Different fragments of the problem may be covered by the infor­mation completely, partially or not covered at all;

- multi-levels: the information can either cover the entire problem, parts thereof, or a particular element of the problem;

- degrees of reliability: the information can contain particularly reliable data, indirect data, results of conclusions on the basis of the reliable information or indirect conclusions;

- discrepancy: the information from various sources can be consistent, be somewhat different or just contradictory;

- variation with time: the problem develops in time, therefore, the informa­tion at different times about the same element may and could differ;

- bias: the information reflects the interests of its originator, therefore, it can have a tendentious character. In a specific case it may be deliberate misinformation for political, economic or social reasons;

- diversity of the information sources: information from articles, newspapers, electronic media, audio- and video- information, etc ..

3. The Idea of DISNA

The idea of the information monitoring system, DISNA, is presented in Fig. 3.1. As can be seen above, the data corresponding to the areas "State's Dec­larations" (left-hand portion of the diagram in Fig. 3.1)and "IAEA Sources of Information" (right-hand portion of the diagram) differ significantly. This is because comparison of the data elements from these portions comprises a significant part of the work of the expert-analyst. A direct comparison and conjoint processing of this information in a framework of classical information technologies would be either impossible or would require escalating costs for the development and support of different types of dynamic knowledge - and databases [9].

The model of the State's nuclear activities would be used for conjoint processing of this information. The model is being developed on the basis

Page 324: Data Mining, Rough Sets and Granular Computing

Physical Model * Structure * Indicators

State's Declarations * Design Informations * Accountancy Reports * Protocol Notification * Ell Notifications * Voluntary Rep. Scheme

I . . ...... . .

Examined Status of State's Nuclear Activities Model

Confinnation or Request for Clarification

. . .

Information Monitoring Thechnology * Granularity Computing * Hierarchical systems Theory * Discrete Dynamic System Theory

........... IAEA Suurces of

State's Nuclear Information Activities Model * Inspector Observations * Structure * Open Sources

* Content * Other IAEA Databases * Third Party Information * Others

. . . ' .. . . . . . . . " .

' .. J . .. ... Fuzzy Augmented Status

f+ Comparison ~ of State's Nuclear Activities Model

Assessed Status of State's Nuclear Activities Model

Evaluation of Inconsistencies

DISNA

Request for Complementary

Access

Fig. 3.1. The idea of the DISNA.

321

Page 325: Data Mining, Rough Sets and Granular Computing

322

of the physical model of the fuel cycle and R&D activities and of informa­tion monitoring technology ideas. The state of the model based on analysis and evaluation information from" State's Declarations" area form the Exam­ined Status of the State's Nuclear Activities Model. The status of the model based on analysis and evaluation information from "State's Declarations" complement information from "IAEA Sources of Information" area form the Augmented Status of the State's Nuclear Activities Model. The status of which can be compared. Based on this comparison we form Assessed Status of the State's Nuclear Activities Model. Using Assessed Status of the State's Nuclear Activities Model, inconsistencies between information from State's declarations and information from IAEA sources is evaluated. Based on this evaluation, State's declaration could be confirmed or request for clarification for State's declaration be formulated or request for complementary access prepared.

The development of a model of this kind, taking into account the basic points (Sect. 3.1), the purpose and tasks of DISNA (Sect. 3.2), the functional requirements of the system (Sect. 3.3) and properties of IAEA information sources and issues relating to the possible implementation of the model into the computer are dealt with in the report.

3.1 Basis

The content of the States' declaration provided on the basis of existing safe­guards' agreements forms the basis for analysis and evaluation of information. The legal basis for this declaration is the safeguards' agreement and the ad­ditional Protocol thereto.

The State makes the declaration and the IAEA's task is:

- to verify the declaration on the basis of the sources of information available to the Agency;

- to evaluate the information available to the Agency from other sources for consistency with the States' declaration;

- in case of inconsistency between the States' declaration and other infor­mation in the nuclear area, to formulate questions for clarification, which will remove the uncertainty to the extent possible (confirm or remove the inconsistency) .

The basis for addressing the task is the physical model of the nuclear fuel cycle and the system of indicators constructed on its basis.

The system should be user-analyst oriented. That means that IAEA analyst-expert should be able to work with the system after 2-3 days training.

Page 326: Data Mining, Rough Sets and Granular Computing

323

3.2 Purpose of DISNA

The goal of the system is to monitor the nuclear activities of a particular country and to evaluate the completeness of its declaration by analyzing all information available to the Agency from other sources.

To attain this purpose, the system should address the following tasks:

• detection and recording of discrepancy between the States' declaration and the information available to the Agency;

• evaluation of the importance of this discrepancy against the possibility of undeclared production by the State of metallic plutonium/highly en­riched uranium, taking into account the uncertainty in the quality and completeness of the available information;

• detection of the "critical point" in the possible chain of production of undeclared metallic plutonium/highly enriched uranium (REU), infor­mation about the status of which will clear the uncertainty (confirm or resolve the inconsistencies) concerning the possibility of undeclared pro­duction by the country of Pu /REU so that the Agency can optimally plan its future actions.

3.3 Functional Requirements for DISNA

Definition of the Basic Concepts

Technology: a stage of processing nuclear material, for example, enrichment, fuel fabrication, reactors, etc.

Process: an element of the physical model, it is a separate part of mining, production, processing or use of the nuclear material. For example, atomic va­por, laser, plasma separation processes, etc., as different types of enrichment technology. Carrying out a process requires feed material, equipment, tech­nologies, specialists and other resources. As a result of conducting a process there occur by-products and side effects.

Group of similar processes: a set of processes within a technology, which have common raw material and which in their turn, can be divided into inde­pendent sub-processes. For example, in the context of enrichment technology there are three groups of similar processes determined by the nature of the raw material: UF6, UCI4, Umet.

Indicator: a specific resource needed for conducting a process, or a spe­cific by-product/side effect. The indicator enables analyst, with a given level of confidence, to assume the presence of a particular process (or group of processes, or technology). The complete set of indicators is divided into some groups (classes).

Sign of a Process: information about the presence of an indicator in the piece of the information currently reviewed by an analyst-expert.

Process chain: a sequence of interlinked processes and technologies which, if established, could result in the production of Pu or REU.

Page 327: Data Mining, Rough Sets and Granular Computing

324

Model of a States' nuclear activities reflects the status of the area of a States' nuclear activities and is being developed in time. The basis for designing the structure of a given model is the physical model of the nuclear fuel cycle. The" content" of a given model is the evaluations of the status of its elements at the present instant.

Assumptions

The model of States' nuclear activities should reflect the situation in the given area and mirror adequately the States' capabilities on processing nu­clear materials. For this purpose it is necessary to have an opportunity to receive information about the whole model or its separate elements. The Agency receives such information directly from the country on the basis of the existing Safeguards' agreements. After evaluation of this information the Agency's analyst can assess, with some uncertainty, the States' capabilities of processing nuclear materials. It is necessary to mention that these States' capabilities are changing in time permanently, for example, because of the construction of new facilities and so on. Therefore the status of a model is changing in time dynamically. To evaluate the real States' capabilities on processing nuclear materials, IAEA should have an opportunity to check cor­rectness and completeness of the declaration provided by the country by com­paring it with information from other sources. If the information provided by the country and the information from other sources will differ, the model of States' nuclear activities (and its capabilities on processing nuclear materials) will be different from the initial one. The model of a States' nuclear activities which "content" was filled in accordance with the information supplied by the country is called the "declared status". The set of evaluations of elements (processes) can take values: "yes" (when all the necessary resources for carry­ing out the processes exist), "none" (when the country declaration does not declare any relevant information) and "partly" (when the country has some of the resources needed for conducting the process, for example, when it is doing preparatory work for conducting the given processes in future). On the basis of "declared status" the Agency's analyst evaluates the technical possibility of conducting the processes, expressing them in fuzzy terms. The model of States' nuclear activities reflecting the technical possibility of the country's conducting the processes is called the "examined status". The status of its ele­ments reflects the possibility of conducting the processes in the given country. They can take values: "none" (when the State has no resources for conduct­ing a given process), "beginning" (when some resources exist but these are clearly insufficient), "advanced" (when some important resources exist but these are insufficient for conducting the process as a whole), "nearly ready" (when most of the important resources exist but some secondary or compar­atively easily obtainable resources, for example, specialists are not available) and "ready" (when there are all the necessary resources for conducting the process).

Page 328: Data Mining, Rough Sets and Granular Computing

325

The model of States' nuclear activities constructed on the basis of its declaration and information from sources accessible to the Agency is called the "augmented status". By comparing "examined status" and "augmented status", the system establishes consistency or inconsistency of the country's declaration with the available information and evaluates the significance of any such inconsistency for production of Pu or HEU, taking into account the reliability of the information source. The status of the model of country's nuclear activities received as a result of comparison of" examined status" and "augmented status" will be called "assessed status".

3.4 Information Analysis in DISNA

Review of IAEA Information Sources

The safeguards' goal is verification of the correctness and completeness of the States' declaration. The comparison f the States' declaration with the information from sources accessible to the Agency is an important component of such verification. The verification process includes the following steps:

a) gaining access to an information source; b) retrieval of the information from a source; c) selection of the items of information which may relate to States' nuclear

activities; d) evaluation of the selected items of information by analyst; e) overall analysis of the information as a whole; f) storage of the evaluated information and results of analysis. All Agency's information sources can be divided into three groups: I. The information provided to the Agency by the country (accounting

reports; DIQs; Protocol information (Art 2.); E/I notifications; voluntary reporting scheme).

II. The information collected by the Agency independently (IAEA sources, environmental sampling, inspector observations, and third party informa­tion).

III. Information from open sources (databases of Monterey, Kurchatov Nuclear Research Center, Reuters, FBIS, NNN, Internet, hard copy sources (journals, newspapers), etc.).

The first group of sources serves to obtain information on the official ac­tivity of a State in the nuclear field and is compared with information from sources of the second and third groups. The Agency is carrying out extensive work on broadening the number of information sources and studying their characteristics: their number is constantly being increased and the methods of delivery and selection of information on topics of interest are being improved, particularly as regards open sources. It should be noted that the Agency is using, or is planning to use in the very near future, the most advanced soft­ware for selecting information from that available on the commercial market such as TOPIC, PATHFINDER and Search 97. All this will serve to provide

Page 329: Data Mining, Rough Sets and Granular Computing

326

Agency experts with a large volume of pre-sorted information for analysis of the status of the subject area. Information from the sources of the first group (see Table 3.4) is supplied to Agency experts for evaluation and analysis and is put in store. The volume of such information is comparatively smalI, ali pieces of information relate to the subject area and have to be taken into account. The sources of such information are considered reliable by definition.

Table 1. Information provided by a State.

Title Volume Reliability Frequency Accounting reports and records fairly large high periodic Facility design information small high periodic Inspection report fairly large high periodic Export/Import notifications small high periodic Voluntary Reporting Scheme small high periodic Information related to Art.2 of INFCIRC 540 small high various

Pieces of information from sources of the second group (see Table 3.4) contain data mainly of average and high reliability. The volume of such in­formation compared with the first group is greater and consequently requires more effort for selection and processing. Not ali informat ion messages from this group will relate to the subject area of interest to the Agency and should be taken into account in the analysis of its status.

Table 2. Information collected by the Agency independently.

Title Volume Reliability Frequency IAEA databases large average daily Environmental sampling small high periodic Third party information small low periodic Inspector's observations small average periodic

Information from sources of the third group (see Table 3.4) has its own specific features:

- Huge volume of accessible information; - Need for both primary and secondary selection of information; - Low reliability of sources of information; - Delays in obtaining informat ion in the case of hard copy sources and the

need to incorporate them electronicalIy; - The need to use special means for selecting informat ion of interest.

Page 330: Data Mining, Rough Sets and Granular Computing

Table 3. Information from open sources.

Title Volume Reliability Frequency Internet huge low instant External databases large average daily Hard copy sources small average periodic

327

In the field of nuclear activity of a country, changes of varying importance are constantly occurring which are reflected in a huge volume of messages from different information sources. For prompt handling of such changes, Agency analysts have to work with a large volume of varied information with different degrees of reliability. Moreover, the model of States' nuclear activity in itself is quite complex so that without special software it is impossible to deal with changes in the status of the subject area promptly and completely.

Thus, it may be concluded that at the present time the Agency has access to a large number of information sources (see Table 3.4), the characteristics of which (reliability, time factors, etc.) are constantly studied and taken into account in the selection of information. Advanced means of obtaining and conducting primary selection of information (i.e. TOPIC) are available and are being used.

Table 4. Information sources available to the Agency.

Information Volume group

Reliability Primary Secondary selection selection

Group 1 Group 2 Group 3

small high fairly large average large low

no no yes

no sometimes yes

The general characteristics of information reaching the Agency (groups 1-3) can be formulated as follows:

• Information reaching the Agency has different degrees of reliability; • The volume of incoming information is very high; • The information from sources accessible to the Agency may not cover the

whole subject area; • Different frequency of arrival of information from different sources; • Information may arrive in different forms (hard copy, electronic form).

All the above characteristics of the inflow of information will be taken into account in the development of the structure and logic of the information monitoring system.

Processing of Information in DISNA

The technology of processing information in the system consists of two stages:

Page 331: Data Mining, Rough Sets and Granular Computing

328

(I) information input and its evaluation with corresponding change in the status of the models and

(II) analysis of changes in the subject area. I. Information input and evaluation includes selecting documents con­

taining indicators (signs) from the incoming flow of information, linking the signs to the corresponding processes, evaluating the properties of the infor­mation received (reliability of the source, completeness etc.). Then the status of the process is being evaluated taking into account the strength of the ob­tained sign and the current evaluation of the process. The analyst determines the significance of the sign in the light of the context of the situation. For example, the sign has a great influence on the status of the process if:

- the indicator defined as a strong one in physical model; - it has been obtained from a reliable source; - it indicates that the State has an important resource for conducting the

process; - the State already has a certain set of resources; - the system already contains other information confirming the given indi-

cator, etc.

The same indicator but:

- obtained from an unreliable source; - not confirmed from other sources;

will have a smaller influence on the evaluation of the possibility of con­ducting the process and, consequently, a smaller significance.

II. Analysis of changes in the subject area has a goal to evaluate the con­sistency between the information supplied by the country to the Agency and the information obtained by the Agency from other available sources. The degree of consistency of the States' declaration with the activities detected by the IAEA for a particular element (or for the States' nuclear activities as a whole) can be expressed by the fuzzy variables (fuzzy granulies) "total agreement", "minor inconsistency", "medium inconsistency", "major incon­sistency", "total disagreement". The degree of consistency is determined by comparing" examined status" and" augmented status" as a whole or at the level of a particular process. The importance of the detected inconsistency is determined from the standpoint of increasing the possibility of Pu/HEU by the country. In the event of detection of inconsistency there goes into action the mechanism of locating critical processes from the standpoint of the possibility of producing Pu and HEU - and possible ways of nuclear weapons production. If the detected inconsistencies increase the possibility of Pu/HEU production by the country, their importance should receive the appropriate evaluation. If the magnitude of the inconsistency increases, the Agency applies the appropriate measures.

Page 332: Data Mining, Rough Sets and Granular Computing

329

3.5 Model of States' Nuclear Activities

Structure of model of States' Nuclear Activities

During the design of the model structure, the folIowing requirements have to be taken into account. The model should:

- Adequately reflect the connections between the elements of the subject area;

- Serve as a basis for performing calculations of the mutual effect of the elements and output of the required evaluations;

- Represent the status of the subject area with different levels of detail; - Be transparent to the user; - The model will be used for qualitative evaluation of the status of the subject

area.

The physical model of the nuclear fuel cycle was taken as the basis for DISNA. This was done for the folIowing reasons:

- The structure of the physical model of the nuclear fuel cycle is sufficiently well developed, i.e. its elements and the interconnections between them are clearly defined;

- The physical model of the nuclear fuel cycle reflects alI presently known processes for handling nuclear material and the links between them, i.e. it can take into account alI the possible technological chains of production of Pu and HEU;

- A sophisticated system of process indicators exists which is one of the most important tools for detecting any undeclared nuclear activity of a country and is fulIy compatible with the physical model of the nuclear fuel cycle;

- A model of the States' nuclear activities, constructed on the basis of the physical model of the nuclear fuel cycle, can in future be easily supple­mented by other components: political, social, economic, military, etc.

For a more complete reflection of the structure of the subject area in subsequent versions of the system, it is desirable to take into account polit­ical, economic, social, ecological, military and other aspects of the problem. However, at this stage, when it is important to define the possibilities of the system for evaluating the status of the subject area, it is quite admissible to use only the physical model of the nuclear fuel cycle for construction of the model of States' nuclear activities at this stage of work.

The structure of the model has severallevels ranging from the technologies alone to specific facilities.

First level - At this level the stages of processing of nuclear material, i.e. the technologies are considered. The elements of the structure of the model of this level are linked. At this level they are generalized and little suited for performing a specific analysis. This level is intended to represent the general status of nuclear activity of a country: the level of development

Page 333: Data Mining, Rough Sets and Granular Computing

330

of technology, general directions of possible production of Pu and HEU, an overall evaluation of possible inconsistencies between a country's declaration and the real status of nuclear activity.

Second level - The level of groups of similar processes. At this level it is possible to reflect the potential of a State to perform a number of linked processes in the framework of the same technology, for example, the uranium production in the framework of Conversion 1 technology. The elements of the second level and their links with elements of the level below serve for taking into account the influence of indicators of this level on processes of the third level.

Third level - Separate processes. At this level the links between the dif­ferent technologies for processing nuclear material are clearly seen. This level enables a detailed analysis to be made of the status of the subject area and is the main level of the model.

Fourth level - This is a detailed version of the third level and reflects the existence of specific capacity for processing nuclear materials: reprocessing plants, reactors, spent fuel stores, etc. The elements of the fourth level are bound to a specific point. It should be noted that the elements of this level would not exist if there is no real facility to reprocess nuclear fuel in the States' declaration or detected by the Agency.

Thus, the four levels of the model of nuclear activity of a country may be delineated as follows:

- Technology level. - Groups of similar processes. - Process level. - Facility level (spatial level).

For analyzing the status of the subject area the third level which will be used mainly. The first level will generalize the results of analysis, and the fourth will serve for detailing the status of the subject area. The links between the elements of the model are quite complex. Whereas the vertical links of elements from the first to the fourth level may be described with a tree, the links between the elements of the same level (first and third) are described in the form of a graph. Under different operating conditions of the system, both these and other forms will be used.

As described in the previous paragraph, the model of the States' nuclear activity consists of four levels. Each succeeding level, depending on the order taken, is a detailed version or generalization of the previous level. For exam­ple, level four of the model is a detailed version of level three, and level three is a generalization of level four. Any element of any level is described by a fuzzy linguistic variable, which at any particular instant may assume one of the values: "ready", "nearly ready", "advanced", "beginning" and "none". It should be noted that an expert-analyst puts in such a value his own shade of meaning during an evaluation of the element of the corresponding level which will be explained below for elements of each level.

Page 334: Data Mining, Rough Sets and Granular Computing

331

3.6 Structure of DISNA

The structure of DISNA must ensure the attainment of the goal of DISNA and fulfill all the assigned tasks (see Seact. 3.2). The structure of DISNA must be designed to meet the following requirements:

- Provide a tool for continuous monitoring of the status of the subject area. - Provide IAEA expert with a tool to input into the system documents con-

cerning the States' nuclear activities in textual format or references to documents in the form of hard copies, video topics, audio reports, etc.

- Produce an evaluation of the influence of obtained sign on the status of elements of the model and to change (confirm) their status accordingly.

- Provide a tool for examining the status of the subject area with several levels of detail.

- Detect inconsistencies between the declared States' capabilities for process­ing nuclear material and those capabilities as established by the Agency through analysis of information from other available sources.

- Assess the importance of any detected inconsistencies from the point of view of a change in the States' capabilities to produce HEU and Pu.

- Detect" critical points" important from the point of view of production of HEU and Pu, information about which is crucial for resolving an incon­sistency between a country's declaration and its capabilities for processing nuclear material established by the Agency.

- Provide storage in its database of all the documents evaluated by the expert on references to them with linkage to specific elements of the model of the nuclear activity of a country.

- Provide IAEA expert with the tool for retrospective analysis of a change in the evaluations of each element of the model with the possibility of scanning the corresponding document or obtaining references to it.

- Record changes occurring in the system and provide the user with the tool for analyzing them.

Thus, the system must become an important tool of the expert-analyst whose task is to analyze the nuclear activity of a specific country. The system must consist of the following sub-systems.

1. Sub-system for evaluation and input of information. With the aid of this sub-system, the expert-analyst, using the system, performs an evaluation of the influence of a given document on the status of a particular element or elements of the problem. After familiarization with the content of the incoming document containing an indicator, the expert-analyst deter­mines to which element(s) of the model of the problem this sign may relate. After this the expert-analyst assesses the weight of the obtained sign, the scale of the event, the reliability of the source of information, etc. Then he/she examines the current values of this element of the model of the problem and, if needed, scans the documents or references to them,

Page 335: Data Mining, Rough Sets and Granular Computing

332

on the basis of which the current value of the sign is obtained. After anal­ysis, the expert-analyst changes the value of this element of the model of the nuclear activity of the country or confirms it. The changed evaluation is recorded in the database of the system and becomes the current one for the given element of the model of the nuclear activity of the country. The document or the reference to it is stored in the database, linked to a spe­cific evaluation of a specific element of the model of the nuclear activity of the country and may be called up for scanning at any moment at the request of the expert-analyst. This sequence of actions for inputting and evaluating the sign is used when inputting of information from sources accessible to the Agency. In the case of inputting information from doc­uments supplied by a country, this information is considered reliable by definition. Thus, when inputting information supplied by a country, the expert-analyst changes the content of the" examined status" , "augmented status", and, when inputting information from sources available to the Agency, changes the content of the" augmented status" only (see Sect. refsec:Requirements). These models may differ only in the evaluations of the values of the elements. The structure of the model of the States' nu­clear activities in both cases remains the same. This circumstance will be used henceforth for detecting inconsistencies between the declared States' nuclear activities and the activity established by the Agency.

2. The subsystem for evaluating a States' capability to process nuclear ma­terial is closely linked to the structure and logic of the model of a States' nuclear capabilities. The aim of this sub-system is to depict a States' capacity to process nuclear material, using as a basis the evaluations of the elements of the model of the States' nuclear activities inputted by the expert-analyst. As is stated above, the system stores the results of two evaluations of each element in the model of a States' nuclear capa­bilities. The first kind of evaluation is derived from the expert-analyst's assessment of the data submitted to the Agency by the country. The second evaluation is based on an assessment of both data from sources available to the Agency and of the data submitted to IAEA by the coun­try. The first group of evaluations was called the" examined status" and the second group the" augmented status" . Analysis of the evaluations in the first group yields an evaluation of the declared capacity of a State to process nuclear material. Analysis of the second group yields the States' capacity based both on an assessment of data from sources available to the Agency and on the evaluation of the declared capacity of the State to process nuclear material. A States' capacity to process nuclear material may be presented in the system with varying levels of detail:

- At the level of technologies; - At the level of processes; - At the level of facilities.

Page 336: Data Mining, Rough Sets and Granular Computing

333

This is achieved by utilizing the characteristics of the model of a States' nuclear activities (see 3.2).

3. The sub-system for detecting inconsistencies is designed to detect con­flicts between the declared capacity of a State to process nuclear material and the capacity identified by the Agency through an analysis of data from available sources. This sub-system is also closely linked to the model of a States' nuclear capabilities. By comparing the evaluations of" examined status" and" augmented status" elements, the system detects inconsis­tencies at any level of the model. The importance of any inconsistency detected, as regards a possible increase in the real capacity of a country to produce Pu and HEU, is evaluated as follows: - Taking into account the evaluations of third-level" examined status"

and "augmented status" elements, possible HEU and Pu production process chains are searched for;

- The possibility of establishing each of these chains is evaluated. The possibility of establishing a chain is expressed as a fuzzy variable, sim­ilar to the evaluation of the possibility of implementing the process.

If the possibility of establishing any of these chains, as determined using the data received by the Agency from available sources, is greater than the possibility of establishing a similar chain as determined using the data supplied by the country, then the inconsistency is considered to be important.

4. The sub-system to detect "critical points" is used to locate the most im­portant points in a States' nuclear programme as regards any change in its capacity to produce HEU and Pu. The identification of such points is essential for the timely concentration of the Agency's efforts on the most important points of a States' nuclear program. The search for "critical points" is performed in the following manner: - Process chains are searched for where the possibility of establishing

the chain exceeds the threshold possibility (this is set by the expert­analyst). The possibility of establishing a process chain is set in a similar manner to the evaluation of the possibility of implementing a process, and is assumed to be equal to the possibility of implementing that process in a given chain whose possibility is the lowest among all the other processes in the chain (this criterion may be changed);

- In chains whose possibility exceeds the threshold possibility, a search is performed for "critical points". By "critical point" we mean the most important process in the given context from the point of view of establishing the whole process chain up to the point of obtaining Pu or HEU. By most important process we mean that process for which, if the possibility of its implementation were increased, this would increase the possibility of the establishment of the whole chain.

The results of this subsystem may also be presented with various levels of detail: at the level of technologies, at the level of processes.

Page 337: Data Mining, Rough Sets and Granular Computing

334

5. The system database is used to store both current and former evaluations of elements of the model of the States' nuclear activities, the documents used to make these evaluations (or references to them), and the names of the indicators contained in these documents. The database is part of DISNA and it therefore has direct links with all its other parts. This means that data from the database can be used in all system operating regimes.

6. The change analysis sub-system gives the expert analyst the opportunity to review changes in the model of States' nuclear activities:

• Over a specific period of time. • For a specific process. • Implemented by a specific expert-analyst (if several people use the

system). • Changes of history for the evaluation of a specific process, etc.

This sub-system can be used by the expert-analyst to generate various types of reports and overviews, and also for detailed study of any aspect of a States' nuclear program.

3.7 Basic Principles of Information Monitoring Technology (IMT)

Information monitoring systems relate to a class of hierarchical fuzzy discrete dynamic systems. The theoretical base of such class of systems is made by the automata theory, fuzzy sets theory, discrete mathematics, methods of the analysis of hierarchies which was developed independently in works of Kudrjavcev [6], Messarovich [7], Saaty [13], Zadeh [14-17], and others.

For development of computer systems aimed at a monitoring problem and processing such information, known technologies of information processing (data base technology, technology of knowledge bases and expert systems, hypertext technology of information processng, etc.) were examined. None of these satisfy the functional requests of the system completely. Therefore, there is a need to develop specially designed techniques in order to fulfil the Agency's needs.

Technological Basis

Information monitoring systems allow to:

• process uniformly diverse, multilevel, fragmentary, unreliable, informa­tion varying in time;

• receive evaluations of status of the whole problem and/or its particular aspects;

• simulate various situations in the subject area;

Page 338: Data Mining, Rough Sets and Granular Computing

335

• reveal "critical ways" of the development of the problem. It means re­vealing those elements of the problem, the small change of which status may qualitatively change the status of the problem as a whole.

It is possible to define information monitoring technology as technology for information support to a particular user which is, in our opinion, a natural development of information support technologies.

Taking into account the given features of the information and specific methods of its processing, it is possible to declare the main features of the information monitoring technology as follows:

• The system is capable of taking into account data conveyed by differ­ent information vehicles (journals, video clips, newspapers, documents in electronic form, etc.). This would be provided by means of storage in a database of a system of references to an evaluated piece of information, if it is not a document in electronic form. If the information is a doc­ument in electronic form, then both the evaluated information (or part thereof) and a reference thereto are stored in the system. Thus, the sys­tem makes it possible to take into account and use in an analysis all pieces of information which have a relationship to the subject area of concern irrespective of the information source.

• The system should be capable of processing information with different degrees of reliability, some of which would be possibly tendentious. This would be achieved by reflecting the influence of a particular piece of information on the status of the elements in the model regarding the problem concerned with the aid of fuzzy linguistic variables.

• The information with various degrees of reliability, probably biased, could be processed in the system. For this purpose, the description of the in­fluence of the information received on a status of the model of a problem could be achieved with the use of fuzzy linguistic variables. It is nec­essary to take into account that evaluation of the element of the model may both vary under the influence of the information received and remain unchanged (i.e. be confirmed).

• Time dependency is one of the parameters of the system. This makes it possible to have a complete picture of the variation of the status of the model at a particular time.

Thus, the systems constructed on the basis of this technology allow the model of the problem to develop in time. It is supported by the references to all information material chosen by the analysts with general and separate evaluations of the status of the problem (and/or its aspects) described on the basis of fuzzy logic theory. Use of the time as one of the parameters of the system allows to conduct the retrospective analysis and to build the forecasts of development of the problem. There is the opportunity of allocation "of critical points", i.e. such element(s) of the model, the small change of which can cause significant changes in a status of the whole problem. The knowledge

Page 339: Data Mining, Rough Sets and Granular Computing

336

of such elements has large practical significance and allows to reveal" critical points" of the problem, to work out the measures on blocking undesirable situations or desirable achievements, i.e. timely guidance of the development of the problem in the desirable direction.

Theoretical Basis

First and foremost, the model must adequately reflect the real situation. The conclusions of the comparison between the Examined Status and the Augmented Status are checked against realities - in order to make a decision regarding the status of the State's nuclear activities. The adequacy in this case means that one must exclude the possibility that similar status of the model might correspond to different situations in the area of concern, or might correspond to several different status.

The mathematical foundation of the decision of this problem is Fuzzy Sets theory [15] and based on this Linguistic Variable theory [14] and Information Granulation theory [16,17]. This is because the evaluation of the pieces of information in DISN A is performed by human beings.

It is assumed that the expert describes the degree of inconsistency of the obtained information and the readiness or potential for readiness of certain processes in a country in the form of linguistic values. The subjective de­gree of convenience of such a description depends on the selection and the composition of such linguistic values.

Let us explain this on a model example.

Example 1. Let it be required to evaluate the quantity of plutonium. Let us consider two extreme situations.

Situation 1. It is permitted to use only two values: "small" and" consid­erable quantity" .

Situation 2. It is permitted to use many values: "very small", "not very considerable quantity", ... , "not small and not considerable quantity", ... , "considerable quantity" and etc ..

Situation 1 is inconvenient. In this situation we have no possibility to describe certain quantity of plutonium definitely. That means that for many situations both the permitted values may be unsuitable and describing the quantity by them, we select between two "bad" values. In other words we may say that we have only small or large amount of plutonium and can not say that we have, for instance, middle amount of plutonium.

Situation 2 is also inconvenient. In fact, in describing a specific quantity of nuclear material, several of the permitted values may be suitable. We again experience a problem but now due to the fact that we are forced to select between two or more" good" values. Could a set of linguistic values be optimal in this case?

It is assumed that the system tracks the development of the problem, i.e. its variation with time. It is also assumed that it integrates the evaluations of

Page 340: Data Mining, Rough Sets and Granular Computing

337

different experts. This means that one object may be described by different experts. Therefore it is desirable to have assurances that the different experts describe one and the same object in the most "uniform" way.

On the basis of the above we may formulate the first problem as follows:

Problem 3.1. Is it possible, taking into account certain features ofthe man's perception of objects of the real world and their description, to formulate a rule for selection of the optimum set of values of characteristics on the basis of which these objects may be described? Two optimality criteria are possible: Criterion 1. We regard as optimum those sets of values through whose use man experiences the minimum uncertainty in describing objects. Criterion 2. If the object is described by a certain number of experts, then we regard as optimum those sets of values which provide the minimum degree of divergence of the descriptions.

This problem may be reformulated as a problem of construction of an optimal information granulation procedure from point of view of criterion 1 and criterion 2.

Information monitoring technology assumes the storage of information material (or references to it) and their linguistic evaluations in the system database. In this connection the following problem arises.

Problem 3.2. Is it possible to define the indices of quality of information re­trieval in fuzzy (linguistic) data bases and to formulate a rule for the selection of such a set of linguistic values, use of which would provide the maximum indices of quality of information retrieval?

This problem may be reformulated as a problem of construction of an optimal information granulation procedure from point of view of information retrieval in fuzzy (linguistic) data bases. A solution to the problem can be found in [12, 11].

It is shown that we can formulate a method of selecting the optimum set of values of qualitative indications (collection of granules [17]). Moreover, it is shown that such a method is stable, i.e. the natural small errors that may occur in constructing the membership functions do not have a significant influence on the selection of the optimum set of values. The sets which are optimal according to criteria 1 and 2 coincide. It is shown that it is possible to introduce indices of the quality of information retrieval in fuzzy (linguistic) data bases and to formalize them. It is shown that it is possible to formulate a method of selecting the optimum set of values of qualitative indications (collection of granules [17]) which provides the maximum quality indices of information retrieval. Moreover, it is shown that such a method is stable, i.e. the natural small errors in the construction of the membership functions do not have a significant effect on the selection of the optimum set of values.

Following this method, one may describe objects with minimum possible uncertainty, i.e. guarantee optimum operation of the information monitoring system from this point of view.

Page 341: Data Mining, Rough Sets and Granular Computing

338

References

1. Belenki A. and Ryjov A. (1997) Fuzzy logic in monitoring the non-proliferation of nuclear technologies, raw matherials and weapons. Journal of Fuzzy Logic and Intelligent Systems Vol. 7 1.3: 27-33.

2. Hooper R. (1995) Strengthening IAEA Safeguards in an Era of Nuclear Co­operation. Arms Control Today, November.

3. IAEA. (1968) The Agency's Safeguards System (1965, as provisionally extended in 1966 and 1968), INFCIRC/66/Rev.2, Vienna.

4. IAEA. (1972) The Structure and Content of Agreements Between the Agency and States Required in Connection with the Treaty on the Non Proliferation of Nuclear Weapons, INFCIRC/153 (corr.), Vienna.

5. IAEA. (1997) Model Protocol Additional to the Agreement(s) Between State(s) and the International Atomic Energy Agency for the Application of Safeguards. INFCIRC/540, Vienna.

6. Kudrjavcev V.B., Alojshin S.V., Podkolzin A.S. (1985) Introduction to au­tomata theory. Moscow, Nauka. (Russian)

7. Messarovich M.D., Macko D., Takahara Y. (1979) Theory of hierarchical mul­tilevel systems. Academic Press, N. Y.- London.

8. Pellaud B. (1996) Safeguards and the Nuclear Industry. Core Issues 5, The Uranium Institute, London.

9. Popov E. (1987) Expert Systems: Solving Non-formal Tasks in Dialogue with Computer. Moscow, Nauka. (Russian).

10. A. Ryjov, A. Belenki, R. Hooper, V. Pouchkarev, A. Fattah and L.A. Zadeh. (1998) Development of an Intelligent System for Monitoring and Evaluation of Peaceful Nuclear Activities (DISNA), IAEA, STR-31O, Vienna

11. Ryjov A. (1988) The degree of fuzziness of a linguistic scale and its general properties. In: Averkin A.N.(Ed.) Fuzzy decision-making systems. Kalinin Uni­versity Publishing House, Kalinin, 82-92 (Russian)

12. Ryjov A. (1987) The degree of uncertainty of fuzzy descriptions. In: Krushinsky L.V., Yablonsky S.V., Lupanov O.B. (Eds.) Mathematical cybernetics and its application to biology. Moscow University Publishing House, Moscow, 50-77 (Russian)

13. Saaty T.L. (1993) The Analytic Hierarchy Process. Moscow, Radio and Swjaz. (Russian).

14. Zadeh L.A. (1975) The concept of a linguistic variable and its application to approximate reasoning. Part 1,2,3. Inform Sci 7:199-249; 8:301-357; 9:43-80

15. Zadeh L.A. (1996) Fuzzy logic == computing with words. IEEE Trans on Fuzzy Systems 4:103-111

16. Zadeh L.A. (1997) The Key Roles of Fuzzy Information Granulation in Hu­man Reasoning, Fuzzy Logic and Computing with Words. The ERL Research Summary.

17. Zadeh L.A. (1997) Towards a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy sets and systems 90:111-127

Page 342: Data Mining, Rough Sets and Granular Computing

Basic Issues of Computing with Granular Probabilities

George J. Klirl

1 Center for Intelligent Systems and Dept. of Systems Science & Industrial Eng. Binghamton University - SUNY, Binghamton, New York 13902-6000, U.S.A.

1. Introduction

It has increasingly been recognized that practical applicability of classical probability theory to some problems of current interest, particularly those dependent on human judgement, is severely restricted. This restriction is a result of the additivity axiom of classical probability theory. Due to this axiom, elementary events are required to be pairwise disjoint and the probability of each is required to be expressed precisely by a real number in the unit interval [0, 1]. The requirement of disjoint events makes mathematically good sense, but it becomes problematic whenever we leave the world of mathematics. As is well known, there is no method of scientific measurement that is free from error. As a consequence, observations in close neighborhoods of the sharp boundaries between events are unreliable and should be properly discounted. This results in a violation of the additivity axiom and, by implication, in a violation of the precision requirement. Hence, we need to deal with imprecise probabilities. This need is even more pronounced when instead of measurements we are dependent on assessments based on subjective human judgement. There are other convincing arguments for imprecise probabilities, including, for example, the following:

• Imprecision of probabilities is needed to reflect the amount of information on which they are based. The precision should increase with the amount of information available.

• Total ignorance can be properly modeled by vacuous probabilities, which are maximally imprecise (Le., each covers the whole range [0, 1]), but not by any precise probabilities.

• Imprecise probabilities are generally easier to assess and elicit than precise ones.

• We may be unable to assess probabilities precisely in practice, even if that is possible in principle, because we lack the time and computational ability.

• A precise probability model that is defined on some class of events determines only imprecise probabilities for events outside the class.

• When several sources of information (sensors, experts, individuals in a group decision) are combined, the extent to which they are consistent can be reflected in the precision of the combined model.

Page 343: Data Mining, Rough Sets and Granular Computing

340

The impreCISIon in expressing probabilities introduces a new dimension into statistical modeling. It can be utilized. for example. to reflect the amount of information upon which probabilities are estimated or the degree of conflict in information obtained from several different sources. These issues are crucial in statistical modeling with small observation sets as well as in problems of information fusion.

The first thorough study of imprecise probabilities was undertaken by Walley [15]. His principal result is a demonstration that reasoning and decision making based on imprecise probabilities satisfies the principles of coherence and avoidance of sure loss. which are generally viewed as principles of rationality. Hence. the requirement of precision (or. equivalently. the additivity axiom) cannot be justified as inevitable for rationality. as previously believed. The soundness of using imprecise probabilities is based on this important result.

In general. a probability model avoids sure loss if its use cannot lead to outcomes that are harmful to the user. It is coherent if it does not contain any internal inconsistencies resulting from the lack of transitivity of a relevant preference ordering. These concepts are carefully formalized by Walley [15]. and he shows that coherent models can be constructed from any set of probability assessments that avoid sure loss by a mathematical procedure called a natural extension.

It is recognized that imprecise probabilities of different types must be distinguished since they require different methodological treatments. The imprecision of probabilities can be expressed. for example. by intervals of real numbers [13]. convex sets of probability distributions [10]. fuzzy numbers or fuzzy intervals [11. 12. 14]. These various types of imprecise probabilities may be treated as special cases within fuzzy measure theory [16]. a relatively new theory concerned with nonadditive measures. whose connection with imprecise probabilities is analogous to the connection of classical measure theory with precise probabilities.

The purpose of this chapter is to present an overview of basic issues involved in computing with granular probabilities. i.e .• probabilities expressed in terms of fuzzy intervals or fuzzy numbers. These issues are well illustrated by a recently developed Bayesian inference with granular probabilities [12. 14].

It should be mentioned at this point that probabilities of fuzzy events. as defined by Zadeh [18]. are not necessarily granular. They are either precise probabilities or imprecise probabilities of any of the mentioned types. depending on the nature of the underlying probability distribution.

2. Granular Probabilities

In general. each granular probability is a fuzzy interval defined on [0. 1]. That is. it is a normal fuzzy set on [0. 1] whose a-cuts for all ae [0. 1] are closed subintervals of [0. 1] [9].

Granular probabilities are denoted in this paper by P" P2 ••.•• Pn and their a­cuts are denoted by a PI' ap2 ••••• ap •. Each granular probability Pi (i e Nn = {1.2 •...

• n}) may conveniently be expressed for all xe [0. 1] in the canonical/orm

Page 344: Data Mining, Rough Sets and Granular Computing

when x E [0, b)

when x E [b, c]

when x E (c, d]

otherwise,

341

(1)

where a, b, c, dE [0, 1], a ~ b ~ c ~ d, fi is a real-valued function that is strictly increasing and right-continuous, and gi is a real-valued function that is strictly decreasing and left-continuous [9].

For any granular probability Pi expressed in the canonical form (1), the a-cuts of Pi are expressed for all aE (0, 1] by the formula

"Pi = {[fi-l(a),g~l(a)] when aE (0,1)

[b, c] when a = I, (2)

where fi -1 and gi -1 are the inverse functions of fj and gj, respectively. When functions fi and gi are linear, we obtain trapezoidal fuzzy intervals, and Eq. (2) assumes the very simple form

"Pi =[a+(b-a)a,d-(d-c)a]. (3)

It is sometimes desirable to generalize the canonical form (1) by allowing functions fi and gi to be only increasing and decreasing, respectively. For this generalized canonical form, formula (2) is not applicable; it must be replace with the formula

" _ {[inf{Xlfi (x) = a}, sup{xlg i (x) = a}] when a E (0,1) Pi -

[a,b]. whena=l

Consider now a discrete random variable X with values in the set

(4)

and assume that probabilities of values of this random variable, p(Xj), are assessed only approximately, in terms of probability granules. Let Pi denote the assessed probability that X = Xi (iENn), and let

be called a tuple o/probability granules (TPG). For convenience, let

"Pi =[p(a),Pi(a)] -I

Page 345: Data Mining, Rough Sets and Granular Computing

342

for each iE Nn and each aE (0, 1]. Then,

Pi = {[Po (a), Pi (a)] I a E (O,I]) -I

for each iE Nn• Moreover, the set of probability distributions

- n ST (a) = {(Pi (a) liE N n )Ipi (a) E [Pi (a), Pi (a)]for all i E N nand L Pi (a) = I}

- i=I

is associated with each TPG T for each aE (0, 1]. For dealing with granular probabilities, two properties of TPGs are essential: a

given TPG T is called reasonable if the set ST(a) is not empty for any aE (0, 1]; and it is called feasible if for each aE(O, 1], each iENn, and every

value ViE [Pi (a), pJa)] there exists at least one probability distribution in

ST(a)such that pla) = Vi. lt is easy to check whether any given TPG T is reasonable or feasible [14, 17]:

T is reasonable iff

n n

L£i(a)~1 and LPi(a)~I i=l i=l

for all aE (0, 1]; it is feasible iff for all aE (0, 1]

!P. (a) +p/a)-p .<a) ~lforalljE N n _1 -J

i=l

and

Moreover, given a reasonable TPG T, there exists a unique feasible TPG T' such that ST,(a) = ST(a) for each aE (0, 1]. This feasible TPG is obtained by the

following recalculation of values of £i (a) and Pi (a) in ST(a):

P. '(a) = max[p. (a), 1- LP.(a)], -I -1 J

i"i (5)

Page 346: Data Mining, Rough Sets and Granular Computing

343

Any TPG that is not reasonable is of no use since it does not represent any probability distribution of the random variable of concern; it is based on inconsistent assessments. Any TPG that is reasonable but not feasible contains some redundancy, which can be eliminated by the recalculation expressed by (5).

It is important that the probability granule for any subset A of X is uniquely determined by a given feasible lPG. Let

U P(A) =[u p(A), up(A),]

denote the a-cut of the probability granule for set A. Then, as shown in [13],

U~(A)=max[L~i(a),I- LPi(a)], xieA xjt!:A

Granular probabilities can be constructed in each application context by the various knowledge acquisition methods [9]. A revised hierarchical analysis method was designed specifically for assessing granular prior probabilities in Bayesian inference by Pan [11].

3. Manipulation of Granular Probabilities.

Since granular probabilities are fuzzy intervals, their manipulation requires that appropriate fuzzy arithmetic be used. Unfortunately, the standard fuzzy arithmetic [4,9] is not applicable. The reason is that selections from components of any given lPG are not independent. They must satisfy not only the requisite equality constraints, as extensively argued in [5,6,7], but also the axiomatic constraints of probability theory.

Consider, for example, that we want to evaluate an arithmetic expression

where the arguments are granular probabilities of a given lPG, each of which may appear in the expression more than once. In this case, we have to take into account two types of constraints: (i) equality constraints pertaining to any multiple use of the same argument in the expression [5,6,7]; and (ii) the axiomatic requirement that probabilities of any finite probability distribution must add to 1. Denoting the evaluation of EXP under the equality and probability constraints by EXPEP, this constrained evaluation is defined for each ae(O, 1] by the formula

(6)

Page 347: Data Mining, Rough Sets and Granular Computing

344

Given a particular expression EXP, Eq. (6) can be expressed more specifically in terms of two similar optimization problems, searching for the minimum in one of them and for the maximum in the other one. Consider, for example, the calculation

E= !PiXi , i=l

where each Xi is a value of a random variable, Pi is an assessed granular probability of this value, and E is a fuzzy interval expressing the expected value of the variable. Let

a E = [~(a), e(a)].

-Then, using the notation introduced in Sec. 2, values ~(a) and e(a) are calculated

for each aE (0, 1] by solving the following two optimization problems:

subject to the constraints

n

~(a) = min LPi (a)x i i=l

n

~(a) = max LPi (a)x i i=l

-(i)p.(a):S;pi(a):S;pi(a) foralliEN n ,

-I

n

(ii) L Pi (a) = 1. i=l

Due to the linearity of the objective function in this case, the expected value E can be obtained by solving two linear programming problems. Moreover, simple formulas were derived by Dubois and Prade for computing E under the assumption that probabilities Pi are represented by symmetric fuzzy numbers[1].

In other computations with granular probabilities, the objective function may be nonlinear. This is, for example, the case of Bayesian inference in which prior probabilities are given in terms of a feasible TPG and likelihoods are either precise or fuzzy. A full investigation of this case is covered in [12] and its summary is given in [13, 14].

In addition to the constraints imposed by probability theory and the requisite equality constraints, the optimization problems involved in computing with probability granules must include any other relevant constraints. For example, if we know for two random variables characterized by granular probabilities that the value of one of the variables is always smaller that the value of the other one, this

Page 348: Data Mining, Rough Sets and Granular Computing

345

knowledge must be included in terms of an additional constraint In any computation involving both these variables.

4. Approximation Issues

To make computation with probability granules as efficient as possible, it is desirable to represent them by trapezoidal or triangular membership functions. However, when fuzzy arithmetic (constrained as well as unconstrained) is applied to trapezoidal or triangular granules, the resulting granules are usually not trapezoidal or triangular. More specifically, they deviate from trapezoidal or triangular shapes whenever the operation of multiplication is involved. This is unfortunate since we often need to use the resulting granules as inputs for further processing. This is typical, for example, in the Bayesian inference with granular probabilities [12, 14]. In this case, posterior probabilities obtained at one stage are employed as prior probabilities in the next stage.

For efficient computation, it is desirable to approximate membership functions of resulting probability granules at each stage of computation by appropriate trapezoidal or triangular membership functions. However, the question of how to approximate in this case is not easy to answer. A simple way of approximation, usually referred to as the standard approximation, is obtained by keeping the values a, b, c, din (1) unchanged and replacing functions fj and gj in (1) with their linear counterparts

x-a d-x fj'(x)=-- andgj'(x)=--.

b-a d-c

The accumulated error for repeated use of the multiplication operation is examined in [2] and it is shown that it can be sufficiently large to produce misleading results. An alternative approximation is proposed in [2] and further developed in [3], which leads to acceptable results in most practical cases. Moreover, an error expression is derived for the new approximation method, which can be used for checking whether or not the obtained results are acceptable. However, the proposed approximation is polynomial instead of linear, and this increases computational complexity.

An interesting approximation was proposed by Pan [12] in the context of the Bayesian inference with fuzzy probabilities. It is based on keeping values band c in (1) unchanged, and calculating new values a' and d' in terms of the weighted least-square-error method, in which the weights are monotone increasing with values of a according to some rule. Although this approximation method seems promising, no error analysis has been made for it as yet.

Approximation errors can be considerably reduced by allowing functions fj and gj in (1) to be piecewise linear. The simplest way is to replace (1) with

Page 349: Data Mining, Rough Sets and Granular Computing

346

Pj (x) =

I fj (X) when x E [a, a I]

2fj (x) when XE [a p b]

1 when x E [b, c]

I gj (x) when XE [c,c l ]

2 gj (x) when x E [c I ' d]

° otherwise,

where a ~ a] ~ b ~ c ~ c\ ~ d and I fj' 2 fj' I gj' 2 gj' are linear functions such that

I fj (a l ) =2fj (a l ) =1 gj (c l ) =2 gj (c l ) = a c

for some value ex" E (0,1). Then, a p. _{[a+a(al-a)-q)/ae,d-a(d-q)/ae ] whenaE (O,ael

1- [al +(a-ae)(b-aJ}/(I-ae),q -(a-ae)(q -e)/(l-ae)]whenaE[ac'l].

The approximation method proposed by Pan can be generalized to this case by varying the values a, aJ, c\, d and using the weighted least-square-error method to obtain the best approximation.

Another approximation problem emerges in the context of "computing with words" introduced by Zadeh [20]. In this case, we operate with a fixed set of linguistic probabilities, each defined in terms of a particular probability granule. The result of each computation is required to be expressed by one of the recognized linguistic terms. This means that we need to determine which of the linguistic terms best approximates (via the associated probability granule) the actual result of our computation. This can be done, for example, by choosing an appropriate distance between the obtained probability granule and those representing the recognized linguistic probabilities, and choosing the linguistic probability with the smallest distance.

5. Uncertainty of Granular Probabilities

The issue of measuring uncertainty embedded in feasible TPGs has not been sufficiently investigated as yet. However, it is known that two types of uncertainty coexist in each TPG, which are usually referred to as non specificity and conflict. To measure the amount of nonspecifity in a TPG, we need to use an appropriate Hartley-like function and, similarly, to measure the amount of conflict in a TPG, we need to use an appropriate Shannon-like function [8].

Given a TPG T and its associated set of probability distributions ST(a) for each aE (0,1], the nonspecificity of T, N(T), is calculated by the formula

I

N(T) = f HL(ST(a»da, (7) o

where HL denotes a well-established Hartley-like function defined for convex subsets of the n-dimensional Euclidean space [8]. In our case, ST(a) is for each aE (0,1] a convex subset of the n-dimensional Euclidean unit cube, and HL(ST(a» is defined by the formula

Page 350: Data Mining, Rough Sets and Granular Computing

347

n n

HL(ST (a» = ~p In[I1 [1+~(i ST, (a»]+~(ST (a»-I1~( ST, (a»], i=l i=l

where 11 denotes the Lebesgue measure, T denotes the set of all transformations

from one orthogonal coordinate system to another, and i ST, denotes the ith

projection of ST within coordinate system t. As an example, let T= < PJ,P2), where

Then, N(T) = 0.240856.

Up, = [0.2 + O.la, 0.4 - O.la], up2 = [0.6 + O.la, 0.8 - O.la].

The conflict of T, C(T), should be expressed by the formula

1

(8)

C(T) = f SL(T(a»da, (9) o

analogous to (7), where SL stands for a Shannon-like function. Although no Shannon-like function for granular probabilities is currently fully justified, a promising candidate is defined for each ae (0, 1] by the formula

where

S(a) = :~)£i (a)+Pi (a)]. i=l

Applying Eqs.(9) and (10) to the example illustrating the calculation of nonspecificity, we obtain C(T) = 0.869742.

6. Conclusions

Computing with granular probabilities is a special subject area within the emerging theory of fuzzy information granulation [21], which in turn is based on the concept of a linguistic variable [19]. It is also a special subject area within the general theory of imprecise probabilities [15].

The unique feature of computing with granular probabilities is that the axiomatic properties of probability theory must be taken into account as requisite constraints in the associated fuzzy arithmetic [5,6,7]. For each type of computation, two optimization problems must be solved by which the left-end points and right-end points of the resulting a-cuts are determined. One important problem type in which the relevant optimization problems are now well developed is the Bayesian inference with granular probabilities [12, 13, 14].

In addition to the manipulation of granular probabilities via constrained fuzzy arithmetic [5,6,7], computing with granular probabilities involves also some challenging problems of approximation. These problems, which are briefly

Page 351: Data Mining, Rough Sets and Granular Computing

348

discussed in Sec. 4, have not been sufficiently investigated as yet. Further investigation is also needed to clarify some issues regarding measures of uncertainty and uncertainty-based information for granular probabilities.

The importance of imprecise probabilities, and granular probabilities in particular, has increasingly been recognized by theoreticians and practitioners alike. As a consequence, the number of researchers working in this area has steadily been growing. However, a comprehensive methodology for dealing with the various problems of imprecise probabilities is still in its infancy, even though some important results pertaining to this methodology has already been obtained.

References

[1] Dubois, D. and H. Prade [1981] "Additions of interactive fuzzy numbers." IEEE Trans. on Automatic Control, 26(4), pp.926-236

[2] Giachetti, R E. and R E. Young [1997], "Analysis of the error in the standard approximation for multiplication of triangular and trapezoidal fuzzy numbers and the development of a new approximation." Fuzzy Sets and Systems, 91(1), pp. 1-13.

[3] Giachetti, RE. and RE.Young [1997] " A parametric representation of fuzzy numbers and their arithmetic operators." FuzzY Sets and Systems, 91(2), pp.185-202.

[4] Kaufmann, A. and M.M.Gupta [1985], Introduction to FuzzY Arithmetic. Van Nostrand, New York.

[5] Klir, G. J. [1997], "The role of fuzzy arithmetic in engineering." In: Ayyub, B. M., ed., Uncertainty Analysis in Engineering and the Science. Kluwer, Boston.

[6] Klir, G. J. [1997], "Fuzzy arithmetic with requisite constraints." Fuzzy Sets and Systems, 91(2), pp. 165-175.

[7] Klir, G.J. and Y.Pan [1998], Constrained fuzzy arithmetic: Basic questions and some answers." Soft Computing, 2(2), pp.l00-l08.

[8] Klir, G.J. and M.J.Wierman [1998], Uncertainty-Based Information: Elements of Generalized Information Theory. Physica -Verlag/Springer Verlag, Heidelberg and New York.

[9] Klir, G. J. and B. Yuan [1995], Fuzzy Sets and FuzzY Logic: Theory and Applications. Prentice Hall, Upper Saddle River, NJ.

[10] Kyburg, H. E. [1987], "Bayesian and non-Bayesian evidential updating." ArtificalIntelligence, 31, pp. 271-293.

[11] Pan, Y. [1997a], "Revised hierarchical analysis method based on crisp and fuzzy entries and its application to assessment of prior probability distributions." Inter. J. of General Systems, 26(1-2), pp. 1110-131.

[12] Pan, Y. [1997b], Calculus of FuzzY Probabilities and Its Applications. PhD Dissertation in Systems Science, Binghamton University-SUNY.

[13] Pan, Y. and G. J. Klir [1997], "Bayesian inference based on interval­valued prior distributions and likelihoods." J. of Intelligent & Fuzzy Systems, 5(3).

Page 352: Data Mining, Rough Sets and Granular Computing

349

[14] Pan, Y. and B. Yuan [1997], "Bayesian inference of fuzzy probabilities." Inter. 1. of General Systems, 26(1-2), pp. 73-90.

[15] Walley, P. [1991], Statistical Reasoning With Imprecise Probabilities. Chapman and Hall, London.

[16] Wang, Z. and G. J. Klir [1992], Fuzzy Measure Theory. Plenum Press, New York.

[17] Weichselberger, K. and Pohlmann [1990], A Methodology for Uncertainty in Knowledge-Based Systems. Springer-Verlag, New York.

[18] Zadeh, L. A. [1968], "Probability measures of fuzzy events." 1. of Math. Analysis and Applications, 23, pp. 421-427.

[19] Zadeh, L. A. [1975], "The concept of a linguistic variable and its application to approximate reasoning I, II, III." Information Sciences, 8, pp. 199-251,301-357; 9, pp. 43-80.

[20] Zadeh, L. A. [1996], "Fuzzy logic = computing with words." IEEE Trans. on Fuzzy Systems, 4(2), pp. 103-111.

[21] Zadeh, L. A. [1997], "Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic." Fuzzy Sets and Systems, 90(2), pp. 111-127.

Page 353: Data Mining, Rough Sets and Granular Computing

Multi-dimensional Aggregation of Fuzzy N umbers Through the Extension Principle

Gaspar Mayor!, Adolfo R. de Soto2, Jaume Suner!, and Enric Trillas3

1 Dept. Matematiques i Informatica, Universitat de les Illes Balears 07071-Palma de Mallorca. e-mail: [email protected]

2 Area de Lenguajes y Sistemas Informaticos, Universidad de Leon 24071 Leon. e-mail: [email protected]

3 Dpto. Inteligencia Artificial, Universidad Politecnica de Madrid Campus de Montegancedo, 28660 Boadilla del Monte. Madrid. e-mail: [email protected]

Abstract. In tms paper we propose the problem of obtaining a procedure to ag­gregate fuzzy numbers in such a way that the output is also a fuzzy number . To do tms, we use the Zadeh's Extension Principle applied to multi-dimensional numerical functions wmch satisfy certain conditions, obtaining multi-dimensional aggregation functions on the lattice of fuzzy numbers. Special attention is given to the case of trapezoidal fuzzy numbers .

1 Introduction

Since some time ago, the problem of aggregating fuzzy sets has preoccupied many researchers and it has produced a lot of related literature ([6], [7], [8]). The need for synthesizing several fuzzy outputs in one unique output appears in multiple situations, for example in a fuzzy rules based system with a set of n rules Ri : IF x is Ai THEN y is Bi, an input triggers the set of rules and each one produces an output B:. The unique final output B is obtained by applying an aggregation operator to the outputs B~, ... , B~. This fuzzy set aggregation operator (or function) can be obtained from a numerical aggregation function (idempotent and increasing) f: [0, l]n -t [0,1] through B(x) = f(B~(x), ... ,B~(x)). In this way it is usual to consider f = max, and thus B(x) = max(BUx) , ... , B~(x)), but this procedure has the serious drawback to produce "distorted" outputs B with respect to Bl, ... ,Bn and to B~, . .. ,B~ too, which complicates the reading of the final result of the process. It is important to remark that quasi-linear means are needed to sinthesize fuzzy preorders ([1])

To avoid this difficulty, in the example above and in any other similar sit­uation, we deal here with the problem of analyzing the advantages of using, for the case of fuzzy numbers, the aggregation procedure derived from the Extension Principle ([9]). At the same time, we introduce a multi-dimensional character to the aggregation and, in this way, we obtain a method which al­lows to aggregate any quantity of fuzzy numbers with output a fuzzy number, holding some properties which seem to make this procedure very interesting.

Page 354: Data Mining, Rough Sets and Granular Computing

351

2 Preliminaries

In this section we present some preliminaries where we can found the defini­tions and basic results which will be used throughout the paper. It is worth to note that the definition of fuzzy number we have taken is quite restrictive if it is compared, for example, with the ones given in [2], [4] and [5], but we believe that it is sufficient for most applications.

Definition 1. A fuzzy number is a continuous function p, : JR -+ [0,1] such that there exist a, b, c, d; -00 ::; a < b ::; c < d ::; +00 satisfying the three following conditions:

1) p,(x) = 0 "Ix E (-oo,a] U [d,+oo), 2) p, is strictly increasing in [a, b] and strictly decreasing in [c, dj, 3) p,(x) = 1 "Ix E [b,c}.

Let us denote by N(JR) the set of all the fuzzy numbers on JR.

Remark 1. In the condition (1) of the definition, in the case of a = -00 and/or d = +00, p,(a) = 0 means that lim p,(x) = 0 and p,(d) = 0 means

x-t-oo

that lim p,(x) = O. x-t+oo

Remark 2. In the condition (2) of the definition, the case a = -00 means that p, is strictly increasing in the interval (-00, b} and, similarly, the case d = +00 means that p, is strictly decreasing in the interval [c, +00 ).

Proposition 1. Every fuzzy number is convex.

Proof. Let p, E N(JR) be a fuzzy number with parameters (a,b,c,d). We must prove that

Let us take thus Xb X2 E JR and A E [0,1]. We will consider three cases:

a) If Xl ::; X2 ::; c, p, is increasing (not necessarily strict) in the interval of extremes Xl and X2. Because Xl ::; AXI +(I-A)x2 ::; X2, the monotonicity of p, shows that p, (AXI + (1 - A)x2) 2: P,(XI) = min (p,(XI), p,(X2))'

b) If b ::; Xl ::; X2, p, is decreasing (no necessarily strict) in the interval of extremes Xl and X2. Because Xl::; AXI +(I-A)x2 ::; X2, the monotonicity of p, shows that p, (AXI + (1 - A)x2) 2: p,(X2) = min (p,(xt), p,(X2)).

c) Finally, if Xl < b ::; c < X2, we have three cases:

i) IfAxl + (1 - A)x2 < b, then

p, (AXI + (1 - A)x2) > p,(XI) 2: min (p,(xt), p,(X2)).

Page 355: Data Mining, Rough Sets and Granular Computing

352

ii) If >'X1 + (1 - >.)xz > c, then

J-t (>'X1 + (1 - >')xz) > J-t(xz) ~ min (J-t(X1), J-t(xz)).

iii) If b ::; >'X1 + (1 - >.)xz ::; c, then

mu (>'X1 + (1- >')xz) = 1 ~ J-t(X1),J-t(XZ).

Example 1. When the membership function J-t is linear on the intervals [a, b] and [c, d], the fuzzy number is said to be trapezoidal. In this case, we have

1 0 si x ::; a,

x-a if a < x < b J-t(x) = b1a if b ~ x ~ c:

x-d if c < x < d. c-d --o if x> d

When b = c, we have a triangular fuzzy number. In this cases the fuzzy number can be identified with the quatern (a, b, c, d),

thus we will write J-t = (a, b, c, d). Let us denote by T N(JR) the set of all the trapezoidal/triangular fuzzy

numbers on JR.

Definition 2. Given a fuzzy number J-t and a E [0,1]' the a-cut of J-t is the set

[J-t]", = {x E JR: J-t(x) ~ a}.

Example 2. In the case of a trapezoidal fuzzy number J-t = (a, b, c, d), it is easy to find the a-cuts from the expression of J-t. We have

[J-t]", = [a + (b - a) . a, d - (d - c) . a] Va E (0,1],

[J-t]o = JR.

Definition 3. Consider I : JRn -+ JR. It is said that I is increasing if it is increasing with respect to the product order:

if Xi ::; x~ Vi = 1, ... , n then I(xl, ... , xn ) ::; I(x~, ... , x~).

Definition 4. Consider I : JRn -+ JR. It is said that I is strictly increasing

if Xi < x~ Vi = 1, ... , n then I(xl, ... , xn ) < I(x~, ... , x~).

Proposition 2. Let I : JRn -+ JR be continuous and strictly increasing. Then I is increasing.

Proal. Let us only prove the case n = 2. Consider Xl ::; xi and Xz ::; x~. Suppose that Xl < x~ and Xz = x~. Consider a sequence a1 > az > ... -+ xz. Then l(x1,ak) -+ I(X1,XZ) and I(x~,ak) -+ I(xi,xz). On the other hand, l(x1,ak+1) < I(xi,ak) Vk ~ 1, thus I(xl,xz)::; I(xi,xz).

Page 356: Data Mining, Rough Sets and Granular Computing

353

E:cample 3. Let us see now an example of a strictly increasing function which is not an increasing function (according to the proposition above, it cannot be continuous). Consider the function f : JR2 --+ JR given by

{ min(:c, y) if(x, y~ rt. (0, +00)2 U {(O, On

f(x, y) = 1 If (x, y) = (0,0) 1 + min(x, y) if(x, y) E (0, +00)2

Note that f is not increasing because, for instance, f(O, 0) = 1 > f(O, 1) = 0. To prove that f is strictly increasing, let us define A = (0, +(0)2. If (x, y), (x', y') E A or (x, y), (x', y') rt. AU {(O, On or (x, y) rt. AU {(O, On and (x',y') E A with x < x' and y < y', then clearly f(x,y) < f(x',y'). Finally, if x < ° and y < 0, then f(x, y) < ° < 1 = f(O,O) and if x > ° and y> 0, then f(x, y) > 1 = f(O, 0).

3 n-dimensional aggregation functions

In this section we present, for the n-dimensional case, the fundamental results (prop. 3 and 4 due to V. Novak ([5])) which will allow us to deal with the multi-dimensional case in the next section. Other references to complete the reading of this section can be found in [3] and [7].

Definition 5. Given a function f : JRn --+ JR, the Extension Principle al­lows to generate a new function F : ([O,l] IRt --+ [0, 1]1R defined as follows: V J-Il, ... , J-In E [0, l]R,

F(J-Il, ... , J-In)(y) = sup min (J-Il(xd, ... , J-In(xn)) Vy E JR (Xl, ••• ,Xn):J(Xl, ••• ,Xn)=Y

and F(J-Ib ... , J-In) (y) = ° when y does not have pre-image.

E:cample 4. In the case of f(x1. .. . , xn) = min(x1. ... , xn), it is obtained the function MIN defined on ([0, l]Rt as follows:

V J-Ib ... , J-In E [0, l]IR, Vy E JR,

M IN (J-I 1. ... , J-In)(y) = sup min (J-Il (Xl)' ... ' J-In(xn)) . (xl, ... ,xn):min(xl, ... ,Xn)=Y

Similarly, if f(x1. ... , xn) = max(xb ... , xn), it is obtained the function M AX defined on ([O,l]Rt as follows: V J-Ib ... , J-In E [0, l]IR, Vy E JR,

MAX(J-Il, ... ,J-In)(y) = sup min(J-Il(xl), ... ,J-In(xn)). (Xl, ... ,xn):max(xl, ... ,Xn)=Y

Proposition 3. Let J-Ib ... , J-In E N(JR) be fuzzy numbers with parameters (ab bb Cl, d1), ... , (an, bn, Cn, dn), respectively, and f : JRn --+ JR a continu­ous and strictly increasing function. The function F : ([O,l]Rt --+ [O,l]R induced through the Extension Principle by f satisfies that

F(J-Ib . .. , J-In) = J-I

is a fuzzy number and, moreover:

Page 357: Data Mining, Rough Sets and Granular Computing

354

1) fl(y) = ° \lyE (-oo,!(al, ... ,an)]U[f(dl, ... ,dn),+oo), 2) fl is strictly increasing on [J(al, ... , an), f(b l , ... , bn)] and strictly de-

creasing on [f(cl,"" cn), f(dl, ... , dn)], 3) fl(X) = 1 \Ix E [J(bl, ... ,bn),f(cl, ... ,cn)].

Proposition 4. Let fll, ... , fln E N(m) be fuzzy numbers with parameters (aI, bl, Ct, dl ), ... , (an, bn, Cn, dn), respectively, and f : mn -+ m a continuous and strictly increasing function. Then the a-cuts of the fuzzy number fl = F(fll, ... , fln) are:

[fl]a = f([fll]a x ... x [fln]a)

= {y Em: y = f(xl,"" xn), wherexi E [fli]a, i = 1, ... , n}.

Definition 6. We define on N(m) the following order: Given two fuzzy num­bers fl, TJ E N(m),

fl:S TJ {::::::} MIN(fl,TJ) = fl {::::::} MAX(fl,TJ) = TJ·

Proposition 5. With this order, (N(m),:s) is a distributive lattice.

Proposition 6. Let fl, TJ E N(m) be two fuzzy numbers, then fl :S TJ if and only if

for all a E [0,1].

Example 5. Let fll = (at, bt, Cl, dl ) i flz = (az, bz, cz, dz) be two trapezoidal fuzzy numbers. Let us suppose that fll :S flz. Then

for all a E [0, 1], that is,

and

inf[al + (b l - al) . a, dl - (dl - cd, a] ::; inf[az + (bz - az) . a, dz - (d2 - C2)' a]

for all a E [0,1]. Putting a = 0, 1, we obtain the relation

Definition 7. Given a lattice (L, ::;), a function f : Ln -+ L IS an n­dimensional aggregation function if

Page 358: Data Mining, Rough Sets and Granular Computing

355

1) f is idempotent: f(a,.~., a) = a for all a in L. 2) f is increasing with respect to the product order of Ln :

If Xi::; Yi Vi = 1, ... , n then f(XI, ... , Xn) ::; f(yI,···, Yn).

Proposition 7. Let f : JRn -+ JR be a continuous, idempotent and strictly increasing function. Then the function F : (N(JR)t -+ N(JR) induced through the Extension Principle by f is an aggregation function on the lattice (N(JR) , ::5).

Remark 3. The function f of the proposition is, according to proposition 2, an aggregation function on JR.

Proof. First of all, let us prove the idempotency of F :

F(J-L, ... , J-L)(Y) = sup min(J-L(xI), ... , J-L(Xn)) !(Xl,' .. 'Xn)=Y

~ min(J-L(Y), ... , J-L(Y))

= J-L(Y),

because of the idempotency of f. On the other hand, if Xl, ••• ,Xn E JR, con­sider m = min(xI, ... , Xn) and M = max(xI, ... , xn), and let Y = f(xI, ... , Xn). Since f is an aggregation function, we have m ::; Y ::; M and, since J-L is convex, J-L(Y) ~ min(J-L(m), J-L(M)). Now, the inequalities m ::; Xi ::; M hold for every i, thus J-L(Xi) ~ min(J-L(m), J-L(M)) and thereafter min(J-L(xI), ... , J-L(Xn)) ~ min(J-L(m), J-L(M)). Since min(J-L(xI), ... , J-L(xn)) ::; min(J-L(m), J-L(M)), we will have min(J-L(xI), ... , J-L(xn)) = min(J-L(m), J-L(M)). Thus

F(J-L, ... , J-L)(Y) = sup min(J-L(xI), ... , J-L(x n)) !(Xl, ... ,Xn)=Y

sup min(J-L(m), J-L(M)) !(Xl, ... ,Xn)=Y

::; J-L(Y)·

Let us prove now the monotonicity with respect to the product order. Let J-L, TJ E N(JR) be two fuzzy intervals with J-L ::5 TJ. To prove that F is non decreasing, it is enough to show that the inequality

F(J-L, J-L2, ... , J-Ln) ::5 F(TJ, J-L2, ... , J-Ln)

is valid for every J-L2, ... ,J-Ln E (N(JR))n, because the proof is analogous for the other components. Due to the proposition 6, it is sufficient to prove

sup[F(J-L, J-L2, ... , J-Ln)]a ::; sup[F(TJ, J-L2, ... , J-Ln)]a, and

inf[F(J-L, J-L2, ... , J-Ln)]a ::; inf[F(TJ, J-L2,···, J-Ln)]a,

for all a E [0,1]. But due to the proposition 4 we have

Page 359: Data Mining, Rough Sets and Granular Computing

356

Let Y = f(x, X2, ... , xn) be a point with x E [Jl]a and Xi E [J.ti]o: for all i = 2, ... , n. Since sup[Jl]o: ::; sup[1J]o:, there will exist x' E [1J]0: with x ::; x' and, then, y ::; y' = f(x', X2, ... , xn) because f is a non decreasing function. Thus, for each y E [F(Jl, Jl2, .•. , Jln)]o:, there exists y' E [F(1J, Jl2, ... , Jln)]o: such that y ::; y' and, consequently,

sup[F(Jl, Jl2, ... , Jln)]o: ::; sup[F(1J, Jl2,.··, J.!n)]o:.

The proof with the infimum is analogous.

4 Multi-dimensional aggregation functions

This is the main section of the paper because in it, by means of the Exten­sion Principle, we present the result which asserts that it can be obtained an aggregation function for fuzzy numbers (resulting another fuzzy number) from a numerical aggregation function. The extended aggregation function preserves, in particular, the properties of multi-dimensional monotonicity re­quired to the numerical function. For more details on the definitions, results and examples in this section, refer to [6].

Definition 8. Given a lattice (L, ::;), let E be the set E = U Ln of all n>l

the ordered lists formed with elements of L. The relations ::;0: and ::;p given below define orders in E. Given x = (Xl, ... , xn), y = (yt, ... , Ym) elements of E:

Xl ::; yt, ... ,Xn ::; Yn, if n = m Xl < Yt, ... ,Xn < Yn and } .f

- - 1 n<m sup(Xt, •.. , xn) ::; inf(Yn+I, ... , Ym),

Proposition 8. (E, ::;0:) and (E, ::;p) are lattices.

Next result establishes a very useful characterization of the monotonicity with respect to the orders a and (3.

Proposition 9. Let f : E ~ L be an increasing mapping with respect to the product order. Then:

aJ f is increasing with respect to::;o: if and only if

f(xt, ... , xn) ::; f(xt, ... , Xn, SUp(XI, ... , xn)) V (Xl, ... , xn) E E

b J f is increasing with respect to::;p if and only if

f(xt, . .• , Xn, inf(xt, ... , xn)) ::; f(xt, ... , xn) V (Xl. ... , xn) E E

Page 360: Data Mining, Rough Sets and Granular Computing

357

Definition 9. Let (L, S;) be a lattice. A function f: U L n -+ L is a multi­n::::l

dimensional aggregation function if

1) V n ~ 1, f is idempotent and increasing with respect to the product order, 2) f is increasing with respect to the orders a and {3.

Proposition 10. Let f: U JRn -+ JR be increasing with respect to the n::::l

orders a and {3 such that, V n ~ 1, is continuous, strictly increasing and idempotent, then the function F: U (N(JR)t -+ N(JR) induced through the

n::::l Extension Principle by f is a multi-dimensional aggregation function.

Proof. We know, by the proposition 7, that for every n ~ I, F is idempotent and increasing. To show that it is increasing with respect to the orders a and {3, it is sufficient to prove the equivalent condition: for all n ~ I, and for all (/Jl, ... ,/In) E (N(JR)t, it is verified that

F(/Jl, ... , /In, MIN(/Jl, ... , /In)) ::5 F(/Jl, ... , /In) ::5 F(/Jl, ... ,/In, M AX(/Jl, ... ,/In)).

According to the proposition 6, it is sufficient to prove that

and

and the same conditions for the infimum. Let us take y E [F(/Jl, ... ,/In, M IN(/Jl, . .. ,/In))]a. Then y = f(Xl, ... ,

xn,m), where

that is,

We must show that there exists y' E [F(/Jl, ... , /In)]a such that y S; y'. Let us consider x~' = max(xi, xD E [/Jda. Thus (Xl' ... ' Xn , m) S;f3 (X~, ... , x~) implies that y = f(Xl, ... , Xn, m) S; y' = f(x~, ... , x~).

Let us consider now y E [F(/Jl, ... , jln)]a, then y = f(Xl, ... , xn), where Xi E [/Ji]a. If M = max(Xl> ... , xn), then (Xl' ... ' Xn) S;a (Xl> ... , Xn, M) and thus y S; y', where

The proof for the case of the infimum is completely analogous.

Page 361: Data Mining, Rough Sets and Granular Computing

358

Let us now see some important examples of functions! U IRn -* IR n>1

satisfying the hypothesis of this proposition.

Example 6. Let us consider a multi-dimensional OWA operator, that is, a function!: U IRn -* IR such that, for each n 2: 1, there exists an n-

n~1 n

dimensional vector of weights, Wn = (wr, ... , w~) E [0, l]n, with L: Wi = 1 i=1

and such that n

!(xt. ... ,xn ) = Lwix[i] Vn 2: 1, i=1

where Xli] = . max. min(xjll"" Xj;) is the i-th biggest value ofthe list 1<Jl<"'<J;<n

Xl, ••• , X n • - -

We know ([6]) that any multi-dimensional OWA operator can be repre­sented through a triangle of weights in such a way that every row n of this triangle is composed by the corresponding n-dimensional vector of weights Wn • That is, by

1 w~ w~

w~ w~ w~ wi w~ w~ W4

For example, the operator Min is given by a triangle of weights 6wi such that, for each n 2: 1, w~ = 1, wi = 0 Vi = 1, ... , n - 1. Similarly, the M ax operator is given by wr = 1, wi = 0 Vi = 2, ... , n. The arithmetic mean is also an OWA operator, with weights wi = ~ Vi = 1, ... n.

These three operators satisfy clearly the hypothesis of the proposition. Let us see what else OWA operators satisfy them.

Notice that any OWA operator! is an idempotent, continuous and strictly increasing function. Thus it will satisfy the hypothesis of the proposition if and only if it is an increasing function with respect to the orders a and /3, that is, if and only if the associated triangle of weights is regular ([6]):

p p p+1

"'" w~+1 < "'" w~ < "'" w~+1 ~I -L..JI-L...J'I i=1 i=1 i=1

for all n 2: 1 y p = 1, ... , n.

Example 7. Let us consider now a multi-dimensional quasi-linear weighted mean, that is, a function !: U IRn -* IR defined by

n~1

Page 362: Data Mining, Rough Sets and Granular Computing

359

where cp is a continuous and strictly monotonic function cp : JR --+ JR and l:>wi is a triangle of weights.

Notice that, if cp = I d and wi = ~ Vi = 1, ... , n, we obtain again the arithmetic mean. Let us see what else means satisfy the hypothesis of the proposition.

Any mean 1 is an idempotent, continuous and strictly increasing function. Thus it will satisfy the hypotesis of the proposition if and only if it is an increasing function with respect to the orders a and (3, that is, if and only if the associated triangle of weights satisfies ([6]):

wI' ;::: wf+l for all n ;::: 1 y i = 1, ... , n.

The examples above (with the restrictions imposed to the triangles of weights) show that the multi-dimensional extensions of the OWA operators and the Means are useful to generate, via the Extension Principle, multi­dimensional aggregation functions for fuzzy numbers.

5 Trapezoidal fuzzy numbers

For the n-dimensional case, we propose the problem of finding n-dimensional real functions which induce, via the Extension Principle, aggregation func­tions which preserve the trapezoidal character of the inputs. Specifically: Which continuous, idempotent and strictly increasing functions satisfy the inclusion F ((TN(JR)t) c TN(JR)? That is, which continuous, idempo­tent and strictly increasing functions 1 : JRn --+ JR satisfy that the function F: (N(JR)t --+ N(JR) induced through the Extension Principle by 1 is such that

F(/11, ... , /1n) = /1 E T N(JR)

where /11, ... , /1n E T N(JR)?

Let /1i = (ai, bi , Ci, di ) E TN(JR), i = 1, ... , n. We know that Va E (0,1]' [/1i]", = [ai + (bi - ai) . a, di - (di - Ci) . a]. Let us put

Observe that a trapezoidal fuzzy number is determined by the fact that the extremes of its a-cuts vary linearly with a. Thus we have to analyze the a-cuts of /1 = F(/11, ... , /1n). But, according to the proposition 4,

[/1]", = 1([/11]", X ... x [/1n]",) = {1(Xl, ... , xn) : Xi E [/1i]"" i = 1, ... , n} = 1 (h, SI] x ... x [Tn' Sn])

Since 1 is continuous and increasing, we obtain

Page 363: Data Mining, Rough Sets and Granular Computing

360

Case 1. Let us suppose now that f is an n-dimensional weighted mean with weights WI, • •• , W n , that is,

Then

n

f(xI, ... ,xn ) = L:w;x; ;=1

n n

f(rI, ... , rn) = L:Wiri = L:Wi [(1- alai + ab;] ;=1 ;=1

n n

= (l-a)L:wiai+aL:wib; ;=1 ;=1

= (1 - a)f(aI, ... , an) + af(bI, ... , bn),

and, similarly,

Thus, we get

[Il]a = [(1- a)f(aI, ... , an) + af(bI, ... , bn ),

(1 - a)f(dI, ... , dn) + af(cb ... , cn)]

= [f(aI, ... , an) + a(f(bI, ... , bn) - f(al,"" an)),

f(Cl,"" cn) + a(f(d1 , ••• , dn ) - f(Cl, ... , cn))]

Then Il is a trapezoidal fuzzy number.

Case 2. Let us consider now an n-dimensional OWA operator fw with weights WI, ... ,Wn, that is,

n

fw(XI, ... ,Xn ) = L:W;X(i] ;=1

where X[i] = . max. min(xh, ... ,XjJ. I<Jl<···<Ji<n

Then - -

n n

fw(rI, ... ,rn ) = L:wir[i] and fw(sI,,,,,sn) = L:WiS[i]. ;=1 i=1

Let us suppose now that bi - ai = P and di - Ci = q Vi = 1, ... , n. Then

and

Page 364: Data Mining, Rough Sets and Granular Computing

361

resulting

The linearity of the extremes of the interval with respect to a proves that J-l is a trapezoidal fuzzy number and, putting a = 0, 1, we obtain

J-l = (fw (al, .. " an), fw(al,"" an) + p, fw(d l , ... , dn ) - q, fw(d1, ... , dn ))

In general, since ri = ai + (bi - ai) . a = (1- a)ai + abi ,

n n

fw(r1, ... ,rn) = LW;r[i] = LWirki ;=1 i=l

n

i=l n n

;=1 i=l = (1 - a)f::01 (a1, ... , an) + af::02 (b1, ... , bn ),

where 7l"1 and 7l"2 are two suitable permutations of {I, ... , n} (depending on a) and we define the OWA operator f::' to be

n

f::O (Xl, ... , Xn) = L W;X[7r(i)]. ;=1

n

Observe that r; (Xl,"" Xn) = L: W7r-1(i)X[;], ;=1

Thus, we get

[J-l]" = [(1 - a)f::01 (a!, ... , an) + af::02 (b l , ... , bn ),

= (1 - a)f~1 (db"" dn) + af~2 (Cl,"" cn )].

Anyway, in the general case, J-l is not a trapezoidal fuzzy numbers. To see this, let us consider the case n = 2. The OWA operator

f(Xl, X2) = WI maX(Xl, X2) + W2 min(x1, X2)

will take the value

f( ) - {Wlrl + W2 r2 if rl ~ r2 r1, r2 - 'f wlr2 + W2rl 1 rl::; r2

If rl ~ r2,

f(r1, r2) = wlrl + W2 r2

= wl((l - a)al + abl ) + w2((1 - a)a2 + ab2)

= [wlb l + W2 b2 - (Wlal + W2a2)]a + wlal + W2 a2.

Page 365: Data Mining, Rough Sets and Granular Computing

362

f(rb r2) = Wt r2 + W2 rt = = wt((1 - a)a2 + ab2) + w2((1 - a)at + abt}

= [Wtb2 + W2bt - (Wta2 + W2at)]a + Wta2 + W2 at·

Consider now two trapezoidal fuzzy numbers with the following first two parameters:

at = 0.2, bt = 0.8, a2 = 0.4, b2 = 0.6.

Let us take the OWA operator with weights Wt = 1/3 and W2 = 2/3. Then a straightforward calculation gives:

It is clearly seen that J.l is not a trapezoidal fuzzy number, because it is not a segment in the strictly increasing interval of its membership function.

In general, it would be a trapezoidal fuzzy number if and only if the slope of both straight lines, as functions of a, coincide, that is, if and only if

For example, it is known that the arithmetic mean (Wt = W2 = 1/2) of two trapezoidal fuzzy numbers gives another trapezoidal fuzzy number. To see whether there are more 2-dimensional OWA operators preserving the trapezoidal fuzzy numbers, let us impose that both slopes coincide:

wt(bt - b2 + a2 - at) - w2(bt - b2 + a2 - al) = 0 ==> ==> (Wt - w2)(bt - b2 + a2 - al) = o.

As this condition must be verified by all ab a2, bb b2, it will follow Wt = W2 = 1/2, that is, the OWA operator is the arithmetic mean.

References

1. Alsina, c.; Trillas, E. (1992) Synthesizing Implications. Int. Jour. of Intell. Sys­tems, 7, 705-713.

2. Dubois, D.; Prade, H. (1993) Fuzzy Numbers: An overview. Reaclings in Fuzzy Sets for Intelligent Systems, Vol. I, pp. 3-39, Morgan Kauffman, San Mateo, CA.

3. Klir, G. J.; Yuan, B. (1995) Fuzzy Sets and Fuzzy Logic: Theory and Applica­tions. Prentice-Hall, Inc.; New York.

Page 366: Data Mining, Rough Sets and Granular Computing

363

4. Lowen, R. (1996) Fuzzy Set Theory. Kluwer Academic Publishers. 5. Novak, V. (1989) Fuzzy Sets and Their Applications, Adam Hilger, Bristol. 6. Mayor, G.; Calvo, T. (1997) On extended Aggregation Functions. Proceedings

of the IFSA Congress, 1, 281-285. 7. Mayor, G.; Trillas, E. (1986) On the representation of some Aggregation FUnc­

tions. Proceedings of the XVIth IEEE-International Symposium on Multiple­Valued Logic, 110-114.

8. Yager, R.R. (1988) On Ordered Weighted Averaging Operators in Multi-criteria Decision Making. IEEE Transactions on Systems, Man and Cybernetics, 18, 183-190.

9. Zadeh, L.A. (1973) The Concept of a linguistic variable and its Application to Approximate Reasoning I, II, III. Reprinted in Yager, R.R. et al. (1987) Fuzzy Sets and Applications: Selected Papers by L. A. Zadeh. John Wiley & Sons, Canada.

Page 367: Data Mining, Rough Sets and Granular Computing

On Optimal Fuzzy Information Granulation

A. Ryjov

Lomonosov' Moscow State University, 119899 Moscow, Russia

1. On Optimal Fuzzy Information Granulation

1.1 Introduction

Information granulation is one of the basic concept of human cognition. L.A. Zadeh defined the subject of the theory of fuzzy information granulation by the following way "The theory of fuzzy information granulation (TFIG) is inspired by the ways in which humans granulate information and reason with it. However, the foundations of TFIG and its methodology are mathematical in nature" [7].

One of the mathematical task concerning to the TFIG - the procedure of an optimal fuzzy information granulation - has been described in this chapter. It is assumed that the person describes the properties of a real objects in the form of linguistic values. The subjective degree of convenience of such a description depends on the selection and the composition of such linguistic values (fuzzy granules). Let us explain this on a model example.

Example 1.1. Let it be required to evaluate the height of a man. Let us consider two extreme situations.

Situation 1. It is permitted to use only two values: "small" and "high". Situation 2. It is permitted to use many values: "very small", "not very

high", ... , "not small and not high", ... , "very high" . Situation 1 is inconvenient. In fact, for many people both the permitted

values may be unsuitable and, in describing them, we select between two "bad" values.

Situation 2 is also inconvenient. In fact, in describing height of a man, several of the permitted values may be suitable. We again experience a prob­lem but now due to the fact that we are forced to select between two or more "good" values. Could a set of linguistic values be optimal in this case?

One object may be described by different experts (persons). Therefore it is desirable to have assurances that the different experts describe one and the same object in the most "uniform" way.

On the basis of the above we may formulate the first problem as follows:

Problem 1.1. Is it possible, taking into account certain features ofthe man's perception of objects of the real world and their description, to formulate a rule for selection of the optimum set of val ues of characteristics on the basis of which these objects may be described? Two optimality criteria are possible:

Page 368: Data Mining, Rough Sets and Granular Computing

365

Criterion 1. We regard as optimum those sets of values through whose use man experiences the minimum uncertainty in describing objects.

Criterion 2. If the object is described by a certain number of experts, then we regard as optimum those sets of values which provide the minimum degree of divergence of the descriptions.

This problem may be reformulated as a problem of construction of an optimal information granulation procedure from point of view of criterion 1 and criterion 2. It is shown that we can formulate a method of selecting the optimum set of values of qualitative indications (collection of fuzzy granules [7]). Moreover, it is shown that such a method is stable, i.e. the natural small errors that may occur in constructing the membership functions do not have a significant influence on the selection of the optimum set of values. The sets which are optimal according to criteria 1 and 2 coincide. The results obtained are described in Sect. 1.3.

What gives us the optimal fuzzy information granulation for a solution of the practical tasks? For the answer to this question let us assume that the human's description of an objects make a data base of some data management system. In this connection the following problem arises.

Problem 1.2. Is it possible to define the indices of quality of information re­trieval in fuzzy (linguistic) databases and to formulate a rule for the selection of such a set of linguistic values, use of which would provide the maximum indices of quality of information retrieval?

This problem may be reformulated as a problem of construction of an optimal information granulation procedure from point of view of information retrieval in fuzzy (linguistic) databases.

It is shown that it is possible to introduce indices of the quality of infor­mation retrieval in fuzzy (linguistic) databases and to formalize them. It is shown that it is possible to formulate a method of selecting the optimum set of values of qualitative indications (collection of fuzzy granules [7]) which provides the maximum quality indices of information retrieval. Moreover, it is shown that such a method is stable, i.e. the natural small errors in the construction of the membership functions do not have a significant effect on the selection of the optimum set of values. The results obtained are shown in Sect. 1.4.

1.2 The Concept of a Complete Orthogonal Collection of Fuzzy Granules

For a strict formulation and proof of the enumerated results it is necessary to retain the basic concepts of the theory of fuzzy sets which we require and to define a new one.

The fuzzy variable [5] is a group of three

Page 369: Data Mining, Rough Sets and Granular Computing

366

(a, U, G),

where a is the name of the fuzzy variable; U is the universal set (field of definition); G are the limitations on the possible values (sense) of the variable A. Thus, the fuzzy variable is the named fuzzy set. The linguistic variable [5] is a group of five.

(A,T(A),U, V,M),

where A is the name of the linguistic variable; T(A) is the term-set of the linguistic variable A, i.e. the set of its linguistic

values; U is the universal set, in which the values of the linguistic variable A are

defined; V is the syntactic rule generating the values of the linguistic variable A

(this often has grammatical form); M is the semantic rule which assigns to each element of T(A) its "sense"

as a fuzzy sub-set U. For specifying the rule V it is usual to give the basic values (small quantity,

medium quantity, significant quantity, etc.), the modifiers (not, very, a bit, etc.) and the rules of formation of the elements of the term-set ( it not small quantity, very small quantity, not very small quantity, etc.).

For specifying the rule M it is assumed that the membership functions of the basic terms are given and also that one knows how the modifiers act on the membership functions of the basic values. For example, in [5] it is proposed that, if the membership function I'A(U) of a certain basic value A is known, the membership functions of the linguistic values" notA", and" veryA" can be calculated as follows: I'notA(U) = 1- I'A(U), I'veryA(U) = I'i(u).

Thus, it was assumed that, having specified the membership functions of several basic values and having selected the formalisms of the logic operations " and" , "or" and" not" , we may" calculate" the membership functions of any elements of the term-set T(A). Moreover, we can "calculate the sense" of any statement by performing the relevant operations under the membership functions of the concepts of different linguistic variables. This "sense" may be represented either in the form of a membership function or in the form of the linguistic value best approximating the resultant membership function.

However, it did not prove possible to carry out this programme in full. Investigations performed in this area provided the basis for a large number of theoretical and practical results. They have received new impetus in the light of the idea of granularity computing [7] and computing with words [6] propounded in recent years.

One of the problems of developing the linguistic variable theory in full lay in its too broad definition. It is well known that the fewer limitations there are in any theory, the more difficult it is to obtain significant results from it.

Page 370: Data Mining, Rough Sets and Granular Computing

367

A particular case of a linguistic variable having a wide spectrum of prac­tical applications can be a set of fuzzy variables describing a certain concept (i.e. a linguistic variable with a fixed term-set). Such structures can be inter­preted, in particular, as a set of linguistic values describing some characteris­tics of an objects. For such structures, a series of natural conditions may need to be fulfilled within the framework of which problems 1 and 2 formulated above (Sect. 1.1) can be solved.

Let us consider t fuzzy variables with the names al, a2, ... ,at, specified in one universal set. We shall call such a set the collection of fuzzy granules (CFG) St.

Let us introduce a system of limitations for the membership functions of the fuzzy variables comprising St. For the sake of simplicity, we shall designate the membership function aj as J.lj. We shall consider that:

(1) Vj (1 ::; j ::; t) 3U} =I 0, where U} = {u E U: J.lj(u) = I}, U} is an interval or a point;

(2) Vj (1 ::; j ::; t) J.lj(u) does not decrease on the left of U} and does not increase on the right of U} (since, according to (1), U} is an interval or a point, the concepts "on the left" and "on the right" are determined unambiguously) .

Requirements (1) and (2) are quite natural for membership functions of concepts forming a CFG. In fact, the first one signifies that, for any concept used in the universal set, there exists at least one object which is standard for the given concept. If there are many such standards, they are positioned in a series and are not "scattered" around the universe. The second requirement signifies that, if the objects are "similar" in the metrics sense in a universal set, they are also "similar" in the sense of membership of a certain concept.

Henceforth, we shall need to use the characteristic functions as well as the membership functions, and so we shall need to fulfil the following technical condition:

(3) Vj (1 ::; j ::; t) J.lj(u) has not more than two points of discontinuity of the first kind.

For simplicity let us designate the requirements (1) - (3) as L. Let us also introduce a system of limitations for the sets of membership

functions of fuzzy variables comprising St. Thus, we may consider that: (4) Vu E U 3j (1 ::; j ::; t) : J.lj(u) =I 0;

t (5) Vu E U I: J.lj(u) = 1.

j=!

Requirements (4) and (5) also have quite a natural interpretation. Re­quirement (4), designated the completeness requirement, signifies that for any object from the universal set there exists at least one concept to which it may belong. This means that in our CFG there are no "holes". Require­ment (5), designated the orthogonality requirement, signifies that we do not permit the use of semantically similar concepts or synonyms, and we require sufficient distinction of the concepts used. Note also that this requirements is

Page 371: Data Mining, Rough Sets and Granular Computing

368

often fulfilled or not fulfilled depending on the method used for constructing the membership functions of the concepts forming the CFG. Thus, for ex­ample, if we have a certain number of experts, present them with an object u E U and permit only the answers "Yes, u E a/, and" No, u ¢ a/' (the answer "] do not know" is not permitted), and as the value of the member­ship function J.lj (u) we take the ratio of the number of experts answering positively to the total number of experts, this requirement is automatically fulfilled. Note also that all the results given below are justified with a certain weakening of the orthogonality requirement [2], but for its description it is necessary to introduce a series of additional concepts. Therefore let us dwell on this requirement.

For simplicity we shall designate requirements (4) and (5) as G. We shall term the CFG consisting of fuzzy variables, the membership

functions of which satisfy the requirements (1) - (3), and their populations the requirements (4) and (5), a complete orthogonal collection of fuzzy granules (COCFG) and denote it G(L).

1.3 Choosing of an Optimal Collection of Fuzzy Granules

The set of linguistic values (fuzzy granules) describing some characteristics of an objects may be represented as G(L).

As can be seen from example 1.1 (Sect. 1.1), the different CFG corre­sponding to the degree of inconsistency of the obtained information and the achievement have a different degree of internal uncertainty. Is it possible to measure this degree of uncertainty? For complete orthogonal CFG the answer to this question is yes.

To prove this fact and derive a corresponding formula, we need to intro­duce a series of additional concepts.

Let there be a certain population of t membership functions St E G(L). Let St = {J.ldu), J.l2(U), ... , J.lt(u)}.

Let us designate the population of t characteristic functions

as the most similar population of characteristic functions, if

h;(u) = {I, ifmaXl~j9J.lj~U) = J.l;(u) 0, otherwIse.

(1.1)

It is not difficult to see that, if the COCFG consists not of membership functions but of characteristic functions (i.e. our collection is the complete orthogonal collection of crisp granules - COCCG), then no uncertainty will arise when describing objects in it.

The expert unambiguously chooses the term at, if the object is in the corresponding region of the universal set. Some experts describe one and the same object with one and the same term. This situation may be illustrated

Page 372: Data Mining, Rough Sets and Granular Computing

369

as follows [4]. Let us assume that we have scales of a certain accuracy and we have the opportunity to weigh a certain material. Moreover, we have agreed that, if the weight of the material falls within a certain range, it belongs to one of the categories. Then we shall have the situation accurately described. The problem lies in the fact that for our task there are no such scales nor do we have the opportunity to weigh on them the objects of interest to us.

However we can assume that, ofthe two COCFG, the one having the least uncertainty will be that which is most "similar" to the collection consisting of the populations of characteristic functions (COCCG).

In mathematics distance can be a degree of similarity. Is it possible to introduce distance among CFG? For COCFG it is possible.

Let us recall the concept of distance. Let there be a certain set X. As distance in X we designate the two­

position function d, defined by the elements of X with a set of values in the range of non-negative real numbers (i.e. d : X x X -t D+), satisfying the following conditions (distance axioms): '<Ix, y, z EX: l)d(x, x) = 0 2)d(x,y) = d(y,x) 3)d(x, z) ::; d(x, y) + d(y, z).

Since we are considering COCFG, the membership functions of the con­cepts employed satisfy the conditions L. This is a subset of the well-known space of the functions integrated over a certain interval U. It is well known that it is possible to introduce distance in this space, for example as follows:

dU,g) = J If(u) -g(u)ldu. u

We can introduce distance in G(L) with the aid oflemma 1.

Lemma 1.1. Let St E G(L), s~ E G(L), St = {J-ll (u), J-l2(U), ... , J-lt(u)}, s~ = {J-l~ (u), J-l~(u), ... , J-l~(u)}, uU, g) - be the metrics in L.

Then

t

d(St,SD = 2:U(J-lj,J-lj) j=l

is the metrics in G(L).

Proof. To proof the lemma we need to check up three distance axioms. 1) d(st, s~) = 0 ¢> '<Ij(l::; j::; t) U(J-lj,J-lj) = 0

¢> '<Ij(l::; j ::; t) J-lj = J-lj I

¢> St = St, tit I I

2) d(st,s~) I: U(J-lj,J-lj) = I: U(J-lj,J-lj) = d(st,st) j=l j=l

(1.2)

Page 373: Data Mining, Rough Sets and Granular Computing

370

jt1 e(f.lj,f.l~) ~ jt1 [e(f.lj,f.l;) + e(f.l;'f.l~)] t /I t II I

L e(f.lj, f.lj) + L e(f.lj' f.lj) j=l j=l

d(st, s~') + d(s~', s~). The semantic statements formulated by us in the analysis of the collection

consisting of the populations of characteristic functions St ( 1.1) may be formalized as follows.

Let St E G(L). For the measure of uncertainty of St we shall take the value of the functional e (st), determined by the elements of G (L) and assuming the values in [0,1] (i.e. e : G(L) -+ [0,1]), satisfying the following conditions (axioms):

AI. e(st} = 0 if St is a COCCG; A2. Let St, s;, E G(L), t and t' may be equal or not equal to each other.

Then e(st} ~ e(S;,), if d(st, St) ~ d(s;, , s~,). (Let us recall that St is the COCCG determined by ( 1.1) closest to st-} Do such functionals exist? The answer to this question is given by the

following theorem.

Theorem 1.1. (Theorem of existence). Let St E G(L). Then the functional

where f.lj.(U) = max f.lj(u),f.lj.(u) = max f.lj(U),

1 l<j<t 2 l~j~t - - j:;a!j~

f satisfies the following conditions: Fl: f(O) = 1, f(l) = 0,­F2 : f does not increase -

(1.3)

(1.4)

is a measure of uncertainty of St, i. e. satisfies the axioms A 1 and A2.

Proof. Let 1J( St, u) is the sub-integral function in ( 1.3). In this case: 1. e(st} = 0 ¢:> \:Iu E U1J(St,u) = 0 ¢:> \:Iu E U f.lj.(u) - f.lj.(u) = 1

1 2

¢:> \:Iu E U 3j(1 ~ j ~ t): f.lj(U) = 1,f.lj(u) = OW i- j.

Page 374: Data Mining, Rough Sets and Granular Computing

2. e(St) ::; e(s;/) {:} IdII JUI f (p:~ (u1) - ph (u1)) du1

::; rJ.r JU2 f (p;~ (u2) - P;; (u2)) du2

{:} rr:hr JUI (p;~ (u1) - ph (u1)) du1 2:: Id21 JU2 (p;~(u2) - p;;(u2)) du2

{:} IdIIJUI (1- (pI~(Ul)-ph(Ul)))dul ::; rJ.r JU2 (1 - (p;~ (u 2 ) - P;; (u 2 )) ) du 2

{:} rr:hr JUI [( 1- P;~ (u1)) + (ph (u1) - 0)] du1 ::; rthr JU2 [ ( 1 - P;~ (u 2 ))

+ (p;; (u 2 ) - 0) ] du2

{:} rr:hr JUI [( hh(u1) - pI~ (u1))

+ (pl;(u1) - hh(u1))]du1

::; rJ.r JU2 [( h;~ (u 2 ) - p;~ (u 2 ))

+ (p;;(u 2 ) -h;;(u2 ))]du2 .

371

In these formulas the second equivalence is a consequence of the definition of the function f (F2 requirement), the third equivalence is a consequence of inequalities 0 ::; Pi~ (u) - Pi; (u) ::; 1 Vu E U, the substitution 1 and 0 to h with indexes (the final equivalence) is a consequence of the definitions of h and Pidu), Pi;(U) ( 1.4).

The final inequality we can transcribe as

d(hi., pi.) + d(hi., pi.) < d(h;., p;.) + d(h;., p;.), 11 22- 11 22

where d(h,p) = J Ih(u) - p(u)ldu is a measure in L. U

The final inequality is a distance between St and St (see Lemma 1). So, if the conditions of the theorems are achieved then a measure exists

for which 1 ,-

e(St) ::; e(St /)' if d(st, 8;) ::; d(Stl' S:/).

There are many functionals satisfying the conditions of Theorem 1.1. They are described in sufficient detail in [2]. The simplest of them is the functional in which the function f is linear. It is not difficult to see that conditions F1 and F2 are satisfied by the sole linear function f(x) = 1 - x. Substituting it in ( 1.3), we obtain the following simplest measure of uncer­tainty of the COCFG:

e(St) = I~I fu (1- (pi~(U) - Pi; (u))) du,

where Pi~(U), Pi;(U) are determined by the relations ( 1.4). Let us denote the sub-integral function in ( 1.5) by 1J(St, u):

(1.5)

Page 375: Data Mining, Rough Sets and Granular Computing

372

17(St,U) = 1- (I';~(u) - 1';; (u))

Now we may adduce the following interpretation of the measure of uncer­tainty ( 1.5).

Interpretation. Let us consider the process of describing objects in the fmmework of the CFG S3 E G(L) - Fig. 1.1.

Fig.!.!. The interpretation of the functional e{St).

For the objects U1 and U5, man will without hesitation select one of the terms (a1 and a3 respectively). For the object U2 the user starts selecting between the terms a1 and a2. This hesitation increases and attains its peak for the object U4: at this point the terms a1 and a2 are indistinguishable. If we remember the procedure for constructing the membership functions described in the analysis of the orthogonality chamcteristic (Sect. 1.2), we can also confirm that all the experts will be unanimous in describing the objects U1 and U5, while in describing U2 a certain divergence will arise which attains its peak for the object U4.

Let us now consider formula ( 1.3). It is not difficult to see that

0= 17(St,U5) = 17(St,U1) < 17(St,U2) < 17(St,U3) < 17(St,U4) = 1.

(17(St, Uj) is equal to the length of the dotted line at point Uj). Thus, 17( St, u) actually reflects the degree of uncertainty which man expe­

riences in describing objects in the fmmework of the corresponding CFG or the degree of divergence of opinion of the experts in such a description.

Then the degree of fuzziness of e(st} ( 1.5) is an avemge measure of such uncertainty in describing all the objects of the universal set.

Section 1.3 provides a series of theorems showing that the degree of uncer­tainty ( 1.5) possesses a number of characteristics natural for the uncertainty of description of objects. Moreover, it is shown that in the particular case of an elementary COCFG we obtain a degree of fuzziness of the set which has been well studied in the context of fuzzy set theory.

An important aspect of the practical use of any model is its stability. It is quite natural that in identifying the parameters of the model (in our

Page 376: Data Mining, Rough Sets and Granular Computing

373

case when constructing the membership functions) small measuring errors can occur. If the model is sensitive to such errors, then its practical use is very problematic. Let us consider that the membership functions in our case are not given accurately but have a certain "accuracy" 8. It can be shown (see Sect. 1.3) that the measure of uncertainty ( 1.5) is stable. This means that the proposed method may be employed in practical exercises.

The Several Properties of our Degree of Uncertainty

Let us define the following subset of function set L:

- L is a set of functions from L, which are part-linear and linear on

- i is a set of functions from L, which are part-linear on U (including U). - d -

Theorem 1.2. Let St E G(L). Then e(st} = 2fUT' where d = lUI.

Proof. At first we examine the simplest case t = 2 (Fig. 1.2).

~

1

u

Fig. 1.2. The picture of the simplest case t = 2.

Let fix two points U2L (the left non-zero frontier of Jla,(u)) and U1R (the right non-zero frontier of Jlal (u)). A value of the integral ( 1.5) is not equal to zero only on segment [U2L' U1R]. Thus,

e(S2) I~I L (1- (Jli~(U) - Jli;(U))) du

1~ll:~R (1- (Jli~(U) - Jli;(U))) duo

If we use elementary formulas of the geometry we can write:

Page 377: Data Mining, Rough Sets and Granular Computing

374

Po,(u) = {

Po,(u) = {

where d = UIR - un.

1, ~(UIR - u),

0,

0, ~(u - un),

1,

ifu:S un ifun :S u :S UIR ,

ifu 2: UIR

if u :S un ifun :S u :S UIR ,

if u 2: UIR

(1.6)

(1.7)

If we substitute ( 1.6), ( 1.7) in ( 1.5) and remember ( 1.4) then we can write:

For the proof of the theorem in the general case t > 2 we must to repeat ours reasoning for all segments of uncertainty [Uj,L, Uj-l,R] (2 :S j :S t).

- d -Theorem 1.3. Let St E G(L). Then e(st) = cfUT' where d = jUl; C < 1, C = Canst.

We do not proof this theorem because it is rather evident. As far as every St E G (L) may be (with every big accuracy) approximated

using a system of sets St E G (L) then the same relation is right for every St E G(L).

Let 9 is some one-to-one function, which is defined on U. This function is induced transformation of some COCFG St E G(L) on universum U to COCFG g(st) on universum U', where U' = g(U) = {u': u' = g(u), u E U}. The above induction is defined by following way:

g( St) is a set of membership functions VA (u'), ... ,Jl~ (u')}, where Jlj (u') = Jlj(g(u)) = Jlj(g-l(u')) = Jlj(u), Jlj(u) ESt, 1 :S j:S t.

Page 378: Data Mining, Rough Sets and Granular Computing

375

The following example illustrate this definition.

Example 1.2. Let St E G(L), U is universum St and 9 is expansion (compres­sion) of universum U. In this case, g(st) is a set of functions produced from St by the same expansion (compression).

Theorem 1.4. Let St E G(L), U is universum St and 9 is some linear ane­ta-one function defined on U and e(st) f. O. Then e(st) = e(g(st)).

Proof. At first we examine the simplest case t = 2 (see the proof of the theorem 1.2). If S2 E G(L) then (see the orthogonality requirement (5) in Sect. 1.2) J.la1 (u) = 1- J.la2(U) 't/u E U. Thus

_1 {"lR (1- (J.li~(U) _ J.li;(U))) du lUilu2L

1 [r;;llW WI }U2L (1 - (J.l a1 (u) - J.la2 (u))) du

+ j::R 1 (1- (J.la2(U) - J.la1 (u))) dU] l" a1(2)

1 [r;;llW WI }U2L (1- ((1- J.la2(U)) - J.la2(U))) du

+ j::R 1 (1 - ((1- J.la1 (u)) - J.la1 (u))) dU] l" a 1 (2)

I ~I [1:;;ll(t) J.la2(u)du+ j::R 1 J.la1(U)du]. U2L l"a1(2)

If we replicate ours reasonings for s~ = g(st) we can write

2 [lg(l";;ll (t)) 19(U1R) ] e(s~) = -I -(-)1 J.l~2(u')du' + J.l~1 (u')du' .

9 U g(U2L) g(l";;ll (t))

(1.9)

(1.10)

If we do a replacement of the variable u' = 9 ( u) then ( 1.10) we can rewrite

2 [l g (l-';;ll (t))

Ig(U)1 g(U2L) J.l~2 (g(u)) dg(u)

19(U1R) ] + -1 1 J.l~1 (g(u)) dg(u)

g(l-' a1 (,.))

(1.11)

2 [ r;;ll m rUlR ] r-(U) I }" J.l a2(u)g'(u)du + }/J-1 1. J.la1 (u)g'(u)du . 9 U2L l-'a1 (2)

Page 379: Data Mining, Rough Sets and Granular Computing

376

The final equality in ( 1.11) is a consequence of the definition of g(St). The equality {(S2) = {(g(S2)) is equivalent to an equality

(1.12)

The equality ( 1.12) we can rewrite as

r-;;llW (1 g'(u) ) l ulR (1 g'(u) ) JU2L J.l a2(U) WI- Ig(U) I du+ l';llW J.lal (u) WI- Ig(U)1 du = O.

(1.13) If g(u) is a linear function then g(u) = ku + a, where k, a is a constant.

The value of g'(u) is equal to

'( ) - k - g(U2) - g(u!) \.J E U 9 u - - VU2, Ul , U2 - Ul

and, in particular,

g'(u) = Igl~)I. If we use the final equality we can write

1 g'(u) WI- Ig(U)1 = O.

Thus the equality ( 1.13) is true. For the proof of the theorem in the general case t > 2 we must to repeat

ours reasoning for all segments of uncertainty [Uj,L, Uj-l,R] (2:::; j :::; t).

This property means that some human describes different-type objects using some set of term with equal difficulties, if physical parameters of ob­jects of one type may be produced from the parameters of objects of other type by some linear transformation. For example, using the set of terms {high, medium, low}, we describe people, trees, buildings and etc. with equal difficulties; using the set of terms { very near, near, not near, far away} we de­scribe distances between molecules, distances between town street, distances between towns on a map and etc. with equal difficulties.

The fuzziness degree of single fuzzy set induced by {(st} is defined as fuzziness degree of a trivial COCFG, determined with a fuzzy set J.I(u):

(1.14)

It's easy proved, that ( 1.14) satisfies all the axioms for the single set fuzziness degree [1]. It can be shown that the introduced in the report more general notion {(st) had been correctly defined.

Page 380: Data Mining, Rough Sets and Granular Computing

377

The Stability of our Degree of Uncertainty

Here, we present the results of the analysis of our model, when the mem­bership functions which are members of the given CFG, are not given with absolute precision, but with some maximal inaccuracy 8 (Fig. 1.3).

1

u Fig. 1.3. The picture of the 6 - model.

Let us call this particular situat ion the 8-model and denote it by GO(L). Let's put an parametres of the 8-model (81 and 82 ) as a function from

8. Using an elementary geometrical reason and Fig. 1.3, we can write: 151 = 8· ""1 + d2 , 152 = ~ . ""1 + d2 , and, as a result,

(1.15)

In this situat ion we can calculate the top and bottom valuations of the degree of fuzziness. The COCFG with the minimum and maximum degree of fuzziness is given in Fig. 1.4 and 1.5 correspondly.

Let's use the following denomination for formalization of this functions:

Then

where

q = {R,L},

_ {R, q= L, if q = L, if q = R,

fj _ t. [UjL + Uj-1,R _ 151 UjL + Uj-1,R 81 ] - UJ =2 2 2 ' 2 + 2

(1.16)

(1.17)

(1.18)

(1.19)

Page 381: Data Mining, Rough Sets and Granular Computing

378

J..L

1

u

Fig. 1.4. The COCFG with the minimun degree of fuzziness.

J..L

1

u

Fig. 1.5. The COCFG with the maximum degree of fuzziness.

Page 382: Data Mining, Rough Sets and Granular Computing

379

(1.20)

{ R, P~ (u) ? P~ (u)

q = L £() R()' , Pi" U > Pi" u 1 1

{ 1- (p~(U) - p~(u)), u E U\U 'ij(St, u) =

0.5, u E U (1.21)

By analogy ( 1.3) the top and bottom valuations of the degree of uncer­tainty ~(St) can be written as:

- 1 1 ~(St) = WI U 'ij(St, u)du. (1.22)

{(sd = I~I L !l.(St, u)du, (1.23)

Theorem 1.5. Let S2 E GoeL). Then

t( ) = d (1 - 62 )2 C( ) = d (1 + 262 ) ~ S2 21UI ,<" S2 21UI·

Proof We can present ~(S2) by analogy of the proof of the theorem 1.2 by the following way: -

2 r'i2L+~ 4 lU2L+~ -lUI jiT 6 du + dlUI _ 6 (u - U2L) du

U2L+T U2L+T

2 iU2L+~ -- (d-ISt)du

dlUl U2L+~ (1.24)

Page 383: Data Mining, Rough Sets and Granular Computing

380

_2 [(~_(5t)+2 ftZdZ_(d+c5I)(~_c51)] dlUl 2 2 J~ 2 2

2

= dl~1 [(~- c5;) (d-d-c5I)+2~ (~ - c51)]

dl~1 [(~_c5;)(_c5d+(~+c5;) (~_c5;)] _2 [(~_c51)(~_c51)]=_2 (d-c5I)2 dlUl 2 2 2 2 dlUI 4 d (1 - c52)2

21U1 During the evaluation of ( 1.24) the replacement of the variables z =

U - U2L and equation c51 = dc52 ( 1.15) has been used. By analogy of calculations {(S2) we can present e(S2) by the following

way:

2 [J..U2L+t+~ J..U2L+t-~ dl I _ 6 ddu+ _ 6 2(u- U2L)du (1.25)

U U2L-~ U2L-~

lU2L+t-~ ] 2 c51 - (d-c5I)du +--

U2L-~ lUI 2

2 [d It-~ d] c51 - d-+2 zdz-(d-c5I)- +-dlUl 2 _!l.. 2 lUI 2

2 [d d(d)] c51 d c51 dlUI 2 (d - d - c5I) + 2 2 - c51 + Wi = 21U1 + Wi

d dc52 d 21U1 + WI = 21UI (1 + 2c52).

During the evaluation of ( 1.25) the replacement of the variables z = u - U2L and equation c51 = dc52 ( 1.15) has been used.

Page 384: Data Mining, Rough Sets and Granular Computing

381

Theorem 1.6. Let St E GO (I). Then

D (1 - 82)2 {(st) = 21U1 ' (1.26)

C( ) = D (1 + 282 ) .. St 21UI' (1.27)

where D = L:~:i dj ,j +1 , dj ,i+l = UjR - Uj+1,L.

Proof of the theorem 1.6 is analogous by the proof of the theorems 1.2 and 1.5.

By comparing the results of the theorem 1.2 and theorems 1.5 and 1.6, we see that for small significances 8, the main laws of our model are preserved. Therefore, we can use our technique of estimation of the degree of fuzziness in practical tasks, since we have shown it to be stable.

Based on the results of Sect. 1.3 and 1.3 we can formulate the following method of optimal fuzzy information granulation:

1. All the "reasonable" sets of linguistic values are formulated; 2. Each of such sets is represented in the form of a COCFG; 3. For each set the measure of uncertainty is calculated ( 1.5); 4. As the optimum set minimizing both the uncertainty in the description

of objects and the degree of divergence of opinions of experts we select the one, the uncertainty of which is minimal.

Following this method, we may describe objects with minimum possible uncertainty, i.e. guamntee optimum operation of the human-computer infor­mation intelligent system from this point of view.

1.4 Models of Information Retrieval in Fuzzy (Linguistic) Data Bases

Let consider some human-computer information systems. The user's estima­tions of the accessible information objects are store in a database of system and are used for an evaluation of the current status of an information prob­lem and for a forecasting of its development. In this sense the database of system is a basis of information model of a subject area. The quality of this basis (and, accordingly, model of the problem) is expressed, in particular, through parameters of the information retrieval. If the database containing the linguistic descriptions of objects of a subject area allows to carry out qualitative and effective search of the relevant information then the system of information monitoring will work also qualitatively and effectively.

Page 385: Data Mining, Rough Sets and Granular Computing

382

As well as in Sect. 1.2, we shall consider that the set of the linguis­tic meanings describing a characteristics of an objects can be submitted as COCFG G(L).

In our study of the process of information searches in data bases whose objects have a linguistic description, we introduced the concepts of loss of information ~ x (U) and of information noise !Ii x (U). These concepts apply to information searches in these data bases, whose attributes have a set of significances X, which are modelled by the fuzzy sets in St • The meaning of these concepts can informally be described as follows. While interacting with the system, a user formulates his query for objects satisfying certain linguistic characteristics, and gets an answer according to his search request. If he knows the real (not the linguistic) values of the characteristics, he would probably delete some of the objects returned by the system (information noise), and he would probably add some others from the data base (information losses), not returned by the system. Information noise and information losses have their origin in the fuzziness of the linguistic descriptions of the characteristics.

These concepts can be formalized as follows. Let's consider the case t = 2 when the set of significances of the charac­

teristics has the following type (Fig. 1.6). Let's fix the number u* E U and introduce following denotes:

~ ~2~) 1~----~-,-------=~ __ ~

~a 2(U") 1-----------'''''''-.

~(U")

u" u

Fig. 1.6. The set of significances of the characteristics: case t = 2.

- N(u*) is the number of objects, the descriptions of which are stored in the data base, that possess a real (physical, not linguistic) significance equal to u*;

- N E - the number of users of the system.

Then

- Na1(u*) = fLal(U*)N(u*) - the number of data base descriptions, which have real meaning of some characteristic equal u* and is described by source of information as al;

Page 386: Data Mining, Rough Sets and Granular Computing

383

- Na2 (u*) = /-la2(u*)N(u*) - the number of the objects, which are described as a2;

- Nf, (u*) = /-la, (u*)NE - the number of the system's users who believe that u* is al;

- N!,(u*) = /-la2(u*)NE - the number ofthe users who believe that u* is a2.

Note that for St E G(L)) taking into account the requirement of orthog­onality (Sect. 1.2) the following matching are right:

Na, (u*) + Na2 (u*) = /-la, (u*)N(u*) + /-la2(u*)N(u*) = N(u*),

Nf, (u*) + N!,(u*) = /-la, (u*)NE + /-la2(u*)N E = N E .

That's why under the request "To find all objects which have a meaning of an attribute, equal al " (let's designate it as (1(0) = al») the user gets Na, (u*) descriptions of objects with real meaning of search characteristic is equal to u*. Under these circumstances Nf, (u*) users do not get Na2 (u*) object descriptions (they carry loses). It goes about descriptions of objects which have the meaning of characteristic equal u*, but described by sources as a2. By analogy the rest N!, (u*) users get noise (" unnecessary" descriptions in the volume of given Na , (u*) descriptions).

A verage individual loses for users in the point u* under the request are equal

(1.28)

By analogy average individual noises in the point u*

(1.29)

Average individual information loses and noises, given under analysed request (CPa, (U) and Wa, (U) accordingly) are naturally defined as

It's obvious that

cpa,(U) = I~I L tpa,(u)du,

Wa, (U) = I~I L 1/!a, (u)du.

(1.30)

(1.31)

(1.32)

Page 387: Data Mining, Rough Sets and Granular Computing

384

By analogy for the request (/(0) = a2) or from symmetry considerations we can get that in this case average loses and noises are equal (4)a2(U) = tlia2 (U)) too and are equal the right part of ( 1.32). Under information loses and noises appearing during some actions with characteristic which has the set of significance X = {al, a2} (4)x(U) and tlix(U)) we naturally understand

(1.33)

(1.34)

where Pi(i = 1,2) - the probability of some request offering in some i­meaning of the characteristic.

It's obvious that as PI + P2 = 1, then

(1.35)

Let's consider a general case t > 2. In this case area of integration U can be presented as

U = UI U Ul2 U U2 U ... U Ut-l,t U Ut , (1.36)

where

- Uj = [Uj-I,R, Uj+I,L] (j = 2, ... , t; UO,R = UI,L, Ut+1,L = Ut,R) - subset U, on which /-Iaj (u) = 1, and, because of orthogonality of G(L), /-Ia. (u) = OVi=Jj;

- Uj-l,j = [Uj,L, Uj-I,R] (j = 2, ... , t) - subset U, on which

0< /-Iaj_l (u), /-Iaj(u) < 1; /-Ia.(U) = 0 for i =J j - 1, i =J j.

Let's consider a query (/(0) = aj)(l ::; j ::; t). In this case on quality of search of the necessary objects render influence the next meanings of an attribute: left (j - 1) and right (j + 1). For average losses of the information and of information noise, thus, it is fair:

(1.37)

(1.38)

where

I~I L /-Iaj(U)/-Iaj_l (u)N(u)du

I~I Lj_l,j /-Iaj_l (u)/-Iaj (u)N(u)du, (1.39)

Page 388: Data Mining, Rough Sets and Granular Computing

385

I~I i J-laj(U)J-laj+, (u)N(u)du

I~I ij,J+, J-laj (U)J-laJ+' (u)N(u)du, (1.40)

The first and second equality in ( 1.39), ( 1.40) can be received by repeat­ing reasoning of a conclusion of the formula ( 1.35), considering for ( 1.39) j = 2 and for ( 1.40) j = 1. Last equality in ( 1.39), ( 1.40) follow from definitions of sets Uj-I,j and Uj,j+1'

In this case average of individual losses of the information and of infor­mation noise arising by the information retrieval by the given attribute will be equal accordingly

t

<Px(U) = LPj<Paj(U), (1.41) j=1

t

tJix(U) = LPjtJiaj(U), (1.42) j=1

where Pj - probability of the query on j - significances of an attribute X,

I:~=I Pj = 1. Let's notice that for extreme significances j = 1 and j = t

<Pa,(U) = <p~~(U) = tJia,(U) = tJig,'(U)

= I~I i" J-la, (u)J-la,(u)N(u)du,

<P~:-' (U) = tJia• (U) = tJi~.·-, (U)

I~I i._". J-la._, (u)J-la. (u)N(u)du.

Substituting ( 1.37) - ( 1.40) in ( 1.41), ( 1.42) in view of the made remark, we shall receive:

<Px(U) = tJix(U) PI -11 11 J-la, (U)J-la,(u)N(u)du U u"

+ ~Pj I~I (ij_"j J-laj_, (U)J-laj(u)N(u)du

+ 1.. J-laj (U)J-laj+, (U)N(U)dU) U).J+'

(1.43)

Page 389: Data Mining, Rough Sets and Granular Computing

386

+ pt-1

11l /-la'_l (U)/-la. (u)N(u)du

U u'_ l ,'

t-i

I~I f;(pj + Pj+i) iuj,j+, /-laj(U)/-laj+l (u)N(u)du.

In Sect. "The Several Properties of Loss ofInformation and ofInformation Noise" also are given a number of the theorems showing that the volumes of losses of the information and of information noise arising by search of the information in a linguistic (fuzzy) databases are coordinated with a de­gree of uncertainty of the description of objects. It means that describing objects by an optimum way (see Sect. 1.2) we provide also optimum search of the information in databases. Is shown also (Sect. "The Stability of Loss of Information and of Information Noise") , that as volume of losses of the information and of information noise and its connection with a measure of uncertainty of the descriptions of objects are stable. It allows to approve that the offered methods can be used in practical tasks and to guarantee optimum work of human - computer information systems.

The Several Properties of Loss of Information and of Information Noise

Theorem 1.7. Let S2 E GeL), N(U) = N = Const. Then

Nd cjjx(U) = wx(U) = 61UI'

where d = 1U121. Proof. Because of S2 E GeL), then membership functions /-lal (u) and /-la2 (u) are express by ( 1.6) and ( 1.7) correspondly. Then if we substitute ( 1.6) and ( 1.7)in ( 1.35) we can write:

1 r'ilR 1 cjjx(U) = wx(U) = WI fun d2 (UlR - u)(u - U2L)N(u)du. (1.44)

where d = UiR - U2L.

Because of N(U) N = Const by rectriction of the theorem, we can write:

Page 390: Data Mining, Rough Sets and Granular Computing

387

-U1RU2L u I:IR] U2L

6d~UI [-2(UrR - U~L) +3(u2L + U1R)(uiR - U~L) (1.45)

-6U1RU2L(U1R - U2L)]

6d~UI [-2urR + 2~L + 3UiRU2L + 3urR

3.,..-3 3- -2 6-2 - + 6- -2] - U2L - U1RU2L - U1RU2L U1RU2L

N(U1R - U2L)3 Nd3 Nd 6d2 1UI - 6d2 1UI - 61UI'

Corollary 1.1. Let the restrictions of the theorem 1.7 are true. Then

Proof. For proof of the corollary is enough to compare ( 1.8) and ( 1.45).

- 1 . Theorem 1.8. Let St E G(L), N(u) = N = Canst and Pj = .t<J = 1, ... , t). Then

ND cT>x(U) = wx(U) = 3tlUl'

where D = I:~:i dj ,j+1 = I:~:i IUj,j+1l·

Proof. Because of N(u) = N = Canst we can rewrite ( 1.43) by the following way:

(1.46)

By analogy ( 1.45) we can derive from ( 1.46):

2N t - 1 ld"+1 N t-l ND cT>x(U) = wx(U) = WI f; iT = 3tlUI f; dj ,j+1 = 3t1U!" (1.47)

Corollary 1.2. Let the restrictions of the theorem 1.8 are true. Then

2N cT>x(U) = wx(U) = 3t~(sd.

Proof. For proof of the Corollary is enough to compare theorem ( 1.2) and ( 1.47).

Page 391: Data Mining, Rough Sets and Granular Computing

388

We can generalize corollaries 1.1 and 1.2 for St E G(L). The following theorems are hold.

Theorem 1.9. Let S2 E G(L), N(u) = N = Const. Then

.;px(U) = !lix(U) = Ce(S2),

where c is a constanta with depends from N only.

Theorem 1.10. Let St E G(L), N(u) = N = Const and Pi t (j 1, ... , t). Then

c .;Px(U) = !lix(U) = ie(st),

where c is a constanta with depends Iram N only.

The proof of theorems 1.9 and 1.10 are bulky. If it is interesting we may refer to [3].

The Stability of Loss of Information and of Information Noise

At first we examine the simplest case t = 2 (Fig. 1.7).

~/'(u· )

~t(u·)

J.1"2L(u· )

1

J.1 "2 R(U·) t:::::::-u.~.~~.!t-_...J... ____ ~ U2L U u

Fig. 1.7. The analysis of 0- model: case t = 2.

Let's fix the number u! E [U2L + ~,U1R - ~] and a query (1(0) = al). In this case

where

1';1 (ut)N(ut)

1':2 (ut)N(ut)

1';1 (ut)N E

1':2 (ut)NE

~ Na1 (ut) ~ 1':1 (ut)N(ui)

~ Na2 (ut) ~ 1';2 (ut)N(ut)

~ N!. (u!) ~ 1':1 (ut)NE

~ NaE2(U*1) ~ L ( *)NE l'a2 ul ,

(1.48)

Page 392: Data Mining, Rough Sets and Granular Computing

389

- N (ut) is the number of objects, the descriptions of which are stored in the data base, that possess a real (physical, not linguistic) significance equal to ut;

- N E - the number of users of the system; - JJ~i (ut), JJ~i (ut) (i = 1,2) - the left and right border of membership func-

tions JJai (u) in point ut (see Fig. 1.7).

If we repeat our reasoning of calculation ( 1.28), ( 1.29), we can write:

~EJJ~l (ut)NE JJ~2(unN(uD :S <Pal (ut) :S ~EJJ~l (ut)NE JJ~2(ut)N(ut), (1.49)

~EJJ~2(Ut)NE JJ~l (ut)N(ut) :S 1/Jal (ut) :S ~EJJ~2 (ut)NE JJ~l (ut)N(ut)· (1.50)

Let's fix the point u~ E [un - ~,un + ~] . In this segment

JJ~l (u~) = 1, JJ~2(U~) = O.

Thus from ( 1.49), ( 1.50) we derive directly:

(1.51)

(1.52)

By analogy for u~ E [U1R - ~,U1R + ~] (JJ~l (u~) = 0, JJ~2 (u~) = 1) we can write:

(1.53)

(1.54)

If we repeat the reasoning of calculation of ( 1.30), we can construct the top (4) al (U)) and bottom (~al (U)) valuations of the if> al (U):

(1.55 )

Page 393: Data Mining, Rough Sets and Granular Computing

390

- 6,

1 /"U'R-,. L R P.a (U) = -, , Pa (u)Pa (u)N(u)du, , U _ 6, ' , U2L+,. (1.56)

By analogy for tf/a, (U)

W'a,(U) 1 U2L+ , [ - ~

WI fu2L-~ p~,(u)N(u)du (1.57)

1 /"

U'R- 6, T L R

'ka, (U) = -,U, _ 6 Pa, (u)Pa,(u)N(u)du, U2L+~

(1.58)

If we will examine the query (/(0) = a2) then from ( 1.48) we derive:

~EP:,(u1)NEp~,(unN(un ~ 'Pa,(un ~ ~EP~,(UnNEp:,(unN(u1), (1.59)

~EP~, (u1)NE p:,(u1)N(u1) ~ 1f>a,(u1) ~ ~EP:' (u1)N E p~,(unN(un· (1.60)

For u5 E [U2£ - ~, U2£ + ~] we derive the following analogies of ( 1.51), ( 1.52):

(1.61 )

(1.62)

By analogy for

u; E [UIR _ 8; ,UlR + 8; ] the following formulas are true:

(1.63)

o ~ 1f>a,(u;) ~ p:,(u;)N(u;). (1.64)

Because of formulas ( 1.49) - ( 1.54) and ( 1.59) - ( 1.64) are coincide completely then (J)a,(U), P.a,(U), W'a,(U), 'ka,(U) are equal to the right part of the formulas ( 1.56) - ( 1.57) correspondingly.

Page 394: Data Mining, Rough Sets and Granular Computing

391

By analogy ( 1.33) and ( 1.34) the top and bottom valuations of loss of information and of information noise for information retrieval on X {al, a2} are equal to

~(U) = Pl~a, (U) + P2~, (U),

4>(U) = Pl4>a, (U) + P24>a, (u),

'k(U) = Pl'ka, (U) + P2'ka, (U),

îî1(U) = plîîla, (U) + P2îî1a,(U),

where Pl (P2) is probability of the query for the first (second) value of at­tribute X .

Because of Pl + P2 = 1 the bot tom valuations of loss of information and of information noise are equal the right part of ( 1.56) and the top valuations of loss of information and of information noise are equal the right part of ( 1.55):

(1.65)

4>x(U) (1.66)

For generalization ( 1.65) ( 1.66) for case t > 2 we present universum U by the following way (Fig. 1.8)

u

Fig. 1.8. The analysis of 0- model: case t > 2.

Page 395: Data Mining, Rough Sets and Granular Computing

392

U lUlL, Ut,R] = [UlL' un - 0;] U [un - i ,Un + 0; ] U [un + 0; ,U1R _ 0; ] U [U1R _ 0; ,U1R + 0;] U ...

U [U'L - 01 U'L + 01] U [U'L + 01 U· 1 R _ 01] (1.67) J 2' J 2 J 2' J- , 2

U [Uj-1,R - 0; ,Uj-1,R + 0;] U [Uj-1,R + 0; ,Uj+1,L _ 0;]

U [Uj+1,L - 0; ,Uj+1,L + 0;] U [Uj+1,L + 0; ,UjR _ 0;]

U [UjR - 0; ,UjR + 0; ] U ... U [UtL _ 0; ,UtL + 0; ]

U [UtL + 0; ,Ut-1,R _ 0;] U [Ut-1,R _ 0; ,Ut-1,R + 0;]

U [Ut-1,R+ 0; ,UtR]. Let's fix number j(l ~ j ~ t) and calculate ~j(U)' 'kaj(U), By analogy

( 1.37), ( 1.38) we can present fE.aj (U) and 'kaj (U) in the following way:

p . (U) = paj_l (U) + paj+l (U) -aJ -GJ ..:::.-a1 '

tfJ . (U) = tfJaj - 1 (U) + tfJaj+l (U). -aJ -a J -a J

It is easy to show (see ( 1.39), ( 1.40)) that

So, if the probability of a queries are equal then

An analogius equation for the top evaluations of loss of information and of information noise are true.

The following theorems are hold.

Page 396: Data Mining, Rough Sets and Granular Computing

393

Theorem 1.11. Let X = {aI, a2}, S2 E GoeI), N(u) = N = Canst. Then

Nd (1- (h)3 fEx(U) = 'k.x(U) = 61UI

Proof. Let's calculate ( 1.65) if the restrictions of the theorem are true. By analogy of the proof of theorem 1.7 we can write

way:

N (d - <hlf+cl2)3 _ Nd (1- %vT+d2)3 6d2 1U1 61UI

Page 397: Data Mining, Rough Sets and Granular Computing

394

(1.70)

Corollary 1.3. Let the restrictions of the theorem 1.11 are true. Then

Proof. For proof of the Corollary is enough to compare ( 1.24) and ( 1.70).

Theorem 1.12. Let X = {al' a2}, s2 E GO(l), N(u) = N = Canst. Then

~ (U) = Iii (U) = Nd(l- 152)3 NdJ2

X X 61UI + lUI .

Proof. Let's calculate ( 1.66) if the restrictions of the theorem are true. By analogy of the proof of theorem 1.11 we can write:

~al (U)

Page 398: Data Mining, Rough Sets and Granular Computing

395

x ( ( U1R - 8; r -(U2L + 8; r) + ( U2L _ 8;) (U1R + 8; )

x ( ( U1R - 8; ) - (U2L + 8; ) ) 1

= N82 N [( ( 8 r ( 8 r) dldl - 6d21UI 2 U1R -; - U2L + ;

-3 ( ( U1R + i ) + (U2L _ 8; ) ) (1.71)

x ( ( U1R _ 8; ) 2 _ (U2L + 8; ) 2)

+6 ( U2L _ 8; ) (U1R + 8; )

x ( ( U1R - 8; ) - (U2L + 8; ) ) 1 = N 8r N [( 3 3) (2 2) dlUI - 6d21U1 2 s - t - 3 (s + t) s - t

+6(s+8d(t- 8d(s-t)]

= N 8r N [( 3 3) (2 2) d1UI-6d21U12 s -t -3(s+t) s-t

+6st (s - t) + 6 (s - t) (t81 - s81 - 8D]

= N8r N [3 ] dlUl-6d2lUI (t-S) -6(t-s)(t-s-81)

= N 8? N [ ( - 81 - 81 r dlUI - 6d21UI u2L + "2 - u1R + "2

(81 81 ) -681 U2L + "2 - U1R + "2

(81 81 )] X U2L + "2 - U1R + "2 - 81

= N8r N [3 ] dlUl - 6d21UI (81 - d) + 681 (81 - d) d

N8r N(d-81)3 6Nd81 (d- 8d = dlUI + 6d21U1 + 6d21UI

Page 399: Data Mining, Rough Sets and Granular Computing

396

During the calculation of ( 1.71) we use a replacement of variables

(_ 01) (_ 01) Z = U - U2L - 2 ' y = U1R + 2 - U

and denotes

_ 01 _ 01 S = U1R - 2' t = U2L + 2'

If we remember equation 01 = d02 ( 1.15), we can rewrite ( 1.71) as following

4>X(U)

(1.72)

Corollary 1.4. Let the restrictions of the theorem 1.12 are true. Then

Proof. Proof. For proof of the Corollary is enough to compare ( 1.25) and ( 1.72).

Theorem 1.13. Let X = {a1, a2, .. " ad, St E Goel), N(u) = N = Cons and Pj = t (j = 1,2, .. . ,t). Then

cJj (U) = I]J (U) = N D (1 - 02)3 -X -X 3t1UI' (1. 73)

where D = L~:i dj ,j+1 = L~:i IUj,j+1l· Proof is compleat ely analogius by proof of the theorem 1.8 using ( 1.70).

Corollary 1.5. Let the restrictions of the theorem 1.13 are true. Then

Proof For proof of the Corollary is enough to compare ( 1.26) and ( 1.73).

Page 400: Data Mining, Rough Sets and Granular Computing

397

Theorem 1.14. Let X = {al,a2, ... ,at}, St E G'5(I), N(u) = N = Cons and Pj = t (j = 1,2, ... , t). Then

q; (U) = ili (U) = N D (1 - 82)3 2N D82

x X 3tlU! + tlU! ' (1.74)

where D = E~:i dj ,j+1 = E~:i !Uj,j+1!'

Proof is compleately analogius by proof ofthe theorem 1.8 using ( 1.72).

Corollary 1.6. Let the restrictions of the theorem 1.14 are true. Then

- - 2N [(1 -82)3 ]-If>x(U) = Wx(U) = t (1 + 282 ) 3 + 282 e(st}.

Proof. For proof of the Corollary is enough to compare ( 1.27) and ( 1.74).

By comparing the results of the theorem 1.7 and theorems 1.11 and 1.12 or of the theorem 1.8 and theorems 1.13 and 1.14, we see that for small significances 8, the main laws of our model of information retrieval are preserved. By comparing the corollary 1.1 and corollaries 1.3 and 1.4 or the corollary 1.2 and corollaries 1.5 and 1.6, we see that for small significances 0, the main laws of our models are preserved. Therefore, we can use our technique of estimation of the degree of uncertainty and our model of information retrieval in fuzzy (linguistic) data bases in practical tasks, since we have shown it to be stable.

References

1. Pospelov D.A. (Ed.)(1986) Fuzzy sets in models of control and artificial intelli­gence. Nauka, Moscow (Russian)

2. Ryjov A. (1987) The degree of uncertainty of fuzzy descriptions. In: Krushinsky L.V., Yablonsky S.V., Lupanov O.B. (Eds.) Mathematical cybernetics and its application to biology. Moscow University Publishing House, Moscow, 50-77 (Russian)

3. Ryjov A. (1988) The degree of fuzziness of a linguistic scale and its general properties. In: Averkin A.N.(Ed.) Fuzzy decision-making systems. Kalinin Uni­versity Publishing House, Kalinin, 82-92 (Russian)

4. A. Ryjov, A. Belenki, R. Hooper, V. Pouchkarev, A. Fattah and L.A. Zadeh. (1998) Development of an Intelligent System for Monitoring and Evaluation of Peaceful Nuclear Activities (DISNA), IAEA, STR-310, Vienna

5. Zadeh L.A. (1975) The concept of a linguistic variable and its application to approximate reasoning. Part 1,2,3. Inform Sci 7:199-249; 8:301-357; 9:43-80

6. Zadeh L.A. (1996) Fuzzy logic == computing with words. IEEE Trans on Fuzzy Systems 4:103-111

7. Zadeh L.A. (1997) Towards a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy sets and systems 90:111-127

Page 401: Data Mining, Rough Sets and Granular Computing

Ordinal Decision Making with a Notion of Acceptable: Denoted Ordinal Scales

Ronald R. Yager Machine Intelligence Institute

Iona College New Rochelle, NY 10801

[email protected]

Abstract. Our concern is with the problem of constructing decision functions to aid in making decision under uncertainty. We discuss the tradeoff that has to be made, when selecting a scale for representing our possible payoffs, between the power of the scale and the burden of the scale. We consider here the situation in which our basic scale is an ordinal scale, however we augment this scale by allowing an additional notion, a classification of payoffs as to whether they are acceptable or not. This allows us to have information such as A is preferred to B but both are acceptable. We indicate that this formally corresponds to an ordinal scale with a denoted element and call such a scale a Denoted Ordinal Scale (DOS). It is shown that this augmentation of the ordinal scale increases the power of the scale and therefore allows us to built more sophisticated decision models.

1. Introduction

In the solution of decision making problems an important component is the scale used to evaluate the worth of the different alternatives open to the decision maker [1, 2]. Usually the selection of a scale involves a tradeoff between the power of the scale and the burden of the scale. By the power of a scale we mean the operations available for manipulation of measurements made in that scale, this is of course effects the types of models we can build using the scale. The more powerful the scale, the more operations available, and the more sophisticated the decision systems that can be built using this scale. By the burden of the scale we mean to indicate the difficulty in providing measurements on the scale. The burden of the scale can be seen to be related to the granularization used in denoting the values of the scale.

Ideally we of course prefer more powerful and less burdensome scales. Unfortunately, the more powerful the scale the more burden involved in providing the necessary measurements. For example, at the most powerful extreme is the absolute scale, here all mathematical operations are available. This scale, however, brings with it the burden that each measurement must be precisely meaningful. Here all distinctions between measurements are meaningful and the decision maker must be precise and careful in assigning values from this scale. At the other end of the burden is the ordinal scale. Here all that is required is an ordering, saying one is better than the other. However, the price we pay for ease of burden is a loss of operations. With this ordinal scale we have only three available operations: maximum, minimum and

Page 402: Data Mining, Rough Sets and Granular Computing

399

negation (inversion). Furthermore, in the case of ordinal scales a distinguishing feature between scales is the cardinality of the scale, how many different grades it has. A very special ordinal scale is that case of two grades, 1 and 0, this is essentially the binary logic scale yes/no.

Intuitively we can view the "powerlburden ratio" as being a measure of the efficiency of the scale and where loosely this can be seen as being constant for the commonly used scales at the frontier of efficiency.

An even less burdensome scale then the ordinal scale is the one assumed in the famous Arrow impossibility theorem [3]. Here one assumes only a collection of independent, non-commensurate ordinal scales, rather than one unifying ordinal scale. That is, in this theorem each decision maker has their own scale which is not connected with the scales of the other decision makers. This does not impose the burden that each decision maker coordinate their valuations with the other decision makers as is implicitly required in the case of a single ordinal scale. The lack of power of this scaling situation is attested to by the impossibility of finding a suitable means for combining multiple preferences. What the Arrow's theorem is essentially telling us is that the power of this multiple independent ordinal scale is insufficient to provide any operations needed for making decisions.

As recently noted by Zadeh [4] human decision makers often use what he calls "perceptions" in performing many of their everyday decisions. What Zadeh refers to as perceptions can be seen as measurement based upon a scale with a very low burden. In this framework another concept discussed by Zadeh [5], the idea of granularization can be seen as closely related to the burden of the scale. The greater the granularization less the burden.

In this work we focus on the problem of decision making under uncertainty in the case in which our basic scale for measuring the payoffs associated with the different causes of action is an ordinal scale. However, we augment this scale by introducing an additional notion which allows us to classify payoffs into acceptable or unacceptable. In this environment solutions can be compared in an ordinal manner as well as being denoted as being acceptable or unacceptable. Thus we can incorporate information such as solution one is preferred to solution two but both are acceptable. We see this provides us with two levels of granularization with respect to our scale. Formally, as we shall see, this augmentation can be accomplished within the framework of our basic ordinal scale by associating with the scale a special value of the scale called the denoted element, thus all we need is the scale and this special value in the scale.

At one level this augmentation of the ordinal scale provides us with some rudimentary ability for counting which increases the power of the scale by allowing more operations and hence allows the construction of more sophisticated decision models without imposing to great an additional burden on the decision maker

2. Decision Making Under Uncertainty

Consider the problem of decision making under uncertainty. The following matrix provides the basic framework for this problem.

Page 403: Data Mining, Rough Sets and Granular Computing

400

C .. IJ

Ap ~--------------------~

Figure #1. Basic decision making under uncertainty framework

Here the Aj indicate a set of possible alternative actions open to a decision

maker, one of which must be chosen. The set X = {xl, .... , xm} consists of a set of

possible values that can be assumed by a variable U, called the state of nature. Cij

indicates the payoff received by the decision maker when he selects alternative Ai and

the state of nature is such that U = Xj. Ideally the decision maker wants to select an

alternative which maximizes his payoff. Unfortunately this is made difficult by the fact that the value of U is unknown at the time the decision maker must select his course of action. This situation induces an uncertainty in the decision making environment.

Effectively we see that if the decision maker selects alternative Ai he receives an

m vector of possible payoffs, Ti = [Cil, Ci2, .... Cim]. He must now compare these

vectors to select a best alternative. In comparing these vectors one fact should be very clear, if Ti and Tk are two m vectors such that Cij ~ Ckj for j = I to m then Ai is a better choice than Aj. This is called the Pareto condition. Except in this very special

case it is hard to distinguish between these vectors. In order to be able to distinguish between the alternatives some additional imperative must be provided. One approach is to introduce some valuation function V which induces a scalar valuation for each alternative payoff vector,

V(Ai) = V(Cil, Ci2, .... , Cim)· Using this function we can then compare the alternatives with respect to these scalar numbers, the bigger the value the better the alternative.

Let us consider some properties that should be associated with such a function. Since we have made no distinction between each of the values for U, they are all equally possible, one property should be a kind of symmetry. This property implies that each of the individual components in the vector of payoffs should be handled in the same manner. If Per[Ci I, Ci2, .... , Cim] is any permutation of the payoffs

associated with Ai then

V(Cil, Ci2, .... , Cim) = V(Per[Cil, Ci2, .... , Cim])· Another requirement is the satisfaction of the Pareto condition, if Cij ~ Ckj for all j

then

Page 404: Data Mining, Rough Sets and Granular Computing

401

V(Cil, Ci2, .... , Cim) ~ V(Ckl, Ck2, .... , Ckm) The combination of Pareto optimality along with the symmetry condition leads

to an interesting property. Assume Cij are the payoffs associated with Ai' Assume bj is a collection of m values such that Cij ~ bj. Let Ak be another alternative for

which Ckj = bj. Let us denote B = [bl, b2, .... bm]. From the Pareto condition V(Ai) ~ V(Ak) From the symmetry condition V(Ak) = V(Per[AkD where

Per(Ak) is any permutation of Ak' Here we then see V(Ai) ~ V(per[AkD.

The implication of this is the fOllowing. Assume Ai = [Ci I, Ci2, .... Cim] and

Ak = [Ckl' Ck2' .... Ckm] are the payoff vectors associated with two alternatives. Let bl , .... , bm be any indexing of the Ckj' a permutation of Ak, such that we get

Cij ~ bj for all j. Then it follows that it must be the case V(Ai) ~ V(Ak) and hence Ai is preferred to Ak' We shall call this extended monotonicity property. As

an illustration of the application of this extended monotonicity requirement consider the following payoff matrix:

10 5

7 13

With C21 = 7 and C22 = 13 if we index bl = C12 = 5 and b2 = CII = 10 then since

C21 ~ bl and C22 ~ b2 we get V(A2) ~ VeAl), thus A2i.§. preferred to AI' It should be noted that without the symmetry condition, the Pareto requirement

gives us only a simple monotonicity:if Cij ~ Ckj for all j then V(Cil, .... , Cim)

~ V(Ckl, .... Ckm) Another property that can be required of these valuation functions is

idempotency: if Cij = a for all j = I to m then V(Ci I, .... Cim) = a. Thus idempotency says that if all the payoffs of an alternative have the same value then this should be the valuation of the alternative. It is noted that any function that satisfies monotonicity, symmetry and idempotency is called a mean operator [6]. While idempotency appears like a nice property it is not completely clear that this should be a required property. Closely related to idempotency is the bounding property. This says that

Minj[Cij] :s; V(Cil, Ci2, .... Cim):S; Maxj[Cij] It can easily be seen that the bounding property implies idempotency. Again this doesn't seem to be a truly required property. These properties, idempotency and boundedness, can be seen to related to some more abstract requirement of naturalism [7] or linearity. By naturalism we mean to indicate that the valuation of certain payoff

Page 405: Data Mining, Rough Sets and Granular Computing

402

should be that payoff, Yea) = a. The satisfaction by the valuation function of the preceding requirements,

symmetry, extended monotonicity and even boundedness, are not enough to uniquely define a valuation function. The decision maker still has considerable freedom in selecting the decision function. Normally the actual choice of the decision valuation function should be a reflection of the decision maker's bias or opinion. In addition the form of valuation function is constrained by the operations available for combining the payoffs. As we have previously indicated the choice of operations is dependent upon the scale which is used to measure the payoffs.

The decision maker's bias or opinion can be expressed in a number of different ways. Let us describe some of them. A decision maker can associate with each payoff vector a specific value. The decision maker may compare pairs of payoff vectors by indicating which one of the two he feels should have a higher evaluation. When supplying preferences in this manner we face the problem that all possibilities are generally difficult to enumerative. Methods such as neural networks [8] and fuzzy modeling [9] can be used to obtain valuation functions by generalizing from some set of specifications of the above type. Another approach that can be used is for the decision maker to indicate his bias in the form of some linguistic specification of his decision attitude with respect to decision making under uncertainty. For example, he may indicate that his bias is to be very conservative and avoid alternatives that can lead to bad payoffs or he made indicate that he is very optimistic and wants to open himself to best possible payoff available. Often the specification of such an attitude can be used to construct an appropriate decision valuation function. This approach, which we shall fundamentally pursue here, can be seen to be very much in the spirit of granularization.

3. Ordinal Decision Making

Our focus here shall be on environments in which the scale on which the payoffs are measured is an ordinal scale. An ordinal scale can be seen to be of two types, natural and prescribed. A prescribed ordinal scale of dimension n, is a set S = {S 1, .... Sn}

such that Si> Sj if i < j. Such a scale supports the operations of Max(v) and

Min(,,), Max(Si' Sj) = Si if i < j and Min(Si' Sj) = Sj if i < j. In addition, it can support an operation called negation(ll), such that ll(Si) = Sn + 1 _ i. We note that

if Si > Sj then ll(Si) < ll(Sj)' Thus negation is essentially an order reversing operation. When using a prescribed ordinal scale every payoff Cij must be associated

with an element in S. There are two modes where we get valuations in a prescribed ordinal scale. One is when the Cij are directly valued in S, here the only perception

of the worth of a payoff is in terms of a value from S. In the second there exists some function F that associates with our perceived valuation of a payoff, C, a value F(C) E S. Here we note that as far as our ability to distinguish and make decisions if

'" '" F(C) = F(C) then we act as if C = C. With a natural scale no prescribed values are associated with the payoffs. The

Page 406: Data Mining, Rough Sets and Granular Computing

403

payoffs are manipulated unadorned simply and directly as they are perceived. A natural ordinal scale is one in which the payoffs as directly perceived can be ordered so that

"" ......... ......... .........

for any pair of payoffs C and C we can determine whether C > C, C > C or C = C and these comparisons are transitive. The point being that with a natural scale no prescribed values are associated with the payoffs. Essentially with a natural ordinal

'" scale we can provide a list of payoffs so that if C is higher than C on the list we say

C > C; if C is lower on the list than C we say that C > C, and if C and C are on the

same position we say that C = C. Here we note that the addition of new payoff values essentially means it can be inserted appropriately in the list. Formally we can associate with this a relationship R on the space C of all payoffs which is transitive, strongly complete and antisymmetric [2, 10].

One distinction between using a prescribed or an natural ordinal scale is that with a prescribed scale the number of allowable distinctions, the number of different classes or level of granularity, is predetermined it is restricted to be the elements in S, whereas with the natural scale the number of distinctions emanate from the payoffs being considered, that is if we have a 100 different payoffs we get 100 levels of distinction. We can always generate a ordinal scale from a naturally ordered set of

'" payoffs, here the scale S is such that each distinct element in the original set of payoffs induces a value in the scale. Thus formally the prescribed and natural scale are equivalent.

Their exists another distinction between a prescribed ordinal scale and a natural ordinal scale. Often when using a prescribed scale the values S have an associated linguistic alias. For example, in the case of the a five point scale {S I, S2' S3' S4, S 5} we may use the linguistic values {very low, low, medium, high, very high}.

While not formally adding any properties, power, to the scale, the association of linguistic aliases with the values often makes the task of assigning values from the scale much easier by allowing the human to act in a linguistic environment. Essentially we can see that the use of linguistic aliases helps reduce the burden associated with the scale.

Given an ordinal scale there are a number of valuation functions we can define. The first valuation function is one that can be seen as modeling a pessimistic decision attitude. In this approach for any alternative Ai,

Val(Ai) = Minj[Cij] Thus Val(Ai) is the smallest payoff that is possible under the selection of Ai. Using

this approach we then select as our decision alternative Aq such that Val(Aq) =

Maxi[Val(Ai)]· Dubois and Prade [11] have suggested the use of a slight modification of this

called the Lexi-Min. In [12] Yager also looked at this Lexi-Min method. In this approach if there are more than one alternative attaining the minimum of Val(Ai), we

adjudicate between these tied values by looking at the second lowest scores associated with these tied alternatives. In this case we select the alternative with the largest second lowest score. We proceed in this manner until we are left with one alternative.

Basically, the pessimistic approach is one in which each alternative is evaluated

Page 407: Data Mining, Rough Sets and Granular Computing

404

by the worst thing that can happen if we select this alternative. It can be seen as a specification of a decision attitude of "watching your back."

Another valuation function is an optimistic approach. In this approach for any alternative Ai,

Val(Ai) = Maxj[Cij] Thus Val(Ai) is the largest payoff that is possible under the selection of Ai' In this approach we select the alternative the one that has the largest valuation. We can modify this to allow a lexi-max which can be used to adjudicate ties. Here we would use the second highest payoff for those tied with the highest payoff.

A generalization of these two approaches can be considered, one which we call

the qth best method. In this approach we use the qth highest payoff for each alternative. If Ai = [Cil, Ci2, .... Cim] we define Bi = [bil, bi2, .... bim] to be a

reordering of the elements in Ai in which bik is the kth largest of the Cij' In the qth

best method we use Val(Ai) = biq.

Here Val(Ai) is the qth largest of the payoffs. We note that if q =1 then we get the

optimistic and if q = m we get the pessimistic method. Thus we can modify the extremes by selecting q near the middle.

The previous formulations for valuation functions are useful in cases of both natural and prescribed ordinal scales. In the following we shall discuss a formulation of valuation function, called ordinal OW A class, which is much more useful in the case of prescribed ordinal scale. Here we associate with our valuation function V a vector W, called the weighting vector, of dimension m in which its components wk, k = 1 to m, satisfy the following conditions:

1) wk E S

2) wm =SI' 3) If k2 > kl then wk2 ~ wk1 .

We see that the wk are elements from the prescribed scale, the value of the mth

component is the biggest value in S and the weights are monotonic. Using this vector we obtain the following valuation function.

Val(Ai) = Maxk =1 to m[Wk 1\ bik]

where bik is the kth largest payoff associated with alternative Ai and 1\ indicates the

min. In should be clear that this valuation function just uses operations available under an ordinal scale. If we let W be our weighting vector and let B be an m

dimensional vector whose kth component is bik then

Va1(Ai) = WT 1\ B

Here B, which we shall call be the ordered argument vector, is a vector whose elements are the payoffs in decreasing order.

By appropriately selecting the vector W we can induce various different valuation functions. Let us look at some of these. If W is such that wk = SI for all k,then

Page 408: Data Mining, Rough Sets and Granular Computing

Val(Ai) = Maxk[SI /\ bik] = SI /\ bjl = bil = Maxj[Cij]' In this case we get the optimistic approach. IfW is such that wk = Sn for k = 1 to m - 1 and wm = SI,then

Val(Ai) = Maxk[Sk /\ bik] = SI /\ Bim = Bim = Minj[Cij] In this case we get the pessimistic approach. If W is such that wk = Sn for k = 1 to q - 1 and wk = S I for k = q to m, then

Val(Ai) = Maxk[wk /\ bik] = Maxk ~ q[SI /\ bik] = biq.

Thus here we get the qth best.

405

We now consider a form of W which can be seen as implementing a valuation function in the spirit of the Hurwicz [13] criteria. Let W be such that wm = SI and

wk = Sp for k = 1 to m - 1. In this case

Val(Ai) = Maxk[Wk /\ bik] = (Maxk =1 to m-l [Sp /\ bikD v (SI /\ bim)

Val(Ai) = bim v (Sp /\ bil)

Val(Ai) = MiniCij] v (Sp /\ Maxj[Cij]] Thus here we take a weighted average of the Max and Min approaches. We noted

the larger Sp, the smaller p, the more we weight the Max.

4. Denoted Ordinal Scales

In decision making environments where we have an ordinal scale, we may also have additional information regarding the payoffs. This information is in the form of an indication of a collection of payoffs which are "acceptable." For example, consider a situation in which a decision maker must select a person for a job from a set of candidates. The decision maker may not only be able to order the candidates regarding his preference, but he may be able to indicate which candidates are acceptable. This situation is made apparent by linguistic statements such as

while they are all acceptable I prefer candidate x they are all unacceptable but z is the best of the lot.

The presence of such additional information on top of the ordering can allow us to build more sophisticated valuation functions then available.

In order to formally capture this in the following we investigate a scale which we shall call a Denoted Ordinal Scale (DOS). Again we shall make a distinction between a prescribed and natural ordinal scale. Assume S = {S 1, S2, .... Sq, .... Sn} is a prescribed ordinal scale with n elements. Here we again assume an ordering on this scale such that Si > Sj if i < j, thus

SI> S2> S3 .... > Sn' In addition we shall associate with this scale a special element on the scale Sq,

called the denoted element. The association of this special denoted element, Sq,

allows us to introduce a mapping D from S into a subset of S, B = {Sl, SnL D: S

~ B, such that

Page 409: Data Mining, Rough Sets and Granular Computing

406

D(Sj) = Sn if j > q. Thus here all scores less than or equal Sq are mapped into S1 while those greater or equal Sq are mapped into Sn' It is to be strongly emphasized that the mapping D uses the ordinal scale S as well as the designated element Sq.

Formally the situation can be seen as one in which we have two connected relationships on the space S, R1 and Rl. The connection is based upon the use of a designated element z E S. The primary relationship, Rl, a linear order on S, induces the basic ordering. Rl is definable in terms of R1 and the designated element z as follows:

if xRly then xRly ifxRl z, y Rl z and xRl y then y Rlx if zRlx, zRlx and xRly then yRlx.

The introduction of this mapping allows us to build more complex decision functions which can make use the information about the acceptability of solutions. Consider the following decision imperative consisting of two parts: F -1: if all the payoffs for an alternative are acceptable then valuate it by its best payoff

or F -2: if any of the payoffs for an alternative are unacceptable then valuate it by its worst. Using the the approach used in fuzzy systems modeling [9] we can build a valuation function to model this decision imperative. In the following /\ and v are used to indicate Min and Max respectively. Condition one can be expressed as

V 1 = Minj[(D(Cij)] /\ Maxj[Cij] Condition two can be expressed as

V2 = Maxj[Neg(D(Cij) /\ MiniCij] Combining these via the or we get

V(Ci1, Ci2, .... , Cim) = Max[V1, V2]' It is interesting to note that this becomes a kind of weighted average of the

maximum and minimum possible payoffs for an alternative. If we let a = Minj[D(Cij)] and take advantage of the fact that Maxj[Neg(D(Cij» = Neg(Minj(D(Cij))] then

V(Ai) = (a /\ Maxj[CijD v (Neg(a) /\ Minj[CijD (1) While the above is a kind of weighted average, it should be pointed out that it is a nonlinear weighted average since a, the weight, is a function of the payoffs and not a simple constant.

A further simplification of (1) can be made. Since Maxj[Cij] ~ Minj[Cij] and

a E {Si' SnL we can express V(Ai) as

V(Ai) = (a /\ Maxj[CijD v Minj[Cij] (2) This can be seen as follows: if a = Sn then both equations (1) and (2) evaluate to

Minj[Cij] and if a = S1 then equation (1) is equivalent to Maxj[Cij] while equation

(2) becomes Maxj[Cij] v Minj[Cij]' However since Maxj[Cij] ~ Minj[Cij] this

Page 410: Data Mining, Rough Sets and Granular Computing

407

becomes Maxj[Cij]. If we let Max[Ai] = Maxj[Cij], the maximal payoff for alternative Ai and let Min[Ai] = Minj[Cij], the minimal payoff for alternative Ai then we can express

V(Ai) = (ex /\ Max[AiD v Min[Ai] where ex = Minj[D(Cij)] = Min[D(Ai)], the minimal acceptability of any payoff under

Ai· F-l: if all the payoffs for an alternative are acceptable then valuate it by its best payoff

or F-2': Valuate the alternative by its worst payoff. Here we see that F -2' has no antecedent clause, it is a kind of default clause.

With the aid of the mapping resulting from the introduction of a designated element we have expressed the concept all acceptable solutions by the representation Minj[D(Cij)]. We can use this framework to express other terms useful for a partitioning of the payoff space.

Consider the concept at least p acceptable solutions where p an integer such that 1 :5: P :5: m. To model this we shall W p be an m dimensional vector such that

thus

wk = S 1 for k = P to m wk = Sn for all others

Sn Wp = Sn

p~ Sl Sl Sl

Furthermore, let aij = D(Cij) and let bk be the kth largest of the aij. Here then

bl = Maxj[aij] and bm = Minj[aij]. Consider now the structure

Gp(Ai) = Maxk[wk /\ bk]·

We see that ifp = 1 then Gl(Ai) = Max[aij]. In this case Gl(Ai) has the value Sl if

at least one element D(Cij) = S 1, thus Ai has at least one acceptable payoff. On the

other hand if Gl (Ai) = Sn' then Maxj[aij] = Sn and we have no acceptable payoff. Consider now the case in which p = m here

Gm(Ai) = Maxk[Wk /\ bk] = bm = Minj[aij]· This has the value S 1 if Minj [aij] = S 1, which implies" all payoffs are acceptable"

More generally for any p Gp(Ai) = Maxk[Wk /\ bk] = Maxk ~ p[SI/\ bk] = Maxk ~ p[bk] = bp

Thus Gp(Ai) = S 1 if the p largest payoff is acceptable, this of course implies that they

are at least p acceptable payoffs. We can express the concept exactly p good payoffs using the Gp(Ai), at

least p good payoffs. In particular if Ep(Ai) indicates exactly p good solutions in

alternative Ai then it can be expressed as

Page 411: Data Mining, Rough Sets and Granular Computing

408

Ep(Ai) = Gp(Ai) A Neg(Gp+ 1 (Ai)) Here we see we expressing the idea at least p good solutions and not p + 1 good solution. Two special cases are worth noting. For p = m then Gm+1 (Ai) = Sn' by

definition and thus Em (Ai) = Gm(Ai)' In addition for p = 0 then GO(Ai) = Sl and

therefore EO(Ai) = Neg(Gl (Ai))' Here we see that we have, that with the aid of the idea of the designated element

and the correlated concept of acceptable solution we have introduced some facility for counting. Thus even though all our definitions only made use of those operations available on an ordinal scale, Max, Min and Neg, we have been able to introduce some rudimentary idea of counting.

In the following we shall these Gp functions to construct complex valuation

functions from more simple ones. One way of constructing complex decision functions from simpler ones is to partition with respect to the number of acceptable solutions. Figure #2 will help us here:

v V V 2 0 I V m

o 2 m

Figure #2. Partition with respect to acceptable solutions.

In this figure for each k, where k is the number of acceptable solutions, we have a particular simple decision function. Thus our complete decision function V is expressible in the form

if the count of acceptable solutions is zero then V = V 0

if the count of acceptable solutions is one then V = VI

if count of acceptable solutions is m then V = V m

Using this approach and the machinery of fuzzy modeling we get V(Ai) = Maxk[Ek(Ai) A Vk(Ai)]

Since one and only one of Ek will equal Sl and the others equal Sn the value of V

will be the Vk(Ai) for this Ek-

This type of very precise formulation is not in the spirit of the kind of granularization that human beings use in expressing their desires. Another approach is to use ranges to express the structure of V. For example if we let k indicate the number of good solutions then we can express our compound decision structure as

If p < k 1 then V is VI

If kl ~ k < k2 then V is V 2

Page 412: Data Mining, Rough Sets and Granular Computing

If k2 ~ k < k3 then V is V3 If k ~ k3 then V is V 4

We can express this as

409

YeA) = (Neg(Gkl (A)) 1\ VI (A)) v (Gkl (A) 1\ Neg(Gk2(A)) 1\ V2(A)) v

(~2(A) 1\ Neg(Gk3(A)) 1\ V 3(A)) v (~3(A) 1\ V 4(A))

In the preceding we have partitioned the good solution space as seen in figure #3.

~ Figure #3. Partitioning the acceptable solution space by ranges

There exists another way of partitioning the space of the number of acceptable solutions which can be seen very much in the spirit of granularization. Before discussing this method we introduce some definitions.

Assume V 1 and V 2 are two valuation functions, that is they take a collection A

of m payoff values drawn from S, A = (al a2, .... , am)' and provide a scalar value in S. As noted two examples of this are the Max and Min. We note for these two it is the case that for all argument tuples A, Max(A) ~ Min(A). We shall generalize this idea by saying VIis more optimistic than V 2 if VI (A) ~ V 2(A) for all A. We shall

denote this as V 1 ~ V2' For A = (aI' .... am) with D being the mapping into acceptable solutions let

D(A) = (D(al), .... , D(am)), here D(aj) E {Si, Sn}. Let count (D(A)) equal the

number of S 1 in D(A); it is the number of acceptable solutions in A. Let V be a

compound decision function of the type we have been considering: YeA) = VjCA) if Count (D(A)) = j.

We have implicitly assumed each of the Vj are valid valuation functions - they are

monotonic, if A and A are two payoff vectors such that A ~ A, ak ~ ~k for k, then

Vj(A) ~ VjCA) . One issue that must be considered is the monotonicity of the compound decision

function V. As we see in the following example monotonicity is not always assured. "- "-

Example: Let A = (S2, S8) and A = {S5, S8}, hence A ~ A. Assume the

designated element Sq is S3' Let V be such that YeA) = Max(A) if Count (D(A)) = 0 YeA) = Min(A) if Count (D(A)) ~ 1

We see that for our example Count(A) = 1 and hence YeA) = Min(A) = S8, while "-

Count(A) = 0 and hence YeA) = Max(A) = S5' Thus we have YeA) > YeA) even

though A > A.

Page 413: Data Mining, Rough Sets and Granular Computing

410

In order to provide compound decision functions that are monotonic we make the following definition. Definition: Let V be a compound decision function of the type in which

V(A) = ViA) if Count(D(A)) = j where Vj is monotonic. We shall say that V is a progressive decision function if

Vj ~ Vk for j > k. The following theorem shows that progressiveness assures monotonicity.

Theorem: If V is a progressive decision function then it is monotonic. ~ ~ .

Proof: Assume A ~ A. Thenj = Count(D(A)) ~ Count(D(A)) = j. Therefore V(A) = ViA)

;.. ~

V(A)= Vj(A) ;.. ~

The monotonicity of Vj and Vj implies Vj(A) ~Vj<A) and VJ(A) ~ V](A). The ~ ~

progressive of V implies ViA) ~ Vj(A) and ViA) ~ Vj(A). From this we see that ~ ~

Vj(A) ~ ViA) ~ Vj(A).

While lack of progressiveness doesn't always imply nonmonotonicity, we have shown that progressiveness always implies monotonicity. We see then that the decision maker can always assure monotonicity in these complex decision functions by making them progressive. Thus it appears that progressive decision functions are very useful. From the point of view of soft computing, where our goal is to try to do things as simply as possible, this is even more so.

In addition to guaranteeing monotonicity the use of progressive decision functions lead's to simplicity in the construction of compound decision functions . Consider now a compound valuation based upon the partitioning shown in figure #4. We shall, furthermore, assume that it is progressive, Vj ~ Vi if j > i.

V 2

K r

Figure #4. Progressive Decision Function

V r

m

As we shall see the progressiveness leads to a very simple formulation. Let GK' J

be the construct corresponding to at least Kj acceptable solution. We can express the compound decision function shown in figure #4 as

V(A) = V O(A) v Maxj =1 to r[~/A) 1\ ViA)]

Let p equal the number of acceptable solutions in A. Consider first the case where p

Page 414: Data Mining, Rough Sets and Granular Computing

411

< K 1. In this case GK.(A) = Sn for all j and hence we get YeA) = VO(A), as desired. J

Consider now the case where Ki ;::: P < Ki + l' In this case GK/A) = S 1 for all Kj ~

Ki and GK/A) = Sn for all Kj ;::: Ki+ 1. In this case

YeA) = VO(A) v VI (A) v VI (A) v V2(A) ......... v Vi(A). Because of assumed progressiveness, VjCA);::: Vj'(A) if j > j', we get as desired YeA) = Vi(A). We've seen then that it corresponds to the following rule base description:

VisVO or

if the number of acceptable solutions is at least K 1 then V is VI or

if the number of acceptable solutions is at least K2 then V is V 2

or

or if the number of acceptable solutions is at least Kr then V is V r

The situation in the preceding is based upon an interesting and novel type of partitioning (granularization) of the space of the number of acceptable solutions as seen figure # 5. Here we let P be a variable, the number of acceptable solutions, whose domain is the set M = [0, m]. We let Fj, j = 0 to r, be a

collection of subsets of M in which Fj = [Kj, m] where Kt > Ks if t > sand KO = O. In this case our rule base is of the form

If Pis Fj then V is Vj.

o m

Figure # 5. Inclusive Partitioning

F o

F 1

F 2

F 3

In this partitioning we see that the antecedent components are included within each other and this forms an inclusive partitioning. This inclusiveness means that multiple rules can fire at the same time, however a property of the consequent, in this case the progressiveness, determines what happens when more than one rule fires.

In the construction of these compound decision functions while our building blocks, the primary decision functions, were assumed to be monotonic, we showed to guarantee monotonicity of the compound function we needed progressiveness. In order

Page 415: Data Mining, Rough Sets and Granular Computing

412

to assure progressiveness, we needed to be able to compare these primary decision functions with respect to the optimism. Let us look at this issue for the ordinal OW A class of primary decision functions. We recall that the ordinal OW A class has an associated vector W of dimension m such that:

1. wk E S

2. wm = SI

3. If k2 > kl then wk2 ~ wkl

and F(ai' .... , am) = Maxj[wj A bj ] where bj is the jth largest of the ai. " Assume Wand W are weight vectors with components, Wj and ;j respectively

such that

Wj ~;j.it is easily seen that FW(al, .... , am) ~ FW<al ' .... , am) for all (al .... am).

This allows us easily to assure progressiveness when working with this class of primary functions

In the preceding we have focused on the situation where our scale has a prescribed scale. Many of the ideas discussed here can be extended to the case of a natural scale.

5 Conclusion

We considered the problem of constructing decision functions for in making decision under uncertainty. We considered the situation in which our basic scale is an ordinal scale, however we augment this scale by allowing an additional notion, a classification of payoffs as to whether they are acceptable or not. We showed that this formally corresponds to an ordinal scale with a denoted element and called such a scale a Denoted Ordinal Scale (DOS). It was shown that this augmentation of the ordinal scale increased the power of the scale and therefore allowed us to built more sophisticated decision models.

6 References

[1]. Krantz, D. H., Luce, R. D., Suppes, P. and Tversky, A., Foundations of Measurement, Academic Press: New York, 1971. [2]. Roberts, F. S., Measurement Theory, Addison-Wesley: Reading, MA, 1979. [3]. Arrow, K. J., Social Choice and Individual Values, John Wiley & Sons: New York, 1951. [4]. Zadeh, L. A., "From computing with numbers to computing with words-From manipulation of measurements to manipulations of perceptions," IEEE Transactions on Circuits and Systems, (To Appear). [5]. Zadeh, L. A., "Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic," Fuzzy Sets and Systems 90,111-127, 1997. [6]. Yager, R. R., "On mean type aggregation," IEEE Transactions on Systems, Man and Cybernetics 26, 209-221, 1996.

Page 416: Data Mining, Rough Sets and Granular Computing

413

[7]. Yager, R. R. and Rybalov, A., "Full reinforcement operators in aggregation techniques," IEEE Transactions on Systems, Man and Cybernetics 28,757-769,1998. [8]. Kosko, B., Neural Networks and Fuzzy Systems, Prentice Hall: Englewood Cliffs, NJ, 1991. [9]. Yager, R. R. and Filev, D. P., Essentials of Fuzzy Modeling and Control, John Wiley: New York, 1994. [10]. Roubens, M. and Vincke, P., Preference Modeling, Springer-Verlag: Berlin, 1989. [11]. Dubois, D., Fargier, H. and Prade, H., "Refining the maximin approach to decision making in fuzzy environment," Proceedings of the Sixth International Fuzzy Systems Association (IFSA) Congress, Sao Paulo, Brazil, 1:313-316, 1995. [12]. Yager, R. R., "On the analytic representation of the leximin ordering and its application to flexible constraint propagation," European Journal of Operations Research 102, 176-192, 1997. [13]. Arrow, K. J. and Hurwicz, L., "An optimality criterion for decision making under ignorance," in Uncertainty and Expectations in Economics, edited by Carter, C. F. and Ford, J. L., Kelley: New Jersey, 1972.

Page 417: Data Mining, Rough Sets and Granular Computing

A Framework for Building Intelligent Information-Processing Systems Based on Granular Factors Space*

Fusheng Yu1 and Chongfu Huang2,3

1 Department of Mathematics, Beijing Normal University Beijing 100875, China

2 Institute of Resources Science, Beijing Normal University Beijing 100875, China

3 Key Laboratory of Environmental Change and Natural Disaster The Ministry of Education of China E-mail: hchongfubnu.edu.cn

Abstract

Reviewing the current Artificial Intelligent (AI) techniques of knowledge representation, knowledge acquisition and inference, in this chapter, we de­velop the factors space into the granular factors space for building intelligent information-processing systems. In details, we discuss, using the factors space methods and granular factors space methods, how to represent knowledge -concepts(in extension and intention), facts, and rules, how to acquire refined knowledge from source knowledge automatically, and how to reason with explaining property quickly. To facilitate our studying, a tool for building diagnostic expert systems is provided.

1 Introduction

Artificial Intelligent (AI) has three main branches: Knowledge Engineering (KE), Pattern Recognition (PR) and Robotics. Among these three branches, Knowledge Engineering (KE) is the one at the core. In the research work of Knowledge Engineering, the intelligent knowledge-based system occupies an important position all along, it especially brings to human beings' attention. The reason is that it can process uncertain and imprecise information, and thus can work in the manners similar to that of human beings'. In KE, expert system[7,8,9] is a main kind of intelligent system for processing uncertain and imprecise information.

What we are interested in is how to build an intelligent information­processing system. Generally speaking, building an intelligent information­procesing system includes three main parts - knowledge acquisition, knowl­edge representation and knowledge utilization, among these, knowledge rep­resentation plays a much more important role, the reason is that the methods

* Project Supported by National Natural Science Foundation of China, No. 49971001

Page 418: Data Mining, Rough Sets and Granular Computing

415

for representing knowledge must be suitable not only for representing the knowledge of real world reasonably , but also for processing the knowledge on computers conveniently and effectively. Up to now, available methods of knowledge representation are: Logic, List, Procedure, IF-THEN rules, Seman­tic Network, Analogy, Frame, Semantic, Primitives, Conceptual Dependency Theory, Script and Fuzzy Methods etc. Each of these representation methods has its advantages and shortcomings, in reality, we can choose one accord­ing to the characteristics of the given problem. In this chapter, we will give a new method of knowledge representation ~ Granular Factors Space Method, details about this method can be found in Section 4.

Knowledge Acquisition is the so-called "bottle neck" problem of Knowl­edge Engineering, this means whether the information-processing system to build is good or not is decided by whether or not the knowledge acquisi­tion can be successfully carried. In the process of knowledge acquisition, knowledge engineers usually should take part in and play an important role. By interchange of ideas with domain experts, knowledge engineers get the so-called "source knowledge", after sorting out, "source knowledge" is rep­resented in the form decided by the method of knowledge representation employed. In this chapter, we will give a new method of knowledge acquisi­tion ~ Random-Set-Falling-Shadow Method, this method can automatically sort out the so-called "refined knowledge".

Knowledge utilization is a process to use the acquired knowledge to solve the given problem, this process usually shows the manifestation of reasoning. There are two kinds of reasoning methods that can be used in this process: exact reasoning and approximate reasoning. Approximate reasoning is a gen­eral method for human to solve problems, although this method is under improvement, it has already presented good characteristics, so it is and will be the main method to use in the process of knowledge utilization. In this chapter, we will give two reasoning methods ~ True-valued Flow method and weighted synthesis method.

The above topics and steps of building intelligent information-processing system will be discussed in factors space and granular factors space. As an example, we will study how to build intelligent diagnostic information­processing systems, and give a method for developing tools for building di­agnostic expert systems.

Being aimed to construct a more powerful framework for building in­telligent information-processing systems, in this chapter, the factors space has been developed into the granular factors space firstly, then the methods for representing knowledge, acquiring knowledge and reasoning based on the theory of factors space are discussed.

To help the readers who never touch the theory of factors space, in sec­tion 2, we outline the main parts of the thoery. In section 3, we propose the concept of granular factors space. In section 4, 5, 6, we discuss knowledge representation, knowledge acquisition and fuzzy inference in factors space or

Page 419: Data Mining, Rough Sets and Granular Computing

416

granular factors space respectively. In section 7, we outline the methods for building intelligent information-processing systems and give a tool for build­ing diagnostic expert systems, which has been implemented on computer. Lastly, we conclude the chapter with a summary in section 8.

2 Factors Space

In the theory of factors space[I,4,24], factor is a metaword, we can understand it by the following two aspects:

(1) Causation If we have observed a result or a phenomenon, then we should usually find its causes. If some causes have been found, then each one of them can be called a factor.

(2) Description It is from different aspects for Human beings to recognize any thing(object), any aspect can be called a factor. Any thing(object) can be described from different aspects, and any concept is formed by contrast at some aspect or some aspects.

Definition 2.1 Let U be a set of objects and V be a set of factors, (U, V] is called a left-match pair, if

Vu E U, V(u) = UII is related to u} ~ V

Definition 2.2 Suppose (U, V] is a left-match pair F ~ V, a flust of sets {X(J)}UEF) is called a factors space,if it satisfies the following conditions:

(1) F = F(V, /\, c, 0,1) isa completed Boolean algebra (2) X(O) = {O} (3) VT ~ F, if (Vs, t E T)(s =f:. t => s /\ t = 0) then

V It = II It tET tET

where iltET It is the Cartesian product of map I(here I is regarded as a map I : U -t X(J)), F is called factor-set,! E F is called factor, 1 is called full factor,O is called zero factor and it has only one state O;X (J) is called state-space or states space offactor I, X(I) is called total space.

By operations V and /\, we can define subfactor, independent factors, difference factor, complementary factor and atom factor:

• Factor I and factor 9 are independent iff I /\ 9 = O. • If factor h satisfies:(J /\ g) V h = I, h /\ 9 = 0, then h is called difference

factor of I by g. • A factor 9 is called a proper subfactor of factor I, denoted by I > g, if

there exists a set Y : Y =f:. 0 and Y =f:. {O} ,satisfying X (J) = X (g) x Y; A factor 9 is called a subfactor of factor I, denoted by I ~ g, if I> 9 or I=g·

• The complementary factor of factor I, denoted by r, is defined as r ~ 1- I.

Page 420: Data Mining, Rough Sets and Granular Computing

417

• A factor I is called an atom factor if I has no proper subfactor except zero factor; the set of all atoms of factor-set F is written as 11"; the factor-set F may be regarded as the power set of 1I",that is F = P(1I").

Property 2.1 If {/t}tET are factors independent of each other, then

X( V It) = II X(/t) tET tET

Factors space is like the coordinate system in mathematics, it provides a describing environment for things. In factors space, every thing can be described, it must has its own manifestation on each related factor. Each concept can be viewed as a fuzzy set in the factors space, it must have its own manifestation on each related factor, if we can obtain all these factors' manifestations, then we can get a whole understanding of the concept. Each factor manifestation can be described by a fuzzy subset on the states space of the factor, the fuzzy subset can be obtained by projecting the concept set onto the factor axis. Then by weighted synthesizing, we can grasp the concept.

Definition 2.3 Suppose il is a given set of some concepts, U is the related universe of discourse, (U, V] is a left-match pair, for any F S;; V, if {X(f)}(JEF) constitute a factors space, then (U, il, V] is called a description frame.

For the convenience of expression, we will use the following marks:

(1) "0: E il" denotes "for a given description frame (U, il, V], 0: E il" (2) "I 2: g" denotes "I, g E F, I 2: g" (3) I, g, h E F are factors of factors space {X (f) hfEF) (4) a(S;; F(U)) is the extension of concept 0: in il;

Definition 2.4 Suppose 0: E il, the extension a of 0: is a fuzzy subset of U ,that is a E F(U); define a f as:

af : X(f) -+ [0,1]

x -+ af(x) = V a(u) f(u)=x

a f is called manifestation extension of 0: on factor I, in particular, we denote a1 as a.

Definition 2.5 The projection ..l.t of factor I on factor g is defined as:

..l.t= X(f) -+ X(g), (x,y) -+..l.t (x,y) = x

For any B E F(X(f)), by the extension principle, we get:

(..l.t B)(x) = V B(x,y) = V B(x,y) (-l.~)(x,y)=x yEX(J-g)

Page 421: Data Mining, Rough Sets and Granular Computing

418

tt B is called projection of fuzzy set B from f to g.

For any B E F(X(g)), the cylinder extension tt B of B from 9 to f is defined as:

tt B : XU) -+ [0,1]

(x, y) -+ (tt B)(x, y) = B(x) where (x, y) E XU) x XU - g).

Definition 2.6 Va E il, "If E F, f is sufficient for a. If t} (t} &) = &, if f is sufficient for a, then any factor g, which is independent of f, is called redundant factor for a.

Definition 2.7 Suppose a E il, let

'Y(a) = AUlf is sufficient for a}

'Y(a) is called rank of concept a. Definition 2.8 Suppose a E il, f E F, the feedback extension of E

F(U) of a about f is defined by: of(u) = &fU(u)), that is of = f-1(&f), in particular, 01 is denoted as o.

Definition 2.9 Suppose a E il, f E F, if of = a, then f is called a coincident factor about concept a, let

T(a) ~ A{flof = a}

T(a)(also denoted by T(a)) is called feedback rank of concept a. Property 2.2 Suppose a E il, f E F, then of = a if and only if

t} (t} &) = &

For a given problem, there certainly exists a corresponding factors space, but different persons may build different factors spaces. Usually we can build a factors space corresponding to a given problem by listing all the atom factors, let 7r be the set of all the atom factors, then the factor-set can be regarded as F = P(7r) , and the corresponding factors space is {XU)}(fEF).

Factors can be parted into four kinds according to their states spaces:

• Switch-Kind(KS) the states space of each factor of this kind consists of two opposite states, for instance, the states space of factor f = "sex", is XU) = {male,female}.

• Degree-Kind(KD) the states space of each factor of this kind is [0,1], each state means a degree given from this factor. For instance, factor f = "satisfaction degree" and factor 9 = "difficulty degree" are factors of this kind, their states are [36,44)(°0) and (0, 300) (cm) respectively.

• Enumeration-Kind(KE) the states space of each factor of this kind con­sists of different kind-values which can be enumerating. For instance, factor f = occupation is a KE-factor whose states spaces is

XU) = {teacher, worker,peasant, soldier, ... }

in fact, a KD-factor may be regarded as a KE-factor, KD-factors are particular cases of KE-factors.

Page 422: Data Mining, Rough Sets and Granular Computing

419

3 Granular Factors Space

The theory of fuzzy information granulation(TFIG) is proposed by L.A. Zadeh[13], it plays an important role in human reasoning and fuzzy logic. In TFIG, the concept of constraint is the basis, every granule is determined by a kind of constraint which may be one of the following seven types: equality constraint, possibilistic constraint,veristic constraint, probabilistic constraint, probabilistic value constraint, random set constraint, fuzzy graph constraint. As pointed out in [13], any fuzzy granule can be considered from its different aspects(attributes), this point of view is consistent with that of the theory of factors space on that any things can be considered from its related factors, so it is natural for us to give the following definition.

Definition 3.1 {X*(fi)}(JiEF) is called a granular factors space corre-sponding to the factors space {X(fi)}(JiEF), if

(1) Vii E F, {X*(Ii)} is a fuzzy perfect field[3]; (2) X*(O) = {O*}; (3) "IT c;;. F, if ("Is, t E T)(s =I t ~ s 1\ t = 0) then

X*( V It) = II X*(It) tET tET

where X*(fi) is the set offuzzy granules on X(fi); 0* is the unique empty granule corresponding to the empty state 0, and we give the convention :X*(O) x X*(f) = X*(f), VI E F. X*(fi) is called granular states space offactor k

Note 1. 3.1Definition 3.1 is rational since the product of two perfect field is still a perfect field.

Property 3.1 V/,g E F,X*(fV g) = X*(f - g) x X*(f 1\ g) x X*(g- I) Proof since F is a Boolean algebra, so

(f V g) = (f - g) V (f 1\ g) V (g - I)

and 1- g, I 1\ g, 9 - I are independent, by the condition (2) in definition 3.1 we can get X*(f V g) = X*(f - g) x X*(f 1\ g) x X*(g - I). U

Property 3.2 if I,g E F,g ~ I then VA E X*(g), H (A) E X*(f) VB E X*(f),~~ (B) E X*(g).

Proof For 9 ~ I, we have I = 9 V (f - g), and 11\ (f - g) = 0, so X*(f) = X*(g) x X*(f - g), from this formula we can get VA E X*(g), t~ (A) E X*(f) VB E X*(f),~~ (B) E X*(g). U

Definition 3.2 Suppose {X(f)hfEF) is a factors space, F = P{-rr). {X*(f)}(JE1r) is called atomic granular factors space.

In reality, for a given problem, we need only build the atomic granular factors space, since V I E F, I can be represented by some atomic factors, from the condition (3) in definition 3.1, the granular states space of I can be represented by the granular states space of these factors.

Page 423: Data Mining, Rough Sets and Granular Computing

420

Definition 3.3 VI E F, let X*(f) = {AIA E P(X(f))} ~ X*(f), x* = n{AIA E X*(f)},x E X(f), x* is called an atomic granule. let XO denote the set of alI atomic granules in X (f).

Va E il, VI E F, consider the manifestation extension ii, of a in X*(f), if ii, C X (f) then two cases may appear:

(1) ii, can be represented by some granules in X*(f); (2) ii, can not be represented by any group of granules in X*(f); in the first case, we can well describe a on factor 1, but in the second case,

we can get only an approximation description of a on factor 1. In order to get good approximation of concepts on factors, the granular factors space should be good enough, in order to raise the efliciency of intelligent information­processing systems, the granular factors space should be as simple as possible, so we should consider these two aspects while solving real world problems.

4 Knowledge Representation in Factors Space

In an intelligent information-processing system, the knowledge to be repre­sented includes facts, rules and inference methods. In this section, we will give the corresponding representing methods for them based on the theory of factors space. since the majority of knowledge can be represented by con­cepts, we will first study the representation method for concepts in factors space.

4.1 Representing Concepts in Factors Space

Different concepts are formed by dividing some given uni verse of discourse, for instance, if we divide the universe of age, we can have the following concepts: young man, middle-aged man and old man. If the division is exact, then the concepts formed are exact; if the division is inexact, then the concepts formed are fuzzy. What are the bases of the division? they are usually factors, so it is natural to represent concepts in the corresponding factors space.

A concept can be described by any one of the following methods:

• Intention-Method this method describes a concept by showing out the essential attributes of the concept.

• Extension-Method this method describes a concept by showing out objects which conform to the concept.

• Concept-Frame-Method this method describes a concept by making use of the relations between itself and other concepts.

In a factors space, describing a concept is implemented by describing it on the related factors, this is an analysis process, after analysis, the concept can be described by synthesizing the analytical results together, this is a synthesis process.

Page 424: Data Mining, Rough Sets and Granular Computing

421

If we want to describe a concept in a factors space, we can use not only the Intention-Method but also the Extension-Method. Fuzzy set theory provides some extension methods but fewer intention methods for describing concepts, besides, fuzzy set theory does not solve the problem of universe selection and universe alternation.

Describing the Extension of Concept in Factors Space Definition 4.1 Suppose I E F, if'Vx E X(f), 1-1 (X) is an one-point set, then I is called a characteristic factor; if'Vx E X(f),x"l- 9, 1-1 (x) is an one-point set, then I is called a pre-characteristic factor; Let U(f) = {ul/(u) = 9} ~ U, U(f) is called blind universe offactor I, where U is the given universe.

Property 4.1 If I is a characteristic factor, then 'Va E [I, I is sufficient and coincident about a.

Proof it is easy to prove that I is a monomorphism if I is a characteristic factor, and thus Ctl = 0:, this means that I is coincident about a, from property 2.2, I is sufficient about a. 0

Definition 4.2 Suppose a E [I,f E F, the restriction of 0: on U - U(f) is called vision extension of a of factor I,denoted by 0:/.

Property 4.2 Suppose I E F is a pre-characteristic factor, then 'Va E [I, 1-1 (f(0:/)) = 0:1

Proof when restricted on set U - U(f), I is a characteristic factor, by property 4.1, there holds 1-1 (f (0: I)) = 0:1 0

Note 2. 4.1 the full factor 1 is a characteristic factor. The following theorems give the methods for describing concepts: Theorem 4.1

o:(u) = 1\ al(x) IEF

where a E [1,0: E F(U),x = I(u),u E U. The proof can be found in [24] where Note 4.1 was made use of. Theorem 4.2 Suppose G ~ F,and 3g E G, g is a characteristic fac-

tor,then

o:(u) = 1\ al(x) lEG

where a E [I, 0: E F(U), x = I(u), u E U. Proof By theorem 4.1, we have

O:(u) = 1\ al(x) ~ 1\ al(x) IEF lEG

in another aspect, since g EGis a characteristic factor, u = g-l(x), x E X(g), so

1\ al(x) ~ ag(x) = (0: 0 g-I)(x) = o:(g-I(X)) = o:(u) lEG

Page 425: Data Mining, Rough Sets and Granular Computing

422

a(u) = /\ Of (x) fEG

In the proof of theorem 4.2, the following property was used: Property 4.3 If I is a characteristic factor, then Va E il, af = a 0 1-1. This property can be illustrated by figure 4.1, its proof is omitted here.

U __ --"-1 ___ • X(f)

~~ [0, 1]

Figure 4.1 relationship between a, af and I Theorem 4.3 Suppose G <;; F, if Vu E U, 31 E G,such that I-l(f(u))

is an one-point set,then a(u) = /\ Of (x)

fEF

where a E il, a E F(U), x = I(u),u E U. Proof it is obvious that a(u) ::; /\fEF Of (x)

for some u E U, 3g E G, such that g-l(g(u)) is an one-point set u, so

/\ Of (x) ::; Og(x) = a(g-l(x)) = a(u) fEF

D Corollary 4.1 Suppose I is an atom factor, if Vu E U, 31 E 71', such

that I-l(f(u)) is an one-point set, then

a(u) = /\ Of (x) fEn:

where a E il,a E F(U),x = I(u),u E U. Property 4.4 If I, 9 are characteristic factors,then there is an one-to-one

mapping between X(f) and X(g), such that Va E il, Of = Og Proof define mapping h as

h: X(f) -+ X(g),x -+ h(x) = g(f-l(X))

it is easy to prove that h is an one-to-one mapping, and that

af(x) = a(f-l(x)) = a(g-l(y)) = ag(y)

where y = g(f-l(X)) = h(x) D From above discussion, we have the following conclusions about decribing

extensions of concepts in factors space:

Page 426: Data Mining, Rough Sets and Granular Computing

423

• The extension of any concept can be described in a factors space • The extension of any concept can be described on a sufficient factor • The extension of any concept can be described on a coincident factor • The extension of any concept can be described on a characteristic factor

Finely Representing the Extensions of Concepts in Factors Space For any factor f in factors space, there holds iiJ :2 ii, so VG ~ F and Va E il, we have

especially

n iiJ :2 ii JE1f

where 7r is the set of all atom factors; and oberiously, if G ~ H ~ F, then

n iiJ:2 n iiJ lEG JEH

From this formula, it is easy to see that the more factors can be got, the better approximation can be reached. But in reality, the number of all the available factors is limited, it is usually equal to the number of the elements of set 7r, so we will find other approximation methods. Since ii = 1-1 (0:) = ii,I = V JE1f f, XCI) = I1JE1f XC!), we should only concentrate on the atom factors. In fact, the manifestation extension 0: of concept a on the full factor 1 is a fuzzy relation in XCI), so the problem of how to well represent the extension of a concept becomes to the problem of how to well approximate the fuzzy relation. Let N be the number of all atom factors, the new problem is how to well approximate the fuzzy relation 0: in the generalized N-dimensions Cartesian space X(h) x X(h) x··· x X(fN). For simplicity, we assume that N = 2, the corresponding space is denoted by X x Y, R ~ F(X x Y) is a fuzzy relation corresponding to some concept a.

The existed important methods for approximating a fuzzy relation are: (1) Projection Method[2,3] For R E P(X x Y), let

Rx = {xl(x,y) E R},Ry = {yl(x,y) E R}

Rx, Ry are called X-projection and Y-projection of R respectively. For R E F(X x Y), let

Rx = U )"(R).,)x, Ry = U )"(R).,)y ).,E[O,1] ).,E[O,1]

Rx, Ry are called fuzzy X-projection and fuzzy Y-projection of R respec­tively.

Page 427: Data Mining, Rough Sets and Granular Computing

424

Property 4.5 VR E F(X x Y), there holds R ~ Rx x Ry This property gives a method for approximating R, this method is easy to

use. While the method is being used, the total work is to get Rx and R y , in general, Rx and Ry are easy to get. The approximation is not good enough, because the approximation can be the same for two extremely different fuzzy relations which are both located in Rx x R y .

(2) Cross Section Method[2,3] For R E P(X x Y), let

Rlx = {yl(x, y) E R}, Rly = {xl (x, y) E R}

Rlx, Rly are called X-cross-section and Y-cross-section of R respectively. For R E F(X x Y), let

Rlx = U 'x(RA)lx, Rly = U 'x(RA)ly AE[O,l] AE[O,l]

Rlx, Rly are called fuzzy X-cross-section and fuzzy Y-cross-section of R re­spectively, where (x, y) E X x Y.

Property 4.6 VR E F(X x Y), there holds

R = U {x} x Rlx, R = U Rly x {y} xEX yEY

Although this method gives an exact representation of R, it is difficult to use, because the cross sections are difficult to get in reality.

(3) Directional Cylinder Extension Method This method was proposed by L.A.Zadeh[13], it is the generalized method

of Projection Method, in this method, the approximating of R is the intersec­tionl of all given directional cylinder extensions. In the words of this method, the Projetion Method can be stated as follows: the approximation of R in ProjetionM ethod is G", n G (3, where G"" G (3 are the X-direction cylindrical extension and Y-direction cylindrical extension respectively.

Here we give a new method called Set-Cross-Section(SCS) Method: (4) Set-Cross-Section Method The departure point of this method is that: in reality, we usually can get

of some given relation not only the whole projections on some given axes but also the part projections on an axis or some axes. It is obvious that the approximation of the relation based on the part projections will better than that based on whole projections.

Definition 4.3 Suppose R E F(X x Y),A E P(X), the set cross section of R on A ,denoted by RIA, is defined as

RIA = U Rlx xEA

Page 428: Data Mining, Rough Sets and Granular Computing

Property 4.7 A ~ B ~ RIA ~ RIB; R ~ S ~ RIA ~ SIA; where A, B E P(X), R, S E F(X x Y).

Theorem 4.4 VA E P(X),RIA = A 0 R Proof

RIA(Y) = (UxEA Rlx)(Y) = VXEARlx(Y) = VxEAR(x,y) = VXEA(A(x) I\R(x,y)) = (A 0 R)(y)

425

thus,RIA = A 0 R Theorem 4.5 Suppose Ai(i

U7=1 (Ai x RIAi) 2 R

o 1,2, ... ,n) is a partition of X, then

The proof is omitted here. Theorem 4.6 If Bj(j = 1,2"", m) is a refined partition of partition

Ai(i = 1,2"", n)(m 2': n)), then

n m

U(Ai x RIA.) 2 U (Bj x RIB;) i=l j=l

The proof is omitted here. Definition 4.4 Suppose R E F(X x Y),A ~ X, the inner set cross

section of R on A ,denoted by RIA, is defined as

RIA = n Rlx xEA

Property 4.8 A ~ B ~ RIA 2 RIB; R ~ S ~ RIA ~ SIA; RIA ~ RIA; where A, BE P(X), R, S E F(X x Y).

Theorem 4.7 Suppose Ai(i = 1,2"", n) is a partition of X, then

n

U(Ai x RIAi) ~ R i=l

The proof is omitted here. Theorem 4.8 If B j (j = 1,2"", m) is a refined partition of partition

Ai(i = 1,2,"',n)(m 2': n)), then

n m

U (Ai x RIA.) 2 U (Bj x RIB;) i=l j=l

The proof is omitted here. By making use of set cross section and inner set cross section, we can

get an upper approximation and a lower approximation of a fuzzy relation, besides, the upper approximation is better than that of Projection Method. The application steps of this method are:

• Construct a partition of X:{ Ad(i = 1,2" . " n)

Page 429: Data Mining, Rough Sets and Granular Computing

426

• Obtain the (inner)set cross section on each Ai • Construct the lower and the upper approximations based on the inner

cross sections and on the set cross sections respectively(usually by fuzzy statistical approaches, for example, the falling shadow of random sets[I]) j from the lower and the upper approximations we can give a proper ap­proximation.

Representing the Intentions of Concepts in Factors Space The theory of factors space provides not only the method for representing the extensions of concepts but also the method for representing the intention of concepts. In a factors space, Va E n, VIE F, we can obtain the approximation of the extension of concept a on factor I, if the approximation is the same as the extension of concept a, then factor 1 is a most essential attribute of concept a,but in contrast if the extension of concept a has nothing to do with fator I, then the factor is an unrelated attribute of concept aj for other factors, they are related attributes of concept a. We may use the difference between the approximation and the extension to describe the intentions of concepts.

Definition 4.5 Suppose a E n,1 E F is essential to a, if 1 is sufficient for a.

Property 4.9 if 1 is a coincident factor of concept a, then 1 is an essential factor of a.

Property 4.10 if 1 is a characteristic factor of concept a, then 1 is an essential factor of a.

Property 4.11 if 1 is an essential factor of concept a, 1 ~ g, then 9 is an essential factor of a. More specifically, 'Y(a) is the minimum essential factor of concept a.

Definition 4.6 Factor 1 is called an inessential factor if &, = X(f). Property 4.12 if 1 is an inessential factor of concept a, 1 ~ g, then 9

is also an inessential factor of a. Example 4.1 Factor I='sex' is inessential to concept a='young man' Definition 4.7 For a given description frame (U, n, F],define a mapping

S as S : F x F(U) --+ [0,1]' (f, A) r+ S(f, A)

S is called a sufficient measure if it satisfies the following conditions[4]: (1) I(A) E P(X(f)) => S(f, A) = Ij (2) (Vx E X(f))(f(A)(x) = 0.5) => S(f, A) = OJ (3) S(f, A) = S(f, AC)j (4) I,g E F,I ~ 9 => S(f,A) ~ S(g,A). Definition 4.8 For a given description frame (U, n, F],define a mapping

Cas C : F x F(U) --+ [0,1], (f, A) r+ C(f, A)

C(f,A) = VuEu{l- I(A) (f(u)) + A(u)}

C is called coincident measure[4].

Page 430: Data Mining, Rough Sets and Granular Computing

427

Definition 4.9 [4] Suppose a E n, e E [0,1], factor f is called an e-essention factor if it satisfies

S(J,ii) ~ 1- e) ar C(J,ii) ~ 1- e)

Note 3. 4.2 For simplicity, we use SC(J, ii) ~ 1 - e to denote the condition in Definition 4.9.

Supose It, h, ... , f n are all atom factors,a En, then

n

ii(U) ~ 1\ iidfi(U)) i=l

from this expression we can give the following statements:

in fact, they are descriptions of intention of concept a, and the precision is 1 - SC(J, ii).

4.2 Representing Facts in Factors Space

Facts are propositions which are used to describe objects, meanwhile proposi­tions are based on concepts. In section 4.1, we have studied how to represent concepts in factors space, thus facts can be described in factors space.

Example 4.2 Suppose U = { all patients }, A= {patients suffering from heart dieases}, B={ patients suffering from pneumonia}, U E U. Then we have the following four facts:

(1) U is a patient suffering from heart disease; (2) U is not a patient suffering from heart disease; (3) u is a patient suffering from heart disease and pneumonia; (4) u is a patient suffering from heart disease or pneumonia; they can be described as u E A, Not(u E A),u E An Band u E AU B

respectively. Example 4.3 Suppose U ={all patients}, A= {patients suffering from

heart diseases}, B={patients suffering from pneumonia},S ~ U. Then we have the following four facts:

(1) S are patients suffering from heart disease; (2) S are not patients suffering from heart disease; (3) S are patients suffering from heart disease and pneumonia; ( 4) S are patients suffering from heart disease or pneumonia. they can be described as S ~ A, Not(S ~ A), S ~ AnB and S ~ AUB

respectively. Example 4.2 and Example 4.3 provide two different methods for describing

facts, the former uses "E", while the latter uses "~". If we regard u as {u}, then the two methods can be unified. so we have

Page 431: Data Mining, Rough Sets and Granular Computing

428

[Conclusion] Any fact on universe U can be described by a subset (crisp or fuzzy), the general form is A ~ B, its implication is "A is B", where A and B are subsets in U.

Definition 4.10 Such facts as "A ~ B" are called simple facts, the others are called camplex facts.

By set operations n, U, c, we can construct complex facts with simple facts.

The description of a fact on the given universe is a whole description of the given object, but in the corresponding factors space, the description of a fact on one factor states space is a side description of the given object.

In factors space, facts can be described in the following manners:

• A is Bf , means f(A) ~ Bf • Af is B, means f-l(Af ) ~ B • Af is Bg, means t~Vg Af ~ttVg Bg

where A,B E :F(U},Af,B"Bg are fuzzy subsets of X(J),X(g) respectively. Example 4.4 The examples corresponding to the above-mentioned

manners are: (l)The body temperature of a patient suffering from pneumonia is on the

high side; (2)A patient whose body temperature is on the high side is one catching

cold; (3)The number of white blood cells of a patient whose body temperature

is on the high side is bigger.

4.3 Representing Rules in Factors Space

Rules are conditional propositions composed of propositions, they are used to describe the dependence relation of facts. If there is a fuzzy proposition in component propositions, then the rule is a fuzzy rule. The general form of a rule is: IF-THEN-, where - is any proposition. Since propositions can be described in factors space, so can rules.

In a factors space, rules have the following basic forms:

(4.1) (4.2) (4.3)

(4.4)

IF u is A THEN v is B IF f(u) is Af THEN v is B IF u is A THEN g(v) is Bg

IF f(u) is Af THEN g(v) is Bg

where A,B E :F(U), Af E :F(X(J)),Bg E :F(X(g)), u,v E U, f,g E F, the dependence relations reflected respectively by them are:

• Relation between the whole description of u and the whole description of v

Page 432: Data Mining, Rough Sets and Granular Computing

429

• Relation between the side description of u and the whole description of v • Relation between the whole description of u and the side description of v • Relation between the side description of u and the side description of v

In a factors space, rules may have the following forms:

(4.5)

(4.6)

(4.7)

(4.8)

IF u*is A THEN v* is B IF f(u*) is A, THEN v* is B IF u* is A THEN g(v*) is Bg IF f(u*) is A, THEN g(v*) is Bg

where A, B E F(U), AI E F(X(f)),Bg E F(X(g)), u*, v* E F(U),!, 9 E F, the dependence relations reflected respectively by them are:

• Relation between the whole description of u* and the whole description of v*

• Relation between the side description of u* and the whole description of v*

• Relation between the whole description of u* and the side description of v*

• Relation between the side description of u* and the side description of v*

From these rule forms, we can get many other forms by combination of propo­sitions.

4.4 Representing Inference in Factors Space

The general form of classical inference is:

where A ~ U, B ~ V

IF u is A THEN v is B u is A

v is B

The general form of fuzzy inference is:

IF u is A THEN v is B u is A*

v is B*

where A,A* E F(U),B,B* E F(V), and

B* = A* oR R=A x BUAc x V (4.9)

Page 433: Data Mining, Rough Sets and Granular Computing

430

In factors space, rules have their own characteristics, this can be seen in (4.1)-(4.8). Based on (4.1)-(4.8), we can give the corresponding inference forms, take (4.3) as an example, we have

IF u is A THEN f(v) is Bf u is A*

f(v) is Bj

(4.10)

where A,A* E F(U),Bf,Bj E F(X(f)), Bj = A* 0 R, R is determined by A and Bf in accord with formula (4.9).

Besides this basic inference form, there are the following inference forms in factors space[ll]:

(1) Extension Inference

IF f(u) is A g?f

g(u) is A x X(g - f) =t} A

(2) Projection Inference

(3) Join Inference

IF f(u) is A g"50f

g(u) is -1-£ A

IF f(u) is A g(u) is B

h(u) is t~ An t~ B (h = f v g)

(4) Composition Inference

IF f(u) is A (f V g)(u) is S

g(u) is AoS

(5) Representation Inference

IF u is C f EF

f(u) is f(C)

Page 434: Data Mining, Rough Sets and Granular Computing

431

where A E F(X(f)), BE F(X(g)), C E F(U), S E F(X(f - g)), I,g, hE F,uE U.

If we combine these inference forms and the rules in (4.1)-(4.8), then we can get many inference forms. Take the rule in (4.3) for an example, if I ~ g, from the inference form in (4.10) , then we have an inference form as:

IF u is A THEN I(v) is BI u is A*

I(v) is Bi g~1

g(u) is it Bi

5 Knowledge Acquiring in Factors Space

Suppose n = {aili = 1,2" .. ,m} is a group of concepts in universe U, the corresponding factors space is {X(Ii)}(JiEF), F = P(7r),7r = {II, 12,"" In}, (U, n, F] is the description frame.

In factors space {X(fi)}(JiEF) , the knowledge about concepts n = {aili = 1,2, ... ,m} can be divided into two main kinds, one is the manifestation knowledge of every concept a E n on related factors, the other is the rela­tionship knowledge of concepts {aili = 1,2" . " m}. In this section, we will study how to acquire the knowledge of these two kinds.

5.1 Acquiring the Manifestation Knowledge in Factors Space

Va En, VIE F, if they are related then a has manifestation a I on factor I. Generally, a I is a fuzzy set in X (f), thus in reality, we should know how to get the knowledge of this kind. Usually the knowledge of this kind is acquired from domain expert(s), if only one domain expert can be consulted, then al should be given by him and the engineer who participates the given problem; but if there are many domain experts to consult, how should al be given?

Here we give a method based on the theory of the falling shadow of random sets[I]. For the sake of completeness of the method, we cite some theorems without proof.

Definition 5.1 [1] Let X be the universe, X* = {{x}*lx EX}, B is a a-algebra of which X is an element, (P(X), B) is a measurable space, a random set ~ on X is a measurable mapping from some field of probability (n, A, P) to (P(X), B):

~: n -t P(X) ~-l(C) = {w E nl~(w) E C} E A, ve E B

Let 5(A, B) be the set of all random sets from some field of probability (n,A,p) to (P(X),B).

Page 435: Data Mining, Rough Sets and Granular Computing

432

Definition 5.2 [1] Suppose ~ E S(A, B), the falling shadow of ~ at x E X is defined as

f.L~(x) = p{w E nix E ~(w)}

f.L~(x) is called falling shadow function ofrandom set ~. Theorem 5.1 [l](law of large numbers of the falling shadow of random

sets) If ~i E S(A, B)(i = 1,2,···, n,···) are random sets which are independent

and of same distributions, f.L~i(X) = f.L(x), then

1 n ~n(x) = ;;, Lxdx) -+ f.L(x) (a.e.)(n -+ 00)

J=l

In order to obtain the manifestation knowledge from domain experts, we may ask them to answer the question

What is the manifestatian of the cancepts 0: an factor f? (5.1)

after being answered this question, we get n random sets in X (I): {~i} (i = 1,2,···, n), in the light of the theory of the falling shadow of random sets, we could use ~ L~=l X~i (x) to approximate the manifestation extension af· It is obvious that the more the random sets the better the approximation.

In reality, the random sets may have special properties. As a special case, the random sets often exist as random intervals, therefore we should discuss this case and give the corresponding results.

Theorem 5.2 Suppose ~ = [6,6] is a random interval, where 6,6 are random variables whose distribution functions are F1, F2 respectively, then f.L~(x) = F1(x + 0) - F2(X).

Proof As space is limited, the proof is omitted here[19]. Corollary 5.1 If 6 '" N (f.L1 , an, 6 '" N (f.L2, a~), then

Proof

_ «-1-'1 )2 (1-1-'2)2

II (x) - _1_(-1... JX e 2~f dt - -1... JX e -~ dt) ,..,~ - ..;z:;r "1 - 00 "2 - 00

;r-J.Ll t 2 :lJ-J.t2 t 2 - J ~1 1 --dt J ~2 1 --dt - -00 ..;z:;re 2 - -00 ..;z:;re 2

Page 436: Data Mining, Rough Sets and Granular Computing

433

where p(x) is the distribution function of normal distribution. 0 Corollary 5.2 Suppose ~ = [6,6] is a random interval,where 6,6 are

continuous random variables whose distribution functions are FI (x), F2 (x) respectively, then f..L~(x) = FI(X) - F2(X).

Corollary 5.3 Suppose ~ = [6,6] is a random interval, where 6,6 are discrete random variables whose distribution laws are PI, P2 respectively, then

f..L~(x) = L PI(~1k) - L P2(6k) 6k:S;X 6k:S;X

According to the results above-obtained, we can get the falling shadow function by the following steps:

• Make hypothesis of the distribution function of random variables; • Test the hypothesis or determine the parameters of the hypothetical dis­

tribution functions.

If the answers to the question (5.1) are fuzzy sets, we have the following conclusions.

Definition 5.3 [3] Suppose (D, A, P) is a field of probability, let X* = {x>.lx E X,A E [0, I]}, x>. = {A E F(X)lx E Ad, 13 = u(X*), (F(X), E) is a fuzzy measurable space on X. A random fuzzy set ~ on X is a mapping defined below

~: D -+ F(X),w -+ ~(w), ~-I(C) = {w E DI~(w) E C} E A, VC E 13

where x* E F(F(X)) , x*(A) = A(x),A E F(X); x>. = {A E F(X)lx*(A) 2: A} = {Alx E A>.}

Let S*(A, E) be the set of all random fuzzy sets from some field of prob­ability (D, A, P) to (F(X), E).

Definition 5.4 [3] Suppose ~ E S*(A,E), the falling shadow of ~ at x E X is defined as

f..L~(x) = P~(x*) = f x*(A)P~ f..L~(x) is called falling shadow function ofrandom fuzzy set ~ , where VC E 13,

P~(C) ~ P(~-I(C)) , VC* E 13* = {C E F(F(X))IC>. E 13, A E (0, I]},

PdC*) ~ J C*(A)P~. Theorem 5.3 [3](law of large numbers of the falling shadow of random

fuzzy sets) If ~i E S*(A,E)(i = 1,2,···,n,···) are random fuzzy sets which are inde­pendent and of same distributions, f..L~i(X) = f..L(x), then

1 n ~n(x) = :;;: LX~i(X) -+ f..L(x) (a.e.)(n -+ (0)

i=1

Page 437: Data Mining, Rough Sets and Granular Computing

434

For random fuzzy sets, we could get similar results to that of random sets, details can be found in[3,17J, but omitted here.

Based on this method, the acquisition of manifestation knowledge can be completed with ease, thus in the process of building intelligent information­processing systems, we can realize the automation of acquisition of knowledge.

5.2 Acquiring the Relationship Knowledge in Factors Space

In factors space {X(!i)}(JiEF), as for concepts {oili = 1,2,···,m}, besides the manifestation knowledge, there is also the relationship knowledge, the acquisition of the former has been discussed in section 5.1, the acquisition of the latter will be discussed in this section.

Since these concepts {oili = 1,2,···, m} share the same factors space, they usually have something to do one another, the relationship can be rep­resented by factors which serve as the link among the concepts. The relation­ship can be represented in fuzzy rules, the antecedents of the rules are about factors, and the consequents of the rules are about the concepts.

In order to get the rules, generally we should granulate the states space, suppose {X(!i)}(JiEF) is a factors space, {X*(fi)}(JiEF) is the granular fac­tors space corresponding to {X(!i)}(JiEF) , then we can construct rules in {X*(!i)}(JiEF).

Example 5.1 Suppose {X(!i)}(JiEF) is a factors space whose atom factors set is

and

7r = {Height, Weight, Age}

X(Height) = (0,300)(cm) X (Weight) = (0,500) (kg) X(Age) = (0,200)(years)

the corresponding granular factors space is {X*(!i)}(JiEF), where

X*(Height) = {tall,middle,short,···} X*(Weight) = {heavy, middle, light,···} X*(Age) = {old, middle, young,· .. }

where tall, middle,short are fuzzy sets in X(Height), heavy, middle,light are fuzzy sets in X(Weight), old, middle, young are fuzzy sets in X(Age), they are all fuzzy granules.

For a factor f, how to construct its granular states space X* (f) depends on the properties of the given real problem and the engineers and experts who participate in the research of the problem. For example, about factor Age, its granular states space may be

X*(Age) = {very old, old, middle, young, very young,···}

Having got the granular factors space, then we can construct rules easily. For instance, in the granular factors space above-built, we have the following rules:

Page 438: Data Mining, Rough Sets and Granular Computing

435

• IF x is short AND x is heavy THEN x suffers from fat disease • IF y is old AND y is heavy THEN y suffers from height blood pressure • IF z is middle AND z is short THEN z is a dwarf

and so forth. Once the granular states spaces of all factors have been determined, then

the number of all rules is determined for some given concepts(objects,things). So the granular states space X* (1) of factor I should be given carefully.

6 Fuzzy Reasoning in Factors Space

Fuzzy reasoning methods in factors space include two main kinds: weighted synthesis method and truth-valued flow method[20,22,3,4].

6.1 Weighted Synthesis Reasoning in Factors Space

This method consists of two steps:

• Step 1 By analytical method, get the side descriptions of the given object on all related factors;

• Step 2 By synthetical method, get the whole description of the given object from all the known side descriptions.

Suppose {X(h)}UiEF) is a factors space whose atom factors set is IT = {h, 12,···, 1m}, (U, fl, F] is a description frame, 'Va E fl, its extension a is unknown, but all its manifestation extensions 0: Ii on all related factors Ii E IT

are known(may be got by the method of falling shadow of random(fuzzy) sets). if wi(x)(i = 1,2,···, m) is the weight function of factor Ii, then by weighted synthesis, we can get an approximation of a, that is

m

a ~ O:/(x) ~ LWi(X)O:/i (Xi) i=l

where I = Vi::11i. X = (Xl,X2,···,Xm ) E I1~lX(h), Wi(X) satisfies the condition: L~l Wi(X) = 1.

Note 4. 6.1 In reality, we usually assume that I = V IEGI, G ~ IT. the reasons are:

(1) For some a E fl, maybe only some of all the atom factors are retated to it;

(2) On some occasion, perhaps we cannot get all the manifestation exten­sions on all factors related to a.

but in this case, the approximation may not be the best.

Note 5. 6.2 The weight functions can be given by many methods. We will give a dynamic weight function based on information amount in section 7.

Page 439: Data Mining, Rough Sets and Granular Computing

436

6.2 Fuzzy Reasoning Based on The Theory of Truth-valued Flow Inference

The theory of truth-valued flow inference is first proposed by P.Z. Wang [20,21]. In the light of this theory, inference is a process in which the truth value flows between propositions, the media linking propositions is called inference channel. More exactly, inference channels conduct truth values be­tween concepts. If the end of one inference channel and the beginning of another inference channel are linked, then a new inference channel is formed, by this linking between inference channels, we can get an inference chain, which may reflect a real inference process, so we can use this method to reason.

For simplicity, we shall give some important definitions and some conclu­sions without proof, details may be found in[3,4,21].

Definition 6.1 Suppose X, Y are two universes of discourse, F ~ F(X) x F(Y) is called a fuzzy inference channel(under certain amount of knowledge), if

(1) (A,B) E F,A' ~ A=> (A', B) E F (2) (A, B) E F, B' ~ B => (A, B') E F (3) (A,Bt) E F(t E T) => (A,ntETBt) E F (4) (At, B) E F(t E T) => (UtETAt, B) E F (A, B) E F,denoted by A -+ B, is called a fuzzy inference channel from A

to B; A,B are called channel beginning, channel end of (A, B) respectively. Definition 6.2 In F, define relation' >' as

(A,B) > (C,D) {:} A"2 C,B ~ D

if (A,B) > (C,D), then it is to say that the information value of (A,B) is larger than that of (C,D).

It is easy to verify that (F, » is a semi-lattice. Property 6.1 If (At, Bt ) E F(t E T), then

tET tET tET tET

If the channel beginning of an inference channel receives a truth value >.., then >.. can be immediately conducted to the channel end of the inference channel. An inference channel can be assigned a channel strength r E [0,1], which reflects the fidelity of the truth value in the process of flowing. We shall use ArB to denote the inference channel (A, B) whose channel strength is r. When truth value>.. reaches the channel end, the truth value received by the channel end is denoted by T(>", r), where T is a triangle-norm.

Definition 6.3 Suppose AsB, CtD are inference channels from A to B and from C to D respectively, if B ~ C, then it is to say that AsB and CtD can be completely compounded;if B n C i- 0, then it is to say that AsB

Page 440: Data Mining, Rough Sets and Granular Computing

437

and CtD can be partly compounded; the result of composition is denoted by ArD = (AsB) 0 (CtD). where

{ T(s,t) ifBc;;.C r= T(s,T(k,t))ifBnC#0

where T are triangle norms, k = 8(C, B),8 is a similitude measure[22]. if B n C = 0, then it is to say that AsB and CtD can not be compounded. Property 6.1 ((AsB) 0 (CtD)) 0 (ErF) = (AsB) 0 ((CtD) 0 (ErF)) Proof

left = (A(T(s, t))D) 0 (ErF) = A(T(T(s, t), r))F right = (AsB) 0 (CT(t, r)F) = A(T(s, T(t, r)))F = A(T(T(s, t), r))F

so,((AsB) 0 (CtD)) 0 (ErF) = (AsB) 0 ((CtD) 0 (ErF)) D Let ATn(Si)B denote the composition results of n inference channels,

which can be completely compounded,the channel strengths are Sl, S2, ... ,Sn in proper order.

Property 6.2 Assume that T = '1\', then ATn(Si)B = AsB, where s = l\i=l Si·

Property 6.3 If truth value I = 1 reaches the channel end of composite inference channel A(Tn(Si))B, then Vi E {1,2,··· ,n},si = 1.

7 Building Intelligent Information-Processing Systems in Factors Space

7.1 General Method for Building Intelligent Information­Processing Systems in Factors Space

An intelligent information-processing system usually has the following main features:

• Having rational knowledge representation method One main task of an intelligent information-processing system is to process the knowledge about the given real world problem, usually the knowledge is of various kind, therefor the method for representing the knowledge must be rational. This means that the method for knowledge represen­tation employed by the intelligent information-processing system is not only suited to the knowledge about the given real world problem, but also suited to computer to handle .

• Having convenient environment for acquiring necessary knowledge and automatic method for generating usable knowledge for computer In reality, the knowledge of an intelligent information-processing system is usually of large amount, so the intelligent information-processing system should provide for user convenient environment for acquiring necessary knowledge and automatic method for generating usable knowledge for computer to save time.

Page 441: Data Mining, Rough Sets and Granular Computing

438

• Having efficient inference method Intelligent information-processing systems are often used to do the aided decision-making, aided analysis etc., it is usually reached by inference method. Now that the amount of the related knowledge is great, the inference method employed by the intelligent information-processing sys­tem must be efficient so as to reduce the time for decision-making, the inference method also should be intelligent enough to mimic human be­ings .

• Having the ability of learning Humans have the ability to learn to obtain new knowledge, an intelligent information-processing system should possess this ability to the great extent. But up to now, the ability of learning of all existed intelligent information-processing systems is far from that of human beings', because the mathematical methods used for learning can not match that of hu­man beings'. The already existed mathematical methods for learning can be part into two kinds: supervision learning methods and unsupervision learning methods. How to learn effectively for an intelligent information­processing system is still a difficult problem.

In section 4, we have studied the new knowledge representation method­factors space method, by this method, the extension and the intention of concepts, facts and rules can all be represented well.

In section 5, we have discussed the knowledge acquisition and given a new method - random-sets-falling-shadow method. By this method, we can get the refined knowledge from "source knowledge" automatically, the refined knowledge can be processed by computers.

In section 6, we have discussed the inference methods, given two infer­ence methods: weighted synthesis method and truth-valued flow methods. These two methods have their own advantages, the former has high inference speed, the later has the ability of explaining the inference process. So while building an intelligent information-processing system, we can combine them to overcome their own shortcomings to form a powerful inference engine.

As for learning, we can use the supervision learning method, for example, we can improve the "knowledge base" by modifying the "source knowledge base", and this can be completed with ease, because the "knowledge base" is generated automatically from the "source knowledge base"; we can also employ the case study method.

Based on the results obtained above, we can build intelligent information­processing systems in factors space by the following steps:

• Step 1 According to the real problem construct the corresponding factors space {X(f)}/EF and granule factors space {X*(f)}/EF. if fl is the set of all related concepts, then F should be sufficient for fl, that is to say that 'Va E fl, all factors related to a must be included in F;

• Step 2 Build the knowledge base of the intelligent information-processing system. The knowledge base consists of manifestation knowledge and re-

Page 442: Data Mining, Rough Sets and Granular Computing

439

altionship knowledge, the manifestation knowledge can be acquired by falling shadow method and represented by falling shadow functions; the ralationship knowledge is represented by fuzzy rules, they are both ob­tained in factors space and granule factors space;

• Step 3 Reason with the method of weighted synthesis and the method of rules inference by truth-valued flow;

• Step 4 Improve the knowledge base by modifying the "source knowledge"; • Step 5 Repeat step 3 and step 4 until the result is satisfactory.

7.2 An Example Illustrating the General Method: Tool for Building Diagnostic Expert systems

The block diagram of a typical expert system is as figure 7.1

ISource Knowledge B~ Knowledge Base I I

Data Base I

Knowledge Knowledge Inference .. f----+ Explanation Acquiring Management Machine

I Man-Machine Interface

I

Figure 7.1 structure of a typical expert system

In order to shorten the development time of expert systems, it is a good way to design the tools for building expert systems[24], EMYCIN, KAS, EXPERT etc. are famous tools for building expert systems.

The techniques introduced above (knowledge representation technique, knowledge acquisition technique and inference technique and so on) are of great universality, thus we can make use of them to develop the tools for building expert systems, here we focus on the tools for building diagnostic expert systems.

Diagnostic problems are very common problems in reality, for example, medical diagnosis, fault diagnosis, psychology and behavior diagnosis etc.

Suppose D = {di Ii = 1,2, ... ,m} is the set of all faults, D is called fault set, F = {fJ Ii = 1,2, ... ,N} is the set of all necessary factors for the given diagnostic problem, F is called factor set, 1f = {fJ Ii = 1,2, ... ,n} is the atom factor set, F = P(1f). Because Vf E F, :3G ~ 1f such that f = VgECg, we will

Page 443: Data Mining, Rough Sets and Granular Computing

440

only consider the atom factors. The diagnostic factors space can be regarded as {X(f)}/E7r.

Definition 7.1 [12,15jSuppose U is the universe of discourse of D, a di­agnostic problem (DP) on U is a quintuple DP =< D,F, {X(f)}/E7r,R, M > ,where R ~ F(D x F) is called diagnostic relation, M E IT/EF X(f) is called a symptom.

In this definition, {X (f)} /E7r is the basic description environment of a di­agnostic problem, R is the core part, if {X (f)} /E7r have been built up, then we can build R. If R is represented by falling shadows of random (fuzzy)sets, then we can design an inference engine based on the weighted synthesis math­ematical model; If R is represented by a group of rules, then we can design an inference engine based on the truth-valued flow inference method. So we can build two tools for building diagnostic expert systems based on the two different inference methods; in the preceding sections, we have given the knowledge representation method, knowledge acquisition method and infer­ence method used to building these two tools, but we did not give the method for determining the weights in the weighted synthesis method, here we give a weight distribution method, called dynamic weights based on information amount(DWIA).

The intuitive ideas of DWIA are:

• Weight Wij(X) reflects the importance offactor Ii to concept ai

• Weight Wij(X) is related not only to factor Ii itself but also to the state of factor Ii

• Weight Wij(X) is determined by the amount of information provided by factor Ii

where i = 1,2,···,m;j = 1,2,···,n. Definition 7.2 P = {ailai E il, i = 1,2,···, m} is called a pattern set,

B E F(P) is called a possibility distribution of P , the entropy of P under distribution B is defined as:

m

Hp(B) = - L bi In bi i=1

where bi = bd "'£":::1 bi , B = (b1 , b2 ,···, bm ) Definition 7.3 The amount of information of factor Ii at point Xj E

X (Ii) about P is defined as

I (Ii, x j) is called point-information amount, where Hmax is the maximum entropy, B(x) is the possibility distribution of P at point

x = (0, ... , 0, Xj, 0, ... , 0)) ~ ------(i-I) (n-j)

Page 444: Data Mining, Rough Sets and Granular Computing

441

o denotes "no manifestation" .

By making use of point-information amount, we give the DAIWA bellow: the weight Wij of factor !i at point x = (Xl, X2, ... , xn) about concepts ai is defined as

Wij(X) = l(!i,xj)/ L l(fI,XI) IE.6.i

where 6.i = {ilj E {I,2,···,n},rij(x) "I- O}, rij(x) is the manifestation extension of ai on factor !i, i = 1,2, ... , m; j = 1,2, ... , n.

When we get a manifestation X = (XI,X2,··· ,xn) E n~=l X(fi) (Xi may be 0), by the formula

n

O!i ~ O!if(X) ~ L Wij(X)O!fi (Xi) i=l

we can get the possibility ai(x) of ai. where f = Vf=lfi,i = 1,2,··· ,m;j = 1,2,· ··,n.

According to {ai(x)}(i = 1,2,· .. , m), we can do the decision-making on the level,x E (0,1], and at last get the decision-making set H>. = {aila;(x) ~ ,x}.

From the above discussion, we can find that point-information amount plays an important role in the determination of weights, besides this so does it in the choice of factors in the successive decision-making[I2,I4J, in [I2,I4],we defined factor information amount, clearness degree, with which we can do the decision-making better.

In a factors space, we can design two tools for building diagnostic expert systems which are based on weighted synthesis inference method (Tool-I) and truth-valued flow inference method (Tool-2) respectively. In order to make use of both advantages of them, we combine them to design a better tool (Tool-3) whose inference is based on weighted synthesis inference method and truth-valued flow inference method. The inference process of this Tool-3 is as follows:

• First start the weighted synthesis inference method, generate the decision­making set H>. on level ,x.

• Then start the truth-valued flow inference method, take H>. as the hy­pothesis set, by backwards inference generate the diagnosis set under the given manifestation.

• Obtain new manifestations, repeat the last two steps untill the diagnosis set is good enough.

Page 445: Data Mining, Rough Sets and Granular Computing

442

The flow chart of the Tool-3 is as follows:

Manifestation Tool-l

Get New Manifestations

Y

Continue(Y IN)?

Figure 7.2 structure of Tool-3

Where

• Manifestation block gives all the available manifestations of factors', the manifestations may be granules as well as states of factors'.

• Tool-l block is a inference machine based upon weighted synthesis; • Hypothesis Set H('\) is a set of hypothesises reasoned out by Tool-I. • Tool-2 block is a inference machine based upon truth-valued flow; • Diagnosis is the result reasoned out by Tool-2. • Countinue block determines whether or not the diagnosing process goes

on. • Get New Manifestations block provides new manifestations which are

different from those obtained in Manifestation block.

All the tools(Tool-l,Tool-2,Tool-3) have been realized on computers with Microsoft VC++. These tools have the following features:

• The man-machine interfaces are friendly; easy to operate; • Each of them is an integrated environment of system description, knowl­

edge acquisition, automatically knowledge building and reasoning; • The inference speed and the execution efficiency are high; • They can reduce the development period of expert systems, save labour

and resources.

we have applied them in some domains and got better results[14].

8 Conclusion

Based upon the factors space and granular factors space, we have studied the techniques for knowledge representation. In factors space and granular

Page 446: Data Mining, Rough Sets and Granular Computing

443

factors space, concepts, facts and rules can all be represented in different forms which are available not only for the description of real world problems but also for the processing by computers.

Based upon the theory of falling shadow of random (fuzzy)sets, we have discussed the techniques for knowledge acquisiton in factors space and gran­ular factors space. We have given a method for automatically building the knowledge base. The final forms of knowledge are quite available for comput­ers to process.

We have also discussed the inference methods in factors space. The infer­ences in factors space present various forms which can be used to well describe the thinking manner of human beings'. We have especially disscussed the in­ference method based on weighted synthsis and the inference method based on truth-valued flow. We have also given the dynamic weight-distributing method-DWIA, which is based on information amount.

Based on techniques for knowledge representation, knowledge acquisition and inference, we have discussed how to build an intelligent information­processing system in factors space and in granular factors space. As an ex­ample, we have designed a tool(Tool-3) for building diagostic expert systems. Tool-3 has been implemented in VC++ on computers and applied to some real world problems, the results of applications are good.

Through the above discussions, we can see that granular factors space is a good framework for building intelligent information-processing systems; we can also see that fuzzy information granulation can be studied in factors space conveniently, but limited to the space, we do not give further discussion here.

References

1. Peizhuang Wang, Fuzzy Sets and the Shadow of Random Sets, Beijing Normal University Publishing House,Beijing,1985.

2. Chengzhong Luo, The Fundamental Theory of Fuzzy Sets(I),Beijing Normal University Publishing House,Beijing,1989.

3. Chengzhong Luo, The Fundamental Theory of Fuzzy Sets(II),Beijing Normal University Publishing House,Beijing,1993.

4. Peizhuang Wang,Hongxing Li, The Theory of Fuzzy Systems and Fuzzy Com­puter, Academic Publishing House,Beijing,1996.

5. A. Kaufmann, Introduction to the Theory of Fuzzy Subsets, Academic Press,New York,1975.

6. D.Dubios, H.Prade, Fuzzy Sets and Systems: Theory and Application, Aca­demic Press,Inc., 1980.

7. Frederick Hayes-Roth, Building Expert Systems, Addision-Wesley Publishing Company,Inc.,1983.

8. G.A. Ringland, D.A. Duce, Approaches to Knowledge Representation: An Introduction, Research Studies Press LTD.,1988.

9. V.N. Constantin, Expert Systems and Fuzzy Systems, The Benjamin Cum­mings Publishing Company,Inc.,1984.

Page 447: Data Mining, Rough Sets and Granular Computing

444

10. Richard Forssyth, Expert Systems: Principles and Case Studies, Chapman and Hall, Ltd., 1984.

11. Xiantu Peng, Abraham kendel, Peizhuang Wang,Concepts,Rules and Fuzzy Reasoning: A Factors Space Approach,IEEE Transactions on Systems Man and Cybernetics, Vol.21,No.1,1990.

12. Fusheng Yu, Fuzzy Diagnosis Theory and Tools for Building Fuzzy Diagnostic Expert Systems Based on Factor Space Theory, Ph.D. Thesis, Beijing Normal University,Beijing,1998.

13. L.A. Zadeh, Toward a Theory of Fuzzy Information Granulation and its Cen­trality in Human Reasoning and Fuzzy Logic,Fuzzy Sets and Systems, Vol.90, 1997.

14. Chengzhong Luo,Fusheng Yu, The mathematical model of Diagnostic and Recognition Problems and The Tool for Building Expert systems, Fuzzy Sys­tems and Mathematics, Vol.6,No.3,1992.

15. Fusheng Yu, The General Model for Building Diagnostic Expert Systems Based on Backwards Reasoning,System Engineering-Theory & Practice, Vo1.18, No.5, 1998.

16. FuSheng Yu, Chengzhong Luo, Building Diagnostic Expert Systems in Factors Space, Advances in Mathematics of Electrical Engineering,1997.

17. Chengzhong Luo, The Law of Large Numbers of The Falling Shadow of Ran­dom Fuzzy Sets, Fuzzy Systems and Mathematics, Vol.6, No.3, 1992.

18. FuSheng Yu, Chengzhong Luo, The Difference Operator of Fuzzy Sets, Journal of Beijing Normal University, Vol.34, No.1, 1998.

19. Chengzhong Luo,Fusheng Yu, The Falling Shadow Distribution of Random Intervals, The Collection of theses of The Fifth Annual Meeting of The Com­mittee of Fuzzy systems and Fuzzy Mathematics of China System and Engi­neering Socity,1990.

20. P.Z. Wang, Truth-valued Flow Inference and Its Dynamic Analysis, Journal of Beijing Normal University, VOl.25,No.1,1989.

21. Peizhuang Wang, Truth-valued Flow Inference Theory and Its Applications, in Advances in Fuzzy systems: Applications and Theory(P.Z. Wang, K.F. Loe), World Scientific Publishing Company,1993.

22. B.Bouchon-meunier et. aI, Towards General Measures of Comparison of Ob­jects,Vol.84, 1996.

23. Frederick Hayes-Roth, etc., Building Expert systems, Addision-Wesley Pub­lishing Company, Inc.,1983.

24. P.Z. Wang, A Factor Space approach to Knowledge Representation, Fuzzy Set and Systems, Vol.36,1990.

Page 448: Data Mining, Rough Sets and Granular Computing

PartS

Rough Sets and Granular Computing

Page 449: Data Mining, Rough Sets and Granular Computing

G RS: A Generalized Rough Sets Model

Xiaohua Hul , Nick Cercone2 , Jianchao Han2 , and Wojciech. Ziarko3

1 Knowledge Stream Partner, 148 State St., Boston, MA 02109 2 Dept. of Computer Science, Unvi. of Waterloo, Waterloo, Ont., Canada 3 Dept. of Computer Science, Univ. of Regina, Regina, Sask., Canada

Abstract. Rough sets extends classical set theory by incorporating the set model into the notion of classification in the form of an indiscernibility relation. Rough sets serves as a tool for data analysis and knowledge discovery from databases. A generalized rough sets model, based on the concept of the VPRS-model, is proposed in this paper. Our approach modifies the traditional rough sets model and is aimed at handling uncertain objects by considering the importance of each object while reducing the influence of noise in modelling the classification process.

1 Introduction

Rough sets theory [7] can be used to reason from data. Using rough sets tech­niques, which are complementary to statistical methods of inference, provide the necessary framework to conduct data analysis and knowledge discovery from imprecise and ambiguous data. Extracting knowledge from the data is not a straightforward task. We need to find ways to analyze information at various levels of knowledge representation, going from refined to coarse levels and vice versa, and we also need to extract the useful information from the disorganized data. A number of algorithms and systems have been developed based on this technique [12,6,2].

Classification analysis is a central problem addressed by the theory of rough sets. The original rough sets approach required the classification, within the available information, to be fully correct or certain. Unfortunately, the available information usually allows only for partial classification. As a result, classification with a controlled degree of uncertainty, or a misclassification error, is outside the realm of this approach. The variable precision rough set model (VP-model) [10] introduced the concept of the majority inclusion relation. Rules which are almost always correct, called strong rules, can be extracted with the VP-model. Such strong rules are useful for decision support in a rule-base expert system. In actual applications, the collected data usually contain noise which will greatly affect the knowledge discovery process.

We propose a new generalized version ofthe rough set model. The general­ized rough sets model is an extension of the concept of the variable precision rough sets model. Our new approach will deal with the situations where uncertain objects may exist, different objects may have different degrees of importance attached, and different classes may have different noise ratios. The original rough sets model and the VP-model of rough sets [10] becomes

Page 450: Data Mining, Rough Sets and Granular Computing

448

a special case of the GRS-model. The primary advantage of the GRS-model is that it modifies the traditional rough sets model to work well in noisy environments.

2 Main Concepts of Rough Sets

Pawlak [7] introduced the notion of rough sets, which characterizes an ordi­nary set by a lower and an upper approximation.l.From a mathematical point of view, rough sets are very simple to understand; they only require finite sets, equivalence relations and cardinalities to understand. In this section, we will review basics of the rough sets theory.

2.1 Information System

Knowledge can be manifested by an ability to classify it in the rough sets model. Knowledge can be defined as a family of partitions over a fixed finite universe, and also can be defined as a family of equivalence relations over the universe.

An information system S [7] is a set of objects. It can be represented by a data table (attribute-value system), the columns of which are labelled by a set of attributes and the rows of which are labelled by objects of the universe U. The knowledge is expressed by values of attributes. We consider a special case of information system called a decision table. A decision table is a finite set of decision rules which specify what decision (action) should be taken when certain conditions are satisfied. The decision rules are represented by statements of the type "IF (set of conditions) THEN (set of decisions)". An information system is defined as follows:

S =< U, C, D, {VALa}aEA, f > be an information system S, where U = {Ul' U2, ... U n } is a non-empty set of objects, C is a non-empty set of condition attributes, and D is a non-empty set of decision attributes. We have A = CUD which is the set of all attributes and enD = 0. V ALa is a domain of an attribute "a" with at least two elements. The elements of V ALa are called values of attribute a (a E A). f: U x A -+ V is a total function s.t. f(x;, q) E V ALq , 'V q E A, Ui E U. .

An information system, which provides information about real-world ob­jects, is a representation of a collection of objects in terms of attributes and their values. Attributes are functions whose common domain is a given collec­tion of the objects U. Objects can be characterized by some selected features represented by attributes. Two objects with the same attributes values are indiscernible.

2.2 Set Approximations

Let U = {Ui} (i = 1,2, ... , n) be a non-empty finite set (universe of discourse), and R be an equivalence relation on U. An ordered pair A = (U, R) is called

Page 451: Data Mining, Rough Sets and Granular Computing

449

an approximation space. The indiscernibility relation, denoted as IN D, is an equivalence relation Ron U. It partitions U into equivalence classes, [Ei] is an equivalence class of the relation R labelled by description E;. Equivalence classes of the relation R are called an elementary set in A. Any finite union of elementary set is called a definable set in A.

Suppose B be a set of attributes B C A, and it induces a equivalence relation based on the attributes values ofB. We use EB = {El' E2, ... , Em} to represent these equivalence clasess ofB. Let X be a subset ofU to represent a concept. The lower approximation of X, also called a positive region of set X and denoted as POSB (X), is the union of all those elementary sets each of which can be classified as definitely belonging to the set X by using the set of attributes B and defined as

The upper approximation of X, denoted as UP PB (X), is the union of all those elementary sets each of which can be classified as possibly belonging to the set X by using the set of attributes B and defined as

The ratio of the size of the lower approximation of a set to the size of its upper approximation is called the approximation accuracy of the set X,

The boundary area of X, denoted as BN DB (X), is the set of all those ele­mentary sets each of which cannot be determined with certainty as belonging to the set X or the complement of X and defined as:

In the boundary region, none of the elementary sets can be classified with certainty to the concept X or the concept -X by using the set of attributes B.

The collection of elementary sets which can be classified with certainty on the basis of available information as not belonging to the set X by using the set of attributes B is called the negative region and defined as:

If the information about objects is sufficient to classify all of the elemen­tary sets, such as POSB(X) = UP PB(X), then the boundary region of the set X disappears and the rough set becomes equivalent to the standard set. For any concept in the information system, we can derive two kinds of clas­sification rules from the lower and upper approximation of this concept. The

Page 452: Data Mining, Rough Sets and Granular Computing

450

rules obtained from the lower approximation of this concept are called deter­ministic rules because whenever the description of an object belongs to the set of deterministic rules, this object is definitely in the target concept. The rules which are obtained from the upper approximation of this concept are called non-deterministic rules because whenever the description of an object belongs to the set of non-deterministic rules, this object is possibly in the target concept.

2.3 Attribute Dependency

The analysis of data dependencies in an information system is one of the primary applications of the rough set theory [12]. Such an analysis is based on the properties of the indiscernibility relation induced by condition attributes of an information system. The analysis of data dependencies is performed by computing lower approximations of subsets of objects corresponding to different combinations of values of the decision attributes belonging to D. The lower approximations are computed with respect to the indiscernibility relation generated by the set of condition attributes C. Since only the objects belonging to the union of computed lower approximations can be assigned a unique combination of values of decision attributes belonging to D, the proportion of these objects in the information system provides a measure of the degree of functional dependency between the set of condition attributes and the set of decision attributes.

S =< U, C, D, {V ALa}aEA' f > is an information system and A = CuD. The measure, the set of decision attributes D depends on the set of condition attributes C, is called a degree of dependency of D on C and denoted as ,(C, D). It is defined as :

,(C, D) = card(POSB(C, D)) card(U)

where POSB(C, D) is the union of lower approximations of all elementary sets of objects corresponding to all concept of the decision attributes D and card denotes set cardinality.

The coefficient ,( C, D) expresses numerically the percentage of objects which can be properly classified. If ,( c, D) = 1 then we say that D totally depends on Cj If ,( c, D) = 0 then we say that D totally does not depend on Cj If 0 < ,(C, D) < 1 then we say that D partially depends on C.

2.4 Reduction of Attributes

The existence of dependence among attributes of an information system may be used to reduce the set of attributes. The concept of attribute reduct is one of the most important part of the rough sets theory [2,10,12]. Given a set of condition attributes C C A and a set of decision attributes DCA

Page 453: Data Mining, Rough Sets and Granular Computing

451

(A = CUD), then any subset C' ~ C of condition attributes C whose degree of dependency ,(0', D) with decision attributes D is the same as the degree of dependency ,( C, D) of all condition attributes C with the decision attributes D and is minimal, is called the reduct of condition attributes C. The minimality requirement of the reduct means that no proper subset of the reduct has the identical level of dependency with the decision attributes D. The advantage of using a reduct rather than the original set of condition attributes C is that we can obtain a more concise classification rule without increasing the classification error of the result.

Let S =< U, C, D, {V ALa}aEA' f > be an information system and P ~ C:

1. An attribute a E P is redundant in P if ,(P - {a}, D) = ,(P, D); other­wise the attribute a is indispensable.

2. If all attributes ai E P are indispensable in P, then P will be called orthogonal.

3. A subset P C C is called reduct of C in S iff P is orthogonal and ,(P, D) = ,(C,D).

3 Generalized Rough Sets Model

Rough set theory has been applied widely in a lot of applications, such as [12,6]. iFrom our experience of applying rough sets theory to data mining applications, we found there are some limitations of the original rough sets model. Some of these limitations include:

(1) Treat all the tuples with equal importance. In our application, we first tend to generalize the original data in the data base into some generalized form (this step is called data generalization, please refer to [3,4] for details). After generalization, some tuples which are distinct in the primitive database become the same and "vote" is used to record the tuples. Normally differ­ent tuples have different "vote" and which means these tuples have varying importance to the decision attributes.

(2) Represent objects crisply. In the original model, the data are crisp, there is no uncertainty associated with the model, an object either has some property or does not have it. In actual applications, there are times when it is too expensive or risky to make a straighforwardt yes-no decision; some uncertainty factor is usually associated with the decision.

(3) Unable to model the probabilistic domain. In the original rough sets mode, a strict set inclusion is used to define the lower approximation, which has no tolerance to the noise data in the classification. For example, sup­pose X = {Xl, X2,·.·, X99, ••• , X500}, E1 = {Xl, X2, ••• , X99, x5od, E2 = {X500, X50I, ..• , X599}. All of the objects of E1 are in X except X50I, all of the objects in E2 do not belong to X except X500. But in the original rough sets model, both equivalence classes E1 and E2 are treated equally and are put in the boundary region. However, in actual applications, X501 may be

Page 454: Data Mining, Rough Sets and Granular Computing

452

noise in EI and X500 may be noise in E 2. It seems reasonable to put EI in the positive region and E2 in the negative region.

Based on these considerations, we proposed a generalized rough set model, GRS. We first modify the definition of information system to extend its rep­resentation power and introduce the classification rationale. Then we give a formal explanation of the generalized rough sets model.

3.1 Uncertain Information Systems (UIS)

To manage objects with uncertainty and varying importance degrees, we introduce an uncertain information system (UIS) based on the information system defined by Pawlak[I]. In the uncertain information system, each object is assigned an uncertainty u and an importance degree d. The uncertainty u is a real number in the range from 0.0 to 1.0. If uncertainty u is equal to 1.0, it represents a complete positive object. If uncertainty u is equal to 0.0, it represents a complete negative object. The importance degree d represents how important the object is in the information system. The d x u induces the positive class and d x (1 - u) induces the negative class in the uncertain information system. In other words, the d x u is the inducing positive class degree and d x (1 - u) is the inducing negative class degree of the uncertain information system. An example collection of classes (objects) of an uncertain information system is shown in Table 1. The uncertain information system (U IS) is defined as follows:

Let UIS =< U, C, D, {VALa}aEc, f, u, d > be an uncertain information system, where U is a non-empty set of objects, C is an non-empty set of condition attributes, D is a decision attribute with uncertainty u. V ALa is a domain of a condition attribute "a" with at least two elements. Each condition attribute a E C can be perceived as a function assigned a value a(obj) E V ALa to each object u E U. d(obj) is a function assigned an im­portance degree to each object u E U. Every object which belongs to U is therefore associated with a set of certain values corresponding to the condi­tion attribute C, an uncertain value corresponding to the decision attribute D and a real number corresponding to the importance degree d of the object.

Example 1: In Table 1, we have a set of objects U = {ed, (i = 1,2, ... , 6). The set of condition attributes is C = {CI, C2} and the domains of condition attributes Care VCI = {O, I}, VC2 = {O, 1, 2}, and the decision attribute is D = {dec} with uncertainty value UdeCi = {0.95,0.67,0.15,0.85,0.47,0.10} (i = 1,2, ... ,6). For each object, an importance degree d is assigned and the set of importance degrees is d(obji) = {4, 3, 4, 4, 3, 4} (i = 1,2, ... ,6).

3.2 Noise Tolerance in Uncertain Information Systems

To manage noise in uncertain information systems, we adapt the concept of relative classification error which was introduced by Ziarko [12]. The main

Page 455: Data Mining, Rough Sets and Granular Computing

453

Table 1. An example of an uncertain information system

U cl c2 dec v

el 0 0 0.95 4

e2 0 1 0.67 3

e3 0 2 0.15 4

e4 1 0 0.85 4

eo 1 1 0.47 3

e6 1 2 0.10 4

idea is to put some equivalence classes in the boundary region into positive region or negative region, according to some classification factors. The goal is to achieve some strong rules which are almost always correct. In actual applications, each class (positive class and negative class) in the informa­tion system may contain different kinds of noise. For example, given positive training objects may contain some noise and given negative training objects may be noise-free. Two classification factors PI' and Np (0.0 ::; PI', Np ::; 1.0) are introduced to solve this problem. PI' and Np may be the same values and simultaneously exist, they can be determined by estimating noise degree in the positive region and the negative region respectively.

Let E be non-empty equivalence class in the approximation space A = (U, R). The classification ratios of the set E with respect to the positive class Pclass and negative class Nclass are defined as

Cp(E) = 2:(~=i u;) if Xi E E, E ~ U

CN(E) = 2:(di E(~i- Ui)) if Xi E E, E ~ U

where 2: di is the sum of importance degrees of objects belonging to the set E, 2:(d; x u;) is the sum of inducing positive class degree of objects belonging to the set E, and 2:~1 (d; x (1-Ui)) is the sum of inducing negative class degree of objects belonging to the set E.

Cp(E) is defined as the certainty to classify E in the positive region. CN(E) is defined as the certainty to classify E in the negative region. If we classify objects belonging to the set E to positive class, we probably have a classification error rate 1-Cp(E) . If we classify objects belonging to the set E to negative class, we probably have an classification error rate 1- CN(E).

Based on the measure of relative classification error, one can classify E to the positive class if and only if the classification certainty Cp(E) is greater than or equal to a given precision level PI', or to the negative class if and

Page 456: Data Mining, Rough Sets and Granular Computing

454

only if the classification certainty CN(E) is greater than or equal to given precision level N fJ. Thus,

E ~ Pc/ass

E ~ Nclas8

if only if Cp(E) ~ PfJ

if only if CN(E) ~ NfJ

otherwise, the equivalence class E belongs to the boundary region. The use­fulness of these concepts are demonstrated in example 2 in next subsection.

3.3 Set Approximation in the GRS-Model

In the original model of rough sets the approximation space is defined as a pair A = (U, R) which consists of a non-empty, finite universe of discourse U and of the equivalence relation Ron U [7]. The equivalence relation R, referred to as an indiscernibility relation IN D, corresponds to a partition of the universe U into a collection of equivalence classes or elementary sets R* = {El, E 2 , ••• , En}. The elementary sets are the atomic components of given information systems. They correspond to the smallest groups of objects which are distinguishable in terms of the information used to represent them, e.g., in terms of object features and their values. .

By using two classification factors PfJ and NfJ, we obtain the following generalization of the concept of rough approximation:

Let the pair A = (U, Rp,N) be an approximation space and R'P,N = {El, E 2 , ••• , En} be the collection of equivalence classes ofthe relation Rp,N. Let PfJ and NfJ be two real number parameters as defined in the previous subsection, such that 0.0 ::; PfJ, NfJ::; 1.0. Given any arbitrary subset X ~ U, its positive lower approximation POSp(X) is defined as a union of those elementary sets whose classification criteria guarantee that the classification ratio Cp(E) will be greater than or equal to PfJ,

POSp(X) = U{E E R'P,N : Cp(E) ~ PfJ}

Its negative lower approximation N EGN(X) is defined as a union of those elementary sets whose classification criteria guarantee that the classification ratio CN(E) will be greater than or equal to NfJ,

The boundary region BNDp,N(X) of the set X is the union of those elementary sets whose classification do not belong to the positive region and the negative region of the set X,

Page 457: Data Mining, Rough Sets and Granular Computing

455

According to the noise level, we can adjust the value of PfJ and NfJ. If the data is very noisy, we can set PfJ and NfJ bigger, otherwise PfJ and NfJ can be set a bit smaller. (both should be greater than 0.5). If PfJ and NfJ increase, it means that the positive and negative will shrinks and the boundary get expended. Oh the other hand, if PfJ and NfJ decrease, then the boundary area shrinks and the positive region and negative regions will expand. In example 2, we set PfJ and NfJ to two different sets of values and the positive, negative and boundary are changed accordingly.

Example 2: Assuming the same set of objects U as described by Table 1, and set PfJ = 0.85 , NfJ = 0.80. The set of equivalence relation R is R = {Xl, X2, ... , X6} , where Xl = {e1}, X2 = {e2}, ... , and X6 = {e6}. Thus

Cp(X1) = 4 X 0.95 = 0.95 4

Similarly,

Cp(X2) = 3 x 0.67 = 0.67 3

Cp(X3) = 4 x 0.15 = 0.15 4

4 x 0.85 Cp(X4) = 4 = 0.85

Cp(X5) = 3 x 0.47 = 0.47 3

Cp(X6) = 4 x 0.1 = 0.10 4

CN(X1) = 4 x (1 - 0.95) = 0.05 4

CN(X2) = 3 x (1 - 0.67) = 0.33 3

CN(X3) = 4 x (1 - 0.15) = 0.85 4 .

CN(X4) = 4 x (1- 0.85) = 0.15 4

CN(X5) = 3 x (1 - 0.47) = 0.53 3

CN(X6) = 4 x (1 - 0.1) = 0.90 4

Since Cp(X1) > PfJ, Cp(X4) > PfJ

POSp(D) = {X1,X4}

Since CN(X3) > NfJ, CN(X6) > NfJ

NEGN(D) = {X3,X6}

So the boundary region is

BNDp,N(D) = {X2,X5}

If we want the positive and negative regions to be more "pure", we can increase the PfJ and NfJ value. Suppose we set PfJ = 0.9, NfJ = 0.9, then we have Cp(X1) > PfJ, CN(X6) > NfJ.

Page 458: Data Mining, Rough Sets and Granular Computing

456

POSp(D) = {Xl} NEGN(D) = {X6}

BNDp,N(D) = {X2,X3,X4,X5}

The equivalence class X 4 is no longer good enough to be in the positive region, so it is put in the boundary and the positive region shrinks.

3.4 The Degree of Attribute Dependencies in the GRS-Model

To formally define the attribute dependency measure between the set of con­dition attributes C C A and the set of decision attributes DCA (A = CUD), let C* denote the collection of equivalence classes of the relation IN D P,N (C) and, similarly, let D* be a family of equivalence class of INDp,N(D) = {Pe/aU! Ndass}. Given two classification factors Pf3 and Nf3 (0.0 :::; Pf3, Nf3 :::; 1.0) we say that the set of decision attributes D imprecisely depends on the set of condition attributes C to the degree ,(C, D, Pf3, N(3) if:

where INT(C, D, Pf3, N(3) is a union of positive and negative lower approx­imations of all elementary sets of the partition D* = {Pdass, Ndass} in the approximation space (U, INDp,N(C)), and the IMP(X) is an importance function assigning the sum of importance degree of objects in the set X, such that

n

IMP(U) = Ed; Ui E U i=1

and a b

IMP(INT(C, D,Pf3, N(3)) = E dP08 + E dneg , pos=1 neg=1

Upos E POSp(X), Uneg E NEGN(X)

So that we can transfer the above formula to:

(c D P N) _ E;OS=1 dpos + E~eg=1 dneg , , , f3, f3 - "n d.

L.Ji=1 I

Informally speaking, the dependency degree ,(C, D, Pf3, N(3) of attributes D on the attributes C at the precision level Pf3, Nf3 is the proportion of these objects Ui E U which can be classified into respective classes of the partition D* (positive class and negative class) with an error rate less than desired value (Pf3, N(3) on the basis of the information represented by the classification C*.

Page 459: Data Mining, Rough Sets and Granular Computing

457

Example 3: Based on the uncertain information system given in Table 1, we can calculate the degree of dependency between condition attributes C and the decision attribute D with classification factors Pf3 = 0.85 and Nf3 = 0.80. From example 2, we have:

POSp(C) = {Xl, X4}

N EGN(C) = {X3, X6}

So that, the degree of dependency between C and Dis,

4+4+4+4 /(C, D, 0.80, 0.85) = 22 = 0.73

3.5 Attribute Reduct in the GRS-Model

In the original model of rough sets, the concept of a reduct is based on the notion of functional, or partial functional data dependency. By substituting the degree of the functional dependency in the reduct definit ion with the degree of dependency /( C, D, Pf3' N(3) computed with classification factors Pf3' Nf3' the idea of attribute reduct can be generalized to alIow for a further reduction of attributes. Such a reduction, by definition, does not preserve functional or partial functional dependencies. Instead, the point is in main­taining the degree of overlap of elementary sets of the relation IND P,N ( C) with elementary sets of the relation IN Dp,N (D) = {Pelas ., Nelass}.

Let UIS =< U,C,D,{VALa}aEc,f,u,d> be an uncertain information system and P ~ C, and given classification factor Pf3' Nf3:

1. An attribute a E P is called redundant in P If /(P - {a}, D, Pf3 , N(3) = /(P, D, Pf3' N(3); otherwise the attribute a is indispensable

2. If alI attribute ai E Pare indispensable in P, then P will be called orthogonal

3. A subset P C C is called reduct of C in U 1 S iff P is orthogonal and /(P, D, Pf3' N(3) = /(C, D, Pf3 , N(3)

A relative reduct of the set of condition attributes will be defined as a maximal independent subset of condition attribute.

The GRS-reduct, or approximation reduct, of the set of condition at­tributes C with respect to a set of decision attributes D is a subset of RED( C, D, Pf3' N(3) of C which satisfies the folIowing two criteria:

1. /(C, D, Pf3' N(3) = /(RED(C, D, Pf3' N(3), D, Pf3, N(3) 2. no attribute can be eliminated for RE D( C, D, Pf3' N(3) without affecting

the first criteria

Example 4: Consider dropping the condition variable CI in Table 1 and set Pf3 = 0.85 and Nf3 = 0.80. The set equivalence relation R is R =

Page 460: Data Mining, Rough Sets and Granular Computing

458

{X1,X2,X3} where Xl = {el,e4}, X2 = {e2,e5} and X3 = {e3,e6}. So that,

Cp(X1) = 4 x 0.95 + 4 x 0.85 = 0.90 , 8

CN(X1) = 4 x (1 - 0.95) + 4 x (1 - 0.85) = 0.10 , 8

Cp(X2) = 3 X 0.67 + 3 x 0.47 = 0.57 , 6

CN(X2) = 3 x (1 - 0.67) + 3 x (1 - 0.47) = 0.43 , 6

Cp(X3) = 4 x 0.15 + 4 x 0.10 = 0.125 , 8

CN(X3) = 4 x (1 - 0.15) + 4 x (1 - 0.1) = 0.875 . 8

From the above computation, we obtain

POSp(c') = {Xl}

and NEGp(c') = {X4}(c' = {C2}).

Thus, we have , 8+8

"Y(C , D, 0.80, 0.85) = 22 = 0.73.

From example 3, we know that "Y(C', D, 0.80, 0.85) = "Y(C, D, 0.80, 0.85), so that C' = {C2} is a reduct olC on D.

The concept of a reduct is most useful in those applications where it is necessary to find the most important collection of condition attributes re­sponsible for a cause-and-effect relationship and also useful for eliminating noise attributes from the information system. Given an arbitrary informa­tion system, there may exist more than one reduct. Each reduct in the set of RED(C, D, PI', Njj) can be used as an alternative group of attributes which could represent the original information system with the classification factor PI', Njj. An important problem to solve is how to select an optimal reduct from the set of RED(C, D, PI', Njj). The selection can depend on the op­timality criterion associated with attributes. The computational procedure for finding a single reduct is very straightforward, but finding all reducts is much more complex. Some significant results obtained for this problem can be found in [2,12].

Page 461: Data Mining, Rough Sets and Granular Computing

459

4 Conclusions

We have proposed a generalized rough sets model for modeling the classifica­tion process in the noise environment. The end result of using the GRS-model for data analysis is a set of classification rules for classifying objects into posi­tive and negative concepts. The classification rules form a description of each concept. It is not difficult to extend this description into more concepts. The GRS-model extends the applicability of the rough sets approach to problems which are more probabilistic than deterministic in nature, and inherits those useful properties of the original model of rough sets. l.From the results of our research, we demonstrate that there is much room for expansion and application of the rough sets theory.

5 Acknowledgment

The first author is grateful to Gregory Piatetsky-Shapiro for his encour­agement and support. The authors are/were members of the Institute for Robotics and Intelligent Systems (IRIS) and wish to acknowledge the sup­port of the Networks of Centres of Excellence of the Government of Canada, the Natural Sciences and Engineering Research Council, and the participa­tion of PRECARN Associates Inc.

References

1. Fayyad U, Piatetsky-Shapiro G, Smyth P and Uthurusamy R. (1996) Advances in Knowledge Discovery and Data Mining, AAAI Press/MIT Press

2. Hu, X. (1995) Knowledge Discovery in Databases: An Attribute-Oriented Rough Set Approach, Ph.D thesis, University of Regina, Canada

3. Hu, X. and Cercone, N. (1996) Mining Knowledge Rules from Databases: An Rough Set Approach, in Proc. of the 12th International Conf. on Data Engi­neering

4. Hu, X. and Cercone, N. (1999) Data Mining via Generalization, Discretiza­tion and Rough Set Feature Selection, Knowledge and Information System: An International Journal, 1(1), 1999

5. Katzberg, J.D. and Ziarko, W. (1993) Variable Precision Rough Sets with Asymmetric Bounds, Proc. IntI. Workshop on Rough Sets and Knowledge Dis­covery, 163-190.

6. Lin, T.Yand Cercone, N. (1997) Applications of Rough Sets Theory and Data Mining, Kluwer Academic Publishers,

7. Pawlak, Z. (1991) Rough Sets: Theoretical Aspects of Reasoning About Data, Kluwer Academic Publishers

8. Slowinski, R (ed.) (1992) Intelligent Decision Support: Handbook of Applica­tions and Advances of Rough Sets Theory

9. Simoudis, E. Han, J. and Fayyad U. (1996) Proc. of the Second International Conf on Knowledge Discovery & Data Mining

Page 462: Data Mining, Rough Sets and Granular Computing

460

10. Ziarko, W. (1993) Variable Precision Rough Set Model. Journal of Computer and System Sciences, Vol. 46, No.1, 39-59.

11. Ziarko, W. (1993) Analysis of Uncertain Information in The Framework ofVari­able Precision Rough Sets , Foundations of Computing and Decision Sciences, Vol. 18. No. 3-4, 381-396.

12. Ziarko, W. (1994) Rough Sets, Fuzzy Sets and Knowledge Discovery, Springer­Verlag

Page 463: Data Mining, Rough Sets and Granular Computing

Structure of Upper and Lower Approximation Spaces of Infinite Sets

D.S. Malik and John N. Mordeson

Department of Mathematics and Computer science Creighton University, Omaha, NE 68178, USA

Abstract. We determine structural properties of upper and lower approximation spaces. In particular, we show that an upper approximation space is a union of its primary subspaces if and only if it is benign. We also show that in a principle upper approximation space which is not primary, the prime and maximal subspaces coincide. Key words: Upper and lower approximation spaces, finitely fanned, genetic subsets, core, primary subspaces, benign, rough sets

1 Introduction

In 1982, Pawlak introduced the concept of a rough set, [7]. This concept is fundamental to the examination of granularity in knowledge. It is a concept which has many applications in data analysis. The idea is to approximate a subset of a universal set by a lower approximation and an upper approxima­tion in the following manner. A partition of the universe is given. The lower approximation is the union of those members of the partition contained in the given subset and the upper approximation is the union of those members of the partition which have a nonempty intersection with the given subset. It is well known that a partition induces an equivalence relation on a set and vice versa. The properties of rough sets can thus be examined via either partitions or equivalence relations. The members of the partition (or equiv­alence classes) can be formally described by unary set-theoretic operators, [14], or by successor functions for upper approximation spaces, [2,3]. This axiomatic approach allows not only for a wide range of areas in mathematics to fall under this approach, but also a wide range of areas to be used to describe rough sets. Hence we use an axiomatic approach. Some examples are topology, (fuzzy) abstract algebra, (fuzzy) directed graphs, (fuzzy) finite state machines, modal logic, interval structures, [2,4,5,6,8,14,15,16]. One may generalize the use of partitions or equivalence relations to that of covers or relations, [6,9,10,11,12,13].

In this paper, we determine structural properties of upper and lower ap­proximation spaces of infinite sets. The case for finite sets has been studied extensively. For example, it is shown in [2] that an upper approximation space is a disjoint union of its primary subspaces. A dual result holds for lower approximation spaces. For infinite sets, some interesting complications

Page 464: Data Mining, Rough Sets and Granular Computing

462

occur. We examine these complications in Section 2, where we give our main results. Our approach is similar to the one used in [1]. We show that an upper approximation space is a union of its primary subspaces if and only if it is be­nign, Theorem 2. In Section 3, we show how our results of Section 2 for upper approximation spaces can be carried over to lower approximation spaces. In particular, we examine structure properties of upper and lower approxima­tion spaces, Theorem 9. In Section 4, we show how ideas from commutative ring theory can be used to give properties of upper and lower approximation spaces. Thus, hopefully, we open the way for a new method to be used to study upper and lower approximation spaces. We show, in particular, that in a principle upper approximation space which is not primary, the prime and maximal subs paces coincide, Theorem 12.

2 Upper Approximation Spaces

Let V be a nonempty set and let P(V) denote the power set of V. Let s be a function of P(V) into itself. We are interested in the following conditions on s since they are the ones that hold for upper approximation operators defined via an equivalence relation:

(u1) '<IX E P(V), X ~ s(X). (u2) '<IX, Y E P(V), X ~ Y =}- s(X) ~ s(Y). (u3) '<IX, Y E P(V), s(X u Y) = s(X) U s(Y). (u4) '<IX E P(V), s(X) = s(s(X)). Let s be a function ofP(V) into itself. We are also interested in the follow­

ing conditions on s since they are the ones that hold for a lower approximation operator defined via an equivalence relation:

(ll) '<IX E P(V), X 2 s(X). (l2) '<IX, Y E P(V), X ~ Y =}- s(X) ~ s(Y). (l3) '<IX, Y E P(V), s(X n Y) = s(X) n s(Y). (l4) '<IX E P(X), s(X) = s(s(X)). We sometimes write s(x) for s({x}), where x E V.

Definition 1. The pair (V, s) is called an upper approximation space (uas) if s satisfies properties (ul), (u3), and (u4). If U ~ V, then U is called an s-subspace or merely a subspace of V if s(U) = U.

We note that if (V, s) is a uas it is not necessarily the case that 0 = s(0). We also note that condition (u2) follows from condition (u3).

Let W denote the set of all nonnegative integers, N the set of all negative integers, IP' the set of all positive integers, and Z the set of all integers.

Definition 2. Let (V, s) be an uas. Let X ~ V. Then X is said to be finitely fanned if for every finite subset X' of X, 3v E V such that X' ~ s(v).

Page 465: Data Mining, Rough Sets and Granular Computing

463

Proposition 1. Let (V, s) be an uas. Then there exists a maximally finitely fanned subset of V. In fact, Vv E V, {v} is contained in a maximally finitely fanned subset.

Proof: Let F = {X I X ~ V and X is finitely fanned}. Let v E V. Then {v} ~ s(v) and so {v} E F. Thus F f:. ¢. Let {Xa}aEA be any chain, where Xa E F, Va. E A. Let XI be a finite subset of X, where X = UaEAXa . Then XI ~ Xa for some a. E A. Thus 3v E V such that XI ~ s(v). Hence X E F. Thus F has a maximal element. The desired result thus follows easily.

Proposition 2. Let (V, s) be an uas. Let U be a maximally finitely fanned subset of V such that if x E s(U), x E s(u) for some u E U. Then U is a subspace of V.

Proof: Let x E s(U). Then 3u E U such that x E s(u). Now VUI, ... , Un E U, 3v E V such that u, UI, ... , Un E s(v) and so x E s(u) ~ s(v). That is, VUI, ... ,Un E U, :3v E V such that X,Ul, ... ,Un E s(v). Hence U U {x} is finitely fanned. By the maximality of U, we have that x E U. Thus s(U) ~ U and so s(U) = U. Hence U is a subspace of V.

Proposition 3. Let (V, s) be an uas. Then V is a union of its maximally finitely spanned subsets.

Proof: For all v E V, :3 a maximally finitely fanned subset U of V such that v E U. Hence the desired result is immediate.

Let (V, s) be an uas and let U be a subspace of V. We say that a subset X of U generates U if s(X) = U. We say that U is singly generated if it is generated by a set with a single element.

Example 1. Define s : P(W) -+ P(W) by VX E P(W), s(X) = {y E W I y ::; x Vx E X} if X f:. 0 and s(0) = 0. Then s satisfies conditions (u1) - (u4). The s-subspaces of Ware Wand {a, 1, ... , n} Vn E W. Hence all s-subspaces of Ware simply generated, except W. That is, W is not singly generated even though it is a union of an ascending sequence of singly generated s-subspaces. We also see that W is finitely fanned. There is no maximal singly generated s-subspaces of W.

Let (V, s) be an uas and let U be a subspace of V. Define au : P(V) -+ P(V) by VX E P(V),

au(X) = {y E U I s(y) n X f:. 0}.

We some times write a for avo It follows that a satisfies (u1) - (u4).

Definition 3. Let (V, s) be an uas and let (U, s) be a subspace of (V, s). Let X ~ U. Then X is called U-genetic or X is said to be a U-genetic set if au(X) ~ s(X). If U = V, a U-genetic set is called genetic.

Page 466: Data Mining, Rough Sets and Granular Computing

464

Definition 4. Let (V, s) be an uas and let (U, s) be a subspace of (V, s). Let X, Y ~ U. Then X is U-genetic for Y if X is U-genetic and Y = seX). Y is U-genetically closed if 3Z ~ Y such that Z is U-genetic for Y.

Example 2. Define s:P(Z) -t P(Z) by "IX E P(Z), seX) = {y E WI y ~ Ixl "Ix E X} U X if X 10 and s(0) = 0. Then s satisfies conditions (uI) - (u4). Now N is genetic since a(N) = N ~ seN). In fact, any subset of N is genetic. However, W is not genetic. In fact, if X is a subset of Z such that 3k E W such that -m ¢. X "1m ~ k, then X is not genetic since m E a(k) and m ¢. s(k). Now W is s(W)-genetic as is {k E Wlk ~ m} for some mEW. For C = s( {-2k IkE 1P'}), {-2k I k> I} U {3} is C-genetic, but not Z-genetic.

Definition 5. Let (V, s) be an uas and let X ~ V. Then X is called a primary of V if X is a maximally nonempty genetically closed subspace of V.

Definition 6. Let (V, s) be an uas. Then V is said to be benign if every nonempty genetically closed subspace of V contains a primary of V.

Lemma 1. Let (V, s) be an uas. Then "Ix E V, s(a(x)) is genetically closed.

Proof: Let x E V. Then a(x) ~ s(a(x)) and a(a(x)) = a(x) ~ s(a(x)), i.e., a(x) is V-genetic. Now a(x) is V-genetic for s(a(x)) since a(x) is genetic and s(a(x)) = s(a(x)). Hence s(a(x)) is V-genetically closed. (Z = a(x) in Definition 4.)

Definition 7. Let (V, s) be an uas. Let U be a subspace of V. Let X ~ U. The U-core of X is /Lu : P(U) -t P(U), where

/LU(X) = {x E X I au(x) ~ X}.

Proposition 4. Let (V, s) be an uas. Let U be a subspace ofV. Then /Lu(U) = U and /Lu(¢» = ¢>.

Proof: We have /Lu(U) ~ U by definition. Let u E U. Then au(u) ~ U by the definition of au.

Lemma 2. Let (V, s) be an uas. Let U be a subspace of V. Let X ~ Y ~ U. Then /Lv (X) ~ /Lu(Y).

Proof: Let x E /Lv(X). Then av(x) ~ X. Now au(x) ~ av(x) ~ X ~ Y. Thus x E /Lu(Y) since x E Y. Hence /Lv(X) ~ /Lu(Y).

Lemma 3. Let (V, s) be an uas. Let U be a subspace of V. Let X ~ U. Then aV(/Lv(X)) = aU(/Lv(X)) = /Lv(X).

Page 467: Data Mining, Rough Sets and Granular Computing

465

Proof: Clearly, OV(J.tv(X)) ;2 au(J.tv(X)) ;2 J.tv(X). Let y E av(J.tv(X)). Then s(y) n J.tv(X) :f. 0. That is, s(y) n {z E X I av(z) ~ X} :f. 0. Hence 3z E s(y) n X such that {w E V I s(w) n {z} :f. 0} = av(z) ~ X. Thus 3z E s(y) n X such that {w E V I z E s(w)} ~ X. Hence y E X. Also, av(y) = {t E V I s(t) n {y} :f. 0} = {t E V lyE s(t)} ~ X since z E s(y) ~ s(s(t)) = s(t). Thus y E J.tv(X). Hence av(J.tv(X)) ~ J.tv(X).

Lemma 4. Let (V, s) be an uas. Let U be a subspace of V and X ~ U. Then

J.tv(U) n J.tu(X) = J.tv(X).

Proof: By Lemma 2, J.tv(X) = J.tv(U) n J.tv(X) ~ J.tv(U) n J.tu(X). By Lemma 3, if y E J.tv(U) n J.tu(X) (= {u E U I av(u) ~ U} n {x E X I au(x) ~ X}), then av(y) ~ U and au(y) ~ X and so av(y) = au(y) ~ X. By Definition 7, y E J.tv(X). Thus J.tv(U) n J.tu(X) ~ J.tv(X).

Theorem 1. Let (V, s) be an uas. Let X be a genetically closed subset of V. Then J.tv(X) is genetic for X and every set genetic for X is contained in J.tv(X).

Proof: Let Y ~ X be genetic for X. Then a(Y) ~ s(Y) = X. By the definition of core, Y ~ J.tv(X). Thus any set genetic for X is a subset of J.tv(X). By Lemma 3, a(J.tv(X)) = J.tv(X). Also J.tv(X) ~ X = s(Y) and since Y ~ J.tv(X), we have s(Y) ~ s(J.tv(X)). But J.tv(X) ~ X and since X is genetically closed, s(X) = X. Consequently, s(J.tv(X)) ~ s(X) = X. Thus we get a(J.tv(X)) = J.tv(X) ~ X = s(Y) ~ s(J.tv(X)) ~ X, and so a(J.tv(X)) ~ s(J.tv(X)) = X. Hence J.tv(X) is genetic for X.

Corollary 1. Let (V, s) be an uas. Let x E V. Then x is a member of every nonempty genetically closed subset of s(a(x)).

Proof: Let X ~ s(a(x)) be a nonempty genetically closed subset. Then J.tv(X) :f. 4>. Thus J.tv(X) n s(a(x)) :f. 4>. Hence by [1, Corollary 3.3.9, p. 207], s(a(J.tv(X))) n {x}:f. 4>. That is, x E s(a(J.tv(X))). Now by Lemma 3, a(J.tv(X)) = J.tv(X). By Theorem 1, s(J.tv(X)) = X since J.tv(X) is genetic for X. Hence x E X.

Lemma 5. Let (V, s) be an uas. Let U be a primary for V. Let x E V. Then s(a(x)) = U if and only if x E J.tv(U).

Proof: Since U is a subspace of V, s(a(x)) = U implies a(x) ~ U. But a(x) ~ U implies that x E J.tv(U) by the definition of core.

Conversely, if x E J.tv(U), then s(a(x)) ~ s(a(J.tv(U))). But a(J.tv(U)) = J.tv(U) by Lemma 3 and s(J.tv(U)) = U by Theorem 1 since J.tv(U) is genetic for U. Hence s(a(x)) ~ U. Now by Lemma 1, s(a(x)) is genetically closed and since it is contained in the minimal genetically closed set U, the two must be equal. That is, s(a(x)) = U.

Page 468: Data Mining, Rough Sets and Granular Computing

466

Lemma 6. Let (V, s) be an uas. Let U be a primary for V. Let x E U. Then U ~ s(u(x)).

Proof: There exists y E J1v(U) such that x E s(y) since by Theorem 1, J1v(U) is genetic for U and hence s(J1v(U)) = U. But x E s(y) implies y E u(x) implies u(y) ~ u(x). By Lemma 5, U = s(u(y)) since y E J1v(U). But u(y) ~ u(x) implies s(u(y)) ~ s(u(x)) and thus U ~ s(u(x)).

Lemma 7. Let (V, s) be an uas. If V = UiEI Vi, where {Vi liE I} is the set of primaries of V, then V is benign.

Proof: If V = <p, then the result is trivial. Suppose V i <p. Let X ~ V be genetically closed and nonempty. Then J1v(X) i <p as shown in the proof of Corollary 1. Hence 3x E J1v(X) n Vi for some i E I. Since J1v(Vi) is genetic for Vi by Theorem 1, 3y E J1v(Vi) such that x E s(y) and so y E u(x). Now by Lemma 5, y E J1v(Vi) => Vi = s(u(y)). Moreover, y E u(x) => s(u(y)) ~ s(u(x)) and thus Vi ~ s(u(x)). But x E J1v(X) => u(x) ~ u(J1v(X)) = J1v(X) and since X is genetically closed, s(u(x)) ~ s(J1(X)) = X. Hence Vi ~ X and since X is an arbitrary genetically closed subset of V, V is benign.

Lemma 8. Let (V, s) be an uas. Let Vi and Vj be distinct primaries of V. Then J1v(Vi) n J1v(Vj) = <p.

Proof: Let y E J1v(Vi) n J1v(Vj). Then by Lemma 5, Vi = s(u(y)) = Vj. The lemma follows by the contrapositive.

Theorem 2. Let (V, s) be an uas. Let {Vi liE I} be the set of primaries of V. Then V (i 0) is benign if and only if

(1) V = UiEI Vi, and (2) V i UiEI\{j} Vi Vj E I.

Proof: Suppose V is benign. Let x E V. Then 3i E I such that Vi ~ s(u(x)) since s(u(x)) is genetically closed by Lemma 1. But by Corollary 1, x is a member of every nonempty genetically closed subset of s(u(x)) and since Vi is one such subset of s(u(x)), we have x E Vi. Thus (1) holds. Let j E I and let y E J1v(Vj). If y E Vi for some i i j, then 3x E u(y) n J1v(Vi). However y E J1v(Vj) => u(y) ~ J1v(Vj) and thus J1v(Vi) n J1v(Vj) i <p contrary to Lemma 8. Thus y ~ J1v(Vi) Vi i j. Hence (2) holds.

The converse is immediate from Lemma 8.

Theorem 3. Let (V, s) be an uas. Let x E V. Then the following are equiv­alent.

(1) s(x) is a primary of V. (2) s(x) is genetically closed. (3) J1v(s(x)) i <p. (4) s(x) is a maximally singly generated subspace of V. (5) u(x) ~ s(x).

Page 469: Data Mining, Rough Sets and Granular Computing

467

Proof: That (1)=>(2) and (2)=>(3) follows immediately from the respective definitions.

(3)=>(4) Suppose s(x) ~ s(y) for some y E V. Since p.v(s(x)) ~ s(x), we have p.v(s(x)) ~ s(y). But then p.v(s(x)) n s(y) :f. ¢ since p.v(s(x)) :f. ¢. Hence a(p.v(s(x))) n s(y) :f. ¢ implying that y E a(p.v(s(x))) = p.v(s(x)) where the latter equality holds from Lemma 3. But y E p.v(s(x)) => s(y) ~ s(p.v(s(x))) ~ s(x). Hence s(x) ~ s(y) => s(y) ~ s(x), i.e., s(x) = s(y). Thus s(x) is maximal.

(4)=>(5): Suppose that a(x) ~ s(x). Thus 3y E a(x)\s(x). Hence x E s(y) and so s(x) ~ s(y). But Y ct. s(x) and so s(y) ~ s(x). However this contradicts the maximality of s(x). Thus a(x) ~ s(x).

(5)=>(1): Since a(x) ~ s(x), {x} is genetic for s(x) and hence s(x) is genetically closed. Let U be a subspace of s(x) which is genetically closed and nonempty. By Corollary 1, x E U. Hence s(x) = U. Thus s(x) is minimal. Hence s(x) is primary of V.

Corollary 2. Let (V, s) be an uas. Let U be a subspace ofV. Let a(x) be finite for some x E U. Then U is a primary of V if and only if U is a maximal singly generated subspace of v.

Proof: Let U be a primary ofV. Since a(x) is finite, 3y E a(x)np.v(U) such that Vy' E a(y), a(y') = a(y). Now a(y') = a(y) => y E a(y') => y' E s(y) and so a(y) ~ s(y). By Theorem 3, s(y) is a primary of V and since y E U, U = s(y) by minimality. Thus U is a maximal singly generated subspace of V by Theorem 3.The converse is immediate from Theorem 3.

3 Lower Approximation Spaces

Definition 8. The pair (V,.§.) is called a lower approximation space (las) if .§. satisfies properties (ll), (13), and (14). If U ~ V, then U is called an .§.­

subspace or merely a subspace of V if .§.(U) = U.

We note that if (V,.§.) is a las, then it is not necessarily the case that V = .§.(V). We also note that condition (12) follows from condition (13).

Proposition 5. Let (V,.§.) be a las. Then the union of any collection of non empty subspaces of V is a subspace of v.

Proof: Let {Ui liE I} be a collection of subspaces of V, where I is a nonempty index set, and let U = UiEIUi. Now .§.(Ui) ~ .§.(UiEIUi) by (12) for all i E I. Thus .§.(U) = .§.(UiEIUi) ;2 UiEIUi = U. Since .§.(U) ~ U by (ll), .§.(U) = U and so U is a subspace of V.

Definition 9. Let (V,.§.) be a las and let X be a subset of V. Define (X) to be the union of all subspaces of V which are contained in X.

Page 470: Data Mining, Rough Sets and Granular Computing

468

Since .§.(0) = 0, 0 is a subspace of V which is contained in every subset X of V. Thus (X) in Definition 9 is meaningful. Clearly, (X) is the largest subspace of V which is contained in X.

Proposition 6. Let (V,.§.) be a las and let X be a subset of V. Then (X) = .§.(X).

Proof: Let Y be a subspace of V such that Y ~ X. Then Y = .§.(Y) ~ .§.(X). Thus (X) ~ .§.(X) by Definition 9. By (11), .§.(X) ~ X and since .§.(X) is a subspace of V by (14), .§.(X) ~ (X) since (X) is the largest subspace of V contained in X. Hence .§.(X) = (X) .

Definition 10. Let (V,.§.) be a las. Then a subset X of V is called minimal if X is a smallest subset of V such that .§.(X) :f:. 0.

Proposition 7. Let (V,.§.) be a las. If X is a subset of V which is minimal, then X is a subspace of V.

Proof: Now X :2 .§.(X) and .§.(.§.(X)) = .§.(X) :f:. 0. Thus X = .§.(X) by the minimality of X.

Theorem 4. Let (V, s) be an uas. Define E ~ V x V by '<I(x, y) E V x V, (x,y) E E if and only ifs(x) = s(y). Then E is an equivalence relation on V. Let [x] denote the equivalence class of x for E, where x E V. Furthermore, the following conditions are equivalent '<Ix E V.

(1) [x] is a subspace of Vj (2) [x] = s(x)j (3) '<Iy E V,y E s(x) implies s(y) = s(x).

Proof: Clearly E is an equivalence relation on V. Let Y E [x]. Then s(y) = s(x) and so y E s(x). Thus [x] ~ s(x).

(1)=>(2): Since x E [x], s(x) ~ s([x]) = [x]. Thus [x] :::: s(x). (2)=>(3): Let y E V. Suppose y E s(x). Since s(x) == [x], y E [x] and so

s(y) = s(x). (3)=>(I):We have [x] ~ s(x). Let y E s(x). Then s(x) ::::: s(y) and so xEy.

Thus y E [x]. Hence s(x) ~ [x].

Corollary 3. Let (V, s) be an uas. Let E be defined as in Theorem 4. Then the following conditions are equivalent:

(1) '<Ix E V, [x] = s(x)j (2) '<Ix, y E V, x E s(y) if and only if y E s(x).

Proof: (1)=>(2): '<Ix,y E V,y E s(x) {:} s(y) = s(x) {:} x E s(y). (2)=>(1): Let x E V and y E s(x). Then x E s(y) and so s(y) = s(x).

Hence [x] = s(x) by (3) => (2) of Theorem 4.

Page 471: Data Mining, Rough Sets and Granular Computing

469

Example 3. Let V = {l, 2} and let 8 : P(V) -t P(V) be defined by 8(0) = 0, 8(1) = {I}, 8(2) = V = 8(V). Then 8 satisfies (ul), (u3), and (u4) and E = {(I, 1), (2, 2)}. Now 8(1) = {I} = [1] C V = 8(2) :) {2} = [2]. We see that it is not the case that 8(X) = [x] \Ix E V.

Theorem 5. Let (V,§.) be a las. Define F ~ V x V by \I(x,y) E V x V, (x,y) E F if and only if §.(V\{x}) = §.(V\{y}). Then F is an equivalence relation on V. Furthermore, the following conditions are equivalent \Ix E V :

(1) V\[X]F is a subspace of Vj (2) V\[X]F = §.(V\{x})j (3) \ly E V, y i §.(V\{x}) implies §.(V\{x}) = §.(V\{y}),

where [X]F is the equivalence class of x for F.

Proof: Clearly, F is an equivalence relation on V. Now y i V\[x] ~ y E [x] ~ §.(V\{x}) = §.(V\{y}) and so y i §.(V\{x}). Thus §.(V\{x}) ~ V\[x].

(1)~(2): V\[x] = §.(V\[x]) ~ §.(V\{x}). Thus V\[x] = §.(V\{x}). (2)~(3): Nowy i §.(V\{x}) ~ y i V\[x] (by (2))~ y E [x] ~ §.(V\{x}) =

§.(V\{y}). (3)~(1): We have §.(V\{x}) ~ V\[x]. Suppose y i §.(V\{x}). Then

§.(V\{x}) = §.(V\{y}) by (3). Hence xFy and so y E [x]. Thus y i V\[x]. Hence V\[x] ~ §.(V\{x}). Therefore V\[x] = §.(V\{x}).

Corollary 4. Let (V, §.) be a las. Let F be defined as in Theorem 5. Then the following conditions are equivalent:

(1) \Ix E V, V\[x] = §.(V\{x})j (2) \lx,y E V,x E §.(V\{y}) {::} y E §.(V\{x}).

Proof: (1)~ (2) : \lx,y E V, Y E §.(V\{x}) {::} y E V\[x] (by (1)) {::} y i [x] {::} x i [y] {::} x E V\[y] {::} x E §.(V\{y}).

(2)~ (1) : Let x E V and y i §.(V\{x}). Then x i §.(V\{y}) and so §.(V\{y}) = §.(V\{x}). Hence V\[x] =§.(V\{x}) by (3)~(2) of Theorem 5.

Example 4. Let V = {1,2} and §. : P(V) ~ P(V) be defined by §.(0) = 0, §.({l}) = {I}, §.({2}) = 0, and §.(V) = V. Then §. satisfies (ll), (13), and (14) and F = {(1,1), (2,2)}. Now §.({1}) = {I} = [1]:) 0 = §.({2}) C {2} = [2]. We see that it is not the case that §.(x) = [x] \Ix E V.

Let (V,8) and (V, §.) be upper and lower approximation spaces, respec­tively. For the remainder of section, we assume that \IX E P(V), V\8(X) = §.(V\X).

Let E be the equivalence relation on V defined as in Theorem 4.

Assume (1): \lx,y E V, Y E 8(X) if and only if x E 8(y), Then 8(X) = [X]E \Ix E V by Corollary 3.

Theorem 6. Let (V,8) be an uas. Suppose that 8(X) = UxEX8(X) \IX E P(V). Then 8(X) = {y E V I lYlE n X :I 0}.

Page 472: Data Mining, Rough Sets and Granular Computing

470

Proof: s(X) = UxExs(x) = UxEx[xlE = {y E V I lylE n X:/; 0}.

Theorem 7. Let (V,§.) be a las. Suppose thats(X) = UxExs(x) \IX E P(V). Then \IX E P(V), §.(X) = {y E VI[ylE ~ X}.

Proof: Z E §.(X) ¢:} Z E V\s(V\X) ¢:} Z E V\ {y E V I lYlE n (V\X) :/; 0} ¢:} z f- {y E VI lYlE n (V\X) :/; 0} ¢:} Z E {y E V I lYlE ~ X}.

Assume (2): \lx,y E V, x E §.(V\{y}) if and only ify E §.(V\{x}). Then V\[xlF = §.(V\{x}) \Ix E V by Corollary 4.

Assumption (1) ¢:} Assumption (2): s(x) = s(y) ¢:} V\S(x) = V\s(y) ¢:} §.(V\{x}) = §.(V\{y}).

Corollary 5. E = F.

Proof. [xlE = lYlE ¢:} [xlF = [YlF' i.e., xEy ¢:} xFy.

Proposition 8. Let (V, s) and (V, §.) be upper and lower approximation spaces, respectively. Then X is an s-subspace of V if and only ifV\X is an §.-subspace ofV.

Proof: s(X) = X ¢:} V\S(X) = V\X ¢:} §.(V\X) = V\X.

Definition 11. Let (V, s) be an uas. Let X be a subspace of V. Then X is called an s-primary subspace of V if X is maximally singly generated, i. e., 3x E V such that X = s({x}) and \ly E V, s({x}) ~ s({y}) implies s({x}) =s({y}).

Definition 12. Let (V, §.) be a las. Let X be a subspace of V. Then X is called an §.-primary subspace of V if X is comaximally singly generated, i. e., 3x E V such that V\X = §.(V\ {x}) and \ly E V, §.(V\ {x}) ;2 §.(V\ {y}) implies §.(V\{x}) = §.(V\{y}).

Theorem 8. Let (V, s) and (V, §.) be upper and lower approximation spaces, respectively. Then P is an s-primary subspace of (V, s) if and only if V\P is an §.-primary subspace of (V, §.).

Proof: We have by Proposition 8 that P is an s-subspace of (V, s) if and only if V\P is an §.-subspace of (V, §.). Now §.(V\ {x}) ;2 §.(V\ {y}) im­plies §.(V\ {x}) = §.(V\ {y}) ¢:} V\S( {x}) ;2 V\s( {y}) implies V\S( {x}) = V\S( {y}) ¢:} s( {x}) ~ s( {y}) implies s( {x}) = s( {y} ).

Theorem 9. Let (V, s) and (V, §.) be upper and lower approximation spaces, respectively. Let X be an s-subspace of (V, s) or equivalently let V\X be an §.-subspace of (V, §.). Then X is a union of s-primaries of (V, s) if and only if V\X is a union of §.-primaries of (V, §.).

Proof: The result is immediate from Theorem 8.

Page 473: Data Mining, Rough Sets and Granular Computing

471

Let (V, s) be an upper approximation space. If 3u E V such that s( u) = V, then u is called a unit. Let V :I 0. If V is finite, then 3X ~ V such that X is maximal with respect to the property that s(X) :I V. If V is any space with the property that 3X ~ V such that X is maximal with respect to the property that s(X) :I V, then X is a subspace of V : s(X) :I V and so s(s(X)) :I V. Hence X = s(X) by the maximality of X.

A subspace X of V is called proper if X :I V. If X is a proper subspace of V such that there does not exist a subspace Y of V such that Xc Y c V, then X is called maximal. Clearly, if X is maximal with respect to the property that s(X) :I V, then X is a maximal subspace.

Theorem 10. Let (V, s) be an uas such that s(0) :I V. Suppose that V has a unit. Then 3X ~ V such that X is maximal with respect to the property that s(X) :I V. In fact, if U is any proper subspace of V, then U is contained in a maximal subspace X of v.

Proof: Suppose that U is a proper subspace of V. Let S = {YIU ~ Y ~ V and s(Y) :I V}. Let {Yo: I 0: E f1} ~ S be a chain. Then s(Yo:) :I V for all 0: E f1. Suppose that Uo:EnYo: = V. Then V :2 s(Uo:EnYo:) :2 Uo:EnS(Yo:) = V. Since V has a unit, say u, u E s(Yo:) for some 0: E f1. Hence V = s(u) ~ s(Yo:) ~ V. Thus s(Yo:) = V, a contradiction. Hence UO:EnYo: :I V. Thus by Zorn's Lemma, S has a maximal element, say X. By comments preceding the theorem, we have that X is the desired subspace. The first part of the theorem holds by letting U = 0.

Corollary 6. Let (V, s) be an uas. Suppose that V has a unit. Then any x E V which is not a unit is contained in a maximal subspace X of u.

Proof: Let U = s(x) in the theorem.

Proposition 9. Let (V, s) be an uas. If V has a maximal subspace X, then V = Xu Y, where Y is any subspace of V not contained in X.

Proof: We have s(X U Y) = s(X) U s(Y) = Xu Y = V else we contradict the maximality of X.

Corollary 7. Let (V, s) be an uas. If V has maximal subspaces, then V is the union of any two of them.

Theorem 11. Let (V, s) be an uas. Suppose that V has a unit. Let X be a subspace of V. Then X is the unique maximal subspace if and only if V\X :I 0 and V\X is the set of units of V.

Proof: Suppose that X is an unique maximal subspace of V. Let u E V\X. Suppose that u is not a unit. Then s( u) :I V. Then u is contained in a maximal

Page 474: Data Mining, Rough Sets and Granular Computing

472

subspace Y by Theorem 10. Since u <t. X, X ::p Y. However this contradicts the unicity of X. Hence u is a unit. Conversely, suppose that V\X is the set of units of V. Let Y be any subspace of V. If Y ~ X, then Y contains a unit and so Y = V. Hence X is the unique maximal subspace of V.

A proper subspace U of V is called prime if U c sex) for some x E V implies sex) = V.

Definition 13. Let (V, s) be an uas with a unit. Then V is called a principal space if for every subspace X of V, there exists x E X such that sex) = X.

Theorem 12. Let (V, s) be a principal uas. Then the prime subspaces and maximal subspaces coincide.

Proof: Let P be a prime subspace of V. 3x E P such that P = sex). Now there exists a maximal subspace M containing P. Since V is principal, there exists y E M such that M = s(y). Hence sex) = P ~ M = s(y). Since P is prime, sex) = s(y). Thus P = M and so P is maximal. Now let M be a maximal subspace of V. Since V is principal, there exists x E M such that sex) = M. That M is prime follows by its maximality.

Example 5. Let V = {x, y, z}. Define s : P(V) -+ P(V) as follows: VZ E P(V),

-(Z) { V if z E Z s = Z otherwise.

Then s satisfies (ul)-(u4). We see that z is a unit and that sex) = {x}, s(y) = {y} are prime subspaces of V, but are not maximal since U = {x,y} is a proper subspace of V. We note that U is the unique maximal subspace of V, U is prime and V is not principal.

Example 6. Let V = {x,y,z}. Define s : P(V) -+ P(V) as follows: VZ E P(V) :

{V ifzEZ

s(Z) = {x,y} if y E Z, z <t. Z {x} ifZ={x} ¢ if Z = ¢.

Then s satisfies (ul)-(u4). We see that z is a unit, V is principal, and s(y) = { x, y} is the only maximal and prime subspace of V.

Page 475: Data Mining, Rough Sets and Granular Computing

473

References

1. Z. Bavel, Introduction to the Theory of Automata, Reston Publishing Co., Inc., Reston, Virginia, A Prentice Hall Co., 1983.

2. N. Kuroki and J. N. Mordeson, Successor and Source functions, J. Fuzzy Math. 5 (1997), 173 - 182.

3. N. Kuroki and J. N. Mordeson, Structure ofrough sets and rough groups, J. Fuzzy Math. 5 (1997) 183 - 191.

4. J. N. Mordeson and P. S. Nair, Retrievability and connectedness in fuzzy finite state machines, Fifth IEEE International Conference on Fuzzy Sys­tems, vol. 3, 1586 - 1590, 1996.

5. J. N. Mordeson and P. S. Nair, Connectedness in systems theory, Fifth IEEE International Conference on Fuzzy Systems, vol. 3, 2045 - 2048, 1996.

6. E. Orlowska, Semantic analysis of inductive reasoning, Theoretical Com­puter Science, 43 (1986) 81 - 89.

7. Z. Pawlak, Rough sets, Int. J. Compo Sci. 11 (1982) 341 - 356. 8. Z. Pawlak, Rough Sets, Theoretical Aspects about Data. Kluwer Acad.

Pub., 1991. 9. J. A. Pomykala, Approximation operations in approximation space, Bul­

letin of the Polish Academy of Sciences, Mathematics, 35 (1987) 653 -662.

10. R. Slowinski and D. Vanderpooten, (1995), Similarity relation as a basis for rough approximations, in: Advances in Machine Intelligence & Soft­Computing, edited by P.P. Wang, Department of Electrical Engineering, Duke University, Durham, North Carolina, USA 17 - 33, 1997.

11. A. Wasilewska, Conditional knowledge representation systems - model for an implementation, Bulletin of the Polish Academy of Sciences: Mathe­matics, 37 (1987) 63 - 69.

12. A. Wasilewska and L. Vigneron, On generalized rough sets, preprint. 13. U. Wybraniec-Skardowska, On a generalization of approximation space,

Bulletin of the Polish Academy of Sciences: Mathematics, 37 (1989) 51 - 61.

14. Y. Y. Yao, Relational interpretations of neighborhood operators and rough set approximation operators, preprint.

15. Y. Y. Yao, Two views of the theory of rough sets in finite universes, Int. J. Approximate Reasoning, 15 (1996) 291 - 317.

16. Y. Y. Yao and T. Y. Lin, Generalization ofrough sets using modal logics, Intelligent Automation and Soft Computing, 2 (1996) 103 - 120.

17. L. A. Zadeh, Fuzzy sets, Inform. and Control 8 (1965) 338 - 353. 318. W. Zakowski, Approximations in the space (U, II), Demonstratio Math­ematica, XVI (1983) 761 - 769.

Page 476: Data Mining, Rough Sets and Granular Computing

Indexed Rough Approximations, A Polymodal System, and Generalized Possibility Measures

Sadaaki Miyamoto

University of Tsukuba, Ibaraki 305-8573, Japan

Abstract. Indexed rough approximations that generalize fuzzy rough sets are pro­posed. A family of indexed relations between objects with the set of indices being a lattice is considered. Relations in the family are ordered by the inclusion, and moreover the ordering is assumed to be consistent with the ordering of the lattice. Thus, a collection of rough approximations, each of which is induced from a relation in the family, is obtained. A polymodal system in which the modal operators with the indices are defined; the completeness between the axiomatic system and the Kripke model which is the above collection of rough approximations is proved. A possibility and necessity measures for sentences that takes the values of the lattice are derived from the polymodal system. These measures are proved to be equiva­lent to the ordinary possibility and necessity measures when the lattice is the unit interval.

1 Introduction

Since rough sets [7,8] and fuzzy sets [11] are two major methods of dealing with uncertainties, relationships between rough sets and fuzzy sets should methodologically be studied. Dubois and Prade [3] have proposed rough ap­proximations of fuzzy sets called rough fuzzy sets and approximations using fuzzy similarity relations called fuzzy rough sets.

From the viewpoint of rough approximations, the latter, fuzzy rough sets, has more room of theoretical studies. The method herein can be considered to be a generalization of the fuzzy rough sets, since the present family of relations for the approximations generalizes the similarity relations.

It should moreover be noted that the generalization is on the basis of modal logic. Namely, a polymodal system [9] associated with the rough ap­proximations is defined and the completeness is proved. The algebraic struc­ture of the index set naturally leads us to lattice-valued possibility measures, in which the ordinary possibility and necessity measures [12] are included as a special case.

Another major difference between the ordinary possibility and necessity measures and the present measures is that the ordinary measures are defined in terms of a set, whereas the present measures are functions of a sentence, hence a problem arises whether or not a possibility distribution can be defined from a given possibility measure. This problem is solved by a simple trick of augmenting atomic sentences each of which represents a possible world.

Page 477: Data Mining, Rough Sets and Granular Computing

475

Organization of this chapter is as follows. We first discuss a generalized rough approximation with indices of which the set is assumed to be a lattice. Second, we show a polymodal system in which the Kripke semantics rep­resents the rough approximation. Moreover an axiomatic system is consid­ered and the completeness is proved. Lattice-valued possibility and necessity measures are defined on the basis of the polymodal system. Relations with the ordinary measures are then shown. Possibility distribution of the modal system is then discussed. Finally a generalization of the measure to fuzzy sentences is considered.

Proofs of all propositions in this chapter are summarized in a section before the conclusion, as some readers may not be interested in that technical part of the paper.

2 Rough approximations with indices

Let W be a set of objects on which approximations are considered. A family of binary relations, R(o:), 0: E A is considered. Each relation is used to define approximations. R(o:) is assumed to be reflexive, but mayor may not be an equivalence relation. If equivalence relation is assumed, the standard rough approximations are obtained [8]; if the relation is not necessarily an equiva­lence, we have generalized rough approximations. The index set A is assumed to be a lattice, unless stated otherwise. The lattice has the ordering ~ and the two operations of sup(o:,,8) and inf(o:,,8). Moreover, we assume that for 0:,,8 E A such that 0: ~ ,8,

R(o:) ~ R(.B). (1)

Given a subset Y ~ W, we consider the upper approximation

R(o:)*Y = {w E W::Jy E Y, yR(o:)w} (2)

and the lower approximation

R(o:)*Y = {w E W : Vz E W, wR(o:)z =} z E Y}. (3)

Notice that the approximations are defined for any 0: E A. It is easy to see that

and that when R(o:) is an equivalence relation, R(o:)*Y and R(o:)*Y become the standard upper and lower approximation, respectively.

Moreover we have

Proposition 1. Assume that Y is an arbitrarily given subset of W. For 0:,,8 E A such that 0: ~ ,8,

R(o:)*Y ~ R(,8)*Y (4)

Page 478: Data Mining, Rough Sets and Granular Computing

476

and (5)

The proofs of the propositions herein are given in a later section for the ease of references.

3 A polymodal system

3.1 A family of the Kripke models

Throughout this paper we use the framework of modal logic discussed in Chellas [1], except that we consider an indexed modal logic that are becoming standard in various applications. Namely, modal operators are denoted by

[aJA: A is necessary with the label a, (a) A: A is possible with the label a,

where the label a is in the above set A. We assume hereafter that A has finite or countably infinite elements.

Now, the above set of objects are identified with the set of possible worlds in the Kripke model. Moreover the relations R( a) used for the approximations are identified with the accessibility relations in the Kripke model.

Thus, the model is given by

M =< W, R(a), P >, a EA.

The atomic sentences denoted by IP'l,1P2 , ... are considered and P is the se­quence P =< PI, P2 , .•. >. Pi is the subset ofW in which the atomic sentence lPi is true.

When we consider truth or falsity of a sentence A in W, the symbol M, k F A means that A is true at the possible world k. IIAII means the subset of possible worlds at which A is true, i.e.,

IIAII = {k E W : M,k FA}.

For the modal operators, M, k F (a)A,

which means that A is possibly true at k, is defined by

M,k'FA

for some k' such that kR(a)k'.

M,k F [aJA,

which implies that A is necessarily true at k, is defined by

M,k'FA,

Page 479: Data Mining, Rough Sets and Granular Computing

477

for every k' such that kR(a)k'. The following proposition should be noted, the proof of which is immedi­

ate and omitted.

Proposition 2.

M, k F [aJA {=} k E R*(a)IIAII

M, k F (a)A {=} k E R*(a)IIAII·

3.2 An axiomatic system and completeness

Let us introduce an axiomatic system and consider completeness between the above model and the axiomatic system. Here we consider the following axioms:

Df[ J: K: T: Pos

[aJA ~ -,(a)-,A [a](A -. B) -. ([aJA -. [aJB) [aJA -. A

a t: a' =} (a)A -. (a')A,

in which the assumption Pos corresponds to (1). Remark also that in addition to the above axioms, the standard axioms of propositional logic[lJ and the two inference rules of

A, A -. B B

MP:

RN:

are assumed. For convenience, the above axiomatic system is called SL and the class of the above models is called SL herein.

Now, the following two propositions hold. Remark that proofs of all propo­sitions below are given in a separate section in the sequel.

Proposition 3. Let us consider another axiomatic system called SL' in which the axiom Pos is replaced by

Nec: a t: a' =} [a'JA -. [aJA,

but all the other axioms and inference rules are the same. Then the two axiomatic systems SL and SL' are equivalent. That is, the two sets of all theorems proved from the above two sets of axioms and the inference rules coincide.

Proposition 4. The system SL is sound and complete with respect to the class SL of the Kripke models.

Remark: Relationships between polymodallogic and rough sets have already been studied (cf., e.g., Vakarelov [10]) from a different viewpoint from ours.

Page 480: Data Mining, Rough Sets and Granular Computing

478

4 Lattice-valued measures

4.1 Two subsets of the index set

Observation of two subsets of A is useful in discussing the possibility and necessity. Let

Ap(A; k) = {a : M, k F (a)A} (6)

and

AN(A; k) = {a : M, k F [aJA} (7)

For the most part the index k is fixed and we use Ap(A) and AN(A) instead of Ap(A; k) and AN(A; k), respectively, without confusion.

Notice

Ap(.A)C = {a: M,k F (a).A}c

= {a: NOT(M, k F (a).A)} = {a: M,k F .(a).A}

= {a: M, k F [aJA}

We thus have

where Ap(.A)C is the complement of Ap(.A). The following equations should also be noted.

Ap(A V B) = {a : M, k F (a)(A V B)} = {a : M, k F (a)A V (a)B} = {a : M, k F (a)A OR M, k F (a)B} = {a: M,k F (a)A} U {a: M,k F (a)B} = Ap(A) U Ap(B).

AN(A i\ B) = Ap(.(A i\ B))c = Ap(.A V .Bf = (Ap(.A) U Ap(.B))c = Ap(.A)c n Ap(.B)c

= AN(A) n AN(B).

Notice that no algebraic structure on the index set A is assumed above.

Page 481: Data Mining, Rough Sets and Granular Computing

479

4.2 Possibility measure of a sentence

The lattice A is in general not complete. When it is not complete, we consider the completion of A, hence we can assume that the lattice is complete without loss of generality. Also notice that sup and inf denote the operations of the lattice. (We do not use the symbols V and 1\ as the operations of the lattice, since they should be used as OR and AND in the propositional logic.)

The possibility measure of a sentence is defined to be the supremum of the indices by which the sentence is accessible, where the accessibility is defined by the relation R(a) in the Kripke model. Thus,

Posk(A) = sup{ a: M, k F (a)A}, (8)

where k is a fixed parameter indicating a possible world in W. On the other hand, the necessity measure is defined by

NeCk(A) = inf{a: M, k F [aJA}. (9)

As noted before, the index k is not necessary in the present discussion and hence we simply write Pos(A) and N ec(A) instead of Posk(A) and N eCk(A), respectively. Roughly, Pos(A) is the maximum of a at which A is possible; N ec(A) is the minimum of a at which A is necessary. It is immediate to see

Pos(A) = sup Ap(A), Nec(A) = inf AN(A).

The meanings of the two measures are as follows. The index implies an environmental condition in which the model is put. The relation kR(a)k' shows accessibility from a state represented by a possible world k to another state k'. The change of environment is expressed by the variation of the index a. When a changes to a' (a ):: a'), R(a) <;;;; R(a') means that the accessibility is more limited in the former case than the latter. In other words, the environment indexed by a is less uncertain than that indexed by a'.

Proposition 5. Assume that A is a totally ordered set. Then

N ec( .A) ):: Pos(A).

Moreover, there is no a E A such that

N ec( .A) ):: a ):: Pos(A), a -I- Pos(A), a -I- Nec(A).

4.3 Real-valued measures

(10)

When the index is the set of rational numbers in the unit interval, the initial lattice defined by the natural ordering is not complete. However, the comple­tion of the set of rational numbers leads us to the set of real numbers. Thus,

Page 482: Data Mining, Rough Sets and Granular Computing

480

we can define real-valued possibility and necessity measures by (8) and (9), respectively, on A = [0,1], the set of real numbers.

It appears that the above introduced measures are different from the ordinary possibility and necessity measures. On the contrary, they are closely related. To show this, We introduce two other measures by

Then we have

Proposition 6.

II(A) = Pos(A), N(A) = 1 - N ec(A).

Pos(A) = Nec(....,A), II(A) = 1 - N(....,A).

(11)

(12)

(13)

(14)

The last property holds for the ordinary possibility and necessity mea­sures [2]. (For generallattices, however, such a duality between the necessity measure and the possibility measure is difficult to be derived.)

Figure 1 shows relationship among Ap(A), AN(....,A), Pos(A), and Nec(....,A) when A is the unit interval. We thus have

Pos(A) = sup Ap(A) = N ec( ....,A) = inf AN( ....,A).

Ap(A) .A

'1 ° 1

1

Pos(A) = Nec(.A)

Fig.!. Relationship among Ap(A), AN(.A), and Pos(A)(= Nec(.A)) when A = [0,1].

4.4 Relation with ordinary possibility measure

We assume that A is the unit interval. Let us write

A=IIAII, (15)

Page 483: Data Mining, Rough Sets and Granular Computing

and

P(A) = JI(A),

N(A) = N(A).

481

(16)

(17)

The following proposition implies that the fundamental equalities of the ordinary possibility and necessity measures [2J hold for the present measures.

Proposition 7. Let A and B be sets derived from the sentences A and B, respectively. Then we have

P(A U B) = max[P(A), P(B)J,

N(A n B) = min[N(A),N(B)J.

Thus P(A) and N(A) are regarded as the ordinary possibility measure and necessity measure, respectively, when A is the unit interval. Notice that A is a subset of W, the set of possible worlds.

4.5 Possibility distribution

Unlike the case of an ordinary possibility measure, the definition of a possi­bility distribution from Pos(A) is not straightforward.

Namely, an ordinary possibility measure on W is defined for any A(~ W), from which we can define a possibility distribution

7r(k) = P({k}),

by putting A = {k}, for an arbitrary k E W. In contrast, Pos(A) is defined from a sentence A and R( 0). If only one sentence should be considered, it appears that we cannot obtain a possibility distribution.

We use a simple trick of augmenting atomic sentences for possible worlds: let Q£ (i E W) be an atomic sentence such that

Namely, Ql is for nothing but idenfication of a possible world i E W. Using Q =< Q l, Q £', ... >, i, if E W, etc, we consider an augmented model

Nt =< W,R(o),PU Q > (0 E A).

(Notice that W is at most count ably infinite.) Assume a set r for which possibility and necessity measures should be

considered. We augment r by adding Ql, Q£" ...

Page 484: Data Mining, Rough Sets and Granular Computing

482

We now can discuss

POS(Qi) = sup{ a: Nt, k F (a)Qi}

Since IIQil1 = {£}, we have

P( {£}) = POS(Qi)

for every £ E W, whereby 7r(£)=P({£}) (18)

is defined. For a given A, assume A = IIAII = {£, £', ... ,£"}, then it is easy to observe

P(A) = P({£,£', ... ,£"}) = supP(Qd = SUP7r(£) iE.A iE.A

from Proposition 7.

4.6 Fuzzy sentences

Let us introduce a fuzzy sentence A which is a parameterized family A = {Aa} (a E A) such that

a ~ a' =? IIAal1 <;;; IIAn'lI. (19)

Then we define Pos(A) = sup{ a: M, k F (a)An},

Nec(A) = inf{a: M,k F [alAn}.

(20)

(21)

The fuzzy sentence can be replaced by the corresponding fuzzy set A that is the parameterized family IIAnl1 (a E A). Then we have a proposition that generalizes Proposition 7 when A is the unit interval.

Proposition 8. Let A and B be fuzzy sets derived from the fuzzy sentences. Then,

P(A U B) = max[P(A), P(B)],

N(A n B) = min[N(A),N(B)].

Alternatively, we may assume that for each possible world £, the degree of relevance

a(£) = /l-A(£) E A,

in other words, the membership is given. The a-cut is

An = {£ E W: /l-A(£) ~ a},

whereby a ~ a' =? Aa <;;; Aa'

holds. If we introduce artificial sentences An by putting IIAa II = An, we can use the definition (20) and (21). The latter definition, however, is not straightforward and more restrictive.

Page 485: Data Mining, Rough Sets and Granular Computing

483

5 Proof of propositions

Proof of Proposition 1.

The conclusion immediately follows from (1); the detail is omitted.

Proof of Proposition 3.

Let a t: a' and assume the axiom Pos. Then,

1. ,[a],A --t ,[a'],A (Df()) 2. [a'],A --t [a],A (1, Propositional logic) 3. [a']A --t [alA (replace ,A by A).

Namely, Nee is a theorem. Assuming Nee as an axiom, we can obtain Pos as a theorem in the same manner. We omit the detail.

Proof of Proposition 4.

SOUNDNESS: It is well-known that Df[ ] and K are valid in all the Kripke models. Since R(a) is reflexive, T is also valid. Thus, it is sufficient to show that Nee or Pos is valid in the class SL of models M. Let k be an arbitrary possible world and assume a t: d. From (1), if kR(a)k' then kR(a')k'. Namely,

if M, k F (a)A then M, k F (a')A.

This implies M, k F (a)A --t (a')A.

Thus, Pos is valid.

COMPLETENESS: It is sufficient to show

SL F (a)A --t (a')A =? \- (a)A --t (a')A. (22)

where SL F A implies that for all models M in the class SL and for every possible world k in M, M, k F A holds.

A proper canonical standard model (abbreviated PCSM) should be con­structed and if PCSM satisfies (22) then the proposition is proved [1]. The procedure of constructing PCSM is described in Chellas [1] and is omitted here. It should, however, be noticed that the assumption of countably infinite set of indices is used in applying Lindenbaum's lemma.

In PCSM, R(a) is defined by

kR(a)k' <=? {(a)A : A E k'} ~ k. (23)

The right hand side of (23) is equivalent to

if A E k' then (a)A E k.

Page 486: Data Mining, Rough Sets and Granular Computing

484

Since Pos is an axiom,

if (a)A E k then (a')A E k.

It follows that

if A E k' then (a')A E k

From these relations we have

{(a')A: A E k'} <;;; k,

and hence kR(a')k'. Thus, PCSM satisfies (1), and therefore belongs to the class SL. Now it is immediate to see (22) holds, since valid sentences in PCSM are theorems.

Proof of Proposition 5.

Let us divide A into Ap(A) and Ap(A)C = ANC.A). In the former (a)A is true while (a)A is false in the latter. From (1) and the assumption of the total ordering,

f3 t: a for all a E Ap(A), f3 E Ap(A)c(= AN (--,A)).

Since Pos(A) = sup Ap(A) and N ec(A) = inf AN(A), We have (10). The latter part is obvious, since each element of A belongs to one and

only one of Ap(A) or Ap(A)c.

Proof of Proposition 6.

Since the set of rational numbers is totally ordered, Proposition 5 holds. From basic arguments concerning rational and real numbers, we have

Pos(A) = supAp(A) = inf AN (--,A) = Nec(A).

The latter equality is obvious.

Proof of Proposition 7.

P(A U B) = Pos(A V B)

= sup{a: M, k F (aHA V B)}

= sup{ a : M, k F (a)A V (a)B}

= sup{ a : M, k F (a) A OR M, k F (a)B))

= max[sup{a : M, k F (a)A}, sup{M, k F (a)B)}]

= max[P(A), P(B)].

The latter equation can easily proved by using (12) and the first equation. We omit the detail.

Page 487: Data Mining, Rough Sets and Granular Computing

485

Proof of Proposition 8.

The proof is omitted, since it is a slight modification of the proof of Propo­sition 7.

6 Conclusion

We have uncovered relationship between the indexed rough approximations and the generalized possibility measures on the basis of polymodallogic. The formulation herein extends the current possibility theory [12,2] and opens a new way of applying the lattice-valued possibility measures. For example, probability theory and possibility theory can complementarily be used in a single application. It should also be noticed that the way of extension herein is fundamentally different from that in the L-fuzzy sets [5].

The plausibility measures proposed by Friedman and Halpern [4] are also related with the present consideration, of which the details will be discussed elsewhere.

A nontrivial lattice that is not real-valued and useful in applications is a set of linguistic expressions. Consider a finite collection of atomic statements representing a system environment. Connections of these statements by AND and OR form a finite lattice. If we assume that a stronger statement implies the corresponding possibility of an event is more limited in general, Pos(A) is the strongest statement by which A is possible, and N ec(A) is the weakest statement by which A is necessary.

Future studies include investigation of theoretical properties of ordinary and fuzzy sentences as well as real applications of the lattice-valued measures. In particular, a lattice of linguistic expressions should be considered in real applications. Applications include system safety analysis without the frame­work of probability, and information retrieval using hierarchical categories.

Acknowledgments

The author would like to express his deepest appreciation to Dr. Tetsuya Murai and Dr. Masahiro Inuiguchi for their helpful suggestions. The present research has partly been supported by TARA (Tsukuba Advanced Research Alliance), University of Tsukuba.

References

1. B. F. Chellas, B.F. (1980) Modal Logic. Cambridge University Press. 2. Dubois, D. and Prade, H. (1988) Possibility Theory: An Approach to Comput­

erized Processing of Uncertainty. Plenum. 3. Dubois D. and Prade, H. (1990) Rough fuzzy sets and fuzzy rough sets. Int. J.

General Systems, 17, 191-209.

Page 488: Data Mining, Rough Sets and Granular Computing

486

4. Friedman N. and Halpern, J. Y. (1996) Plausibility measures and default rea-soning. Proc. of the 13th National Conf. on Artificial Intelligence, 2, 1297-1304.

5. Goguen, J. A. (1967) L-fuzzy sets. J. of Math. Anal. and Appl., 18, 145-174. 6. MacLane S. and Birkoff, G. (1979) Algebra, 2nd ed. Macmillan. 7. Pawlak, Z. (1982) Rough sets. International Journal of Computer and Infor­

mation Sciences, II, 341-356. 8. Pawlak, Z. (1991) Rough Sets. Kluwer Academic Publishers, Dordrecht. 9. Popkorn, S. (1994) First Steps in Modal Logic. Cambridge University Press.

10. Vakarelov, D. (1991) A modal logic for similarity relations in Pawlak knowledge representation systems. Pundamenta Informaticae, 15, 61-79.

11. Zadeh, L. A. (1965) Fuzzy sets. Information and Control, 8, 338-353. 12. Zadeh, L. A. (1978) Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets

and Systems, I, 3-28.

Page 489: Data Mining, Rough Sets and Granular Computing

Granularity, Multi-valued Logic, Bayes' Theorem and Rough Sets

Zdzislaw Pawlak

Institute for Theoretical and Applied Informatics Polish Academy of Sciences ul. Baltycka 5, 44 000 Gliwice, Poland e-mail:[email protected]

Abstract. Granularity of knowledge attracted attention of many researchers re­cently. This paper concerns this issue from the rough set perspective. Granularity is inherently connected with foundation of rough set theory. The concept of the rough set hinges on classification of objects of interest into similarity classes, which form elementary building blocks (atoms, granules) of knowledge. These granules are employed to define basic concepts of the theory. In the paper basic concepts of rough set theory will be defined and their granular structure will be pointed out. Next the consequences of granularity of knowledge for reasoning about imprecise concepts will be discussed. In particular the relationship between some ideas of Lukasiewicz's multi-valued logic, Bayes' Theorem and rough sets will be pointed out.

1 Introduction

This paper is an extended version of [15). Information (knowledge) granulation, discussed recently by Prof. Zadeh

[27, 28, 29) seems to be a very important issue for computing science, logic, philosophy and others.

In this note we are going to discuss some problems connected with granu­larity of knowledge in the context of rough sets. First, discussion of granula­tion of knowledge in connection with rough and fuzzy sets has been presented by Dubois and Prade in [8). Recently, an interesting study of information granulation in the framework of rough sets can be found in Polkowski and Skowron [16) and Skowron and Stepaniuk [20).

In rough set theory we assume that with every object some information is associated, and objects can be "seen" through the accessible information only. Hence, object with the same information cannot be discerned and ap­pear as the same. This results in, that indiscernible objects of the universe form clusters of indistinguishable objects (granules, atoms, etc.). Thus from the rough set view the granularity of knowledge is due to the indiscernibility of objects caused by lack of sufficient information about them. Consequently granularity and indiscerniblity are strictly connected and the concept of in­discernibility seems to be prior to granularity.

Page 490: Data Mining, Rough Sets and Granular Computing

488

Current state of rough set theory and its application can be found in [19].

Indiscernibility attracted attention of philosophers for a long time and its first formulation can be attributed to Leibniz (cf. Forrest [9]), and is known as the principle of "the identity of indiscernibles". The principle says that not two objects have exactly the same properties, or in other words if all properties of objects x and y are the same then x and yare identical.

But what are "properties of objects"? and what does it mean "all prop­erties"? A lot of philosophical discussions have been devoted to answer these questions (cf. e.g., Black [3], Forrest [9]), but we will refrain here from philo­sophical debate. Let us observe only that Leibniz approach to indiscernibility identifies indiscernibility with identity. The later is obviously an equivalence relation, i.e., it leads to partition of the universe into equivalence classes (granules) of objects which are indistinguishable in view of the assumed prop­erties. Thus in the rough set approach granulation is a consequence of the Leibniz principle.

It is worthwhile to mention that indiscernibility can be also viewed in a wider context, as pointed out by Williamson [25]: "Intelligent life requires the ability discriminate, but not with unlimited precision". This is a very interesting issue however it lays outside the scope of this paper.

In rough set theory we assume empiristic approach, i.e., we suppose that properties are simply empirical data which can be obtained as a result of measurements, observations, computations, etc. and are expressed by values of a fixed, finite set of attributes, e.g., properties are attribute-value pairs, like (size, small), (color, red) etc. The idea could be also expressed in more general terms assuming as a starting point not a set of specific attributes but abstract equivalence relation, however, the assumed approach seems more intuitive.

Equivalence relation is the simplest formalization of the indiscernibility relation and is sufficient for many applications. However, more interesting seems to assume that the indiscernibility relation is formalized as a tolerance relation, i.e., transitivity of indiscernibilty is denied in this case, for, if x is indiscernible from y and y is indiscernible from z then not necessarily x is indiscernible from z. Many authors have proposed tolerance relation as a ba­sis for rough set theory (cf. e.g., Skowron and Stepaniuk [19]). This causes, however, some mathematical complications as well philosophical questions, because it leads to vague granules, i.e., granules without sharp boundaries, closely related to the boundary-line approach to vagueness (cf. e.g., Chatte­brjee [7], Sorensen [22}).

Besides, instead of tolerance relation also more sophisticated mathemati­cal models of indiscernibility, as a basis for rough set theory, have been pro­posed (cf. e.g., Krawiec, Slowinski, and Vanderpooten [11], Yao and Wong, [26], Ziarko [30]). Interested readers are advised to consult the mentioned above references, but for the sake of simplicity we will adhere in this paper

Page 491: Data Mining, Rough Sets and Granular Computing

489

to the equivalence relation as a mathematical formalization of the indiscerni­bility relation.

Since granules of knowledge can be considered as a basic building blocks of knowledge about the universe it seems that natural mathematical model for granulated knowledge can be based on ideas similar to that used in mere­ology proposed by Lesniewski [12], in which part of is the basic relation of this theory. Axioms of mereology, in particular in a version proposed by Suppes [23), seem to be natural candidate for this purpose. Moreover, rough mereol­ogy, extension of classical mereology proposed by Polkowski and Skowron in [17, 18], seems to be exceptionally suited to analyze granules of knowledge with not sharp boundaries (cf. Polkowski and Skowron [16], Skowron and Stepaniuk [20J).

It also worthwhile to mention in this context that granularity of knowledge has been also pursued in quantum physics. Its relation to fuzzy sets and rough sets has been first mentioned by Cattaneo [5, 6).

Recently a very interesting study of rough sets, granularity and foun­dations of mathematics and physics has been done by Apostoli and Kanda [2).

Besides, it is also interesting to observe that computations and measure­ments are very good examples of granularity of information, for they are based in fact not on real numbers but on intervals, determined by the accuracy of computation or measurement.

2 Basic Philosophy of Rough Sets

The rough set philosophy is founded on the assumption that with every object of the universe of discourse we associate some information (data, knowledge). E.g., if objects are patients suffering from a certain disease, symptoms of the disease form information about patients. Objects characterized by the same information are indiscernible (similar) in view of the available information about them. The indiscernibility relation generated in this way is the math­ematical basis of rough set theory.

Any set of all indiscernible (similar) objects is called an elementary con­cepts, and forms a basic granule (atom) of knowledge about the universe. Any union of some elementary concepts is referred to as crisp (precise) concept -otherwise the set is rough (imprecise, vague).

Consequently each rough concept has boundary-line cases, i.e., objects which cannot be with certainty classified neither as members of the con­cept nor of its complement. Obviously crisp concepts have no boundary-line elements at all. That means that boundary-line cases cannot be properly classified by employing the available knowledge.

Thus, the assumption that objects can be "seen" only through the infor­mation available about them leads to the view that knowledge has granular structure. As a consequence vague concepts, in contrast to precise concepts,

Page 492: Data Mining, Rough Sets and Granular Computing

490

cannot be characterized in terms of elementary concepts. Therefore in the proposed approach we assume that any vague concept is replaced by a pair of precise concepts - called the lower and the upper approximation of the vague concept. The lower approximation consists of all elementary concepts which surely are included in the concept and the upper approximation con­tains all elementary concepts which possibly are included in the concept. Obviously, the difference between the upper and the lower approximation constitutes the boundary region of the vague concept. Approximations are two basic operations in rough set theory.

3 Indiscernibility and Granularity

As mentioned in the introduction, the starting point of rough set theory is the indiscernibility relation, generated by information about objects of interest. The indiscernibility relation is intended to express the fact that due to the lack of knowledge we are unable to discern some objects employing the available information. That means that, in general, we are unable to deal with single objects but we have to consider clusters of indiscernible objects, as fundamental concepts of knowledge.

Now we present above considerations more formally. Suppose we are given two finite, non-empty sets U and A, where U is the

universe, and A - a set attributes. With every attribute a E A we associate a set Va, of its values, called the domain of a. The pair S = (U, A) will be called an information system. Any subset B of A determines a binary relation IB on U, which will be called an indiscernibility relation, and is defined as follows:

xIBY if and only if a(x) = a(y) for every a E A, where a(x) denotes the value of attribute a for element x.

Obviously IB is an equivalence relation. The family of all equivalence classes of IB , i.e., the partition determined by B, will be denoted by U/IB, or sim­ply U / B; an equivalence class of IB, i.e., the block of the partition U / B, containing x will be denoted by B(x).

If (x, y) belongs to IB we will say that x and yare B-indiscernible. Equiv­alence classes of the relation I B (or blocks of the partition U / B) are referred to as B-elementary concepts or B-granules.

In the rough set approach the elementary concepts are the basic building blocks (concepts) of our knowledge about reality.

4 Approximations and Granularity

Now the indiscernibility relation will be used to define basic operations in rough set theory, which are defined as follows:

B*(X) = U {B(x) : B(x) ~ X}, xEU

Page 493: Data Mining, Rough Sets and Granular Computing

491

B*(X) = U {B(x) : B(x) n x =I- 0}, xEU

assigning to every X ~ U two sets B* (X) and B* (X) called the B-lower and the B-upper approximation of X, respectively.

Hence, the B-Iower approximation of a concept is the union of all B­granules that are included in the concept, whereas the B-upper approximation of a concept is the union of all B-granules that have a nonempty intersection with the concept. The set

BNB(X) = B*(X) - B*(X)

will be referred to as the B-boundary region of X. If the boundary region of X is the empty set, i.e., BNB(X) = 0, then X

is crisp (exact) with respect to B; in the opposite case, i.e., if BNB(X) =I- 0, X is referred to as rough (inexact) with respect to B.

Rough sets can be also defined using a rough membership function, defined as

f-LB (x) = card(B(x) n X). x card(B(x))

Obviously f-L~(x) E [0,1].

Value of the membership function f-L~ (x) is kind of conditional probability, and can be interpreted as a degree of certainty to which x belongs to X (or 1- f-L~(x), as a degree of uncertainty).

The rough membership function, can be used to define approximations and the boundary region of a set, as shown below:

B*(X) = {x E U: f-L~(x) = I},

B*(X) = {x E U : f-L~(x) > O},

BNB(X) = {x E U : 0 < f-L~(x) < I}.

The rough membership function can be generalized as follows (cf. Polkowski and Skowron [17]):

f-L(X, Y) = card(X n Y) card X

where X, Y ~ U, X =I- 0 and f-L(<l>, Y) = l. Function f-L(X, Y) is an example of a rough inclusion [14} and expresses

the degree to which X is included in Y. Obviously, if f-L(X, Y) = 1, then XCY.

If X is included in a degree k we will write X ~k Y. The rough inclusion function can be interpreted as a generalization of the

mereological relation "part of' , and reads as "part in a degree" .

Page 494: Data Mining, Rough Sets and Granular Computing

492

Employing now the rough inclusion function we can represent approxima­tions in an uniform way:

B*(X) = U {B(x) : f1(B(x),X) = I}, xEU

B*(X) = U {B(x) : f1(B(x),X) > O}. xEU

Hence, the B-Iower approximation of X consists of all B-granules included in X, whereas the B-upper approximation of X consists of all roughly included B-granules of X.

In this way approximations reveal granular structure of complex concepts. Thuse granularity of knowledge is inherently incorporated in the foundations of rough set theory.

5 Dependencies and Granularity

Another important issue in data analysis is discovering dependencies between attributes. Intuitively, a set of attributes D depends totally on a set of at­tributes C, denoted C => D, if all values of attributes from D are uniquely determined by values of attributes from C. In other words, D depends totally on C, if there exists a functional dependency between values of D and C.

We would need also a more general concept of dependency, called a partial dependency of attributes. Intuitively, the partial dependency means that only some values of D are determined by values of C.

Formally dependency can be defined in the following way. Let D and C be subsets of A.

We will say that D depends on C in a degree k (0 ::; k ::; 1), denoted C=>kD,if

k = (C D) = card(POSc(D)) , , cardU'

where POSc(D) = U C*(X),

XEU/D

called a positive region of the partition U j D with respect to C, is the set of all elements of U that can be uniquely classified to blocks of the partition UjD, by means of C.

Obviously

,(C,D) = XEU/D

cardU

If k = 1 we say that D depends totally on C, and if k < 1, we say that D depends partially (in a degree k) on C.

Page 495: Data Mining, Rough Sets and Granular Computing

493

The coefficient k expresses the ratio of all elements of the universe, which can be properly classified to blocks of the partition U / D, employing attributes C and will be called the degree of the dependency.

Obviously if D depends totally on C then le ~ ID . That means that the partition generated by C is finer than the partition generated by D.

Degree of dependency expresses to what extend granularity imposed by the set of attributes D can be expres sed in terms of elementary concepts associated with C.

The function ,(C, D) can be regarded as a generalization of the rough inclusion function jL(X, Y), for it expresses to what degree partition generated by C, Le., U / C is included in the partition generated by D, Le., U / D.

In other words, degree of dependency between C and D reveals to what degree granular structure imposed by D can be expressed in terms of granular structure associated with C.

In fact approximations and dependencies are different sides of the same coin, and exhibit a relationship between two kinds of granular structures.

6 Decision Rules

With every dependency C ~k D we can associate a set of decision rules, specifying decisions that should be taken when certain condition are satistied.

To express this idea more precisely we need a formal language associated with any informat ion system 5 = (U, A). The language is defined in a stan­dard way and we omit detailed definit ion here, assuming that the reader is familiar with the construction (ef. Pawlak [15]).

By cI>, lJt etc. we will de note logicals formulas built from attributes, attribute­values and logical connectives (and, or, not) in a standard way. We will denote by 11cI>lls the set of all object x E U satisfying cI> and refer to as the meaning of cI> in S.

The expression 7rs(cI» = ca;:;llfJ\sl will denote the probability that the formula cI> is true in S .

A decision rule is an expres sion in the form "if. .. then ... ", written cI> -t 1Jt; cI> and lJt are refered to as condition and decision of the rule respectively.

The number sUPPs(cI>, 1Jt) = card(llcI> 1\ IJtlls) will be called the supporl of the decision rule cI> -t lJt in 5 and the number

(cI> 1Jt) = supps(cI>,IJt) as, card(U)

will be reffered to as the strenght of the decis ion rule cI> -t lJt in S. If supps(cI>, 1Jt) -# cI> then the decis ion rule cI> -t lJt will be called admissible

in S. In what follows we will consider admissible decision rules only. A decision rule cI> -t lJt is true in a degree 1 (O ::; 1 ::; 1) in 5 , if 11cI>lls ~l

111Jt11s.

Page 496: Data Mining, Rough Sets and Granular Computing

494

With every decision rule iJ> --+ lJI we associate a cerlainty factor

7rs(lJIliJ» _ card(lliJ> 1\ lJIlls) - card(lIiJ>lls) ,

which is the conditional probability that lJI is true in S given iJ> is true in S with the probability 7rs(iJ».

The certainty factor of a decis ion rule can be understood as the degree of truth of the decision rule or as the degree of inclus ion of conditions in decisions of the decision rule. Besides, we will also need a coverage factor [24]

7rs (iJ>llJI) _ card(lliJ> 1\ lJIlls) - card(lIlJIlls) ,

which is the conditional probability that iJ> is true in S given lJI is true in S with the probability 7rs(lJI).

The coverage factor of a decis ion rule can be interpreted as the degree of truth of the inverse decision rule, Of as the degree of the corresponding inclusion.

7 Properties of Certainty and Coverage Factors

Let iJ> --+ lJI be a decision rule admissible in S. By C(lJI) we denote the set of aH conditions of lJI, such that if iJ>' E C(lJI) then iJ>' --+ lJI is admissible in S, and by D(iJ» we mean the set of aH decisions of iJ> such that if lJI' E D(iJ» then iJ> --+ lJI' is admis si bIe in S. Moreover we as sume that all conditions in C(lJI) and all decisions in D(iJ» are pairwise mutually exclusive, i.e., if iJ>',iJ> E C(lJI) then Iii!>' 1\iJ>lls = 0 and iflJl',lJI E D(iJ» then IllJI'l\lJIlls = 0. Then the following property holds:

L 7rs(iJ>'llJI) = 1 (1) <P' EC(W)

L 7rs(lJI'IiJ» = 1 (2) W'ED(<P)

7rs(lJI) = L 7rs(lJIliJ>') . 7rs(iJ>') = L as (iJ>', lJI) (3) <P'EC(W) <P'EC(W)

7rs(iJ» = L 7rs(iJ>llJI') . 7rs(lJI') = L as (lJI' , iJ» (4) w' ED(<P) w' ED(<P)

7rs(iJ>llJI) = 7rs(lJIliJ» . 7rs(iJ» as(iJ>,lJI) (5)

L:<P'EC(IP) 7rs(lJIliJ>') · 7rs(iJ>') 7rs(lJI)

7rs(lJIliJ» = 7rs(iJ>llJI) · 7rs(lJI) as(iJ>,lJI) (6)

L:w'ED(<P) 7rs(iJ>llJI') · 7rs(lJI') 7rs(iJ»

Page 497: Data Mining, Rough Sets and Granular Computing

495

Formulas 3) and 4) are the total probability theorems, whereas formulas 5) and 6) are the Bayes' theorems. The relationship between truth of implica­tions and the Bayes' theorem first was observed by Lukasiewicz [4, 13] (see also [1]). The meaning of Bayes' theorem in this case differs from that postu­lated in statistical inference, where we assume that prior probability about some parameters without knowledge about the data is given. The posterior probability is computed next, which tells us what can be said about prior probability in view of the data.

In the rough set approach the meaning of Bayes' theorem is unlike. It reveals some relationships between decision rules, without referring to prior and posterior probabilities. Instead, the proposed approach connects the total probability theorem and the Bayes, theorem with the strength of decision rules, giving a very simple way of computing the certainty and the coverage factors.

Thus, the proposed approach can be seen as a new model for Bayes' the­orem, which offers a new approach to data analysis, in particular, to inverse decision rules and to compute their certainty factors, which can be used to explain decisions in terms of conditions.

8 Rough Modus Ponens and Rough Modus Tollens

The above considerations can be seen as a generalization of modus ponens and modus tollens inference rules.

Modus ponens inference rule says that:

if cP -+ tJr is true and cP is true then tJr is true

This rule can be generalized as rough modus ponens as follows. For any cP -+ tJr we have

if cP -+ tJr is true with the probability 1fs(tJr ICP)

and cP is true with the probability 1fs( CP)

then tJr is true with the probability 1fs(tJr) = E<1'/EC(W) 1fs(tJrjep') . 1fs(CP') = E<1'/EC(W) as (CP', tJr)

Similarly, modus tollens inference rule

if cP -+ tJr is true and rv tJr is true then rv cP is true

Page 498: Data Mining, Rough Sets and Granular Computing

496

can be generalized as rough modus tollens as follows. For any if> -+ lP we have

if if> -+ lP is true with the probability 11"5 (lP\if»

and lP is true with the probability 11"s(lP)

then if> is true with the probability 11"s(if» = I:lJF'ED(<P) 11"5 (if>\lP') . 11"5 (lP') = I:lJF1ED(4)) IJ's(lP', if»

Due to the Bayes' theorem (5) and symmetry of strength of decision rules we get

11"s(if» = L IJ's(if>, lP'). lJFIED(<P)

The generalizations of both inference rules consist in replacing logical values of truth and falsehood with their probabilities in accordance with the total probability theorem (3),(4) and the Bayes' theorem (5),(6).

9 Conclusions

Granularity of knowledge, information, measurements, computations etc., seems to be an intrinsic feature of our thinking and can be considered as a manifestation of an old antinomy associated with continuos-discrete paradigm.

Rough set philosophy hinges on the granularity of data, which is used to build up all its basic concepts, like approximations, dependencies, reduction etc.

Particularly interesting in this approach seems to be the relationship be­tween partial truth, rough mereology, Lukasiewicz's many-valued logic and Bayes'theorem.

Acknowledgments

Thanks are due to Professor Andrzej Skowron for his critical remarks and helpful comments.

References

1. Adams E. W. (1975) The Logic of Conditionals, an Application of Probability to Deductive Logic. D. Reidel Publishing Company, Dordrecht, Boston

2. P. Apostoli, A. Kanda, Parts of the continuum: towards a modern ontology of science, (to appear), 1999

3. Black M. (1952) The Identity of Indiscernibles, Mind, 61 4. Borkowski L. (Ed.) (1970) Jan Lukasiewicz - Selected Works, North Hol­

land Publishing Company, Amsterdam, London, Polish Scientific Publushers, Warszawa

Page 499: Data Mining, Rough Sets and Granular Computing

497

5. Cattaneo G. (1993) Fuzzy Quantum Logic: The Logic of Unsharp Quantum Mechanics. Int. Journal of Theoretical Physics 32: 1709-1734

6. Cattaneo G. (1996) Mathematical Foundations of Roughness and Fuzziness. In: Tsumoto S. at al (Eds.) The fourth International Workshop on Rough Sets, Fuzzy Sets and Machine Discovery, Proceedings, The University of Tokyo 241-247

7. Chattebrjee A. (1994) Understanding vagueness, pragati publications. Dehli 8. Dubois D., Prade H. (1999) Foreword. In: Pawlak Z. Rough Sets - Theoreti­

cal Aspect of Reasoning about Data, Kluwer Academic Publishers, Dordrecht, Boston , London

9. Forrest P. (1997) Identity of Indiscernibilities. Stanford Encyclopedia of Philos­ophy

10. French S. (1998) Quantum Physics and the Identity of Indiscernibles. British Journal of the Philosophy of Sciences 39

11. Krawiec K., Slowinski R., Vanderpooten D. (1996) Construction of Rough Clas­sifiers Based on Application of Similarity Relation. In: Proceedings of the Fourth International Workshop on Rough Sets, Fuzzy Sets and Machine Discovery, November 6-8, Tokyo, Japan, 23-30

12. Lesniewski S. (1992) Foundations of the General Theory of Sets. In: Surma, Srzednicki, Barnett, Riskkey (Eds) Stanislaw Lesniweski Collected Works, Kluwer Academic Publishers, Dordrecht, Boston, London, 128-173

13. Lukasiewicz J. (1913) Die logishen Grundlagen der Wahrscheini­chkeitsrechnung, Krakow

14. Parker-Rhodes A. F. (1981) The Theory of Indistinguishables. D. Reidel Pub­lishing Company, Dordrecht, Boston, London

15. Pawlak, Z. (1998) Granurality of Knowledge, Indiscernibility and Rough Sets. In: IEEE International Conference on Granulationary Computing, May 5-9, An­chorage, Alaska, 100-103

16. Polkowski L., Skowron A. (1997) Towards Adaptative Calculus of Granules. Manuscript

17. Polkowski L., Skowron A. (1994) Rough Mereology. In: Proc. of the Sympho­sium on Methodologies for Intelligent Systems, Charlotte, N.C., Lecture Notes in Artificial Intelligence 869, Springer Verlag, Berlin, 85-94

18. Polkowski L., Skowron A. (1996) Rough Mereology: A new Paradigm for Ap­proximate Reasoning. Journ. of Approximate Reasoning 15(4):333-365

19. Polkowski L., Skowron A. (Eds.) (1998) Rough Sets in Knowledge Discovery. Physica-Verlag Vol. 1, 2

20. Skowron A., Stepaniuk J. (1997) Information Granulation - a Rough Set Ap­proach. Manuscrript

21. Skowron A., Stepaniuk J. (1996) Tolerance Approximation Spaces. Fundamenta Informaticae 27:245-253

22. Sorensen R. (1997) Vagueness. Stanford Encyclopedia of Philosophy 23. Suppes P. (1972) Some Open Problems in the Philosophy of Space and Time.

Synthese 24:298-316 24. Tsumoto S. (1998) Modelling Medical Diagnostic Rules Based on Rough Sets.

In: Polkowski L., Skowron A. (Eds.) Rough Sets and Current Trends in Com­puting, Lecture Notes in Artificial Intelligence 1424 Springer, First International Conference, RSCTC'98, Warsaw, Poland, June, Proceedings, 475-482

25. Williamson T. (1990) Identity and Discrimination. Blackwell

Page 500: Data Mining, Rough Sets and Granular Computing

498

26. Yao Y.Y., Wong S.K.M. (1995) Generalization of Rough Sets using Relation­ships between Attribute Values. In: Proceedings of the Second Annual Joint Conference on Information Sciences, Wrightsville Beach, N.C. USA, September 28 - October 1, 245-253

27. Zadeh L. (1994) Fuzzy Graphs, Rough Sets and Information Granularity. In: Proc. Third Int. Workshop on Rough Sets and Soft Computing, Nov. 10-12, San Jose

28. Zadeh L. (1996) The Key Rules of Information Granulation and Fuzzy Logic in Human Reasoning, Concept Formulation and Computing with Words. In: Proc. FUZZ-96: Fifth IEEE International Conference on Fuzzy Systems, September 8-11, New Orleans

29. Zadeh L. (1996) Information Granulation, Fuzzy Logic and Rough Sets. In: Proc. of the Fourth Int. Workshop on Rough Sets, and Machine Discovery, November 6-8, Tokyo

30. Ziarko W. (1993) Variable Precison Rough Set Model. Journal of Computer and System Sciences 46/1:39-59

Page 501: Data Mining, Rough Sets and Granular Computing

The Generic Rough Set Inductive Logic Programming (gRS-ILP) Model

Arul Siromoneyl and Katsushi Inoue2

1 School of Computer Science and Engineering, Anna University, Chennai - 600 025, India asirolDvsnl. com (Contact address)

2 Department of Computer Science and Systems Engineering, Faculty of Engineering, Yamaguchi University, Ube 755-8611, Japan inouelDcsse.yamaguchi-u.ac.jp

Abstract. The example semantics of Inductive Logic Programming (ILP) systems is said to be in a rough setting when the consistency and completeness criteria cannot both be fulfilled together, because the evidence, background knowledge and declarative bias are such that any induced hypothesis cannot distinguish between some of the positive and negative examples. The gRS-ILP model (generic Rough Set Inductive Logic Programming model) provides a theoretical foundation in this rough setting for an ILP system to induce hypotheses that are used to say that an example is definitely positive, or definitely negative. An illustrative example using Progol is presented. Results are presented of GOLEM experiments using the data set for drug design for Alzheimer's disease and other experiments using Progol on mutagenesis data and transmembrane domain data.

1 Introduction

Inductive Logic Programming [1,2] (in the example semantics) uses back­ground knowledge definite clauses, and positive and negative example ground facts to induce a logic program that describes the examples, where the in­duced logic program consists of the original background knowledge along with an induced hypothesis (as definite clauses). Consider a hypothesis space that does not contain the example ground facts. The rough setting describes the situation where the background knowledge, declarative bias and evidence are such that it is not possible to induce any logic program that is both consis­tent and complete. Any induced logic program will not be able to distinguish between certain positive and negative examples. It will either cover both the positive and the negative examples in the group (it is then not consistent), or not cover the group at all, with both the positive and the negative examples in this group being left out (it is then not complete).

Rough set theory [3,4] deals with such a situation. The basic notions of rough set theory are used to establish the theoretical foundation of a generic Rough Set Inductive Logic Programming (gRS-ILP) model [5-7]. Two types of hypotheses are then defined to handle the rough setting. One is a

Page 502: Data Mining, Rough Sets and Granular Computing

500

hypothesis that is consistent (but not complete). The other is a hypothesis that is complete (but not consistent). When the consistent hypothesis is true, the example is definitely positive. When the complete hypothesis is not true, the example is definitely negative.

All the existing ILP systems (in the example semantics) can handle a rough setting by putting more knowledge into the background knowledge and thereby making the setting crisp. Most existing systems handle the rough setting by using various techniques to induce a hypothesis that describes the evidence as well as possible. They aim to maximize the correct cover of the induced hypothesis by maximizing the number of positive examples covered and negative examples not covered. This means that most of the positive evidence would be described, along with some of the negative evidence. The induced hypothesis cannot definitively describe an example.

Consistency and completeness in the gRS-ILP model work in the rough setting by acting as a complement to existing example semantic ILP sys­tems. The ILP system still induces its best hypothesis and uses it. However, additional hypotheses are also induced and used. One such hypothesis is con­sistent, and if true for an example, says that the example is definitely positive. Another hypothesis is complete, and if not true for an example, says that the example is definitely negative.

An illustrative example is discussed by modifying the evidence in Michal­ski's East-West problem to create a rough setting, and using the ILP system Progol to illustrate how the gRS-ILP model applies to an existing ILP system in a rough setting of the example semantics. It is interesting to note that ab­solutely no change is required to the Progol executable system. The gRS-ILP model is applied to Progol by utilizing its user definable settings alone.

The results of experiments using the ILP system GOLEM are presented. The dataset is that used in earlier GOLEM experiments on drug design for Alzheimer's disease. The aim of the earlier experiments using these datasets was to maximize the correct cover (number of positives covered and negatives not covered). These earlier experiments did not yield any hypothesis that is both consistent and complete. Now, consistent hypotheses are induced, and these are useful in indicating definitely whether an example is positive or not. The results of experiments using the ILP system Progol on the mutagenesis dataset and on transmembrane domain data are also reported.

2 Inductive logic programming

Inductive Logic Programming [1] is the research area formed at the inter­section of logic programming and machine learning. The semantics of ILP systems are discussed in [2]. In ILP systems, background (prior) knowledge B and evidence E (consisting of positive evidence E+ and negative evidence E-) are given, and the aim is then to find a hypothesis H such that certain conditions are fulfilled.

Page 503: Data Mining, Rough Sets and Granular Computing

501

In the normal semantics, the background knowledge, evidence and hy­pothesis can be any well-formed logical formula. The conditions that are to be fulfilled by an ILP system in the normal semantics are

Prior Satisfiability: B "E- ~ 0 Posterior Satisfiability: B " H "E- ~ 0 Prior Necessity: B ~ E+ Posterior Sufficiency: B "H F E+

However, the definite semantics, which can be considered as a special case of the normal semantics, restricts the background knowledge and hypothesis to being definite clauses. This is simpler than the general setting of normal semantics, since a definite clause theory T has a unique minimal Herbrand model M+(T), and any logical formula is either true or false in the minimal model. The conditions that are to be fulfilled by an ILP system in the definite semantics are

Prior Satisfiability: all e E E- are false in M+(B) Posterior Satisfiability: all e E E- are false in M+(B "H) Prior Necessity: some e E E+ are false in M+(B) Posterior Sufficiency: all e E E+ are true in M+(B "H)

The Sufficiency criterion is also known as completeness with respect to positive evidence and the Posterior Satisfiability criterion is also known as consistency with the negative evidence.

The special case of definite semantics, where evidence is restricted to true and false ground facts (examples), is called the example setting. The example setting is thus the normal semantics with B and H as definite clauses and E as a set of ground unit clauses. The example setting is the main setting of ILP employed by the large majority of ILP systems.

3 Rough set theory

The basic notions of rough set theory are defined in [3], and in [4], which is an excellent reference for the fundamentals of rough set theory.

Let U be a certain set called the universe, and let R be an equivalence relation on U. The pair A = (U, R) is called an approximation space. R is called an indiscernibility relation. If x, y E U and (x, y) E R we say that x and yare indistinguishable in A.

Equivalence classes of the relation R are called elementary sets, and every finite union of elementary sets is called a composed set.

Let X be a certain subset of U. The greatest composed set contained in X is the best lower approximation (or lower approximation) of X (known as R(X», i.e., U[xlR~x[x]R = R(X) where for each x E U, [X]R = {y E U I (x, y) E R}. R(X) is also known as the R-positive region of X (PosR(X».

Page 504: Data Mining, Rough Sets and Granular Computing

502

The lower approximation is the collection of those elements that can be classi­fied with full certainty as members of set X using R. In other words, elements of POSR(X) surely belong to X.

The least composed set containing X is the best upper approximation (or upper approximation) of X (known as R(X)), i.e., U[xlRnx;t<t>[x]R = R(X). The upper approximation of X consists of elements that could possibly belong to X. In other words, R does not allow us to exclude the possibility that they rriay belong to X.

The R-negative region is the complement ofthe upper approximation with respect to the universe U (NegR(X) = U - R(X)). The R-negative region is the collection of elements that can be classified without any ambiguity using R, that they do not belong to the set X. In other words, elements of N egR(X) surely belong to the complement of X, that is, elements of the R-negative region surely do not belong to X.

The R-borderline region of X (or boundary of X), BndR(X) = R(X) -E(X), is the undecidable area of the universe. None of the elements in the boundary region can be classified with certainty into X or U - X using R.

If the R-borderline region of X is empty, X is crisp in R (or X is precise in R); and otherwise, if the set X has some non-empty R-borderline region, X is rough in R (or X is vague in R).

4 Theoretical foundations of gRS-ILP

The rough setting in Inductive Logic Programming (example semantics) is when the Posterior Sufficiency criterion (completeness) with respect to posi­tive evidence and the Posterior Satisfiability criterion (consistency) with the negative evidence cannot be fulfilled together. Consider a hypothesis space that does not contain the example ground facts. The rough setting describes the situation where the background knowledge, declarative bias and evidence are such that it is not possible to induce any logic program that is both consistent and complete. Any induced logic program will not be able to dis­tinguish between certain positive and negative examples. It will either cover both the positive as well as the negative examples in the group (it is then not consistent), or not cover the group at all, with both the positive and the negative examples in this group being left out (it is then not complete) [5]. The following formal definitions are presented in [6,7].

4.1 The RSILP system

We first formally define the ILP system in the example setting of [2] as follows.

Definition 1. An ILP system in the example setting is a tuple Ses = (Ees, B), where (1) Ees = Et" U E;;, is the universe, where Et" is the set of positive examples

Page 505: Data Mining, Rough Sets and Granular Computing

503

(true ground facts), and E~ is the set of negative examples (false ground facts), and (2) B is a background knowledge given as definite clauses, such that (i) for all e- E E~, B If C, and (ii) for some e+ E Ets, B If e+. 0

Let Ses = (Ees, B) be an ILP system in the example setting. Then let H(Ses) (also written as H(Ees, B)) denote the set of all possible definite clause hypotheses that can be induced from Ees and B, and be called the hypothesis space induced from Ses (or from Ees and B). Further, let P(Ses) (also written as P(Ees, B)) = {P = B /\ H I HE H(Ees , Bn denote the set of all the programs induced from Ees and B, and be called the program space induced from Ses (or from Ees and B).

Our aim is to find a program P E P(Ses) such that the next two condi­tions hold: (iii) for all e- E E~, P If e-, (iv) for all e+ E Ets, P f- e+.

The following definitions of Rough Set ILP systems in the gRS-ILP model (abbreviated as RSILP systems) use the terminology of [2].

Definition 2. An RSILP system in the example setting (abbreviated as RSILP-E system) is an ILP system in the example setting, Ses = (Ees, B), such that there does not exist a program P E P(Ses) satisfying both the conditions (iii) and (iv) above. 0

Definition 3. An RSILP-E system in the single-predicate learning context (abbreviated as RSILP-ES system) is an RSILP-E system, whose universe E is such that all examples (ground facts) in E use only one predicate, also known as the target predicate. 0

A declarative bias [2] biases or restricts the set of acceptable hypotheses, and is of two kinds: syntactic bias (also called language bias) that imposes restrictions on the form (syntax) of clauses allowed in the hypothesis, and semantic bias that imposes restrictions on the meaning, or the behaviour of hypotheses.

Definition 4. An RSILP-ES system with declarative bias (abbreviated as RSILP-ESD system) is a tuple S = (S',L), where (i) S' = (E, B) is an RSILP-ES system, and (ii) L is a declarative bias, which is any restriction imposed on the hypothesis space H(E, B). We also write S = (E, B, L) instead of S = (S', L). 0

Page 506: Data Mining, Rough Sets and Granular Computing

504

For any RSILP-ESD system 8 = (E, B, L), let 1£(8) = {H E 1£(E, B) I H is allowed by L}, and P(8) = {P = B "H I H E 1£(8)}. 1£(8) (also written as 1£(E, B, L)) is called the hypothesis space induced from 8 (or from E, B, and L). P(8) (also written as P(E,B,L)) denotes the set of all the programs induced by 8, and is called the program space induced from 8 (or from E, B, and L).

4.2 Equivalence relation, elementary sets and composed sets

We now define an equivalence relation on the universe of an RSILP-ESD system.

Definition 5. Let 8 = (E,B,L) be an RSILP-ESD system. An indiscerni­bility relation of 8, denoted by R(8), is a relation on E defined as follows: 'r/x, y E E, (x, y) E R(8) iff (P f- x ¢:} P f- y) for any P E P(8) (i.e. iff x and yare inherently indistin­guishable by any induced logic program P in P(8)). 0

The following fact follows directly from the definition of R(8).

Fact 1 For any R8ILP-E8D system 8, R(8) is an equivalence relation.

Definition 6. Let 8 = (E, B, L) be an RSILP-ESD system. An elementary set of R(8) is an equivalence class of the relation R(8). For each x E E, let [xlR(s) denote the elementary set of R(8) containing x. Formally, [xlR(s) = {y EEl (x, y) E R(8)}. A composed set of R(8) is any finite union of elementary sets of R(8). 0

Definition 7. An RSILP-ESD system 8 = (E,B,L) is said to be in a rough setting iff 3e+ E E+ 3e- E E- [(e+,e-) E R(8) 1.0

4.3 Rough declarative biases

By (E, B, ¢», we denote an RSILP-ESD system whose universe and back­ground knowledge are E and B, respectively, and which does not have any declarative bias. We also write (E, B, ¢» as (8, ¢» where 8 = (E, B).

For any RSILP-ES system 8 = (E,B), let 1£se(8) = { {e} leE E} and pse(8) = {P = B" HI HE 1£se(8)}. Let EB = {e EEl B f- e}.

Fact 2 Let 8' = (8,L) be any R8ILP-E8D system such that 1£se(8) C 1£(8'). Every elementary set of R(8'), other than EB, is singleton.

Page 507: Data Mining, Rough Sets and Granular Computing

505

Proof. Let S = (E, B). For each P E pse(s), P f- e" P If e' for all e'(=j:. e) in E, where P = B" H, H = {e}, e E E - EB. Hence the fact follows. 0

Fact 3 Let S'" = (S,</J) for any RSILP-ES system S = (E,B). Every ele­mentary set of R(S"'), other than EB , is singleton.

Proof. We note that 1I.(S"') = 1£(S) and therefore 1I.se(s) ~ 1I.(S"'), since 1I.(S) is the set of all possible hypotheses that can be induced from E and B. Using Fact 2 we get this fact. 0

Some declarative bias LR , is needed to be able to have an RSILP-ESD system in a rough setting. In other words, E and B could be such that a rough setting is possible for some LR" but without some such LR" S is not in a rough setting. E and B are what we would normally consider input or data in the system. So the input or data could be 'rough', but the system will still not be in a rough setting without some declarative bias L R ,.

We illustrate these with a simple illustration. Let S = (E, B, </J), with E = {p(x),p(y)} and B = {data(x, a),data(y, a)}. It is to be noted that S is not in a rough setting, even though the input or data appears to be 'rough'.

Let Lte be a declarative bias such that for any R-RSILP-ESD system S = (E, B, L te ), H E 1I.(S) :::} x is not a term in q( .. . ) for any a E H, any q( ... ) E a, and any x such that p(x) E E, where p is the target predicate of S.

We discuss in [7] several issues regarding such declarative biases and we see that the declarative bias Lte is such a bias LR ,.

4.4 Consistency and completeness in the gRS-ILP model

Let S = (E,B,L) be an RSILP-ESD system, and P(S) the program space induced by S, as defined earlier.

Definition 8. The upper approximation of S, Upap(S) , is defined as the least composed set of R(S) such that E+ ~ Upap(S). 0

Definition 9. The lower approximation of S, Loap(S) , is defined as the greatest composed set of R(S) such that Loap(S) ~ E+. 0

The set Bndr(S) = Upap(S) - Loap(S) is known as the boundary region of S (or the borderline region of S). The lower approximation of S, Loap(S), is also known as Pos(S), the positive region of S. The set Neg(S) = E -Upap(S) is known as the negative region of S.

Definition 10. The consistent program space P cons(S) of S is defined as Pcons(S) = {P E P(S) I P If e-, 'Ve- E E-}. A program P E P(S) is

Page 508: Data Mining, Rough Sets and Granular Computing

506

consistent with respect to S iff P E Pcons(S). The reverse-consistent program space Prev-cons(S) of S is defined as Prev-cons(S) = {P E peS) I P 17 e+, Ve+ E E+}. A program P E peS) is reverse-consistent with respect to S iff P E Prev-cons(S). 0

Consistency is useful with respect to a positive region and its dual, reverse­consistency, is useful with respect to a negative region.

Definition 11. The complete program space Pcomp(S) of S is defined as Pcomp(S) = {P E peS) I P f- e+, Ve+ E E+}. A program P E peS) is complete with respect to S iff PEP comp(S). 0

Definition 12. The cover of a program P E peS) in S is defined as caver(S,P) = {e EEl P f- e}. 0

The following facts follow directly from the definitions.

Fact 4 VP E Pcons(S), caver(S,P) ~ Loap(S).

Fact 5 VP E Pcomp(S), caver(S,P) 2 Upap(S).

Fact 6 VP E Pcomp(S), (E - caver(S,P)) ~ (E - Upap(S)).

Fact 7 VP E Prev-cons(S), caver(S, P) ~ (E - Upap(S)).

For a program P E Pcons(S), the closer to Loap(S) caver(S, P) is, the better Pis. P is best when caver(S, P) = Loap(S). Similarly, for a program P E Prev-cons(S) (resp., P E Pcomp(S)), the closer to U - Upap(S) (resp., Upap(S)) caver(S,P) is, the better P is, and P is best when caver(S,P) = U - Upap(S) (resp., caver(S,P) = Upap(S)).

Fact 8 VP E Pcons(S), P f- e ~ e E E+.

Fact 9 VP E Pcomp(S), P 17 e ~ e E E-.

Fact 10 VP E Prev-cons(S), P f- e ~ e E E-.

These facts are used in the definitive description of data in a rough set­ting. Definitive description involves the description of the data with 100% accuracy. In a rough setting, it is not possible to definitively describe the en­tire data, since some of the positive examples and negative examples (of the concept being described) inherently cannot be distinguished from each other. These facts show that definitive description is possible in a rough setting when an example is covered by a consistent program (the example is then definitely positive), covered by a reverse-consistent program (the example is then definitely negative), or not covered by a complete program (the example

Page 509: Data Mining, Rough Sets and Granular Computing

507

is then definitely negative). In many practical implementations, it is easy to find a consistent program (and therefore, also a reverse-consistent program), whereas it is not so easy to find a complete program. So in practical appli­cations of definitive description, consistent and reverse-consistent programs are easier to use than consistent and complete programs.

5 Illustrative example

Progol is an Inductive Logic Programming system written in C by Dr. Mug­gleton [8]. The syntax for examples, background knowledge and hypothesis is Dec-l0 Prolog. Headless Horn clauses are used to represent negative exam­ples and constraints. Progol source code and example files are freely available (for academic research) from ftp. cs . york. ac. uk under the appropriate di­rectory in pub/ML_GROUP.

Ryszard Michalski's classic presentation of 5 Eastbound and 5 Westbound trains, as found in the input file train. pI in the examples subdirectory of Progol version 4.1 (dated 11.9.95), is used as an illustrative example. The sources for Progol version 4.2 (the C version dated 22.01.97) are used.

The background knowledge using the predicates has_car, open, closed (not open), long, short (not long) is presented in tabular form.

cars trains long/open long/closed short/open short/closed east! 11,13 14 12 east2 21,22 23 east3 33 31 32 east4 41,42,44 43 east5 52 51 53 west6 61 62 west 7 71,72 west8 81 82 west9 91,93,94 westl0 102 101

Progol induces the following hypothesis: eastbound (A) :- has_car(A,B), not open (B) , not long(B). This hypothesis means that a train is eastbound if it has a car that is closed and short. It is easy to see from the preceeding table that this hypothesis is complete and consistent. Every eastbound train has a car that is closed and short, and every train that is not eastbound does not have a car that is closed and short.

Let So = (E, BO, Lte ) with E = E+ U E- consisting of the positive examples E+ = {eastbound(eastl), ... ,eastbound(east5)} and negative examples E- = {--, eastbound(west6), ... , --, eastbound(westl0)}, BO be

Page 510: Data Mining, Rough Sets and Granular Computing

508

the background knowledge described above, and Lte be the declarative bias discussed in section 4.3.

In order to have a good illustration of gRS-ILP, car 72 is made closed instead of being open, and this results in the following table.

cars trains long/open long/closed short/open short/closed eastl 11,13 14 12 east2 21,22 23 east3 33 31 32 east4 41,42,44 43 east5 52 51 53 west6 61 62 west7 71 72 west8 81 82 west9 91,93,94 west 10 102 101

It can be immediately seen that the earlier hypothesis is no longer con­sistent in this situation. Train west7 has a car that is short and closed, even though it is not eastbound. The hypothesis incorrectly covers one of the neg­ative examples.

Let S = (E, B, Lte) with B being the modified background knowledge described above. S and So are RSILP-ESD systems.

It is interesting to look at the elementary sets in these two situations. The elementary sets of So (the original problem) are

east bound (eastl) eastbound(east2), eastbound(east4) east bound ( east3), east bound ( east5) ..., eastbound(west6), ..., eastbound(west8) ..., eastbound(west7), ..., eastbound(west9) ..., eastbound(westlO)

So is not in a rough setting. However, when car72 is made closed instead of open, the elementary sets

of S are

eastbound (eastl) east bound ( east3), east bound ( east5) eastbound(east2), eastbound(east4), ..., eastbound(west7) ..., eastbound(west6), ..., eastbound(west8) ..., eastbound(west9) ..., eastbound(westlO)

Page 511: Data Mining, Rough Sets and Granular Computing

509

It is immediately apparent that S is in a rough setting. With this back­ground knowledge, it is not possible to induce a program that distinguishes between the positive examples eastbound(east2), eastbound(east4) and the negative example -, eastbound (west7) .

It is easily seen that Loap(S) = {eastbound(eastl), eastbound(east3), eastbound(east5)}

and U pap(S) = {eastbound( eastl), eastbound( east2), eastbound( east3), eastbound( east4), eastbound( east5), -, eastbound(west7)}.

The boundary and negative regions are Bnd(S) = {eastbound( east2), eastbound( east4), -, eastbound(west7)}

and Neg(S) = {-, eastbound(west6), -, eastbound(west8), -, eastbound(west9), -, eastbound(westl0)}.

Progol is used, with the default value of 0% for the noise setting, to induce a consistent hypothesis H. (The noise setting is the percentage of negative examples that are allowed to be covered.) Let Peons = BAH. We see that

cover(S, Peons) = {eastbound(eastl), eastbound(east3), eastbound( east5),

and Icover(S, Peons) I = 3.

Progol is now run again with a noise setting of 100%. This setting allows the induced hypothesis to cover any number of negative examples incorrectly. This induces the hypothesis H' that is shown in the second entry of the following table. Let Peomp = B A H'. Peomp covers one negative example incorrectly. We see that

cover(S, Peomp) = {eastbound(eastl), eastbound(east2), eastbound( east3), eastbound( east4), eastbound( east5), eastbound( west7)}

and Icover(S, Peomp) I = 6.

hypothesis Peons eastbound (A) :- has_car (A ,B) , has_car(A,C),

closed(B), short(B), long(C). Peomp eastbound (A) :- has_car(A,B),

closed(B), short(B)

cover 3

6

We see that Progol performs extremely well in this illustration, and de­scribes the rough concept as good as possible. It induces Peons and Peomp so well that

cover(S, Peons) = Loap(S)

Page 512: Data Mining, Rough Sets and Granular Computing

510

and cover(S,Peomp ) = Upap(S). Progol has done the best possible in this example. There is no imprecise

factor contributed by Progol, and the imprecision in the induced hypothesis is only due to the inherent roughness in the background knowledge for these examples.

6 gRS-ILP in experimental environments

Consistency and completeness in the gRS-ILP model are used in experimen­tal environments in the sequence outlined below.

Step 1. An ILP experiment is attempted with a traditional ILP system in the example setting of the general semantics. It is found that it is not possible to induce a complete and consistent hypothesis because the experimental environment itself is imprecise.

The traditional ILP system usually attempts to induce a hypothesis that is as close as possible to the concept described by the examples. It relaxes the completeness condition or the consistency condition (or possibly even both), and induces a program that has the best cover possible (maximum positive examples and minimum negative examples). Let the name of this program be known as P for the purpose of this outline.

The concepts of consistency and completeness in the gRS-ILP model are now used in this situation, by following the remaining steps in the sequence.

Step 2. The same traditional ILP system is used, possibly in a modified ver­sion, with the completeness condition removed, but ensuring that any induced hypothesis strictly fulfills the consistency condition. A program induced us­ing this modified system strictly fulfills the consistency condition, but does not fulfill the completeness condition. Let the name of this program be Peons.

Step 3. The same traditional ILP system is used again, possibly in a modi­fied version, with the consistency condition removed, but ensuring that any induced hypothesis strictly fulfills the completeness condition. A program in­duced using this modified system strictly fulfills the completeness condition, but does not fulfill the consistency condition. Let the name of this program be Peomp.

If it is difficult to induce a hypothesis that strictly fulfills the complete­ness condition, it may be possible to induce a hypothesis by exchanging the roles of the positive and negative examples, and then repeating step 2 (that is, the original positive examples are used as the new negative examples, and the original negative examples are used as the new positive examples, and a consistent hypothesis is induced). Let the name of this program be Prevcons.

Page 513: Data Mining, Rough Sets and Granular Computing

511

In addition to P, we may use the programs Peons andPeomp (or Preveons). Let S = (E,B,Lte ) be an RSILP-ESD system where E = E+ U E-, E+ are the positive examples, E- the negative examples, B is the background knowledge and Lte the declarative bias.

Using Facts 8, 9 and 10 we have the following. If Peons f- e, then e E E+. If Peomp Ii e, then e E E- . If Prev-eons f- e, then e E E-.

Otherwise P is used: If P f- e, then it is very likely that e E E+; else if P Ii e, then it is very likely that e E E-.

It is clear from this sequence that gRS-ILP plays its role when the system S is imprecise. B, Land E are such that a complete and consistent hypothesis cannot be induced. Consistency and completeness in gRS-ILP do not replace traditional ILP systems in this imprecise (or rough) setting. They playa complementary role to the traditional ILP system. The gRS-ILP model uses the same ILP system (possibly slightly modified) to try to induce a consis­tent hypothesis and a complete (or reverse-consistent) hypothesis. When the consistent hypothesis is true, it means that the example surely belongs to the concept described by the given examples. When the complete hypothesis is not true, it means that the example surely does not belong to the concept described by the given examples. (When the reverse-consistent hypothesis is true, it means that the example surely does not belong to the concept described by the given examples. )

7 Experimental illustrations

7.1 GOLEM and variants of drug Tacrine used for treating Alzheimer's disease

GOLEM is another Inductive Logic Programming system written by Dr. Muggleton [9]. Examples and background knowledge ground facts are given and a hypothesis is induced. GOLEM source code and example files are avail­able (for academic research) from ftp. comlab. ox. ac. uk under the appropri­ate directory in pub/Packages/ILP. The data sets used are available under Datasets.

The drug Tacrine is reportedly useful in treating Alzheimer's disease. However, since it is highly toxic, there have been investigations into possible variants of the basic Tacrine structure. A good drug should have low toxic­ity, high acetocholinesterase inhibition, good reversal of scopolamine induced deficiency and should inhibit amine re-uptake.

Page 514: Data Mining, Rough Sets and Granular Computing

512

There are four experiments based on each of these four properties. In each experiment, the positive and negative examples are pairwise comparisons of drugs for the corresponding one of these four properties. The experiments first performed on this data set using GOLEM are reported in [10]. The data set is available in the directory alzheimer and is based on the work reported in [11].

The property investigated, the number of positive examples and the name of the predicate used to describe the property, are tabulated below for each of the four experiments. The number of negative examples in each experiment is the same as the number of positive examples, and each negative example is the inverse of a positive example in the pairwise comparison of a drug (if great (x, y) is a positive example, then great (y ,x) is a negative example).

Property +ve examples Predicate High acetocholinesterase inhibition 663 great Inhibition of amine re-uptake 343 great...ne Low toxicity 443 less_toxic High acetocholinesterase inhibition 321 great...rsd

The experimental illustration of consistency and completeness in the gRS­ILP model follows. Each experiment is performed on the entire data set and the induced logic program characterizes the data set.

The first step is the original set of GOLEM experiments. Let the name of the program be known as P for the purpose of this outline.

The second step uses GOLEM with the default noise setting of zero, where any induced hypothesis cannot cover any negative example. A hypoth­esis induced using GOLEM with this setting strictly fulfills the consistency condition, but does not fulfill the completeness condition. Let the name of the induced program be Peons.

The details of the induced hypotheses for each of the four properties follow. The number of samples (rlggsample) is set to 10% of the number of positive examples.

The number of clauses induced, and the number of positive examples covered, are tabulated below for each experiment. The column 'All clauses' are the results for all the induced clauses. Some of the induced clauses (such as great(A,nl).) are with respect to a particular example (nl in this case). Note that removing such clauses by hand amounts to imposing the declarative bias L te . The clauses left after imposing the declarative bias L te are shown in the next column 'Lte clauses'.

Page 515: Data Mining, Rough Sets and Granular Computing

513

Predicate rlggsample All clauses L te clauses Number Cover Number Cover

great 66 8 130 5 52 great...ne 34 6 166 3 100 less_toxic 44 20 416 8 295 great...rsd 32 4 85 2 55

The induced hypotheses did not cover any negative example (in keeping with the zero noise setting). The induced hypotheses are presented below.

great(A,B) :- x_subst(A,6,C), alk_groups(A,O), ring_subst_3(B,D).

great(A,B) :- alk_groups(B,3), x_subst(A,6,C), polarisable(C,polaril).

great(A,B) :- x_subst(A,6,C), alk_groups(A,O), ring_subst_4(B,D), h_acceptor(D,h_accO).

great(A,B) :- alk_groups(A,O), ring_substitutions(B,l), ring_subst_2(B,C), h_acceptor(C,h_accl).

great(A,B) :- alk_groups(A,C), ring_substitutions(B,C), x_subst(A,D,E), polar(E,F), x_subst(B,6,G), polar(G,F), polarisable(G,polariO).

great_ne(A,B) alk_groups(B,O), ring_substitutions(A,O). great_ne(A,B) alk_groups(B,O), ring_substitutions(A,l). great_ne(A,B) ring_substitutions(A,O),

ring_subst_4(B,C), h_acceptor(C,h_accl).

less_toxic(A,B) alk_groups(A,2). less_toxic(A,B) alk_groups(A,4). less_toxic(A,B) alk_groups(B,3), ring_substitutions(A,l). less_toxic(A,B) ring_subst_4(A,C), h_acceptor(C,h_accO). less_toxic(A,B) x_subst(B,6,C), alk_groups(B,O),

ring_substitutions(A,l). less_toxic(A,B) :- alk_groups(A,l), x_subst(A,6,C),

polar(C,polar3). less_toxic(A,B) :- alk_groups(B,O), x_subst(B,6,C),

polarisable(C,polaril). less_toxic(A,B) :- alk_groups(B,l),

ring_substitutions(A,l), ring_substitutions(B,l), ring_subst_3(B,C), polarisable(C,polariO).

great_rsd(A,B) :- x_subst(B,6,C), ring_substitutions(B,O), ring_subst_2(A,D).

great_rsd(A,B) :- alk_groups(A,O),

Page 516: Data Mining, Rough Sets and Granular Computing

514

r_subst_l(B,single_alk(C)), x_subst(A,6,D), polarisable(D,polariO).

The third step is to determine P eomp using GOLEM with the consistency condition removed, where the induced hypothesis strictly fulfills the com­pleteness condition.

However, it is easier to determine Preveons rather than P eomp , since the negative examples are each the inverted form of a positive example. The value of cooer(S, Prev-eons) is the same as the value of cooer(S, Peons), where S = (E, B, Lte) as usually defined.

Predicate U Icooer(S, Peons) I Icooer(S, Prev-eons) I great 1326 52 great...ne 686 100 less_toxic 886 295 great...rsd 642 55

In addition to P, we may use Peons and Preveons.

Using Facts 8 and 10 we have the following. If Peons f-- e, then e E E+.

If Prev-eons f-- e, then e E E-.

Otherwise P is used: If P f-- e, then it is very likely that e E E+j else if P If e, then it is very likely that e E E- .

52 100 295 55

We see that in the case of toxicity, 295 out of 443 positive examples are definitively described by Peons and 295 out of 443 negative examples are definitively described by Prev-eons.

It is seen that these experiments result in interesting observations based on the induced rules listed above. An interesting observation is that an example drug in the data set is definitely less toxic if it contains two or four alkaline groups.

7.2 Progol and mutagenicity in nitroaromatic compounds

The following is a brief summary of an experimental illustration of definitive description using Progol that is described in detail in [6]. These experiments concern the discovery of rules for mutagenicity in nitroaromatic compounds described in [12]. Progol version 4.2 dated 14.02.98 is used.

The results are tabulated below.

Page 517: Data Mining, Rough Sets and Granular Computing

515

positive negative total consistent reverse-consistent examples examples examples examples examples

125 63 188 77 22

It is seen that 77 of the 125 positive examples are definitively described, and 22 of the 63 negative examples are definitively described.

7.3 Progol and transmembrane domains in amino acid sequences

A similar experiment is conducted for the learning of transmembrane do­mains from amino-acid sequences [13]. The identification of transmembrane domains from amino acid sequences is described in [14]. Progol version 4.4 dated 25.08.98 is used in this study. The results are given below.

positive negative total consistent reverse-consistent examples examples examples examples examples

583 583 1166 249 55

It is seen that 249 of the 583 positive examples are definitively described, and 55 of the 583 negative examples are definitively described.

8 Conclusions

The gRS-ILP model plays an important role in the rough setting of example semantics in ILP systems. A rough setting is when the consistency and com­pleteness criteria of the example semantics cannot both be fulfilled together, because the background knowledge, the declarative bias and the evidence are inherently such that any induced hypothesis can never distinguish between some of the positive and negative examples. The gRS-ILP model lays a firm theoretical foundation in this rough setting.

In such a rough setting where it is not possible to induce a hypothesis that is both consistent and complete, many ILP systems induce a hypothesis that fulfills the consistency and completeness criteria as much as possible. The gRS-ILP model complements such systems. The gRS-ILP model helps induce additional hypotheses, by using the same ILP system, possibly with minor modifications. These hypotheses are such that they definitively describe part of the data.

Page 518: Data Mining, Rough Sets and Granular Computing

516

The gRS-ILP model is a theoretical foundation for a rough setting in ILP. The use of consistency and completeness (or reverse-consistency) is only one of several possible applications of this model.

Acknowledgements

The authors thank Professors S. Miyano, K. Morita, V. Ganapathy, K.M. Mehata, and R. Siromoney for their valuable comments and support; Profes­sor S. Muggleton and Drs. A. Srinivasan and D. Page for the warm welcome and the sharing of their research results during the first author's brief visit to the Oxford University Computing Laboratory; Dr. N. Zhong for help in providing rough set material; and the Japan Society for Promotion of Science for the Ronpaku Fellowship for the first author.

References

1. S. Muggleton. (1991) Inductive logic programming. New Generation Comput­ing, 8(4), 295-318

2. S. Muggleton and L. De Raedt. (1994) Inductive logic programming: Theory and Methods. Journal of Logic Programming, 19/20, 629-679

3. z. Pawlak. (1982) Rough sets. International Journal of Computer and Infor­mation Sciences, 11(5), 341-356

4. z. Pawlak. (1991) Rough Sets - Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht, The Netherlands

5. A. Siromoney. (1997) A rough set perspective of Inductive Logic Programming. In Luc De Raedt and Stephen Muggleton, editors, Proceedings of the IJCAI-97 Workshop on Frontiers of Inductive Logic Programming, Nagoya, Japan, 111-113

6. A. Siromoney and K Inoue. (1998) A framework for Rough Set Inductive Logic Programming - the gRS-ILP model. In Pacific Rim Knowledge Acquisition Workshop, Singapore, 201-217

7. A. Siromoney and K Inoue. (1999) Elementary sets and declarative biases in a restricted gRS-ILP model. Informatica. To appear.

8. S. Muggleton. (1995) Inverse entailment and Progol. New Generation Com­puting, 13, 245-286

9. S. Muggleton and C. Feng. (1992) Efficient induction in logic programs. In S. Muggleton, editor, Inductive Logic Programming, Academic Press, 281-298

10. R.D. King, A. Srinivasan, and M.J.E. Sternberg. (1995) Relating chemical activity to structure: an examination of ILP successes. New Generation Com­puting, 13, 411-433

11. G.M. Shutske, F.A. Pierrat, KJ. Kapples, M.L. Cornfeldt, M.R. Szewczak, F.P. Huger, G.M. Bores, V. Haroutunian, and KL. Davis. (1989) 9-amino-1,2,3,4-tetrahydroacridin-1-ols: Synthesis and evaluation as potential Alzheimer's dis­ease therapeutics. Journal of Medical Chemistry, 32, 1805-1813.

12. A. Srinivasan, S.H. Muggleton, R.D. King, and M.J.E. Sternberg. (1996) The­ories for mutagenicity: a study of first-order and feature based induction. Ar­tificial Intelligence, 85, 277-299

Page 519: Data Mining, Rough Sets and Granular Computing

517

13. A. Siromoney and K. Inoue. (1999) The gRS-ILP model and motifs in strings. In N. Zhong, A. Skowron, and S. Ohsuga, editors, New Directions in Rough Sets, Data Mining, and Granular-Soft Computing - 7th International Workshop, RSFDGrC'99, Yamaguchi, Japan, Lecture Notes in Artificial Intelligence 1711, Springer, 158-167

14. S. Arikawa, S. Miyano, A. Shinohara, S. Kuhara, Y. Mukouchi, and T. Shino­hara. (1993) A machine discovery from amino acid sequences by decision trees over regular patterns. New Generation Computing, 11, 361-375

Page 520: Data Mining, Rough Sets and Granular Computing

Possibilistic Data Analysis and Its Similarity to Rough Sets

Hideo Tanaka

Graduate School of Management and Information,

Toyohashi Sozo College, Toyohashi-shi 440-8511, Japan

Peijun Guo

Faculty of Economics, Kagawa University, Takamatsu, Kagawa 760-8523, Japan

Abstract. This paper is dealing with the upper and lower approximation models for representing the given phenomenon in a fuzzy environment. Based on the given data, the upper and lower approximation models can be derived from upper and lower directions, respectively where the inclusion relationship between these two models holds. Thus, the inherent fuzziness existing in the given phenomenon can be represented by the upper and lower models. The modalities of the upper and lower models have been illustrated in regression analysis and also in the identification methods of possibility distributions. The comparison of the concepts of possibility data analysis and rough sets is shown. A measure similar to the accuracy measure of rough sets is used to clarify the difference between the data structure and the assumed model.

1. Introduction

It is well-known that multivariate data analysis is a main tool for analyzing the uncertainty in the real world based on probability theory. Possibilistic data analysis is an alternative based on possibility distributions [9]. Multivariate data analysis considers the uncertainty as probability phenomena while possibility data analysis considers that as possibility phenomena. Possibility theory based on possibility distributions has been proposed by Zadeh [13] and advanced by Dubois and Prade [1]. There are many applications of possibility theory in various fields. For example, possibility linear regression [6,7] and possibility portfolio selection [2,5,8] have been formulated by using exponential possibility distributions on a multi-dimensional space. The theory of exponential possibility distributions has been proposed in the paper [10].

Based on possibilities which are more flexible than probabilities to represent the knowledge of human being, the upper and the lower approximation models are obtained for representing the given phenomenon in a fuzzy environment. These two models are derived from the given data to characterize the inherent fuzziness in the given phenomenon from upper and lower directions, respectively. The modalities of the upper and lower models have been illustrated in interval regression analysis and also in the identification methods of possibility distributions. The upper and lower models are closely connected to the rough sets concept [3,4] in the sense that the imprecise and insufficient knowledge are

Page 521: Data Mining, Rough Sets and Granular Computing

519

characterized by the dual approximations with the inclusion relation between them. The comparison of concepts of possibilistic data analysis and rough sets is shown in this paper. A measure similar to the accuracy measure of rough sets is used to clarify the difference between the data structure and the assumed model.

2. Upper and lower models for interval regression by linear programming methods

Interval regression [6] is regarded as the simplest version of possibility regression analysis. It can be easily applied to investigate many uncertain real-life phenomena. If the given output values are intervals, we can formulate two estimation models, i.e., the upper and lower approximation models based on the inclusion relations between the given interval outputs and the estimated intervals.

2.1. Upper and lower approximation models

An interval linear system can be written as

(1)

where x = [1, XI'"'' xn]/ is an input vector, A = [Ao, ... , An]' is an interval coefficient vector, and Y is the corresponding estimated interval. An interval coefficient Ai is denoted as Ai = (ai' ci ) where ai is a center and ci is a

radius. Thus, an interval coefficient Ai can also be expressed as

(2)

By interval arithmetic, the regression model (1) can be expressed as

(3)

where a=[ao, ... ,an]/, c=[co,""cn ]/ and Ixjl = [1, IXjd, .. ·, IX jn 1]/ . Here a/x and c/lx jl represent the center and the radius of the estimated interval Y(x),

respectively.

When the given outputs are intervals but the given inputs are crisp, we can consider two regression models, namely, an upper estimation model and a lower estimation model. The given data are denoted as (Yj,Xjl,""Xjn)=(Yj,x)

Page 522: Data Mining, Rough Sets and Granular Computing

520

where Y; is an interval output denoted as (y j. e j)' The upper and lower

estimation models are defined respectively as follows:

(4)

(5)

Two regression models are described as follows:

Upper regression model: The problem here is to satisfy

Yj ~Y; .j=l ..... m.

and to find the interval coefficients At = (at. c;) spreads of the estimation intervals. that is.

• ~ *, J =L.tc Ixjl. j=l

(6)

that minimize the sum of the

(7)

where the minimization stems from the inclusion relations (6). Since the constraint

conditions Yj ~ Yj' can be written as

(8)

the problem for obtaining the interval coefficients Ai' can be described as the

following LP problem:

ll)il) a ,c

LC" IXj I j=l •... ,m

(9)

s.t. ., .,

Yj+ej::;;;a Xj+C Ixjl.

C· ;::0.

Page 523: Data Mining, Rough Sets and Granular Computing

521

The reason for minimizing J* is to find the narrowest estimated intervals Yj*

among those satisfying the constraint conditions of (6).

Lower regression model: The problem here is to satisfy

Y' j ~ Yj , i=l, ... ,m (10)

and to find the interval coefficients A.i = (a*i,c*i) that maximize the sum of the spreads of the estimation intervals:

m

J * = L c~ I x j I , (11) j=!

where the maximization stems from the inclusion relations (10). Since the constraint conditions Y* j ~ Yj can be written as

(12)

Y j + e j ~ a~x j + c~ I x j I .

The problem for obtaining the interval coefficients A. i can be described as the

following LP problem:

max (13)

c* ~ O.

Page 524: Data Mining, Rough Sets and Granular Computing

522

The reason for maximizing J * is to find the widest estimated intervals Y. j

among those satisfying the constraint conditions of (10). The estimated intervals from upper and lower estimation models satisfy inclusion relations

Y. j ~ Yj ~ Y; (j=I, ... ,m). In order to show the validity of the above formulations,

assume that the given data (yJ, x~), j=I, ... ,m satisfy the linear interval system

(14)

Theorem 1. If the given data (Y?,x~) (j=I, ... ,m) satisfy (14), the interval

vector A· and A* obtained from (9) and (13), respectively, are the same as

A 0 . Thus, we have

(15)

Proof. Let us prove only A· = A ° In the upper regression model. Since

(Yrx~) satisfies (14), we have

Substituting (16) and (17) to the constraint conditions of (9) yields that

° > *t ° _ *t I ° I Ot I ° I Yj_ax j C Xj+C x j , (18)

Setting a·=ao and c*=co , (ao,co)[ is a feasible solution of the LP

problem (9). If there is another solution c· such that

Page 525: Data Mining, Rough Sets and Granular Computing

523

L,c't Ix~ 1< L,cOt Ix~ I. (19) j=i, ... ,m j=i, ... ,m

Thus, for some i we have

(20)

The ith constraint condition of (9) can be rewritten as

° *t ° Yi :2:a Xi +ki' (21)

° *t ° Yi ~a Xi -ki'

where ki is as follows:

(22)

It is obvious from the contradiction of (21) that (19) can not hold. Thus, the

optimal solution c* should be co. Moreover, it follows from (18) with c* =co

° *t ° * . ° that Y j = a X j .Thus, a IS equal to a . •

Theorem 2. There exists always an optimal solution in (9) while it is not assured that there is always an optimal solution in (13) for interval linear systems.

Proof. In the upper regression model, there is an admissible set of the constraint

conditions (9) if a sufficient large positive vector is taken for c * . On the contrary, there is a case where there is no admissible set of (13) even if a zero vector is

taken for c *. •

Unfortunately, the lower approximation model is not always guaranteed if we fail to assume a proper regression model. In case of no solution for linear regression with the lower approximation model, we can take a following polynomial:

(23)

Since a polynomial such as (23) can represent any function, the center of the lower approximation model Y* (x j) can meet the center of the observed output Yj •

Thus, we can obtain an optimal solution in the lower approximation model by increasing the number of terms of the polynomial (23) until a solution is found

Page 526: Data Mining, Rough Sets and Granular Computing

524

out. The existence of the lower approximation model can be interpreted as the fact that the assumed model is somewhat reliable. If we find a lower model, since there always exists an optimal solution for the upper model, the measure of fitness q>

can be introduced as

(24)

where 0 S q> S 1. This measure of fitness q> indicates how the upper model is approximating to the lower model and vice versa. It is desirable to assume a regression model which gives the higher value of q>. If the given input-output data satisfy a linear system (14), then we can obtain the upper and the lower models which are identical. In this case the measure of fitness q> becomes 1.

2.2 The integrated model for obtaining the upper and the lower models

To obtain the upper and lower approximation models simultaneously, the following integrated model can be considered.

min . . '" *t '" t L... C X j - L... C.X j a ,C ,3* ,3* j=I, ...• p j=l •...• p

s. t. Y' (x) ;;2 Yj ,

Y,(x) ~ Yj , (j=l, ... ,p)

* • a'i +C*i sai +C; ,

• • a.; - C*i S a i - Ci '

a.;~O, a;~O, C'i~O, c;~O,(i=O, .. "n),

(25)

This LP problem is combining (9) and (13) with considering inclusion relations

Ai' ;;2 Ai (0= 1, ... ,n) between the upper and lower regression coefficients which

guarantee Y. (x) ~ Y' (x) for any x.

Now assuming that an analyst may consider a tolerance limit OJ such that q>y ~ OJ , we propose a new algorithm which gives two approximate models for

the interval regression analysis.

Algorithm for obtaining two approximation models:

Step 1: Take a linear function as regression model:

Page 527: Data Mining, Rough Sets and Granular Computing

525

Y = .<1. + " Ax. .l.l() .L..J I I. (26)

Step 2: Solve the lower approximation model (13). If there is an optimal solution in the model (13), go to Step 4. Otherwise, go to Step 3. Step 3: Increase the number of terms of the polynomials, i.e.,

(27)

Go to Step 2.

Step 4: Solve the unified LP problem (25) and calculate the measure of fitness cp of the two models. If cpy ~ (j) , then go to Step 5 (We already have the optimal

upper model y*cx) and the optimal lower model Y.(x) satisfying

Y' (x) ~ Y.(x) for any x). Otherwise, go to Step 3.

Step5: End the procedure.

3. Interval regression analysis by quadratic programming methods

3.1 Basic model by quadratic programming problems

Here we introduce a basic model by QP in interval regression analysis corresponding to the former LP-based interval regression model. To formulate interval regression by QP, the following assumptions are given:

(1) The input-output data are given as (Yj'x j ) (j=l, ... ,p) where

Xj = [1,xjl, .... Xjn]t

(2) The data can be represented by the interval linear model (1)

(3) The given outputYj should be included in the estimated output Y(x j ), that

is, Yj E Y(x) (j=l,,,.,p).

The objective function is defined by

(28)

Page 528: Data Mining, Rough Sets and Granular Computing

526

which is the sum of squared spreads of the estimated outputs and fix jllX jll IS

j=1

an (n+l)x(n+l) positive definite matrix.

Based on the above assumptions, QP-based interval regression is to determine the optimal interval coefficients Ai = (ai,ci), i=O, ... ,p , that minimize the objective

function (28) subject to the constraints Y j E Y(x j) G=I, ... ,n). Thus, the model

can be expressed as the following QP problem [11]:

J =c'(fIXjllxl)c+~a/a j=]

mm a,c

s.t. a1xj-c/lxjl:::;;Yj G=I,oo.,p),

a/x j + cl Ix j I~ Y j G=I,oo .,p),

c~o,

(29)

where ~ is a small positive number. ~ala is added to the objective function (28) so that (29) becomes a strictly convex quadratic programming because of positive definite matrix respect to decision variables a and c.

3.2 Model integrating central tendency and possibilistic property

In this section, a new objective function is introduced with considering minimizing the sum of squared spreads of the estimated outputs and the sum of squared distances between the estimated output centers and the observed outputs as follows.

J = klf(Yj -a1x j )2 +k2cl(iIXjIIXl)C (30) j=1 j=]

p

where L.. (y j - a1x j)2 corresponds to the least squares estimation, k] and k2

j=1

are positive weight coefficients.

Using this new objective function (30), interval regression analysis is to determine the interval coefficients Ai = (ai,ci)/ (i=O,oo.,p) that minimize the objective

function (30) subject to the constraints Y j E Y(x) G=l,oo.,n) which can be

expressed as the following QP problem [11]:

Page 529: Data Mining, Rough Sets and Granular Computing

527

a,c

p p

kiL,.(Yj -a1x)2 +k2cl (I,jxjIIXl)C j=i j=i

(31) min

s.t. aIXj-cIIXjl::;;Yj U=I, ... ,p),

alxj+cllxjl~Yj U=I, ... ,p), c~o.

Likewise, when a data set with crisp inputs and interval outputs is given, we can consider two interval regression models, i.e., the upper and the lower models by

QP problems. In order to guarantee that Y.(x j) ~ Yj ~ Y' (x j) for any arbitrary

xi ' it is convenient to assume that

(32)

The objective function is introduced as the following quadratic function:

(33)

which is the sum of squared spread differences between upper and lower regression model. Therefore, interval regression analysis is to determine the

interval coefficients A; and A. i (i = 0, ... ,n) that minimize the objective

function (33) and satisfy inclusion relations Y.(x) ~ Yj ~ Y' (x j) U=I, ... ,p),

which can be described as the following QP problem.

min a,c,d

J = d l (I.Ix jllxl)d +~(ala+clc) j=i

s.t. aIXj+cIIXjl+dIIXjl~Yj+ej'

a1x j -c1lx jl-dllX jl ~ Yj -e j ,

a1xj +cIIXjl::;; Yj +ej ,

al x j - c1lx j I ::;; Y j - e j ,

c~o,

(34)

where ~ is a small positive number. Likewise, ~(ala+clc) is added to the

objective function (33) so that (34) becomes a strictly convex quadratic programming problem because of the positive definite matrix respect to decision

Page 530: Data Mining, Rough Sets and Granular Computing

528

variables a, c and d. This approach for obtaining the upper and lower regression models is called a unified approach [6].

3.3 Numerical examples

The data set of crisp inputs and interval outputs is shown in Table 1.

Table 1. Data sets with crisp inputs and interval output

No.(j) 1 2 3 4 Input(x) 1 2 3 4

Output(Y) [15,30] [20,37.5] [15,35] [25,60]

No·CD 5 6 7 8 Input(x) 5 6 7 8

Output(Y) [25,55] [40,65] [55,95] [70,100]

The linear interval model for the given input and output is assumed as

(35)

We obtained the upper model Y' (x) and the lower model Y. (x) as

y* (x) = (7.311,11.856) + (8.371,2.462)x , (36)

Y.(x) = (7.311,0.168)+(8.371,0.514)x, (37)

which are depicted in Fig. 1 where the outer 2 lines represent the upper model

y' (x) and the inner 2 lines represent the lower model Y. (x) . Let us change the

linear model (35) into the following non-linear model:

(38)

The upper model Y' (x) and the lower model Y. (x) are obtained as

y* (x) = (10.463,13.241) + (5.648,1.944)x+ (0.370,0)x 2 , (39)

Page 531: Data Mining, Rough Sets and Granular Computing

529

Y. (X) = (10.463,1.204) + (5.648,1.019)x+(0.370,0)x 2 , (40)

which are depicted in Fig. 2 where the outer 2 lines represent the upper model

y' (x) and the inner 2 lines represent the lower model Y.(x). It is obviously

known from Fig.1 and Fig.2 that polynomial (38) can make the upper and lower regression model closer than the linear function (35).

'1

120

100

so

60

'10

20

~~~~2~----~'1~-----6~-----S~-- x

Fig. 1. The upper and lower models based on (35).

'1

120

100

80

60

'10

20

Fig. 2. The upper and lower models based on (38).

4. Identification methods of upper and lower possibility distributions

Let us begin with the given data (Xi' hi) (i=1 , ... ,m) where Xi = [xii"'" xinr is a

vector of returns of n securities Sj 0=1, ... ,n) at the ith period and hi is an

associated possibility grade given by expert knowledge to reflect a similarity degree between the future state of stock markets and the state of the ith sample.

Page 532: Data Mining, Rough Sets and Granular Computing

530

Assume that these grades hi associated with security data Xi (i=1 •...• m) are

expressed by a possibility distribution A defined as

where a=[al.a2 .. ··.an]t is a center vector and DA is a symmetric positive

definite matrix. that is. D A > O. D A is called a distribution matrix.

Given the data. the problem is to determine an exponential possibility distribution (41). i.e .• a center vector a and a symmetric positive definite matrix D A .

The center vector a can be approximately estimated as

(42)

which has the maximum grade such that h.* = max hk • The possibility grade I k=l •...• m

associated with X.* is revised to be 1 because it is regarded as a center vector. I

Taking the transformation Y = x - a. the possibility distribution with a zero

center vector is obtained as.

(43)

According to two different viewpoints. two kinds of possibility distributions of A. namely. upper and lower possibility distributions are introduced to reflect two kinds of distributions from upper and lower directions. Upper and lower possibility distributions denoted as n u and n I' respectively with the

associated distribution matrices. denoted as Du and DI • respectively should

satisfy the inequality ITu {x};::: ITI {x}. The upper possibility distribution is the one

that minimizes the objective function n u (y i) x .. · x n u (y m) subject to the

constraint conditions n u (y i) ;::: hi and the lower possibility distribution is the

one that maximizes the objective function nl(YI)x .. ·xnl(ym ) subject to the

constraint conditions n I (y i ) ~ hi' Thus. the following optimization problem is

introduced for seeking the possibility distribution matrices Du and DI .

m m ~ tD -I ~ tD-1 L.JYi I Yi - L.JYi u Yi (44) i=1 i=1

Page 533: Data Mining, Rough Sets and Granular Computing

s. t.

Here

y:DU-IYi ~-Inhi' tD -1 Y i / Y i ~ -In hi

Du -D/ ~O,

D/ >0.

minimizing

531

, i=I, ... ,m,

and maximizing m

n / (y I) x··· x n / (y m) are transformed into maximizing Ly:D u -Iy i and i=l

m

minimizing Ly:D/-1Yi with considering (41), respectively. Likewise, the i=1

constraint conditions n u (y i ) ~ hi and n / (y i ) ~ hi are equivalent to

y:Du-IYi ~-Inhi and y:D/-1Yi ~-Inhi' respectively. In order to ensure that

IT u (y ) ~ IT / (y) holds for an arbitrary y, the condition D u - D / ~ 0 is

introduced into (44). ITu (y) and IT/ (y) are similar to the rough set concept

shown in Fig. 3. In Fig.3 the inconsistent knowledge represented by the irregular relation between hi and y i has been approximated by two exponential

functions from the upper and lower directions, called upper and lower possibility distributions, which play the similar role to the upper and lower approximations of some set.

It is obvious that (44) is a nonlinear optimization problem which is difficult to be solved. In order to solve the problem (44) easily, we will use principle component analysis (PCA) to rotate the given data (y i' hi) to obtain a positive definite distribution matrix. Columns of the transformation matrix T are eigenvectors of

the matrix L = [0" ij ] , where 0" ij is defined as

m m

O"ij={ L(Xki -ai)(xkj -a)hk }/Lhk ' (45) k=1 k=1

which is similar to a weighed co-variance.

Using the linear transformation, the data y can be transformed into {z = Tty}. Then we have

(46)

According to the feature of PCA, TtDAIT is assumed to be a diagonal matrix as

follows:

Page 534: Data Mining, Rough Sets and Granular Computing

532

(47)

u

y

Fig. 3. Graphic explanation of upper and lower distributions ( the upper and lower curves are the upper and lower possibility distributions, respectively, the given possibility grades are in the middle of these two curves.)

Denote C A as Cu and C1 for upper and lower possibility distributions,

respectively and denote c uj and clj (j=l, ... ,n) as the diagonal elements of C u

and C1 , respectively. The model (44) can be rewritten as follows.

m m

L>:C,Zi -L>:CUzi i=! i=!

s. t. z:C,Zi ~ -Inhi'

z:Cuzi ~ -Inh;, i=l, ... ,m,

cuj ;;;:: f

Clj ;;;:: c uj ,j=l , ... ,n,

(48)

where £ is a very small positive value and the condition Clj ;;;:: Cuj ~ f >0

makes the matrix D u - D I semi-positive definite and matrices D u and D,

positive definite. Thus, we have

Page 535: Data Mining, Rough Sets and Granular Computing

533

(49)

It can be proved that in the linear programming (LP) problem (48), matrices Cu

and C[ always exist (See [8]).

Similar to regression analysis, we can define the measure of fitness 17 as

(50)

x2

Figure 4. The upper and lower possibility distributions.

Numerical example

The data in the possibility portfolio problem are given in Table 2. From the proposed approach explained in Section 5, we obtained

a = [0.154,0.176]1, (51)

Page 536: Data Mining, Rough Sets and Granular Computing

534

D = [0.2665 0.0972] u 0.0972 0.1689 '

D =[0.0313 0.0165]. I 0.0165 0.0148

Using the formulation (48) and (49), we obtained the two possibility distributions as shown in Figure 4 where the outer ellipse is the upper possibility distribution and the inner ellipse is the lower one for h = 0.5, respectively. From (50), we obtained 11 = 0.226.

Table 2. Return rate on two securities and possibility degrees.

#1 #2

hi year Am.T A.T.&T.

0.2 1977(1) -0.305 -0.173

0.241 1978(2) 0.513 0.098

0.282 1979(3) 0.055 0.2

0.324 1980(4) -0.126 0.03

0.365 1981(5) -0.28 -0.183

0.406 1982(6) -0.003 0.067

0.447 1983(7) 0.428 0.3 0.488 1984(8) 0.192 0.103

0.529 1985(9) 0.446 0.216

0.571 1986(10) -0.088 -0.046

0.612 1987(11) -0.127 -0.071

0.653 1988(12) -0.015 0.056

0.694 1989(13) 0.305 0.038

0.735 1990(14) -0.096 0.089 0.776 1991(15) 0.016 0.09

0.818 1992(16) 0.128 0.083

0.859 1993(17) -0.01 0.035

0.9 1994(18) 0.154 0.176

5. Similarities between the proposed models and rough sets

Let a set X cUbe given. An upper approximation of X in A denoted as

A * (X) means the least definable set containing X, and a lower approximation of

X in A denoted as A.(X) means the greatest definable set contained in X. The

Page 537: Data Mining, Rough Sets and Granular Computing

535

upper approximation A * (X) and the lower approximation A. (X) can be

defined as

A*(X)= u E;, A.(X)= u E;, Ei nX.,0 Ei>;;X

(52)

where E; is the ith elementary set in A, An accuracy measure of a set X in the

approximation space A=( U,R) is defined as

a (X) = Card(A.(X)) A Card(A*(X))

(53)

where Card (A. (X)) is the cardinality of A. (X), When the classification

ceU) = {XI "'" X J is given, the accuracy of the classification ceU) is defined

as

f3 A (U) = Card(uA.(X j ))/ Card (uA\X )) (54)

whose concept is used to define the measure of fitness in interval regression analysis and the identification methods of exponential possibility distributions,

Table 3. Similarities between rough sets and possibility distributions and regression models

Possibility distributions Rough sets

Upper distribution: O,(x) Upper approximation: R'(X)

Lower distribution: O,(x)

Spread of n,,(X): LO,,(X,)

Spread of O/(x): LO,(X i )

Inequality relation

Lower approximation: R.(X)

Cardinality of A'(X): Card(R'(X))

Cardinality of A.(X): Card(R.(X))

Inclusion relation O,,(x,)"O,(x,)

Measure of fitness .2.. L O,(x i )

R·(X)\:R.(X)

Accuracy measure of a set X

m 1=1, ...• 111 nu(Xj )

Interval regression model

Upper model: y'(x)

Lower model: Y.(x)

Spread of y' (x): c" I x I

Spread of Y.(x): c.' I x I

Inclusion relation y'(x);) Y.(x)

Measure of fitness:

~L~ P 1=1,. .• ,p c·' I x, I

Furthermore, the upper and lower approximations of X, A. (X) and A * (X) are

corresponding to the upper and lower approximation models in regression analysis

Page 538: Data Mining, Rough Sets and Granular Computing

536

and in the identification methods of possibility distributions. Thus, we can summarize the similarities between our models and rough sets in Table 3 [12].

References

1. Dubois, D. and Prade, H. (1988) Possibility Theory. Plenum Press, New York 2. Guo, P. and Tanaka, H. (1998) Possibilistic data analysis and its application to

portfolio selection problems. Fuzzy Economic Review 3/2, 3-23 3. Pawlak, Z. (1982) Rough sets. Int. J. Information and Computer Sciences II,

341-356 4. Pawlak, Z. (1984) Rough classification. Int. J. Man-Machine Studies 20,

469-483 5. Tanaka, H., Guo, P. and Turksen B. (2000) Portfolio selection based on fuzzy

probabilities and possibility distributions. Fuzzy sets and Systems 111, 387-397 6. Tanaka, H., Hayashi, I. and Watada, J. (1989) Possibilistic linear regression

analysis for fuzzy data. European J. of Operational Research 40,389-396 7. Tanaka, H. and Ishibuchi, H. (1991) Identification of possibilistic linear systems

by quadratic membership functions of fuzzy parameters. Fuzzy sets and Systems 41,145-160

8. Tanaka, H.and Guo, P.(1999) Portfolio selections based on upper and lower exponential possibility distributions. European J. of Operational Research 114 (1999) 115-126

9. Tanaka, H. and Guo, P.(1999) Possibilistic Data Analysis for Operations Research. Physica-Verlag, Heidelberg; New York

10. Tanaka, H. and Ishibuchi, H. (1993) Evidence theory of exponential possibility distributions. Int. J. of Approximate Reasoning 8, 123-140

II. Tanaka H. and Lee, H. (1998) Interval regression analysis by quadratic programming approach. IEEE Transaction on Fuzzy Systems 6, 473-481

12. Tanaka, H., Lee H. and Guo, P. (1998) Possibility data analysis with rough set concept. Proceeding of Sixth IEEE International Conference on Fuzzy Systems 117-122

13. Zadeh, L. A. (1977) Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems 1,3-28