non-symmetrical factorial discriminant analysis for symbolic objects

*Correspondence to: Francesco Palumbo, Dipartimento di Istituzioni Economiche e Finanziarie, Universita diMacerata, Via Crescimbeni 14, I-62100 Macerata, Italy.

Contract/grant sponsor: Esprit; contract/grant number: 20821

CCC 1524}1904/99/040419}09$17.50 Received June 1997Copyright ( 1999 John Wiley & Sons, Ltd. Revised May 1999

APPLIED STOCHASTIC MODELS IN BUSINESS AND INDUSTRYAppl. Stochastic Models Bus. Ind. 15, 419}427 (1999)

NON-SYMMETRICAL FACTORIAL DISCRIMINANTANALYSIS FOR SYMBOLIC OBJECTS

FRANCESCO PALUMBO1,* AND ROSANNA VERDE2

1Dipartimento di Istituzioni Economiche e Finanziarie, Universita di Macerata, Via Crescimbeni 14, I-62100 Macerata, Italy2FacoltaH di Economia, Seconda Universita di Napoli, Piazza Umberto I, I-81043 Capua, Italy

SUMMARY

In this paper we propose a generalization of the factorial discriminant analysis (FDA) to complex datastructures named Symbolic Objects. We assume that the a priori classes are de"ned by an equal number ofintention symbolic objects. The paper proposes a three-step discrimination procedure. Symbolic data arecoded in suitable numerical matrices, coded variables are transformed into canonical variables, symbolicobjects are visualized building maximum covering area rectangles, with respect to the canonical variables.Referring to the graphical representation, geometrical rules are proposed in order to assign new objects toa a priori class on the basis of proximity measures. Copyright ( 1999 John Wiley & Sons, Ltd.

KEY WORDS: symbolic data analysis; discrimination and classi"cation; factorial data analysis

1. MOTIVATION AND BASIC DEFINITIONS

Sometimes real world is too complex to be described by the classical numerical data, usuallycollected in the tabular model. A classical data table de"nes a relation between a set of statisticalunits and a set of variables, where each variable assumes a single value.

Symbolic objects (SOs) are complex structured data described on the basis of their ownproperties [1] and de"ned, similarly to classical statistical units, on the basis of a set ofmonovalued variables (numerical and/or categorical), but also by set-valued variables and by thede"nition of logical relations. A SO can also be referred to as an a priori concept and it is calledintention SO. For example, the concept of tree can be de"ned by a biologist on the basis ofa multivalued variable set. A recognition function, associated to the intention SO description,allows to identify all observed elements respecting its properties. In other words, this rule shouldbe capable of identi"ng a generic tree as belonging to the concept of tree. The whole set of theseelements represents the SOs extention.

Symbolic data analysis (SDA) exploits suitable methods to study the relationships amongintention SOs, taking into account their properties.

Given a set of intention SOs and a set of statistical units, we propose a graphical approachaimed to identify the extention SOs with respect to the given intention SOs. Our proposal is basedon a generalization of the non-symmetrical approach to factorial discriminant analysis (FDA) onsymbolic data.

Discriminant analysis refers to the statistical methods aimed to &distinguish' K groupsbelonging to the same population on the basis of p esplicative variables. Many authorsdistinguish between two aspects of discriminant analysis on the basis of its goal: the "rstdeals with the selection of a subset of predictors in order to represent the a priori groups asdistinct as possible from each other; the second looks for the de"nition of a decision rule in orderto classify a new statistical unit, belonging to the same population, with respect to one of theK groups [2].

The present paper mainly deals with the "rst aspect. This leads us to tackle the problem froma geometrical point of view. The geometrical approach looks for a set of new variables (Canonical<ariables), obtained by means of linear transformations of the p predictors, such that theK groups are well separated. These new variables are called discriminant factors. As a "nal result,the FDA funishes the images of the statistical units on the discriminant factorial axes.

Moreover, Canonical variables can also be used to de"ne a classi"cation rule. Classi"cationrules are based either on proximity measures, or on parametric and non-parametric probabilisticapproaches.

In FDA there are two data sets extracted by the same population: ¹raining and ¹est set. The"rst is used to de"ne the discriminant factors and the classi"cation rule. The classi"cation (or theseparability) power of the new discriminant variables and of the rule, is generally measured usingthe second set of elements.

The purpose of this paper is to generalize the non-symmetrical factorial discriminant analysis(NS-FDA) approach, proposed by Verde and Palumbo [3], to symbolic data. The NS-FDA usesas criterion to decompose the explicative power of a set of categorical descriptors with respect toa priori classes. In this context we assume that the K classes are represented by an equal numberof intention SOs and constitute the training set. Whereas, the test set can be formed by SOs or bysimple statistical units.

The application of the NS-FDA to SOs requires a suitable numerical transformation of theSO's descriptors, such that their relations are presented. Nevertheless, the results of the analysiswill be interpreted in symbolic terms, consistently with this kind of data. In this sense, theproposed approach can be considered as a symbolic}numerical}symbolic procedure. The proced-ure involves three steps:

(i) numerical binary or fuzzy coding of the SO's descriptors, depending on the kind ofvariables. Each SO is decomposed into a set of simple statistical units;

(ii) NS}FDA based on the maximization of the separability of the coded training set SOs,according to a predictivity measure strictly related to the Goodman}Kruskal's q index [4].The SOs are then visualized on the discriminant factorial sub-space by means of maximumcovering area rectangles (mcar) [5]. In the same way the test-set SOs can be visualized onthe same factorial plan as supplementary elements.

(iii) In order to validate the discrimination power of the factorial axes, geometrical classi"ca-tion rules, based on proximity measures between SOs images are de"ned.

420 F. PALUMBO AND R. VERDE

Copyright ( 1999 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 15, 419}427 (1999)

In particular, a generalized Minkosky metrics [6], and a new dissimilarity measurebased on the minimum descriptor potential increase are considered.

1.1. Symbolic objects dexnition

Let E be a universe Mu1,2 , u

nN of individuals (or elementary objects) described by a set of >

jset-valued variables ( j"1,2, p).

A basic kind of SO is an event. The jth event is denoted by ej"[>

j"D

j], where >

j:PY

jand

Dj-Y

j.

The domains Yj's of the >

j's di!er according to the kind of variable: intervals of real values, for

continuous variables; a set of admissible categories for multi-nominal variables and weightedcategories, for modal variables. The weights in the modal variables domain can be represented byfrequencies, probabilities or belief values.

An assertion SO is also de"ned on the basis of a recognition function q (u) comparing the valuesof the individuals u3E with the set of descriptions D

j, by means of relations R

j(with j"1,2, p).

We assume that q(u) is a Boolean function. Therefore, the extention of a SO is the set of elementsof E such that >

j(u)3D

j, ∀

j. If u satis"es all the conditions de"ned in q, then the logical function

q(u)"TRUE.A conjunction of e

jde"nes a Boolean assertion object q:

q :"pRj/1

[>j"D

j]"

pRj/1

ej

(1)

Classes of SOs or second-order SO can be obtained by generalizations of a set of individualsSOs and, as a consequence, the description of a generalized SO contains the description of the"rst-order SOs.

In our context, the training set is assumed to be made of K intention SOs MC1,2, C

k,2, C

KN,

however, without loss of generality, it can also be assumed that the training set is made by classesof lower-order SOs. The test set can be constituted either by simple observations, or bylower-order symbolic assertion objects, e.g. u3E.

2. NUMERICAL CODING OF SOs DESCRIPTORS

Since, statistical methods cannot treat set-valued variables, we propose to code the categoricalvariables in binary ones. Speci"cally, if probability values are associated to categories of thesedescriptors, they are directly assumed as coding values. Quantitative descriptors are coded infuzzy way by means of semi-linear functions. Using this kind of coding, it is necessary to de"ne:the number of coding functions and the bounds of coding intervals, while their co-domain is "xedin [0, 1]. Moreover, the fuzzy coding of numerical variables allows to transform quantitativedescriptors in categorical-like ones in order to homogenize the nature of the variables and, at thesame time, retain the numerical variability of the data. Adopting semi-linear functions, eachquantitative value is transformed into a set of membership degrees, or fuzzy coding values [7, 8].

The coded variables are collected in coding matrices, denoted by Zj(with j"1,2, p).

In order to take into account the relations among descriptors, all the numerical codingmatrices Z

j's referring to the same SO are combined by means of Cartesian product of their rows.

FACTORIAL DISCRIMINANT FOR SYMBOLIC OBJECTS 421


Let Z(N]G) be the coding matrix obtained by juxtaposing the p coding matrices Zjof the

numerical coding descriptors for all K training set SOs. The number of columns G corresponds tothe global number of coding categories of the descriptors, while, the number of rows N dependson the number of continuous interval variables and it is proportional to the number of categoriesof multi-nomial variables [9]. Precisely, it is given by:

N"

K+i/1C2hi] l

<j/1

tijD ,

where hiis the number of interval variables, t

ijis the number of categories of the jth nomial

variable >jand l is the number of multi-valued nominal variables.

In order to refer the rows of the matrix Z to the di!erent SOs, we de"ne the indicator matrixX

N,K.

From a geometrical point of view the rows of Z represents the vertices of the hypercubesassociated to the SOs in the space RG.

3. INTENTION SO'S REPRESENTATION IN A FACTORIAL SUBSPACE

The factorial discriminant analysis (FDA) we are going to propose is in the framework ofCategorical FDA, because all descriptors have been coded into fuzzy or binary categoricalvariables. It decomposes a quantity strictly related to a dependence association index: theGoodman}Kruskal's q index [4].

This approach was already proposed as a generalisation of the principal component analysisonto reference subspace (PCAR), and was presented as a descriptive factorial analysis byLauro and D'Ambra [10] and by D'Ambra and Lauro [11, 12]. It was named non symmetricalcorrespondence analysis (NSCA). Afterwards, Palumbo proposed the NSCA in the FDAcontext. We refer to Palumbo [13] and to Verde and Palumbo [3] for a more completereference.

Let us consider the special case in which there is the dependent variable X and only oneindependent variable>

j, respectively, with K and q caterogies observed on n statistical units. The

matrices Xn,K

and Zn,q

represent the variables in binary coding. Notice that PZj"Z

j(Z@

jZ

j)~1Z@

jis an orthogonal projector, so that the image of X in the space spanned by the orthogonal columnvectors of Z

jis given by P

ZjX. The NSCA solves the following eigenanalysis:

1nAX@P

ZjX!

1n

X@11@XBua"jaua (2)

Equation (2) admits K!1 real and distinct solutions (1)a)(K!1)) where ja is the genericeigenvalue, ua is the associated eigenvector and 1 is the n]1 unit vector. Solutions are obtainedunder the general orthonormality constraints:

u@aua{G0 ∀aOa@

1 ∀a"a@



The Goodman}Kruskal's q index can be expressed in matrix notation by the following formula:

qXZ

"trX@Z

j(Z@

jZ

j)~1Z@

jX!(1/n)X@11@X

X@X!(1/n)X@11@X(3)

Note that the numerator in expression (3) corresponds to the left-hand side of expression (2).Furthermore, the denominator in expression (3) does not depend on the dependent variable X, sothat q can be maximized only with respect to the independent variable Z

j.

Under the independence hypothesis among the variables X's, the analysis can be generalized tothe case of more independent variables (coded in the matrix Z), so that the characteristicEquation (2) becomes:

1nCX@Z*~1Z Z@X!

pnX@11@XDva"kava (4)

The additive multiple model in Equation (4) assumes that >1, >

2,2 ,>

pare independent pre-

dictors. This constraint is imposed by the use of the block-diagonal matrix *Z instead of thematrix Z@Z.

The quantity +aka"+

pq(XZp

, assuming that q( indicates the numerator of theGoodman}Kruskal's q index. The SOs vertices coordinate in the subspace spanned byMv

1, v

2,2 , v

K~1N, are obtained by two steps: the regression of z

iinto the space of the variable

X :x;i"z

i(*~1Z Z@X); and the projection of x;

ion va:

mia"x;

iva (5)

This approach can be geometrically interpreted as a principal component analysis of the image ofthe a priori de"ned intention SOs onto a subspace generated by the sum of the p subspacesspanned by the column vectors of the matrices Z

j, under the vertices cohesion constraint given by

the matrix X.The hypercubes associated to the classes of SOs are represented onto the factorial plans by

maximum covering area rectangles, de"ned as the rectangles covering all the nkpoints, in which

each assertion has been decomposed. The length of the rectangles sides are related to thevariability of the descriptors contributing more to the orientation of the axes.

The image of the generic Ckassertion on the factorial hyperplan of dimension m)K!1, is

denoted as

cLk"

mRa/1

[SkaG[mincLka, max cL

ka]] (6)

where Ska are the new descriptors, and [min cLka, max cL

ka] represents the [minimum, maximum]co-ordinates of the vertices of the hypercube associated to C

kon ath factorial axis.

4. DECISION AIM OF THE NS-FDA

According to the decision aim of the standard FDA a second set of SOs, described by the same setof p variables as the training set SOs, have to be assigned to the K classes. If the factorial



discriminant axes separate well the several classes of SOs, each SO of the test set will be assignedto the nearest class according to a geometrical rule.

In order to represent the test set SOs on the same sub-space of the classes of SOs, a preliminarytransformation and a coding of their descriptors, as seen for the training set SOs, have to beperformed. Then, the vertices of the hypercubes associated to this second set of SOs are projectedonto the discriminant factors as supplementary points. The generic object u

s3E is visualized onto

the factorial plan as rectangles and their representation is denoted as:

uLs"

mRa/1

[SuaG[min uLsa, max uL

sa]] (7)

5. GEOMETRICAL CLASSIFICATION RULES

To de"ne a geometrical classi"cation rule, we propose two approaches based on the descriptorpotential n ( ) ) measure, de"ned by De Carvalho [14] as the volume of the Cartesian product ofthe descriptors. The latter can be expressed, with respect to the co-ordinates of SOs on thefactorial axes, as:

n ( ) )"m<a/1

k(Sa) (8)

k (Sa)"length(Sa) (9)

5.1. A classixcation rule based on a dissimilarity measure

The "rst decision rule proposed here is based on a dissimilarity measure introduced by Ichino andYaguchi [15] and extended to the symbolic data by De Carvalho and Diday [6].

In order to classify a test-set SO us, we consider the dissimilarity measures d(uL

s, cL

k)∀R"1 in

the subspace of dimension m)(K!1) computed as:

d(uLs, cL

k)"rJ+m

a/1Mbata (uL s, cL k)Nr, (10)

where ba'0, ∀a3M1,2,mN, +ma/1

ba"1, r*1, and

ta(uL s, cL k)"k (Sua =Ska)!k(SuaWSka)#c(2k((SuaWSka)!k (Sua)!k (Ska))

k ((Sua=Ska). (11)

The parameter c3[0, 1] and Sua is the new descriptor of uswith respect to the ath factorial axis. In

order to consider the di!erent inertia explained by the factorial axes we assume the weights baequal to the eigenvalue ja (a"1,2,m) normalized to 1. So far, the generic object u

s3E is

assigned to Ckif its dissimilarity measure d (uL

s, cL

s) results as minimum with respect to the image of

Ckon the factorial sub-space. In particular, we observe that if d(uL

s, cL

k)"0, the u

sis a specializa-

tion of Ckwith respect to the factorial variables Sa (a"1,2 ,m).



sThe data set has been drawn at the WWW site http://lib.stat.cmu.edu and it was presented at the 1993 GraphicsExposition.

5.2. A classixcation rule based on the minimum descriptor potential increase (pdi)

A second classi"cation rule is based on a geometrical generalization of the classes of SOs imagesto include the image of each test set SO. According to this rule, a lower-order SO u

sis assigned to

the higher-order SO Ck

that includes it with the minimum increase of its descriptor potential

pdi(uLs, CK

k)"

<ma/1

k (Sua= Ska)!<ma/1

k (Sua)<m

a/1k (Ska)

"min (12)

The product of k(Sua=Ska) with respect to the "rst two factorial axes (a"1, 2) represents the areaof the rectangle of maximum extension of the object class C

kin order to include the image of the

object usto be classi"ed. The pdi(uL

s, CK

k)"1 when uL

sis coincident to the image cL

kof the SO class

Ck. In this case u

scan be considered as an extension of C

k, according to the extention SO

de"nition given by Diday [1].

6. APPLICATION: BREAKFAST CEREAL DATASET

An application of the NS-FDA for SOs has been performed on the Breakfast Cereal data sets.The intention SOs have been de"ned as four di!erent kinds of products for breakfast. These

speci"cations were given by an expert, who de"ned four &&classes11 of breakfast on the basis of theirenergetic power. These are Dietetic, Medium Dietetic, Energetic and High Energetic. They aredescribed by four numerical (at interval solved) descriptors: calories (number), sodium (mg), "bre(g), potassium (mg); a multi-categories nominal descriptor: vitamins and minerals enrichment(low, medium, high).

SOs to be assigned to these a priori concepts are de"ned as the di!erent products ranges ofGeneral Mills, Post, Kellogg1s and Ralston Purina manufacturers. Of course, these objects aredescribed by the same variables.

The results of the NS-FDA in Figure 1 are relative to the representation of the intention andthe test-set SOs on the "rst factorial plan (explained inertia 73 per cent).

If we consider the correlations between the categories of coded descriptors and factorial axes(for the sake of simplicity they are not presented in this text) we can interpret the factorial axes interms of SO assertions [16].

The right side of the "rst axis is described by: [vitamins"MmediumN]?[potassium"

MhighN]?["bre"MhighN], and the positive direction of the second axis is characterised by:[sodiuim"MmediumN]?[calories"MmediumN]. So that the "rst axis graduates, from right toleft, the products from dietetic to high energetic.

Table I contains the classi"cation results based on the values computed according to theproposed classi"cation rules. We observe, according to both rules, that the Post, Kellogg1s andGeneral Mills products can be mainly considered as part of the Energetic range products. RalstonPurina products are more typical of the Dietetic range products, according to De Carvalho's



Figure 1. First factorial plan: intentions and test-set SOs representation (expl.IN"73 per cent).

Table I. Classi"cation results: A Minimum descriptor potential increment; B De Carvalho dissimilarity.

ProductEnergetic Dietetic High Energetic Medium Dietetic

Manufacturer A B A B A B A BRalson Purina 7.83 0.197 3.19 0.118 7.43 0.180 2.82 0.121Post 2.02 0.099 5.44 0.235 5.68 0.223 2.73 0.190Kellogg's 1.01 0.000 3.19 0.120 3.55 0.125 2.01 0.068General Mills 1.02 0.000 2.59 0.167 1.95 0.061 2.26 0.117

dissimilarity index (&B') and Medium Dietetic with respect to the other rule (&A'). However,looking at the numerical values of the rules with respect to Kellogg's and Ralston Purina, we notea feeble di!erence.

ACKNOWLEDGEMENTS

This paper has been carried out in the framework of the &Dipartimento di Matematica e Statistica'University of Naples contribution to SODAS project (supervisor prof. N.C. Lauro) and supported by thegrant Esprit (no. 20821)

REFERENCES

1. Diday E. An introduction to symbolic data analysis. INRIA, R.R. n. 1936, 1993.2. Saporta G. ProbabiliteH s Analyse des DonneH es et Statistiques. Technip: Paris, 1990.3. Verde R, Palumbo F. Non Symmetrical Factorial Discriminant Analysis on qualitative predictors. Acts of0XXX<II12Riunione Scienti,ca della Societa Italiana di Statistica, Rimini, 1996 (in Italian).



4. Goodman LA, Kruskal WH. Measure of association for Cross-classi"cations. Journal of American StatisticalAssociation 1954; 49:732}764.

5. Chouakria A, Diday E, Casez P. Extension of Principal Component Analysis to Interval Data. Proceedings of New¹echniques and ¹ecnologies for Statistics, Bonn, 1995.

6. De Carvalho FAT, Diday E. Un indice de proximite entre objects symboliques qui tient compte des contraintes dansl'espace de description. ;npublished manuscript, INRIA, 1996.

7. van Reijckevorsel J, De Leeuw J. Component and Correspondence Analysis. Wiley: New York, 1988.8. Verde R. B-Spline Functions and Fuzzy Coding of Numerical <ariables in Non ¸inear Data Analysis. Rocco Curto

Editore, 1994 (in Italian).9. Chouakria A, Verde R, Diday E, Casez P. &Generalisation de l'analyse factorielle des correspondances multiples a des

objects symboliques. Proceeding of the Quatriemes JourneH es de la SocieteH Francophone de Classi,cation, Vannes,France, 1996.

10. Lauro NC, D'Ambra L. L'analyse non symmeH trique des correspondances. In Data Analysis and Informatics, III,Diday E. et al. (eds). North-Holland: Amsterdam, 1984.

11. D'Ambra L, Lauro NC. Principal Component Analysis onto a Reference sub-space. Rivista di Statistica Applicata1982; 15.

12. D'Ambra L, Lauro NC. Non Symmetrical Exploratory Data Analysis. Statistica Applicata 1992; 4:511}529.13. Palumbo F. Selezione e quanti"cazione dei predittori qualitativi nell'analisi discriminante. Ph.D. ¹hesis, Univ.&&Federico II'', Napoli, 1995.

14. De Carvalho FAT. MeH thodes Descriptives an Analyse de DonneH es Symbolique. ¹heH se de Doctorat, ;niversiteH ParisIX Dauphine, 1992.

15. Ichino M, Yaguchi H. General Minkowsky metrics for mixed features type data analysis. IEEE ¹ransactions onSystem, Man and Cybernetics 1994; 24:698}708.

16. Gettler-Summa M. Factorial axis interpretation by symbolic objects. In JourneH es Sumbolique-Numerique, Diday,Kodrato!, Pinson (Eds). Paris 1993.



non-symmetrical factorial discriminant analysis for symbolic objects

Documents