fuzzy ontology mining and semantic information granulation

Fuzzy Ontology Mining and Semantic Information Granulation forEffective Information Retrieval Decision Making

Raymond Y.K. Lau 1 , Chapmann C.L. Lai 1 , Yuefeng Li 2

1 Department of Information Systems, City University of Hong Kong,Tat Chee Avenue, Kowloon, Hong Kong

E-mail: {raylau, chunllai}@cityu.edu.hk2 School of Information Technology, Queensland University of Technology,

GPO Box 2434, Brisbane, Qld 4001, AustraliaE-mail: [email protected]

Abstract

The notion of semantic information granulation is explored to estimate the information specificity orgenerality of documents. Basically, a document is considered more specific than another document ifit contains more cohesive domain-specific terminologies than that of the other one. We believe thatthe dimension of semantic granularity is an important supplement to the existing similarity-based andpopularity-based measures for building effective document ranking functions. The main contributions ofthis paper is the illustration of the design and development of a fuzzy ontology based granular informationretrieval (IR) system to improve the effectiveness of IR decision making for various domains. Based onthe notion of semantic information granulation, a novel computational model is developed to estimatethe semantic granularity of documents; these documents can then be ranked according to the informationseekers’ specific semantic granularity requirements. One main component of the proposed computationalmodel is the fuzzy ontology mining mechanism which can automatically build domain-specific ontologyfor the estimation of semantic granularity of documents. Our TREC-based experiment reveals that theproposed fuzzy ontology based granular IR system outperforms a classical vector space based IR systemin domain specific IR. Our research work opens the door to the applications of granular computing andfuzzy ontology mining methods to enhance domain specific IR decision making.

Keywords: Text Mining, Fuzzy Ontology, Fuzzy Subsumption, Information Granulation, Granular Com-puting, Information Retrieval.

1. Introduction

Classical similarity-based document ranking func-tions 22,23, and the recent popularity-based rankingalgorithms 16 have been applied to developed IR sys-tems such as Internet search engines. One implicitassumption behind the document ranking functionsof existing Internet search engines is that popularity

is closely correlated with relevance. Unfortunately,the correlation between popularity and relevancecould be weak for newly created Web documents 14.In this paper, we examine if another dimension canbe used to supplement the existing similarity-basedand popularity-based dimensions so as to develop amore effective document ranking function. In fact,recent advances in Granular Computing 1,31,32 sheds

International Journal of Computational Intelligence Systems, Vol.4, No. 1 (February, 2011).

Published by Atlantis Press Copyright: the authors 54

zegerkarssen

Texte tapé à la machine

Accepted: 29-11-2009 Received: 09-09-2010

Raymond Y.K. Lau, Chapmann C.L. Lai, Yuefeng Li

light on developing more effective IR systems to al-leviate the problem of information overload. Thegranular computing paradigm emphasizes the effec-tive use of levels of “granularity” or abstraction tosystematically analyze, represent, and solve real-world problems 1,13,19,21,32. In granular comput-ing, information granulation refers to the computa-tional processes of generating and presenting levelsof granularity of information to facilitate problemsolving 30,34,17. However, when researchers refer toinformation granularity in the discipline of granularcomputing, they often mean the ”structural granu-larity”, that is, the structural abstractions of infor-mation items such as the coarse level of a docu-ment containing the finer levels of sections, chap-ters, paragraphs, sentences, and so on. In the con-text of IR, we try to apply the concept of infor-mation granulation to design a granular IR systemwhich can estimate the “semantic granularity” ofdocuments (e.g., general vs. specific documents)and rank these documents with respect to informa-tion seekers’ specific granularity requirements.

The semantic granularity of a document refers tothe levels of semantic details (i.e., the specificity)of information contained in the document 9. To ob-jectively estimate the semantic granularity of a doc-ument, it is possible to refer to a domain ontologysuch as the Medical Subject Headings (MeSH)a todetermine if a document contains specific terminolo-gies or general concepts. It is generally acceptedthat ontology refers to a formal specification of con-ceptualization 3. Since specificity is the antonymof generality, we will measure a document’s seman-tic granularity (an attribute) in terms of documentspecificity (attribute value) throughout this paper.As we only focus on the semantic granularity ratherthan the structural granularity of documents, we willloosely use the term granularity to refer to semanticgranularity for the rest of this paper.

Because of the sheer volume of documentsarchived on the Web and digital libraries, it is notpractical to manually label “general” or “specific”documents individually. In fact, it is almost impossi-ble to assign a static granularity label to a documentbecause the granularity of a document is subject to

human interpretations in some cases. For instance,the term “granular computing” may be consideredspecific for an undergraduate student, while it maybe considered too general for researchers in the fieldof granularity computing. One of the main contri-butions of this paper is the development of a com-putational model for granular information retrieval.Such a computational model supports the objectiveestimation of the granularity of documents by con-sulting domain ontology, and it can also deal withthe subjectivity in information granularity by tak-ing into account the individual user’s granularity re-quirement. As comprehensive domain ontology maynot be readily available for arbitrary domains, thesecond main contribution of this paper is the illus-tration of how to apply a fuzzy ontology discoverymethod to enhance the effectiveness of the proposedgranular IR system.

The remainder of the paper is organized as fol-lows. Section 2 highlights previous research in therelated area and compare these research work withours. Section 3 describes the general architecture ofthe proposed granular IR system. Our novel fuzzyontology discovery method is then illustrated in Sec-tion 4. The computational details for the estimationof document or query granularity are then given inSection 5. Section 6 describes TREC-AP collectionbased evaluation of the proposed granular IR sys-tem. Finally, we offer concluding remarks and de-scribe future directions of our research work.

2. Related Research

Yao 31 is probably the first researcher to explore theidea of granular computing in the context of IR. Itwas proposed that an IR support system should ex-ploit document space granulations (e.g., documentclustering), user space granulations (e.g., groupingsimilar queries into a group user profile), term spacegranulations (e.g., grouping terms by specificity orgenerality), and retrieval result granulations (e.g.,clustering the result sets) to develop effective IR sys-tems for an individual or group of information seek-ers 31. However, the idea of applying the granularcomputing methodology to IR remains as a concep-

ahttp://www.nlm.nih.gov/mesh/


Fuzzy Ontology Mining and Semantic Information Granulation for IR Decision Making

tual discussion rather than a concrete system designand development work. Our research extends theidea of granular information retrieval support systemby designing, implementing, and evaluating a proto-type granular IR system. Based on the conceptualideas presented in 31, we explore term space granu-lation and apply this concept to construct a compu-tational model to estimate the granularity of docu-ments and queries.

It was argued that the readability of a docu-ment could be assessed quantitatively and objec-tively 29. Accordingly, an ontology-based compu-tational model was developed to assess the read-ability of documents. The assumption was that ifa document contained some terms which appearedin a concept hierarchy (i.e., ontology) of a specificdomain, the readability score of that document de-creased. Our granular IR system also makes use ofdomain ontology to objectively assess the granular-ity (e.g., the specificity) of documents. However, weavoid modeling the more subjective issue of read-ability which may not be quantified based on domainontology alone.

Zhou et. al. 35 examined the issues of in-formation specificity and information generality inthe context of ontological user profiling and users’search intension modeling. Their computationalmodel for estimating information specificity and in-formation generality was based on Dempster-Shafer(D-S) theory of evidence. In particular, the speci-ficity measure was developed based on the belieffunction of the D-S theory, whereas the generalitymeasure was constructed based on the plausibilityfunction of the D-S theory. Instead of employing theD-S theory of evidence to estimate document granu-larity, we develop an efficient ontological computa-tional model to estimate document granularity.

The FOGA framework for fuzzy ontology gener-ation has been proposed 27. The FOGA frameworkconsists of fuzzy formal concept analysis, fuzzyconceptual clustering, fuzzy ontology generation,and semantic representation conversion. Essentially,the FOGA method extends the formal concept anal-ysis approach, which has also been applied to on-tology discovery, with the notions of fuzzy sets.Our proposed ontology discovery method is based

on previous work in computational linguistic andwith the computational mechanism built based onthe concept of fuzzy relations.

An ontology mining technique was proposedto extract patterns representing users’ informationneeds 12. The ontology mining method consists oftwo parts: the top backbone and the base backbone.The former represents the relations between com-pound classes of the ontology. The latter indicatesthe linkage between primitive classes and compoundclasses. The Dempster-Shafer theory of evidencemodel was applied to extract the relations amongclasses. The strength of the ontology mining methodis that it can effectively synthesize taxonomic rela-tions and non-taxonomic relation in a single ontol-ogy model. In addition, a novel method was pro-posed to capture the evolving patterns in order torefine the initially discovered ontology. Finally, aformal model was developed to assess the relevanceof the discovered ontology with respect to the user’sinformation needs. The research work presented inthis paper focuses on fuzzy domain ontology discov-ery rather than the discovery of crisp ontology rep-resenting users’ information needs.

A fuzzy ontology which is an extension of thedomain ontology with crisp concepts is utilized fornews summarization purpose 11. The main functionof the fuzzy inference mechanism is to generate themembership degrees (classification) for each eventwith respect to some pre-defined concepts. Thestandard triangular membership function is used forthe classification purpose. The method discussed inthis paper is a fully automatic fuzzy domain ontol-ogy discovery approach. There is no pre-definedfuzzy concepts and taxonomy of concepts, insteadour text mining method will automatically discoversuch concepts and generate the taxonomy relations.

Sanderson and Croft 24 proposed a document-based subsumption induction method to automati-cally derive a hierarchy of terms from a corpus. Inparticular, the subsumption relations among termsare developed based on the co-occurrence of termsin the documents of a corpus. For example, term t1is considered more specific than another term t2 ifthe appearance of t1 in a document implies the ap-pearance of t2 in the same document but not vice



versa. They adopted an artificial threshold such asPr(t2|t1) > 0.8 as a fixed cut-off to determine thespecificity relation between t1 and t2. Our conceptmining method differs from their work in that we aredealing with the more challenging task of concepthierarchy mining rather than term relationship ex-traction. In addition, our method extends their com-putational method in that the co-occurrence of termsis derived based on a moving text window ratherthan the whole document to reduce the chance ofgenerating noisy subsumption relations.

3. General System Architecture

Learning & Adaptation

Document Ranking

Indexing

Document Representation

User Profile

Information Needs

Information Seekers

Query & Feedback

Information Sources

Selected documents

Granular Presentation Domain

Ontology

Ontology Mining

Fig. 1. The System Architecture of the Granular IR System

The general system architecture of our granularIR system is depicted in Figure. 1. An informa-tion seeker first translates their implicit informationneeds (including granularity requirement) into ex-plicit queries. Recurring queries are often storedin a user profile within the granular IR system.On the other hand, information objects (e.g., Webpages) from specific information sources such as theInternet are characterized by a particular indexingscheme. These document characterizations are also

stored in the local cache of the granular IR system.The document ranking mechanism of the granularIR system computes an aggregated document scorefor each document by taking into account three as-pects, namely similarity, popularity, and granularitywith respect to the given query. The granular presen-tation layer of the IR system will generate the appro-priate presentation formats (e.g., ranked list of doc-uments or clusters of documents) with respect to thespecific preferences of individual user or group ofusers. As a result, information seekers can retrieveinformation with the levels of details and formatsthey prefer. After reviewing the information objects,information seekers may provide relevance feedbackabout the content and the presentation format of thedelivered documents to the learning and adaptationmechanism. Thereby, both the content and the pre-sentation format of documents can be improved insubsequent round of information search. To alle-viate the problem of lack of good quality domainontology for certain domains, the ontology miningmechanism can automatically discover domain on-tology based on some information sources. Ontol-ogy discovery is invoked as a background task, andso it will not affect the online IR performance. Ourprototype systemb was developed using Java (J2SEv 1.4.2), Java Server Pages (JSP) 2.1, and Servlet2.5.

4. Automatic Fuzzy Domain OntologyDiscovery

4.1. A Formal Model for Fuzzy DomainOntology

Since any IR processes involve uncertainty 8, an un-certainty management mechanism is required in ourontology mining method. The notions of Fuzzy setand Fuzzy Relation are effective to represent knowl-edge with uncertainty 33. Therefore, a fuzzy ontol-ogy rather than a crisp ontology is employed in ourgranular IR system. Our proposed fuzzy domain on-tology is formally defined by:

Definition 1. [Fuzzy Set] A fuzzy set F consistsof a set of objects drawn from a domain X and the

bhttp://quantum.is.cityu.edu.hk/ong/ongui.jsp



membership of each object xi in F is defined bya membership function µF : X 7→ [0,1]. For thespecial case of a crisp set, the crisp membershipfunction has the mapping µF : X 7→ {0,1}.

Definition 2. [Fuzzy Relation] A fuzzy relation RXYis defined as the fuzzy set R on a domain X ×Ywhere X and Y are two crisp sets. The membershipof each object (xi,yi) in R is defined by a member-ship function µR : X×Y 7→ [0,1].

Definition 3. [Fuzzy Ontology] A fuzzy ontology isa 6-tuple Ont = 〈X ,A,C,RXC,RAC,RCC〉, where X isa set of objects, A is the set of attributes describingthe objects, and C is a set of concepts (classes). Thefuzzy relation RXC : X ×C 7→ [0,1] assigns a mem-bership to the pair (xi,ci) for all xi ∈ X ,ci ∈ C, thefuzzy relation RAC : A×C 7→ [0,1] defines the map-ping from the set of attributes A to the set of con-cepts C, and the fuzzy relation RCC : C×C 7→ [0,1]defines the strength of the sub-class/super-class re-lationships among the set of concepts C.

Based on the idea of formal concept analysis 2,X is the extent of the concepts C, and A is the intentwhich defines the properties of C. According to theidea of subsumption, the sub-concept/super-conceptrelation (RCC) can be defined by:

Definition 4. [Fuzzy Subsumption] With respectto an arbitrary α-cut level, a concept cx ∈ C isthe sub-concept of another super-concept cy ∈ Cif and only if ∀ai ∈ {z ∈ A|µRAC(z,cy) > α},µRAC(ai,cx) > α . Alternatively, from an exten-sional perspective, a concept cx ∈ C is the sub-concept of another super-concept cy ∈C if and onlyif ∀xi ∈ {z ∈ X |µRXC(z,cx) > α}, µRXC(xi,cy) > α

with respect to an arbitrary α-cut level.

Definition 4 can be explained as follows: if themembership of every attribute ai ∈ A for the conceptcy ∈C is greater than or equal to a certain thresholdα , the membership of the corresponding attribute aifor the concept cx ∈C is also greater than or equal toα , then the concept cx is the sub-concept of cy. Ascan be seen, the crisp subsumption relation is only

a special case of the generalized fuzzy subsumptionrelation in that the threshold value α = 1 is estab-lished for the crisp case. In other words, if it is truethat every attribute ai ∈ A characterizing the conceptcy implies that it also characterizes the concept cx,the concept cx is the sub-concept of cy.

4.2. Context-Sensitive Text Mining for ConceptDiscovery

In the field of IR, the notion of context vectors 5,25

has been proposed to construct computer-based rep-resentations of concepts (i.e., linguistic class). Inthis approach, a concept is represented by a vec-tor of words (features) and their numerical weights.The weight of a word indicates the extent to whichthe particular word is associated with the underly-ing concept. Figure. 2 shows that the concept “ac-quisition” is represented by the feature terms suchas “merger”, “company”, “arbitrage”, “takeover”,“buyout”, etc. Indeed, this is a real example ex-tracted from the TREC-AP Topic description file byapplying our ontology discovery algorithm. All theterms are stemmed by our program as shown in Fig-ure. 2. The context vector of “acquisition” is repre-sented as follows:

Concept: acquisitionContext Vector:〈(merger,0.675),(compani,0.675),(arbitrag,0.589),(takeover,0.507),(buyout,0.416), . . .〉

Fig. 2. The Representation of the Concept “Acquisition”



A context vector can be seen as a point in a multi-dimensional geometric information space with eachdimension representing a property term. A linguis-tic concept such as “acquisition” can be taken as aclass (set) with respect to the fuzzy sets framework.A feature word such as “merger” can then be treatedas an attribute describing the concept to a certain de-gree (i.e., µRAC(merger,acquisition) = 0.675).

Our context-sensitive text mining method is firstapplied to extract concepts from a textual corpus 6.In particular, a virtual window is moved from left toright among the tokens in each document of a corpusto take into account the proximity among terms. Ac-cording to previous studies, a text window of 5 to 10terms is effective 7,18, and so we adopt this range asthe basis to perform our windowing process. Stan-dard association measure such as Mutual Informa-tion (MI) 26 is then applied to measure the associa-tion between a concept and its underlying terms:

MI(ti, t j) = log2Pr(ti, t j)

Pr(ti)Pr(t j)(1)

where MI(ti, t j) is the mutual information betweenterm ti and term t j. Pr(ti, t j) is the joint probabilitythat both terms appear in a text window, and Pr(ti)is the probability that a term ti appears in a text win-dow. The probability Pr(ti) is estimated based on|wt ||w| where |wt | is the number of windows containingthe term t and |w| is the total number of windowsconstructed from a corpus. Similarly, Pr(ti, t j) is thefraction of the number of windows containing bothterms out of the total number of windows.

We develop Balanced Mutual Information(BMI) 7,10 to compute the degree of associationamong tokens. This method considers both termpresence and term absence as the evidence of the im-plicit term relationships.

µci(t j) ≈ BMI(ti, t j)

= β × [Pr(ti, t j) log2(Pr(ti,t j)+1Pr(ti)Pr(t j)

)+

Pr(¬ti,¬t j) log2(Pr(¬ti,¬t j)+1Pr(¬ti)Pr(¬t j)

)] −(1−β )× [Pr(ti,¬t j) log2(

Pr(ti,¬t j)+1Pr(ti)Pr(¬t j)

)+

Pr(¬ti, t j) log2(Pr(¬ti,t j)+1Pr(¬ti)Pr(t j)

)]

(2)

where µci(t j) is the membership function to estimatethe degree of a term t j ∈ A belonging to a conceptci ∈ C. µci(t j) is the computational mechanism forthe relation RAC defined in the fuzzy domain ontol-ogy Ont = 〈X ,A,C,RXC,RAC,RCC〉. The member-ship function µci(t j) is indeed approximated by theBMI score. The weight factor β > 0.5 is used tocontrol the relative importance of two kinds of evi-dence (positive and negative).

To improve computational efficiency and filternoisy relations, only certain linguistic pattern (e.g.,Noun Noun, and Adjective Noun) will be consid-ered. After the linear normalization process (i.e.,∀ci∈C,t j∈X µci(t j) ∈ [0,1]), concept vectors 5,25 canbe built. Essentially, the MI score between a con-cept and an underlying feature word is treated asthe membership of the word (attribute) character-izing the concept. An α-cut is applied to discardterms from the potential concept if their member-ship values are below the threshold α . For instance,the concept vector for “commercial bank” is repre-sented by cx = 〈(merger,0.675), (compani,0.675),(arbitrag,0.589), (takeover,0.507)〉 after applyinga cut of α = 0.5.

4.3. Concept Pruning

To further filter the noisy concepts, we adopt theTFIDF 23 like heuristic to perform the filtering pro-cess. Similar approach has also been used in ontol-ogy learning 15. For example, if a concept is signifi-cant for a particular domain, it will appear more fre-quently in that domain when compared with its ap-pearance in other domains. The following measureis used to compute the relevance score of a concept:

Rel(ci,D j) =Dom(ci,D j)

∑nk=1 Dom(ci,Dk)

(3)

where Rel(ci,D j) is the relevance score of a con-cept ci in the domain D j. The term Dom(ci,D j) isthe domain frequency of the concept ci (i.e., num-ber of documents containing the concept divided bythe total number of documents in the corpus). Thehigher the value of Rel(ci,D j), the more relevant theconcept is for domain D j. Based on empirical test-ing, we can estimate a threshold ϖ for a particular



domain. Only the concepts with relevance scoresgreater than the threshold will be selected.

4.4. Fuzzy Relation Discovery

The final stage towards our ontology discoverymethod is fuzzy taxonomy generation based onsubsumption relations among extracted concepts.Spec(cx,cy) denotes that concept cx is a specializa-tion (sub-class) of another concept cy. The degree ofsuch a specialization is derived by:

µC×C(cx,cy) ≈ Spec(cx,cy)

=∑tx∈cx ,ty∈cy,tx=ty µcx (tx)⊗µcy (ty)

∑tx∈cx µcx (tx)(4)

where ⊗ is a fuzzy conjunction operator which isequivalent to the min function. The above for-mula states that the degree of subsumption (speci-ficity) of cx to cy is based on the ratio of the sumof the minimal membership values of the com-mon terms belonging to the two concepts to thesum of the membership values of terms in theconcept cx. The range of the Spec(cx,cy) func-tion falls in the unit interval [0,1] and the sub-sumption relation is asymmetric. When the tax-onomy is built, we only select the subsumptionrelations such that Spec(cx,cy) > Spec(cy,cx) andSpec(cx,cy) > λ where λ is a threshold to distin-guish significant subsumption relations. The pa-rameter λ is estimated based on empirical tests.In addition, a pruning step is introduced such thatthe redundant taxonomy relations are removed.If the membership of a relation µC×C(c1,c2) 6min({µC×C(c1,ci), . . . ,µC×C(ci,c2)}), wherec1,ci, . . . ,c2 form a path P from c1 to c2, the re-lation R(c1,c2) is removed because it can be derivedfrom other stronger taxonomy relations in the on-tology. The details of the fuzzy domain ontologymining algorithm can be found at 10.

5. A Computational Model for EstimatingSemantic Granularity

5.1. Semantic Information Granulation

Intuitively, a document is considered specific if itcontains some domain specific terminologies. For

instance, comparing a document about “diseases”and another document about “pneumonia”, the lat-ter is probably considered to be more specific thanthe former because it refers to a specific kind of dis-ease using formal terminology. With reference toFigure. 3 which shows a segment of the MeSH do-main ontology, we know that “conjunctivitis” is onespecific kind of “eye infection”, and so “conjunc-tivitis” is a more specific terminology than “eye in-fection” is. We propose the notion of “terminologi-cal specificity” to measure the proportion of domainspecific terminologies appearing in a document. Onthe other hand, consider that two documents con-taining terminologies of the same level as encodedin a concept hierarchy may still demonstrate differ-ent specificity. With reference to Figure. 3, a docu-ment about “conjunctivitis” and “keratitis” is proba-bly considered more specific than another documentabout “conjunctivitis” and “warts”. The reason isthat both “conjunctivitis” and “keratitis” are specif-ically referring to “eye infection”, whereas “warts”is about skin disease. This gives rise to the notion of“referential specificity” which refers to the cohesionof the terminologies covered by a document. In thispaper, we argue that the granularity of a documentcan be estimated based on two dimensions, that is,terminological specificity and referential specificity.

Virus Diseases [C02]

Eye Infections [C02.325]

Skin Diseases [C02.825]

Conjunctivitis [C02.325.250]

Conjunctivitis, Acute Hemorrhagic [C02.325.250.250]

Keratitis[C02.825.290]

Warts [C02.825.810]

CondylomataAcuminata[C02.825.810.110]

Fig. 3. A Segment of the MeSH Domain Ontology



5.2. Computing Document and QueryGranularity

Terminological specificity of a document can be es-timated according to the coverage and the level ofspecificity of the terms contained in the document.With reference to a domain ontology such as the onedepicted in Figure. 3, the more low-level terminolo-gies appear in a document, the more specific the doc-ument will be. Eq.(5) is proposed to measure the ter-minological specificity of a document by computingthe average depths of all the document terms withreference to a domain ontology. The depth of a ter-minology is measured in terms of the distance be-tween the terminology node and the root node. Thedepth of a term not appearing in the ontology is as-sumed zero.

T S(d) =∑t∈MC depth(t)

max(ran(depth))×|MC|(5)

where MC refers to the set of matching terminolo-gies found in a document d and an ontology, and|MC| is the cardinality of the set MC. The func-tion depth(t) returns the depth of the terminology twith respect to an ontology. The operator ran returnsthe range of a function. The normalization factor

1max(ran(depth)) is applied to the T S(d) function suchthat its range falls in the unit interval.

If the terms of a document are more cohesive(e.g., they refer to the same topic), the document isconsidered more referentially specific. The referen-tial specificity of two terms can be measured basedon the information specificity (information content)of their least common subsumer 20:

RS(d) =2∑ti∈MC,t j∈MC,i6= j sim(ti, t j)

|MC|× (|MC|−1)(6)

sim(ti, t j) = maxcs∈S(ti,t j)

− log2 Pr(cs) (7)

where the function sim(ti, t j) returns the “semanticsimilarity” of two terminologies ti and t j based onthe information content of their least common sub-sumers found in an ontology. S(ti, t j) represents the

set of common subsumers cs of ti and t j. The fac-tor |MC|×(|MC|−1)

2 returns the number of distinct ter-minology pairs constructed from the set MC. Thefunction RS(d) basically tries to measure the aver-age semantic similarity among the pairs of matchingdomain terminologies found in the document d. Theprobability Pr(cs) of a concept cs is estimated basedon:

Pr(cs) =T F(cs)

N(8)

T F(cs) = ∑x∈Subsumed(cs)

count(x) (9)

where T F(cs) represents the frequency of the con-cept cs estimated according to the structure of anontology and the populated concept frequencies de-rived from a chosen document collection. The termN represents the sum of all concept frequencies withreference to the ontology and the chosen documentcollection. When the frequency of cs is estimated,any concepts subsumed by cs is also taken into ac-count. The set Subsumed(cs) represents all the con-cepts subsumed by cs according to the ontology. Thefunction count(x) simply returns the occurrence fre-quency of the concept x with reference to the on-tology and the chosen document collection. For theexperiment reported in this paper, we adopted theTREC-AP collection to estimate concept probabil-ity. The TREC-AP collection consists of 1,226,194unique terms.

By taking into account both terminologicalspecificity and referential specificity, the speci-ficity of a document can be estimated according toEq.(10). If we treat a query q as a short docu-ment and apply the same approach to estimate itsspecificity, the query specificity of q can be derivedfrom Eq.(11). The weight factor ϕd ∈ [0,1] controlsthe relative importance of terminological specificityand referential specificity in estimating the overalldocument specificity. Similarly, the weight factorϕq ∈ [0,1] is used to tune the query specificity mea-sure. The range of document specificity or queryspecificity falls into the unit interval. The automatedmeans of computing query specificity Eq.(11) is oneof the ways to deal with the variance of the perceived



granularity of documents among different informa-tion seekers. For instance, while an informationseeker perceives one document as specific, anotherperson may think that the same document is rela-tively general. Nevertheless, such a variance will bereflected by the different usage of terminologies inthe respective queries. As a result, the variance ofinformation seekers’ perceived document granular-ity due to different knowledge states or tasks at handcan be captured by our system.

DS(d) = ϕd×T S(d)+(1−ϕd)×RS(d) (10)

QS(q) = ϕq×T S(q)+(1−ϕq)×RS(q) (11)

To implement a holistic ranking function, ourgranular IR system can re-rank documents after ap-plying a similarity or popularity based ranking func-tion. The basic intuition is that if there is a largegranularity gap between a query and a document(e.g., a specific query versus a general document),the initial similarity or popularity score of the docu-ment should be adjusted (e.g., lowered). The reasonis that the document is unlikely to meet the informa-tion seeker’s granularity requirement. On the otherhand, if the granularity gap between a query and adocument is small, little or no adjustment of the sim-ilarity or the popularity score is required. The gran-ularity gap between a query q and a document d isestimated based on the absolute difference betweenDS(d) and QS(q). The parameter ϕG controls therelative weight of the granularity factor when docu-ments are ranked. The function SimPop(d,q) repre-sents the combined similarity and popularity baseddocument ranking function; it is assumed that sucha ranking function has been made available inside oroutside (e.g., from Internet search engines) our pro-posed granular IR system.

GScore(d,q)= SimPop(d,q)−ϕG×|DS(d)−QS(q)|(12)

6. Experiments and Results

Our experimental procedure was based on the rout-ing task employed in the TREC forum 28. Essen-tially, a set of pre-defined topics (i.e., queries) wasselected to represent the hypothetical user informa-tion needs. By invoking the respective IR systems(e.g., the granular IR system and the baseline sys-tem), documents from the benchmark corpora wereranked according to their relevance to the queries.Standard performance evaluation measures such asprecision, recall, mean average precision (MAP)were applied to assess the effectiveness of the re-spective IR systems 28. Precision is the fraction ofthe number of retrieved relevant documents to thenumber of retrieved documents, whereas recall isthe fraction of the number of retrieved relevant doc-uments to the number of relevant documents. Inparticular, we employed the TREC evaluation pack-age available at Cornell University to compute allthe performance data. We used the TREC-AP col-lection which comprises the Associated Press (AP)newswires covering the period from 1988 to 1990with a total number of 242,892 documents (withthe removal of 26 documents containing stop wordsonly) and the average document length of 450.6words 4.

Table 1. Results of the TREC-AP Benchmark TestRecall Baseline System Granular IR System t-statistics p values

Mean STD Mean STD d f (9)

0 0.5223 0.1001 0.5978 0.1013 5.285 < .01∗∗

0.1 0.3035 0.1730 0.3781 0.1826 4.911 < .01∗∗

0.2 0.2325 0.1888 0.2862 0.2019 4.591 < .01∗∗

0.3 0.1994 0.1706 0.2432 0.2115 3.716 < .01∗∗

0.4 0.1687 0.1528 0.2168 0.2109 3.354 < .01∗∗

0.5 0.1240 0.1115 0.1872 0.2033 2.719 = .01∗∗

0.6 0.0869 0.0791 0.1425 0.1384 2.443 < .05∗

0.7 0.0644 0.0743 0.1063 0.1052 2.069 < .05∗

0.8 0.0390 0.0759 0.0608 0.0554 1.081 = .15

0.9 0.0103 0.0235 0.0223 0.0311 1.179 = .13

1 0.0047 0.0026 0.0158 0.0250 1.536 = .08

MAP 0.1519 0.1782

∆% 17.31%

A baseline system was developed based on theclassical vector space model 23. With respect to eachtest query, the first 1,000 documents from the ranked



list were used to evaluate the performance of an IRsystem. Our granular IR system employed the ag-gregated document ranking function Eq.(12) to rankdocuments. The query specificity of each TREC-APquery was computed according to Eq.(11). For allthe experiments reported in this paper, the parame-ters ϕd =ϕq = 0.41 and ϕG = 0.83 were used. Thesesystem parameters were estimated based on the pilottests which involved a subset of the TREC-AP testqueries.

Fig. 4. The First Level Concepts of the “Acquisitions” On-tology

We randomly selected ten TREC-AP topics suchas “Antitrust”, “Acquisitions”, “AIDS treatments”,“Space Program”, “Water Pollution”, “JapaneseStock Market Trends”, “New Medical Technology”,“Influential Players in Multimedia”, “Impact of Re-ligious Right on U.S. Law”, and “Computer VirusOutbreaks” for our experiment. Each of these top-ics contains relevant documents. A query was con-structed based on the title and the narrative fieldof the topic. For each TREC topic, we employedour fuzzy ontology discovery method to automati-cally generate a domain ontology based on the full-text description of the topic. Figure. 4 shows thefirst level concepts of the “Acquisitions” ontologyafter applying our fuzzy domain ontology miningalgorithm to the TREC Topic 002; there are lowerlevels concepts which are not expanded in the di-

agram for better readability reason. The perfor-mance data as generated by the TREC evaluationpackage is tabulated in Table. 1. At every recalllevel, we tried to test the null hypothesis (Hnull :µGranular−µBaseline = 0) and the alternative hypoth-esis (Halternative : µGranular− µBaseline > 0), whereasµGranular and µBaseline represented the mean preci-sion values achieved by the granular IR system andthe baseline IR system respectively.

The granular IR system achieves better precisionat all levels of recall, and there are statistically sig-nificant improvement at most levels. In terms ofMAP, the granular IR system achieves a 17.31%overall improvement, and such an improvement isshown to be statistically significant. The last twocolumns of Table. 1 show the results of our pairedone tail t-test. An entry in the last column markedwith (**) indicates that the corresponding null hy-pothesis is rejected at the 0.01 level of significanceor below, whereas an entry marked with (*) indi-cates that the null hypothesis is rejected at the 0.05level of significance or below. The performance im-provement of the granular IR system over the base-line system can be explained with reference to Fig-ure. 4. For instance, the concepts “merger”, “arbi-trage”, “takeover”, “buyout”, and so on captured inthe system generated fuzzy domain ontology wereused to estimate the specificity of the TREC-AP doc-uments when the topic (query) “acquisitions” wasprocessed. This extra semantic information helpeda lot in determining if certain documents were spe-cific about the topic “acquisitions” or not, and ac-cordingly the rank of these documents was adjusted.

7. Conclusions and Future Work

By exploiting the granular computing methodology,we design and develop a novel granular IR systemto enhance document retrieval decision making forspecific domains. In particular, the notion of se-mantic granularity is proposed and the correspond-ing computational model is developed to estimatethe semantic granularity (i.e., specific vs. general) ofdocuments and queries. One main component of ourproposed computational model is the fuzzy domainontology mining mechanism. By automatically ex-



tracting the fuzzy ontology pertaining to various do-mains, our system can effectively measure the se-mantic granularity of documents. A TREC-basedbenchmark corpus was applied to evaluate the effec-tiveness of the fuzzy ontology based granular IR sys-tem. Our experimental results show that the fuzzyontology based granular IR system outperforms aclassical similarity-based IR system for some rout-ing tasks. Our research work opens the door to thedesign of the next generation of domain-specific IRsystems such as domain-specific search engines. Inthe future, we will evaluate our granular IR systemby comparing it with a baseline system which canutilize a general ontology such as the Library ofCongress Subject Headings (LCSH) for IR. More-over, the optimal values of the system parameterswill be sought by invoking heuristic search meth-ods such as a genetic algorithm. Finally, field testswill be conducted to compare the IR effectivenessbetween our fuzzy ontology based granular IR sys-tem and Internet search engines.

Acknowledgments

The work reported in this paper has been fundedin part by City University of Hong Kong’s Start-upGrant (Project No.: 7200126), Strategic ResearchGrant (Project No.: 7002426), and Strategic Re-search Grant (Project No.: 7008018).

References

1. A. Bargiela and W. Pedrycz. Toward a theory ofgranular computing for human-centered informationprocessing. IEEE Transactions on Fuzzy Systems,16(2):320–330, 2008.

2. P. Cimiano, A. Hotho, and S. Staab. Learning con-cept hierarchies from text corpora using formal con-cept analysis. Journal of Artificial Intelligence Re-search, 24:305–339, 2005.

3. T. R. Gruber. A translation approach to portable ontol-ogy specifications. Knowledge Acquisition, 5(2):199–220, 1993.

4. D. Hull. The TREC-7 Filtering Track: Description andAnalysis. In E.M. Voorhees and D.K. Harman, edi-tors, Proceedings of the seventh Text REtrieval Con-ference (TREC-7), pages 33–56, Gaithersburg, Mary-land, November 9–11 1998. NIST. Available fromhttp://trec.nist.gov/pubs/trec7/.

5. Hongyan Jing and Evelyne Tzoukermann. Informa-tion retrieval based on context distance and morphol-ogy. In Proceedings of the 22nd Annual Interna-tional ACM SIGIR Conference on Research and De-velopment in Information Retrieval, Language Analy-sis, pages 90–96, 1999.

6. R.Y.K. Lau. Context-Sensitive Text Mining and Be-lief Revision for Intelligent Information Retrieval onthe Web. Web Intelligence and Agent Systems An In-ternational Journal, 1(3-4):1–22, 2003.

7. R.Y.K. Lau. Fuzzy Domain Ontology Discovery forBusiness Knowledge Management. IEEE IntelligentInformatics Bulletin, 8(1):29–41, 2007.

8. R.Y.K. Lau, P. Bruza, and D. Song. Towards a Be-lief Revision Based Adaptive and Context-SensitiveInformation Retrieval System. ACM Transactions onInformation Systems, 26(2):8.1–8.38, 2008.

9. R.Y.K. Lau, C.L. Lai, and Y. Li. Mining fuzzy on-tology for a web-based granular information retrievalsystem. In Peng Wen, Yuefeng Li, Lech Polkowski,Yiyu Yao, Shusaku Tsumoto, and Guoyin Wang, edi-tors, Proceedings of the 4th International Conferenceon Rough Sets and Knowledge Technology, volume5589 of Lecture Notes in Computer Science, pages239–246, Gold Coast, Australia, July 14–16 2009.Springer.

10. R.Y.K. Lau, D. Song, Y. Li, C.H. Cheung, and J.X.Hao. Towards A Fuzzy Domain Ontology ExtractionMethod for Adaptive e-Learning. IEEE Transactionson Knowledge and Data Engineering, 21(6):800–813,2009.

11. Chang-Shing Lee, Zhi-Wei Jian, and Lin-Kai Huang.A fuzzy ontology and its application to news summa-rization. IEEE Transactions on Systems, Man, and Cy-bernetics, Part B, 35(5):859–880, 2005.

12. Yuefeng Li and Ning Zhong. Mining ontology forautomatically acquiring web user information needs.IEEE Transactions on Knowledge and Data Engineer-ing, 18(4):554–568, 2006.

13. Luis Martinez, Manuel J. Barranco, Luis G. Perez,and Macarena Espinilla. A knowledge based recom-mender system with multigranular linguistic informa-tion. International Journal of Computational Intelli-gence Systems, 1:225–236, 2008.

14. A. Mowshowitz and A. Kawaguchi. Efficientweb browsing on handheld devices using page andform summarization. Communications of the ACM,45(9):56–60, 2002.

15. Roberto Navigli, Paola Velardi, and Aldo Gangemi.Ontology learning and its application to automatedterminology translation. IEEE Intelligent Systems,18(1):22–31, 2003.

16. L. Page, S. Brin, R. Motwani, and T. Andwinograd.The PageRank citation ranking: Bringing order tothe Web. In Stanford Digital Library Technologies



Project, 1998. Technical Report.17. Witold Pedrycz. Hierarchical architectures of fuzzy

models: From type-1 fuzzy sets to information gran-ules of higher type. International Journal of Compu-tational Intelligence Systems, 3:202–214, 2010.

18. Patrick Perrin and Frederick Petry. Extraction and rep-resentation of contextual information for knowledgediscovery in texts. Information Sciences, 151:125–152, 2003.

19. Lech Polkowski and Piotr Artiemjew. On knowl-edge granulation and applications to classifier induc-tion in the framework of rough mereology. Interna-tional Journal of Computational Intelligence Systems,2:315–331, 2009.

20. Philip Resnik. Using information to evaluate seman-tic similarity in a taxonomy. In Chris Mellish, edi-tor, Proceedings of the Fourteenth International JointConference on Artificial Intelligence, pages 448–452.Morgan Kaufmann, 1995.

21. Anjum Reyaz-Ahmed, Yan-Qing Zhang, andRobert W. Harrison. Granular decision tree and evo-lutionary neural svm for protein secondary structureprediction. International Journal of ComputationalIntelligence Systems, 2:343–352, 2009.

22. G. Salton. Developments in automatic text retrieval.Science, 253(5023):974–980, August 1991.

23. G. Salton and M.J. McGill. Introduction to ModernInformation Retrieval. McGraw-Hill, New York, NewYork, 1983.

24. M. Sanderson and B. Croft. Deriving concept hierar-chies from text. In Proceedings of the 22nd Annual In-ternational ACM SIGIR Conference on Research andDevelopment in Information Retrieval, pages 206–213. ACM, 1999.

25. Hinrich Schutze. Automatic word sense discrimina-tion. Computational Linguistics, 24(1):97–124, 1998.

26. C. Shannon. A mathematical theory of communica-tion. Bell System Technology Journal, 27:379–423,1948.

27. Quan Thanh Tho, Siu Cheung Hui, Alvis Cheuk M.Fong, and Tru Hoang Cao. Automatic fuzzy ontol-ogy generation for semantic web. IEEE Transactionson Knowledge and Data Engineering, 18(6):842–856,2006.

28. E. Voorhees and D. Harman. Overview of theNinth Text REtrieval Conference (TREC-9). InE.M. Voorhees and D.K. Harman, editors, Proceed-ings of the ninth Text REtrieval Conference (TREC-9), pages 1–14, Gaithersburg, Maryland, Novem-ber 13–16 2000. NIST. Available from htt p ://trec.nist.gov/pubs/trec9/t9proceedings.html.

29. X. Yan, D. Song, and S. Li. Concept-based documentreadability in domain specific information retrieval.In Proceedings of the 15th ACM International Con-ference on Information and Knowledge Management,pages 540–549, 2006.

30. J.T. Yao. Information granulation and granular rela-tionships. In Proceedings of the 2005 IEEE Inter-national Conference on Granular Computing, pages326–329, 2005.

31. Y.Y. Yao. Information retrieval support systems. InProceedings of the 2002 IEEE World Congress onComputational Intelligence, pages 773–778, 2002.

32. Y.Y. Yao. Perspectives of granular computing. In Pro-ceedings of the 2005 IEEE International Conferenceon Granular Computing, pages 326–329, 2005.

33. L. A. Zadeh. Fuzzy sets. Journal of Information andControl, 8:338–353, 1965.

34. L. A. Zadeh. Toward a theory of fuzzy informationgranulation and its centrality in human reasoning andfuzzy logic. Fuzzy Sets and Systems, 90:111–127,1997.

35. X. Zhou, S.T. Wu, Y. Li, Y. Xu, R.Y.K. Lau, andP.D. Bruza. Utilizing search intent in topic ontology-based user profile for web mining. In Proceedingsof the 2006 IEEE/WIC/ACM International Conferenceon Web Intelligence, pages 558–564, 2006.


fuzzy ontology mining and semantic information granulation

Documents